This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Differentially Private Simple Linear Regression

Daniel Alabi Harvard John A. Paulson School of Engineering and Applied Sciences29 Oxford St Audra McMillan Department of Computer Science, Boston University111 Cummington Mall Khoury College of Computer Sciences, Northeastern University805 Columbus Ave Jayshree Sarathy Harvard John A. Paulson School of Engineering and Applied Sciences29 Oxford St Adam Smith Department of Computer Science, Boston University111 Cummington Mall  and  Salil Vadhan Harvard John A. Paulson School of Engineering and Applied Sciences29 Oxford St
Abstract.

Economics and social science research often require analyzing datasets of sensitive personal information at fine granularity, with models fit to small subsets of the data. Unfortunately, such fine-grained analysis can easily reveal sensitive individual information. We study algorithms for simple linear regression that satisfy differential privacy, a constraint which guarantees that an algorithm’s output reveals little about any individual input data record, even to an attacker with arbitrary side information about the dataset. We consider the design of differentially private algorithms for simple linear regression for small datasets, with tens to hundreds of datapoints, which is a particularly challenging regime for differential privacy. Focusing on a particular application to small-area analysis in economics research, we study the performance of a spectrum of algorithms we adapt to the setting. We identify key factors that affect their performance, showing through a range of experiments that algorithms based on robust estimators (in particular, the Theil-Sen estimator) perform well on the smallest datasets, but that other more standard algorithms do better as the dataset size increases.

differential privacy, linear regression, robust statistics, small-area analysis
copyright: usgovccs: Security and privacy Privacy-preserving protocols

1. Introduction

The analysis of small datasets, with sizes in the dozens to low hundreds, is crucial in many social science applications. For example, neighborhood-level household income, high school graduation rate, and incarceration rates are all studied using small datasets that contain sensitive, and often protected, data (e.g., (Chetty et al., 2018)). However, the release of statistical estimates based on these data quantities—if too many and too accurate—can allow reconstruction of the original dataset (Dinur and Nissim, 2003). The possibility of such attacks led to differential privacy (Dwork et al., 2006), a rigorous mathematical definition used to quantify privacy loss. Differentially private (DP) algorithms limit the information that is leaked about any particular individual by introducing random distortion. The amount of distortion, and its effect on utility, are most often studied for large datasets, using asymptotic tools. When datasets are small, one has to be very careful when calibrating differentially private statistical estimates to preserve utility.

In this work, we focus on simple (i.e. one-dimensional) linear regression and show that this prominent statistical task can have accurate differentially private algorithms even on small datasets. Our goal is to provide insight and guidance into how to choose a DP algorithm for simple linear regression in a variety of realistic parameter regimes.

Even without a privacy constraint, small sample sizes pose a problem for traditional statistics, since the variability from sample to sample, called the sampling error, can overwhelm the signal about the underlying trend. A reasonable concrete goal for a DP mechanism, then, is that it not introduce substantially more uncertainty into the estimate. Specifically, we compare the noise added in order to maintain privacy to the standard error of the (nonprivate) OLS estimate. Our experiments indicate that for a wide range of realistic datasets and moderate values of the privacy parameter, ε\varepsilon, it is possible to choose a DP linear regression algorithm that introduces distortion less than the standard error. In particular, in our motivating use-case of the Opportunity Atlas (Chetty et al., 2014), described below, we design a differentially private algorithm that matches or outperforms the heuristic method currently deployed, which does not formally satisfy differential privacy.

1.1. Problem Setup

1.1.1. Simple Linear Regression

In this paper we consider the most common model of linear regression: one-dimensional linear regression with homoscedastic noise. This model is defined by a slope α\alpha\in\mathbb{R}, an intercept β\beta\in\mathbb{R}, a noise distribution FeF_{e} with mean 0 and variance σe2\sigma_{e}^{2}, and the relationship

y=αx+β+e\textbf{y}=\alpha\cdot\textbf{x}+\beta+\textbf{e}

where x,yn\textbf{x},\textbf{y}\in\mathbb{R}^{n}. Our data consists of nn pairs, (xi,yi)(x_{i},y_{i}), where xix_{i} is the explanatory variable, yiy_{i} is the response variable, and each random variable eie_{i} is draw i.i.d. from the distribution FeF_{e}. The goal of simple linear regression is to estimate α\alpha and β\beta given the data {(x1,y1),,(xn,yn)}\{(x_{1},y_{1}),\cdots,(x_{n},y_{n})\}.

Let x=(x1,,xn)T\textbf{x}=(x_{1},\ldots,x_{n})^{T}, y=(y1,,yn)T\textbf{y}=(y_{1},\ldots,y_{n})^{T}. The Ordinary Least Squares (OLS) solution to the simple linear regression problem is the solution to the following optimization problem:

(1) (α^,β^)=argminα,βyαxβ12.(\hat{\alpha},\hat{\beta})=\arg\min_{\alpha,\beta}\|\textbf{y}-\alpha\textbf{x}-\beta\textbf{1}\|_{2}.

It is the most commonly used solution in practice since it is the maximum likelihood estimator when FeF_{e} is Gaussian, and it has a closed form solution. We define the following empirical quantities:

x¯=1ni=1nxi,\displaystyle\bar{x}=\frac{1}{n}\sum_{i=1}^{n}x_{i},\; y¯=1ni=1nyi\displaystyle\;\;\bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_{i}
ncov(x,y)\displaystyle\text{ncov}(\textbf{x},\textbf{y}) =xx¯𝟏,yy¯𝟏,\displaystyle=\langle\textbf{x}-\bar{x}\mathbf{1},\textbf{y}-\bar{y}\mathbf{1}\rangle,
nvar(x)\displaystyle\text{nvar}(\textbf{x}) =xx¯𝟏,xx¯𝟏=nvar(x).\displaystyle=\langle\textbf{x}-\bar{x}\mathbf{1},\textbf{x}-\bar{x}\mathbf{1}\rangle=n\cdot\text{var}(\textbf{x}).

Then

(2) α^=ncov(x,y)nvar(x)andβ^=y¯α^x¯\hat{\alpha}=\frac{\text{ncov}(\textbf{x},\textbf{y})}{\text{nvar}(\textbf{x})}\;\;\text{and}\;\;\hat{\beta}=\bar{y}-\hat{\alpha}\bar{x}

this paper, we focus on predicting the (mean of the) response variable yy at a single value of the explanatory variable xx. For xnew[0,1]x_{new}\in[0,1], the prediction at xnewx_{new} is defined as:

pxnew=αxnew+βp_{x_{new}}=\alpha x_{new}+\beta

Let p^xnew\widehat{p}_{x_{new}} be the estimate of the prediction at xnewx_{new} computed using the OLS estimates α^\hat{\alpha} and β^\hat{\beta}. The quantity p^xnew\widehat{p}_{x_{new}} is a random variable, where the randomness is due to the sampling process. The standard error σ^(p^xnew)\widehat{\sigma}(\widehat{p}_{x_{new}}) is an estimate of the standard deviation of p^xnew\widehat{p}_{x_{new}}. If we assume that the noise FeF_{e} is Gaussian, then we can compute the standard error as follows (see (Yan and Su, 2009), for example):

(3) σ^(p^xnew)=yα^xβ^2n21n+(xnewx¯)2nvar(x).\widehat{\sigma}(\widehat{p}_{x_{new}})=\frac{\|\textbf{y}-\hat{\alpha}\textbf{x}-\hat{\beta}\|_{2}}{\sqrt{n-2}}\sqrt{\frac{1}{n}+\frac{(x_{new}-\bar{x})^{2}}{n\text{var}(\textbf{x})}}.

It can be shown that the variance of (p^xnewpxnew)/σ^(p^xnew)(\widehat{p}_{x_{new}}-p_{x_{new}})/\widehat{\sigma}(\widehat{p}_{x_{new}}) approaches 11 as nn increases.

1.1.2. Differential Privacy

In this section, we define the notion of privacy that we are using: differential privacy (DP). Since our algorithms often include hyperparameters, we include a definition of DP for algorithms that take as input not only the dataset, but also the desired privacy parameters and any required hyperparameters. Let 𝒳\mathcal{X} be a data universe and 𝒳n\mathcal{X}^{n} be the space of datasets. Two datasets d,d𝒳nd,d^{\prime}\in\mathcal{X}^{n} are neighboring, denoted ddd\sim d^{\prime}, if they differ on a single record. Let \mathcal{H} be a hyperparameter space and 𝒴\mathcal{Y} be an output space.

Definition 1 ((ε,δ)(\varepsilon,\delta)-Differential Privacy (Dwork et al., 2006)).

A randomized mechanism M:𝒳n×0×[0,1]×𝒴M:\mathcal{X}^{n}\times\mathbb{R}_{\geq 0}\times[0,1]\times\mathcal{H}\rightarrow\mathcal{Y} is differentially private if for all datasets dd𝒳nd\sim d^{\prime}\in\mathcal{X}^{n}, privacy-loss parameters ε0,δ[0,1]\varepsilon\geq 0,\delta\in[0,1], hyperparams\textrm{hyperparams}\in\mathcal{H}, and events EE,

Pr[M(d,ε,δ,hyperparams)E]\displaystyle\Pr[M(d,\varepsilon,\delta,\text{hyperparams})\in E]
eεPr[M(d,ε,δ,hyperparams)E]+δ,\displaystyle\leq e^{\varepsilon}\cdot\Pr[M(d^{\prime},\varepsilon,\delta,\text{hyperparams})\in E]+\delta,

where the probabilities are taken over the random coins of MM.

For strong privacy guarantees, the privacy-loss parameter is typically taken to be a small constant less than 11 (note that eε1+εe^{\varepsilon}\approx 1+\varepsilon as ε0\varepsilon\rightarrow 0), but we will sometimes consider larger constants such as ε=8\varepsilon=8 to match our motivating application (described in Section 1.2).

The key intuition for this definition is that the distribution of outputs on input dataset dd is almost indistinguishable from the distribution on outputs on input dataset dd^{\prime}. Therefore, given the output of a differentially private mechanism, it is impossible to confidently determine whether the input dataset was dd or dd^{\prime}.

1.1.3. Other Notation

The DP estimates will be denoted by p~xnew\widetilde{p}_{x_{new}} and the OLS estimates by p^xnew\widehat{p}_{x_{new}}. We will focus on data with values bounded between 0 and 1, so 0xi,yi10\leq x_{i},y_{i}\leq 1 for i=1,,ni=1,\ldots,n.

We will be primarily concerned with the predicted values at xnew=0.25x_{new}=0.25 and 0.750.75, which for ease of notation we denote as p25p_{25} and p75p_{75}, respectively. Correspondingly, we will use p^25,p^75\widehat{p}_{25},\widehat{p}_{75} to denote the OLS estimates of the predicted values and p~25,p~75\widetilde{p}_{25},\widetilde{p}_{75} to denote the DP estimates. Our use of the 25th and 75th percentiles is motivated by the Opportunity Atlas tool (Chetty et al., 2018), described in Section 1.2, which releases estimates of p25p_{25} and p75p_{75} for certain regressions done for every census tract in each state.

1.1.4. Error Metric

We will be concerned primarily with empirical performance measures. In particular, we will restrict our focus to high probability error bounds that can be accurately computed empirically through Monte Carlo experiments. The question of providing tight theoretical error bounds for DP linear regression is an important one, which we leave to future work. Since the relationship between the OLS estimate p^xnew\widehat{p}_{x_{new}} and the true value pxnewp_{x_{new}} is well-understood, we focus on measuring the difference between the private estimate p~xnew\widetilde{p}_{x_{new}} and p^xnew\widehat{p}_{x_{new}}, which is something we can measure experimentally on real datasets. Specifically, we define the prediction error at xnewx_{new} to be |p~xnewp^xnew|.|\widetilde{p}_{x_{new}}-\widehat{p}_{x_{new}}|.

For a dataset dd, xnew[0,1]x_{new}\in[0,1], and q[0,100]q\in[0,100], we define the q%q\% error bound as

C(q)(d)=min{c:(|p~xnewp^xnew|c)q100},C(q)(d)=\min\left\{c:\mathbb{P}(|\widetilde{p}_{x_{new}}-\widehat{p}_{x_{new}}|\leq c)\geq\frac{q}{100}\right\},

where the dataset dd is fixed, and the probability is taken over the randomness in the DP algorithm.

We empirically estimate C(q)C(q) by running many trials of the algorithm on the same dataset dd:

C^(q)(d)=min{c:for at least q% of trials,|p~xnewp^xnew|c}.\hat{C}(q)(d)=\min\left\{c:\text{for at least }q\%\text{ of trials},|\widetilde{p}_{x_{new}}-\widehat{p}_{x_{new}}|\leq c\right\}.

We term C^(q)(d)\hat{C}(q)(d) the q%q\% empirical error bound. We will often drop the reference to dd from the notation. This error metric only accounts for the randomness in the algorithm, not the sampling error.

When the ground truth is known (eg. for synthetically generated data), we can compute error bounds compared to the ground truth, rather than to the non-private OLS estimate. So, let Ctrue(q)(d)C_{\text{true}}(q)(d) and C^true(q)(d)\hat{C}_{\text{true}}(q)(d) be similar to the error bounds described earlier, except that the prediction error is measured as |p~xnewpxnew||\widetilde{p}_{x_{new}}-p_{x_{new}}|. This error metric accounts for the randomness in both the sampling and the algorithm. When we say the noise added for privacy is less than the sampling error, we are referring to the technical statement that C^(68)\hat{C}(68) is less than the standard error, σ^(p^xnew)\widehat{\sigma}(\widehat{p}_{x_{new}}).

1.2. Motivating Use-Case: Opportunity Atlas

The Opportunity Atlas, designed and deployed by the economics research group Opportunity Insights, is an interactive tool designed to study the link between the neighbourhood a child grows up in and their prospect for economic mobility (Chetty et al., 2018). The tool provides valuable insights to researchers and policy-makers, and is built from Census data, protected under Title 13 and authorized by the Census Bureau’s Disclosure Review Board, linked with federal income tax returns from the US Internal Revenue Service. The Atlas provides individual statistics on each census tract in the country, with tract data often being refined by demographics to contain only a small subset of the individuals who live in that tract; for example, black male children with low parental income. The resulting datasets typically contain 100 to 400 datapoints, but can be as small as 30 datapoints. A separate regression estimate is released in each of these small geographic areas. The response variable yi[0,1]y_{i}\in[0,1] is the child’s income percentile at age 35 and the explanatory variable xi[0,1]x_{i}\in[0,1] is the parent’s income percentile, each with respect to the national income distribution. The coefficient α\alpha in the model yi=αxi+β+eiy_{i}=\alpha\cdot x_{i}+\beta+e_{i} is a measure of economic mobility for that particular Census tract and demographic. The small size of the datasets used in the Opportunity Atlas are the result of Chetty et al.’s motivation to study inequality at the neighbourhood level: “the estimates permit precise targeting of policies to improve economic opportunity by uncovering specific neighborhoods where certain subgroups of children grow up to have poor outcomes. Neighborhoods matter at a very granular level: conditional on characteristics such as poverty rates in a child’s own Census tract, characteristics of tracts that are one mile away have little predictive power for a child’s outcomes” (Chetty et al., 2018).

1.3. Robustness and DP Algorithm Design

Simple linear regression is one of the most fundamental statistical tasks with well-understood convergence properties in the non-private literature. However, finding a differentially private estimator for this task that is accurate across a range of datasets and parameter regimes is surprisingly nuanced. As a first attempt, one might consider the global sensitivity (Dwork et al., 2006):

Definition 2 (Global Sensitivity).

For a query q:𝒳nkq:~{}\mathcal{X}^{n}~{}\rightarrow~{}{\mathbb{R}}^{k}, the global sensitivity is

GSq=maxxxq(x)q(x)1.GS_{q}=\max_{\textbf{x}\sim\textbf{x}^{\prime}}\|q(\textbf{x})-q(\textbf{x}^{\prime})\|_{1}.

One can create a differentially private mechanism by adding noise proportional to GSq/εGS_{q}/\varepsilon. Unfortunately, the global sensitivity of p25p_{25} and p75p_{75} are both infinite (even though we consider bounded x,y[0,1]n\textbf{x},\textbf{y}\in[0,1]^{n}, the point estimates p25p_{25} and p75p_{75} are unbounded). For the type of datasets that we typically see in practice, however, changing one datapoint does not result in a major change in the point estimates. For such datasets, where the point estimates are reasonably stable, one might hope to take advantage of the local sensitivity:

Definition 3 (Local Sensitivity (Nissim et al., 2007)).

The local sensitivity of a query q:𝒳nkq:~{}\mathcal{X}^{n}~{}\rightarrow~{}{\mathbb{R}}^{k} with respect to a dataset x is

LSq(x)=maxxxq(x)q(x)1.LS_{q}(\textbf{x})=\max_{\textbf{x}\sim\textbf{x}^{\prime}}\|q(\textbf{x})-q(\textbf{x}^{\prime})\|_{1}.

Adding noise proportional to the local sensitivity is typically not differentially private, since the local sensitivity itself can reveal information about the underlying dataset. To get around this, one could try to add noise that is larger, but not too much larger, than the local sensitivity. As DP requires that the amount of noise added cannot depend too strongly on the particular dataset, DP mechanisms of this flavor often involve calculating the local sensitivity on neighbouring datasets. So far, we have been unable to design a computationally feasible algorithm for performing the necessary computations for OLS. Furthermore, computationally feasible upper bounds for the local sensitivity have, so far, proved too loose to be useful.

The Opportunity Insights (OI) algorithm takes a heuristic approach by adding noise proportional to a non-private, heuristic upper bound on the local sensitivity of data from tracts in any given state. However, their heuristic approach does not satisfy the formal requirements of differential privacy, leaving open the possibility that there is a realistic attack.

The OI algorithm incorporates a “winsorization” step in their estimation procedure (e.g. dropping bottom and top 10% of data values). This sometimes has the effect of greatly reducing the local sensitivity (and also their upper bound on it) due to the possible removal of outliers. This suggests that for finding an effective differentially private algorithm, we should consider differentially private analogues of robust linear regression methods rather than of OLS. Specifically, we consider Theil-Sen, a robust estimator for linear regression proposed by Theil (Theil, 1950) and further developed by Sen (Sen, 1968). Similar to the way in which the median is less sensitive to changes in the data than the mean, the Theil-Sen estimator is more robust to changes in the data than OLS .

In this work, we consider three differentially private algorithms based on both robust and non-robust methods:

  • NoisyStats is the DP mechanism that most closely mirrors OLS. It involves perturbing the sufficient statistics ncov(x,y)\text{ncov}(\textbf{x},\textbf{y}) and nvar(x)\text{nvar}(\textbf{x}) from the OLS computation. This algorithm is inspired by the “Analyze Gauss” technique (Dwork et al., 2014), although the noise we add ensures pure differential privacy rather than approximate differential privacy. NoisyStats has two main benefits: it is as computationally efficient as its non-private analogue, and it allows us to release DP versions of the sufficient statistics with no extra privacy cost.

  • DPTheilSen is a DP version of Theil-Sen. The non-private estimator computes the p25p_{25} estimates based on the lines defined by all pairs of datapoints (xi,yi),(xj,yj)[0,1]2(x_{i},y_{i}),(x_{j},y_{j})\in[0,1]^{2} for all ij[n]i\neq j\in[n], then outputs the median of these pairwise estimates. To create a differentially private version, we replace the median computation with a differentially private median algorithm. We will consider three DP versions of this algorithm which use different DP median algorithms: DPExpTheilSen, DPWideTheilSen, and DPSSTheilSen. We also consider more computationally efficient variants that pair points according to one or more random matchings, rather than using all (n2){n\choose 2} pairs. A DP algorithm obtained by using one matching was previously considered by Dwork and Lei (Dwork and Lei, 2009) (their “Short-Cut Regression Algorithm”). Our algorithms can be viewed as updated versions, reflecting improvements in DP median estimation since (Dwork and Lei, 2009), as well as incorporating benefits accrued by considering more than one matching or all (n2){n\choose 2} pairs.

  • DPGradDescent is a DP mechanism that uses DP gradient descent to solve the convex optimization problem that defines OLS: argminα,βyαxβ𝟏2.\operatorname*{argmin}_{\alpha,\beta}\|\textbf{y}-\alpha\textbf{x}-\beta\mathbf{1}\|_{2}. We use the private stochastic gradient descent technique proposed by Bassily et. al in (Bassily et al., 2014). Versions that satisfy pure, approximate, and zero-concentrated differential privacy are considered.

1.4. Our Results

Our experiments indicate that for a wide range of realistic datasets, and moderate values of ε\varepsilon, it is possible to choose a DP linear regression algorithm where the error due to privacy is less than the standard error. In particular, in our motivating use-case of the Opportunity Atlas, we can design a differentially private algorithm that outperforms the heuristic method used by the Opportunity Insights team. This is promising, since the error added by the heuristic method was deemed acceptable by the Opportunity Insights team for deployment of the tool, and for use by policy makers. One particular differentially private algorithm of the robust variety, called DPExpTheilSen, emerges as the best algorithm in a wide variety of settings for this small-dataset regime.

Our experiments reveal three main settings where an analyst should consider alternate algorithms:

  • When εnvar(x)\varepsilon n\text{var}(\textbf{x}) is large and σe2\sigma_{e}^{2} is large, a DP algorithm NoisyStats that simply perturbs the Ordinary Least Squares (OLS) sufficient statistics, nvar(x)\text{nvar}(\textbf{x}) and ncov(x,y)\text{ncov}(\textbf{x},\textbf{y}), performs well. This algorithm is preferable in this setting since it is more computationally efficient, and allows for the release of noisy sufficient statistics without any additional privacy loss.

  • When the standard error σ^(p^xnew)\widehat{\sigma}(\widehat{p}_{x_{new}}) is very small,
    DPExpTheilSen can perform poorly. In this setting, one should switch to a different DP estimator based on Theil-Sen. We give two potential alternatives, which are both useful in different situations.

  • The algorithm DPExpTheilSen requires as input a range in which to search for the output predicted value. If this range is very large and ε\varepsilon is small (ε1\varepsilon\ll 1) then DPExpTheilSen can perform poorly, so it is good to use other algorithms like NoisyStats that do not require a range for the output (but do require that the input variables xix_{i} and yiy_{i} are bounded, which is not required for DPExpTheilSen).

The quantity εnvar(x)\varepsilon n\text{var}(\textbf{x}) is connected to the size of the dataset, how concentrated the independent variable of the data is and how private the mechanism is. Experimentally, this quantity has proved to be a good indicator of the relative performance of the DP algorithms. Roughly, when εnvar(x)\varepsilon n\text{var}(\textbf{x}) is small, the OLS estimate can be very sensitive to small changes in the data, and thus we recommend switching to differentially private versions of the Theil-Sen estimator. In the opposite regime, when εnvar(x)\varepsilon n\text{var}(\textbf{x}) is large, NoisyStats typically suffices and is a simple non-robust method to adopt in practice. In this regime the additional noise added for privacy by NoisyStats can be less than the difference between the non-private OLS and Theil-Sen point estimates. The other non-robust algorithm we consider, DPGradDescent, may be a suitable replacement for NoisyStats depending on the privacy model used (i.e. pure, approximate, or zero-concentrated DP). Our comparison of NoisyStats and DPGradDescent is not comprehensive, but we find that any performance advantages of DPGradDescent over NoisyStats appear to be small in the regime where the non-robust methods outperform the robust ones. NoisyStats is simpler and has fewer hyperparameters, however, so we find that it may be preferable in practice.

In addition to the quantity εnvar(x)\varepsilon n\text{var}(\textbf{x}), the magnitude of noise in the dependent variable effects the relative performance of the algorithms. When the dependent variable is not very noisy (i.e. σe2\sigma_{e}^{2} is small), the Theil-Sen-based estimators perform better since they are better at leveraging a strong linear signal in the data.

These results show that even in this one-dimensional setting, the story is already quite nuanced. Indeed, which algorithm performs best depends on properties of the dataset, such as nvar(x)n\text{var}(\textbf{x}), which cannot be directly used without violating differential privacy. So, one has to make the choice based on guesses (eg. using similar public datasets) or develop differentially private methods for selecting the algorithm, a problem which we leave to future work. Moreover, most of our methods come with hyperparameters that govern their behavior. How to optimally choose these parameters is still an open problem. In addition, we focus on outputting accurate point estimates, rather than confidence intervals. Computing confidence intervals is an important direction for future work.

1.5. Other Related Work

Linear regression is one of the most prevalent statistical methods in the social sciences, and hence has been studied previously in the differential privacy literature. These works have included both theoretical analysis and experimental exploration, with the majority of work focusing on large datasets.

One of our main findings — that robust estimators perform better than parametric estimators in the differentially private setting, even when the data come from a parametric model — corroborate insights by (Dwork and Lei, 2009) with regard to the connection between robust statistics and differential privacy, and by (Couch et al., 2019) in the context of hypothesis testing.

Other systematic studies of DP linear regression have been performed by Sheffet (Sheffet, 2017) and Wang (Wang, 2018). Sheffet (Sheffet, 2017) considered differentially private ordinary least squares methods and estimated confidence intervals for the regression coefficients. He assumes normality of the explanatory variables, while we do not make any distributional assumptions on our covariates.

Private linear regression in the high-dimensional settings is studied by Cai et al. (Cai et al., 2019) and Wang (Wang, 2018). Wang (Wang, 2018) considered private ridge regression, where the ridge parameter is adaptively and privately chosen, using techniques similar to output perturbation (Chaudhuri et al., 2011). These papers present methods and experiments for high dimensional data (the datasets used contain at least 13 explanatory variables), whereas we are concerned with the one-dimensional setting. We find that even the one-dimensional setting the choice of optimal algorithm is already quite complex.

A Bayesian approach to DP linear regression is taken by Bernstein and Sheldon (Bernstein and Sheldon, 2019). Their method uses sufficient statistic perturbation (similar to our NoisyStats algorithm) for private release and Markov Chain Monte Carlo sampling over a posterior of the regression coefficients. Their Bayesian approach can produce tight credible intervals for the regression coefficients, but unlike ours, requires distributional assumptions on both the regression coefficients and the underlying independent variables. In order to make private releases, we assume the data is bounded but make no other distributional assumptions on the independent variables.

2. Algorithms

In this section we detail the practical differentially private algorithms we will evaluate experimentally. Pseudocode for all efficient implementations of each algorithm described can be found in the Appendix, and real code can be found in our GitHub repository. We will assume throughout that (xi,yi)[0,1]2(x_{i},y_{i})\in[0,1]^{2}, as in the Opportunity Atlas, where it is achieved by preprocessing of the data. While we would ideally ensure differentially private preprocessing of the data, we will consider this outside the scope of this work.

2.1. NoisyStats

In NoisyStats (Algorithm 1), we add Laplace noise, with standard deviation approximately 1/ε1/\varepsilon, to the OLS sufficient statistics, ncov(x,y),nvar(x)\text{ncov}(\textbf{x},\textbf{y}),\text{nvar}(\textbf{x}), and then use the noisy sufficient statistics to compute the predicted values. Note that this algorithm fails if the denominator for the OLS estimator, the noisy version of nvar(x)\text{nvar}(\textbf{x}), becomes 0 or negative, in which case we output \perp (failure). The probability of failure decreases as ε\varepsilon or nvar(x)\text{nvar}(x) increases. NoisyStats is the simplest and most efficient algorithm that we will study. In addition, the privacy guarantee is maintained even if we additionally release the noisy statistics nvar(x)+L1\text{nvar}(\textbf{x})+L_{1} and ncov(x,y)+L2\text{ncov}(\textbf{x},\textbf{y})+L_{2}, which may be of independent interest to researchers. We also note that the algorithm is biased due to dividing by a Laplacian distribution centered at nvar(x)\text{nvar}(\textbf{x}).

Data: {(xi,yi)}i=1n([0,1]×[0,1])n\{(x_{i},y_{i})\}_{i=1}^{n}\in([0,1]\times[0,1])^{n}
Privacy params: ε\varepsilon
Hyperparams: n/a
Define Δ1=Δ2=(11/n)\Delta_{1}=\Delta_{2}=\left(1-1/n\right)
Sample L1Lap(0,3Δ1/ε)L_{1}\sim\text{Lap}(0,3\Delta_{1}/\varepsilon)
Sample L2Lap(0,3Δ2/ε)L_{2}\sim\text{Lap}(0,3\Delta_{2}/\varepsilon)
if nvar(x)+L2>0\text{nvar}(\textbf{x})+L_{2}>0 then
       α~=ncov(x,y)+L1nvar(x)+L2\tilde{\alpha}=\frac{\text{ncov}(\textbf{x},\textbf{y})+L_{1}}{\text{nvar}(\textbf{x})+L_{2}}
      Δ3=1/n(1+|α~|)\Delta_{3}=1/n\cdot(1+|\tilde{\alpha}|)
      Sample L3Lap(0,3Δ3/ε)L_{3}\sim\text{Lap}(0,3\Delta_{3}/\varepsilon)
      β~=(y¯α~x¯)+L3\tilde{\beta}=(\bar{y}-\tilde{\alpha}\bar{x})+L_{3}
      p~25=0.25α~+β~\widetilde{p}_{25}=0.25\cdot\tilde{\alpha}+\tilde{\beta}
      p~75=0.75α~+β~\widetilde{p}_{75}=0.75\cdot\tilde{\alpha}+\tilde{\beta}
      return p~25,p~75\widetilde{p}_{25},\widetilde{p}_{75}
else
       return \perp
Algorithm 1 NoisyStats: (ε,0)(\varepsilon,0)-DP Algorithm
Lemma 1.

Algorithm 1 (NoisyStats) is (ε,0)(\varepsilon,0)-DP.

2.2. DP TheilSen

The standard Theil-Sen estimator is a robust estimator for linear regression. It computes the p25p_{25} estimates based on the lines defined by all pairs of datapoints (xi,yi),(xj,yj)[0,1]2(x_{i},y_{i}),(x_{j},y_{j})\in[0,1]^{2} for all ij[n]i\neq j\in[n], then outputs the median of these pairwise estimates. To create a differentially private version, we can replace the median computation with a differentially private median algorithm. We implement this approach using three DP median algorithms; two based on the exponential mechanism (McSherry and Talwar, 2007) and one based on the smooth sensitivity of (Nissim et al., 2007) and the noise distributions of (Bun and Steinke, 2019).

In the “complete” version of Theil-Sen, all pairwise estimates are included in the final median computation. A similar algorithm can be run on the point estimates computed using kk random matchings of the (xi,yi)(x_{i},y_{i}) pairs. The case k=1k=1 amounts to the differentially private “Short-cut Regression Algorithm” proposed by Dwork and Lei (Dwork and Lei, 2009). This results in a more computationally efficient algorithm.

Furthermore, while in the non-private setting k=n1k=n-1 is the optimal choice, this is not necessarily true when we incorporate privacy. Each datapoint affects kk of the datapoints in z(p25)\textbf{z}^{(p25)} (using the notation from Algorithm 2) out of the total kn/2k\cdot n/2 datapoints. In some private algorithms, the amount of noise added for privacy is a function of the fraction of points in z(p25)\textbf{z}^{(p25)} influenced by each (xi,yi)(x_{i},y_{i}), which is independent of kk – meaning that one should always choose kk as large as is computationally feasible. However, this is not always the case. Increasing kk can result in more noise being added for privacy resulting in a trade-off between decreasing the noise added for privacy (with smaller kk) and decreasing the non-private error (with larger kk).

While intuitively, for small kk, one can think of using kk random matchings, we will actually ensure that no edge is included twice. We say the permutations τ1,,τn1\tau_{1},\cdots,\tau_{n-1} are a decomposition of the complete graph on nn vertices, KnK_{n} into n1n-1 matchings if Σk={(xi,xj)|i,j[n]}\cup\Sigma_{k}=\{(x_{i},x_{j})\;|\;i,j\in[n]\} where Σk={(xτk(1),xτk(2)),,(xτk(n1),xτk(n))}\Sigma_{k}=\{(x_{\tau_{k}(1)},x_{\tau_{k}(2)}),\cdots,(x_{\tau_{k}(n-1)},x_{\tau_{k}(n)})\}. Thus, DPTheilSen(n-1)Match, referred to simply as DPTheilSen, uses all pairs of points. We will focus mainly on k=n1k=n-1, which we will refer to simply as DPTheilSen and k=1k=1, which we will refer to as DPTheilSenMatch. For any other kk, we denote the algorithm as DPTheilSenkMatch. In the following subsections we discuss the different differentially private median algorithms we use as subroutines. The pseudo-code for DPTheilSenkMatch is found in Algorithm 2.

Lemma 2.

If DPmed(z(p25),ε,(n,k,hyperparameters)=(z(p25),hyperparameters))(\textbf{z}^{(p25)},\varepsilon,(n,k,\text{hyperparameters})=\\ \mathcal{M}(z^{(p25)},\text{hyperparameters})) for some (ε/k,0)(\varepsilon/k,0)-DP mechanism \mathcal{M}, then Algorithm 2 (DPTheilSenkMatch) is (ε,0)(\varepsilon,0)-DP.

Data: {(xi,yi)}i=1n([0,1]×[0,1])n\{(x_{i},y_{i})\}_{i=1}^{n}\in([0,1]\times[0,1])^{n}
Privacy params: ε\varepsilon
Hyperparams: n,kn,k, DPmed, hyperparams
z(p25),z(p75)=[]\textbf{z}^{(p25)},\textbf{z}^{(p75)}~{}=[~{}]
Let τ1,,τn1\tau_{1},\cdots,\tau_{n-1} be a decomposition of KnK_{n} into matchings.
for kk iterations do
      
      Sample (without replacement) h[n1]h\in[n-1]
      for 0i<n1,i=i+20\leq i<n-1,i=i+2 do
             j=τh(i)j=\tau_{h}(i)
            l=τh(i+1)l=\tau_{h}(i+1)
            if (xlxj0)(x_{l}-x_{j}\neq 0) then
                  
                  s=(ylyj)/(xlxj)s=(y_{l}-y_{j})/(x_{l}-x_{j})
                  zj,l(p25)=s(0.25xl+xj2)+yl+yj2z^{(p25)}_{j,l}=s\left(0.25-\frac{x_{l}+x_{j}}{2}\right)+\frac{y_{l}+y_{j}}{2}
                  zj,l(p75)=s(0.75xl+xj2)+yl+yj2z^{(p75)}_{j,l}=s\left(0.75-\frac{x_{l}+x_{j}}{2}\right)+\frac{y_{l}+y_{j}}{2}
                  Append zj,l(p25)z^{(p25)}_{j,l} to z(p25)\textbf{z}^{(p25)} and zj,l(p75)z^{(p75)}_{j,l} to z(p75)\textbf{z}^{(p75)}
            
      
p~25=DPmed(z(p25),ε,(n,k,hyperparams))\tilde{p}_{25}=\textrm{DPmed}\left(\textbf{z}^{(p25)},\varepsilon,(n,k,\text{hyperparams})\right)
p~75=DPmed(z(p75),ε,(n,k,hyperparams))\tilde{p}_{75}=\textrm{DPmed}\left(\textbf{z}^{(p75)},\varepsilon,(n,k,\text{hyperparams})\right)
return p~25,p~75\tilde{p}_{25},\tilde{p}_{75}
Algorithm 2 DPTheilSenkMatch: (ε,0)(\varepsilon,0)-DP Algorithm

2.2.1. DP Median using the Exponential Mechanism

The first differentially private algorithm for the median that we will consider is an instantiation of the exponential mechanism (McSherry and Talwar, 2007), a differentially private algorithm designed for general optimization problems. The exponential mechanism is defined with respect to a utility function uu, which maps (dataset, output) pairs to real values. For a dataset z, the mechanism aims to output a value rr that maximizes u(z,r)u(\textbf{z},r).

Definition 3 (Exponential Mechanism (McSherry and Talwar, 2007)).

Given dataset zn\textbf{z}\in{\mathbb{R}}^{n} and the range of the outputs, [rl,ru][r_{l},r_{u}], the exponential mechanism outputs r[rl,ru]r\in[r_{l},r_{u}] with probability proportional to exp(εu(z,r)2GSu)\exp\left(\frac{\varepsilon u(\textbf{z},r)}{2GS_{u}}\right), where

GSu=maxr[rl,ru]maxz,zneighbors|u(z,r)u(z,r)|.GS_{u}=\max_{r\in[r_{l},r_{u}]}\max_{\textbf{z},\textbf{z}^{\prime}\text{neighbors}}|u(\textbf{z},r)-u(\textbf{z}^{\prime},r)|.

One way to instantiate the exponential mechanism to compute the median is by using the following utility function. Let

u(z,r)=|#above r#below r|u(\textbf{z},r)=-\left|\#\text{above }r-\#\text{below }r\right|

where #above rr and #below rr denote the number of datapoints in z that are above and below rr in value respectively, not including rr itself. An example of the shape of the output distribution of this algorithm is given in Figure 1. An efficient implementation is given in the Appendix.

Refer to caption
Figure 1. Unnormalized distribution of outputs of the exponential mechanism for differentially privately computing the median of dataset z.

We will write DPExpTheilSenkMatch to refer to DPTheilSenkMatch where DPmed is the DP exponential mechanism described above with privacy parameter ε/k\varepsilon/k. Again, we write DPExpTheilSenMatch when k=1k=1 and DPExpTheilSen when k=n1k=n-1.

Lemma 4.

DPExpTheilSenkMatch is (ε,0)(\varepsilon,0)-DP.

2.2.2. DP Median using Widened Exponential Mechanism

When the output space is the real line, the standard exponential mechanism for the median has some nuanced behaviour when the data is highly concentrated. For example, imagine in Figure 1 if all the datapoints coincided. In this instance, DPExpTheilSen is simply the uniform distribution on [rl,ru][r_{l},r_{u}], despite the fact that the median of the dataset is very stable. To mitigate this issue, we use a variation on the standard utility function. For a widening parameter θ>0\theta>0, the widened utility function is

u(z,r)=min{|#above a#below a|:|ar|θ}u(\textbf{z},r)=-\min\left\{\left|\#\text{above }a-\#\text{below }a\right|\;:\;|a-r|\leq\theta\right\}

where #above aa and #below aa are defined as before. This has the effect of increasing the probability mass around the median, as shown in Figure 2.

Refer to caption
Figure 2. Unnormalized distribution of outputs of the θ\theta-widened exponential mechanism for differentially privately computing the median of dataset z.

The parameter θ\theta needs to be carefully chosen. All outputs within θ\theta of the median are given the same utility score, so θ\theta represents a lower bound on the error. Conversely, choosing θ\theta too small may result in the area around the median not being given sufficient weight in the sampled distribution. Note that θ>0\theta>0 is only required when one expects the datasets zp25\textbf{z}^{p25} to be highly concentrated (for example when the dataset has strong linear signal). We defer the question of optimally choosing θ\theta to future work.

An efficient implementation of the θ\theta-widened exponential mechanism for the median can be found in the Appendix as Algorithm 4. We will use DPWideTheilSenkMatch to refer to DPTheilSenkMatch where DPmed is the θ\theta-widened exponential mechanism with privacy parameter ε/k\varepsilon/k. Again, we use
DPWideTheilSenMatch when k=1k=1 and DPWideTheilSen when k=n1k=n-1.

Lemma 5.

DPWideTheilSenkMatch is (ε,0)(\varepsilon,0)-DP.

2.2.3. DP Median using Smooth Sensitivity Noise Addition

The final algorithm we consider for releasing a differentially private median adds noise scaled to the smooth sensitivity – a smooth upper bound on the local sensitivity function. Intuitively, this algorithm should perform well when the datapoints are clustered around the median; that is, when the median is very stable.

Definition 6 (Smooth Upper Bound on LSfLS_{f} (Nissim et al., 2007)).

For t>0t>0, a function Sf,t:𝒳nS_{f,t}:\mathcal{X}^{n}\rightarrow{\mathbb{R}} is a tt-smooth upper bound on the local sensitivity of a function f:𝒳nf:\mathcal{X}^{n}\rightarrow{\mathbb{R}} if:

z𝒳n:LSf(z)Sf,t(z);\forall\textbf{z}\in\mathcal{X}^{n}:LS_{f}(\textbf{z})\leq S_{f,t}(\textbf{z});
z,z𝒳n,d(z,z)=1:Sf,t(z)etSf,t(z).\forall\textbf{z},\textbf{z}^{\prime}\in\mathcal{X}^{n},d(\textbf{z},\textbf{z}^{\prime})=1:S_{f,t}(\textbf{z})\leq e^{t}\cdot S_{f,t}(\textbf{z}^{\prime}).

where d(z,z)d(\textbf{z},\textbf{z}^{\prime}) is the distance between datasets z and z\textbf{z}^{\prime}.

Let 𝒵k:([0,1]×[0,1])nkn/2\mathcal{Z}_{k}:([0,1]\times[0,1])^{n}\rightarrow{\mathbb{R}}^{kn/2} to denote the function that transforms a set of point coordinates into estimates for each pair of points in our kk matchings, so in the notation of Algorithm 2, zp25=𝒵(x,y)\textbf{z}^{p25}=\mathcal{Z}(\textbf{x},\textbf{y}). The function that we are concerned with the smooth sensitivity of is med𝒵k\circ\mathcal{Z}_{k}, which maps (x,y)(\textbf{x},\textbf{y}) to med(zp25)(\textbf{z}^{p25}). We will use the following smooth upper bound to the local sensitivity:

Lemma 7.

Let z1z2z2mz_{1}\leq z_{2}\leq\cdots\leq z_{2m} be a sorting of 𝒵k(x,y)\mathcal{Z}_{k}(\textbf{x},\textbf{y}). Then

Smed𝒵,tk((x,y))\displaystyle S_{\text{med}\circ\mathcal{Z},t}^{k}((\textbf{x},\textbf{y}))
=max{\displaystyle=\max\Big{\{} zm+kzm,zmzmk,\displaystyle z_{m+k}-z_{m},z_{m}-z_{m-k},
maxl=1,,nmaxs=0,,k(l+1)elt(zm+szm(k(l+1)+s)},\displaystyle\max_{l=1,\ldots,n}\max_{s=0,\cdots,k(l+1)}e^{-lt}(z_{m+s}-z_{m-(k(l+1)+s})\Big{\}},

is a tt-smooth upper bound on the local sensitivity of med𝒵k\circ\mathcal{Z}_{k}.

Proof.

Proof in the Appendix. ∎

The algorithm then adds noise proportional to Smed𝒵,tk((x,y))/εS_{\text{med}\circ\mathcal{Z},t}^{k}((\textbf{x},\textbf{y}))/\varepsilon to med𝒵(x,y)\circ\mathcal{Z}(\textbf{x},\textbf{y}). A further discussion of the formula Smed𝒵,tk((x,y))S_{\text{med}\circ\mathcal{Z},t}^{k}((\textbf{x},\textbf{y})) and pesudo-code can be found in Appendix E. The noise is sampled from the Student’s T distribution. There are several other valid choices of noise distributions (see (Nissim et al., 2007) and (Bun and Steinke, 2019)), but we found the Student’s T distribution to be preferable as the mechanism remains stable across values of ε\varepsilon

Lemma 8.

DPSSTheilSenkMatch is (ε,0)(\varepsilon,0)-DP.

2.3. DP Gradient Descent

Ordinary Least Squares (OLS) for simple 1-dimensional linear regression is defined as the solution to the optimization problem

argminα,βi=1n(yiαxiβ)2.\operatorname*{argmin}_{\alpha,\beta}\sum_{i=1}^{n}(y_{i}-\alpha x_{i}-\beta)^{2}.

There has been an extensive line of work on solving convex optimization problems in a differentially private manner. We use the private gradient descent algorithm of (Bassily et al., 2014) to provide private estimates of the 0.25, 0.75 predictions (p25,p75)(p_{25},p_{75}). This algorithm performs standard gradient descent, except that noise is added to a clipped version of the gradient at each round (clipped to range [τ,τ]2[-\tau,\tau]^{2} for some setting of τ>0\tau>0). The number of calls to the gradient needs to be chosen carefully since we have to split our privacy budget amongst the gradient calls. We note that we are yet to fully explore the full suite of parameter settings for this method. See the Appendix for pseudo-code for our implementation and (Bassily et al., 2014) for more information on differentially private stochastic gradient descent. We consider three versions of DPGradDescent: DPGDPure, DPGDApprox, and DPGDzCDP, which are (ε,0)(\varepsilon,0)-DP, (ε,δ)(\varepsilon,\delta)-DP, and (ε2/2)(\varepsilon^{2}/2)-zCDP algorithms, respectively. Zero-concentrated differential privacy (zCDP) is a slightly weaker notion of differential privacy that is well suited to iterative algorithms. For the purposes of this paper, we set δ=230\delta=2^{-30} whenever DPGDApprox is run, and set the privacy parameter of DPGDzCDP to allow for a fair comparison. DPGDPure provides the strongest privacy guarantee followed by DPGDApprox and then DPGDzCDP. As expected, DPGDzCDP typically has the best performance.

2.4. NoisyIntercept

We also compare the above DP mechanisms to simply adding noise to the average yy-value. For any given dataset (x,y)([0,1]×[0,1])n(\textbf{x},\textbf{y})\in([0,1]\times[0,1])^{n}, this method computes y¯=1ni=1nyi\bar{y}=\frac{1}{n}\sum_{i=1}^{n}y_{i} and outputs a noisy estimate y~=y¯+Lap(0,1εn)\tilde{y}=\bar{y}+\text{Lap}\left(0,\frac{1}{\varepsilon n}\right) as the predicted p~25,p~75\widetilde{p}_{25},\widetilde{p}_{75} estimates. This method performs well when the slope α\alpha is very small.

2.5. A Note on Hyperparameters

We leave the question of how to choose the optimal hyperparameters for each algorithm to future work. Unfortunately, since the optimal hyperparameter settings may reveal sensitive information about the dataset, one can not tune the hyperparameters on a hold-out set. However, we found that for most of the hyperparameters, once a good choice of hyperparameter setting was found, it could be used for a variety of similar datasets. Thus, one realistic way to tune the parameters may be to tune on a public dataset similar to the dataset of interest. For example, for applications using census data, one could tune the parameters on previous years’ census data.

We note that, as presented here, NoisyStats requires no hyperparameter tuning, while DPExpTheilSen and DPSSTheilSen only require a small amount of tuning (they require upper and lower bounds on the output). However, NoisyStats requires knowledge of the range of the inputs xi,yix_{i},y_{i}, which is not required by the Theil-Sen methods. Both DPGradDescent and DPWideTheilSen can be quite sensitive to the choice of hyperparameters. In fact, DPExpTheilSen is simply DPWideTheilSen with θ=0\theta=0, and we will often see a significant difference in the behaviour of these algorithms. Preliminary experiments on the robustness of DPWideTheilSen to the choice of θ\theta can be found in the Appendix.

3. Experimental Outline

The goal of this work is to provide insight and guidance into what features of a dataset should be considered when choosing a differentially private algorithm for simple linear regression. As such, in the following sections, we explore the behavior of these algorithms on a variety of real datasets. We also consider some synthetic datasets where we can further explore how properties of the dataset affect performance.

3.1. Description of the Data

3.1.1. Opportunity Insights data

The first dataset we consider is a simulated version of the data used by the Opportunity Insights team in creating the Opportunity Atlas tool described in Section 1.2. In the appendix, we describe the data generation process used by the Opportunity Insights team to generate the simulated data. Each datapoint, (xi,yi)(x_{i},y_{i}), corresponds to a pair, (parent income percentile rank, child income percentile rank) , where parent income is defined as the total pre-tax income at the household level, averaged over the years 1994-2000, and child income is defined as the total pre-tax income at the individual level, averaged over the years 2014-2015 when the children are between the ages of 31-37. In the Census data, every state in the United States is partitioned into small neighborhood-level blocks called tracts. We perform the linear regression on each tract individually. The “best” differentially private algorithm differs from state to state, and even tract to tract. We focus on results for Illinois (IL) which has a total of n=219,594n=219,594 datapoints divided among 3,108 tracts. The individual datasets (corresponding to tracts in the Census data) each contain between n=30n=30 and n=400n=400 datapoints. The statistic nvar(x)n\text{var}(\textbf{x}) ranges between 0 and 25, with the majority of tracts having nvar(x)[0,5]n\text{var}(\textbf{x})\in[0,5]. We use ε=16\varepsilon=16 in all our experiments on this data since that is the value approved by the Census for, and currently used in, the release of the Opportunity Atlas (Chetty et al., 2018).

3.1.2. Washington, DC Bikeshare UCI Dataset

Next, we consider a family of small datasets containing data from the Capital Bikeshare system, in Washington D.C., USA, from the years 2011 and 2012111This data is publicly available at http://capitalbikeshare.com/system-data (Fanaee-T and Gama, 2013).. Each (xi,yi)(x_{i},y_{i}) datapoint of the dataset contains the temperature (xix_{i}) and user count of the bikeshare program (yiy_{i}) for a single hour in 2011 or 2012. The xix_{i} and yiy_{i} values are both normalized so that they lie between 0 and 1 by a linear rescaling of the data. In order to obtain smaller datasets to work with, we segment this dataset into 288 (12x24) smaller datasets each corresponding to a (month, hour of the day) pair. Each smaller dataset contains between n=45n=45 and n=62n=62 datapoints. We linearly regress the number of active users of the bikeshare program on the temperature. While the variables are positively correlated, the correlation is not obviously linear — in that the data is not necessarily well fit by a line — within many of these datasets, so we see reasonably large standard error values ranging between 0.0003 and 0.32 for this data. Note that our privacy guarantee is at the level of hours rather than individuals so it serves more as an example to evaluate the utility of our methods, rather than a real privacy application. While this is an important distinction to make for deployment of differential privacy, we will not dwell on it here since our goal to evaluate the statistical utility of our algorithms on real datasets. An example of one of these datasets is displayed in Figure 3.

Refer to caption
Figure 3. Example of dataset from Washington, DC Bikeshare UCI Dataset.

3.1.3. Carbon Nanotubes UCI Dataset

While we are primarily concerned with evaluating the behavior of the private algorithms on small datasets, we also present results on a good candidate for private analysis: a large dataset with a strong linear relationship. In contrast to the Bikeshare UCI data and the Opportunity Insights data, this is a single dataset, rather than a family of datasets. This data is drawn from a dataset studying atomic coordinates in carbon nanotubes (Aci and Avci, 2016). For each datapoint (xi,yi)(x_{i},y_{i}), xix_{i} is the uu-coordinate of the initial atomic coordinates of a molecule and yiy_{i} is the uu-coordinate of the calculated atomic coordinates after the energy of the system has been minimized. After we filtered out all points that are not contained in [0,1]×[0,1][0,1]\times[0,1], the resulting dataset contains n=10,683n=10,683 datapoints. A graphical representation of the data is included in Figure 4. Due to the size of this dataset, we run the efficient DPTheilSenMatch algorithm with k=1k=1 on this dataset. This dataset does not contain sensitive information, however we have included it to evaluate the behaviour of the DP algorithms on a variety of real datasets.

Refer to caption
Figure 4. Carbon Nanotubes UCI Dataset.

3.1.4. Stock Exchange UCI Dataset

Our final real dataset studies the relationship between the Istanbul Stock Exchange and the USD Stock Exchange (Akb, 2013). 222This dataset was collected from imkb.gov.tr and finance.yahoo.com. It compares {xi=x_{i}= Istanbul Stock Exchange national 100 index} to {yi=y_{i}= the USD International Securities Exchange}. This dataset has n=250n=250 datapoints and a representation is included in Figure 5. This is a a smaller, noisier dataset than the Carbon Nanotubes UCI dataset.

Refer to caption
Figure 5. The Stock Exchange UCI Dataset.

3.1.5. Synthetic Data

The synthetic datasets were constructed by sampling xix_{i}\in\mathbb{R}, for i=1,,ni=1,\ldots,n, independently from a uniform distribution with x¯=0.5\bar{x}=0.5 and variance σx2\sigma_{x}^{2}. For each xix_{i}, the corresponding yiy_{i} is generated as yi=αxi+β+eiy_{i}=\alpha x_{i}+\beta+e_{i}, where α=0.5,β=0.2\alpha=0.5,\beta=0.2, and eie_{i} is sampled from 𝒩(0,σe2)\mathcal{N}(0,\sigma_{e}^{2}). The (xi,yi)(x_{i},y_{i}) datapoints are then clipped to the box [0,1]2[0,1]^{2}. The DP algorithms estimate the prediction at xnewx_{new} using privacy parameter ε\varepsilon.

The values of nn, σx2\sigma_{x}^{2}, σe2\sigma_{e}^{2}, xnewx_{new}, and ε\varepsilon vary across the individual experiments. The synthetic data experiments are designed to study which properties of the data and privacy regime determine the performance of the private algorithms. Thus, in these experiments, we vary one of parameters listed above and observe the impact on the accuracy of the algorithms and their relative performance to each other. Since we know the ground truth on this synthetically generated data, we plot empirical error bounds that take into account both the sampling error and the error due to the DP algorithms, C^true(68)/σ(p^xnew)\hat{C}_{\text{true}}(68)/\sigma(\widehat{p}_{x_{new}}). We evaluate DPTheilSenkMatch with k=10k=10 in the synthetic experiments rather than DPTheilSen, since the former is computationally more efficient and still gives us insight into the performance of the latter.

3.2. Hyperparameters

Table 1. Hyperparameters used in experiments on OI data and UCI datasets.
Algorithms OI/UCI data Synthetic data
NoisyStats None None
NoisyIntercept None N/A
DPExpTheilSen rl=0.5,ru=1.5r_{l}=-0.5,r_{u}=1.5 rl=2,ru=2r_{l}=-2,r_{u}=2
DPWideTheilSen rl=0.5,ru=1.5r_{l}=-0.5,r_{u}=1.5, rl=2,ru=2r_{l}=-2,r_{u}=2
θ=0.01\theta=0.01 θ=0.01\theta=0.01
DPSSTheilSen rl=0.5,ru=1.5r_{l}=-0.5,r_{u}=1.5, rl=2,ru=2r_{l}=-2,r_{u}=2
d=3d=3 d=3d=3
DPGradDescent τ=1\tau=1, T = 80 τ=1\tau=1, T = 80

Hyperparameters were tuned on the semi-synthetic Opportunity Insights data by experimenting with many different choices, and choosing the best. The hyperparameters are listed in Table 1. We leave the question of optimizing hyperparameters in a privacy-preserving way to future work. Hyperparameters for the UCI and synthetic datasets were chosen according to choices that seemed to perform well on the Opportunity Insights datasets. This ensures that the hyperparameters were not tuned to the specific datasets (ensuring validity of the privacy guarantees), but also leaves some room for improvement in the performance by more careful hyperparameter tuning.

4. Results and Discussion

In this section we discuss the findings from the experiments described in the previous section. In the majority of the real datasets tested, DPExpTheilSen is the best performing algorithm. We’ll discuss some reasons why DPExpTheilSen performs well, as well as some regimes when other algorithms perform better.

A note on the statement of privacy parameters in the experiments. We will state the privacy budget used to compute the pair (p25,p75)(p_{25},p_{75}), however we will only show empirical error bounds for p25p_{25}. The empirical error bounds for p75p_{75} display similar phenomena. The algorithms NoisyStats and DPGradDescent inherently release both point estimates together so the privacy loss is the same whether we release the pair (p25,p75)(p_{25},p_{75}) or just p25p_{25}. However, DPExpTheilSen, DPExpWideTheilSen, DPSSTheilSen and the algorithm used by the Opportunity Insights team use half their budget to release p25p_{25} and half their budget to release p75p_{75}, separately.

4.1. Real Data and the Opportunity Insights Application

Figure 6 shows the results of all the DP linear regression algorithms, as well as the mechanism used by the Opportunity Insights team, on the Opportunity Insights data for the states of Illinois and North Carolina. For each algorithm, we build an empirical cumulative density distribution of the empirical error bounds set over the tracts in that state. The vertical dotted line in Figures 6 intercepts each curve at the point where the noise due to privacy exceeds the standard error. The Opportunity Insights team used a heuristic method inspired by, but not satisfying, DP to deploy the Opportunity Altas (Chetty et al., 2018). Their privacy parameter of ε=16\varepsilon=16 was selected by the Opportunity Insights team and a Census Disclosure Review Board by balancing the privacy and utility considerations. Figure 6 shows that there exist differentially private algorithms that are competitive with, and in many cases more accurate than, the algorithm currently deployed in the Opportunity Atlas.

Additionally, while the methods used by the Opportunity Insights team are highly tailored to their setting relying on coordination across Census tracts, in order to compute an upper bound on the local sensitivity, as discussed in Section 1.3, the formally differentially private methods are general-purpose and do not require coordination across tracts.

Refer to caption
(a) Illinois (IL)
Refer to caption
(b) North Carolina (NC)
Figure 6. Empirical CDF for the Empirical 68% error bounds, C^(68)\hat{C}(68), normalized by empirical OLS standard error when evaluated on Opportunity Insights data for the states of Illinois and North Carolina. The algorithm denoted by OI refers to the heuristic algorithm used in the deployment of the Opportunity Altas Tool (Chetty and Friedman, 2019). Privacy parameter ε=16\varepsilon=16 for the pair (p25,p75)(p_{25},p_{75}).

Figures 7 shows empirical cumulative density distributions of the empirical error bounds on the 288 datasets in the Bikeshare dataset. Figure 8 shows the empirical cumulative density function of the output distribution on the Stockexchange dataset. Note that this is a different form to the empirical CDFs which appear in Figure 6 and Figure 7. Figure 9(a) shows the empirical cumulative density function of the output distribution on the Carbon Nanotubes dataset. Figures 7, 8 and 9(a) show that on the three other real datasets, for a range of realistic ε\varepsilon values the additional noise due to privacy, in particular using DPExpTheilSen, is less than the standard error.

Refer to caption
Figure 7. Bikeshare data. Empirical CDFs of the performance over the set of datasets when ε=10\varepsilon=10.
Refer to caption
Figure 8. Stock Exchange UCI Data. Empirical cdf of the output distribution of the estimate of p25p_{25} after 100 trials of each algorithm with ε=2\varepsilon=2. The grey region includes all the values that are within one standard error of σ^(p^25)\widehat{\sigma}(\widehat{p}_{25}). The curve labelled “standard error“ shows the non-private posterior belief on the value of the p25p_{25} assuming Gaussian noise.
Refer to caption
(a) Domain knowledge that p25,p75[0.5,1.5]p_{25},p_{75}~{}\in~{}[-0.5,1.5]
Refer to caption
(b) Domain knowledge that p25,p75[50,50]p_{25},p_{75}~{}\in~{}[-50,50]
Figure 9. Carbon nanotubes UCI Data. Empirical cdf of the output distribution of the estimate of p25p_{25} after 100 trials of each algorithm. The grey region includes all the values that are within one standard error of p^25\widehat{p}_{25}. The curve labelled “standard error“ shows the non-private posterior belief on the value of the p25p_{25} assuming Gaussian noise.

4.2. Robustness vs. Non-robustness: Guidance for Algorithm Selection

The DP algorithms we evaluate can be divided into two classes, robust DP estimators based on Theil-Sen — DPSSTheilSen, DPExpTheilSen and DPWideTheilSen — and non-robust DP estimators based on OLS — NoisyStats and GradDescent. Experimentally, we found that the algorithm’s behaviour tend to be clustered in these two classes, with the robust estimators outperforming the non-robust estimators in a wide variety of parameter regimes. In most of the experiments we saw in the previous section (Figures 67,  8 and,  9(a)), DPExpTheilSen was the best performing algorithm, followed by DPWideTheilSen and DPSSTheilSen. However, below we will see that in experiments on synthetic data, we found that the non-robust estimators outperform the robust estimators in some parameter regimes.

4.2.1. NoisyStats and DPExpTheilSen

In Figure 10, we investigate the relative performance (C^true(68)/σ(p^25)\hat{C}_{\text{true}}(68)/\sigma(\widehat{p}_{25})) of the algorithms in several parameter regimes of n,εn,\varepsilon, and σx2\sigma_{x}^{2} on synthetically generated data. For each parameter setting, for each algorithm, we plot of the average value of C^true(68)/σ(p^25)\hat{C}_{\text{true}}(68)/\sigma(\widehat{p}_{25}) over 5050 trials on a single dataset, and average again over 500500 independently sampled datasets. Across large ranges for each of these three parameters (n[30,10,000]n\in[30,10,000]; σx2[0.003,0.03]\sigma_{x}^{2}\in[0.003,0.03]; ε[0.01,10]\varepsilon\in[0.01,10], all varied on a logarithmic scale), we see that either DPExpTheilSenDPGDzCDP, or NoisyStats is consistently the best performing algorithm—or close to it. Note that of these three algorithms, only DPExpTheilSen and NoisyStats fulfill the stronger pure (ε,0)(\varepsilon,0)-DP privacy guarantee. We see that DPGDzCDP and NoisyStats trend towards taking over as the best algorithms as εnσx2\varepsilon n\sigma_{x}^{2} increases.

Refer to caption
(a) Varying n
Refer to caption
(b) Varying σx2\sigma_{x}^{2}
Refer to caption
(c) Varying ε\varepsilon
Figure 10. Relative error (C^true/σ(p^25)\hat{C}_{\text{true}}/\sigma(\widehat{p}_{25})) of DP and non-private algorithms on synthetic data..
Refer to caption
(a) εnσx2=40\varepsilon n\sigma_{x}^{2}=40; varying σe2\sigma_{e}^{2}
Refer to caption
(b) εnσx2=40\varepsilon n\sigma_{x}^{2}=40; varying xnewx_{new}
Refer to caption
(c) εnσx2=216\varepsilon n\sigma_{x}^{2}=216; varying σe2\sigma_{e}^{2}
Refer to caption
(d) εnσx2=216\varepsilon n\sigma_{x}^{2}=216; varying xnewx_{new}
Refer to caption
(e) εnσx2=800\varepsilon n\sigma_{x}^{2}=800; varying σe2\sigma_{e}^{2}
Refer to caption
(f) εnσx2=800\varepsilon n\sigma_{x}^{2}=800; varying xnewx_{new}
Figure 11. Comparing NoisyStats and DPExpTS10Match on synthetic data as σe2\sigma_{e}^{2} and xnewx_{new} vary, for x¯=0.5\bar{x}=0.5. Plotting C^true/σ(p^25)\hat{C}_{\text{true}}/\sigma(\widehat{p}_{25}). OLS and TheilSen10Match included for reference.

For parameter regimes in which the non-robust algorithms outperform the robust estimators, NoisyStats is preferable to DPGDzCDP since it is more efficient, requires no hyperparameter tuning (except bounds on the inputs xix_{i} and yiy_{i}), fulfills a stronger privacy guarantee, and releases the noisy sufficient statistics with no additional cost in privacy. Experimentally, we find that the main indicator for deciding between robust estimators and non-robust estimators is the quantity εnvar(x)\varepsilon n\text{var}(\textbf{x}) (which is a proxy for εnσx2\varepsilon n\sigma_{x}^{2}). Roughly, when εnvar(x)\varepsilon n\text{var}(\textbf{x}) and σe2\sigma_{e}^{2} are both large, NoisyStats is close to optimal among the DP algorithms tested; otherwise, the robust estimator DPExpTheilSen typically has lower error. Hyperparameter tuning and the quantity |xnewx¯||x_{new}-\bar{x}| also play minor roles in determining the relative performance of the algorithms.

4.2.2. Role of εnvar(x)\varepsilon n\text{var}(\textbf{x})

The quantity εnvar(x)\varepsilon n\text{var}(\textbf{x}) is related to the size of the dataset, how concentrated the independent variable of the data is and how private the mechanism is. It appears to be a better indicator of the performance of DP mechanisms than any of the individual statistics ε,n\varepsilon,n, or var(x)\text{var}(\textbf{x}) in isolation. In Figure 11 we compare the performance of NoisyStats and DPExpTheilSen10Match as we vary σe2\sigma_{e}^{2} and xnewx_{new}. For each parameter setting, for each algorithm, the error is computed as the average error over 2020 trials and 500500 independently sampled datasets. Notice that the blue line presents the error of the non-private OLS estimator, which is our baseline. The quantity we control in these experiments is εnσx2\varepsilon n\sigma_{x}^{2}, although we expect this to align closely with εnvar(x)\varepsilon n\text{var}(\textbf{x}), which is the empirically measurable quantity. In all of our synthetic data experiments, in which the xix_{i}’s are uniform and the eie_{i}’s are Gaussian, once εnσx2>400\varepsilon n\sigma_{x}^{2}>400 and σe2102\sigma_{e}^{2}\geq 10^{-2}, NoisyStats is close to the best performing algorithm. It is also important to note that once εnvar(x)\varepsilon n\text{var}(\textbf{x}) is large, both NoisyStats and DPExpTheilSen perform well. See Figure 11 for a demonstration of this on the synthetic data.

The error, as measured by C^true(68)\hat{C}_{\text{true}}(68), of both OLS and Theil-Sen estimators converges to 0 as nn\to\infty at the same asymptotic rate. However, OLS converges a constant factor faster than Theil-Sen resulting in its superior performance on relatively large datasets. As εnvar(x)\varepsilon n\text{var}(\textbf{x}) increases, the error of NoisyStats decreases, and the outputs of this algorithm concentrate around the OLS estimate. While the outputs of DPExpTheilSen tend to be more concentrated, they converge to the Theil-Sen estimate, which has higher sampling error. Thus, as we increase εnvar(x)\varepsilon n\text{var}(\textbf{x}), eventually the additional noise due to privacy added by NoisyStats is less than the difference between the OLS and Theil-Sen point estimates, resulting in NoisyStats outperforming DPExpTheilSen. This phenomenon can be seen in Figure 10 and Figure 11.

The classifying power of εnvar(x)\varepsilon n\text{var}(\textbf{x}) is a result of its impact on the performance of NoisyStats. Recall that NoisyStats works by using noisy sufficient statistics within the closed form solution for OLS given in Equation 2. The noisy sufficient statistic nvar(x)+Lap(0,3Δ2/ε)\text{nvar}(\textbf{x})+\text{Lap}(0,3\Delta_{2}/\varepsilon), appears in the denominator of this closed form solution. When εnvar(x)\varepsilon n\text{var}(\textbf{x}) is small, this noisy sufficient statistic has constant mass concentrated near, or below, 0, and hence the inverse, 1/(nvar(x)+Lap(0,3Δ2/ε))1/(\text{nvar}(\textbf{x})+\text{Lap}(0,3\Delta_{2}/\varepsilon)), which appears in the NoisyStats, can be an arbitrarily bad estimate of 1/nvar(x)1/\text{nvar}(\textbf{x}). In contrast, when εnvar(x)\varepsilon n\text{var}(\textbf{x}) is large, the distribution of nvar(x)+Lap(0,3Δ2/ε)\text{nvar}(\textbf{x})+\text{Lap}(0,3\Delta_{2}/\varepsilon) is more concentrated around nvar(x)\text{nvar}(\textbf{x}), so that with high probability 1/(nvar(x)+Lap(0,3Δ2/ε))1/(\text{nvar}(\textbf{x})+\text{Lap}(0,3\Delta_{2}/\varepsilon)) is close to 1/nvar(x)1/\text{nvar}(\textbf{x}).

In Figures 10(a), 10(b) and  10(c), the performance of all the algorithms, including NoisyStats and DPExpTheilSen, are shown as we vary each of the parameters ε,n\varepsilon,n and σx2\sigma_{x}^{2}, while holding the other variables constant, on synthetic data. In doing so, each of these plots is effectively varying εnvar(x)\varepsilon n\text{var}(\textbf{x}) as a whole. These plots suggest that all three parameters help determine which algorithm is the most accurate. Figure 11 confirms that εnvar(x)\varepsilon n\text{var}(\textbf{x}) is a strong indicator of the relative performance of NoisyStats and DPExpTheilSen, even as other variables in the OLS standard error equation (3) – including the variance of the noise of the dependent variable, σe\sigma_{e}, and the difference between xnewx_{new} and the mean of the xx values, |xnewx¯||x_{new}-\bar{x}| – are varied.

4.2.3. The Role of σe2\sigma_{e}^{2}

One main advantage that all the DPTheilSen algorithms have over NoisyStats is that they are better able to adapt to the specific dataset. When σe2\sigma_{e}^{2} is small, there is a strong linear signal in the data that the non-private OLS estimator and the DPTheilSen algorithms can leverage. Since the noise addition mechanism of NoisyStats does not leverage this strong signal, its relative performance is worse when σe2\sigma_{e}^{2} is small. Thus, σe2\sigma_{e}^{2} affects the relative performance of the algorithms, even when εnvar(x)\varepsilon n\text{var}(\textbf{x}) is large. We saw this effect in Figure 9(a) on the Carbon Nanotubes UCI data, where εnvar(x)=889.222\varepsilon n\text{var}(x)=889.222 is large, but NoisyStats performed poorly relative to the DPTheilSen algorithms.

In each of the plots 11(a), 11(c), and 11(e), the quantity εnvar(x)\varepsilon n\text{var}(\textbf{x}) is held constant while σe2\sigma_{e}^{2} is varied. These plots confirm that the performance of NoisyStats degrades, relative to other algorithms, when σe2\sigma_{e}^{2} is small. As σe2\sigma_{e}^{2} increases, the relative performance of NoisyStats improves until it drops below the relative performance of the TheilSen estimates. The cross-over point varies with εnvar(x)\varepsilon n\text{var}(\textbf{x}). In Figure 11(a), we see that when εnvar(x)\varepsilon n\text{var}(\textbf{x}) is small, the methods based on Theil-Sen are the better choice for all values of σe2\sigma_{e}^{2} studied. When we increase εnvar(x)\varepsilon n\text{var}(\textbf{x}) in Figure 11(e) we see a clear cross over point. These plots add evidence to our conjecture that robust estimators are a good choice when εnvar(x)\varepsilon n\text{var}(\textbf{x}) and σe2\sigma_{e}^{2} are both small, while non-robust estimators perform well when εnvarx\varepsilon n\text{var}\textbf{x} is large.

4.2.4. The Role of |xnewx¯||x_{new}-\bar{x}|

The final factor we found that plays a role in the relative performance of NoisyStats and DPExpTheilSen is |xnewx¯||x_{new}-\bar{x}|. This effect is less pronounced than that of εnvar(x)\varepsilon n\text{var}(\textbf{x}) or σe2\sigma_{e}^{2}, and seems to rarely change the ordering of DP algorithms. In Figures 11(b), 11(d) and 11(f), we explore the effect of this quantity on the relative performance of OLS, TheilSen10Match, DPExpTheilSen10Match, and NoisyStats. We saw in Equation 3 that |xnewx¯||x_{new}-\bar{x}| affects the standard error σ^(p^25)\hat{\sigma}(\widehat{p}_{25}) - the further xnewx_{new} is from the centre of the data, x¯\bar{x}, the less certain we are about the OLS point estimate. All algorithms have better performance when |xnewx¯||x_{new}-\bar{x}| is small; however, this effect is most pronounced with NoisyStats. In NoisyStats, p~xnew=α~(xnewx¯)+y¯+L3\widetilde{p}_{x_{new}}=\tilde{\alpha}(x_{new}-\bar{x})+\bar{y}+L_{3} where L3Lap(0,3(1+|α~|)/(εn))L_{3}\sim\text{Lap}(0,3(1+|\tilde{\alpha}|)/(\varepsilon n)) and α~\tilde{\alpha} and β~\tilde{\beta} are the noisy slope and intercepts respectively. Thus, we expect a large |xnewx¯||x_{new}-\bar{x}| to amplify the error present in α~\tilde{\alpha}. We vary xnewx_{new} between 0 and 1. As expected, the error is minimized when xnew=0.5x_{new}=0.5, since x¯=0.5\bar{x}=0.5. Since we expect the error in α~\tilde{\alpha} to decrease as we increase εnvar(x)\varepsilon n\text{var}(\textbf{x}), this explains why the quantity |xnewx¯||x_{new}-\bar{x}| has a larger effect when εnvar(x)\varepsilon n\text{var}(\textbf{x}) is small.

4.2.5. The Role of Hyperparameter Tuning

A final major distinguishing feature between the DPTheilSen algorithms, NoisyStats and DPGradDescent is the amount of prior knowledge needed by the data analyst to choose the hyperparameters appropriately. Notably, NoisyStats does not require any hyperparameter tuning other than a bound on the data. The DPTheilSen algorithms require some reasonable knowledge of the range that p25p_{25} and p75p_{75} lie in in order to set rlr_{l} and rur_{u}. Finally, DPGradDescent requires some knowledge of where the input values lie, so it can set τ\tau and TT.

In Figure 9(b), NoisyStats and all three DPGradDescent algorithms outperform the robust estimators. This experiment differs from Figure 9(a) in two important ways: the privacy parameter ε\varepsilon has decreased from 1 to 0.01, and the feasible output region for the DP TheilSen methods has increased from [0.5,1.5][-0.5,1.5] to [50,50][-50,50]. When ε\varepsilon is large, the DPTheilSen algorithms are robust to the choice of this region since any area outside the range of the data is exponentially down weighted (see Figure 1). However, when ε\varepsilon is small, the size of this region can have a large effect on the stability of the output. As ε\varepsilon decreases, the output distributions of the DPTheilSen estimators are flattened, so that they are essentially sampling from the uniform distribution on the range of the parameter. This effect likely explains the poor performance of the robust estimators in Figure 9(b), and highlights the importance of choosing hyperparameters carefully. If ε\varepsilon is small (ε\varepsilon much less than 1) and the analyst does not have a decent estimate of the range of p25p_{25}, then NoisyStats may be a safer choice than DPTheilSen.

4.3. Which robust estimator?

In the majority of the regimes we have tested, DPExpTheilSen outperforms all the other private algorithms. While DPSSTheilSen can be competitive with DPExpTheilSen and DPWideTheilSen, it rarely seems to outperform them. However, DPWideTheilSen can significantly outperform DPExpTheilSen when the standard error is small. Figure 12 compares the performance of DPExpTheilSen and DPWideTheilSen on the Bikeshare UCI data. When there is little noise in the data we expect the set of pairwise estimates that Theil-Sen takes the median of to be highly concentrated. We discussed in Section 2.2 why this is a difficult setting for DPExpTheilSen: in the continuous setting, the exponential mechanism based median algorithm can fail to put sufficient probability mass around the median, even if the data is concentrated at the median (see Figure 1). DPWideTheilSen was designed exactly as a fix to this problem. The parameter θ\theta needs to be chosen carefully. In Figure 9(a), θ\theta is set to be much larger than the standard error, resulting in DPWideTheilSen performing poorly.

Refer to caption
Figure 12. Comparison of DPExpTheilSen and DPWideExpTheilSen on Bikeshare UCI data with ε=2\varepsilon=2 and θ=0.01\theta=0.01. Datasets are sorted by their standard errors. For each dataset, there is a dot corresponding to the C^(68)\hat{C}(68) value of each algorithm. Both dots lie on the same vertical line.

4.4. Which non-robust estimator?

There are two main non-robust estimators we consider: DPGradDescent and NoisyStats. DPGradDescent has three versions – DPGDPure, DPGDApprox, and DPGDzCDP – corresponding to the pure, approximate, and zero-concentrated variants. Amongst the DPGradDescent algorithms, as expected, DPGDzCDP provides the best utility followed by DPGDApprox, and then DPGDPure. But how do these compare to NoisyStats? NoisyStats outperforms both DPGDPure and DPGDApprox for small δ\delta (e.g. δ=230\delta=2^{-30} in our experiments). DPGDzCDP consistently outperforms NoisyStats, but the gap in performance is small in the regime where the non-robust estimators beat the robust estimators. Moreover, NoisyStats achieves a stronger privacy guarantee (pure (ε,0)(\varepsilon,0)-DP rather than ε2/2\varepsilon^{2}/2-zCDP). A fairer comparison is to use the natural ε2/2\varepsilon^{2}/2-zCDP analogue of NoisyStats (using Gaussian noise and zCDP composition), in which case we have found that the advantage of DPGDzCDP significantly shrinks and in some cases is reversed. (Experiments omitted.)

The performance of the DPGradDescent algorithms also depend on hyperparameters that need to be carefully tuned, such as the number of gradient calls TT and the clip range [τ,τ][-\tau,\tau]. Since NoisyStats requires less hyperparameters, this makes DPGradDescent potentially harder to use in practice. In addition, NoisyStats is more efficient and can be used to release the noisy sufficient statistics with no additional privacy loss. Since the performance of the two algorithms is similar in the regime where non-robust methods appear to have more utility than the robust ones, the additional benefits of NoisyStats may make it preferable in practice.

We leave a more thorough evaluation and optimization of these algorithms in the regime of large nn, including how to optimize the hyperparameters in a DP manner, to future work.

4.5. Analyzing the bias

Let (p25TS,p75TS)(p_{25}^{TS},p_{75}^{TS}) be the prediction estimates produced using the non-private TheilSen estimator. The non-robust DP methods – NoisyStats and DPGradDescent – approach (p^25,p^75)(\widehat{p}_{25},\widehat{p}_{75}) as ε\varepsilon\to\infty, while the DPTheilSen methods approach (p25TS,p75TS)(p_{25}^{TS},p_{75}^{TS}) as ε\varepsilon\to\infty. For any fixed dataset, (p^25,p^75)(\widehat{p}_{25},\widehat{p}_{75}) and (p25TS,p75TS)(p_{25}^{TS},p_{75}^{TS}) are not necessarily equal. A good representation of this bias can be seen in Figure 9(a). However, both the TheilSen estimator and the OLS estimator are consistent unbiased estimators of the true slope in simple linear regression. That is, as nn\to\infty, both (p25TS,p75TS)(p_{25}^{TS},p_{75}^{TS}) and (p^25,p^75)(\widehat{p}_{25},\widehat{p}_{75}) tend to the true value (p25,p75)(p_{25},p_{75}). Thus, all the private algorithms output the true prediction estimates as nn\to\infty, for a fixed ε\varepsilon.

5. Conclusion

It is possible to design DP simple linear regression algorithms where the distortion added by the private algorithm is less than the standard error, even for small datasets. In this work, we found that in order to achieve this we needed to switch from OLS regression to the more robust linear regression estimator, Theil-Sen. We identified key factors that analysts should consider when deciding whether DP methods based on robust or non-robust estimators are right for their application.

Acknowledgements.
We thank Raj Chetty, John Friedman, and Daniel Reuter from Opportunity Insights for many motivating discussions, providing us with simulated Opportunity Atlas data for our experiments, and helping us implement the OI algorithm. We also thank Cynthia Dwork and James Honaker for helpful conversations at an early stage of this work; Garett Bernstein and Dan Sheldon for discussions about their work on private Bayesian inference for linear regression; Sofya Raskhodnikova, Thomas Steinke, and others at the Simons Institute Spring 2019 program on data privacy for helpful discussions on specific technical aspects of this work.

References

  • (1)
  • Akb (2013) 2013. A novel Hybrid RBF Neural Networks model as a forecaster. Statistics and Computing (2013). https://doi.org/10.1007/s11222-013-9375-7
  • Abernethy et al. (2016) Jacob Abernethy, Chansoo Lee, and Ambuj Tewari. 2016. Perturbation techniques in online learning and optimization. Perturbations, Optimization, and Statistics (2016), 233.
  • Aci and Avci (2016) Mehmet Aci and Mutlu Avci. 2016. Artificial Neural Network Approach for Atomic Coordinate Prediction of Carbon Nanotubes. Applied Physics A 122, 631 (2016).
  • Bassily et al. (2014) Raef Bassily, Adam D. Smith, and Abhradeep Thakurta. 2014. Private Empirical Risk Minimization, Revisited. CoRR abs/1405.7085 (2014). http://arxiv.org/abs/1405.7085
  • Bernstein and Sheldon (2019) Garrett Bernstein and Daniel R Sheldon. 2019. Differentially Private Bayesian Linear Regression. In Advances in Neural Information Processing Systems 32. 523–533.
  • Bun and Steinke (2016) Mark Bun and Thomas Steinke. 2016. Concentrated differential privacy: Simplifications, extensions, and lower bounds. In Theory of Cryptography Conference. Springer, 635–658.
  • Bun and Steinke (2019) Mark Bun and Thomas Steinke. 2019. Average-Case Averages: Private Algorithms for Smooth Sensitivity and Mean Estimation. arXiv preprint arXiv:1906.02830 (2019).
  • Cai et al. (2019) Tony Cai, Yichen Wang, and Linjun Zhang. 2019. The Cost of Privacy: Optimal Rates of Convergence for Parameter Estimation with Differential Privacy. CoRR abs/1902.04495 (2019). http://arxiv.org/abs/1902.04495
  • Chaudhuri et al. (2011) Kamalika Chaudhuri, Claire Monteleoni, and Anand D. Sarwate. 2011. Differentially Private Empirical Risk Minimization. Journal of Machine Learning Research 12 (2011), 1069–1109.
  • Chetty and Friedman (2019) Raj Chetty and John N. Friedman. 2019. A Practical Method to Reduce Privacy Loss When Disclosing Statistics Based on Small Samples. American Economic Review Papers and Proceedings 109 (2019), 414–420.
  • Chetty et al. (2018) Raj Chetty, John N Friedman, Nathaniel Hendren, Maggie R Jones, and Sonya R Porter. 2018. The opportunity atlas: Mapping the childhood roots of social mobility. Technical Report. National Bureau of Economic Research.
  • Chetty et al. (2014) Raj Chetty, Nathaniel Hendren, Patrick Kline, and Emmanuel Saez. 2014. Where is the land of opportunity? The geography of intergenerational mobility in the United States. The Quarterly Journal of Economics 129, 4 (2014), 1553–1623.
  • Cormode ([n.d.]) Graham Cormode. [n.d.]. Building Blocks of Privacy: Differentially Private Mechanisms. ([n. d.]), 18–19.
  • Couch et al. (2019) Simon Couch, Zeki Kazan, Kaiyan Shi, Andrew Bray, and Adam Groce. 2019. Differentially Private Nonparametric Hypothesis Testing. arXiv preprint arXiv:1903.09364 (2019).
  • Dinur and Nissim (2003) Irit Dinur and Kobbi Nissim. 2003. Revealing information while preserving privacy. In Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems. 202–210.
  • Dwork and Lei (2009) Cynthia Dwork and Jing Lei. 2009. Differential privacy and robust statistics.. In STOC, Vol. 9. 371–380.
  • Dwork et al. (2006) Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam D. Smith. 2006. Calibrating Noise to Sensitivity in Private Data Analysis. In Theory of Cryptography, Third Theory of Cryptography Conference, TCC 2006, New York, NY, USA, March 4-7, 2006, Proceedings. 265–284.
  • Dwork et al. (2014) Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. 2014. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In STOC. 11–20.
  • Fanaee-T and Gama (2013) Hadi Fanaee-T and Joao Gama. 2013. Event labeling combining ensemble detectors and background knowledge. Progress in Artificial Intelligence (2013), 1–15. https://doi.org/10.1007/s13748-013-0040-3
  • McSherry and Talwar (2007) Frank McSherry and Kunal Talwar. 2007. Mechanism Design via Differential Privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS 2007), October 20-23, 2007, Providence, RI, USA, Proceedings. 94–103.
  • Nissim et al. (2007) Kobbi Nissim, Sofya Raskhodnikova, and Adam D. Smith. 2007. Smooth sensitivity and sampling in private data analysis. In Proceedings of the 39th Annual ACM Symposium on Theory of Computing, San Diego, California, USA, June 11-13, 2007. 75–84.
  • Sen (1968) Pranab Kumar Sen. 1968. Estimates of the regression coefficient based on Kendall’s tau. Journal of the American statistical association 63, 324 (1968), 1379–1389.
  • Sheffet (2017) Or Sheffet. 2017. Differentially Private Ordinary Least Squares. In Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017. 3105–3114. http://proceedings.mlr.press/v70/sheffet17a.html
  • Theil (1950) Henri Theil. 1950. A rank-invariant method of linear and polynomial regression analysis, 3; confidence regions for the parameters of polynomial regression equations. Indagationes Mathematicae 1, 2 (1950), 467–482.
  • Wang (2018) Yu-Xiang Wang. 2018. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. CoRR abs/1803.02596 (2018). arXiv:1803.02596 http://arxiv.org/abs/1803.02596
  • Yan and Su (2009) Xin Yan and Xiao Gang Su. 2009. Linear Regression Analysis: Theory and Computing. World Scientific Publishing Co., Inc., USA.

Appendix A Opportunity Insights Application

A.1. Data Generating Process for OI Synthetic Datasets

In this subsection, we describe the data generating process for simulating census microdata from the Opportunity Insights team. In the model, assume that each set of parents ii in tract tt and state mm has one child (also indexed by ii).

Size of Tract: The size of each tract in any state is a random variable. Let Exp(b)\operatorname*{Exp}(b) represent the exponential distribution with scale bb. Then if ntmn_{tm} is the number of people in tract tt and state mm, then ntmExp(52)+20n_{tm}\sim\lfloor\operatorname*{Exp}(52)+20\rfloor. This distribution over simulated counts was chosen by the Opportunity Insights team because it closely matches the distribution of tract sizes in the real data.

Linear Income Relationship: Let xitmx_{itm} be the child income given the parent income yitmy_{itm}, then we enforce the following relationship between xitmx_{itm} and yitmy_{itm}:

ln(xitm)=αtm+βtmln(yitm)+ei,\ln(x_{itm})=\alpha_{tm}+\beta_{tm}\ln(y_{itm})+e_{i},

where ei𝒩(0,(0.20)2)e_{i}\sim\mathcal{N}(0,(0.20)^{2}) and αtm,βtm\alpha_{tm},\beta_{tm} are calculated from public estimates of child income rank given the parent income rank. 333 Gleaned from some tables the Opportunity Insights team publicly released with tract-level (micro) data. See https://opportunityatlas.org/. The dataset used to calculate the αtm,βtm\alpha_{tm},\beta_{tm} is aggregated from some publicly-available income data for all tracts within all U.S. states between 1978 and 1983. Next, pimp_{im} is calculated as the parent ii’s percentile income within the state mm’s parent income distribution (and rounded up to the 2nd decimal place).

Parent Income Distribution: Let μtm\mu_{tm} denote the public estimate of the mean household income for tract tt in state mm (also obtained from publicly-available income data used to calculate αtm,βtm\alpha_{tm},\beta_{tm}). Empirically, the Opportunity Insights team found that the within-tract standard deviation of parent incomes is about twice the between-tract standard deviation. Let Var(μtm)\operatorname*{Var}(\mu_{tm}) denote the sample variance of the estimator μtm\mu_{tm}. Then enforce that Vartm(yitm)=4Var(μtm)\operatorname*{Var}_{tm}(y_{itm})=4\operatorname*{Var}(\mu_{tm}) where Vartm(yitm)\operatorname*{Var}_{tm}(y_{itm}) is the variance of parental income yitmy_{itm} in tract tt and state mm. Furthermore, assume that yitmy_{itm} are lognormally distributed within each tract and draw ln(yitm)\ln(y_{itm}) from
𝒩(𝔼tm[ln(yitm)],Vartm[ln(yitm)])\mathcal{N}(\mathbb{E}_{tm}[\ln(y_{itm})],\operatorname*{Var}_{tm}[\ln(y_{itm})]) for i=1,,ntmi=1,\ldots,n_{tm}, where

𝔼tm[ln(yitm)]=2ln(μtm)0.5ln(Vartm(yitm)+μtm2),\mathbb{E}_{tm}[\ln(y_{itm})]=2\ln(\mu_{tm})-0.5\ln(\operatorname*{Var}_{tm}(y_{itm})+\mu_{tm}^{2}),

and

Vartm[ln(yitm)]=2ln(μtm)+ln(Vartm(yitm)+μtm2).\operatorname*{Var}_{tm}[\ln(y_{itm})]=-2\ln(\mu_{tm})+\ln(\operatorname*{Var}_{tm}(y_{itm})+\mu_{tm}^{2}).

A.2. The Maximum Observed Sensitivity Algorithm

Opportunity Insights (Chetty and Friedman, 2019) provided a practical method – which they term the “Maximum Observed Sensitivity” (MOS) algorithm – to reduce the privacy loss of their released estimates. This method is not formally differentially private. We use MOS or OI interchangeably to refer to their statistical disclosure limitation method. The crux of their algorithm is as follows: The maximum observed sensitivity, corresponding to an upper envelope on the largest local sensitivity across tracts in a state in the dataset, is calculated and Laplace noise of this magnitude divided by the number of people in that cell is added to the estimate and then released. The statistics they release include 0.25, 0.75 percentiles per cell, the standard error of these percentiles, and the count.

Two notes are in order. First, the MOS algorithm does not calculate the local sensitivity exactly but uses a lower bound by overlaying an 11×1111\times 11 grid on the [0,1]×[0,1][0,1]\times[0,1] space of possible x,y\textbf{x},\textbf{y} values. Then the algorithm proceeds to add a datapoint from this grid to the dataset and calculate the maximum change in the statistic, which is then used as a lower bound. Second, analysis are performed only on census tracts that satisfy the following property: they must have at least 20 individuals with 10% of parent income percentiles in that tract above the state parent income median percentile and 10% below. If a tract does not satisfy this condition then no regression estimate is released for that tract.

Appendix B Some Results in Differential Privacy

In this section we will briefly review some of the fundamental definitions and results pertaining to general differentially private algorithms.

For any query function f:𝒳nKf:\mathcal{X}^{n}\rightarrow{\mathbb{R}}^{K} let

GSf=maxddf(d)f(d),\text{GS}_{f}=\max_{d\sim d^{\prime}}\|f(d)-f(d^{\prime})\|,

called the global sensitivity, be the maximum amount the query can differ on neighboring datasets.

Theorem 1 (Laplace Mechanism (Dwork et al., 2006)).

For any privacy parameter ε>0\varepsilon>0 and any given query function f:𝒳nKf:\mathcal{X}^{n}\rightarrow{\mathbb{R}}^{K} and database d𝒳nd\in\mathcal{X}^{n}, the Laplace mechanism outputs

f~L(d)=f(d)+(R1,,RK),\tilde{f}_{L}(d)=f(d)+(R_{1},\ldots,R_{K}),

where R1,,RKLap(0,GSfε)R_{1},\ldots,R_{K}\sim\text{Lap}(0,\frac{\text{GS}_{f}}{\varepsilon}) are i.i.d. random variables drawn from the 0-mean Laplace distribution with scale GSfε\frac{\text{GS}_{f}}{\varepsilon}.

The Laplace mechanism is (ε,0)(\varepsilon,0)-DP.

Theorem 2 (Exponential Mechanism (McSherry and Talwar, 2007)).

Given an arbitrary range \mathcal{R}, let u:𝒳n×u:\mathcal{X}^{n}\times\mathcal{R}\rightarrow{\mathbb{R}} be a utility function that maps database/output pairs to utility scores. Let GSu=maxrGSu(,r)\text{GS}_{u}=\max_{r}\text{GS}_{u(\cdot,r)}. For a fixed database d𝒳nd\in\mathcal{X}^{n} and privacy parameter ε>0\varepsilon>0, the exponential mechanism outputs an element rr\in\mathcal{R} with probability proportional to exp(εu(d,r)2GSu)\exp\left(\frac{\varepsilon\cdot u(d,r)}{2GS_{u}}\right).

The exponential mechanism is (ε,0)(\varepsilon,0)-DP.

The following results allow us to use differentially private algorithms as building blocks in larger algorithms.

Lemma 3 (Post-Processing (Dwork et al., 2006)).

Let M:𝒳n𝒴M:\mathcal{X}^{n}\rightarrow\mathcal{Y} be an (ε,δ)(\varepsilon,\delta) differentially private and f:𝒴f:\mathcal{Y}\rightarrow\mathcal{R} be a (randomized) function. Then fM:𝒳nf\circ M:\mathcal{X}^{n}\rightarrow\mathcal{R} is an (ε,δ)(\varepsilon,\delta) differentially private algorithm.

Theorem 4 (Basic Composition (Dwork et al., 2006)).

For any k[K]k\in[K], let MkM_{k} be an (εk,δk)(\varepsilon_{k},\delta_{k}) differentially private algorithm. Then the composition of the TT mechanisms M=(M1,,MK)M=(M_{1},\ldots,M_{K}) is (ε,δ)(\varepsilon,\delta) differentially private where ε=k[K]εk\varepsilon=\sum_{k\in[K]}\varepsilon_{k} and δ=k[K]δk\delta=\sum_{k\in[K]}\delta_{k}.

Definition 5 (Coupling).

Let z and z\textbf{z}^{\prime} be two random variables defined over the probability spaces ZZ and ZZ^{\prime}, respectively. A coupling of z and z\textbf{z}^{\prime} is a joint variable (zc,zc)(\textbf{z}_{c},\textbf{z}_{c}^{\prime}) taking values in the product space (Z×Z)(Z\times Z^{\prime}) such that zc\textbf{z}_{c} has the same marginal distribution as z and zc\textbf{z}^{\prime}_{c} has the same marginal distribution as z\textbf{z}^{\prime}.

Definition 6 (cc-Lipschitz randomized transformations).

A randomized transformation T:𝒳n𝒴mT:\mathcal{X}^{n}\rightarrow\mathcal{Y}^{m} is cc-Lipschitz if for all datasets d,d𝒳nd,d^{\prime}\in\mathcal{X}^{n}, there exists a coupling (zc,zc)(\textbf{z}_{c},\textbf{z}^{\prime}_{c}) of the random variables z=T(d)\textbf{z}=T(d) and z=T(d)\textbf{z}^{\prime}=T(d^{\prime}) such that with probability 1,

H(zc,zc)cH(d,d)H(\textbf{z}_{c},\textbf{z}^{\prime}_{c})\leq c\cdot H(d,d^{\prime})

where HH denotes Hamming distance.

Lemma 7 (Composition with Lipschitz transformations (well-known)).

Let MM be an (ε,δ)(\varepsilon,\delta)-DP algorithm, and let TT be a cc-Lipschitz transformation of the data with respect to the Hamming distance. Then, MTM\circ T is (cε,δ)(c\varepsilon,\delta)-DP.

Proof.

Let H(d,d)H(d,d^{\prime}) denote the distance (in terms of additions and removals or swaps) between datasets dd and dd^{\prime}. By definition of the Lipschitz property, H(T(d),T(d))cH(d,d)H(T(d),T(d^{\prime}))\leq c\cdot H(d,d^{\prime}). The lemma follows directly from the Lipschitz property on adjacent databases and the definition of (ε,δ)(\varepsilon,\delta)-differential privacy. ∎

Appendix C NoisyStats

C.1. Privacy Proof of NoisyStats

Lemma 1.

We are given two vectors x,y[0,1]n\textbf{x},\textbf{y}\in[0,1]^{n}. Let

ncov(x,y)=xx¯𝟏,yy¯𝟏=(i=1nxiyi)(i=1nxi)(i=1nyi)n\text{ncov}(\textbf{x},\textbf{y})=\langle\textbf{x}-\bar{x}\mathbf{1},\textbf{y}-\bar{y}\mathbf{1}\rangle=(\sum_{i=1}^{n}x_{i}\cdot y_{i})-\frac{(\sum_{i=1}^{n}x_{i})(\sum_{i=1}^{n}y_{i})}{n}

and

nvar(x)=xx¯𝟏,xx¯𝟏=(i=1nxi2)(i=1nxi)2n.\text{nvar}(\textbf{x})=\langle\textbf{x}-\bar{x}\mathbf{1},\textbf{x}-\bar{x}\mathbf{1}\rangle=(\sum_{i=1}^{n}x_{i}^{2})-\frac{(\sum_{i=1}^{n}x_{i})^{2}}{n}.

Also, let x¯,y¯\bar{x},\bar{y} be the means of x and y respectively and 𝟏\mathbf{1} be the all ones vector.

Then if GSncov\text{GS}_{\text{ncov}} and GSnvar\text{GS}_{\text{nvar}} are the global sensitivities of functions ncov and nvar then GSncov=(11n)\text{GS}_{\text{ncov}}=\left(1-\frac{1}{n}\right) and GSnvar=(11n)\text{GS}_{\text{nvar}}=\left(1-\frac{1}{n}\right).

Proof.

Let z=x,y\textbf{z}=\langle\textbf{x},\textbf{y}\rangle and z=x,y\textbf{z}^{\prime}=\langle\textbf{x}^{\prime},\textbf{y}^{\prime}\rangle be neighbouring databases differing on the nnth datapoint 444This is without loss of generality as we can always “rotate” both databases until the index on which they differ becomes the nnth datapoint.. Let a=i=1n1xia=\sum_{i=1}^{n-1}x_{i} and b=i=1n1yib=\sum_{i=1}^{n-1}y_{i} and note that max{a,b}n1\max\{a,b\}\leq n-1. Then,

nvar(x)nvar(x)\displaystyle\text{nvar}(\textbf{x})-\text{nvar}(\textbf{x}^{\prime}) =xn2xn22axnnxn2n+2axnn+xn2n\displaystyle=x_{n}^{2}-x_{n}^{\prime 2}-\frac{2ax_{n}}{n}-\frac{x_{n}^{2}}{n}+\frac{2ax_{n}^{\prime}}{n}+\frac{x_{n}^{\prime 2}}{n}
=(11n)(xn2xn2)+2an(xnxn).\displaystyle=(1-\frac{1}{n})(x_{n}^{2}-x_{n}^{\prime 2})+\frac{2a}{n}(x_{n}^{\prime}-x_{n}).

If xnxn0x_{n}^{\prime}-x_{n}\leq 0 then nvar(x)nvar(x)(11n)(xn2xn2)11n\text{nvar}(\textbf{x})-\text{nvar}(\textbf{x}^{\prime})\leq(1-\frac{1}{n})(x_{n}^{2}-x_{n}^{\prime 2})\leq 1-\frac{1}{n}. Otherwise,

nvar(x)nvar(x)\displaystyle\text{nvar}(\textbf{x})-\text{nvar}(\textbf{x}^{\prime}) (11n)(xn2xn2)+2(n1)n(xnxn)\displaystyle\leq(1-\frac{1}{n})(x_{n}^{2}-x_{n}^{\prime 2})+\frac{2(n-1)}{n}(x_{n}^{\prime}-x_{n})
=(11n)(xn22xn+2xnxn2)\displaystyle=(1-\frac{1}{n})(x_{n}^{2}-2x_{n}+2x_{n}^{\prime}-x_{n}^{\prime 2})

Since xn[0,1]x_{n}\in[0,1] we have xn22xn[1,0]x_{n}^{2}-2x_{n}\in[-1,0], so nvar(x)nvar(x)11n\text{nvar}(\textbf{x})-\text{nvar}(\textbf{x}^{\prime})\leq 1-\frac{1}{n}.

Also,

ncov(x,y)ncov(x,y)\displaystyle\text{ncov}(\textbf{x},\textbf{y})-\text{ncov}(\textbf{x}^{\prime},\textbf{y}^{\prime})
=xnynxnyn+\displaystyle=x_{n}y_{n}-x_{n}^{\prime}y_{n}^{\prime}+
a(ynyn)+b(xnxn)+xnynxnynn\displaystyle\frac{a(y_{n}^{\prime}-y_{n})+b(x_{n}^{\prime}-x_{n})+x_{n}^{\prime}y_{n}^{\prime}-x_{n}y_{n}}{n}
(11n)(xnynxnyn)+\displaystyle\leq(1-\frac{1}{n})(x_{n}y_{n}-x_{n}^{\prime}y_{n}^{\prime})+
a(ynyn)+b(xnxn)n\displaystyle\frac{a(y_{n}^{\prime}-y_{n})+b(x_{n}^{\prime}-x_{n})}{n}

If ynyn0y_{n}^{\prime}-y_{n}\leq 0 and xnxn0x_{n}^{\prime}-x_{n}\leq 0 then ncov(x,y)ncov(x,y)(11n)(xnynxnyn)(11n)\text{ncov}(\textbf{x},\textbf{y})-\text{ncov}(\textbf{x}^{\prime},\textbf{y}^{\prime})\leq(1-\frac{1}{n})(x_{n}y_{n}-x_{n}^{\prime}y_{n}^{\prime})\leq(1-\frac{1}{n}). If ynyn0y_{n}^{\prime}-y_{n}\leq 0 and xnxn>0x_{n}^{\prime}-x_{n}>0 then ncov(x,y)ncov(x,y)(11n)(xnynxnyn+(xnxn))\text{ncov}(\textbf{x},\textbf{y})-\text{ncov}(\textbf{x}^{\prime},\textbf{y}^{\prime})\leq(1-\frac{1}{n})(x_{n}y_{n}-x_{n}^{\prime}y_{n}^{\prime}+(x_{n}^{\prime}-x_{n})). Since xnynxn0x_{n}y_{n}-x_{n}\leq 0 and xnxnyn1x_{n}^{\prime}-x_{n}^{\prime}y_{n}^{\prime}\leq 1 we have ncov(x,y)ncov(x,y)11n\text{ncov}(\textbf{x},\textbf{y})-\text{ncov}(\textbf{x}^{\prime},\textbf{y}^{\prime})\leq 1-\frac{1}{n}. Similarly if ynyn>0y_{n}^{\prime}-y_{n}>0 and xnxn0x_{n}^{\prime}-x_{n}\leq 0 then ncov(x,y)ncov(x,y)11n\text{ncov}(\textbf{x},\textbf{y})-\text{ncov}(\textbf{x}^{\prime},\textbf{y}^{\prime})\leq 1-\frac{1}{n}. Finally, if ynyn>0y_{n}^{\prime}-y_{n}>0 and xnxn>0x_{n}^{\prime}-x_{n}>0, we have

ncov(x,y)ncov(x,y)\displaystyle\text{ncov}(\textbf{x},\textbf{y})-\text{ncov}(\textbf{x}^{\prime},\textbf{y}^{\prime})\leq
(11n)(xnynxnyn+(ynyn)+(xnxn))\displaystyle(1-\frac{1}{n})(x_{n}y_{n}-x_{n}^{\prime}y_{n}^{\prime}+(y_{n}^{\prime}-y_{n})+(x_{n}^{\prime}-x_{n}))
(11n)(xn(yn1)xn(yn1)+(yn1)(yn1))\displaystyle\leq(1-\frac{1}{n})(x_{n}(y_{n}-1)-x_{n}^{\prime}(y_{n}^{\prime}-1)+(y_{n}^{\prime}-1)-(y_{n}-1))
(11n)((xn1)(yn1)(xn1)(yn1)).\displaystyle\leq(1-\frac{1}{n})((x_{n}-1)(y_{n}-1)-(x_{n}^{\prime}-1)(y_{n}^{\prime}-1)).

Since, (xn1)(yn1)[0,1](x_{n}-1)(y_{n}-1)\in[0,1] and (xn1)(yn1)[0,1](x_{n}^{\prime}-1)(y_{n}^{\prime}-1)\in[0,1], we have ncov(x,y)ncov(x,y)11n\text{ncov}(\textbf{x},\textbf{y})-\text{ncov}(\textbf{x}^{\prime},\textbf{y}^{\prime})\leq 1-\frac{1}{n}. ∎

Proof of Lemma 1: (NoisyStats).

The global sensitivity of both ncov(x,y)\text{ncov}(\textbf{x},\textbf{y}) and nvar(x)\text{nvar}(\textbf{x}) is bounded by Δ=(11/n)\Delta=\left(1-1/n\right) (by Lemma 1).

As a result, if we sample L1,L2Lap(0,3Δ/ε)L_{1},L_{2}\sim\text{Lap}(0,3\Delta/\varepsilon) then both ncov(x,y)+L1\text{ncov}(\textbf{x},\textbf{y})+L_{1} and nvar(x)+L2\text{nvar}(\textbf{x})+L_{2} are (ε/3,0)(\varepsilon/3,0)-DP estimates by the Laplace mechanism guarantees (see Theorem 1). By the post-processing properties of differential privacy (Lemma 3), 1/(nvar(x)+L2)1/(\text{nvar}(\textbf{x})+L_{2}) is a private release and the test nvar(x)+L2>0\text{nvar}(\textbf{x})+L_{2}>0 is also private. As a result, α~\tilde{\alpha} is a (2ε/3,0)(2\varepsilon/3,0)-DP release. Now to calculate the private intercept β~\tilde{\beta}, we use the global sensitivity of (y¯α~x¯)(\bar{y}-\tilde{\alpha}\bar{x}) which is at most 1/n(1+|α~|)1/n\cdot(1+|\tilde{\alpha}|), since the means of x,y\textbf{x},\textbf{y} can change by at most 1/n1/n. 555 Alternatively, to estimate β~\tilde{\beta}, one could compute x~,y~\tilde{x},\tilde{y}, private estimates of x¯,y¯\bar{x},\bar{y} by adding Laplace noise from Lap(0,1/n)\text{Lap}(0,1/n) and then compute β^=y~α^x~\hat{\beta}=\tilde{y}-\hat{\alpha}\tilde{x}. The Laplace noise we add ensures the private release of the intercept is (ε/3,0)(\varepsilon/3,0)-DP.

Finally, by composition properties of differential privacy (Theorem 4), Algorithm 1 is (ε,0)(\varepsilon,0)-DP. ∎

C.2. On the Failure Rate of NoisyStats

Refer to caption
(a) α^=0.45577,nvar(x)=0.0245,n=39\hat{\alpha}=0.45577,\text{nvar}(\textbf{x})=0.0245,n=39
For Tract 800100 in County 31 in IL
Refer to caption
(b) α^=0.25217,nvar(x)=28.002,n=381\hat{\alpha}=0.25217,\text{nvar}(\textbf{x})=28.002,n=381
For Tract 520100 in County 31 in IL
Figure 13.
Refer to caption
(a) α^=0.0965,nvar(x)=0.0245,n=39\hat{\alpha}=0.0965,\text{nvar}(\textbf{x})=0.0245,n=39
For Tract 2008 in County 63 in NC
Refer to caption
(b) α^=0.03757,nvar(x)=31.154,n=433\hat{\alpha}=0.03757,\text{nvar}(\textbf{x})=31.154,n=433
For Tract 603 in County 147 in NC
Figure 14.
Refer to caption
Figure 15. Failure rate of NoisyStats for all tracts in IL sorted by εnvar(x)\varepsilon\text{nvar}(\textbf{x})

Unlike all the other algorithms in this paper, NoisyStats (Algorithm 1) can fail by returning \perp. It fails if and only if the noisy nvar(x)\text{nvar}(\textbf{x}) sufficient statistic becomes 0 or negative i.e. nvar(x)+L20\text{nvar}(\textbf{x})+L_{2}\leq 0 where L2Lap(0,3(11/n)/ε)L_{2}\sim\text{Lap}(0,3(1-1/n)/\varepsilon). Since the statistic nvar(x)\text{nvar}(\textbf{x}) will always be non-negative, we also require the private version of this statistic to be non-negative. Intuitively, we see that nvar(x)+L2\text{nvar}(\textbf{x})+L_{2} is more likely to be less than or equal to 0 if nvar(x)\text{nvar}(\textbf{x}) is small or ε\varepsilon is small. The setting of ε\varepsilon directly affects the standard deviation of the noise distribution added to ensure privacy. The smaller ε\varepsilon is, the more spread out the noise distribution is.So we would expect that when ε\varepsilon or nvar(x)\text{nvar}(\textbf{x}) is small, Algorithm 1 would fail more often. We experimentally show this observation holds on some tracts in IL and NC (from the semi-synthetic datasets given to us by the Opportunity Insights team).

In Figures 13(a)13(b)14(a)14(b), we see the failure rates for both high and low nvar(x)\text{nvar}(\textbf{x}) census tracts in both IL and NC as we vary ε\varepsilon between 272^{-7} and 88. We see that the failure rate is about 40% for any value of ε\varepsilon in low nvar(x)\text{nvar}(\textbf{x}) and is on average less than 5% for high nvar(x)\text{nvar}(\textbf{x}) tracts. In Figure 15, we show the failure rate for NoisyStats when evaluated on data from all tracts in IL. The results are averaged over 100 trials. We see that the failure rate for ε=8\varepsilon=8 is 0% for the majority of tracts. For tracts with small nvar(x)\text{nvar}(\textbf{x}), the rate of failure is at most 12%. Thus, we can conclude that the failure rate approaches 0 as we increase either ε\varepsilon or nvar(x)\text{nvar}(\textbf{x}).

Appendix D DPExpTheilSen and DPWideTheilSen

D.1. Privacy Proofs for DPExpTheilSen and DPWideTheilSen

Lemma 1.

Let TT be the following randomized algorithm. For dataset d=(xi,yi)i=1nd=(x_{i},y_{i})_{i=1}^{n}, let Kn(d)K_{n}(d) be the complete graph on the nn datapoints, where edges denote points paired together to compute estimates in Theil-Sen. Then Kn(d)K_{n}(d) can be decomposed into n1n-1 matchings, Σ1,,Σn1\Sigma_{1},\ldots,\Sigma_{n-1}. Suppose T(d)T(d) samples kk matchings without replacement from the n1n-1 matchings, and computes the corresponding pairwise estimates (up to kn/2kn/2 estimates). Then TT is a kk-Lipschitz randomized transformation.

Proof.

Let z=T(d)\textbf{z}=T(d) and z=T(d)\textbf{z}^{\prime}=T(d^{\prime}) denote the multi-sets of estimates that result from applying TT to datasets dd and dd^{\prime}, respectively. We can define a coupling zc\textbf{z}_{c} and zc\textbf{z}^{\prime}_{c} of z and z\textbf{z}^{\prime}. First, use kk matchings sampled randomly without replacement from Kn(d)K_{n}(d), Σ1,,Σk\Sigma_{1},\ldots,\Sigma_{k}, to compute the multi-set of estimates zc={zj,l(pxnew):(xj,xl)Σ1Σk}}\textbf{z}_{c}=\{z_{j,l}^{(p_{xnew})}:(x_{j},x_{l})\in\Sigma_{1}\cup\ldots\cup\Sigma_{k}\}\}. Now, use the corresponding kk matchings from Kn(d)K_{n}(d^{\prime}) to compute a multi-set of estimates zc={zj,l(pxnew):(xj,xl)Σ1Σk}\textbf{z}_{c}^{\prime}=\{z_{j,l}^{(p_{xnew})}:(x^{\prime}_{j},x^{\prime}_{l})\in\Sigma_{1}\cup\ldots\cup\Sigma_{k}\}. This is a valid coupling because the kk matchings are sampled randomly without replacement from the complete graphs Kn(d)K_{n}(d) and Kn(d)K_{n}(d^{\prime}), respectively, matching the marginal distributions of z and z\textbf{z}^{\prime}.

Notice that every datapoint xjx_{j} is used to compute exactly kk estimates in zc\textbf{z}_{c}. Therefore, for every datapoint at which dd and dd^{\prime} differ, zc\textbf{z}_{c} and zc\textbf{z}^{\prime}_{c} differ by at most kk estimates. Therefore, by the triangle inequality, we are done. ∎

Proof of Lemma 2.

If DPmed(z(p25),ε,(n,k,hyperparameters))=(z(p25),hyperparameters))(z^{(p25)},\varepsilon,(n,k,\text{hyperparameters}))=\\ \mathcal{M}(z^{(p25)},\text{hyperparameters})) then Algorithm 2 is a composition of two algorithms, T\mathcal{M}\circ T, where by Lemma 1, TT is a kk-Lipschitz randomized transformation, and \mathcal{M} is (ε/k,0)(\varepsilon/k,0)-DP. By the Lipschitz composition lemma (Lemma 7), Algorithm 2 (DPTheilSenkMatch) is (ε,0)(\varepsilon,0)-DP. ∎

Proofs of Lemmas 4 and 5.

The privacy of DPExpTheilSenkMatch and DPWideTheilSenkMatch follows directly from Theorem 2 and Lemma 2. ∎

D.2. Sensitivity of Hyperparameter

Choosing optimal hyperparameters is beyond the scope of this work. However, in this section we present some preliminary work exploring the behavior of DPWideTheilSen with respect to the choice of θ\theta. In particular, we consider the question of how robust this algorithm is to the setting of the hyperparameter. Figure 16 shows the performance as a function the widening parameter θ\theta on synthetic (Gaussian) data. Note that in each graph both axes are on a log-scale so we see very large variation in the quality depending on the choice of hyperparameter.

Refer to caption
Figure 16. Experimental results exploring the sensitivity of the hyperparameter choices for DPWideTheilSen. For each dataset n=40n=40, and nn datapoints are generated as xi𝒩(0,σ2)x_{i}\sim\mathcal{N}(0,\sigma^{2}), yi=0.5xi+0.5+𝒩(0,τ2)y_{i}=0.5*x_{i}+0.5+\mathcal{N}(0,\tau^{2}). The parameters of the data are fixed at σ=103\sigma=10^{-3} and τ=104\tau=10^{-4}. The datapoints are then truncated so they belong between 0 and 1. Note that both axes are on a log scale.

D.3. Pseudo-code for DPExpTheilSen

In Algorithm 3, we give an efficient method for implementation of the DP median algorithm used as subroutine in DPExpTheilSen, the exponential mechanism for computing medians. To sample efficiently from this distribution, we implement a two-step algorithm following (Cormode, [n.d.]): first, we sample an interval according to the exponential mechanism, and then we will sample an output uniformly at random from that interval. To efficiently sample from the exponential mechanism, we use the fact that sampling from the exponential mechanism is equivalent to choosing the value with maximum utility score after i.i.d. Gumbel-distributed noise has been added to the utility scores  (Dwork et al., 2014; Abernethy et al., 2016).

Data: z
Privacy params: ε\varepsilon
Input: n,k,rl,run,k,r_{l},r_{u}
ε=ε/k\varepsilon=\varepsilon/k
Sort z in increasing order
Clip z to the range [rl,ru][r_{l},r_{u}]
Insert rlr_{l} and rur_{u} into z and set n=n+2n=n+2
Set maxNoisyScore=\textrm{maxNoisyScore}=-\infty
Set argMaxNoisyScore=1\textrm{argMaxNoisyScore}=-1
for i[1,n)i\in[1,n) do
       logIntervalLength=log(z[i]z[i1])\textrm{logIntervalLength}=\log(\textbf{z}[i]-\textbf{z}[i-1])
      distFromMedian=|in2|\textrm{distFromMedian}=\lceil|i-\frac{n}{2}|\rceil
      score=logIntervalLengthε2distFromMedian\textrm{score}=\textrm{logIntervalLength}-\frac{\varepsilon}{2}\cdot\textrm{distFromMedian}
      NGumbel(0,1)N\sim\textrm{Gumbel}(0,1)
      noisyScore=score+N\textrm{noisyScore}=\textrm{score}+N
      if noisyScore>maxNoisyScore\textrm{noisyScore}>\textrm{maxNoisyScore} then
             maxNoisyScore=noisyScore\textrm{maxNoisyScore}=\textrm{noisyScore}
            argMaxNoisyScore=i\text{argMaxNoisyScore}=i
      
left=z[argMaxNoisyScore-1]\text{left}=\textbf{z}[\text{argMaxNoisyScore-1}]
right=z[argMaxNoisyScore]\text{right}=\textbf{z}[\text{argMaxNoisyScore}]
Sample m~Unif[left,right]\tilde{m}\sim\text{Unif}\left[\text{left},\text{right}\right]
return m~\tilde{m}
Algorithm 3 Exponential Mechanism for Median: (ε/k,0)(\varepsilon/k,0)-DP Algorithm

D.4. Pseudo-code for DPWideTheilSen

The pseudo-code for DPWideTheilSen is given in Algorithm 4. It is a small variant on Algorithm 3.

Data: z
Privacy params: ε\varepsilon
Input: n,k,θ,rl,run,k,\theta,r_{l},r_{u}
ε=ε/k\varepsilon=\varepsilon/k
Sort z in increasing order
Clip z to the range [rl,ru][r_{l},r_{u}]
if nn is even then
       Insert mm, the true median, into z
      Set n=n+1n=n+1
for i[0,n2]i\in[0,\lfloor\frac{n}{2}\rfloor] do
       z[i]=max(zl,z[i]θ)z[i]=\max(z_{l},z[i]-\theta)
      z[ni1]=min(zu,z[i]+θ)z[n-i-1]=\min(z_{u},z[i]+\theta)
Insert zlz_{l} and zbz_{b} into z and set n=n+2n=n+2
Set maxNoisyScore=\textrm{maxNoisyScore}=-\infty
Set argMaxNoisyScore=1\textrm{argMaxNoisyScore}=-1
for i[1,n)i\in[1,n) do
       logIntervalLength=log(z[i]z[i1])\textrm{logIntervalLength}=\log(\textbf{z}[i]-\textbf{z}[i-1])
      distFromMedian=|in2|\textrm{distFromMedian}=\lceil|i-\frac{n}{2}|\rceil
      score=logIntervalLengthε2distFromMedian\textrm{score}=\textrm{logIntervalLength}-\frac{\varepsilon}{2}\cdot\textrm{distFromMedian}
      NGumbel(0,1)N\sim\textrm{Gumbel}(0,1)
      noisyScore=score+N\textrm{noisyScore}=\textrm{score}+N
      if noisyScore>maxNoisyScore\textrm{noisyScore}>\textrm{maxNoisyScore} then
             maxNoisyScore=noisyScore\textrm{maxNoisyScore}=\textrm{noisyScore}
            argMaxNoisyScore=i\text{argMaxNoisyScore}=i
      
left=z[argMaxNoisyScore-1]\text{left}=\textbf{z}[\text{argMaxNoisyScore-1}]
right=z[argMaxNoisyScore]\text{right}=\textbf{z}[\text{argMaxNoisyScore}]
Sample m~Unif[left,right]\tilde{m}\sim\text{Unif}\left[\text{left},\text{right}\right]
return m~\tilde{m}
Algorithm 4 θ\theta-Widened Exponential Mechanism for Median: (ε/k,0)(\varepsilon/k,0)-DP Algorithm

Appendix E DPSSTheilSen

Suppose we are given a dataset (x,y)(\textbf{x},\textbf{y}). Consider a neighboring dataset (x,y)(\textbf{x}^{\prime},\textbf{y}^{\prime}) that differs from the original dataset in exactly one row. Let z be the set of point estimates (e.g. the p25p25 or p75p75 point estimates) induced by the dataset (x,y)(\textbf{x},\textbf{y}), and let z\textbf{z}^{\prime} be the set of point estimates induced by dataset (x,y)(\textbf{x}^{\prime},\textbf{y}^{\prime}) by Theil-Sen. Formally, for N=kn/2N=kn/2, we let 𝒵k:[0,1]n×[0,1]nN\mathcal{Z}_{k}:[0,1]^{n}\times[0,1]^{n}\rightarrow{\mathbb{R}}^{N} denote the function that transforms a set of point coordinates into estimates for each pair of points. Then z=𝒵(x,y)\textbf{z}=\mathcal{Z}(\textbf{x},\textbf{y}), z=𝒵(x,y)\textbf{z}^{\prime}=\mathcal{Z}(\textbf{x}^{\prime},\textbf{y}^{\prime}). Notice that changing one datapoint in (x,y)(\textbf{x},\textbf{y}) changes at most kk of the point estimates in z. Assume that both z and z\textbf{z}^{\prime} are in sorted order. Recall the definition of Smed𝒵,tk((x,y))S_{\text{med}\circ\mathcal{Z},t}^{k}((\textbf{x},\textbf{y})):

Smed𝒵k,tk((x,y))\displaystyle S_{\text{med}\circ\mathcal{Z}_{k},t}^{k}((\textbf{x},\textbf{y}))
=max{\displaystyle=\max\Big{\{} zm+kzm,zmzmk,\displaystyle z_{m+k}-z_{m},z_{m}-z_{m-k},
maxl=1,,nmaxs=0,,k(l+1)elt(zm+szm(k(l+1)+s)},\displaystyle\max_{l=1,\ldots,n}\max_{s=0,\cdots,k(l+1)}e^{-lt}(z_{m+s}-z_{m-(k(l+1)+s})\Big{\}},

Let

LSmedk(z)=maxzN,Ham(z,z)k|med(z)med(z)|LS_{\text{med}}^{k}(\textbf{z})=\max_{\textbf{z}^{\prime}\in\mathbb{R}^{N},\text{Ham}(\textbf{z},\textbf{z}^{\prime})\leq k}|\text{med}(\textbf{z})-\text{med}(\textbf{z}^{\prime})|

be distance kk local sensitivity of the dataset z with respect to the median. In order to prove that Smed𝒵k,tk((x,y))S_{\text{med}\circ\mathcal{Z}_{k},t}^{k}((\textbf{x},\textbf{y})) is a tt-smooth upper bound on LSmed𝒵kLS_{\text{med}\circ\mathcal{Z}_{k}}, we will use the observation that

LSmed𝒵k(x,y)LSmedk(z).LS_{\text{med}\circ\mathcal{Z}_{k}}(\textbf{x},\textbf{y})\leq LS_{\text{med}}^{k}(\textbf{z}).

Now Figure 17 outlines the maximal changes we can make to z. For l1l\geq 1 and any interval of lk+k+1lk+k+1 points containing the median, we can move the median to one side of the interval by moving klkl points, and to the other side by moving an additional ll points. Therefore, for l1l\geq 1,

(4) maxz:d(z,z)lkLSmed(z)=maxs=0,,lk+k{zm+szm(lk+k)+s}\max_{\textbf{z}^{\prime}:d(\textbf{z},\textbf{z}^{\prime})\leq lk}LS_{\text{med}}(\textbf{z}^{\prime})=\max_{s=0,\cdots,lk+k}\{z_{m+s}-z_{m-(lk+k)+s}\}

so

Smed,tk(z)=maxl=0,,neltmaxz:d(z,z)lkLSmedk(z).S_{\text{med},t}^{k}(\textbf{z})=\max_{l=0,\ldots,n}e^{-lt}\max_{\textbf{z}^{\prime}:d(\textbf{z},\textbf{z}^{\prime})\leq lk}LS^{k}_{\text{med}}(\textbf{z}^{\prime}).
Refer to caption
Figure 17. A brief proof by pictures of Equation 4.
Proof of Lemma 7.

We need to show that Smed𝒵k,tk((x,y))S_{\text{med}\circ\mathcal{Z}_{k},t}^{k}((\textbf{x},\textbf{y})) is lower bounded by the local sensitivity and that for any dataset (x,y)(\textbf{x}^{\prime},\textbf{y}^{\prime}) such that d((x,y),(x,y))ld((\textbf{x},\textbf{y}),(\textbf{x}^{\prime},\textbf{y}^{\prime}))\leq l, we have Smed𝒵k,tk((x,y))etlSmed𝒵k,tk((x,y))S_{\text{med}\circ\mathcal{Z}_{k},t}^{k}((\textbf{x},\textbf{y}))\leq e^{tl}S_{\text{med}\circ\mathcal{Z}_{k},t}^{k}((\textbf{x}^{\prime},\textbf{y}^{\prime})).

By definition of Smed,tkS_{\text{med},t}^{k}, we see that Smed𝒵k,tk((x,y))LSmedkS_{\text{med}\circ\mathcal{Z}_{k},t}^{k}((\textbf{x},\textbf{y}))\geq LS_{\text{med}}^{k} (e.g. when l=0l=0 in the formula for Smed,tkS_{\text{med},t}^{k}). Next, we see that

(5) Smed,tk(z)\displaystyle S_{\text{med},t}^{k}(\textbf{z}) =maxl=0,,neltmaxz:d(z,z)lkLSmedk(z)\displaystyle=\max_{l=0,\ldots,n}e^{-lt}\max_{\textbf{z}^{\prime}:d(\textbf{z},\textbf{z}^{\prime})\leq lk}LS^{k}_{\text{med}}(\textbf{z}^{\prime})
(6) etmaxl=1,,neltmaxz′′:d(z,z′′)lkLSmedk(z′′)\displaystyle\leq e^{t}\cdot\max_{l=1,\ldots,n}e^{-lt}\max_{\textbf{z}^{\prime\prime}:d(\textbf{z}^{\prime},\textbf{z}^{\prime\prime})\leq lk}LS_{\text{med}}^{k}(\textbf{z}^{\prime\prime})
(7) etSmed,tk(z),\displaystyle\leq e^{t}\cdot S_{\text{med},t}^{k}(\textbf{z}^{\prime}),

which completes our proof. ∎

Lemma 1.

Let M(x)=median(x)+1sSmed,tk(x)NM(\textbf{x})=\text{median}(\textbf{x})+\frac{1}{s}S_{\text{med},t}^{k}(\textbf{x})\cdot N, where N,sN,s and tt are computed according to Algorithm 5. Then, MM is (ε,0)(\varepsilon,0)-DP.

Proof.

Let D(P||Q)=supxsupp(Q)logp(x)q(x)D_{\infty}(P||Q)=\sup_{x\in\text{supp}(Q)}\log\frac{p(x)}{q(x)} denote the max-divergence for distributions PP and QQ. Let NN be a random variable sampled from StudentsT(d)(d), where d>0d>0 is the degrees of freedom. From Theorem 31 in (Bun and Steinke, 2019), we have that for s,t>0s,t>0,

D(N||etN+s)D(etN+s||N)}|t|(d+1)+|s|d+12d\displaystyle\left.\begin{array}[]{ll}D_{\infty}(N||e^{t}N+s)\\ D_{\infty}(e^{t}N+s||N)\end{array}\right\}\leq|t|(d+1)+|s|\cdot\frac{d+1}{2\sqrt{d}}

The parameters ss and tt correspond to the translation (shifting) and dilation (scaling) of the StudentsT(d)(d) distribution.

Setting s=2d(ε|t|(d+1)d+1)s=2\sqrt{d}\left(\frac{\varepsilon^{\prime}-|t|(d+1)}{d+1}\right) as in Algorithm 5, we have that for |t|(d+1)<ε|t|(d+1)<\varepsilon^{\prime},

(8) {D(N||etN+s)D(etN+s||N)ε\displaystyle\begin{cases}D_{\infty}(N||e^{t}N+s)\\ D_{\infty}(e^{t}N+s||N)\end{cases}\leq\varepsilon

If Equation 8 is satisfied, then by Theorem 46 in (Bun and Steinke, 2019), the mechanism in Algorithm 5, M(z)=median(z)+1sSmedian()t(z)+NM(\textbf{z})=\text{median}(\textbf{z})+\frac{1}{s}S^{t}_{\text{median}(\cdot)}(\textbf{z})+N, is (ε,0)(\varepsilon,0)-DP. ∎

Data: z,{(xi,yi)}i=1n([0,1]×[0,1])n\textbf{z},\{(x_{i},y_{i})\}_{i=1}^{n}\in([0,1]\times[0,1])^{n}
Privacy params: ε\varepsilon
Hyperparams: k,n,rl,ru,dk,n,r_{l},r_{u},d
Set t=ε2(d+1)t=\frac{\varepsilon}{2(d+1)} and s=εdd+1s=\frac{\varepsilon\sqrt{d}}{d+1}
Smedian=Smed,tk((x,y))S_{\textrm{median}}=S_{\text{med},t}^{k}((\textbf{x},\textbf{y}))
Sample NStudent’s T(d)N\sim\textrm{Student's T}(d)
Set m~=median(z)+1sSmedianN\widetilde{m}=\textrm{median}(\textbf{z})+\frac{1}{s}\cdot S_{\textrm{median}}\cdot N
return m~\widetilde{m}
Algorithm 5 Smooth Sensitivity Student’s T Noise Addition for Median: (ε,0)(\varepsilon,0)-DP Algorithm

E.1. Sensitivity of Hyperparameters

In Algorithm 5 we set the smoothing parameter to be a specific function of ϵ\epsilon and dd: t=ϵ2(d+1)t=\frac{\epsilon}{2(d+1)} and s=ϵdd+1s=\frac{\epsilon\sqrt{d}}{d+1}. There were other choices for these parameters. For any β[0,1]\beta\in[0,1], the (ϵ,0)(\epsilon,0)-DP guarantee is preserved if we set

t=ϵβd+1 and s=2d(ϵt(d+1)d+1).t=\frac{\epsilon\beta}{d+1}\text{\;\; and \;\;}s=2\sqrt{d}\left(\frac{\epsilon-t(d+1)}{d+1}\right).

Algorithm 5 corresponds to setting β=1/2\beta=1/2. Increasing β\beta increases tt, which results in Smed𝒵kk((x,y))S_{\text{med}\circ\mathcal{Z}_{k}}^{k}((\textbf{x},\textbf{y})) decreasing. However, if increasing β\beta also decreases ss. In Figure 18 we explore the performance of DPSSTheilSen as a function of β\beta on synthetic (Gaussian) data. Note that the performance doesn’t seem too sensitive to the choice of β\beta and β=0.5\beta=0.5 is a good choice on this data.

Refer to caption
Figure 18. Experimental results exploring the sensitivity of the hyperparameter choices for DPSSTheilSen. For each dataset n=30n=30, and nn datapoints are generated as xi𝒩(0.5,σ2)x_{i}\sim\mathcal{N}(0.5,\sigma^{2}), yi=0.5xi+0.2+𝒩(0,τ2)y_{i}=0.5*x_{i}+0.2+\mathcal{N}(0,\tau^{2}). The parameters of the data are fixed at σ=0.01\sigma=0.01 and τ=0.005\tau=0.005. The datapoints are then truncated so they belong between 0 and 1. Note that both axes are on a log scale.

Appendix F DPGradDescent

There are three main versions of DPGradDescent we consider:

  1. (1)

    DPGDPure: (ε,0)(\varepsilon,0)-DP;

  2. (2)

    DPGDApprox: (ε,δ)(\varepsilon,\delta)-DP; and

  3. (3)

    DPGDzCDP: (ε2/2)(\varepsilon^{2}/2)-zCDP.

Algorithm 6 is the specification of a (ε,0)(\varepsilon,0)-DP algorithm and Algorithm 7 is a ρ\rho-zCDP algorithm, from which we can obtain a (ε,δ)(\varepsilon,\delta)-DP algorithm and a (ε2/2)(\varepsilon^{2}/2)-zCDP algorithm. As with traditional gradient descent, there are several choices that have been made in designing this algorithm: the step size, the batch size for the gradients, how many of the estimates are averaged to make our final estimate, how the privacy budget is distributed. We have included this pseudo-code for completeness to show the choices that were made in our experiments. We do not claim to have extensively explored the suite of parameter choices, and it is possible that a better choice of these parameters would result in a better performing algorithm. Differentially private gradient descent has received a lot of attention in the literature. For a more in-depth discussion of DP gradient descent see (Bassily et al., 2014).

Data: {(xi,yi)}i=1n([0,1]×[0,1])n\{(x_{i},y_{i})\}_{i=1}^{n}\in([0,1]\times[0,1])^{n}
Privacy params: ε\varepsilon
Hyperparams: n,T,τ,p~250,p~750n,T,\tau,\tilde{p}_{25}^{0},\tilde{p}_{75}^{0}
for t=0:T1t=0:T-1 do
       εt=ε/T\varepsilon_{t}=\varepsilon/T
      for i=1:ni=1:n do
            
            y~i,t=2(p~25t(3/4xi)+p~75t(xi1/4))\tilde{y}_{i,t}=2(\tilde{p}_{25}^{t}*(3/4-x_{i})+\tilde{p}_{75}^{t}(x_{i}-1/4))
            Δi,t=([2(yiy~)(3/4xi)]ττ,[2(yiy~)(xi1/4)]ττ)\Delta_{i,t}=\begin{pmatrix}[2(y_{i}-\tilde{y})(3/4-x_{i})]_{-\tau}^{\tau},\\ [2(y_{i}-\tilde{y})(x_{i}-1/4)]_{-\tau}^{\tau}\end{pmatrix}
      
      Δt=i=1nΔi,t+Lap2(0,4τ/εt)\Delta_{t}=\sum_{i=1}^{n}\Delta_{i,t}+\text{Lap}_{2}\left(0,4\tau/\varepsilon_{t}\right)
      γt=1l=0tΔl2\gamma_{t}=\frac{1}{\sqrt{\sum_{l=0}^{t}\Delta_{l}^{2}}}
      [p~25t+1,p~75t+1]=[p~25t,p~75t]γtΔt[\tilde{p}_{25}^{t+1},\tilde{p}_{75}^{t+1}]=[\tilde{p}_{25}^{t},\tilde{p}_{75}^{t}]-\gamma_{t}*\Delta_{t}
return 2Tt=T/2T1[p~25t,p~75t]\frac{2}{T}\sum_{t=T/2}^{T-1}[\tilde{p}_{25}^{t},\tilde{p}_{75}^{t}]
Algorithm 6 DPGDPure: (ε,0)(\varepsilon,0)-DP Algorithm
Data: {(xi,yi)}i=1n([0,1]×[0,1])n\{(x_{i},y_{i})\}_{i=1}^{n}\in([0,1]\times[0,1])^{n}
Privacy params: ρ\rho
Hyperparams: n,T,τ,p~250,p~750n,T,\tau,\tilde{p}_{25}^{0},\tilde{p}_{75}^{0}
for t=0:T1t=0:T-1 do
       ρt=ρ/T\rho_{t}=\rho/T
      for i=1:ni=1:n do
            
            y~i,t=2(p~25t(3/4xi)+p~75t(xi1/4))\tilde{y}_{i,t}=2(\tilde{p}_{25}^{t}*(3/4-x_{i})+\tilde{p}_{75}^{t}(x_{i}-1/4))
            Δi,t=([2(yiy~)(3/4xi)]ττ,[2(yiy~)(xi1/4)]ττ)\Delta_{i,t}=\begin{pmatrix}[2(y_{i}-\tilde{y})(3/4-x_{i})]_{-\tau}^{\tau},\\ [2(y_{i}-\tilde{y})(x_{i}-1/4)]_{-\tau}^{\tau}\end{pmatrix}
      
      Δt=i=1nΔi,t+𝒩2(0,(2τ/ρt)2)\Delta_{t}=\sum_{i=1}^{n}\Delta_{i,t}+\mathcal{N}_{2}\left(0,(2\tau/\sqrt{\rho}_{t})^{2}\right)
      γt=1l=0tΔl2\gamma_{t}=\frac{1}{\sqrt{\sum_{l=0}^{t}\Delta_{l}^{2}}}
      [p~25t+1,p~75t+1]=[p~25t,p~75t]γtΔt[\tilde{p}_{25}^{t+1},\tilde{p}_{75}^{t+1}]=[\tilde{p}_{25}^{t},\tilde{p}_{75}^{t}]-\gamma_{t}*\Delta_{t}
return 2Tt=T/2T1[p~25t,p~75t]\frac{2}{T}\sum_{t=T/2}^{T-1}[\tilde{p}_{25}^{t},\tilde{p}_{75}^{t}]
Algorithm 7 DPGDzCDP: ρ\rho-zCDP Algorithm
Lemma 1.

For any ρ>0\rho>0, Algorithm 7 is ρ\rho-zCDP.

Proof.

By composition properties of zCDP, it suffices to show that Δt=i=1nΔi,t+𝒩2(0,(2τ/ρt)2)\Delta_{t}=\sum_{i=1}^{n}\Delta_{i,t}+\mathcal{N}_{2}\left(0,(2\tau/\sqrt{\rho}_{t})^{2}\right) is ρt\rho_{t}-zCDP where 𝒩2(0,s2)\mathcal{N}_{2}(0,s^{2}) represents two (independent) draws from a Normal distribution with standard deviation ss.

The L2L_{2}-sensitivity of i=1nΔi,t\sum_{i=1}^{n}\Delta_{i,t} is at most 22τ2\sqrt{2}\tau and the standard deviation of the Gaussian distribution from which noise is added is 2τ/ρt2\tau/\sqrt{\rho_{t}}. Then by Proposition 1.6 in (Bun and Steinke, 2016), the procedure to compute Δt\Delta_{t} is ρt\rho_{t}-zCDP.

Lemma 2.

For any δ(0,1]\delta\in(0,1] and any ρ>0\rho>0, Algorithm 7 is (ε,δ)(\varepsilon,\delta)-DP where

ε=ρ+4ρlog(πρδ).\varepsilon=\rho+\sqrt{4\rho\log\left(\frac{\sqrt{\pi\rho}}{\delta}\right)}.
Proof.

Follows from Lemma 3.6 in (Bun and Steinke, 2016). ∎

Lemma 3.

For any ε>0\varepsilon>0, Algorithm 6 is (ε,0)(\varepsilon,0)-DP.

Proof.

By basic composition, it suffices to show that Δt=i=1nΔi,t+Lap2(0,4τ/εt)\Delta_{t}=\sum_{i=1}^{n}\Delta_{i,t}+\text{Lap}_{2}\left(0,4\tau/\varepsilon_{t}\right) is (εt,0)(\varepsilon_{t},0)-DP where Lap2(0,s)\text{Lap}_{2}(0,s) represents two (independent) draws from a Laplace distribution with scale ss. This holds since the L1L_{1}-sensitivity of i=1nΔi,t\sum_{i=1}^{n}\Delta_{i,t} is at most 4τ4\tau. ∎