Differentially Private Generalized Linear Models Revisited
Abstract
We study the problem of -differentially private learning of linear predictors with convex losses. We provide results for two subclasses of loss functions. The first case is when the loss is smooth and non-negative but not necessarily Lipschitz (such as the squared loss). For this case, we establish an upper bound on the excess population risk of , where is the number of samples, is the dimension of the problem, and is the minimizer of the population risk. Apart from the dependence on , our bound is essentially tight in all parameters. In particular, we show a lower bound of . We also revisit the previously studied case of Lipschitz losses [SSTT21]. For this case, we close the gap in the existing work and show that the optimal rate is (up to log factors) , where is the rank of the design matrix. This improves over existing work in the high privacy regime. Finally, our algorithms involve a private model selection approach that we develop to enable attaining the stated rates without a-priori knowledge of .
1 Adapting to \texorpdfstring
Our method for privately adapting to is given in Algorithm 1. We start by giving a high level overview and defining some necessary preliminaries. The algorithm works in the following manner. First we define a number of “guesses” for , where . Then given black box access to a DP optimization algorithm, , Algorithm 1 generates candidate vectors using , training set , and the guesses . We assume satisfies the following accuracy assumption for some confidence parameter . {assumption} There exists a function such that for any , whenever , w.p. at least under the randomness of and it holds that . After generating the candidate vectors, the goal is to pick guess with the smallest excess population risk in a differentially private manner using a validation set . The following assumption on allows us both to ensure the privacy of the model selection algorithm and verify that provides a tight estimate of . {assumption} There exist a function such that for any dataset and
Specifically, our strategy will be to use the Generalized Exponential Mechanism, , of [RS15] in conjunction with a penalized score function. Roughly, this score function penalty ensures the looser guarantees on the population loss estimate when is large do not interfere with the loss estimates at smaller values of . We provide the relevant details for in Appendix D.1. We now state our result.
[t] Private Grid Search {algorithmic}[1] \REQUIREDataset , grid parameter , optimization algorithm: , privacy parameters
Partition into two disjoint sets, and , of size
Set as the output of run with privacy parameter , confidence parameter , and sensitivity/score pairs ,
Output
Let be a smooth non-negative loss function such that for any . Let . Let satisfy . Let be an -DP algorithm satisfying Assumption 1. Then Algorithm 1 is -DP. Further, if satisfies Assumption 1 and then Algorithm 1 outputs s.t. with probability at least , {align*} \excessrisk(&¯w;\cD) ≤min{\ybound^2, \err(2max\bc\normw^*,1) + 4\ybound2log4K/βn + 5Δ(2max\bc\mnorm,1)nϵ}.
We note that we develop a generic confidence boosting approach to obtain high probability guarantees from our previously described algorithms in Section 1, and thus obtaining algorithms which satisfy 1 is straightforward. We provide more details on how our algorithms satisfy Assumption 1 in Appendix D.4.
The following Theorem details the guarantees implied by this method for output perturbation with boosting (see Theorems E,LABEL:thm:boosted-smooth-output-perturbation). Full details are in Appendix D.3.
{theorem}
Let and be the algorithm formed by running LABEL:alg:constrained-reg-erm-output-perturbation with boosting and privacy parameters , .
Then there exists a setting of such that and Algorithm 1 run with and is -DP and when given , satisfies the following w.p. at least (letting )
{align*}
\excessrisk(\out) &= ~O(min{\ybound^2,
\brHB*\norm\cX4/3\norm\cY2/3+ \brHB*\norm\cX2(nϵ)2/3
+HB*\norm\cXmax\bc\norm\cY,1 + \norm\cY2n + \ybound2+ H(B*\xbound)2nϵ}).
Confidence Boosting:
We give an algorithm to boost the confidence of unconstrained, smooth DP-SCO (with possibly non-Lipschitz losses). We split the dataset into chunks and run an -DP algorithm over the chunks to get models, and then use Report Noisy Max mechanism to select a model with approximately the least empirical risk. We show that this achieves the optimal rate of whereas the previous high probability result of [SST10] had an additional term, which was also limited to only GLMs. The key idea is that non-negativity, convexity, smoothness and loss bounded at zero, all together enable strong bounds on the variance of the loss, and consequently give stronger concentration bounds. The details are deferred to Appendix E.
Acknowledgements
RA and EU are supported, in part, by NSF BIGDATA award IIS-1838139 and NSF CAREER award IIS-1943251. RB’s and MM’s research is supported by NSF Award AF-1908281 and Google Faculty Research Award. CG’s research is partially supported by INRIA Associate Teams project, FONDECYT 1210362 grant, and National Center for Artificial Intelligence CENIA FB210017, Basal ANID. Part of this work was done while CG was at the University of Twente.
Appendix A Missing Proofs from Section LABEL:sec:smooth_upper (Smooth Upper Bounds)
Appendix B Missing Proofs from Section LABEL:sec:lip_upper (Lipschitz Upper Bounds)
Appendix C Missing proofs from Section LABEL:sec:lip_lower (Lipschitz Lower Bounds)
Appendix D Missing Details for Section 1 (Adapting to \texorpdfstring)
D.1 Generalized Exponential Mechanism
[RS15] Let and . Let be functions s.t. for any adjacent datasets it holds that . There exists an Algorithm, , such that when given sensitivity-score pairs , privacy parameter and confidence parameter , outputs such that with probability at least satisfies .
D.2 Proof of Theorem 1
Note that by assumptions on , the process of generating is . Furthermore, by Assumption 1 with probability at least the sensitivity values passed to bound sensitivity. Thus by the privacy guarantees of and composition we have that the entire algorithm is -DP.
We now prove accuracy. In order to do so, we first prove that with high probability every is an upper bound on the true population loss of . Specifically, define (i.e. the term added to each in Algorithm 1). Note it suffices to prove
(1) |
Fix some . Note that the non-negativity of the loss implies that . The excess risk assumption then implies that , which in turn bounds the variance. Further, with probability at least it holds that for all that . Thus by Bernstein’s inequality we have
Thus it suffices to set to ensure . Taking a union bound over all establishes \eqrefeq:ploss-est-bound. We now condition on this event for the rest of the proof.
Now consider the case where and . Note that the unconstrained minimizer is the constrained minimizer with respect to any for .
With this in mind, let (i.e. the index of the smallest ball containing ).
In the following we condition on the event that , the parameter vector satisfies excess population risk at most . We note by Assumption 1 that this (in addition to the event given in \eqrefeq:ploss-est-bound) happens with probability at least .
By the guarantees of , with probability at least we (additionally) have
{align*}
L(w_j^*;\cD) ≤L(w_j^*;S_2) + τ_j^* &≤min_j∈[K]\bcL(w_j;S_2) + τ_j + 4Δ(Bj)log4K/βnϵ
≤L(w_j’;S_2) + τ_j’ + 4Δ(Bj’)log4K/βnϵ
≤L(w_j’;\cD) + 2τ_j’ + 4Δ(Bj’)log4K/βnϵ.
Since we have
{align*}
L(w_j^*;\cD) - L(w^*;\cD) &≤L(w_j’;\cD)- L(w^*;\cD) + 2τ_j’ + 4Δ(Bj’)log4K/βnϵ
≤\err(2max\bc\mnorm,1) + 2τ_j’ + 4Δ(max\bc2\mnorm,1)log4K/βnϵ
≤\err(2max\bc\mnorm,1) + 4\ybound2log4K/βn + 5Δ(max\bc2\mnorm,1)log4K/βnϵ
where the second inequality comes from the fact the assumption that .
Now note that by the assumption that , whenever it holds that . However since the sensitivity-score pair is passed to , the excess risk of the output is bounded by at most by the guarantees of ).
D.3 Proof of Theorem 1
Let denote the output of the regularized output perturbation method with boosting and noise and privacy parameters and
. We have by Theorem LABEL:thm:boosted-smooth-output-perturbation that with probability at least that
{align*}
L(\out;\cD) - L(w^*;\cD) &= ~O(HB\norm\cX\norm\cY + \norm\cY2n +\br\brHB\norm\cX4/3\norm\cY2/3+ \brHB\norm\cX2(nϵ)2/3
+ \br\ybound2+ HB2\xbound2nϵ + \br\ybound+HB\xboundn
).
Note that this is no smaller than when , and thus it suffices to set
to satisfy the condition of the Theorem statement.
Let denote the level noise used for when the guess for is . To establish Assumption 1, by Lemma D.4 we have that this assumption is satisfies with . In particular, we note for the setting of implied by Theorem LABEL:thm:boosted-smooth-output-perturbation and the setting of we have for all that . Thus . The result then follows from Theorem 1.
D.4 Stability Results for Assumption 1
Algorithm LABEL:alg:noisySGD run with constraint set satisfies Assumption 1 with . The proof is straightforward using Lemma LABEL:lem:smooth_loss_bound (provided in the Appendix). For the output perturbation method, we can obtain similar guarantees. Here however, we must account for the fact that the output may not lie in the constraint set. We also remark that the JL-based method (Algorithm LABEL:alg:jl-constrained-dp-erm) can also enjoy this same bound. However, in this case one must apply the norm adaptation method to the intermediate vector , as may have large norm. {lemma} Algorithm LABEL:alg:constrained-reg-erm-output-perturbation run with parameter and satisfies Assumption 1 with {proof} Note that since and differ in only one point, it suffices to show that for any that and similarly for . Let and let where . We have by previous analysis . Since is distributed as a zero mean Gaussian with variance at most , we have Setting we obtain . Thus with probability at least it holds that .
Appendix E Missing Details for Boosting
[h] Boosting {algorithmic}[1] \REQUIREDataset , loss function , Algorithm , privacy parameters \STATESplit the dataset into equally sized chunks \STATEFor each , \STATE \ENSURE
We state the result of the boosting procedure in a general enough setup so as apply to our proposed algorithms. This leads to additional conditions on the base algorithm since our proposed methods may not produce the output in the constrained set.
Let be a non-negative, smooth, convex loss function. Let . Let be an algorithm such that
-
1.
satisfies -DP
-
2.
For any fixed , is sub-Gaussian [vershynin2018high]:
-
3.
For any fixed ,
-
4.
Given a dataset of i.i.d. points,
Let and . \crefalg:boosting with Algorithm as input satisfies -DP. Given a dataset of samples, with probability at least , the excess risk of its output is bounded as,
L(^w;\cD) - L(w^*;\cD) &≤~O(\err\brn4log4/β,ϵ2,γ + 2Δ(γ,β/2)nϵ +2\lboundtwon
+ 32γ~H\yboundn + 16\yboundn +128~Hγ2n).
We first establish the following concentration bound for convex smooth non-negative functions.
Let be a convex smooth non-negative function.
Let be a dataset of i.i.d. samples.
Let be a random variable which is sub-Gaussian and independent of and let be such that . Then, with probability at least ,
{align*}
^L(w;S) & ≤\br1+T(n,β)L(w;\cD) + U(n,β)
L(w;\cD) ≤\br1+T(n,β)^L(w;S)+V(n,β)
where , and
.
With probability at least , for each . We condition on this event and apply Bernstein inequality to the random variable : {align*} \mathbbP[\absL(w;\cD)-^L(w;S) ¿ t] ≤exp-3nt26n\mathbbE[\brL(w;\cD)-^L(w;S)2]+2\lboundt This gives us that {align} \absL(w;\cD)-^L(w;S) ≤\lboundlog2/βn + \mathbbE\brL(w;\cD)-^L(w;S)^2log2/β
The term .
Now, {align*} \mathbbE[(ℓ(w;(x,y)))^2] &≤2\mathbbE[(ℓ(w;(x,y)) - ℓ(0;(x,y))^2] + 2\mathbbE[(ℓ(0;(x,y)))