This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Differentially Private Generalized Linear Models Revisited

Raman Arora Raef Bassily Cristóbal Guzmán Department of Computer Science, The Johns Hopkins University, \hrefmailto:[email protected]@cs.jhu.edu Department of Computer Science & Engineering and the Translational Data Analytics Institute (TDAI), The Ohio State University, \hrefmailto:[email protected]@osu.edu Department of Applied Mathematics, University of Twente and Institute for Mathematical and Computational Engineering, Pontificia Universidad Católica de Chile, \hrefmailto:[email protected]@utwente.nl    Michael Menart Enayat Ullah Department of Computer Science & Engineering, The Ohio State University, \hrefmailto:[email protected]@osu.eduDepartment of Computer Science, The Johns Hopkins University, \hrefmailto:[email protected]@jhu.edu
Abstract

We study the problem of (ϵ,δ)(\epsilon,\delta)-differentially private learning of linear predictors with convex losses. We provide results for two subclasses of loss functions. The first case is when the loss is smooth and non-negative but not necessarily Lipschitz (such as the squared loss). For this case, we establish an upper bound on the excess population risk of O~(wn+min{w2(nϵ)2/3,dw2nϵ})\tilde{O}\left(\frac{\|w^{*}\|}{\sqrt{n}}+\min\left\{\frac{\|w^{*}\|^{2}}{(n\epsilon)^{2/3}},\frac{\sqrt{d}\|w^{*}\|^{2}}{n\epsilon}\right\}\right), where nn is the number of samples, dd is the dimension of the problem, and ww^{*} is the minimizer of the population risk. Apart from the dependence on w\|w^{\ast}\|, our bound is essentially tight in all parameters. In particular, we show a lower bound of Ω~(1n+min{w4/3(nϵ)2/3,dwnϵ})\tilde{\Omega}\left(\frac{1}{\sqrt{n}}+{\min\left\{\frac{\|w^{*}\|^{4/3}}{(n\epsilon)^{2/3}},\frac{\sqrt{d}\|w^{*}\|}{n\epsilon}\right\}}\right). We also revisit the previously studied case of Lipschitz losses [SSTT21]. For this case, we close the gap in the existing work and show that the optimal rate is (up to log factors) Θ(wn+min{wnϵ,\textrankwnϵ})\Theta\left(\frac{\|w^{*}\|}{\sqrt{n}}+\min\left\{\frac{\|w^{*}\|}{\sqrt{n\epsilon}},\frac{\sqrt{\text{rank}}\|w^{*}\|}{n\epsilon}\right\}\right), where \textrank\text{rank} is the rank of the design matrix. This improves over existing work in the high privacy regime. Finally, our algorithms involve a private model selection approach that we develop to enable attaining the stated rates without a-priori knowledge of w\|w^{*}\|.

1 Adapting to \texorpdfstring\mnorm\mnorm

Our method for privately adapting to \mnorm\mnorm is given in Algorithm 1. We start by giving a high level overview and defining some necessary preliminaries. The algorithm works in the following manner. First we define a number of “guesses” KK for \mnorm\mnorm, B1,,BKB_{1},...,B_{K} where Bj=2j:j[K]B_{j}=2^{j}:\forall j\in[K]. Then given black box access to a DP optimization algorithm, \cA\cA, Algorithm 1 generates KK candidate vectors w1,,wKw_{1},...,w_{K} using \cA\cA, training set S1(\cX×\cY)n/2S_{1}\in(\cX\times\cY)^{n/2}, and the guesses B1,,BKB_{1},...,B_{K}. We assume \cA\cA satisfies the following accuracy assumption for some confidence parameter β>0\beta>0. {assumption} There exists a function \err:\re+\re+\err:\re^{+}\mapsto\re^{+} such that for any B\re+B\in\re^{+}, whenever B\mnormB\geq\mnorm, w.p. at least 1β4K1-\frac{\beta}{4K} under the randomness of S1\cDn2S_{1}\sim\cD^{\frac{n}{2}} and \cA\cA it holds that \excessrisk(\cA(S1,B);\cD)\err(B)\excessrisk(\cA(S_{1},B);\cD)\leq\err(B). After generating the candidate vectors, the goal is to pick guess with the smallest excess population risk in a differentially private manner using a validation set S2S_{2}. The following assumption on \cA\cA allows us both to ensure the privacy of the model selection algorithm and verify that \eloss(wj;S2)\eloss(w_{j};S_{2}) provides a tight estimate of L(wj;\cD)L(w_{j};\cD). {assumption} There exist a function Δ:\re+\re+\Delta:\re^{+}\mapsto\re^{+} such that for any dataset S2(\cX×\cY)n/2S_{2}\in(\cX\times\cY)^{n/2} and B>0B>0

\underset\cA\mathbbP[(x,y)S2:|(\cA(S1,B);(x,y))|Δ(B)]min\bcδ,β4K\underset{\cA}{\mathbb{P}}[\exists(x,y)\in S_{2}:|\ell(\cA(S_{1},B);(x,y))|\geq\Delta(B)]\leq\frac{\min\bc{\delta,\beta}}{4K}

Specifically, our strategy will be to use the Generalized Exponential Mechanism, \gem\gem, of [RS15] in conjunction with a penalized score function. Roughly, this score function penalty ensures the looser guarantees on the population loss estimate when BB is large do not interfere with the loss estimates at smaller values of BB. We provide the relevant details for \gem\gem in Appendix D.1. We now state our result.

{algorithm}

[t] Private Grid Search {algorithmic}[1] \REQUIREDataset \cS(\cX×\cY)n\cS\in(\cX\times\cY)^{n}, grid parameter K\reK\in\re, optimization algorithm: \cA:(\cX×\cY)n×\re\red\cA:(\cX\times\cY)^{n}\times\re\mapsto\re^{d}, privacy parameters (ϵ,δ)(\epsilon,\delta)

\STATE

Partition SS into two disjoint sets, S1S_{1} and S2S_{2}, of size n2\frac{n}{2}

\STATE

w0=0w_{0}=0

\FOR

j[K]j\in[K]

\STATE

Bj=2jB_{j}=2^{j}

\STATE

wj=\cA(S1,Bj)w_{j}=\cA(S_{1},B_{j})

\STATE

L~j=\eloss(wj;S2)+Δ(Bj)logK/βn+4\ybound2logK/βn\tilde{L}_{j}=\eloss(w_{j};S_{2})+\frac{\Delta(B_{j})\log{K/\beta}}{n}+\sqrt{\frac{4\ybound^{2}\log{K/\beta}}{n}}

\ENDFOR
\STATE

Set jj^{*} as the output of \gem\gem run with privacy parameter ϵ2\frac{\epsilon}{2}, confidence parameter β4\frac{\beta}{4}, and sensitivity/score pairs (0,\ybound2),(Δ(B1),L~1),(Δ(BK),L~K)(0,\ybound^{2}),(\Delta(B_{1}),\tilde{L}_{1})...,(\Delta(B_{K}),\tilde{L}_{K}),

\STATE

Output wjw_{j^{*}}

{theorem}

Let :\red×(\cX×\cY)\ell:\re^{d}\times(\cX\times\cY) be a smooth non-negative loss function such that (0,(x,y))\ybound2\ell(0,(x,y))\leq\ybound^{2} for any x,y(\cX×\cY)x,y\in(\cX\times\cY). Let ϵ,δ,β[0,1]\epsilon,\delta,\beta\in[0,1]. Let K>0K>0 satisfy \err(2K)\ybound2\err(2^{K})\geq\ybound^{2}. Let \cA\cA be an (ϵ2K,δ2K)(\frac{\epsilon}{2K},\frac{\delta}{2K})-DP algorithm satisfying Assumption 1. Then Algorithm 1 is (ϵ,δ)(\epsilon,\delta)-DP. Further, if \cA\cA satisfies Assumption 1 and S1\cDn/2S_{1}\sim\cD^{n/2} then Algorithm 1 outputs w¯\bar{w} s.t. with probability at least 1β1-\beta, {align*} \excessrisk(&¯w;\cD) ≤min{\ybound^2, \err(2max\bc\normw^*,1) + 4\ybound2log4K/βn + 5Δ(2max\bc\mnorm,1)nϵ}.

We note that we develop a generic confidence boosting approach to obtain high probability guarantees from our previously described algorithms in Section 1, and thus obtaining algorithms which satisfy 1 is straightforward. We provide more details on how our algorithms satisfy Assumption 1 in Appendix D.4. The following Theorem details the guarantees implied by this method for output perturbation with boosting (see Theorems E,LABEL:thm:boosted-smooth-output-perturbation). Full details are in Appendix D.3. {theorem} Let K,ϵ,δ,β>0K,\epsilon,\delta,\beta>0 and \cA\cA be the algorithm formed by running LABEL:alg:constrained-reg-erm-output-perturbation with boosting and privacy parameters ϵ=ϵK\epsilon^{\prime}=\frac{\epsilon}{K}, δ=δK\delta^{\prime}=\frac{\delta}{K}. Then there exists a setting of KK such that K=Θ\brlogmax\bc\yboundn\xboundH,\ybound2(nϵ)2/3H\xbound2K=\Theta\br{\log{\max\bc{\frac{\ybound\sqrt{n}}{\xbound\sqrt{H}},\frac{\ybound^{2}(n\epsilon)^{2/3}}{\sqrt{H}\xbound^{2}}}}} and Algorithm 1 run with \cA\cA and KK is (ϵ,δ)(\epsilon,\delta)-DP and when given S\cDnS\sim\cD^{n}, satisfies the following w.p. at least 1β1-\beta (letting B=2max\bc\mnorm,1B^{*}=2\max\bc{\mnorm,1}) {align*} \excessrisk(\out) &= ~O(min{\ybound^2, \brHB*\norm\cX4/3\norm\cY2/3+ \brHB*\norm\cX2(nϵ)2/3
+HB*\norm\cXmax\bc\norm\cY,1 + \norm\cY2n + \ybound2+ H(B*\xbound)2nϵ}).

Confidence Boosting:

We give an algorithm to boost the confidence of unconstrained, smooth DP-SCO (with possibly non-Lipschitz losses). We split the dataset SS into m+1m+1 chunks and run an (ϵ,δ)(\epsilon,\delta)-DP algorithm over the mm chunks to get mm models, and then use Report Noisy Max mechanism to select a model with approximately the least empirical risk. We show that this achieves the optimal rate of O~\brHB\xbound\yboundn\tilde{O}\br{\frac{\sqrt{H}B\xbound\ybound}{\sqrt{n}}} whereas the previous high probability result of [SST10] had an additional O~\br\ybound2n\tilde{O}\br{\frac{\ybound^{2}}{\sqrt{n}}} term, which was also limited to only GLMs. The key idea is that non-negativity, convexity, smoothness and loss bounded at zero, all together enable strong bounds on the variance of the loss, and consequently give stronger concentration bounds. The details are deferred to Appendix E.

Acknowledgements

RA and EU are supported, in part, by NSF BIGDATA award IIS-1838139 and NSF CAREER award IIS-1943251. RB’s and MM’s research is supported by NSF Award AF-1908281 and Google Faculty Research Award. CG’s research is partially supported by INRIA Associate Teams project, FONDECYT 1210362 grant, and National Center for Artificial Intelligence CENIA FB210017, Basal ANID. Part of this work was done while CG was at the University of Twente.

Appendix A Missing Proofs from Section LABEL:sec:smooth_upper (Smooth Upper Bounds)

Appendix B Missing Proofs from Section LABEL:sec:lip_upper (Lipschitz Upper Bounds)

Appendix C Missing proofs from Section LABEL:sec:lip_lower (Lipschitz Lower Bounds)

Appendix D Missing Details for Section 1 (Adapting to \texorpdfstring\mnorm\mnorm)

D.1 Generalized Exponential Mechanism

{theorem}

[RS15] Let K>0K>0 and S\cZnS\in\cZ^{n}. Let q1,,qKq_{1},...,q_{K} be functions s.t. for any adjacent datasets S,SS,S^{\prime} it holds that |qj(S)qj(S)|γj:j[K]|q_{j}(S)-q_{j}(S^{\prime})|\leq\gamma_{j}:\forall j\in[K]. There exists an Algorithm, \gem\gem, such that when given sensitivity-score pairs (γ1,q1(S)),,(γN,qN(S))(\gamma_{1},q_{1}(S)),...,(\gamma_{N},q_{N}(S)), privacy parameter ϵ>0\epsilon>0 and confidence parameter β>0\beta>0, outputs j[N]j\in[N] such that with probability at least 1β1-\beta satisfies qj(S)minj[N]\bcqj(S)+4γjlogN/βϵq_{j}(S)\leq\min\limits_{j\in[N]}\bc{q_{j}(S)+\frac{4\gamma_{j}\log{N/\beta}}{\epsilon}}.

D.2 Proof of Theorem 1

Note that by assumptions on \cA\cA, the process of generating w1,,wKw_{1},...,w_{K} is (ϵ/2,δ/2)DP(\epsilon/2,\delta/2)-DP. Furthermore, by Assumption 1 with probability at least δ/2\delta/2 the sensitivity values passed to \gem\gem bound sensitivity. Thus by the privacy guarantees of \gem\gem and composition we have that the entire algorithm is (ϵ,δ)(\epsilon,\delta)-DP.

We now prove accuracy. In order to do so, we first prove that with high probability every L~j\tilde{L}_{j} is an upper bound on the true population loss of wjw_{j}. Specifically, define τj=Δ(Bj)log4K/βn+4\ybound2log4K/βn\tau_{j}=\frac{\Delta(B_{j})\log{4K/\beta}}{n}+\sqrt{\frac{4\ybound^{2}\log{4K/\beta}}{n}} (i.e. the term added to each L(wj;S2)L(w_{j};S_{2}) in Algorithm 1). Note it suffices to prove

\mathbbP[j[K]:|L(wj;S2)L(wj;\cD)|τj]β2.\mathbb{P}\left[\exists j\in[K]:|L(w_{j};S_{2})-L(w_{j};\cD)|\geq\tau_{j}\right]\leq\frac{\beta}{2}. (1)

Fix some j[K]j\in[K]. Note that the non-negativity of the loss implies that (wB;(x,y)2)0\ell(w_{B}^{*};(x,y)^{2})\geq 0. The excess risk assumption then implies that \ex(x,y)\cD(wj;(x,y))24\ybound2\ex{(x,y)\sim\cD}{\ell(w_{j};(x,y))^{2}}\leq 4\ybound^{2}, which in turn bounds the variance. Further, with probability at least 1β4K1-\frac{\beta}{4K} it holds that for all (x,y)S2(x,y)\in S_{2} that (w,(x,y))Δ0+Δ(B)\ell(w,(x,y))\leq\Delta_{0}+\Delta(B). Thus by Bernstein’s inequality we have

\mathbbP[|L(w;S2)L(w;\cD)|t]expt2n2Δ(Bj)tn+4n\ybound2+β4K\mathbb{P}\left[|L(w;S_{2})-L(w;\cD)|\geq t\right]\leq\exp{-\frac{t^{2}n^{2}}{\Delta(B_{j})tn+4n\ybound^{2}}}+\frac{\beta}{4K}

Thus it suffices to set t=Δ(Bj)log4K/βn+4\ybound2log4K/βnt=\frac{\Delta(B_{j})\log{4K/\beta}}{n}+\sqrt{\frac{4\ybound^{2}\log{4K/\beta}}{n}} to ensure \mathbbP[|L(w;S2)L(w;\cD)|t]β2K\mathbb{P}\left[|L(w;S_{2})-L(w;\cD)|\geq t\right]\leq\frac{\beta}{2K}. Taking a union bound over all jKj\in K establishes \eqrefeq:ploss-est-bound. We now condition on this event for the rest of the proof.

Now consider the case where j0j^{*}\neq 0 and \mnorm2K\mnorm\leq 2^{K}. Note that the unconstrained minimizer ww^{*} is the constrained minimizer with respect to any \ballr\ball{r} for r\normwr\geq\norm{w^{*}}. With this in mind, let j=minj[K]\bcj:w\ball2jj^{\prime}=\min\limits_{j\in[K]}\bc{j:w^{*}\in\ball{2^{j}}} (i.e. the index of the smallest ball containing ww^{*}). In the following we condition on the event that j[K],jj\forall j\in[K],j\geq j^{\prime}, the parameter vector wjw_{j} satisfies excess population risk at most \err(2j)\err(2^{j}). We note by Assumption 1 that this (in addition to the event given in \eqrefeq:ploss-est-bound) happens with probability at least 13β41-\frac{3\beta}{4}. By the guarantees of \gem\gem, with probability at least 1β1-\beta we (additionally) have {align*} L(w_j^*;\cD) ≤L(w_j^*;S_2) + τ_j^* &≤min_j∈[K]\bcL(w_j;S_2) + τ_j + 4Δ(Bj)log4K/βnϵ
≤L(w_j’;S_2) + τ_j’ + 4Δ(Bj’)log4K/βnϵ
≤L(w_j’;\cD) + 2τ_j’ + 4Δ(Bj’)log4K/βnϵ.

Since 2jmax\bc2\mnorm,12^{j^{\prime}}\leq\max\bc{2\mnorm,1} we have {align*} L(w_j^*;\cD) - L(w^*;\cD) &≤L(w_j’;\cD)- L(w^*;\cD) + 2τ_j’ + 4Δ(Bj’)log4K/βnϵ
\err(2max\bc\mnorm,1) + 2τ_j’ + 4Δ(max\bc2\mnorm,1)log4K/βnϵ
\err(2max\bc\mnorm,1) + 4\ybound2log4K/βn + 5Δ(max\bc2\mnorm,1)log4K/βnϵ where the second inequality comes from the fact the assumption that \mnorm\ybound2\mnorm\leq\ybound^{2}. Now note that by the assumption that \err(2K)\ybound2\err(2^{K})\geq\ybound^{2}, whenever \mnorm2K\mnorm\geq 2^{K} it holds that \ybound2\err(\mnorm)\ybound^{2}\leq\err(\mnorm). However since the sensitivity-score pair (0,\ybound2)(0,\ybound^{2}) is passed to \gem\gem, the excess risk of the output is bounded by at most \ybound2\ybound^{2} by the guarantees of \gem\gem).

D.3 Proof of Theorem 1

Let \out\out denote the output of the regularized output perturbation method with boosting and noise and privacy parameters ϵ=ϵK\epsilon^{\prime}=\frac{\epsilon}{K} and δ=δK\delta^{\prime}=\frac{\delta}{K}. We have by Theorem LABEL:thm:boosted-smooth-output-perturbation that with probability at least 1β4K1-\frac{\beta}{4K} that {align*} L(\out;\cD) - L(w^*;\cD) &= ~O(HB\norm\cX\norm\cY + \norm\cY2n +\br\brHB\norm\cX4/3\norm\cY2/3+ \brHB\norm\cX2(nϵ)2/3
+ \br\ybound2+ HB2\xbound2nϵ + \br\ybound+HB\xboundn ). Note that this is no smaller than \ybound2\ybound^{2} when B=Ω\brmax\bc\yboundnϵ\xboundH,\ybound2(nϵ)2/3H\xbound2B=\Omega\br{\max\bc{\frac{\ybound\sqrt{n\epsilon}}{\xbound\sqrt{H}},\frac{\ybound^{2}(n\epsilon)^{2/3}}{\sqrt{H}\xbound^{2}}}}, and thus it suffices to set K=Θ\brlogmax\bc\yboundn\xboundH,\ybound2(nϵ)2/3H\xbound2K=\Theta\br{\log{\max\bc{\frac{\ybound\sqrt{n}}{\xbound\sqrt{H}},\frac{\ybound^{2}(n\epsilon)^{2/3}}{\sqrt{H}\xbound^{2}}}}} to satisfy the condition of the Theorem statement.

Let σj\sigma_{j} denote the level noise used for when the guess for \mnorm\mnorm is BjB_{j}. To establish Assumption 1, by Lemma D.4 we have that this assumption is satisfies with Δ(B)=\ybound2+H\xbound2σj2logK/min\bcβ,δ)+HB2\xbound2\Delta(B)=\ybound^{2}+H\xbound^{2}\sigma_{j}^{2}\log{K/\min\bc{\beta,\delta}})+HB^{2}\xbound^{2}. In particular, we note for the setting of σj\sigma_{j} implied by Theorem LABEL:thm:boosted-smooth-output-perturbation and the setting of KK we have for all j[K]j\in[K] that H\xbound2σj2=O~(\ybound2)H\xbound^{2}\sigma_{j}^{2}=\tilde{O}(\ybound^{2}). Thus Δ(B)=O~\br\ybound2+HB2\xbound2\Delta(B)=\tilde{O}\br{\ybound^{2}+HB^{2}\xbound^{2}}. The result then follows from Theorem 1.

D.4 Stability Results for Assumption 1

{lemma}

Algorithm LABEL:alg:noisySGD run with constraint set \ballB\ball{B} satisfies Assumption 1 with Δ(B)=\ybound2+HB2\xbound2\Delta(B)=\ybound^{2}+HB^{2}\xbound^{2}. The proof is straightforward using Lemma LABEL:lem:smooth_loss_bound (provided in the Appendix). For the output perturbation method, we can obtain similar guarantees. Here however, we must account for the fact that the output may not lie in the constraint set. We also remark that the JL-based method (Algorithm LABEL:alg:jl-constrained-dp-erm) can also enjoy this same bound. However, in this case one must apply the norm adaptation method to the intermediate vector w~\tilde{w}, as Φw~\Phi^{\top}\tilde{w} may have large norm. {lemma} Algorithm LABEL:alg:constrained-reg-erm-output-perturbation run with parameter BB and σ\sigma satisfies Assumption 1 with Δ(B)=\ybound2+H\xbound2σ2logK/δ)+HB2\xbound2\Delta(B)=\ybound^{2}+H\xbound^{2}\sigma^{2}\log{K/\delta})+HB^{2}\xbound^{2} {proof} Note that since SS and SS^{\prime} differ in only one point, it suffices to show that for any (x,y),(x,y)(x,y),(x^{\prime},y^{\prime}) that (\out;(x,y))\ybound2+HB2\xbound2+H\xbound2σ2logK/δ\ell(\out;(x,y))\leq\ybound^{2}+HB^{2}\xbound^{2}+H\xbound^{2}\sigma^{2}\log{K/\delta} and similarly for (\out,(x,y))\ell(\out,(x^{\prime},y^{\prime})). Let w\ballBw\in\ball{B} and let \out=w+b\out=w+b where b\cN(0,\mathbbIdσ2)b\sim\cN(0,\mathbb{I}_{d}\sigma^{2}). We have by previous analysis (\out;(x,y))\ybound2+HB2\xbound2+H\ipbx2\ell(\out;(x,y))\leq\ybound^{2}+HB^{2}\xbound^{2}+H\ip{b}{x}^{2}. Since \ipbx\ip{b}{x} is distributed as a zero mean Gaussian with variance at most \xbound2σ2\xbound^{2}\sigma^{2}, we have \mathbbP[|\ipbx|t]expt2\xbound2σ2.\mathbb{P}[|\ip{b}{x}|\geq t]\leq\exp{\frac{-t^{2}}{\xbound^{2}\sigma^{2}}}. Setting t=\xboundσlogK/δt=\xbound\sigma\log{K/\delta} we obtain \mathbbP[|\ipbx|2\xbound2σ2logK/δ]δ/K\mathbb{P}[|\ip{b}{x}|^{2}\geq\xbound^{2}\sigma^{2}\log{K/\delta}]\leq\delta/K. Thus with probability at least 1δ/K1-\delta/K it holds that (\out;(x,y))\ybound2+HB2\xbound2+H\xbound2σ2logK/δ\ell(\out;(x,y))\leq\ybound^{2}+HB^{2}\xbound^{2}+H\xbound^{2}\sigma^{2}\log{K/\delta}.

Appendix E Missing Details for Boosting

{algorithm}

[h] Boosting {algorithmic}[1] \REQUIREDataset SS, loss function \ell, Algorithm \cA\cA, σ~\tilde{\sigma} privacy parameters ϵ,δ\epsilon,\delta \STATESplit the dataset SS into equally sized chunks \bcSii=1m+1\bc{S_{i}}_{i=1}^{m+1} \STATEFor each i[m+1]i\in[m+1], w^i=\cA\brSi,ϵ2,δ\hat{w}_{i}=\cA\br{S_{i},\frac{\epsilon}{2},\delta} \STATEi=\argmaxi[m]\brL^(w^i;Sm+1)+\operatornameLap(0,σ~)i^{*}=\argmax_{i\in[m]}\br{-\hat{L}(\hat{w}_{i};S_{m+1})+\operatorname{Lap}(0,\tilde{\sigma})} \ENSUREw^i\hat{w}_{i^{*}}

We state the result of the boosting procedure in a general enough setup so as apply to our proposed algorithms. This leads to additional conditions on the base algorithm since our proposed methods may not produce the output in the constrained set.

{theorem}

Let \ell be a non-negative, H~\tilde{H} smooth, convex loss function. Let ϵ,δ>0\epsilon,\delta>0. Let \cA:(S,ϵ,δ)\cA(S,ϵ,δ)\cA:(S,\epsilon,\delta)\mapsto\cA(S,\epsilon,\delta) be an algorithm such that

  1. 1.

    \cA\cA satisfies (ϵ,δ)(\epsilon,\delta)-DP

  2. 2.

    For any fixed SS, \cA(S)\cA(S) is γ2\gamma^{2} sub-Gaussian [vershynin2018high]:

    sup\normu=1\mathbbE[exp\ip\cA(S)u2/γ2]2\sup_{\norm{u}=1}\mathbb{E}\left[\exp{\ip{\cA(S)}{u}^{2}/\gamma^{2}}\right]\leq 2
  3. 3.

    For any fixed SS, \mathbbP(x,y)[(\cA(S);(x,y))>\lbounddef]<β\mathbb{P}_{(x,y)}[\ell(\cA(S);(x,y))>\lbounddef]<\beta

  4. 4.

    Given a dataset SS of nn i.i.d. points, \mathbbE[L(\cA(S);\cD)minw\cBBL(w;\cD)]\err\brn,ϵ,γ\mathbb{E}[L(\cA(S);\cD)-\min_{w\in\cB_{B}}L(w;\cD)]\leq\err\br{n,\epsilon,\gamma}

Let σ~2=4(\ybound2+H~γ~2\xbound2)nϵ\tilde{\sigma}^{2}=\frac{4(\ybound^{2}+\tilde{H}\tilde{\gamma}^{2}\xbound^{2})}{n\epsilon} and n0=16γ2\operatornamelog8\br4/βH~\ybound2n_{0}=\frac{16\gamma^{2}\operatorname{log}^{8}\br{4/\beta}\tilde{H}}{\ybound^{2}}. \crefalg:boosting with Algorithm \cA\cA as input satisfies (ϵ,δ)(\epsilon,\delta)-DP. Given a dataset SS of nn0n\geq n_{0} samples, with probability at least 1β1-\beta, the excess risk of its output w^i\hat{w}_{i^{*}} is bounded as,

{align*}

L(^w;\cD) - L(w^*;\cD) &≤~O(\err\brn4log4/β,ϵ2,γ + 2Δ(γ,β/2)nϵ +2\lboundtwon
+ 32γ~H\yboundn + 16\yboundn +128~Hγ2n).

We first establish the following concentration bound for convex H~\tilde{H} smooth non-negative functions.

{lemma}

Let \ell be a convex H~\tilde{H} smooth non-negative function. Let SS be a dataset of nn i.i.d. samples. Let ww be a random variable which is γ2\gamma^{2} sub-Gaussian and independent of SS and let \lbounddef\lbounddef be such that \mathbbP(x,y)[(w;(x,y))>\lbounddef]β\mathbb{P}_{(x,y)}[\ell(w;(x,y))>\lbounddef]\leq\beta. Then, with probability at least 1β1-\beta, {align*} ^L(w;S) & ≤\br1+T(n,β)L(w;\cD) + U(n,β)
L(w;\cD) ≤\br1+T(n,β)^L(w;S)+V(n,β) where T(n,β):=4γlog4/βH~\yboundnT(n,\beta):=\frac{4\gamma\log{4/\beta}\sqrt{\tilde{H}}}{\ybound\sqrt{n}}, U(n,β):=4γlog4/β\yboundH~n+\yboundlog2/βnU(n,\beta):=\frac{4\gamma\log{4/\beta}\ybound\sqrt{\tilde{H}}}{\sqrt{n}}+\frac{\ybound\sqrt{\log{2/\beta}}}{\sqrt{n}} and
V(n,β):=4γlog4/βH~\yboundn+2\lboundlog2/βn+\yboundlog2/βn+48H~γ2\operatornamelog2(4/β)nV(n,\beta):=\frac{4\gamma\log{4/\beta}\sqrt{\tilde{H}}\ybound}{\sqrt{n}}+\frac{2\lbound\log{2/\beta}}{n}+\frac{\ybound\sqrt{\log{2/\beta}}}{\sqrt{n}}+\frac{48\tilde{H}\gamma^{2}\operatorname{log}^{2}(4/\beta)}{n}.

{proof}

With probability at least 1β41-\frac{\beta}{4}, for each (x,y)S,(w;(x,y))\lbound(x,y)\in S,\ell(w;(x,y))\leq\lbound. We condition on this event and apply Bernstein inequality to the random variable L(w;\cD)L^(w;S)L(w;\cD)-\hat{L}(w;S): {align*} \mathbbP[\absL(w;\cD)-^L(w;S) ¿ t] ≤exp-3nt26n\mathbbE[\brL(w;\cD)-^L(w;S)2]+2\lboundt This gives us that {align} \absL(w;\cD)-^L(w;S) ≤\lboundlog2/βn + \mathbbE\brL(w;\cD)-^L(w;S)^2log2/β

The term \mathbbE[\brL(w;\cD)L^(w;S)2=1n\mathbbE[\br(w;(x,y))\mathbbE[(w;(x,y)]2]1n\mathbbE[((w;(x,y)))2]\mathbb{E}[\br{L(w;\cD)-\hat{L}(w;S)}^{2}=\frac{1}{n}\mathbb{E}[\br{\ell(w;(x,y))-\mathbb{E}[\ell(w;(x,y)]}^{2}]\leq\frac{1}{n}\mathbb{E}[(\ell(w;(x,y)))^{2}].

Now, {align*} \mathbbE[(ℓ(w;(x,y)))^2] &≤2\mathbbE[(ℓ(w;(x,y)) - ℓ(0;(x,y))^2] + 2\mathbbE[(ℓ(0;(x,y)))