This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Differentially Private 1\ell_{1}-norm Linear Regression with Heavy-tailed Data

Di Wang1 and Jinhui Xu3 1Division of Computer, Electrical and Mathematical Sciences and Engineering
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia. Email: [email protected]
3Department of Computer Science and Engineering
State University of New York at Buffalo, Buffalo, NY. Email: [email protected]
Abstract

We study the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) with heavy-tailed data. Specifically, we focus on the 1\ell_{1}-norm linear regression in the ϵ\epsilon-DP model. While most of the previous work focuses on the case where the loss function is Lipschitz, here we only need to assume the variates has bounded moments. Firstly, we study the case where the 2\ell_{2} norm of data has bounded second order moment. We propose an algorithm which is based on the exponential mechanism and show that it is possible to achieve an upper bound of O~(dnϵ)\tilde{O}(\sqrt{\frac{d}{n\epsilon}}) (with high probability). Next, we relax the assumption to bounded θ\theta-th order moment with some θ(1,2)\theta\in(1,2) and show that it is possible to achieve an upper bound of O~((dnϵ)θ1θ)\tilde{O}(({\frac{d}{n\epsilon}})^{\frac{\theta-1}{\theta}}). Our algorithms can also be extended to more relaxed cases where only each coordinate of the data has bounded moments, and we can get an upper bound of O~(dnϵ)\tilde{O}({\frac{d}{\sqrt{n\epsilon}}}) and O~(d(nϵ)θ1θ)\tilde{O}({\frac{d}{({n\epsilon})^{\frac{\theta-1}{\theta}}}}) in the second and θ\theta-th moment case respectively.

I Introduction

As one of the most fundamental problem in both statistical machine learning and differential privacy (DP). Stochastic Convex Optimization under the differential privacy [1] constraint, i.e., Differentially Private Stochastic Convex Optimization (DPSCO), has received much attentions in recent years [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. In DPSCO, we have a loss function \ell and an nn-size dataset D={(x1,y1),(x2,y2),,(xn,yn)}D=\{(x_{1},y_{1}),(x_{2},y_{2}),\cdots,(x_{n},y_{n})\} where each pair of the variate and the label/response (xi,yi)(x_{i},y_{i}) is i.i.d. sampled from some unknown distribution 𝒟\mathcal{D}. The goal of DPSCO is to privately minimize the population risk function L𝒟(w)=𝔼(x,y)𝒟[(w,x,y))]L_{\mathcal{D}}(w)=\mathbb{E}_{(x,y)\sim\mathcal{D}}[\ell(w,x,y))] over a parameter space 𝒲d\mathcal{W}\subseteq\mathbb{R}^{d}, i.e., we aim to design some DP algorithm whose output wprivw^{priv} makes the excess population risk L𝒟(wpriv)minw𝒲L𝒟(wpriv)L_{\mathcal{D}}(w^{priv})-\min_{w\in\mathcal{W}}L_{\mathcal{D}}(w^{priv}) to be as small as possible with high probability.

Although DPSCO and it empirical form, differentially Private Empirical Risk Minimization (DPERM), have been extensively studied for many years and there is a long of work attacked the problems from different perspectives, most of those work needs to assume the data distribution is bounded which indicates that the loss function is Lipschitz, which is unrealistic and does not always hold in practice. To relax the Lipschitz assumption, start from [16], there are several work have studied DPSCO with heavy-tailed data recently, where the Lipschitz assumption is removed and we only need to assume that the distribution of the gradient of the loss function has bounded finite order of moments instead [16, 17, 18]. However, there are still several open problems in DP-SCO with heavy-tailed data. For example, previous work only considered the case where the loss function is smooth. Moreover, most of those work studied the (ϵ,δ)(\epsilon,\delta)-DP model and its behaviors in the ϵ\epsilon-DP model is still far from well-understood. Thirdly, all the previous results need to assume the data or the gradient of the loss has at least bounded second (order) moments and cannot be extended to a more relaxed case where it has only θ\theta-th moment with θ(1,2)\theta\in(1,2). In this paper, we continue along the directions of these open problems. Specifically, we study the problem of 1\ell_{1}-norm linear regression (i.e., (w,x,y)=|x,wy|\ell(w,x,y)=|\langle x,w\rangle-y|) with heavy-tailed data in ϵ\epsilon-DP model. Our contributions can be summarized as follows.

  • We first consider the case where the second moment of the 2\ell_{2}-norm of the variate xx, i.e., 𝔼x22\mathbb{E}\|x\|^{2}_{2}, is bounded. Specifically, we propose a method which is based on the exponential mechanism and show that it is possible to achieve an upper bound of O~(dnϵ)\tilde{O}(\sqrt{\frac{d}{n\epsilon}}) with high probability. Moreover, instead of the 2\ell_{2}-norm, we also consider the case where each coordinate of xx has bounded second moment, i.e., 𝔼|xj|2<\mathbb{E}|x_{j}|^{2}<\infty for every j[d]j\in[d]. We show that our algorithm could achieve an error bound of O~(dnϵ)\tilde{O}({\frac{d}{\sqrt{n\epsilon}}}).

  • We then investigate a relaxed case where the data only has θ\theta-th moment with θ(1,2)\theta\in(1,2). First, similar to the second moment case, we assume that 𝔼x2θ<\mathbb{E}\|x\|^{\theta}_{2}<\infty and show it is possible to achieve a rate of O~((dnϵ)θ1θ)\tilde{O}\big{(}({\frac{d}{n\epsilon}})^{\frac{\theta-1}{\theta}}\big{)}. Then, under the relaxed condition that 𝔼|xj|θ<\mathbb{E}|x_{j}|^{\theta}<\infty for every j[d]j\in[d], we show that our algorithm could achieve an error of O~(d(nϵ)θ1θ)\tilde{O}\big{(}\frac{d}{(n\epsilon)^{\frac{\theta-1}{\theta}}}\big{)}. To the best of our knowledge, this is the first theoretical result of DPSCO with heavy-tailed data that only has θ\theta-th moment with θ(1,2)\theta\in(1,2).

II Related Work

Although there is a long list of work studied either DPSCO/DPERM or robust estimation. DPSCO with heavy-tailed data is not well-understood. Below we will mentioned the related work on DPSCO with heavy-tailed data and private and robust mean estimation.

For private estimation for heavy-tailed distribution, [19] provides the first study on private mean estimation for distributions with bounded second moment and proposes the minimax private rates. Later, [20] also studies the heavy-tailed mean estimation with a relaxed assumption, which is also studied by [21, 22] recently. However, due to the complex nature of 1\ell_{1} regression, the methods for mean estimation cannot be used to our problem. Moreover, it is unknown whether their methods could be extended to the case where each coordinate of the data has only θ\theta-th order moment with θ(1,2)\theta\in(1,2).

For DPSCO with heavy-tailed data, [16] first studies the problem by proposing three methods based on different assumptions. The first method is based on the smooth sensitivity and the Sample-and-Aggregate framework [23]. However, it needs enormous assumptions and its error bound is quite large. Their second method is still based on the smooth sensitivity [24]. However, it needs to assume the distribution is sub-exponential.[16] also provides a new private estimator motivated by the previous work in robust statistics. [18] recently revisits the problem and improves the (expected) excess population risk for both convex and strongly convex loss functions. It also provides the lower bounds of the mean estimation problem in both (ϵ,δ)(\epsilon,\delta)-DP and ϵ\epsilon-DP models. However, as we mentioned earlier, all of those algorithms are for (ϵ,δ)(\epsilon,\delta)-DP model and need to assume the loss function is smooth. Thus, their methods cannot be used to our problem. Later, [17] studies the problem in the high dimensional space where the dimension could far greater than the data size. It focuses the case where the loss function is smooth and the constraint set is some polytope or some 0\ell_{0}-norm ball, which is quite different with our settings.

III Preliminaries

Definition 1 (Differential Privacy [1]).

Given a data universe 𝒳\mathcal{X}, we say that two datasets D,D𝒳D,D^{\prime}\subseteq\mathcal{X} are neighbors if they differ by only one entry, which is denoted as DDD\sim D^{\prime}. A randomized algorithm 𝒜\mathcal{A} is ϵ\epsilon-differentially private (DP) if for all neighboring datasets D,DD,D^{\prime} and for all events SS in the output space of 𝒜\mathcal{A}, the following holds Pr(𝒜(D)S)eϵPr(𝒜(D)S).\text{Pr}(\mathcal{A}(D)\in S)\leq e^{\epsilon}\text{Pr}(\mathcal{A}(D^{\prime})\in S).

Definition 2 (Exponential Mechanism).

The Exponential Mechanism allows differentially private computation over arbitrary domains and range \mathcal{R}, parametrized by a score function u(D,r)u(D,r) which maps a pair of input data set DD and candidate result rr\in\mathcal{R} to a real valued score. With the score function uu and privacy budget ϵ\epsilon, the mechanism yields an output with exponential bias in favor of high scoring outputs. Let (D,x,R)\mathcal{M}(D,x,R) denote the exponential mechanism, and Δ\Delta be the sensitivity of uu in the range RR, Δ=maxrmaxDD|u(D,r)u(D,r)|.\Delta=\max_{r\in\mathcal{R}}\max_{D\sim D^{\prime}}|u(D,r)-u(D^{\prime},r)|. Then if (D,x,R)\mathcal{M}(D,x,R) selects and outputs an element rr\in\mathcal{R} with probability proportional to exp(ϵu(D,r)2Δu)\exp(\frac{\epsilon u(D,r)}{2\Delta u}), it preserves ϵ\epsilon-differential privacy.

Lemma 1.

[25] For the exponential mechanism (D,u,)\mathcal{M}(D,u,\mathcal{R}), we have

Pr{u((D,u,))OPTu(x)2Δuϵ(ln||+t)}et.\text{Pr}\{u(\mathcal{M}(D,u,\mathcal{R}))\leq OPT_{u}(x)-\frac{2\Delta u}{\epsilon}(\ln|\mathcal{R}|+t)\}\leq e^{-t}.

where OPTu(x)OPT_{u}(x) is the highest score in the range \mathcal{R}, i.e. maxru(D,r)\max_{r\in\mathcal{R}}u(D,r).

Definition 3 (DPSCO [2]).

Given a dataset D={z1,,zn}D=\{z_{1},\cdots,z_{n}\} from a data universe 𝒵\mathcal{Z} where zi=(xi,yi)z_{i}=(x_{i},y_{i}) with a feature vector xix_{i} and a label/response yiy_{i} are i.i.d. samples from some unknown distribution 𝒟\mathcal{D}, a convex constraint set 𝒲d\mathcal{W}\subseteq\mathbb{R}^{d}, and a convex loss function :𝒲×𝒵\ell:\mathcal{W}\times\mathcal{Z}\mapsto\mathbb{R}. Differentially Private Stochastic Convex Optimization (DPSCO) is to find wprivw^{\text{priv}} so as to minimize the population risk, i.e., L𝒟(w)=𝔼z𝒟[(w,z)]L_{\mathcal{D}}(w)=\mathbb{E}_{z\sim\mathcal{D}}[\ell(w,z)] with the guarantee of being DP. The utility of the algorithm is measured by the excess population risk, that is L𝒟(wpriv)minw𝒲L𝒟(w)L_{\mathcal{D}}(w^{\text{priv}})-\min_{w\in\mathbb{\mathcal{W}}}L_{\mathcal{D}}(w) (for convenience we denote the optimal solution as ww^{*}). Besides the population risk, we can also measure the empirical risk of dataset DD: L^(w,D)=1ni=1n(w,zi).\hat{L}(w,D)=\frac{1}{n}\sum_{i=1}^{n}\ell(w,z_{i}). It is notable that in the high probability setting, we need to get a high probability excess population risk. That is given a failure probability 0<η<10<\eta<1, we want get a (polynomial) function f(d,log1η,1n,1ϵ)f(d,\log\frac{1}{\eta},\frac{1}{n},\frac{1}{\epsilon}) such that with probability at least 1η1-\eta (over the randomness of the algorithm and the data distribution),

L𝒟(wpriv)L𝒟(w)O(f(d,log1η,1n,1ϵ)).L_{\mathcal{D}}(w^{\text{priv}})-L_{\mathcal{D}}(w^{*})\leq O(f(d,\log\frac{1}{\eta},\frac{1}{n},\frac{1}{\epsilon})).

In this paper, we mainly focus on 1\ell_{1}-norm linear regression:

minw𝒲L𝒟(w)=𝔼(x,y)𝒟|x,wy|,\min_{w\in\mathcal{W}}L_{\mathcal{D}}(w)=\mathbb{E}_{(x,y)\sim\mathcal{D}}|\langle x,w\rangle-y|, (1)

where the convex constraint set 𝒲\mathcal{W} is bounded with diameter Δ=maxw,w𝒲ww2<\Delta=\max_{w,w^{\prime}\in\mathcal{W}}\|w-w^{\prime}\|_{2}<\infty.

Definition 4 (ζ\zeta-Net).

Let (T,d)(T,d) be a metric space. Consider a subset 𝒲T\mathcal{W}\subset T and let ζ>0\zeta>0. A subset 𝒮𝒲\mathcal{S}\subseteq\mathcal{W} is called an ζ\zeta-net of 𝒲\mathcal{W} if every point in 𝒲\mathcal{W} is within a distance ζ\zeta of some points of 𝒮\mathcal{S}, i.e., xk,x0𝒩:d(x,x0)ζ.\forall x\in k,\exists x_{0}\in\mathcal{N}:d(x,x_{0})\leq\zeta. The smallest possible cardinality of an ζ\zeta-net of 𝒲\mathcal{W} is called the covering number of 𝒲\mathcal{W} and is denoted by 𝒩(𝒲,d,ζ)\mathcal{N}(\mathcal{W},d,\zeta). Equivalently, covering number is the smallest number of closed balls with centers in KK and radii ζ\zeta whose union covers 𝒲\mathcal{W}.

Lemma 2 ([26]).

For the Euclidean space (d,2)(\mathbb{R}^{d},\|\cdot\|_{2}), and a bounded subset 𝒲d\mathcal{W}\subseteq\mathbb{R}^{d} with the diameter Δ\Delta. Then its covering number 𝒩(𝒲,ζ)(3Δζ)d\mathcal{N}(\mathcal{W},\zeta)\leq(\frac{3\Delta}{\zeta})^{d}.

IV Main Methods

IV-A Bounded second moment case

In this section we consider the case of bounded second moment. As mentioned earlier, in the previous work on DPSCO with heavy-tailed data, we always assume the distribution of gradient has bounded moments [16, 17, 18], if the loss function is smooth. However, for 1\ell_{1} regression, here the loss function is non-differentiable. Thus, instead of the gradient, here we will directly assume the second moments of xx are bounded, which implies that the second moments of the sub-gradient of the loss function are also bounded. In general, there are two assumptions on the heavy-tailedness of xx; one assumes the distribution of x2\|x\|_{2} has bounded moment; the other one assumes the distribution of each coordinate of xx has bounded moment. Formally,

Assumption 1.

The second moment of x2\|x\|_{2} is bounded by τ2=O(1)\tau^{2}=O(1), that is 𝔼(x,y)𝒟x22τ2.\mathbb{E}_{(x,y)\sim\mathcal{D}}\|x\|_{2}^{2}\leq\tau^{2}.

Assumption 2.

The second moment of each coordinate of xx is bounded by τ2=O(1)\tau^{2}=O(1), that is j[d],𝔼(x,y)𝒟xj2τ2.\forall j\in[d],\mathbb{E}_{(x,y)\sim\mathcal{D}}x_{j}^{2}\leq\tau^{2}.

From the above two assumptions we can see that Assumption 1 is more restricted than Assumption 2. Before showing our algorithm, we first provide a brief overview on the approach of solving the problem (1) in the non-private case, proposed by [27]. Specifically, instead of study the empirical risk function of the population risk (1), [27] considers the following minimization problem of a truncated loss :

minw𝒲L^ι(w,D)=1nιi=1nψ(ι|yixi,w|),\min_{w\in\mathcal{W}}\hat{L}_{\iota}(w,D)=\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|), (2)

where ι>0\iota>0 is a parameter that will be specified later, the truncation function ψ()\psi(\cdot) is a non-decreasing function which should satisfies the following property:

log(1x+x22)ψ(x)log(1+x+x22).-\log(1-x+\frac{x^{2}}{2})\leq\psi(x)\leq\log(1+x+\frac{x^{2}}{2}). (3)

Specifically, [27] shows the following result:

Lemma 3 (Corollary 2 in [27]).

Under Assumption 1 and assume the ψ()\psi(\cdot) satisfies (3). Then for any given failure probability η\eta, for some specified ι=ι(1n,d,Δ,log1η)\iota=\iota(\frac{1}{n},d,\Delta,\log\frac{1}{\eta}) in (2), the optimal solution w^ι\hat{w}_{\iota} of (2) has the following the excess population risk with probability at least 1η1-\eta

L𝒟(w^ι)L𝒟(w)O~(1n𝔼x2+\displaystyle L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq\tilde{O}\big{(}\frac{1}{n}\mathbb{E}\|x\|_{2}+ (4)
dlog1ηn(1n2+supw𝒲𝔼(yx,w)2))=O~(dlog1ηn).\displaystyle\sqrt{\frac{d\log\frac{1}{\eta}}{n}}(\frac{1}{n^{2}}+\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2})\big{)}=\tilde{O}(\sqrt{\frac{d\log\frac{1}{\eta}}{n}}). (5)

The previous lemma indicates that solving the problem (2) is sufficient to solve the 1\ell_{1} regression problem if supw𝒲𝔼|yx,w|2=O(1)\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{2}=O(1). To solve the problem (2) differentially privately, we adapt the following specific form of ψ()\psi(\cdot):

ψ(x)={log(1x+x22),0x1log2,x1ψ(x),x0\psi(x)=\begin{cases}-\log(1-x+\frac{x^{2}}{2}),&0\leq x\leq 1\\ \log 2,&x\geq 1\\ -\psi(-x),&x\leq 0\end{cases} (6)

We can easily see (6) satisfies the property in (3). Moreover, due to the non-decreasing property we can see the function is upper bounded by log2\log 2. That is, for a fixed ww, the sensitivity of L^ι(w,D)\hat{L}_{\iota}(w,D) is bounded by 2log2nι\frac{2\log 2}{n\iota}. Motivated by this, we can use the exponential mechanism to solve (2) in ϵ\epsilon-DP model, see Algorithm 1 for details.

Algorithm 1 Exponential mechanism for 1\ell_{1}-regression (second moment)

𝐈𝐧𝐩𝐮𝐭\mathbf{Input}: D={(xi,yi)}i=1nD=\{(x_{i},y_{i})\}_{i=1}^{n}; privacy parameter ϵ\epsilon; parameters ι,ζ\iota,\zeta (will be specified later); truncated empirical risk L^ι\hat{L}_{\iota} in (2) with ψ\psi in (6).

1:Find a ζ\zeta-net of 𝒲\mathcal{W} with covering number N(𝒲,ζ)N(\mathcal{W},\zeta), denote it as W~ζ={w1,,wN(𝒲,ζ)}\tilde{W}_{\zeta}=\{w_{1},\cdots,w_{N(\mathcal{W},\zeta)}\}.
2:Run the exponential mechanism with the range R=W~ζR=\tilde{W}_{\zeta} and the score function u(D,w)=L^ι(w,D)u(D,w)=-\hat{L}_{\iota}(w,D). That is, output an wW~ζw\in\tilde{W}_{\zeta} with probability proportional to exp(nιϵL^ι(w,D)log2)\exp(\frac{-n\iota\epsilon\hat{L}_{\iota}(w,D)}{\log 2}).
Theorem 1.

For any ϵ>0\epsilon>0, Algorithm 1 is ϵ\epsilon-DP. Moreover, under Assumption 1, given any failure probability η(0,1)\eta\in(0,1), for the output w~\tilde{w} we have the following with probability at least 1η1-\eta for any ι>0\iota>0,

L𝒟(w^ι)L𝒟(w)O(ζτ+ιζ2τ2+ιτ2Δ2+1nιϵlogN(𝒲,ζ)ζ).L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq{O}(\zeta\tau+\iota\zeta^{2}\tau^{2}+\iota\tau^{2}\Delta^{2}+\frac{1}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\zeta}).

Furthermore, by setting ζ=O(1n)\zeta=O(\frac{1}{n}) and ι=O(dlognlog1ηnϵτ2)\iota=O(\sqrt{\frac{d\log n\log\frac{1}{\eta}}{n\epsilon\tau^{2}}}) we have

L𝒟(w^ι)L𝒟(w)O(τdlognlog1ηnϵ).L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq{O}(\tau\sqrt{\frac{d\log n\log\frac{1}{\eta}}{n\epsilon}}). (7)

Moreover, under Assumption 2, set ζ=O(1n)\zeta=O(\frac{1}{n}) and ι=O(lognlog1ητ2nϵ)\iota=O(\sqrt{\frac{\log n\log\frac{1}{\eta}}{\tau^{2}n\epsilon}}) we have

L𝒟(w^ι)L𝒟(w)O(τd2lognlog1ηnϵ),L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq{O}(\tau\sqrt{\frac{d^{2}\log n\log\frac{1}{\eta}}{n\epsilon}}), (8)

where Big-O{O} notations omit the term of Δ\Delta.

Remark 1.

First, notice that since Assumption 1 is more stronger, there is a gap of O(d)O(\sqrt{d}) compared with (7) and (8). Moreover, for the upper bound in (8), it matches the lower bound of the private mean estimation under Assumption 2 in [18]. However, it is still unknown whether this lower bound is optimal for DPSCO with general convex loss. To the best of our knowledge, this is the first ϵ\epsilon-DP algorithm which could achieve the bound of O~(dnϵ)\tilde{O}({\frac{d}{\sqrt{n\epsilon}}}) under the same assumption.

Proof of Theorem 1.

The proof of ϵ\epsilon-DP is due to the sensitivity of the score function is bounded by 2log2nι\frac{2\log 2}{n\iota} and the exponential mechanism. Next we will proof the utility. For convenience we denote wζ=argminwW~ζL^ι(w,D)w_{\zeta}=\arg\min_{w\in\tilde{W}_{\zeta}}\hat{L}_{\iota}(w,D) and the optimal solution of (2) as w^ι\hat{w}_{\iota}. By the utility of exponential mechanism Lemma 1 and take t=log1ηt=\log\frac{1}{\eta} we have with probability at least 1η1-\eta, L^ι(w~,D)L^ι(wζ,D)4log2nιϵlogN(𝒲,ζ)η.-\hat{L}_{\iota}(\tilde{w},D)\geq-\hat{L}_{\iota}(w_{\zeta},D)-\frac{4\log 2}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta}. That is

L^ι(w~,D)L^ι(wζ,D)+4log2nιϵlogN(𝒲,ζ)η.\hat{L}_{\iota}(\tilde{w},D)\leq\hat{L}_{\iota}(w_{\zeta},D)+\frac{4\log 2}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta}. (9)

For the term L^ι(w~,D)\hat{L}_{\iota}(\tilde{w},D) in (9) we have the following.

Lemma 4 ([27]).

For any given η\eta, we have the following inequality for all wW~ζw\in\tilde{W}_{\zeta} with probability at least 1η1-\eta

L^ι(w,D)=1nιi=1nψ(ι|yixi,w|)L𝒟(w)ι2supw𝒲𝔼(yx,w)2)1nιlogN(𝒲,ζ)η.\hat{L}_{\iota}(w,D)=\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|)\geq L_{\mathcal{D}}(w)\\ -\frac{\iota}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2})-\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}. (10)

For the term L^ι(wζ,D)\hat{L}_{\iota}(w_{\zeta},D) in (9), since W~ζ\tilde{W}_{\zeta} is a ζ\zeta-net, thus there exists a w~ιW~ζ\tilde{w}_{\iota}\in\tilde{W}_{\zeta} such that w~ιw^ι2ζ\|\tilde{w}_{\iota}-\hat{w}_{\iota}\|_{2}\leq\zeta, where w^ι=argminw𝒲L^ι(w,D)\hat{w}_{\iota}=\arg\min_{w\in\mathcal{W}}\hat{L}_{\iota}(w,D). And by the definition we have

L^ι(wζ,D)L^ι(w~ι,D).\hat{L}_{\iota}(w_{\zeta},D)\leq\hat{L}_{\iota}(\tilde{w}_{\iota},D). (11)

For the term L^ι(w~ι,D)\hat{L}_{\iota}(\tilde{w}_{\iota},D) we have the following lemma

Lemma 5 ([27]).

For any given η\eta, we have the following inequality for all wW~ζw\in\tilde{W}_{\zeta} with probability at least 1η1-\eta

L^ι(w,D)L𝒟(w)+ι2supw𝒲𝔼(yx,w)2+1nιlogN(𝒲,ζ)η.\hat{L}_{\iota}(w,D)\leq L_{\mathcal{D}}(w)+\frac{\iota}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}. (12)

Thus, combing with (11) and Lemma 5 we have with probability at least 1η1-\eta

L^ι(wζ,D)L^ι(w~ι,D)\displaystyle\hat{L}_{\iota}(w_{\zeta},D)\leq\hat{L}_{\iota}(\tilde{w}_{\iota},D)
L𝒟(w~ι)+ι2supw𝒲𝔼(yx,w)2+1nιlogN(𝒲,ζ)η\displaystyle\leq L_{\mathcal{D}}(\tilde{w}_{\iota})+\frac{\iota}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}
L𝒟(w^ι)+ζτ+ι2supw𝒲𝔼(yx,w)2+1nιlogN(𝒲,ζ)η,\displaystyle\leq L_{\mathcal{D}}(\hat{w}_{\iota})+\zeta\tau+\frac{\iota}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}, (13)

where the last inequality is due to

L𝒟(w~ι)L𝒟(w^ι)\displaystyle L_{\mathcal{D}}(\tilde{w}_{\iota})-L_{\mathcal{D}}(\hat{w}_{\iota}) =𝔼[|yx,w~ι||yx,w^ι|]\displaystyle=\mathbb{E}[|y-\langle x,\tilde{w}_{\iota}\rangle|-|y-\langle x,\hat{w}_{\iota}\rangle|]
𝔼|x,w~ιw^ι|ζ𝔼x2\displaystyle\leq\mathbb{E}|\langle x,\tilde{w}_{\iota}-\hat{w}_{\iota}\rangle|\leq\zeta\mathbb{E}\|x\|_{2}

The relation between L𝒟(w^ι)L_{\mathcal{D}}(\hat{w}_{\iota}) and L𝒟(w)L_{\mathcal{D}}(w^{*}) is due to the following lemma, given by [27],

Lemma 6.

For any given failure probability η\eta, under Assumption 1, we have the following with probability at least 12η1-2\eta

L𝒟(w^ι)L𝒟(w)2ζτ+ιζ2τ2+3ι2supw𝒲𝔼(yx,w)2+1nιlogN(𝒲,ζ)η.L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq 2\zeta\tau\\ +\iota\zeta^{2}\tau^{2}+\frac{3\iota}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}.

Thus, taking Lemma 4, (13) and Lemma 6 into (9) we have with probability at least 14η1-4\eta

L𝒟(w~)L𝒟(w)3ζτ+ιζ2τ2+5ι2supw𝒲𝔼(yx,w)2\displaystyle L_{\mathcal{D}}(\tilde{w})-L_{\mathcal{D}}(w^{*})\leq 3\zeta\tau+\iota\zeta^{2}\tau^{2}+\frac{5\iota}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2}
+3nιlogN(𝒲,ζ)η+4log2nιϵlogN(𝒲,ζ)η.\displaystyle+\frac{3}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}+\frac{4\log 2}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta}. (14)

By logN(𝒲,ζ)dlog3Δζ\log N(\mathcal{W},\zeta)\leq d\log\frac{3\Delta}{\zeta} and the following inequality we can complete the proof.

supw𝒲𝔼(yx,w)2𝔼(yx,w)2+2𝔼x22supw𝒲ww22=O(Δ2τ2).\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2}\\ \leq\mathbb{E}(y-\langle x^{*},w\rangle)^{2}+2\mathbb{E}\|x\|_{2}^{2}\sup_{w\in\mathcal{W}}\|w-w^{*}\|_{2}^{2}=O(\Delta^{2}\tau^{2}).

For (8), note that using the same proof we can replace τ\tau by 𝔼x2\mathbb{E}\|x\|_{2} in (14). By Assumption 2 we have 𝔼x2τd\mathbb{E}\|x\|_{2}\leq\tau\sqrt{d} and 𝔼x22dτ2\mathbb{E}\|x\|^{2}_{2}\leq d\tau^{2}. Thus, take ζ=O(1n)\zeta=O(\frac{1}{n}) and ι=O(lognlog1ητ2nϵ)\iota=O(\sqrt{\frac{\log n\log\frac{1}{\eta}}{\tau^{2}n\epsilon}}) we finish the proof. ∎

IV-B Bounded θ\theta-th moment case

Actually, motivated by [28], for the 1\ell_{1}-regression problem, we can even relax the second moment assumption in Assumption 1 and 2 to finite θ\theta-th moment assumptions with any θ(1,2)\theta\in(1,2). Similar to the second moment case, here we consider two cases:

Assumption 3.

There exists an θ(1,2)\theta\in(1,2) such that the θ\theta-th order moment of xx is bounded by τθ<\tau^{\theta}<\infty for some constant τ\tau, 111Here we use τθ\tau^{\theta} is for convenience to compare with the second moment case. that is 𝔼(x,y)𝒟x2θτθ=O(1).\mathbb{E}_{(x,y)\sim\mathcal{D}}\|x\|_{2}^{\theta}\leq\tau^{\theta}=O(1).

Assumption 4.

We assume that the second moment of each coordinate of xx is bounded by τθ=O(1)\tau^{\theta}=O(1), that is j[d],𝔼(x,y)𝒟xjθτθ=O(1).\forall j\in[d],\mathbb{E}_{(x,y)\sim\mathcal{D}}x_{j}^{\theta}\leq\tau^{\theta}=O(1).

Here our main idea is almost the same as in the bounded second-order moment case and we will still focus on the function L^ι(w,D)\hat{L}_{\iota}(w,D) in (2). However, here we will adjust the non-decreasing truncation function ψ:\psi:\mathbb{R}\mapsto\mathbb{R} to make it satisfies the following inequality instead of (3):

log(1x+|x|θθ)ψ(x)log(1+x+|x|θθ).-\log(1-x+\frac{|x|^{\theta}}{\theta})\leq\psi(x)\leq\log(1+x+\frac{|x|^{\theta}}{\theta}). (15)

Motived by (6), here we use the following specific form for ψ\psi:

ψ(x)={log(1x+|x|θθ),0x1logθ,x1ψ(x),x0\psi(x)=\begin{cases}-\log(1-x+\frac{|x|^{\theta}}{\theta}),&0\leq x\leq 1\\ \log\theta,&x\geq 1\\ -\psi(-x),&x\leq 0\end{cases} (16)
Algorithm 2 Exponential mechanism for 1\ell_{1}-regression (θ\theta-th moment)

𝐈𝐧𝐩𝐮𝐭\mathbf{Input}: D={(xi,yi)}i=1nD=\{(x_{i},y_{i})\}_{i=1}^{n}; privacy parameter ϵ\epsilon; parameters ι,ζ\iota,\zeta; truncated empirical risk L^ι\hat{L}_{\iota} in (2) with ψ\psi in (16).

1:Find a ζ\zeta-net of 𝒲\mathcal{W} with covering number N(𝒲,ζ)N(\mathcal{W},\zeta), denote it as W~ζ={w1,,wN(𝒲,ζ)}\tilde{W}_{\zeta}=\{w_{1},\cdots,w_{N(\mathcal{W},\zeta)}\}.
2:Run the exponential mechanism with the range R=W~ζR=\tilde{W}_{\zeta} and the score function u(D,w)=L^ι(w,D)u(D,w)=-\hat{L}_{\iota}(w,D). That is, output an wW~ζw\in\tilde{W}_{\zeta} with probability proportional to exp(nιϵL^ι(w,D)logθ)\exp(\frac{-n\iota\epsilon\hat{L}_{\iota}(w,D)}{\log\theta}).

We can easily see it satisfies the inequality in (15). Moreover, its absolute value is upper bounded by logθ\log\theta. That is the sensitivity of L^ι(w,D)\hat{L}_{\iota}(w,D) is upper bounded by 2logθ2nι\frac{2\log\theta}{2n\iota}. Therefore, we will use the exponential mechanism (see Algorithm 2 for details) and its output has the following utility.

Theorem 2.

For any ϵ>0\epsilon>0, Algorithm 2 is ϵ\epsilon-DP. Moreover, under Assumption 3, given failure probability ζ\zeta, for the output w~\tilde{w} we have the following with probability at least 1ζ1-\zeta for any ι>0\iota>0

L𝒟(w~)L𝒟(w)\displaystyle L_{\mathcal{D}}(\tilde{w})-L_{\mathcal{D}}(w^{*}) O(ζτ+ιθ1τθ+\displaystyle\leq O\big{(}\zeta\tau+{\iota^{\theta-1}}\tau^{\theta}+
ιθ1ζθτθ+1nιϵlogN(𝒲,ζ)η).\displaystyle{\iota^{\theta-1}\zeta^{\theta}}\tau^{\theta}+\frac{1}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta}\big{)}. (17)

Take ζ=O(1n)\zeta=O(\frac{1}{n}) and ι=O(1τ(dlog1ζnϵ)1θ)\iota=O(\frac{1}{\tau}(\frac{d\log\frac{1}{\zeta}}{n\epsilon})^{\frac{1}{\theta}}) we have

L𝒟(w^ι)L𝒟(w)O~(τ(dlognlog1ζnϵ)θ1θ).L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq\tilde{O}(\tau(\frac{d\log n\log\frac{1}{\zeta}}{n\epsilon})^{\frac{\theta-1}{\theta}}). (18)

Moreover, under Assumption 4, set ζ=O(1n)\zeta=O(\frac{1}{n}) and ι=O(1τ(lognlog1ηnϵ)θ1θ)\iota=O(\frac{1}{\tau}(\frac{\log n\log\frac{1}{\eta}}{n\epsilon})^{\frac{\theta-1}{\theta}}) we have

L𝒟(w^ι)L𝒟(w)O(τd(lognlog1ζnϵ)θ1θ).L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq{O}(\tau d(\frac{\log n\log\frac{1}{\zeta}}{n\epsilon})^{\frac{\theta-1}{\theta}}). (19)
Proof of Theorem 2.

The proof of ϵ\epsilon-DP is due to the sensitivity of the score function is bounded by 2logθnι\frac{2\log\theta}{n\iota} and the exponential mechanism. Next we will proof the utility. For convenience we denote wζ=argminwW~ζL^ι(w,D)w_{\zeta}=\arg\min_{w\in\tilde{W}_{\zeta}}\hat{L}_{\iota}(w,D) and the optimal solution of (2) as w^ι\hat{w}_{\iota}. By the utility of exponential mechanism Lemma 1 and take t=log1ηt=\log\frac{1}{\eta} we have with probability at least 1η1-\eta, L^ι(w~,D)L^ι(wζ,D)4logθnιϵlogN(𝒲,ζ)η.-\hat{L}_{\iota}(\tilde{w},D)\geq-\hat{L}_{\iota}(w_{\zeta},D)-\frac{4\log\theta}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta}. That is

L^ι(w~,D)L^ι(wζ,D)+4logθnιϵlogN(𝒲,ζ)η.\hat{L}_{\iota}(\tilde{w},D)\leq\hat{L}_{\iota}(w_{\zeta},D)+\frac{4\log\theta}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta}. (20)

For the term L^ι(w~,D)\hat{L}_{\iota}(\tilde{w},D) in (20) we have the following inequality.

Lemma 7 ([28]).

For any given ζ\zeta, we have the following inequality for all wW~ζw\in\tilde{W}_{\zeta} with probability at least 1η1-\eta

L^ι(w,D)=1nιi=1nψ(ι|yixi,w|)\displaystyle\hat{L}_{\iota}(w,D)=\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|)
L𝒟(w)ιθ1θsupw𝒲𝔼|yx,w|θ1nιlogN(𝒲,ζ)η.\displaystyle\geq L_{\mathcal{D}}(w)-\frac{\iota^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}-\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}.

For the term L^ι(wζ,D)\hat{L}_{\iota}(w_{\zeta},D) in (20), since W~ζ\tilde{W}_{\zeta} is a ζ\zeta-net, thus there exists a w~ιW~ζ\tilde{w}_{\iota}\in\tilde{W}_{\zeta} such that w~ιw^ι2ζ\|\tilde{w}_{\iota}-\hat{w}_{\iota}\|_{2}\leq\zeta, where w^ι=argminw𝒲L^ι(w,D)\hat{w}_{\iota}=\arg\min_{w\in\mathcal{W}}\hat{L}_{\iota}(w,D). And by the definition we have

L^ι(wζ,D)L^ι(w~ι,D).\hat{L}_{\iota}(w_{\zeta},D)\leq\hat{L}_{\iota}(\tilde{w}_{\iota},D). (21)

For the term L^ι(w~ι,D)\hat{L}_{\iota}(\tilde{w}_{\iota},D) we have the following lemma

Lemma 8.

For any given η\eta, we have the following inequality for all wW~ζw\in\tilde{W}_{\zeta} with probability at least 1η1-\eta

L^ι(w,D)L𝒟(w)+ιθ1θsupw𝒲𝔼|yx,w|θ+1nιlogN(𝒲,ζ)η.\hat{L}_{\iota}(w,D)\leq L_{\mathcal{D}}(w)+\frac{\iota^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}. (22)

Thus, combing with (21) and Lemma 8 we have with probability at least 1η1-\eta

L^ι(wζ,D)L^ι(w~ι,D)\displaystyle\hat{L}_{\iota}(w_{\zeta},D)\leq\hat{L}_{\iota}(\tilde{w}_{\iota},D)
L𝒟(w~ι)+ιθ1θsupw𝒲𝔼|yx,w|θ+1nιlogN(𝒲,ζ)η\displaystyle\leq L_{\mathcal{D}}(\tilde{w}_{\iota})+\frac{\iota^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}
L𝒟(w^ι)+ζτ+ιθ1θsupw𝒲𝔼|yx,w|θ+1nιlogN(𝒲,ζ)η,\displaystyle\leq L_{\mathcal{D}}(\hat{w}_{\iota})+\zeta\tau+\frac{\iota^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}, (23)

In the following we will show the relation between L𝒟(w^ι)L_{\mathcal{D}}(\hat{w}_{\iota}) and L𝒟(w)L_{\mathcal{D}}(w^{*}). We first show the following lemma:

Lemma 9 ([28]).

With probability at least 1η1-\eta,

L𝒟(w^ι)L^ι(w^ι,D)2ζτ+(2ι)θ1θsupw𝒲𝔼|yx,w|θ+(2ι)θ1ζθθτθ+1nιlogN(𝒲,ζ)ηL_{\mathcal{D}}(\hat{w}_{\iota})-\hat{L}_{\iota}(\hat{w}_{\iota},D)\leq 2\zeta\tau+\frac{(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}\\ +\frac{(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}

By Lemma 9, the definition of w^ι\hat{w}_{\iota} we have with probability at least 12η1-2\eta

L𝒟(w^ι)\displaystyle L_{\mathcal{D}}(\hat{w}_{\iota}) L^ι(w^ι,D)+2ζτ+(2ι)θ1θsupw𝒲𝔼|yx,w|θ\displaystyle\leq\hat{L}_{\iota}(\hat{w}_{\iota},D)+2\zeta\tau+\frac{(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}
+(2ι)θ1ζθθτθ+1nιlogN(𝒲,ζ)η\displaystyle\qquad+\frac{(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}
L𝒟(w)+2ζτ+(2ι)θ1θsupw𝒲𝔼|yx,w|θ\displaystyle\leq L_{\mathcal{D}}(w^{*})+2\zeta\tau+\frac{(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}
+2(2ι)θ1ζθθτθ+2nιlogN(𝒲,ζ)η.\displaystyle\qquad+\frac{2(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}+\frac{2}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}. (24)

Where the last inequality of (24) is due to the following with probability 1η1-\eta, whose proof is the same as in the proof of Lemma 8 (we omit it here)

L^ι(w,D)L𝒟(w)+ιθ1θsupw𝒲𝔼|yx,w|θ+1nιlog1η\hat{L}_{\iota}(w^{*},D)\leq L_{\mathcal{D}}(w^{*})+\frac{\iota^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}+\frac{1}{n\iota}\log\frac{1}{\eta} (25)

Thus, combing with (20) , Lemma 7, (23) and (24) we have with probability at least 15η1-5\eta

L𝒟(w~)L𝒟(w)+3ζτ+2(2ι)θ1θsupw𝒲𝔼|yx,w|θ+2(2ι)θ1ζθθτθ+8logθnιϵlogN(𝒲,ζ)ηL_{\mathcal{D}}(\tilde{w})\leq L_{\mathcal{D}}(w^{*})+3\zeta\tau+\frac{2(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}\\ +\frac{2(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}+\frac{8\log\theta}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta} (26)

Since logN(𝒲,ζ)dlog3Δζ\log N(\mathcal{W},\zeta)\leq d\log\frac{3\Delta}{\zeta}, we have

L𝒟(w~)L𝒟(w)O(ζτ+ιθ1τθ+ιθ1ζθτθ+dnιϵlog1ζη),L_{\mathcal{D}}(\tilde{w})-L_{\mathcal{D}}(w^{*})\leq{O}(\zeta\tau+{\iota^{\theta-1}}\tau^{\theta}+{\iota^{\theta-1}\zeta^{\theta}}\tau^{\theta}+\frac{d}{n\iota\epsilon}\log\frac{1}{\zeta\eta}), (27)

which is due to that supw𝒲𝔼|yx,w|θO(𝔼|yx,w|θ+𝔼|x,ww|θ)O(Δθτθ)\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}\leq O(\mathbb{E}|y-\langle x,w^{*}\rangle|^{\theta}+\mathbb{E}|\langle x,w^{*}-w\rangle|^{\theta})\leq O(\Delta^{\theta}\tau^{\theta}). Take ι=(dlog1ζnϵ)1θ\iota=(\frac{d\log\frac{1}{\zeta}}{n\epsilon})^{\frac{1}{\theta}} and ζ=1n\zeta=\frac{1}{n} we can get the proof.

For (19), we can replace τ\tau by 𝔼x2\mathbb{E}\|x\|_{2} in (27). Under Assumption 4 and use the inequality 𝔼x2θ𝔼xθθ=dτθ\mathbb{E}\|x\|_{2}^{\theta}\leq\mathbb{E}\|x\|_{\theta}^{\theta}=d\tau^{\theta}, we have the result by taking ι=O(1τ(lognnϵ)θ1θ)\iota=O(\frac{1}{\tau}(\frac{\log n}{n\epsilon})^{\frac{\theta-1}{\theta}}) and ζ=O(1n)\zeta=O(\frac{1}{n}).

References

  • [1] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of cryptography conference.   Springer, 2006, pp. 265–284.
  • [2] R. Bassily, A. Smith, and A. Thakurta, “Private empirical risk minimization: Efficient algorithms and tight error bounds,” in Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on.   IEEE, 2014, pp. 464–473.
  • [3] D. Wang and J. Xu, “Differentially private empirical risk minimization with smooth non-convex loss functions: A non-stationary view,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 1182–1189.
  • [4] D. Wang, C. Chen, and J. Xu, “Differentially private empirical risk minimization with non-convex loss functions,” in International Conference on Machine Learning, 2019, pp. 6526–6535.
  • [5] R. Bassily, V. Feldman, K. Talwar, and A. Thakurta, “Private stochastic convex optimization with optimal rates,” in NeurIPS, 2019.
  • [6] R. Bassily, V. Feldman, C. Guzmán, and K. Talwar, “Stability of stochastic gradient descent on nonsmooth convex losses,” Advances in Neural Information Processing Systems, vol. 33, 2020.
  • [7] V. Feldman, T. Koren, and K. Talwar, “Private stochastic convex optimization: optimal rates in linear time,” in Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020, pp. 439–449.
  • [8] S. Song, O. Thakkar, and A. Thakurta, “Characterizing private clipped gradient descent on convex generalized linear problems,” arXiv preprint arXiv:2006.06783, 2020.
  • [9] J. Su and D. Wang, “Faster rates of differentially private stochastic convex optimization,” arXiv preprint arXiv:2108.00331, 2021.
  • [10] H. Asi, J. Duchi, A. Fallah, O. Javidbakht, and K. Talwar, “Private adaptive gradient methods for convex optimization,” in International Conference on Machine Learning.   PMLR, 2021, pp. 383–392.
  • [11] R. Bassily, C. Guzmán, and A. Nandi, “Non-euclidean differentially private stochastic convex optimization,” arXiv preprint arXiv:2103.01278, 2021.
  • [12] J. Kulkarni, Y. T. Lee, and D. Liu, “Private non-smooth empirical risk minimization and stochastic convex optimization in subquadratic steps,” arXiv preprint arXiv:2103.15352, 2021.
  • [13] R. Bassily, C. Guzmán, and M. Menart, “Differentially private stochastic optimization: New results in convex and non-convex settings,” Advances in Neural Information Processing Systems, vol. 34, 2021.
  • [14] H. Asi, V. Feldman, T. Koren, and K. Talwar, “Private stochastic convex optimization: Optimal rates in 1\ell_{1} geometry,” arXiv preprint arXiv:2103.01516, 2021.
  • [15] Z. Xue, S. Yang, M. Huai, and D. Wang, “Differentially private pairwise learning revisited,” in IJCAI.   ijcai.org, 2021, pp. 3242–3248.
  • [16] D. Wang, H. Xiao, S. Devadas, and J. Xu, “On differentially private stochastic convex optimization with heavy-tailed data,” arXiv preprint arXiv:2010.11082, 2020.
  • [17] L. Hu, S. Ni, H. Xiao, and D. Wang, “High dimensional differentially private stochastic optimization with heavy-tailed data,” arXiv preprint arXiv:2107.11136, 2021.
  • [18] G. Kamath, X. Liu, and H. Zhang, “Improved rates for differentially private stochastic convex optimization with heavy-tailed data,” arXiv preprint arXiv:2106.01336, 2021.
  • [19] R. F. Barber and J. C. Duchi, “Privacy and statistical risk: Formalisms and minimax bounds,” arXiv preprint arXiv:1412.4451, 2014.
  • [20] G. Kamath, V. Singhal, and J. Ullman, “Private mean estimation of heavy-tailed distributions,” arXiv preprint arXiv:2002.09464, 2020.
  • [21] X. Liu, W. Kong, S. Kakade, and S. Oh, “Robust and differentially private mean estimation,” arXiv preprint arXiv:2102.09159, 2021.
  • [22] Y. Tao, Y. Wu, P. Zhao, and D. Wang, “Optimal rates of (locally) differentially private heavy-tailed multi-armed bandits,” arXiv preprint arXiv:2106.02575, 2021.
  • [23] K. Nissim, S. Raskhodnikova, and A. Smith, “Smooth sensitivity and sampling in private data analysis,” in Proceedings of the thirty-ninth annual ACM symposium on Theory of computing.   ACM, 2007, pp. 75–84.
  • [24] M. Bun and T. Steinke, “Average-case averages: Private algorithms for smooth sensitivity and mean estimation,” arXiv preprint arXiv:1906.02830, 2019.
  • [25] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy.” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211–407, 2014.
  • [26] R. Vershynin, High-dimensional probability: An introduction with applications in data science.   Cambridge university press, 2018, vol. 47.
  • [27] L. Zhang and Z.-H. Zhou, “_1\ell\_1-regression with heavy-tailed distributions,” in Advances in Neural Information Processing Systems, 2018, pp. 1076–1086.
  • [28] P. Chen, X. Jin, X. Li, and L. Xu, “A generalized catoni’s m-estimator under finite α\alpha-th moment assumption with α(1,2)\alpha\in(1,2),” Electronic Journal of Statistics, vol. 15, no. 2, pp. 5523–5544, 2021.

The following omitted proofs haven been showed in [27] and [28], we include here for self-completeness.

Proof of Lemma 4.

First, note that our truncation function ψ\psi satisfies ψ(x)log(1x+x22).\psi(x)\geq-\log(1-x+\frac{x^{2}}{2}). Thus we have,

𝔼[exp(i=1nψ(ι|yixi,w|))]\displaystyle\mathbb{E}[\exp(-\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|))]
𝔼[|Πi=1n(1ι|yixi,w|+ι2(yixi,w)22)]\displaystyle\leq\mathbb{E}[|\Pi_{i=1}^{n}(1-\iota|y_{i}-\langle x_{i},w\rangle|+\frac{\iota^{2}(y_{i}-\langle x_{i},w\rangle)^{2}}{2})] (28)
=(𝔼[(1ι|yx,w|+ι2(yx,w)22)])n\displaystyle=(\mathbb{E}[(1-\iota|y-\langle x,w\rangle|+\frac{\iota^{2}(y-\langle x,w\rangle)^{2}}{2})])^{n}
=(1ιL𝒟(w)+ι22𝔼(yx,w)2)n\displaystyle=\big{(}1-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2}\big{)}^{n}
exp(n(ιL𝒟(w)+ι22𝔼(yx,w)2)),\displaystyle\leq\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})\big{)}, (29)

where (28) is due to the previous inequality and (29) is due the the inequality 1+xex1+x\leq e^{x}. By the Chernoff’s method, we have

Pr{i=1nψ(ι|yixi,w|)\displaystyle\text{Pr}\{-\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|)\geq
n(ιL𝒟(w)+ι22𝔼(yx,w)2)+log1η}\displaystyle n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\}
=Pr{exp(i=1nψ(ι|yixi,w|))\displaystyle=\text{Pr}\{\exp(-\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|))
exp(n(ιL𝒟(w)+ι22𝔼(yx,w)2)+log1η)}\displaystyle\geq\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\big{)}\}
𝔼[exp(i=1nψ(ι|yixi,w|))]𝔼[exp(n(ιL𝒟(w)+ι22𝔼(yx,w)2)+log1η)]η.\displaystyle\leq\frac{\mathbb{E}[\exp(-\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|))]}{\mathbb{E}[\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\big{)}]}\leq\eta. (30)

Thus, with probability at least 1η1-\eta we have

i=1nψ(ι|yixi,w|)\displaystyle-\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|)
n(ιL𝒟(w)+ι22𝔼(yx,w)2)+log1η\displaystyle\leq n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}
n(ιL𝒟(w)+ι22supw𝒲𝔼(yx,w)2)+log1η.\displaystyle\leq n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}.

Take the union bound for all wW~ζw\in\tilde{W}_{\zeta} we complete the proof. ∎

Proof of Lemma 5.

First, note the truncation function ψ\psi satisfies ψ(x)log(1+x+x22).\psi(x)\leq\log(1+x+\frac{x^{2}}{2}). Thus we have

𝔼[exp(i=1nψ(ι|yixi,w|))]\displaystyle\mathbb{E}[\exp(\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|))]
𝔼[|Πi=1n(1+ι|yixi,w|+ι2(yixi,w)22)]\displaystyle\leq\mathbb{E}[|\Pi_{i=1}^{n}(1+\iota|y_{i}-\langle x_{i},w\rangle|+\frac{\iota^{2}(y_{i}-\langle x_{i},w\rangle)^{2}}{2})] (31)
=(𝔼[(1+ι|yx,w|+ι2(yx,w)22)])n\displaystyle=(\mathbb{E}[(1+\iota|y-\langle x,w\rangle|+\frac{\iota^{2}(y-\langle x,w\rangle)^{2}}{2})])^{n}
=(1+ιL𝒟(w)+ι22𝔼(yx,w)2)n\displaystyle=\big{(}1+\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2}\big{)}^{n}
exp(n(ιL𝒟(w)+ι22𝔼(yx,w)2)),\displaystyle\leq\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})\big{)}, (32)

where (31) is due to the previous inequality and (32) is due the the inequality 1+xex1+x\leq e^{x}. By the Chernoff’s method,

Pr{i=1nψ(ι|yixi,w|)\displaystyle\text{Pr}\{\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|)
n(ιL𝒟(w)+ι22𝔼(yx,w)2)+log1η}\displaystyle\geq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\}
=Pr{exp(i=1nψ(ι|yixi,w|))\displaystyle=\text{Pr}\{\exp(\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|))
exp(n(ιL𝒟(w)+ι22𝔼(yx,w)2)+log1η)}\displaystyle\geq\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\big{)}\}
𝔼[exp(i=1nψ(ι|yixi,w|))]𝔼[exp(n(ιL𝒟(w)+ι22𝔼(yx,w)2)+log1η)]η.\displaystyle\leq\frac{\mathbb{E}[\exp(\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|))]}{\mathbb{E}[\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\big{)}]}\leq\eta.

Thus, with probability at least 1η1-\eta we have

i=1nψ(ι|yixi,w|)\displaystyle\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|)
n(ιL𝒟(w)+ι22𝔼(yx,w)2)+log1η\displaystyle\leq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}
n(ιL𝒟(w)+ι22supw𝒲𝔼(yx,w)2)+log1η.\displaystyle\leq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}.

Take the union bound for all wW~ζw\in\tilde{W}_{\zeta} we complete the proof. ∎

Proof of Lemma 7.

First, note that our truncation function ψ\psi satisfies ψ(x)log(1x+|x|θθ).\psi(x)\geq-\log(1-x+\frac{|x|^{\theta}}{\theta}). Thus we have,

𝔼[exp(i=1nψ(ι|yixi,w|))]\displaystyle\mathbb{E}[\exp(-\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|))]
𝔼[|Πi=1n(1ι|yixi,w|+ιθ|yixi,w|θθ)]\displaystyle\leq\mathbb{E}[|\Pi_{i=1}^{n}(1-\iota|y_{i}-\langle x_{i},w\rangle|+\frac{\iota^{\theta}|y_{i}-\langle x_{i},w\rangle|^{\theta}}{\theta})] (33)
=(𝔼[(1ι|yx,w|+ιθ|yixi,w|θθ)])n\displaystyle=(\mathbb{E}[(1-\iota|y-\langle x,w\rangle|+\frac{\iota^{\theta}|y_{i}-\langle x_{i},w\rangle|^{\theta}}{\theta})])^{n}
=(1ιL𝒟(w)+ιθθ𝔼|yx,w|θ)n\displaystyle=\big{(}1-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta}\big{)}^{n}
exp(n(ιL𝒟(w)+ιθθ𝔼|yx,w|θ)),\displaystyle\leq\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta})\big{)}, (34)

where (33) is due to the previous inequality and (34) is due the the inequality 1+xex1+x\leq e^{x}. By the Chernoff’s method, we have

Pr{i=1nψ(ι|yixi,w|)\displaystyle\text{Pr}\{-\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|)
n(ιL𝒟(w)+ιθθ𝔼|yx,w|θ)+log1ζ}\displaystyle\geq n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta})+\log\frac{1}{\zeta}\}
=Pr{exp(i=1nψ(ι|yixi,w|))\displaystyle=\text{Pr}\{\exp(-\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|))
exp(n(ιL𝒟(w)+ιθθ𝔼|yx,w|θ)+log1η)}\displaystyle\geq\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta})+\log\frac{1}{\eta}\big{)}\}
𝔼[exp(i=1nψ(ι|yixi,w|))]𝔼[exp(n(ιL𝒟(w)+ιθθ𝔼|yx,w|θ+log1η)]η.\displaystyle\leq\frac{\mathbb{E}[\exp(-\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|))]}{\mathbb{E}[\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta}+\log\frac{1}{\eta}\big{)}]}\leq\eta. (35)

Thus, with probability at least 1η1-\eta we have

i=1nψ(ι|yixi,w|)\displaystyle-\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|)
n(ιL𝒟(w)+ιθθ𝔼|yx,w|θ)+log1η\displaystyle\leq n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta})+\log\frac{1}{\eta}
n(ιL𝒟(w)+ιθθsupw𝒲𝔼|yx,w|θ)+log1η.\displaystyle\leq n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta})+\log\frac{1}{\eta}.

Take the union bound for all wW~ζw\in\tilde{W}_{\zeta} we complete the proof. ∎

Proof of Lemma 8.

First, note the truncation function ψ\psi satisfies ψ(x)log(1+x+|x|θθ).\psi(x)\leq\log(1+x+\frac{|x|^{\theta}}{\theta}). Thus we have

𝔼[exp(i=1nψ(ι|yixi,w|))]\displaystyle\mathbb{E}[\exp(\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|))]
𝔼[|Πi=1n(1+ι|yixi,w|+ιθ|yixi,w|θθ)]\displaystyle\leq\mathbb{E}[|\Pi_{i=1}^{n}(1+\iota|y_{i}-\langle x_{i},w\rangle|+\frac{\iota^{\theta}|y_{i}-\langle x_{i},w\rangle|^{\theta}}{\theta})] (36)
=(𝔼[(1+ι|yx,w|+ιθ|yixi,w|θθ)])n\displaystyle=(\mathbb{E}[(1+\iota|y-\langle x,w\rangle|+\frac{\iota^{\theta}|y_{i}-\langle x_{i},w\rangle|^{\theta}}{\theta})])^{n}
=(1+ιL𝒟(w)+ιθθ𝔼|yx,w|θ)n\displaystyle=\big{(}1+\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta}\big{)}^{n}
exp(n(ιL𝒟(w)+ιθθ𝔼|yx,w|θ),\displaystyle\leq\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta}\big{)}, (37)

where (36) is due to the previous inequality and (37) is due the the inequality 1+xex1+x\leq e^{x}. By the Chernoff’s method, we have

Pr{i=1nψ(ι|yixi,w|)\displaystyle\text{Pr}\{\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|)
n(ιL𝒟(w)+ιθθ𝔼|yx,w|θ)+log1η}\displaystyle\geq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta})+\log\frac{1}{\eta}\}
=Pr{exp(i=1nψ(ι|yixi,w|))\displaystyle=\text{Pr}\{\exp(\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|))
exp(n(ιL𝒟(w)+ιθθ𝔼|yx,w|θ)+log1η)}\displaystyle\geq\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta})+\log\frac{1}{\eta}\big{)}\}
𝔼[exp(i=1nψ(ι|yixi,w|))]𝔼[exp(n(ιL𝒟(w)+ιθθ𝔼|yx,w|θ)+log1η)]η.\displaystyle\leq\frac{\mathbb{E}[\exp(\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|))]}{\mathbb{E}[\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta})+\log\frac{1}{\eta}\big{)}]}\leq\eta.

Thus, with probability at least 1η1-\eta we have

i=1nψ(ι|yixi,w|)n(ιL𝒟(w)+ιθθ𝔼|yx,w|θ)+log1η\displaystyle\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|)\leq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta})+\log\frac{1}{\eta}
n(ιL𝒟(w)+ιθθsupw𝒲𝔼|yx,w|θ)+log1η.\displaystyle\leq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta})+\log\frac{1}{\eta}.

Take the union bound for all wW~ζw\in\tilde{W}_{\zeta} we complete the proof. ∎

Proof of Lemma 9.

As we mentioned before, there exists a w~ιW~ζ\tilde{w}_{\iota}\in\tilde{W}_{\zeta} such that w~ιw^ι2ζ\|\tilde{w}_{\iota}-\hat{w}_{\iota}\|_{2}\leq\zeta. This implies that

|yixi,w^ι||yixi,w~ι||xi,w~ιw^ι||yixi,w~ι|ζxi2.|y_{i}-\langle x_{i},\hat{w}_{\iota}\rangle|\geq|y_{i}-\langle x_{i},\tilde{w}_{\iota}\rangle|-|\langle x_{i},\tilde{w}_{\iota}-\hat{w}_{\iota}|\\ \geq|y_{i}-\langle x_{i},\tilde{w}_{\iota}\rangle|-\zeta\|x_{i}\|_{2}.

Since ψ\psi is non-decreasing, this implied that

L^α(w^ι,D)=1nιi=1nψ(ι|yixi,w^ι|)1nιi=1nψ(ι|yixi,w~ι|ιζxi2).\hat{L}_{\alpha}(\hat{w}_{\iota},D)=\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},\hat{w}_{\iota}\rangle|)\\ \geq\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},\tilde{w}_{\iota}\rangle|-\iota\zeta\|x_{i}\|_{2}).

We then proof the following lemma

Lemma 10.

For any wW~ζw\in\tilde{W}_{\zeta}, with probability at least 1η1-\eta, the following inequality holds:

1nιi=1nψ(ι|yixi,w|ιζxi2)L𝒟(w)+ζτ+(2ι)θ1θsupw𝒲𝔼|yx,w|θ+(2ι)θ1ζθθτθ+1nιlogN(𝒲,ζ)η.-\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|-\iota\zeta\|x_{i}\|_{2})\leq-L_{\mathcal{D}}(w)+\zeta\tau\\ +\frac{(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}+\frac{(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}.
Proof of Lemma 10 .

The idea of proof is almost the same as in the proof of Lemma 8. First, note that our truncation function ψ\psi satisfies the following ψ(x)log(1x+|x|θθ).\psi(x)\geq-\log(1-x+\frac{|x|^{\theta}}{\theta}). Thus we have,

𝔼[exp(i=1nψ(ι|yixi,w|ιζxi2))]\displaystyle\mathbb{E}[\exp(-\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|-\iota\zeta\|x_{i}\|_{2}))]
𝔼[|Πi=1n(1ι|yixi,w|\displaystyle\leq\mathbb{E}[|\Pi_{i=1}^{n}(1-\iota|y_{i}-\langle x_{i},w\rangle|
+ιζxi2+ιθ(|yixi,w|ζxi2)θθ)]\displaystyle\qquad+\iota\zeta\|x_{i}\|_{2}+\frac{\iota^{\theta}(|y_{i}-\langle x_{i},w\rangle|-\zeta\|x_{i}\|_{2})^{\theta}}{\theta})] (38)
=(𝔼[(1ι|yx,w|\displaystyle=(\mathbb{E}[(1-\iota|y-\langle x,w\rangle|
+ιζx2+ιθ(|yx,w|ζx2)θθ)])n\displaystyle\qquad+\iota\zeta\|x\|_{2}+\frac{\iota^{\theta}(|y-\langle x,w\rangle|-\zeta\|x\|_{2})^{\theta}}{\theta})])^{n}
=(1ιL𝒟(w)+ιζ𝔼x2+ιθθ𝔼(|yx,w|ζx2)θ)n\displaystyle=\big{(}1-\iota L_{\mathcal{D}}(w)+\iota\zeta\mathbb{E}\|x\|_{2}+\frac{\iota^{\theta}}{\theta}\mathbb{E}(|y-\langle x,w\rangle|-\zeta\|x\|_{2})^{\theta}\big{)}^{n}
exp(n(ιL𝒟(w)+ιθ2θ1θ𝔼|yx,w|θ\displaystyle\leq\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}2^{\theta-1}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta}
+ιθ2θ1ζθθ𝔼x2θ)).\displaystyle\qquad+\frac{\iota^{\theta}2^{\theta-1}\zeta^{\theta}}{\theta}\mathbb{E}\|x\|_{2}^{\theta})\big{)}. (39)

By the Chernoff’s method, we have with probability at least 1η1-\eta

i=1nψ(ι|yixi,w|ιζxi2)ιL𝒟(w)+ιθ2θ1θ𝔼|yx,w|θ+ιθ2θ1ζθθ𝔼x2θ+log1η.-\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|-\iota\zeta\|x_{i}\|_{2})\\ \geq-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}2^{\theta-1}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta}+\frac{\iota^{\theta}2^{\theta-1}\zeta^{\theta}}{\theta}\mathbb{E}\|x\|_{2}^{\theta}+\log\frac{1}{\eta}.

Take the union and then we complete the proof. ∎

Thus by Lemma 10 we have with probability at least 1η1-\eta

L^α(w^ι,D)=1nιi=1nψ(ι|yixi,w^ι|)\displaystyle\hat{L}_{\alpha}(\hat{w}_{\iota},D)=\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},\hat{w}_{\iota}\rangle|)
1nιi=1nψ(ι|yixi,w~ι|ιζxi2)\displaystyle\geq\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},\tilde{w}_{\iota}\rangle|-\iota\zeta\|x_{i}\|_{2})
L𝒟(w~ι)ζτ(2ι)θ1θsupw𝒲𝔼|yx,w|θ\displaystyle\geq L_{\mathcal{D}}(\tilde{w}_{\iota})-\zeta\tau-\frac{(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}
(2ι)θ1ζθθτθ1nιlogN(𝒲,ζ)η\displaystyle\qquad-\frac{(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}-\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}
L𝒟(w^ι)2ζτ(2ι)θ1θsupw𝒲𝔼|yx,w|θ\displaystyle\geq L_{\mathcal{D}}(\hat{w}_{\iota})-2\zeta\tau-\frac{(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}
(2ι)θ1ζθθτθ1nιlogN(𝒲,ζ)η,\displaystyle\qquad-\frac{(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}-\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}, (40)

where (40) is due to

L𝒟(w~ι)L𝒟(w^ι)=𝔼[|yx,w~ι||yx,w^ι|]\displaystyle L_{\mathcal{D}}(\tilde{w}_{\iota})-L_{\mathcal{D}}(\hat{w}_{\iota})=\mathbb{E}[|y-\langle x,\tilde{w}_{\iota}\rangle|-|y-\langle x,\hat{w}_{\iota}\rangle|]
𝔼|x,w~ιw^ι|ζ𝔼x2ζτ\displaystyle\leq\mathbb{E}|\langle x,\tilde{w}_{\iota}-\hat{w}_{\iota}\rangle|\leq\zeta\mathbb{E}\|x\|_{2}\leq\zeta\tau