Differentially Private $\ell_{1}$ -norm Linear Regression with Heavy-tailed Data

Di Wang1 and Jinhui Xu3 1Division of Computer, Electrical and Mathematical Sciences and Engineering
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia. Email: [email protected] 3Department of Computer Science and Engineering
State University of New York at Buffalo, Buffalo, NY. Email: [email protected]

Abstract

We study the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) with heavy-tailed data. Specifically, we focus on the $\ell_{1}$ -norm linear regression in the $\epsilon$ -DP model. While most of the previous work focuses on the case where the loss function is Lipschitz, here we only need to assume the variates has bounded moments. Firstly, we study the case where the $\ell_{2}$ norm of data has bounded second order moment. We propose an algorithm which is based on the exponential mechanism and show that it is possible to achieve an upper bound of $\tilde{O}(\sqrt{\frac{d}{n\epsilon}})$ (with high probability). Next, we relax the assumption to bounded $\theta$ -th order moment with some $\theta\in(1,2)$ and show that it is possible to achieve an upper bound of $\tilde{O}(({\frac{d}{n\epsilon}})^{\frac{\theta-1}{\theta}})$ . Our algorithms can also be extended to more relaxed cases where only each coordinate of the data has bounded moments, and we can get an upper bound of $\tilde{O}({\frac{d}{\sqrt{n\epsilon}}})$ and $\tilde{O}({\frac{d}{({n\epsilon})^{\frac{\theta-1}{\theta}}}})$ in the second and $\theta$ -th moment case respectively.

I Introduction

As one of the most fundamental problem in both statistical machine learning and differential privacy (DP). Stochastic Convex Optimization under the differential privacy [1] constraint, i.e., Differentially Private Stochastic Convex Optimization (DPSCO), has received much attentions in recent years [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. In DPSCO, we have a loss function $\ell$ and an $n$ -size dataset $D=\{(x_{1},y_{1}),(x_{2},y_{2}),\cdots,(x_{n},y_{n})\}$ where each pair of the variate and the label/response $(x_{i},y_{i})$ is i.i.d. sampled from some unknown distribution $\mathcal{D}$ . The goal of DPSCO is to privately minimize the population risk function $L_{\mathcal{D}}(w)=\mathbb{E}_{(x,y)\sim\mathcal{D}}[\ell(w,x,y))]$ over a parameter space $\mathcal{W}\subseteq\mathbb{R}^{d}$ , i.e., we aim to design some DP algorithm whose output $w^{priv}$ makes the excess population risk $L_{\mathcal{D}}(w^{priv})-\min_{w\in\mathcal{W}}L_{\mathcal{D}}(w^{priv})$ to be as small as possible with high probability.

Although DPSCO and it empirical form, differentially Private Empirical Risk Minimization (DPERM), have been extensively studied for many years and there is a long of work attacked the problems from different perspectives, most of those work needs to assume the data distribution is bounded which indicates that the loss function is Lipschitz, which is unrealistic and does not always hold in practice. To relax the Lipschitz assumption, start from [16], there are several work have studied DPSCO with heavy-tailed data recently, where the Lipschitz assumption is removed and we only need to assume that the distribution of the gradient of the loss function has bounded finite order of moments instead [16, 17, 18]. However, there are still several open problems in DP-SCO with heavy-tailed data. For example, previous work only considered the case where the loss function is smooth. Moreover, most of those work studied the $(\epsilon,\delta)$ -DP model and its behaviors in the $\epsilon$ -DP model is still far from well-understood. Thirdly, all the previous results need to assume the data or the gradient of the loss has at least bounded second (order) moments and cannot be extended to a more relaxed case where it has only $\theta$ -th moment with $\theta\in(1,2)$ . In this paper, we continue along the directions of these open problems. Specifically, we study the problem of $\ell_{1}$ -norm linear regression (i.e., $\ell(w,x,y)=|\langle x,w\rangle-y|$ ) with heavy-tailed data in $\epsilon$ -DP model. Our contributions can be summarized as follows.

•

We first consider the case where the second moment of the $\ell_{2}$ -norm of the variate $x$ , i.e., $\mathbb{E}\|x\|^{2}_{2}$ , is bounded. Specifically, we propose a method which is based on the exponential mechanism and show that it is possible to achieve an upper bound of $\tilde{O}(\sqrt{\frac{d}{n\epsilon}})$ with high probability. Moreover, instead of the $\ell_{2}$ -norm, we also consider the case where each coordinate of $x$ has bounded second moment, i.e., $\mathbb{E}|x_{j}|^{2}<\infty$ for every $j\in[d]$ . We show that our algorithm could achieve an error bound of $\tilde{O}({\frac{d}{\sqrt{n\epsilon}}})$ .
•

We then investigate a relaxed case where the data only has $\theta$ -th moment with $\theta\in(1,2)$ . First, similar to the second moment case, we assume that $\mathbb{E}\|x\|^{\theta}_{2}<\infty$ and show it is possible to achieve a rate of $\tilde{O}\big{(}({\frac{d}{n\epsilon}})^{\frac{\theta-1}{\theta}}\big{)}$ . Then, under the relaxed condition that $\mathbb{E}|x_{j}|^{\theta}<\infty$ for every $j\in[d]$ , we show that our algorithm could achieve an error of $\tilde{O}\big{(}\frac{d}{(n\epsilon)^{\frac{\theta-1}{\theta}}}\big{)}$ . To the best of our knowledge, this is the first theoretical result of DPSCO with heavy-tailed data that only has $\theta$ -th moment with $\theta\in(1,2)$ .

II Related Work

Although there is a long list of work studied either DPSCO/DPERM or robust estimation. DPSCO with heavy-tailed data is not well-understood. Below we will mentioned the related work on DPSCO with heavy-tailed data and private and robust mean estimation.

For private estimation for heavy-tailed distribution, [19] provides the first study on private mean estimation for distributions with bounded second moment and proposes the minimax private rates. Later, [20] also studies the heavy-tailed mean estimation with a relaxed assumption, which is also studied by [21, 22] recently. However, due to the complex nature of $\ell_{1}$ regression, the methods for mean estimation cannot be used to our problem. Moreover, it is unknown whether their methods could be extended to the case where each coordinate of the data has only $\theta$ -th order moment with $\theta\in(1,2)$ .

For DPSCO with heavy-tailed data, [16] first studies the problem by proposing three methods based on different assumptions. The first method is based on the smooth sensitivity and the Sample-and-Aggregate framework [23]. However, it needs enormous assumptions and its error bound is quite large. Their second method is still based on the smooth sensitivity [24]. However, it needs to assume the distribution is sub-exponential.[16] also provides a new private estimator motivated by the previous work in robust statistics. [18] recently revisits the problem and improves the (expected) excess population risk for both convex and strongly convex loss functions. It also provides the lower bounds of the mean estimation problem in both $(\epsilon,\delta)$ -DP and $\epsilon$ -DP models. However, as we mentioned earlier, all of those algorithms are for $(\epsilon,\delta)$ -DP model and need to assume the loss function is smooth. Thus, their methods cannot be used to our problem. Later, [17] studies the problem in the high dimensional space where the dimension could far greater than the data size. It focuses the case where the loss function is smooth and the constraint set is some polytope or some $\ell_{0}$ -norm ball, which is quite different with our settings.

III Preliminaries

Definition 1 (Differential Privacy [1]).

Given a data universe $\mathcal{X}$ , we say that two datasets $D,D^{\prime}\subseteq\mathcal{X}$ are neighbors if they differ by only one entry, which is denoted as $D\sim D^{\prime}$ . A randomized algorithm $\mathcal{A}$ is $\epsilon$ -differentially private (DP) if for all neighboring datasets $D,D^{\prime}$ and for all events $S$ in the output space of $\mathcal{A}$ , the following holds $\text{Pr}(\mathcal{A}(D)\in S)\leq e^{\epsilon}\text{Pr}(\mathcal{A}(D^{\prime})\in S).$

Definition 2 (Exponential Mechanism).

The Exponential Mechanism allows differentially private computation over arbitrary domains and range $\mathcal{R}$ , parametrized by a score function $u(D,r)$ which maps a pair of input data set $D$ and candidate result $r\in\mathcal{R}$ to a real valued score. With the score function $u$ and privacy budget $\epsilon$ , the mechanism yields an output with exponential bias in favor of high scoring outputs. Let $\mathcal{M}(D,x,R)$ denote the exponential mechanism, and $\Delta$ be the sensitivity of $u$ in the range $R$ , $\Delta=\max_{r\in\mathcal{R}}\max_{D\sim D^{\prime}}|u(D,r)-u(D^{\prime},r)|.$ Then if $\mathcal{M}(D,x,R)$ selects and outputs an element $r\in\mathcal{R}$ with probability proportional to $\exp(\frac{\epsilon u(D,r)}{2\Delta u})$ , it preserves $\epsilon$ -differential privacy.

Lemma 1.

[25] For the exponential mechanism $\mathcal{M}(D,u,\mathcal{R})$ , we have

\text{Pr}\{u(\mathcal{M}(D,u,\mathcal{R}))\leq OPT_{u}(x)-\frac{2\Delta u}{\epsilon}(\ln|\mathcal{R}|+t)\}\leq e^{-t}.

where $OPT_{u}(x)$ is the highest score in the range $\mathcal{R}$ , i.e. $\max_{r\in\mathcal{R}}u(D,r)$ .

Definition 3 (DPSCO [2]).

Given a dataset $D=\{z_{1},\cdots,z_{n}\}$ from a data universe $\mathcal{Z}$ where $z_{i}=(x_{i},y_{i})$ with a feature vector $x_{i}$ and a label/response $y_{i}$ are i.i.d. samples from some unknown distribution $\mathcal{D}$ , a convex constraint set $\mathcal{W}\subseteq\mathbb{R}^{d}$ , and a convex loss function $\ell:\mathcal{W}\times\mathcal{Z}\mapsto\mathbb{R}$ . Differentially Private Stochastic Convex Optimization (DPSCO) is to find $w^{\text{priv}}$ so as to minimize the population risk, i.e., $L_{\mathcal{D}}(w)=\mathbb{E}_{z\sim\mathcal{D}}[\ell(w,z)]$ with the guarantee of being DP. The utility of the algorithm is measured by the excess population risk, that is $L_{\mathcal{D}}(w^{\text{priv}})-\min_{w\in\mathbb{\mathcal{W}}}L_{\mathcal{D}}(w)$ (for convenience we denote the optimal solution as $w^{*}$ ). Besides the population risk, we can also measure the empirical risk of dataset $D$ : $\hat{L}(w,D)=\frac{1}{n}\sum_{i=1}^{n}\ell(w,z_{i}).$ It is notable that in the high probability setting, we need to get a high probability excess population risk. That is given a failure probability $0<\eta<1$ , we want get a (polynomial) function $f(d,\log\frac{1}{\eta},\frac{1}{n},\frac{1}{\epsilon})$ such that with probability at least $1-\eta$ (over the randomness of the algorithm and the data distribution),

L_{\mathcal{D}}(w^{\text{priv}})-L_{\mathcal{D}}(w^{*})\leq O(f(d,\log\frac{1}{\eta},\frac{1}{n},\frac{1}{\epsilon})).

In this paper, we mainly focus on $\ell_{1}$ -norm linear regression:

\min_{w\in\mathcal{W}}L_{\mathcal{D}}(w)=\mathbb{E}_{(x,y)\sim\mathcal{D}}|\langle x,w\rangle-y|,

(1)

where the convex constraint set $\mathcal{W}$ is bounded with diameter $\Delta=\max_{w,w^{\prime}\in\mathcal{W}}\|w-w^{\prime}\|_{2}<\infty$ .

Definition 4 ( $\zeta$ -Net).

Let $(T,d)$ be a metric space. Consider a subset $\mathcal{W}\subset T$ and let $\zeta>0$ . A subset $\mathcal{S}\subseteq\mathcal{W}$ is called an $\zeta$ -net of $\mathcal{W}$ if every point in $\mathcal{W}$ is within a distance $\zeta$ of some points of $\mathcal{S}$ , i.e., $\forall x\in k,\exists x_{0}\in\mathcal{N}:d(x,x_{0})\leq\zeta.$ The smallest possible cardinality of an $\zeta$ -net of $\mathcal{W}$ is called the covering number of $\mathcal{W}$ and is denoted by $\mathcal{N}(\mathcal{W},d,\zeta)$ . Equivalently, covering number is the smallest number of closed balls with centers in $K$ and radii $\zeta$ whose union covers $\mathcal{W}$ .

Lemma 2 ([26]).

For the Euclidean space $(\mathbb{R}^{d},\|\cdot\|_{2})$ , and a bounded subset $\mathcal{W}\subseteq\mathbb{R}^{d}$ with the diameter $\Delta$ . Then its covering number $\mathcal{N}(\mathcal{W},\zeta)\leq(\frac{3\Delta}{\zeta})^{d}$ .

IV Main Methods

IV-A Bounded second moment case

In this section we consider the case of bounded second moment. As mentioned earlier, in the previous work on DPSCO with heavy-tailed data, we always assume the distribution of gradient has bounded moments [16, 17, 18], if the loss function is smooth. However, for $\ell_{1}$ regression, here the loss function is non-differentiable. Thus, instead of the gradient, here we will directly assume the second moments of $x$ are bounded, which implies that the second moments of the sub-gradient of the loss function are also bounded. In general, there are two assumptions on the heavy-tailedness of $x$ ; one assumes the distribution of $\|x\|_{2}$ has bounded moment; the other one assumes the distribution of each coordinate of $x$ has bounded moment. Formally,

Assumption 1.

The second moment of $\|x\|_{2}$ is bounded by $\tau^{2}=O(1)$ , that is $\mathbb{E}_{(x,y)\sim\mathcal{D}}\|x\|_{2}^{2}\leq\tau^{2}.$

Assumption 2.

The second moment of each coordinate of $x$ is bounded by $\tau^{2}=O(1)$ , that is $\forall j\in[d],\mathbb{E}_{(x,y)\sim\mathcal{D}}x_{j}^{2}\leq\tau^{2}.$

From the above two assumptions we can see that Assumption 1 is more restricted than Assumption 2. Before showing our algorithm, we first provide a brief overview on the approach of solving the problem (1) in the non-private case, proposed by [27]. Specifically, instead of study the empirical risk function of the population risk (1), [27] considers the following minimization problem of a truncated loss :

\min_{w\in\mathcal{W}}\hat{L}_{\iota}(w,D)=\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|),

(2)

where $\iota>0$ is a parameter that will be specified later, the truncation function $\psi(\cdot)$ is a non-decreasing function which should satisfies the following property:

-\log(1-x+\frac{x^{2}}{2})\leq\psi(x)\leq\log(1+x+\frac{x^{2}}{2}).

(3)

Specifically, [27] shows the following result:

Lemma 3 (Corollary 2 in [27]).

Under Assumption 1 and assume the $\psi(\cdot)$ satisfies (3). Then for any given failure probability $\eta$ , for some specified $\iota=\iota(\frac{1}{n},d,\Delta,\log\frac{1}{\eta})$ in (2), the optimal solution $\hat{w}_{\iota}$ of (2) has the following the excess population risk with probability at least $1-\eta$

	$\displaystyle L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq\tilde{O}\big{(}\frac{1}{n}\mathbb{E}\\|x\\|_{2}+$		(4)
	$\displaystyle\sqrt{\frac{d\log\frac{1}{\eta}}{n}}(\frac{1}{n^{2}}+\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2})\big{)}=\tilde{O}(\sqrt{\frac{d\log\frac{1}{\eta}}{n}}).$		(5)

The previous lemma indicates that solving the problem (2) is sufficient to solve the $\ell_{1}$ regression problem if $\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{2}=O(1)$ . To solve the problem (2) differentially privately, we adapt the following specific form of $\psi(\cdot)$ :

\psi(x)=\begin{cases}-\log(1-x+\frac{x^{2}}{2}),&0\leq x\leq 1\\ \log 2,&x\geq 1\\ -\psi(-x),&x\leq 0\end{cases}

(6)

We can easily see (6) satisfies the property in (3). Moreover, due to the non-decreasing property we can see the function is upper bounded by $\log 2$ . That is, for a fixed $w$ , the sensitivity of $\hat{L}_{\iota}(w,D)$ is bounded by $\frac{2\log 2}{n\iota}$ . Motivated by this, we can use the exponential mechanism to solve (2) in $\epsilon$ -DP model, see Algorithm 1 for details.

Algorithm 1 Exponential mechanism for

\ell_{1}

-regression (second moment)

$\mathbf{Input}$ : $D=\{(x_{i},y_{i})\}_{i=1}^{n}$ ; privacy parameter $\epsilon$ ; parameters $\iota,\zeta$ (will be specified later); truncated empirical risk $\hat{L}_{\iota}$ in (2) with $\psi$ in (6).

1:Find a

\zeta

-net of

\mathcal{W}

with covering number

N(\mathcal{W},\zeta)

, denote it as

\tilde{W}_{\zeta}=\{w_{1},\cdots,w_{N(\mathcal{W},\zeta)}\}

2:Run the exponential mechanism with the range

R=\tilde{W}_{\zeta}

and the score function

u(D,w)=-\hat{L}_{\iota}(w,D)

. That is, output an

w\in\tilde{W}_{\zeta}

with probability proportional to

\exp(\frac{-n\iota\epsilon\hat{L}_{\iota}(w,D)}{\log 2})

Theorem 1.

For any $\epsilon>0$ , Algorithm 1 is $\epsilon$ -DP. Moreover, under Assumption 1, given any failure probability $\eta\in(0,1)$ , for the output $\tilde{w}$ we have the following with probability at least $1-\eta$ for any $\iota>0$ ,

L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq{O}(\zeta\tau+\iota\zeta^{2}\tau^{2}+\iota\tau^{2}\Delta^{2}+\frac{1}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\zeta}).

Furthermore, by setting $\zeta=O(\frac{1}{n})$ and $\iota=O(\sqrt{\frac{d\log n\log\frac{1}{\eta}}{n\epsilon\tau^{2}}})$ we have

L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq{O}(\tau\sqrt{\frac{d\log n\log\frac{1}{\eta}}{n\epsilon}}).

(7)

Moreover, under Assumption 2, set $\zeta=O(\frac{1}{n})$ and $\iota=O(\sqrt{\frac{\log n\log\frac{1}{\eta}}{\tau^{2}n\epsilon}})$ we have

L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq{O}(\tau\sqrt{\frac{d^{2}\log n\log\frac{1}{\eta}}{n\epsilon}}),

(8)

where Big- ${O}$ notations omit the term of $\Delta$ .

Remark 1.

First, notice that since Assumption 1 is more stronger, there is a gap of $O(\sqrt{d})$ compared with (7) and (8). Moreover, for the upper bound in (8), it matches the lower bound of the private mean estimation under Assumption 2 in [18]. However, it is still unknown whether this lower bound is optimal for DPSCO with general convex loss. To the best of our knowledge, this is the first $\epsilon$ -DP algorithm which could achieve the bound of $\tilde{O}({\frac{d}{\sqrt{n\epsilon}}})$ under the same assumption.

Proof of Theorem 1.

The proof of $\epsilon$ -DP is due to the sensitivity of the score function is bounded by $\frac{2\log 2}{n\iota}$ and the exponential mechanism. Next we will proof the utility. For convenience we denote $w_{\zeta}=\arg\min_{w\in\tilde{W}_{\zeta}}\hat{L}_{\iota}(w,D)$ and the optimal solution of (2) as $\hat{w}_{\iota}$ . By the utility of exponential mechanism Lemma 1 and take $t=\log\frac{1}{\eta}$ we have with probability at least $1-\eta$ , $-\hat{L}_{\iota}(\tilde{w},D)\geq-\hat{L}_{\iota}(w_{\zeta},D)-\frac{4\log 2}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta}.$ That is

\hat{L}_{\iota}(\tilde{w},D)\leq\hat{L}_{\iota}(w_{\zeta},D)+\frac{4\log 2}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta}.

(9)

For the term $\hat{L}_{\iota}(\tilde{w},D)$ in (9) we have the following.

Lemma 4 ([27]).

For any given $\eta$ , we have the following inequality for all $w\in\tilde{W}_{\zeta}$ with probability at least $1-\eta$

\hat{L}_{\iota}(w,D)=\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|)\geq L_{\mathcal{D}}(w)\\ -\frac{\iota}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2})-\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}.

(10)

For the term $\hat{L}_{\iota}(w_{\zeta},D)$ in (9), since $\tilde{W}_{\zeta}$ is a $\zeta$ -net, thus there exists a $\tilde{w}_{\iota}\in\tilde{W}_{\zeta}$ such that $\|\tilde{w}_{\iota}-\hat{w}_{\iota}\|_{2}\leq\zeta$ , where $\hat{w}_{\iota}=\arg\min_{w\in\mathcal{W}}\hat{L}_{\iota}(w,D)$ . And by the definition we have

\hat{L}_{\iota}(w_{\zeta},D)\leq\hat{L}_{\iota}(\tilde{w}_{\iota},D).

(11)

For the term $\hat{L}_{\iota}(\tilde{w}_{\iota},D)$ we have the following lemma

Lemma 5 ([27]).

For any given $\eta$ , we have the following inequality for all $w\in\tilde{W}_{\zeta}$ with probability at least $1-\eta$

\hat{L}_{\iota}(w,D)\leq L_{\mathcal{D}}(w)+\frac{\iota}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}.

(12)

Thus, combing with (11) and Lemma 5 we have with probability at least $1-\eta$

	$\displaystyle\hat{L}_{\iota}(w_{\zeta},D)\leq\hat{L}_{\iota}(\tilde{w}_{\iota},D)$
	$\displaystyle\leq L_{\mathcal{D}}(\tilde{w}_{\iota})+\frac{\iota}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}$
	$\displaystyle\leq L_{\mathcal{D}}(\hat{w}_{\iota})+\zeta\tau+\frac{\iota}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta},$		(13)

where the last inequality is due to

	$\displaystyle L_{\mathcal{D}}(\tilde{w}_{\iota})-L_{\mathcal{D}}(\hat{w}_{\iota})$	$\displaystyle=\mathbb{E}[\|y-\langle x,\tilde{w}_{\iota}\rangle\|-\|y-\langle x,\hat{w}_{\iota}\rangle\|]$
		$\displaystyle\leq\mathbb{E}\|\langle x,\tilde{w}_{\iota}-\hat{w}_{\iota}\rangle\|\leq\zeta\mathbb{E}\\|x\\|_{2}$

The relation between $L_{\mathcal{D}}(\hat{w}_{\iota})$ and $L_{\mathcal{D}}(w^{*})$ is due to the following lemma, given by [27],

Lemma 6.

For any given failure probability $\eta$ , under Assumption 1, we have the following with probability at least $1-2\eta$

L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq 2\zeta\tau\\ +\iota\zeta^{2}\tau^{2}+\frac{3\iota}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}.

Thus, taking Lemma 4, (13) and Lemma 6 into (9) we have with probability at least $1-4\eta$

	$\displaystyle L_{\mathcal{D}}(\tilde{w})-L_{\mathcal{D}}(w^{*})\leq 3\zeta\tau+\iota\zeta^{2}\tau^{2}+\frac{5\iota}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2}$
	$\displaystyle+\frac{3}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}+\frac{4\log 2}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta}.$		(14)

By $\log N(\mathcal{W},\zeta)\leq d\log\frac{3\Delta}{\zeta}$ and the following inequality we can complete the proof.

\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2}\\ \leq\mathbb{E}(y-\langle x^{*},w\rangle)^{2}+2\mathbb{E}\|x\|_{2}^{2}\sup_{w\in\mathcal{W}}\|w-w^{*}\|_{2}^{2}=O(\Delta^{2}\tau^{2}).

For (8), note that using the same proof we can replace $\tau$ by $\mathbb{E}\|x\|_{2}$ in (14). By Assumption 2 we have $\mathbb{E}\|x\|_{2}\leq\tau\sqrt{d}$ and $\mathbb{E}\|x\|^{2}_{2}\leq d\tau^{2}$ . Thus, take $\zeta=O(\frac{1}{n})$ and $\iota=O(\sqrt{\frac{\log n\log\frac{1}{\eta}}{\tau^{2}n\epsilon}})$ we finish the proof. ∎

IV-B Bounded $\theta$ -th moment case

Actually, motivated by [28], for the $\ell_{1}$ -regression problem, we can even relax the second moment assumption in Assumption 1 and 2 to finite $\theta$ -th moment assumptions with any $\theta\in(1,2)$ . Similar to the second moment case, here we consider two cases:

Assumption 3.

There exists an $\theta\in(1,2)$ such that the $\theta$ -th order moment of $x$ is bounded by $\tau^{\theta}<\infty$ for some constant $\tau$ , ¹¹1Here we use $\tau^{\theta}$ is for convenience to compare with the second moment case. that is $\mathbb{E}_{(x,y)\sim\mathcal{D}}\|x\|_{2}^{\theta}\leq\tau^{\theta}=O(1).$

Assumption 4.

We assume that the second moment of each coordinate of $x$ is bounded by $\tau^{\theta}=O(1)$ , that is $\forall j\in[d],\mathbb{E}_{(x,y)\sim\mathcal{D}}x_{j}^{\theta}\leq\tau^{\theta}=O(1).$

Here our main idea is almost the same as in the bounded second-order moment case and we will still focus on the function $\hat{L}_{\iota}(w,D)$ in (2). However, here we will adjust the non-decreasing truncation function $\psi:\mathbb{R}\mapsto\mathbb{R}$ to make it satisfies the following inequality instead of (3):

-\log(1-x+\frac{|x|^{\theta}}{\theta})\leq\psi(x)\leq\log(1+x+\frac{|x|^{\theta}}{\theta}).

(15)

Motived by (6), here we use the following specific form for $\psi$ :

\psi(x)=\begin{cases}-\log(1-x+\frac{|x|^{\theta}}{\theta}),&0\leq x\leq 1\\ \log\theta,&x\geq 1\\ -\psi(-x),&x\leq 0\end{cases}

(16)

Algorithm 2 Exponential mechanism for

\ell_{1}

-regression (

\theta

-th moment)

$\mathbf{Input}$ : $D=\{(x_{i},y_{i})\}_{i=1}^{n}$ ; privacy parameter $\epsilon$ ; parameters $\iota,\zeta$ ; truncated empirical risk $\hat{L}_{\iota}$ in (2) with $\psi$ in (16).

1:Find a

\zeta

-net of

\mathcal{W}

with covering number

N(\mathcal{W},\zeta)

, denote it as

\tilde{W}_{\zeta}=\{w_{1},\cdots,w_{N(\mathcal{W},\zeta)}\}

2:Run the exponential mechanism with the range

R=\tilde{W}_{\zeta}

and the score function

u(D,w)=-\hat{L}_{\iota}(w,D)

. That is, output an

w\in\tilde{W}_{\zeta}

with probability proportional to

\exp(\frac{-n\iota\epsilon\hat{L}_{\iota}(w,D)}{\log\theta})

We can easily see it satisfies the inequality in (15). Moreover, its absolute value is upper bounded by $\log\theta$ . That is the sensitivity of $\hat{L}_{\iota}(w,D)$ is upper bounded by $\frac{2\log\theta}{2n\iota}$ . Therefore, we will use the exponential mechanism (see Algorithm 2 for details) and its output has the following utility.

Theorem 2.

For any $\epsilon>0$ , Algorithm 2 is $\epsilon$ -DP. Moreover, under Assumption 3, given failure probability $\zeta$ , for the output $\tilde{w}$ we have the following with probability at least $1-\zeta$ for any $\iota>0$

	$\displaystyle L_{\mathcal{D}}(\tilde{w})-L_{\mathcal{D}}(w^{*})$	$\displaystyle\leq O\big{(}\zeta\tau+{\iota^{\theta-1}}\tau^{\theta}+$
		$\displaystyle{\iota^{\theta-1}\zeta^{\theta}}\tau^{\theta}+\frac{1}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta}\big{)}.$		(17)

Take $\zeta=O(\frac{1}{n})$ and $\iota=O(\frac{1}{\tau}(\frac{d\log\frac{1}{\zeta}}{n\epsilon})^{\frac{1}{\theta}})$ we have

L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq\tilde{O}(\tau(\frac{d\log n\log\frac{1}{\zeta}}{n\epsilon})^{\frac{\theta-1}{\theta}}).

(18)

Moreover, under Assumption 4, set $\zeta=O(\frac{1}{n})$ and $\iota=O(\frac{1}{\tau}(\frac{\log n\log\frac{1}{\eta}}{n\epsilon})^{\frac{\theta-1}{\theta}})$ we have

L_{\mathcal{D}}(\hat{w}_{\iota})-L_{\mathcal{D}}(w^{*})\leq{O}(\tau d(\frac{\log n\log\frac{1}{\zeta}}{n\epsilon})^{\frac{\theta-1}{\theta}}).

(19)

Proof of Theorem 2.

The proof of $\epsilon$ -DP is due to the sensitivity of the score function is bounded by $\frac{2\log\theta}{n\iota}$ and the exponential mechanism. Next we will proof the utility. For convenience we denote $w_{\zeta}=\arg\min_{w\in\tilde{W}_{\zeta}}\hat{L}_{\iota}(w,D)$ and the optimal solution of (2) as $\hat{w}_{\iota}$ . By the utility of exponential mechanism Lemma 1 and take $t=\log\frac{1}{\eta}$ we have with probability at least $1-\eta$ , $-\hat{L}_{\iota}(\tilde{w},D)\geq-\hat{L}_{\iota}(w_{\zeta},D)-\frac{4\log\theta}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta}.$ That is

\hat{L}_{\iota}(\tilde{w},D)\leq\hat{L}_{\iota}(w_{\zeta},D)+\frac{4\log\theta}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta}.

(20)

For the term $\hat{L}_{\iota}(\tilde{w},D)$ in (20) we have the following inequality.

Lemma 7 ([28]).

For any given $\zeta$ , we have the following inequality for all $w\in\tilde{W}_{\zeta}$ with probability at least $1-\eta$

	$\displaystyle\hat{L}_{\iota}(w,D)=\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|)$
	$\displaystyle\geq L_{\mathcal{D}}(w)-\frac{\iota^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}-\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}.$

For the term $\hat{L}_{\iota}(w_{\zeta},D)$ in (20), since $\tilde{W}_{\zeta}$ is a $\zeta$ -net, thus there exists a $\tilde{w}_{\iota}\in\tilde{W}_{\zeta}$ such that $\|\tilde{w}_{\iota}-\hat{w}_{\iota}\|_{2}\leq\zeta$ , where $\hat{w}_{\iota}=\arg\min_{w\in\mathcal{W}}\hat{L}_{\iota}(w,D)$ . And by the definition we have

\hat{L}_{\iota}(w_{\zeta},D)\leq\hat{L}_{\iota}(\tilde{w}_{\iota},D).

(21)

For the term $\hat{L}_{\iota}(\tilde{w}_{\iota},D)$ we have the following lemma

Lemma 8.

For any given $\eta$ , we have the following inequality for all $w\in\tilde{W}_{\zeta}$ with probability at least $1-\eta$

\hat{L}_{\iota}(w,D)\leq L_{\mathcal{D}}(w)+\frac{\iota^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}.

(22)

Thus, combing with (21) and Lemma 8 we have with probability at least $1-\eta$

	$\displaystyle\hat{L}_{\iota}(w_{\zeta},D)\leq\hat{L}_{\iota}(\tilde{w}_{\iota},D)$
	$\displaystyle\leq L_{\mathcal{D}}(\tilde{w}_{\iota})+\frac{\iota^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}$
	$\displaystyle\leq L_{\mathcal{D}}(\hat{w}_{\iota})+\zeta\tau+\frac{\iota^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta},$		(23)

In the following we will show the relation between $L_{\mathcal{D}}(\hat{w}_{\iota})$ and $L_{\mathcal{D}}(w^{*})$ . We first show the following lemma:

Lemma 9 ([28]).

With probability at least $1-\eta$ ,

L_{\mathcal{D}}(\hat{w}_{\iota})-\hat{L}_{\iota}(\hat{w}_{\iota},D)\leq 2\zeta\tau+\frac{(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}\\ +\frac{(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}

By Lemma 9, the definition of $\hat{w}_{\iota}$ we have with probability at least $1-2\eta$

$\displaystyle L_{\mathcal{D}}(\hat{w}_{\iota})$	$\displaystyle\leq\hat{L}_{\iota}(\hat{w}_{\iota},D)+2\zeta\tau+\frac{(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}$
	$\displaystyle\qquad+\frac{(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}$
	$\displaystyle\leq L_{\mathcal{D}}(w^{*})+2\zeta\tau+\frac{(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}$
	$\displaystyle\qquad+\frac{2(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}+\frac{2}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}.$	(24)

Where the last inequality of (24) is due to the following with probability $1-\eta$ , whose proof is the same as in the proof of Lemma 8 (we omit it here)

\hat{L}_{\iota}(w^{*},D)\leq L_{\mathcal{D}}(w^{*})+\frac{\iota^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}+\frac{1}{n\iota}\log\frac{1}{\eta}

(25)

Thus, combing with (20) , Lemma 7, (23) and (24) we have with probability at least $1-5\eta$

L_{\mathcal{D}}(\tilde{w})\leq L_{\mathcal{D}}(w^{*})+3\zeta\tau+\frac{2(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}\\ +\frac{2(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}+\frac{8\log\theta}{n\iota\epsilon}\log\frac{N(\mathcal{W},\zeta)}{\eta}

(26)

Since $\log N(\mathcal{W},\zeta)\leq d\log\frac{3\Delta}{\zeta}$ , we have

L_{\mathcal{D}}(\tilde{w})-L_{\mathcal{D}}(w^{*})\leq{O}(\zeta\tau+{\iota^{\theta-1}}\tau^{\theta}+{\iota^{\theta-1}\zeta^{\theta}}\tau^{\theta}+\frac{d}{n\iota\epsilon}\log\frac{1}{\zeta\eta}),

(27)

which is due to that $\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}\leq O(\mathbb{E}|y-\langle x,w^{*}\rangle|^{\theta}+\mathbb{E}|\langle x,w^{*}-w\rangle|^{\theta})\leq O(\Delta^{\theta}\tau^{\theta})$ . Take $\iota=(\frac{d\log\frac{1}{\zeta}}{n\epsilon})^{\frac{1}{\theta}}$ and $\zeta=\frac{1}{n}$ we can get the proof.

For (19), we can replace $\tau$ by $\mathbb{E}\|x\|_{2}$ in (27). Under Assumption 4 and use the inequality $\mathbb{E}\|x\|_{2}^{\theta}\leq\mathbb{E}\|x\|_{\theta}^{\theta}=d\tau^{\theta}$ , we have the result by taking $\iota=O(\frac{1}{\tau}(\frac{\log n}{n\epsilon})^{\frac{\theta-1}{\theta}})$ and $\zeta=O(\frac{1}{n})$ .

∎

References

[1] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of cryptography conference. Springer, 2006, pp. 265–284.
[2] R. Bassily, A. Smith, and A. Thakurta, “Private empirical risk minimization: Efficient algorithms and tight error bounds,” in Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on. IEEE, 2014, pp. 464–473.
[3] D. Wang and J. Xu, “Differentially private empirical risk minimization with smooth non-convex loss functions: A non-stationary view,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 1182–1189.
[4] D. Wang, C. Chen, and J. Xu, “Differentially private empirical risk minimization with non-convex loss functions,” in International Conference on Machine Learning, 2019, pp. 6526–6535.
[5] R. Bassily, V. Feldman, K. Talwar, and A. Thakurta, “Private stochastic convex optimization with optimal rates,” in NeurIPS, 2019.
[6] R. Bassily, V. Feldman, C. Guzmán, and K. Talwar, “Stability of stochastic gradient descent on nonsmooth convex losses,” Advances in Neural Information Processing Systems, vol. 33, 2020.
[7] V. Feldman, T. Koren, and K. Talwar, “Private stochastic convex optimization: optimal rates in linear time,” in Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020, pp. 439–449.
[8] S. Song, O. Thakkar, and A. Thakurta, “Characterizing private clipped gradient descent on convex generalized linear problems,” arXiv preprint arXiv:2006.06783, 2020.
[9] J. Su and D. Wang, “Faster rates of differentially private stochastic convex optimization,” arXiv preprint arXiv:2108.00331, 2021.
[10] H. Asi, J. Duchi, A. Fallah, O. Javidbakht, and K. Talwar, “Private adaptive gradient methods for convex optimization,” in International Conference on Machine Learning. PMLR, 2021, pp. 383–392.
[11] R. Bassily, C. Guzmán, and A. Nandi, “Non-euclidean differentially private stochastic convex optimization,” arXiv preprint arXiv:2103.01278, 2021.
[12] J. Kulkarni, Y. T. Lee, and D. Liu, “Private non-smooth empirical risk minimization and stochastic convex optimization in subquadratic steps,” arXiv preprint arXiv:2103.15352, 2021.
[13] R. Bassily, C. Guzmán, and M. Menart, “Differentially private stochastic optimization: New results in convex and non-convex settings,” Advances in Neural Information Processing Systems, vol. 34, 2021.
[14] H. Asi, V. Feldman, T. Koren, and K. Talwar, “Private stochastic convex optimization: Optimal rates in $\ell_{1}$ geometry,” arXiv preprint arXiv:2103.01516, 2021.
[15] Z. Xue, S. Yang, M. Huai, and D. Wang, “Differentially private pairwise learning revisited,” in IJCAI. ijcai.org, 2021, pp. 3242–3248.
[16] D. Wang, H. Xiao, S. Devadas, and J. Xu, “On differentially private stochastic convex optimization with heavy-tailed data,” arXiv preprint arXiv:2010.11082, 2020.
[17] L. Hu, S. Ni, H. Xiao, and D. Wang, “High dimensional differentially private stochastic optimization with heavy-tailed data,” arXiv preprint arXiv:2107.11136, 2021.
[18] G. Kamath, X. Liu, and H. Zhang, “Improved rates for differentially private stochastic convex optimization with heavy-tailed data,” arXiv preprint arXiv:2106.01336, 2021.
[19] R. F. Barber and J. C. Duchi, “Privacy and statistical risk: Formalisms and minimax bounds,” arXiv preprint arXiv:1412.4451, 2014.
[20] G. Kamath, V. Singhal, and J. Ullman, “Private mean estimation of heavy-tailed distributions,” arXiv preprint arXiv:2002.09464, 2020.
[21] X. Liu, W. Kong, S. Kakade, and S. Oh, “Robust and differentially private mean estimation,” arXiv preprint arXiv:2102.09159, 2021.
[22] Y. Tao, Y. Wu, P. Zhao, and D. Wang, “Optimal rates of (locally) differentially private heavy-tailed multi-armed bandits,” arXiv preprint arXiv:2106.02575, 2021.
[23] K. Nissim, S. Raskhodnikova, and A. Smith, “Smooth sensitivity and sampling in private data analysis,” in Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. ACM, 2007, pp. 75–84.
[24] M. Bun and T. Steinke, “Average-case averages: Private algorithms for smooth sensitivity and mean estimation,” arXiv preprint arXiv:1906.02830, 2019.
[25] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy.” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211–407, 2014.
[26] R. Vershynin, High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018, vol. 47.
[27] L. Zhang and Z.-H. Zhou, “ $\ell\_1$ -regression with heavy-tailed distributions,” in Advances in Neural Information Processing Systems, 2018, pp. 1076–1086.
[28] P. Chen, X. Jin, X. Li, and L. Xu, “A generalized catoni’s m-estimator under finite $\alpha$ -th moment assumption with $\alpha\in(1,2)$ ,” Electronic Journal of Statistics, vol. 15, no. 2, pp. 5523–5544, 2021.

The following omitted proofs haven been showed in [27] and [28], we include here for self-completeness.

Proof of Lemma 4.

First, note that our truncation function $\psi$ satisfies $\psi(x)\geq-\log(1-x+\frac{x^{2}}{2}).$ Thus we have,

	$\displaystyle\mathbb{E}[\exp(-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))]$
	$\displaystyle\leq\mathbb{E}[\|\Pi_{i=1}^{n}(1-\iota\|y_{i}-\langle x_{i},w\rangle\|+\frac{\iota^{2}(y_{i}-\langle x_{i},w\rangle)^{2}}{2})]$		(28)
	$\displaystyle=(\mathbb{E}[(1-\iota\|y-\langle x,w\rangle\|+\frac{\iota^{2}(y-\langle x,w\rangle)^{2}}{2})])^{n}$
	$\displaystyle=\big{(}1-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2}\big{)}^{n}$
	$\displaystyle\leq\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})\big{)},$		(29)

where (28) is due to the previous inequality and (29) is due the the inequality $1+x\leq e^{x}$ . By the Chernoff’s method, we have

	$\displaystyle\text{Pr}\{-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|)\geq$
	$\displaystyle n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\}$
	$\displaystyle=\text{Pr}\{\exp(-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))$
	$\displaystyle\geq\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\big{)}\}$
	$\displaystyle\leq\frac{\mathbb{E}[\exp(-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))]}{\mathbb{E}[\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\big{)}]}\leq\eta.$		(30)

Thus, with probability at least $1-\eta$ we have

	$\displaystyle-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|)$
	$\displaystyle\leq n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}$
	$\displaystyle\leq n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}.$

Take the union bound for all $w\in\tilde{W}_{\zeta}$ we complete the proof. ∎

Proof of Lemma 5.

First, note the truncation function $\psi$ satisfies $\psi(x)\leq\log(1+x+\frac{x^{2}}{2}).$ Thus we have

	$\displaystyle\mathbb{E}[\exp(\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))]$
	$\displaystyle\leq\mathbb{E}[\|\Pi_{i=1}^{n}(1+\iota\|y_{i}-\langle x_{i},w\rangle\|+\frac{\iota^{2}(y_{i}-\langle x_{i},w\rangle)^{2}}{2})]$		(31)
	$\displaystyle=(\mathbb{E}[(1+\iota\|y-\langle x,w\rangle\|+\frac{\iota^{2}(y-\langle x,w\rangle)^{2}}{2})])^{n}$
	$\displaystyle=\big{(}1+\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2}\big{)}^{n}$
	$\displaystyle\leq\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})\big{)},$		(32)

where (31) is due to the previous inequality and (32) is due the the inequality $1+x\leq e^{x}$ . By the Chernoff’s method,

	$\displaystyle\text{Pr}\{\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|)$
	$\displaystyle\geq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\}$
	$\displaystyle=\text{Pr}\{\exp(\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))$
	$\displaystyle\geq\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\big{)}\}$
	$\displaystyle\leq\frac{\mathbb{E}[\exp(\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))]}{\mathbb{E}[\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\big{)}]}\leq\eta.$

Thus, with probability at least $1-\eta$ we have

	$\displaystyle\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|)$
	$\displaystyle\leq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}$
	$\displaystyle\leq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\sup_{w\in\mathcal{W}}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}.$

Take the union bound for all $w\in\tilde{W}_{\zeta}$ we complete the proof. ∎

Proof of Lemma 7.

First, note that our truncation function $\psi$ satisfies $\psi(x)\geq-\log(1-x+\frac{|x|^{\theta}}{\theta}).$ Thus we have,

	$\displaystyle\mathbb{E}[\exp(-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))]$
	$\displaystyle\leq\mathbb{E}[\|\Pi_{i=1}^{n}(1-\iota\|y_{i}-\langle x_{i},w\rangle\|+\frac{\iota^{\theta}\|y_{i}-\langle x_{i},w\rangle\|^{\theta}}{\theta})]$		(33)
	$\displaystyle=(\mathbb{E}[(1-\iota\|y-\langle x,w\rangle\|+\frac{\iota^{\theta}\|y_{i}-\langle x_{i},w\rangle\|^{\theta}}{\theta})])^{n}$
	$\displaystyle=\big{(}1-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}\big{)}^{n}$
	$\displaystyle\leq\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})\big{)},$		(34)

where (33) is due to the previous inequality and (34) is due the the inequality $1+x\leq e^{x}$ . By the Chernoff’s method, we have

	$\displaystyle\text{Pr}\{-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|)$
	$\displaystyle\geq n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\zeta}\}$
	$\displaystyle=\text{Pr}\{\exp(-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))$
	$\displaystyle\geq\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\eta}\big{)}\}$
	$\displaystyle\leq\frac{\mathbb{E}[\exp(-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))]}{\mathbb{E}[\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}+\log\frac{1}{\eta}\big{)}]}\leq\eta.$		(35)

Thus, with probability at least $1-\eta$ we have

	$\displaystyle-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|)$
	$\displaystyle\leq n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\eta}$
	$\displaystyle\leq n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\eta}.$

Take the union bound for all $w\in\tilde{W}_{\zeta}$ we complete the proof. ∎

Proof of Lemma 8.

First, note the truncation function $\psi$ satisfies $\psi(x)\leq\log(1+x+\frac{|x|^{\theta}}{\theta}).$ Thus we have

	$\displaystyle\mathbb{E}[\exp(\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))]$
	$\displaystyle\leq\mathbb{E}[\|\Pi_{i=1}^{n}(1+\iota\|y_{i}-\langle x_{i},w\rangle\|+\frac{\iota^{\theta}\|y_{i}-\langle x_{i},w\rangle\|^{\theta}}{\theta})]$		(36)
	$\displaystyle=(\mathbb{E}[(1+\iota\|y-\langle x,w\rangle\|+\frac{\iota^{\theta}\|y_{i}-\langle x_{i},w\rangle\|^{\theta}}{\theta})])^{n}$
	$\displaystyle=\big{(}1+\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}\big{)}^{n}$
	$\displaystyle\leq\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}\big{)},$		(37)

where (36) is due to the previous inequality and (37) is due the the inequality $1+x\leq e^{x}$ . By the Chernoff’s method, we have

	$\displaystyle\text{Pr}\{\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|)$
	$\displaystyle\geq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\eta}\}$
	$\displaystyle=\text{Pr}\{\exp(\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))$
	$\displaystyle\geq\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\eta}\big{)}\}$
	$\displaystyle\leq\frac{\mathbb{E}[\exp(\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))]}{\mathbb{E}[\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\eta}\big{)}]}\leq\eta.$

Thus, with probability at least $1-\eta$ we have

	$\displaystyle\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|)\leq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\eta}$
	$\displaystyle\leq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\eta}.$

Take the union bound for all $w\in\tilde{W}_{\zeta}$ we complete the proof. ∎

Proof of Lemma 9.

As we mentioned before, there exists a $\tilde{w}_{\iota}\in\tilde{W}_{\zeta}$ such that $\|\tilde{w}_{\iota}-\hat{w}_{\iota}\|_{2}\leq\zeta$ . This implies that

|y_{i}-\langle x_{i},\hat{w}_{\iota}\rangle|\geq|y_{i}-\langle x_{i},\tilde{w}_{\iota}\rangle|-|\langle x_{i},\tilde{w}_{\iota}-\hat{w}_{\iota}|\\ \geq|y_{i}-\langle x_{i},\tilde{w}_{\iota}\rangle|-\zeta\|x_{i}\|_{2}.

Since $\psi$ is non-decreasing, this implied that

\hat{L}_{\alpha}(\hat{w}_{\iota},D)=\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},\hat{w}_{\iota}\rangle|)\\ \geq\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},\tilde{w}_{\iota}\rangle|-\iota\zeta\|x_{i}\|_{2}).

We then proof the following lemma

Lemma 10.

For any $w\in\tilde{W}_{\zeta}$ , with probability at least $1-\eta$ , the following inequality holds:

-\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|-\iota\zeta\|x_{i}\|_{2})\leq-L_{\mathcal{D}}(w)+\zeta\tau\\ +\frac{(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}|y-\langle x,w\rangle|^{\theta}+\frac{(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}+\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}.

Proof of Lemma 10 .

The idea of proof is almost the same as in the proof of Lemma 8. First, note that our truncation function $\psi$ satisfies the following $\psi(x)\geq-\log(1-x+\frac{|x|^{\theta}}{\theta}).$ Thus we have,

	$\displaystyle\mathbb{E}[\exp(-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|-\iota\zeta\\|x_{i}\\|_{2}))]$
	$\displaystyle\leq\mathbb{E}[\|\Pi_{i=1}^{n}(1-\iota\|y_{i}-\langle x_{i},w\rangle\|$
	$\displaystyle\qquad+\iota\zeta\\|x_{i}\\|_{2}+\frac{\iota^{\theta}(\|y_{i}-\langle x_{i},w\rangle\|-\zeta\\|x_{i}\\|_{2})^{\theta}}{\theta})]$		(38)
	$\displaystyle=(\mathbb{E}[(1-\iota\|y-\langle x,w\rangle\|$
	$\displaystyle\qquad+\iota\zeta\\|x\\|_{2}+\frac{\iota^{\theta}(\|y-\langle x,w\rangle\|-\zeta\\|x\\|_{2})^{\theta}}{\theta})])^{n}$
	$\displaystyle=\big{(}1-\iota L_{\mathcal{D}}(w)+\iota\zeta\mathbb{E}\\|x\\|_{2}+\frac{\iota^{\theta}}{\theta}\mathbb{E}(\|y-\langle x,w\rangle\|-\zeta\\|x\\|_{2})^{\theta}\big{)}^{n}$
	$\displaystyle\leq\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}2^{\theta-1}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}$
	$\displaystyle\qquad+\frac{\iota^{\theta}2^{\theta-1}\zeta^{\theta}}{\theta}\mathbb{E}\\|x\\|_{2}^{\theta})\big{)}.$		(39)

By the Chernoff’s method, we have with probability at least $1-\eta$

-\sum_{i=1}^{n}\psi(\iota|y_{i}-\langle x_{i},w\rangle|-\iota\zeta\|x_{i}\|_{2})\\ \geq-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}2^{\theta-1}}{\theta}\mathbb{E}|y-\langle x,w\rangle|^{\theta}+\frac{\iota^{\theta}2^{\theta-1}\zeta^{\theta}}{\theta}\mathbb{E}\|x\|_{2}^{\theta}+\log\frac{1}{\eta}.

Take the union and then we complete the proof. ∎

Thus by Lemma 10 we have with probability at least $1-\eta$

	$\displaystyle\hat{L}_{\alpha}(\hat{w}_{\iota},D)=\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},\hat{w}_{\iota}\rangle\|)$
	$\displaystyle\geq\frac{1}{n\iota}\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},\tilde{w}_{\iota}\rangle\|-\iota\zeta\\|x_{i}\\|_{2})$
	$\displaystyle\geq L_{\mathcal{D}}(\tilde{w}_{\iota})-\zeta\tau-\frac{(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}$
	$\displaystyle\qquad-\frac{(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}-\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta}$
	$\displaystyle\geq L_{\mathcal{D}}(\hat{w}_{\iota})-2\zeta\tau-\frac{(2\iota)^{\theta-1}}{\theta}\sup_{w\in\mathcal{W}}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}$
	$\displaystyle\qquad-\frac{(2\iota)^{\theta-1}\zeta^{\theta}}{\theta}\tau^{\theta}-\frac{1}{n\iota}\log\frac{N(\mathcal{W},\zeta)}{\eta},$		(40)

where (40) is due to

	$\displaystyle L_{\mathcal{D}}(\tilde{w}_{\iota})-L_{\mathcal{D}}(\hat{w}_{\iota})=\mathbb{E}[\|y-\langle x,\tilde{w}_{\iota}\rangle\|-\|y-\langle x,\hat{w}_{\iota}\rangle\|]$
	$\displaystyle\leq\mathbb{E}\|\langle x,\tilde{w}_{\iota}-\hat{w}_{\iota}\rangle\|\leq\zeta\mathbb{E}\\|x\\|_{2}\leq\zeta\tau$

∎

	$\displaystyle\text{Pr}\{\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|)$
	$\displaystyle\geq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\}$
	$\displaystyle=\text{Pr}\{\exp(\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))$
	$\displaystyle\geq\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\big{)}\}$
	$\displaystyle\leq\frac{\mathbb{E}[\exp(\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))]}{\mathbb{E}[\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{2}}{2}\mathbb{E}(y-\langle x,w\rangle)^{2})+\log\frac{1}{\eta}\big{)}]}\leq\eta.$

	$\displaystyle\mathbb{E}[\exp(-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))]$
	$\displaystyle\leq\mathbb{E}[\|\Pi_{i=1}^{n}(1-\iota\|y_{i}-\langle x_{i},w\rangle\|+\frac{\iota^{\theta}\|y_{i}-\langle x_{i},w\rangle\|^{\theta}}{\theta})]$		(33)
	$\displaystyle=(\mathbb{E}[(1-\iota\|y-\langle x,w\rangle\|+\frac{\iota^{\theta}\|y_{i}-\langle x_{i},w\rangle\|^{\theta}}{\theta})])^{n}$
	$\displaystyle=\big{(}1-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}\big{)}^{n}$
	$\displaystyle\leq\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})\big{)},$		(34)

	$\displaystyle\text{Pr}\{-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|)$
	$\displaystyle\geq n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\zeta}\}$
	$\displaystyle=\text{Pr}\{\exp(-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))$
	$\displaystyle\geq\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\eta}\big{)}\}$
	$\displaystyle\leq\frac{\mathbb{E}[\exp(-\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))]}{\mathbb{E}[\exp\big{(}n(-\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}+\log\frac{1}{\eta}\big{)}]}\leq\eta.$		(35)

	$\displaystyle\mathbb{E}[\exp(\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))]$
	$\displaystyle\leq\mathbb{E}[\|\Pi_{i=1}^{n}(1+\iota\|y_{i}-\langle x_{i},w\rangle\|+\frac{\iota^{\theta}\|y_{i}-\langle x_{i},w\rangle\|^{\theta}}{\theta})]$		(36)
	$\displaystyle=(\mathbb{E}[(1+\iota\|y-\langle x,w\rangle\|+\frac{\iota^{\theta}\|y_{i}-\langle x_{i},w\rangle\|^{\theta}}{\theta})])^{n}$
	$\displaystyle=\big{(}1+\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}\big{)}^{n}$
	$\displaystyle\leq\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta}\big{)},$		(37)

	$\displaystyle\text{Pr}\{\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|)$
	$\displaystyle\geq n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\eta}\}$
	$\displaystyle=\text{Pr}\{\exp(\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))$
	$\displaystyle\geq\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\eta}\big{)}\}$
	$\displaystyle\leq\frac{\mathbb{E}[\exp(\sum_{i=1}^{n}\psi(\iota\|y_{i}-\langle x_{i},w\rangle\|))]}{\mathbb{E}[\exp\big{(}n(\iota L_{\mathcal{D}}(w)+\frac{\iota^{\theta}}{\theta}\mathbb{E}\|y-\langle x,w\rangle\|^{\theta})+\log\frac{1}{\eta}\big{)}]}\leq\eta.$

Differentially Private ℓ1\ell_{1}-norm Linear Regression with Heavy-tailed Data

Abstract

I Introduction

II Related Work

III Preliminaries

Definition 1 (Differential Privacy [1]).

Definition 2 (Exponential Mechanism).

Lemma 1.

Definition 3 (DPSCO [2]).

Definition 4 (ζ\zeta-Net).

Lemma 2 ([26]).

IV Main Methods

IV-A Bounded second moment case

Assumption 1.

Assumption 2.

Lemma 3 (Corollary 2 in [27]).

Theorem 1.

Remark 1.

Proof of Theorem 1.

Lemma 4 ([27]).

Lemma 5 ([27]).

Lemma 6.

IV-B Bounded θ\theta-th moment case

Assumption 3.

Assumption 4.

Theorem 2.

Proof of Theorem 2.

Lemma 7 ([28]).

Lemma 8.

Lemma 9 ([28]).

References

Proof of Lemma 4.

Proof of Lemma 5.

Proof of Lemma 7.

Proof of Lemma 8.

Proof of Lemma 9.

Lemma 10.

Proof of Lemma 10 .

Differentially Private $\ell_{1}$ -norm Linear Regression with Heavy-tailed Data

Definition 4 ( $\zeta$ -Net).

IV-B Bounded $\theta$ -th moment case