Differentially Private -norm Linear Regression with Heavy-tailed Data
Abstract
We study the problem of Differentially Private Stochastic Convex Optimization (DP-SCO) with heavy-tailed data. Specifically, we focus on the -norm linear regression in the -DP model. While most of the previous work focuses on the case where the loss function is Lipschitz, here we only need to assume the variates has bounded moments. Firstly, we study the case where the norm of data has bounded second order moment. We propose an algorithm which is based on the exponential mechanism and show that it is possible to achieve an upper bound of (with high probability). Next, we relax the assumption to bounded -th order moment with some and show that it is possible to achieve an upper bound of . Our algorithms can also be extended to more relaxed cases where only each coordinate of the data has bounded moments, and we can get an upper bound of and in the second and -th moment case respectively.
I Introduction
As one of the most fundamental problem in both statistical machine learning and differential privacy (DP). Stochastic Convex Optimization under the differential privacy [1] constraint, i.e., Differentially Private Stochastic Convex Optimization (DPSCO), has received much attentions in recent years [2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]. In DPSCO, we have a loss function and an -size dataset where each pair of the variate and the label/response is i.i.d. sampled from some unknown distribution . The goal of DPSCO is to privately minimize the population risk function over a parameter space , i.e., we aim to design some DP algorithm whose output makes the excess population risk to be as small as possible with high probability.
Although DPSCO and it empirical form, differentially Private Empirical Risk Minimization (DPERM), have been extensively studied for many years and there is a long of work attacked the problems from different perspectives, most of those work needs to assume the data distribution is bounded which indicates that the loss function is Lipschitz, which is unrealistic and does not always hold in practice. To relax the Lipschitz assumption, start from [16], there are several work have studied DPSCO with heavy-tailed data recently, where the Lipschitz assumption is removed and we only need to assume that the distribution of the gradient of the loss function has bounded finite order of moments instead [16, 17, 18]. However, there are still several open problems in DP-SCO with heavy-tailed data. For example, previous work only considered the case where the loss function is smooth. Moreover, most of those work studied the -DP model and its behaviors in the -DP model is still far from well-understood. Thirdly, all the previous results need to assume the data or the gradient of the loss has at least bounded second (order) moments and cannot be extended to a more relaxed case where it has only -th moment with . In this paper, we continue along the directions of these open problems. Specifically, we study the problem of -norm linear regression (i.e., ) with heavy-tailed data in -DP model. Our contributions can be summarized as follows.
-
•
We first consider the case where the second moment of the -norm of the variate , i.e., , is bounded. Specifically, we propose a method which is based on the exponential mechanism and show that it is possible to achieve an upper bound of with high probability. Moreover, instead of the -norm, we also consider the case where each coordinate of has bounded second moment, i.e., for every . We show that our algorithm could achieve an error bound of .
-
•
We then investigate a relaxed case where the data only has -th moment with . First, similar to the second moment case, we assume that and show it is possible to achieve a rate of . Then, under the relaxed condition that for every , we show that our algorithm could achieve an error of . To the best of our knowledge, this is the first theoretical result of DPSCO with heavy-tailed data that only has -th moment with .
II Related Work
Although there is a long list of work studied either DPSCO/DPERM or robust estimation. DPSCO with heavy-tailed data is not well-understood. Below we will mentioned the related work on DPSCO with heavy-tailed data and private and robust mean estimation.
For private estimation for heavy-tailed distribution, [19] provides the first study on private mean estimation for distributions with bounded second moment and proposes the minimax private rates. Later, [20] also studies the heavy-tailed mean estimation with a relaxed assumption, which is also studied by [21, 22] recently. However, due to the complex nature of regression, the methods for mean estimation cannot be used to our problem. Moreover, it is unknown whether their methods could be extended to the case where each coordinate of the data has only -th order moment with .
For DPSCO with heavy-tailed data, [16] first studies the problem by proposing three methods based on different assumptions. The first method is based on the smooth sensitivity and the Sample-and-Aggregate framework [23]. However, it needs enormous assumptions and its error bound is quite large. Their second method is still based on the smooth sensitivity [24]. However, it needs to assume the distribution is sub-exponential.[16] also provides a new private estimator motivated by the previous work in robust statistics. [18] recently revisits the problem and improves the (expected) excess population risk for both convex and strongly convex loss functions. It also provides the lower bounds of the mean estimation problem in both -DP and -DP models. However, as we mentioned earlier, all of those algorithms are for -DP model and need to assume the loss function is smooth. Thus, their methods cannot be used to our problem. Later, [17] studies the problem in the high dimensional space where the dimension could far greater than the data size. It focuses the case where the loss function is smooth and the constraint set is some polytope or some -norm ball, which is quite different with our settings.
III Preliminaries
Definition 1 (Differential Privacy [1]).
Given a data universe , we say that two datasets are neighbors if they differ by only one entry, which is denoted as . A randomized algorithm is -differentially private (DP) if for all neighboring datasets and for all events in the output space of , the following holds
Definition 2 (Exponential Mechanism).
The Exponential Mechanism allows differentially private computation over arbitrary domains and range , parametrized by a score function which maps a pair of input data set and candidate result to a real valued score. With the score function and privacy budget , the mechanism yields an output with exponential bias in favor of high scoring outputs. Let denote the exponential mechanism, and be the sensitivity of in the range , Then if selects and outputs an element with probability proportional to , it preserves -differential privacy.
Lemma 1.
Definition 3 (DPSCO [2]).
Given a dataset from a data universe where with a feature vector and a label/response are i.i.d. samples from some unknown distribution , a convex constraint set , and a convex loss function . Differentially Private Stochastic Convex Optimization (DPSCO) is to find so as to minimize the population risk, i.e., with the guarantee of being DP. The utility of the algorithm is measured by the excess population risk, that is (for convenience we denote the optimal solution as ). Besides the population risk, we can also measure the empirical risk of dataset : It is notable that in the high probability setting, we need to get a high probability excess population risk. That is given a failure probability , we want get a (polynomial) function such that with probability at least (over the randomness of the algorithm and the data distribution),
In this paper, we mainly focus on -norm linear regression:
(1) |
where the convex constraint set is bounded with diameter .
Definition 4 (-Net).
Let be a metric space. Consider a subset and let . A subset is called an -net of if every point in is within a distance of some points of , i.e., The smallest possible cardinality of an -net of is called the covering number of and is denoted by . Equivalently, covering number is the smallest number of closed balls with centers in and radii whose union covers .
Lemma 2 ([26]).
For the Euclidean space , and a bounded subset with the diameter . Then its covering number .
IV Main Methods
IV-A Bounded second moment case
In this section we consider the case of bounded second moment. As mentioned earlier, in the previous work on DPSCO with heavy-tailed data, we always assume the distribution of gradient has bounded moments [16, 17, 18], if the loss function is smooth. However, for regression, here the loss function is non-differentiable. Thus, instead of the gradient, here we will directly assume the second moments of are bounded, which implies that the second moments of the sub-gradient of the loss function are also bounded. In general, there are two assumptions on the heavy-tailedness of ; one assumes the distribution of has bounded moment; the other one assumes the distribution of each coordinate of has bounded moment. Formally,
Assumption 1.
The second moment of is bounded by , that is
Assumption 2.
The second moment of each coordinate of is bounded by , that is
From the above two assumptions we can see that Assumption 1 is more restricted than Assumption 2. Before showing our algorithm, we first provide a brief overview on the approach of solving the problem (1) in the non-private case, proposed by [27]. Specifically, instead of study the empirical risk function of the population risk (1), [27] considers the following minimization problem of a truncated loss :
(2) |
where is a parameter that will be specified later, the truncation function is a non-decreasing function which should satisfies the following property:
(3) |
Specifically, [27] shows the following result:
Lemma 3 (Corollary 2 in [27]).
The previous lemma indicates that solving the problem (2) is sufficient to solve the regression problem if . To solve the problem (2) differentially privately, we adapt the following specific form of :
(6) |
We can easily see (6) satisfies the property in (3). Moreover, due to the non-decreasing property we can see the function is upper bounded by . That is, for a fixed , the sensitivity of is bounded by . Motivated by this, we can use the exponential mechanism to solve (2) in -DP model, see Algorithm 1 for details.
Theorem 1.
Remark 1.
First, notice that since Assumption 1 is more stronger, there is a gap of compared with (7) and (8). Moreover, for the upper bound in (8), it matches the lower bound of the private mean estimation under Assumption 2 in [18]. However, it is still unknown whether this lower bound is optimal for DPSCO with general convex loss. To the best of our knowledge, this is the first -DP algorithm which could achieve the bound of under the same assumption.
Proof of Theorem 1.
The proof of -DP is due to the sensitivity of the score function is bounded by and the exponential mechanism. Next we will proof the utility. For convenience we denote and the optimal solution of (2) as . By the utility of exponential mechanism Lemma 1 and take we have with probability at least , That is
(9) |
For the term in (9) we have the following.
Lemma 4 ([27]).
For any given , we have the following inequality for all with probability at least
(10) |
For the term in (9), since is a -net, thus there exists a such that , where . And by the definition we have
(11) |
For the term we have the following lemma
Lemma 5 ([27]).
For any given , we have the following inequality for all with probability at least
(12) |
Thus, combing with (11) and Lemma 5 we have with probability at least
(13) |
where the last inequality is due to
The relation between and is due to the following lemma, given by [27],
Lemma 6.
For any given failure probability , under Assumption 1, we have the following with probability at least
IV-B Bounded -th moment case
Actually, motivated by [28], for the -regression problem, we can even relax the second moment assumption in Assumption 1 and 2 to finite -th moment assumptions with any . Similar to the second moment case, here we consider two cases:
Assumption 3.
There exists an such that the -th order moment of is bounded by for some constant , 111Here we use is for convenience to compare with the second moment case. that is
Assumption 4.
We assume that the second moment of each coordinate of is bounded by , that is
Here our main idea is almost the same as in the bounded second-order moment case and we will still focus on the function in (2). However, here we will adjust the non-decreasing truncation function to make it satisfies the following inequality instead of (3):
(15) |
Motived by (6), here we use the following specific form for :
(16) |
We can easily see it satisfies the inequality in (15). Moreover, its absolute value is upper bounded by . That is the sensitivity of is upper bounded by . Therefore, we will use the exponential mechanism (see Algorithm 2 for details) and its output has the following utility.
Theorem 2.
Proof of Theorem 2.
The proof of -DP is due to the sensitivity of the score function is bounded by and the exponential mechanism. Next we will proof the utility. For convenience we denote and the optimal solution of (2) as . By the utility of exponential mechanism Lemma 1 and take we have with probability at least , That is
(20) |
For the term in (20) we have the following inequality.
For the term in (20), since is a -net, thus there exists a such that , where . And by the definition we have
(21) |
For the term we have the following lemma
Lemma 8.
For any given , we have the following inequality for all with probability at least
(22) |
Thus, combing with (21) and Lemma 8 we have with probability at least
(23) |
In the following we will show the relation between and . We first show the following lemma:
Lemma 9 ([28]).
With probability at least ,
By Lemma 9, the definition of we have with probability at least
(24) |
Where the last inequality of (24) is due to the following with probability , whose proof is the same as in the proof of Lemma 8 (we omit it here)
(25) |
Thus, combing with (20) , Lemma 7, (23) and (24) we have with probability at least
(26) |
Since , we have
(27) |
which is due to that . Take and we can get the proof.
For (19), we can replace by in (27). Under Assumption 4 and use the inequality , we have the result by taking and .
∎
References
- [1] C. Dwork, F. McSherry, K. Nissim, and A. Smith, “Calibrating noise to sensitivity in private data analysis,” in Theory of cryptography conference. Springer, 2006, pp. 265–284.
- [2] R. Bassily, A. Smith, and A. Thakurta, “Private empirical risk minimization: Efficient algorithms and tight error bounds,” in Foundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on. IEEE, 2014, pp. 464–473.
- [3] D. Wang and J. Xu, “Differentially private empirical risk minimization with smooth non-convex loss functions: A non-stationary view,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 1182–1189.
- [4] D. Wang, C. Chen, and J. Xu, “Differentially private empirical risk minimization with non-convex loss functions,” in International Conference on Machine Learning, 2019, pp. 6526–6535.
- [5] R. Bassily, V. Feldman, K. Talwar, and A. Thakurta, “Private stochastic convex optimization with optimal rates,” in NeurIPS, 2019.
- [6] R. Bassily, V. Feldman, C. Guzmán, and K. Talwar, “Stability of stochastic gradient descent on nonsmooth convex losses,” Advances in Neural Information Processing Systems, vol. 33, 2020.
- [7] V. Feldman, T. Koren, and K. Talwar, “Private stochastic convex optimization: optimal rates in linear time,” in Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, 2020, pp. 439–449.
- [8] S. Song, O. Thakkar, and A. Thakurta, “Characterizing private clipped gradient descent on convex generalized linear problems,” arXiv preprint arXiv:2006.06783, 2020.
- [9] J. Su and D. Wang, “Faster rates of differentially private stochastic convex optimization,” arXiv preprint arXiv:2108.00331, 2021.
- [10] H. Asi, J. Duchi, A. Fallah, O. Javidbakht, and K. Talwar, “Private adaptive gradient methods for convex optimization,” in International Conference on Machine Learning. PMLR, 2021, pp. 383–392.
- [11] R. Bassily, C. Guzmán, and A. Nandi, “Non-euclidean differentially private stochastic convex optimization,” arXiv preprint arXiv:2103.01278, 2021.
- [12] J. Kulkarni, Y. T. Lee, and D. Liu, “Private non-smooth empirical risk minimization and stochastic convex optimization in subquadratic steps,” arXiv preprint arXiv:2103.15352, 2021.
- [13] R. Bassily, C. Guzmán, and M. Menart, “Differentially private stochastic optimization: New results in convex and non-convex settings,” Advances in Neural Information Processing Systems, vol. 34, 2021.
- [14] H. Asi, V. Feldman, T. Koren, and K. Talwar, “Private stochastic convex optimization: Optimal rates in geometry,” arXiv preprint arXiv:2103.01516, 2021.
- [15] Z. Xue, S. Yang, M. Huai, and D. Wang, “Differentially private pairwise learning revisited,” in IJCAI. ijcai.org, 2021, pp. 3242–3248.
- [16] D. Wang, H. Xiao, S. Devadas, and J. Xu, “On differentially private stochastic convex optimization with heavy-tailed data,” arXiv preprint arXiv:2010.11082, 2020.
- [17] L. Hu, S. Ni, H. Xiao, and D. Wang, “High dimensional differentially private stochastic optimization with heavy-tailed data,” arXiv preprint arXiv:2107.11136, 2021.
- [18] G. Kamath, X. Liu, and H. Zhang, “Improved rates for differentially private stochastic convex optimization with heavy-tailed data,” arXiv preprint arXiv:2106.01336, 2021.
- [19] R. F. Barber and J. C. Duchi, “Privacy and statistical risk: Formalisms and minimax bounds,” arXiv preprint arXiv:1412.4451, 2014.
- [20] G. Kamath, V. Singhal, and J. Ullman, “Private mean estimation of heavy-tailed distributions,” arXiv preprint arXiv:2002.09464, 2020.
- [21] X. Liu, W. Kong, S. Kakade, and S. Oh, “Robust and differentially private mean estimation,” arXiv preprint arXiv:2102.09159, 2021.
- [22] Y. Tao, Y. Wu, P. Zhao, and D. Wang, “Optimal rates of (locally) differentially private heavy-tailed multi-armed bandits,” arXiv preprint arXiv:2106.02575, 2021.
- [23] K. Nissim, S. Raskhodnikova, and A. Smith, “Smooth sensitivity and sampling in private data analysis,” in Proceedings of the thirty-ninth annual ACM symposium on Theory of computing. ACM, 2007, pp. 75–84.
- [24] M. Bun and T. Steinke, “Average-case averages: Private algorithms for smooth sensitivity and mean estimation,” arXiv preprint arXiv:1906.02830, 2019.
- [25] C. Dwork, A. Roth et al., “The algorithmic foundations of differential privacy.” Foundations and Trends in Theoretical Computer Science, vol. 9, no. 3-4, pp. 211–407, 2014.
- [26] R. Vershynin, High-dimensional probability: An introduction with applications in data science. Cambridge university press, 2018, vol. 47.
- [27] L. Zhang and Z.-H. Zhou, “-regression with heavy-tailed distributions,” in Advances in Neural Information Processing Systems, 2018, pp. 1076–1086.
- [28] P. Chen, X. Jin, X. Li, and L. Xu, “A generalized catoni’s m-estimator under finite -th moment assumption with ,” Electronic Journal of Statistics, vol. 15, no. 2, pp. 5523–5544, 2021.
The following omitted proofs haven been showed in [27] and [28], we include here for self-completeness.
Proof of Lemma 4.
Proof of Lemma 5.
Proof of Lemma 7.
Proof of Lemma 8.
Proof of Lemma 9.
As we mentioned before, there exists a such that . This implies that
Since is non-decreasing, this implied that
We then proof the following lemma
Lemma 10.
For any , with probability at least , the following inequality holds: