[name=Theorem]thm \declaretheorem[name=Lemma,sibling=theorem]lem
Linear-Time User-Level DP-SCO via Robust Statistics
Abstract
User-level differentially private stochastic convex optimization (DP-SCO) has garnered significant attention due to the paramount importance of safeguarding user privacy in modern large-scale machine learning applications. Current methods, such as those based on differentially private stochastic gradient descent (DP-SGD), often struggle with high noise accumulation and suboptimal utility due to the need to privatize every intermediate iterate. In this work, we introduce a novel linear-time algorithm that leverages robust statistics, specifically the median and trimmed mean, to overcome these challenges. Our approach uniquely bounds the sensitivity of all intermediate iterates of SGD with gradient estimation based on robust statistics, thereby significantly reducing the gradient estimation noise for privacy purposes and enhancing the privacy-utility trade-off. By sidestepping the repeated privatization required by previous methods, our algorithm not only achieves an improved theoretical privacy-utility trade-off but also maintains computational efficiency. We complement our algorithm with an information-theoretic lower bound, showing that our upper bound is optimal up to logarithmic factors and the dependence on . This work sets the stage for more robust and efficient privacy-preserving techniques in machine learning, with implications for future research and application in the field.
1 Introduction
With the rapid development and widespread applications of modern machine learning and artificial intelligence, particularly driven by advancements in large language models (LLMs), privacy concerns have come to the forefront. For example, recent studies have highlighted significant privacy risks associated with LLMs, including well-documented instances of training data leakage [CTW+21, LSS+23]. These challenges underscore the urgent need for privacy-preserving mechanisms in machine learning systems.
Differential Privacy (DP) [DMNS06] has emerged as a rigorous mathematical framework for ensuring privacy and is now the gold standard for addressing privacy concerns in machine learning. The classic definition of DP, referred to as item-level DP, guarantees that the replacement of any single training example has a negligible impact on the model’s output. Formally, a mechanism is said to satisfy -item-level DP if, for any pair and of neighboring datasets that differ by a single item, and for any event , the following condition holds:
(1) |
Item-level DP provides a strong guarantee against the leakage of private information associated with any single element in a dataset. However, in many real-world applications, a single user may contribute multiple elements to the training dataset [XZ24]. This introduces the need to safeguard the privacy of each user as a whole, which motivates a stronger notion of user-level DP. This notion ensures that replacing one user (who may contribute up to items) in the dataset has only a negligible effect on the model’s output. Formally, (1) still holds when and differ by one user, i.e., up to items in total. Notably, when , user-level DP reduces to item-level DP.
DP-SCO
As one of the central problems in privacy-preserving machine learning and statistical learning, DP stochastic convex optimization (DP-SCO) has garnered significant attention in recent years (e.g., [BST14, BFTT19, FKT20, BFGT20, BGN21, SHW22, GLL22, ALT24]). In DP-SCO, we are provided with a dataset of users, where each user contributes samples drawn from an underlying distribution . The goal is to minimize the population objective function under the DP constraint:
where is a convex function defined on the convex domain .
DP-SCO was initially studied in the item-level setting, where significant progress has been made with numerous exciting results. The study of user-level DP-SCO, to our knowledge, was first initiated by [LSA+21], where the problem was considered in Euclidean spaces, with the Lipschitz constant and the diameter of the convex domain defined under the -norm. They achieved an error rate of for smooth functions, along with a lower bound of .
Building on this, [BS23] introduced new algorithms based on DP-SGD, incorporating improved mean estimation procedures to achieve an asymptotically optimal rate of . However, their approach relies on the smoothness of the loss function and imposes parameter restrictions, including and . On the other hand, [GKK+23] observed that user-level DP-SCO has low local sensitivity to user deletions. Using the propose-test-release mechanism, they developed algorithms applicable even to non-smooth functions and requiring only users. However, their approach is computationally inefficient, running in super-polynomial time and achieving a sub-optimal error rate of . [AL24] proposed an algorithm that achieves optimal excess risk in polynomial time, requiring only users and accommodating non-smooth losses. However, their algorithm is also computationally expensive: for -smooth losses, it requires gradient evaluations, while for non-smooth losses, it requires gradient evaluations. Motivated by these inefficiencies, [LLA24] focused on improving the computational cost of user-level DP-SCO while maintaining optimal excess risk. For -smooth losses, they designed an algorithm requiring gradient evaluations; for non-smooth losses, they achieved the same optimal excess risk using evaluations.
Linear-time algorithms have also been explored in the user-level DP-SCO setting. [BS23] proposed a linear-time algorithm in the local DP model, achieving an error rate of under the constraints , , and . Similarly, [LLA24] achieved the same rate under slightly relaxed conditions, requiring and .
In our work, we consider user-level DP-SCO under -norm assumptions, in contrast to most of prior results, which were established under the -norm. This distinction is significant, as the -norm provides stronger per-coordinate guarantees and hence is more desirable. Furthermore, the -norm enjoys properties that are crucial to our algorithmic design but do not hold in the -norm setting. These properties influence both the development of our algorithm and the corresponding theoretical guarantees. The implications of this distinction and its role in our results will be discussed in detail in the subsequent sections.
1.1 Technical Challenges in User-Level DP-SCO
A key challenge in solving user-level DP-SCO using DP-SGD lies in obtaining more accurate gradient estimates while maintaining privacy. Consider a simple scenario where we perform gradient descent for steps and seek an estimate of . To achieve this, we sample users to estimate and compute , the average of the gradients from user at point . If each user’s functions are i.i.d. drawn from the distribution , then with high probability, we know that
This naturally leads to the following mean-estimation problem: Given points in the unit ball, with most of them likely to be within a distance of from each other (under the i.i.d. assumption for utility guarantees), how can we accurately and privately estimate their mean?
A straightforward approach to recover the item-level rate is to apply the Gaussian mechanism:
(2) |
where the noise level is set as .
Mean Estimation and Sensitivity Control
To improve upon this, prior works [AL24, LLA24] designed mean-estimation sub-procedures with the following properties:
-
•
Outlier Detection: The procedure tests whether the number of “bad” users (whose gradients significantly deviate from the majority) exceeds a predefined threshold (or “break point”).
-
•
Outlier Removal and Sensitivity Reduction: If the number of “bad” users is below the threshold, the procedure removes outliers and produces an estimate with sensitivity . The privatized gradient is then: where .
-
•
Better Variance Control: When all users provide consistent estimates, the output follows
resulting in significantly smaller noise compared to the naive Gaussian mechanism in (2).
By leveraging such sub-procedures, prior works have achieved the optimal excess risk rate in polynomial time. However, extending these to obtain linear-time algorithms poses new challenges.
Challenges in Linear-Time User-Level DP-SCO
Several linear-time algorithms exist for item-level DP-SCO. [ZTC22] and [CCGT24] achieved the optimal item-level rate under the smoothness assumption .111One can roughly interpolate the optimal item-level rate by setting in user-level DP. Their approach maintains privacy by adding noise to all intermediate iterations of DP-SGD.
[FKT20] notably relaxed the smoothness requirement to by analyzing the stability of non-private SGD. They showed that for neighboring datasets, the sequence and remain close, ensuring that the sensitivity of the average iterate is low. This allows them to apply the Gaussian mechanism directly to privatize the average iterate.
Motivated by this stability-based analysis, [LLA24] attempted to generalize the linear-time approach of [FKT20] to the user-level setting. However, a key difficulty arises when incorporating the mean-estimation sub-procedure. Specifically, even if one can bound , where and represent the trajectories corresponding to neighboring datasets, there is no clear understanding of how applying the sub-procedure impacts stability in subsequent iterations. In particular, after performing one gradient descent step using gradient estimations from sub-procedure, we do not have guarantees on how well remains bounded.
Due to this lack of stability analysis for the sub-procedure, [LLA24] resorted to privatizing all iterations, resulting in excessive Gaussian noise accumulation. Consequently, their algorithm achieved only a suboptimal error rate of , highlighting the fundamental challenge of designing a linear-time user-level DP-SCO algorithm by controlling the stability of the sub-procedures.
Generalizing Other Linear-Time Algorithms
The linear-time algorithms proposed in [ZTC22] and [CCGT24] for item-level DP-SCO represent promising approaches to generalize to the user-level setting. These algorithms achieve the optimal item-level rate by privatizing all intermediate iterations of variants of DP-SGD. This approach avoids the need for additional stability analysis, as the noise added at every iteration directly ensures privacy without relying on intermediate sensitivity bounds. Generalizing such algorithms to the user-level setting may, therefore, be easier compared to other approaches, as they sidestep the stability issues associated with mean-estimation sub-procedures.
In a private communication, the authors of [LLA24] indicated that it is possible to generalize the linear-time algorithm of [ZTC22] to the user-level setting. However, this generalization introduces a dependence on the smoothness parameter , which may impose restrictive smoothness constraints on the types of functions for which the algorithm is effective. Despite this limitation, such a generalization represents a natural direction for extending linear-time algorithms to user-level DP-SCO.
While these developments are promising, a more challenging and interesting direction lies in generalizing stability-based analyses from the item-level setting to the user-level setting. Stability-based methods, as seen in [FKT20], rely on carefully bounding the sensitivity of the entire optimization trajectory. Extending this approach to the user-level setting requires incorporating an appropriate sub-procedure for mean estimation, tailored to handle user-level sensitivity. This introduces additional layers of complexity, as the interactions between the sub-procedure and the iterative optimization process must be carefully analyzed to ensure stability and privacy. Overcoming these challenges would not only advance the theoretical understanding of user-level DP-SCO but might also lead to more efficient algorithms with broader applicability.
1.2 Our Techniques and Contributions
In this work, we design a novel mean-estimation sub-procedure based on robust statistics, such as the median and trimmed mean, specifically tailored for user-level DP-SCO. By incorporating the sub-procedure into (non-private) SGD, we establish an upper bound on for all iterations . This ensures stability throughout the optimization process.
Key Idea: 1-Lipschitz Property of Robust Statistics
In one dimension, many robust statistics, such as the median, satisfy a 1-Lipschitz property. This means that if each data point is perturbed by a distance of at most , the robust statistic shifts by at most . This property makes robust statistics particularly well-suited for mean estimation in user-level DP-SCO.
To see this, recall that we define as the average of the gradients from user at point . Similarly, we define for the gradients at . If is bounded, then by the smoothness of , we have This implies that the robust statistic computed from and remains bounded by from the 1-Lipschitz property. As a result, the desired stability is naturally established in the one-dimensional setting.
Extending Stability to High Dimensions
In high-dimensional settings, robust statistics that satisfy the 1-Lipschitz property are not well understood. To address this, we adopt coordinate-wise robust statistics for gradient estimation. This approach ensures stability at each coordinate level. In turn, this allows us to establish iteration sensitivity in the -norm.
Debiasing Technique
While robust statistics are effective in controlling sensitivity, using them directly introduces a significant bias in gradient estimation. This bias occurs even in "good" datasets where all functions are i.i.d. from the underlying distribution. If not handled properly, the bias can dominate the utility guarantee and degrade performance.
To address this issue, we propose a novel debiasing technique: If the mean and the robust statistic are sufficiently close, we directly use the mean; otherwise, we project the mean onto the ball centered at the robust statistic. Both the mean and robust statistics individually satisfy the 1-Lipschitz property. We prove that this coordinate-wise projection preserves the 1-Lipschitz property in the -norm. The resulting robust mean-estimation sub-procedure ensures iteration sensitivity while remaining unbiased when the dataset is well-behaved. This property holds with high probability when all functions are i.i.d. from the distribution.
Improving Robust Mean Estimation: Smoothed Concentration Test
To further enhance stability, we introduce a smoothed version of the concentration score for testing whether the number of “bad” users exceeds a threshold (the “break point”). Prior works relied on an indicator function: which is non-smooth and prone to instability. We replace this with a smoother function: which allows for a more stable and robust concentration test by providing a continuous measure of closeness.
Main Result and Implications
Using our sub-procedure and sensitivity bounds, we achieve a utility rate of
for smooth functions defined over an -ball, with gradients bounded in the -norm and diagonally dominant Hessians (see Section 2 for detailed assumptions). We also construct a lower bound:
using the fingerprinting lemma, showing that our upper bound is nearly optimal except for the dependence on . We discuss this loose dependence further in Section 5.
These assumptions are strong in terms of the properties of the norm: the -ball is the largest among all -balls and Lipschitz continuity in the -norm implies that the -norm of the gradient is bounded, which is the weakest possible assumption on gradient norms.
Comparison with Prior Work
The best-known item-level rate for this setting (i.e., Lipschitz in the -norm and optimization over an -ball) is: as established in the work of [AFKT21]. To our knowledge, our result is the first to extend item-level rates to the user-level setting, incorporating the dependence on .
Existing user-level DP-SCO results have primarily been studied in Euclidean spaces, where functions are assumed to be Lipschitz in the -norm and optimized over an -ball. Since the diameter of an -ball is , applying existing linear-time algorithms to our setting yields a suboptimal rate: This suboptimal dependence on arises because existing methods privatize all intermediate steps, leading to excessive noise accumulation. However, we acknowledge that our approach requires an additional assumption on the diagonal dominance of Hessians. Despite this restriction, our techniques are well-suited to the -ball and gradient norm setting. We provide a detailed discussion of our assumptions, limitations, and open problems in Section 5.
DP and Robustness
There is a rich body of work exploring the connections between DP and robustness, with robust statistics playing a central role in many DP applications [DL09, SM12, LKO22, AUZ23]. However, to the best of our knowledge, the application of robust statistics in private optimization remains relatively under-explored. We view our work as an important step in this direction and hope it inspires further research into leveraging robust statistics for private optimization.
2 Preliminaries
In user-level DP-SCO, we are given a dataset of users, where is the user ’s data which consists of datapoints drawn i.i.d. from an (unknown) underlying distribution . The objective is to minimize the following population function under user-level DP constraint:
In this section, we present the key definitions and assumptions. Discussions regarding the limitations of these assumptions can be found in Section 5. Additional tools, including those from differential privacy, are deferred to Appendix A.
Definition 2.1 (Lipschitz).
We say a function is -Lipschitz with respect to -norm , if for any , we have This means for any , where .
Definition 2.2 (Smooth).
In this work, we say a function is -smooth, if , where for a symmetric matrix . This implies that for any .
Definition 2.3 (Diagonal Dominance).
A matrix is diagonally dominant if
Assumption 2.4.
Assumption 2.5.
The Hessian of each function in the universe is diagonally dominant.
Diagonal dominance, although somewhat restrictive, is a commonly discussed assumption in the literature. For example, [WGZ+21] demonstrated the convergence rate of SGD in heavy-tailed settings under the assumption of diagonal dominance. Similarly, [DASD24] studied Adam’s preconditioning effect for quadratic functions with diagonally dominant Hessians. In the case of 1-hidden layer neural networks (a common focus of the NTK line of work), the Hessian is diagonal (see Section 3 in [LZB20]). Additionally, it has been shown that in practice, Hessians are typically block diagonal dominant [MG15, BRB17]. We also discuss potential ways to avoid this assumption in Section 5.
Notation
For , we use to denote its th coordinate. For a vector and a convex set , we use . For and , we use to denote the -ball centered at of radius .
3 Main Algorithm
We present our main result in this section and explain the algorithm in a top-down manner. The algorithm is based on the localization framework of [FKT20]; see Algorithm 5 in the Appendix for details. The main result is stated formally below:
Theorem 3.1.
We briefly describe the localization framework. In the first phase, it runs (non-private) SGD using half of the dataset, and averages the iterates to get . Roughly speaking, the solution already provides a good approximation with a small population loss when the datasets are drawn i.i.d. from the underlying distribution. However, to ensure privacy, we require a sensitivity bound on and add noise correspondingly to privatize , yielding the private solution .
A naive bound on the excess loss due to the privatization is given by
but the magnitude of the noise is typically too large to achieve a good utility guarantee. Nevertheless, this process yields a much better initial point compared to the original starting point . As a result, a smaller dataset and a smaller step size are sufficient to find the next good solution in expectation, with smaller noise added to privatize .
This process is repeated over phases, where each subsequent solution is progressively refined, and the Gaussian noise becomes negligible. Ultimately, this iterative refinement balances privacy and utility, as established in Theorem 3.1. The formal argument about the utility guarantee and proof can be found in Lemma 3.
Our main contribution is in Algorithm 1, which uses a novel gradient estimation sub-procedure.
Iteration Sensitivity of Algorithm 1:
The contractivity of gradient descent plays a crucial role in the sensitivity analysis, for which we need the Hessians to be diagonally dominant (Assumption 2.5).
[] [Contractivity] Suppose is a convex and -smooth function satisfying Assumption 2.5, then for any two points , with step size , we have
Now, we discuss Algorithm 1. Given the dataset , we proceed in steps. At the th step, we draw users and compute the average gradient of each user. We then apply our gradient estimation algorithm (Algorithm 2) and perform normal gradient descent for steps.
In the second phase of Algorithm 1, we perform the concentration test (Algorithm 3) on the gradients at each step based on (Algorithm 4). If the concentration test passes for all steps (i.e., for all ), we output the average iterate. Otherwise, the algorithm fails and returns the initial point. As mentioned in the Introduction, the crucial novelty of Algorithm 1 and Algorithm 2 lies in bounding the sensitivity of each solution by incorporating the (coordinate-wise) robust statistics into SGD.
We utilize robust statistics in the gradient estimation sub-procedure. We make the following assumptions regarding the robust statistics used:
Assumption 3.2.
Given a set of vectors, let be any robust statistic satisfying the following properties:
(i) For any , if there exists a point such that more than points lie within , then .
(ii) If we perturb each point such that for any , and let be the robust statistic of , then .
(iii) For any real numbers and , if for each , and is the corresponding robust statistic of , then .
Remark 3.3.
Common robust statistics, such as the (coordinate-wise) median and trimmed mean, satisfy Assumption 3.2.
In Algorithm 2, we output means if they are close to the robust statistics to control the bias, and project the means onto the sphere around the robust statistics in a coordinate-wise manner when they are far apart. However, we still need to ensure that the sensitivity remains bounded when the projection is operated. The following technical lemma plays a crucial role in establishing iteration sensitivity to deal with the sensitivity with potential projection operations.
[] Consider four real numbers , such that , and . Let and for any . Then, we have
Unfortunately, we are unaware of any robust statistic satisfying Assumption 3.2 in high dimensions under the -norm, and Lemma 3.3 does not hold in high dimensions either. These limitations restrict the applicability of our techniques in high-dimensional Euclidean spaces; see Section 5.
Let and be two trajectories corresponding to neighboring datasets that differ by one user. The crucial technical novelty is that, for any , we can control as long as the number of “bad” users in each phase ( in total) does not exceed the “break point”, say . Without loss of generality, assume that is the differing user in the neighboring dataset pairs considered in the following proof.
The first property of Assumption 3.2 ensures that when the number of “bad” users in each phase does not exceed the “break point” , the robust statistic remains close to most of the gradients, allowing us to control . To formalize this, we say that the neighboring dataset pair is -aligned if there exist points and such that and , where
This definition essentially states that the number of “bad” users does not exceed the “break point” in either or , ensuring that most gradients remain well-aligned within a bounded region.
[] For some (unknown) parameter , suppose is -aligned. Then, by running Algorithm 2 with threshold parameter , we have .
The sequential sensitivity can be bounded using induction, with the base case already established. The formal statement is provided in Lemma 3.
(3) |
Query Sensitivity of Concentration Test (Algorithm 3):
We have established iteration sensitivity for any aligned neighboring dataset pair . Next, we analyze the influence of the concentration test, which we use to check if the number of “bad” users exceed the “break point”.
To apply the privacy guarantee of (Lemma A.3), it suffices to bound the sensitivity of each query in the concentration test. Recall that we assume in the neighboring datasets. Thus, by the definition (Equation (3)), it is straightforward to observe that
(4) |
Next, we consider the sensitivity of for . The sensitivity is proportional to , which we have already bounded by . Note that we can bound the iteration sensitivity if the neighboring datasets are aligned, meaning the number of “bad” users does not exceed the “break point”. We first show that if the number of “bad” users exceeds the “break point”, the algorithm is likely to halt after the first step by failing the first test.
[] Suppose and we set . Suppose for any point , we get where . Then with probability at least , the returns .
We now analyze the query sensitivity between the aligned neighboring datasets.
[Query Sensitivity] Suppose . Suppose is -aligned and set threshold parameter in Algorithm 4, the sensitivity of the query is bounded by at most . That is,
Equation (4) shows that the sensitivity is always bounded for . Lemma 3 shows that if the number of “bad” users exceeds the “break point”, we obtain , and the query sensitivities of the subsequent queries do not need to be considered. Lemma 3 establishes the query sensitivity in the concentration test when the neighboring datasets are aligned, and the number of "bad" users is below the threshold.
Privacy proof.
The final privacy guarantee–stated formally below–now easily follows from the previous lemmas. The full proof is deferred to Appendix B.7.
Utility proof.
We apply the localization framework in private optimization to finish the utility argument. We analyze the utility guarantee of Algorithm 1 based on the classic convergence rate of SGD on smooth convex functions (Lemma A.6) as follows:
[] Let be any point in the domain. Suppose the data set of the users, whose size is larger than , is drawn i.i.d. from the distribution . Setting then the final output of Algorithm 1 satisfies that
Now we apply the localization framework. We set to make the term depending on it negligible. The proof of the following lemma mostly follows from [FKT20].
4 Lower Bound
This section presents our main lower bound. As stated above, the lower bound is nearly tight, apart from lower-order terms and the dependency on .
Theorem 4.1.
The non-private term represents the information-theoretic lower bound for SCO under these assumptions (see, e.g., Theorem 1 in [AWBR09]).
5 Conclusion
We present a linear-time algorithm for user-level DP-SCO that leverages (coordinate-wise) robust statistics in the gradient estimation subprocedure and provide a lower bound that nearly matches the upper bound up to logarithmic terms and an additional dependence on . Despite this progress, several limitations and open problems remain, some of which we highlight below.
-
•
Generalization to Euclidean and Other Spaces. Extending our results to Euclidean and other spaces is an interesting but technically challenging problem. One key challenge is the lack of robust statistics that are -Lipschitz under perturbations in the high-dimensional -norm (see the second item in Assumption 3.2). One may wonder whether commonly used robust statistics, such as the geometric median, are stable in this sense. However, we provide a simple counterexample involving the geometric median in the Appendix (Section B.10).
Another challenge arises from our approach of projecting the mean towards the robust statistic to ensure unbiased gradient estimation. This projection is -Lipschitz under perturbations in one dimension (Lemma 3.3), but there are known counterexamples in higher dimensions [ALT24, Lemma 16]. Overcoming these issues is crucial for extending our method to general spaces.
-
•
Additional Assumption on Diagonal Dominance. Our analysis assumes that the Hessians of functions in the universe are diagonally dominant, which is primarily used to show that gradient descent is contractive in the -norm (Lemma 3). This assumption is somewhat restrictive compared to the -norm setting, where it is sufficient to assume that the operator norm of the Hessians is bounded (i.e., smoothness). Addressing the aforementioned challenges and generalizing our results to the Euclidean space could potentially eliminate this additional assumption.
-
•
Suboptimal Dependence on . Although our upper bound nearly matches the lower bound, it has a suboptimal dependence on . This issue arises from the loose dependence on sensitivity. Specifically, we draw users at each step and compute their average gradient, with . However, the final sensitivity remains roughly and does not improve with larger . An improvement in the dependence on could be achieved if the sensitivity of the robust statistic could benefit from larger .
Finally, it would be interesting to explore the use of robust statistics, such as the median used in this work, to address other private optimization problems.
References
- [AFKT21] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in l1 geometry. In ICML, pages 393–403, 2021.
- [AL24] Hilal Asi and Daogao Liu. User-level differentially private stochastic convex optimization: Efficient algorithms with optimal rates. In AISTATS, pages 4240–4248, 2024.
- [ALT24] Hilal Asi, Daogao Liu, and Kevin Tian. Private stochastic convex optimization with heavy tails: Near-optimality from simple reductions. arXiv, 2406.02789, 2024.
- [AUZ23] Hilal Asi, Jonathan Ullman, and Lydia Zakynthinou. From robustness to privacy and back. In ICML, pages 1121–1146, 2023.
- [AWBR09] Alekh Agarwal, Martin J Wainwright, Peter Bartlett, and Pradeep Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In NIPS, 2009.
- [BFGT20] Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. arXiv, 2006.06914, 2020.
- [BFTT19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In NIPS, pages 11282–11291, 2019.
- [BGN21] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. In COLT, pages 474–499, 2021.
- [BRB17] Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical gauss-newton optimisation for deep learning. In International Conference on Machine Learning, pages 557–565. PMLR, 2017.
- [BS23] Raef Bassily and Ziteng Sun. User-level private stochastic convex optimization with optimal rates. In ICML, pages 1838–1851, 2023.
- [BST14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pages 464–473, 2014.
- [Bub15] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
- [CCGT24] Christopher A Choquette-Choo, Arun Ganesh, and Abhradeep Guha Thakurta. Optimal rates for o (1)-smooth dp-sco with a single epoch and large batches. In ALT, 2024.
- [CTW+21] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In USENIX Security, pages 2633–2650, 2021.
- [DASD24] Rudrajit Das, Naman Agarwal, Sujay Sanghavi, and Inderjit S Dhillon. Towards quantifying the preconditioning effect of adam. arXiv preprint arXiv:2402.07114, 2024.
- [DK09] Stephane Durocher and David Kirkpatrick. The projection median of a set of points. Computational Geometry, 42(5):364–375, 2009.
- [DL09] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In STOC, pages 371–380, 2009.
- [DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284, 2006.
- [DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
- [FKT20] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In STOC, pages 439–449, 2020.
- [GKK+23] Badih Ghazi, Pritish Kamath, Ravi Kumar, Raghu Meka, Pasin Manurangsi, and Chiyuan Zhang. On user-level private convex optimization. arXiv, 2305.04912, 2023.
- [GLL22] Sivakanth Gopi, Yin Tat Lee, and Daogao Liu. Private convex optimization via exponential mechanism. In COLT, pages 1948–1989, 2022.
- [JNG+19] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. arXiv, 1902.03736, 2019.
- [KLSU19] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning high-dimensional distributions. In COLT, pages 1853–1902, 2019.
- [LKO22] Xiyang Liu, Weihao Kong, and Sewoong Oh. Differential privacy and robust statistics in high dimensions. In COLT, pages 1167–1246, 2022.
- [LLA24] Andrew Lowy, Daogao Liu, and Hilal Asi. Faster algorithms for user-level private stochastic convex optimization. In NeurIPS, 2024.
- [LSA+21] Daniel Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, and Ananda Theertha Suresh. Learning with user-level privacy. NeurIPS, pages 12466–12479, 2021.
- [LSS+23] Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Analyzing leakage of personally identifiable information in language models. In S & P, pages 346–363, 2023.
- [LZB20] Chaoyue Liu, Libin Zhu, and Misha Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33:15954–15964, 2020.
- [MG15] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR, 2015.
- [SHW22] Jinyan Su, Lijie Hu, and Di Wang. Faster rates of private stochastic convex optimization. In ALT, pages 995–1002, 2022.
- [SM12] Aleksandra Slavkovic and Roberto Molinari. Perturbed m-estimation: A further investigation of robust statistics for differential privacy. In Statistics in the Public Interest: In Memory of Stephen E. Fienberg, pages 337–361. Springer, 2012.
- [WGZ+21] Hongjian Wang, Mert Gurbuzbalaban, Lingjiong Zhu, Umut Simsekli, and Murat A Erdogdu. Convergence rates of stochastic gradient descent under infinite noise variance. Advances in Neural Information Processing Systems, 34:18866–18877, 2021.
- [XZ24] Zheng Xu and Yanxiang Zhang. Advances in private training for production on-device language models. https://research.google/blog/advances-in-private-training-for-production-on-device-language-models/, 2024. Google Research Blog.
- [ZTC22] Qinzi Zhang, Hoang Tran, and Ashok Cutkosky. Differentially private online-to-batch for smooth losses. In NeurIPS, 2022.
Appendix A Preliminaries
A.1 Differential Privacy
Definition A.1 (User-level DP).
We say a mechanism is -user-level DP, if for any neighboring datasets and that differ from one user, and any output event set , we have
Proposition A.2 (Gaussian Mechanism).
Consider a function . If , where means and are neighboring datasets, then the Gaussian mechanism
where with is -DP.
A.1.1 AboveThreshold
Our algorithms use the AboveThreshold algorithm [DR14], which is a key tool in DP to identify whether there is a query in a stream of queries that is above a certain threshold . The algorithm (Algorithm 4 presented in the Appendix) has the following guarantees:
Lemma A.3 ([DR14], Theorem 3.24).
is -DP. Moreover, let and . For any sequence of queries each of sensitivity , halts at time such that with probability at least ,
-
•
For all , and ;
-
•
and or .
A.2 SubGaussian and Norm-SubGaussian Random Vectors
Definition A.4.
Let . We say a random vector is SubGaussian () with parameter if for any . Random vector is Norm-SubGaussian with parameter () if for all .
Theorem A.5 (Hoeffding-type inequality for norm-subGaussian, [JNG+19]).
Let be random vectors, and let for be the corresponding filtration. Suppose for each , is zero-mean . Then, there exists an absolute constant , for any ,
A.3 Optimization
Lemma A.6 ([Bub15]).
Consider a -smooth convex function over a convex set . For any , suppose that the random initial point satisfies . Suppose we have an unbiased stochastic gradient oracle such that , then running SGD for steps with fixed step size satisfies that
Appendix B Proof of Section 3
B.1 Proof of Lemma 3
See 3
Proof.
By the diagonal dominance assumption and precondition that , we know
Note that by the mean-value theorem,
where is in the segment between and . Hence we have
∎
B.2 Proof of Lemma 3.3
See 3.3
Proof.
Without loss of generality, we assume . We do case analysis.
Case (I): if no projection happens. Then it is straightforward that and and hence .
Case (II): if one projection happens. Without loss of generality, assume we project and get . We analyze the following sub-cases:
(i): . In this case we know .
If , then we know .
If , then which is impossible.
(ii): . If , then we know .
Consider the subsubcase when . If , then .
Else if , as , we have .
Case (III): if two projections happen.
(i): . This is a trivial case.
(ii): . This case is also trivial.
(iii): . We can show that .
(vi): . If , then we know .
Else, if , then we know .
∎
B.3 Proof of Lemma 3.3
See 3.3
Proof.
It suffices to show that . By the first property of Assumption 3.2 and the preconditions in the statement, we know , which leads to that
Moreover, we have that the robust statistic by the triangle inequality as and .
By the projection operation in Algorithm 2, we know and , and hence we know . This completes the proof. ∎
B.4 Proof of Lemma 3
See 3
Proof.
Recall that we assume all users are equal to but .
We actually show that
as the projection operator to the same convex set is contractive in - and -norm in our case.
Let and . Note that the users used in computing the gradients are the same. Let be the output of Algorithm 2 with as inputs, and be the corresponding output of . By the third property of Assumption 3.2, it suffices to show that
(6) |
By Lemma 3, we know
B.5 Proof of Lemma 3
See 3
Proof.
The main randomness comes from the Laplacian noise we add to the query and the threshold. Under the precondition that for any , then we know the concentration score
Then by Lemma A.3 with a probability of at least , we have
which means . ∎
B.6 Proof of Lemma 3
See 3
B.7 Proof of Lemma 3
See 3
Proof.
Consider the implementation over two neighboring datasets and , and we use the prime notation to denote the corresponding variables with respect to . Without loss of generality, we assume the different user is used in the first phase.
To avoid confusion, let and be the outputs of Algorithm 1 with neighboring inputs and , and be the outputs of Algorithm 5 by privatizing and with Gaussian noise respectively.
The outputs and of Algorithm 1 depend on the intermediate random variables and .
As the query sensitivity for is always bounded, by our parameter setting and property of , we know .
We do case analysis.
(i) is not -aligned. Then by Lemma 3, either or . Then by union bound and , we have
If , then is the initial point. We have for any event range ,
This completes the privacy guarantee for case(i).
(ii) is -aligned. In this case, by Lemma 3.3 and Lemma 3, we know . Moreover, Lemma 3 suggests that the query sensitivity is always bounded by 1. Then by the property of (Lemma A.3), we have . We have for any event range ,
where we use the Gaussian Mechanism (Proposition A.2) to bound by . This completes the proof.
∎
B.8 Proof of Lemma 3
See 3
Proof.
By the Hoeffding inequality for norm-subGaussian vectors (Theorem A.5), for each and each , we have
Then conditional on the event that for all and , by our setting of parameters, we know that we pass all the concentration tests with , and that
which means . Note that and when all functions are drawn i.i.d. from the distribution. By the small TV distance between and the good gradient estimation , it follows from Lemma A.6 that
where the last term comes from the worst value , and the small failure probability . ∎
B.9 Proof of Lemma 3
See 3
B.10 Counterexample of the 1-Lipschitz of Geometric Median
We use the counterexample from [DK09] to show the geometric median is unstable in 2-dimension space. Recall give a set of points in , the geometric median of , denoted by , is
Let and with as a very small perturbation. But we know and .
Appendix C Details of Lower Bound
Now we construct a lower bound for the weighted sign estimation error.
Weighted sign estimation error.
We construct a distribution as follows: for each coordinate , we draw uniformly random from , and i.i.d., for . The objective is to minimize weighted sign estimation error with respect to .
Lemma C.1.
Let . For any -user-level-DP algorithm , there exists a distribution such that and almost surely, and, given dataset i.i.d. drawn from , we have
First, by the previous result, we can reduce the user-level to item-level setting.
Lemma C.2 ([LSA+21]).
Suppose each user observes with known. For any -User-level-DP algorithm , there exists an -Item-level-DP algorithm that takes inputs where and has the same performance as .
Hence by Lemma C.2, it suffices to consider the item-level lower bound. We prove the following lemma:
Lemma C.3.
Let . Let be i.i.d. drawn from . If is -DP, and
then .
By the invariant scaling, it suffices to consider the case when . To prove Lemma C.3, we need the fingerprinting lemma:
Lemma C.4 (Lemma 6.8 in [KLSU19]).
For every fixed number and every , define . We have
(7) |
By choosing uniformly from , we have the following observation over the expectation on the RHS of Equation (7).
Lemma C.5.
We have
Proof.
Using integration by parts, we have
∎
Now we use the fingerprinting lemma.
Lemma C.6.
One has
Proof.
Fix a column .
Construct the for our purpose. Define to be
That is, is the expectation of conditioned on . And define to be
That is is the expectation of conditional on .
A small weighted sign error means a large , as demonstrated by the following lemma:
Lemma C.7.
Let . Suppose , and
then we have
Proof.
Note that
Moreover, one has that
This completes the proof. ∎
It remains to show the sample complexity to achieve a large value of
(8) |
Now we complete the proof of Lemma C.3.
Proof of Lemma C.3.
Let . We use the DP constraint of to upper bound .
Let denote with replaced with an independent draw from . Define
Due to the independence between and , we have . Moreover, as and , we have .
Split with and and split similarly. By the property of DP, we know
Then we have
where the last inequality used the fact that when .
When and set , we have
It is standard to translate the sample complexity lower bound (Lemma C.3) to the error lower bound (Lemma C.1). We present a proof below.
Proof of Lemma C.1.
Let be the set of item-level DP mechanisms, and let be the set of user-level DP mechanisms.
Define the error term:
where .
Recall that we construct the distribution as follows: for each coordinate , we draw uniformly random from , and i.i.d., for .
Let be the following: for each coordinate , we draw uniformly random from , and i.i.d., for . is corresponding to averaging the samples for each user. By Lemma C.2, we have
By Lemma C.3,
Let . When we have a large data set of size , construct , where is a Dirac distribution at .
Hence, by a Chernoff bound, with high probability, the number of samples drawn from in the dataset is no more than . Then we have
In the precondition of , we need almost surely for . But involves some Gaussian distributions. We construct by truncating the Gaussian distributions.
More specifically, for each item drawn from , we set with . In other words, we get by projecting to . Fixing , we first show
(9) |
It suffices to consider any coordinate , and prove
Letting and , we know
where is density function of standard Gaussian . As and , we establish Equation (9). Denote , that is the mean of conditional on .
Moreover, we have
where the inequality (i) follows from that we can always sample from with samples from , which means the problem over is harder than . This completes the proof. ∎