\declaretheorem

[name=Theorem]thm \declaretheorem[name=Lemma,sibling=theorem]lem

Linear-Time User-Level DP-SCO via Robust Statistics

Badih Ghazi¹ Ravi Kumar¹ Daogao Liu¹ Pasin Manurangsi¹
¹Google Email:[email protected]

Abstract

User-level differentially private stochastic convex optimization (DP-SCO) has garnered significant attention due to the paramount importance of safeguarding user privacy in modern large-scale machine learning applications. Current methods, such as those based on differentially private stochastic gradient descent (DP-SGD), often struggle with high noise accumulation and suboptimal utility due to the need to privatize every intermediate iterate. In this work, we introduce a novel linear-time algorithm that leverages robust statistics, specifically the median and trimmed mean, to overcome these challenges. Our approach uniquely bounds the sensitivity of all intermediate iterates of SGD with gradient estimation based on robust statistics, thereby significantly reducing the gradient estimation noise for privacy purposes and enhancing the privacy-utility trade-off. By sidestepping the repeated privatization required by previous methods, our algorithm not only achieves an improved theoretical privacy-utility trade-off but also maintains computational efficiency. We complement our algorithm with an information-theoretic lower bound, showing that our upper bound is optimal up to logarithmic factors and the dependence on $\varepsilon$ . This work sets the stage for more robust and efficient privacy-preserving techniques in machine learning, with implications for future research and application in the field.

1 Introduction

With the rapid development and widespread applications of modern machine learning and artificial intelligence, particularly driven by advancements in large language models (LLMs), privacy concerns have come to the forefront. For example, recent studies have highlighted significant privacy risks associated with LLMs, including well-documented instances of training data leakage [CTW⁺21, LSS⁺23]. These challenges underscore the urgent need for privacy-preserving mechanisms in machine learning systems.

Differential Privacy (DP) [DMNS06] has emerged as a rigorous mathematical framework for ensuring privacy and is now the gold standard for addressing privacy concerns in machine learning. The classic definition of DP, referred to as item-level DP, guarantees that the replacement of any single training example has a negligible impact on the model’s output. Formally, a mechanism $\mathcal{M}$ is said to satisfy $(\varepsilon,\delta)$ -item-level DP if, for any pair $\mathcal{D}$ and $\mathcal{D}^{\prime}$ of neighboring datasets that differ by a single item, and for any event $\mathcal{O}\in\mathrm{Range}(\mathcal{M})$ , the following condition holds:

\displaystyle\Pr[\mathcal{M}(\mathcal{D})\in\mathcal{O}]\leq e^{\varepsilon}\Pr[\mathcal{M}(\mathcal{D}^{\prime})\in\mathcal{O}]+\delta.

(1)

Item-level DP provides a strong guarantee against the leakage of private information associated with any single element in a dataset. However, in many real-world applications, a single user may contribute multiple elements to the training dataset [XZ24]. This introduces the need to safeguard the privacy of each user as a whole, which motivates a stronger notion of user-level DP. This notion ensures that replacing one user (who may contribute up to $m$ items) in the dataset has only a negligible effect on the model’s output. Formally, (1) still holds when $\mathcal{D}$ and $\mathcal{D}^{\prime}$ differ by one user, i.e., up to $m$ items in total. Notably, when $m=1$ , user-level DP reduces to item-level DP.

DP-SCO

As one of the central problems in privacy-preserving machine learning and statistical learning, DP stochastic convex optimization (DP-SCO) has garnered significant attention in recent years (e.g., [BST14, BFTT19, FKT20, BFGT20, BGN21, SHW22, GLL22, ALT24]). In DP-SCO, we are provided with a dataset $\{Z_{i}\}_{i\in[n]}$ of users, where each user $Z_{i}$ contributes samples drawn from an underlying distribution $\mathcal{P}$ . The goal is to minimize the population objective function under the DP constraint:

\displaystyle F(x):=\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}}f(x;z),

where $f(x;z):\mathcal{X}\to\mathbb{R}$ is a convex function defined on the convex domain $\mathcal{X}$ .

DP-SCO was initially studied in the item-level setting, where significant progress has been made with numerous exciting results. The study of user-level DP-SCO, to our knowledge, was first initiated by [LSA⁺21], where the problem was considered in Euclidean spaces, with the Lipschitz constant and the diameter of the convex domain defined under the $\ell_{2}$ -norm. They achieved an error rate of $O\left(\frac{1}{\sqrt{mn}}+\frac{d}{n\varepsilon\sqrt{m}}\right)$ for smooth functions, along with a lower bound of $\Omega\left(\frac{1}{\sqrt{mn}}+\frac{\sqrt{d}}{n\varepsilon\sqrt{m}}\right)$ .

Building on this, [BS23] introduced new algorithms based on DP-SGD, incorporating improved mean estimation procedures to achieve an asymptotically optimal rate of $\frac{1}{\sqrt{nm}}+\frac{\sqrt{d}}{n\sqrt{m}\varepsilon}$ . However, their approach relies on the smoothness of the loss function and imposes parameter restrictions, including $n\geq\sqrt{d}/\varepsilon$ and $m\leq\max\{\sqrt{d},n\varepsilon^{2}/\sqrt{d}\}$ . On the other hand, [GKK⁺23] observed that user-level DP-SCO has low local sensitivity to user deletions. Using the propose-test-release mechanism, they developed algorithms applicable even to non-smooth functions and requiring only $n\geq\log(d)/\varepsilon$ users. However, their approach is computationally inefficient, running in super-polynomial time and achieving a sub-optimal error rate of $O\left(\frac{1}{\sqrt{nm}}+\frac{\sqrt{d}}{n\sqrt{m}\varepsilon^{2.5}}\right)$ . [AL24] proposed an algorithm that achieves optimal excess risk in polynomial time, requiring only $n\geq\log(md)/\varepsilon$ users and accommodating non-smooth losses. However, their algorithm is also computationally expensive: for $\beta$ -smooth losses, it requires $\beta\cdot(nm)^{3/2}$ gradient evaluations, while for non-smooth losses, it requires $(nm)^{3}$ gradient evaluations. Motivated by these inefficiencies, [LLA24] focused on improving the computational cost of user-level DP-SCO while maintaining optimal excess risk. For $\beta$ -smooth losses, they designed an algorithm requiring $\max\{\beta^{1/4}(nm)^{9/8},\beta^{1/2}n^{1/4}m^{5/4}\}$ gradient evaluations; for non-smooth losses, they achieved the same optimal excess risk using $n^{11/8}m^{5/4}$ evaluations.

Linear-time algorithms have also been explored in the user-level DP-SCO setting. [BS23] proposed a linear-time algorithm in the local DP model, achieving an error rate of $O\left(\sqrt{d}/\sqrt{nm}\varepsilon\right)$ under the constraints $m<d/\varepsilon^{2}$ , $n>d/\varepsilon^{2}$ , and $\beta\leq\sqrt{n^{3}/md^{3}}$ . Similarly, [LLA24] achieved the same rate under slightly relaxed conditions, requiring $n\geq\log(ndm)/\varepsilon$ and $\beta\leq\sqrt{nmd}$ .

In our work, we consider user-level DP-SCO under $\ell_{\infty}$ -norm assumptions, in contrast to most of prior results, which were established under the $\ell_{2}$ -norm. This distinction is significant, as the $\ell_{\infty}$ -norm provides stronger per-coordinate guarantees and hence is more desirable. Furthermore, the $\ell_{\infty}$ -norm enjoys properties that are crucial to our algorithmic design but do not hold in the $\ell_{2}$ -norm setting. These properties influence both the development of our algorithm and the corresponding theoretical guarantees. The implications of this distinction and its role in our results will be discussed in detail in the subsequent sections.

1.1 Technical Challenges in User-Level DP-SCO

A key challenge in solving user-level DP-SCO using DP-SGD lies in obtaining more accurate gradient estimates while maintaining privacy. Consider a simple scenario where we perform gradient descent for $t$ steps and seek an estimate of $\nabla F(x_{t})$ . To achieve this, we sample $B$ users to estimate $\nabla F(x_{t})$ and compute $q_{t}(Z_{i}):=\frac{1}{m}\sum_{z\in Z_{i}}\nabla f(x_{t};z)$ , the average of the $m$ gradients from user $Z_{i}$ at point $x_{t}$ . If each user’s $m$ functions are i.i.d. drawn from the distribution $\mathcal{P}$ , then with high probability, we know that $\|q_{t}(Z_{i})-\nabla F(x_{t})\|\leq\tilde{O}(1/\sqrt{m}).$

This naturally leads to the following mean-estimation problem: Given points $q_{t}(Z_{1}),\cdots,q_{t}(Z_{B})$ in the unit ball, with most of them likely to be within a distance of $1/\sqrt{m}$ from each other (under the i.i.d. assumption for utility guarantees), how can we accurately and privately estimate their mean?

A straightforward approach to recover the item-level rate is to apply the Gaussian mechanism:

\displaystyle\frac{1}{B}\sum_{i\in[B]}q_{t}(Z_{i})+\mathcal{N}(0,\sigma_{1}^{2}I_{d}),

(2)

where the noise level is set as $\sigma_{1}\propto 1/B$ .

Mean Estimation and Sensitivity Control

To improve upon this, prior works [AL24, LLA24] designed mean-estimation sub-procedures with the following properties:

•

Outlier Detection: The procedure tests whether the number of “bad” users (whose gradients significantly deviate from the majority) exceeds a predefined threshold (or “break point”).
•

Outlier Removal and Sensitivity Reduction: If the number of “bad” users is below the threshold, the procedure removes outliers and produces an estimate $g_{t}$ with sensitivity $\tilde{O}(1/B\sqrt{m})$ . The privatized gradient is then: $g_{t}+\mathcal{N}(0,\sigma_{2}^{2}I_{d}),$ where $\sigma_{2}\propto\frac{1}{B\sqrt{m}}$ .
•

Better Variance Control: When all users provide consistent estimates, the output follows

$\displaystyle\frac{1}{B}\sum_{i\in[B]}q_{t}(Z_{i})+\mathcal{N}(0,\sigma_{2}^{2}I_{d}),\sigma_{2}\propto\frac{1}{B\sqrt{m}},$

resulting in significantly smaller noise compared to the naive Gaussian mechanism in (2).

By leveraging such sub-procedures, prior works have achieved the optimal excess risk rate in polynomial time. However, extending these to obtain linear-time algorithms poses new challenges.

Challenges in Linear-Time User-Level DP-SCO

Several linear-time algorithms exist for item-level DP-SCO. [ZTC22] and [CCGT24] achieved the optimal item-level rate $O\left(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\varepsilon}\right)$ under the smoothness assumption $\beta=O(1)$ .¹¹1One can roughly interpolate the optimal item-level rate by setting $m=1$ in user-level DP. Their approach maintains privacy by adding noise to all intermediate iterations of DP-SGD.

[FKT20] notably relaxed the smoothness requirement to $\beta\leq\sqrt{n}+\sqrt{d}/\varepsilon$ by analyzing the stability of non-private SGD. They showed that for neighboring datasets, the sequence $\{x_{t}\}_{t\in[T]}$ and $\{x_{t}^{\prime}\}_{t\in[T]}$ remain close, ensuring that the sensitivity of the average iterate $\frac{1}{T}\sum_{t\in[T]}x_{t}$ is low. This allows them to apply the Gaussian mechanism directly to privatize the average iterate.

Motivated by this stability-based analysis, [LLA24] attempted to generalize the linear-time approach of [FKT20] to the user-level setting. However, a key difficulty arises when incorporating the mean-estimation sub-procedure. Specifically, even if one can bound $\|x_{t}-x_{t}^{\prime}\|$ , where $\{x_{t}\}$ and $\{x_{t}^{\prime}\}$ represent the trajectories corresponding to neighboring datasets, there is no clear understanding of how applying the sub-procedure impacts stability in subsequent iterations. In particular, after performing one gradient descent step using gradient estimations from sub-procedure, we do not have guarantees on how well $\|x_{t+1}-x^{\prime}_{t+1}\|$ remains bounded.

Due to this lack of stability analysis for the sub-procedure, [LLA24] resorted to privatizing all iterations, resulting in excessive Gaussian noise accumulation. Consequently, their algorithm achieved only a suboptimal error rate of $O\left(\frac{\sqrt{d}}{\sqrt{nm}\varepsilon}\right)$ , highlighting the fundamental challenge of designing a linear-time user-level DP-SCO algorithm by controlling the stability of the sub-procedures.

Generalizing Other Linear-Time Algorithms

The linear-time algorithms proposed in [ZTC22] and [CCGT24] for item-level DP-SCO represent promising approaches to generalize to the user-level setting. These algorithms achieve the optimal item-level rate $O\left(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\varepsilon}\right)$ by privatizing all intermediate iterations of variants of DP-SGD. This approach avoids the need for additional stability analysis, as the noise added at every iteration directly ensures privacy without relying on intermediate sensitivity bounds. Generalizing such algorithms to the user-level setting may, therefore, be easier compared to other approaches, as they sidestep the stability issues associated with mean-estimation sub-procedures.

In a private communication, the authors of [LLA24] indicated that it is possible to generalize the linear-time algorithm of [ZTC22] to the user-level setting. However, this generalization introduces a dependence on the smoothness parameter $\beta$ , which may impose restrictive smoothness constraints on the types of functions for which the algorithm is effective. Despite this limitation, such a generalization represents a natural direction for extending linear-time algorithms to user-level DP-SCO.

While these developments are promising, a more challenging and interesting direction lies in generalizing stability-based analyses from the item-level setting to the user-level setting. Stability-based methods, as seen in [FKT20], rely on carefully bounding the sensitivity of the entire optimization trajectory. Extending this approach to the user-level setting requires incorporating an appropriate sub-procedure for mean estimation, tailored to handle user-level sensitivity. This introduces additional layers of complexity, as the interactions between the sub-procedure and the iterative optimization process must be carefully analyzed to ensure stability and privacy. Overcoming these challenges would not only advance the theoretical understanding of user-level DP-SCO but might also lead to more efficient algorithms with broader applicability.

1.2 Our Techniques and Contributions

In this work, we design a novel mean-estimation sub-procedure based on robust statistics, such as the median and trimmed mean, specifically tailored for user-level DP-SCO. By incorporating the sub-procedure into (non-private) SGD, we establish an upper bound on $\|x_{t}-x_{t}^{\prime}\|_{\infty}$ for all iterations $t\in[T]$ . This ensures stability throughout the optimization process.

Key Idea: 1-Lipschitz Property of Robust Statistics

In one dimension, many robust statistics, such as the median, satisfy a 1-Lipschitz property. This means that if each data point is perturbed by a distance of at most $\iota$ , the robust statistic shifts by at most $\iota$ . This property makes robust statistics particularly well-suited for mean estimation in user-level DP-SCO.

To see this, recall that we define $q_{t}(Z_{i}):=\frac{1}{m}\sum_{z\in Z_{i}}\nabla f(x_{t};z)$ as the average of the $m$ gradients from user $Z_{i}$ at point $x_{t}$ . Similarly, we define $q_{t}^{\prime}(Z_{i}):=\frac{1}{m}\sum_{z\in Z_{i}}\nabla f(x_{t}^{\prime};z)$ for the gradients at $x_{t}^{\prime}$ . If $|x_{t}-x_{t}^{\prime}|$ is bounded, then by the smoothness of $f$ , we have $|q_{t}(Z_{i})-q_{t}^{\prime}(Z_{i})|\leq\beta|x_{t}-x_{t}^{\prime}|.$ This implies that the robust statistic computed from $\{q_{t}(Z_{i})\}_{i\in[B]}$ and $\{q_{t}^{\prime}(Z_{i})\}_{i\in[B]}$ remains bounded by $\beta|x_{t}-x_{t}^{\prime}|$ from the 1-Lipschitz property. As a result, the desired stability is naturally established in the one-dimensional setting.

Extending Stability to High Dimensions

In high-dimensional settings, robust statistics that satisfy the 1-Lipschitz property are not well understood. To address this, we adopt coordinate-wise robust statistics for gradient estimation. This approach ensures stability at each coordinate level. In turn, this allows us to establish iteration sensitivity in the $\ell_{\infty}$ -norm.

Debiasing Technique

While robust statistics are effective in controlling sensitivity, using them directly introduces a significant bias in gradient estimation. This bias occurs even in "good" datasets where all functions are i.i.d. from the underlying distribution. If not handled properly, the bias can dominate the utility guarantee and degrade performance.

To address this issue, we propose a novel debiasing technique: If the mean and the robust statistic are sufficiently close, we directly use the mean; otherwise, we project the mean onto the ball centered at the robust statistic. Both the mean and robust statistics individually satisfy the 1-Lipschitz property. We prove that this coordinate-wise projection preserves the 1-Lipschitz property in the $\ell_{\infty}$ -norm. The resulting robust mean-estimation sub-procedure ensures iteration sensitivity while remaining unbiased when the dataset is well-behaved. This property holds with high probability when all functions are i.i.d. from the distribution.

Improving Robust Mean Estimation: Smoothed Concentration Test

To further enhance stability, we introduce a smoothed version of the concentration score for testing whether the number of “bad” users exceeds a threshold (the “break point”). Prior works relied on an indicator function: $\mathbf{1}(\|q_{t}(Z_{i})-q_{t}(Z_{j})\|\leq 1/\tau),$ which is non-smooth and prone to instability. We replace this with a smoother function: $\exp\left(-\tau\|q_{t}(Z_{i})-q_{t}(Z_{j})\|\right),$ which allows for a more stable and robust concentration test by providing a continuous measure of closeness.

Main Result and Implications

Using our sub-procedure and sensitivity bounds, we achieve a utility rate of

\tilde{O}\left(\frac{d}{\sqrt{nm}}+\frac{d^{3/2}}{n\sqrt{m}\varepsilon^{2}}\right),

for smooth functions defined over an $\ell_{\infty}$ -ball, with gradients bounded in the $\ell_{1}$ -norm and diagonally dominant Hessians (see Section 2 for detailed assumptions). We also construct a lower bound:

\tilde{\Omega}\left(\frac{d}{\sqrt{nm}}+\frac{d^{3/2}}{n\sqrt{m}\varepsilon}\right),

using the fingerprinting lemma, showing that our upper bound is nearly optimal except for the dependence on $\varepsilon$ . We discuss this loose dependence further in Section 5.

These assumptions are strong in terms of the properties of the norm: the $\ell_{\infty}$ -ball is the largest among all $\ell_{p}$ -balls and Lipschitz continuity in the $\ell_{1}$ -norm implies that the $\ell_{\infty}$ -norm of the gradient is bounded, which is the weakest possible assumption on gradient norms.

Comparison with Prior Work

The best-known item-level rate for this setting (i.e., Lipschitz in the $\ell_{1}$ -norm and optimization over an $\ell_{\infty}$ -ball) is: $O\left(\frac{d}{\sqrt{n}}+\frac{d^{3/2}}{n\varepsilon}\right),$ as established in the work of [AFKT21]. To our knowledge, our result is the first to extend item-level rates to the user-level setting, incorporating the dependence on $m$ .

Existing user-level DP-SCO results have primarily been studied in Euclidean spaces, where functions are assumed to be Lipschitz in the $\ell_{2}$ -norm and optimized over an $\ell_{2}$ -ball. Since the $\ell_{2}$ diameter of an $\ell_{\infty}$ -ball is $\sqrt{d}$ , applying existing linear-time algorithms to our setting yields a suboptimal rate: $O\left(\frac{d^{3/2}}{\sqrt{nm}\varepsilon}\right).$ This suboptimal dependence on $n$ arises because existing methods privatize all intermediate steps, leading to excessive noise accumulation. However, we acknowledge that our approach requires an additional assumption on the diagonal dominance of Hessians. Despite this restriction, our techniques are well-suited to the $\ell_{\infty}$ -ball and gradient norm setting. We provide a detailed discussion of our assumptions, limitations, and open problems in Section 5.

DP and Robustness

There is a rich body of work exploring the connections between DP and robustness, with robust statistics playing a central role in many DP applications [DL09, SM12, LKO22, AUZ23]. However, to the best of our knowledge, the application of robust statistics in private optimization remains relatively under-explored. We view our work as an important step in this direction and hope it inspires further research into leveraging robust statistics for private optimization.

2 Preliminaries

In user-level DP-SCO, we are given a dataset $\mathcal{D}=\{Z_{i}\}_{i\in[n]}$ of $n$ users, where $Z_{i}\in\mathcal{Z}^{m}$ is the user $i$ ’s data which consists of $m$ datapoints drawn i.i.d. from an (unknown) underlying distribution $\mathcal{P}$ . The objective is to minimize the following population function under user-level DP constraint:

\displaystyle F(x):=\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}}f(x;z).

In this section, we present the key definitions and assumptions. Discussions regarding the limitations of these assumptions can be found in Section 5. Additional tools, including those from differential privacy, are deferred to Appendix A.

Definition 2.1 (Lipschitz).

We say a function $f:\mathcal{X}\to\mathbb{R}$ is $G$ -Lipschitz with respect to $\ell_{p}$ -norm , if for any $x,y\in\mathcal{X}$ , we have $|f(x)-f(y)|\leq G\|x-y\|_{p}.$ This means $\|\nabla f(x)\|_{q}\leq G$ for any $x\in\mathcal{X}$ , where $1/p+1/q=1$ .

Definition 2.2 (Smooth).

In this work, we say a function $f$ is $\beta$ -smooth, if $\|\nabla^{2}f(x)\|_{\infty}\leq\beta,\forall x\in\mathcal{X}$ , where $\|A\|_{\infty}:=\max_{i}\sum_{j}|A_{i,j}|$ for a symmetric matrix $A$ . This implies that $\|\nabla f(x)-\nabla f(y)\|_{\infty}\leq\beta\|x-y\|_{\infty}$ for any $x,y\in\mathcal{X}$ .

Definition 2.3 (Diagonal Dominance).

A matrix $A\in\mathbb{R}^{d\times d}$ is diagonally dominant if

\displaystyle|A_{i,i}|\geq\sum_{j\neq i}|A_{i,j}|,

\displaystyle\forall i\in[d].

Assumption 2.4.

Each function $f(;z):\mathcal{X}\to\mathbb{R}$ in the universe is convex, $G$ -Lipschitz with respect to $\ell_{1}$ -norm (Definition 2.1) and $\beta$ -smooth (Definition 2.2). $\mathcal{X}$ is a ball of radius $D$ in $\ell_{\infty}$ -norm.

Assumption 2.5.

The Hessian of each function $f(\cdot;z)$ in the universe is diagonally dominant.

Diagonal dominance, although somewhat restrictive, is a commonly discussed assumption in the literature. For example, [WGZ⁺21] demonstrated the convergence rate of SGD in heavy-tailed settings under the assumption of diagonal dominance. Similarly, [DASD24] studied Adam’s preconditioning effect for quadratic functions with diagonally dominant Hessians. In the case of 1-hidden layer neural networks (a common focus of the NTK line of work), the Hessian is diagonal (see Section 3 in [LZB20]). Additionally, it has been shown that in practice, Hessians are typically block diagonal dominant [MG15, BRB17]. We also discuss potential ways to avoid this assumption in Section 5.

Notation

For $X\in\mathbb{R}^{d}$ , we use $X[i]$ to denote its $i$ th coordinate. For a vector $X\in\mathbb{R}^{d}$ and a convex set $\mathcal{X}\subset\mathbb{R}^{d}$ , we use $\Pi_{\mathcal{X}}(X):=\arg\min_{Y\in\mathcal{X}}\|Y-X\|_{2}$ . For $X\in\mathbb{R}^{d}$ and $r\in\mathbb{R}_{\geq 0}$ , we use $B_{\infty}(X,r)$ to denote the $\ell_{\infty}$ -ball centered at $X$ of radius $r$ .

3 Main Algorithm

We present our main result in this section and explain the algorithm in a top-down manner. The algorithm is based on the localization framework of [FKT20]; see Algorithm 5 in the Appendix for details. The main result is stated formally below:

Theorem 3.1.

Under Assumptions 2.4 and 2.5, suppose $\beta\leq\frac{G}{D}(\frac{\sqrt{n}\varepsilon}{\sqrt{m}\log(nmd/\delta)}+\frac{\sqrt{d\log(1/\delta)\log(nmd)}}{\sqrt{m}\varepsilon})$ , $\varepsilon\leq O(1),n\geq\log^{2}(nd/\delta)/\varepsilon$ and $m\leq n^{O(\log\log n)}$ . Setting $\eta\leq\frac{D}{G}\cdot\min\{\frac{B\sqrt{m}}{\sqrt{n}},\frac{\sqrt{m}\varepsilon}{\sqrt{d\log(1/\delta)\log(nmd)}}\}$ , $B=100\log(mnd/\delta)/\varepsilon$ , $\tau=O(G\log(nmd)/\sqrt{m})$ and $\upsilon=0.9B+\frac{2\log(T/\delta)}{\varepsilon}$ , Algorithm 5 is $(\varepsilon,\delta)$ -user-level-DP. When the $nm$ functions in dataset $\mathcal{D}$ are i.i.d. drawn from the underlying distribution $\mathcal{P}$ , it takes $mn$ gradient computations and outputs $x_{S}$ such that

\displaystyle\operatorname*{\mathbb{E}}[F(x_{S})-F(x^{*})]\leq\tilde{O}\left(\frac{d}{\sqrt{nm}}+\frac{d^{3/2}}{n\varepsilon^{2}\sqrt{m}}\right).

We briefly describe the localization framework. In the first phase, it runs (non-private) SGD using half of the dataset, and averages the iterates to get $\overline{x}_{1}$ . Roughly speaking, the solution $\overline{x}_{1}$ already provides a good approximation with a small population loss when the datasets are drawn i.i.d. from the underlying distribution. However, to ensure privacy, we require a sensitivity bound on $\|\overline{x}_{1}\|$ and add noise $\zeta_{1}$ correspondingly to privatize $\overline{x}_{1}$ , yielding the private solution $x_{1}\leftarrow\overline{x}_{1}+\zeta_{1}$ .

A naive bound on the excess loss due to the privatization is given by

\operatorname*{\mathbb{E}}[F(x_{1})-F(\overline{x}_{1})]\leq G\|\zeta_{1}\|_{2},

but the magnitude of the noise $\|\zeta_{1}\|_{2}$ is typically too large to achieve a good utility guarantee. Nevertheless, this process yields a much better initial point $x_{1}$ compared to the original starting point $x_{0}$ . As a result, a smaller dataset and a smaller step size are sufficient to find the next good solution $\overline{x}_{2}$ in expectation, with smaller noise $\|\zeta_{2}\|_{2}$ added to privatize $\overline{x}_{2}$ .

This process is repeated over $O(\log n)$ phases, where each subsequent solution $\overline{x}_{S}$ is progressively refined, and the Gaussian noise $\|\zeta_{S}\|_{2}$ becomes negligible. Ultimately, this iterative refinement balances privacy and utility, as established in Theorem 3.1. The formal argument about the utility guarantee and proof can be found in Lemma 3.

Our main contribution is in Algorithm 1, which uses a novel gradient estimation sub-procedure.

1 Input: dataset

\mathcal{D}

, privacy parameters

\varepsilon,\delta

, other parameters

\eta,\tau,\upsilon,B

, initial point

x_{0}

;

2 Divide

\mathcal{D}

into B disjoint subsets of equal size, denoted by

\{\mathcal{D}_{i}\}_{i\in[B]}

\mathcal{D}_{i}=\{Z_{i,t}\}_{t\in[|\mathcal{D}|/B]}

;

3 Set

T=|\mathcal{D}|/B

;

4 for Step $t=1,\ldots,T$ do

5 For each

i\in[B]

, get

q_{t}(Z_{i,t}):=\frac{1}{m}\sum_{z\in Z_{i,t}}\nabla f(x_{t-1};z)

;

6 Let

g_{t-1}

be the output of Algorithm 2 with inputs

\{q_{t}(Z_{i,t})\}_{i\in[B]}

and threshold

1/\tau

;

x_{t}=\Pi_{\mathcal{X}}(x_{t-1}-\eta g_{t-1})

8 end for

/* Concentration Test */

/* Recall the query

q_{t}(Z_{i,t})

for each

t\in[T],i\in[B]

from above */

9 Run Algorithm 3 with query

\{q_{t}\}_{t\in[T]}

and parameters

\mathcal{D}_{t},\varepsilon,\frac{\delta}{2Tmnd},\tau,\upsilon

to get answers

\{a_{t}\}_{t\in[T]}

;

10 if $a_{t}=\top,\forall t\in[T]$ then

11 Return: Average iterate

\overline{x}=\frac{1}{T}\sum_{t\in[T]}x_{t}

;

13 end if

14else

15 Output: Initial point

x_{0}

;

17 end if

Algorithm 1 SGD for User-level DP-SCO

Iteration Sensitivity of Algorithm 1:

The contractivity of gradient descent plays a crucial role in the sensitivity analysis, for which we need the Hessians to be diagonally dominant (Assumption 2.5).

{lem}

[] [Contractivity] Suppose $f:\mathcal{X}\to\mathbb{R}$ is a convex and $\beta$ -smooth function satisfying Assumption 2.5, then for any two points $x,y\in\mathcal{X}$ , with step size $\eta\leq 2/\beta$ , we have

\displaystyle\|(x-\eta\nabla f(x))-(y-\eta\nabla f(y))\|_{\infty}\leq\|x-y\|_{\infty}.

Now, we discuss Algorithm 1. Given the dataset $\mathcal{D}$ , we proceed in $T=|\mathcal{D}|/B$ steps. At the $t$ th step, we draw $B$ users $\{Z_{i,t}\}_{i\in[B]}$ and compute the average gradient of each user. We then apply our gradient estimation algorithm (Algorithm 2) and perform normal gradient descent for $T$ steps.

In the second phase of Algorithm 1, we perform the concentration test (Algorithm 3) on the $B$ gradients at each step based on $\mathsf{AboveThreshold}$ (Algorithm 4). If the concentration test passes for all steps (i.e., $a_{t}=\top$ for all $t\in[T]$ ), we output the average iterate. Otherwise, the algorithm fails and returns the initial point. As mentioned in the Introduction, the crucial novelty of Algorithm 1 and Algorithm 2 lies in bounding the sensitivity of each solution $x_{t}$ by incorporating the (coordinate-wise) robust statistics into SGD.

1 Input: a set of

d

-dimensional vectors

\{X_{i}\}_{i\in[B]}

, threshold parameter

\varsigma>0

;

2 for Each dimension $j=1,\ldots,d$ do

3 Compute the robust statistics

X_{\mathrm{rs}}[j]

, and the mean

\overline{x}[j]

over

\{X_{i}[j]\}_{i\in[B]}

;

4 if $|X_{\mathrm{rs}}[j]-\overline{x}[j]|\geq\varsigma$ then

5 Set

X_{est}[j]=\Pi_{B(Y_{j},\varsigma)}(\overline{x}[j])

;

7 end if

8 else

9 Set

X_{est}[j]=\overline{x}[j]

;

11 end if

13 end for

Return

X_{est}

Algorithm 2 Gradient Estimation based on Robust Statistics

We utilize robust statistics in the gradient estimation sub-procedure. We make the following assumptions regarding the robust statistics used:

Assumption 3.2.

Given a set $\{X_{i}\}_{i\in[B]}$ of vectors, let $X_{\mathrm{rs}}$ be any robust statistic satisfying the following properties:

(i) For any $\rho\geq 0$ , if there exists a point $X^{\prime}$ such that more than $B/2$ points lie within $B_{\infty}(X^{\prime},\rho)$ , then $X_{\mathrm{rs}}\in B_{\infty}(X^{\prime},\rho)$ .

(ii) If we perturb each point $Y_{i}=X_{i}+\iota_{i}$ such that $\|\iota_{i}\|_{\infty}\leq\Delta$ for any $\Delta\geq 0$ , and let $Y_{\mathrm{rs}}$ be the robust statistic of $\{Y_{i}\}$ , then $\|X_{\mathrm{rs}}-Y_{\mathrm{rs}}\|_{\infty}\leq\Delta$ .

(iii) For any real numbers $a$ and $b$ , if $Z_{i}=aX_{i}+b$ for each $i\in[B]$ , and $Z_{\mathrm{rs}}$ is the corresponding robust statistic of $\{Z_{i}\}_{i\in[B]}$ , then $Z_{\mathrm{rs}}=aX_{\mathrm{rs}}+b$ .

Remark 3.3.

Common robust statistics, such as the (coordinate-wise) median and trimmed mean, satisfy Assumption 3.2.

In Algorithm 2, we output means if they are close to the robust statistics to control the bias, and project the means onto the sphere around the robust statistics in a coordinate-wise manner when they are far apart. However, we still need to ensure that the sensitivity remains bounded when the projection is operated. The following technical lemma plays a crucial role in establishing iteration sensitivity to deal with the sensitivity with potential projection operations.

{lem}

[] Consider four real numbers $a,b,c,d$ , such that $|a-b|\leq 1$ , and $|c-d|\leq 1$ . Let $c^{\prime}=\Pi_{B(a,r)}(c)$ and $d^{\prime}=\Pi_{B(b,r)}(d)$ for any $r\geq 0$ . Then, we have $|c^{\prime}-d^{\prime}|\leq 1.$

Unfortunately, we are unaware of any robust statistic satisfying Assumption 3.2 in high dimensions under the $\ell_{2}$ -norm, and Lemma 3.3 does not hold in high dimensions either. These limitations restrict the applicability of our techniques in high-dimensional Euclidean spaces; see Section 5.

Let $\{x_{t}\}_{t\in[T]}$ and $\{y_{t}\}_{t\in[T]}$ be two trajectories corresponding to neighboring datasets that differ by one user. The crucial technical novelty is that, for any $t\in[T]$ , we can control $\|x_{t}-y_{t}\|_{\infty}$ as long as the number of “bad” users in each phase ( $B$ in total) does not exceed the “break point”, say $2B/3$ . Without loss of generality, assume that $Z_{1,1}\neq Z_{1,1}^{\prime}$ is the differing user in the neighboring dataset pairs $(\mathcal{D},\mathcal{D}^{\prime})$ considered in the following proof.

The first property of Assumption 3.2 ensures that when the number of “bad” users in each phase does not exceed the “break point” $2B/3$ , the robust statistic remains close to most of the gradients, allowing us to control $\|x_{1}-y_{1}\|_{\infty}$ . To formalize this, we say that the neighboring dataset pair $(\mathcal{D},\mathcal{D}^{\prime})$ is $\rho$ -aligned if there exist points $X^{\prime}$ and $Y^{\prime}$ such that $|X_{\mathrm{good}}|\geq 2B/3$ and $|Y_{\mathrm{good}}|\geq 2B/3$ , where

X_{\mathrm{good}}=\{q_{1}(Z_{i,1}):q_{1}(Z_{i,1})\in B_{\infty}(X^{\prime},\rho),i\in[B]\},\text{ and }

Y_{\mathrm{good}}=\{q_{1}^{\prime}(Z_{i,1}^{\prime}):q_{1}^{\prime}(Z_{i,1}^{\prime})\in B_{\infty}(Y^{\prime},\rho),i\in[B]\}.

This definition essentially states that the number of “bad” users does not exceed the “break point” in either $\mathcal{D}$ or $\mathcal{D}^{\prime}$ , ensuring that most gradients remain well-aligned within a bounded region.

{lem}

[] For some (unknown) parameter $\rho>0$ , suppose $(\mathcal{D},\mathcal{D}^{\prime})$ is $\rho$ -aligned. Then, by running Algorithm 2 with threshold parameter $\varsigma\geq 0$ , we have $\|x_{1}-y_{1}\|_{\infty}\leq\eta(4\rho+2\varsigma)$ .

The sequential sensitivity can be bounded using induction, with the base case $\|x_{1}-y_{1}\|_{\infty}$ already established. The formal statement is provided in Lemma 3.

1 Input: Dataset

\mathcal{D}=(Z_{1},\ldots,Z_{B})

, privacy parameters

\varepsilon,\delta

, parameters

\tau,\upsilon

;

2 for $t=1,\ldots,T$ do

3 Receive a new concentration query

q_{t}:\mathcal{Z}\to\mathbb{R}^{d}

;

4 Define the concentration score

\displaystyle s^{\mathsf{conc}}_{t}(\mathcal{D},\tau):=\frac{1}{B}\sum_{Z\in\mathcal{D}}\sum_{Z^{\prime}\in\mathcal{D}}\exp(-\tau\|q_{t}(Z)-q_{t}(Z^{\prime})\|_{\infty})\;

(3)

Return

\mathsf{AboveThreshold}(s^{\mathsf{conc}}_{t},\varepsilon/2,\upsilon)

5 end for

Algorithm 3 Concentration Test

{lem}

[Iteration Sensitivity] If we use a robust statistic satisfying Assumption 3.2 in Algorithm 2, then for all $t\in[T]$ , we have $\|x_{t}-y_{t}\|_{\infty}\leq\|x_{1}-y_{1}\|_{\infty}$ .

Lemmas 3.3 and 3 together establish the iteration sensitivity of Algorithm 1.

Query Sensitivity of Concentration Test (Algorithm 3):

We have established iteration sensitivity for any aligned neighboring dataset pair $(\mathcal{D},\mathcal{D}^{\prime})$ . Next, we analyze the influence of the concentration test, which we use to check if the number of “bad” users exceed the “break point”.

To apply the privacy guarantee of $\mathsf{AboveThreshold}$ (Lemma A.3), it suffices to bound the sensitivity of each query in the concentration test. Recall that we assume $Z_{1,1}\neq Z_{1,1}^{\prime}$ in the neighboring datasets. Thus, by the definition (Equation (3)), it is straightforward to observe that

\displaystyle|s^{\mathsf{conc}}_{1}(\mathcal{D},\tau)-s^{\mathsf{conc}}_{1}(\mathcal{D}^{\prime},\tau)|\leq 2.

(4)

Next, we consider the sensitivity of $s^{\mathsf{conc}}_{t}$ for $t\geq 2$ . The sensitivity is proportional to $\|x_{t}-y_{t}\|_{\infty}$ , which we have already bounded by $\|x_{1}-y_{1}\|_{\infty}$ . Note that we can bound the iteration sensitivity if the neighboring datasets are aligned, meaning the number of “bad” users does not exceed the “break point”. We first show that if the number of “bad” users exceeds the “break point”, the algorithm is likely to halt after the first step by failing the first test.

{lem}

[] Suppose $B\geq\frac{100\log(T/\delta)}{\varepsilon},\varepsilon\leq O(1)$ and we set $\upsilon=0.9B+\frac{2\log(T/\delta)}{\varepsilon}$ . Suppose for any point $Y$ , we get $|X_{\mathrm{good}}|<B/3$ where $X_{\mathrm{good}}=\{q_{1}(Z_{i,1}):q_{1}(Z_{i,1})\in B_{\infty}(Y,1/\tau),i\in[B]\}$ . Then with probability at least $1-\delta/T\exp(\varepsilon)$ , the $\mathsf{AboveThreshold}$ returns $a_{1}=\bot$ .

We now analyze the query sensitivity between the aligned neighboring datasets.

{lem}

[Query Sensitivity] Suppose $6\beta\eta B\leq 1$ . Suppose $(\mathcal{D},\mathcal{D}^{\prime})$ is $(1/\tau)$ -aligned and set threshold parameter $\varsigma=1/\tau$ in Algorithm 4, the sensitivity of the query is bounded by at most $2$ . That is,

\displaystyle|s^{\mathsf{conc}}_{t}(\mathcal{D},\tau)-s^{\mathsf{conc}}_{1}(\mathcal{D}^{\prime},\tau)|\leq 2,

\displaystyle\forall t\geq 2.

Equation (4) shows that the sensitivity is always bounded for $s^{\mathsf{conc}}_{1}$ . Lemma 3 shows that if the number of “bad” users exceeds the “break point”, we obtain $a_{1}=\bot$ , and the query sensitivities of the subsequent queries do not need to be considered. Lemma 3 establishes the query sensitivity in the concentration test when the neighboring datasets are aligned, and the number of "bad" users is below the threshold.

Privacy proof.

The final privacy guarantee–stated formally below–now easily follows from the previous lemmas. The full proof is deferred to Appendix B.7.

{lem}

[Privacy Guarantee] Under Assumption 2.4 and Assumption 2.5, suppose $\varepsilon\leq O(1),B\geq\frac{100\log(T/\delta)}{\varepsilon}$ , then Algorithm 5 is $(\varepsilon,\delta)$ -user-level-DP.

Utility proof.

We apply the localization framework in private optimization to finish the utility argument. We analyze the utility guarantee of Algorithm 1 based on the classic convergence rate of SGD on smooth convex functions (Lemma A.6) as follows:

{lem}

[] Let $x\in\mathcal{X}$ be any point in the domain. Suppose the data set $\mathcal{D}$ of the users, whose size $|\mathcal{D}|$ is larger than $\frac{100\log(T/\delta)}{\varepsilon}$ , is drawn i.i.d. from the distribution $\mathcal{P}$ . Setting $\tau=G\log(nmd/\omega)/\sqrt{m}$ then the final output $\overline{x}$ of Algorithm 1 satisfies that

\displaystyle\operatorname*{\mathbb{E}}[F(\overline{x})-F(x)]\lesssim\left(\beta+\frac{1}{\eta}\right)\frac{\operatorname*{\mathbb{E}}[\|x_{0}-x\|^{2}]}{T}+\frac{\eta G^{2}d}{Bm}+GDd\omega.

Now we apply the localization framework. We set $\omega=1/(nmd)^{3}$ to make the term depending on it negligible. The proof of the following lemma mostly follows from [FKT20].

{lem}

[Localization] Under Assumption 2.4 and Assumption 2.5, suppose $\beta\leq\frac{G}{D}(\frac{\sqrt{n}\varepsilon}{\sqrt{m}\log(nmd/\delta})$ , $n\geq\log^{2}(nd/\delta)/\varepsilon,\varepsilon\leq O(1)$ and $m\leq n^{O(\log\log n)}$ . Set $\eta\leq\frac{D}{G}\cdot\min\{\frac{B\sqrt{m}}{\sqrt{n}},\frac{\sqrt{m}\varepsilon}{\sqrt{d\log(1/\delta)\log(nmd)}}\}$ , $B=100\log(mnd/\delta)/\varepsilon$ , $\tau=O(G\log(nmd)/\sqrt{m})$ and $\upsilon=0.9B+\frac{2\log(T/\delta)}{\varepsilon}$ . If the dataset is drawn i.i.d. from the distribution $\mathcal{P}$ , the final output $x_{S}$ for Algorithm 5 satisfies

\displaystyle\operatorname*{\mathbb{E}}[F(x_{S})-F(x^{*})]\leq\tilde{O}\Big{(}GD\Big{(}\frac{d}{\sqrt{mn}}+\frac{d^{3/2}}{n\varepsilon^{2}\sqrt{m}}\Big{)}\Big{)}.

Main Result: Theorem 3.1 directly follows from Lemma 3 and Lemma 3.

4 Lower Bound

This section presents our main lower bound. As stated above, the lower bound is nearly tight, apart from lower-order terms and the dependency on $\varepsilon$ .

Theorem 4.1.

There exists a distribution $\mathcal{P}$ and a loss function $f$ satisfying Assumption 2.4 and Assumption 2.5, such that for any $(\varepsilon,\delta)$ -User-level-DP algorithm $\mathcal{M}$ , given i.i.d. dataset $\mathcal{D}$ drawn from $\mathcal{P}$ , the output of $\mathcal{M}$ satisfies

\displaystyle\operatorname*{\mathbb{E}}[F(\mathcal{M}(\mathcal{D}))-F(x^{*})]\geq GD\cdot\tilde{\Omega}\Big{(}\min\Big{\{}d,\frac{d}{\sqrt{mn}}+\frac{d^{3/2}}{n\varepsilon\sqrt{m}}\Big{\}}\Big{)}.

The non-private term $GD\frac{d}{\sqrt{mn}}$ represents the information-theoretic lower bound for SCO under these assumptions (see, e.g., Theorem 1 in [AWBR09]).

We construct the hard instance as follows: let $\mathcal{X}=[-1,1]^{d}$ be unit $\ell_{\infty}$ -ball and let $f(x;z)=-\langle x,z\rangle$ for any $x\in\mathcal{X}$ be the linear function. Let $z\in[-\sqrt{m},\sqrt{m}]^{d}$ with $\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}}[z]=\mu$ . Then one can easily verify that $f$ satisfies Assumptions 2.4 and 2.5 with $G=\sqrt{m},D=1$ and $\beta=0$ . We have

	$\displaystyle F(\mathcal{M}(\mathcal{D}))-F(x^{*})$	$\displaystyle=\sum_{i=1}^{d}(\operatorname{sign}(\mu[i])-\mathcal{M}(\mathcal{D})[i])\cdot\mu[i]$
		$\displaystyle\geq\sum_{i=1}^{d}\|\mu[i]\|.\mathbf{1}\big{(}\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[i])\big{)}.$		(5)

By (5), we reduce the optimization error to the weighted sign estimation error. Most existing lower bounds rely on the $\ell^{2}_{2}$ -error of mean estimation. We adapt their techniques, especially the fingerprinting lemma, and provide the proof in the Appendix C.

5 Conclusion

We present a linear-time algorithm for user-level DP-SCO that leverages (coordinate-wise) robust statistics in the gradient estimation subprocedure and provide a lower bound that nearly matches the upper bound up to logarithmic terms and an additional dependence on $\varepsilon$ . Despite this progress, several limitations and open problems remain, some of which we highlight below.

•

Generalization to Euclidean and Other Spaces. Extending our results to Euclidean and other spaces is an interesting but technically challenging problem. One key challenge is the lack of robust statistics that are $1$ -Lipschitz under perturbations in the high-dimensional $\ell_{2}$ -norm (see the second item in Assumption 3.2). One may wonder whether commonly used robust statistics, such as the geometric median, are stable in this sense. However, we provide a simple counterexample involving the geometric median in the Appendix (Section B.10).

Another challenge arises from our approach of projecting the mean towards the robust statistic to ensure unbiased gradient estimation. This projection is $1$ -Lipschitz under perturbations in one dimension (Lemma 3.3), but there are known counterexamples in higher dimensions [ALT24, Lemma 16]. Overcoming these issues is crucial for extending our method to general spaces.
•

Additional Assumption on Diagonal Dominance. Our analysis assumes that the Hessians of functions in the universe are diagonally dominant, which is primarily used to show that gradient descent is contractive in the $\ell_{\infty}$ -norm (Lemma 3). This assumption is somewhat restrictive compared to the $\ell_{2}$ -norm setting, where it is sufficient to assume that the operator norm of the Hessians is bounded (i.e., smoothness). Addressing the aforementioned challenges and generalizing our results to the Euclidean space could potentially eliminate this additional assumption.
•

Suboptimal Dependence on $\varepsilon$ . Although our upper bound nearly matches the lower bound, it has a suboptimal dependence on $\varepsilon$ . This issue arises from the loose dependence on sensitivity. Specifically, we draw $B$ users at each step and compute their average gradient, with $B=\tilde{O}(1/\varepsilon)$ . However, the final sensitivity remains roughly $\tilde{O}(1/\sqrt{m})$ and does not improve with larger $B$ . An improvement in the dependence on $\varepsilon$ could be achieved if the sensitivity of the robust statistic could benefit from larger $B$ .

Finally, it would be interesting to explore the use of robust statistics, such as the median used in this work, to address other private optimization problems.

References

[AFKT21] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in l1 geometry. In ICML, pages 393–403, 2021.
[AL24] Hilal Asi and Daogao Liu. User-level differentially private stochastic convex optimization: Efficient algorithms with optimal rates. In AISTATS, pages 4240–4248, 2024.
[ALT24] Hilal Asi, Daogao Liu, and Kevin Tian. Private stochastic convex optimization with heavy tails: Near-optimality from simple reductions. arXiv, 2406.02789, 2024.
[AUZ23] Hilal Asi, Jonathan Ullman, and Lydia Zakynthinou. From robustness to privacy and back. In ICML, pages 1121–1146, 2023.
[AWBR09] Alekh Agarwal, Martin J Wainwright, Peter Bartlett, and Pradeep Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In NIPS, 2009.
[BFGT20] Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. arXiv, 2006.06914, 2020.
[BFTT19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In NIPS, pages 11282–11291, 2019.
[BGN21] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. In COLT, pages 474–499, 2021.
[BRB17] Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical gauss-newton optimisation for deep learning. In International Conference on Machine Learning, pages 557–565. PMLR, 2017.
[BS23] Raef Bassily and Ziteng Sun. User-level private stochastic convex optimization with optimal rates. In ICML, pages 1838–1851, 2023.
[BST14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pages 464–473, 2014.
[Bub15] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
[CCGT24] Christopher A Choquette-Choo, Arun Ganesh, and Abhradeep Guha Thakurta. Optimal rates for o (1)-smooth dp-sco with a single epoch and large batches. In ALT, 2024.
[CTW⁺21] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In USENIX Security, pages 2633–2650, 2021.
[DASD24] Rudrajit Das, Naman Agarwal, Sujay Sanghavi, and Inderjit S Dhillon. Towards quantifying the preconditioning effect of adam. arXiv preprint arXiv:2402.07114, 2024.
[DK09] Stephane Durocher and David Kirkpatrick. The projection median of a set of points. Computational Geometry, 42(5):364–375, 2009.
[DL09] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In STOC, pages 371–380, 2009.
[DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284, 2006.
[DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
[FKT20] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In STOC, pages 439–449, 2020.
[GKK⁺23] Badih Ghazi, Pritish Kamath, Ravi Kumar, Raghu Meka, Pasin Manurangsi, and Chiyuan Zhang. On user-level private convex optimization. arXiv, 2305.04912, 2023.
[GLL22] Sivakanth Gopi, Yin Tat Lee, and Daogao Liu. Private convex optimization via exponential mechanism. In COLT, pages 1948–1989, 2022.
[JNG⁺19] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. arXiv, 1902.03736, 2019.
[KLSU19] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning high-dimensional distributions. In COLT, pages 1853–1902, 2019.
[LKO22] Xiyang Liu, Weihao Kong, and Sewoong Oh. Differential privacy and robust statistics in high dimensions. In COLT, pages 1167–1246, 2022.
[LLA24] Andrew Lowy, Daogao Liu, and Hilal Asi. Faster algorithms for user-level private stochastic convex optimization. In NeurIPS, 2024.
[LSA⁺21] Daniel Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, and Ananda Theertha Suresh. Learning with user-level privacy. NeurIPS, pages 12466–12479, 2021.
[LSS⁺23] Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Analyzing leakage of personally identifiable information in language models. In S & P, pages 346–363, 2023.
[LZB20] Chaoyue Liu, Libin Zhu, and Misha Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33:15954–15964, 2020.
[MG15] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR, 2015.
[SHW22] Jinyan Su, Lijie Hu, and Di Wang. Faster rates of private stochastic convex optimization. In ALT, pages 995–1002, 2022.
[SM12] Aleksandra Slavkovic and Roberto Molinari. Perturbed m-estimation: A further investigation of robust statistics for differential privacy. In Statistics in the Public Interest: In Memory of Stephen E. Fienberg, pages 337–361. Springer, 2012.
[WGZ⁺21] Hongjian Wang, Mert Gurbuzbalaban, Lingjiong Zhu, Umut Simsekli, and Murat A Erdogdu. Convergence rates of stochastic gradient descent under infinite noise variance. Advances in Neural Information Processing Systems, 34:18866–18877, 2021.
[XZ24] Zheng Xu and Yanxiang Zhang. Advances in private training for production on-device language models. https://research.google/blog/advances-in-private-training-for-production-on-device-language-models/, 2024. Google Research Blog.
[ZTC22] Qinzi Zhang, Hoang Tran, and Ashok Cutkosky. Differentially private online-to-batch for smooth losses. In NeurIPS, 2022.

Appendix A Preliminaries

A.1 Differential Privacy

Definition A.1 (User-level DP).

We say a mechanism $\mathcal{M}$ is $(\varepsilon,\delta)$ -user-level DP, if for any neighboring datasets $\mathcal{D}$ and $\mathcal{D}^{\prime}$ that differ from one user, and any output event set $\mathcal{O}$ , we have

\displaystyle\Pr[\mathcal{M}(\mathcal{D})\in\mathcal{O}]\leq e^{\varepsilon}\Pr[\mathcal{M}(\mathcal{D}^{\prime})\in\mathcal{O}]+\delta.

Proposition A.2 (Gaussian Mechanism).

Consider a function $f:\mathcal{P}^{*}\to\mathbb{R}^{d}$ . If $\max_{\mathcal{D}\sim\mathcal{D}^{\prime}}\|f(\mathcal{D})-f(\mathcal{D}^{\prime})\|_{2}\leq\Delta$ , where $\mathcal{D}\sim\mathcal{D}^{\prime}$ means $\mathcal{D}$ and $\mathcal{D}^{\prime}$ are neighboring datasets, then the Gaussian mechanism

\displaystyle\mathcal{M}(\mathcal{D}):=f(\mathcal{D})+\zeta,

where $\zeta\sim\mathcal{N}(0,\sigma^{2}I_{d})$ with $\sigma^{2}\geq\frac{2\Delta^{2}\log(1.25/\delta)}{\varepsilon^{2}}$ is $(\varepsilon,\delta)$ -DP.

A.1.1 AboveThreshold

Our algorithms use the AboveThreshold algorithm [DR14], which is a key tool in DP to identify whether there is a query $q_{i}:\mathcal{Z}\to\mathbb{R}$ in a stream $q_{1},\dots,q_{T}$ of queries that is above a certain threshold $\Delta$ . The $\mathsf{AboveThreshold}$ algorithm (Algorithm 4 presented in the Appendix) has the following guarantees:

Lemma A.3 ([DR14], Theorem 3.24).

$\mathsf{AboveThreshold}$ is $(\varepsilon,0)$ -DP. Moreover, let $\alpha=\frac{8\log(2T/\gamma)}{\varepsilon}$ and $\mathcal{D}\in\mathcal{Z}^{n}$ . For any sequence $q_{1},\cdots,q_{T}:\mathcal{Z}^{n}\to\mathbb{R}$ of $T$ queries each of sensitivity $1$ , $\mathsf{AboveThreshold}$ halts at time $k\in[T+1]$ such that with probability at least $1-\gamma$ ,

•

For all $t<k$ , $a_{t}=\top$ and $q_{t}(\mathcal{D})\geq\Delta-\alpha$ ;
•

$a_{k}=\bot$ and $q_{k}(\mathcal{D})\leq\Delta+\alpha$ or $k=T+1$ .

A.2 SubGaussian and Norm-SubGaussian Random Vectors

Definition A.4.

Let $\zeta>0$ . We say a random vector $X$ is SubGaussian ( $\mathrm{SG}(\zeta)$ ) with parameter $\zeta$ if $\operatorname*{\mathbb{E}}[e^{\langle v,X-\operatorname*{\mathbb{E}}X\rangle}]\leq e^{\|v\|^{2}\zeta^{2}/2}$ for any $v\in\mathbb{R}^{d}$ . Random vector $X\in\mathbb{R}^{d}$ is Norm-SubGaussian with parameter $\zeta$ ( $\mathrm{nSG}(\zeta)$ ) if $\mathbb{P}[\|X-\operatorname*{\mathbb{E}}X\|_{2}\geq t]\leq 2e^{-\frac{t^{2}}{2\zeta^{2}}}$ for all $t>0$ .

Theorem A.5 (Hoeffding-type inequality for norm-subGaussian, [JNG⁺19]).

Let $X_{1},\cdots,X_{k}\in\mathbb{R}^{d}$ be random vectors, and let $\mathcal{F}_{i}=\sigma(x_{1},\cdot,x_{i})$ for $i\in[k]$ be the corresponding filtration. Suppose for each $i\in[k]$ , $X_{i}\mid\mathcal{F}_{i-1}$ is zero-mean $\mathrm{nSG}(\zeta_{i})$ . Then, there exists an absolute constant $c>0$ , for any $\gamma>0$ ,

\displaystyle\mathbb{P}\left[\left\|\sum_{i\in[k]}X_{i}\right\|_{2}\geq c\sqrt{\log(d/\gamma)\sum_{i\in[k]}\zeta_{i}^{2}}\right]\leq\gamma.

1 Input: Dataset

\mathcal{D}=(Z_{1},\dots,Z_{n})

, threshold

\Delta\in\mathbb{R}

, privacy parameter

\varepsilon

;

2 Let

\widehat{\Delta}:=\Delta-\mathrm{Lap}(\frac{2}{\varepsilon})

;

3 for $i=1$ to $T$ do

4 Receive a new query

q_{i}:\mathcal{Z}^{n}\to\mathbb{R}

;

5 Sample

\nu_{i}\sim\mathrm{Lap}(\frac{4}{\varepsilon})

;

6 if $q_{t}(\mathcal{D})+\nu_{i}<\widehat{\Delta}$ then

7 Output:

a_{i}=\bot

;

8 Halt;

9 else

10 Output:

a_{i}=\top

;

12 end if

14 end if

16 end for

Algorithm 4

\mathsf{AboveThreshold}

A.3 Optimization

Lemma A.6 ([Bub15]).

Consider a $\beta$ -smooth convex function $f$ over a convex set $\mathcal{X}$ . For any $x\in\mathcal{X}$ , suppose that the random initial point $x_{0}$ satisfies $\operatorname*{\mathbb{E}}[\|x_{0}-x\|_{2}^{2}]\leq R^{2}$ . Suppose we have an unbiased stochastic gradient oracle such that $\operatorname*{\mathbb{E}}\|\tilde{g}(x_{t})-\nabla f(x_{t})\|_{2}^{2}\leq\sigma_{t}^{2}$ , then running SGD for $T$ steps with fixed step size $\eta$ satisfies that

\displaystyle\operatorname*{\mathbb{E}}\left[f\left(\frac{1}{T}\sum_{t=1}^{T}x_{t+1}\right)-f(x)\right]\leq\left(\beta+\frac{1}{\eta}\right)\frac{R^{2}}{T}+\frac{\eta\sum_{t}\sigma_{t}^{2}}{2T}.

Appendix B Proof of Section 3

1 Input: Dataset

\mathcal{D}

, parameters

\varepsilon,\delta,B

, initial point

x_{0}

;

2 Process: Set

S=\lceil\log n/B\rceil

;

3 for Phase $s=1,\ldots,S$ do

4 Set

n_{s}=n/2^{s}

and

\eta_{s}=(\log^{-s}m)\eta

;

5 Draw a dataset

\mathcal{D}_{s}

of size

n_{s}

from the unused users;

6 Run Algorithm 1 with inputs

\mathcal{D}_{s},\varepsilon,\delta,\eta_{s},\tau,\upsilon,B,x_{s-1}

;

7 set

x_{s}=\overline{x}_{s}+\zeta_{s}

, where

\zeta_{s}\sim\mathcal{N}(0,\sigma_{s}^{2}I_{d})

with

\sigma_{s}=O(\frac{\eta_{s}G\sqrt{d\log(\exp(\varepsilon)/\delta)\log(nmd)}}{\sqrt{m}\varepsilon})

;

9 end for

Return: the final solution

x_{s}

Algorithm 5 Localization for user-level DP-SCO

B.1 Proof of Lemma 3

See 3

Proof.

By the diagonal dominance assumption and precondition that $\eta\leq 2/\beta$ , we know

\displaystyle\|I-\eta\nabla^{2}f(z)\|_{\infty}=\max_{j}\left\{|1-\eta\nabla^{2}f(z)_{j,j}|+\sum_{i\neq j}\eta|\nabla^{2}f(z)_{j,i}|\right\}\leq 1.

Note that by the mean-value theorem,

\displaystyle(x-\eta\nabla f(x))-(y-\nabla f(y))=x-y+\eta(x-y)^{\top}\nabla^{2}f(z)=(x-y)(I-\eta\nabla^{2}f(z)),

where $z$ is in the segment between $x$ and $y$ . Hence we have

\|(x-\eta\nabla f(x))-(y-\nabla f(y))\|_{\infty}\leq\|x-y\|_{\infty}\cdot\|I-\eta\nabla^{2}f(z)\|_{\infty}\cdot\\ \leq\|x-y\|_{\infty}.

∎

B.2 Proof of Lemma 3.3

See 3.3

Proof.

Without loss of generality, we assume $a\leq b$ . We do case analysis.

Case (I): if no projection happens. Then it is straightforward that $c^{\prime}=c$ and $d^{\prime}=d$ and hence $|c^{\prime}-d^{\prime}|=|c-d|\leq 1$ .

Case (II): if one projection happens. Without loss of generality, assume we project $c$ and get $c^{\prime}$ . We analyze the following sub-cases:
(i): $c^{\prime}=a-r$ . In this case we know $c\leq c^{\prime}$ . If $d\geq a-r=c^{\prime}$ , then we know $|c^{\prime}-d^{\prime}|\leq|c-d|\leq 1$ . If $d<a-r$ , then $d-b<-r$ which is impossible.
(ii): $c^{\prime}=a+r$ . If $a+r\leq b$ , then we know $|c^{\prime}-d^{\prime}|\leq|a-b|\leq 1$ . Consider the subsubcase when $a+r>b$ . If $d\leq a+r$ , then $|c^{\prime}-d^{\prime}|\leq|c-d|\leq 1$ . Else if $d\geq a+r$ , as $b+r\geq d\geq c^{\prime}=a+r$ , we have $|c^{\prime}-d^{\prime}|\leq|a-b|\leq 1$ .

Case (III): if two projections happen.
(i): $c^{\prime}=a-r,d^{\prime}=b-r$ . This is a trivial case.
(ii): $c^{\prime}=a+r,d^{\prime}=b+r$ . This case is also trivial.
(iii): $c^{\prime}=a-r,d^{\prime}=b+r$ . We can show that $|c^{\prime}-d^{\prime}|\leq|c-d|\leq 1$ .
(vi): $c^{\prime}=a+r,d^{\prime}=b-r$ . If $a+r\leq b$ , then we know $|c^{\prime}-d^{\prime}|\leq|a-b|\leq 1$ . Else, if $a+r>b$ , then we know $|c^{\prime}-d^{\prime}|\leq|c-d|\leq 1$ . ∎

B.3 Proof of Lemma 3.3

See 3.3

Proof.

It suffices to show that $\|g_{0}-g_{0}^{\prime}\|_{\infty}\leq 4\rho+2\varsigma$ . By the first property of Assumption 3.2 and the preconditions in the statement, we know $B_{\infty}(X^{\prime},\rho)\cap B_{\infty}(Y^{\prime},\rho)\neq\emptyset$ , which leads to that

\displaystyle\|X^{\prime}-Y^{\prime}\|\leq 2\rho.

Moreover, we have that the robust statistic $\|X_{\mathrm{rs}}-Y_{\mathrm{rs}}\|_{\infty}\leq 4\rho$ by the triangle inequality as $\|X_{\mathrm{rs}}-X^{\prime}\|_{\infty}\leq\rho$ and $\|Y_{\mathrm{rs}}-Y^{\prime}\|_{\infty}\leq\rho$ .

By the projection operation in Algorithm 2, we know $g_{0}\in B_{\infty}(X_{\mathrm{rs}},\varsigma)$ and $g_{0}^{\prime}\in B_{\infty}(Y_{\mathrm{rs}},\varsigma)$ , and hence we know $\|g_{0}-g_{0}^{\prime}\|_{\infty}\leq 4\rho+2\varsigma$ . This completes the proof. ∎

B.4 Proof of Lemma 3

See 3

Proof.

Recall that we assume all users $Z_{i,t}$ are equal to $Z_{i,t}^{\prime}$ but $Z_{1,1}$ .

We actually show that

\displaystyle\|(x_{t-1}-\eta g_{t-1})-(y_{t-1}-\eta g_{t-1}^{\prime})\|_{\infty}\leq\|x_{t-1}-y_{t-1}\|_{\infty},

as the projection operator to the same convex set is contractive in $\ell_{2}$ - and $\ell_{\infty}$ -norm in our case.

Let $X_{i,t}=x_{t-1}-\eta q_{t}(Z_{i,t})$ and $Y_{i,t}=y_{t-1}-\eta q_{t}^{\prime}(Z_{i,t})$ . Note that the users used in computing the gradients are the same. Let $X_{\operatorname{est}}$ be the output of Algorithm 2 with $\{X_{i,t}\}_{i\in[B]}$ as inputs, and $Y_{\operatorname{est}}$ be the corresponding output of $\{Y_{i,t}\}_{i\in[B]}$ . By the third property of Assumption 3.2, it suffices to show that

\|X_{\operatorname{est}}-Y_{\operatorname{est}}\|_{\infty}\leq\|x_{t-1}-y_{t-1}\|_{\infty}.

(6)

By Lemma 3, we know

\displaystyle\|X_{i,t}-Y_{i,t}\|_{\infty}\leq\|x_{t-1}-y_{t-1}\|_{\infty}.

Then by the second property in Assumption 3.2, we know that $\|X_{\mathrm{rs}}-Y_{\mathrm{rs}}\|_{\infty}\leq\|x_{t-1}-y_{t-1}\|_{\infty}$ . Similarly, by Lemma 3, we have $\|\overline{x}-\overline{y}\|_{\infty}\leq\|x_{t-1}-y_{t-1}\|_{\infty}$ . Then (6) follows from Lemma 3.3. This completes the proof. ∎

B.5 Proof of Lemma 3

See 3

Proof.

The main randomness comes from the Laplacian noise we add to the query and the threshold. Under the precondition that $|X_{\mathrm{good}}|<B/3$ for any $Y$ , then we know the concentration score

\displaystyle s^{\mathsf{conc}}_{1}(\mathcal{D},\tau)=\sum_{Z_{i,1}}\sum_{Z_{j,1}}\exp(-\tau\|q_{1}(Z_{i,1})-q_{1}(Z_{j,1}\|)\leq 2B/3+\exp(-1)\cdot B/3<0.8B.

Then by Lemma A.3 with a probability of at least $1-\delta/T\exp(\varepsilon)$ , we have

\displaystyle s^{\mathsf{conc}}_{1}(\mathcal{D},\tau)+\mathrm{Lap}(6/\varepsilon)\leq\upsilon,

which means $a_{1}=\bot$ . ∎

B.6 Proof of Lemma 3

See 3

Proof.

Consider the difference between $\|q_{t}(Z_{j,t})-q_{t}(Z_{i,t})\|_{\infty}-\|q_{t}^{\prime}(Z_{i,t})-q_{t}^{\prime}(Z_{j,t})\|_{\infty}$ .

By Lemma 3.3 and Lemma 3, we know $\|x_{t}-y_{t}\|_{\infty}\leq 6\eta/\tau$ . Then by the smoothness assumption, we have

		$\displaystyle~{}\\|q_{t}(Z_{j,t})-q_{t}(Z_{i,t})\\|_{\infty}-\\|q_{t}^{\prime}(Z_{i,t})-q_{t}^{\prime}(Z_{j,t})\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle~{}\\|q_{t}(Z_{i,t})-q_{t}^{\prime}(Z_{i,t})-(q_{t}(Z_{j,t})-q_{t}^{\prime}(Z_{j,t}))\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle~{}\\|q_{t}(Z_{i,t}-q_{t}^{\prime}(Z_{i,t})\\|_{\infty}+\\|q_{t}(Z_{j,t})-q_{t}^{\prime}(Z_{j,t})\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle~{}2\beta\\|x_{t}-y_{t}\\|_{\infty}.$

Hence we have

	$\displaystyle s^{\mathsf{conc}}_{i}(\mathcal{D},\tau)=$	$\displaystyle\frac{1}{B}\sum_{Z,Z^{\prime}\in\mathcal{D}}\exp(-\tau\\|q_{i}(Z)-q_{i}(Z^{\prime})\\|_{\infty})$
	$\displaystyle\geq$	$\displaystyle\frac{1}{B}\sum_{Z,Z^{\prime}\in\mathcal{D}^{\prime}}\exp(-\tau\\|q_{i}^{\prime}(Z)-q_{i}^{\prime}(Z^{\prime})\\|_{\infty})\exp(-12\beta\eta)$
	$\displaystyle\geq$	$\displaystyle\exp(-12\beta\eta)s^{\mathsf{conc}}_{i}(\mathcal{D}^{\prime},\tau).$

As both $s^{\mathsf{conc}}_{i}(\mathcal{D},\tau)$ and $s^{\mathsf{conc}}_{i}(\mathcal{D}^{\prime},\tau)$ are in the range $[0,B]$ , we know that

\displaystyle s^{\mathsf{conc}}_{i}(\mathcal{D}^{\prime},\tau)-s^{\mathsf{conc}}_{i}(\mathcal{D},\tau)\leq(1-\exp(-12\beta\eta))B\leq 12\beta\eta B,

where we use the fact $1-\exp(-x)\leq x$ for any $x\geq 0$ .

Similarly, we can bound $s^{\mathsf{conc}}_{i}(\mathcal{D},\tau)-s^{\mathsf{conc}}_{i}(\mathcal{D}^{\prime},\tau)\leq 12\beta\eta B$ , and complete the proof. ∎

B.7 Proof of Lemma 3

See 3

Proof.

Consider the implementation over two neighboring datasets $\mathcal{D}$ and $\mathcal{D}^{\prime}$ , and we use the prime notation to denote the corresponding variables with respect to $\mathcal{D}^{\prime}$ . Without loss of generality, we assume the different user is used in the first phase.

To avoid confusion, let $\overline{x}_{1}$ and $\overline{x}_{1}^{\prime}$ be the outputs of Algorithm 1 with neighboring inputs $\mathcal{D}_{1}$ and $\mathcal{D}_{1}^{\prime}$ , $x_{1}$ and $x_{1}^{\prime}$ be the outputs of Algorithm 5 by privatizing $\overline{x}_{1}$ and $\overline{x}_{1}^{\prime}$ with Gaussian noise respectively.

The outputs $\overline{x}_{1}$ and $\overline{x}_{1}^{\prime}$ of Algorithm 1 depend on the intermediate random variables $\{a_{t}\}_{t\in[T]}$ and $\{a_{t}^{\prime}\}_{t\in[T]}$ .

As the query sensitivity for $t=1$ is always bounded, by our parameter setting and property of $\mathsf{AboveThreshold}$ , we know $a_{1}\approx_{\varepsilon/2,0}a_{1}^{\prime}$ .

We do case analysis.

(i) $(\mathcal{D}_{1},\mathcal{D}_{1}^{\prime})$ is not $(1/\tau)$ -aligned. Then by Lemma 3, either $\Pr[a_{1}=\bot]\geq 1-\delta/2$ or $\Pr[a_{1}^{\prime}=\bot]\geq 1-\delta/2e^{\varepsilon}$ . Then by union bound and $a_{1}\approx_{\varepsilon/2,0}a_{1}^{\prime}$ , we have

\displaystyle\Pr[a_{1}=a_{1}^{\prime}=\bot]\geq 1-(1+\exp(\varepsilon))\delta/2e^{\varepsilon}.

If $a_{1}=a_{1}^{\prime}=\bot$ , then $\overline{x}_{1}=\overline{x}_{1}^{\prime}$ is the initial point. We have for any event range $\mathcal{O}$ ,

	$\displaystyle\Pr[x_{1}\in\mathcal{O}]=$	$\displaystyle\Pr[x_{1}\in\mathcal{O}\mid a_{1}=\bot]\Pr[a_{1}=\bot]+\Pr[x_{1}\in\mathcal{O}\mid a_{1}=\top]\Pr[a_{1}=\top]$
	$\displaystyle\leq$	$\displaystyle\Pr[x_{1}^{\prime}\in\mathcal{O}\mid a_{1}^{\prime}=\bot]\exp(\varepsilon/2)\Pr[a_{1}^{\prime}=\bot]+e^{\varepsilon}(\delta/2\exp(\varepsilon))$
	$\displaystyle\leq$	$\displaystyle e^{\varepsilon/2}\Pr[x_{1}^{\prime}\in\mathcal{O}]+\delta/2.$

This completes the privacy guarantee for case(i).

(ii) $(\mathcal{D}_{1},\mathcal{D}_{1}^{\prime})$ is $(1/\tau)$ -aligned. In this case, by Lemma 3.3 and Lemma 3, we know $\|x_{t}-y_{t}\|_{\infty}\leq 6\eta/\tau$ . Moreover, Lemma 3 suggests that the query sensitivity is always bounded by 1. Then by the property of $\mathsf{AboveThreshold}$ (Lemma A.3), we have $\{a_{t}\}_{t\in[T]}\approx_{\varepsilon/2,0}\{a_{t}^{\prime}\}_{t\in[T]}$ . We have for any event range $\mathcal{O}$ ,

	$\displaystyle\Pr[x_{1}\in\mathcal{O}]=$	$\displaystyle\Pr[x_{1}\in\mathcal{O}\mid\exists t,a_{t}=\bot]\Pr[\exists t,a_{t}=\bot]+\Pr[x_{1}\in\mathcal{O}\mid a_{t}=\top,\forall t]\Pr[a_{t}=\top,\forall t]$
	$\displaystyle\leq$	$\displaystyle\Pr[x_{1}^{\prime}\in\mathcal{O}\mid\exists t,a_{t}^{\prime}=\bot]\exp(\varepsilon/2)\Pr[\exists t,a_{t}^{\prime}=\bot]+\Pr[x_{1}\in\mathcal{O}\mid a_{t}=\top,\forall t]\Pr[a_{t}=\top,\forall t]$
	$\displaystyle\leq$	$\displaystyle\exp(\varepsilon/2)\Pr[x_{1}^{\prime}\in\mathcal{O}\mid\exists t,a_{t}^{\prime}=\bot]\Pr[\exists t,a_{t}^{\prime}=\bot]$
		$\displaystyle~{}~{}~{}+(\exp(\varepsilon/2)\Pr[x_{1}^{\prime}\in\mathcal{O}\mid a_{t}^{\prime}=\top,\forall t]+\delta/2\exp(\varepsilon))\exp(\varepsilon/2)\Pr[a_{t}^{\prime}=\top,\forall t]$
	$\displaystyle\leq$	$\displaystyle\exp(\varepsilon)\Pr[x_{1}^{\prime}\in\mathcal{O}]+\delta,$

where we use the Gaussian Mechanism (Proposition A.2) to bound $\Pr[x_{1}\in\mathcal{O}\mid a_{t}=\top,\forall t]$ by $\Pr[x_{1}^{\prime}\in\mathcal{O}\mid a_{t}^{\prime}=\top,\forall t]$ . This completes the proof.

∎

B.8 Proof of Lemma 3

See 3

Proof.

By the Hoeffding inequality for norm-subGaussian vectors (Theorem A.5), for each $t\in[T]$ and each $i\in[B]$ , we have

\displaystyle\Pr\Big{[}\|q_{t}(Z_{i,t})-\nabla F(x_{t-1})\|_{\infty}\geq G\log(ndm/\omega)/\sqrt{m}\Big{]}\leq 1-\omega/nm.

Then conditional on the event that $\|q_{t}(Z_{i,t})-\nabla F(x_{t-1})\|_{\infty}\leq\tau$ for all $i\in[B]$ and $t\in[T]$ , by our setting of parameters, we know that we pass all the concentration tests with $a_{t}=\top,\forall t\in[T]$ , and that

\displaystyle g_{t-1}=\frac{1}{B}\sum_{i\in[B]}q_{t}(Z_{i,t}),

which means $d_{\mathrm{TV}}\Big{(}\{g_{t-1}\}_{t\in[T]},\{\frac{1}{B}\sum_{i\in[B]}q_{t}(Z_{i,t})\}_{t\in[T]}\Big{)}\leq\omega$ . Note that $\operatorname*{\mathbb{E}}[\sum_{i\in[B]}q_{t}(Z_{i,t})]=\nabla F(x_{t-1})$ and $\operatorname*{\mathbb{E}}[\|\frac{1}{B}\sum_{i\in[B]}q_{t}(Z_{i,t})-\nabla F(x_{t-1})\|_{2}^{2}]\leq G^{2}d/Bm$ when all functions are drawn i.i.d. from the distribution. By the small TV distance between $g_{t-1}$ and the good gradient estimation $\frac{1}{B}\sum_{i\in[B]}q_{t}(Z_{i,t})$ , it follows from Lemma A.6 that

\displaystyle\operatorname*{\mathbb{E}}[F(\overline{x})-F(x)]\lesssim(\beta+\frac{1}{\eta})\frac{\operatorname*{\mathbb{E}}[\|x_{0}-x\|^{2}_{2}]}{T}+\frac{\eta G^{2}d}{Bm}+GDd\omega,

where the last term comes from the worst value $GDd$ , and the small failure probability $\omega$ . ∎

B.9 Proof of Lemma 3

See 3

Proof.

Let $\overline{x}_{0}=x^{*}$ and $\zeta_{0}=x_{0}-x^{*}$ . Lemma 3 can be used to analyze the utility concerning $\overline{x}_{s}$ . As we add Gaussian noise $\zeta_{s}$ to $\overline{x}_{s}$ in each phase, we analyze the influence of $\zeta_{s}$ first.

By the assumption, we know $\|\zeta_{0}\|_{2}\leq D\sqrt{d}$ . Recall that by the setting that $\eta\leq\frac{D}{G}\cdot\frac{\sqrt{m}\varepsilon}{d\log(1/\delta)\log(nmd)}$ , for all $s\geq 0$ ,

\displaystyle\operatorname*{\mathbb{E}}[\|\zeta_{s}\|_{2}^{2}]=d\sigma_{s}^{2}=\eta_{s}^{2}d\frac{Gd\log(1/\delta)\log(nmd)}{m\varepsilon^{2}}\leq D^{2}d\cdot\log^{-s}m.

Then by Lemma 3, we have

	$\displaystyle\operatorname{\mathbb{E}}[F(x_{S})]-F(x^{})=$	$\displaystyle\sum_{s=1}^{S}\operatorname{\mathbb{E}}[F(\overline{x}_{s}-\overline{x}_{s-1})]+\operatorname{\mathbb{E}}[F(x_{s})-F(\overline{x}_{s})]$
	$\displaystyle\leq$	$\displaystyle\sum_{s=1}^{S}\left(\frac{\operatorname{\mathbb{E}}[\\|\zeta_{s-1}\\|_{2}^{2}]}{\eta_{s}T_{s}}+\frac{\eta_{s}G^{2}d}{2Bm}\right)+G\operatorname{\mathbb{E}}[\\|\zeta_{S}\\|_{2}]$
	$\displaystyle\leq$	$\displaystyle\sum_{s=1}^{S}\left(\frac{\log m}{2}\right)^{-(i-2)}\left(\frac{D^{2}d}{\eta n/B}+\frac{\eta G^{2}d}{2Bm}\right)+\frac{GDd}{(\log m)^{\log n}}$
	$\displaystyle\leq$	$\displaystyle\tilde{O}\Big{(}GD(\frac{d}{\sqrt{nm}}+\frac{d^{3/2}}{n\sqrt{m}\varepsilon^{2}})\Big{)},$

where we use the fact that $\frac{1}{(\log m)^{\log n}}\leq 1/nm$ when $m\leq n^{\log\log n}$ . ∎

B.10 Counterexample of the 1-Lipschitz of Geometric Median

We use the counterexample from [DK09] to show the geometric median is unstable in 2-dimension space. Recall give a set of points $P$ in $\mathbb{R}^{d}$ , the geometric median of $P$ , denoted by $M(P)$ , is

\displaystyle M(P):=\arg\min_{x}\sum_{p\in P}\|x-p\|_{2}.

Let $P=\{(0,0),(0,0),(1,0),(1,\alpha)\}$ and $P^{\prime}=\{(0,0),(0,\alpha),(1,0),(1,0)\}$ with $\alpha>0$ as a very small perturbation. But we know $M(P)=(0,0)$ and $M(P^{\prime})=(1,0)$ .

Appendix C Details of Lower Bound

Now we construct a lower bound for the weighted sign estimation error.

Weighted sign estimation error.

We construct a distribution $\mathcal{P}_{1}$ as follows: for each coordinate $k\in[d]$ , we draw $\mu[k]$ uniformly random from $[-1,1]$ , and $z_{i,j}[k]\sim\mathcal{N}(\mu[k],m)$ i.i.d., for $i\in[n],j\in[m]$ . The objective is to minimize weighted sign estimation error with respect to $\mu$ .

Lemma C.1.

Let $\varepsilon\leq 0.1,\delta\leq 1/(dnm)$ . For any $(\varepsilon,\delta)$ -user-level-DP algorithm $\mathcal{M}$ , there exists a distribution $\mathcal{P}_{2}$ such that $\|\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}_{2}}[z]\|_{\infty}\leq 1$ and $\|z\|_{\infty}\leq\tilde{O}(\sqrt{m})$ almost surely, and, given dataset $\mathcal{D}$ i.i.d. drawn from $\mathcal{P}_{2}$ , we have

\displaystyle\operatorname*{\mathbb{E}}_{\mathcal{D},\mathcal{M},\mu}\sum_{i=1}^{d}|\mu[i]|\cdot\mathbf{1}\big{(}\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[i])\big{)}\geq\tilde{\Omega}(\frac{d^{3/2}}{n\varepsilon}).

First, by the previous result, we can reduce the user-level to item-level setting.

Lemma C.2 ([LSA⁺21]).

Suppose each user $Z_{i},i\in[n]$ observes $z_{i,1},\ldots,z_{i,m}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\mathcal{N}(\mu,\sigma^{2}I_{d})$ with $\sigma$ known. For any $(\varepsilon,\delta)$ -User-level-DP algorithm $\mathcal{M}_{\mathrm{user}}$ , there exists an $(\varepsilon,\delta)$ -Item-level-DP algorithm $\mathcal{M}_{\mathrm{item}}$ that takes inputs $(\overline{Z}_{1},\ldots,\overline{Z}_{n})$ where $\overline{Z}_{i}=\frac{1}{m}\sum_{j\in[m]}z_{i,j}$ and has the same performance as $\mathcal{M}_{\mathrm{user}}$ .

Hence by Lemma C.2, it suffices to consider the item-level lower bound. We prove the following lemma:

Lemma C.3.

Let $\{\mu[k]\}_{k\in[d]}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}[-\sigma,\sigma]$ . Let $\{\overline{Z}_{1},\ldots,\overline{Z}_{n}\}$ be i.i.d. drawn from $\mathcal{N}(\mu,\sigma^{2}I_{d})$ . If $\mathcal{M}:\mathbb{R}^{n\times d}\to\{-1,1\}^{d}$ is $(\varepsilon,\delta)$ -DP, and

\displaystyle\operatorname*{\mathbb{E}}_{\mu,\mathcal{D},\mathcal{M}}\sum_{i=1}^{d}|\mu[i]|\cdot\mathbf{1}\big{(}\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[i])\big{)}\leq\alpha\leq\frac{d\sigma}{8},

then $n\geq\frac{\sqrt{d}}{32\varepsilon}$ .

By the invariant scaling, it suffices to consider the case when $\sigma=1$ . To prove Lemma C.3, we need the fingerprinting lemma:

Lemma C.4 (Lemma 6.8 in [KLSU19]).

For every fixed number $p$ and every $f:\mathbb{R}^{n}\to[-1,1]$ , define $g(p):=\operatorname*{\mathbb{E}}_{X_{1,\ldots n}\sim\mathcal{N}(p,1)}[f(X)]$ . We have

\displaystyle\operatorname*{\mathbb{E}}_{X_{1,\ldots n}\sim\mathcal{N}(p,1)}[(1-p^{2})(f(X)-p)\sum_{i\in[n]}(X_{i}-p)]=(1-p^{2})\frac{\mathrm{d}}{\mathrm{d}p}g(p).

(7)

By choosing $p$ uniformly from $[-1,1]$ , we have the following observation over the expectation on the RHS of Equation (7).

Lemma C.5.

We have

\displaystyle\operatorname*{\mathbb{E}}_{p\sim[-1,1]}[(1-p^{2})\frac{\mathrm{d}}{\mathrm{d}p}g(p)]=\operatorname*{\mathbb{E}}_{p}[g(p)\cdot p].

Proof.

Using integration by parts, we have

	$\displaystyle\operatorname*{\mathbb{E}}_{p\sim[-1,1]}[(1-p^{2})\frac{\mathrm{d}}{\mathrm{d}p}g(p)]=$	$\displaystyle\frac{1}{2}\int_{-1}^{1}(1-p^{2})\frac{\mathrm{d}}{\mathrm{d}p}g(p)\mathrm{d}p$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\int_{-1}^{1}\Big{(}\frac{\mathrm{d}}{\mathrm{d}p}\big{(}(1-p^{2})g(p)\big{)}-g(p)\frac{\mathrm{d}}{\mathrm{d}p}\big{(}1-p^{2}\big{)}\Big{)}\mathrm{d}p$
	$\displaystyle=$	$\displaystyle\int_{-1}^{1}g(p)p\mathrm{d}p$
	$\displaystyle=$	$\displaystyle\operatorname*{\mathbb{E}}[g(p)\cdot p].$

∎

Now we use the fingerprinting lemma.

Lemma C.6.

One has

\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{i\in[n],k\in[d]}(1-\mu[k]^{2})(\mathcal{M}(\mathcal{D})[k]-\mu[k])\cdot(\overline{Z}_{i}[k]-\mu[k])\right]=\operatorname*{\mathbb{E}}\left[\sum_{k\in[d]}\mathcal{M}(\mathcal{D})[k]\cdot\mu[k]\right].

Proof.

Fix a column $k\in[d]$ .

Construct the $f$ for our purpose. Define $f:\mathbb{R}^{n}\to[-1,1]$ to be

\displaystyle f(X):=\operatorname*{\mathbb{E}}_{\mathcal{D},\mathcal{M}}[\mathcal{M}(\mathcal{D}^{-k}\|X)[k]].

That is, $f(X)$ is the expectation of $\mathcal{M}(\mathcal{D})[k]$ conditioned on $\overline{Z}_{i}[k]=X_{i},\forall i\in[n]$ . And define $g:[-1,1]\to[-1,1]$ to be

\displaystyle g(p):=\operatorname*{\mathbb{E}}_{\mu^{-k},X_{1,\ldots n}\sim\mathcal{N}(p,1)}[f(X)].

That is $g(p)$ is the expectation of $\mathcal{M}(\mathcal{D})[k]$ conditional on $\mu[k]=p$ .

Now we can calculate

		$\displaystyle\operatorname*{\mathbb{E}}\Big{[}(1-\mu[k]^{2})(\mathcal{M}(\mathcal{D})[k]-\mu[k])\sum_{i\in[n]}(\overline{Z}_{i}[k]-\mu[k])\Big{]}$
	$\displaystyle=$	$\displaystyle\operatorname*{\mathbb{E}}_{\mu[k]\sim[-1,1],X_{1,\ldots n}\sim\mathcal{N}(\mu[k],1)}\Big{[}(1-\mu[k]^{2})(f(X)-\mu[k])\sum_{i\in[n]}(\overline{Z}_{i}[k]-\mu[k])\Big{]}$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{=}}$	$\displaystyle\operatorname*{\mathbb{E}}_{\mu[k]}[g(\mu[k])\cdot\mu[k]]$
	$\displaystyle=$	$\displaystyle\operatorname*{\mathbb{E}}[\mathcal{M}(\mathcal{D})[k]\mu[k]],$

where $(i)$ follows from Lemma C.4 and Lemma C.5. The statement follows by summation over $k\in[d]$ . ∎

A small weighted sign error means a large $\operatorname*{\mathbb{E}}\left[\sum_{k\in[d]}\mathcal{M}(\mathcal{D})[k]\cdot\mu[k]\right]$ , as demonstrated by the following lemma:

Lemma C.7.

Let $\mathcal{M}:\mathbb{R}^{n\times d}\to[-1,1]^{d}$ . Suppose $\{\mu[k]\}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}[-1,1]$ , and

\displaystyle\operatorname*{\mathbb{E}}_{\mu,\mathcal{D},\mathcal{M}}\left[\sum_{k=1}^{d}|\mu[k]|\cdot\mathbf{1}\big{(}\operatorname{sign}(\mu[k])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[k])\big{)}\right]\leq\alpha,

then we have

\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{k\in[d]}\mathcal{M}(\mathcal{D})[k]\cdot\mu[k]\right]\geq\frac{d}{2}-2\alpha.

Proof.

Note that

	$\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{k=1}^{d}\|\mu[k]\|\cdot\Big{(}\mathbf{1}(\operatorname{sign}(\mu[k])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[k]))+\mathbf{1}(\operatorname{sign}(\mu[k])=\operatorname{sign}(\mathcal{M}(\mathcal{D})[k]))\Big{)}\right]$
	$\displaystyle=\operatorname*{\mathbb{E}}\left[\sum_{k=1}^{d}\|\mu[k]\|\right]=d/2.$

Moreover, one has that

		$\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{k\in[d]}\mathcal{M}(\mathcal{D})[k]\cdot\mu[k]\right]$
	$\displaystyle=$	$\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{k\in[d]}\|\mu[k]\|\cdot\Big{(}\mathbf{1}(\operatorname{sign}(\mu[k])=\operatorname{sign}(\mathcal{M}(\mathcal{D})[k]))-\mathbf{1}(\operatorname{sign}(\mu[k])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[k]))\Big{)}\right]$
	$\displaystyle\geq$	$\displaystyle d/2-2\alpha.$

This completes the proof. ∎

It remains to show the sample complexity to achieve a large value of

\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{i\in[n],k\in[d]}(1-\mu[k]^{2})(\mathcal{M}(\mathcal{D})[k]-\mu[k])\cdot(\overline{Z}_{i}[k]-\mu[k])\right]\geq d/2-2\alpha.

(8)

Now we complete the proof of Lemma C.3.

Proof of Lemma C.3.

Let $A_{i}=\sum_{k\in[d]}(1-\mu[k]^{2})(\mathcal{M}(\mathcal{D})[k]-\mu[k])(\overline{Z}_{i}[k]-\mu[k])$ . We use the DP constraint of $\mathcal{M}$ to upper bound $\operatorname*{\mathbb{E}}[A_{i}]$ .

Let $\mathcal{D}_{\sim i}$ denote $\mathcal{D}$ with $\overline{Z}_{i}$ replaced with an independent draw from $\mathcal{P}_{1}$ . Define

\displaystyle\tilde{A}_{i}=\sum_{k\in[d]}(1-\mu[k]^{2})(\mathcal{M}(\mathcal{D}_{\sim i})[k]-\mu[k])(\overline{Z}_{i}[k]-\mu[k]).

Due to the independence between $\mathcal{M}(\mathcal{D}_{\sim i})$ and $\overline{Z}_{i}$ , we have $\operatorname*{\mathbb{E}}[\tilde{A}_{i}]=0$ . Moreover, as $|1-\mu[k]^{2}|\leq 1$ and $|\mathcal{M}(\mathcal{D}_{\sim i})-\mu[k]|\leq 2$ , we have $\operatorname*{\mathbb{E}}[|\tilde{A}_{i}|]\leq 2\sqrt{\sum_{k\in[d]}\operatorname*{\mathbf{Var}}(\overline{Z}_{i}[k])}\leq 2\sqrt{d}$ .

Split $A_{i}$ with $A_{i,+}=\max\{0,A_{i}\}$ and $A_{i,-}=\min\{0,A_{i}\}$ and split $\tilde{A}_{i}$ similarly. By the property of DP, we know

	$\displaystyle\Pr[\|A_{i,+}\|\geq t]\leq\exp(\varepsilon)\Pr[\|\tilde{A}_{i,+}\|\geq t]+\delta,\forall t\geq 0,$
	$\displaystyle\Pr[\|A_{i,-}\|\geq t]\geq\exp(-\varepsilon)\Pr[\|\tilde{A}_{i,1}\|\geq t]-\delta,\forall t\geq 0.$

Then we have

	$\displaystyle\operatorname*{\mathbb{E}}[A_{i}]=$	$\displaystyle\int_{0}^{\infty}\Pr[\|A_{i,+}\|\geq t]\mathrm{d}t-\int_{0}^{\infty}\Pr[\|A_{i,-}\|\leq t]\mathrm{d}t$
	$\displaystyle\leq$	$\displaystyle\exp(\varepsilon)\operatorname{\mathbb{E}}[\|\tilde{A}_{i,+}\|]-\exp(-\varepsilon)\operatorname{\mathbb{E}}[\|\tilde{A}_{i,-}\|]+2\delta T+\int_{T}^{\infty}\Pr[\|A_{i}\|\geq t]\mathrm{d}t$
	$\displaystyle\leq$	$\displaystyle\operatorname{\mathbb{E}}[\tilde{A}_{i}]+(\exp(\varepsilon)-1)\operatorname{\mathbb{E}}[\|\tilde{A}_{i,+}\|]+(1-\exp(-\varepsilon))\operatorname*{\mathbb{E}}[\|\tilde{A}_{i,-}\|]+2\delta T+\int_{T}^{\infty}\Pr[\|A_{i}\|\geq t]\mathrm{d}t$
	$\displaystyle\leq$	$\displaystyle\operatorname{\mathbb{E}}[\tilde{A}_{i}]+2\varepsilon\operatorname{\mathbb{E}}[\|\tilde{A}_{i}\|]+2\delta T+\int_{T}^{\infty}\Pr[\|A_{i}\|\geq t]\mathrm{d}t,$

where the last inequality used the fact that $\exp(\varepsilon)-1\leq 2\varepsilon$ when $\varepsilon\leq 1/10$ .

When $\delta\leq 1/dn^{2}$ and set $T=O(\sqrt{d\log(1/\delta)})$ , we have

\displaystyle\operatorname*{\mathbb{E}}[A_{i}]\leq 4\varepsilon\sqrt{d}+d/8n.

When $\alpha\leq d/8$ , Equation (8) implies that

\displaystyle n(4\varepsilon\sqrt{d}+d/8n)\geq d/4,

which leads to

\displaystyle n\geq\frac{\sqrt{d}}{32\varepsilon}.

This completes the proof. ∎

It is standard to translate the sample complexity lower bound (Lemma C.3) to the error lower bound (Lemma C.1). We present a proof below.

Proof of Lemma C.1.

Let $\mathcal{A}_{\varepsilon,\delta}^{\mathrm{item}}$ be the set of item-level DP mechanisms, and let $\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}$ be the set of user-level DP mechanisms.

Define the error term:

\displaystyle\mathrm{Error}[\mathcal{P},\mathcal{M},n]=\operatorname*{\mathbb{E}}_{\mathcal{D}\sim\mathcal{P}^{n},\mathcal{M}}\sum_{i=1}^{d}|\mu[i]|\mathbf{1}(\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[i])),

where $\mu=\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}}z$ .

Recall that we construct the distribution $\mathcal{P}_{1}$ as follows: for each coordinate $k\in[d]$ , we draw $\mu[k]$ uniformly random from $[-1,1]$ , and $z_{i,j}[k]\sim\mathcal{N}(\mu[k],m)$ i.i.d., for $i\in[n],j\in[m]$ .

Let $\overline{\mathcal{P}}_{1}$ be the following: for each coordinate $k\in[d]$ , we draw $\mu[k]$ uniformly random from $[-1,1]$ , and $\overline{Z}_{i}[k]\sim\mathcal{N}(\mu[k],1)$ i.i.d., for $i\in[n]$ . $\overline{\mathcal{P}}_{1}$ is corresponding to averaging the $m$ samples for each user. By Lemma C.2, we have

\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\mathrm{Error}[\mathcal{P}_{1},\mathcal{M},nm]\geq\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{item}}}\mathrm{Error}[\overline{\mathcal{P}}_{1},\mathcal{M},n].

By Lemma C.3,

\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{item}}}\mathrm{Error}[\overline{\mathcal{P}}_{1},\mathcal{M},\sqrt{d}/32\varepsilon]\geq\Omega(d).

Let $n^{*}=\tilde{O}(\sqrt{d}/\varepsilon)$ . When we have a large data set of size $n\gg n^{*}$ , construct $\overline{\mathcal{P}}_{2}=\frac{n^{*}}{n}\overline{\mathcal{P}}_{1}+(1-\frac{n^{*}}{n})\mathcal{P}_{3}$ , where $\mathcal{P}_{3}$ is a Dirac distribution at $\mathbf{0}\in\mathbb{R}^{d}$ .

Hence, by a Chernoff bound, with high probability, the number of samples drawn from $\overline{\mathcal{P}}_{1}$ in the dataset $\mathcal{D}$ is no more than $O(n^{*}\cdot\log(nd))=\frac{\sqrt{d}}{32\varepsilon}$ . Then we have

	$\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\mathrm{Error}[\mathcal{P}_{1},\mathcal{M},nm]\geq$	$\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{item}}}\mathrm{Error}[\overline{\mathcal{P}}_{2},\mathcal{M},n]$
	$\displaystyle\geq$	$\displaystyle\frac{n^{*}}{n}\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{item}}}\mathrm{Error}[\overline{\mathcal{P}}_{1},\mathcal{M},\sqrt{d}/32\varepsilon]\geq\tilde{\Omega}(\frac{d^{3/2}}{\varepsilon n}).$

In the precondition of $\mathcal{P}_{2}$ , we need $\|z\|_{\infty}\leq\tilde{O}(\sqrt{m})$ almost surely for $z\sim\mathcal{P}_{2}$ . But $\mathcal{P}_{1}$ involves some Gaussian distributions. We construct $\mathcal{P}_{2}$ by truncating the Gaussian distributions.

More specifically, for each item $z_{i,j}$ drawn from $\mathcal{N}(\mu,mI_{d})$ , we set $z^{\prime}_{i,j}[k]=\frac{z_{i,j}[k]}{\max\{1,|z_{i,j}[k]|/G\}}$ with $G=\Theta(\sqrt{m\log(mnd)})$ . In other words, we get $z^{\prime}_{i,j}$ by projecting $z_{i,j}$ to $B_{\infty}(0,G)$ . Fixing $\mu$ , we first show

\displaystyle\|\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}_{2}}[z]-\mu\|_{\infty}\leq O(1/dn^{2}).

(9)

It suffices to consider any coordinate $k\in[d]$ , and prove

\displaystyle|\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}_{2}}[z[k]]-\mu[k]|\leq O(1/dn^{2}).

Letting $\beta=\frac{-\mu[k]+G}{\sqrt{m}}$ and $\alpha=\frac{-\mu[k]-G}{\sqrt{m}}$ , we know

\displaystyle|\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}_{2}}[z[k]]-\mu[k]|\leq\frac{n^{*}}{n}\cdot\sqrt{m}\cdot\frac{\varphi(\alpha)-\varphi(\beta)}{\int_{\alpha}^{\beta}\varphi(x)\mathrm{d}x},

where $\varphi$ is density function of standard Gaussian $\mathcal{N}(0,1)$ . As $\mu[k]\in[-1,1]$ and $G=\Theta(\sqrt{m\log(mnd)})$ , we establish Equation (9). Denote $\mu^{\prime}=\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}_{2}}[z\mid\mu]$ , that is the mean of $\mathcal{P}_{2}$ conditional on $\mu$ .

Moreover, we have

		$\displaystyle~{}\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\operatorname*{\mathbb{E}}_{\mu,\mathcal{D}\sim\mathcal{P}_{2}^{mn},\mathcal{M}}\sum_{i=1}^{d}\|\mu^{\prime}[i]\|\mathbf{1}(\operatorname{sign}(\mu^{\prime}[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})))$
	$\displaystyle\geq$	$\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\operatorname*{\mathbb{E}}_{\mu,\mathcal{D}\sim\mathcal{P}_{2}^{mn},\mathcal{M}}\Big{(}\sum_{i=1}^{d}\|\mu[i]\|\mathbf{1}(\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})))-d\\|\mu-\mu^{\prime}\\|_{\infty}\Big{)}$
	$\displaystyle\geq$	$\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\operatorname*{\mathbb{E}}_{\mu,\mathcal{D}\sim\mathcal{P}_{2}^{mn},\mathcal{M}}\sum_{i=1}^{d}\|\mu[i]\|\mathbf{1}(\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})))-O(1/n^{2})$
	$\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}}$	$\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\operatorname*{\mathbb{E}}_{\mu,\mathcal{D}\sim\mathcal{P}_{1}^{mn},\mathcal{M}}\sum_{i=1}^{d}\|\mu[i]\|\mathbf{1}(\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})))-O(1/n^{2})$
	$\displaystyle\geq$	$\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\mathrm{Error}(\mathcal{P}_{1},\mathcal{M},nm)-O(1/n^{2})$
	$\displaystyle\geq$	$\displaystyle\tilde{\Omega}(\frac{d^{3/2}}{\varepsilon n}),$

where the inequality (i) follows from that we can always sample from $\mathcal{P}_{2}$ with samples from $\mathcal{P}_{1}$ , which means the problem over $\mathcal{P}_{2}$ is harder than $\mathcal{P}_{1}$ . This completes the proof. ∎

The lower bound Theorem 4.1 follows from Lemma C.1 and our construction of the function class, that is

	$\displaystyle\operatorname{\mathbb{E}}[F(\mathcal{M}(\mathcal{D}))-F(x^{})]$	$\displaystyle\geq\operatorname*{\mathbb{E}}_{\mathcal{D},\mathcal{M},\mu}\sum_{i=1}^{d}\|\mu[i]\|\cdot\mathbf{1}\big{(}\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[i])\big{)}$
		$\displaystyle\geq\tilde{\Omega}\left(\frac{d^{3/2}}{n\varepsilon}\right)=GD\cdot\tilde{\Omega}\left(\frac{d^{3/2}}{n\varepsilon\sqrt{m}}\right),$

given $G=\tilde{O}(\sqrt{m})$ .

		$\displaystyle~{}\\|q_{t}(Z_{j,t})-q_{t}(Z_{i,t})\\|_{\infty}-\\|q_{t}^{\prime}(Z_{i,t})-q_{t}^{\prime}(Z_{j,t})\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle~{}\\|q_{t}(Z_{i,t})-q_{t}^{\prime}(Z_{i,t})-(q_{t}(Z_{j,t})-q_{t}^{\prime}(Z_{j,t}))\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle~{}\\|q_{t}(Z_{i,t}-q_{t}^{\prime}(Z_{i,t})\\|_{\infty}+\\|q_{t}(Z_{j,t})-q_{t}^{\prime}(Z_{j,t})\\|_{\infty}$
	$\displaystyle\leq$	$\displaystyle~{}2\beta\\|x_{t}-y_{t}\\|_{\infty}.$

Linear-Time User-Level DP-SCO via Robust Statistics

Abstract

1 Introduction

DP-SCO

1.1 Technical Challenges in User-Level DP-SCO

Mean Estimation and Sensitivity Control

Challenges in Linear-Time User-Level DP-SCO

Generalizing Other Linear-Time Algorithms

1.2 Our Techniques and Contributions

Key Idea: 1-Lipschitz Property of Robust Statistics

Extending Stability to High Dimensions

Debiasing Technique

Improving Robust Mean Estimation: Smoothed Concentration Test

Main Result and Implications

Comparison with Prior Work

DP and Robustness

2 Preliminaries

Definition 2.1 (Lipschitz).

Definition 2.2 (Smooth).

Definition 2.3 (Diagonal Dominance).

Assumption 2.4.

Assumption 2.5.

Notation

3 Main Algorithm

Theorem 3.1.

Iteration Sensitivity of Algorithm 1:

Assumption 3.2.

Remark 3.3.

Query Sensitivity of Concentration Test (Algorithm 3):

Privacy proof.

Utility proof.

4 Lower Bound

Theorem 4.1.

5 Conclusion

References

Appendix A Preliminaries

A.1 Differential Privacy

Definition A.1 (User-level DP).

Proposition A.2 (Gaussian Mechanism).

A.1.1 AboveThreshold

Lemma A.3 ([DR14], Theorem 3.24).

A.2 SubGaussian and Norm-SubGaussian Random Vectors

Definition A.4.

Theorem A.5 (Hoeffding-type inequality for norm-subGaussian, [JNG+19]).

A.3 Optimization

Lemma A.6 ([Bub15]).

Appendix B Proof of Section 3

B.1 Proof of Lemma 3

Proof.

B.2 Proof of Lemma 3.3

Proof.

B.3 Proof of Lemma 3.3

Proof.

B.4 Proof of Lemma 3

Proof.

B.5 Proof of Lemma 3

Proof.

B.6 Proof of Lemma 3

Proof.

B.7 Proof of Lemma 3

Proof.

B.8 Proof of Lemma 3

Proof.

B.9 Proof of Lemma 3

Proof.

B.10 Counterexample of the 1-Lipschitz of Geometric Median

Appendix C Details of Lower Bound

Weighted sign estimation error.

Lemma C.1.

Lemma C.2 ([LSA+21]).

Lemma C.3.

Lemma C.4 (Lemma 6.8 in [KLSU19]).

Lemma C.5.

Proof.

Lemma C.6.

Proof.

Lemma C.7.

Proof.

Proof of Lemma C.3.

Proof of Lemma C.1.

Theorem A.5 (Hoeffding-type inequality for norm-subGaussian, [JNG⁺19]).

Lemma C.2 ([LSA⁺21]).