This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\declaretheorem

[name=Theorem]thm \declaretheorem[name=Lemma,sibling=theorem]lem

Linear-Time User-Level DP-SCO via Robust Statistics

Badih Ghazi1  Ravi Kumar1 Daogao Liu1 Pasin Manurangsi1
1Google
Email:[email protected]
Abstract

User-level differentially private stochastic convex optimization (DP-SCO) has garnered significant attention due to the paramount importance of safeguarding user privacy in modern large-scale machine learning applications. Current methods, such as those based on differentially private stochastic gradient descent (DP-SGD), often struggle with high noise accumulation and suboptimal utility due to the need to privatize every intermediate iterate. In this work, we introduce a novel linear-time algorithm that leverages robust statistics, specifically the median and trimmed mean, to overcome these challenges. Our approach uniquely bounds the sensitivity of all intermediate iterates of SGD with gradient estimation based on robust statistics, thereby significantly reducing the gradient estimation noise for privacy purposes and enhancing the privacy-utility trade-off. By sidestepping the repeated privatization required by previous methods, our algorithm not only achieves an improved theoretical privacy-utility trade-off but also maintains computational efficiency. We complement our algorithm with an information-theoretic lower bound, showing that our upper bound is optimal up to logarithmic factors and the dependence on ε\varepsilon. This work sets the stage for more robust and efficient privacy-preserving techniques in machine learning, with implications for future research and application in the field.

1 Introduction

With the rapid development and widespread applications of modern machine learning and artificial intelligence, particularly driven by advancements in large language models (LLMs), privacy concerns have come to the forefront. For example, recent studies have highlighted significant privacy risks associated with LLMs, including well-documented instances of training data leakage [CTW+21, LSS+23]. These challenges underscore the urgent need for privacy-preserving mechanisms in machine learning systems.

Differential Privacy (DP) [DMNS06] has emerged as a rigorous mathematical framework for ensuring privacy and is now the gold standard for addressing privacy concerns in machine learning. The classic definition of DP, referred to as item-level DP, guarantees that the replacement of any single training example has a negligible impact on the model’s output. Formally, a mechanism \mathcal{M} is said to satisfy (ε,δ)(\varepsilon,\delta)-item-level DP if, for any pair 𝒟\mathcal{D} and 𝒟\mathcal{D}^{\prime} of neighboring datasets that differ by a single item, and for any event 𝒪Range()\mathcal{O}\in\mathrm{Range}(\mathcal{M}), the following condition holds:

Pr[(𝒟)𝒪]eεPr[(𝒟)𝒪]+δ.\displaystyle\Pr[\mathcal{M}(\mathcal{D})\in\mathcal{O}]\leq e^{\varepsilon}\Pr[\mathcal{M}(\mathcal{D}^{\prime})\in\mathcal{O}]+\delta. (1)

Item-level DP provides a strong guarantee against the leakage of private information associated with any single element in a dataset. However, in many real-world applications, a single user may contribute multiple elements to the training dataset [XZ24]. This introduces the need to safeguard the privacy of each user as a whole, which motivates a stronger notion of user-level DP. This notion ensures that replacing one user (who may contribute up to mm items) in the dataset has only a negligible effect on the model’s output. Formally, (1) still holds when 𝒟\mathcal{D} and 𝒟\mathcal{D}^{\prime} differ by one user, i.e., up to mm items in total. Notably, when m=1m=1, user-level DP reduces to item-level DP.

DP-SCO

As one of the central problems in privacy-preserving machine learning and statistical learning, DP stochastic convex optimization (DP-SCO) has garnered significant attention in recent years (e.g., [BST14, BFTT19, FKT20, BFGT20, BGN21, SHW22, GLL22, ALT24]). In DP-SCO, we are provided with a dataset {Zi}i[n]\{Z_{i}\}_{i\in[n]} of users, where each user ZiZ_{i} contributes samples drawn from an underlying distribution 𝒫\mathcal{P}. The goal is to minimize the population objective function under the DP constraint:

F(x):=𝔼z𝒫f(x;z),\displaystyle F(x):=\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}}f(x;z),

where f(x;z):𝒳f(x;z):\mathcal{X}\to\mathbb{R} is a convex function defined on the convex domain 𝒳\mathcal{X}.

DP-SCO was initially studied in the item-level setting, where significant progress has been made with numerous exciting results. The study of user-level DP-SCO, to our knowledge, was first initiated by [LSA+21], where the problem was considered in Euclidean spaces, with the Lipschitz constant and the diameter of the convex domain defined under the 2\ell_{2}-norm. They achieved an error rate of O(1mn+dnεm)O\left(\frac{1}{\sqrt{mn}}+\frac{d}{n\varepsilon\sqrt{m}}\right) for smooth functions, along with a lower bound of Ω(1mn+dnεm)\Omega\left(\frac{1}{\sqrt{mn}}+\frac{\sqrt{d}}{n\varepsilon\sqrt{m}}\right).

Building on this, [BS23] introduced new algorithms based on DP-SGD, incorporating improved mean estimation procedures to achieve an asymptotically optimal rate of 1nm+dnmε\frac{1}{\sqrt{nm}}+\frac{\sqrt{d}}{n\sqrt{m}\varepsilon}. However, their approach relies on the smoothness of the loss function and imposes parameter restrictions, including nd/εn\geq\sqrt{d}/\varepsilon and mmax{d,nε2/d}m\leq\max\{\sqrt{d},n\varepsilon^{2}/\sqrt{d}\}. On the other hand, [GKK+23] observed that user-level DP-SCO has low local sensitivity to user deletions. Using the propose-test-release mechanism, they developed algorithms applicable even to non-smooth functions and requiring only nlog(d)/εn\geq\log(d)/\varepsilon users. However, their approach is computationally inefficient, running in super-polynomial time and achieving a sub-optimal error rate of O(1nm+dnmε2.5)O\left(\frac{1}{\sqrt{nm}}+\frac{\sqrt{d}}{n\sqrt{m}\varepsilon^{2.5}}\right). [AL24] proposed an algorithm that achieves optimal excess risk in polynomial time, requiring only nlog(md)/εn\geq\log(md)/\varepsilon users and accommodating non-smooth losses. However, their algorithm is also computationally expensive: for β\beta-smooth losses, it requires β(nm)3/2\beta\cdot(nm)^{3/2} gradient evaluations, while for non-smooth losses, it requires (nm)3(nm)^{3} gradient evaluations. Motivated by these inefficiencies, [LLA24] focused on improving the computational cost of user-level DP-SCO while maintaining optimal excess risk. For β\beta-smooth losses, they designed an algorithm requiring max{β1/4(nm)9/8,β1/2n1/4m5/4}\max\{\beta^{1/4}(nm)^{9/8},\beta^{1/2}n^{1/4}m^{5/4}\} gradient evaluations; for non-smooth losses, they achieved the same optimal excess risk using n11/8m5/4n^{11/8}m^{5/4} evaluations.

Linear-time algorithms have also been explored in the user-level DP-SCO setting. [BS23] proposed a linear-time algorithm in the local DP model, achieving an error rate of O(d/nmε)O\left(\sqrt{d}/\sqrt{nm}\varepsilon\right) under the constraints m<d/ε2m<d/\varepsilon^{2}, n>d/ε2n>d/\varepsilon^{2}, and βn3/md3\beta\leq\sqrt{n^{3}/md^{3}}. Similarly, [LLA24] achieved the same rate under slightly relaxed conditions, requiring nlog(ndm)/εn\geq\log(ndm)/\varepsilon and βnmd\beta\leq\sqrt{nmd}.

In our work, we consider user-level DP-SCO under \ell_{\infty}-norm assumptions, in contrast to most of prior results, which were established under the 2\ell_{2}-norm. This distinction is significant, as the \ell_{\infty}-norm provides stronger per-coordinate guarantees and hence is more desirable. Furthermore, the \ell_{\infty}-norm enjoys properties that are crucial to our algorithmic design but do not hold in the 2\ell_{2}-norm setting. These properties influence both the development of our algorithm and the corresponding theoretical guarantees. The implications of this distinction and its role in our results will be discussed in detail in the subsequent sections.

1.1 Technical Challenges in User-Level DP-SCO

A key challenge in solving user-level DP-SCO using DP-SGD lies in obtaining more accurate gradient estimates while maintaining privacy. Consider a simple scenario where we perform gradient descent for tt steps and seek an estimate of F(xt)\nabla F(x_{t}). To achieve this, we sample BB users to estimate F(xt)\nabla F(x_{t}) and compute qt(Zi):=1mzZif(xt;z)q_{t}(Z_{i}):=\frac{1}{m}\sum_{z\in Z_{i}}\nabla f(x_{t};z), the average of the mm gradients from user ZiZ_{i} at point xtx_{t}. If each user’s mm functions are i.i.d. drawn from the distribution 𝒫\mathcal{P}, then with high probability, we know that qt(Zi)F(xt)O~(1/m).\|q_{t}(Z_{i})-\nabla F(x_{t})\|\leq\tilde{O}(1/\sqrt{m}).

This naturally leads to the following mean-estimation problem: Given points qt(Z1),,qt(ZB)q_{t}(Z_{1}),\cdots,q_{t}(Z_{B}) in the unit ball, with most of them likely to be within a distance of 1/m1/\sqrt{m} from each other (under the i.i.d. assumption for utility guarantees), how can we accurately and privately estimate their mean?

A straightforward approach to recover the item-level rate is to apply the Gaussian mechanism:

1Bi[B]qt(Zi)+𝒩(0,σ12Id),\displaystyle\frac{1}{B}\sum_{i\in[B]}q_{t}(Z_{i})+\mathcal{N}(0,\sigma_{1}^{2}I_{d}), (2)

where the noise level is set as σ11/B\sigma_{1}\propto 1/B.

Mean Estimation and Sensitivity Control

To improve upon this, prior works [AL24, LLA24] designed mean-estimation sub-procedures with the following properties:

  • Outlier Detection: The procedure tests whether the number of “bad” users (whose gradients significantly deviate from the majority) exceeds a predefined threshold (or “break point”).

  • Outlier Removal and Sensitivity Reduction: If the number of “bad” users is below the threshold, the procedure removes outliers and produces an estimate gtg_{t} with sensitivity O~(1/Bm)\tilde{O}(1/B\sqrt{m}). The privatized gradient is then: gt+𝒩(0,σ22Id),g_{t}+\mathcal{N}(0,\sigma_{2}^{2}I_{d}), where σ21Bm\sigma_{2}\propto\frac{1}{B\sqrt{m}}.

  • Better Variance Control: When all users provide consistent estimates, the output follows

    1Bi[B]qt(Zi)+𝒩(0,σ22Id),σ21Bm,\displaystyle\frac{1}{B}\sum_{i\in[B]}q_{t}(Z_{i})+\mathcal{N}(0,\sigma_{2}^{2}I_{d}),\sigma_{2}\propto\frac{1}{B\sqrt{m}},

    resulting in significantly smaller noise compared to the naive Gaussian mechanism in (2).

By leveraging such sub-procedures, prior works have achieved the optimal excess risk rate in polynomial time. However, extending these to obtain linear-time algorithms poses new challenges.

Challenges in Linear-Time User-Level DP-SCO

Several linear-time algorithms exist for item-level DP-SCO. [ZTC22] and [CCGT24] achieved the optimal item-level rate O(1n+dnε)O\left(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\varepsilon}\right) under the smoothness assumption β=O(1)\beta=O(1).111One can roughly interpolate the optimal item-level rate by setting m=1m=1 in user-level DP. Their approach maintains privacy by adding noise to all intermediate iterations of DP-SGD.

[FKT20] notably relaxed the smoothness requirement to βn+d/ε\beta\leq\sqrt{n}+\sqrt{d}/\varepsilon by analyzing the stability of non-private SGD. They showed that for neighboring datasets, the sequence {xt}t[T]\{x_{t}\}_{t\in[T]} and {xt}t[T]\{x_{t}^{\prime}\}_{t\in[T]} remain close, ensuring that the sensitivity of the average iterate 1Tt[T]xt\frac{1}{T}\sum_{t\in[T]}x_{t} is low. This allows them to apply the Gaussian mechanism directly to privatize the average iterate.

Motivated by this stability-based analysis, [LLA24] attempted to generalize the linear-time approach of [FKT20] to the user-level setting. However, a key difficulty arises when incorporating the mean-estimation sub-procedure. Specifically, even if one can bound xtxt\|x_{t}-x_{t}^{\prime}\|, where {xt}\{x_{t}\} and {xt}\{x_{t}^{\prime}\} represent the trajectories corresponding to neighboring datasets, there is no clear understanding of how applying the sub-procedure impacts stability in subsequent iterations. In particular, after performing one gradient descent step using gradient estimations from sub-procedure, we do not have guarantees on how well xt+1xt+1\|x_{t+1}-x^{\prime}_{t+1}\| remains bounded.

Due to this lack of stability analysis for the sub-procedure, [LLA24] resorted to privatizing all iterations, resulting in excessive Gaussian noise accumulation. Consequently, their algorithm achieved only a suboptimal error rate of O(dnmε)O\left(\frac{\sqrt{d}}{\sqrt{nm}\varepsilon}\right), highlighting the fundamental challenge of designing a linear-time user-level DP-SCO algorithm by controlling the stability of the sub-procedures.

Generalizing Other Linear-Time Algorithms

The linear-time algorithms proposed in [ZTC22] and [CCGT24] for item-level DP-SCO represent promising approaches to generalize to the user-level setting. These algorithms achieve the optimal item-level rate O(1n+dnε)O\left(\frac{1}{\sqrt{n}}+\frac{\sqrt{d}}{n\varepsilon}\right) by privatizing all intermediate iterations of variants of DP-SGD. This approach avoids the need for additional stability analysis, as the noise added at every iteration directly ensures privacy without relying on intermediate sensitivity bounds. Generalizing such algorithms to the user-level setting may, therefore, be easier compared to other approaches, as they sidestep the stability issues associated with mean-estimation sub-procedures.

In a private communication, the authors of [LLA24] indicated that it is possible to generalize the linear-time algorithm of [ZTC22] to the user-level setting. However, this generalization introduces a dependence on the smoothness parameter β\beta, which may impose restrictive smoothness constraints on the types of functions for which the algorithm is effective. Despite this limitation, such a generalization represents a natural direction for extending linear-time algorithms to user-level DP-SCO.

While these developments are promising, a more challenging and interesting direction lies in generalizing stability-based analyses from the item-level setting to the user-level setting. Stability-based methods, as seen in [FKT20], rely on carefully bounding the sensitivity of the entire optimization trajectory. Extending this approach to the user-level setting requires incorporating an appropriate sub-procedure for mean estimation, tailored to handle user-level sensitivity. This introduces additional layers of complexity, as the interactions between the sub-procedure and the iterative optimization process must be carefully analyzed to ensure stability and privacy. Overcoming these challenges would not only advance the theoretical understanding of user-level DP-SCO but might also lead to more efficient algorithms with broader applicability.

1.2 Our Techniques and Contributions

In this work, we design a novel mean-estimation sub-procedure based on robust statistics, such as the median and trimmed mean, specifically tailored for user-level DP-SCO. By incorporating the sub-procedure into (non-private) SGD, we establish an upper bound on xtxt\|x_{t}-x_{t}^{\prime}\|_{\infty} for all iterations t[T]t\in[T]. This ensures stability throughout the optimization process.

Key Idea: 1-Lipschitz Property of Robust Statistics

In one dimension, many robust statistics, such as the median, satisfy a 1-Lipschitz property. This means that if each data point is perturbed by a distance of at most ι\iota, the robust statistic shifts by at most ι\iota. This property makes robust statistics particularly well-suited for mean estimation in user-level DP-SCO.

To see this, recall that we define qt(Zi):=1mzZif(xt;z)q_{t}(Z_{i}):=\frac{1}{m}\sum_{z\in Z_{i}}\nabla f(x_{t};z) as the average of the mm gradients from user ZiZ_{i} at point xtx_{t}. Similarly, we define qt(Zi):=1mzZif(xt;z)q_{t}^{\prime}(Z_{i}):=\frac{1}{m}\sum_{z\in Z_{i}}\nabla f(x_{t}^{\prime};z) for the gradients at xtx_{t}^{\prime}. If |xtxt||x_{t}-x_{t}^{\prime}| is bounded, then by the smoothness of ff, we have |qt(Zi)qt(Zi)|β|xtxt|.|q_{t}(Z_{i})-q_{t}^{\prime}(Z_{i})|\leq\beta|x_{t}-x_{t}^{\prime}|. This implies that the robust statistic computed from {qt(Zi)}i[B]\{q_{t}(Z_{i})\}_{i\in[B]} and {qt(Zi)}i[B]\{q_{t}^{\prime}(Z_{i})\}_{i\in[B]} remains bounded by β|xtxt|\beta|x_{t}-x_{t}^{\prime}| from the 1-Lipschitz property. As a result, the desired stability is naturally established in the one-dimensional setting.

Extending Stability to High Dimensions

In high-dimensional settings, robust statistics that satisfy the 1-Lipschitz property are not well understood. To address this, we adopt coordinate-wise robust statistics for gradient estimation. This approach ensures stability at each coordinate level. In turn, this allows us to establish iteration sensitivity in the \ell_{\infty}-norm.

Debiasing Technique

While robust statistics are effective in controlling sensitivity, using them directly introduces a significant bias in gradient estimation. This bias occurs even in "good" datasets where all functions are i.i.d. from the underlying distribution. If not handled properly, the bias can dominate the utility guarantee and degrade performance.

To address this issue, we propose a novel debiasing technique: If the mean and the robust statistic are sufficiently close, we directly use the mean; otherwise, we project the mean onto the ball centered at the robust statistic. Both the mean and robust statistics individually satisfy the 1-Lipschitz property. We prove that this coordinate-wise projection preserves the 1-Lipschitz property in the \ell_{\infty}-norm. The resulting robust mean-estimation sub-procedure ensures iteration sensitivity while remaining unbiased when the dataset is well-behaved. This property holds with high probability when all functions are i.i.d. from the distribution.

Improving Robust Mean Estimation: Smoothed Concentration Test

To further enhance stability, we introduce a smoothed version of the concentration score for testing whether the number of “bad” users exceeds a threshold (the “break point”). Prior works relied on an indicator function: 𝟏(qt(Zi)qt(Zj)1/τ),\mathbf{1}(\|q_{t}(Z_{i})-q_{t}(Z_{j})\|\leq 1/\tau), which is non-smooth and prone to instability. We replace this with a smoother function: exp(τqt(Zi)qt(Zj)),\exp\left(-\tau\|q_{t}(Z_{i})-q_{t}(Z_{j})\|\right), which allows for a more stable and robust concentration test by providing a continuous measure of closeness.

Main Result and Implications

Using our sub-procedure and sensitivity bounds, we achieve a utility rate of

O~(dnm+d3/2nmε2),\tilde{O}\left(\frac{d}{\sqrt{nm}}+\frac{d^{3/2}}{n\sqrt{m}\varepsilon^{2}}\right),

for smooth functions defined over an \ell_{\infty}-ball, with gradients bounded in the 1\ell_{1}-norm and diagonally dominant Hessians (see Section 2 for detailed assumptions). We also construct a lower bound:

Ω~(dnm+d3/2nmε),\tilde{\Omega}\left(\frac{d}{\sqrt{nm}}+\frac{d^{3/2}}{n\sqrt{m}\varepsilon}\right),

using the fingerprinting lemma, showing that our upper bound is nearly optimal except for the dependence on ε\varepsilon. We discuss this loose dependence further in Section 5.

These assumptions are strong in terms of the properties of the norm: the \ell_{\infty}-ball is the largest among all p\ell_{p}-balls and Lipschitz continuity in the 1\ell_{1}-norm implies that the \ell_{\infty}-norm of the gradient is bounded, which is the weakest possible assumption on gradient norms.

Comparison with Prior Work

The best-known item-level rate for this setting (i.e., Lipschitz in the 1\ell_{1}-norm and optimization over an \ell_{\infty}-ball) is: O(dn+d3/2nε),O\left(\frac{d}{\sqrt{n}}+\frac{d^{3/2}}{n\varepsilon}\right), as established in the work of [AFKT21]. To our knowledge, our result is the first to extend item-level rates to the user-level setting, incorporating the dependence on mm.

Existing user-level DP-SCO results have primarily been studied in Euclidean spaces, where functions are assumed to be Lipschitz in the 2\ell_{2}-norm and optimized over an 2\ell_{2}-ball. Since the 2\ell_{2} diameter of an \ell_{\infty}-ball is d\sqrt{d}, applying existing linear-time algorithms to our setting yields a suboptimal rate: O(d3/2nmε).O\left(\frac{d^{3/2}}{\sqrt{nm}\varepsilon}\right). This suboptimal dependence on nn arises because existing methods privatize all intermediate steps, leading to excessive noise accumulation. However, we acknowledge that our approach requires an additional assumption on the diagonal dominance of Hessians. Despite this restriction, our techniques are well-suited to the \ell_{\infty}-ball and gradient norm setting. We provide a detailed discussion of our assumptions, limitations, and open problems in Section 5.

DP and Robustness

There is a rich body of work exploring the connections between DP and robustness, with robust statistics playing a central role in many DP applications [DL09, SM12, LKO22, AUZ23]. However, to the best of our knowledge, the application of robust statistics in private optimization remains relatively under-explored. We view our work as an important step in this direction and hope it inspires further research into leveraging robust statistics for private optimization.

2 Preliminaries

In user-level DP-SCO, we are given a dataset 𝒟={Zi}i[n]\mathcal{D}=\{Z_{i}\}_{i\in[n]} of nn users, where Zi𝒵mZ_{i}\in\mathcal{Z}^{m} is the user ii’s data which consists of mm datapoints drawn i.i.d. from an (unknown) underlying distribution 𝒫\mathcal{P}. The objective is to minimize the following population function under user-level DP constraint:

F(x):=𝔼z𝒫f(x;z).\displaystyle F(x):=\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}}f(x;z).

In this section, we present the key definitions and assumptions. Discussions regarding the limitations of these assumptions can be found in Section 5. Additional tools, including those from differential privacy, are deferred to Appendix A.

Definition 2.1 (Lipschitz).

We say a function f:𝒳f:\mathcal{X}\to\mathbb{R} is GG-Lipschitz with respect to p\ell_{p}-norm , if for any x,y𝒳x,y\in\mathcal{X}, we have |f(x)f(y)|Gxyp.|f(x)-f(y)|\leq G\|x-y\|_{p}. This means f(x)qG\|\nabla f(x)\|_{q}\leq G for any x𝒳x\in\mathcal{X}, where 1/p+1/q=11/p+1/q=1.

Definition 2.2 (Smooth).

In this work, we say a function ff is β\beta-smooth, if 2f(x)β,x𝒳\|\nabla^{2}f(x)\|_{\infty}\leq\beta,\forall x\in\mathcal{X}, where A:=maxij|Ai,j|\|A\|_{\infty}:=\max_{i}\sum_{j}|A_{i,j}| for a symmetric matrix AA. This implies that f(x)f(y)βxy\|\nabla f(x)-\nabla f(y)\|_{\infty}\leq\beta\|x-y\|_{\infty} for any x,y𝒳x,y\in\mathcal{X}.

Definition 2.3 (Diagonal Dominance).

A matrix Ad×dA\in\mathbb{R}^{d\times d} is diagonally dominant if

|Ai,i|ji|Ai,j|,\displaystyle|A_{i,i}|\geq\sum_{j\neq i}|A_{i,j}|, i[d].\displaystyle\forall i\in[d].
Assumption 2.4.

Each function f(;z):𝒳f(;z):\mathcal{X}\to\mathbb{R} in the universe is convex, GG-Lipschitz with respect to 1\ell_{1}-norm (Definition 2.1) and β\beta-smooth (Definition 2.2). 𝒳\mathcal{X} is a ball of radius DD in \ell_{\infty}-norm.

Assumption 2.5.

The Hessian of each function f(;z)f(\cdot;z) in the universe is diagonally dominant.

Diagonal dominance, although somewhat restrictive, is a commonly discussed assumption in the literature. For example, [WGZ+21] demonstrated the convergence rate of SGD in heavy-tailed settings under the assumption of diagonal dominance. Similarly, [DASD24] studied Adam’s preconditioning effect for quadratic functions with diagonally dominant Hessians. In the case of 1-hidden layer neural networks (a common focus of the NTK line of work), the Hessian is diagonal (see Section 3 in [LZB20]). Additionally, it has been shown that in practice, Hessians are typically block diagonal dominant [MG15, BRB17]. We also discuss potential ways to avoid this assumption in Section 5.

Notation

For XdX\in\mathbb{R}^{d}, we use X[i]X[i] to denote its iith coordinate. For a vector XdX\in\mathbb{R}^{d} and a convex set 𝒳d\mathcal{X}\subset\mathbb{R}^{d}, we use Π𝒳(X):=argminY𝒳YX2\Pi_{\mathcal{X}}(X):=\arg\min_{Y\in\mathcal{X}}\|Y-X\|_{2}. For XdX\in\mathbb{R}^{d} and r0r\in\mathbb{R}_{\geq 0}, we use B(X,r)B_{\infty}(X,r) to denote the \ell_{\infty}-ball centered at XX of radius rr.

3 Main Algorithm

We present our main result in this section and explain the algorithm in a top-down manner. The algorithm is based on the localization framework of [FKT20]; see Algorithm 5 in the Appendix for details. The main result is stated formally below:

Theorem 3.1.

Under Assumptions 2.4 and 2.5, suppose βGD(nεmlog(nmd/δ)+dlog(1/δ)log(nmd)mε)\beta\leq\frac{G}{D}(\frac{\sqrt{n}\varepsilon}{\sqrt{m}\log(nmd/\delta)}+\frac{\sqrt{d\log(1/\delta)\log(nmd)}}{\sqrt{m}\varepsilon}), εO(1),nlog2(nd/δ)/ε\varepsilon\leq O(1),n\geq\log^{2}(nd/\delta)/\varepsilon and mnO(loglogn)m\leq n^{O(\log\log n)}. Setting ηDGmin{Bmn,mεdlog(1/δ)log(nmd)}\eta\leq\frac{D}{G}\cdot\min\{\frac{B\sqrt{m}}{\sqrt{n}},\frac{\sqrt{m}\varepsilon}{\sqrt{d\log(1/\delta)\log(nmd)}}\}, B=100log(mnd/δ)/εB=100\log(mnd/\delta)/\varepsilon, τ=O(Glog(nmd)/m)\tau=O(G\log(nmd)/\sqrt{m}) and υ=0.9B+2log(T/δ)ε\upsilon=0.9B+\frac{2\log(T/\delta)}{\varepsilon}, Algorithm 5 is (ε,δ)(\varepsilon,\delta)-user-level-DP. When the nmnm functions in dataset 𝒟\mathcal{D} are i.i.d. drawn from the underlying distribution 𝒫\mathcal{P}, it takes mnmn gradient computations and outputs xSx_{S} such that

𝔼[F(xS)F(x)]O~(dnm+d3/2nε2m).\displaystyle\operatorname*{\mathbb{E}}[F(x_{S})-F(x^{*})]\leq\tilde{O}\left(\frac{d}{\sqrt{nm}}+\frac{d^{3/2}}{n\varepsilon^{2}\sqrt{m}}\right).

We briefly describe the localization framework. In the first phase, it runs (non-private) SGD using half of the dataset, and averages the iterates to get x¯1\overline{x}_{1}. Roughly speaking, the solution x¯1\overline{x}_{1} already provides a good approximation with a small population loss when the datasets are drawn i.i.d. from the underlying distribution. However, to ensure privacy, we require a sensitivity bound on x¯1\|\overline{x}_{1}\| and add noise ζ1\zeta_{1} correspondingly to privatize x¯1\overline{x}_{1}, yielding the private solution x1x¯1+ζ1x_{1}\leftarrow\overline{x}_{1}+\zeta_{1}.

A naive bound on the excess loss due to the privatization is given by

𝔼[F(x1)F(x¯1)]Gζ12,\operatorname*{\mathbb{E}}[F(x_{1})-F(\overline{x}_{1})]\leq G\|\zeta_{1}\|_{2},

but the magnitude of the noise ζ12\|\zeta_{1}\|_{2} is typically too large to achieve a good utility guarantee. Nevertheless, this process yields a much better initial point x1x_{1} compared to the original starting point x0x_{0}. As a result, a smaller dataset and a smaller step size are sufficient to find the next good solution x¯2\overline{x}_{2} in expectation, with smaller noise ζ22\|\zeta_{2}\|_{2} added to privatize x¯2\overline{x}_{2}.

This process is repeated over O(logn)O(\log n) phases, where each subsequent solution x¯S\overline{x}_{S} is progressively refined, and the Gaussian noise ζS2\|\zeta_{S}\|_{2} becomes negligible. Ultimately, this iterative refinement balances privacy and utility, as established in Theorem 3.1. The formal argument about the utility guarantee and proof can be found in Lemma 3.

Our main contribution is in Algorithm 1, which uses a novel gradient estimation sub-procedure.

1 Input: dataset 𝒟\mathcal{D}, privacy parameters ε,δ\varepsilon,\delta, other parameters η,τ,υ,B\eta,\tau,\upsilon,B, initial point x0x_{0};
2 Divide 𝒟\mathcal{D} into B disjoint subsets of equal size, denoted by {𝒟i}i[B]\{\mathcal{D}_{i}\}_{i\in[B]}, 𝒟i={Zi,t}t[|𝒟|/B]\mathcal{D}_{i}=\{Z_{i,t}\}_{t\in[|\mathcal{D}|/B]};
3 Set T=|𝒟|/BT=|\mathcal{D}|/B;
4 for Step t=1,,Tt=1,\ldots,T do
5       For each i[B]i\in[B], get qt(Zi,t):=1mzZi,tf(xt1;z)q_{t}(Z_{i,t}):=\frac{1}{m}\sum_{z\in Z_{i,t}}\nabla f(x_{t-1};z);
6       Let gt1g_{t-1} be the output of Algorithm 2 with inputs {qt(Zi,t)}i[B]\{q_{t}(Z_{i,t})\}_{i\in[B]} and threshold 1/τ1/\tau;
7       xt=Π𝒳(xt1ηgt1)x_{t}=\Pi_{\mathcal{X}}(x_{t-1}-\eta g_{t-1})
8 end for
/* Concentration Test */
/* Recall the query qt(Zi,t)q_{t}(Z_{i,t}) for each t[T],i[B]t\in[T],i\in[B] from above */
9 Run Algorithm 3 with query {qt}t[T]\{q_{t}\}_{t\in[T]} and parameters 𝒟t,ε,δ2Tmnd,τ,υ\mathcal{D}_{t},\varepsilon,\frac{\delta}{2Tmnd},\tau,\upsilon to get answers {at}t[T]\{a_{t}\}_{t\in[T]} ;
10 if at=,t[T]a_{t}=\top,\forall t\in[T] then
11       Return: Average iterate x¯=1Tt[T]xt\overline{x}=\frac{1}{T}\sum_{t\in[T]}x_{t};
12      
13 end if
14else
15       Output: Initial point x0x_{0};
16      
17 end if
Algorithm 1 SGD for User-level DP-SCO
Iteration Sensitivity of Algorithm 1:

The contractivity of gradient descent plays a crucial role in the sensitivity analysis, for which we need the Hessians to be diagonally dominant (Assumption 2.5).

{lem}

[] [Contractivity] Suppose f:𝒳f:\mathcal{X}\to\mathbb{R} is a convex and β\beta-smooth function satisfying Assumption 2.5, then for any two points x,y𝒳x,y\in\mathcal{X}, with step size η2/β\eta\leq 2/\beta, we have

(xηf(x))(yηf(y))xy.\displaystyle\|(x-\eta\nabla f(x))-(y-\eta\nabla f(y))\|_{\infty}\leq\|x-y\|_{\infty}.

Now, we discuss Algorithm 1. Given the dataset 𝒟\mathcal{D}, we proceed in T=|𝒟|/BT=|\mathcal{D}|/B steps. At the ttth step, we draw BB users {Zi,t}i[B]\{Z_{i,t}\}_{i\in[B]} and compute the average gradient of each user. We then apply our gradient estimation algorithm (Algorithm 2) and perform normal gradient descent for TT steps.

In the second phase of Algorithm 1, we perform the concentration test (Algorithm 3) on the BB gradients at each step based on 𝖠𝖻𝗈𝗏𝖾𝖳𝗁𝗋𝖾𝗌𝗁𝗈𝗅𝖽\mathsf{AboveThreshold} (Algorithm 4). If the concentration test passes for all steps (i.e., at=a_{t}=\top for all t[T]t\in[T]), we output the average iterate. Otherwise, the algorithm fails and returns the initial point. As mentioned in the Introduction, the crucial novelty of Algorithm 1 and Algorithm 2 lies in bounding the sensitivity of each solution xtx_{t} by incorporating the (coordinate-wise) robust statistics into SGD.

1 Input: a set of dd-dimensional vectors {Xi}i[B]\{X_{i}\}_{i\in[B]}, threshold parameter ς>0\varsigma>0;
2 for Each dimension j=1,,dj=1,\ldots,d do
3       Compute the robust statistics Xrs[j]X_{\mathrm{rs}}[j], and the mean x¯[j]\overline{x}[j] over {Xi[j]}i[B]\{X_{i}[j]\}_{i\in[B]};
4       if |Xrs[j]x¯[j]|ς|X_{\mathrm{rs}}[j]-\overline{x}[j]|\geq\varsigma then
5             Set Xest[j]=ΠB(Yj,ς)(x¯[j])X_{est}[j]=\Pi_{B(Y_{j},\varsigma)}(\overline{x}[j]);
6            
7       end if
8      else
9             Set Xest[j]=x¯[j]X_{est}[j]=\overline{x}[j];
10            
11       end if
12      
13 end for
Return XestX_{est}
Algorithm 2 Gradient Estimation based on Robust Statistics

We utilize robust statistics in the gradient estimation sub-procedure. We make the following assumptions regarding the robust statistics used:

Assumption 3.2.

Given a set {Xi}i[B]\{X_{i}\}_{i\in[B]} of vectors, let XrsX_{\mathrm{rs}} be any robust statistic satisfying the following properties:

(i) For any ρ0\rho\geq 0, if there exists a point XX^{\prime} such that more than B/2B/2 points lie within B(X,ρ)B_{\infty}(X^{\prime},\rho), then XrsB(X,ρ)X_{\mathrm{rs}}\in B_{\infty}(X^{\prime},\rho).

(ii) If we perturb each point Yi=Xi+ιiY_{i}=X_{i}+\iota_{i} such that ιiΔ\|\iota_{i}\|_{\infty}\leq\Delta for any Δ0\Delta\geq 0, and let YrsY_{\mathrm{rs}} be the robust statistic of {Yi}\{Y_{i}\}, then XrsYrsΔ\|X_{\mathrm{rs}}-Y_{\mathrm{rs}}\|_{\infty}\leq\Delta.

(iii) For any real numbers aa and bb, if Zi=aXi+bZ_{i}=aX_{i}+b for each i[B]i\in[B], and ZrsZ_{\mathrm{rs}} is the corresponding robust statistic of {Zi}i[B]\{Z_{i}\}_{i\in[B]}, then Zrs=aXrs+bZ_{\mathrm{rs}}=aX_{\mathrm{rs}}+b.

Remark 3.3.

Common robust statistics, such as the (coordinate-wise) median and trimmed mean, satisfy Assumption 3.2.

In Algorithm 2, we output means if they are close to the robust statistics to control the bias, and project the means onto the sphere around the robust statistics in a coordinate-wise manner when they are far apart. However, we still need to ensure that the sensitivity remains bounded when the projection is operated. The following technical lemma plays a crucial role in establishing iteration sensitivity to deal with the sensitivity with potential projection operations.

{lem}

[] Consider four real numbers a,b,c,da,b,c,d, such that |ab|1|a-b|\leq 1, and |cd|1|c-d|\leq 1. Let c=ΠB(a,r)(c)c^{\prime}=\Pi_{B(a,r)}(c) and d=ΠB(b,r)(d)d^{\prime}=\Pi_{B(b,r)}(d) for any r0r\geq 0. Then, we have |cd|1.|c^{\prime}-d^{\prime}|\leq 1.

Unfortunately, we are unaware of any robust statistic satisfying Assumption 3.2 in high dimensions under the 2\ell_{2}-norm, and Lemma 3.3 does not hold in high dimensions either. These limitations restrict the applicability of our techniques in high-dimensional Euclidean spaces; see Section 5.

Let {xt}t[T]\{x_{t}\}_{t\in[T]} and {yt}t[T]\{y_{t}\}_{t\in[T]} be two trajectories corresponding to neighboring datasets that differ by one user. The crucial technical novelty is that, for any t[T]t\in[T], we can control xtyt\|x_{t}-y_{t}\|_{\infty} as long as the number of “bad” users in each phase (BB in total) does not exceed the “break point”, say 2B/32B/3. Without loss of generality, assume that Z1,1Z1,1Z_{1,1}\neq Z_{1,1}^{\prime} is the differing user in the neighboring dataset pairs (𝒟,𝒟)(\mathcal{D},\mathcal{D}^{\prime}) considered in the following proof.

The first property of Assumption 3.2 ensures that when the number of “bad” users in each phase does not exceed the “break point” 2B/32B/3, the robust statistic remains close to most of the gradients, allowing us to control x1y1\|x_{1}-y_{1}\|_{\infty}. To formalize this, we say that the neighboring dataset pair (𝒟,𝒟)(\mathcal{D},\mathcal{D}^{\prime}) is ρ\rho-aligned if there exist points XX^{\prime} and YY^{\prime} such that |Xgood|2B/3|X_{\mathrm{good}}|\geq 2B/3 and |Ygood|2B/3|Y_{\mathrm{good}}|\geq 2B/3, where

Xgood={q1(Zi,1):q1(Zi,1)B(X,ρ),i[B]}, and X_{\mathrm{good}}=\{q_{1}(Z_{i,1}):q_{1}(Z_{i,1})\in B_{\infty}(X^{\prime},\rho),i\in[B]\},\text{ and }
Ygood={q1(Zi,1):q1(Zi,1)B(Y,ρ),i[B]}.Y_{\mathrm{good}}=\{q_{1}^{\prime}(Z_{i,1}^{\prime}):q_{1}^{\prime}(Z_{i,1}^{\prime})\in B_{\infty}(Y^{\prime},\rho),i\in[B]\}.

This definition essentially states that the number of “bad” users does not exceed the “break point” in either 𝒟\mathcal{D} or 𝒟\mathcal{D}^{\prime}, ensuring that most gradients remain well-aligned within a bounded region.

{lem}

[] For some (unknown) parameter ρ>0\rho>0, suppose (𝒟,𝒟)(\mathcal{D},\mathcal{D}^{\prime}) is ρ\rho-aligned. Then, by running Algorithm 2 with threshold parameter ς0\varsigma\geq 0, we have x1y1η(4ρ+2ς)\|x_{1}-y_{1}\|_{\infty}\leq\eta(4\rho+2\varsigma).

The sequential sensitivity can be bounded using induction, with the base case x1y1\|x_{1}-y_{1}\|_{\infty} already established. The formal statement is provided in Lemma 3.

1 Input: Dataset 𝒟=(Z1,,ZB)\mathcal{D}=(Z_{1},\ldots,Z_{B}), privacy parameters ε,δ\varepsilon,\delta, parameters τ,υ\tau,\upsilon;
2 for t=1,,Tt=1,\ldots,T do
3       Receive a new concentration query qt:𝒵dq_{t}:\mathcal{Z}\to\mathbb{R}^{d};
4       Define the concentration score
st𝖼𝗈𝗇𝖼(𝒟,τ):=1BZ𝒟Z𝒟exp(τqt(Z)qt(Z))\displaystyle s^{\mathsf{conc}}_{t}(\mathcal{D},\tau):=\frac{1}{B}\sum_{Z\in\mathcal{D}}\sum_{Z^{\prime}\in\mathcal{D}}\exp(-\tau\|q_{t}(Z)-q_{t}(Z^{\prime})\|_{\infty})\; (3)
Return 𝖠𝖻𝗈𝗏𝖾𝖳𝗁𝗋𝖾𝗌𝗁𝗈𝗅𝖽(st𝖼𝗈𝗇𝖼,ε/2,υ)\mathsf{AboveThreshold}(s^{\mathsf{conc}}_{t},\varepsilon/2,\upsilon)
5 end for
Algorithm 3 Concentration Test
{lem}

[Iteration Sensitivity] If we use a robust statistic satisfying Assumption 3.2 in Algorithm 2, then for all t[T]t\in[T], we have xtytx1y1\|x_{t}-y_{t}\|_{\infty}\leq\|x_{1}-y_{1}\|_{\infty}.

Lemmas 3.3 and 3 together establish the iteration sensitivity of Algorithm 1.

Query Sensitivity of Concentration Test (Algorithm 3):

We have established iteration sensitivity for any aligned neighboring dataset pair (𝒟,𝒟)(\mathcal{D},\mathcal{D}^{\prime}). Next, we analyze the influence of the concentration test, which we use to check if the number of “bad” users exceed the “break point”.

To apply the privacy guarantee of 𝖠𝖻𝗈𝗏𝖾𝖳𝗁𝗋𝖾𝗌𝗁𝗈𝗅𝖽\mathsf{AboveThreshold} (Lemma A.3), it suffices to bound the sensitivity of each query in the concentration test. Recall that we assume Z1,1Z1,1Z_{1,1}\neq Z_{1,1}^{\prime} in the neighboring datasets. Thus, by the definition (Equation (3)), it is straightforward to observe that

|s1𝖼𝗈𝗇𝖼(𝒟,τ)s1𝖼𝗈𝗇𝖼(𝒟,τ)|2.\displaystyle|s^{\mathsf{conc}}_{1}(\mathcal{D},\tau)-s^{\mathsf{conc}}_{1}(\mathcal{D}^{\prime},\tau)|\leq 2. (4)

Next, we consider the sensitivity of st𝖼𝗈𝗇𝖼s^{\mathsf{conc}}_{t} for t2t\geq 2. The sensitivity is proportional to xtyt\|x_{t}-y_{t}\|_{\infty}, which we have already bounded by x1y1\|x_{1}-y_{1}\|_{\infty}. Note that we can bound the iteration sensitivity if the neighboring datasets are aligned, meaning the number of “bad” users does not exceed the “break point”. We first show that if the number of “bad” users exceeds the “break point”, the algorithm is likely to halt after the first step by failing the first test.

{lem}

[] Suppose B100log(T/δ)ε,εO(1)B\geq\frac{100\log(T/\delta)}{\varepsilon},\varepsilon\leq O(1) and we set υ=0.9B+2log(T/δ)ε\upsilon=0.9B+\frac{2\log(T/\delta)}{\varepsilon}. Suppose for any point YY, we get |Xgood|<B/3|X_{\mathrm{good}}|<B/3 where Xgood={q1(Zi,1):q1(Zi,1)B(Y,1/τ),i[B]}X_{\mathrm{good}}=\{q_{1}(Z_{i,1}):q_{1}(Z_{i,1})\in B_{\infty}(Y,1/\tau),i\in[B]\}. Then with probability at least 1δ/Texp(ε)1-\delta/T\exp(\varepsilon), the 𝖠𝖻𝗈𝗏𝖾𝖳𝗁𝗋𝖾𝗌𝗁𝗈𝗅𝖽\mathsf{AboveThreshold} returns a1=a_{1}=\bot.

We now analyze the query sensitivity between the aligned neighboring datasets.

{lem}

[Query Sensitivity] Suppose 6βηB16\beta\eta B\leq 1. Suppose (𝒟,𝒟)(\mathcal{D},\mathcal{D}^{\prime}) is (1/τ)(1/\tau)-aligned and set threshold parameter ς=1/τ\varsigma=1/\tau in Algorithm 4, the sensitivity of the query is bounded by at most 22. That is,

|st𝖼𝗈𝗇𝖼(𝒟,τ)s1𝖼𝗈𝗇𝖼(𝒟,τ)|2,\displaystyle|s^{\mathsf{conc}}_{t}(\mathcal{D},\tau)-s^{\mathsf{conc}}_{1}(\mathcal{D}^{\prime},\tau)|\leq 2, t2.\displaystyle\forall t\geq 2.

Equation (4) shows that the sensitivity is always bounded for s1𝖼𝗈𝗇𝖼s^{\mathsf{conc}}_{1}. Lemma 3 shows that if the number of “bad” users exceeds the “break point”, we obtain a1=a_{1}=\bot, and the query sensitivities of the subsequent queries do not need to be considered. Lemma 3 establishes the query sensitivity in the concentration test when the neighboring datasets are aligned, and the number of "bad" users is below the threshold.

Privacy proof.

The final privacy guarantee–stated formally below–now easily follows from the previous lemmas. The full proof is deferred to Appendix B.7.

{lem}

[Privacy Guarantee] Under Assumption 2.4 and Assumption 2.5, suppose εO(1),B100log(T/δ)ε\varepsilon\leq O(1),B\geq\frac{100\log(T/\delta)}{\varepsilon}, then Algorithm 5 is (ε,δ)(\varepsilon,\delta)-user-level-DP.

Utility proof.

We apply the localization framework in private optimization to finish the utility argument. We analyze the utility guarantee of Algorithm 1 based on the classic convergence rate of SGD on smooth convex functions (Lemma A.6) as follows:

{lem}

[] Let x𝒳x\in\mathcal{X} be any point in the domain. Suppose the data set 𝒟\mathcal{D} of the users, whose size |𝒟||\mathcal{D}| is larger than 100log(T/δ)ε\frac{100\log(T/\delta)}{\varepsilon}, is drawn i.i.d. from the distribution 𝒫\mathcal{P}. Setting τ=Glog(nmd/ω)/m\tau=G\log(nmd/\omega)/\sqrt{m} then the final output x¯\overline{x} of Algorithm 1 satisfies that

𝔼[F(x¯)F(x)](β+1η)𝔼[x0x2]T+ηG2dBm+GDdω.\displaystyle\operatorname*{\mathbb{E}}[F(\overline{x})-F(x)]\lesssim\left(\beta+\frac{1}{\eta}\right)\frac{\operatorname*{\mathbb{E}}[\|x_{0}-x\|^{2}]}{T}+\frac{\eta G^{2}d}{Bm}+GDd\omega.

Now we apply the localization framework. We set ω=1/(nmd)3\omega=1/(nmd)^{3} to make the term depending on it negligible. The proof of the following lemma mostly follows from [FKT20].

{lem}

[Localization] Under Assumption 2.4 and Assumption 2.5, suppose βGD(nεmlog(nmd/δ)\beta\leq\frac{G}{D}(\frac{\sqrt{n}\varepsilon}{\sqrt{m}\log(nmd/\delta}), nlog2(nd/δ)/ε,εO(1)n\geq\log^{2}(nd/\delta)/\varepsilon,\varepsilon\leq O(1) and mnO(loglogn)m\leq n^{O(\log\log n)}. Set ηDGmin{Bmn,mεdlog(1/δ)log(nmd)}\eta\leq\frac{D}{G}\cdot\min\{\frac{B\sqrt{m}}{\sqrt{n}},\frac{\sqrt{m}\varepsilon}{\sqrt{d\log(1/\delta)\log(nmd)}}\}, B=100log(mnd/δ)/εB=100\log(mnd/\delta)/\varepsilon, τ=O(Glog(nmd)/m)\tau=O(G\log(nmd)/\sqrt{m}) and υ=0.9B+2log(T/δ)ε\upsilon=0.9B+\frac{2\log(T/\delta)}{\varepsilon}. If the dataset is drawn i.i.d. from the distribution 𝒫\mathcal{P}, the final output xSx_{S} for Algorithm 5 satisfies

𝔼[F(xS)F(x)]O~(GD(dmn+d3/2nε2m)).\displaystyle\operatorname*{\mathbb{E}}[F(x_{S})-F(x^{*})]\leq\tilde{O}\Big{(}GD\Big{(}\frac{d}{\sqrt{mn}}+\frac{d^{3/2}}{n\varepsilon^{2}\sqrt{m}}\Big{)}\Big{)}.

Main Result: Theorem 3.1 directly follows from Lemma 3 and Lemma 3.

4 Lower Bound

This section presents our main lower bound. As stated above, the lower bound is nearly tight, apart from lower-order terms and the dependency on ε\varepsilon.

Theorem 4.1.

There exists a distribution 𝒫\mathcal{P} and a loss function ff satisfying Assumption 2.4 and Assumption 2.5, such that for any (ε,δ)(\varepsilon,\delta)-User-level-DP algorithm \mathcal{M}, given i.i.d. dataset 𝒟\mathcal{D} drawn from 𝒫\mathcal{P}, the output of \mathcal{M} satisfies

𝔼[F((𝒟))F(x)]GDΩ~(min{d,dmn+d3/2nεm}).\displaystyle\operatorname*{\mathbb{E}}[F(\mathcal{M}(\mathcal{D}))-F(x^{*})]\geq GD\cdot\tilde{\Omega}\Big{(}\min\Big{\{}d,\frac{d}{\sqrt{mn}}+\frac{d^{3/2}}{n\varepsilon\sqrt{m}}\Big{\}}\Big{)}.

The non-private term GDdmnGD\frac{d}{\sqrt{mn}} represents the information-theoretic lower bound for SCO under these assumptions (see, e.g., Theorem 1 in [AWBR09]).

We construct the hard instance as follows: let 𝒳=[1,1]d\mathcal{X}=[-1,1]^{d} be unit \ell_{\infty}-ball and let f(x;z)=x,zf(x;z)=-\langle x,z\rangle for any x𝒳x\in\mathcal{X} be the linear function. Let z[m,m]dz\in[-\sqrt{m},\sqrt{m}]^{d} with 𝔼z𝒫[z]=μ\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}}[z]=\mu. Then one can easily verify that ff satisfies Assumptions 2.4 and 2.5 with G=m,D=1G=\sqrt{m},D=1 and β=0\beta=0. We have

F((𝒟))F(x)\displaystyle F(\mathcal{M}(\mathcal{D}))-F(x^{*}) =i=1d(sign(μ[i])(𝒟)[i])μ[i]\displaystyle=\sum_{i=1}^{d}(\operatorname{sign}(\mu[i])-\mathcal{M}(\mathcal{D})[i])\cdot\mu[i]
i=1d|μ[i]|.1(sign(μ[i])sign((𝒟)[i])).\displaystyle\geq\sum_{i=1}^{d}|\mu[i]|.\mathbf{1}\big{(}\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[i])\big{)}. (5)

By (5), we reduce the optimization error to the weighted sign estimation error. Most existing lower bounds rely on the 22\ell^{2}_{2}-error of mean estimation. We adapt their techniques, especially the fingerprinting lemma, and provide the proof in the Appendix C.

5 Conclusion

We present a linear-time algorithm for user-level DP-SCO that leverages (coordinate-wise) robust statistics in the gradient estimation subprocedure and provide a lower bound that nearly matches the upper bound up to logarithmic terms and an additional dependence on ε\varepsilon. Despite this progress, several limitations and open problems remain, some of which we highlight below.

  • Generalization to Euclidean and Other Spaces. Extending our results to Euclidean and other spaces is an interesting but technically challenging problem. One key challenge is the lack of robust statistics that are 11-Lipschitz under perturbations in the high-dimensional 2\ell_{2}-norm (see the second item in Assumption 3.2). One may wonder whether commonly used robust statistics, such as the geometric median, are stable in this sense. However, we provide a simple counterexample involving the geometric median in the Appendix (Section B.10).

    Another challenge arises from our approach of projecting the mean towards the robust statistic to ensure unbiased gradient estimation. This projection is 11-Lipschitz under perturbations in one dimension (Lemma 3.3), but there are known counterexamples in higher dimensions [ALT24, Lemma 16]. Overcoming these issues is crucial for extending our method to general spaces.

  • Additional Assumption on Diagonal Dominance. Our analysis assumes that the Hessians of functions in the universe are diagonally dominant, which is primarily used to show that gradient descent is contractive in the \ell_{\infty}-norm (Lemma 3). This assumption is somewhat restrictive compared to the 2\ell_{2}-norm setting, where it is sufficient to assume that the operator norm of the Hessians is bounded (i.e., smoothness). Addressing the aforementioned challenges and generalizing our results to the Euclidean space could potentially eliminate this additional assumption.

  • Suboptimal Dependence on ε\varepsilon. Although our upper bound nearly matches the lower bound, it has a suboptimal dependence on ε\varepsilon. This issue arises from the loose dependence on sensitivity. Specifically, we draw BB users at each step and compute their average gradient, with B=O~(1/ε)B=\tilde{O}(1/\varepsilon). However, the final sensitivity remains roughly O~(1/m)\tilde{O}(1/\sqrt{m}) and does not improve with larger BB. An improvement in the dependence on ε\varepsilon could be achieved if the sensitivity of the robust statistic could benefit from larger BB.

Finally, it would be interesting to explore the use of robust statistics, such as the median used in this work, to address other private optimization problems.

References

  • [AFKT21] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in l1 geometry. In ICML, pages 393–403, 2021.
  • [AL24] Hilal Asi and Daogao Liu. User-level differentially private stochastic convex optimization: Efficient algorithms with optimal rates. In AISTATS, pages 4240–4248, 2024.
  • [ALT24] Hilal Asi, Daogao Liu, and Kevin Tian. Private stochastic convex optimization with heavy tails: Near-optimality from simple reductions. arXiv, 2406.02789, 2024.
  • [AUZ23] Hilal Asi, Jonathan Ullman, and Lydia Zakynthinou. From robustness to privacy and back. In ICML, pages 1121–1146, 2023.
  • [AWBR09] Alekh Agarwal, Martin J Wainwright, Peter Bartlett, and Pradeep Ravikumar. Information-theoretic lower bounds on the oracle complexity of convex optimization. In NIPS, 2009.
  • [BFGT20] Raef Bassily, Vitaly Feldman, Cristóbal Guzmán, and Kunal Talwar. Stability of stochastic gradient descent on nonsmooth convex losses. arXiv, 2006.06914, 2020.
  • [BFTT19] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. In NIPS, pages 11282–11291, 2019.
  • [BGN21] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. In COLT, pages 474–499, 2021.
  • [BRB17] Aleksandar Botev, Hippolyt Ritter, and David Barber. Practical gauss-newton optimisation for deep learning. In International Conference on Machine Learning, pages 557–565. PMLR, 2017.
  • [BS23] Raef Bassily and Ziteng Sun. User-level private stochastic convex optimization with optimal rates. In ICML, pages 1838–1851, 2023.
  • [BST14] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In FOCS, pages 464–473, 2014.
  • [Bub15] Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends® in Machine Learning, 8(3-4):231–357, 2015.
  • [CCGT24] Christopher A Choquette-Choo, Arun Ganesh, and Abhradeep Guha Thakurta. Optimal rates for o (1)-smooth dp-sco with a single epoch and large batches. In ALT, 2024.
  • [CTW+21] Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-Voss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ulfar Erlingsson, et al. Extracting training data from large language models. In USENIX Security, pages 2633–2650, 2021.
  • [DASD24] Rudrajit Das, Naman Agarwal, Sujay Sanghavi, and Inderjit S Dhillon. Towards quantifying the preconditioning effect of adam. arXiv preprint arXiv:2402.07114, 2024.
  • [DK09] Stephane Durocher and David Kirkpatrick. The projection median of a set of points. Computational Geometry, 42(5):364–375, 2009.
  • [DL09] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In STOC, pages 371–380, 2009.
  • [DMNS06] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In TCC, pages 265–284, 2006.
  • [DR14] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends® in Theoretical Computer Science, 9(3–4):211–407, 2014.
  • [FKT20] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In STOC, pages 439–449, 2020.
  • [GKK+23] Badih Ghazi, Pritish Kamath, Ravi Kumar, Raghu Meka, Pasin Manurangsi, and Chiyuan Zhang. On user-level private convex optimization. arXiv, 2305.04912, 2023.
  • [GLL22] Sivakanth Gopi, Yin Tat Lee, and Daogao Liu. Private convex optimization via exponential mechanism. In COLT, pages 1948–1989, 2022.
  • [JNG+19] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. arXiv, 1902.03736, 2019.
  • [KLSU19] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning high-dimensional distributions. In COLT, pages 1853–1902, 2019.
  • [LKO22] Xiyang Liu, Weihao Kong, and Sewoong Oh. Differential privacy and robust statistics in high dimensions. In COLT, pages 1167–1246, 2022.
  • [LLA24] Andrew Lowy, Daogao Liu, and Hilal Asi. Faster algorithms for user-level private stochastic convex optimization. In NeurIPS, 2024.
  • [LSA+21] Daniel Levy, Ziteng Sun, Kareem Amin, Satyen Kale, Alex Kulesza, Mehryar Mohri, and Ananda Theertha Suresh. Learning with user-level privacy. NeurIPS, pages 12466–12479, 2021.
  • [LSS+23] Nils Lukas, Ahmed Salem, Robert Sim, Shruti Tople, Lukas Wutschitz, and Santiago Zanella-Béguelin. Analyzing leakage of personally identifiable information in language models. In S & P, pages 346–363, 2023.
  • [LZB20] Chaoyue Liu, Libin Zhu, and Misha Belkin. On the linearity of large non-linear models: when and why the tangent kernel is constant. Advances in Neural Information Processing Systems, 33:15954–15964, 2020.
  • [MG15] James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pages 2408–2417. PMLR, 2015.
  • [SHW22] Jinyan Su, Lijie Hu, and Di Wang. Faster rates of private stochastic convex optimization. In ALT, pages 995–1002, 2022.
  • [SM12] Aleksandra Slavkovic and Roberto Molinari. Perturbed m-estimation: A further investigation of robust statistics for differential privacy. In Statistics in the Public Interest: In Memory of Stephen E. Fienberg, pages 337–361. Springer, 2012.
  • [WGZ+21] Hongjian Wang, Mert Gurbuzbalaban, Lingjiong Zhu, Umut Simsekli, and Murat A Erdogdu. Convergence rates of stochastic gradient descent under infinite noise variance. Advances in Neural Information Processing Systems, 34:18866–18877, 2021.
  • [XZ24] Zheng Xu and Yanxiang Zhang. Advances in private training for production on-device language models. https://research.google/blog/advances-in-private-training-for-production-on-device-language-models/, 2024. Google Research Blog.
  • [ZTC22] Qinzi Zhang, Hoang Tran, and Ashok Cutkosky. Differentially private online-to-batch for smooth losses. In NeurIPS, 2022.

Appendix A Preliminaries

A.1 Differential Privacy

Definition A.1 (User-level DP).

We say a mechanism \mathcal{M} is (ε,δ)(\varepsilon,\delta)-user-level DP, if for any neighboring datasets 𝒟\mathcal{D} and 𝒟\mathcal{D}^{\prime} that differ from one user, and any output event set 𝒪\mathcal{O}, we have

Pr[(𝒟)𝒪]eεPr[(𝒟)𝒪]+δ.\displaystyle\Pr[\mathcal{M}(\mathcal{D})\in\mathcal{O}]\leq e^{\varepsilon}\Pr[\mathcal{M}(\mathcal{D}^{\prime})\in\mathcal{O}]+\delta.
Proposition A.2 (Gaussian Mechanism).

Consider a function f:𝒫df:\mathcal{P}^{*}\to\mathbb{R}^{d}. If max𝒟𝒟f(𝒟)f(𝒟)2Δ\max_{\mathcal{D}\sim\mathcal{D}^{\prime}}\|f(\mathcal{D})-f(\mathcal{D}^{\prime})\|_{2}\leq\Delta, where 𝒟𝒟\mathcal{D}\sim\mathcal{D}^{\prime} means 𝒟\mathcal{D} and 𝒟\mathcal{D}^{\prime} are neighboring datasets, then the Gaussian mechanism

(𝒟):=f(𝒟)+ζ,\displaystyle\mathcal{M}(\mathcal{D}):=f(\mathcal{D})+\zeta,

where ζ𝒩(0,σ2Id)\zeta\sim\mathcal{N}(0,\sigma^{2}I_{d}) with σ22Δ2log(1.25/δ)ε2\sigma^{2}\geq\frac{2\Delta^{2}\log(1.25/\delta)}{\varepsilon^{2}} is (ε,δ)(\varepsilon,\delta)-DP.

A.1.1 AboveThreshold

Our algorithms use the AboveThreshold algorithm [DR14], which is a key tool in DP to identify whether there is a query qi:𝒵q_{i}:\mathcal{Z}\to\mathbb{R} in a stream q1,,qTq_{1},\dots,q_{T} of queries that is above a certain threshold Δ\Delta. The 𝖠𝖻𝗈𝗏𝖾𝖳𝗁𝗋𝖾𝗌𝗁𝗈𝗅𝖽\mathsf{AboveThreshold} algorithm (Algorithm 4 presented in the Appendix) has the following guarantees:

Lemma A.3 ([DR14], Theorem 3.24).

𝖠𝖻𝗈𝗏𝖾𝖳𝗁𝗋𝖾𝗌𝗁𝗈𝗅𝖽\mathsf{AboveThreshold} is (ε,0)(\varepsilon,0)-DP. Moreover, let α=8log(2T/γ)ε\alpha=\frac{8\log(2T/\gamma)}{\varepsilon} and 𝒟𝒵n\mathcal{D}\in\mathcal{Z}^{n}. For any sequence q1,,qT:𝒵nq_{1},\cdots,q_{T}:\mathcal{Z}^{n}\to\mathbb{R} of TT queries each of sensitivity 11, 𝖠𝖻𝗈𝗏𝖾𝖳𝗁𝗋𝖾𝗌𝗁𝗈𝗅𝖽\mathsf{AboveThreshold} halts at time k[T+1]k\in[T+1] such that with probability at least 1γ1-\gamma,

  • For all t<kt<k, at=a_{t}=\top and qt(𝒟)Δαq_{t}(\mathcal{D})\geq\Delta-\alpha;

  • ak=a_{k}=\bot and qk(𝒟)Δ+αq_{k}(\mathcal{D})\leq\Delta+\alpha or k=T+1k=T+1.

A.2 SubGaussian and Norm-SubGaussian Random Vectors

Definition A.4.

Let ζ>0\zeta>0. We say a random vector XX is SubGaussian (SG(ζ)\mathrm{SG}(\zeta)) with parameter ζ\zeta if 𝔼[ev,X𝔼X]ev2ζ2/2\operatorname*{\mathbb{E}}[e^{\langle v,X-\operatorname*{\mathbb{E}}X\rangle}]\leq e^{\|v\|^{2}\zeta^{2}/2} for any vdv\in\mathbb{R}^{d}. Random vector XdX\in\mathbb{R}^{d} is Norm-SubGaussian with parameter ζ\zeta (nSG(ζ)\mathrm{nSG}(\zeta)) if [X𝔼X2t]2et22ζ2\mathbb{P}[\|X-\operatorname*{\mathbb{E}}X\|_{2}\geq t]\leq 2e^{-\frac{t^{2}}{2\zeta^{2}}} for all t>0t>0.

Theorem A.5 (Hoeffding-type inequality for norm-subGaussian, [JNG+19]).

Let X1,,XkdX_{1},\cdots,X_{k}\in\mathbb{R}^{d} be random vectors, and let i=σ(x1,,xi)\mathcal{F}_{i}=\sigma(x_{1},\cdot,x_{i}) for i[k]i\in[k] be the corresponding filtration. Suppose for each i[k]i\in[k], Xii1X_{i}\mid\mathcal{F}_{i-1} is zero-mean nSG(ζi)\mathrm{nSG}(\zeta_{i}). Then, there exists an absolute constant c>0c>0, for any γ>0\gamma>0,

[i[k]Xi2clog(d/γ)i[k]ζi2]γ.\displaystyle\mathbb{P}\left[\left\|\sum_{i\in[k]}X_{i}\right\|_{2}\geq c\sqrt{\log(d/\gamma)\sum_{i\in[k]}\zeta_{i}^{2}}\right]\leq\gamma.
1 Input: Dataset 𝒟=(Z1,,Zn)\mathcal{D}=(Z_{1},\dots,Z_{n}), threshold Δ\Delta\in\mathbb{R}, privacy parameter ε\varepsilon;
2 Let Δ^:=ΔLap(2ε)\widehat{\Delta}:=\Delta-\mathrm{Lap}(\frac{2}{\varepsilon});
3 for i=1i=1 to TT do
4       Receive a new query qi:𝒵nq_{i}:\mathcal{Z}^{n}\to\mathbb{R} ;
5       Sample νiLap(4ε)\nu_{i}\sim\mathrm{Lap}(\frac{4}{\varepsilon});
6       if qt(𝒟)+νi<Δ^q_{t}(\mathcal{D})+\nu_{i}<\widehat{\Delta} then
7             Output: ai=a_{i}=\bot;
8             Halt;
9             else
10                   Output: ai=a_{i}=\top;
11                  
12             end if
13            
14       end if
15      
16 end for
Algorithm 4 𝖠𝖻𝗈𝗏𝖾𝖳𝗁𝗋𝖾𝗌𝗁𝗈𝗅𝖽\mathsf{AboveThreshold}

A.3 Optimization

Lemma A.6 ([Bub15]).

Consider a β\beta-smooth convex function ff over a convex set 𝒳\mathcal{X}. For any x𝒳x\in\mathcal{X}, suppose that the random initial point x0x_{0} satisfies 𝔼[x0x22]R2\operatorname*{\mathbb{E}}[\|x_{0}-x\|_{2}^{2}]\leq R^{2}. Suppose we have an unbiased stochastic gradient oracle such that 𝔼g~(xt)f(xt)22σt2\operatorname*{\mathbb{E}}\|\tilde{g}(x_{t})-\nabla f(x_{t})\|_{2}^{2}\leq\sigma_{t}^{2}, then running SGD for TT steps with fixed step size η\eta satisfies that

𝔼[f(1Tt=1Txt+1)f(x)](β+1η)R2T+ηtσt22T.\displaystyle\operatorname*{\mathbb{E}}\left[f\left(\frac{1}{T}\sum_{t=1}^{T}x_{t+1}\right)-f(x)\right]\leq\left(\beta+\frac{1}{\eta}\right)\frac{R^{2}}{T}+\frac{\eta\sum_{t}\sigma_{t}^{2}}{2T}.

Appendix B Proof of Section 3

1 Input: Dataset 𝒟\mathcal{D}, parameters ε,δ,B\varepsilon,\delta,B, initial point x0x_{0};
2 Process: Set S=logn/BS=\lceil\log n/B\rceil;
3 for Phase s=1,,Ss=1,\ldots,S do
4       Set ns=n/2sn_{s}=n/2^{s} and ηs=(logsm)η\eta_{s}=(\log^{-s}m)\eta;
5       Draw a dataset 𝒟s\mathcal{D}_{s} of size nsn_{s} from the unused users;
6       Run Algorithm 1 with inputs 𝒟s,ε,δ,ηs,τ,υ,B,xs1\mathcal{D}_{s},\varepsilon,\delta,\eta_{s},\tau,\upsilon,B,x_{s-1};
7       set xs=x¯s+ζsx_{s}=\overline{x}_{s}+\zeta_{s}, where ζs𝒩(0,σs2Id)\zeta_{s}\sim\mathcal{N}(0,\sigma_{s}^{2}I_{d}) with σs=O(ηsGdlog(exp(ε)/δ)log(nmd)mε)\sigma_{s}=O(\frac{\eta_{s}G\sqrt{d\log(\exp(\varepsilon)/\delta)\log(nmd)}}{\sqrt{m}\varepsilon});
8      
9 end for
Return: the final solution xsx_{s}
Algorithm 5 Localization for user-level DP-SCO

B.1 Proof of Lemma 3

See 3

Proof.

By the diagonal dominance assumption and precondition that η2/β\eta\leq 2/\beta, we know

Iη2f(z)=maxj{|1η2f(z)j,j|+ijη|2f(z)j,i|}1.\displaystyle\|I-\eta\nabla^{2}f(z)\|_{\infty}=\max_{j}\left\{|1-\eta\nabla^{2}f(z)_{j,j}|+\sum_{i\neq j}\eta|\nabla^{2}f(z)_{j,i}|\right\}\leq 1.

Note that by the mean-value theorem,

(xηf(x))(yf(y))=xy+η(xy)2f(z)=(xy)(Iη2f(z)),\displaystyle(x-\eta\nabla f(x))-(y-\nabla f(y))=x-y+\eta(x-y)^{\top}\nabla^{2}f(z)=(x-y)(I-\eta\nabla^{2}f(z)),

where zz is in the segment between xx and yy. Hence we have

(xηf(x))(yf(y))xyIη2f(z)xy.\|(x-\eta\nabla f(x))-(y-\nabla f(y))\|_{\infty}\leq\|x-y\|_{\infty}\cdot\|I-\eta\nabla^{2}f(z)\|_{\infty}\cdot\\ \leq\|x-y\|_{\infty}.

B.2 Proof of Lemma 3.3

See 3.3

Proof.

Without loss of generality, we assume aba\leq b. We do case analysis.

Case (I): if no projection happens. Then it is straightforward that c=cc^{\prime}=c and d=dd^{\prime}=d and hence |cd|=|cd|1|c^{\prime}-d^{\prime}|=|c-d|\leq 1.

Case (II): if one projection happens. Without loss of generality, assume we project cc and get cc^{\prime}. We analyze the following sub-cases:
(i): c=arc^{\prime}=a-r. In this case we know ccc\leq c^{\prime}. If dar=cd\geq a-r=c^{\prime}, then we know |cd||cd|1|c^{\prime}-d^{\prime}|\leq|c-d|\leq 1. If d<ard<a-r, then db<rd-b<-r which is impossible.
(ii): c=a+rc^{\prime}=a+r. If a+rba+r\leq b, then we know |cd||ab|1|c^{\prime}-d^{\prime}|\leq|a-b|\leq 1. Consider the subsubcase when a+r>ba+r>b. If da+rd\leq a+r, then |cd||cd|1|c^{\prime}-d^{\prime}|\leq|c-d|\leq 1. Else if da+rd\geq a+r, as b+rdc=a+rb+r\geq d\geq c^{\prime}=a+r, we have |cd||ab|1|c^{\prime}-d^{\prime}|\leq|a-b|\leq 1.

Case (III): if two projections happen.
(i): c=ar,d=brc^{\prime}=a-r,d^{\prime}=b-r. This is a trivial case.
(ii): c=a+r,d=b+rc^{\prime}=a+r,d^{\prime}=b+r. This case is also trivial.
(iii): c=ar,d=b+rc^{\prime}=a-r,d^{\prime}=b+r. We can show that |cd||cd|1|c^{\prime}-d^{\prime}|\leq|c-d|\leq 1.
(vi): c=a+r,d=brc^{\prime}=a+r,d^{\prime}=b-r. If a+rba+r\leq b, then we know |cd||ab|1|c^{\prime}-d^{\prime}|\leq|a-b|\leq 1. Else, if a+r>ba+r>b, then we know |cd||cd|1|c^{\prime}-d^{\prime}|\leq|c-d|\leq 1. ∎

B.3 Proof of Lemma 3.3

See 3.3

Proof.

It suffices to show that g0g04ρ+2ς\|g_{0}-g_{0}^{\prime}\|_{\infty}\leq 4\rho+2\varsigma. By the first property of Assumption 3.2 and the preconditions in the statement, we know B(X,ρ)B(Y,ρ)B_{\infty}(X^{\prime},\rho)\cap B_{\infty}(Y^{\prime},\rho)\neq\emptyset, which leads to that

XY2ρ.\displaystyle\|X^{\prime}-Y^{\prime}\|\leq 2\rho.

Moreover, we have that the robust statistic XrsYrs4ρ\|X_{\mathrm{rs}}-Y_{\mathrm{rs}}\|_{\infty}\leq 4\rho by the triangle inequality as XrsXρ\|X_{\mathrm{rs}}-X^{\prime}\|_{\infty}\leq\rho and YrsYρ\|Y_{\mathrm{rs}}-Y^{\prime}\|_{\infty}\leq\rho.

By the projection operation in Algorithm 2, we know g0B(Xrs,ς)g_{0}\in B_{\infty}(X_{\mathrm{rs}},\varsigma) and g0B(Yrs,ς)g_{0}^{\prime}\in B_{\infty}(Y_{\mathrm{rs}},\varsigma), and hence we know g0g04ρ+2ς\|g_{0}-g_{0}^{\prime}\|_{\infty}\leq 4\rho+2\varsigma. This completes the proof. ∎

B.4 Proof of Lemma 3

See 3

Proof.

Recall that we assume all users Zi,tZ_{i,t} are equal to Zi,tZ_{i,t}^{\prime} but Z1,1Z_{1,1}.

We actually show that

(xt1ηgt1)(yt1ηgt1)xt1yt1,\displaystyle\|(x_{t-1}-\eta g_{t-1})-(y_{t-1}-\eta g_{t-1}^{\prime})\|_{\infty}\leq\|x_{t-1}-y_{t-1}\|_{\infty},

as the projection operator to the same convex set is contractive in 2\ell_{2}- and \ell_{\infty}-norm in our case.

Let Xi,t=xt1ηqt(Zi,t)X_{i,t}=x_{t-1}-\eta q_{t}(Z_{i,t}) and Yi,t=yt1ηqt(Zi,t)Y_{i,t}=y_{t-1}-\eta q_{t}^{\prime}(Z_{i,t}). Note that the users used in computing the gradients are the same. Let XestX_{\operatorname{est}} be the output of Algorithm 2 with {Xi,t}i[B]\{X_{i,t}\}_{i\in[B]} as inputs, and YestY_{\operatorname{est}} be the corresponding output of {Yi,t}i[B]\{Y_{i,t}\}_{i\in[B]}. By the third property of Assumption 3.2, it suffices to show that

XestYestxt1yt1.\|X_{\operatorname{est}}-Y_{\operatorname{est}}\|_{\infty}\leq\|x_{t-1}-y_{t-1}\|_{\infty}. (6)

By Lemma 3, we know

Xi,tYi,txt1yt1.\displaystyle\|X_{i,t}-Y_{i,t}\|_{\infty}\leq\|x_{t-1}-y_{t-1}\|_{\infty}.

Then by the second property in Assumption 3.2, we know that XrsYrsxt1yt1\|X_{\mathrm{rs}}-Y_{\mathrm{rs}}\|_{\infty}\leq\|x_{t-1}-y_{t-1}\|_{\infty}. Similarly, by Lemma 3, we have x¯y¯xt1yt1\|\overline{x}-\overline{y}\|_{\infty}\leq\|x_{t-1}-y_{t-1}\|_{\infty}. Then (6) follows from Lemma 3.3. This completes the proof. ∎

B.5 Proof of Lemma 3

See 3

Proof.

The main randomness comes from the Laplacian noise we add to the query and the threshold. Under the precondition that |Xgood|<B/3|X_{\mathrm{good}}|<B/3 for any YY, then we know the concentration score

s1𝖼𝗈𝗇𝖼(𝒟,τ)=Zi,1Zj,1exp(τq1(Zi,1)q1(Zj,1)2B/3+exp(1)B/3<0.8B.\displaystyle s^{\mathsf{conc}}_{1}(\mathcal{D},\tau)=\sum_{Z_{i,1}}\sum_{Z_{j,1}}\exp(-\tau\|q_{1}(Z_{i,1})-q_{1}(Z_{j,1}\|)\leq 2B/3+\exp(-1)\cdot B/3<0.8B.

Then by Lemma A.3 with a probability of at least 1δ/Texp(ε)1-\delta/T\exp(\varepsilon), we have

s1𝖼𝗈𝗇𝖼(𝒟,τ)+Lap(6/ε)υ,\displaystyle s^{\mathsf{conc}}_{1}(\mathcal{D},\tau)+\mathrm{Lap}(6/\varepsilon)\leq\upsilon,

which means a1=a_{1}=\bot. ∎

B.6 Proof of Lemma 3

See 3

Proof.

Consider the difference between qt(Zj,t)qt(Zi,t)qt(Zi,t)qt(Zj,t)\|q_{t}(Z_{j,t})-q_{t}(Z_{i,t})\|_{\infty}-\|q_{t}^{\prime}(Z_{i,t})-q_{t}^{\prime}(Z_{j,t})\|_{\infty}.

By Lemma 3.3 and Lemma 3, we know xtyt6η/τ\|x_{t}-y_{t}\|_{\infty}\leq 6\eta/\tau. Then by the smoothness assumption, we have

qt(Zj,t)qt(Zi,t)qt(Zi,t)qt(Zj,t)\displaystyle~{}\|q_{t}(Z_{j,t})-q_{t}(Z_{i,t})\|_{\infty}-\|q_{t}^{\prime}(Z_{i,t})-q_{t}^{\prime}(Z_{j,t})\|_{\infty}
\displaystyle\leq qt(Zi,t)qt(Zi,t)(qt(Zj,t)qt(Zj,t))\displaystyle~{}\|q_{t}(Z_{i,t})-q_{t}^{\prime}(Z_{i,t})-(q_{t}(Z_{j,t})-q_{t}^{\prime}(Z_{j,t}))\|_{\infty}
\displaystyle\leq qt(Zi,tqt(Zi,t)+qt(Zj,t)qt(Zj,t)\displaystyle~{}\|q_{t}(Z_{i,t}-q_{t}^{\prime}(Z_{i,t})\|_{\infty}+\|q_{t}(Z_{j,t})-q_{t}^{\prime}(Z_{j,t})\|_{\infty}
\displaystyle\leq 2βxtyt.\displaystyle~{}2\beta\|x_{t}-y_{t}\|_{\infty}.

Hence we have

si𝖼𝗈𝗇𝖼(𝒟,τ)=\displaystyle s^{\mathsf{conc}}_{i}(\mathcal{D},\tau)= 1BZ,Z𝒟exp(τqi(Z)qi(Z))\displaystyle\frac{1}{B}\sum_{Z,Z^{\prime}\in\mathcal{D}}\exp(-\tau\|q_{i}(Z)-q_{i}(Z^{\prime})\|_{\infty})
\displaystyle\geq 1BZ,Z𝒟exp(τqi(Z)qi(Z))exp(12βη)\displaystyle\frac{1}{B}\sum_{Z,Z^{\prime}\in\mathcal{D}^{\prime}}\exp(-\tau\|q_{i}^{\prime}(Z)-q_{i}^{\prime}(Z^{\prime})\|_{\infty})\exp(-12\beta\eta)
\displaystyle\geq exp(12βη)si𝖼𝗈𝗇𝖼(𝒟,τ).\displaystyle\exp(-12\beta\eta)s^{\mathsf{conc}}_{i}(\mathcal{D}^{\prime},\tau).

As both si𝖼𝗈𝗇𝖼(𝒟,τ)s^{\mathsf{conc}}_{i}(\mathcal{D},\tau) and si𝖼𝗈𝗇𝖼(𝒟,τ)s^{\mathsf{conc}}_{i}(\mathcal{D}^{\prime},\tau) are in the range [0,B][0,B], we know that

si𝖼𝗈𝗇𝖼(𝒟,τ)si𝖼𝗈𝗇𝖼(𝒟,τ)(1exp(12βη))B12βηB,\displaystyle s^{\mathsf{conc}}_{i}(\mathcal{D}^{\prime},\tau)-s^{\mathsf{conc}}_{i}(\mathcal{D},\tau)\leq(1-\exp(-12\beta\eta))B\leq 12\beta\eta B,

where we use the fact 1exp(x)x1-\exp(-x)\leq x for any x0x\geq 0.

Similarly, we can bound si𝖼𝗈𝗇𝖼(𝒟,τ)si𝖼𝗈𝗇𝖼(𝒟,τ)12βηBs^{\mathsf{conc}}_{i}(\mathcal{D},\tau)-s^{\mathsf{conc}}_{i}(\mathcal{D}^{\prime},\tau)\leq 12\beta\eta B, and complete the proof. ∎

B.7 Proof of Lemma 3

See 3

Proof.

Consider the implementation over two neighboring datasets 𝒟\mathcal{D} and 𝒟\mathcal{D}^{\prime}, and we use the prime notation to denote the corresponding variables with respect to 𝒟\mathcal{D}^{\prime}. Without loss of generality, we assume the different user is used in the first phase.

To avoid confusion, let x¯1\overline{x}_{1} and x¯1\overline{x}_{1}^{\prime} be the outputs of Algorithm 1 with neighboring inputs 𝒟1\mathcal{D}_{1} and 𝒟1\mathcal{D}_{1}^{\prime}, x1x_{1} and x1x_{1}^{\prime} be the outputs of Algorithm 5 by privatizing x¯1\overline{x}_{1} and x¯1\overline{x}_{1}^{\prime} with Gaussian noise respectively.

The outputs x¯1\overline{x}_{1} and x¯1\overline{x}_{1}^{\prime} of Algorithm 1 depend on the intermediate random variables {at}t[T]\{a_{t}\}_{t\in[T]} and {at}t[T]\{a_{t}^{\prime}\}_{t\in[T]}.

As the query sensitivity for t=1t=1 is always bounded, by our parameter setting and property of 𝖠𝖻𝗈𝗏𝖾𝖳𝗁𝗋𝖾𝗌𝗁𝗈𝗅𝖽\mathsf{AboveThreshold}, we know a1ε/2,0a1a_{1}\approx_{\varepsilon/2,0}a_{1}^{\prime}.

We do case analysis.

(i) (𝒟1,𝒟1)(\mathcal{D}_{1},\mathcal{D}_{1}^{\prime}) is not (1/τ)(1/\tau)-aligned. Then by Lemma 3, either Pr[a1=]1δ/2\Pr[a_{1}=\bot]\geq 1-\delta/2 or Pr[a1=]1δ/2eε\Pr[a_{1}^{\prime}=\bot]\geq 1-\delta/2e^{\varepsilon}. Then by union bound and a1ε/2,0a1a_{1}\approx_{\varepsilon/2,0}a_{1}^{\prime}, we have

Pr[a1=a1=]1(1+exp(ε))δ/2eε.\displaystyle\Pr[a_{1}=a_{1}^{\prime}=\bot]\geq 1-(1+\exp(\varepsilon))\delta/2e^{\varepsilon}.

If a1=a1=a_{1}=a_{1}^{\prime}=\bot, then x¯1=x¯1\overline{x}_{1}=\overline{x}_{1}^{\prime} is the initial point. We have for any event range 𝒪\mathcal{O},

Pr[x1𝒪]=\displaystyle\Pr[x_{1}\in\mathcal{O}]= Pr[x1𝒪a1=]Pr[a1=]+Pr[x1𝒪a1=]Pr[a1=]\displaystyle\Pr[x_{1}\in\mathcal{O}\mid a_{1}=\bot]\Pr[a_{1}=\bot]+\Pr[x_{1}\in\mathcal{O}\mid a_{1}=\top]\Pr[a_{1}=\top]
\displaystyle\leq Pr[x1𝒪a1=]exp(ε/2)Pr[a1=]+eε(δ/2exp(ε))\displaystyle\Pr[x_{1}^{\prime}\in\mathcal{O}\mid a_{1}^{\prime}=\bot]\exp(\varepsilon/2)\Pr[a_{1}^{\prime}=\bot]+e^{\varepsilon}(\delta/2\exp(\varepsilon))
\displaystyle\leq eε/2Pr[x1𝒪]+δ/2.\displaystyle e^{\varepsilon/2}\Pr[x_{1}^{\prime}\in\mathcal{O}]+\delta/2.

This completes the privacy guarantee for case(i).

(ii) (𝒟1,𝒟1)(\mathcal{D}_{1},\mathcal{D}_{1}^{\prime}) is (1/τ)(1/\tau)-aligned. In this case, by Lemma 3.3 and Lemma 3, we know xtyt6η/τ\|x_{t}-y_{t}\|_{\infty}\leq 6\eta/\tau. Moreover, Lemma 3 suggests that the query sensitivity is always bounded by 1. Then by the property of 𝖠𝖻𝗈𝗏𝖾𝖳𝗁𝗋𝖾𝗌𝗁𝗈𝗅𝖽\mathsf{AboveThreshold} (Lemma A.3), we have {at}t[T]ε/2,0{at}t[T]\{a_{t}\}_{t\in[T]}\approx_{\varepsilon/2,0}\{a_{t}^{\prime}\}_{t\in[T]}. We have for any event range 𝒪\mathcal{O},

Pr[x1𝒪]=\displaystyle\Pr[x_{1}\in\mathcal{O}]= Pr[x1𝒪t,at=]Pr[t,at=]+Pr[x1𝒪at=,t]Pr[at=,t]\displaystyle\Pr[x_{1}\in\mathcal{O}\mid\exists t,a_{t}=\bot]\Pr[\exists t,a_{t}=\bot]+\Pr[x_{1}\in\mathcal{O}\mid a_{t}=\top,\forall t]\Pr[a_{t}=\top,\forall t]
\displaystyle\leq Pr[x1𝒪t,at=]exp(ε/2)Pr[t,at=]+Pr[x1𝒪at=,t]Pr[at=,t]\displaystyle\Pr[x_{1}^{\prime}\in\mathcal{O}\mid\exists t,a_{t}^{\prime}=\bot]\exp(\varepsilon/2)\Pr[\exists t,a_{t}^{\prime}=\bot]+\Pr[x_{1}\in\mathcal{O}\mid a_{t}=\top,\forall t]\Pr[a_{t}=\top,\forall t]
\displaystyle\leq exp(ε/2)Pr[x1𝒪t,at=]Pr[t,at=]\displaystyle\exp(\varepsilon/2)\Pr[x_{1}^{\prime}\in\mathcal{O}\mid\exists t,a_{t}^{\prime}=\bot]\Pr[\exists t,a_{t}^{\prime}=\bot]
+(exp(ε/2)Pr[x1𝒪at=,t]+δ/2exp(ε))exp(ε/2)Pr[at=,t]\displaystyle~{}~{}~{}+(\exp(\varepsilon/2)\Pr[x_{1}^{\prime}\in\mathcal{O}\mid a_{t}^{\prime}=\top,\forall t]+\delta/2\exp(\varepsilon))\exp(\varepsilon/2)\Pr[a_{t}^{\prime}=\top,\forall t]
\displaystyle\leq exp(ε)Pr[x1𝒪]+δ,\displaystyle\exp(\varepsilon)\Pr[x_{1}^{\prime}\in\mathcal{O}]+\delta,

where we use the Gaussian Mechanism (Proposition A.2) to bound Pr[x1𝒪at=,t]\Pr[x_{1}\in\mathcal{O}\mid a_{t}=\top,\forall t] by Pr[x1𝒪at=,t]\Pr[x_{1}^{\prime}\in\mathcal{O}\mid a_{t}^{\prime}=\top,\forall t]. This completes the proof.

B.8 Proof of Lemma 3

See 3

Proof.

By the Hoeffding inequality for norm-subGaussian vectors (Theorem A.5), for each t[T]t\in[T] and each i[B]i\in[B], we have

Pr[qt(Zi,t)F(xt1)Glog(ndm/ω)/m]1ω/nm.\displaystyle\Pr\Big{[}\|q_{t}(Z_{i,t})-\nabla F(x_{t-1})\|_{\infty}\geq G\log(ndm/\omega)/\sqrt{m}\Big{]}\leq 1-\omega/nm.

Then conditional on the event that qt(Zi,t)F(xt1)τ\|q_{t}(Z_{i,t})-\nabla F(x_{t-1})\|_{\infty}\leq\tau for all i[B]i\in[B] and t[T]t\in[T], by our setting of parameters, we know that we pass all the concentration tests with at=,t[T]a_{t}=\top,\forall t\in[T], and that

gt1=1Bi[B]qt(Zi,t),\displaystyle g_{t-1}=\frac{1}{B}\sum_{i\in[B]}q_{t}(Z_{i,t}),

which means dTV({gt1}t[T],{1Bi[B]qt(Zi,t)}t[T])ωd_{\mathrm{TV}}\Big{(}\{g_{t-1}\}_{t\in[T]},\{\frac{1}{B}\sum_{i\in[B]}q_{t}(Z_{i,t})\}_{t\in[T]}\Big{)}\leq\omega. Note that 𝔼[i[B]qt(Zi,t)]=F(xt1)\operatorname*{\mathbb{E}}[\sum_{i\in[B]}q_{t}(Z_{i,t})]=\nabla F(x_{t-1}) and 𝔼[1Bi[B]qt(Zi,t)F(xt1)22]G2d/Bm\operatorname*{\mathbb{E}}[\|\frac{1}{B}\sum_{i\in[B]}q_{t}(Z_{i,t})-\nabla F(x_{t-1})\|_{2}^{2}]\leq G^{2}d/Bm when all functions are drawn i.i.d. from the distribution. By the small TV distance between gt1g_{t-1} and the good gradient estimation 1Bi[B]qt(Zi,t)\frac{1}{B}\sum_{i\in[B]}q_{t}(Z_{i,t}), it follows from Lemma A.6 that

𝔼[F(x¯)F(x)](β+1η)𝔼[x0x22]T+ηG2dBm+GDdω,\displaystyle\operatorname*{\mathbb{E}}[F(\overline{x})-F(x)]\lesssim(\beta+\frac{1}{\eta})\frac{\operatorname*{\mathbb{E}}[\|x_{0}-x\|^{2}_{2}]}{T}+\frac{\eta G^{2}d}{Bm}+GDd\omega,

where the last term comes from the worst value GDdGDd, and the small failure probability ω\omega. ∎

B.9 Proof of Lemma 3

See 3

Proof.

Let x¯0=x\overline{x}_{0}=x^{*} and ζ0=x0x\zeta_{0}=x_{0}-x^{*}. Lemma 3 can be used to analyze the utility concerning x¯s\overline{x}_{s}. As we add Gaussian noise ζs\zeta_{s} to x¯s\overline{x}_{s} in each phase, we analyze the influence of ζs\zeta_{s} first.

By the assumption, we know ζ02Dd\|\zeta_{0}\|_{2}\leq D\sqrt{d}. Recall that by the setting that ηDGmεdlog(1/δ)log(nmd)\eta\leq\frac{D}{G}\cdot\frac{\sqrt{m}\varepsilon}{d\log(1/\delta)\log(nmd)}, for all s0s\geq 0,

𝔼[ζs22]=dσs2=ηs2dGdlog(1/δ)log(nmd)mε2D2dlogsm.\displaystyle\operatorname*{\mathbb{E}}[\|\zeta_{s}\|_{2}^{2}]=d\sigma_{s}^{2}=\eta_{s}^{2}d\frac{Gd\log(1/\delta)\log(nmd)}{m\varepsilon^{2}}\leq D^{2}d\cdot\log^{-s}m.

Then by Lemma 3, we have

𝔼[F(xS)]F(x)=\displaystyle\operatorname*{\mathbb{E}}[F(x_{S})]-F(x^{*})= s=1S𝔼[F(x¯sx¯s1)]+𝔼[F(xs)F(x¯s)]\displaystyle\sum_{s=1}^{S}\operatorname*{\mathbb{E}}[F(\overline{x}_{s}-\overline{x}_{s-1})]+\operatorname*{\mathbb{E}}[F(x_{s})-F(\overline{x}_{s})]
\displaystyle\leq s=1S(𝔼[ζs122]ηsTs+ηsG2d2Bm)+G𝔼[ζS2]\displaystyle\sum_{s=1}^{S}\left(\frac{\operatorname*{\mathbb{E}}[\|\zeta_{s-1}\|_{2}^{2}]}{\eta_{s}T_{s}}+\frac{\eta_{s}G^{2}d}{2Bm}\right)+G\operatorname*{\mathbb{E}}[\|\zeta_{S}\|_{2}]
\displaystyle\leq s=1S(logm2)(i2)(D2dηn/B+ηG2d2Bm)+GDd(logm)logn\displaystyle\sum_{s=1}^{S}\left(\frac{\log m}{2}\right)^{-(i-2)}\left(\frac{D^{2}d}{\eta n/B}+\frac{\eta G^{2}d}{2Bm}\right)+\frac{GDd}{(\log m)^{\log n}}
\displaystyle\leq O~(GD(dnm+d3/2nmε2)),\displaystyle\tilde{O}\Big{(}GD(\frac{d}{\sqrt{nm}}+\frac{d^{3/2}}{n\sqrt{m}\varepsilon^{2}})\Big{)},

where we use the fact that 1(logm)logn1/nm\frac{1}{(\log m)^{\log n}}\leq 1/nm when mnloglognm\leq n^{\log\log n}. ∎

B.10 Counterexample of the 1-Lipschitz of Geometric Median

We use the counterexample from [DK09] to show the geometric median is unstable in 2-dimension space. Recall give a set of points PP in d\mathbb{R}^{d}, the geometric median of PP, denoted by M(P)M(P), is

M(P):=argminxpPxp2.\displaystyle M(P):=\arg\min_{x}\sum_{p\in P}\|x-p\|_{2}.

Let P={(0,0),(0,0),(1,0),(1,α)}P=\{(0,0),(0,0),(1,0),(1,\alpha)\} and P={(0,0),(0,α),(1,0),(1,0)}P^{\prime}=\{(0,0),(0,\alpha),(1,0),(1,0)\} with α>0\alpha>0 as a very small perturbation. But we know M(P)=(0,0)M(P)=(0,0) and M(P)=(1,0)M(P^{\prime})=(1,0).

Appendix C Details of Lower Bound

Now we construct a lower bound for the weighted sign estimation error.

Weighted sign estimation error.

We construct a distribution 𝒫1\mathcal{P}_{1} as follows: for each coordinate k[d]k\in[d], we draw μ[k]\mu[k] uniformly random from [1,1][-1,1], and zi,j[k]𝒩(μ[k],m)z_{i,j}[k]\sim\mathcal{N}(\mu[k],m) i.i.d., for i[n],j[m]i\in[n],j\in[m]. The objective is to minimize weighted sign estimation error with respect to μ\mu.

Lemma C.1.

Let ε0.1,δ1/(dnm)\varepsilon\leq 0.1,\delta\leq 1/(dnm). For any (ε,δ)(\varepsilon,\delta)-user-level-DP algorithm \mathcal{M}, there exists a distribution 𝒫2\mathcal{P}_{2} such that 𝔼z𝒫2[z]1\|\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}_{2}}[z]\|_{\infty}\leq 1 and zO~(m)\|z\|_{\infty}\leq\tilde{O}(\sqrt{m}) almost surely, and, given dataset 𝒟\mathcal{D} i.i.d. drawn from 𝒫2\mathcal{P}_{2}, we have

𝔼𝒟,,μi=1d|μ[i]|𝟏(sign(μ[i])sign((𝒟)[i]))Ω~(d3/2nε).\displaystyle\operatorname*{\mathbb{E}}_{\mathcal{D},\mathcal{M},\mu}\sum_{i=1}^{d}|\mu[i]|\cdot\mathbf{1}\big{(}\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[i])\big{)}\geq\tilde{\Omega}(\frac{d^{3/2}}{n\varepsilon}).

First, by the previous result, we can reduce the user-level to item-level setting.

Lemma C.2 ([LSA+21]).

Suppose each user Zi,i[n]Z_{i},i\in[n] observes zi,1,,zi,mi.i.d.𝒩(μ,σ2Id)z_{i,1},\ldots,z_{i,m}\stackrel{{\scriptstyle\text{i.i.d.}}}{{\sim}}\mathcal{N}(\mu,\sigma^{2}I_{d}) with σ\sigma known. For any (ε,δ)(\varepsilon,\delta)-User-level-DP algorithm user\mathcal{M}_{\mathrm{user}}, there exists an (ε,δ)(\varepsilon,\delta)-Item-level-DP algorithm item\mathcal{M}_{\mathrm{item}} that takes inputs (Z¯1,,Z¯n)(\overline{Z}_{1},\ldots,\overline{Z}_{n}) where Z¯i=1mj[m]zi,j\overline{Z}_{i}=\frac{1}{m}\sum_{j\in[m]}z_{i,j} and has the same performance as user\mathcal{M}_{\mathrm{user}}.

Hence by Lemma C.2, it suffices to consider the item-level lower bound. We prove the following lemma:

Lemma C.3.

Let {μ[k]}k[d]i.i.d.[σ,σ]\{\mu[k]\}_{k\in[d]}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}[-\sigma,\sigma]. Let {Z¯1,,Z¯n}\{\overline{Z}_{1},\ldots,\overline{Z}_{n}\} be i.i.d. drawn from 𝒩(μ,σ2Id)\mathcal{N}(\mu,\sigma^{2}I_{d}). If :n×d{1,1}d\mathcal{M}:\mathbb{R}^{n\times d}\to\{-1,1\}^{d} is (ε,δ)(\varepsilon,\delta)-DP, and

𝔼μ,𝒟,i=1d|μ[i]|𝟏(sign(μ[i])sign((𝒟)[i]))αdσ8,\displaystyle\operatorname*{\mathbb{E}}_{\mu,\mathcal{D},\mathcal{M}}\sum_{i=1}^{d}|\mu[i]|\cdot\mathbf{1}\big{(}\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[i])\big{)}\leq\alpha\leq\frac{d\sigma}{8},

then nd32εn\geq\frac{\sqrt{d}}{32\varepsilon}.

By the invariant scaling, it suffices to consider the case when σ=1\sigma=1. To prove Lemma C.3, we need the fingerprinting lemma:

Lemma C.4 (Lemma 6.8 in [KLSU19]).

For every fixed number pp and every f:n[1,1]f:\mathbb{R}^{n}\to[-1,1], define g(p):=𝔼X1,n𝒩(p,1)[f(X)]g(p):=\operatorname*{\mathbb{E}}_{X_{1,\ldots n}\sim\mathcal{N}(p,1)}[f(X)]. We have

𝔼X1,n𝒩(p,1)[(1p2)(f(X)p)i[n](Xip)]=(1p2)ddpg(p).\displaystyle\operatorname*{\mathbb{E}}_{X_{1,\ldots n}\sim\mathcal{N}(p,1)}[(1-p^{2})(f(X)-p)\sum_{i\in[n]}(X_{i}-p)]=(1-p^{2})\frac{\mathrm{d}}{\mathrm{d}p}g(p). (7)

By choosing pp uniformly from [1,1][-1,1], we have the following observation over the expectation on the RHS of Equation (7).

Lemma C.5.

We have

𝔼p[1,1][(1p2)ddpg(p)]=𝔼p[g(p)p].\displaystyle\operatorname*{\mathbb{E}}_{p\sim[-1,1]}[(1-p^{2})\frac{\mathrm{d}}{\mathrm{d}p}g(p)]=\operatorname*{\mathbb{E}}_{p}[g(p)\cdot p].
Proof.

Using integration by parts, we have

𝔼p[1,1][(1p2)ddpg(p)]=\displaystyle\operatorname*{\mathbb{E}}_{p\sim[-1,1]}[(1-p^{2})\frac{\mathrm{d}}{\mathrm{d}p}g(p)]= 1211(1p2)ddpg(p)dp\displaystyle\frac{1}{2}\int_{-1}^{1}(1-p^{2})\frac{\mathrm{d}}{\mathrm{d}p}g(p)\mathrm{d}p
=\displaystyle= 1211(ddp((1p2)g(p))g(p)ddp(1p2))dp\displaystyle\frac{1}{2}\int_{-1}^{1}\Big{(}\frac{\mathrm{d}}{\mathrm{d}p}\big{(}(1-p^{2})g(p)\big{)}-g(p)\frac{\mathrm{d}}{\mathrm{d}p}\big{(}1-p^{2}\big{)}\Big{)}\mathrm{d}p
=\displaystyle= 11g(p)pdp\displaystyle\int_{-1}^{1}g(p)p\mathrm{d}p
=\displaystyle= 𝔼[g(p)p].\displaystyle\operatorname*{\mathbb{E}}[g(p)\cdot p].

Now we use the fingerprinting lemma.

Lemma C.6.

One has

𝔼[i[n],k[d](1μ[k]2)((𝒟)[k]μ[k])(Z¯i[k]μ[k])]=𝔼[k[d](𝒟)[k]μ[k]].\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{i\in[n],k\in[d]}(1-\mu[k]^{2})(\mathcal{M}(\mathcal{D})[k]-\mu[k])\cdot(\overline{Z}_{i}[k]-\mu[k])\right]=\operatorname*{\mathbb{E}}\left[\sum_{k\in[d]}\mathcal{M}(\mathcal{D})[k]\cdot\mu[k]\right].
Proof.

Fix a column k[d]k\in[d].

Construct the ff for our purpose. Define f:n[1,1]f:\mathbb{R}^{n}\to[-1,1] to be

f(X):=𝔼𝒟,[(𝒟kX)[k]].\displaystyle f(X):=\operatorname*{\mathbb{E}}_{\mathcal{D},\mathcal{M}}[\mathcal{M}(\mathcal{D}^{-k}\|X)[k]].

That is, f(X)f(X) is the expectation of (𝒟)[k]\mathcal{M}(\mathcal{D})[k] conditioned on Z¯i[k]=Xi,i[n]\overline{Z}_{i}[k]=X_{i},\forall i\in[n]. And define g:[1,1][1,1]g:[-1,1]\to[-1,1] to be

g(p):=𝔼μk,X1,n𝒩(p,1)[f(X)].\displaystyle g(p):=\operatorname*{\mathbb{E}}_{\mu^{-k},X_{1,\ldots n}\sim\mathcal{N}(p,1)}[f(X)].

That is g(p)g(p) is the expectation of (𝒟)[k]\mathcal{M}(\mathcal{D})[k] conditional on μ[k]=p\mu[k]=p.

Now we can calculate

𝔼[(1μ[k]2)((𝒟)[k]μ[k])i[n](Z¯i[k]μ[k])]\displaystyle\operatorname*{\mathbb{E}}\Big{[}(1-\mu[k]^{2})(\mathcal{M}(\mathcal{D})[k]-\mu[k])\sum_{i\in[n]}(\overline{Z}_{i}[k]-\mu[k])\Big{]}
=\displaystyle= 𝔼μ[k][1,1],X1,n𝒩(μ[k],1)[(1μ[k]2)(f(X)μ[k])i[n](Z¯i[k]μ[k])]\displaystyle\operatorname*{\mathbb{E}}_{\mu[k]\sim[-1,1],X_{1,\ldots n}\sim\mathcal{N}(\mu[k],1)}\Big{[}(1-\mu[k]^{2})(f(X)-\mu[k])\sum_{i\in[n]}(\overline{Z}_{i}[k]-\mu[k])\Big{]}
=(i)\displaystyle\stackrel{{\scriptstyle(i)}}{{=}} 𝔼μ[k][g(μ[k])μ[k]]\displaystyle\operatorname*{\mathbb{E}}_{\mu[k]}[g(\mu[k])\cdot\mu[k]]
=\displaystyle= 𝔼[(𝒟)[k]μ[k]],\displaystyle\operatorname*{\mathbb{E}}[\mathcal{M}(\mathcal{D})[k]\mu[k]],

where (i)(i) follows from Lemma C.4 and Lemma C.5. The statement follows by summation over k[d]k\in[d]. ∎

A small weighted sign error means a large 𝔼[k[d](𝒟)[k]μ[k]]\operatorname*{\mathbb{E}}\left[\sum_{k\in[d]}\mathcal{M}(\mathcal{D})[k]\cdot\mu[k]\right], as demonstrated by the following lemma:

Lemma C.7.

Let :n×d[1,1]d\mathcal{M}:\mathbb{R}^{n\times d}\to[-1,1]^{d}. Suppose {μ[k]}i.i.d.[1,1]\{\mu[k]\}\stackrel{{\scriptstyle i.i.d.}}{{\sim}}[-1,1], and

𝔼μ,𝒟,[k=1d|μ[k]|𝟏(sign(μ[k])sign((𝒟)[k]))]α,\displaystyle\operatorname*{\mathbb{E}}_{\mu,\mathcal{D},\mathcal{M}}\left[\sum_{k=1}^{d}|\mu[k]|\cdot\mathbf{1}\big{(}\operatorname{sign}(\mu[k])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[k])\big{)}\right]\leq\alpha,

then we have

𝔼[k[d](𝒟)[k]μ[k]]d22α.\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{k\in[d]}\mathcal{M}(\mathcal{D})[k]\cdot\mu[k]\right]\geq\frac{d}{2}-2\alpha.
Proof.

Note that

𝔼[k=1d|μ[k]|(𝟏(sign(μ[k])sign((𝒟)[k]))+𝟏(sign(μ[k])=sign((𝒟)[k])))]\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{k=1}^{d}|\mu[k]|\cdot\Big{(}\mathbf{1}(\operatorname{sign}(\mu[k])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[k]))+\mathbf{1}(\operatorname{sign}(\mu[k])=\operatorname{sign}(\mathcal{M}(\mathcal{D})[k]))\Big{)}\right]
=𝔼[k=1d|μ[k]|]=d/2.\displaystyle=\operatorname*{\mathbb{E}}\left[\sum_{k=1}^{d}|\mu[k]|\right]=d/2.

Moreover, one has that

𝔼[k[d](𝒟)[k]μ[k]]\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{k\in[d]}\mathcal{M}(\mathcal{D})[k]\cdot\mu[k]\right]
=\displaystyle= 𝔼[k[d]|μ[k]|(𝟏(sign(μ[k])=sign((𝒟)[k]))𝟏(sign(μ[k])sign((𝒟)[k])))]\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{k\in[d]}|\mu[k]|\cdot\Big{(}\mathbf{1}(\operatorname{sign}(\mu[k])=\operatorname{sign}(\mathcal{M}(\mathcal{D})[k]))-\mathbf{1}(\operatorname{sign}(\mu[k])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[k]))\Big{)}\right]
\displaystyle\geq d/22α.\displaystyle d/2-2\alpha.

This completes the proof. ∎

It remains to show the sample complexity to achieve a large value of

𝔼[i[n],k[d](1μ[k]2)((𝒟)[k]μ[k])(Z¯i[k]μ[k])]d/22α.\displaystyle\operatorname*{\mathbb{E}}\left[\sum_{i\in[n],k\in[d]}(1-\mu[k]^{2})(\mathcal{M}(\mathcal{D})[k]-\mu[k])\cdot(\overline{Z}_{i}[k]-\mu[k])\right]\geq d/2-2\alpha. (8)

Now we complete the proof of Lemma C.3.

Proof of Lemma C.3.

Let Ai=k[d](1μ[k]2)((𝒟)[k]μ[k])(Z¯i[k]μ[k])A_{i}=\sum_{k\in[d]}(1-\mu[k]^{2})(\mathcal{M}(\mathcal{D})[k]-\mu[k])(\overline{Z}_{i}[k]-\mu[k]). We use the DP constraint of \mathcal{M} to upper bound 𝔼[Ai]\operatorname*{\mathbb{E}}[A_{i}].

Let 𝒟i\mathcal{D}_{\sim i} denote 𝒟\mathcal{D} with Z¯i\overline{Z}_{i} replaced with an independent draw from 𝒫1\mathcal{P}_{1}. Define

A~i=k[d](1μ[k]2)((𝒟i)[k]μ[k])(Z¯i[k]μ[k]).\displaystyle\tilde{A}_{i}=\sum_{k\in[d]}(1-\mu[k]^{2})(\mathcal{M}(\mathcal{D}_{\sim i})[k]-\mu[k])(\overline{Z}_{i}[k]-\mu[k]).

Due to the independence between (𝒟i)\mathcal{M}(\mathcal{D}_{\sim i}) and Z¯i\overline{Z}_{i}, we have 𝔼[A~i]=0\operatorname*{\mathbb{E}}[\tilde{A}_{i}]=0. Moreover, as |1μ[k]2|1|1-\mu[k]^{2}|\leq 1 and |(𝒟i)μ[k]|2|\mathcal{M}(\mathcal{D}_{\sim i})-\mu[k]|\leq 2, we have 𝔼[|A~i|]2k[d]𝐕𝐚𝐫(Z¯i[k])2d\operatorname*{\mathbb{E}}[|\tilde{A}_{i}|]\leq 2\sqrt{\sum_{k\in[d]}\operatorname*{\mathbf{Var}}(\overline{Z}_{i}[k])}\leq 2\sqrt{d}.

Split AiA_{i} with Ai,+=max{0,Ai}A_{i,+}=\max\{0,A_{i}\} and Ai,=min{0,Ai}A_{i,-}=\min\{0,A_{i}\} and split A~i\tilde{A}_{i} similarly. By the property of DP, we know

Pr[|Ai,+|t]exp(ε)Pr[|A~i,+|t]+δ,t0,\displaystyle\Pr[|A_{i,+}|\geq t]\leq\exp(\varepsilon)\Pr[|\tilde{A}_{i,+}|\geq t]+\delta,\forall t\geq 0,
Pr[|Ai,|t]exp(ε)Pr[|A~i,1|t]δ,t0.\displaystyle\Pr[|A_{i,-}|\geq t]\geq\exp(-\varepsilon)\Pr[|\tilde{A}_{i,1}|\geq t]-\delta,\forall t\geq 0.

Then we have

𝔼[Ai]=\displaystyle\operatorname*{\mathbb{E}}[A_{i}]= 0Pr[|Ai,+|t]dt0Pr[|Ai,|t]dt\displaystyle\int_{0}^{\infty}\Pr[|A_{i,+}|\geq t]\mathrm{d}t-\int_{0}^{\infty}\Pr[|A_{i,-}|\leq t]\mathrm{d}t
\displaystyle\leq exp(ε)𝔼[|A~i,+|]exp(ε)𝔼[|A~i,|]+2δT+TPr[|Ai|t]dt\displaystyle\exp(\varepsilon)\operatorname*{\mathbb{E}}[|\tilde{A}_{i,+}|]-\exp(-\varepsilon)\operatorname*{\mathbb{E}}[|\tilde{A}_{i,-}|]+2\delta T+\int_{T}^{\infty}\Pr[|A_{i}|\geq t]\mathrm{d}t
\displaystyle\leq 𝔼[A~i]+(exp(ε)1)𝔼[|A~i,+|]+(1exp(ε))𝔼[|A~i,|]+2δT+TPr[|Ai|t]dt\displaystyle\operatorname*{\mathbb{E}}[\tilde{A}_{i}]+(\exp(\varepsilon)-1)\operatorname*{\mathbb{E}}[|\tilde{A}_{i,+}|]+(1-\exp(-\varepsilon))\operatorname*{\mathbb{E}}[|\tilde{A}_{i,-}|]+2\delta T+\int_{T}^{\infty}\Pr[|A_{i}|\geq t]\mathrm{d}t
\displaystyle\leq 𝔼[A~i]+2ε𝔼[|A~i|]+2δT+TPr[|Ai|t]dt,\displaystyle\operatorname*{\mathbb{E}}[\tilde{A}_{i}]+2\varepsilon\operatorname*{\mathbb{E}}[|\tilde{A}_{i}|]+2\delta T+\int_{T}^{\infty}\Pr[|A_{i}|\geq t]\mathrm{d}t,

where the last inequality used the fact that exp(ε)12ε\exp(\varepsilon)-1\leq 2\varepsilon when ε1/10\varepsilon\leq 1/10.

When δ1/dn2\delta\leq 1/dn^{2} and set T=O(dlog(1/δ))T=O(\sqrt{d\log(1/\delta)}), we have

𝔼[Ai]4εd+d/8n.\displaystyle\operatorname*{\mathbb{E}}[A_{i}]\leq 4\varepsilon\sqrt{d}+d/8n.

When αd/8\alpha\leq d/8, Equation (8) implies that

n(4εd+d/8n)d/4,\displaystyle n(4\varepsilon\sqrt{d}+d/8n)\geq d/4,

which leads to

nd32ε.\displaystyle n\geq\frac{\sqrt{d}}{32\varepsilon}.

This completes the proof. ∎

It is standard to translate the sample complexity lower bound (Lemma C.3) to the error lower bound (Lemma C.1). We present a proof below.

Proof of Lemma C.1.

Let 𝒜ε,δitem\mathcal{A}_{\varepsilon,\delta}^{\mathrm{item}} be the set of item-level DP mechanisms, and let 𝒜ε,δuser\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}} be the set of user-level DP mechanisms.

Define the error term:

Error[𝒫,,n]=𝔼𝒟𝒫n,i=1d|μ[i]|𝟏(sign(μ[i])sign((𝒟)[i])),\displaystyle\mathrm{Error}[\mathcal{P},\mathcal{M},n]=\operatorname*{\mathbb{E}}_{\mathcal{D}\sim\mathcal{P}^{n},\mathcal{M}}\sum_{i=1}^{d}|\mu[i]|\mathbf{1}(\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[i])),

where μ=𝔼z𝒫z\mu=\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}}z.

Recall that we construct the distribution 𝒫1\mathcal{P}_{1} as follows: for each coordinate k[d]k\in[d], we draw μ[k]\mu[k] uniformly random from [1,1][-1,1], and zi,j[k]𝒩(μ[k],m)z_{i,j}[k]\sim\mathcal{N}(\mu[k],m) i.i.d., for i[n],j[m]i\in[n],j\in[m].

Let 𝒫¯1\overline{\mathcal{P}}_{1} be the following: for each coordinate k[d]k\in[d], we draw μ[k]\mu[k] uniformly random from [1,1][-1,1], and Z¯i[k]𝒩(μ[k],1)\overline{Z}_{i}[k]\sim\mathcal{N}(\mu[k],1) i.i.d., for i[n]i\in[n]. 𝒫¯1\overline{\mathcal{P}}_{1} is corresponding to averaging the mm samples for each user. By Lemma C.2, we have

inf𝒜ε,δuserError[𝒫1,,nm]inf𝒜ε,δitemError[𝒫¯1,,n].\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\mathrm{Error}[\mathcal{P}_{1},\mathcal{M},nm]\geq\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{item}}}\mathrm{Error}[\overline{\mathcal{P}}_{1},\mathcal{M},n].

By Lemma C.3,

inf𝒜ε,δitemError[𝒫¯1,,d/32ε]Ω(d).\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{item}}}\mathrm{Error}[\overline{\mathcal{P}}_{1},\mathcal{M},\sqrt{d}/32\varepsilon]\geq\Omega(d).

Let n=O~(d/ε)n^{*}=\tilde{O}(\sqrt{d}/\varepsilon). When we have a large data set of size nnn\gg n^{*}, construct 𝒫¯2=nn𝒫¯1+(1nn)𝒫3\overline{\mathcal{P}}_{2}=\frac{n^{*}}{n}\overline{\mathcal{P}}_{1}+(1-\frac{n^{*}}{n})\mathcal{P}_{3}, where 𝒫3\mathcal{P}_{3} is a Dirac distribution at 𝟎d\mathbf{0}\in\mathbb{R}^{d}.

Hence, by a Chernoff bound, with high probability, the number of samples drawn from 𝒫¯1\overline{\mathcal{P}}_{1} in the dataset 𝒟\mathcal{D} is no more than O(nlog(nd))=d32εO(n^{*}\cdot\log(nd))=\frac{\sqrt{d}}{32\varepsilon}. Then we have

inf𝒜ε,δuserError[𝒫1,,nm]\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\mathrm{Error}[\mathcal{P}_{1},\mathcal{M},nm]\geq inf𝒜ε,δitemError[𝒫¯2,,n]\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{item}}}\mathrm{Error}[\overline{\mathcal{P}}_{2},\mathcal{M},n]
\displaystyle\geq nninf𝒜ε,δitemError[𝒫¯1,,d/32ε]Ω~(d3/2εn).\displaystyle\frac{n^{*}}{n}\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{item}}}\mathrm{Error}[\overline{\mathcal{P}}_{1},\mathcal{M},\sqrt{d}/32\varepsilon]\geq\tilde{\Omega}(\frac{d^{3/2}}{\varepsilon n}).

In the precondition of 𝒫2\mathcal{P}_{2}, we need zO~(m)\|z\|_{\infty}\leq\tilde{O}(\sqrt{m}) almost surely for z𝒫2z\sim\mathcal{P}_{2}. But 𝒫1\mathcal{P}_{1} involves some Gaussian distributions. We construct 𝒫2\mathcal{P}_{2} by truncating the Gaussian distributions.

More specifically, for each item zi,jz_{i,j} drawn from 𝒩(μ,mId)\mathcal{N}(\mu,mI_{d}), we set zi,j[k]=zi,j[k]max{1,|zi,j[k]|/G}z^{\prime}_{i,j}[k]=\frac{z_{i,j}[k]}{\max\{1,|z_{i,j}[k]|/G\}} with G=Θ(mlog(mnd))G=\Theta(\sqrt{m\log(mnd)}). In other words, we get zi,jz^{\prime}_{i,j} by projecting zi,jz_{i,j} to B(0,G)B_{\infty}(0,G). Fixing μ\mu, we first show

𝔼z𝒫2[z]μO(1/dn2).\displaystyle\|\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}_{2}}[z]-\mu\|_{\infty}\leq O(1/dn^{2}). (9)

It suffices to consider any coordinate k[d]k\in[d], and prove

|𝔼z𝒫2[z[k]]μ[k]|O(1/dn2).\displaystyle|\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}_{2}}[z[k]]-\mu[k]|\leq O(1/dn^{2}).

Letting β=μ[k]+Gm\beta=\frac{-\mu[k]+G}{\sqrt{m}} and α=μ[k]Gm\alpha=\frac{-\mu[k]-G}{\sqrt{m}}, we know

|𝔼z𝒫2[z[k]]μ[k]|nnmφ(α)φ(β)αβφ(x)dx,\displaystyle|\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}_{2}}[z[k]]-\mu[k]|\leq\frac{n^{*}}{n}\cdot\sqrt{m}\cdot\frac{\varphi(\alpha)-\varphi(\beta)}{\int_{\alpha}^{\beta}\varphi(x)\mathrm{d}x},

where φ\varphi is density function of standard Gaussian 𝒩(0,1)\mathcal{N}(0,1). As μ[k][1,1]\mu[k]\in[-1,1] and G=Θ(mlog(mnd))G=\Theta(\sqrt{m\log(mnd)}), we establish Equation (9). Denote μ=𝔼z𝒫2[zμ]\mu^{\prime}=\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}_{2}}[z\mid\mu], that is the mean of 𝒫2\mathcal{P}_{2} conditional on μ\mu.

Moreover, we have

inf𝒜ε,δuser𝔼μ,𝒟𝒫2mn,i=1d|μ[i]|𝟏(sign(μ[i])sign((𝒟)))\displaystyle~{}\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\operatorname*{\mathbb{E}}_{\mu,\mathcal{D}\sim\mathcal{P}_{2}^{mn},\mathcal{M}}\sum_{i=1}^{d}|\mu^{\prime}[i]|\mathbf{1}(\operatorname{sign}(\mu^{\prime}[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})))
\displaystyle\geq inf𝒜ε,δuser𝔼μ,𝒟𝒫2mn,(i=1d|μ[i]|𝟏(sign(μ[i])sign((𝒟)))dμμ)\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\operatorname*{\mathbb{E}}_{\mu,\mathcal{D}\sim\mathcal{P}_{2}^{mn},\mathcal{M}}\Big{(}\sum_{i=1}^{d}|\mu[i]|\mathbf{1}(\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})))-d\|\mu-\mu^{\prime}\|_{\infty}\Big{)}
\displaystyle\geq inf𝒜ε,δuser𝔼μ,𝒟𝒫2mn,i=1d|μ[i]|𝟏(sign(μ[i])sign((𝒟)))O(1/n2)\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\operatorname*{\mathbb{E}}_{\mu,\mathcal{D}\sim\mathcal{P}_{2}^{mn},\mathcal{M}}\sum_{i=1}^{d}|\mu[i]|\mathbf{1}(\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})))-O(1/n^{2})
(i)\displaystyle\stackrel{{\scriptstyle(i)}}{{\geq}} inf𝒜ε,δuser𝔼μ,𝒟𝒫1mn,i=1d|μ[i]|𝟏(sign(μ[i])sign((𝒟)))O(1/n2)\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\operatorname*{\mathbb{E}}_{\mu,\mathcal{D}\sim\mathcal{P}_{1}^{mn},\mathcal{M}}\sum_{i=1}^{d}|\mu[i]|\mathbf{1}(\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})))-O(1/n^{2})
\displaystyle\geq inf𝒜ε,δuserError(𝒫1,,nm)O(1/n2)\displaystyle\inf_{\mathcal{M}\in\mathcal{A}_{\varepsilon,\delta}^{\mathrm{user}}}\mathrm{Error}(\mathcal{P}_{1},\mathcal{M},nm)-O(1/n^{2})
\displaystyle\geq Ω~(d3/2εn),\displaystyle\tilde{\Omega}(\frac{d^{3/2}}{\varepsilon n}),

where the inequality (i) follows from that we can always sample from 𝒫2\mathcal{P}_{2} with samples from 𝒫1\mathcal{P}_{1}, which means the problem over 𝒫2\mathcal{P}_{2} is harder than 𝒫1\mathcal{P}_{1}. This completes the proof. ∎

The lower bound Theorem 4.1 follows from Lemma C.1 and our construction of the function class, that is

𝔼[F((𝒟))F(x)]\displaystyle\operatorname*{\mathbb{E}}[F(\mathcal{M}(\mathcal{D}))-F(x^{*})] 𝔼𝒟,,μi=1d|μ[i]|𝟏(sign(μ[i])sign((𝒟)[i]))\displaystyle\geq\operatorname*{\mathbb{E}}_{\mathcal{D},\mathcal{M},\mu}\sum_{i=1}^{d}|\mu[i]|\cdot\mathbf{1}\big{(}\operatorname{sign}(\mu[i])\neq\operatorname{sign}(\mathcal{M}(\mathcal{D})[i])\big{)}
Ω~(d3/2nε)=GDΩ~(d3/2nεm),\displaystyle\geq\tilde{\Omega}\left(\frac{d^{3/2}}{n\varepsilon}\right)=GD\cdot\tilde{\Omega}\left(\frac{d^{3/2}}{n\varepsilon\sqrt{m}}\right),

given G=O~(m)G=\tilde{O}(\sqrt{m}).