This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adaptive Batch Size for Privately Finding Second-Order Stationary Points

Daogao Liu University of Washington, work was done while interning at Apple, [email protected]    Kunal Talwar Apple, [email protected]
Abstract

There is a gap between finding a first-order stationary point (FOSP) and a second-order stationary point (SOSP) under differential privacy constraints, and it remains unclear whether privately finding an SOSP is more challenging than finding an FOSP. Specifically, Ganesh et al. (2023) claimed that an α\alpha-SOSP can be found with α=O~(1n1/3+(dnε)3/7)\alpha=\tilde{O}(\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{n\varepsilon})^{3/7}), where nn is the dataset size, dd is the dimension, and ε\varepsilon is the differential privacy parameter. However, a recent analysis revealed an issue in their saddle point escape procedure, leading to weaker guarantees. Building on the SpiderBoost algorithm framework, we propose a new approach that uses adaptive batch sizes and incorporates the binary tree mechanism. Our method not only corrects this issue but also improves the results for privately finding an SOSP, achieving α=O~(1n1/3+(dnε)1/2)\alpha=\tilde{O}(\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{n\varepsilon})^{1/2}). This improved bound matches the state-of-the-art for finding a FOSP, suggesting that privately finding an SOSP may be achievable at no additional cost.

1 Introduction

Privacy concerns have gained increasing attention with the rapid development of artificial intelligence and modern machine learning, particularly the widespread success of large language models. Differential privacy (DP) has become the standard notion of privacy in machine learning since it was introduced by Dwork et al. (2006). Given two neighboring datasets, 𝒟\mathcal{D} and 𝒟\mathcal{D}^{\prime}, differing by a single item, a mechanism \mathcal{M} is said to be (ε,δ)(\varepsilon,\delta)-differentially private if, for any event 𝒳\mathcal{X}, it holds that:

Pr[(𝒟)𝒳]eεPr[(𝒟)𝒳]+δ.\displaystyle\Pr[\mathcal{M}(\mathcal{D})\in\mathcal{X}]\leq e^{\varepsilon}\Pr[\mathcal{M}(\mathcal{D}^{\prime})\in\mathcal{X}]+\delta.

In this work, we focus on the stochastic optimization problem under the constraint of DP. The loss function is defined below:

F𝒫(x):=𝔼z𝒫f(x;z),\displaystyle F_{\mathcal{P}}(x):=\operatorname*{\mathbb{E}}_{z\sim\mathcal{P}}f(x;z),

where the functions may be non-convex, the underlying distribution 𝒫\mathcal{P} is unknown, and we are given a dataset 𝒟={zi}i[n]\mathcal{D}=\{z_{i}\}_{i\in[n]} drawn i.i.d. from 𝒫\mathcal{P}. Notably, our goal is to design a private algorithm with provable utility guarantees under the i.i.d. assumption.

Minimizing non-convex functions is generally challenging and often intractable, but most models used in practice are not guaranteed to be convex. How, then, can we explain the success of optimization methods in practice? One possible explanation is the effectiveness of Stochastic Gradient Descent (SGD), which is well-known to be able to find an α\alpha-first-order stationary point (FOSP) of a non-convex function ff—that is, a point xx such that f(x)α\|\nabla f(x)\|\leq\alpha—within O(1/α2)O(1/\alpha^{2}) steps (Nesterov, 1998). However, FOSPs can include saddle points or even local maxima. Thus, we focus on finding second-order stationary points (SOSP), for the non-convex function F𝒫F_{\mathcal{P}}.

Non-convex optimization has been extensively studied in recent years due to its central role in modern machine learning, and we now have a solid understanding of the complexity involved in finding FOSPs and SOSPs (Ghadimi and Lan, 2013; Agarwal et al., 2017; Carmon et al., 2020; Zhang et al., 2020). Variance reduction techniques have been shown to improve the theoretical complexity, leading to the development of several promising algorithms such as Spider (Fang et al., 2018), SARAH (Nguyen et al., 2017), and SpiderBoost (Wang et al., 2019b). More recently, private non-convex optimization has emerged as an active area of research (Wang et al., 2019a; Tran and Cutkosky, 2022; Arora et al., 2023; Gao and Wright, 2023; Ganesh et al., 2023; Wang et al., 2023; Lowy et al., 2024; Kornowski et al., 2024; Menart et al., 2024).

1.1 Our Main Result

In this work, we study how to find the SOSP of F𝒫F_{\mathcal{P}} privately. Let us formally define the FOSP and the SOSP. For more on Hessian Lipschitz continuity and related background, see the preliminaries in Section 2.

Definition 1.1 (FOSP).

For α0\alpha\geq 0, we say a point xx is an α\alpha-first-order stationary point (α\alpha-FOSP) of a function gg if g(x)α\|\nabla g(x)\|\leq\alpha.

Definition 1.2 (SOSP, Nesterov and Polyak (2006); Agarwal et al. (2017)).

For a function g:dg:\mathbb{R}^{d}\to\mathbb{R} which is ρ\rho-Hessian Lipschitz, we say a point xdx\in\mathbb{R}^{d} is α\alpha-second-order stationary point (α\alpha-SOSP) of gg if g(x)2α2g(x)ραId\|\nabla g(x)\|_{2}\leq\alpha\;\bigwedge\;\nabla^{2}g(x)\succeq-\sqrt{\rho\alpha}I_{d}.

Given the dataset size of nn, privacy parameters ε,δ\varepsilon,\delta, and functions defined over dd-dimensional space, Ganesh et al. (2023) proposed a private algorithm that can find an αS\alpha_{S}-SOSP for F𝒫F_{\mathcal{P}} with

αS=O~(1n1/3+(dnε)3/7).\displaystyle\alpha_{S}=\tilde{O}\big{(}\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{n\varepsilon})^{3/7}\big{)}.

However, as shown in Arora et al. (2023), the state-of-the-art bound for privately finding an αF\alpha_{F}-FOSP is tighter: 111As proposed by Lowy et al. (2024), allowing exponential running time enables the use of the exponential mechanism to find a warm start, which can further improve the bounds for both FOSP and SOSP.

αF=O~(1n1/3+(dnε)1/2)\displaystyle\alpha_{F}=\tilde{O}\big{(}\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{n\varepsilon})^{1/2}\big{)}

When the privacy parameter ε\varepsilon is sufficiently small, and the error term depending on it dominates the non-private term 1/n1/31/n^{1/3}, we observe that αFαS\alpha_{F}\ll\alpha_{S}. This raises the question: is finding an SOSP under differential privacy constraints more difficult than finding an FOSP?

Moreover, as pointed out by Tao et al. (2025), the results in Ganesh et al. (2023) may be overly optimistic. In particular, the saddle point escape subprocedure used in Ganesh et al. (2023) was designed for the full-batch setting. However, to mitigate dependence issues, the algorithm performs only a single pass over the nn functions and relies on minibatches instead. A direct fix for this issue leads to a weaker guarantee on the minimum eigenvalue of the Hessian, specifically,

2F𝒫(x)ρd1/5α2/5Id,\nabla^{2}F_{\mathcal{P}}(x)\succeq-\sqrt{\rho}d^{1/5}\alpha^{2/5}I_{d},

which fails to satisfy Definition 1.2. To address this, Tao et al. (2025) proposed an alternative correction, but their method resulted in weaker theoretical guarantees than Ganesh et al. (2023) originally claimed, yielding

α=O~(1n1/3+(dnε)2/5).\alpha=\tilde{O}\big{(}\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{n\varepsilon})^{2/5}\big{)}.

This work improves upon the results of Ganesh et al. (2023). In addition, we introduce a new saddle point escape subprocedure that correctly addresses the issue in Ganesh et al. (2023) while fully recovering—and even strengthening—the theoretical guarantees.

Specifically, we present an algorithm that finds an α\alpha-SOSP with privacy guarantees, where:

α=O~(αF)=O~(1n1/3+(dnε)1/2).\displaystyle\alpha=\tilde{O}(\alpha_{F})=\tilde{O}\big{(}\frac{1}{n^{1/3}}+(\frac{\sqrt{d}}{n\varepsilon})^{1/2}\big{)}.

This improved bound suggests that when we try to find the stationary point privately, we can find the SOSP for free under additional (standard) assumptions.

It is also worth noting that, our improvement primarily affects terms dependent on the privacy parameters. In the non-private setting, as ε\varepsilon\to\infty (i.e., without privacy constraints), all the results discussed above achieve a bound of O~(1/n1/3)\tilde{O}(1/n^{1/3}), which matches the non-private lower bound established by Arjevani et al. (2023) in high-dimensional settings (where dΩ~(1/α4)d\geq\tilde{\Omega}(1/\alpha^{4})). However, to our knowledge, whether this non-private term of O~(1/n1/3)\tilde{O}(1/n^{1/3}) can be further improved in low dimension remains an open question.

1.2 Overview of Techniques

In this work, we build on the SpiderBoost algorithm framework, similar to prior approaches (Arora et al., 2023; Ganesh et al., 2023), to find second-order stationary points (SOSP) privately. At a high level, our method leverages two types of gradient oracles: 𝒪1(x)f(x)\mathcal{O}_{1}(x)\approx\nabla f(x), which estimates the gradient at point xx, and 𝒪2(x,y)f(x)f(y)\mathcal{O}_{2}(x,y)\approx\nabla f(x)-\nabla f(y), which estimates the gradient difference between two points, xx and yy. When performing gradient descent, to compute the gradient estimator t\nabla_{t} at point xtx_{t}, we can either use t=𝒪1(xt)\nabla_{t}=\mathcal{O}_{1}(x_{t}) for a direct estimate, or t=t1+𝒪2(xt,xt1)\nabla_{t}=\nabla_{t-1}+\mathcal{O}_{2}(x_{t},x_{t-1}) to update based on the previous gradient. In our setting, 𝒪1\mathcal{O}_{1} is more accurate but incurs higher computational or privacy costs.

The approach in Arora et al. (2023) adopts SpiderBoost by querying 𝒪1\mathcal{O}_{1} periodically: they call 𝒪1\mathcal{O}_{1} once and then use 𝒪2\mathcal{O}_{2} for qq subsequent queries, controlled by a hyperparameter qq. Their method ensures the gradient estimators are sufficiently accurate on average, which works well for finding first-order stationary points (FOSP). However, finding an SOSP, where greater precision is required, presents additional challenges when relying on average-accurate gradient estimators.

To address this, Ganesh et al. (2023) introduced a variable called driftt:=i=t0+1txixi12\mathrm{drift}_{t}:=\sum_{i=t_{0}+1}^{t}\|x_{i}-x_{i-1}\|^{2}, where t0t_{0} is the index of the last iteration when 𝒪1\mathcal{O}_{1} was queried. If driftt\mathrm{drift}_{t} remains small, the gradient estimator stays accurate enough, allowing further queries of 𝒪2\mathcal{O}_{2}. However, if driftt\mathrm{drift}_{t} grows large, the gradient estimator’s accuracy deteriorates, signaling the need to query 𝒪1\mathcal{O}_{1} for a fresh, more accurate estimate. This modification enables the algorithm to maintain the precision necessary for privately finding an SOSP.

Our improvement introduces two new components: the use of the tree mechanism instead of using the Gaussian mechanism as in Ganesh et al. (2023), and the implementation of adaptive batch sizes for constructing 𝒪2\mathcal{O}_{2}.

In the prior approach using the Gaussian mechanism, a noisy gradient estimator t1\nabla_{t-1} is computed, and the next estimator is updated via t=t1+𝒪2(xt,xt1)+gt\nabla_{t}=\nabla_{t-1}+\mathcal{O}_{2}(x_{t},x_{t-1})+g_{t}, where gtg_{t} is Gaussian noise added to preserve privacy. Over multiple iterations, the accumulation of noise gt\sum g_{t} can severely degrade the accuracy of the gradient estimator, requiring frequent re-queries of 𝒪1\mathcal{O}_{1}. On the other hand, the tree mechanism mitigates this issue when frequent queries to 𝒪2\mathcal{O}_{2} are needed.

However, simply replacing the Gaussian mechanism with the tree mechanism and using a fixed batch size does not yield optimal results. In worst-case scenarios, where the function’s gradients are large, the drift\mathrm{drift} grows quickly, necessitating frequent calls to 𝒪1\mathcal{O}_{1}, which diminishes the advantages of the tree mechanism.

To address this, we introduce adaptive batch sizes. In Ganesh et al. (2023), the oracle 𝒪2\mathcal{O}_{2} is constructed by drawing a fixed batch of size BB from the unused dataset and outputting 𝒪2(xt,xt1):=zStf(xt;z)f(xt1;z)B\mathcal{O}_{2}(x_{t},x_{t-1}):=\sum_{z\in S_{t}}\frac{\nabla f(x_{t};z)-\nabla f(x_{t-1};z)}{B}. Given an upper bound on drift\mathrm{drift}, they guaranteed that xtxt1D\|x_{t}-x_{t-1}\|\leq D for some parameter DD, thereby bounding the sensitivity of 𝒪2\mathcal{O}_{2}.

In contrast, we dynamically adjust the batch size in proportion to xtxt1\|x_{t}-x_{t-1}\|, setting Btxtxt1B_{t}\propto\|x_{t}-x_{t-1}\|, and compute 𝒪2(xt,xt1):=zStf(xt;z)f(xt1;z)Bt\mathcal{O}_{2}(x_{t},x_{t-1}):=\sum_{z\in S_{t}}\frac{\nabla f(x_{t};z)-\nabla f(x_{t-1};z)}{B_{t}}. Fixed batch sizes present two drawbacks: (i) when xtxt1\|x_{t}-x_{t-1}\| is large, the gradient estimator has higher sensitivity and variance, leading to worse estimate accuracy; (ii) when xtxt1\|x_{t}-x_{t-1}\| is small, progress in terms of function value decrease is limited. Using a fixed batch size forces us to handle both cases simultaneously: we must add noise and analyze accuracy assuming a worst-case large xtxt1\|x_{t}-x_{t-1}\|, but for utility analysis, we pretend xtxt1\|x_{t}-x_{t-1}\| is small to examine the function value decrease. The adaptive batch size resolves this paradox: it allows us to control sensitivity and variance adaptively. When xtxt1\|x_{t}-x_{t-1}\| is small, we decrease the batch size but can still control the variance and sensitivity; when it is small, the function value decreases significantly, aiding in finding an SOSP.

By combining the tree mechanism with adaptive batch sizes, we improve the accuracy of gradient estimation and achieve better results for privately finding an SOSP.

Fixing the Error:

Building upon our discussion of obtaining more accurate gradient estimations through adaptive batch sizes, we extend this approach to Hessian estimations, achieving accuracy in terms of the operator norm. Upon encountering a potential saddle point—characterized by a small gradient norm—we utilize the Hessian to inform the gradient estimation over several iterations. This process is analogous to performing stochastic power iteration methods on the Hessian to identify the direction corresponding to the smallest eigenvalue. If the Hessian exhibits a small enough eigenvalue (say smaller than ρα-\sqrt{\rho\alpha}), we demonstrate that this approach facilitates a significant drift, effectively enabling successful escape from the saddle point. This idea can also be applied to Ganesh et al. (2023) to recover their claimed rate.

1.3 Other Related Work

A significant body of literature on private optimization focuses on the convex setting, where it is typically assumed that each function f(;z)f(;z) is convex for any zz in the universe (e.g., (Chaudhuri et al., 2011; Bassily et al., 2014, 2019; Feldman et al., 2020; Asi et al., 2021; Kulkarni et al., 2021; Carmon et al., 2023; Gopi et al., 2023)).

The tree mechanism, originally introduced by the differential privacy (DP) community (Dwork et al., 2010; Chan et al., 2011) for the continual observation, has inspired tree-structure private optimization algorithms like Asi et al. (2021); Bassily et al. (2021); Arora et al. (2023); Zhang et al. (2024). One can also use the matrix mechanism Fichtenberger et al. (2023) which improves the tree mechanism by a constant factor. Some prior works have explored adaptive batch size techniques in optimization. For instance, De et al. (2016) introduced adaptive batch sizing for stochastic gradient descent (SGD), while Ji et al. (2020) combined adaptive batch sizing with variance reduction techniques to modify SVRG and Spider algorithms. However, these works’ motivations and approaches to setting adaptive batch sizes differ from ours. To the best of our knowledge, we are the first to propose using adaptive batch sizes in the context of optimization under differential privacy constraints.

Most of the non-convex optimization literature assumes that the functions being optimized are smooth. Recent work has begun addressing non-smooth, non-convex functions as well, as seen in Zhang et al. (2020); Kornowski and Shamir (2021); Davis et al. (2022); Jordan et al. (2023).

2 Preliminaries

Throughout the paper, we use \|\cdot\| to represent both the 2\ell_{2} norm of a vector and the operator norm of a matrix when there is no confusion.

Definition 2.1 (Lipschitz, Smoothness and Hessian Lipschitz).

Let 𝒦d\mathcal{K}\subseteq\mathbb{R}^{d}. Given a twice differentiable function f:𝒦f:\mathcal{K}\to\mathbb{R}, we say ff is GG-Lipschitz, if for all x1,x2𝒦x_{1},x_{2}\in\mathcal{K}, |f(x1)f(x2)|Gx1x2|f(x_{1})-f(x_{2})|\leq G\|x_{1}-x_{2}\|; we say ff is MM-smooth, if for all x1,x2𝒦x_{1},x_{2}\in\mathcal{K}, f(x1)f(x2)Mx1x2\|\nabla f(x_{1})-\nabla f(x_{2})\|\leq M\|x_{1}-x_{2}\|, and we say the function ff is ρ\rho-Hessian Lipschitz, if for all x1,x2𝒦x_{1},x_{2}\in\mathcal{K}, we have 2f(x1)2f(x2)ρx1x2.\|\nabla^{2}f(x_{1})-\nabla^{2}f(x_{2})\|\leq\rho\|x_{1}-x_{2}\|.

2.1 Other Techniques

As mentioned in the introduction, we use the tree mechanism (Algorithm 1) to privatize the algorithm, whose formal guarantee is stated below:

Theorem 2.2 (Tree Mechanism, Dwork et al. (2010); Chan et al. (2011)).

Let 𝒵1,,𝒵Σ\mathcal{Z}_{1},\cdots,\mathcal{Z}_{\Sigma} be dataset spaces, and 𝒳\mathcal{X} be the state space. Let i:𝒳i1×𝒵i𝒳\mathcal{M}_{i}:\mathcal{X}^{i-1}\times\mathcal{Z}_{i}\to\mathcal{X} be a sequence of algorithms for i[Σ]i\in[\Sigma]. Let 𝒜:𝒵(1:Σ)𝒳Σ\mathcal{A}:\mathcal{Z}^{(1:\Sigma)}\to\mathcal{X}^{\Sigma} be the algorithm that given a dataset Z1:Σ𝒵(1:Σ)Z_{1:\Sigma}\in\mathcal{Z}^{(1:\Sigma)}, sequentially computes Xi=j=1ij(X1:j1,Zj)+TREE(i)X_{i}=\sum_{j=1}^{i}\mathcal{M}_{j}(X_{1:j-1},Z_{j})+\mathrm{TREE}(i) for i[Σ]i\in[\Sigma], and then outputs X1:ΣX_{1:\Sigma}.

Suppose for all i[Σ]i\in[\Sigma], and neighboring Z1:Σ,Z1:Σ𝒵(1:Σ),i(X1:i1,Zi)i(X1:i1,Zi)sZ_{1:\Sigma},Z_{1:\Sigma}^{\prime}\in\mathcal{Z}^{(1:\Sigma)},\|\mathcal{M}_{i}(X_{1:i-1},Z_{i})-\mathcal{M}_{i}(X_{1:i-1},Z_{i}^{\prime})\|\leq s for all auxiliary inputs X1:i1𝒳i1X_{1:i-1}\in\mathcal{X}^{i-1}. Then setting σ=4slogΣlog(1/δ)ε\sigma=\frac{4s\sqrt{\log\Sigma\log(1/\delta)}}{\varepsilon}, Algorithm 1 is (ε,δ)(\varepsilon,\delta)-DP. Furthermore, with probability at least 1Σι1-\Sigma\cdot\iota, for all t[Σ]:TREE(t)dlog(1/ι)σt\in[\Sigma]:\|\mathrm{TREE}(t)\|\lesssim\sqrt{d\log(1/\iota)}\sigma.

1:Input: Noise parameter σtree\sigma_{\mathrm{tree}}, sequence length Σ\Sigma
2:Define 𝒯:={(u,v):u=j21+1,v=(j+1)21,1logΣ,0jΣ/211}\mathcal{T}:=\{(u,v):u=j\cdot 2^{\ell-1}+1,v=(j+1)\cdot 2^{\ell-1},1\leq\ell\leq\log\Sigma,0\leq j\leq\Sigma/2^{\ell-1}-1\}
3:Sample and store ζ(u,v)𝒩(0,σ2)\zeta_{(u,v)}\sim\mathcal{N}(0,\sigma^{2}) for each (u,v)𝒯(u,v)\in\mathcal{T}
4:for t=1,,Σt=1,\cdots,\Sigma do
5:     Let TREE(t)(u,v)NODE(t)ζ(u,v)\mathrm{TREE}(t)\leftarrow\sum_{(u,v)\in\mathrm{NODE}(t)}\zeta_{(u,v)}
6:end for
7:Return: TREE(t)\mathrm{TREE}(t) for each t[Σ]t\in[\Sigma]
8:
9:Function NODE:
10:Input: index t[Σ]t\in[\Sigma]
11:Initialize S={}S=\{\} and k=0k=0
12:for i=1,,logΣi=1,\cdots,\lceil\log\Sigma\rceil while k<tk<t do
13:     Set k=k+2logΣik^{\prime}=k+2^{\lceil\log\Sigma\rceil-i}
14:     if ktk^{\prime}\leq t then
15:         SS{(k+1,k)}S\leftarrow S\cup\{(k+1,k^{\prime})\}, kkk\leftarrow k^{\prime}
16:     end if
17:end for
Algorithm 1 Tree Mechanism

We also need the concentration inequality for norm-subGaussian random vectors.

Definition 2.3 (SubGaussian, and Norm-SubGaussian).

We say a random vector xdx\in\mathbb{R}^{d} is SubGaussian (SG(ζ)\mathrm{SG}(\zeta)) if there exists a positive constant ζ\zeta such that 𝔼ev,x𝔼xev2ζ2/2,vd\operatorname*{\mathbb{E}}e^{\langle v,x-\operatorname*{\mathbb{E}}x\rangle}\leq e^{\|v\|^{2}\zeta^{2}/2},\;\;\forall v\in\mathbb{R}^{d}. We say xdx\in\mathbb{R}^{d} is norm-SubGaussian (nSG(ζ)\mathrm{nSG}(\zeta)) if there exists ζ\zeta such that Pr[x𝔼xt]2et22ζ2,t\Pr[\|x-\operatorname*{\mathbb{E}}x\|\geq t]\leq 2e^{-\frac{t^{2}}{2\zeta^{2}}},\forall t\in\mathbb{R}.

Lemma 2.4 (Hoeffding type inequality for norm-subGaussian, Jin et al. (2019)).

Let x1,,xkdx_{1},\cdots,x_{k}\in\mathbb{R}^{d} be random vectors, and for each i[k]i\in[k], xii1x_{i}\mid\mathcal{F}_{i-1} is zero-mean nSG(ζi)\mathrm{nSG}(\zeta_{i}) where i\mathcal{F}_{i} is the corresponding filtration. Then there exists an absolute constant cc such that for any δ>0\delta>0, with probability at least 1ω1-\omega, i=1kxici=1kζi2log(2d/ω)\|\sum_{i=1}^{k}x_{i}\|\leq c\cdot\sqrt{\sum_{i=1}^{k}\zeta_{i}^{2}\log(2d/\omega)}, which means i=1kxi\sum_{i=1}^{k}x_{i} is nSG(clog(d)i=1kζi2)\mathrm{nSG}(\sqrt{c\log(d)\,\sum_{i=1}^{k}\zeta_{i}^{2}}).

Theorem 2.5 (Matrix Azuma Inequality, Tropp (2012)).

Consider a finite sequence of random self-adjoint matrices X1,,XnX_{1},\cdots,X_{n} with common dimensions d×dd\times d and 𝔼[Xii1]=0,XiAi\operatorname*{\mathbb{E}}[X_{i}\mid\mathcal{F}_{i-1}]=0,X_{i}\preceq A_{i} a.s., i\forall i. Let σ2=i=1nAi2\sigma^{2}=\|\sum_{i=1}^{n}A_{i}^{2}\|. Let S=i=1nXiS=\sum_{i=1}^{n}X_{i}. Then for any t0t\geq 0, we have

Pr[St]dexp(t28σ2).\displaystyle\Pr[\|S\|\geq t]\leq d\cdot\exp(-\frac{-t^{2}}{8\sigma^{2}}).

3 SOSP

We make the following assumption for our main result.

Assumption 3.1.

Let G,ρ,M,B>0G,\rho,M,B>0. Any function drawn from 𝒫\mathcal{P} is GG-Lipschitz, ρ\rho-Hessian Lipschitz, and MM-smooth, almost surely. Moreover, we are given a public initial point x0x_{0} such that F𝒫(x0)infxF𝒫(x)BF_{\mathcal{P}}(x_{0})-\inf_{x}F_{\mathcal{P}}(x)\leq B.

We modify the Stochastic SpiderBoost used in Ganesh et al. (2023) and state it in Algorithm 3. The following standard lemma plays a crucial role in finding stationary points of smooth functions:

Lemma 3.2.

Assume FF is MM-smooth and let η=1/M\eta=1/M. Let xt+1=xtηgtx_{t+1}=x_{t}-\eta g_{t}. Then we have F(xt+1)F(xt)+ηgtF(xt)gtη2gt2.F(x_{t+1})\leq F(x_{t})+\eta\|g_{t}\|\cdot\|\nabla F(x_{t})-g_{t}\|-\frac{\eta}{2}\|g_{t}\|^{2}. Moreover, if F(xt)γ\|\nabla F(x_{t})\|\geq\gamma and gtF(xt)γ/4\|g_{t}-\nabla F(x_{t})\|\leq\gamma/4, we have

F(xt+1)F(xt)ηgt2/16.\displaystyle F(x_{t+1})-F(x_{t})\leq-\eta\|g_{t}\|^{2}/16.
Proof.

By the assumption of the smoothness, we know

F(xt+1)\displaystyle F(x_{t+1}) F(xt)+F(xt),xt+1xt+M2xt+1xt2\displaystyle\leq F(x_{t})+\langle\nabla F(x_{t}),x_{t+1}-x_{t}\rangle+\frac{M}{2}\|x_{t+1}-x_{t}\|^{2}
=F(xt)η/2gt2F(xt)gt,ηgt\displaystyle=F(x_{t})-\eta/2\|g_{t}\|^{2}-\langle\nabla F(x_{t})-g_{t},\eta g_{t}\rangle
F(xt)+ηF(xt)gtgtη2gt2.\displaystyle\leq F(x_{t})+\eta\|\nabla F(x_{t})-g_{t}\|\cdot\|g_{t}\|-\frac{\eta}{2}\|g_{t}\|^{2}.

When F(xt)γ\|\nabla F(x_{t})\|\geq\gamma and gtF(xt)γ/4\|g_{t}-\nabla F(x_{t})\|\leq\gamma/4, the conclusion follows from the calculation. ∎

Lemma 3.2 shows that one can be able to find the FOSP with inexact gradient estimates as long as the estimated error is small enough. A key challenge in finding an SOSP, compared to an FOSP, is ensuring that the algorithm can effectively escape saddle points. Existing analyses of saddle point escape using inexact gradients are insufficient for our purposes. To address this, we propose a new approach that leverages both inexact gradients and Hessians for escaping saddle points. The details are presented in the following subsection.

3.1 Escape Saddle Point with Hessian

We present the subprocedure for escaping saddle points in this section, with its pseudocode provided in Algorithm 2. The core idea is straightforward: given a sufficiently accurate estimate of the function’s Hessian, we can apply the power method to escape the saddle point. If the Hessian’s smallest eigenvalue is sufficiently negative, then after a certain number of steps, the iterate will have moved a significant distance, resulting in a substantial decrease in function value—indicating a successful escape.

Definition 3.3.

We say the estimation (g,H)(g,H) of the pair of gradient and Hessian is (γ,ϰ)(\gamma,\varkappa)-accurate at point xx, if

gF(x)2γ,H2F(x)2ϰ.\displaystyle\|g-\nabla F(x)\|_{2}\leq\gamma,\|H-\nabla^{2}F(x)\|_{2}\leq\varkappa.
Algorithm 2 Escape from Saddle Point
1:Input: initial point x0x_{0} such that F(x)α\|\nabla F(x)\|\leq\alpha, (γ,ϰ)(\gamma,\varkappa)-accurate estimation pair (g,H)(g,H) at x0x_{0}, parameters Γ,Ξ\Gamma,\Xi
2:Process:
3:yx0y\leftarrow x_{0}
4:for t=1,,Γt=1,\cdots,\Gamma do
5:     gt1=g+H(xt1x0))+ζt1g_{t-1}=g+H(x_{t-1}-x_{0}))+\zeta_{t-1}, where ζt1𝒩(0,σt12Id)\zeta_{t-1}\sim\mathcal{N}(0,\sigma_{t-1}^{2}I_{d}).
6:     xt=xt1ηgt1x_{t}=x_{t-1}-\eta g_{t-1}
7:     if xtx0Ξ\|x_{t}-x_{0}\|\geq\Xi then
8:         yxty\leftarrow x_{t}
9:         Break
10:     end if
11:end for
12:Output: yy

The Gaussian noise ζt1\zeta_{t-1} added in Line 5 is for the privacy purpose. We have the following guarantee of Algorithm 2:

Theorem 3.4.

Suppose the function FF satisfies the Assumption 3.1, and the initial point is a saddle point such that F(x)α\|\nabla F(x)\|\leq\alpha and 2F(x)ραId\nabla^{2}F(x)\succeq-\sqrt{\rho\alpha}I_{d}. When αγlog3(dBM/ρι)\alpha\geq\gamma\log^{3}(dBM/\rho\iota), σt=γdlog(T/ι)\sigma_{t}=\frac{\gamma}{\sqrt{d\log(T/\iota)}} Γ=O~(M/ργ)\Gamma=\tilde{O}(M/\sqrt{\rho\gamma}), ϰΞγ,η=1/M\varkappa\Xi\leq\gamma,\eta=1/M and Ξ=γ/ρ\Xi=\sqrt{\gamma/\rho}, then with probability at least 1ι1-\iota, the output yy satisfies that

F(y)F(x0)Φ,\displaystyle F(y)-F(x_{0})\leq-\Phi,

where Φ=Ω(γ3/2ρlog3(dMZb/ργι))\Phi=\Omega(\frac{\gamma^{3/2}}{\sqrt{\rho}\log^{3}(dMZb/\rho\gamma\iota)}).

We have the following guarantee on gtg_{t}:

Lemma 3.5.

With probability at least 1ι/21-\iota/2, for all t[Γ]t\in[\Gamma], we have

gt1F(xt1)22γ+(ϰ+ρΞ)Ξ.\displaystyle\|g_{t-1}-\nabla F(x_{t-1})\|_{2}\leq 2\gamma+(\varkappa+\rho\Xi)\Xi.

Let ϱt=F(xt)gt\varrho_{t}=\nabla F(x_{t})-g_{t} denote the difference between the true gradient and our estimation. We use the following standard technical claims to connect the distance of the trajectory and the function value change.

Claim 3.6 (Lemma 10 in Wang et al. (2019a)).

We have

F(xt+1)F(xt)η4F(xt)22+5ηϱt22,\displaystyle F(x_{t+1})-F(x_{t})\leq-\frac{\eta}{4}\|\nabla F(x_{t})\|_{2}^{2}+5\eta\|\varrho_{t}\|_{2}^{2},
Claim 3.7 (Lemma 11 in Wang et al. (2019a)).

For any t[Γ]t\in[\Gamma], one has

xtx0228ηΓ(F(x0)F(xt))+50η2ti=1tϱi22.\displaystyle\|x_{t}-x_{0}\|_{2}^{2}\leq 8\eta\Gamma(F(x_{0})-F(x_{t}))+50\eta^{2}t\sum_{i=1}^{t}\|\varrho_{i}\|_{2}^{2}.

In Algorithm 2, we halt the algorithm whenever we find a point xtx_{t} such that xtx02Ξ\|x_{t}-x_{0}\|_{2}\geq\Xi. We use the following lemma to demonstrate that a large distance at xtx_{t} means a large function value decrease at xtx_{t}, that is, it escapes from the saddle point successfully:

Lemma 3.8.

Suppose Algorithm 2 halts in advance when condition xtx0Ξ\|x_{t}-x_{0}\|\geq\Xi is satisfied, then we have

F(xt)F(x0)Φ.\displaystyle F(x_{t})-F(x_{0})\leq-\Phi.

Similar to previous works Jin et al. (2017); Wang et al. (2019a), we use a coupling argument to prove that we can escape the saddle point with high probability. Fix the sequence of Gaussian noise (ζ1,,ζΓ)(\zeta_{1},\cdots,\zeta_{\Gamma}) and ensure ζt2γ\|\zeta_{t}\|_{2}\leq\gamma for all t[Γ]t\in[\Gamma].

Let y(x)y(x) denote the output yy conditional on the first iterate x1=xx_{1}=x. Define the set of stuck region around x0ηgx_{0}-\eta g:

𝒳(x0)={xxB(x0ηg,r), and y(x)x02Ξ},\displaystyle\mathcal{X}(x_{0})=\{x\mid x\in B(x_{0}-\eta g,r),\text{ and }\|y(x)-x_{0}\|_{2}\leq\Xi\}, (1)

where r=O(ηγ)r=O(\eta\gamma). Suppose x0x_{0} is the saddle point, and let 𝐞1\mathbf{e}_{1} be the minimum eigenvector of HH. We have the following Lemma:

Lemma 3.9.

Suppose ϰργ/2\varkappa\leq\sqrt{\rho\gamma}/2. For any two points w,uB(x0,r)w,u\in B(x_{0},r), if wu=μr𝐞1w-u=\mu r\mathbf{e}_{1} with μι2/16d\mu\geq\iota^{2}/16\sqrt{d}, then at least one of w,uw,u is not in 𝒳(x0)\mathcal{X}(x_{0}).

With these Lemmas, we can complete the proof of Theorem 3.4. In particular, Theorem 3.4 suggests that, if we meet a saddle point, then after the following O~(1/γ)\tilde{O}(1/\sqrt{\gamma}) steps, the function value will decrease by at least Ω~(γ3/2)\tilde{\Omega}(\gamma^{3/2}). This means the function value decreases by Ω~(γ2)\tilde{\Omega}(\gamma^{2}) on average for each step.

3.2 Main Algorithm

Algorithm 3 Stochastic Spider with Escaping Saddle Point SubProcedure
1:Input: Dataset 𝒟\mathcal{D}, privacy parameters ε,δ\varepsilon,\delta, parameters of objective function B,M,G,ρB,M,G,\rho, parameter κ\kappa, failure probability ω\omega, batch size parameter bb, noise parameters σtree,{σt}\sigma_{\mathrm{tree}},\{\sigma_{t}\}
2:set drift0=κ,frozen1=1,Δ1=0,𝒟r𝒟,t=0,EscapeFlag=𝐅𝐚𝐥𝐬𝐞\mathrm{drift}_{0}=\kappa,\mathrm{frozen}_{-1}=1,\Delta_{-1}=0,\mathcal{D}_{r}\leftarrow\mathcal{D},t=0,\mathrm{EscapeFlag}=\mathbf{False}
3:while t<Tt<T and the number of unused functions is larger than bb do
4:   // Estimate gradient and Hessian with oracle (Algorithm 4)
5:     if drifttκ\mathrm{drift}_{t}\geq\kappa or countτ\mathrm{count}\geq\tau then
6:         t=𝒪1(xt)\nabla_{t}=\mathcal{O}_{1}(x_{t}), Ht=𝒪3(xt)H_{t}=\mathcal{O}_{3}(x_{t}),
7:         driftt=0\mathrm{drift}_{t}=0, frozent=frozent11\mathrm{frozen}_{t}=\mathrm{frozen}_{t-1}-1, count=0\mathrm{count}=0
8:     else
9:         Δt=𝒪2(xt,xt1)\Delta_{t}=\mathcal{O}_{2}(x_{t},x_{t-1}), t=t1+Δt\nabla_{t}=\nabla_{t-1}+\Delta_{t}, ~t=t+TREE(t)\widetilde{\nabla}_{t}=\nabla_{t}+\mathrm{TREE}(t)
10:         ΔtH=𝒪4(xt,xt1)\Delta_{t}^{H}=\mathcal{O}_{4}(x_{t},x_{t-1}), Ht=Ht1+ΔtHH_{t}=H_{t-1}+\Delta_{t}^{H}
11:     end if
12:// Escape Saddle Point if EscapeFlag=𝐓𝐫𝐮𝐞\mathrm{EscapeFlag}=\mathbf{True}
13:     if ~tγlog3(BMd/ρω)frozent10\|\widetilde{\nabla}_{t}\|\leq\gamma\log^{3}(BMd/\rho\omega)\bigwedge\mathrm{frozen}_{t-1}\leq 0 then
14:         Set frozent=Γ\mathrm{frozen}_{t}=\Gamma, EscapeFlag=𝐓𝐫𝐮𝐞\mathrm{EscapeFlag}=\mathbf{True}, g=~tg=\widetilde{\nabla}_{t}, H=HtH=H_{t}, xanchor=xtx_{\mathrm{anchor}}=x_{t}, count=count+1\mathrm{count}=\mathrm{count}+1
15:     end if
16:     if EscapeFlag==𝐓𝐫𝐮𝐞\mathrm{EscapeFlag}==\mathbf{True} then
17:         gt=g+H(xtxanchor)+ξtg_{t}=g+H(x_{t}-x_{\mathrm{anchor}})+\xi_{t}, where ξt𝒩(0,σt2Id)\xi_{t}\sim\mathcal{N}(0,\sigma_{t}^{2}I_{d})
18:     else
19:         gt=~tg_{t}=\widetilde{\nabla}_{t} \triangleright Normal gradient descent
20:     end if
21:     xt+1=xtηgt,driftt+1=driftt+gt2x_{t+1}=x_{t}-\eta g_{t},\mathrm{drift}_{t+1}=\mathrm{drift}_{t}+\|g_{t}\|_{2}, frozent+1=frozent1\mathrm{frozen}_{t+1}=\mathrm{frozen}_{t}-1,
22:     if xt+1xanchorΞ\|x_{t+1}-x_{\mathrm{anchor}}\|\geq\Xi or frozent+10\mathrm{frozen}_{t+1}\leq 0 then
23:         EscapeFlag=𝐅𝐚𝐥𝐬𝐞\mathrm{EscapeFlag}=\mathbf{False}
24:     end if
25:     t=t+1t=t+1
26:end while
27:Return: {x1,,xt}\{x_{1},\cdots,x_{t}\}

Algorithm 3 follows the SpiderBoost framework. We primarily focus on gradient oracles and estimators for simplicity, as the Hessian case follows the same principle. Moreover, we allow a larger error tolerance for a Hessian estimate HtH_{t}, approximately ργ\sqrt{\rho\gamma}, compared to the gradient, which is γ\gamma. We discuss some key variables and parameters in it.

We either query 𝒪1\mathcal{O}_{1} to estimate the gradient directly or query 𝒪2\mathcal{O}_{2} to estimate the gradient difference between consecutive points. The term drift\mathrm{drift} controls the estimated error: when driftt\mathrm{drift}_{t} is small, Δt\Delta_{t} remains a reliable estimator; when driftt\mathrm{drift}_{t} exceeds the threshold determined by the parameter κ\kappa, we obtain a fresh estimate from 𝒪1\mathcal{O}_{1}.

The term frozen\mathrm{frozen} serves a technical role in the application of Theorem 3.4; specifically, when a potential saddle point is detected, we invoke the subprocedure described in Algorithm 2.

The variable count\mathrm{count} tracks the number of saddle point escapes since the last fresh gradient estimate Δt\Delta_{t} and Hessian estimate HtH_{t}. Once count\mathrm{count} exceeds the threshold τ\tau, we force a fresh query to 𝒪1\mathcal{O}_{1} and 𝒪3\mathcal{O}_{3} to ensure privacy, following the Gaussian Mechanism.

We have the following guarantee of Algorithm 3.

Proposition 3.10.

Under Assumption 3.1 and with oracles such that ~tF(xt)γ\|\widetilde{\nabla}_{t}-\nabla F(x_{t})\|\leq\gamma and Ht2F(xt)2ργ/2\|H_{t}-\nabla^{2}F(x_{t})\|_{2}\leq\sqrt{\rho\gamma}/2 for any t[T]t\in[T]. When σtree=σt=γdlog(T/ι)\sigma_{\mathrm{tree}}=\sigma_{t}=\frac{\gamma}{\sqrt{d\log(T/\iota)}}, Γ=O~(M/ργ)\Gamma=\tilde{O}(M/\sqrt{\rho\gamma}), ϰΞγ,η=1/M\varkappa\Xi\leq\gamma,\eta=1/M and Ξ=γ/ρ\Xi=\sqrt{\gamma/\rho}, setting T=O~(B/ηγ2)T=\tilde{O}(B/\eta\gamma^{2}), and supposing it does not halt before completing all TT iterations, with probability at least 1Tι1-T\iota, at least one point in the output set {x1,,xT}\{x_{1},\cdots,x_{T}\} of Algorithm 3 is O~(γ)\tilde{O}(\gamma)-SOSP.

The proof intuition of Proposition 3.10 is, if we do not find an O(γ)O(\gamma)-SOSP, then on average, the function value will at least decrease by Ω(η/γ2)\Omega(\eta/\gamma^{2}). As we know F𝒫(x0)F𝒫+BF_{\mathcal{P}}(x_{0})\leq F_{\mathcal{P}}^{*}+B, hence O(B/ηγ2)O(B/\eta\gamma^{2}) steps can ensure we find an O(γ)O(\gamma)-SOSP. See the proof of Proposition 3.10 in the Appendix.

It suffices to build the private oracles with provable utility guarantees. We construct the gradient oracles below in Algorithm 4.

Lemma 3.11 (Oracles with bounded error).

Under assumption 3.1, let ι>0\iota>0 and use Algorithm 4 as instantiations of 𝒪1\mathcal{O}_{1} and 𝒪2\mathcal{O}_{2}. If 𝒟\mathcal{D} is i.i.d. drawn from distribution 𝒫\mathcal{P}, we have:
(1)(1) for any xtx_{t}, we have 𝔼[𝒪1(xt)]=F(xt)\operatorname*{\mathbb{E}}[\mathcal{O}_{1}(x_{t})]=\nabla F(x_{t}) and

Pr[𝒪1(xt)F(xt)ζ1]ι,\displaystyle\Pr[\|\mathcal{O}_{1}(x_{t})-\nabla F(x_{t})\|\geq\zeta_{1}]\leq\iota,

where ζ1=O(Glog(d/ι)/b)\zeta_{1}=O(G\sqrt{\log(d/\iota)/b}).
(2)(2) for any xt,xt1x_{t},x_{t-1}, we have 𝔼[𝒪2(xt,xt1)]=F(xt)F(xt1)\operatorname*{\mathbb{E}}[\mathcal{O}_{2}(x_{t},x_{t-1})]=\nabla F(x_{t})-\nabla F(x_{t-1}) and

Pr[𝒪2(xt,xt1)(F(xt)F(xt1))ζ2]ι,\displaystyle\Pr[\|\mathcal{O}_{2}(x_{t},x_{t-1})-(\nabla F(x_{t})-\nabla F(x_{t-1}))\|\geq\zeta_{2}]\leq\iota,

where ζ2=O(Mxtxt1log(d/ι)/bt)\zeta_{2}=O(M\|x_{t}-x_{t-1}\|\sqrt{\log(d/\iota)/b_{t}}).
(3) for any xtx_{t}. we have 𝔼[𝒪3(xt)]=2F(xt)\operatorname*{\mathbb{E}}[\mathcal{O}_{3}(x_{t})]=\nabla^{2}F(x_{t}) and

Pr[𝒪3(xt)2F(xt)ζ3]ι,\displaystyle\Pr[\|\mathcal{O}_{3}(x_{t})-\nabla^{2}F(x_{t})\|\geq\zeta_{3}]\leq\iota,

where ζ3=O(Mlog(d/ι)/b)\zeta_{3}=O(M\sqrt{\log(d/\iota)/b}).
(4)for any xt,xt1x_{t},x_{t-1}, we have 𝔼[𝒪4(xt,xt1)]=2F(xt)2F(xt1)\operatorname*{\mathbb{E}}[\mathcal{O}_{4}(x_{t},x_{t-1})]=\nabla^{2}F(x_{t})-\nabla^{2}F(x_{t-1}) and

Pr[𝒪4(xt,xt1)(F(xt)F(xt1))ζ4]ι,\displaystyle\Pr[\|\mathcal{O}_{4}(x_{t},x_{t-1})-(\nabla F(x_{t})-\nabla F(x_{t-1}))\|\geq\zeta_{4}]\leq\iota,

where ζ4=O(ρxtxt1log(d/ι)/bt)\zeta_{4}=O(\rho\|x_{t}-x_{t-1}\|\sqrt{\log(d/\iota)/b_{t}}).

From now on, we adopt Algorithm 4 as the gradient oracles for Line 6 and Line 10 respectively in Algorithm 3, and we set η=1/M\eta=1/M. We then bound the error between gradient estimator t\nabla_{t} and the true gradient F(xt)\nabla F(x_{t}) for Algorithm 3.

Lemma 3.12.

Suppose the dataset is drawn i.i.d. from the distribution 𝒫\mathcal{P}. For any 1tT1\leq t\leq T and letting τtt\tau_{t}\leq t be the largest integer such that driftτt\mathrm{drift}_{\tau_{t}} is set to be 0, with probability at least 1Tι1-T\iota, for some universal constant C>0C>0, we have

tF(xt)2C(G2b+i=τt+1tM2xixi12/bi)log(Td/ι),\displaystyle\|\nabla_{t}-\nabla F(x_{t})\|^{2}\leq C(\frac{G^{2}}{b}+\sum_{i=\tau_{t}+1}^{t}M^{2}\|x_{i}-x_{i-1}\|^{2}/b_{i})\log(Td/\iota), (2)
Ht2F(xt)2C(M2b+i=τt+1tρ2xixi12/bi)log2(Td/ι).\displaystyle\|H_{t}-\nabla^{2}F(x_{t})\|^{2}\leq C(\frac{M^{2}}{b}+\sum_{i=\tau_{t}+1}^{t}\rho^{2}\|x_{i}-x_{i-1}\|^{2}/b_{i})\log^{2}(Td/\iota). (3)

Now, we consider the noise added from the tree mechanism and the noise added in the stochastic power method to make the algorithm private.

Lemma 3.13.

If we set σtree=(Gb+maxt~tbt)log(1/δ)/ε\sigma_{\mathrm{tree}}=(\frac{G}{b}+\max_{t}\frac{\|\widetilde{\nabla}_{t}\|}{b_{t}})\log(1/\delta)/\varepsilon in the tree mechanism (Algorithm 1), σt=(Mb+maxtρxtxt1bt)2ΞΓτlog(1/δ)ε\sigma_{t}=(\frac{M}{b}+\max_{t}\frac{\rho\|x_{t}-x_{t-1}\|}{b_{t}})\cdot\frac{2\Xi\sqrt{\Gamma\tau\log(1/\delta)}}{\varepsilon} and use Algorithm 4 as oracles, then Algorithm 3 is (ε,δ)(\varepsilon,\delta)-DP.

Algorithm 4 oracles
1:gradient oracle 𝒪1\mathcal{O}_{1}
2:inputs: xtx_{t}
3:draw a batch size of bb among unused functions
4:return: 1bzf(xt;z)\frac{1}{b}\sum_{z}\nabla f(x_{t};z)
1:gradient difference oracle 𝒪2\mathcal{O}_{2}
2:inputs: xt,xt1x_{t},x_{t-1}
3:draw a batch size of btb_{t} among unused functions
4:return: 1btz(f(xt;z)f(xt1;z))\frac{1}{b_{t}}\sum_{z}(\nabla f(x_{t};z)-\nabla f(x_{t-1};z))
1:Hessian oracle 𝒪3\mathcal{O}_{3}
2:inputs: xtx_{t}
3:draw a batch size of bb among unused functions
4:return: 1bz2f(xt;z)\frac{1}{b}\sum_{z}\nabla^{2}f(x_{t};z)
1:Hessian difference oracle 𝒪4\mathcal{O}_{4}
2:inputs: xt,xt1x_{t},x_{t-1}
3:draw a batch size of btb_{t} among unused functions
4:return: 1btz(2f(xt;z)2f(xt1;z))\frac{1}{b_{t}}\sum_{z}(\nabla^{2}f(x_{t};z)-\nabla^{2}f(x_{t-1};z))

With the noise added in mind, we get the high-probability error bound of gradient estimators ~t\widetilde{\nabla}_{t} and Hessian estimators HtH_{t}.

Lemma 3.14.

In Algorithm 3, setting fixed batch size b=Gd/εα+G2/α2+Mdρε+M2ραb=G\sqrt{d}/\varepsilon\alpha+G^{2}/\alpha^{2}+\frac{M\sqrt{d}}{\rho\varepsilon}+\frac{M^{2}}{\rho\alpha}, adaptive batch size bt=max{gtdαε,κgtα2,ρκgtM2α,1}b_{t}=\max\{\frac{\|g_{t}\|\cdot\sqrt{d}}{\alpha\varepsilon},\frac{\kappa\cdot\|g_{t}\|}{\alpha^{2}},\frac{\rho\kappa\cdot\|g_{t}\|}{M^{2}\alpha},1\}, Ξ=γ/ρ,τ=α3/2Mρ,Γ=O~(M/ργ)\Xi=\sqrt{\gamma/\rho},\tau=\frac{\alpha^{3/2}}{M\sqrt{\rho}},\Gamma=\tilde{O}(M/\sqrt{\rho\gamma}), σt=2log(1/δ)α/d,t\sigma_{t}=2\log(1/\delta)\alpha/\sqrt{d},\forall t and σtree=2log(1/δ)α/d\sigma_{\mathrm{tree}}=2\log(1/\delta)\alpha/\sqrt{d} correspondingly according to Lemma 3.13, for each t[T]t\in[T], we know Algorithm 3 is (ε,δ)(\varepsilon,\delta)-DP, and with probability at least 1ι1-\iota, gtF(xt)γ\|g_{t}-\nabla F(x_{t})\|\leq\gamma, Ht2F(xt)2ργ/2\|H_{t}-\nabla^{2}F(x_{t})\|_{2}\leq\sqrt{\rho\gamma}/2, where γ=O~(α).\gamma=\tilde{O}(\alpha).

We need to show that we can find an γ\gamma-SOSP before we use up all the functions. We need the following technical Lemma:

Lemma 3.15.

Consider the implementation of Algorithm 3. Suppose the size of dataset 𝒟\mathcal{D} can be arbitrarily large with functions drawn i.i.d. from 𝒫\mathcal{P}, and we run the algorithm until finding an O~(γ)\tilde{O}(\gamma)-SOSP, then with probability at least 1Tι1-T\iota, the total number of functions we will use is bounded by

O~(bBMκγ+BM(dγ2ε+κγ3+ρκM2γ2+1γ2)).\displaystyle\tilde{O}\Big{(}\frac{bBM}{\kappa\gamma}+BM(\frac{\sqrt{d}}{\gamma^{2}\varepsilon}+\frac{\kappa}{\gamma^{3}}+\frac{\rho\kappa}{M^{2}\gamma^{2}}+\frac{1}{\gamma^{2}})\Big{)}.

Given the dataset size requirement, we can get the final bound on finding SOSP.

Lemma 3.16.

With 𝒟\mathcal{D} of size nn drawn i.i.d. from 𝒫\mathcal{P}, setting κ=max{αdε,(BGM)1/3}\kappa=\max\{\frac{\alpha\sqrt{d}}{\varepsilon},(BGM)^{1/3}\},

α=\displaystyle\alpha= O(((BGM)1/3+BM)(dnε)1/2+B29M29G59+B49M49G19n1/3)\displaystyle O\Big{(}((BGM)^{1/3}+\sqrt{BM})(\frac{\sqrt{d}}{n\varepsilon})^{1/2}+\frac{B^{\frac{2}{9}}M^{\frac{2}{9}}G^{\frac{5}{9}}+B^{\frac{4}{9}}M^{\frac{4}{9}}G^{\frac{1}{9}}}{n^{1/3}}\Big{)}
+O((MB)1/2n+BM2ρn+B1/3M4/3G1/6ρn+B2/3G1/6ρM1/3n+BρdMnε),\displaystyle+O\Big{(}\frac{(MB)^{1/2}}{\sqrt{n}}+\frac{\sqrt{BM^{2}}}{\rho\sqrt{n}}+\frac{B^{1/3}M^{4/3}}{G^{1/6}\sqrt{\rho n}}+\frac{B^{2/3}G^{1/6}\sqrt{\rho}}{M^{1/3}\sqrt{n}}+\frac{B\rho\sqrt{d}}{Mn\varepsilon}\Big{)},

and other parameters as in Lemma 3.14, with probability at least 1ι1-\iota, at least one of the outputs of Algorithm 3 is γ\gamma-SOSP, with γ=O~(α)\gamma=\tilde{O}(\alpha).

Combining Lemma 3.13 and Lemma 3.16, we have the following main result for finding SOSP privately:

Theorem 3.17.

Given δ>0,εO(1)\delta>0,\varepsilon\leq O(1), with gradient oracles in Algorithm 4, setting b=Gd/εα+G2/α2+Mdρε+M2ρα,bt=max{gtdαε,κgtα2,ρκgtM2α,1}b=G\sqrt{d}/\varepsilon\alpha+G^{2}/\alpha^{2}+\frac{M\sqrt{d}}{\rho\varepsilon}+\frac{M^{2}}{\rho\alpha},b_{t}=\max\{\frac{\|g_{t}\|\cdot\sqrt{d}}{\alpha\varepsilon},\frac{\kappa\cdot\|g_{t}\|}{\alpha^{2}},\frac{\rho\kappa\cdot\|g_{t}\|}{M^{2}\alpha},1\}, κ=max{αdε,(BGM)1/3}\kappa=\max\{\frac{\alpha\sqrt{d}}{\varepsilon},(BGM)^{1/3}\}, Ξ=γ/ρ,τ=α3/2Mρ,Γ=O~(M/ργ)\Xi=\sqrt{\gamma/\rho},\tau=\frac{\alpha^{3/2}}{M\sqrt{\rho}},\Gamma=\tilde{O}(M/\sqrt{\rho\gamma}), and σtree=σt=α/d,t\sigma_{\mathrm{tree}}=\sigma_{t}=\alpha/\sqrt{d},\forall t, Algorithm 3 is (ε,δ)(\varepsilon,\delta)-DP, and if the dataset is i.i.d. drawn from the underlying distribution 𝒫\mathcal{P}, at least one of its outputed points is O~(α)\tilde{O}(\alpha)-SOSP, where

α=\displaystyle\alpha= O(((BGM)1/3+BM)(dnε)1/2+B29M29G59+B49M49G19n1/3)\displaystyle O\Big{(}((BGM)^{1/3}+\sqrt{BM})(\frac{\sqrt{d}}{n\varepsilon})^{1/2}+\frac{B^{\frac{2}{9}}M^{\frac{2}{9}}G^{\frac{5}{9}}+B^{\frac{4}{9}}M^{\frac{4}{9}}G^{\frac{1}{9}}}{n^{1/3}}\Big{)}
+O((MB)1/2n+BM2ρn+B1/3M4/3G1/6ρn+B2/3G1/6ρM1/3n+BρdMnε),\displaystyle+O\Big{(}\frac{(MB)^{1/2}}{\sqrt{n}}+\frac{\sqrt{BM^{2}}}{\rho\sqrt{n}}+\frac{B^{1/3}M^{4/3}}{G^{1/6}\sqrt{\rho n}}+\frac{B^{2/3}G^{1/6}\sqrt{\rho}}{M^{1/3}\sqrt{n}}+\frac{B\rho\sqrt{d}}{Mn\varepsilon}\Big{)},
Remark 3.18.

If we treat the parameters B,G,M,ρB,G,M,\rho as constants O(1)O(1), then we get α=O~((dnε)1/2+1n1/3)\alpha=\tilde{O}((\frac{\sqrt{d}}{n\varepsilon})^{1/2}+\frac{1}{n^{1/3}}) as claimed before in the abstract and introduction.

If we make further assumptions, like assuming the functions are defined over a constraint domain 𝒳d\mathcal{X}\subset\mathbb{R}^{d} of diameter RR and we allow exponential running time, we can get some other standard bounds that can be better than Theorem 3.17 in some regimes. See Appendix C for more discussions.

4 Discussion

We combine the concepts of adaptive batch sizes and the tree mechanism to improve the previous best results for privately finding SOSP. Our approach achieves the same bound as the state-of-the-art method for finding FOSP, suggesting that privately finding an SOSP may incur no additional cost.

Several interesting questions remain. First, what is the tight bound for privately finding FOSP and SOSP? Second, can the adaptive batch size technique be applied in other settings? Could it offer additional advantages, such as reducing runtime in practice? Finally, while we can obtain a generalization error bound of d/n\sqrt{d/n} using concentration inequalities and the union bound, can we achieve a better generalization error bound for the non-convex optimization?

References

  • Agarwal et al. [2017] Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. In Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing, pages 1195–1199, 2017.
  • Arjevani et al. [2023] Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199(1):165–214, 2023.
  • Arora et al. [2023] Raman Arora, Raef Bassily, Tomás González, Cristóbal A Guzmán, Michael Menart, and Enayat Ullah. Faster rates of convergence to stationary points in differentially private optimization. In International Conference on Machine Learning, pages 1060–1092. PMLR, 2023.
  • Asi et al. [2021] Hilal Asi, Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: Optimal rates in l1 geometry. In International Conference on Machine Learning, pages 393–403. PMLR, 2021.
  • Bassily et al. [2014] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In Proc. of the 2014 IEEE 55th Annual Symp. on Foundations of Computer Science (FOCS), pages 464–473, 2014.
  • Bassily et al. [2019] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. Advances in neural information processing systems, 32, 2019.
  • Bassily et al. [2021] Raef Bassily, Cristóbal Guzmán, and Anupama Nandi. Non-euclidean differentially private stochastic convex optimization. In Conference on Learning Theory, pages 474–499. PMLR, 2021.
  • Carmon et al. [2020] Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points i. Mathematical Programming, 184(1):71–120, 2020.
  • Carmon et al. [2023] Yair Carmon, Arun Jambulapati, Yujia Jin, Yin Tat Lee, Daogao Liu, Aaron Sidford, and Kevin Tian. Resqueing parallel and private stochastic convex optimization. In 2023 IEEE 64th Annual Symposium on Foundations of Computer Science (FOCS), pages 2031–2058. IEEE, 2023.
  • Chan et al. [2011] T-H Hubert Chan, Elaine Shi, and Dawn Song. Private and continual release of statistics. ACM Transactions on Information and System Security (TISSEC), 14(3):1–24, 2011.
  • Chaudhuri et al. [2011] Kamalika Chaudhuri, Claire Monteleoni, and Anand D Sarwate. Differentially private empirical risk minimization. Journal of Machine Learning Research, 12(3), 2011.
  • Davis et al. [2022] Damek Davis, Dmitriy Drusvyatskiy, Yin Tat Lee, Swati Padmanabhan, and Guanghao Ye. A gradient sampling method with complexity guarantees for lipschitz functions in high and low dimensions. Advances in neural information processing systems, 35:6692–6703, 2022.
  • De et al. [2016] Soham De, Abhay Yadav, David Jacobs, and Tom Goldstein. Big batch sgd: Automated inference using adaptive batch sizes. arXiv preprint arXiv:1610.05792, 2016.
  • Dwork et al. [2006] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Proc. of the Third Conf. on Theory of Cryptography (TCC), pages 265–284, 2006. URL http://dx.doi.org/10.1007/11681878_14.
  • Dwork et al. [2010] Cynthia Dwork, Moni Naor, Toniann Pitassi, and Guy N Rothblum. Differential privacy under continual observation. In Proceedings of the forty-second ACM symposium on Theory of computing, pages 715–724, 2010.
  • Fang et al. [2018] Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing systems, 31, 2018.
  • Feldman et al. [2020] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439–449, 2020.
  • Fichtenberger et al. [2023] Hendrik Fichtenberger, Monika Henzinger, and Jalaj Upadhyay. Constant matters: Fine-grained error bound on differentially private continual observation. In International Conference on Machine Learning, pages 10072–10092. PMLR, 2023.
  • Ganesh et al. [2023] Arun Ganesh, Daogao Liu, Sewoong Oh, and Abhradeep Thakurta. Private (stochastic) non-convex optimization revisited: Second-order stationary points and excess risks. Advances in Neural Information Processing Systems, 2023.
  • Gao and Wright [2023] Changyu Gao and Stephen J Wright. Differentially private optimization for smooth nonconvex erm. arXiv preprint arXiv:2302.04972, 2023.
  • Ghadimi and Lan [2013] Saeed Ghadimi and Guanghui Lan. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM journal on optimization, 23(4):2341–2368, 2013.
  • Gopi et al. [2023] Sivakanth Gopi, Yin Tat Lee, Daogao Liu, Ruoqi Shen, and Kevin Tian. Private convex optimization in general norms. In Proceedings of the 2023 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), pages 5068–5089. SIAM, 2023.
  • Ji et al. [2020] Kaiyi Ji, Zhe Wang, Bowen Weng, Yi Zhou, Wei Zhang, and Yingbin Liang. History-gradient aided batch size adaptation for variance reduced algorithms. In International Conference on Machine Learning, pages 4762–4772. PMLR, 2020.
  • Jin et al. [2017] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In International conference on machine learning, pages 1724–1732. PMLR, 2017.
  • Jin et al. [2019] Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M Kakade, and Michael I Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint arXiv:1902.03736, 2019.
  • Jordan et al. [2023] Michael Jordan, Guy Kornowski, Tianyi Lin, Ohad Shamir, and Manolis Zampetakis. Deterministic nonsmooth nonconvex optimization. In The Thirty Sixth Annual Conference on Learning Theory, pages 4570–4597. PMLR, 2023.
  • Kornowski and Shamir [2021] Guy Kornowski and Ohad Shamir. Oracle complexity in nonsmooth nonconvex optimization. Advances in Neural Information Processing Systems, 34:324–334, 2021.
  • Kornowski et al. [2024] Guy Kornowski, Daogao Liu, and Kunal Talwar. Improved sample complexity for private nonsmooth nonconvex optimization. arXiv preprint arXiv:2410.05880, 2024.
  • Kulkarni et al. [2021] Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. Private non-smooth erm and sco in subquadratic steps. Advances in Neural Information Processing Systems, 34:4053–4064, 2021.
  • Lowy et al. [2024] Andrew Lowy, Jonathan Ullman, and Stephen Wright. How to make the gradients small privately: Improved rates for differentially private non-convex optimization. In Forty-first International Conference on Machine Learning, 2024.
  • Menart et al. [2024] Michael Menart, Enayat Ullah, Raman Arora, Raef Bassily, and Cristóbal Guzmán. Differentially private non-convex optimization under the kl condition with optimal rates. In International Conference on Algorithmic Learning Theory, pages 868–906. PMLR, 2024.
  • Nesterov [1998] Yurii Nesterov. Introductory lectures on convex programming volume i: Basic course. Lecture notes, 3(4):5, 1998.
  • Nesterov and Polyak [2006] Yurii Nesterov and Boris T Polyak. Cubic regularization of newton method and its global performance. Mathematical programming, 108(1):177–205, 2006.
  • Nguyen et al. [2017] Lam M Nguyen, Jie Liu, Katya Scheinberg, and Martin Takáč. Sarah: A novel method for machine learning problems using stochastic recursive gradient. In International conference on machine learning, pages 2613–2621. PMLR, 2017.
  • Tao et al. [2025] Youming Tao, Dongxiao Yu, Xiuzhen Cheng, Falko Dressler, and Di Wang. Private stochastic optimization for achieving second-order stationary points, 2025. URL https://openreview.net/forum?id=UVaLZMv0uk. OpenReview preprint, ICLR submission.
  • Tran and Cutkosky [2022] Hoang Tran and Ashok Cutkosky. Momentum aggregation for private non-convex erm. Advances in Neural Information Processing Systems, 35:10996–11008, 2022.
  • Tropp [2012] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12:389–434, 2012.
  • Wang et al. [2019a] Di Wang, Changyou Chen, and Jinhui Xu. Differentially private empirical risk minimization with non-convex loss functions. In International Conference on Machine Learning, pages 6526–6535. PMLR, 2019a.
  • Wang et al. [2023] Lingxiao Wang, Bargav Jayaraman, David Evans, and Quanquan Gu. Efficient privacy-preserving stochastic nonconvex optimization. In Uncertainty in Artificial Intelligence, pages 2203–2213. PMLR, 2023.
  • Wang et al. [2019b] Zhe Wang, Kaiyi Ji, Yi Zhou, Yingbin Liang, and Vahid Tarokh. Spiderboost and momentum: Faster variance reduction algorithms. Advances in Neural Information Processing Systems, 32, 2019b.
  • Zhang et al. [2020] Jingzhao Zhang, Hongzhou Lin, Stefanie Jegelka, Suvrit Sra, and Ali Jadbabaie. Complexity of finding stationary points of nonconvex nonsmooth functions. In International Conference on Machine Learning, pages 11173–11182. PMLR, 2020.
  • Zhang et al. [2024] Qinzi Zhang, Hoang Tran, and Ashok Cutkosky. Private zeroth-order nonsmooth nonconvex optimization. In The Twelfth International Conference on Learning Representations, 2024.

Appendix A Omitted Proof of Subsection 3.1

A.1 Proof of Lemma 3.5

Proof.

By the concentration of the Gaussian, we know with probability at least 1ι/21-\iota/2, ζt2γ\|\zeta_{t}\|_{2}\leq\gamma for all t[Γ]t\in[\Gamma]. The following proof is conditional on the event that ζt2γ\|\zeta_{t}\|_{2}\leq\gamma for all tt.

By the Assumption 3.1, for each t[Γ]t\in[\Gamma], we know

gt1F(xt1)2=\displaystyle\|g_{t-1}-\nabla F(x_{t-1})\|_{2}= g+H(xt1x0)+ζt1F(xt1)2\displaystyle\|g+H(x_{t-1}-x_{0})+\zeta_{t-1}-\nabla F(x_{t-1})\|_{2}
\displaystyle\leq g+H(xt1x0)F(xt1)2+ζt12\displaystyle\|g+H(x_{t-1}-x_{0})-\nabla F(x_{t-1})\|_{2}+\|\zeta_{t-1}\|_{2}
\displaystyle\leq g+H(xt1x0)F(xt1)2+γ\displaystyle\|g+H(x_{t-1}-x_{0})-\nabla F(x_{t-1})\|_{2}+\gamma
=\displaystyle= gF(x0)+H(xt1x0)(F(xt1)F(x0)2+γ\displaystyle\|g-\nabla F(x_{0})+H(x_{t-1}-x_{0})-(\nabla F(x_{t-1})-\nabla F(x_{0})\|_{2}+\gamma
\displaystyle\leq 2γ+(H2F(z))(xt1x0)2,\displaystyle 2\gamma+\|(H-\nabla^{2}F(z))(x_{t-1}-x_{0})\|_{2},

where zz is a point in the section between xt1x_{t-1} and x0x_{0}. Note that xt1x02<Ξ\|x_{t-1}-x_{0}\|_{2}<\Xi by the algorithm design, hence we know H2F(z)2ϰ+ρΞ\|H-\nabla^{2}F(z)\|_{2}\leq\varkappa+\rho\Xi. Hence, we have

gt1F(xt1)22γ+(ϰ+ρΞ)Ξ.\displaystyle\|g_{t-1}-\nabla F(x_{t-1})\|_{2}\leq 2\gamma+(\varkappa+\rho\Xi)\Xi.

This completes the proof.

A.2 Proof of Lemma 3.8

Proof.

By Claim 3.6, we have

F(xt)F(x0)η4i=1tF(xi)22+5ηitϱi22.\displaystyle F(x_{t})-F(x_{0})\leq-\frac{\eta}{4}\sum_{i=1}^{t}\|\nabla F(x_{i})\|_{2}^{2}+5\eta\sum_{i}^{t}\|\varrho_{i}\|_{2}^{2}. (4)

Note that xtx0=η(i=1tF(xi)ϱi)x_{t}-x_{0}=\eta(\sum_{i=1}^{t}\nabla F(x_{i})-\varrho_{i}), which means η(i=1tF(xi)ϱi)2Ξ\eta\|(\sum_{i=1}^{t}\nabla F(x_{i})-\varrho_{i})\|_{2}\geq\Xi. By Lemma 3.5 and the preconditions, we know

ϱi3γ.\displaystyle\|\varrho_{i}\|\leq 3\gamma.

Hence we know

ηi=1tF(xi)Ξ3γΓγ/2ρ.\displaystyle\eta\|\sum_{i=1}^{t}\nabla F(x_{i})\|\geq\Xi-3\gamma\Gamma\geq\sqrt{\gamma/2\rho}.

Then by Equation 4, we know

F(xt)F(x0)\displaystyle F(x_{t})-F(x_{0})\leq η4Γ(i=1tF(xi))2+45ηΓγ2\displaystyle-\frac{\eta}{4\Gamma}(\sum_{i=1}^{t}\|\nabla F(x_{i})\|)^{2}+45\eta\Gamma\gamma^{2}
\displaystyle\leq ηγ8Γρ+45ηΓγ2\displaystyle-\frac{\eta\gamma}{8\Gamma\rho}+45\eta\Gamma\gamma^{2}
\displaystyle\leq Φ.\displaystyle-\Phi.

A.3 Proof of Lemma 3.9

Proof.

For proof purposes, let {xt}t[Γ]\{x_{t}\}_{t\in[\Gamma]} and {xt}t[Γ]\{x_{t}^{\prime}\}_{t\in[\Gamma]} be the two trajectories with x1=wx_{1}=w and x1=ux_{1}^{\prime}=u.

It suffices to show that

max{xΓx02,xΓx02}Ξ.\displaystyle\max\{\|x_{\Gamma}-x_{0}\|_{2},\|x_{\Gamma}^{\prime}-x_{0}\|_{2}\}\geq\Xi. (5)

Let zt=xtxtz_{t}=x_{t}-x_{t}^{\prime} be the difference. Hence we have

zt+1=\displaystyle z_{t+1}= ztη[gt1gt1]\displaystyle z_{t}-\eta[g_{t-1}-g_{t-1}^{\prime}]
=\displaystyle= ztηH(xtxt)\displaystyle z_{t}-\eta H(x_{t}-x_{t}^{\prime})
=\displaystyle= (IηH)zt.\displaystyle(I-\eta H)z_{t}.

Recall that z1=μr𝐞1z_{1}=\mu r\mathbf{e}_{1}, λmin(2F(x0))ρα\lambda_{\min}(\nabla^{2}F(x_{0}))\leq-\sqrt{\rho\alpha} and λmin(H)ρα/2\lambda_{\min}(H)\leq\sqrt{\rho\alpha}/2. Then we know that

zt(1+ηργ/2)tμr,\displaystyle\|z_{t}\|\geq(1+\eta\sqrt{\rho\gamma}/2)^{t}\mu r,

which means

zΓ22Ξ\displaystyle\|z_{\Gamma}\|_{2}\geq 2\Xi

by our choices of parameters. This establishes Equation (5) and completes the proof.

A.4 Proof of Proposition 3.10

Proof of Proposition 3.10.

By Lemma 3.2 and the precondition that ~tF(xt)γ\|\widetilde{\nabla}_{t}-\nabla F(x_{t})\|\leq\gamma, we know that, if F(xt)4γ\|\nabla F(x_{t})\|\geq 4\gamma, then F(xt+1)F(xt)η~2/16F(x_{t+1})-F(x_{t})\leq-\eta\|\widetilde{\nabla}\|^{2}/16. Otherwise, F(xt)<4γ\|\nabla F(x_{t})\|<4\gamma. If F(xt)<4γ\|\nabla F(x_{t})\|<4\gamma but xtx_{t} is a saddle point, then by Theorem 3.4, we know with probability at least 1ι1-\iota,

F(xt+Γ)F(xt)Ω~(γ3/2/ρ),\displaystyle F(x_{t+\Gamma})-F(x_{t})\leq-\tilde{\Omega}(\gamma^{3/2}/\sqrt{\rho}),

where Γ=O~(M/ργ)\Gamma=\tilde{O}(M/\sqrt{\rho\gamma}). Then if none of the points in {xi}i[T]\{x_{i}\}_{i\in[T]} is an O~(γ)\tilde{O}(\gamma)-SOSP, then we know F(xT)F(x0)<BF(x_{T})-F(x_{0})<-B, which is contradictory to Assumption 3.1. Hence at least one point in {xi}i[T]\{x_{i}\}_{i\in[T]} should be an O~(γ)\tilde{O}(\gamma)-SOSP, and hence complete the proof.

A.5 Proof of Theorem 3.4

Now we complete the proof of Theorem 3.4.

Proof of Theorem 3.4.

By Lemma 3.8 and the definition of 𝒳(x0)\mathcal{X}(x_{0}) (see Equation (1)), we have

Pr[F(y)F(x0)Φ]Pr[x1𝒳(x0)ζt2γ,t]+ι/2.\displaystyle\Pr[F(y)-F(x_{0})\leq-\Phi]\geq\Pr\Big{[}x_{1}\notin\mathcal{X}(x_{0})\mid\|\zeta_{t}\|_{2}\leq\gamma,\forall t\Big{]}+\iota/2.

It suffices to upper bound Pr[x1𝒳(x0)ζt2γ,t]\Pr\Big{[}x_{1}\in\mathcal{X}(x_{0})\mid\|\zeta_{t}\|_{2}\leq\gamma,\forall t\Big{]}. Let μ0=ι2/4d\mu_{0}=\iota^{2}/4\sqrt{d}. Recall the Gaussian we add is σ0=γdlog(T/ι)\sigma_{0}=\frac{\gamma}{\sqrt{d\log(T/\iota)}}. Suppose we add two independent Gaussians ζ0,ζ0\zeta_{0},\zeta_{0}^{\prime} and get two independent first iterates x1x_{1} and x1x_{1}^{\prime}. By the property of Gaussians, we have

Pr[|𝐞1,ζ0ζ0|μγ]=2erf(μγ/σ0)ι2/16.\displaystyle\Pr[|\mathbf{e}_{1},\zeta_{0}-\zeta_{0}^{\prime}|\leq\mu\gamma]=2\mathrm{erf}(\mu\gamma/\sigma_{0})\leq\iota^{2}/16.

Then

Pr[|𝐞1,ζ0ζ0|μγζ0γ,ζ0γ]ι2/161ιι2/4.\displaystyle\Pr\big{[}|\mathbf{e}_{1},\zeta_{0}-\zeta_{0}^{\prime}|\leq\mu\gamma\mid\|\zeta_{0}\|\leq\gamma,\|\zeta_{0}^{\prime}\|\leq\gamma\big{]}\leq\frac{\iota^{2}/16}{1-\iota}\leq\iota^{2}/4.

By Lemma 3.9 and independence between x1,x1x_{1},x_{1}^{\prime}, we have

Pr[x1𝒳(x0)ζt2γ,t]\displaystyle\Pr\Big{[}x_{1}\in\mathcal{X}(x_{0})\mid\|\zeta_{t}\|_{2}\leq\gamma,\forall t\Big{]}\leq Pr[x1,x1𝒳(x0)ζ0γ,ζ0γ]\displaystyle\sqrt{\Pr[x_{1},x_{1}^{\prime}\in\mathcal{X}(x_{0})\mid\|\zeta_{0}\|\leq\gamma,\|\zeta_{0}^{\prime}\|\leq\gamma]}
\displaystyle\leq Pr[|𝐞1,ζ0ζ0|μγζ0γ,ζ0γ]\displaystyle\sqrt{\Pr\big{[}|\mathbf{e}_{1},\zeta_{0}-\zeta_{0}^{\prime}|\leq\mu\gamma\mid\|\zeta_{0}\|\leq\gamma,\|\zeta_{0}^{\prime}\|\leq\gamma\big{]}}
\displaystyle\leq ι/2.\displaystyle\iota/2.

Hence, we show that

Pr[F(y)F(x0)Φ]1ι.\displaystyle\Pr[F(y)-F(x_{0})\leq-\Phi]\geq 1-\iota.

Appendix B Proof of Subsection 3.2

B.1 Proof of Lemma 3.11

Proof.

We only prove the first two items, as (3) and (4) follow from similar arguments. For each data z𝒫z\sim\mathcal{P}, we know

𝔼f(xt;z)F(xt)=0,f(xt;z)F(xt)2G.\displaystyle\operatorname*{\mathbb{E}}\nabla f(x_{t};z)-\nabla F(x_{t})=0,\quad\|\nabla f(x_{t};z)-\nabla F(x_{t})\|\leq 2G.

Then the conclusion (1) follows from Lemma 2.4.

Similarly, for each data z𝒫z\sim\mathcal{P}, we know

𝔼(f(xt;z)f(xt1;z))(F(xt;z)F(xt1;z))=0,\displaystyle\operatorname*{\mathbb{E}}(\nabla f(x_{t};z)-\nabla f(x_{t-1};z))-(\nabla F(x_{t};z)-\nabla F(x_{t-1};z))=0,
(f(xt;z)f(xt1;z))(F(xt;z)F(xt1;z))2Mxtxt1.\displaystyle\|(\nabla f(x_{t};z)-\nabla f(x_{t-1};z))-(\nabla F(x_{t};z)-\nabla F(x_{t-1};z))\|\leq 2M\|x_{t}-x_{t-1}\|.

The statement (2) also follows from Lemma 2.4. ∎

B.2 Proof of Lemma 3.12

Proof.

We consider the case when t=τtt=\tau_{t} first, i.e., we query 𝒪1\mathcal{O}_{1} to get t\nabla_{t}. Then Equation (2) follows from Lemma 3.11.

When t>τtt>\tau_{t}, then for each ii such that τt<it\tau_{t}<i\leq t, we know conditional on i1\nabla_{i-1}, we have

𝔼[Δii1]=F(xi)F(xi1).\displaystyle\operatorname*{\mathbb{E}}[\Delta_{i}\mid\nabla_{i-1}]=\nabla F(x_{i})-\nabla F(x_{i-1}).

That is Δi(F(xi)F(xi1))\Delta_{i}-(\nabla F(x_{i})-\nabla F(x_{i-1})) is zero-mean and nSG(Mxixi1log(dT/ι)/bi)\mathrm{nSG}(M\|x_{i}-x_{i-1}\|\sqrt{\log(dT/\iota)}/\sqrt{b_{i}}) by applying Lemma 3.11. Then Equation (2) follows from applying Lemma 2.4.

The case for Hessian estimation involves some truncation. By Lemma 3.11, we know with probability at least 1Tι1-T\iota, we have

Pr[Hτt2F(xτt)2ζ3]ι/T,\displaystyle\Pr[\|H_{\tau_{t}}-\nabla^{2}F(x_{\tau_{t}})\|_{2}\geq\zeta_{3}]\leq\iota/T,

with ζ3=O(Mlog(Td/ι)/b)\zeta_{3}=O(M\sqrt{\log(Td/\iota)/b}). The similar concentration holds for ΔtH\Delta_{t}^{H}. We truncate the distribution of HτtH_{\tau_{t}} around 2F(xτt)\nabla^{2}F(x_{\tau_{t}}), that is H¯τt=Hτt𝟏(Hτt2F(xτt)ζ3)\bar{H}_{\tau_{t}}=H_{\tau_{t}}\cdot\mathbf{1}\Big{(}\|H_{\tau_{t}}-\nabla^{2}F(x_{\tau_{t}})\|\leq\zeta_{3}\Big{)}. It is straightforward to see that

𝔼H¯τt2F(xτt)ζ3Pr[Hτt2F(xτt)ζ]dζζ3/T.\displaystyle\|\operatorname*{\mathbb{E}}\bar{H}_{\tau_{t}}-\nabla^{2}F(x_{\tau_{t}})\|\leq\int_{\zeta_{3}}^{\infty}\Pr[\|H_{\tau_{t}}-\nabla^{2}F(x_{\tau_{t}})\|\geq\zeta]\mathrm{d}\zeta\leq\zeta_{3}/T.

Truncate ΔtH\Delta_{t}^{H} in a similar way. Then by Lemma 3.11 we can show that

Pr[H¯τt+i=τt+1tΔ¯iH𝔼[H¯τt+i=τt+1tΔ¯iH]2C2(M2b+i=τt+1tρ2xixi12/bi)log2(Td/ι)]ι.\displaystyle\Pr[\|\bar{H}_{\tau_{t}}+\sum_{i=\tau_{t}+1}^{t}\overline{\Delta}_{i}^{H}-\operatorname*{\mathbb{E}}[\bar{H}_{\tau_{t}}+\sum_{i=\tau_{t}+1}^{t}\bar{\Delta}_{i}^{H}]\|^{2}\geq\frac{C}{2}(\frac{M^{2}}{b}+\sum_{i=\tau_{t}+1}^{t}\rho^{2}\|x_{i}-x_{i-1}\|^{2}/b_{i})\log^{2}(Td/\iota)]\leq\iota. (6)

Note that

𝔼[H¯τt+i=τt+1tΔ¯iH]2F(xt)=\displaystyle\|\operatorname*{\mathbb{E}}[\bar{H}_{\tau_{t}}+\sum_{i=\tau_{t}+1}^{t}\bar{\Delta}_{i}^{H}]-\nabla^{2}F(x_{t})\|= 𝔼[H¯τt+i=τt+1tΔ¯iH]𝔼[Hτt+i=τt+1tΔiH]\displaystyle\|\operatorname*{\mathbb{E}}[\bar{H}_{\tau_{t}}+\sum_{i=\tau_{t}+1}^{t}\bar{\Delta}_{i}^{H}]-\operatorname*{\mathbb{E}}[H_{\tau_{t}}+\sum_{i=\tau_{t}+1}^{t}\Delta_{i}^{H}]\| (7)
\displaystyle\leq Clog(Td/ι)2T(Mb+i=τt+1tρxixi1/bi).\displaystyle\frac{C\log(Td/\iota)}{2T}(\frac{M}{\sqrt{b}}+\sum_{i=\tau_{t}+1}^{t}\rho\|x_{i}-x_{i-1}\|/\sqrt{b_{i}}).

By union bound, we have

Pr[H¯τt+i=τt+1tΔ¯iHHt]Tι/2.\displaystyle\Pr[\|\bar{H}_{\tau_{t}}+\sum_{i=\tau_{t}+1}^{t}\overline{\Delta}_{i}^{H}\neq H_{t}]\leq T\iota/2. (8)

Equations (6),(7) and (8) complete the proof. ∎

B.3 Proof of Lemma 3.13

Proof.

We use the tree mechanism to privatize the gradients if no potential saddle point is met, and we use the Gaussian mechanism during escaping from the saddle point.

We first show the indistinguishability of the gradients. It suffices to consider the sensitivity of the gradient oracles.

Consider the sensitivity of 𝒪1\mathcal{O}_{1} first. Let 𝒪(xt)\mathcal{O}(x_{t})^{\prime} denote the output with the neighboring dataset. Then it is obvious that

𝒪1(xt)𝒪1(xt)Gb.\displaystyle\|\mathcal{O}_{1}(x_{t})-\mathcal{O}_{1}(x_{t})^{\prime}\|\leq\frac{G}{b}.

As for the sensitivity of 𝒪2\mathcal{O}_{2}, we have

𝒪2(xt,xt1)𝒪2(xt,xt1)Mxtxt1bt=~tbt.\displaystyle\|\mathcal{O}_{2}(x_{t},x_{t-1})-\mathcal{O}_{2}(x_{t},x_{t-1})^{\prime}\|\leq\frac{M\|x_{t}-x_{t-1}\|}{b_{t}}=\frac{\|\widetilde{\nabla}_{t}\|}{b_{t}}.

The privacy guarantee of gradients follows from the tree mechanism (Theorem 2.2), which means {~t}tε/2,δ/2{~t}t\{\widetilde{\nabla}_{t}\}_{t}\approx_{\varepsilon/2,\delta/2}\{\widetilde{\nabla}_{t}^{\prime}\}_{t}.

As for the Gaussian mechanism, note that for neighboring datasets with Hessian estimates HtH_{t} and HtH_{t}^{\prime} respectively, we know that

HtHt2Mb+maxtρxtxt1bt.\displaystyle\|H_{t}-H_{t}^{\prime}\|_{2}\leq\frac{M}{b}+\max_{t}\frac{\rho\|x_{t}-x_{t-1}\|}{b_{t}}.

Note that we force a fresh estimate from 𝒪3\mathcal{O}_{3} after escaping the saddle points for τ\tau times, and in each time, we ensure that xtx0Ξ\|x_{t}-x_{0}\|\leq\Xi and HtH_{t} are used at most Γ\Gamma steps, which means the difference is

(HtHt)(xtx0)2HtHtxtx0(Mb+maxtρxtxt1bt)Ξ.\displaystyle\|(H_{t}-H_{t}^{\prime})(x_{t}-x_{0})\|_{2}\leq\|H_{t}-H_{t}^{\prime}\|\cdot\|x_{t}-x_{0}\|\leq(\frac{M}{b}+\max_{t}\frac{\rho\|x_{t}-x_{t-1}\|}{b_{t}})\cdot\Xi.

Then the property of Gaussian Mechanism and composition finishes the privacy guarantee proof. ∎

B.4 Proof of Lemma 3.14

Proof.

By our setting of parameters, we know ΞΓτα/ρ\Xi\sqrt{\Gamma\tau}\leq\alpha/\rho and hence

(Gb+maxtgtbt)2εα/d,\displaystyle(\frac{G}{b}+\max_{t}\frac{\|g_{t}\|}{b_{t}})\leq 2\varepsilon\alpha/\sqrt{d},
(Mb+maxtρxtxt1bt)(2ΞΓτ)2εα/d,\displaystyle(\frac{M}{b}+\max_{t}\frac{\rho\|x_{t}-x_{t-1}\|}{b_{t}})\cdot(2\Xi\sqrt{\Gamma\tau})\leq 2\varepsilon\alpha/\sqrt{d},

Then our choice of σtree\sigma_{\mathrm{tree}} and σt\sigma_{t} ensures the privacy guarantee by Lemma 3.13.

For any t[T]t\in[T], if it is in the process of escaping from the saddle point, and gt=g+H(xtxanchor)+ξtg_{t}=g+H(x_{t}-x_{\mathrm{anchor}})+\xi_{t}, then Lemma 3.5 and our parameter settings ensure the accuracy on gtg_{t}. Consider the other case when gt=~tg_{t}=\widetilde{\nabla}_{t}, we have

~tF(xt)~tt(1)+tF(xt)(2).\displaystyle\|\widetilde{\nabla}_{t}-\nabla F(x_{t})\|\leq\underbrace{\|\widetilde{\nabla}_{t}-\nabla_{t}\|}_{(1)}+\underbrace{\|\nabla_{t}-\nabla F(x_{t})\|}_{(2)}.

By Theorem 2.2, we know

(1)maxtTREE(t)σdlog(T)σdlognαlognO~(γ).\displaystyle(1)\leq\max_{t}\|\mathrm{TREE}(t)\|\leq\sigma\sqrt{d\log(T)}\leq\sigma\sqrt{d\log n}\lesssim\alpha\sqrt{\log n}\leq\tilde{O}(\gamma).

By Lemma 3.12 and our parameter settings, we have

tF(xt)2\displaystyle\|\nabla_{t}-\nabla F(x_{t})\|^{2}\lesssim (α2+i=τt+1tgt2/bt)log(nd/ι)\displaystyle(\alpha^{2}+\sum_{i=\tau_{t}+1}^{t}\|g_{t}\|^{2}/b_{t})\log(nd/\iota)
\displaystyle\lesssim (α2+i=τt+1tgtmin{αεd,α2/κ})log(nd/ι)\displaystyle(\alpha^{2}+\sum_{i=\tau_{t}+1}^{t}\|g_{t}\|\cdot\min\{\frac{\alpha\varepsilon}{\sqrt{d}},\alpha^{2}/\kappa\})\log(nd/\iota)
\displaystyle\leq (α2+κmin{αεd,α2/κ})log(nd/ι)\displaystyle(\alpha^{2}+\kappa\cdot\min\{\frac{\alpha\varepsilon}{\sqrt{d}},\alpha^{2}/\kappa\})\log(nd/\iota)
\displaystyle\lesssim α2log(nd/ι).\displaystyle\alpha^{2}\log(nd/\iota).

Hence we conclude that (2)αlog(nd/ι)(2)\lesssim\alpha\sqrt{\log(nd/\iota)}.

Similarly, by Lemma 3.12, we can conclude that

Ht2F(xt)2\displaystyle\|H_{t}-\nabla^{2}F(x_{t})\|^{2}\lesssim (ρα+i=τt+1ρ2gt2/M2bt)log2(nd/ι)\displaystyle(\rho\alpha+\sum_{i=\tau_{t}+1}\rho^{2}\|g_{t}\|^{2}/M^{2}b_{t})\log^{2}(nd/\iota)
\displaystyle\lesssim (ρα+i=τt+1tρ2gt(α/ρκ))log2(nd/ι)\displaystyle(\rho\alpha+\sum_{i=\tau_{t}+1}^{t}\rho^{2}\|g_{t}\|\cdot(\alpha/\rho\kappa))\log^{2}(nd/\iota)
\displaystyle\lesssim ραlog2(nd/ι),\displaystyle\rho\alpha\log^{2}(nd/\iota),

which completes the proof. ∎

B.5 Proof of Lemma 3.15

Proof.

By Proposition 3.10, setting T=O~(B/ηγ2)T=\tilde{O}(B/\eta\gamma^{2}) suffices to find an O~(γ)\tilde{O}(\gamma)-SOSP. Let {x1,,xt}\{x_{1},\cdots,x_{t}\} be the outputs of the algorithms, where tTt\leq T denotes the step we halt the algorithm. We first show

i=1tgi2O~(BM/γ).\displaystyle\sum_{i=1}^{t}\|g_{i}\|_{2}\lesssim\tilde{O}(BM/\gamma). (9)

Denote the set S:={i[t]:giγ}S:=\{i\in[t]:\|g_{i}\|\leq\gamma\}. As |S|T=O~(B/ηγ2)|S|\leq T=\tilde{O}(B/\eta\gamma^{2}), we know iSgiO~(B/ηγ)\sum_{i\in S}\|g_{i}\|\leq\tilde{O}(B/\eta\gamma).

Now consider the set Sc:=[t]SS^{c}:=[t]\setminus S denoting the index of steps when the norm of the gradient estimator is large. It suffices to bound iScgi2\sum_{i\in S^{c}}\|g_{i}\|_{2}.

By Lemma 3.2, we know when giγ\|g_{i}\|\geq\gamma, F(xi+1)F(xi)ηgi2/16F(x_{i+1})-F(x_{i})\leq-\eta\|g_{i}\|^{2}/16, and when giγ,F(xi+1)F(xi)+ηγ2\|g_{i}\|\leq\gamma,F(x_{i+1})\leq F(x_{i})+\eta\gamma^{2}. Given the bound on the function values, we know

iScgi2O~(B/η).\displaystyle\sum_{i\in S^{c}}\|g_{i}\|^{2}\leq\tilde{O}(B/\eta).

Hence

iScgiiScgi2γO~(Bηγ).\displaystyle\sum_{i\in S^{c}}\|g_{i}\|\leq\frac{\sum_{i\in S^{c}}\|g_{i}\|^{2}}{\gamma}\leq\tilde{O}(\frac{B}{\eta\gamma}).

This completes the proof of Equation (9). Moreover, we know the total time that we will escape from the saddle point is at most O(B/Φ)O(B/\Phi). Hence, The total number of functions we used for 𝒪1\mathcal{O}_{1} and 𝒪3\mathcal{O}_{3}, is upper bounded by

b(i[t]giκ+B/Φτ)=O~(bBM/κγ).\displaystyle b\cdot(\frac{\sum_{i\in[t]}\|g_{i}\|}{\kappa}+B/\Phi\tau)=\tilde{O}(bBM/\kappa\gamma).

The total number of functions we used for 𝒪2\mathcal{O}_{2} is upper bounded as follows:

i[t]bt(dαε+κα2+ρκM2α)i[t]gi+TBMO~(dγ2ε+κγ3+ρκM2γ2+1γ2).\displaystyle\sum_{i\in[t]}b_{t}\lesssim(\frac{\sqrt{d}}{\alpha\varepsilon}+\frac{\kappa}{\alpha^{2}}+\frac{\rho\kappa}{M^{2}\alpha})\sum_{i\in[t]}\|g_{i}\|+T\leq BM\cdot\tilde{O}(\frac{\sqrt{d}}{\gamma^{2}\varepsilon}+\frac{\kappa}{\gamma^{3}}+\frac{\rho\kappa}{M^{2}\gamma^{2}}+\frac{1}{\gamma^{2}}).

This completes the proof. ∎

B.6 Proof of Lemma 3.16

Proof.

By Lemma 3.15 and our parameter settings, we need

nΩ~(bBMκγ+BM(dγ2ε+κγ3+ρκM2γ2+1γ2)).\displaystyle n\geq\tilde{\Omega}\big{(}\frac{bBM}{\kappa\gamma}+BM(\frac{\sqrt{d}}{\gamma^{2}\varepsilon}+\frac{\kappa}{\gamma^{3}}+\frac{\rho\kappa}{M^{2}\gamma^{2}}+\frac{1}{\gamma^{2}})\big{)}.

First,

nΩ~(bBMκγ)=Θ(BGMdκεγ2+G2BMκγ3+BM2dκργε+BM3κργ2)\displaystyle n\geq\tilde{\Omega}(\frac{bBM}{\kappa\gamma})=\Theta(\frac{BGM\sqrt{d}}{\kappa\varepsilon\gamma^{2}}+\frac{G^{2}BM}{\kappa\gamma^{3}}+\frac{BM^{2}\sqrt{d}}{\kappa\rho\gamma\varepsilon}+\frac{BM^{3}}{\kappa\rho\gamma^{2}})
γO~((BGM)1/3d1/4nε+B29M29G59n1/3+BM2ρn+B1/3M4/3G1/6ρn).\displaystyle\Leftarrow\gamma\geq\tilde{O}(\frac{(BGM)^{1/3}d^{1/4}}{\sqrt{n\varepsilon}}+\frac{B^{\frac{2}{9}}M^{\frac{2}{9}}G^{\frac{5}{9}}}{n^{1/3}}+\frac{\sqrt{BM^{2}}}{\sqrt{\rho n}}+\frac{B^{1/3}M^{4/3}}{G^{1/6}\sqrt{\rho n}}).

Secondly,

nΩ~(BMd/γ2ε)γO~(BM(dnε)1/2).\displaystyle n\geq\tilde{\Omega}(BM\sqrt{d}/\gamma^{2}\varepsilon)\Leftarrow\gamma\geq\tilde{O}(\sqrt{BM}(\frac{\sqrt{d}}{n\varepsilon})^{1/2}).

Thirdly,

nΩ~(BMκ/γ3)nΩ~(BM(dγ2ε)+B4/3M4/3G1/3/γ3)\displaystyle n\geq\tilde{\Omega}(BM\kappa/\gamma^{3})\Leftrightarrow n\geq\tilde{\Omega}(BM(\frac{\sqrt{d}}{\gamma^{2}\varepsilon})+B^{4/3}M^{4/3}G^{1/3}/\gamma^{3})
γO~(BM(dnε)1/2+B49M49G19n1/3).\displaystyle\Leftarrow\gamma\geq\tilde{O}(\sqrt{BM}(\frac{\sqrt{d}}{n\varepsilon})^{1/2}+\frac{B^{\frac{4}{9}}M^{\frac{4}{9}}G^{\frac{1}{9}}}{n^{1/3}}).

Fourthly,

nΩ~(BρκMγ2)Ω~(B4/3G1/3ρM2/3γ2+BρdMγε)\displaystyle n\geq\tilde{\Omega}(\frac{B\rho\kappa}{M\gamma^{2}})\geq\tilde{\Omega}(\frac{B^{4/3}G^{1/3}\rho}{M^{2/3}\gamma^{2}}+\frac{B\rho\sqrt{d}}{M\gamma\varepsilon})
γO~(B2/3G1/6ρM1/3n+BρdMnε).\displaystyle\Leftarrow\gamma\geq\tilde{O}(\frac{B^{2/3}G^{1/6}\sqrt{\rho}}{M^{1/3}\sqrt{n}}+\frac{B\rho\sqrt{d}}{Mn\varepsilon}).

Finally,

nΩ~(MB/γ2)γO~((MB)1/2n).\displaystyle n\geq\tilde{\Omega}(MB/\gamma^{2})\Leftarrow\gamma\geq\tilde{O}(\frac{(MB)^{1/2}}{\sqrt{n}}).

Combining these together, we get the claimed statement.

Appendix C Other results

The first result is combining the current result in finding the SOSP of the empirical function F𝒟(x):=1nζ𝒟f(x;ζ)F_{\mathcal{D}}(x):=\frac{1}{n}\sum_{\zeta\in\mathcal{D}}f(x;\zeta), and then apply the generalization error bound as follows:

Theorem C.1.

Suppose 𝒟\mathcal{D} is i.i.d. drawn from the underlying distribution 𝒫\mathcal{P} and under Assumption 3.1. Additionally assume f(;ζ):𝒳f(;\zeta):\mathcal{X}\to\mathbb{R} for some constrained domain 𝒳d\mathcal{X}\subset\mathbb{R}^{d} of diameter RR. Then we know for any point x𝒳x\in\mathcal{X}, with probability at least 1ι1-\iota, we have

F𝒫(x)F𝒟(x)O~(d/n),2F𝒫(x)2F𝒟(x)O~(d/n).\displaystyle\|\nabla F_{\mathcal{P}}(x)-\nabla F_{\mathcal{D}}(x)\|\leq\tilde{O}(\sqrt{d/n}),\|\nabla^{2}F_{\mathcal{P}}(x)-\nabla^{2}F_{\mathcal{D}}(x)\|\leq\tilde{O}(\sqrt{d/n}).
Proof.

We construct a maximal packing 𝒴\mathcal{Y} of O((R/r)d)O((R/r)^{d}) points for 𝒳\mathcal{X}, such that for any x𝒳x\in\mathcal{X}, there exists a point y𝒴y\in\mathcal{Y} such that xyr\|x-y\|\leq r.

By Union bound, the Hoeffding inequality for norm-subGaussian (Lemma 2.4 and the Matrix Bernstein Inequality(Theorem 2.5), we know with probability at least 1τ1-\tau, for all point y𝒴y\in\mathcal{Y}, we have

F𝒫(y)F𝒟(y)O~(Ldlog(R/r)/n),2F𝒫(y)2F𝒟(y)O~(Mdlog(R/r)/n).\displaystyle\|\nabla F_{\mathcal{P}}(y)-\nabla F_{\mathcal{D}}(y)\|\leq\tilde{O}(L\sqrt{d\log(R/r)/n}),\|\nabla^{2}F_{\mathcal{P}}(y)-\nabla^{2}F_{\mathcal{D}}(y)\|\leq\tilde{O}(M\sqrt{d\log(R/r)/n}). (10)

Conditional on the above event Equation (10). Choosing rmin{1,M/ρ}d/nr\leq\min\{1,M/\rho\}\sqrt{d/n}, then by the assumptions on Lipschitz and smoothness, we have for any x𝒳x\in\mathcal{X}, there exists y𝒴y\in\mathcal{Y} such that xyr\|x-y\|\leq r, and

F𝒫(x)F𝒟(x)\displaystyle\|\nabla F_{\mathcal{P}}(x)-\nabla F_{\mathcal{D}}(x)\|\leq F𝒫(x)F𝒫(y)+F𝒫(y)F𝒟(y)+F𝒟(y)F𝒟(x)\displaystyle\|\nabla F_{\mathcal{P}}(x)-\nabla F_{\mathcal{P}}(y)\|+\|\nabla F_{\mathcal{P}}(y)-\nabla F_{\mathcal{D}}(y)\|+\|\nabla F_{\mathcal{D}}(y)-\nabla F_{\mathcal{D}}(x)\|
\displaystyle\leq O~(Ld/n).\displaystyle\tilde{O}(L\sqrt{d/n}).

Similarly, we can show

2F𝒫(x)2F𝒟(x)O~((M+ρr)d/n)=O~(Md/n).\displaystyle\|\nabla^{2}F_{\mathcal{P}}(x)-\nabla^{2}F_{\mathcal{D}}(x)\|\leq\tilde{O}((M+\rho r)\sqrt{d/n})=\tilde{O}(M\sqrt{d/n}).

The current SOTA of finding SOSP privately of F𝒟F_{\mathcal{D}} is from Ganesh et al. [2023], where they can find an O~((d/nε)2/3)\tilde{O}((\sqrt{d}/n\varepsilon)^{2/3})-SOSP. Combining the SOTA and Theorem C.1, we can find the α\alpha-SOSP of F𝒫F_{\mathcal{P}} privately with

α=O~(dn+(dnε)2/3).\displaystyle\alpha=\tilde{O}(\frac{\sqrt{d}}{n}+(\frac{\sqrt{d}}{n\varepsilon})^{2/3}).

If we allow exponential running time, as Lowy et al. [2024] suggests, we can find an initial point x0x_{0} privately to minimize the empirical function and then use x0x_{0} as a warm start to improve the final bound further.