This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

DP-PCA: Statistically Optimal and
Differentially Private PCA

Xiyang Liu00footnotemark: 0 Paul Allen School of Computer Science & Engineering, University of Washington, [email protected]    Weihao Kong00footnotemark: 0 Google Research, [email protected]    Prateek Jain00footnotemark: 0 Google Research, [email protected]    Sewoong Oh00footnotemark: 0 Paul Allen School of Computer Science & Engineering, University of Washington, and Google Research, [email protected]
Abstract

We study the canonical statistical task of computing the principal component from nn i.i.d. data in dd dimensions under (ε,δ)(\varepsilon,\delta)-differential privacy. Although extensively studied in literature, existing solutions fall short on two key aspects: (ii) even for Gaussian data, existing private algorithms require the number of samples nn to scale super-linearly with dd, i.e., n=Ω(d3/2)n=\Omega(d^{3/2}), to obtain non-trivial results while non-private PCA requires only n=O(d)n=O(d), and (iiii) existing techniques suffer from a non-vanishing error even when the randomness in each data point is arbitrarily small. We propose DP-PCA, which is a single-pass algorithm that overcomes both limitations. It is based on a private minibatch gradient ascent method that relies on private mean estimation, which adds minimal noise required to ensure privacy by adapting to the variance of a given minibatch of gradients. For sub-Gaussian data, we provide nearly optimal statistical error rates even for n=O~(d)n=\tilde{O}(d). Furthermore, we provide a lower bound showing that sub-Gaussian style assumption is necessary in obtaining the optimal error rate.

1 Introduction

Principal Component Analysis (PCA) is a fundamental statistical tool with multiple applications including dimensionality reduction, data visualization, and noise reduction. Naturally, it is a key part of most standard data analysis and ML pipelines. However, when applied to data collected from numerous individuals, such as the U.S. Census data, outcome of PCA might reveal highly sensitive personal information. We investigate the design of privacy preserving PCA algorithms and the involved privacy/utility tradeoffs, for computing the first principal component, that should serve as the building block of more general rank-kk PCA.

Differential privacy (DP) is a widely accepted mathematical notion of privacy introduced in [22], which is a standard in releasing the U.S. Census data [2] and also deployed in commercial systems [64, 26, 28]. A query to a database is said to be (ε,δ)(\varepsilon,\delta)-differentialy private if a strong adversary who knows all other entries but one cannot infer that one entry from the query output, with high confidence. The parameters ε\varepsilon and δ\delta restricts the confidence as measured by the Type-I and II errors [42]. Smaller values of ε[0,)\varepsilon\in[0,\infty) and δ[0,1]\delta\in[0,1] imply stronger privacy and plausible deniability for the participants.

For non-private PCA with nn i.i.d. samples in dd dimensions, the popular Oja’s algorithm (provided in Algorithm 1) achieves the optimal error of sin(v^,v1)=Θ~(d/n)\sin(\hat{v},v_{1})=\tilde{\Theta}(\sqrt{d/n}), where the error is measured by the sine function of the angle between the estimate, v^\hat{v}, and the principal component, v1v_{1}, [39]. For differentially private PCA, there is a natural fundamental question: what is the extra cost we pay in the error rate for ensuring (ε,δ)(\varepsilon,\delta)-DP?

We introduce a novel approach we call DP-PCA (Algorithm 3) and show that it achieves an error bounded by sin(v^,v)=O~(d/n+d/(εn))\sin(\hat{v},v)=\tilde{O}(\sqrt{d/n}+d/(\varepsilon n)) for sub-Gaussian-like data defined in Assumption 1, which is a broad class of light-tailed distributions that includes Gaussian data as a special case. The second term characterizes the cost of privacy and this is tight; we prove a nearly matching information theoretic lower bound showing that no (ε,δ)(\varepsilon,\delta)-DP algorithm can achieve a smaller error. This significantly improves upon a long line of existing private algorithms for PCA, e.g., [15, 10, 36, 34, 24]. These existing algorithms are analyzed for fixed and non-stochastic data and achieve sub-optimal error rates of O(d/n+d3/2/(εn))O(\sqrt{d/n}+d^{3/2}/(\varepsilon n)) even in the stochastic setting we consider.

A remaining question is whether the sub-Gaussian-like assumption, namely Assumption A.4, is necessary or if it is an artifact of our analysis and our algorithm. It turns out that such an assumption on the lightness of the tail is critical; we prove an algorithmic independent and information theoretic lower bound (Theorem 5.4) to show that, without such an assumption, the cost of privacy is lower bounded by Ω(d/(εn))\Omega(\sqrt{d/(\varepsilon n)}). This proves a separation of the error depending on the lightness of the tail.

We start with the formal description of the stochastic setting in Section 2 and present Oja’s algorithm for non-private PCA. Our first attempt in making this algorithm private in Section 3 already achieves near-optimal error, if the data is strictly from a Gaussian distribution. However, there are two remaining challenges that we describe in detail in Section 4: (i)(i) the excessive number of iterations of Private Oja’s Algorithm (Algorithm 2) prevents using typical values of ε\varepsilon used in practice, and (ii)(ii) for general sub-Gaussian-like distributions, the error does not vanish even when the noise in the data (as measured by a certain fourth moment of a function of the data) vanishes. The first challenge is due to the analysis that requires amplification by shuffling [25] that is restrictive. The second is due to its reliance on gradient norm clipping [1] that does not adapt to the variance of the current gradients. This drives the design of DP-PCA in Section 5 that critically relies on two techniques to overcome each challenge, respectively. First, minibatch SGD (instead of single sample SGD) significantly reduces the number iterations, thus obviating the need for amplification by shuffling. Next, private mean estimation (instead of gradient norm clipping and noise adding) adapts to the stochasticity of the problem and adds the minimal noise necessary to achieve privacy. The main idea of this variance adaptive stochastic gradient update is explained in detail in Section 6, along with a sketch of a proof.

Notations. For a vector xdx\in{\mathbb{R}}^{d}, we use x\|x\| to denote the Euclidean norm. For a matrix Xd×dX\in{\mathbb{R}}^{d\times d}, we use X2=maxv=1Xv2\|X\|_{2}=\max_{\|v\|=1}\|Xv\|_{2} to denote the spectral norm. We use 𝐈d{\mathbf{I}}_{d} to denote d×dd\times d identity matrix. For n+n\in\mathbb{Z}^{+}, let [n]:={1,2,,n}[n]:=\{1,2,\ldots,n\}. Let 𝕊2d1\mathbb{S}_{2}^{d-1} denote the unit dd-sphere of 2\ell_{2}, i.e., 𝕊2d1:={xd:x=1}\mathbb{S}_{2}^{d-1}:=\{x\in{\mathbb{R}}^{d}:\|x\|=1\}. O~()\tilde{O}() hides logarithmic factors in nn, dd, and the failure probability ζ\zeta.

2 Problem formulation and background on DP

Typical PCA assumes i.i.d. data {xid}\{x_{i}\in{\mathbb{R}}^{d}\} from a distribution and finds the first eigenvector of Σ=𝔼[(xi𝔼[xi])(xi𝔼[xi])]d×d\Sigma={\mathbb{E}}[(x_{i}-{\mathbb{E}}[x_{i}])(x_{i}-{\mathbb{E}}[x_{i}])^{\top}]\in{\mathbb{R}}^{d\times d}. Our approach allows for a more general class of data {Aid×d}\{A_{i}\in{\mathbb{R}}^{d\times d}\} that recovers the standard case when Ai=(xi𝔼[xi])(xi𝔼[xi])A_{i}=(x_{i}-{\mathbb{E}}[x_{i}])(x_{i}-{\mathbb{E}}[x_{i}])^{\top}.

Assumption 1 ((Σ,{λi}i=1d,M,V,K,κ,a,γ2)(\Sigma,\{\lambda_{i}\}_{i=1}^{d},M,V,K,\kappa,a,\gamma^{2})-model).

Let A1,A2,,And×dA_{1},A_{2},\ldots,A_{n}\in{\mathbb{R}}^{d\times d} be a sequence of (not necessarily symmetric) matrices sampled independently from the same distribution that satisfy the following with PSD matrices Σd×d\Sigma\in{\mathbb{R}}^{d\times d} and Hud×dH_{u}\in{\mathbb{R}}^{d\times d}, and positive scalar parameters M,V,KM,V,K, κ\kappa, aa, and γ2\gamma^{2}:

  1. A.1.

    Let Σ:=𝔼[Ai]\Sigma:=\mathbb{E}[A_{i}], for a symmetric positive semidefinite (PSD) matrix Σd×d\Sigma\in{\mathbb{R}}^{d\times d}, λi\lambda_{i} denote the ii-th largest eigenvalue of Σ\Sigma, and κ:=λ1/(λ1λ2)\kappa:=\lambda_{1}/(\lambda_{1}-\lambda_{2}),

  2. A.2.

    AiΣ2λ1M\|A_{i}-\Sigma\|_{2}\leq\lambda_{1}M almost surely,

  3. A.3.

    max{𝔼[(AiΣ)(AiΣ)]2,𝔼[(AiΣ)(AiΣ)]2}λ12V\max\left\{\left\|\mathbb{E}\left[(A_{i}-\Sigma)(A_{i}-\Sigma)^{\top}\right]\right\|_{2},\left\|\mathbb{E}\left[(A_{i}-\Sigma)^{\top}(A_{i}-\Sigma)\right]\right\|_{2}\right\}\leq\lambda_{1}^{2}V,

  4. A.4.

    maxu=1,v=1𝔼[exp((|u(AiΣ)v|2K2λ12Hu2)1/(2a))]1\max_{\|u\|=1,\|v\|=1}\mathbb{E}\left[\exp\left(\left(\frac{|u^{\top}(A_{i}^{\top}-\Sigma)v|^{2}}{K^{2}\lambda_{1}^{2}\|H_{u}\|_{2}}\right)^{1/(2a)}\right)\right]\leq 1, where Hu:=(1/λ12)𝔼[(AiΣ)uu(AiΣ)]H_{u}:=(1/\lambda_{1}^{2})\mathbb{E}[(A_{i}-\Sigma)uu^{\top}(A_{i}-\Sigma)^{\top}]. We denote γ2:=maxu=1Hu2\gamma^{2}:=\max_{\|u\|=1}\|H_{u}\|_{2}.

The first three assumptions are required for PCA even if privacy is not needed. The last assumption provides a Gaussian-like tail bound that determines how much noise we need to introduce in the algorithm for (ε,δ)(\varepsilon,\delta)-DP. The following lemma is useful in the analyses.

Lemma 2.1.

Under A.1 and A.4 in Assumption 1, for any unit vector uu, vv, with probability 1ζ1-\zeta,

|u(AiΣ)v|2K2λ12Hu2log2a(1/ζ).\displaystyle|u^{\top}(A_{i}^{\top}-\Sigma)v|^{2}\;\leq\;K^{2}\lambda_{1}^{2}\|H_{u}\|_{2}\log^{2a}(1/\zeta)\;. (1)

2.1 Oja’s algorithm

In a non-private setting, the following streaming algorithm introduced in [61] achieves optimal sample complexity as analyzed in [39]. It is a projected stochastic gradient ascent on the objective defined on the empirical covariance: maxw=1(1/n)i=1nwAiw\max_{\|w\|=1}(1/n)\sum_{i=1}^{n}w^{\top}A_{i}w.

1 Choose w0w_{0} uniformly at random from the unit sphere
2 for t=1,2,,Tt=1,2,\ldots,T do wtwt1+ηtAtwt1w_{t}^{\prime}\leftarrow w_{t-1}+\eta_{t}A_{t}w_{t-1} , wtwt/wtw_{t}\leftarrow w_{t}^{\prime}/\|w_{t}^{\prime}\|
Return wTw_{T}
Algorithm 1 (Non-private) Oja’s Algorithm

Central to our analysis is the following error bound on Oja’s Algorithm from [39].

Theorem 2.2 ([39, Theorem 4.1]).

Under Assumptions A.1-A.3, suppose the step size ηt=α(λ1λ2)(ξ+t)\eta_{t}=\frac{\alpha}{(\lambda_{1}-\lambda_{2})(\xi+t)} for some α>1/2\alpha>1/2 and ξ:= 20max(κMα,κ2(V+1)α2/log(1+(ζ/100)))\xi\;:=\;20\max\left(\kappa M\alpha,{\kappa^{2}\left(V+1\right)\alpha^{2}}/{\log(1+({\zeta}/{100}))}\right). If T>βT>\beta then there exists a constant C>0C>0 such that Algorithm 1 outputs wTw_{T} achieving w.p. 1ζ1-\zeta,

sin2(wT,v1)Clog(1/ζ)ζ2(α2κ2V(2α1)T+d(ξT)2α).\displaystyle\sin^{2}\left(w_{T},v_{1}\right)\leq\frac{C\log(1/\zeta)}{\zeta^{2}}\left(\,\frac{\alpha^{2}\kappa^{2}V}{(2\alpha-1)T}+d\left(\frac{\xi}{T}\right)^{2\alpha}\,\right)\;. (2)

2.2 Background on Differential Privacy

Differential privacy (DP), introduced in [22], is a de facto mathematical measure for privacy leakage of a database accessed via queries. It ensures that even an adversary who knows all other entries cannot identify with a high confidence whether a person of interest participated in a database or not.

Definition 2.3 (Differential privacy [22]).

Given two multisets SS and SS^{\prime}, we say the pair (S,S)(S,S^{\prime}) is neighboring if |SS|+|SS|1|S\setminus S^{\prime}|+|S^{\prime}\setminus S|\leq 1. We say a stochastic query qq over a dataset SS satisfies (ε,δ)(\varepsilon,\delta)-differential privacy for some ε>0\varepsilon>0 and δ(0,1)\delta\in(0,1) if (q(S)A)eε(q(S)A)+δ{\mathbb{P}}(q(S)\in A)\leq e^{\varepsilon}{\mathbb{P}}(q(S^{\prime})\in A)+\delta for all neighboring (S,S)(S,S^{\prime}) and all subset AA of the range of qq.

Small values of ε\varepsilon and δ\delta ensures that the adversary cannot identify any single data point with high confidence, thus providing plausible deniability. We provide useful DP lemmas in Appendix B. Within our stochastic gradient descent approach to PCA, we rely on the Gaussian mechanism to privatize each update. The sensitivity of a query qq is defined as Δq:=supneighboring (S,S)q(S)q(S)\Delta_{q}:=\sup_{\text{neighboring }(S,S^{\prime})}\|q(S)-q(S^{\prime})\|.

Lemma 2.4 (Gaussian mechanism [23]).

For a query qq with sensitivity Δq\Delta_{q}, ε(0,1)\varepsilon\in(0,1), and δ(0,1)\delta\in(0,1), the Gaussian mechanism outputs q(S)+𝒩(0,(Δq(2log(1.25/δ))/ε)2𝐈d)q(S)+{\cal N}(0,(\Delta_{q}(\sqrt{2\log(1.25/\delta)})/\varepsilon)^{2}{\mathbf{I}}_{d}) and achieves (ε,δ)(\varepsilon,\delta)-DP.

Another mechanism we frequently use is the private histogram learner of [49], whose analysis is provide in Appendix B, along with various composition theorems to provide end-to-end guarantees.

2.3 Comparisons with existing results in private PCA

We briefly discuss the most closely related work and provide more previous work in Appendix A. Most existing results assume a fixed data under a deterministic setting where each sample has a bounded norm, xiβ\|x_{i}\|\leq\beta, and the goal is to find the top eigenvector of Σ^:=(1/n)i=1n(xiμ^)(xiμ^)\hat{\Sigma}:=(1/n)\sum_{i=1}^{n}(x_{i}-\hat{\mu})(x_{i}-\hat{\mu})^{\top} for the empirical mean μ^\hat{\mu}. For the purpose of comparisons, consider Gaussian xi𝒩(0,Σ)x_{i}\sim{\cal N}(0,\Sigma) with xiβ=O(λ1dlog(n/ζ))\|x_{i}\|\leq\beta=O(\sqrt{\lambda_{1}d\log(n/\zeta)}) for all i[n]i\in[n] with probability 1ζ1-\zeta. The first line of approaches in [10, 15, 24] is a Gaussian mechanism that outputs PCA(Σ^+Z){\rm PCA}(\widehat{\Sigma}+Z), where ZZ is a symmetric matrix with i.i.d. Gaussian entries with a variance ((β2/nε)2log(1.25/δ))2((\beta^{2}/n\varepsilon)\sqrt{2\log(1.25/\delta)})^{2} to ensure (ε,δ)(\varepsilon,\delta)-DP. The tightest result in [24, Theorem 7] achieves

sin(v^,v1)\displaystyle\sin(\hat{v},v_{1}) =\displaystyle= O~(κ(dn+d3/2log(1/δ)εn)),\displaystyle\tilde{O}\Big{(}\kappa\Big{(}\sqrt{\frac{d}{n}}+\frac{d^{3/2}\sqrt{\log(1/\delta)}}{\varepsilon n}\Big{)}\,\Big{)}\;, (3)

with high probability, under a strong assumption that the spectral gap is very large: λ1λ2=ω(d3/2log(1/δ)/(εn))\lambda_{1}-\lambda_{2}=\omega(d^{3/2}\sqrt{\log(1/\delta)}/(\varepsilon n)). In a typical scenario with λ1=O(1)\lambda_{1}=O(1), this requires a large sample size of n=ω(d3/2/ε)n=\omega(d^{3/2}/\varepsilon). Since this Gaussian mechanism does not exploit the statistical properties of i.i.d. samples, the second term in this upper bound is larger by a factor of d1/2d^{1/2} compared to the proposed DP-PCA (Corollary 5.2). The error rate of Eq. (3) is also achieved in [36, 34] by adding Gaussian noise to the standard power method for computing the principal components. When the spectral gap, λ1λ2\lambda_{1}-\lambda_{2}, is smaller, it is possible to trade-off the dependence in κ\kappa and the sampling ratio d/nd/n, which we do not address in this work but is surveyed in Appendix A.

3 First attempt: making Oja’s Algorithm private

Following the standard recipe in training with DP-SGD, e.g., [1], we introduce Private Oja’s Algorithm in Algorithm 2. At each gradient update, we first apply gradient norm clipping to limit the contribution of a single data point and next add an appropriately chosen Gaussian noise from Lemma 2.4 to achieve (ε,δ)(\varepsilon,\delta)-DP, end-to-end. The choice of clipping threshold β\beta ensures that, with high probability under our assumption, we do not clip any gradients. The choice of noise multiplier α\alpha ensures (ε,δ)(\varepsilon,\delta)-DP.

Input: S={Aid×d}i=1nS=\{A_{i}\in{\mathbb{R}}^{d\times d}\}_{i=1}^{n}, privacy (ε,δ)(\varepsilon,\delta), learning rates {ηt}t=1n\{\eta_{t}\}^{n}_{t=1}
1 Randomly permute SS and choose w0w_{0} uniformly at random from the unit sphere
2 Set DP noise multiplier: αClog(n/δ)/(εn)\alpha\leftarrow C^{\prime}\log(n/\delta)/(\varepsilon\sqrt{n})
3 Set clipping threshold: βCλ1d(Kγloga(nd/ζ)+1)\beta\leftarrow C\lambda_{1}\sqrt{d}(K\gamma\log^{a}(nd/\zeta)+1)
4 for t=1, 2, …, n do
5       Sample zt𝒩(0,𝐈d)z_{t}\sim{\cal N}(0,\mathbf{I}_{d})
6       wtwt1+ηtclipβ(Atwt1)+2ηtβαztw_{t}^{\prime}\leftarrow w_{t-1}+\eta_{t}\,{\rm clip}_{\beta}\left(A_{t}w_{t-1}\right)+2\eta_{t}\beta\alpha z_{t} where clipβ(x)=xmin{1,βx2}{\rm clip}_{\beta}(x)=x\cdot\min\{1,\frac{\beta}{\|x\|_{2}}\}
7       wtwt/wtw_{t}\leftarrow w_{t}^{\prime}/\|w_{t}^{\prime}\|
Return wnw_{n}
Algorithm 2 Private Oja’s Algorithm

One caveat in streaming algorithms is that we access data nn times, each with a private mechanism, but accessing only a single data point at a time. To prevent excessive privacy loss due to such a large number of data accesses, we apply a random shuffling in line 2 Algorithm 2, in order to benefit from a standard amplification by shuffling [25, 30]. This gives an amplified privacy guarantee that allows us to add a small noise proportional to α=O(log(n/δ)/(εn))\alpha=O(\log(n/\delta)/(\varepsilon\sqrt{n})). Without the shuffle amplification, we will instead need a larger noise scaling as α=O(log(n/δ)/ε)\alpha=O(\log(n/\delta)/\varepsilon), resulting in a suboptimal utility guarantee. However, this comes with a restriction that the amplification holds only for small values of ε=O(log(n/δ)/n)\varepsilon=O(\sqrt{\log(n/\delta)/n}). Our first contribution in the proposed DP-PCA (Algorithm 3) is to expand this range to ε=O(1)\varepsilon=O(1), which includes the practical regime of interest ε[1/2,5]\varepsilon\in[1/2,5].

Lemma 3.1 (Privacy).

If ε=O(log(n/δ)/n)\varepsilon=O(\sqrt{{\log(n/\delta)}/{n}}) and the noise multiplier is chosen to be α=Ω(log(n/δ)/(εn))\alpha=\Omega\left({\log(n/\delta)}/{(\varepsilon\sqrt{n})}\right), then Algorithm 2 is (ε,δ)(\varepsilon,\delta)-DP.

Under Assumption 1, we select gradient norm clipping threshold β\beta such that no gradient exceeds β\beta.

Lemma 3.2 (Gradient clipping).

Let β=Cλ1d(Kγloga(nd/ζ)+1)\beta=C\lambda_{1}\sqrt{d}(K\gamma\log^{a}(nd/\zeta)+1) for some constant C>0C>0. Then with probability 1ζ1-\zeta, Atwt1β\|A_{t}w_{t-1}\|\leq\beta for any fixed wt1w_{t-1} independent of AtA_{t}, for all t[n]t\in[n].

We provide proofs of both lemmas and the next theorem in Appendix D. When no clipping is applied, we can use the standard analysis of Oja’s Algorithm from [39] to prove the following utility guarantee.

Theorem 3.3 (Utility).

Given nn i.i.d. samples {Aid×d}i=1n\{A_{i}\in{\mathbb{R}}^{d\times d}\}_{i=1}^{n} satisfying Assumption 1 with parameters (Σ,M,V,K,κ,a,γ2)(\Sigma,M,V,K,\kappa,a,\gamma^{2}), if

n=O~(κ2+κM+κ2V+dκ(γ+1)log(1/δ)ε),\displaystyle n\;=\;\tilde{O}\Big{(}\,\kappa^{2}+\kappa M+\kappa^{2}V+\frac{d\,\kappa\,(\gamma+1)\,\log(1/\delta)}{\varepsilon}\,\Big{)}\;, (4)

with a large enough constant, then there exists a positive universal constant c1c_{1} and a choice of learning rate ηt\eta_{t} that depends on (t,M(t,M, VV, KK, aa, λ1\lambda_{1}, λ1λ2\lambda_{1}-\lambda_{2}, nn, dd, ε\varepsilon, δ)\delta) such that Algorithm 2 with a choice of ζ=0.01\zeta=0.01 outputs wnw_{n} that achieves with probability 0.990.99,

sin2(wn,v1)=O~(κ2(Vn+(γ+1)2d2log2(1/δ)ε2n2)),\displaystyle\sin^{2}\left(w_{n},v_{1}\right)\;=\;\widetilde{O}\left(\kappa^{2}\Big{(}\frac{V}{n}+\frac{(\gamma+1)^{2}d^{2}\log^{2}(1/\delta)}{\varepsilon^{2}n^{2}}\,\Big{)}\,\right)\;, (5)

where O~()\widetilde{O}(\cdot) hides poly-logarithmic factors in nn, dd, 1/ε1/\varepsilon, and log(1/δ)\log(1/\delta) and polynomial factors in KK.

The first term in Eq. (5) matches the non-private error rate for Oja’s algorithm in Eq. (2) with α=O(logn)\alpha=O(\log n) and T=nT=n, and the second term is the price we pay for ensuring (ε,δ)(\varepsilon,\delta)-DP.

Remark 3.4.

For a canonical setting of a Gaussian data with Ai=xixiA_{i}=x_{i}x_{i}^{\top} and xi𝒩(0,Σ)x_{i}\sim{\cal N}(0,\Sigma), we have, for example from [62, Lemma 1.12], that M=O(dlog(n))M=O(d\log(n)), V=O(d)V=O(d), K=4K=4, a=1a=1, and γ2=O(1)\gamma^{2}=O(1). Theorem 3.3 implies the following error rate:

sin2(wn,v1)=O~(κ2(dn+d2log2(1/δ)ε2n2)).\displaystyle\sin^{2}\left(w_{n},v_{1}\right)\;=\;\tilde{O}\Big{(}\kappa^{2}\Big{(}\frac{d}{n}+\frac{d^{2}\log^{2}(1/\delta)}{\varepsilon^{2}n^{2}}\Big{)}\Big{)}\;. (6)

Comparing to a lower bound in Theorem 5.3, this is already near optimal. However, for general distributions satisfying Assumption 1, Algorithm 2 (in particular the second term in Eq. (5)) can be significantly sub-optimal. We explain this second weakness of Private Oja’s Algorithm in the following section (the first weakness is the restriction on ε=O(log(n/δ)/n)\varepsilon=O(\sqrt{\log(n/\delta)/n})).

4 Two remaining challenges

We explain the two remaining challenges in Private Oja’s Algorithm and propose techniques to overcomes these challenges that lead to the design of DP-PCA (Algorithm 3).

First challenge: restricted range of ε=O(log(n/δ)/n)\varepsilon=O(\sqrt{\log(n/\delta)/n}). This is due to the large number, nn, of iterations that necessitates the use of the amplification by shuffling in Theorem D.1. We reduce the number of iterations with minibatch SGD. For T=O(log2n)T=O(\log^{2}n) and t=1,2,,Tt=1,2,\ldots,T, we repeat

wtwt1+ηtBi=1+B(t1)Bt1clipβ(Aiwt1)+wηtβαBzt, and wtwt/wt,\displaystyle w^{\prime}_{t}\leftarrow w_{t-1}+\frac{\eta_{t}}{B}\sum_{i=1+B(t-1)}^{Bt-1}{\rm clip}_{\beta}(A_{i}w_{t-1})+\frac{w\eta_{t}\beta\alpha}{B}z_{t}\;,\text{ and }\;\;w_{t}\leftarrow w^{\prime}_{t}/\|w_{t}^{\prime}\|\;, (7)

where zt𝒩(0,𝐈d)z_{t}\sim{\cal N}(0,{\bf I}_{d}) and the minibatch size is B=n/TB=\lfloor n/T\rfloor. Since the dataset is accessed only T=O(log2n)T=O(\log^{2}n) times, the end-to-end privacy is analyzed with the serial composition (Lemma B.3) instead of the amplification by shuffling. This ensures (ε,δ)(\varepsilon,\delta)-DP for any ε=O(1)\varepsilon=O(1), resolving the first challenge, and still achieves the utility guarantee of Eq. (5).

Second challenge: excessive noise for privacy. This is best explained with an example.

Example 4.1 (Signal and noise separation).

Consider a setting with Ai=xixiA_{i}=x_{i}x_{i}^{\top} and xi=si+nix_{i}\;=\;s_{i}+n_{i} where si=vs_{i}=v with probability half and si=vs_{i}=-v otherwise for a unit norm vector vv and ni𝒩(0,σ2𝐈)n_{i}\sim{\cal N}(0,\sigma^{2}{\bf I}). We want to find the principal component of Σ=𝔼[xixi]=vv+σ2𝐈\Sigma={\mathbb{E}}[x_{i}x_{i}^{\top}]=vv^{\top}+\sigma^{2}{\bf I}, which is vv. This construction decomposes the signal and the noise. For Ai=vv+sini+nisi+niniA_{i}=vv^{\top}+s_{i}n_{i}^{\top}+n_{i}s_{i}^{\top}+n_{i}n_{i}^{\top}, the signal component is determined by vvvv^{\top} that is deterministic due to the sign cancelling. The noise component is xini+nisi+ninix_{i}n_{i}^{\top}+n_{i}s_{i}^{\top}+n_{i}n_{i}^{\top} which is random. We can control the Signal-to-Noise Ratio (SNR), 1/σ21/\sigma^{2}, by changing σ2\sigma^{2}, and we are particularly interested in the regime where σ2\sigma^{2} is small. As we are interested in σ2<1\sigma^{2}<1, this satisfies Assumption 1 with λ1=1+σ2\lambda_{1}=1+\sigma^{2}, λ2=σ2\lambda_{2}=\sigma^{2}, V=O(dσ2)V=O(d\sigma^{2}), K=O(1)K=O(1), a=1a=1, and γ2=σ2\gamma^{2}=\sigma^{2}. Substituting this into Eq. (5), Private Oja’s Algorithm achieves

sin2(wn,v1)=O~(σ2dn+d2log(1/δ)ε2n2),\displaystyle\sin^{2}(w_{n},v_{1})\;=\;\tilde{O}\Big{(}\frac{\sigma^{2}d}{n}+\frac{d^{2}\log(1/\delta)}{\varepsilon^{2}n^{2}}\Big{)}\;, (8)

where we are interested in σ2<1\sigma^{2}<1.

This is problematic since the second term, due to the DP noise, does not vanish as the randomness σ2\sigma^{2} in the data decreases. We do not observe this for Gaussian data where signal and noise scale proportionally as shown below. We reduce the noise we add for privacy, by switching from a simple norm clipping, that adds noise proportional to the norm of the gradients, to private estimation, that only requires the noise to scale as the range of the gradients, i.e. the maximum distance between two gradients in the minibatch. The toy example above showcases that the range can be arbitrarily smaller than the maximum norm (Fig. 1). We want to emphasize that although the idea of using private estimation within an optimization has been conceptually proposed in abstract settings, e.g., in [44], DP-PCA is the first setting where (i)(i) such separation between the norm and the range of the gradients holds under any statistical model, and hence (ii)(ii) the long line of recent advances in private estimation provides significant gain over the simple DP-SGD [1].

Refer to caption
Refer to caption
Figure 1: 2-d PCA under the Gaussian data from Remark 3.4 (left) shows that the average gradient (red arrow) is smaller than the range of the minibatch of 400 gradients (blue dots). Under Example 4.1 (right), the range can be made arbitrarily smaller than the average gradient by changing σ2\sigma^{2}.

5 Differentially Private Principal Component Analysis (DP-PCA)

Combining the two ideas of minibatch SGD and private mean estimation, we propose DP-SGD. We use minibatch SGD of minibatch size B=O(n/log2n)B=O(n/\log^{2}n) to allow for larger range of ε=O(1)\varepsilon=O(1). We use Private Mean Estimation to add an appropriate level of noise chosen adaptively according to Private Eigenvalue Estimation. We describe details of both sub-routines in Section 6.

Input: S={Ai}i=1nS=\{A_{i}\}_{i=1}^{n}, (ε,δ)(\varepsilon,\delta), batch size B+B\in{\mathbb{Z}}_{+}, learning rates {ηt}t=1n/B\{\eta_{t}\}_{t=1}^{\lfloor n/B\rfloor}, probability ζ(0,1)\zeta\in(0,1)
1 Choose w0w_{0} uniformly at random from the unit sphere
2 for t=1,2,,T=n/Bt=1,2,\ldots,T=\lfloor n/B\rfloor do
3       Run Private Top Eigenvalue Estimation (Algorithm 4) with (ε/2,δ/2)(\varepsilon/2,\delta/2)-DP and failure probability ζ/(2T)\zeta/(2T) on {AB(t1)+iwt1}i=1B/2\{A_{B(t-1)+i}w_{t-1}\}_{i=1}^{\lfloor B/2\rfloor}. Let the returned estimation be Λ^t>0\hat{\Lambda}_{t}>0.
4      
5      Run Private Mean Estimation (Algorithm 5) with (ε/2,δ/2)\left(\varepsilon/2,\delta/2\right)-DP, failure probability ζ/(2T)\zeta/(2T), and the estimated eigenvalue 2Λ^t2\hat{\Lambda}_{t} on {AB(t1)+B/2+iwt1}iB/2\left\{A_{B(t-1)+\lfloor B/2\rfloor+i}w_{t-1}\right\}_{i\in\lfloor B/2\rfloor}. Let the returned mean gradient estimate be g^td\hat{g}_{t}\in{\mathbb{R}}^{d}.
6       wtwt1+ηtg^t,w_{t}^{\prime}\leftarrow w_{t-1}+\eta_{t}\hat{g}_{t}\;\;,\;\;\;\; wtwt/wtw_{t}\leftarrow w_{t}^{\prime}/\|w_{t}^{\prime}\|
Return wTw_{T}
Algorithm 3 Differentially Private Principal Component Analysis (DP-PCA)

We show an upper bound on the error achieved by DP-PCA under an appropriate choice of the learning rate. We provide a complete proof in Appendix E.1 that includes the explicit choice of the learning rate ηt\eta_{t} in Eq. (60), and a proof sketch is provided in Section 6.1.

Theorem 5.1.

For ε(0,0.9)\varepsilon\in(0,0.9), DP-PCA guarantees (ε,δ)(\varepsilon,\delta)-DP for all SS, BB, ζ\zeta, and δ\delta. Given nn i.i.d. samples {Aid×d}i=1n\{A_{i}\in{\mathbb{R}}^{d\times d}\}_{i=1}^{n} satisfying Assumption 1 with parameters (Σ,M,V,K,κ,a,γ2)(\Sigma,M,V,K,\kappa,a,\gamma^{2}), if

n=O~(eκ2+d1/2(log(1/δ))3/2ε+κM+κ2V+dκγ(log(1/δ))1/2ε),\displaystyle n\;=\;\tilde{O}\Big{(}\,e^{\kappa^{2}}+\frac{d^{1/2}(\log(1/\delta))^{3/2}}{\varepsilon}+\kappa M+\kappa^{2}V+\frac{d\,\kappa\,\gamma\,(\log(1/\delta))^{1/2}}{\varepsilon}\,\Big{)}\;, (9)

with a large enough constant and δ1/n\delta\leq 1/n, then there exists a positive universal constant c1c_{1} and a choice of learning rate ηt\eta_{t} that depends on (t,M(t,M, VV, KK, aa, λ1\lambda_{1}, λ1λ2\lambda_{1}-\lambda_{2}, nn, dd, ε\varepsilon, δ)\delta) such that T=n/BT=\lfloor n/B\rfloor steps of DP-PCA in Algorithm 3 with choices of ζ=0.01\zeta=0.01 and B=c1n/(logn)2B=c_{1}n/(\log n)^{2}, outputs wTw_{T} such that with probability 0.990.99,

sin(wT,v1)=O~(κ(Vn+γdlog(1/δ)εn)),\displaystyle\sin\left(w_{T},v_{1}\right)\;=\;\widetilde{O}\left(\kappa\Big{(}\sqrt{\frac{V}{n}}+\frac{\gamma d\sqrt{\log(1/\delta)}}{\varepsilon n}\,\Big{)}\,\right)\;, (10)

where O~()\widetilde{O}(\cdot) hides poly-logarithmic factors in nn, dd, 1/ε1/\varepsilon, and log(1/δ)\log(1/\delta) and polynomial factors in KK.

We further interpret this analysis and show that (i)(i) DP-PCA is nearly optimal when the data is from a Gaussian distribution by comparing against a lower bound (Theorem 5.3); and (ii)(ii) DP-PCA significantly improves upon the private Oja’s algorithm under Example 4.1. We discuss the necessity of some of the assumptions at the end of this section, including how to agnostically find the appropriate learning rate scheduling.

Near-optimality of DP-PCA under Gaussian distributions. Consider the case of i.i.d. samples {xi}i=1n\{x_{i}\}_{i=1}^{n} from a Gaussian distribution from Remark 3.4.

Corollary 5.2 (Upper bound; Gaussian distribution).

Under the hypotheses of Theorem 5.1 and {Ai=xixi}i=1n\{A_{i}=x_{i}x_{i}^{\top}\}_{i=1}^{n} with Gaussian random vectors xix_{i}’s, after T=n/BT=n/B steps, DP-PCA outputs wTw_{T} that achieves, with probability 0.990.99,

sin(wT,v1)=O~(κ(dn+dlog(1/δ)εn)).\displaystyle\sin(w_{T},v_{1})\;=\;\tilde{O}\left(\kappa\left(\sqrt{\frac{d}{n}}+\frac{d\sqrt{\log(1/\delta)}}{\varepsilon n}\right)\right)\;. (11)

We prove a nearly matching lower bound, up to factors of λ1/λ2\sqrt{\lambda_{1}/\lambda_{2}} and log(1/δ)\sqrt{\log(1/\delta)}. One caveat is that the lower bound assumes pure-DP with δ=0\delta=0. We do not yet have a lower bound technique for approximate DP that is tight, and all known approximate DP lower bounds have gaps to achievable upper bounds in its dependence in log(1/δ)\log(1/\delta), e.g., [6, 56]. We provide a proof in Appendix C.1.

Theorem 5.3 (Lower bound; Gaussian distribution).

Let ε\mathcal{M}_{\varepsilon} be a class of (ε,0)(\varepsilon,0)-DP estimators that map nn i.i.d. samples to an estimate v^d\hat{v}\in{\mathbb{R}}^{d}. A set of Gaussian distributions with (λ1,λ2)(\lambda_{1},\lambda_{2}) as the first and second eigenvalues of the covariance matrix is denoted by 𝒫(λ1,λ2)\mathcal{P}_{(\lambda_{1},\lambda_{2})}. There exists a universal constant C>0C>0 such that

infv^εsupP𝒫(λ1,λ2)𝔼SPn[sin(v^(S),v1)]Cmin(κ(dn+dεn)λ2λ1,1).\displaystyle\inf_{\hat{v}\in\mathcal{M}_{\varepsilon}}\sup_{P\in\mathcal{P}_{(\lambda_{1},\lambda_{2})}}\mathbb{E}_{S\sim P^{n}}\left[\sin(\hat{v}(S),v_{1})\right]\;\geq\;C\min\left(\kappa\left(\sqrt{\frac{d}{n}}+\frac{d}{\varepsilon n}\right)\sqrt{\frac{\lambda_{2}}{\lambda_{1}}},1\right)\;. (12)

Comparisons with private Oja’s algorithm. We demonstrate that DP-PCA can significantly improve upon Private Oja’s Algorithm with Example 4.1, where DP-PCA achieves an error bound of sin(wT,v1)=O~(σd/n+σdlog(1/δ)/(εn))\sin(w_{T},v_{1})=\tilde{O}\big{(}\sigma\sqrt{d/n}+\sigma d\sqrt{\log(1/\delta)}/(\varepsilon n)\big{)}. As the noise power σ2\sigma^{2} decreases DP-PCA achieves a vanishing error, whereas Private Oja’s Algorithm has a non-vanishing error in Eq. (8). This follows from the fact that the second term in the error bound in Eq. (10) scales as γ\gamma, which can be made arbitrarily smaller than the second term in Eq. (5) that scales as (γ+1)(\gamma+1). Further, the error bound for DP-PCA holds for any ε=O(1)\varepsilon=O(1), whereas Private Oja’s Algorithm requires significantly smaller ε=O(log(n/δ)/n)\varepsilon=O(\sqrt{\log(n/\delta)/n}).

Remarks on the assumptions of Theorem 5.1. We have an exponential dependence of the sample complexity in the spectral gap, nexp(κ2)n\geq\exp(\kappa^{2}). This ensures we have a large enough T=n/BT=\lfloor n/B\rfloor to reduce the non-dominant second term in Eq. (2), in balancing the learning rate ηt\eta_{t} and TT (which is explicitly shown in Eqs. 62 and (63) in the Appendix). It is possible to get rid of this exponential dependence at the cost of an extra term of O~(κ4γ2d2log(1/δ)/(εn)2)\tilde{O}(\kappa^{4}\gamma^{2}d^{2}\log(1/\delta)/(\varepsilon n)^{2}) in the error rate in Eq. (10), by selecting a slightly larger T=cκ2log2nT=c\kappa^{2}\log^{2}n. A Gaussian-like tail bound in Assumption A.4 is necessary to get the desired upper bound scaling as O~(dlog(1/δ)/(εn))\tilde{O}(d\sqrt{\log(1/\delta)}/(\varepsilon n)) in Eq. 11, for example. The next lower bound shows that without such assumptions on the tail, the error due to privacy scales as Ω(dlog(1/δ)/(εn))\Omega(\sqrt{d\wedge\log(1/\delta)/(\varepsilon n)}). We believe that the dependence in δ\delta is loose, and it might be possible to get a tighter lower bound using [45]. We provide a proof and other lower bounds in Appendix C.

Theorem 5.4 (Lower bound without Assumption A.4).

Let ε\mathcal{M}_{\varepsilon} be a class of (ε,δ)(\varepsilon,\delta)-DP estimators that map nn i.i.d. samples to an estimate v^d\hat{v}\in{\mathbb{R}}^{d}. A set of distributions satisfying Assumptions A.1A.3 with M=O~(d+nε/d)M=\tilde{O}(d+\sqrt{n\varepsilon/d}), V=O(d)V=O(d) and γ=O(1)\gamma=O(1) is denoted by 𝒫~\tilde{\mathcal{P}}. For d2d\geq 2, there exists a universal constant C>0C>0 such that

infv^εsupP𝒫~𝔼SPn[sin(v^(S),v1)]Cκmin(dlog((1eε)/δ)εn,1).\displaystyle\inf_{\hat{v}\in\mathcal{M}_{\varepsilon}}\sup_{P\in\tilde{\mathcal{P}}}\mathbb{E}_{S\sim P^{n}}\left[\sin(\hat{v}(S),v_{1})\right]\;\geq\;C\kappa\min\left(\sqrt{\frac{d\wedge\log\left(\left(1-e^{-\varepsilon}\right)/\delta\right)}{\varepsilon n}},1\right)\;. (13)

Currently, DP-PCA requires choices of the learning rates, ηt\eta_{t}, that depend on possibly unknown quantities. Since we can privately evaluate the quality of our solution, one can instead run multiple instances of DP-PCA with varying ηt=c1/(c2+t)\eta_{t}=c_{1}/(c_{2}+t) and find the best choice of c1>0c_{1}>0 and c2>0c_{2}>0. Let wT(c1,c2)w_{T}(c_{1},c_{2}) denote the resulting solution for one instance of {ηt=c1/(c2+t)}t=1T\{\eta_{t}=c_{1}/(c_{2}+t)\}_{t=1}^{T}. We first set a target error ζ\zeta. For each round i=1,i=1,\ldots, we will run algorithm for (c1,c2)=[2i1,2i+1]×[2i+1,2i+2,2i1](c_{1},c_{2})=[2^{i-1},2^{-i+1}]\times[2^{-i+1},2^{-i+2}\ldots,2^{i-1}] and (c1,c2)=[2i+1,2i+2,2i1]×[2i1,2i+1](c_{1},c_{2})=[2^{-i+1},2^{-i+2}\ldots,2^{i-1}]\times[2^{i-1},2^{-i+1}], and compute each sin(wT(c1,c2),v1)\sin(w_{T}(c_{1},c_{2}),v_{1}) privately, each with privacy budget εi=ε2i+1(2i1),δi=δ2i+1(2i1)\varepsilon_{i}=\frac{\varepsilon}{2^{i+1}(2i-1)},\delta_{i}=\frac{\delta}{2^{i+1}(2i-1)}. We terminate the algorithm once there there is a wT(c1,c2)w_{T}(c_{1},c_{2}) satisfies sin(wT(c1,c2),v1)ζ\sin(w_{T}(c_{1},c_{2}),v_{1})\leq\zeta. It is clear that this search meta-algorithm terminate in logarithmic round, and the total sample complexity only blows up by a poly-log factor.

6 Private mean estimation for the minibatch stochastic gradients

DP-PCA critically relies on private mean estimation to reduce variance of the noise required to achieve (ε,δ)(\varepsilon,\delta)-DP. We follow a common recipe from [49, 43, 47, 9, 16]. First, we privately find an approximate range of the gradients in the minibatch (Alg. 4). Next, we apply the Gaussian mechanism to the truncated gradients where the truncation is tailored to the estimated range (Alg. 5).

Step 1: estimating the range. We need to find an approximate range of the minibatch of gradients in order to adaptively truncate the gradients and bound the sensitivity. Inspired by a private preconditioning mechanism designed for mean estimation with unknown covariance from [46], we propose to use privately estimated top eigenvalue of the covariance matrix of the gradients. For details on the version of the histogram learner we use in Alg. 4 in Appendix E.2, we refer to [55, Lemma D.1]. Unlike the private preconditioning of [46] that estimates all eigenvalues and requires n=O~(d3/2log(1/δ)/ε)n=\widetilde{O}(d^{3/2}\log(1/\delta)/\varepsilon) samples, we only require the top eigenvalue and hence the next theorem shows that we only need n=O~(dlog(1/δ)/ε)n=\widetilde{O}(d\log(1/\delta)/\varepsilon).

Theorem 6.1.

Algorithm 4 is (ε,δ)(\varepsilon,\delta)-DP. Let gi=Aiug_{i}=A_{i}u for some fixed vector uu, where AiA_{i} satisfies A.1 and A.4 in Assumption 1 such that the mean is 𝔼[gi]=Σu\mathbb{E}[g_{i}]=\Sigma u and the covariance is 𝔼[(giΣu)(giΣu)]=λ12Hu\mathbb{E}[(g_{i}-\Sigma u)(g_{i}-\Sigma u)^{\top}]=\lambda_{1}^{2}H_{u}. With a large enough sample size scaling as

B\displaystyle B =O(K2dlog1+2a(ndlog(1/(δζ))/ε)log(1/(ζδ))ε)=O~(K2dlog(1/δ)ε),\displaystyle=O\left(\frac{K^{2}\,d\,\log^{1+2a}(nd\log(1/(\delta\zeta))/\varepsilon)\log(1/(\zeta\delta))}{\varepsilon}\right)=\tilde{O}\left(\frac{K^{2}\,d\,\log(1/\delta)}{\varepsilon}\right)\;, (14)

Algorithm 4 outputs Λ^\hat{\Lambda} achieving Λ^[(1/2)λ12Hu2,2λ12Hu2]\hat{\Lambda}\in\left[(1/\sqrt{2})\lambda_{1}^{2}\|H_{u}\|_{2},\sqrt{2}\lambda_{1}^{2}\|H_{u}\|_{2}\right] with probability 1ζ1-\zeta, where the pair (K>0,a>0)(K>0,a>0) parametrizes the tail of the distribution in A.4 and O~()\tilde{O}(\cdot) hides logarithmic factors in B,d,1/ζ,log(1/δ)B,d,1/\zeta,\log(1/\delta), and ε\varepsilon.

We provide a proof in Appendix E.2. There are other ways to privately estimate the range. Some approaches require known bounds such as σmin2λ12(Hu)iiσmax2\sigma_{\rm min}^{2}\leq\lambda_{1}^{2}(H_{u})_{ii}\leq\sigma_{\rm max}^{2} for all i[d]i\in[d] [49], and other agnostic approaches are more involved such as instance optimal universal estimators of [18].

Step 2: Gaussian mechanism for mean estimation. Once we have a good estimate of the top eigenvalue from the previous section, we use it to select the bin size of the private histogram and compute the truncated empirical mean. Since truncated empirical mean has a bounded sensitivity, we can use Gaussian mechanism to achieve DP. The algorithm is now standard in DP mean estimation, e.g., [49, 43]. However, the analysis is slightly different since our assumptions on gig_{i}’s are different. For completeness, we provide the Algorithm 5 in Appendix E.3.

The next lemma shows that the Private Mean Estimation is (ε,δ)(\varepsilon,\delta)-DP, and with high probability clipping does not apply to any of the gradients. The returned private mean, therefore, is distributed as a spherical Gaussian centered at the empirical mean of the gradients. This result requires that we have a good estimate of the top eigenvalue from Alg. 4 such that Λ^λ12Hu2\hat{\Lambda}\simeq\lambda_{1}^{2}\|H_{u}\|_{2}. This analysis implies that we get an unbiased estimate of the gradient mean (which is critical in the analysis) with noise scaling as O~(λ1γdlog(1/δ)/(εB))\tilde{O}(\lambda_{1}\gamma\sqrt{d\log(1/\delta)}/(\varepsilon B)), where γ2=maxu:u=1Hu2\gamma^{2}=\max_{u:\|u\|=1}\|H_{u}\|_{2} (which is critical in getting the tight sample complexity in the second term of the final utility guarantee in Eq. (10)). We provide a proof in Appendix E.3.

Lemma 6.2.

For ε(0,0.9)\varepsilon\in(0,0.9) and any δ(0,1)\delta\in(0,1), Algorithm 5 is (ε,δ)(\varepsilon,\delta)-DP. Let gi=Aiug_{i}=A_{i}u for some fixed vector uu, where AiA_{i} satisfies A.1 and A.4 in Assumption 1 such that the mean is 𝔼[gi]=Σu\mathbb{E}[g_{i}]=\Sigma u and the covariance is 𝔼[(giΣu)(giΣu)]=λ12Hu\mathbb{E}[(g_{i}-\Sigma u)(g_{i}-\Sigma u)^{\top}]=\lambda_{1}^{2}H_{u}. If Λ^[λ12Hu2/2,2λ12Hu2]\hat{\Lambda}\in[\lambda_{1}^{2}\|H_{u}\|_{2}/\sqrt{2},\sqrt{2}\lambda_{1}^{2}\|H_{u}\|_{2}], δ1/B\delta\leq 1/B, and B=Ω((dlog(1/δ)/ε)log(d/(ζδ)))B=\Omega((\sqrt{d\log(1/\delta)}/\varepsilon)\log(d/(\zeta\delta))) then, with probability 1ζ1-\zeta, we have gig¯+[3KΛ^loga(Bd/ζ),3KΛ^loga(Bd/ζ)]dg_{i}\in\bar{g}+\left[-3K\sqrt{\hat{\Lambda}}\log^{a}(Bd/\zeta),3K\sqrt{\hat{\Lambda}}\log^{a}(Bd/\zeta)\right]^{d} for all i[B].i\in[B].

6.1 Proof sketch of Theorem 5.1

We choose B=Θ(n/log2n)B=\Theta(n/\log^{2}n) such that we access the dataset only T=Θ(log2n)T=\Theta(\log^{2}n) times. Hence we do not need to rely on amplification by shuffling. To add Gaussian noise that scales as the standard deviation of the gradients in each minibatch (as opposed to potentially excessively large mean of the gradients), DP-PCA adopts techniques from recent advances in private mean estimation. Namely, we first get a private and accurate estimate of the range from Theorem 6.1. Using this estimate, Λ^\hat{\Lambda}, Private Mean Estimation returns an unbiased estimate of the empirical mean of the gradients, as long as no truncation has been applied as ensured by Lemma 6.2. This gives

wt\displaystyle w_{t}^{\prime} wt1+ηt(1Bi=1BAB(t1)+iwt1+βtzt),\displaystyle\leftarrow w_{t-1}+\eta_{t}\left(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}w_{t-1}+\beta_{t}z_{t}\right)\;, (15)

for zt𝒩(0,𝐈)z_{t}\sim{\cal N}(0,{\bf I}) and βt=8K2Λ^tloga(Bd/ζ)2dlog(2.5/δ)εB\beta_{t}=\frac{8K\sqrt{2\hat{\Lambda}_{t}}\log^{a}(Bd/\zeta)\sqrt{2d\log(2.5/\delta)}}{\varepsilon B}. Using rotation invariance of spherical Gaussian random vectors and the fact that wt1=1\|w_{t-1}\|=1, we can reformulate it as

wt\displaystyle w_{t}^{\prime} wt1+ηt(1Bi=1BAB(t1)+i+βtGt)A~twt1.\displaystyle\leftarrow w_{t-1}+\eta_{t}\underbrace{\left(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}+\beta_{t}G_{t}\right)}_{\tilde{A}_{t}}w_{t-1}\;. (16)

This process can be analyzed with Theorem 2.2 with A~t\tilde{A}_{t} substituting AtA_{t}.

7 Discussion

Under the canonical task of computing the principal component from i.i.d. samples, we show the first result achieving a near-optimal error rate. This critically relies on two ideas: minibatch SGD and private mean estimation. In particular, private mean estimation plays a critical role in the case when the range of the gradients is significantly smaller than the norm; we achieve an optimal error rate that cannot be achieved with the standard recipe of gradient clipping or even with a more sophisticated adaptive clipping [4].

Assumption A.4 can be relaxed to heavy-tail bounds with bounded kk-th moment on AiA_{i}, in which case we expect the second term in Eq. (10) to scale as O(d(log(1/δ)/εn)11/k)O(d(\sqrt{\log(1/\delta)}/\varepsilon n)^{1-1/k}), drawing analogy from a similar trend in a computationally inefficient DP-PCA without spectral gap [56, Corollary 6.10]. When a fraction of data is corrupted, recent advances in [74, 51, 40] provide optimal algorithms for PCA. However, existing approach of [56] for robust and private PCA is computationally intractable. Borrowing ideas from robust and private mean estimation in [55], one can design an efficient algorithm, but at the cost of sub-optimal sample complexity. It is an interesting direction to design an optimal and robust version of DP-PCA. Our lower bounds are loose in its dependence in log(1/δ)\log(1/\delta). Recently, a promising lower bound technique has been introduced in [45] that might close this gap.

Acknowledgement

This work is supported in part by Google faculty research award and NSF grants CNS-2002664, IIS-1929955, DMS-2134012, CCF-2019844 as a part of NSF Institute for Foundations of Machine Learning (IFML), CNS-2112471 as a part of NSF AI Institute for Future Edge Networks and Distributed Intelligence (AI-EDGE).

References

  • [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
  • [2] John M Abowd. The us census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2867–2867, 2018.
  • [3] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private assouad, fano, and le cam. In Algorithmic Learning Theory, pages 48–78. PMLR, 2021.
  • [4] Galen Andrew, Om Thakkar, Brendan McMahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. Advances in Neural Information Processing Systems, 34, 2021.
  • [5] Maria-Florina Balcan, Simon Shaolei Du, Yining Wang, and Adams Wei Yu. An improved gap-dependency analysis of the noisy power method. In Conference on Learning Theory, pages 284–309. PMLR, 2016.
  • [6] Rina Foygel Barber and John C Duchi. Privacy and statistical risk: Formalisms and minimax bounds. arXiv preprint arXiv:1412.4451, 2014.
  • [7] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. Advances in Neural Information Processing Systems, 32, 2019.
  • [8] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–473. IEEE, 2014.
  • [9] Sourav Biswas, Yihe Dong, Gautam Kamath, and Jonathan Ullman. Coinpress: Practical private mean and covariance estimation. arXiv preprint arXiv:2006.06618, 2020.
  • [10] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: the sulq framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 128–138, 2005.
  • [11] Gavin Brown, Marco Gaboardi, Adam Smith, Jonathan Ullman, and Lydia Zakynthinou. Covariance-aware private mean estimation without private covariance estimation. Advances in Neural Information Processing Systems, 34, 2021.
  • [12] Mark Bun, Kobbi Nissim, and Uri Stemmer. Simultaneous private learning of multiple concepts. J. Mach. Learn. Res., 20:94–1, 2019.
  • [13] Mark Bun and Thomas Steinke. Average-case averages: Private algorithms for smooth sensitivity and mean estimation. Advances in Neural Information Processing Systems, 32, 2019.
  • [14] T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. arXiv preprint arXiv:1902.04495, 2019.
  • [15] Kamalika Chaudhuri, Anand D Sarwate, and Kaushik Sinha. A near-optimal algorithm for differentially-private principal components. The Journal of Machine Learning Research, 14(1):2905–2943, 2013.
  • [16] Christian Covington, Xi He, James Honaker, and Gautam Kamath. Unbiased statistical estimation and valid confidence intervals under differential privacy. arXiv preprint arXiv:2110.14465, 2021.
  • [17] Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, and Benjamin IP Rubinstein. Robust and private bayesian inference. In International Conference on Algorithmic Learning Theory, pages 291–305. Springer, 2014.
  • [18] Wei Dong and Ke Yi. Universal private estimators. arXiv preprint arXiv:2111.02598, 2021.
  • [19] John Duchi and Ryan Rogers. Lower bounds for locally private estimation via communication complexity. In Conference on Learning Theory, pages 1161–1191. PMLR, 2019.
  • [20] John C Duchi, Michael I Jordan, and Martin J Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 113(521):182–201, 2018.
  • [21] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 371–380, 2009.
  • [22] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
  • [23] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.
  • [24] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 11–20, 2014.
  • [25] Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2468–2479. SIAM, 2019.
  • [26] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054–1067, 2014.
  • [27] Hossein Esfandiari, Vahab Mirrokni, and Shyam Narayanan. Tight and robust private mean estimation with few users. arXiv preprint arXiv:2110.11876, 2021.
  • [28] Giulia Fanti, Vasyl Pihur, and Úlfar Erlingsson. Building a rappor with the unknown: Privacy-preserving learning of associations and data dictionaries. Proceedings on Privacy Enhancing Technologies, 2016(3):41–61, 2016.
  • [29] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439–449, 2020.
  • [30] Vitaly Feldman, Audra McMillan, and Kunal Talwar. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 954–964. IEEE, 2022.
  • [31] Vitaly Feldman and Thomas Steinke. Calibrating noise to variance in adaptive data analysis. In Conference On Learning Theory, pages 535–544. PMLR, 2018.
  • [32] James Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. On the theory and practice of privacy-preserving bayesian data analysis. arXiv preprint arXiv:1603.07294, 2016.
  • [33] Marco Gaboardi, Ryan Rogers, and Or Sheffet. Locally private mean estimation: zz-test and tight confidence intervals. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2545–2554. PMLR, 2019.
  • [34] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. Advances in neural information processing systems, 27, 2014.
  • [35] Moritz Hardt and Aaron Roth. Beating randomized response on incoherent matrices. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 1255–1268, 2012.
  • [36] Moritz Hardt and Aaron Roth. Beyond worst-case analysis in private singular vector computation. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 331–340, 2013.
  • [37] Samuel B Hopkins, Gautam Kamath, and Mahbod Majid. Efficient mean estimation with pure differential privacy via a sum-of-squares exponential mechanism. arXiv preprint arXiv:2111.12981, 2021.
  • [38] Lijie Hu, Shuo Ni, Hanshen Xiao, and Di Wang. High dimensional differentially private stochastic optimization with heavy-tailed data. arXiv preprint arXiv:2107.11136, 2021.
  • [39] Prateek Jain, Chi Jin, Sham M Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming pca: Matching matrix bernstein and near-optimal finite sample guarantees for oja’s algorithm. In Conference on learning theory, pages 1147–1164. PMLR, 2016.
  • [40] Arun Jambulapati, Jerry Li, and Kevin Tian. Robust sub-gaussian principal component analysis and width-independent schatten packing. Advances in Neural Information Processing Systems, 33, 2020.
  • [41] Matthew Joseph, Janardhan Kulkarni, Jieming Mao, and Steven Z Wu. Locally private gaussian estimation. Advances in Neural Information Processing Systems, 32, 2019.
  • [42] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In International conference on machine learning, pages 1376–1385. PMLR, 2015.
  • [43] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning high-dimensional distributions. In Conference on Learning Theory, pages 1853–1902. PMLR, 2019.
  • [44] Gautam Kamath, Xingtu Liu, and Huanyu Zhang. Improved rates for differentially private stochastic convex optimization with heavy-tailed data. arXiv preprint arXiv:2106.01336, 2021.
  • [45] Gautam Kamath, Argyris Mouzakis, and Vikrant Singhal. New lower bounds for private estimation and a generalized fingerprinting lemma. arXiv preprint arXiv:2205.08532, 2022.
  • [46] Gautam Kamath, Argyris Mouzakis, Vikrant Singhal, Thomas Steinke, and Jonathan Ullman. A private and computationally-efficient estimator for unbounded gaussians. arXiv preprint arXiv:2111.04609, 2021.
  • [47] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed distributions. arXiv preprint arXiv:2002.09464, 2020.
  • [48] Michael Kapralov and Kunal Talwar. On differentially private low rank approximation. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1395–1414. SIAM, 2013.
  • [49] Vishesh Karwa and Salil Vadhan. Finite sample differentially private confidence intervals. arXiv preprint arXiv:1711.03908, 2017.
  • [50] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, pages 25–1. JMLR Workshop and Conference Proceedings, 2012.
  • [51] Weihao Kong, Raghav Somani, Sham Kakade, and Sewoong Oh. Robust meta-learning for mixed linear regression with small batches. Advances in Neural Information Processing Systems, 33, 2020.
  • [52] Pravesh K Kothari, Pasin Manurangsi, and Ameya Velingker. Private robust estimation by stabilizing convex relaxations. arXiv preprint arXiv:2112.03548, 2021.
  • [53] Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. Private non-smooth empirical risk minimization and stochastic convex optimization in subquadratic steps. arXiv preprint arXiv:2103.15352, 2021.
  • [54] Mengchu Li, Thomas B Berrett, and Yi Yu. On robustness and local differential privacy. arXiv preprint arXiv:2201.00751, 2022.
  • [55] Xiyang Liu, Weihao Kong, Sham Kakade, and Sewoong Oh. Robust and differentially private mean estimation. Advances in Neural Information Processing Systems, 34, 2021.
  • [56] Xiyang Liu, Weihao Kong, and Sewoong Oh. Differential privacy and robust statistics in high dimensions. arXiv preprint arXiv:2111.06578, 2021.
  • [57] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 94–103. IEEE, 2007.
  • [58] Frank D McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 19–30, 2009.
  • [59] Kentaro Minami, HItomi Arai, Issei Sato, and Hiroshi Nakagawa. Differential privacy without sensitivity. In Advances in Neural Information Processing Systems, pages 956–964, 2016.
  • [60] Darakhshan J Mir. Differential privacy: an exploration of the privacy-utility landscape. Rutgers The State University of New Jersey-New Brunswick, 2013.
  • [61] Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267–273, 1982.
  • [62] Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics. Lecture notes for course 18S997, 813(814):46, 2015.
  • [63] Or Sheffet. Old techniques in differentially private linear regression. In Algorithmic Learning Theory, pages 789–827. PMLR, 2019.
  • [64] Jun Tang, Aleksandra Korolova, Xiaolong Bai, Xueqiang Wang, and Xiaofeng Wang. Privacy loss in apple’s implementation of differential privacy on macos 10.12. arXiv preprint arXiv:1709.02753, 2017.
  • [65] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
  • [66] Christos Tzamos, Emmanouil-Vasileios Vlatakis-Gkaragkounis, and Ilias Zadik. Optimal private median estimation under minimal distributional assumptions. Advances in Neural Information Processing Systems, 33:3301–3311, 2020.
  • [67] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
  • [68] Duy Vu and Aleksandra Slavkovic. Differential privacy for clinical trial data: Preliminary evaluations. In 2009 IEEE International Conference on Data Mining Workshops, pages 138–143. IEEE, 2009.
  • [69] Vincent Vu and Jing Lei. Minimax rates of estimation for sparse pca in high dimensions. In Artificial intelligence and statistics, pages 1278–1286. PMLR, 2012.
  • [70] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
  • [71] Di Wang, Hanshen Xiao, Srinivas Devadas, and Jinhui Xu. On differentially private stochastic convex optimization with heavy-tailed data. In International Conference on Machine Learning, pages 10081–10091. PMLR, 2020.
  • [72] Yu-Xiang Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. arXiv preprint arXiv:1803.02596, 2018.
  • [73] Yu-Xiang Wang, Stephen Fienberg, and Alex Smola. Privacy for free: Posterior sampling and stochastic gradient monte carlo. In International Conference on Machine Learning, pages 2493–2502. PMLR, 2015.
  • [74] Huan Xu, Constantine Caramanis, and Shie Mannor. Principal component analysis with contaminated data: The high dimensional case. arXiv preprint arXiv:1002.4658, 2010.

Appendix

Appendix A Related work

Our work builds upon a series of advances in private SGD [44, 8, 7, 29, 53, 71, 38] to make advance in understanding the tradeoff of privacy and sample complexity for PCA. Such tradeoffs have been studied extensively in canonical statistical estimation problems of mean (and covariance) estimation and linear regression.

Private mean estimation. As one of the most fundamental problem in private data analysis, mean estimation is initially studied under the bounded support assumptions, and the optimal error rate is now well understood. More recently, [6] considered the private mean estimation problem for kk-th moment bounded distributions where the support of the data is unbounded and provided minimax error bound in various settings. [49] studied private mean estimation from Gaussian sample, and obtained an optimal error rate. There has been a lot of recent interests on private mean estimation under various assumptions, including mean and covariance joint estimation  [43, 9], heavy-tailed mean estimation [47], mean estimation for general distributions [31, 66], distribution adaptive mean estimation [13], estimation for unbounded distribution parameters [46], mean estimation under pure differential privacy [37], local differential privacy [19, 20, 33, 41, 54], user-level differential privacy [27, 27], Mahalanobis distance[11, 56] and robust and differentially private mean estimation [55, 52, 56].

Private linear regression The goal of private linear regression is to learn a linear predictor of response variable yy from a set of examples {xi,yi}i=1n\{x_{i},y_{i}\}_{i=1}^{n} while guarantee the privacy of the examples. Again, the work on private linear regression can be divided into two categories: deterministic and randomized. In the deterministic setting where the data is deterministically given without any probabilistic assumptions, significant advances in DP linear regression has been made [68, 50, 60, 17, 8, 73, 32, 59, 72, 63]. In the randomized settings where each example {𝐱i,yi}\{\mathbf{x}_{i},y_{i}\} is drawn i.i.d. from a distribution, [21] proposes an exponential time algorithm that achieves asymptotic consistency. [14] provides an efficient and minimax optimal algorithm under sub-Gaussian design and nearly identity covariance assumptions. Very recently, [56] for the first time gives an exponential time algorithm that achieves minimax risk for general covariance matrix under sub-Gaussian and hypercontractive assumptions.

Private PCA without spectral gap. There is a long line of work in Private PCA [35, 36, 34, 10, 15, 48, 24, 5]. We explain the closely related ones in Section 2.3, providing interpretation when the covariance matrix has a spectral gap.

When there is no spectral gap, one can still learn a principal component. However, since the principal component is not unique, the error is typically measured in how much of the variance is captured in the estimated direction: 1v^Σv^/Σ1-\hat{v}^{\top}\Sigma\hat{v}/\|\Sigma\|. [15] introduces an exponential mechanism (from [57]) which samples an estimate from a distribution fΣ^(v^)=(1/C)exp{((εn)/c2)v^Σ^v^}f_{\widehat{\Sigma}}(\hat{v})=(1/C)\exp\{((\varepsilon n)/c^{2})\hat{v}^{\top}\widehat{\Sigma}\hat{v}\}, where CC is a normalization constant to ensure that the pdf integrates to one. This achieves a stronger pure DP, i.e., (ε,0)(\varepsilon,0)-DP, but is computationally expensive; [15] does not provide a tractable implementation and [48] provides a polynomial time implementation with time complexity at least cubic in dd. This achieves an error rate of 1v^Σv^/Σ=O~(d2/(εn))1-\hat{v}^{\top}\Sigma\hat{v}/\|\Sigma\|=\tilde{O}(d^{2}/(\varepsilon n)) in [15, Theorem 7] when samples are from Gaussian 𝒩(0,Σ){\cal N}(0,\Sigma), which, when there is a spectral gap, translates into

sin(v^,v1)2\displaystyle\sin(\hat{v},v_{1})^{2} =\displaystyle= O~(κd2εn),\displaystyle\tilde{O}\Big{(}\frac{\kappa d^{2}}{\varepsilon n}\Big{)}\;, (17)

with high probability. Closest to our setting is the analyses in [56, Corollary 6.5] that proposed an exponential mechanism that achieves 1v^Σv^/Σ=O~(d/n+(d+log(1/δ))/(εn))1-\hat{v}^{\top}\Sigma\hat{v}/\|\Sigma\|=\tilde{O}(\sqrt{d/n}+(d+\log(1/\delta))/(\varepsilon n)) with high probability under (ε,δ)(\varepsilon,\delta)-DP and Gaussian samples, but this algorithm is computationally intractable. This is shown to be tight when there is no spectral gap. When there is a spectral gap, this translates into

sin(v^,v1)2\displaystyle\sin(\hat{v},v_{1})^{2} =\displaystyle= O~(κ(dn+d+log(1/δ)εn)).\displaystyle\tilde{O}\Big{(}\kappa\Big{(}\sqrt{\frac{d}{n}}+\frac{d+\log(1/\delta)}{\varepsilon n}\Big{)}\Big{)}\;. (18)

As these algorithms and the corresponding analyses are tailored for gap-free cases, they have better dependence on κ\kappa and worse dependence on d/nd/n and d/εnd/\varepsilon n, compared to the proposed DP-PCA and its error rate in Corollary 5.2.

Appendix B Preliminary on differential privacy

Lemma B.1 (Stability-based histogram [49, Lemma 2.3]).

For every KK\in\mathbb{N}\cup\infty, domain Ω\Omega, for every collection of disjoint bins B1,,BKB_{1},\ldots,B_{K} defined on Ω\Omega, nn\in\mathbb{N}, ε0,δ(0,1/n)\varepsilon\geq 0,\delta\in(0,1/n), β>0\beta>0 and α(0,1)\alpha\in(0,1) there exists an (ε,δ)(\varepsilon,\delta)-differentially private algorithm M:ΩnKM:\Omega^{n}\to\mathbb{R}^{K} such that for any set of data X1,,XnΩnX_{1},\ldots,X_{n}\in\Omega^{n}

  1. 1.

    p^k=1nXiBk1\hat{p}_{k}=\frac{1}{n}\sum_{X_{i}\in B_{k}}1

  2. 2.

    (p~1,,p~K)M(X1,,Xn),(\tilde{p}_{1},\ldots,\tilde{p}_{K})\leftarrow M(X_{1},\ldots,X_{n}), and

  3. 3.
    nmin{8εβlog(2K/α),8εβlog(4/αδ)}n\geq\min\left\{\frac{8}{\varepsilon\beta}\log(2K/\alpha),\frac{8}{\varepsilon\beta}\log(4/\alpha\delta)\right\}

then,

(|p~kp^k|β)1α\mathbb{P}(|\tilde{p}_{k}-\hat{p}_{k}|\leq\beta)\geq 1-\alpha

Since we focus on one-pass algorithms where a data point is only accessed once, we need a basic parallel composition of DP.

Lemma B.2 (Parallel composition [58]).

Consider a sequence of interactive queries {qk}k=1K\{q_{k}\}_{k=1}^{K} each operating on a subset SkS_{k} of the database and each satisfying (ε,δ)(\varepsilon,\delta)-DP. If SkS_{k}’s are disjoint then the composition (q1(S1),q2(S2),,qK(SK))(q_{1}(S_{1}),q_{2}(S_{2}),\ldots,q_{K}(S_{K})) is (ε,δ)(\varepsilon,\delta)-DP.

We also utilize the following serial composition theorem.

Lemma B.3 (Serial composition [23]).

If a database is accessed with an (ε1,δ1)(\varepsilon_{1},\delta_{1})-DP mechanism and then with an (ε2,δ2)(\varepsilon_{2},\delta_{2})-DP mechanism, then the end-to-end privacy guarantee is (ε1+ε2,δ1+δ2)(\varepsilon_{1}+\varepsilon_{2},\delta_{1}+\delta_{2})-DP.

When we apply private histogram learner to each coordinate, we require more advanced composition theorem from [42].

Lemma B.4 (Advanced composition [42]).

For ε0.9\varepsilon\leq 0.9, an end-to-end guarantee of (ε,δ)(\varepsilon,\delta)-differential privacy is satisfied if a database is accessed kk times, each with a (ε/(22klog(2/δ)),δ/(2k))(\varepsilon/(2\sqrt{2k\log(2/\delta)}),\delta/(2k))-differential private mechanism.

Appendix C Lower bounds

When privacy is not required, we know from Theorem 2.2 that under Assumptions A.1-A.3, we can achieve an error rate of O~(κV/n)\tilde{O}(\kappa\sqrt{V/n}). In the regime of V=O(d)V=O(d) and κ=O(1)\kappa=O(1), n=O(d)n=O(d) samples are enough to achieve an arbitrarily small error. The next lower bounds shows that we need n=O(d2)n=O(d^{2}) samples when (ε=O(1),0)(\varepsilon=O(1),0)-DP is required, showing that private PCA is significantly more challenging than a non-private PCA, when assuming only the support and moment bounds of assumptions A.1-A.3. We provide a proof in Appendix C.3.

Theorem C.1 (Lower bound without Assumption A.4).

Let ε\mathcal{M}_{\varepsilon} be a class of (ε,0)(\varepsilon,0)-DP estimators that map nn i.i.d. samples to an estimate v^d\hat{v}\in{\mathbb{R}}^{d}. A set of distributions satisfying Assumptions A.1A.3 with M=O(dlogn)M=O(d\log n) and V=O(d)V=O(d) is denoted by 𝒫~(λ1,λ2)\tilde{\mathcal{P}}_{(\lambda_{1},\lambda_{2})}. There exists a universal constant C>0C>0 such that

infv^εsupP𝒫~(λ1,λ2)𝔼SPn[sin(v^(S),v1)]Cmin(κd2εnλ2λ1,λ2λ1).\displaystyle\inf_{\hat{v}\in\mathcal{M}_{\varepsilon}}\sup_{P\in\tilde{\mathcal{P}}_{(\lambda_{1},\lambda_{2})}}\mathbb{E}_{S\sim P^{n}}\left[\sin(\hat{v}(S),v_{1})\right]\;\geq\;C\min\left(\frac{\kappa d^{2}}{\varepsilon n}\sqrt{\frac{\lambda_{2}}{\lambda_{1}}},\sqrt{\frac{\lambda_{2}}{\lambda_{1}}}\right)\;. (19)

We next provide the proofs of all the lower bounds.

C.1 Proof of Theorem 5.3 on the lower bound for Gaussian case

Our proof is based on following differentially private Fano’s method [3, Corollary 4].

Theorem C.2 (DP Fano’s method [3, Corollary 4]).

Let 𝒫\mathcal{P} denote family of distributions of interest and θ:𝒫Θ\theta:\mathcal{P}\rightarrow\Theta denote the population parameter. Our goal is to estimate θ\theta from i.i.d. samples x1,x2,,xnP𝒫x_{1},x_{2},\ldots,x_{n}\sim P\in\mathcal{P}. Let θ^ε\hat{\theta}_{\varepsilon} be an (ε,0)(\varepsilon,0)-DP estimator. Let ρ:Θ×Θ+\rho:\Theta\times\Theta\rightarrow{\mathbb{R}}^{+} be a pseudo-metric on parameter space Θ\Theta. Let 𝒱\mathcal{V} be an index set with finite cardinality. Define 𝒫𝒱={Pv,v𝒱}𝒫\mathcal{P}_{\mathcal{V}}=\{P_{v},v\in\mathcal{V}\}\subset\mathcal{P} be an indexed family of probability measures on measurable set (𝒳,𝒜)(\mathcal{X},\mathcal{A}). If for any vv𝒱v\neq v^{\prime}\in\mathcal{V},

  1. 1.

    ρ(θ(Pv),θ(Pv))τ\rho(\theta(P_{v}),\theta(P_{v^{\prime}}))\geq\tau,

  2. 2.

    DKL(Pv,Pv)βD_{\rm KL}\left(P_{v},P_{v^{\prime}}\right)\leq\beta,

  3. 3.

    DTV(Pv,Pv)ϕD_{\rm TV}\left(P_{v},P_{v^{\prime}}\right)\leq\phi,

then

infθ^εmaxP𝒫𝔼SPn[ρ(θ^ε(S),θ(P))]max(τ2(1nβ+log(2)log(|𝒱|)),0.4τmin(1,log(|𝒱|)nϕε)).\displaystyle\inf_{\hat{\theta}_{\varepsilon}}\max_{P\in\mathcal{P}}\mathbb{E}_{S\sim P^{n}}\left[\rho(\hat{\theta}_{\varepsilon}(S),\theta(P))\right]\geq\max\left(\frac{\tau}{2}\left(1-\frac{n\beta+\log(2)}{\log(|\mathcal{V}|)}\right),0.4\tau\min\left(1,\frac{\log(|\mathcal{V}|)}{n\phi\varepsilon}\right)\right)\;.

For our problem, we are interested in Gaussian 𝒫Σ\mathcal{P}_{\Sigma} and metric ρ(u,v)=sin(u,v)\rho(u,v)=\sin(u,v). Using Theorem C.2, it suffices to construct such indexed set 𝒱\mathcal{V} and the indexed distribution family 𝒫𝒱\mathcal{P}_{\mathcal{V}}. We use the same construction as in [69, Theorem 2.1] introduced to prove a lower bound for the (non-private) sparse PCA problem. The construction is given by the following lemma.

Lemma C.3 ([69, Lemma 3.1.2]).

Let d>5d>5. For α(0,1]\alpha\in(0,1], there exists 𝒱α𝕊2d1\mathcal{V}_{\alpha}\subset\mathbb{S}_{2}^{d-1} and an absolute constant c1>0c_{1}>0 such that for every vv𝒱αv\neq v^{\prime}\in\mathcal{V}_{\alpha}, α/2vv22α\alpha/\sqrt{2}\leq\|v-v^{\prime}\|_{2}\leq\sqrt{2}\alpha and log(|𝒱α|)c1d\log(|\mathcal{V}_{\alpha}|)\geq c_{1}d.

Fix α(0,1]\alpha\in(0,1]. For each v𝒱αv\in\mathcal{V}_{\alpha}, we define Σv=(λ1λ2)vv+λ2𝐈d\Sigma_{v}=(\lambda_{1}-\lambda_{2})vv^{\top}+\lambda_{2}\mathbf{I}_{d} and Pv=𝒩(0,Σv)P_{v}={\cal N}(0,\Sigma_{v}). It is easy to see that Σv\Sigma_{v} has eigenvalues λ1>λ2==λn\lambda_{1}>\lambda_{2}=\cdots=\lambda_{n}. The top eigenvector of Σv\Sigma_{v} is vv. Using Lemma F.4, we know for any vv𝒱v\neq v^{\prime}\in\mathcal{V},

α2ρ(v,v)=1v,v2α.\displaystyle\frac{\alpha}{\sqrt{2}}\leq\rho(v,v^{\prime})=\sqrt{1-\left\langle v,v^{\prime}\right\rangle^{2}}\leq\alpha\;. (20)

Using [69, Lemma 3.1.3], we know

DKL(Pv,Pv)=(λ1λ2)2λ1λ2(1v,v2)(λ1λ2)2α2λ1λ2.\displaystyle D_{\rm KL}\left(P_{v},P_{v^{\prime}}\right)=\frac{(\lambda_{1}-\lambda_{2})^{2}}{\lambda_{1}\lambda_{2}}(1-\left\langle v,v^{\prime}\right\rangle^{2})\leq\frac{(\lambda_{1}-\lambda_{2})^{2}\alpha^{2}}{\lambda_{1}\lambda_{2}}\;. (21)

Using Pinsker’s inequality, we have

DTV(Pv,Pv)DKL(Pv,Pv)2α(λ1λ2)22λ1λ2.\displaystyle D_{\rm TV}\left(P_{v},P_{v^{\prime}}\right)\leq\sqrt{\frac{D_{\rm KL}\left(P_{v},P_{v^{\prime}}\right)}{2}}\leq\alpha\sqrt{\frac{(\lambda_{1}-\lambda_{2})^{2}}{2\lambda_{1}\lambda_{2}}}\;. (22)

Now we set

α:=min(1,dc1λ1λ22n(λ1λ2)2,c1dnε2λ1λ2(λ1λ2)2)\displaystyle\alpha:=\min\left(1,\sqrt{\frac{dc_{1}\lambda_{1}\lambda_{2}}{2n(\lambda_{1}-\lambda_{2})^{2}}},\frac{c_{1}d}{n\varepsilon}\sqrt{\frac{2\lambda_{1}\lambda_{2}}{(\lambda_{1}-\lambda_{2})^{2}}}\right) (23)

It follows from Theorem C.2 and d>8d>8 that there exists a constant CC such that

infv^supP𝒫Σ𝔼SPn[sin(v^(S),v1(Σ))]Cmin((dn+dεn)λ1λ2(λ1λ2)2,1).\displaystyle\inf_{\hat{v}}\sup_{P\in\mathcal{P}_{\Sigma}}\mathbb{E}_{S\sim P^{n}}\left[\sin(\hat{v}(S),v_{1}(\Sigma))\right]\geq C\min\left(\left(\sqrt{\frac{d}{n}}+\frac{d}{\varepsilon n}\right)\sqrt{\frac{\lambda_{1}\lambda_{2}}{(\lambda_{1}-\lambda_{2})^{2}}},1\right)\;. (24)

C.2 Proof of Theorem 5.4

We first construct an indexed set 𝒱\mathcal{V} and indexed distribution family 𝒫𝒱\mathcal{P}_{\mathcal{V}} such that xixix_{i}x_{i}^{\top} satisfies A.1, A.2 and A.3 in Assumption 1. Our construction is defined as follows.

By [3, Lemma 6] , there exists a finite set 𝒱𝕊2d1\mathcal{V}\subset\mathbb{S}_{2}^{d-1}, with cardinality |𝒱|2d|\mathcal{V}|\geq 2^{d}, such that for any vv𝒱v\neq v^{\prime}\in\mathcal{V}, vv1/2\|v-v^{\prime}\|\geq 1/2.

Let f(0,𝐈d)f_{(0,\mathbf{I}_{d})} denotes the density function of 𝒩(0,𝐈d){\cal N}(0,\mathbf{I}_{d}). Let QvQ_{v} be a uniform distribution on two point masses {±α14v}\{\pm\alpha^{-\frac{1}{4}}v\}. Let Q0Q_{0} be Gaussian distribution 𝒩(0,𝐈d){\cal N}(0,\mathbf{I}_{d}). For α(0,1]\alpha\in(0,1], we construct Pv:=(1α)Q0+αQvP_{v}:=(1-\alpha)Q_{0}+\alpha Q_{v}. It is easy to see that PvP_{v} is a distribution over d{\mathbb{R}}^{d} with the following density function.

Pv(x)={α2, if x=α14v,α2, if x=α14v,(1α)f(0,𝐈d)(x) otherwise .\displaystyle P_{v}(x)=\begin{cases}\frac{\alpha}{2},&\text{ if }x=-\alpha^{-\frac{1}{4}}v\;,\\ \frac{\alpha}{2},&\text{ if }x=\alpha^{-\frac{1}{4}}v\;,\\ (1-\alpha)f_{(0,\mathbf{I}_{d})}(x)&\text{ otherwise }\end{cases}\;. (25)

The mean of PvP_{v} is 0. The covariance of PvP_{v} is Σv=(1α)𝐈d+αvv\Sigma_{v}=(1-\alpha)\mathbf{I}_{d}+\sqrt{\alpha}vv^{\top}. The top eigenvalue is λ1=1α+α\lambda_{1}=1-\alpha+\sqrt{\alpha}, the top eigenvector is vv, and the second eigenvalue is λ2=1α\lambda_{2}=1-\alpha. And κ=O(α1/2)\kappa=O(\alpha^{-1/2}).

If x=α1/4vx=\alpha^{-1/4}v, then xxΣv2=O(α1/2)\|xx^{\top}-\Sigma_{v}\|_{2}=O(\alpha^{-1/2}). If x𝒩(0,𝐈d)x\sim{\cal N}(0,\mathbf{I}_{d}), we know xxΣv2=O(d)\|xx^{\top}-\Sigma_{v}\|_{2}=O(d). This implies PvP_{v} satisfies A.2 in Assumption 1 with M=O((d+α1/2)log(n))M=O((d+\alpha^{-1/2})\log(n)) for nn i.i.d. samples.

It is easy to see that 𝔼[(xxΣv)(xxΣv)]2=O(d)\|\mathbb{E}[(xx^{\top}-\Sigma_{v})(xx^{\top}-\Sigma_{v})^{\top}]\|_{2}=O(d). This means PvP_{v} satisfies A.3 in Assumption 1 with V=O(d)V=O(d).

By the fact that 𝔼[x,u2]=O(1)\mathbb{E}[\left\langle x,u\right\rangle^{2}]=O(1) and 𝔼[x,u4]=O(1)\mathbb{E}[\left\langle x,u\right\rangle^{4}]=O(1) for any unit vector uu, we have γ2=𝔼[(xxΣv)uu(xxΣv)]2=O(1)\gamma^{2}=\|\mathbb{E}[(xx^{\top}-\Sigma_{v})uu^{\top}(xx^{\top}-\Sigma_{v})^{\top}]\|_{2}=O(1) for any unit vector uu.

Our proof technique is based on following lemma.

Lemma C.4 ([6, Theorem 3]).

Fix α(0,1]\alpha\in(0,1]. Define Pv=(1α)Q0+αQvP_{v}=(1-\alpha)Q_{0}+\alpha Q_{v} for v𝒱v\in\mathcal{V} such that such that ρ(θ(Pv),θ(Pv))2t\rho(\theta(P_{v}),\theta(P_{v^{\prime}}))\geq 2t. Let θ^\hat{\theta} be a (ε,δ)(\varepsilon,\delta) differentially private estimator. Then,

1|𝒱|ν𝒱Pv(ρ(θ^,θ(Pv))t)(|𝒱|1)(12eεnαδ1eε[nα1eε)1+(|𝒱|1)eεnα.\displaystyle\frac{1}{|\mathcal{V}|}\sum_{\nu\in\mathcal{V}}P_{v}\left(\rho\left(\widehat{\theta},\theta(P_{v})\right)\geq t\right)\geq\frac{(|\mathcal{V}|-1)\cdot\left(\frac{1}{2}e^{-\varepsilon\lceil n\alpha\rceil}-\delta\frac{1-e^{-\varepsilon[n\alpha\rceil}}{1-e^{-\varepsilon}}\right)}{1+(|\mathcal{V}|-1)\cdot e^{-\varepsilon\lceil n\alpha\rceil}}\;. (26)

Set ρ(θ(Pv),θ(Pv))=sin(v,v)/κ\rho(\theta(P_{v}),\theta(P_{v^{\prime}}))=\sin(v,v^{\prime})/\kappa. By Lemma F.4, ρ(θ(Pv),θ(Pv))vv/κ=Ω(α)\rho(\theta(P_{v}),\theta(P_{v^{\prime}}))\geq\|v-v^{\prime}\|/\kappa=\Omega(\sqrt{\alpha}).

Lemma C.4 implies

supP𝒫~𝔼SPn[sin(v^(S),v1(Σ))]\displaystyle\sup_{P\in\tilde{\mathcal{P}}}\mathbb{E}_{S\sim P^{n}}[\sin(\hat{v}(S),v_{1}(\Sigma))] 1|𝒱|v𝒱𝔼SPvn[sin(v^(S),v1(Σv))]\displaystyle\geq\frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}\mathbb{E}_{S\sim P_{v}^{n}}[\sin(\hat{v}(S),v_{1}(\Sigma_{v}))] (27)
=κt1|𝒱|v𝒱Pv(sin(v^(S),v1(Σv))κt)\displaystyle=\kappa t\frac{1}{|\mathcal{V}|}\sum_{v\in\mathcal{V}}P_{v}\left(\frac{\sin(\hat{v}(S),v_{1}(\Sigma_{v}))}{\kappa}\geq t\right) (28)
κt(2d1)(12eεnαδ1eε)1+(2d1)eεnα,\displaystyle\gtrsim\kappa t\frac{(2^{d}-1)\cdot\left(\frac{1}{2}e^{-\varepsilon\lceil n\alpha\rceil}-\frac{\delta}{1-e^{-\varepsilon}}\right)}{1+(2^{d}-1)e^{-\varepsilon\lceil n\alpha\rceil}}\;, (29)

For d2d\geq 2, we know 2d1ed/22^{d}-1\geq e^{d/2}. We choose

α=min{1nε(d2ε),1nεlog(1eε4δeε),1}.\displaystyle\alpha=\min\left\{\frac{1}{n\varepsilon}\left(\frac{d}{2}-\varepsilon\right),\frac{1}{n\varepsilon}\log\left(\frac{1-e^{-\varepsilon}}{4\delta e^{\varepsilon}}\right),1\right\}\;. (30)

This implies

12eεnαδ1eε14eε(nα+1)>0.\displaystyle\frac{1}{2}e^{-\varepsilon\lceil n\alpha\rceil}-\frac{\delta}{1-e^{-\varepsilon}}\geq\frac{1}{4}e^{-\varepsilon(n\alpha+1)}>0\;. (31)

So we have there exists a constant CC such that

infv^supP𝒫~𝔼SPn[sin(v^(S),v1(Σ))]\displaystyle\inf_{\hat{v}}\sup_{P\in\tilde{\mathcal{P}}}\mathbb{E}_{S\sim P^{n}}\left[\sin(\hat{v}(S),v_{1}(\Sigma))\right] Cκα14ed/2eε(nα+1)1+ed/2eε(nα+1)\displaystyle\geq C\kappa\sqrt{\alpha}\frac{\frac{1}{4}e^{d/2}e^{-\varepsilon(n\alpha+1)}}{1+e^{d/2}e^{-\varepsilon(n\alpha+1)}} (32)
κmin(1,dlog((1eε)/δ)nε).\displaystyle\gtrsim\kappa\min\left(1,\sqrt{\frac{d\wedge\log\left(\left(1-e^{-\varepsilon}\right)/\delta\right)}{n\varepsilon}}\right)\;. (33)

C.3 Proof of Theorem C.1

Similar to the proof of Theorem 5.3, we use DP Fano’s method in Theorem C.2. It suffices to construct an indexed set 𝒱\mathcal{V} and indexed distribution family 𝒫𝒱\mathcal{P}_{\mathcal{V}} such that xixix_{i}x_{i}^{\top} satisfies A.1, A.2 and A.3 in Assumption 1. Our construction is defined as follows.

Let λ1>λ2>0\lambda_{1}>\lambda_{2}>0. By Lemma C.3, there exists a finite set 𝒱α𝕊2d1\mathcal{V}_{\alpha}\subset\mathbb{S}_{2}^{d-1}, with cardinality |𝒱α|=2Ω(d)|\mathcal{V}_{\alpha}|=2^{\Omega(d)}, such that for any vv𝒱αv\neq v^{\prime}\in\mathcal{V}_{\alpha}, α/2vv2\alpha/\sqrt{2}\leq\|v-v^{\prime}\|\leq\sqrt{2}, where α:=λ2/λ1\alpha:=\sqrt{\lambda_{2}/\lambda_{1}}.

Let f(0,S)f_{(0,S)} denotes the density function of 𝒩(0,S){\cal N}(0,S). We construct PvP_{v} over d{\mathbb{R}}^{d} for v𝒱αv\in\mathcal{V}_{\alpha} with the following density function.

Pv(x)={1λ2/λ12d, if x=dλ1v,1λ2/λ12d, if x=dλ1v,11λ2/λ1d,f(0,λ211λ2/λ1d𝐈d)(x) otherwise .\displaystyle P_{v}(x)=\begin{cases}\frac{1-\lambda_{2}/\lambda_{1}}{2d},&\text{ if }x=-\sqrt{d\lambda_{1}}v\;,\\ \frac{1-\lambda_{2}/\lambda_{1}}{2d},&\text{ if }x=\sqrt{d\lambda_{1}}v\;,\\ 1-\frac{1-\lambda_{2}/\lambda_{1}}{d},f_{(0,\frac{\lambda_{2}}{1-\frac{1-\lambda_{2}/\lambda_{1}}{d}}\mathbf{I}_{d})}(x)&\text{ otherwise }\end{cases}\;. (34)

The mean of PvP_{v} is 0. The covariance of PvP_{v} is Σv:=(λ1λ2)vv+λ2𝐈d\Sigma_{v}:=(\lambda_{1}-\lambda_{2})vv^{\top}+\lambda_{2}\mathbf{I}_{d}. It is easy to see that the top eigenvalue is λ1\lambda_{1}, the top eigenvector is vv, and the second eigenvalue is λ2\lambda_{2}.

If x=dλ1vx=\sqrt{d\lambda_{1}}v, then xxΣv2=(dλ1λ1+λ2)λ2𝐈d2=O(dλ1)\|xx^{\top}-\Sigma_{v}\|_{2}=\|(d\lambda_{1}-\lambda_{1}+\lambda_{2})-\lambda_{2}\mathbf{I}_{d}\|_{2}=O(d\lambda_{1}). If x𝒩(0,λ211λ2/λ1d𝐈d)x\sim{\cal N}(0,\frac{\lambda_{2}}{1-\frac{1-\lambda_{2}/\lambda_{1}}{d}}\mathbf{I}_{d}), by the fact that λ211λ2/λ1dλ1\frac{\lambda_{2}}{1-\frac{1-\lambda_{2}/\lambda_{1}}{d}}\leq\lambda_{1}, we know xxΣv2O(dλ1)\|xx^{\top}-\Sigma_{v}\|_{2}\leq O(d\lambda_{1}). This implies PvP_{v} satisfies A.2 in Assumption 1 with M=O(dlog(n))M=O(d\log(n)) for nn i.i.d. samples.

Similarly, 𝔼[(xxΣv)(xxΣv)]2d(λ12λ1λ2)vv+dλ2λ1+3ΣvΣv2=O(dλ12)\|\mathbb{E}[(xx^{\top}-\Sigma_{v})(xx^{\top}-\Sigma_{v})^{\top}]\|_{2}\leq\|d(\lambda_{1}^{2}-\lambda_{1}\lambda_{2})vv^{\top}+d\lambda_{2}\lambda_{1}+3\Sigma_{v}\Sigma_{v}^{\top}\|_{2}=O(d\lambda_{1}^{2}). This means PvP_{v} satisfies A.3 in Assumption 1 with V=O(d)V=O(d).

For vv𝒱αv\neq v^{\prime}\in\mathcal{V}_{\alpha}, we have DTV(Pv,Pv)=(1λ2/λ1)/dD_{\rm TV}(P_{v},P_{v^{\prime}})=(1-\lambda_{2}/\lambda_{1})/d. By Lemma F.4, sin(v,v)vv/2(λ2/λ1)/2\sin(v,v^{\prime})\geq\|v-v^{\prime}\|/\sqrt{2}\geq(\sqrt{\lambda_{2}/\lambda_{1}})/2.

By Theorem C.2, there exists a constant CC such that

infv^supP𝒫Σ𝔼SPn[sin(v^(S),v1(Σ))]Cmin(λ2λ1,d2nελ1λ2(λ1λ2)2).\displaystyle\inf_{\hat{v}}\sup_{P\in\mathcal{P}_{\Sigma}}\mathbb{E}_{S\sim P^{n}}\left[\sin(\hat{v}(S),v_{1}(\Sigma))\right]\geq C\min\left(\sqrt{\frac{\lambda_{2}}{\lambda_{1}}},\frac{d^{2}}{n\varepsilon}\sqrt{\frac{\lambda_{1}\lambda_{2}}{(\lambda_{1}-\lambda_{2})^{2}}}\right)\;. (35)

Appendix D The analysis of Private Oja’s Algorithm

We analyze Private Oja’s Algorithm in Algorithm 2.

D.1 Proof of privacy in Lemma 3.1

We use following Theorem D.1 to prove our privacy guarantees.

Theorem D.1 (Privacy amplification by shuffling [30, Theorem 3.8]).

For any domain 𝒟\mathcal{D}, let (i):𝒮(1)××𝒮(i1)×𝒟𝒮(i)\mathcal{R}^{(i)}:\mathcal{S}^{(1)}\times\cdots\times\mathcal{S}^{(i-1)}\times\mathcal{D}\rightarrow\mathcal{S}^{(i)} for i[n]i\in[n] (where 𝒮(i)\mathcal{S}^{(i)} is the range space of (i)\mathcal{R}^{(i)}) be a sequence of algorithms such that (i)(z1:i1,)\mathcal{R}^{(i)}(z_{1:i-1},\cdot) is an (ε0,δ0)(\varepsilon_{0},\delta_{0})-DP local randomizer for all values of auxiliary inputs z1:i1𝒮(1)××𝒮(i1)z_{1:i-1}\in\mathcal{S}^{(1)}\times\cdots\times\mathcal{S}^{(i-1)}. Let 𝒜S:𝒟n𝒮(1)××𝒮(n)\mathcal{A}_{S}:\mathcal{D}^{n}\rightarrow\mathcal{S}^{(1)}\times\cdots\times\mathcal{S}^{(n)} be the algorithm that given a dataset x1:n𝒟nx_{1:n}\in\mathcal{D}^{n}, samples a uniform random permutation π\pi over [n][n], then sequentially computes zi=(i)(z1:i1,xπ(i))z_{i}=\mathcal{R}^{(i)}(z_{1:i-1},x_{\pi(i)}) for i[n]i\in[n] and outputs z1:nz_{1:n}. Then for any δ[0,1]\delta\in[0,1] such that ε0log(n16log(2/δ))\varepsilon_{0}\leq\log\left(\frac{n}{16\log(2/\delta)}\right), 𝒜s\mathcal{A}_{s} is (ε,δ+O(eεδ0n))(\varepsilon,\delta+O(e^{\varepsilon}\delta_{0}n))-DP, where

ε=O((1eε0)(eε0log(1/δ)n+eε0n)).\displaystyle\varepsilon=O\left((1-e^{-\varepsilon_{0}})\left(\frac{\sqrt{e^{\varepsilon_{0}}\log(1/\delta)}}{\sqrt{n}}+\frac{e^{\varepsilon_{0}}}{n}\right)\right)\;. (36)

Let (t)(wt1,Aπ(t)):=wt\mathcal{R}^{(t)}(w_{t-1},A_{\pi(t)}):=w_{t}. Let ε0=2log(1.25/δ0)α\varepsilon_{0}=\frac{\sqrt{2\log(1.25/\delta_{0})}}{\alpha}. We show (t)(wt1,)\mathcal{R}^{(t)}(w_{t-1},\cdot) is an (ε0,δ0)(\varepsilon_{0},\delta_{0})-DP local randomizer.

If there is no noise in each update step, the update rule is

wt\displaystyle w_{t}^{\prime} wt1+ηtclipβ(Atwt1),\displaystyle\leftarrow w_{t-1}+\eta_{t}{\rm clip}_{\beta}\left(A_{t}w_{t-1}\right)\;, (37)
wt\displaystyle w_{t} wt1/wt1\displaystyle\leftarrow w_{t-1}/\|w_{t-1}\| (38)

The sensitivity of wtw_{t}^{\prime} is 2βηt2\beta\eta_{t} with respect to a difference in AtA_{t}. By Gaussian mechanism in Lemma 2.4 and post processing property of local differential privacy, we know wtw_{t} is (ε0,δ0)(\varepsilon_{0},\delta_{0})-DP local randomizer.

Assume that ε0=2log(1.25/δ0)α12\varepsilon_{0}=\frac{\sqrt{2\log(1.25/\delta_{0})}}{\alpha}\leq\frac{1}{2}. By Theorem D.1, for δ^[0,1]\hat{\delta}\in[0,1] such that ε0log(n16log(2/δ^))\varepsilon_{0}\leq\log\left(\frac{n}{16\log(2/\hat{\delta})}\right), Algorithm 2 is (ε^,δ^+O(eε^δ0n))(\hat{\varepsilon},\hat{\delta}+O(e^{\hat{\varepsilon}}\delta_{0}n))-DP and for some constant c1>0c_{1}>0,

ε^\displaystyle\hat{\varepsilon} c1((1eε0)(eε0log(1/δ^)n+eε0n))\displaystyle\leq c_{1}\left((1-e^{-\varepsilon_{0}})\left(\frac{\sqrt{e^{\varepsilon_{0}}\log(1/\hat{\delta})}}{\sqrt{n}}+\frac{e^{\varepsilon_{0}}}{n}\right)\right) (39)
c1((e0.5e0.5ε0)log(1/δ^)n+eε01n)\displaystyle\leq c_{1}\left((e^{0.5}-e^{-0.5\varepsilon_{0}})\frac{\sqrt{\log(1/\hat{\delta})}}{\sqrt{n}}+\frac{e^{\varepsilon_{0}}-1}{n}\right) (40)
c1(((1+ε0)(1ε0/2))log(1/δ^)n+1+2ε01n)\displaystyle\leq c_{1}\left(((1+\varepsilon_{0})-(1-\varepsilon_{0}/2))\frac{\sqrt{\log(1/\hat{\delta})}}{\sqrt{n}}+\frac{1+2\varepsilon_{0}-1}{n}\right) (41)
=c1ε0(12log(1/δ^)n+2n)\displaystyle=c_{1}\varepsilon_{0}\left(\frac{1}{2}\sqrt{\frac{\log(1/\hat{\delta})}{n}}+\frac{2}{n}\right) (42)
c2log(1/δ0)αlog(1/δ^)n,\displaystyle\leq c_{2}\frac{\sqrt{\log(1/\delta_{0})}}{\alpha}\sqrt{\frac{\log(1/\hat{\delta})}{n}}\;, (43)

for some absolute constant c2>0c_{2}>0.

Set δ^=δ/2\hat{\delta}=\delta/2, δ0=c3δ/(eε^n)\delta_{0}=c_{3}\delta/(e^{\hat{\varepsilon}}n) for some c3>0c_{3}>0 and α=Clog(n/δ)/(εn)\alpha=C^{\prime}\log(n/\delta)/(\varepsilon\sqrt{n}). We have

ε^\displaystyle\hat{\varepsilon} c2log(eε^n/(c3δ))αlog(2/δ)n\displaystyle\leq c_{2}\frac{\sqrt{\log(e^{\hat{\varepsilon}}n/(c_{3}\delta))}}{\alpha}\sqrt{\frac{\log(2/\delta)}{n}} (44)
=log(eε^n/(c3δ))log(2/δ)Clog(n/δ)ε.\displaystyle=\frac{\sqrt{\log(e^{\hat{\varepsilon}}n/(c_{3}\delta))\log(2/\delta)}}{C^{\prime}\log(n/\delta)}\cdot\varepsilon. (45)

For any ε1\varepsilon\leq 1, by Eq. (45), there exists some sufficiently large C>0C^{\prime}>0 such that ε^ε\hat{\varepsilon}\leq\varepsilon.

Recall that we assume ε0=2log(1.25/δ0)α12\varepsilon_{0}=\frac{\sqrt{2\log(1.25/\delta_{0})}}{\alpha}\leq\frac{1}{2}. This means ε=O(log(n/δ)n)\varepsilon=O(\sqrt{\frac{\log(n/\delta)}{n})}.

D.2 Proof of clipping in Lemma 3.2

Let zt=Atwt1z_{t}=A_{t}w_{t-1}. Let μt:=𝔼[zt]=Σwt1\mu_{t}:=\mathbb{E}[z_{t}]=\Sigma w_{t-1}. By Lemma 2.1, we know for any v=1\|v\|=1, with probability 1ζ1-\zeta,

|v(ztμt)|Kγλ1loga(1/ζ).\displaystyle|v^{\top}(z_{t}-\mu_{t})|\leq K\gamma\lambda_{1}\log^{a}(1/\zeta)\;. (46)

Applying union bound over all basis vectors v{e1,,ed}v\in\{e_{1},\ldots,e_{d}\} and all samples, we know with probability 1ζ1-\zeta, for all j[d]j\in[d] and t[n]t\in[n]

|zt,j|Kγλ1loga(nd/ζ)+λ1.\displaystyle|z_{t,j}|\leq K\gamma\lambda_{1}\log^{a}(nd/\zeta)+\lambda_{1}\;. (47)

This implies that with probability 1ζ1-\zeta, for all t[n]t\in[n], we have

zt(Kγloga(nd/ζ)+1)λ1d.\displaystyle\|z_{t}\|\leq(K\gamma\log^{a}(nd/\zeta)+1)\lambda_{1}\sqrt{d}\;. (48)

D.3 Proof of utility in Theorem 3.3

Lemma 3.2 implies that with probability 1O(ζ)1-O(\zeta), Algorithm 2 does not have any clipping. Under this event, the update rule becomes

wt\displaystyle w_{t}^{\prime} wt1+ηt(At+2αβGt)wt1\displaystyle\leftarrow w_{t-1}+\eta_{t}\left(A_{t}+2\alpha\beta G_{t}\right)w_{t-1} (49)
wt\displaystyle w_{t} wt/wt,\displaystyle\leftarrow w_{t}^{\prime}/\|w_{t}^{\prime}\|\;, (50)

where β=(Kγloga(nd/ζ)+1)λ1d\beta=(K\gamma\log^{a}(nd/\zeta)+1)\lambda_{1}\sqrt{d} and each entry in Gtd×dG_{t}\in{\mathbb{R}}^{d\times d} is i.i.d. sampled from standard Gaussian 𝒩(0,1){\cal N}(0,1). This follows form the fact that wt1=1\|w_{t-1}\|=1 and Gtwt1𝒩(0,𝐈d)G_{t}w_{t-1}\sim{\cal N}(0,{\bf I}_{d}).

Let Bt=At+2αβGtB_{t}=A_{t}+2\alpha\beta G_{t}. We show BtB_{t} satisfies the three conditions in Theorem 2.2 ([39, Theorem 4.12]). It is easy to see that 𝔼[Bt]=Σ\mathbb{E}[B_{t}]=\Sigma from Assumption A.1. Next, we show upper bound of max{𝔼[(BtΣ)(BtΣ)]2,𝔼[(BtΣ)(BtΣ)]2}\max\left\{\left\|\mathbb{E}\left[(B_{t}-\Sigma)(B_{t}-\Sigma)^{\top}\right]\right\|_{2},\left\|\mathbb{E}\left[(B_{t}-\Sigma)^{\top}(B_{t}-\Sigma)\right]\right\|_{2}\right\}. We have

𝔼[(BtΣ)(BtΣ)]2\displaystyle\left\|\mathbb{E}\left[(B_{t}-\Sigma)(B_{t}-\Sigma)^{\top}\right]\right\|_{2}
=\displaystyle= 𝔼[(At+2αβGtΣ)(At+2αβGtΣ)]2\displaystyle\;\left\|\mathbb{E}[(A_{t}+2\alpha\beta G_{t}-\Sigma)(A_{t}+2\alpha\beta G_{t}-\Sigma)^{\top}]\right\|_{2}
\displaystyle\leq 𝔼[(AtΣ)(AtΣ)]2+4α2β2𝔼[GtGt]2\displaystyle\;\left\|\mathbb{E}[(A_{t}-\Sigma)(A_{t}-\Sigma)^{\top}]\right\|_{2}+4\alpha^{2}\beta^{2}\|\mathbb{E}[G_{t}G_{t}^{\top}]\|_{2}
\displaystyle\leq Vλ12+4α2β2C2d,\displaystyle\;V\lambda_{1}^{2}+4\alpha^{2}\beta^{2}C_{2}d\;, (51)

where the last inequality follows from Lemma F.3 and C2>0C_{2}>0 is an absolute constant. Let V~:=Vλ12+4α2β2C2d\widetilde{V}:=V\lambda_{1}^{2}+4\alpha^{2}\beta^{2}C_{2}d. Similarly, we can show that 𝔼[(BtΣ)(BtΣ)]2V~\left\|\mathbb{E}\left[(B_{t}-\Sigma)^{\top}(B_{t}-\Sigma)\right]\right\|_{2}\leq\widetilde{V}.

By Lemma F.2, we know with probability 1ζ1-\zeta, for all t[T]t\in[T],

BtΣ2\displaystyle\left\|B_{t}-\Sigma\right\|_{2}
=\displaystyle= At+2αβGtΣ2\displaystyle\left\|A_{t}+2\alpha\beta G_{t}-\Sigma\right\|_{2}
\displaystyle\leq AtΣ2+2αβGt2\displaystyle\left\|A_{t}-\Sigma\right\|_{2}+2\alpha\beta\|G_{t}\|_{2}
\displaystyle\leq Mλ1+2C3αβ(d+log(n/ζ)).\displaystyle M\lambda_{1}+2C_{3}\alpha\beta\left(\sqrt{d}+\sqrt{\log(n/\zeta)}\right)\;.

Let M~:=Mλ1+2C3αβ(d+log(n/ζ))\widetilde{M}:=M\lambda_{1}+2C_{3}\alpha\beta\left(\sqrt{d}+\sqrt{\log(n/\zeta)}\right).

Under the event that BtΣ2M~\left\|B_{t}-\Sigma\right\|_{2}\leq\widetilde{M} for all t[n]t\in[n], we apply Theorem 2.2 with a learning rate ηt=h(λ1λ2)(ξ+t)\eta_{t}=\frac{h}{(\lambda_{1}-\lambda_{2})(\xi+t)} where

ξ=20max(M~h(λ1λ2),(V~+λ12)h2(λ1λ2)2log(1+ζ100)).\displaystyle\xi=20\max\left(\frac{\widetilde{M}h}{(\lambda_{1}-\lambda_{2})},\frac{\left(\widetilde{V}+\lambda_{1}^{2}\right)h^{2}}{(\lambda_{1}-\lambda_{2})^{2}\log(1+\frac{\zeta}{100})}\right)\;. (52)

Then Theorem 2.2 implies that with probability 1ζ1-\zeta,

sin2(wn,v1)Clog(1/ζ)ζ2(d(ξn)2h+h2V~(2h1)(λ1λ2)2n),\displaystyle\sin^{2}\left(w_{n},v_{1}\right)\leq\frac{C\log(1/\zeta)}{\zeta^{2}}\left(d\left(\frac{\xi}{n}\right)^{2h}+\frac{h^{2}\widetilde{V}}{(2h-1)\left(\lambda_{1}-\lambda_{2}\right)^{2}n}\right)\;, (53)

for some positive constant CC.

Set α=Clog(n/δ)εn\alpha=\frac{C^{\prime}\log(n/\delta)}{\varepsilon\sqrt{n}}, the above bound implies

sin2(wn,v1)Clog(1/ζ)ζ2(h2Vλ12(2h1)(λ1λ2)2n+(Kγloga(nd/ζ)+1)2λ12log2(n/δ)d2h2(2h1)(λ1λ2)2ε2n2+d(ξ~)h),\displaystyle\sin^{2}\left(w_{n},v_{1}\right)\leq\frac{C\log(1/\zeta)}{\zeta^{2}}\left(\frac{h^{2}V\lambda_{1}^{2}}{(2h-1)\left(\lambda_{1}-\lambda_{2}\right)^{2}n}+\frac{(K\gamma\log^{a}(nd/\zeta)+1)^{2}\lambda_{1}^{2}\log^{2}(n/\delta)d^{2}h^{2}}{(2h-1)(\lambda_{1}-\lambda_{2})^{2}\varepsilon^{2}n^{2}}+d\left(\tilde{\xi}\right)^{h}\right)\;, (54)

where ξ~=(ξ/n)2\tilde{\xi}=(\xi/n)^{2}, and

ξ~:=max\displaystyle\tilde{\xi}:=\max (M2λ12h2(λ1λ2)2n2+(Kγloga(nd/ζ)+1)2λ12log3(n/δ)h2d2(λ1λ2)2ε2n3,\displaystyle\left(\frac{M^{2}\lambda_{1}^{2}h^{2}}{(\lambda_{1}-\lambda_{2})^{2}n^{2}}+\frac{(K\gamma\log^{a}(nd/\zeta)+1)^{2}\lambda_{1}^{2}\log^{3}(n/\delta)h^{2}d^{2}}{(\lambda_{1}-\lambda_{2})^{2}\varepsilon^{2}n^{3}},\right.
V2λ14h4(λ1λ2)4log2(1+ζ100)n2+(Kγloga(nd/ζ)+1)4λ14log4(n/δ)h4d4(λ1λ2)4log2(1+ζ100)ε4n4\displaystyle\left.\frac{V^{2}\lambda_{1}^{4}h^{4}}{(\lambda_{1}-\lambda_{2})^{4}\log^{2}(1+\frac{\zeta}{100})n^{2}}+\frac{(K\gamma\log^{a}(nd/\zeta)+1)^{4}\lambda_{1}^{4}\log^{4}(n/\delta)h^{4}d^{4}}{(\lambda_{1}-\lambda_{2})^{4}\log^{2}(1+\frac{\zeta}{100})\varepsilon^{4}n^{4}}\right.
+λ14h4(λ1λ2)4log2(1+ζ100)n2).\displaystyle\left.+\frac{\lambda_{1}^{4}h^{4}}{(\lambda_{1}-\lambda_{2})^{4}\log^{2}(1+\frac{\zeta}{100})n^{2}}\right)\;. (55)

For ζ=O(1)\zeta=O(1) and K=O(1)K=O(1), selecting h=clognh=c\log n, and assuming

n\displaystyle n\geq C(Mλ1log(n)λ1λ2+(Kγloga(nd/ζ)+1)2/3λ12/3log(n/δ)log2/3(n)d2/3(λ1λ2)2/3ε2/3\displaystyle C\left(\frac{M\lambda_{1}\log(n)}{\lambda_{1}-\lambda_{2}}+\frac{(K\gamma\log^{a}(nd/\zeta)+1)^{2/3}\lambda_{1}^{2/3}\log(n/\delta)\log^{2/3}(n)d^{2/3}}{(\lambda_{1}-\lambda_{2})^{2/3}\varepsilon^{2/3}}\right.
+Vλ12(log(n))2(λ1λ2)2+(Kγloga(nd/ζ)+1)λ1log(n/δ)log(n)d(λ1λ2)ε+λ12log2(n)(λ1λ2)2),\displaystyle\left.+\frac{V\lambda_{1}^{2}(\log(n))^{2}}{(\lambda_{1}-\lambda_{2})^{2}}+\frac{(K\gamma\log^{a}(nd/\zeta)+1)\lambda_{1}\log(n/\delta)\log(n)d}{(\lambda_{1}-\lambda_{2})\varepsilon}+\frac{\lambda_{1}^{2}\log^{2}(n)}{(\lambda_{1}-\lambda_{2})^{2}}\right)\;, (56)

with large enough positive constants cc, and CC, we have ξ~1\tilde{\xi}\leq 1 and dξ~α1/n2d\tilde{\xi}^{\alpha}\leq 1/n^{2}. Hence it is sufficient to have

n=O~(λ12(λ1λ2)2+Mλ1λ1λ2+Vλ12(λ1λ2)2+d(γ+1)λ1log(1/δ)(λ1λ2)ε),n=\tilde{O}\Big{(}\,\frac{\lambda_{1}^{2}}{(\lambda_{1}-\lambda_{2})^{2}}+\frac{M\lambda_{1}}{\lambda_{1}-\lambda_{2}}+\frac{V\lambda_{1}^{2}}{(\lambda_{1}-\lambda_{2})^{2}}+\frac{d\,(\gamma+1)\lambda_{1}\,\log(1/\delta)}{(\lambda_{1}-\lambda_{2})\varepsilon}\,\Big{)}\;,

with a large enough constant.

Appendix E The analysis of DP-PCA

We provides the proofs for Theorem 5.1, Theorem 6.1, and Lemma 6.2 that guarantees the privacy and utility of DP-PCA.

E.1 Proof of Theorem 5.1 on the privacy and utility of DP-PCA

From Theorem 6.1 we know that Alg. 4 returns Λ^\hat{\Lambda} satisfying 2Λ^λ12Hu22\hat{\Lambda}\geq\lambda_{1}^{2}\|H_{u}\|_{2} with high probability. Then, from Lemma 6.2, we know that with high probability Alg 5 returns an unbiased estimate of the gradient mean with added Gaussian noise. Under this case, the update rule becomes

wt\displaystyle w_{t}^{\prime} wt1+ηt(1Bi=1BAB(t1)+i+βtGt)wt1\displaystyle\leftarrow w_{t-1}+\eta_{t}\left(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}+\beta_{t}G_{t}\right)w_{t-1} (57)
wt\displaystyle w_{t} wt/wt,\displaystyle\leftarrow w_{t}^{\prime}/\|w_{t}^{\prime}\|\;, (58)

where βt=8K2Λ^tloga(Bd/ζ)2dlog(2.5/δ)εB\beta_{t}=\frac{8K\sqrt{2\hat{\Lambda}_{t}}\log^{a}(Bd/\zeta)\sqrt{2d\log(2.5/\delta)}}{\varepsilon B}, Λ^t\hat{\Lambda}_{t} denote the estimated eigenvalue of covariance of the gradients at tt-th iteration, and each entry in Gtd×dG_{t}\in{\mathbb{R}}^{d\times d} is i.i.d. sampled from standard Gaussian 𝒩(0,1){\cal N}(0,1). This follows form the fact that wt1=1\|w_{t-1}\|=1 and Gtwt1𝒩(0,𝐈d)G_{t}w_{t-1}\sim{\cal N}(0,{\bf I}_{d}).

Let β:=16Kγλ1loga(Bd/ζ)2dlog(2.5/δ)εB\beta:=\frac{16K\gamma\lambda_{1}\log^{a}(Bd/\zeta)\sqrt{2d\log(2.5/\delta)}}{\varepsilon B} such that ββt\beta\geq\beta_{t}, which follows from the fact that Λ^2λ12Hu22λ12γ2\hat{\Lambda}\leq\sqrt{2}\lambda_{1}^{2}\|H_{u}\|_{2}\leq\sqrt{2}\lambda_{1}^{2}\gamma^{2} (Theorem 6.1 and Assumption A.4). Let Bt=(1/B)i=1BAB(t1)+i+βtGtB_{t}=(1/B)\sum_{i=1}^{B}A_{B(t-1)+i}+\beta_{t}G_{t}. We show BtB_{t} satisfies the three conditions in Theorem 2.2 ([39, Theorem 4.12]). It is easy to see that 𝔼[Bt]=Σ\mathbb{E}[B_{t}]=\Sigma from Assumption A.1. Next, we show upper bound of max{𝔼[(BtΣ)(BtΣ)]2,𝔼[(BtΣ)(BtΣ)]2}\max\left\{\left\|\mathbb{E}\left[(B_{t}-\Sigma)(B_{t}-\Sigma)^{\top}\right]\right\|_{2},\left\|\mathbb{E}\left[(B_{t}-\Sigma)^{\top}(B_{t}-\Sigma)\right]\right\|_{2}\right\}. We have

𝔼[(BtΣ)(BtΣ)]2\displaystyle\left\|\mathbb{E}\left[(B_{t}-\Sigma)(B_{t}-\Sigma)^{\top}\right]\right\|_{2}
=\displaystyle= 𝔼[(1Bi=1BAB(t1)+i+βtGtΣ)(1Bi=1BAB(t1)+i+βtGtΣ)]2\displaystyle\;\left\|\mathbb{E}[(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}+\beta_{t}G_{t}-\Sigma)(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}+\beta_{t}G_{t}-\Sigma)^{\top}]\right\|_{2}
\displaystyle\leq 𝔼[(1Bi=1BAB(t1)+iΣ)(1Bi=1BAB(t1)+iΣ)]2+β2𝔼[GtGt]2\displaystyle\;\left\|\mathbb{E}[(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}-\Sigma)(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}-\Sigma)^{\top}]\right\|_{2}+\beta^{2}\|\mathbb{E}[G_{t}G_{t}^{\top}]\|_{2}
=\displaystyle= Vλ12/B+β2𝔼[GtGt]2\displaystyle\;V\lambda_{1}^{2}/B+\beta^{2}\|\mathbb{E}[G_{t}G_{t}^{\top}]\|_{2}
\displaystyle\leq Vλ12/B+β2C2d,\displaystyle\;V\lambda_{1}^{2}/B+\beta^{2}C_{2}d\;, (59)

where the last inequality follows from Lemma F.3 and C2>0C_{2}>0 is an absolute constant. Let V~:=Vλ12/B+β2C2d\widetilde{V}:=V\lambda_{1}^{2}/B+\beta^{2}C_{2}d. Similarly, we can show that 𝔼[(BtΣ)(BtΣ)]2V~\left\|\mathbb{E}\left[(B_{t}-\Sigma)^{\top}(B_{t}-\Sigma)\right]\right\|_{2}\leq\widetilde{V}. By Lemma F.5 and Lemma F.2, we know with probability 1ζ1-\zeta, for all t[T]t\in[T],

BtΣ2\displaystyle\left\|B_{t}-\Sigma\right\|_{2}
=\displaystyle= 1Bi=1BAB(t1)+i+βtGtΣ2\displaystyle\left\|\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}+\beta_{t}G_{t}-\Sigma\right\|_{2}
\displaystyle\leq C3(Mλ1log(dT/ζ)B+Vλ12log(dT/ζ)B+β(d+log(T/ζ))).\displaystyle C_{3}\left(\frac{M\lambda_{1}\log(dT/\zeta)}{B}+\sqrt{\frac{V\lambda_{1}^{2}\log(dT/\zeta)}{B}}+\beta\left(\sqrt{d}+\sqrt{\log(T/\zeta)}\right)\right)\;.

Let M~:=C3(Mλ1log(dT/ζ)B+Vλ12log(dT/ζ)B+β(d+log(T/ζ)))\widetilde{M}:=C_{3}\left(\frac{M\lambda_{1}\log(dT/\zeta)}{B}+\sqrt{\frac{V\lambda_{1}^{2}\log(dT/\zeta)}{B}}+\beta\left(\sqrt{d}+\sqrt{\log(T/\zeta)}\right)\right). Under the event that BtΣ2M~\left\|B_{t}-\Sigma\right\|_{2}\leq\widetilde{M} for all t[T]t\in[T], we apply Theorem 2.2 with a learning rate ηt=α(λ1λ2)(ξ+t)\eta_{t}=\frac{\alpha}{(\lambda_{1}-\lambda_{2})(\xi+t)} where

ξ=20max(M~α(λ1λ2),(V~+λ12)α2(λ1λ2)2log(1+ζ100)).\displaystyle\xi=20\max\left(\frac{\widetilde{M}\alpha}{(\lambda_{1}-\lambda_{2})},\frac{\left(\widetilde{V}+\lambda_{1}^{2}\right)\alpha^{2}}{(\lambda_{1}-\lambda_{2})^{2}\log(1+\frac{\zeta}{100})}\right)\;. (60)

Then Theorem 2.2 implies that with probability 1ζ1-\zeta,

sin2(wT,v1)Clog(1/ζ)ζ2(d(ξT)2α+α2V~(2α1)(λ1λ2)2T),\displaystyle\sin^{2}\left(w_{T},v_{1}\right)\leq\frac{C\log(1/\zeta)}{\zeta^{2}}\left(d\left(\frac{\xi}{T}\right)^{2\alpha}+\frac{\alpha^{2}\widetilde{V}}{(2\alpha-1)\left(\lambda_{1}-\lambda_{2}\right)^{2}T}\right)\;, (61)

for some positive constant CC. Using n=BTn=BT and Eq. (59), the above bound implies

sin2(wT,v1)Clog(1/ζ)ζ2(α2Vλ12(2α1)(λ1λ2)2n+K2γ2λ12log2a(nd/(Tζ))log(1/δ)d2α2T(2α1)(λ1λ2)2ε2n2+d(ξ~)α).\displaystyle\sin^{2}\left(w_{T},v_{1}\right)\leq\frac{C\log(1/\zeta)}{\zeta^{2}}\left(\frac{\alpha^{2}V\lambda_{1}^{2}}{(2\alpha-1)\left(\lambda_{1}-\lambda_{2}\right)^{2}n}+\frac{K^{2}\gamma^{2}\lambda_{1}^{2}\log^{2a}(nd/(T\zeta))\log(1/\delta)d^{2}\alpha^{2}T}{(2\alpha-1)(\lambda_{1}-\lambda_{2})^{2}\varepsilon^{2}n^{2}}+d\left(\tilde{\xi}\right)^{\alpha}\right)\;. (62)

where ξ~=(ξ/T)2\tilde{\xi}=(\xi/T)^{2}, and

ξ~:=max\displaystyle\tilde{\xi}:=\max (M2λ12α2log2(dT/ζ)(λ1λ2)2n2+Vλ12log(dT/ζ)α2(λ1λ2)2nT+K2γ2λ12log2a(nd/(Tζ))log(1/δ)log(T/ζ)α2d2(λ1λ2)2ε2n2,\displaystyle\left(\frac{M^{2}\lambda_{1}^{2}\alpha^{2}\log^{2}(dT/\zeta)}{(\lambda_{1}-\lambda_{2})^{2}n^{2}}+\frac{V\lambda_{1}^{2}\log(dT/\zeta)\alpha^{2}}{(\lambda_{1}-\lambda_{2})^{2}nT}+\frac{K^{2}\gamma^{2}\lambda_{1}^{2}\log^{2a}(nd/(T\zeta))\log(1/\delta)\log(T/\zeta)\alpha^{2}d^{2}}{(\lambda_{1}-\lambda_{2})^{2}\varepsilon^{2}n^{2}},\right.
V2λ14α4(λ1λ2)4log2(1+ζ100)n2+K4γ4λ14log4a(nd/(Tζ))log2(1/δ)α4d4T2(λ1λ2)4log2(1+ζ100)ε4n4\displaystyle\left.\frac{V^{2}\lambda_{1}^{4}\alpha^{4}}{(\lambda_{1}-\lambda_{2})^{4}\log^{2}(1+\frac{\zeta}{100})n^{2}}+\frac{K^{4}\gamma^{4}\lambda_{1}^{4}\log^{4a}(nd/(T\zeta))\log^{2}(1/\delta)\alpha^{4}d^{4}T^{2}}{(\lambda_{1}-\lambda_{2})^{4}\log^{2}(1+\frac{\zeta}{100})\varepsilon^{4}n^{4}}\right.
+λ14α4(λ1λ2)4log2(1+ζ100)T2).\displaystyle\left.+\frac{\lambda_{1}^{4}\alpha^{4}}{(\lambda_{1}-\lambda_{2})^{4}\log^{2}(1+\frac{\zeta}{100})T^{2}}\right)\;. (63)

For ζ=O(1)\zeta=O(1) and K=O(1)K=O(1), selecting α=clogn\alpha=c\log n, T=c(logn)2T=c^{\prime}(\log n)^{2}, and assuming lognλ12/(λ1λ2)2\log n\geq\lambda_{1}^{2}/(\lambda_{1}-\lambda_{2})^{2} and

n\displaystyle n\geq C(Mλ1log(n)log(dlog(n))λ1λ2+Vλ12log(dT)(λ1λ2)+γλ1loga(nd/log(n))log(1/δ)log(log(n))log(n)d(λ1λ2)ε\displaystyle C\left(\frac{M\lambda_{1}\log(n)\log(d\log(n))}{\lambda_{1}-\lambda_{2}}+\frac{\sqrt{V\lambda_{1}^{2}\log(dT)}}{(\lambda_{1}-\lambda_{2})}+\frac{\gamma\lambda_{1}\log^{a}(nd/\log(n))\sqrt{\log(1/\delta)\log(\log(n))}\log(n)d}{(\lambda_{1}-\lambda_{2})\varepsilon}\right.
+Vλ12(log(n))2(λ1λ2)2+γλ1loga(nd/log(n))log(1/δ)(log(n))2d(λ1λ2)ε),\displaystyle\left.+\frac{V\lambda_{1}^{2}(\log(n))^{2}}{(\lambda_{1}-\lambda_{2})^{2}}+\frac{\gamma\lambda_{1}\log^{a}(nd/\log(n))\sqrt{\log(1/\delta)}(\log(n))^{2}d}{(\lambda_{1}-\lambda_{2})\varepsilon}\right)\;, (64)

with large enough positive constants cc, cc^{\prime}, and CC, we have ξ~1\tilde{\xi}\leq 1 and dξ~α1/n2d\tilde{\xi}^{\alpha}\leq 1/n^{2}. Hence it is sufficient to have

n=O~(exp(λ12/(λ1λ2)2)+Mλ1λ1λ2+Vλ12(λ1λ2)2+dγλ1log(1/δ)(λ1λ2)ε),n=\tilde{O}\Big{(}\,\exp(\lambda_{1}^{2}/(\lambda_{1}-\lambda_{2})^{2})+\frac{M\lambda_{1}}{\lambda_{1}-\lambda_{2}}+\frac{V\lambda_{1}^{2}}{(\lambda_{1}-\lambda_{2})^{2}}+\frac{d\,\gamma\,\lambda_{1}\sqrt{\log(1/\delta)}}{(\lambda_{1}-\lambda_{2})\varepsilon}\,\Big{)}\;,

with a large enough constant.

E.2 Algorithm and proof of Theorem 6.1 on top eigenvalue estimation

Input: S={gi}i=1BS=\{g_{i}\}_{i=1}^{B}, (ε,δ)(\varepsilon,\delta)-DP, failure probability ζ\zeta
1 Let g~ig2ig2i1\tilde{g}_{i}\leftarrow g_{2i}-g_{2i-1} for i1,2,,B/2i\in 1,2,\ldots,\lfloor B/2\rfloor. Let S~={g~i}i=1B/2\tilde{S}=\{\tilde{g}_{i}\}_{i=1}^{\lfloor B/2\rfloor}
2Partition S~\tilde{S} into k=C1log(1/(δζ))/εk={C_{1}\log(1/(\delta\zeta))}/{\varepsilon} subsets and denote each dataset as Gjd×bG_{j}\in{\mathbb{R}}^{d\times b}, where each dataset is of size b=B/2kb=\lfloor B/2k\rfloor
3
4Let λ1(j)\lambda_{1}^{(j)} be the top eigenvalue of (1/b)GjGj(1/b)G_{j}G_{j}^{\top} for j[k]\forall j\in[k]
5 Partition [0,)[0,\infty) into Ω{,[22/4,21/4)[21/4,1)[1,21/4),[21/4,22/4),}{[0,0]}\Omega\leftarrow\left\{\ldots,\left[2^{-2/4},2^{-1/4}\right)\left[2^{-1/4},1\right)\left[1,2^{1/4}\right),\left[2^{1/4},2^{2/4}\right),\ldots\right\}\cup\{[0,0]\}
6 Run (ε,δ)(\varepsilon,\delta)-DP histogram learner of Lemma B.1 on {λ1(j)}j=1k\{\lambda_{1}^{(j)}\}_{j=1}^{k} over Ω\Omega
7 if all the bins are empty then Return \perp
8 Let [l,r][l,r] be a non-empty bin that contains the maximum number of points in the DP histogram
Return Λ^=l\hat{\Lambda}=l
Algorithm 4 Private Top Eigenvalue Estimation

Taking the difference ensures that g~i\tilde{g}_{i} is zero mean, such that we can directly use the top eigenvalue of (1/b)GjGj(1/b)G_{j}G_{j}^{\top} for j[k]j\in[k]. We compute a histogram over those kk top eigenvalues. This histogram is privatized by adding noise only to the occupied bins and thresholding small entries of the histogram to be zero. The choice k=Ω(log(1/ζ)/ε)k=\Omega(\log(1/\zeta)/\varepsilon) ensures that the most occupied bin does not change after adding the DP noise to the histograms, and k=Ω(log(1/δ)/ε)k=\Omega(\log(1/\delta)/\varepsilon) is necessary for handling unbounded number of bins. We emphasize that we do not require any upper and lower bounds on the eigenvalue, thanks to the private histogram learner from [12, 49] that gracefully handles unbounded number of bins.

The privacy guarantee follows from the privacy guarantee of the histogram learner provided in Lemma B.1.

For utility analysis, we follow the analysis of [46, Theorem 3.1]. The main difference is that we prove a smaller sample complexity sine we only need the top eigenvalue, and we analyze a more general distribution family. The random vector g~i\tilde{g}_{i} is zero mean with covariance 2λ12Hud×d2\lambda_{1}^{2}H_{u}\in{\mathbb{R}}^{d\times d}, where Hu=𝔼[(AiΣ)uu(AiΣ)]/λ12H_{u}=\mathbb{E}[(A_{i}-\Sigma)uu^{\top}(A_{i}-\Sigma)^{\top}]/\lambda_{1}^{2}, and g~i\tilde{g}_{i} satisfies with probability 1ζ1-\zeta,

|g~i,v| 2Kλ1Hu2loga(1/ζ),\displaystyle|\left\langle\tilde{g}_{i},v\right\rangle|\;\leq\;2K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(1/\zeta)\;, (65)

which follows from Lemma 2.1. Applying union bound over all basis vectors v{e1,,ed}v\in\{e_{1},\ldots,e_{d}\}, we know with probability 1ζ1-\zeta,

g~i 2Kλ1dHu2loga(d/ζ).\displaystyle\|\tilde{g}_{i}\|\;\leq\;2K\lambda_{1}\sqrt{d\|H_{u}\|_{2}}\log^{a}(d/\zeta)\;.

We next show that conditioned on event ={g~i2Kλ1dHu2loga(d/ζ)}\mathcal{E}=\{\|\tilde{g}_{i}\|\leq 2K\lambda_{1}\sqrt{d\|H_{u}\|_{2}}\log^{a}(d/\zeta)\}, the covariance 𝔼[g~ig~i|]\mathbb{E}[\tilde{g}_{i}\tilde{g}_{i}^{\top}|\mathcal{E}] is close to the true covariance 𝔼[g~ig~i]=2λ12Hu\mathbb{E}[\tilde{g}_{i}\tilde{g}_{i}^{\top}]=2\lambda_{1}^{2}H_{u}. Note that

𝔼[g~ig~i|]\displaystyle\mathbb{E}[\tilde{g}_{i}\tilde{g}_{i}^{\top}|\mathcal{E}] =𝔼[g~ig~i𝕀{g~i2Kλ1dHu2loga(d/ζ)}]()\displaystyle\;=\;\frac{\mathbb{E}[\tilde{g}_{i}\tilde{g}_{i}^{\top}{\mathbb{I}}\{\|\tilde{g}_{i}\|\leq 2K\lambda_{1}\sqrt{d\|H_{u}\|_{2}}\log^{a}(d/\zeta)\}]}{{\mathbb{P}}(\mathcal{E})}
𝔼[g~ig~i]()2λ12Hu1ζ.\displaystyle\;\preceq\;\frac{\mathbb{E}[\tilde{g}_{i}\tilde{g}_{i}^{\top}]}{{\mathbb{P}}(\mathcal{E})}\;\preceq\;\frac{2\lambda_{1}^{2}H_{u}}{1-\zeta}\;. (66)

We next show the empirical covariance (1/b)i=1bg~ig~i({1}/{b})\sum_{i=1}^{b}\tilde{g}_{i}\tilde{g}_{i}^{\top} concentrates around 2λ12Hu2\lambda_{1}^{2}H_{u}. First of all, using union bound on Eq. (65), we have with probability 1ζ1-\zeta, for all i[b]i\in[b] and j[d]j\in[d],

|g~ij|2Kλ1Hu2loga(bd/ζ).\displaystyle|\tilde{g}_{ij}|\leq 2K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(bd/\zeta)\;.

Under the event that |g~ij|2Kλ1Hu2loga(nd/ζ)|\tilde{g}_{ij}|\leq 2K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(nd/\zeta) for all i[b]i\in[b], j[d]j\in[d], [70, Corrollary 6.20] together with Eq. (66) implies

(1bi=1bg~ig~i2λ12Hu2α)2dexp(bα28K2λ12Hu2log2a(bdζ)d(2λ12Hu2/(1ζ)+α)).\displaystyle{\mathbb{P}}\left(\left\|\frac{1}{b}\sum_{i=1}^{b}\tilde{g}_{i}\tilde{g}_{i}^{\top}-2\lambda_{1}^{2}H_{u}\right\|_{2}\geq\alpha\right)\leq 2d\exp\left(-\frac{b\alpha^{2}}{8K^{2}\lambda_{1}^{2}\|H_{u}\|_{2}\log^{2a}(\frac{bd}{\zeta})d(2\lambda_{1}^{2}\|H_{u}\|_{2}/(1-\zeta)+\alpha)}\right)\;.

The above bound implies that with probability 1ζ1-\zeta,

1bi=1bg~ig~iλ122Hu2=O(Kλ12Hu2loga(bd/ζ)dlog(d/ζ)b+K2λ12Hu2log2a(bd/ζ)dlog(d/ζ)b).\displaystyle\left\|\frac{1}{b}\sum_{i=1}^{b}\tilde{g}_{i}\tilde{g}_{i}^{\top}-\lambda_{1}^{2}2H_{u}\right\|_{2}\;=\;O\Big{(}\,K\lambda_{1}^{2}\|H_{u}\|_{2}\log^{a}(bd/\zeta)\sqrt{\frac{d\log(d/\zeta)}{b}}+K^{2}\lambda_{1}^{2}\|H_{u}\|_{2}\log^{2a}(bd/\zeta)\frac{d\log(d/\zeta)}{b}\,\Big{)}\;.

This means if b=Ω(K2dlog(dk/ζ)log2a(bdk/ζ))b={\Omega}(K^{2}d\log(dk/\zeta)\log^{2a}(bdk/\zeta)), then with probability 1ζ1-\zeta, for all j[k]j\in[k], (121/8)λ12Hu2λ1(j)(1+21/8)λ12Hu2(1-2^{1/8})\lambda_{1}^{2}\|H_{u}\|_{2}\leq\lambda_{1}^{(j)}\leq(1+2^{1/8})\lambda_{1}^{2}\|H_{u}\|_{2}. This means all of λ1(j)\lambda_{1}^{(j)} must be within 21/4λ12Hu22^{1/4}\lambda_{1}^{2}\|H_{u}\|_{2} interval. Thus, at most two consecutive buckets are filled with λ1(j)\lambda_{1}^{(j)}. By private histogram from Lemma B.1, if klog(1/(δζ))/εk\geq\log(1/(\delta\zeta))/\varepsilon, one of those two bins are released. The resulting total multiplicative error is bounded by 21/22^{1/2}.

E.3 Algorithm and proof of Lemma 6.2 on DP mean estimation

Input: S={gi}i=1BS=\{g_{i}\}_{i=1}^{B}, (ε,δ)(\varepsilon,\delta), target error α\alpha, failure probability ζ\zeta, approximate top eigenvalue Λ^\hat{\Lambda}
1
2Let τ=21/4KΛ^loga(25)\tau=2^{1/4}K\sqrt{\hat{\Lambda}}\log^{a}(25).
3 for j=1, 2, …, d do
4      
5      Run (ε42dlog(4/δ),δ4d)(\frac{\varepsilon}{4\sqrt{2d\log(4/\delta)}},\frac{\delta}{4d})-DP histogram learner of Lemma B.1 on {gij}i[B]\{g_{ij}\}_{i\in[B]} over Ω={,(2τ,τ],(τ,0],(0,τ],(τ,2τ],(2τ,3τ]}\Omega=\{\cdots,(-2\tau,-\tau],(-\tau,0],(0,\tau],(\tau,2\tau],(2\tau,3\tau]\cdots\}.
6       Let [l,h][l,h] be the bucket that contains maximum number of points in the private histogram
7       g¯jl\bar{g}_{j}\leftarrow l
8       Truncate the jj-th coordinate of gradient {gi}i[B]\{g_{i}\}_{i\in[B]} by [g¯j3KΛ^loga(Bd/ζ),g¯j+3KΛ^loga(Bd/ζ)][\bar{g}_{j}-3K\sqrt{\hat{\Lambda}}\log^{a}(Bd/\zeta),\bar{g}_{j}+3K\sqrt{\hat{\Lambda}}\log^{a}(Bd/\zeta)].
9       Let g~i\tilde{g}_{i} be the truncated version of gig_{i}.
10Compute empirical mean of truncated gradients μ~=(1/B)i=1Bg~i\tilde{\mu}=(1/B)\sum_{i=1}^{B}\tilde{g}_{i} and add Gaussian noise: μ^=μ~+𝒩(0,(12KΛ^loga(Bd/ζ)2dlog(2.5/δ)εB)2𝐈d)\hat{\mu}=\tilde{\mu}+{\cal N}\left(0,\left(\frac{12K\sqrt{\hat{\Lambda}}\log^{a}(Bd/\zeta)\sqrt{2d\log(2.5/\delta)}}{\varepsilon B}\right)^{2}\mathbf{I}_{d}\right)
Return μ^\hat{\mu}
Algorithm 5 Private Mean Estimation [49, 43]

The histogram learner is called dd times, each with (ε/(42dlog(4/δ)),δ/(4d))(\varepsilon/(4\sqrt{2d\log(4/\delta)}),\delta/(4d))-DP guarantee, and the end-to-end privacy guarantee is (ε/2,δ/2)(\varepsilon/2,\delta/2) from Lemma B.4 for ε(0,0.9)\varepsilon\in(0,0.9). The sensitivity of the clipped mean estimate is Δ=d6KΛ^loga(Bd/ζ)\Delta=\sqrt{d}6K\sqrt{\hat{\Lambda}}\log^{a}(Bd/\zeta). Gaussian mechanism with covariance (2Δ2log(2.5/δ)/ε)2𝐈d(2\Delta\sqrt{2\log(2.5/\delta)}/\varepsilon)^{2}{\bf I}_{d} satisfy (ε/2,δ/2)(\varepsilon/2,\delta/2)-DP from Lemma 2.4 for ε(0,1)\varepsilon\in(0,1). Putting these two together, with serial composition of Lemma B.3, we get the desired privacy guarantee.

The proof of utility follows similarly as [55, Lemma D.2]. Let Il=(lΛ^,(l+1)Λ^]I_{l}=(l\sqrt{\hat{\Lambda}},(l+1)\sqrt{\hat{\Lambda}}]. Denote the population probability of jj-th coordinate at IlI_{l} as hj,l=(gijIl)h_{j,l}={\mathbb{P}}(g_{ij}\in I_{l}). Denote the empirical probability as h^j,l=1Bi=1B𝕀(gijIl)\hat{h}_{j,l}=\frac{1}{B}\sum_{i=1}^{B}{\mathbb{I}}(g_{ij}\in I_{l}). Denote the private empirical probability being released as h~j,l\tilde{h}_{j,l}.

Fix j[d]j\in[d]. Let IkI_{k} be the bin that contains the μj\mu_{j}. Then we know [μjKλ1Hu2loga(25),μj+Kλ1Hu2loga(25)][μjτ,μj+τ](Ik1IkIk+1)[\mu_{j}-K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(25),\mu_{j}+K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(25)]\subseteq[\mu_{j}-\tau,\mu_{j}+\tau]\subset(I_{k-1}\cup I_{k}\cup I_{k+1}). By Lemma 2.1, we know (|gijμj|τ)(|gijμj|Kλ1Hu2loga(25))0.04{\mathbb{P}}(|g_{ij}-\mu_{j}|\geq\tau)\leq{\mathbb{P}}(|g_{ij}-\mu_{j}|\geq K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(25))\leq 0.04. This means h(k1),j+hk,j+h(k+1),j0.96h_{(k-1),j}+h_{k,j}+h_{(k+1),j}\geq 0.96 and min(h(k1),j,hk,j,h(k+1),j)0.32\min(h_{(k-1),j},h_{k,j},h_{(k+1),j})\geq 0.32.

By Dvoretzky-Kiefer-Wolfowitz inequality and an union bound over j[d]j\in[d], we have that with probability 1ζ1-\zeta, maxj,l|hj,lh^j,l|log(d/ζ)/B\max_{j,l}|h_{j,l}-\hat{h}_{j,l}|\leq\sqrt{\log(d/\zeta)/B}. Using Lemma B.1, if B=Ω((dlog(1/δ)/ε)log(d/(ζδ)))B=\Omega((\sqrt{d\log(1/\delta)}/\varepsilon)\log(d/(\zeta\delta))), with probability 1ζ1-\zeta, we have maxj,l|h~j,lh^j,l|0.005\max_{j,l}|\tilde{h}_{j,l}-\hat{h}_{j,l}|\leq 0.005. Thus, with our assumption on BB, we can make sure with probability 1ζ1-\zeta, maxj,l|h~j,lhj,l|0.01\max_{j,l}|\tilde{h}_{j,l}-h_{j,l}|\leq 0.01. Then we have min(h(k1),j,hk,j,h(k+1),j)0.010.310.04+0.01maxlk1,k,k1hj,l+0.01\min(h_{(k-1),j},h_{k,j},h_{(k+1),j})-0.01\geq 0.31\geq 0.04+0.01\geq\max_{l\neq k-1,k,k_{1}}h_{j,l}+0.01. This implies with probability 1ζ1-\zeta, the algorithm must pick one of the bins from Ik1,Ik,Ik+1I_{k-1},I_{k},I_{k+1}. This means |g¯jμj|2τ21.5Kλ1Hu2loga(25)|\bar{g}_{j}-\mu_{j}|\leq 2\tau\leq 2^{1.5}K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(25). By tail bound of Lemma 2.1, we know for all j[d]j\in[d] and i[B]i\in[B], |gijg¯j||gijμj|+|g¯jμj|3Kλ1Hu2loga(Bd/ζ)|g_{ij}-\bar{g}_{j}|\leq|g_{ij}-\mu_{j}|+|\bar{g}_{j}-\mu_{j}|\leq 3K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(Bd/\zeta). This completes our proof.

Appendix F Technical lemmas

Lemma F.1.

Let xd𝒩(0,Σ)x\in{\mathbb{R}}^{d}\sim{\cal N}(0,\Sigma). Then there exists universal constant CC such that with probability 1ζ1-\zeta,

x2CTr(Σ)log(1/ζ).\displaystyle\|x\|^{2}\leq C\operatorname{Tr}(\Sigma)\log(1/\zeta)\;. (67)
Proof.

Let x~:=Σ1/2x\tilde{x}:=\Sigma^{-1/2}x. Then x~\tilde{x} is also a Gaussian with x~𝒩(0,𝐈d)\tilde{x}\sim{\cal N}(0,\mathbf{I}_{d}). By Hanson-Wright inequality ( [67, Theorem 6.2.1]), there exists universal constant c>0c>0 such that with probability 1ζ1-\zeta,

x2=x~Σx~Tr(Σ)+c(Σ𝐅+Σ2)log(2/ζ)CTr(Σ)log(1/ζ).\displaystyle\|x\|^{2}=\tilde{x}^{\top}\Sigma\tilde{x}\leq\operatorname{Tr}(\Sigma)+c(\|\Sigma\|_{\mathbf{F}}+\|\Sigma\|_{2})\log(2/\zeta)\leq C\operatorname{Tr}(\Sigma)\log(1/\zeta)\;. (68)

Lemma F.2 ([67, Theorem 4.4.5]).

Let Gd×dG\in{\mathbb{R}}^{d\times d} be a random matrix where each entry GijG_{ij} is i.i.d. sampled from standard Gaussian 𝒩(0,1){\cal N}(0,1). Then there exists universal constant C>0C>0 such that with probability 12et21-2e^{-t^{2}}, G2C(d+t)\|G\|_{2}\leq C(\sqrt{d}+t) for t>0t>0.

Lemma F.3.

Let Gd×dG\in{\mathbb{R}}^{d\times d} be a random matrix where each entry GijG_{ij} is i.i.d. sampled from standard Gaussian 𝒩(0,1){\cal N}(0,1). Then we have 𝔼[GG]2C2d\|\mathbb{E}[GG^{\top}]\|_{2}\leq C_{2}d and 𝔼[GG]2C2d\|\mathbb{E}[G^{\top}G]\|_{2}\leq C_{2}d.

Proof.

By Lemma F.2, there exists universal constant C3>0C_{3}>0 such that

(GC1(d+s))es2,s>0.\displaystyle{\mathbb{P}}\left(\|G\|\geq C_{1}(\sqrt{d}+s)\right)\leq e^{-s^{2}},\;\;\;\forall s>0\;. (69)

Then

𝔼[GG]2\displaystyle\|\mathbb{E}[GG^{\top}]\|_{2} 𝔼[GG2]\displaystyle\leq\mathbb{E}[\|GG^{\top}\|_{2}] (70)
𝔼[G22]\displaystyle\leq\mathbb{E}[\|G\|_{2}^{2}] (71)
=02r(G2r)𝑑rC1d+C3d2re(rd)22d\displaystyle=\int_{0}^{\infty}2r{\mathbb{P}}(\|G\|_{2}\geq r)dr\leq C_{1}d+C_{3}\int_{\sqrt{d}}^{\infty}2re^{-\frac{(r-\sqrt{d})^{2}}{2}}d (72)
=C1(d+2πd+2)C2d,\displaystyle=C_{1}(d+\sqrt{2\pi d}+2)\leq C_{2}d\;, (73)

where C2C_{2} is an absolute constant. The proof for the second claim follows similarly. ∎

Lemma F.4.

Let x,y𝕊2d1x,y\in\mathbb{S}_{2}^{d-1}. Then

1x,y2xy2.\displaystyle 1-\left\langle x,y\right\rangle^{2}\leq\|x-y\|^{2}\;. (74)

If xy22\|x-y\|^{2}\leq\sqrt{2}, then

1x,y212xy2.\displaystyle 1-\left\langle x,y\right\rangle^{2}\geq\frac{1}{2}\|x-y\|^{2}\;. (75)

The following lemma follows from matrix Bernstein inequality [65].

Lemma F.5.

Under A.1, A.2, and A.3, in Assumption 1, with probability 1ζ1-\zeta,

1Bi[B]AiΣ2=O(λ12Vlog(d/ζ)B+λ1Mlog(d/ζ)B).\displaystyle\Big{\|}\frac{1}{B}\sum_{i\in[B]}A_{i}-\Sigma\Big{\|}_{2}\;=\;O\Big{(}\,\sqrt{\frac{\lambda_{1}^{2}V\log(d/\zeta)}{B}}+\frac{\lambda_{1}M\log(d/\zeta)}{B}\,\Big{)}\;. (76)