DP-PCA: Statistically Optimal and
Differentially Private PCA

Xiyang Liu⁰⁰footnotemark: 0 Paul Allen School of Computer Science & Engineering, University of Washington, [email protected] Weihao Kong⁰⁰footnotemark: 0 Google Research, [email protected] Prateek Jain⁰⁰footnotemark: 0 Google Research, [email protected] Sewoong Oh⁰⁰footnotemark: 0 Paul Allen School of Computer Science & Engineering, University of Washington, and Google Research, [email protected]

Abstract

We study the canonical statistical task of computing the principal component from $n$ i.i.d. data in $d$ dimensions under $(\varepsilon,\delta)$ -differential privacy. Although extensively studied in literature, existing solutions fall short on two key aspects: ( $i$ ) even for Gaussian data, existing private algorithms require the number of samples $n$ to scale super-linearly with $d$ , i.e., $n=\Omega(d^{3/2})$ , to obtain non-trivial results while non-private PCA requires only $n=O(d)$ , and ( $ii$ ) existing techniques suffer from a non-vanishing error even when the randomness in each data point is arbitrarily small. We propose DP-PCA, which is a single-pass algorithm that overcomes both limitations. It is based on a private minibatch gradient ascent method that relies on private mean estimation, which adds minimal noise required to ensure privacy by adapting to the variance of a given minibatch of gradients. For sub-Gaussian data, we provide nearly optimal statistical error rates even for $n=\tilde{O}(d)$ . Furthermore, we provide a lower bound showing that sub-Gaussian style assumption is necessary in obtaining the optimal error rate.

1 Introduction

Principal Component Analysis (PCA) is a fundamental statistical tool with multiple applications including dimensionality reduction, data visualization, and noise reduction. Naturally, it is a key part of most standard data analysis and ML pipelines. However, when applied to data collected from numerous individuals, such as the U.S. Census data, outcome of PCA might reveal highly sensitive personal information. We investigate the design of privacy preserving PCA algorithms and the involved privacy/utility tradeoffs, for computing the first principal component, that should serve as the building block of more general rank- $k$ PCA.

Differential privacy (DP) is a widely accepted mathematical notion of privacy introduced in [22], which is a standard in releasing the U.S. Census data [2] and also deployed in commercial systems [64, 26, 28]. A query to a database is said to be $(\varepsilon,\delta)$ -differentialy private if a strong adversary who knows all other entries but one cannot infer that one entry from the query output, with high confidence. The parameters $\varepsilon$ and $\delta$ restricts the confidence as measured by the Type-I and II errors [42]. Smaller values of $\varepsilon\in[0,\infty)$ and $\delta\in[0,1]$ imply stronger privacy and plausible deniability for the participants.

For non-private PCA with $n$ i.i.d. samples in $d$ dimensions, the popular Oja’s algorithm (provided in Algorithm 1) achieves the optimal error of $\sin(\hat{v},v_{1})=\tilde{\Theta}(\sqrt{d/n})$ , where the error is measured by the sine function of the angle between the estimate, $\hat{v}$ , and the principal component, $v_{1}$ , [39]. For differentially private PCA, there is a natural fundamental question: what is the extra cost we pay in the error rate for ensuring $(\varepsilon,\delta)$ -DP?

We introduce a novel approach we call DP-PCA (Algorithm 3) and show that it achieves an error bounded by $\sin(\hat{v},v)=\tilde{O}(\sqrt{d/n}+d/(\varepsilon n))$ for sub-Gaussian-like data defined in Assumption 1, which is a broad class of light-tailed distributions that includes Gaussian data as a special case. The second term characterizes the cost of privacy and this is tight; we prove a nearly matching information theoretic lower bound showing that no $(\varepsilon,\delta)$ -DP algorithm can achieve a smaller error. This significantly improves upon a long line of existing private algorithms for PCA, e.g., [15, 10, 36, 34, 24]. These existing algorithms are analyzed for fixed and non-stochastic data and achieve sub-optimal error rates of $O(\sqrt{d/n}+d^{3/2}/(\varepsilon n))$ even in the stochastic setting we consider.

A remaining question is whether the sub-Gaussian-like assumption, namely Assumption A.4, is necessary or if it is an artifact of our analysis and our algorithm. It turns out that such an assumption on the lightness of the tail is critical; we prove an algorithmic independent and information theoretic lower bound (Theorem 5.4) to show that, without such an assumption, the cost of privacy is lower bounded by $\Omega(\sqrt{d/(\varepsilon n)})$ . This proves a separation of the error depending on the lightness of the tail.

We start with the formal description of the stochastic setting in Section 2 and present Oja’s algorithm for non-private PCA. Our first attempt in making this algorithm private in Section 3 already achieves near-optimal error, if the data is strictly from a Gaussian distribution. However, there are two remaining challenges that we describe in detail in Section 4: $(i)$ the excessive number of iterations of Private Oja’s Algorithm (Algorithm 2) prevents using typical values of $\varepsilon$ used in practice, and $(ii)$ for general sub-Gaussian-like distributions, the error does not vanish even when the noise in the data (as measured by a certain fourth moment of a function of the data) vanishes. The first challenge is due to the analysis that requires amplification by shuffling [25] that is restrictive. The second is due to its reliance on gradient norm clipping [1] that does not adapt to the variance of the current gradients. This drives the design of DP-PCA in Section 5 that critically relies on two techniques to overcome each challenge, respectively. First, minibatch SGD (instead of single sample SGD) significantly reduces the number iterations, thus obviating the need for amplification by shuffling. Next, private mean estimation (instead of gradient norm clipping and noise adding) adapts to the stochasticity of the problem and adds the minimal noise necessary to achieve privacy. The main idea of this variance adaptive stochastic gradient update is explained in detail in Section 6, along with a sketch of a proof.

Notations. For a vector $x\in{\mathbb{R}}^{d}$ , we use $\|x\|$ to denote the Euclidean norm. For a matrix $X\in{\mathbb{R}}^{d\times d}$ , we use $\|X\|_{2}=\max_{\|v\|=1}\|Xv\|_{2}$ to denote the spectral norm. We use ${\mathbf{I}}_{d}$ to denote $d\times d$ identity matrix. For $n\in\mathbb{Z}^{+}$ , let $[n]:=\{1,2,\ldots,n\}$ . Let $\mathbb{S}_{2}^{d-1}$ denote the unit $d$ -sphere of $\ell_{2}$ , i.e., $\mathbb{S}_{2}^{d-1}:=\{x\in{\mathbb{R}}^{d}:\|x\|=1\}$ . $\tilde{O}()$ hides logarithmic factors in $n$ , $d$ , and the failure probability $\zeta$ .

2 Problem formulation and background on DP

Typical PCA assumes i.i.d. data $\{x_{i}\in{\mathbb{R}}^{d}\}$ from a distribution and finds the first eigenvector of $\Sigma={\mathbb{E}}[(x_{i}-{\mathbb{E}}[x_{i}])(x_{i}-{\mathbb{E}}[x_{i}])^{\top}]\in{\mathbb{R}}^{d\times d}$ . Our approach allows for a more general class of data $\{A_{i}\in{\mathbb{R}}^{d\times d}\}$ that recovers the standard case when $A_{i}=(x_{i}-{\mathbb{E}}[x_{i}])(x_{i}-{\mathbb{E}}[x_{i}])^{\top}$ .

Assumption 1 ( $(\Sigma,\{\lambda_{i}\}_{i=1}^{d},M,V,K,\kappa,a,\gamma^{2})$ -model).

Let $A_{1},A_{2},\ldots,A_{n}\in{\mathbb{R}}^{d\times d}$ be a sequence of (not necessarily symmetric) matrices sampled independently from the same distribution that satisfy the following with PSD matrices $\Sigma\in{\mathbb{R}}^{d\times d}$ and $H_{u}\in{\mathbb{R}}^{d\times d}$ , and positive scalar parameters $M,V,K$ , $\kappa$ , $a$ , and $\gamma^{2}$ :

A.1.

Let $\Sigma:=\mathbb{E}[A_{i}]$ , for a symmetric positive semidefinite (PSD) matrix $\Sigma\in{\mathbb{R}}^{d\times d}$ , $\lambda_{i}$ denote the $i$ -th largest eigenvalue of $\Sigma$ , and $\kappa:=\lambda_{1}/(\lambda_{1}-\lambda_{2})$ ,
A.2.

$\|A_{i}-\Sigma\|_{2}\leq\lambda_{1}M$ almost surely,
A.3.

$\max\left\{\left\|\mathbb{E}\left[(A_{i}-\Sigma)(A_{i}-\Sigma)^{\top}\right]\right\|_{2},\left\|\mathbb{E}\left[(A_{i}-\Sigma)^{\top}(A_{i}-\Sigma)\right]\right\|_{2}\right\}\leq\lambda_{1}^{2}V$ ,
A.4.

$\max_{\|u\|=1,\|v\|=1}\mathbb{E}\left[\exp\left(\left(\frac{|u^{\top}(A_{i}^{\top}-\Sigma)v|^{2}}{K^{2}\lambda_{1}^{2}\|H_{u}\|_{2}}\right)^{1/(2a)}\right)\right]\leq 1$ , where $H_{u}:=(1/\lambda_{1}^{2})\mathbb{E}[(A_{i}-\Sigma)uu^{\top}(A_{i}-\Sigma)^{\top}]$ . We denote $\gamma^{2}:=\max_{\|u\|=1}\|H_{u}\|_{2}$ .

The first three assumptions are required for PCA even if privacy is not needed. The last assumption provides a Gaussian-like tail bound that determines how much noise we need to introduce in the algorithm for $(\varepsilon,\delta)$ -DP. The following lemma is useful in the analyses.

Lemma 2.1.

Under A.1 and A.4 in Assumption 1, for any unit vector $u$ , $v$ , with probability $1-\zeta$ ,

\displaystyle|u^{\top}(A_{i}^{\top}-\Sigma)v|^{2}\;\leq\;K^{2}\lambda_{1}^{2}\|H_{u}\|_{2}\log^{2a}(1/\zeta)\;.

(1)

2.1 Oja’s algorithm

In a non-private setting, the following streaming algorithm introduced in [61] achieves optimal sample complexity as analyzed in [39]. It is a projected stochastic gradient ascent on the objective defined on the empirical covariance: $\max_{\|w\|=1}(1/n)\sum_{i=1}^{n}w^{\top}A_{i}w$ .

1 Choose

w_{0}

uniformly at random from the unit sphere

2 for

t=1,2,\ldots,T

w_{t}^{\prime}\leftarrow w_{t-1}+\eta_{t}A_{t}w_{t-1}

w_{t}\leftarrow w_{t}^{\prime}/\|w_{t}^{\prime}\|

Return

w_{T}

Algorithm 1 (Non-private) Oja’s Algorithm

Central to our analysis is the following error bound on Oja’s Algorithm from [39].

Theorem 2.2 ([39, Theorem 4.1]).

Under Assumptions A.1-A.3, suppose the step size $\eta_{t}=\frac{\alpha}{(\lambda_{1}-\lambda_{2})(\xi+t)}$ for some $\alpha>1/2$ and $\xi\;:=\;20\max\left(\kappa M\alpha,{\kappa^{2}\left(V+1\right)\alpha^{2}}/{\log(1+({\zeta}/{100}))}\right)$ . If $T>\beta$ then there exists a constant $C>0$ such that Algorithm 1 outputs $w_{T}$ achieving w.p. $1-\zeta$ ,

\displaystyle\sin^{2}\left(w_{T},v_{1}\right)\leq\frac{C\log(1/\zeta)}{\zeta^{2}}\left(\,\frac{\alpha^{2}\kappa^{2}V}{(2\alpha-1)T}+d\left(\frac{\xi}{T}\right)^{2\alpha}\,\right)\;.

(2)

2.2 Background on Differential Privacy

Differential privacy (DP), introduced in [22], is a de facto mathematical measure for privacy leakage of a database accessed via queries. It ensures that even an adversary who knows all other entries cannot identify with a high confidence whether a person of interest participated in a database or not.

Definition 2.3 (Differential privacy [22]).

Given two multisets $S$ and $S^{\prime}$ , we say the pair $(S,S^{\prime})$ is neighboring if $|S\setminus S^{\prime}|+|S^{\prime}\setminus S|\leq 1$ . We say a stochastic query $q$ over a dataset $S$ satisfies $(\varepsilon,\delta)$ -differential privacy for some $\varepsilon>0$ and $\delta\in(0,1)$ if ${\mathbb{P}}(q(S)\in A)\leq e^{\varepsilon}{\mathbb{P}}(q(S^{\prime})\in A)+\delta$ for all neighboring $(S,S^{\prime})$ and all subset $A$ of the range of $q$ .

Small values of $\varepsilon$ and $\delta$ ensures that the adversary cannot identify any single data point with high confidence, thus providing plausible deniability. We provide useful DP lemmas in Appendix B. Within our stochastic gradient descent approach to PCA, we rely on the Gaussian mechanism to privatize each update. The sensitivity of a query $q$ is defined as $\Delta_{q}:=\sup_{\text{neighboring }(S,S^{\prime})}\|q(S)-q(S^{\prime})\|$ .

Lemma 2.4 (Gaussian mechanism [23]).

For a query $q$ with sensitivity $\Delta_{q}$ , $\varepsilon\in(0,1)$ , and $\delta\in(0,1)$ , the Gaussian mechanism outputs $q(S)+{\cal N}(0,(\Delta_{q}(\sqrt{2\log(1.25/\delta)})/\varepsilon)^{2}{\mathbf{I}}_{d})$ and achieves $(\varepsilon,\delta)$ -DP.

Another mechanism we frequently use is the private histogram learner of [49], whose analysis is provide in Appendix B, along with various composition theorems to provide end-to-end guarantees.

2.3 Comparisons with existing results in private PCA

We briefly discuss the most closely related work and provide more previous work in Appendix A. Most existing results assume a fixed data under a deterministic setting where each sample has a bounded norm, $\|x_{i}\|\leq\beta$ , and the goal is to find the top eigenvector of $\hat{\Sigma}:=(1/n)\sum_{i=1}^{n}(x_{i}-\hat{\mu})(x_{i}-\hat{\mu})^{\top}$ for the empirical mean $\hat{\mu}$ . For the purpose of comparisons, consider Gaussian $x_{i}\sim{\cal N}(0,\Sigma)$ with $\|x_{i}\|\leq\beta=O(\sqrt{\lambda_{1}d\log(n/\zeta)})$ for all $i\in[n]$ with probability $1-\zeta$ . The first line of approaches in [10, 15, 24] is a Gaussian mechanism that outputs ${\rm PCA}(\widehat{\Sigma}+Z)$ , where $Z$ is a symmetric matrix with i.i.d. Gaussian entries with a variance $((\beta^{2}/n\varepsilon)\sqrt{2\log(1.25/\delta)})^{2}$ to ensure $(\varepsilon,\delta)$ -DP. The tightest result in [24, Theorem 7] achieves

\displaystyle\sin(\hat{v},v_{1})

\displaystyle=

\displaystyle\tilde{O}\Big{(}\kappa\Big{(}\sqrt{\frac{d}{n}}+\frac{d^{3/2}\sqrt{\log(1/\delta)}}{\varepsilon n}\Big{)}\,\Big{)}\;,

(3)

with high probability, under a strong assumption that the spectral gap is very large: $\lambda_{1}-\lambda_{2}=\omega(d^{3/2}\sqrt{\log(1/\delta)}/(\varepsilon n))$ . In a typical scenario with $\lambda_{1}=O(1)$ , this requires a large sample size of $n=\omega(d^{3/2}/\varepsilon)$ . Since this Gaussian mechanism does not exploit the statistical properties of i.i.d. samples, the second term in this upper bound is larger by a factor of $d^{1/2}$ compared to the proposed DP-PCA (Corollary 5.2). The error rate of Eq. (3) is also achieved in [36, 34] by adding Gaussian noise to the standard power method for computing the principal components. When the spectral gap, $\lambda_{1}-\lambda_{2}$ , is smaller, it is possible to trade-off the dependence in $\kappa$ and the sampling ratio $d/n$ , which we do not address in this work but is surveyed in Appendix A.

3 First attempt: making Oja’s Algorithm private

Following the standard recipe in training with DP-SGD, e.g., [1], we introduce Private Oja’s Algorithm in Algorithm 2. At each gradient update, we first apply gradient norm clipping to limit the contribution of a single data point and next add an appropriately chosen Gaussian noise from Lemma 2.4 to achieve $(\varepsilon,\delta)$ -DP, end-to-end. The choice of clipping threshold $\beta$ ensures that, with high probability under our assumption, we do not clip any gradients. The choice of noise multiplier $\alpha$ ensures $(\varepsilon,\delta)$ -DP.

Input:

S=\{A_{i}\in{\mathbb{R}}^{d\times d}\}_{i=1}^{n}

, privacy

(\varepsilon,\delta)

, learning rates

\{\eta_{t}\}^{n}_{t=1}

1 Randomly permute

S

and choose

w_{0}

uniformly at random from the unit sphere

2 Set DP noise multiplier:

\alpha\leftarrow C^{\prime}\log(n/\delta)/(\varepsilon\sqrt{n})

3 Set clipping threshold:

\beta\leftarrow C\lambda_{1}\sqrt{d}(K\gamma\log^{a}(nd/\zeta)+1)

4 for t=1, 2, …, n do

5 Sample

z_{t}\sim{\cal N}(0,\mathbf{I}_{d})

w_{t}^{\prime}\leftarrow w_{t-1}+\eta_{t}\,{\rm clip}_{\beta}\left(A_{t}w_{t-1}\right)+2\eta_{t}\beta\alpha z_{t}

where

{\rm clip}_{\beta}(x)=x\cdot\min\{1,\frac{\beta}{\|x\|_{2}}\}

w_{t}\leftarrow w_{t}^{\prime}/\|w_{t}^{\prime}\|

Return

w_{n}

Algorithm 2 Private Oja’s Algorithm

One caveat in streaming algorithms is that we access data $n$ times, each with a private mechanism, but accessing only a single data point at a time. To prevent excessive privacy loss due to such a large number of data accesses, we apply a random shuffling in line 2 Algorithm 2, in order to benefit from a standard amplification by shuffling [25, 30]. This gives an amplified privacy guarantee that allows us to add a small noise proportional to $\alpha=O(\log(n/\delta)/(\varepsilon\sqrt{n}))$ . Without the shuffle amplification, we will instead need a larger noise scaling as $\alpha=O(\log(n/\delta)/\varepsilon)$ , resulting in a suboptimal utility guarantee. However, this comes with a restriction that the amplification holds only for small values of $\varepsilon=O(\sqrt{\log(n/\delta)/n})$ . Our first contribution in the proposed DP-PCA (Algorithm 3) is to expand this range to $\varepsilon=O(1)$ , which includes the practical regime of interest $\varepsilon\in[1/2,5]$ .

Lemma 3.1 (Privacy).

If $\varepsilon=O(\sqrt{{\log(n/\delta)}/{n}})$ and the noise multiplier is chosen to be $\alpha=\Omega\left({\log(n/\delta)}/{(\varepsilon\sqrt{n})}\right)$ , then Algorithm 2 is $(\varepsilon,\delta)$ -DP.

Under Assumption 1, we select gradient norm clipping threshold $\beta$ such that no gradient exceeds $\beta$ .

Lemma 3.2 (Gradient clipping).

Let $\beta=C\lambda_{1}\sqrt{d}(K\gamma\log^{a}(nd/\zeta)+1)$ for some constant $C>0$ . Then with probability $1-\zeta$ , $\|A_{t}w_{t-1}\|\leq\beta$ for any fixed $w_{t-1}$ independent of $A_{t}$ , for all $t\in[n]$ .

We provide proofs of both lemmas and the next theorem in Appendix D. When no clipping is applied, we can use the standard analysis of Oja’s Algorithm from [39] to prove the following utility guarantee.

Theorem 3.3 (Utility).

Given $n$ i.i.d. samples $\{A_{i}\in{\mathbb{R}}^{d\times d}\}_{i=1}^{n}$ satisfying Assumption 1 with parameters $(\Sigma,M,V,K,\kappa,a,\gamma^{2})$ , if

\displaystyle n\;=\;\tilde{O}\Big{(}\,\kappa^{2}+\kappa M+\kappa^{2}V+\frac{d\,\kappa\,(\gamma+1)\,\log(1/\delta)}{\varepsilon}\,\Big{)}\;,

(4)

with a large enough constant, then there exists a positive universal constant $c_{1}$ and a choice of learning rate $\eta_{t}$ that depends on $(t,M$ , $V$ , $K$ , $a$ , $\lambda_{1}$ , $\lambda_{1}-\lambda_{2}$ , $n$ , $d$ , $\varepsilon$ , $\delta)$ such that Algorithm 2 with a choice of $\zeta=0.01$ outputs $w_{n}$ that achieves with probability $0.99$ ,

\displaystyle\sin^{2}\left(w_{n},v_{1}\right)\;=\;\widetilde{O}\left(\kappa^{2}\Big{(}\frac{V}{n}+\frac{(\gamma+1)^{2}d^{2}\log^{2}(1/\delta)}{\varepsilon^{2}n^{2}}\,\Big{)}\,\right)\;,

(5)

where $\widetilde{O}(\cdot)$ hides poly-logarithmic factors in $n$ , $d$ , $1/\varepsilon$ , and $\log(1/\delta)$ and polynomial factors in $K$ .

The first term in Eq. (5) matches the non-private error rate for Oja’s algorithm in Eq. (2) with $\alpha=O(\log n)$ and $T=n$ , and the second term is the price we pay for ensuring $(\varepsilon,\delta)$ -DP.

Remark 3.4.

For a canonical setting of a Gaussian data with $A_{i}=x_{i}x_{i}^{\top}$ and $x_{i}\sim{\cal N}(0,\Sigma)$ , we have, for example from [62, Lemma 1.12], that $M=O(d\log(n))$ , $V=O(d)$ , $K=4$ , $a=1$ , and $\gamma^{2}=O(1)$ . Theorem 3.3 implies the following error rate:

\displaystyle\sin^{2}\left(w_{n},v_{1}\right)\;=\;\tilde{O}\Big{(}\kappa^{2}\Big{(}\frac{d}{n}+\frac{d^{2}\log^{2}(1/\delta)}{\varepsilon^{2}n^{2}}\Big{)}\Big{)}\;.

(6)

Comparing to a lower bound in Theorem 5.3, this is already near optimal. However, for general distributions satisfying Assumption 1, Algorithm 2 (in particular the second term in Eq. (5)) can be significantly sub-optimal. We explain this second weakness of Private Oja’s Algorithm in the following section (the first weakness is the restriction on $\varepsilon=O(\sqrt{\log(n/\delta)/n})$ ).

4 Two remaining challenges

We explain the two remaining challenges in Private Oja’s Algorithm and propose techniques to overcomes these challenges that lead to the design of DP-PCA (Algorithm 3).

First challenge: restricted range of $\varepsilon=O(\sqrt{\log(n/\delta)/n})$ . This is due to the large number, $n$ , of iterations that necessitates the use of the amplification by shuffling in Theorem D.1. We reduce the number of iterations with minibatch SGD. For $T=O(\log^{2}n)$ and $t=1,2,\ldots,T$ , we repeat

\displaystyle w^{\prime}_{t}\leftarrow w_{t-1}+\frac{\eta_{t}}{B}\sum_{i=1+B(t-1)}^{Bt-1}{\rm clip}_{\beta}(A_{i}w_{t-1})+\frac{w\eta_{t}\beta\alpha}{B}z_{t}\;,\text{ and }\;\;w_{t}\leftarrow w^{\prime}_{t}/\|w_{t}^{\prime}\|\;,

(7)

where $z_{t}\sim{\cal N}(0,{\bf I}_{d})$ and the minibatch size is $B=\lfloor n/T\rfloor$ . Since the dataset is accessed only $T=O(\log^{2}n)$ times, the end-to-end privacy is analyzed with the serial composition (Lemma B.3) instead of the amplification by shuffling. This ensures $(\varepsilon,\delta)$ -DP for any $\varepsilon=O(1)$ , resolving the first challenge, and still achieves the utility guarantee of Eq. (5).

Second challenge: excessive noise for privacy. This is best explained with an example.

Example 4.1 (Signal and noise separation).

Consider a setting with $A_{i}=x_{i}x_{i}^{\top}$ and $x_{i}\;=\;s_{i}+n_{i}$ where $s_{i}=v$ with probability half and $s_{i}=-v$ otherwise for a unit norm vector $v$ and $n_{i}\sim{\cal N}(0,\sigma^{2}{\bf I})$ . We want to find the principal component of $\Sigma={\mathbb{E}}[x_{i}x_{i}^{\top}]=vv^{\top}+\sigma^{2}{\bf I}$ , which is $v$ . This construction decomposes the signal and the noise. For $A_{i}=vv^{\top}+s_{i}n_{i}^{\top}+n_{i}s_{i}^{\top}+n_{i}n_{i}^{\top}$ , the signal component is determined by $vv^{\top}$ that is deterministic due to the sign cancelling. The noise component is $x_{i}n_{i}^{\top}+n_{i}s_{i}^{\top}+n_{i}n_{i}^{\top}$ which is random. We can control the Signal-to-Noise Ratio (SNR), $1/\sigma^{2}$ , by changing $\sigma^{2}$ , and we are particularly interested in the regime where $\sigma^{2}$ is small. As we are interested in $\sigma^{2}<1$ , this satisfies Assumption 1 with $\lambda_{1}=1+\sigma^{2}$ , $\lambda_{2}=\sigma^{2}$ , $V=O(d\sigma^{2})$ , $K=O(1)$ , $a=1$ , and $\gamma^{2}=\sigma^{2}$ . Substituting this into Eq. (5), Private Oja’s Algorithm achieves

\displaystyle\sin^{2}(w_{n},v_{1})\;=\;\tilde{O}\Big{(}\frac{\sigma^{2}d}{n}+\frac{d^{2}\log(1/\delta)}{\varepsilon^{2}n^{2}}\Big{)}\;,

(8)

where we are interested in $\sigma^{2}<1$ .

This is problematic since the second term, due to the DP noise, does not vanish as the randomness $\sigma^{2}$ in the data decreases. We do not observe this for Gaussian data where signal and noise scale proportionally as shown below. We reduce the noise we add for privacy, by switching from a simple norm clipping, that adds noise proportional to the norm of the gradients, to private estimation, that only requires the noise to scale as the range of the gradients, i.e. the maximum distance between two gradients in the minibatch. The toy example above showcases that the range can be arbitrarily smaller than the maximum norm (Fig. 1). We want to emphasize that although the idea of using private estimation within an optimization has been conceptually proposed in abstract settings, e.g., in [44], DP-PCA is the first setting where $(i)$ such separation between the norm and the range of the gradients holds under any statistical model, and hence $(ii)$ the long line of recent advances in private estimation provides significant gain over the simple DP-SGD [1].

Refer to caption — Figure 1: 2-d PCA under the Gaussian data from Remark 3.4 (left) shows that the average gradient (red arrow) is smaller than the range of the minibatch of 400 gradients (blue dots). Under Example 4.1 (right), the range can be made arbitrarily smaller than the average gradient by changing $\sigma^{2}$ .

5 Differentially Private Principal Component Analysis (DP-PCA)

Combining the two ideas of minibatch SGD and private mean estimation, we propose DP-SGD. We use minibatch SGD of minibatch size $B=O(n/\log^{2}n)$ to allow for larger range of $\varepsilon=O(1)$ . We use Private Mean Estimation to add an appropriate level of noise chosen adaptively according to Private Eigenvalue Estimation. We describe details of both sub-routines in Section 6.

Input:

S=\{A_{i}\}_{i=1}^{n}

(\varepsilon,\delta)

, batch size

B\in{\mathbb{Z}}_{+}

, learning rates

\{\eta_{t}\}_{t=1}^{\lfloor n/B\rfloor}

, probability

\zeta\in(0,1)

1 Choose

w_{0}

uniformly at random from the unit sphere

2 for $t=1,2,\ldots,T=\lfloor n/B\rfloor$ do

3 Run Private Top Eigenvalue Estimation (Algorithm 4) with

(\varepsilon/2,\delta/2)

-DP and failure probability

\zeta/(2T)

\{A_{B(t-1)+i}w_{t-1}\}_{i=1}^{\lfloor B/2\rfloor}

. Let the returned estimation be

\hat{\Lambda}_{t}>0

5 Run Private Mean Estimation (Algorithm 5) with

\left(\varepsilon/2,\delta/2\right)

-DP, failure probability

\zeta/(2T)

, and the estimated eigenvalue

2\hat{\Lambda}_{t}

\left\{A_{B(t-1)+\lfloor B/2\rfloor+i}w_{t-1}\right\}_{i\in\lfloor B/2\rfloor}

. Let the returned mean gradient estimate be

\hat{g}_{t}\in{\mathbb{R}}^{d}

w_{t}^{\prime}\leftarrow w_{t-1}+\eta_{t}\hat{g}_{t}\;\;,\;\;\;\;

w_{t}\leftarrow w_{t}^{\prime}/\|w_{t}^{\prime}\|

Return

w_{T}

Algorithm 3 Differentially Private Principal Component Analysis (DP-PCA)

We show an upper bound on the error achieved by DP-PCA under an appropriate choice of the learning rate. We provide a complete proof in Appendix E.1 that includes the explicit choice of the learning rate $\eta_{t}$ in Eq. (60), and a proof sketch is provided in Section 6.1.

Theorem 5.1.

For $\varepsilon\in(0,0.9)$ , DP-PCA guarantees $(\varepsilon,\delta)$ -DP for all $S$ , $B$ , $\zeta$ , and $\delta$ . Given $n$ i.i.d. samples $\{A_{i}\in{\mathbb{R}}^{d\times d}\}_{i=1}^{n}$ satisfying Assumption 1 with parameters $(\Sigma,M,V,K,\kappa,a,\gamma^{2})$ , if

\displaystyle n\;=\;\tilde{O}\Big{(}\,e^{\kappa^{2}}+\frac{d^{1/2}(\log(1/\delta))^{3/2}}{\varepsilon}+\kappa M+\kappa^{2}V+\frac{d\,\kappa\,\gamma\,(\log(1/\delta))^{1/2}}{\varepsilon}\,\Big{)}\;,

(9)

with a large enough constant and $\delta\leq 1/n$ , then there exists a positive universal constant $c_{1}$ and a choice of learning rate $\eta_{t}$ that depends on $(t,M$ , $V$ , $K$ , $a$ , $\lambda_{1}$ , $\lambda_{1}-\lambda_{2}$ , $n$ , $d$ , $\varepsilon$ , $\delta)$ such that $T=\lfloor n/B\rfloor$ steps of DP-PCA in Algorithm 3 with choices of $\zeta=0.01$ and $B=c_{1}n/(\log n)^{2}$ , outputs $w_{T}$ such that with probability $0.99$ ,

\displaystyle\sin\left(w_{T},v_{1}\right)\;=\;\widetilde{O}\left(\kappa\Big{(}\sqrt{\frac{V}{n}}+\frac{\gamma d\sqrt{\log(1/\delta)}}{\varepsilon n}\,\Big{)}\,\right)\;,

(10)

where $\widetilde{O}(\cdot)$ hides poly-logarithmic factors in $n$ , $d$ , $1/\varepsilon$ , and $\log(1/\delta)$ and polynomial factors in $K$ .

We further interpret this analysis and show that $(i)$ DP-PCA is nearly optimal when the data is from a Gaussian distribution by comparing against a lower bound (Theorem 5.3); and $(ii)$ DP-PCA significantly improves upon the private Oja’s algorithm under Example 4.1. We discuss the necessity of some of the assumptions at the end of this section, including how to agnostically find the appropriate learning rate scheduling.

Near-optimality of DP-PCA under Gaussian distributions. Consider the case of i.i.d. samples $\{x_{i}\}_{i=1}^{n}$ from a Gaussian distribution from Remark 3.4.

Corollary 5.2 (Upper bound; Gaussian distribution).

Under the hypotheses of Theorem 5.1 and $\{A_{i}=x_{i}x_{i}^{\top}\}_{i=1}^{n}$ with Gaussian random vectors $x_{i}$ ’s, after $T=n/B$ steps, DP-PCA outputs $w_{T}$ that achieves, with probability $0.99$ ,

\displaystyle\sin(w_{T},v_{1})\;=\;\tilde{O}\left(\kappa\left(\sqrt{\frac{d}{n}}+\frac{d\sqrt{\log(1/\delta)}}{\varepsilon n}\right)\right)\;.

(11)

We prove a nearly matching lower bound, up to factors of $\sqrt{\lambda_{1}/\lambda_{2}}$ and $\sqrt{\log(1/\delta)}$ . One caveat is that the lower bound assumes pure-DP with $\delta=0$ . We do not yet have a lower bound technique for approximate DP that is tight, and all known approximate DP lower bounds have gaps to achievable upper bounds in its dependence in $\log(1/\delta)$ , e.g., [6, 56]. We provide a proof in Appendix C.1.

Theorem 5.3 (Lower bound; Gaussian distribution).

Let $\mathcal{M}_{\varepsilon}$ be a class of $(\varepsilon,0)$ -DP estimators that map $n$ i.i.d. samples to an estimate $\hat{v}\in{\mathbb{R}}^{d}$ . A set of Gaussian distributions with $(\lambda_{1},\lambda_{2})$ as the first and second eigenvalues of the covariance matrix is denoted by $\mathcal{P}_{(\lambda_{1},\lambda_{2})}$ . There exists a universal constant $C>0$ such that

\displaystyle\inf_{\hat{v}\in\mathcal{M}_{\varepsilon}}\sup_{P\in\mathcal{P}_{(\lambda_{1},\lambda_{2})}}\mathbb{E}_{S\sim P^{n}}\left[\sin(\hat{v}(S),v_{1})\right]\;\geq\;C\min\left(\kappa\left(\sqrt{\frac{d}{n}}+\frac{d}{\varepsilon n}\right)\sqrt{\frac{\lambda_{2}}{\lambda_{1}}},1\right)\;.

(12)

Comparisons with private Oja’s algorithm. We demonstrate that DP-PCA can significantly improve upon Private Oja’s Algorithm with Example 4.1, where DP-PCA achieves an error bound of $\sin(w_{T},v_{1})=\tilde{O}\big{(}\sigma\sqrt{d/n}+\sigma d\sqrt{\log(1/\delta)}/(\varepsilon n)\big{)}$ . As the noise power $\sigma^{2}$ decreases DP-PCA achieves a vanishing error, whereas Private Oja’s Algorithm has a non-vanishing error in Eq. (8). This follows from the fact that the second term in the error bound in Eq. (10) scales as $\gamma$ , which can be made arbitrarily smaller than the second term in Eq. (5) that scales as $(\gamma+1)$ . Further, the error bound for DP-PCA holds for any $\varepsilon=O(1)$ , whereas Private Oja’s Algorithm requires significantly smaller $\varepsilon=O(\sqrt{\log(n/\delta)/n})$ .

Remarks on the assumptions of Theorem 5.1. We have an exponential dependence of the sample complexity in the spectral gap, $n\geq\exp(\kappa^{2})$ . This ensures we have a large enough $T=\lfloor n/B\rfloor$ to reduce the non-dominant second term in Eq. (2), in balancing the learning rate $\eta_{t}$ and $T$ (which is explicitly shown in Eqs. 62 and (63) in the Appendix). It is possible to get rid of this exponential dependence at the cost of an extra term of $\tilde{O}(\kappa^{4}\gamma^{2}d^{2}\log(1/\delta)/(\varepsilon n)^{2})$ in the error rate in Eq. (10), by selecting a slightly larger $T=c\kappa^{2}\log^{2}n$ . A Gaussian-like tail bound in Assumption A.4 is necessary to get the desired upper bound scaling as $\tilde{O}(d\sqrt{\log(1/\delta)}/(\varepsilon n))$ in Eq. 11, for example. The next lower bound shows that without such assumptions on the tail, the error due to privacy scales as $\Omega(\sqrt{d\wedge\log(1/\delta)/(\varepsilon n)})$ . We believe that the dependence in $\delta$ is loose, and it might be possible to get a tighter lower bound using [45]. We provide a proof and other lower bounds in Appendix C.

Theorem 5.4 (Lower bound without Assumption A.4).

Let $\mathcal{M}_{\varepsilon}$ be a class of $(\varepsilon,\delta)$ -DP estimators that map $n$ i.i.d. samples to an estimate $\hat{v}\in{\mathbb{R}}^{d}$ . A set of distributions satisfying Assumptions A.1–A.3 with $M=\tilde{O}(d+\sqrt{n\varepsilon/d})$ , $V=O(d)$ and $\gamma=O(1)$ is denoted by $\tilde{\mathcal{P}}$ . For $d\geq 2$ , there exists a universal constant $C>0$ such that

\displaystyle\inf_{\hat{v}\in\mathcal{M}_{\varepsilon}}\sup_{P\in\tilde{\mathcal{P}}}\mathbb{E}_{S\sim P^{n}}\left[\sin(\hat{v}(S),v_{1})\right]\;\geq\;C\kappa\min\left(\sqrt{\frac{d\wedge\log\left(\left(1-e^{-\varepsilon}\right)/\delta\right)}{\varepsilon n}},1\right)\;.

(13)

Currently, DP-PCA requires choices of the learning rates, $\eta_{t}$ , that depend on possibly unknown quantities. Since we can privately evaluate the quality of our solution, one can instead run multiple instances of DP-PCA with varying $\eta_{t}=c_{1}/(c_{2}+t)$ and find the best choice of $c_{1}>0$ and $c_{2}>0$ . Let $w_{T}(c_{1},c_{2})$ denote the resulting solution for one instance of $\{\eta_{t}=c_{1}/(c_{2}+t)\}_{t=1}^{T}$ . We first set a target error $\zeta$ . For each round $i=1,\ldots$ , we will run algorithm for $(c_{1},c_{2})=[2^{i-1},2^{-i+1}]\times[2^{-i+1},2^{-i+2}\ldots,2^{i-1}]$ and $(c_{1},c_{2})=[2^{-i+1},2^{-i+2}\ldots,2^{i-1}]\times[2^{i-1},2^{-i+1}]$ , and compute each $\sin(w_{T}(c_{1},c_{2}),v_{1})$ privately, each with privacy budget $\varepsilon_{i}=\frac{\varepsilon}{2^{i+1}(2i-1)},\delta_{i}=\frac{\delta}{2^{i+1}(2i-1)}$ . We terminate the algorithm once there there is a $w_{T}(c_{1},c_{2})$ satisfies $\sin(w_{T}(c_{1},c_{2}),v_{1})\leq\zeta$ . It is clear that this search meta-algorithm terminate in logarithmic round, and the total sample complexity only blows up by a poly-log factor.

6 Private mean estimation for the minibatch stochastic gradients

DP-PCA critically relies on private mean estimation to reduce variance of the noise required to achieve $(\varepsilon,\delta)$ -DP. We follow a common recipe from [49, 43, 47, 9, 16]. First, we privately find an approximate range of the gradients in the minibatch (Alg. 4). Next, we apply the Gaussian mechanism to the truncated gradients where the truncation is tailored to the estimated range (Alg. 5).

Step 1: estimating the range. We need to find an approximate range of the minibatch of gradients in order to adaptively truncate the gradients and bound the sensitivity. Inspired by a private preconditioning mechanism designed for mean estimation with unknown covariance from [46], we propose to use privately estimated top eigenvalue of the covariance matrix of the gradients. For details on the version of the histogram learner we use in Alg. 4 in Appendix E.2, we refer to [55, Lemma D.1]. Unlike the private preconditioning of [46] that estimates all eigenvalues and requires $n=\widetilde{O}(d^{3/2}\log(1/\delta)/\varepsilon)$ samples, we only require the top eigenvalue and hence the next theorem shows that we only need $n=\widetilde{O}(d\log(1/\delta)/\varepsilon)$ .

Theorem 6.1.

Algorithm 4 is $(\varepsilon,\delta)$ -DP. Let $g_{i}=A_{i}u$ for some fixed vector $u$ , where $A_{i}$ satisfies A.1 and A.4 in Assumption 1 such that the mean is $\mathbb{E}[g_{i}]=\Sigma u$ and the covariance is $\mathbb{E}[(g_{i}-\Sigma u)(g_{i}-\Sigma u)^{\top}]=\lambda_{1}^{2}H_{u}$ . With a large enough sample size scaling as

\displaystyle B

\displaystyle=O\left(\frac{K^{2}\,d\,\log^{1+2a}(nd\log(1/(\delta\zeta))/\varepsilon)\log(1/(\zeta\delta))}{\varepsilon}\right)=\tilde{O}\left(\frac{K^{2}\,d\,\log(1/\delta)}{\varepsilon}\right)\;,

(14)

Algorithm 4 outputs $\hat{\Lambda}$ achieving $\hat{\Lambda}\in\left[(1/\sqrt{2})\lambda_{1}^{2}\|H_{u}\|_{2},\sqrt{2}\lambda_{1}^{2}\|H_{u}\|_{2}\right]$ with probability $1-\zeta$ , where the pair $(K>0,a>0)$ parametrizes the tail of the distribution in A.4 and $\tilde{O}(\cdot)$ hides logarithmic factors in $B,d,1/\zeta,\log(1/\delta)$ , and $\varepsilon$ .

We provide a proof in Appendix E.2. There are other ways to privately estimate the range. Some approaches require known bounds such as $\sigma_{\rm min}^{2}\leq\lambda_{1}^{2}(H_{u})_{ii}\leq\sigma_{\rm max}^{2}$ for all $i\in[d]$ [49], and other agnostic approaches are more involved such as instance optimal universal estimators of [18].

Step 2: Gaussian mechanism for mean estimation. Once we have a good estimate of the top eigenvalue from the previous section, we use it to select the bin size of the private histogram and compute the truncated empirical mean. Since truncated empirical mean has a bounded sensitivity, we can use Gaussian mechanism to achieve DP. The algorithm is now standard in DP mean estimation, e.g., [49, 43]. However, the analysis is slightly different since our assumptions on $g_{i}$ ’s are different. For completeness, we provide the Algorithm 5 in Appendix E.3.

The next lemma shows that the Private Mean Estimation is $(\varepsilon,\delta)$ -DP, and with high probability clipping does not apply to any of the gradients. The returned private mean, therefore, is distributed as a spherical Gaussian centered at the empirical mean of the gradients. This result requires that we have a good estimate of the top eigenvalue from Alg. 4 such that $\hat{\Lambda}\simeq\lambda_{1}^{2}\|H_{u}\|_{2}$ . This analysis implies that we get an unbiased estimate of the gradient mean (which is critical in the analysis) with noise scaling as $\tilde{O}(\lambda_{1}\gamma\sqrt{d\log(1/\delta)}/(\varepsilon B))$ , where $\gamma^{2}=\max_{u:\|u\|=1}\|H_{u}\|_{2}$ (which is critical in getting the tight sample complexity in the second term of the final utility guarantee in Eq. (10)). We provide a proof in Appendix E.3.

Lemma 6.2.

For $\varepsilon\in(0,0.9)$ and any $\delta\in(0,1)$ , Algorithm 5 is $(\varepsilon,\delta)$ -DP. Let $g_{i}=A_{i}u$ for some fixed vector $u$ , where $A_{i}$ satisfies A.1 and A.4 in Assumption 1 such that the mean is $\mathbb{E}[g_{i}]=\Sigma u$ and the covariance is $\mathbb{E}[(g_{i}-\Sigma u)(g_{i}-\Sigma u)^{\top}]=\lambda_{1}^{2}H_{u}$ . If $\hat{\Lambda}\in[\lambda_{1}^{2}\|H_{u}\|_{2}/\sqrt{2},\sqrt{2}\lambda_{1}^{2}\|H_{u}\|_{2}]$ , $\delta\leq 1/B$ , and $B=\Omega((\sqrt{d\log(1/\delta)}/\varepsilon)\log(d/(\zeta\delta)))$ then, with probability $1-\zeta$ , we have $g_{i}\in\bar{g}+\left[-3K\sqrt{\hat{\Lambda}}\log^{a}(Bd/\zeta),3K\sqrt{\hat{\Lambda}}\log^{a}(Bd/\zeta)\right]^{d}$ for all $i\in[B].$

6.1 Proof sketch of Theorem 5.1

We choose $B=\Theta(n/\log^{2}n)$ such that we access the dataset only $T=\Theta(\log^{2}n)$ times. Hence we do not need to rely on amplification by shuffling. To add Gaussian noise that scales as the standard deviation of the gradients in each minibatch (as opposed to potentially excessively large mean of the gradients), DP-PCA adopts techniques from recent advances in private mean estimation. Namely, we first get a private and accurate estimate of the range from Theorem 6.1. Using this estimate, $\hat{\Lambda}$ , Private Mean Estimation returns an unbiased estimate of the empirical mean of the gradients, as long as no truncation has been applied as ensured by Lemma 6.2. This gives

\displaystyle w_{t}^{\prime}

\displaystyle\leftarrow w_{t-1}+\eta_{t}\left(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}w_{t-1}+\beta_{t}z_{t}\right)\;,

(15)

for $z_{t}\sim{\cal N}(0,{\bf I})$ and $\beta_{t}=\frac{8K\sqrt{2\hat{\Lambda}_{t}}\log^{a}(Bd/\zeta)\sqrt{2d\log(2.5/\delta)}}{\varepsilon B}$ . Using rotation invariance of spherical Gaussian random vectors and the fact that $\|w_{t-1}\|=1$ , we can reformulate it as

\displaystyle w_{t}^{\prime}

\displaystyle\leftarrow w_{t-1}+\eta_{t}\underbrace{\left(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}+\beta_{t}G_{t}\right)}_{\tilde{A}_{t}}w_{t-1}\;.

(16)

This process can be analyzed with Theorem 2.2 with $\tilde{A}_{t}$ substituting $A_{t}$ .

7 Discussion

Under the canonical task of computing the principal component from i.i.d. samples, we show the first result achieving a near-optimal error rate. This critically relies on two ideas: minibatch SGD and private mean estimation. In particular, private mean estimation plays a critical role in the case when the range of the gradients is significantly smaller than the norm; we achieve an optimal error rate that cannot be achieved with the standard recipe of gradient clipping or even with a more sophisticated adaptive clipping [4].

Assumption A.4 can be relaxed to heavy-tail bounds with bounded $k$ -th moment on $A_{i}$ , in which case we expect the second term in Eq. (10) to scale as $O(d(\sqrt{\log(1/\delta)}/\varepsilon n)^{1-1/k})$ , drawing analogy from a similar trend in a computationally inefficient DP-PCA without spectral gap [56, Corollary 6.10]. When a fraction of data is corrupted, recent advances in [74, 51, 40] provide optimal algorithms for PCA. However, existing approach of [56] for robust and private PCA is computationally intractable. Borrowing ideas from robust and private mean estimation in [55], one can design an efficient algorithm, but at the cost of sub-optimal sample complexity. It is an interesting direction to design an optimal and robust version of DP-PCA. Our lower bounds are loose in its dependence in $\log(1/\delta)$ . Recently, a promising lower bound technique has been introduced in [45] that might close this gap.

Acknowledgement

This work is supported in part by Google faculty research award and NSF grants CNS-2002664, IIS-1929955, DMS-2134012, CCF-2019844 as a part of NSF Institute for Foundations of Machine Learning (IFML), CNS-2112471 as a part of NSF AI Institute for Future Edge Networks and Distributed Intelligence (AI-EDGE).

References

[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
[2] John M Abowd. The us census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2867–2867, 2018.
[3] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private assouad, fano, and le cam. In Algorithmic Learning Theory, pages 48–78. PMLR, 2021.
[4] Galen Andrew, Om Thakkar, Brendan McMahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. Advances in Neural Information Processing Systems, 34, 2021.
[5] Maria-Florina Balcan, Simon Shaolei Du, Yining Wang, and Adams Wei Yu. An improved gap-dependency analysis of the noisy power method. In Conference on Learning Theory, pages 284–309. PMLR, 2016.
[6] Rina Foygel Barber and John C Duchi. Privacy and statistical risk: Formalisms and minimax bounds. arXiv preprint arXiv:1412.4451, 2014.
[7] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. Advances in Neural Information Processing Systems, 32, 2019.
[8] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–473. IEEE, 2014.
[9] Sourav Biswas, Yihe Dong, Gautam Kamath, and Jonathan Ullman. Coinpress: Practical private mean and covariance estimation. arXiv preprint arXiv:2006.06618, 2020.
[10] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: the sulq framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 128–138, 2005.
[11] Gavin Brown, Marco Gaboardi, Adam Smith, Jonathan Ullman, and Lydia Zakynthinou. Covariance-aware private mean estimation without private covariance estimation. Advances in Neural Information Processing Systems, 34, 2021.
[12] Mark Bun, Kobbi Nissim, and Uri Stemmer. Simultaneous private learning of multiple concepts. J. Mach. Learn. Res., 20:94–1, 2019.
[13] Mark Bun and Thomas Steinke. Average-case averages: Private algorithms for smooth sensitivity and mean estimation. Advances in Neural Information Processing Systems, 32, 2019.
[14] T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. arXiv preprint arXiv:1902.04495, 2019.
[15] Kamalika Chaudhuri, Anand D Sarwate, and Kaushik Sinha. A near-optimal algorithm for differentially-private principal components. The Journal of Machine Learning Research, 14(1):2905–2943, 2013.
[16] Christian Covington, Xi He, James Honaker, and Gautam Kamath. Unbiased statistical estimation and valid confidence intervals under differential privacy. arXiv preprint arXiv:2110.14465, 2021.
[17] Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, and Benjamin IP Rubinstein. Robust and private bayesian inference. In International Conference on Algorithmic Learning Theory, pages 291–305. Springer, 2014.
[18] Wei Dong and Ke Yi. Universal private estimators. arXiv preprint arXiv:2111.02598, 2021.
[19] John Duchi and Ryan Rogers. Lower bounds for locally private estimation via communication complexity. In Conference on Learning Theory, pages 1161–1191. PMLR, 2019.
[20] John C Duchi, Michael I Jordan, and Martin J Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 113(521):182–201, 2018.
[21] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 371–380, 2009.
[22] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
[23] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.
[24] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 11–20, 2014.
[25] Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2468–2479. SIAM, 2019.
[26] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054–1067, 2014.
[27] Hossein Esfandiari, Vahab Mirrokni, and Shyam Narayanan. Tight and robust private mean estimation with few users. arXiv preprint arXiv:2110.11876, 2021.
[28] Giulia Fanti, Vasyl Pihur, and Úlfar Erlingsson. Building a rappor with the unknown: Privacy-preserving learning of associations and data dictionaries. Proceedings on Privacy Enhancing Technologies, 2016(3):41–61, 2016.
[29] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439–449, 2020.
[30] Vitaly Feldman, Audra McMillan, and Kunal Talwar. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 954–964. IEEE, 2022.
[31] Vitaly Feldman and Thomas Steinke. Calibrating noise to variance in adaptive data analysis. In Conference On Learning Theory, pages 535–544. PMLR, 2018.
[32] James Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. On the theory and practice of privacy-preserving bayesian data analysis. arXiv preprint arXiv:1603.07294, 2016.
[33] Marco Gaboardi, Ryan Rogers, and Or Sheffet. Locally private mean estimation: $z$ -test and tight confidence intervals. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2545–2554. PMLR, 2019.
[34] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. Advances in neural information processing systems, 27, 2014.
[35] Moritz Hardt and Aaron Roth. Beating randomized response on incoherent matrices. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 1255–1268, 2012.
[36] Moritz Hardt and Aaron Roth. Beyond worst-case analysis in private singular vector computation. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 331–340, 2013.
[37] Samuel B Hopkins, Gautam Kamath, and Mahbod Majid. Efficient mean estimation with pure differential privacy via a sum-of-squares exponential mechanism. arXiv preprint arXiv:2111.12981, 2021.
[38] Lijie Hu, Shuo Ni, Hanshen Xiao, and Di Wang. High dimensional differentially private stochastic optimization with heavy-tailed data. arXiv preprint arXiv:2107.11136, 2021.
[39] Prateek Jain, Chi Jin, Sham M Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming pca: Matching matrix bernstein and near-optimal finite sample guarantees for oja’s algorithm. In Conference on learning theory, pages 1147–1164. PMLR, 2016.
[40] Arun Jambulapati, Jerry Li, and Kevin Tian. Robust sub-gaussian principal component analysis and width-independent schatten packing. Advances in Neural Information Processing Systems, 33, 2020.
[41] Matthew Joseph, Janardhan Kulkarni, Jieming Mao, and Steven Z Wu. Locally private gaussian estimation. Advances in Neural Information Processing Systems, 32, 2019.
[42] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In International conference on machine learning, pages 1376–1385. PMLR, 2015.
[43] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning high-dimensional distributions. In Conference on Learning Theory, pages 1853–1902. PMLR, 2019.
[44] Gautam Kamath, Xingtu Liu, and Huanyu Zhang. Improved rates for differentially private stochastic convex optimization with heavy-tailed data. arXiv preprint arXiv:2106.01336, 2021.
[45] Gautam Kamath, Argyris Mouzakis, and Vikrant Singhal. New lower bounds for private estimation and a generalized fingerprinting lemma. arXiv preprint arXiv:2205.08532, 2022.
[46] Gautam Kamath, Argyris Mouzakis, Vikrant Singhal, Thomas Steinke, and Jonathan Ullman. A private and computationally-efficient estimator for unbounded gaussians. arXiv preprint arXiv:2111.04609, 2021.
[47] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed distributions. arXiv preprint arXiv:2002.09464, 2020.
[48] Michael Kapralov and Kunal Talwar. On differentially private low rank approximation. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1395–1414. SIAM, 2013.
[49] Vishesh Karwa and Salil Vadhan. Finite sample differentially private confidence intervals. arXiv preprint arXiv:1711.03908, 2017.
[50] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, pages 25–1. JMLR Workshop and Conference Proceedings, 2012.
[51] Weihao Kong, Raghav Somani, Sham Kakade, and Sewoong Oh. Robust meta-learning for mixed linear regression with small batches. Advances in Neural Information Processing Systems, 33, 2020.
[52] Pravesh K Kothari, Pasin Manurangsi, and Ameya Velingker. Private robust estimation by stabilizing convex relaxations. arXiv preprint arXiv:2112.03548, 2021.
[53] Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. Private non-smooth empirical risk minimization and stochastic convex optimization in subquadratic steps. arXiv preprint arXiv:2103.15352, 2021.
[54] Mengchu Li, Thomas B Berrett, and Yi Yu. On robustness and local differential privacy. arXiv preprint arXiv:2201.00751, 2022.
[55] Xiyang Liu, Weihao Kong, Sham Kakade, and Sewoong Oh. Robust and differentially private mean estimation. Advances in Neural Information Processing Systems, 34, 2021.
[56] Xiyang Liu, Weihao Kong, and Sewoong Oh. Differential privacy and robust statistics in high dimensions. arXiv preprint arXiv:2111.06578, 2021.
[57] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 94–103. IEEE, 2007.
[58] Frank D McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 19–30, 2009.
[59] Kentaro Minami, HItomi Arai, Issei Sato, and Hiroshi Nakagawa. Differential privacy without sensitivity. In Advances in Neural Information Processing Systems, pages 956–964, 2016.
[60] Darakhshan J Mir. Differential privacy: an exploration of the privacy-utility landscape. Rutgers The State University of New Jersey-New Brunswick, 2013.
[61] Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267–273, 1982.
[62] Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics. Lecture notes for course 18S997, 813(814):46, 2015.
[63] Or Sheffet. Old techniques in differentially private linear regression. In Algorithmic Learning Theory, pages 789–827. PMLR, 2019.
[64] Jun Tang, Aleksandra Korolova, Xiaolong Bai, Xueqiang Wang, and Xiaofeng Wang. Privacy loss in apple’s implementation of differential privacy on macos 10.12. arXiv preprint arXiv:1709.02753, 2017.
[65] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
[66] Christos Tzamos, Emmanouil-Vasileios Vlatakis-Gkaragkounis, and Ilias Zadik. Optimal private median estimation under minimal distributional assumptions. Advances in Neural Information Processing Systems, 33:3301–3311, 2020.
[67] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
[68] Duy Vu and Aleksandra Slavkovic. Differential privacy for clinical trial data: Preliminary evaluations. In 2009 IEEE International Conference on Data Mining Workshops, pages 138–143. IEEE, 2009.
[69] Vincent Vu and Jing Lei. Minimax rates of estimation for sparse pca in high dimensions. In Artificial intelligence and statistics, pages 1278–1286. PMLR, 2012.
[70] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
[71] Di Wang, Hanshen Xiao, Srinivas Devadas, and Jinhui Xu. On differentially private stochastic convex optimization with heavy-tailed data. In International Conference on Machine Learning, pages 10081–10091. PMLR, 2020.
[72] Yu-Xiang Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. arXiv preprint arXiv:1803.02596, 2018.
[73] Yu-Xiang Wang, Stephen Fienberg, and Alex Smola. Privacy for free: Posterior sampling and stochastic gradient monte carlo. In International Conference on Machine Learning, pages 2493–2502. PMLR, 2015.
[74] Huan Xu, Constantine Caramanis, and Shie Mannor. Principal component analysis with contaminated data: The high dimensional case. arXiv preprint arXiv:1002.4658, 2010.

Appendix

Appendix A Related work

Our work builds upon a series of advances in private SGD [44, 8, 7, 29, 53, 71, 38] to make advance in understanding the tradeoff of privacy and sample complexity for PCA. Such tradeoffs have been studied extensively in canonical statistical estimation problems of mean (and covariance) estimation and linear regression.

Private mean estimation. As one of the most fundamental problem in private data analysis, mean estimation is initially studied under the bounded support assumptions, and the optimal error rate is now well understood. More recently, [6] considered the private mean estimation problem for $k$ -th moment bounded distributions where the support of the data is unbounded and provided minimax error bound in various settings. [49] studied private mean estimation from Gaussian sample, and obtained an optimal error rate. There has been a lot of recent interests on private mean estimation under various assumptions, including mean and covariance joint estimation [43, 9], heavy-tailed mean estimation [47], mean estimation for general distributions [31, 66], distribution adaptive mean estimation [13], estimation for unbounded distribution parameters [46], mean estimation under pure differential privacy [37], local differential privacy [19, 20, 33, 41, 54], user-level differential privacy [27, 27], Mahalanobis distance[11, 56] and robust and differentially private mean estimation [55, 52, 56].

Private linear regression The goal of private linear regression is to learn a linear predictor of response variable $y$ from a set of examples $\{x_{i},y_{i}\}_{i=1}^{n}$ while guarantee the privacy of the examples. Again, the work on private linear regression can be divided into two categories: deterministic and randomized. In the deterministic setting where the data is deterministically given without any probabilistic assumptions, significant advances in DP linear regression has been made [68, 50, 60, 17, 8, 73, 32, 59, 72, 63]. In the randomized settings where each example $\{\mathbf{x}_{i},y_{i}\}$ is drawn i.i.d. from a distribution, [21] proposes an exponential time algorithm that achieves asymptotic consistency. [14] provides an efficient and minimax optimal algorithm under sub-Gaussian design and nearly identity covariance assumptions. Very recently, [56] for the first time gives an exponential time algorithm that achieves minimax risk for general covariance matrix under sub-Gaussian and hypercontractive assumptions.

Private PCA without spectral gap. There is a long line of work in Private PCA [35, 36, 34, 10, 15, 48, 24, 5]. We explain the closely related ones in Section 2.3, providing interpretation when the covariance matrix has a spectral gap.

When there is no spectral gap, one can still learn a principal component. However, since the principal component is not unique, the error is typically measured in how much of the variance is captured in the estimated direction: $1-\hat{v}^{\top}\Sigma\hat{v}/\|\Sigma\|$ . [15] introduces an exponential mechanism (from [57]) which samples an estimate from a distribution $f_{\widehat{\Sigma}}(\hat{v})=(1/C)\exp\{((\varepsilon n)/c^{2})\hat{v}^{\top}\widehat{\Sigma}\hat{v}\}$ , where $C$ is a normalization constant to ensure that the pdf integrates to one. This achieves a stronger pure DP, i.e., $(\varepsilon,0)$ -DP, but is computationally expensive; [15] does not provide a tractable implementation and [48] provides a polynomial time implementation with time complexity at least cubic in $d$ . This achieves an error rate of $1-\hat{v}^{\top}\Sigma\hat{v}/\|\Sigma\|=\tilde{O}(d^{2}/(\varepsilon n))$ in [15, Theorem 7] when samples are from Gaussian ${\cal N}(0,\Sigma)$ , which, when there is a spectral gap, translates into

\displaystyle\sin(\hat{v},v_{1})^{2}

\displaystyle=

\displaystyle\tilde{O}\Big{(}\frac{\kappa d^{2}}{\varepsilon n}\Big{)}\;,

(17)

with high probability. Closest to our setting is the analyses in [56, Corollary 6.5] that proposed an exponential mechanism that achieves $1-\hat{v}^{\top}\Sigma\hat{v}/\|\Sigma\|=\tilde{O}(\sqrt{d/n}+(d+\log(1/\delta))/(\varepsilon n))$ with high probability under $(\varepsilon,\delta)$ -DP and Gaussian samples, but this algorithm is computationally intractable. This is shown to be tight when there is no spectral gap. When there is a spectral gap, this translates into

\displaystyle\sin(\hat{v},v_{1})^{2}

\displaystyle=

\displaystyle\tilde{O}\Big{(}\kappa\Big{(}\sqrt{\frac{d}{n}}+\frac{d+\log(1/\delta)}{\varepsilon n}\Big{)}\Big{)}\;.

(18)

As these algorithms and the corresponding analyses are tailored for gap-free cases, they have better dependence on $\kappa$ and worse dependence on $d/n$ and $d/\varepsilon n$ , compared to the proposed DP-PCA and its error rate in Corollary 5.2.

Appendix B Preliminary on differential privacy

Lemma B.1 (Stability-based histogram [49, Lemma 2.3]).

For every $K\in\mathbb{N}\cup\infty$ , domain $\Omega$ , for every collection of disjoint bins $B_{1},\ldots,B_{K}$ defined on $\Omega$ , $n\in\mathbb{N}$ , $\varepsilon\geq 0,\delta\in(0,1/n)$ , $\beta>0$ and $\alpha\in(0,1)$ there exists an $(\varepsilon,\delta)$ -differentially private algorithm $M:\Omega^{n}\to\mathbb{R}^{K}$ such that for any set of data $X_{1},\ldots,X_{n}\in\Omega^{n}$

1.

$\hat{p}_{k}=\frac{1}{n}\sum_{X_{i}\in B_{k}}1$
2.

$(\tilde{p}_{1},\ldots,\tilde{p}_{K})\leftarrow M(X_{1},\ldots,X_{n}),$ and
3.

$n\geq\min\left\{\frac{8}{\varepsilon\beta}\log(2K/\alpha),\frac{8}{\varepsilon\beta}\log(4/\alpha\delta)\right\}$

then,

\mathbb{P}(|\tilde{p}_{k}-\hat{p}_{k}|\leq\beta)\geq 1-\alpha

Since we focus on one-pass algorithms where a data point is only accessed once, we need a basic parallel composition of DP.

Lemma B.2 (Parallel composition [58]).

Consider a sequence of interactive queries $\{q_{k}\}_{k=1}^{K}$ each operating on a subset $S_{k}$ of the database and each satisfying $(\varepsilon,\delta)$ -DP. If $S_{k}$ ’s are disjoint then the composition $(q_{1}(S_{1}),q_{2}(S_{2}),\ldots,q_{K}(S_{K}))$ is $(\varepsilon,\delta)$ -DP.

We also utilize the following serial composition theorem.

Lemma B.3 (Serial composition [23]).

If a database is accessed with an $(\varepsilon_{1},\delta_{1})$ -DP mechanism and then with an $(\varepsilon_{2},\delta_{2})$ -DP mechanism, then the end-to-end privacy guarantee is $(\varepsilon_{1}+\varepsilon_{2},\delta_{1}+\delta_{2})$ -DP.

When we apply private histogram learner to each coordinate, we require more advanced composition theorem from [42].

Lemma B.4 (Advanced composition [42]).

For $\varepsilon\leq 0.9$ , an end-to-end guarantee of $(\varepsilon,\delta)$ -differential privacy is satisfied if a database is accessed $k$ times, each with a $(\varepsilon/(2\sqrt{2k\log(2/\delta)}),\delta/(2k))$ -differential private mechanism.

Appendix C Lower bounds

When privacy is not required, we know from Theorem 2.2 that under Assumptions A.1-A.3, we can achieve an error rate of $\tilde{O}(\kappa\sqrt{V/n})$ . In the regime of $V=O(d)$ and $\kappa=O(1)$ , $n=O(d)$ samples are enough to achieve an arbitrarily small error. The next lower bounds shows that we need $n=O(d^{2})$ samples when $(\varepsilon=O(1),0)$ -DP is required, showing that private PCA is significantly more challenging than a non-private PCA, when assuming only the support and moment bounds of assumptions A.1-A.3. We provide a proof in Appendix C.3.

Theorem C.1 (Lower bound without Assumption A.4).

Let $\mathcal{M}_{\varepsilon}$ be a class of $(\varepsilon,0)$ -DP estimators that map $n$ i.i.d. samples to an estimate $\hat{v}\in{\mathbb{R}}^{d}$ . A set of distributions satisfying Assumptions A.1–A.3 with $M=O(d\log n)$ and $V=O(d)$ is denoted by $\tilde{\mathcal{P}}_{(\lambda_{1},\lambda_{2})}$ . There exists a universal constant $C>0$ such that

\displaystyle\inf_{\hat{v}\in\mathcal{M}_{\varepsilon}}\sup_{P\in\tilde{\mathcal{P}}_{(\lambda_{1},\lambda_{2})}}\mathbb{E}_{S\sim P^{n}}\left[\sin(\hat{v}(S),v_{1})\right]\;\geq\;C\min\left(\frac{\kappa d^{2}}{\varepsilon n}\sqrt{\frac{\lambda_{2}}{\lambda_{1}}},\sqrt{\frac{\lambda_{2}}{\lambda_{1}}}\right)\;.

(19)

We next provide the proofs of all the lower bounds.

C.1 Proof of Theorem 5.3 on the lower bound for Gaussian case

Our proof is based on following differentially private Fano’s method [3, Corollary 4].

Theorem C.2 (DP Fano’s method [3, Corollary 4]).

Let $\mathcal{P}$ denote family of distributions of interest and $\theta:\mathcal{P}\rightarrow\Theta$ denote the population parameter. Our goal is to estimate $\theta$ from i.i.d. samples $x_{1},x_{2},\ldots,x_{n}\sim P\in\mathcal{P}$ . Let $\hat{\theta}_{\varepsilon}$ be an $(\varepsilon,0)$ -DP estimator. Let $\rho:\Theta\times\Theta\rightarrow{\mathbb{R}}^{+}$ be a pseudo-metric on parameter space $\Theta$ . Let $\mathcal{V}$ be an index set with finite cardinality. Define $\mathcal{P}_{\mathcal{V}}=\{P_{v},v\in\mathcal{V}\}\subset\mathcal{P}$ be an indexed family of probability measures on measurable set $(\mathcal{X},\mathcal{A})$ . If for any $v\neq v^{\prime}\in\mathcal{V}$ ,

1.

$\rho(\theta(P_{v}),\theta(P_{v^{\prime}}))\geq\tau$ ,
2.

$D_{\rm KL}\left(P_{v},P_{v^{\prime}}\right)\leq\beta$ ,
3.

$D_{\rm TV}\left(P_{v},P_{v^{\prime}}\right)\leq\phi$ ,

then

\displaystyle\inf_{\hat{\theta}_{\varepsilon}}\max_{P\in\mathcal{P}}\mathbb{E}_{S\sim P^{n}}\left[\rho(\hat{\theta}_{\varepsilon}(S),\theta(P))\right]\geq\max\left(\frac{\tau}{2}\left(1-\frac{n\beta+\log(2)}{\log(|\mathcal{V}|)}\right),0.4\tau\min\left(1,\frac{\log(|\mathcal{V}|)}{n\phi\varepsilon}\right)\right)\;.

For our problem, we are interested in Gaussian $\mathcal{P}_{\Sigma}$ and metric $\rho(u,v)=\sin(u,v)$ . Using Theorem C.2, it suffices to construct such indexed set $\mathcal{V}$ and the indexed distribution family $\mathcal{P}_{\mathcal{V}}$ . We use the same construction as in [69, Theorem 2.1] introduced to prove a lower bound for the (non-private) sparse PCA problem. The construction is given by the following lemma.

Lemma C.3 ([69, Lemma 3.1.2]).

Let $d>5$ . For $\alpha\in(0,1]$ , there exists $\mathcal{V}_{\alpha}\subset\mathbb{S}_{2}^{d-1}$ and an absolute constant $c_{1}>0$ such that for every $v\neq v^{\prime}\in\mathcal{V}_{\alpha}$ , $\alpha/\sqrt{2}\leq\|v-v^{\prime}\|_{2}\leq\sqrt{2}\alpha$ and $\log(|\mathcal{V}_{\alpha}|)\geq c_{1}d$ .

Fix $\alpha\in(0,1]$ . For each $v\in\mathcal{V}_{\alpha}$ , we define $\Sigma_{v}=(\lambda_{1}-\lambda_{2})vv^{\top}+\lambda_{2}\mathbf{I}_{d}$ and $P_{v}={\cal N}(0,\Sigma_{v})$ . It is easy to see that $\Sigma_{v}$ has eigenvalues $\lambda_{1}>\lambda_{2}=\cdots=\lambda_{n}$ . The top eigenvector of $\Sigma_{v}$ is $v$ . Using Lemma F.4, we know for any $v\neq v^{\prime}\in\mathcal{V}$ ,

\displaystyle\frac{\alpha}{\sqrt{2}}\leq\rho(v,v^{\prime})=\sqrt{1-\left\langle v,v^{\prime}\right\rangle^{2}}\leq\alpha\;.

(20)

Using [69, Lemma 3.1.3], we know

\displaystyle D_{\rm KL}\left(P_{v},P_{v^{\prime}}\right)=\frac{(\lambda_{1}-\lambda_{2})^{2}}{\lambda_{1}\lambda_{2}}(1-\left\langle v,v^{\prime}\right\rangle^{2})\leq\frac{(\lambda_{1}-\lambda_{2})^{2}\alpha^{2}}{\lambda_{1}\lambda_{2}}\;.

(21)

Using Pinsker’s inequality, we have

\displaystyle D_{\rm TV}\left(P_{v},P_{v^{\prime}}\right)\leq\sqrt{\frac{D_{\rm KL}\left(P_{v},P_{v^{\prime}}\right)}{2}}\leq\alpha\sqrt{\frac{(\lambda_{1}-\lambda_{2})^{2}}{2\lambda_{1}\lambda_{2}}}\;.

(22)

Now we set

\displaystyle\alpha:=\min\left(1,\sqrt{\frac{dc_{1}\lambda_{1}\lambda_{2}}{2n(\lambda_{1}-\lambda_{2})^{2}}},\frac{c_{1}d}{n\varepsilon}\sqrt{\frac{2\lambda_{1}\lambda_{2}}{(\lambda_{1}-\lambda_{2})^{2}}}\right)

(23)

It follows from Theorem C.2 and $d>8$ that there exists a constant $C$ such that

\displaystyle\inf_{\hat{v}}\sup_{P\in\mathcal{P}_{\Sigma}}\mathbb{E}_{S\sim P^{n}}\left[\sin(\hat{v}(S),v_{1}(\Sigma))\right]\geq C\min\left(\left(\sqrt{\frac{d}{n}}+\frac{d}{\varepsilon n}\right)\sqrt{\frac{\lambda_{1}\lambda_{2}}{(\lambda_{1}-\lambda_{2})^{2}}},1\right)\;.

(24)

C.2 Proof of Theorem 5.4

We first construct an indexed set $\mathcal{V}$ and indexed distribution family $\mathcal{P}_{\mathcal{V}}$ such that $x_{i}x_{i}^{\top}$ satisfies A.1, A.2 and A.3 in Assumption 1. Our construction is defined as follows.

By [3, Lemma 6] , there exists a finite set $\mathcal{V}\subset\mathbb{S}_{2}^{d-1}$ , with cardinality $|\mathcal{V}|\geq 2^{d}$ , such that for any $v\neq v^{\prime}\in\mathcal{V}$ , $\|v-v^{\prime}\|\geq 1/2$ .

Let $f_{(0,\mathbf{I}_{d})}$ denotes the density function of ${\cal N}(0,\mathbf{I}_{d})$ . Let $Q_{v}$ be a uniform distribution on two point masses $\{\pm\alpha^{-\frac{1}{4}}v\}$ . Let $Q_{0}$ be Gaussian distribution ${\cal N}(0,\mathbf{I}_{d})$ . For $\alpha\in(0,1]$ , we construct $P_{v}:=(1-\alpha)Q_{0}+\alpha Q_{v}$ . It is easy to see that $P_{v}$ is a distribution over ${\mathbb{R}}^{d}$ with the following density function.

\displaystyle P_{v}(x)=\begin{cases}\frac{\alpha}{2},&\text{ if }x=-\alpha^{-\frac{1}{4}}v\;,\\ \frac{\alpha}{2},&\text{ if }x=\alpha^{-\frac{1}{4}}v\;,\\ (1-\alpha)f_{(0,\mathbf{I}_{d})}(x)&\text{ otherwise }\end{cases}\;.

(25)

The mean of $P_{v}$ is $0$ . The covariance of $P_{v}$ is $\Sigma_{v}=(1-\alpha)\mathbf{I}_{d}+\sqrt{\alpha}vv^{\top}$ . The top eigenvalue is $\lambda_{1}=1-\alpha+\sqrt{\alpha}$ , the top eigenvector is $v$ , and the second eigenvalue is $\lambda_{2}=1-\alpha$ . And $\kappa=O(\alpha^{-1/2})$ .

If $x=\alpha^{-1/4}v$ , then $\|xx^{\top}-\Sigma_{v}\|_{2}=O(\alpha^{-1/2})$ . If $x\sim{\cal N}(0,\mathbf{I}_{d})$ , we know $\|xx^{\top}-\Sigma_{v}\|_{2}=O(d)$ . This implies $P_{v}$ satisfies A.2 in Assumption 1 with $M=O((d+\alpha^{-1/2})\log(n))$ for $n$ i.i.d. samples.

It is easy to see that $\|\mathbb{E}[(xx^{\top}-\Sigma_{v})(xx^{\top}-\Sigma_{v})^{\top}]\|_{2}=O(d)$ . This means $P_{v}$ satisfies A.3 in Assumption 1 with $V=O(d)$ .

By the fact that $\mathbb{E}[\left\langle x,u\right\rangle^{2}]=O(1)$ and $\mathbb{E}[\left\langle x,u\right\rangle^{4}]=O(1)$ for any unit vector $u$ , we have $\gamma^{2}=\|\mathbb{E}[(xx^{\top}-\Sigma_{v})uu^{\top}(xx^{\top}-\Sigma_{v})^{\top}]\|_{2}=O(1)$ for any unit vector $u$ .

Our proof technique is based on following lemma.

Lemma C.4 ([6, Theorem 3]).

Fix $\alpha\in(0,1]$ . Define $P_{v}=(1-\alpha)Q_{0}+\alpha Q_{v}$ for $v\in\mathcal{V}$ such that such that $\rho(\theta(P_{v}),\theta(P_{v^{\prime}}))\geq 2t$ . Let $\hat{\theta}$ be a $(\varepsilon,\delta)$ differentially private estimator. Then,

\displaystyle\frac{1}{|\mathcal{V}|}\sum_{\nu\in\mathcal{V}}P_{v}\left(\rho\left(\widehat{\theta},\theta(P_{v})\right)\geq t\right)\geq\frac{(|\mathcal{V}|-1)\cdot\left(\frac{1}{2}e^{-\varepsilon\lceil n\alpha\rceil}-\delta\frac{1-e^{-\varepsilon[n\alpha\rceil}}{1-e^{-\varepsilon}}\right)}{1+(|\mathcal{V}|-1)\cdot e^{-\varepsilon\lceil n\alpha\rceil}}\;.

(26)

Set $\rho(\theta(P_{v}),\theta(P_{v^{\prime}}))=\sin(v,v^{\prime})/\kappa$ . By Lemma F.4, $\rho(\theta(P_{v}),\theta(P_{v^{\prime}}))\geq\|v-v^{\prime}\|/\kappa=\Omega(\sqrt{\alpha})$ .

Lemma C.4 implies

$\displaystyle\sup_{P\in\tilde{\mathcal{P}}}\mathbb{E}_{S\sim P^{n}}[\sin(\hat{v}(S),v_{1}(\Sigma))]$	$\displaystyle\geq\frac{1}{\|\mathcal{V}\|}\sum_{v\in\mathcal{V}}\mathbb{E}_{S\sim P_{v}^{n}}[\sin(\hat{v}(S),v_{1}(\Sigma_{v}))]$	(27)
	$\displaystyle=\kappa t\frac{1}{\|\mathcal{V}\|}\sum_{v\in\mathcal{V}}P_{v}\left(\frac{\sin(\hat{v}(S),v_{1}(\Sigma_{v}))}{\kappa}\geq t\right)$	(28)
	$\displaystyle\gtrsim\kappa t\frac{(2^{d}-1)\cdot\left(\frac{1}{2}e^{-\varepsilon\lceil n\alpha\rceil}-\frac{\delta}{1-e^{-\varepsilon}}\right)}{1+(2^{d}-1)e^{-\varepsilon\lceil n\alpha\rceil}}\;,$	(29)

For $d\geq 2$ , we know $2^{d}-1\geq e^{d/2}$ . We choose

\displaystyle\alpha=\min\left\{\frac{1}{n\varepsilon}\left(\frac{d}{2}-\varepsilon\right),\frac{1}{n\varepsilon}\log\left(\frac{1-e^{-\varepsilon}}{4\delta e^{\varepsilon}}\right),1\right\}\;.

(30)

This implies

\displaystyle\frac{1}{2}e^{-\varepsilon\lceil n\alpha\rceil}-\frac{\delta}{1-e^{-\varepsilon}}\geq\frac{1}{4}e^{-\varepsilon(n\alpha+1)}>0\;.

(31)

So we have there exists a constant $C$ such that

	$\displaystyle\inf_{\hat{v}}\sup_{P\in\tilde{\mathcal{P}}}\mathbb{E}_{S\sim P^{n}}\left[\sin(\hat{v}(S),v_{1}(\Sigma))\right]$	$\displaystyle\geq C\kappa\sqrt{\alpha}\frac{\frac{1}{4}e^{d/2}e^{-\varepsilon(n\alpha+1)}}{1+e^{d/2}e^{-\varepsilon(n\alpha+1)}}$		(32)
		$\displaystyle\gtrsim\kappa\min\left(1,\sqrt{\frac{d\wedge\log\left(\left(1-e^{-\varepsilon}\right)/\delta\right)}{n\varepsilon}}\right)\;.$		(33)

C.3 Proof of Theorem C.1

Similar to the proof of Theorem 5.3, we use DP Fano’s method in Theorem C.2. It suffices to construct an indexed set $\mathcal{V}$ and indexed distribution family $\mathcal{P}_{\mathcal{V}}$ such that $x_{i}x_{i}^{\top}$ satisfies A.1, A.2 and A.3 in Assumption 1. Our construction is defined as follows.

Let $\lambda_{1}>\lambda_{2}>0$ . By Lemma C.3, there exists a finite set $\mathcal{V}_{\alpha}\subset\mathbb{S}_{2}^{d-1}$ , with cardinality $|\mathcal{V}_{\alpha}|=2^{\Omega(d)}$ , such that for any $v\neq v^{\prime}\in\mathcal{V}_{\alpha}$ , $\alpha/\sqrt{2}\leq\|v-v^{\prime}\|\leq\sqrt{2}$ , where $\alpha:=\sqrt{\lambda_{2}/\lambda_{1}}$ .

Let $f_{(0,S)}$ denotes the density function of ${\cal N}(0,S)$ . We construct $P_{v}$ over ${\mathbb{R}}^{d}$ for $v\in\mathcal{V}_{\alpha}$ with the following density function.

\displaystyle P_{v}(x)=\begin{cases}\frac{1-\lambda_{2}/\lambda_{1}}{2d},&\text{ if }x=-\sqrt{d\lambda_{1}}v\;,\\ \frac{1-\lambda_{2}/\lambda_{1}}{2d},&\text{ if }x=\sqrt{d\lambda_{1}}v\;,\\ 1-\frac{1-\lambda_{2}/\lambda_{1}}{d},f_{(0,\frac{\lambda_{2}}{1-\frac{1-\lambda_{2}/\lambda_{1}}{d}}\mathbf{I}_{d})}(x)&\text{ otherwise }\end{cases}\;.

(34)

The mean of $P_{v}$ is $0$ . The covariance of $P_{v}$ is $\Sigma_{v}:=(\lambda_{1}-\lambda_{2})vv^{\top}+\lambda_{2}\mathbf{I}_{d}$ . It is easy to see that the top eigenvalue is $\lambda_{1}$ , the top eigenvector is $v$ , and the second eigenvalue is $\lambda_{2}$ .

If $x=\sqrt{d\lambda_{1}}v$ , then $\|xx^{\top}-\Sigma_{v}\|_{2}=\|(d\lambda_{1}-\lambda_{1}+\lambda_{2})-\lambda_{2}\mathbf{I}_{d}\|_{2}=O(d\lambda_{1})$ . If $x\sim{\cal N}(0,\frac{\lambda_{2}}{1-\frac{1-\lambda_{2}/\lambda_{1}}{d}}\mathbf{I}_{d})$ , by the fact that $\frac{\lambda_{2}}{1-\frac{1-\lambda_{2}/\lambda_{1}}{d}}\leq\lambda_{1}$ , we know $\|xx^{\top}-\Sigma_{v}\|_{2}\leq O(d\lambda_{1})$ . This implies $P_{v}$ satisfies A.2 in Assumption 1 with $M=O(d\log(n))$ for $n$ i.i.d. samples.

Similarly, $\|\mathbb{E}[(xx^{\top}-\Sigma_{v})(xx^{\top}-\Sigma_{v})^{\top}]\|_{2}\leq\|d(\lambda_{1}^{2}-\lambda_{1}\lambda_{2})vv^{\top}+d\lambda_{2}\lambda_{1}+3\Sigma_{v}\Sigma_{v}^{\top}\|_{2}=O(d\lambda_{1}^{2})$ . This means $P_{v}$ satisfies A.3 in Assumption 1 with $V=O(d)$ .

For $v\neq v^{\prime}\in\mathcal{V}_{\alpha}$ , we have $D_{\rm TV}(P_{v},P_{v^{\prime}})=(1-\lambda_{2}/\lambda_{1})/d$ . By Lemma F.4, $\sin(v,v^{\prime})\geq\|v-v^{\prime}\|/\sqrt{2}\geq(\sqrt{\lambda_{2}/\lambda_{1}})/2$ .

By Theorem C.2, there exists a constant $C$ such that

\displaystyle\inf_{\hat{v}}\sup_{P\in\mathcal{P}_{\Sigma}}\mathbb{E}_{S\sim P^{n}}\left[\sin(\hat{v}(S),v_{1}(\Sigma))\right]\geq C\min\left(\sqrt{\frac{\lambda_{2}}{\lambda_{1}}},\frac{d^{2}}{n\varepsilon}\sqrt{\frac{\lambda_{1}\lambda_{2}}{(\lambda_{1}-\lambda_{2})^{2}}}\right)\;.

(35)

Appendix D The analysis of Private Oja’s Algorithm

We analyze Private Oja’s Algorithm in Algorithm 2.

D.1 Proof of privacy in Lemma 3.1

We use following Theorem D.1 to prove our privacy guarantees.

Theorem D.1 (Privacy amplification by shuffling [30, Theorem 3.8]).

For any domain $\mathcal{D}$ , let $\mathcal{R}^{(i)}:\mathcal{S}^{(1)}\times\cdots\times\mathcal{S}^{(i-1)}\times\mathcal{D}\rightarrow\mathcal{S}^{(i)}$ for $i\in[n]$ (where $\mathcal{S}^{(i)}$ is the range space of $\mathcal{R}^{(i)}$ ) be a sequence of algorithms such that $\mathcal{R}^{(i)}(z_{1:i-1},\cdot)$ is an $(\varepsilon_{0},\delta_{0})$ -DP local randomizer for all values of auxiliary inputs $z_{1:i-1}\in\mathcal{S}^{(1)}\times\cdots\times\mathcal{S}^{(i-1)}$ . Let $\mathcal{A}_{S}:\mathcal{D}^{n}\rightarrow\mathcal{S}^{(1)}\times\cdots\times\mathcal{S}^{(n)}$ be the algorithm that given a dataset $x_{1:n}\in\mathcal{D}^{n}$ , samples a uniform random permutation $\pi$ over $[n]$ , then sequentially computes $z_{i}=\mathcal{R}^{(i)}(z_{1:i-1},x_{\pi(i)})$ for $i\in[n]$ and outputs $z_{1:n}$ . Then for any $\delta\in[0,1]$ such that $\varepsilon_{0}\leq\log\left(\frac{n}{16\log(2/\delta)}\right)$ , $\mathcal{A}_{s}$ is $(\varepsilon,\delta+O(e^{\varepsilon}\delta_{0}n))$ -DP, where

\displaystyle\varepsilon=O\left((1-e^{-\varepsilon_{0}})\left(\frac{\sqrt{e^{\varepsilon_{0}}\log(1/\delta)}}{\sqrt{n}}+\frac{e^{\varepsilon_{0}}}{n}\right)\right)\;.

(36)

Let $\mathcal{R}^{(t)}(w_{t-1},A_{\pi(t)}):=w_{t}$ . Let $\varepsilon_{0}=\frac{\sqrt{2\log(1.25/\delta_{0})}}{\alpha}$ . We show $\mathcal{R}^{(t)}(w_{t-1},\cdot)$ is an $(\varepsilon_{0},\delta_{0})$ -DP local randomizer.

If there is no noise in each update step, the update rule is

	$\displaystyle w_{t}^{\prime}$	$\displaystyle\leftarrow w_{t-1}+\eta_{t}{\rm clip}_{\beta}\left(A_{t}w_{t-1}\right)\;,$		(37)
	$\displaystyle w_{t}$	$\displaystyle\leftarrow w_{t-1}/\\|w_{t-1}\\|$		(38)

The sensitivity of $w_{t}^{\prime}$ is $2\beta\eta_{t}$ with respect to a difference in $A_{t}$ . By Gaussian mechanism in Lemma 2.4 and post processing property of local differential privacy, we know $w_{t}$ is $(\varepsilon_{0},\delta_{0})$ -DP local randomizer.

Assume that $\varepsilon_{0}=\frac{\sqrt{2\log(1.25/\delta_{0})}}{\alpha}\leq\frac{1}{2}$ . By Theorem D.1, for $\hat{\delta}\in[0,1]$ such that $\varepsilon_{0}\leq\log\left(\frac{n}{16\log(2/\hat{\delta})}\right)$ , Algorithm 2 is $(\hat{\varepsilon},\hat{\delta}+O(e^{\hat{\varepsilon}}\delta_{0}n))$ -DP and for some constant $c_{1}>0$ ,

$\displaystyle\hat{\varepsilon}$	$\displaystyle\leq c_{1}\left((1-e^{-\varepsilon_{0}})\left(\frac{\sqrt{e^{\varepsilon_{0}}\log(1/\hat{\delta})}}{\sqrt{n}}+\frac{e^{\varepsilon_{0}}}{n}\right)\right)$	(39)
	$\displaystyle\leq c_{1}\left((e^{0.5}-e^{-0.5\varepsilon_{0}})\frac{\sqrt{\log(1/\hat{\delta})}}{\sqrt{n}}+\frac{e^{\varepsilon_{0}}-1}{n}\right)$	(40)
	$\displaystyle\leq c_{1}\left(((1+\varepsilon_{0})-(1-\varepsilon_{0}/2))\frac{\sqrt{\log(1/\hat{\delta})}}{\sqrt{n}}+\frac{1+2\varepsilon_{0}-1}{n}\right)$	(41)
	$\displaystyle=c_{1}\varepsilon_{0}\left(\frac{1}{2}\sqrt{\frac{\log(1/\hat{\delta})}{n}}+\frac{2}{n}\right)$	(42)
	$\displaystyle\leq c_{2}\frac{\sqrt{\log(1/\delta_{0})}}{\alpha}\sqrt{\frac{\log(1/\hat{\delta})}{n}}\;,$	(43)

for some absolute constant $c_{2}>0$ .

Set $\hat{\delta}=\delta/2$ , $\delta_{0}=c_{3}\delta/(e^{\hat{\varepsilon}}n)$ for some $c_{3}>0$ and $\alpha=C^{\prime}\log(n/\delta)/(\varepsilon\sqrt{n})$ . We have

	$\displaystyle\hat{\varepsilon}$	$\displaystyle\leq c_{2}\frac{\sqrt{\log(e^{\hat{\varepsilon}}n/(c_{3}\delta))}}{\alpha}\sqrt{\frac{\log(2/\delta)}{n}}$		(44)
		$\displaystyle=\frac{\sqrt{\log(e^{\hat{\varepsilon}}n/(c_{3}\delta))\log(2/\delta)}}{C^{\prime}\log(n/\delta)}\cdot\varepsilon.$		(45)

For any $\varepsilon\leq 1$ , by Eq. (45), there exists some sufficiently large $C^{\prime}>0$ such that $\hat{\varepsilon}\leq\varepsilon$ .

Recall that we assume $\varepsilon_{0}=\frac{\sqrt{2\log(1.25/\delta_{0})}}{\alpha}\leq\frac{1}{2}$ . This means $\varepsilon=O(\sqrt{\frac{\log(n/\delta)}{n})}$ .

D.2 Proof of clipping in Lemma 3.2

Let $z_{t}=A_{t}w_{t-1}$ . Let $\mu_{t}:=\mathbb{E}[z_{t}]=\Sigma w_{t-1}$ . By Lemma 2.1, we know for any $\|v\|=1$ , with probability $1-\zeta$ ,

\displaystyle|v^{\top}(z_{t}-\mu_{t})|\leq K\gamma\lambda_{1}\log^{a}(1/\zeta)\;.

(46)

Applying union bound over all basis vectors $v\in\{e_{1},\ldots,e_{d}\}$ and all samples, we know with probability $1-\zeta$ , for all $j\in[d]$ and $t\in[n]$

\displaystyle|z_{t,j}|\leq K\gamma\lambda_{1}\log^{a}(nd/\zeta)+\lambda_{1}\;.

(47)

This implies that with probability $1-\zeta$ , for all $t\in[n]$ , we have

\displaystyle\|z_{t}\|\leq(K\gamma\log^{a}(nd/\zeta)+1)\lambda_{1}\sqrt{d}\;.

(48)

D.3 Proof of utility in Theorem 3.3

Lemma 3.2 implies that with probability $1-O(\zeta)$ , Algorithm 2 does not have any clipping. Under this event, the update rule becomes

	$\displaystyle w_{t}^{\prime}$	$\displaystyle\leftarrow w_{t-1}+\eta_{t}\left(A_{t}+2\alpha\beta G_{t}\right)w_{t-1}$		(49)
	$\displaystyle w_{t}$	$\displaystyle\leftarrow w_{t}^{\prime}/\\|w_{t}^{\prime}\\|\;,$		(50)

where $\beta=(K\gamma\log^{a}(nd/\zeta)+1)\lambda_{1}\sqrt{d}$ and each entry in $G_{t}\in{\mathbb{R}}^{d\times d}$ is i.i.d. sampled from standard Gaussian ${\cal N}(0,1)$ . This follows form the fact that $\|w_{t-1}\|=1$ and $G_{t}w_{t-1}\sim{\cal N}(0,{\bf I}_{d})$ .

Let $B_{t}=A_{t}+2\alpha\beta G_{t}$ . We show $B_{t}$ satisfies the three conditions in Theorem 2.2 ([39, Theorem 4.12]). It is easy to see that $\mathbb{E}[B_{t}]=\Sigma$ from Assumption A.1. Next, we show upper bound of $\max\left\{\left\|\mathbb{E}\left[(B_{t}-\Sigma)(B_{t}-\Sigma)^{\top}\right]\right\|_{2},\left\|\mathbb{E}\left[(B_{t}-\Sigma)^{\top}(B_{t}-\Sigma)\right]\right\|_{2}\right\}$ . We have

	$\displaystyle\left\\|\mathbb{E}\left[(B_{t}-\Sigma)(B_{t}-\Sigma)^{\top}\right]\right\\|_{2}$
$\displaystyle=$	$\displaystyle\;\left\\|\mathbb{E}[(A_{t}+2\alpha\beta G_{t}-\Sigma)(A_{t}+2\alpha\beta G_{t}-\Sigma)^{\top}]\right\\|_{2}$
$\displaystyle\leq$	$\displaystyle\;\left\\|\mathbb{E}[(A_{t}-\Sigma)(A_{t}-\Sigma)^{\top}]\right\\|_{2}+4\alpha^{2}\beta^{2}\\|\mathbb{E}[G_{t}G_{t}^{\top}]\\|_{2}$
$\displaystyle\leq$	$\displaystyle\;V\lambda_{1}^{2}+4\alpha^{2}\beta^{2}C_{2}d\;,$	(51)

where the last inequality follows from Lemma F.3 and $C_{2}>0$ is an absolute constant. Let $\widetilde{V}:=V\lambda_{1}^{2}+4\alpha^{2}\beta^{2}C_{2}d$ . Similarly, we can show that $\left\|\mathbb{E}\left[(B_{t}-\Sigma)^{\top}(B_{t}-\Sigma)\right]\right\|_{2}\leq\widetilde{V}$ .

By Lemma F.2, we know with probability $1-\zeta$ , for all $t\in[T]$ ,

		$\displaystyle\left\\|B_{t}-\Sigma\right\\|_{2}$
	$\displaystyle=$	$\displaystyle\left\\|A_{t}+2\alpha\beta G_{t}-\Sigma\right\\|_{2}$
	$\displaystyle\leq$	$\displaystyle\left\\|A_{t}-\Sigma\right\\|_{2}+2\alpha\beta\\|G_{t}\\|_{2}$
	$\displaystyle\leq$	$\displaystyle M\lambda_{1}+2C_{3}\alpha\beta\left(\sqrt{d}+\sqrt{\log(n/\zeta)}\right)\;.$

Let $\widetilde{M}:=M\lambda_{1}+2C_{3}\alpha\beta\left(\sqrt{d}+\sqrt{\log(n/\zeta)}\right)$ .

Under the event that $\left\|B_{t}-\Sigma\right\|_{2}\leq\widetilde{M}$ for all $t\in[n]$ , we apply Theorem 2.2 with a learning rate $\eta_{t}=\frac{h}{(\lambda_{1}-\lambda_{2})(\xi+t)}$ where

\displaystyle\xi=20\max\left(\frac{\widetilde{M}h}{(\lambda_{1}-\lambda_{2})},\frac{\left(\widetilde{V}+\lambda_{1}^{2}\right)h^{2}}{(\lambda_{1}-\lambda_{2})^{2}\log(1+\frac{\zeta}{100})}\right)\;.

(52)

Then Theorem 2.2 implies that with probability $1-\zeta$ ,

\displaystyle\sin^{2}\left(w_{n},v_{1}\right)\leq\frac{C\log(1/\zeta)}{\zeta^{2}}\left(d\left(\frac{\xi}{n}\right)^{2h}+\frac{h^{2}\widetilde{V}}{(2h-1)\left(\lambda_{1}-\lambda_{2}\right)^{2}n}\right)\;,

(53)

for some positive constant $C$ .

Set $\alpha=\frac{C^{\prime}\log(n/\delta)}{\varepsilon\sqrt{n}}$ , the above bound implies

\displaystyle\sin^{2}\left(w_{n},v_{1}\right)\leq\frac{C\log(1/\zeta)}{\zeta^{2}}\left(\frac{h^{2}V\lambda_{1}^{2}}{(2h-1)\left(\lambda_{1}-\lambda_{2}\right)^{2}n}+\frac{(K\gamma\log^{a}(nd/\zeta)+1)^{2}\lambda_{1}^{2}\log^{2}(n/\delta)d^{2}h^{2}}{(2h-1)(\lambda_{1}-\lambda_{2})^{2}\varepsilon^{2}n^{2}}+d\left(\tilde{\xi}\right)^{h}\right)\;,

(54)

where $\tilde{\xi}=(\xi/n)^{2}$ , and

$\displaystyle\tilde{\xi}:=\max$	$\displaystyle\left(\frac{M^{2}\lambda_{1}^{2}h^{2}}{(\lambda_{1}-\lambda_{2})^{2}n^{2}}+\frac{(K\gamma\log^{a}(nd/\zeta)+1)^{2}\lambda_{1}^{2}\log^{3}(n/\delta)h^{2}d^{2}}{(\lambda_{1}-\lambda_{2})^{2}\varepsilon^{2}n^{3}},\right.$
	$\displaystyle\left.\frac{V^{2}\lambda_{1}^{4}h^{4}}{(\lambda_{1}-\lambda_{2})^{4}\log^{2}(1+\frac{\zeta}{100})n^{2}}+\frac{(K\gamma\log^{a}(nd/\zeta)+1)^{4}\lambda_{1}^{4}\log^{4}(n/\delta)h^{4}d^{4}}{(\lambda_{1}-\lambda_{2})^{4}\log^{2}(1+\frac{\zeta}{100})\varepsilon^{4}n^{4}}\right.$
	$\displaystyle\left.+\frac{\lambda_{1}^{4}h^{4}}{(\lambda_{1}-\lambda_{2})^{4}\log^{2}(1+\frac{\zeta}{100})n^{2}}\right)\;.$	(55)

For $\zeta=O(1)$ and $K=O(1)$ , selecting $h=c\log n$ , and assuming

	$\displaystyle n\geq$	$\displaystyle C\left(\frac{M\lambda_{1}\log(n)}{\lambda_{1}-\lambda_{2}}+\frac{(K\gamma\log^{a}(nd/\zeta)+1)^{2/3}\lambda_{1}^{2/3}\log(n/\delta)\log^{2/3}(n)d^{2/3}}{(\lambda_{1}-\lambda_{2})^{2/3}\varepsilon^{2/3}}\right.$
		$\displaystyle\left.+\frac{V\lambda_{1}^{2}(\log(n))^{2}}{(\lambda_{1}-\lambda_{2})^{2}}+\frac{(K\gamma\log^{a}(nd/\zeta)+1)\lambda_{1}\log(n/\delta)\log(n)d}{(\lambda_{1}-\lambda_{2})\varepsilon}+\frac{\lambda_{1}^{2}\log^{2}(n)}{(\lambda_{1}-\lambda_{2})^{2}}\right)\;,$		(56)

with large enough positive constants $c$ , and $C$ , we have $\tilde{\xi}\leq 1$ and $d\tilde{\xi}^{\alpha}\leq 1/n^{2}$ . Hence it is sufficient to have

n=\tilde{O}\Big{(}\,\frac{\lambda_{1}^{2}}{(\lambda_{1}-\lambda_{2})^{2}}+\frac{M\lambda_{1}}{\lambda_{1}-\lambda_{2}}+\frac{V\lambda_{1}^{2}}{(\lambda_{1}-\lambda_{2})^{2}}+\frac{d\,(\gamma+1)\lambda_{1}\,\log(1/\delta)}{(\lambda_{1}-\lambda_{2})\varepsilon}\,\Big{)}\;,

with a large enough constant.

Appendix E The analysis of DP-PCA

We provides the proofs for Theorem 5.1, Theorem 6.1, and Lemma 6.2 that guarantees the privacy and utility of DP-PCA.

E.1 Proof of Theorem 5.1 on the privacy and utility of DP-PCA

From Theorem 6.1 we know that Alg. 4 returns $\hat{\Lambda}$ satisfying $2\hat{\Lambda}\geq\lambda_{1}^{2}\|H_{u}\|_{2}$ with high probability. Then, from Lemma 6.2, we know that with high probability Alg 5 returns an unbiased estimate of the gradient mean with added Gaussian noise. Under this case, the update rule becomes

	$\displaystyle w_{t}^{\prime}$	$\displaystyle\leftarrow w_{t-1}+\eta_{t}\left(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}+\beta_{t}G_{t}\right)w_{t-1}$		(57)
	$\displaystyle w_{t}$	$\displaystyle\leftarrow w_{t}^{\prime}/\\|w_{t}^{\prime}\\|\;,$		(58)

where $\beta_{t}=\frac{8K\sqrt{2\hat{\Lambda}_{t}}\log^{a}(Bd/\zeta)\sqrt{2d\log(2.5/\delta)}}{\varepsilon B}$ , $\hat{\Lambda}_{t}$ denote the estimated eigenvalue of covariance of the gradients at $t$ -th iteration, and each entry in $G_{t}\in{\mathbb{R}}^{d\times d}$ is i.i.d. sampled from standard Gaussian ${\cal N}(0,1)$ . This follows form the fact that $\|w_{t-1}\|=1$ and $G_{t}w_{t-1}\sim{\cal N}(0,{\bf I}_{d})$ .

Let $\beta:=\frac{16K\gamma\lambda_{1}\log^{a}(Bd/\zeta)\sqrt{2d\log(2.5/\delta)}}{\varepsilon B}$ such that $\beta\geq\beta_{t}$ , which follows from the fact that $\hat{\Lambda}\leq\sqrt{2}\lambda_{1}^{2}\|H_{u}\|_{2}\leq\sqrt{2}\lambda_{1}^{2}\gamma^{2}$ (Theorem 6.1 and Assumption A.4). Let $B_{t}=(1/B)\sum_{i=1}^{B}A_{B(t-1)+i}+\beta_{t}G_{t}$ . We show $B_{t}$ satisfies the three conditions in Theorem 2.2 ([39, Theorem 4.12]). It is easy to see that $\mathbb{E}[B_{t}]=\Sigma$ from Assumption A.1. Next, we show upper bound of $\max\left\{\left\|\mathbb{E}\left[(B_{t}-\Sigma)(B_{t}-\Sigma)^{\top}\right]\right\|_{2},\left\|\mathbb{E}\left[(B_{t}-\Sigma)^{\top}(B_{t}-\Sigma)\right]\right\|_{2}\right\}$ . We have

	$\displaystyle\left\\|\mathbb{E}\left[(B_{t}-\Sigma)(B_{t}-\Sigma)^{\top}\right]\right\\|_{2}$
$\displaystyle=$	$\displaystyle\;\left\\|\mathbb{E}[(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}+\beta_{t}G_{t}-\Sigma)(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}+\beta_{t}G_{t}-\Sigma)^{\top}]\right\\|_{2}$
$\displaystyle\leq$	$\displaystyle\;\left\\|\mathbb{E}[(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}-\Sigma)(\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}-\Sigma)^{\top}]\right\\|_{2}+\beta^{2}\\|\mathbb{E}[G_{t}G_{t}^{\top}]\\|_{2}$
$\displaystyle=$	$\displaystyle\;V\lambda_{1}^{2}/B+\beta^{2}\\|\mathbb{E}[G_{t}G_{t}^{\top}]\\|_{2}$
$\displaystyle\leq$	$\displaystyle\;V\lambda_{1}^{2}/B+\beta^{2}C_{2}d\;,$	(59)

where the last inequality follows from Lemma F.3 and $C_{2}>0$ is an absolute constant. Let $\widetilde{V}:=V\lambda_{1}^{2}/B+\beta^{2}C_{2}d$ . Similarly, we can show that $\left\|\mathbb{E}\left[(B_{t}-\Sigma)^{\top}(B_{t}-\Sigma)\right]\right\|_{2}\leq\widetilde{V}$ . By Lemma F.5 and Lemma F.2, we know with probability $1-\zeta$ , for all $t\in[T]$ ,

		$\displaystyle\left\\|B_{t}-\Sigma\right\\|_{2}$
	$\displaystyle=$	$\displaystyle\left\\|\frac{1}{B}\sum_{i=1}^{B}A_{B(t-1)+i}+\beta_{t}G_{t}-\Sigma\right\\|_{2}$
	$\displaystyle\leq$	$\displaystyle C_{3}\left(\frac{M\lambda_{1}\log(dT/\zeta)}{B}+\sqrt{\frac{V\lambda_{1}^{2}\log(dT/\zeta)}{B}}+\beta\left(\sqrt{d}+\sqrt{\log(T/\zeta)}\right)\right)\;.$

Let $\widetilde{M}:=C_{3}\left(\frac{M\lambda_{1}\log(dT/\zeta)}{B}+\sqrt{\frac{V\lambda_{1}^{2}\log(dT/\zeta)}{B}}+\beta\left(\sqrt{d}+\sqrt{\log(T/\zeta)}\right)\right)$ . Under the event that $\left\|B_{t}-\Sigma\right\|_{2}\leq\widetilde{M}$ for all $t\in[T]$ , we apply Theorem 2.2 with a learning rate $\eta_{t}=\frac{\alpha}{(\lambda_{1}-\lambda_{2})(\xi+t)}$ where

\displaystyle\xi=20\max\left(\frac{\widetilde{M}\alpha}{(\lambda_{1}-\lambda_{2})},\frac{\left(\widetilde{V}+\lambda_{1}^{2}\right)\alpha^{2}}{(\lambda_{1}-\lambda_{2})^{2}\log(1+\frac{\zeta}{100})}\right)\;.

(60)

Then Theorem 2.2 implies that with probability $1-\zeta$ ,

\displaystyle\sin^{2}\left(w_{T},v_{1}\right)\leq\frac{C\log(1/\zeta)}{\zeta^{2}}\left(d\left(\frac{\xi}{T}\right)^{2\alpha}+\frac{\alpha^{2}\widetilde{V}}{(2\alpha-1)\left(\lambda_{1}-\lambda_{2}\right)^{2}T}\right)\;,

(61)

for some positive constant $C$ . Using $n=BT$ and Eq. (59), the above bound implies

\displaystyle\sin^{2}\left(w_{T},v_{1}\right)\leq\frac{C\log(1/\zeta)}{\zeta^{2}}\left(\frac{\alpha^{2}V\lambda_{1}^{2}}{(2\alpha-1)\left(\lambda_{1}-\lambda_{2}\right)^{2}n}+\frac{K^{2}\gamma^{2}\lambda_{1}^{2}\log^{2a}(nd/(T\zeta))\log(1/\delta)d^{2}\alpha^{2}T}{(2\alpha-1)(\lambda_{1}-\lambda_{2})^{2}\varepsilon^{2}n^{2}}+d\left(\tilde{\xi}\right)^{\alpha}\right)\;.

(62)

where $\tilde{\xi}=(\xi/T)^{2}$ , and

$\displaystyle\tilde{\xi}:=\max$	$\displaystyle\left(\frac{M^{2}\lambda_{1}^{2}\alpha^{2}\log^{2}(dT/\zeta)}{(\lambda_{1}-\lambda_{2})^{2}n^{2}}+\frac{V\lambda_{1}^{2}\log(dT/\zeta)\alpha^{2}}{(\lambda_{1}-\lambda_{2})^{2}nT}+\frac{K^{2}\gamma^{2}\lambda_{1}^{2}\log^{2a}(nd/(T\zeta))\log(1/\delta)\log(T/\zeta)\alpha^{2}d^{2}}{(\lambda_{1}-\lambda_{2})^{2}\varepsilon^{2}n^{2}},\right.$
	$\displaystyle\left.\frac{V^{2}\lambda_{1}^{4}\alpha^{4}}{(\lambda_{1}-\lambda_{2})^{4}\log^{2}(1+\frac{\zeta}{100})n^{2}}+\frac{K^{4}\gamma^{4}\lambda_{1}^{4}\log^{4a}(nd/(T\zeta))\log^{2}(1/\delta)\alpha^{4}d^{4}T^{2}}{(\lambda_{1}-\lambda_{2})^{4}\log^{2}(1+\frac{\zeta}{100})\varepsilon^{4}n^{4}}\right.$
	$\displaystyle\left.+\frac{\lambda_{1}^{4}\alpha^{4}}{(\lambda_{1}-\lambda_{2})^{4}\log^{2}(1+\frac{\zeta}{100})T^{2}}\right)\;.$	(63)

For $\zeta=O(1)$ and $K=O(1)$ , selecting $\alpha=c\log n$ , $T=c^{\prime}(\log n)^{2}$ , and assuming $\log n\geq\lambda_{1}^{2}/(\lambda_{1}-\lambda_{2})^{2}$ and

	$\displaystyle n\geq$	$\displaystyle C\left(\frac{M\lambda_{1}\log(n)\log(d\log(n))}{\lambda_{1}-\lambda_{2}}+\frac{\sqrt{V\lambda_{1}^{2}\log(dT)}}{(\lambda_{1}-\lambda_{2})}+\frac{\gamma\lambda_{1}\log^{a}(nd/\log(n))\sqrt{\log(1/\delta)\log(\log(n))}\log(n)d}{(\lambda_{1}-\lambda_{2})\varepsilon}\right.$
		$\displaystyle\left.+\frac{V\lambda_{1}^{2}(\log(n))^{2}}{(\lambda_{1}-\lambda_{2})^{2}}+\frac{\gamma\lambda_{1}\log^{a}(nd/\log(n))\sqrt{\log(1/\delta)}(\log(n))^{2}d}{(\lambda_{1}-\lambda_{2})\varepsilon}\right)\;,$		(64)

with large enough positive constants $c$ , $c^{\prime}$ , and $C$ , we have $\tilde{\xi}\leq 1$ and $d\tilde{\xi}^{\alpha}\leq 1/n^{2}$ . Hence it is sufficient to have

n=\tilde{O}\Big{(}\,\exp(\lambda_{1}^{2}/(\lambda_{1}-\lambda_{2})^{2})+\frac{M\lambda_{1}}{\lambda_{1}-\lambda_{2}}+\frac{V\lambda_{1}^{2}}{(\lambda_{1}-\lambda_{2})^{2}}+\frac{d\,\gamma\,\lambda_{1}\sqrt{\log(1/\delta)}}{(\lambda_{1}-\lambda_{2})\varepsilon}\,\Big{)}\;,

with a large enough constant.

E.2 Algorithm and proof of Theorem 6.1 on top eigenvalue estimation

Input:

S=\{g_{i}\}_{i=1}^{B}

(\varepsilon,\delta)

-DP, failure probability

\zeta

1 Let

\tilde{g}_{i}\leftarrow g_{2i}-g_{2i-1}

for

i\in 1,2,\ldots,\lfloor B/2\rfloor

. Let

\tilde{S}=\{\tilde{g}_{i}\}_{i=1}^{\lfloor B/2\rfloor}

2Partition

\tilde{S}

into

k={C_{1}\log(1/(\delta\zeta))}/{\varepsilon}

subsets and denote each dataset as

G_{j}\in{\mathbb{R}}^{d\times b}

, where each dataset is of size

b=\lfloor B/2k\rfloor

4Let

\lambda_{1}^{(j)}

be the top eigenvalue of

(1/b)G_{j}G_{j}^{\top}

for

\forall j\in[k]

5 Partition

[0,\infty)

into

\Omega\leftarrow\left\{\ldots,\left[2^{-2/4},2^{-1/4}\right)\left[2^{-1/4},1\right)\left[1,2^{1/4}\right),\left[2^{1/4},2^{2/4}\right),\ldots\right\}\cup\{[0,0]\}

6 Run

(\varepsilon,\delta)

-DP histogram learner of Lemma B.1 on

\{\lambda_{1}^{(j)}\}_{j=1}^{k}

over

\Omega

7 if all the bins are empty then Return

\perp

8 Let

[l,r]

be a non-empty bin that contains the maximum number of points in the DP histogram

Return

\hat{\Lambda}=l

Algorithm 4 Private Top Eigenvalue Estimation

Taking the difference ensures that $\tilde{g}_{i}$ is zero mean, such that we can directly use the top eigenvalue of $(1/b)G_{j}G_{j}^{\top}$ for $j\in[k]$ . We compute a histogram over those $k$ top eigenvalues. This histogram is privatized by adding noise only to the occupied bins and thresholding small entries of the histogram to be zero. The choice $k=\Omega(\log(1/\zeta)/\varepsilon)$ ensures that the most occupied bin does not change after adding the DP noise to the histograms, and $k=\Omega(\log(1/\delta)/\varepsilon)$ is necessary for handling unbounded number of bins. We emphasize that we do not require any upper and lower bounds on the eigenvalue, thanks to the private histogram learner from [12, 49] that gracefully handles unbounded number of bins.

The privacy guarantee follows from the privacy guarantee of the histogram learner provided in Lemma B.1.

For utility analysis, we follow the analysis of [46, Theorem 3.1]. The main difference is that we prove a smaller sample complexity sine we only need the top eigenvalue, and we analyze a more general distribution family. The random vector $\tilde{g}_{i}$ is zero mean with covariance $2\lambda_{1}^{2}H_{u}\in{\mathbb{R}}^{d\times d}$ , where $H_{u}=\mathbb{E}[(A_{i}-\Sigma)uu^{\top}(A_{i}-\Sigma)^{\top}]/\lambda_{1}^{2}$ , and $\tilde{g}_{i}$ satisfies with probability $1-\zeta$ ,

\displaystyle|\left\langle\tilde{g}_{i},v\right\rangle|\;\leq\;2K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(1/\zeta)\;,

(65)

which follows from Lemma 2.1. Applying union bound over all basis vectors $v\in\{e_{1},\ldots,e_{d}\}$ , we know with probability $1-\zeta$ ,

\displaystyle\|\tilde{g}_{i}\|\;\leq\;2K\lambda_{1}\sqrt{d\|H_{u}\|_{2}}\log^{a}(d/\zeta)\;.

We next show that conditioned on event $\mathcal{E}=\{\|\tilde{g}_{i}\|\leq 2K\lambda_{1}\sqrt{d\|H_{u}\|_{2}}\log^{a}(d/\zeta)\}$ , the covariance $\mathbb{E}[\tilde{g}_{i}\tilde{g}_{i}^{\top}|\mathcal{E}]$ is close to the true covariance $\mathbb{E}[\tilde{g}_{i}\tilde{g}_{i}^{\top}]=2\lambda_{1}^{2}H_{u}$ . Note that

	$\displaystyle\mathbb{E}[\tilde{g}_{i}\tilde{g}_{i}^{\top}\|\mathcal{E}]$	$\displaystyle\;=\;\frac{\mathbb{E}[\tilde{g}_{i}\tilde{g}_{i}^{\top}{\mathbb{I}}\{\\|\tilde{g}_{i}\\|\leq 2K\lambda_{1}\sqrt{d\\|H_{u}\\|_{2}}\log^{a}(d/\zeta)\}]}{{\mathbb{P}}(\mathcal{E})}$
		$\displaystyle\;\preceq\;\frac{\mathbb{E}[\tilde{g}_{i}\tilde{g}_{i}^{\top}]}{{\mathbb{P}}(\mathcal{E})}\;\preceq\;\frac{2\lambda_{1}^{2}H_{u}}{1-\zeta}\;.$		(66)

We next show the empirical covariance $({1}/{b})\sum_{i=1}^{b}\tilde{g}_{i}\tilde{g}_{i}^{\top}$ concentrates around $2\lambda_{1}^{2}H_{u}$ . First of all, using union bound on Eq. (65), we have with probability $1-\zeta$ , for all $i\in[b]$ and $j\in[d]$ ,

\displaystyle|\tilde{g}_{ij}|\leq 2K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(bd/\zeta)\;.

Under the event that $|\tilde{g}_{ij}|\leq 2K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(nd/\zeta)$ for all $i\in[b]$ , $j\in[d]$ , [70, Corrollary 6.20] together with Eq. (66) implies

\displaystyle{\mathbb{P}}\left(\left\|\frac{1}{b}\sum_{i=1}^{b}\tilde{g}_{i}\tilde{g}_{i}^{\top}-2\lambda_{1}^{2}H_{u}\right\|_{2}\geq\alpha\right)\leq 2d\exp\left(-\frac{b\alpha^{2}}{8K^{2}\lambda_{1}^{2}\|H_{u}\|_{2}\log^{2a}(\frac{bd}{\zeta})d(2\lambda_{1}^{2}\|H_{u}\|_{2}/(1-\zeta)+\alpha)}\right)\;.

The above bound implies that with probability $1-\zeta$ ,

\displaystyle\left\|\frac{1}{b}\sum_{i=1}^{b}\tilde{g}_{i}\tilde{g}_{i}^{\top}-\lambda_{1}^{2}2H_{u}\right\|_{2}\;=\;O\Big{(}\,K\lambda_{1}^{2}\|H_{u}\|_{2}\log^{a}(bd/\zeta)\sqrt{\frac{d\log(d/\zeta)}{b}}+K^{2}\lambda_{1}^{2}\|H_{u}\|_{2}\log^{2a}(bd/\zeta)\frac{d\log(d/\zeta)}{b}\,\Big{)}\;.

This means if $b={\Omega}(K^{2}d\log(dk/\zeta)\log^{2a}(bdk/\zeta))$ , then with probability $1-\zeta$ , for all $j\in[k]$ , $(1-2^{1/8})\lambda_{1}^{2}\|H_{u}\|_{2}\leq\lambda_{1}^{(j)}\leq(1+2^{1/8})\lambda_{1}^{2}\|H_{u}\|_{2}$ . This means all of $\lambda_{1}^{(j)}$ must be within $2^{1/4}\lambda_{1}^{2}\|H_{u}\|_{2}$ interval. Thus, at most two consecutive buckets are filled with $\lambda_{1}^{(j)}$ . By private histogram from Lemma B.1, if $k\geq\log(1/(\delta\zeta))/\varepsilon$ , one of those two bins are released. The resulting total multiplicative error is bounded by $2^{1/2}$ .

E.3 Algorithm and proof of Lemma 6.2 on DP mean estimation

Input:

S=\{g_{i}\}_{i=1}^{B}

(\varepsilon,\delta)

, target error

\alpha

, failure probability

\zeta

, approximate top eigenvalue

\hat{\Lambda}

2Let

\tau=2^{1/4}K\sqrt{\hat{\Lambda}}\log^{a}(25)

3 for j=1, 2, …, d do

5 Run

(\frac{\varepsilon}{4\sqrt{2d\log(4/\delta)}},\frac{\delta}{4d})

-DP histogram learner of Lemma B.1 on

\{g_{ij}\}_{i\in[B]}

over

\Omega=\{\cdots,(-2\tau,-\tau],(-\tau,0],(0,\tau],(\tau,2\tau],(2\tau,3\tau]\cdots\}

6 Let

[l,h]

be the bucket that contains maximum number of points in the private histogram

\bar{g}_{j}\leftarrow l

8 Truncate the

j

-th coordinate of gradient

\{g_{i}\}_{i\in[B]}

[\bar{g}_{j}-3K\sqrt{\hat{\Lambda}}\log^{a}(Bd/\zeta),\bar{g}_{j}+3K\sqrt{\hat{\Lambda}}\log^{a}(Bd/\zeta)]

9 Let

\tilde{g}_{i}

be the truncated version of

g_{i}

10Compute empirical mean of truncated gradients

\tilde{\mu}=(1/B)\sum_{i=1}^{B}\tilde{g}_{i}

and add Gaussian noise:

\hat{\mu}=\tilde{\mu}+{\cal N}\left(0,\left(\frac{12K\sqrt{\hat{\Lambda}}\log^{a}(Bd/\zeta)\sqrt{2d\log(2.5/\delta)}}{\varepsilon B}\right)^{2}\mathbf{I}_{d}\right)

Return

\hat{\mu}

Algorithm 5 Private Mean Estimation [49, 43]

The histogram learner is called $d$ times, each with $(\varepsilon/(4\sqrt{2d\log(4/\delta)}),\delta/(4d))$ -DP guarantee, and the end-to-end privacy guarantee is $(\varepsilon/2,\delta/2)$ from Lemma B.4 for $\varepsilon\in(0,0.9)$ . The sensitivity of the clipped mean estimate is $\Delta=\sqrt{d}6K\sqrt{\hat{\Lambda}}\log^{a}(Bd/\zeta)$ . Gaussian mechanism with covariance $(2\Delta\sqrt{2\log(2.5/\delta)}/\varepsilon)^{2}{\bf I}_{d}$ satisfy $(\varepsilon/2,\delta/2)$ -DP from Lemma 2.4 for $\varepsilon\in(0,1)$ . Putting these two together, with serial composition of Lemma B.3, we get the desired privacy guarantee.

The proof of utility follows similarly as [55, Lemma D.2]. Let $I_{l}=(l\sqrt{\hat{\Lambda}},(l+1)\sqrt{\hat{\Lambda}}]$ . Denote the population probability of $j$ -th coordinate at $I_{l}$ as $h_{j,l}={\mathbb{P}}(g_{ij}\in I_{l})$ . Denote the empirical probability as $\hat{h}_{j,l}=\frac{1}{B}\sum_{i=1}^{B}{\mathbb{I}}(g_{ij}\in I_{l})$ . Denote the private empirical probability being released as $\tilde{h}_{j,l}$ .

Fix $j\in[d]$ . Let $I_{k}$ be the bin that contains the $\mu_{j}$ . Then we know $[\mu_{j}-K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(25),\mu_{j}+K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(25)]\subseteq[\mu_{j}-\tau,\mu_{j}+\tau]\subset(I_{k-1}\cup I_{k}\cup I_{k+1})$ . By Lemma 2.1, we know ${\mathbb{P}}(|g_{ij}-\mu_{j}|\geq\tau)\leq{\mathbb{P}}(|g_{ij}-\mu_{j}|\geq K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(25))\leq 0.04$ . This means $h_{(k-1),j}+h_{k,j}+h_{(k+1),j}\geq 0.96$ and $\min(h_{(k-1),j},h_{k,j},h_{(k+1),j})\geq 0.32$ .

By Dvoretzky-Kiefer-Wolfowitz inequality and an union bound over $j\in[d]$ , we have that with probability $1-\zeta$ , $\max_{j,l}|h_{j,l}-\hat{h}_{j,l}|\leq\sqrt{\log(d/\zeta)/B}$ . Using Lemma B.1, if $B=\Omega((\sqrt{d\log(1/\delta)}/\varepsilon)\log(d/(\zeta\delta)))$ , with probability $1-\zeta$ , we have $\max_{j,l}|\tilde{h}_{j,l}-\hat{h}_{j,l}|\leq 0.005$ . Thus, with our assumption on $B$ , we can make sure with probability $1-\zeta$ , $\max_{j,l}|\tilde{h}_{j,l}-h_{j,l}|\leq 0.01$ . Then we have $\min(h_{(k-1),j},h_{k,j},h_{(k+1),j})-0.01\geq 0.31\geq 0.04+0.01\geq\max_{l\neq k-1,k,k_{1}}h_{j,l}+0.01$ . This implies with probability $1-\zeta$ , the algorithm must pick one of the bins from $I_{k-1},I_{k},I_{k+1}$ . This means $|\bar{g}_{j}-\mu_{j}|\leq 2\tau\leq 2^{1.5}K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(25)$ . By tail bound of Lemma 2.1, we know for all $j\in[d]$ and $i\in[B]$ , $|g_{ij}-\bar{g}_{j}|\leq|g_{ij}-\mu_{j}|+|\bar{g}_{j}-\mu_{j}|\leq 3K\lambda_{1}\sqrt{\|H_{u}\|_{2}}\log^{a}(Bd/\zeta)$ . This completes our proof.

Appendix F Technical lemmas

Lemma F.1.

Let $x\in{\mathbb{R}}^{d}\sim{\cal N}(0,\Sigma)$ . Then there exists universal constant $C$ such that with probability $1-\zeta$ ,

\displaystyle\|x\|^{2}\leq C\operatorname{Tr}(\Sigma)\log(1/\zeta)\;.

(67)

Proof.

Let $\tilde{x}:=\Sigma^{-1/2}x$ . Then $\tilde{x}$ is also a Gaussian with $\tilde{x}\sim{\cal N}(0,\mathbf{I}_{d})$ . By Hanson-Wright inequality ( [67, Theorem 6.2.1]), there exists universal constant $c>0$ such that with probability $1-\zeta$ ,

\displaystyle\|x\|^{2}=\tilde{x}^{\top}\Sigma\tilde{x}\leq\operatorname{Tr}(\Sigma)+c(\|\Sigma\|_{\mathbf{F}}+\|\Sigma\|_{2})\log(2/\zeta)\leq C\operatorname{Tr}(\Sigma)\log(1/\zeta)\;.

(68)

∎

Lemma F.2 ([67, Theorem 4.4.5]).

Let $G\in{\mathbb{R}}^{d\times d}$ be a random matrix where each entry $G_{ij}$ is i.i.d. sampled from standard Gaussian ${\cal N}(0,1)$ . Then there exists universal constant $C>0$ such that with probability $1-2e^{-t^{2}}$ , $\|G\|_{2}\leq C(\sqrt{d}+t)$ for $t>0$ .

Lemma F.3.

Let $G\in{\mathbb{R}}^{d\times d}$ be a random matrix where each entry $G_{ij}$ is i.i.d. sampled from standard Gaussian ${\cal N}(0,1)$ . Then we have $\|\mathbb{E}[GG^{\top}]\|_{2}\leq C_{2}d$ and $\|\mathbb{E}[G^{\top}G]\|_{2}\leq C_{2}d$ .

Proof.

By Lemma F.2, there exists universal constant $C_{3}>0$ such that

\displaystyle{\mathbb{P}}\left(\|G\|\geq C_{1}(\sqrt{d}+s)\right)\leq e^{-s^{2}},\;\;\;\forall s>0\;.

(69)

Then

$\displaystyle\\|\mathbb{E}[GG^{\top}]\\|_{2}$	$\displaystyle\leq\mathbb{E}[\\|GG^{\top}\\|_{2}]$	(70)
	$\displaystyle\leq\mathbb{E}[\\|G\\|_{2}^{2}]$	(71)
	$\displaystyle=\int_{0}^{\infty}2r{\mathbb{P}}(\\|G\\|_{2}\geq r)dr\leq C_{1}d+C_{3}\int_{\sqrt{d}}^{\infty}2re^{-\frac{(r-\sqrt{d})^{2}}{2}}d$	(72)
	$\displaystyle=C_{1}(d+\sqrt{2\pi d}+2)\leq C_{2}d\;,$	(73)

where $C_{2}$ is an absolute constant. The proof for the second claim follows similarly. ∎

Lemma F.4.

Let $x,y\in\mathbb{S}_{2}^{d-1}$ . Then

\displaystyle 1-\left\langle x,y\right\rangle^{2}\leq\|x-y\|^{2}\;.

(74)

If $\|x-y\|^{2}\leq\sqrt{2}$ , then

\displaystyle 1-\left\langle x,y\right\rangle^{2}\geq\frac{1}{2}\|x-y\|^{2}\;.

(75)

The following lemma follows from matrix Bernstein inequality [65].

Lemma F.5.

Under A.1, A.2, and A.3, in Assumption 1, with probability $1-\zeta$ ,

\displaystyle\Big{\|}\frac{1}{B}\sum_{i\in[B]}A_{i}-\Sigma\Big{\|}_{2}\;=\;O\Big{(}\,\sqrt{\frac{\lambda_{1}^{2}V\log(d/\zeta)}{B}}+\frac{\lambda_{1}M\log(d/\zeta)}{B}\,\Big{)}\;.

(76)

DP-PCA: Statistically Optimal and Differentially Private PCA

Abstract

1 Introduction

2 Problem formulation and background on DP

Assumption 1 ((Σ,{λi}i=1d,M,V,K,κ,a,γ2)(\Sigma,\{\lambda_{i}\}_{i=1}^{d},M,V,K,\kappa,a,\gamma^{2})-model).

Lemma 2.1.

2.1 Oja’s algorithm

Theorem 2.2 ([39, Theorem 4.1]).

2.2 Background on Differential Privacy

Definition 2.3 (Differential privacy [22]).

Lemma 2.4 (Gaussian mechanism [23]).

2.3 Comparisons with existing results in private PCA

3 First attempt: making Oja’s Algorithm private

Lemma 3.1 (Privacy).

Lemma 3.2 (Gradient clipping).

Theorem 3.3 (Utility).

Remark 3.4.

4 Two remaining challenges

Example 4.1 (Signal and noise separation).

5 Differentially Private Principal Component Analysis (DP-PCA)

Theorem 5.1.

Corollary 5.2 (Upper bound; Gaussian distribution).

Theorem 5.3 (Lower bound; Gaussian distribution).

Theorem 5.4 (Lower bound without Assumption A.4).

6 Private mean estimation for the minibatch stochastic gradients

Theorem 6.1.

Lemma 6.2.

6.1 Proof sketch of Theorem 5.1

7 Discussion

Acknowledgement

References

Appendix

Appendix A Related work

Appendix B Preliminary on differential privacy

Lemma B.1 (Stability-based histogram [49, Lemma 2.3]).

Lemma B.2 (Parallel composition [58]).

Lemma B.3 (Serial composition [23]).

Lemma B.4 (Advanced composition [42]).

Appendix C Lower bounds

Theorem C.1 (Lower bound without Assumption A.4).

C.1 Proof of Theorem 5.3 on the lower bound for Gaussian case

Theorem C.2 (DP Fano’s method [3, Corollary 4]).

Lemma C.3 ([69, Lemma 3.1.2]).

C.2 Proof of Theorem 5.4

Lemma C.4 ([6, Theorem 3]).

C.3 Proof of Theorem C.1

Appendix D The analysis of Private Oja’s Algorithm

D.1 Proof of privacy in Lemma 3.1

Theorem D.1 (Privacy amplification by shuffling [30, Theorem 3.8]).

D.2 Proof of clipping in Lemma 3.2

D.3 Proof of utility in Theorem 3.3

Appendix E The analysis of DP-PCA

E.1 Proof of Theorem 5.1 on the privacy and utility of DP-PCA

E.2 Algorithm and proof of Theorem 6.1 on top eigenvalue estimation

E.3 Algorithm and proof of Lemma 6.2 on DP mean estimation

Appendix F Technical lemmas

Lemma F.1.

Proof.

Lemma F.2 ([67, Theorem 4.4.5]).

Lemma F.3.

Proof.

Lemma F.4.

Lemma F.5.

DP-PCA: Statistically Optimal and
Differentially Private PCA

Assumption 1 ( $(\Sigma,\{\lambda_{i}\}_{i=1}^{d},M,V,K,\kappa,a,\gamma^{2})$ -model).