DP-PCA: Statistically Optimal and
Differentially Private PCA
Abstract
We study the canonical statistical task of computing the principal component from i.i.d. data in dimensions under -differential privacy. Although extensively studied in literature, existing solutions fall short on two key aspects: () even for Gaussian data, existing private algorithms require the number of samples to scale super-linearly with , i.e., , to obtain non-trivial results while non-private PCA requires only , and () existing techniques suffer from a non-vanishing error even when the randomness in each data point is arbitrarily small. We propose DP-PCA, which is a single-pass algorithm that overcomes both limitations. It is based on a private minibatch gradient ascent method that relies on private mean estimation, which adds minimal noise required to ensure privacy by adapting to the variance of a given minibatch of gradients. For sub-Gaussian data, we provide nearly optimal statistical error rates even for . Furthermore, we provide a lower bound showing that sub-Gaussian style assumption is necessary in obtaining the optimal error rate.
1 Introduction
Principal Component Analysis (PCA) is a fundamental statistical tool with multiple applications including dimensionality reduction, data visualization, and noise reduction. Naturally, it is a key part of most standard data analysis and ML pipelines. However, when applied to data collected from numerous individuals, such as the U.S. Census data, outcome of PCA might reveal highly sensitive personal information. We investigate the design of privacy preserving PCA algorithms and the involved privacy/utility tradeoffs, for computing the first principal component, that should serve as the building block of more general rank- PCA.
Differential privacy (DP) is a widely accepted mathematical notion of privacy introduced in [22], which is a standard in releasing the U.S. Census data [2] and also deployed in commercial systems [64, 26, 28]. A query to a database is said to be -differentialy private if a strong adversary who knows all other entries but one cannot infer that one entry from the query output, with high confidence. The parameters and restricts the confidence as measured by the Type-I and II errors [42]. Smaller values of and imply stronger privacy and plausible deniability for the participants.
For non-private PCA with i.i.d. samples in dimensions, the popular Oja’s algorithm (provided in Algorithm 1) achieves the optimal error of , where the error is measured by the sine function of the angle between the estimate, , and the principal component, , [39]. For differentially private PCA, there is a natural fundamental question: what is the extra cost we pay in the error rate for ensuring -DP?
We introduce a novel approach we call DP-PCA (Algorithm 3) and show that it achieves an error bounded by for sub-Gaussian-like data defined in Assumption 1, which is a broad class of light-tailed distributions that includes Gaussian data as a special case. The second term characterizes the cost of privacy and this is tight; we prove a nearly matching information theoretic lower bound showing that no -DP algorithm can achieve a smaller error. This significantly improves upon a long line of existing private algorithms for PCA, e.g., [15, 10, 36, 34, 24]. These existing algorithms are analyzed for fixed and non-stochastic data and achieve sub-optimal error rates of even in the stochastic setting we consider.
A remaining question is whether the sub-Gaussian-like assumption, namely Assumption A.4, is necessary or if it is an artifact of our analysis and our algorithm. It turns out that such an assumption on the lightness of the tail is critical; we prove an algorithmic independent and information theoretic lower bound (Theorem 5.4) to show that, without such an assumption, the cost of privacy is lower bounded by . This proves a separation of the error depending on the lightness of the tail.
We start with the formal description of the stochastic setting in Section 2 and present Oja’s algorithm for non-private PCA. Our first attempt in making this algorithm private in Section 3 already achieves near-optimal error, if the data is strictly from a Gaussian distribution. However, there are two remaining challenges that we describe in detail in Section 4: the excessive number of iterations of Private Oja’s Algorithm (Algorithm 2) prevents using typical values of used in practice, and for general sub-Gaussian-like distributions, the error does not vanish even when the noise in the data (as measured by a certain fourth moment of a function of the data) vanishes. The first challenge is due to the analysis that requires amplification by shuffling [25] that is restrictive. The second is due to its reliance on gradient norm clipping [1] that does not adapt to the variance of the current gradients. This drives the design of DP-PCA in Section 5 that critically relies on two techniques to overcome each challenge, respectively. First, minibatch SGD (instead of single sample SGD) significantly reduces the number iterations, thus obviating the need for amplification by shuffling. Next, private mean estimation (instead of gradient norm clipping and noise adding) adapts to the stochasticity of the problem and adds the minimal noise necessary to achieve privacy. The main idea of this variance adaptive stochastic gradient update is explained in detail in Section 6, along with a sketch of a proof.
Notations. For a vector , we use to denote the Euclidean norm. For a matrix , we use to denote the spectral norm. We use to denote identity matrix. For , let . Let denote the unit -sphere of , i.e., . hides logarithmic factors in , , and the failure probability .
2 Problem formulation and background on DP
Typical PCA assumes i.i.d. data from a distribution and finds the first eigenvector of . Our approach allows for a more general class of data that recovers the standard case when .
Assumption 1 (-model).
Let be a sequence of (not necessarily symmetric) matrices sampled independently from the same distribution that satisfy the following with PSD matrices and , and positive scalar parameters , , , and :
-
A.1.
Let , for a symmetric positive semidefinite (PSD) matrix , denote the -th largest eigenvalue of , and ,
-
A.2.
almost surely,
-
A.3.
,
-
A.4.
, where . We denote .
The first three assumptions are required for PCA even if privacy is not needed. The last assumption provides a Gaussian-like tail bound that determines how much noise we need to introduce in the algorithm for -DP. The following lemma is useful in the analyses.
2.1 Oja’s algorithm
In a non-private setting, the following streaming algorithm introduced in [61] achieves optimal sample complexity as analyzed in [39]. It is a projected stochastic gradient ascent on the objective defined on the empirical covariance: .
Central to our analysis is the following error bound on Oja’s Algorithm from [39].
2.2 Background on Differential Privacy
Differential privacy (DP), introduced in [22], is a de facto mathematical measure for privacy leakage of a database accessed via queries. It ensures that even an adversary who knows all other entries cannot identify with a high confidence whether a person of interest participated in a database or not.
Definition 2.3 (Differential privacy [22]).
Given two multisets and , we say the pair is neighboring if . We say a stochastic query over a dataset satisfies -differential privacy for some and if for all neighboring and all subset of the range of .
Small values of and ensures that the adversary cannot identify any single data point with high confidence, thus providing plausible deniability. We provide useful DP lemmas in Appendix B. Within our stochastic gradient descent approach to PCA, we rely on the Gaussian mechanism to privatize each update. The sensitivity of a query is defined as .
Lemma 2.4 (Gaussian mechanism [23]).
For a query with sensitivity , , and , the Gaussian mechanism outputs and achieves -DP.
2.3 Comparisons with existing results in private PCA
We briefly discuss the most closely related work and provide more previous work in Appendix A. Most existing results assume a fixed data under a deterministic setting where each sample has a bounded norm, , and the goal is to find the top eigenvector of for the empirical mean . For the purpose of comparisons, consider Gaussian with for all with probability . The first line of approaches in [10, 15, 24] is a Gaussian mechanism that outputs , where is a symmetric matrix with i.i.d. Gaussian entries with a variance to ensure -DP. The tightest result in [24, Theorem 7] achieves
(3) |
with high probability, under a strong assumption that the spectral gap is very large: . In a typical scenario with , this requires a large sample size of . Since this Gaussian mechanism does not exploit the statistical properties of i.i.d. samples, the second term in this upper bound is larger by a factor of compared to the proposed DP-PCA (Corollary 5.2). The error rate of Eq. (3) is also achieved in [36, 34] by adding Gaussian noise to the standard power method for computing the principal components. When the spectral gap, , is smaller, it is possible to trade-off the dependence in and the sampling ratio , which we do not address in this work but is surveyed in Appendix A.
3 First attempt: making Oja’s Algorithm private
Following the standard recipe in training with DP-SGD, e.g., [1], we introduce Private Oja’s Algorithm in Algorithm 2. At each gradient update, we first apply gradient norm clipping to limit the contribution of a single data point and next add an appropriately chosen Gaussian noise from Lemma 2.4 to achieve -DP, end-to-end. The choice of clipping threshold ensures that, with high probability under our assumption, we do not clip any gradients. The choice of noise multiplier ensures -DP.
One caveat in streaming algorithms is that we access data times, each with a private mechanism, but accessing only a single data point at a time. To prevent excessive privacy loss due to such a large number of data accesses, we apply a random shuffling in line 2 Algorithm 2, in order to benefit from a standard amplification by shuffling [25, 30]. This gives an amplified privacy guarantee that allows us to add a small noise proportional to . Without the shuffle amplification, we will instead need a larger noise scaling as , resulting in a suboptimal utility guarantee. However, this comes with a restriction that the amplification holds only for small values of . Our first contribution in the proposed DP-PCA (Algorithm 3) is to expand this range to , which includes the practical regime of interest .
Lemma 3.1 (Privacy).
If and the noise multiplier is chosen to be , then Algorithm 2 is -DP.
Under Assumption 1, we select gradient norm clipping threshold such that no gradient exceeds .
Lemma 3.2 (Gradient clipping).
Let for some constant . Then with probability , for any fixed independent of , for all .
We provide proofs of both lemmas and the next theorem in Appendix D. When no clipping is applied, we can use the standard analysis of Oja’s Algorithm from [39] to prove the following utility guarantee.
Theorem 3.3 (Utility).
Given i.i.d. samples satisfying Assumption 1 with parameters , if
(4) |
with a large enough constant, then there exists a positive universal constant and a choice of learning rate that depends on , , , , , , , , , such that Algorithm 2 with a choice of outputs that achieves with probability ,
(5) |
where hides poly-logarithmic factors in , , , and and polynomial factors in .
The first term in Eq. (5) matches the non-private error rate for Oja’s algorithm in Eq. (2) with and , and the second term is the price we pay for ensuring -DP.
Remark 3.4.
Comparing to a lower bound in Theorem 5.3, this is already near optimal. However, for general distributions satisfying Assumption 1, Algorithm 2 (in particular the second term in Eq. (5)) can be significantly sub-optimal. We explain this second weakness of Private Oja’s Algorithm in the following section (the first weakness is the restriction on ).
4 Two remaining challenges
We explain the two remaining challenges in Private Oja’s Algorithm and propose techniques to overcomes these challenges that lead to the design of DP-PCA (Algorithm 3).
First challenge: restricted range of . This is due to the large number, , of iterations that necessitates the use of the amplification by shuffling in Theorem D.1. We reduce the number of iterations with minibatch SGD. For and , we repeat
(7) |
where and the minibatch size is . Since the dataset is accessed only times, the end-to-end privacy is analyzed with the serial composition (Lemma B.3) instead of the amplification by shuffling. This ensures -DP for any , resolving the first challenge, and still achieves the utility guarantee of Eq. (5).
Second challenge: excessive noise for privacy. This is best explained with an example.
Example 4.1 (Signal and noise separation).
Consider a setting with and where with probability half and otherwise for a unit norm vector and . We want to find the principal component of , which is . This construction decomposes the signal and the noise. For , the signal component is determined by that is deterministic due to the sign cancelling. The noise component is which is random. We can control the Signal-to-Noise Ratio (SNR), , by changing , and we are particularly interested in the regime where is small. As we are interested in , this satisfies Assumption 1 with , , , , , and . Substituting this into Eq. (5), Private Oja’s Algorithm achieves
(8) |
where we are interested in .
This is problematic since the second term, due to the DP noise, does not vanish as the randomness in the data decreases. We do not observe this for Gaussian data where signal and noise scale proportionally as shown below. We reduce the noise we add for privacy, by switching from a simple norm clipping, that adds noise proportional to the norm of the gradients, to private estimation, that only requires the noise to scale as the range of the gradients, i.e. the maximum distance between two gradients in the minibatch. The toy example above showcases that the range can be arbitrarily smaller than the maximum norm (Fig. 1). We want to emphasize that although the idea of using private estimation within an optimization has been conceptually proposed in abstract settings, e.g., in [44], DP-PCA is the first setting where such separation between the norm and the range of the gradients holds under any statistical model, and hence the long line of recent advances in private estimation provides significant gain over the simple DP-SGD [1].


5 Differentially Private Principal Component Analysis (DP-PCA)
Combining the two ideas of minibatch SGD and private mean estimation, we propose DP-SGD. We use minibatch SGD of minibatch size to allow for larger range of . We use Private Mean Estimation to add an appropriate level of noise chosen adaptively according to Private Eigenvalue Estimation. We describe details of both sub-routines in Section 6.
We show an upper bound on the error achieved by DP-PCA under an appropriate choice of the learning rate. We provide a complete proof in Appendix E.1 that includes the explicit choice of the learning rate in Eq. (60), and a proof sketch is provided in Section 6.1.
Theorem 5.1.
For , DP-PCA guarantees -DP for all , , , and . Given i.i.d. samples satisfying Assumption 1 with parameters , if
(9) |
with a large enough constant and , then there exists a positive universal constant and a choice of learning rate that depends on , , , , , , , , , such that steps of DP-PCA in Algorithm 3 with choices of and , outputs such that with probability ,
(10) |
where hides poly-logarithmic factors in , , , and and polynomial factors in .
We further interpret this analysis and show that DP-PCA is nearly optimal when the data is from a Gaussian distribution by comparing against a lower bound (Theorem 5.3); and DP-PCA significantly improves upon the private Oja’s algorithm under Example 4.1. We discuss the necessity of some of the assumptions at the end of this section, including how to agnostically find the appropriate learning rate scheduling.
Near-optimality of DP-PCA under Gaussian distributions. Consider the case of i.i.d. samples from a Gaussian distribution from Remark 3.4.
Corollary 5.2 (Upper bound; Gaussian distribution).
Under the hypotheses of Theorem 5.1 and with Gaussian random vectors ’s, after steps, DP-PCA outputs that achieves, with probability ,
(11) |
We prove a nearly matching lower bound, up to factors of and . One caveat is that the lower bound assumes pure-DP with . We do not yet have a lower bound technique for approximate DP that is tight, and all known approximate DP lower bounds have gaps to achievable upper bounds in its dependence in , e.g., [6, 56]. We provide a proof in Appendix C.1.
Theorem 5.3 (Lower bound; Gaussian distribution).
Let be a class of -DP estimators that map i.i.d. samples to an estimate . A set of Gaussian distributions with as the first and second eigenvalues of the covariance matrix is denoted by . There exists a universal constant such that
(12) |
Comparisons with private Oja’s algorithm. We demonstrate that DP-PCA can significantly improve upon Private Oja’s Algorithm with Example 4.1, where DP-PCA achieves an error bound of . As the noise power decreases DP-PCA achieves a vanishing error, whereas Private Oja’s Algorithm has a non-vanishing error in Eq. (8). This follows from the fact that the second term in the error bound in Eq. (10) scales as , which can be made arbitrarily smaller than the second term in Eq. (5) that scales as . Further, the error bound for DP-PCA holds for any , whereas Private Oja’s Algorithm requires significantly smaller .
Remarks on the assumptions of Theorem 5.1. We have an exponential dependence of the sample complexity in the spectral gap, . This ensures we have a large enough to reduce the non-dominant second term in Eq. (2), in balancing the learning rate and (which is explicitly shown in Eqs. 62 and (63) in the Appendix). It is possible to get rid of this exponential dependence at the cost of an extra term of in the error rate in Eq. (10), by selecting a slightly larger . A Gaussian-like tail bound in Assumption A.4 is necessary to get the desired upper bound scaling as in Eq. 11, for example. The next lower bound shows that without such assumptions on the tail, the error due to privacy scales as . We believe that the dependence in is loose, and it might be possible to get a tighter lower bound using [45]. We provide a proof and other lower bounds in Appendix C.
Theorem 5.4 (Lower bound without Assumption A.4).
Currently, DP-PCA requires choices of the learning rates, , that depend on possibly unknown quantities. Since we can privately evaluate the quality of our solution, one can instead run multiple instances of DP-PCA with varying and find the best choice of and . Let denote the resulting solution for one instance of . We first set a target error . For each round , we will run algorithm for and , and compute each privately, each with privacy budget . We terminate the algorithm once there there is a satisfies . It is clear that this search meta-algorithm terminate in logarithmic round, and the total sample complexity only blows up by a poly-log factor.
6 Private mean estimation for the minibatch stochastic gradients
DP-PCA critically relies on private mean estimation to reduce variance of the noise required to achieve -DP. We follow a common recipe from [49, 43, 47, 9, 16]. First, we privately find an approximate range of the gradients in the minibatch (Alg. 4). Next, we apply the Gaussian mechanism to the truncated gradients where the truncation is tailored to the estimated range (Alg. 5).
Step 1: estimating the range. We need to find an approximate range of the minibatch of gradients in order to adaptively truncate the gradients and bound the sensitivity. Inspired by a private preconditioning mechanism designed for mean estimation with unknown covariance from [46], we propose to use privately estimated top eigenvalue of the covariance matrix of the gradients. For details on the version of the histogram learner we use in Alg. 4 in Appendix E.2, we refer to [55, Lemma D.1]. Unlike the private preconditioning of [46] that estimates all eigenvalues and requires samples, we only require the top eigenvalue and hence the next theorem shows that we only need .
Theorem 6.1.
Algorithm 4 is -DP. Let for some fixed vector , where satisfies A.1 and A.4 in Assumption 1 such that the mean is and the covariance is . With a large enough sample size scaling as
(14) |
Algorithm 4 outputs achieving with probability , where the pair parametrizes the tail of the distribution in A.4 and hides logarithmic factors in , and .
We provide a proof in Appendix E.2. There are other ways to privately estimate the range. Some approaches require known bounds such as for all [49], and other agnostic approaches are more involved such as instance optimal universal estimators of [18].
Step 2: Gaussian mechanism for mean estimation. Once we have a good estimate of the top eigenvalue from the previous section, we use it to select the bin size of the private histogram and compute the truncated empirical mean. Since truncated empirical mean has a bounded sensitivity, we can use Gaussian mechanism to achieve DP. The algorithm is now standard in DP mean estimation, e.g., [49, 43]. However, the analysis is slightly different since our assumptions on ’s are different. For completeness, we provide the Algorithm 5 in Appendix E.3.
The next lemma shows that the Private Mean Estimation is -DP, and with high probability clipping does not apply to any of the gradients. The returned private mean, therefore, is distributed as a spherical Gaussian centered at the empirical mean of the gradients. This result requires that we have a good estimate of the top eigenvalue from Alg. 4 such that . This analysis implies that we get an unbiased estimate of the gradient mean (which is critical in the analysis) with noise scaling as , where (which is critical in getting the tight sample complexity in the second term of the final utility guarantee in Eq. (10)). We provide a proof in Appendix E.3.
Lemma 6.2.
6.1 Proof sketch of Theorem 5.1
We choose such that we access the dataset only times. Hence we do not need to rely on amplification by shuffling. To add Gaussian noise that scales as the standard deviation of the gradients in each minibatch (as opposed to potentially excessively large mean of the gradients), DP-PCA adopts techniques from recent advances in private mean estimation. Namely, we first get a private and accurate estimate of the range from Theorem 6.1. Using this estimate, , Private Mean Estimation returns an unbiased estimate of the empirical mean of the gradients, as long as no truncation has been applied as ensured by Lemma 6.2. This gives
(15) |
for and . Using rotation invariance of spherical Gaussian random vectors and the fact that , we can reformulate it as
(16) |
This process can be analyzed with Theorem 2.2 with substituting .
7 Discussion
Under the canonical task of computing the principal component from i.i.d. samples, we show the first result achieving a near-optimal error rate. This critically relies on two ideas: minibatch SGD and private mean estimation. In particular, private mean estimation plays a critical role in the case when the range of the gradients is significantly smaller than the norm; we achieve an optimal error rate that cannot be achieved with the standard recipe of gradient clipping or even with a more sophisticated adaptive clipping [4].
Assumption A.4 can be relaxed to heavy-tail bounds with bounded -th moment on , in which case we expect the second term in Eq. (10) to scale as , drawing analogy from a similar trend in a computationally inefficient DP-PCA without spectral gap [56, Corollary 6.10]. When a fraction of data is corrupted, recent advances in [74, 51, 40] provide optimal algorithms for PCA. However, existing approach of [56] for robust and private PCA is computationally intractable. Borrowing ideas from robust and private mean estimation in [55], one can design an efficient algorithm, but at the cost of sub-optimal sample complexity. It is an interesting direction to design an optimal and robust version of DP-PCA. Our lower bounds are loose in its dependence in . Recently, a promising lower bound technique has been introduced in [45] that might close this gap.
Acknowledgement
This work is supported in part by Google faculty research award and NSF grants CNS-2002664, IIS-1929955, DMS-2134012, CCF-2019844 as a part of NSF Institute for Foundations of Machine Learning (IFML), CNS-2112471 as a part of NSF AI Institute for Future Edge Networks and Distributed Intelligence (AI-EDGE).
References
- [1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar, and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC conference on computer and communications security, pages 308–318, 2016.
- [2] John M Abowd. The us census bureau adopts differential privacy. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 2867–2867, 2018.
- [3] Jayadev Acharya, Ziteng Sun, and Huanyu Zhang. Differentially private assouad, fano, and le cam. In Algorithmic Learning Theory, pages 48–78. PMLR, 2021.
- [4] Galen Andrew, Om Thakkar, Brendan McMahan, and Swaroop Ramaswamy. Differentially private learning with adaptive clipping. Advances in Neural Information Processing Systems, 34, 2021.
- [5] Maria-Florina Balcan, Simon Shaolei Du, Yining Wang, and Adams Wei Yu. An improved gap-dependency analysis of the noisy power method. In Conference on Learning Theory, pages 284–309. PMLR, 2016.
- [6] Rina Foygel Barber and John C Duchi. Privacy and statistical risk: Formalisms and minimax bounds. arXiv preprint arXiv:1412.4451, 2014.
- [7] Raef Bassily, Vitaly Feldman, Kunal Talwar, and Abhradeep Guha Thakurta. Private stochastic convex optimization with optimal rates. Advances in Neural Information Processing Systems, 32, 2019.
- [8] Raef Bassily, Adam Smith, and Abhradeep Thakurta. Private empirical risk minimization: Efficient algorithms and tight error bounds. In 2014 IEEE 55th Annual Symposium on Foundations of Computer Science, pages 464–473. IEEE, 2014.
- [9] Sourav Biswas, Yihe Dong, Gautam Kamath, and Jonathan Ullman. Coinpress: Practical private mean and covariance estimation. arXiv preprint arXiv:2006.06618, 2020.
- [10] Avrim Blum, Cynthia Dwork, Frank McSherry, and Kobbi Nissim. Practical privacy: the sulq framework. In Proceedings of the twenty-fourth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 128–138, 2005.
- [11] Gavin Brown, Marco Gaboardi, Adam Smith, Jonathan Ullman, and Lydia Zakynthinou. Covariance-aware private mean estimation without private covariance estimation. Advances in Neural Information Processing Systems, 34, 2021.
- [12] Mark Bun, Kobbi Nissim, and Uri Stemmer. Simultaneous private learning of multiple concepts. J. Mach. Learn. Res., 20:94–1, 2019.
- [13] Mark Bun and Thomas Steinke. Average-case averages: Private algorithms for smooth sensitivity and mean estimation. Advances in Neural Information Processing Systems, 32, 2019.
- [14] T Tony Cai, Yichen Wang, and Linjun Zhang. The cost of privacy: Optimal rates of convergence for parameter estimation with differential privacy. arXiv preprint arXiv:1902.04495, 2019.
- [15] Kamalika Chaudhuri, Anand D Sarwate, and Kaushik Sinha. A near-optimal algorithm for differentially-private principal components. The Journal of Machine Learning Research, 14(1):2905–2943, 2013.
- [16] Christian Covington, Xi He, James Honaker, and Gautam Kamath. Unbiased statistical estimation and valid confidence intervals under differential privacy. arXiv preprint arXiv:2110.14465, 2021.
- [17] Christos Dimitrakakis, Blaine Nelson, Aikaterini Mitrokotsa, and Benjamin IP Rubinstein. Robust and private bayesian inference. In International Conference on Algorithmic Learning Theory, pages 291–305. Springer, 2014.
- [18] Wei Dong and Ke Yi. Universal private estimators. arXiv preprint arXiv:2111.02598, 2021.
- [19] John Duchi and Ryan Rogers. Lower bounds for locally private estimation via communication complexity. In Conference on Learning Theory, pages 1161–1191. PMLR, 2019.
- [20] John C Duchi, Michael I Jordan, and Martin J Wainwright. Minimax optimal procedures for locally private estimation. Journal of the American Statistical Association, 113(521):182–201, 2018.
- [21] Cynthia Dwork and Jing Lei. Differential privacy and robust statistics. In Proceedings of the forty-first annual ACM symposium on Theory of computing, pages 371–380, 2009.
- [22] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284. Springer, 2006.
- [23] Cynthia Dwork and Aaron Roth. The algorithmic foundations of differential privacy. Foundations and Trends in Theoretical Computer Science, 9(3-4):211–407, 2014.
- [24] Cynthia Dwork, Kunal Talwar, Abhradeep Thakurta, and Li Zhang. Analyze gauss: optimal bounds for privacy-preserving principal component analysis. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing, pages 11–20, 2014.
- [25] Úlfar Erlingsson, Vitaly Feldman, Ilya Mironov, Ananth Raghunathan, Kunal Talwar, and Abhradeep Thakurta. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, pages 2468–2479. SIAM, 2019.
- [26] Úlfar Erlingsson, Vasyl Pihur, and Aleksandra Korolova. Rappor: Randomized aggregatable privacy-preserving ordinal response. In Proceedings of the 2014 ACM SIGSAC conference on computer and communications security, pages 1054–1067, 2014.
- [27] Hossein Esfandiari, Vahab Mirrokni, and Shyam Narayanan. Tight and robust private mean estimation with few users. arXiv preprint arXiv:2110.11876, 2021.
- [28] Giulia Fanti, Vasyl Pihur, and Úlfar Erlingsson. Building a rappor with the unknown: Privacy-preserving learning of associations and data dictionaries. Proceedings on Privacy Enhancing Technologies, 2016(3):41–61, 2016.
- [29] Vitaly Feldman, Tomer Koren, and Kunal Talwar. Private stochastic convex optimization: optimal rates in linear time. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, pages 439–449, 2020.
- [30] Vitaly Feldman, Audra McMillan, and Kunal Talwar. Hiding among the clones: A simple and nearly optimal analysis of privacy amplification by shuffling. In 2021 IEEE 62nd Annual Symposium on Foundations of Computer Science (FOCS), pages 954–964. IEEE, 2022.
- [31] Vitaly Feldman and Thomas Steinke. Calibrating noise to variance in adaptive data analysis. In Conference On Learning Theory, pages 535–544. PMLR, 2018.
- [32] James Foulds, Joseph Geumlek, Max Welling, and Kamalika Chaudhuri. On the theory and practice of privacy-preserving bayesian data analysis. arXiv preprint arXiv:1603.07294, 2016.
- [33] Marco Gaboardi, Ryan Rogers, and Or Sheffet. Locally private mean estimation: -test and tight confidence intervals. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 2545–2554. PMLR, 2019.
- [34] Moritz Hardt and Eric Price. The noisy power method: A meta algorithm with applications. Advances in neural information processing systems, 27, 2014.
- [35] Moritz Hardt and Aaron Roth. Beating randomized response on incoherent matrices. In Proceedings of the forty-fourth annual ACM symposium on Theory of computing, pages 1255–1268, 2012.
- [36] Moritz Hardt and Aaron Roth. Beyond worst-case analysis in private singular vector computation. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing, pages 331–340, 2013.
- [37] Samuel B Hopkins, Gautam Kamath, and Mahbod Majid. Efficient mean estimation with pure differential privacy via a sum-of-squares exponential mechanism. arXiv preprint arXiv:2111.12981, 2021.
- [38] Lijie Hu, Shuo Ni, Hanshen Xiao, and Di Wang. High dimensional differentially private stochastic optimization with heavy-tailed data. arXiv preprint arXiv:2107.11136, 2021.
- [39] Prateek Jain, Chi Jin, Sham M Kakade, Praneeth Netrapalli, and Aaron Sidford. Streaming pca: Matching matrix bernstein and near-optimal finite sample guarantees for oja’s algorithm. In Conference on learning theory, pages 1147–1164. PMLR, 2016.
- [40] Arun Jambulapati, Jerry Li, and Kevin Tian. Robust sub-gaussian principal component analysis and width-independent schatten packing. Advances in Neural Information Processing Systems, 33, 2020.
- [41] Matthew Joseph, Janardhan Kulkarni, Jieming Mao, and Steven Z Wu. Locally private gaussian estimation. Advances in Neural Information Processing Systems, 32, 2019.
- [42] Peter Kairouz, Sewoong Oh, and Pramod Viswanath. The composition theorem for differential privacy. In International conference on machine learning, pages 1376–1385. PMLR, 2015.
- [43] Gautam Kamath, Jerry Li, Vikrant Singhal, and Jonathan Ullman. Privately learning high-dimensional distributions. In Conference on Learning Theory, pages 1853–1902. PMLR, 2019.
- [44] Gautam Kamath, Xingtu Liu, and Huanyu Zhang. Improved rates for differentially private stochastic convex optimization with heavy-tailed data. arXiv preprint arXiv:2106.01336, 2021.
- [45] Gautam Kamath, Argyris Mouzakis, and Vikrant Singhal. New lower bounds for private estimation and a generalized fingerprinting lemma. arXiv preprint arXiv:2205.08532, 2022.
- [46] Gautam Kamath, Argyris Mouzakis, Vikrant Singhal, Thomas Steinke, and Jonathan Ullman. A private and computationally-efficient estimator for unbounded gaussians. arXiv preprint arXiv:2111.04609, 2021.
- [47] Gautam Kamath, Vikrant Singhal, and Jonathan Ullman. Private mean estimation of heavy-tailed distributions. arXiv preprint arXiv:2002.09464, 2020.
- [48] Michael Kapralov and Kunal Talwar. On differentially private low rank approximation. In Proceedings of the twenty-fourth annual ACM-SIAM symposium on Discrete algorithms, pages 1395–1414. SIAM, 2013.
- [49] Vishesh Karwa and Salil Vadhan. Finite sample differentially private confidence intervals. arXiv preprint arXiv:1711.03908, 2017.
- [50] Daniel Kifer, Adam Smith, and Abhradeep Thakurta. Private convex empirical risk minimization and high-dimensional regression. In Conference on Learning Theory, pages 25–1. JMLR Workshop and Conference Proceedings, 2012.
- [51] Weihao Kong, Raghav Somani, Sham Kakade, and Sewoong Oh. Robust meta-learning for mixed linear regression with small batches. Advances in Neural Information Processing Systems, 33, 2020.
- [52] Pravesh K Kothari, Pasin Manurangsi, and Ameya Velingker. Private robust estimation by stabilizing convex relaxations. arXiv preprint arXiv:2112.03548, 2021.
- [53] Janardhan Kulkarni, Yin Tat Lee, and Daogao Liu. Private non-smooth empirical risk minimization and stochastic convex optimization in subquadratic steps. arXiv preprint arXiv:2103.15352, 2021.
- [54] Mengchu Li, Thomas B Berrett, and Yi Yu. On robustness and local differential privacy. arXiv preprint arXiv:2201.00751, 2022.
- [55] Xiyang Liu, Weihao Kong, Sham Kakade, and Sewoong Oh. Robust and differentially private mean estimation. Advances in Neural Information Processing Systems, 34, 2021.
- [56] Xiyang Liu, Weihao Kong, and Sewoong Oh. Differential privacy and robust statistics in high dimensions. arXiv preprint arXiv:2111.06578, 2021.
- [57] Frank McSherry and Kunal Talwar. Mechanism design via differential privacy. In 48th Annual IEEE Symposium on Foundations of Computer Science (FOCS’07), pages 94–103. IEEE, 2007.
- [58] Frank D McSherry. Privacy integrated queries: an extensible platform for privacy-preserving data analysis. In Proceedings of the 2009 ACM SIGMOD International Conference on Management of data, pages 19–30, 2009.
- [59] Kentaro Minami, HItomi Arai, Issei Sato, and Hiroshi Nakagawa. Differential privacy without sensitivity. In Advances in Neural Information Processing Systems, pages 956–964, 2016.
- [60] Darakhshan J Mir. Differential privacy: an exploration of the privacy-utility landscape. Rutgers The State University of New Jersey-New Brunswick, 2013.
- [61] Erkki Oja. Simplified neuron model as a principal component analyzer. Journal of mathematical biology, 15(3):267–273, 1982.
- [62] Phillippe Rigollet and Jan-Christian Hütter. High dimensional statistics. Lecture notes for course 18S997, 813(814):46, 2015.
- [63] Or Sheffet. Old techniques in differentially private linear regression. In Algorithmic Learning Theory, pages 789–827. PMLR, 2019.
- [64] Jun Tang, Aleksandra Korolova, Xiaolong Bai, Xueqiang Wang, and Xiaofeng Wang. Privacy loss in apple’s implementation of differential privacy on macos 10.12. arXiv preprint arXiv:1709.02753, 2017.
- [65] Joel A Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
- [66] Christos Tzamos, Emmanouil-Vasileios Vlatakis-Gkaragkounis, and Ilias Zadik. Optimal private median estimation under minimal distributional assumptions. Advances in Neural Information Processing Systems, 33:3301–3311, 2020.
- [67] Roman Vershynin. High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018.
- [68] Duy Vu and Aleksandra Slavkovic. Differential privacy for clinical trial data: Preliminary evaluations. In 2009 IEEE International Conference on Data Mining Workshops, pages 138–143. IEEE, 2009.
- [69] Vincent Vu and Jing Lei. Minimax rates of estimation for sparse pca in high dimensions. In Artificial intelligence and statistics, pages 1278–1286. PMLR, 2012.
- [70] Martin J Wainwright. High-dimensional statistics: A non-asymptotic viewpoint, volume 48. Cambridge University Press, 2019.
- [71] Di Wang, Hanshen Xiao, Srinivas Devadas, and Jinhui Xu. On differentially private stochastic convex optimization with heavy-tailed data. In International Conference on Machine Learning, pages 10081–10091. PMLR, 2020.
- [72] Yu-Xiang Wang. Revisiting differentially private linear regression: optimal and adaptive prediction & estimation in unbounded domain. arXiv preprint arXiv:1803.02596, 2018.
- [73] Yu-Xiang Wang, Stephen Fienberg, and Alex Smola. Privacy for free: Posterior sampling and stochastic gradient monte carlo. In International Conference on Machine Learning, pages 2493–2502. PMLR, 2015.
- [74] Huan Xu, Constantine Caramanis, and Shie Mannor. Principal component analysis with contaminated data: The high dimensional case. arXiv preprint arXiv:1002.4658, 2010.
Appendix
Appendix A Related work
Our work builds upon a series of advances in private SGD [44, 8, 7, 29, 53, 71, 38] to make advance in understanding the tradeoff of privacy and sample complexity for PCA. Such tradeoffs have been studied extensively in canonical statistical estimation problems of mean (and covariance) estimation and linear regression.
Private mean estimation. As one of the most fundamental problem in private data analysis, mean estimation is initially studied under the bounded support assumptions, and the optimal error rate is now well understood. More recently, [6] considered the private mean estimation problem for -th moment bounded distributions where the support of the data is unbounded and provided minimax error bound in various settings. [49] studied private mean estimation from Gaussian sample, and obtained an optimal error rate. There has been a lot of recent interests on private mean estimation under various assumptions, including mean and covariance joint estimation [43, 9], heavy-tailed mean estimation [47], mean estimation for general distributions [31, 66], distribution adaptive mean estimation [13], estimation for unbounded distribution parameters [46], mean estimation under pure differential privacy [37], local differential privacy [19, 20, 33, 41, 54], user-level differential privacy [27, 27], Mahalanobis distance[11, 56] and robust and differentially private mean estimation [55, 52, 56].
Private linear regression The goal of private linear regression is to learn a linear predictor of response variable from a set of examples while guarantee the privacy of the examples. Again, the work on private linear regression can be divided into two categories: deterministic and randomized. In the deterministic setting where the data is deterministically given without any probabilistic assumptions, significant advances in DP linear regression has been made [68, 50, 60, 17, 8, 73, 32, 59, 72, 63]. In the randomized settings where each example is drawn i.i.d. from a distribution, [21] proposes an exponential time algorithm that achieves asymptotic consistency. [14] provides an efficient and minimax optimal algorithm under sub-Gaussian design and nearly identity covariance assumptions. Very recently, [56] for the first time gives an exponential time algorithm that achieves minimax risk for general covariance matrix under sub-Gaussian and hypercontractive assumptions.
Private PCA without spectral gap. There is a long line of work in Private PCA [35, 36, 34, 10, 15, 48, 24, 5]. We explain the closely related ones in Section 2.3, providing interpretation when the covariance matrix has a spectral gap.
When there is no spectral gap, one can still learn a principal component. However, since the principal component is not unique, the error is typically measured in how much of the variance is captured in the estimated direction: . [15] introduces an exponential mechanism (from [57]) which samples an estimate from a distribution , where is a normalization constant to ensure that the pdf integrates to one. This achieves a stronger pure DP, i.e., -DP, but is computationally expensive; [15] does not provide a tractable implementation and [48] provides a polynomial time implementation with time complexity at least cubic in . This achieves an error rate of in [15, Theorem 7] when samples are from Gaussian , which, when there is a spectral gap, translates into
(17) |
with high probability. Closest to our setting is the analyses in [56, Corollary 6.5] that proposed an exponential mechanism that achieves with high probability under -DP and Gaussian samples, but this algorithm is computationally intractable. This is shown to be tight when there is no spectral gap. When there is a spectral gap, this translates into
(18) |
As these algorithms and the corresponding analyses are tailored for gap-free cases, they have better dependence on and worse dependence on and , compared to the proposed DP-PCA and its error rate in Corollary 5.2.
Appendix B Preliminary on differential privacy
Lemma B.1 (Stability-based histogram [49, Lemma 2.3]).
For every , domain , for every collection of disjoint bins defined on , , , and there exists an -differentially private algorithm such that for any set of data
-
1.
-
2.
and
-
3.
then,
Since we focus on one-pass algorithms where a data point is only accessed once, we need a basic parallel composition of DP.
Lemma B.2 (Parallel composition [58]).
Consider a sequence of interactive queries each operating on a subset of the database and each satisfying -DP. If ’s are disjoint then the composition is -DP.
We also utilize the following serial composition theorem.
Lemma B.3 (Serial composition [23]).
If a database is accessed with an -DP mechanism and then with an -DP mechanism, then the end-to-end privacy guarantee is -DP.
When we apply private histogram learner to each coordinate, we require more advanced composition theorem from [42].
Lemma B.4 (Advanced composition [42]).
For , an end-to-end guarantee of -differential privacy is satisfied if a database is accessed times, each with a -differential private mechanism.
Appendix C Lower bounds
When privacy is not required, we know from Theorem 2.2 that under Assumptions A.1-A.3, we can achieve an error rate of . In the regime of and , samples are enough to achieve an arbitrarily small error. The next lower bounds shows that we need samples when -DP is required, showing that private PCA is significantly more challenging than a non-private PCA, when assuming only the support and moment bounds of assumptions A.1-A.3. We provide a proof in Appendix C.3.
Theorem C.1 (Lower bound without Assumption A.4).
We next provide the proofs of all the lower bounds.
C.1 Proof of Theorem 5.3 on the lower bound for Gaussian case
Our proof is based on following differentially private Fano’s method [3, Corollary 4].
Theorem C.2 (DP Fano’s method [3, Corollary 4]).
Let denote family of distributions of interest and denote the population parameter. Our goal is to estimate from i.i.d. samples . Let be an -DP estimator. Let be a pseudo-metric on parameter space . Let be an index set with finite cardinality. Define be an indexed family of probability measures on measurable set . If for any ,
-
1.
,
-
2.
,
-
3.
,
then
For our problem, we are interested in Gaussian and metric . Using Theorem C.2, it suffices to construct such indexed set and the indexed distribution family . We use the same construction as in [69, Theorem 2.1] introduced to prove a lower bound for the (non-private) sparse PCA problem. The construction is given by the following lemma.
Lemma C.3 ([69, Lemma 3.1.2]).
Let . For , there exists and an absolute constant such that for every , and .
Fix . For each , we define and . It is easy to see that has eigenvalues . The top eigenvector of is . Using Lemma F.4, we know for any ,
(20) |
Now we set
(23) |
It follows from Theorem C.2 and that there exists a constant such that
(24) |
C.2 Proof of Theorem 5.4
We first construct an indexed set and indexed distribution family such that satisfies A.1, A.2 and A.3 in Assumption 1. Our construction is defined as follows.
By [3, Lemma 6] , there exists a finite set , with cardinality , such that for any , .
Let denotes the density function of . Let be a uniform distribution on two point masses . Let be Gaussian distribution . For , we construct . It is easy to see that is a distribution over with the following density function.
(25) |
The mean of is . The covariance of is . The top eigenvalue is , the top eigenvector is , and the second eigenvalue is . And .
By the fact that and for any unit vector , we have for any unit vector .
Our proof technique is based on following lemma.
Lemma C.4 ([6, Theorem 3]).
Fix . Define for such that such that . Let be a differentially private estimator. Then,
(26) |
Set . By Lemma F.4, .
Lemma C.4 implies
(27) | ||||
(28) | ||||
(29) |
For , we know . We choose
(30) |
This implies
(31) |
So we have there exists a constant such that
(32) | ||||
(33) |
C.3 Proof of Theorem C.1
Similar to the proof of Theorem 5.3, we use DP Fano’s method in Theorem C.2. It suffices to construct an indexed set and indexed distribution family such that satisfies A.1, A.2 and A.3 in Assumption 1. Our construction is defined as follows.
Let . By Lemma C.3, there exists a finite set , with cardinality , such that for any , , where .
Let denotes the density function of . We construct over for with the following density function.
(34) |
The mean of is . The covariance of is . It is easy to see that the top eigenvalue is , the top eigenvector is , and the second eigenvalue is .
If , then . If , by the fact that , we know . This implies satisfies A.2 in Assumption 1 with for i.i.d. samples.
For , we have . By Lemma F.4, .
By Theorem C.2, there exists a constant such that
(35) |
Appendix D The analysis of Private Oja’s Algorithm
We analyze Private Oja’s Algorithm in Algorithm 2.
D.1 Proof of privacy in Lemma 3.1
We use following Theorem D.1 to prove our privacy guarantees.
Theorem D.1 (Privacy amplification by shuffling [30, Theorem 3.8]).
For any domain , let for (where is the range space of ) be a sequence of algorithms such that is an -DP local randomizer for all values of auxiliary inputs . Let be the algorithm that given a dataset , samples a uniform random permutation over , then sequentially computes for and outputs . Then for any such that , is -DP, where
(36) |
Let . Let . We show is an -DP local randomizer.
If there is no noise in each update step, the update rule is
(37) | ||||
(38) |
The sensitivity of is with respect to a difference in . By Gaussian mechanism in Lemma 2.4 and post processing property of local differential privacy, we know is -DP local randomizer.
Assume that . By Theorem D.1, for such that , Algorithm 2 is -DP and for some constant ,
(39) | ||||
(40) | ||||
(41) | ||||
(42) | ||||
(43) |
for some absolute constant .
Set , for some and . We have
(44) | ||||
(45) |
For any , by Eq. (45), there exists some sufficiently large such that .
Recall that we assume . This means .
D.2 Proof of clipping in Lemma 3.2
Let . Let . By Lemma 2.1, we know for any , with probability ,
(46) |
Applying union bound over all basis vectors and all samples, we know with probability , for all and
(47) |
This implies that with probability , for all , we have
(48) |
D.3 Proof of utility in Theorem 3.3
Lemma 3.2 implies that with probability , Algorithm 2 does not have any clipping. Under this event, the update rule becomes
(49) | ||||
(50) |
where and each entry in is i.i.d. sampled from standard Gaussian . This follows form the fact that and .
Let . We show satisfies the three conditions in Theorem 2.2 ([39, Theorem 4.12]). It is easy to see that from Assumption A.1. Next, we show upper bound of . We have
(51) |
where the last inequality follows from Lemma F.3 and is an absolute constant. Let . Similarly, we can show that .
Under the event that for all , we apply Theorem 2.2 with a learning rate where
(52) |
Then Theorem 2.2 implies that with probability ,
(53) |
for some positive constant .
Set , the above bound implies
(54) |
where , and
(55) |
For and , selecting , and assuming
(56) |
with large enough positive constants , and , we have and . Hence it is sufficient to have
with a large enough constant.
Appendix E The analysis of DP-PCA
We provides the proofs for Theorem 5.1, Theorem 6.1, and Lemma 6.2 that guarantees the privacy and utility of DP-PCA.
E.1 Proof of Theorem 5.1 on the privacy and utility of DP-PCA
From Theorem 6.1 we know that Alg. 4 returns satisfying with high probability. Then, from Lemma 6.2, we know that with high probability Alg 5 returns an unbiased estimate of the gradient mean with added Gaussian noise. Under this case, the update rule becomes
(57) | ||||
(58) |
where , denote the estimated eigenvalue of covariance of the gradients at -th iteration, and each entry in is i.i.d. sampled from standard Gaussian . This follows form the fact that and .
Let such that , which follows from the fact that (Theorem 6.1 and Assumption A.4). Let . We show satisfies the three conditions in Theorem 2.2 ([39, Theorem 4.12]). It is easy to see that from Assumption A.1. Next, we show upper bound of . We have
(59) |
where the last inequality follows from Lemma F.3 and is an absolute constant. Let . Similarly, we can show that . By Lemma F.5 and Lemma F.2, we know with probability , for all ,
Let . Under the event that for all , we apply Theorem 2.2 with a learning rate where
(60) |
Then Theorem 2.2 implies that with probability ,
(61) |
for some positive constant . Using and Eq. (59), the above bound implies
(62) |
where , and
(63) |
For and , selecting , , and assuming and
(64) |
with large enough positive constants , , and , we have and . Hence it is sufficient to have
with a large enough constant.
E.2 Algorithm and proof of Theorem 6.1 on top eigenvalue estimation
Taking the difference ensures that is zero mean, such that we can directly use the top eigenvalue of for . We compute a histogram over those top eigenvalues. This histogram is privatized by adding noise only to the occupied bins and thresholding small entries of the histogram to be zero. The choice ensures that the most occupied bin does not change after adding the DP noise to the histograms, and is necessary for handling unbounded number of bins. We emphasize that we do not require any upper and lower bounds on the eigenvalue, thanks to the private histogram learner from [12, 49] that gracefully handles unbounded number of bins.
The privacy guarantee follows from the privacy guarantee of the histogram learner provided in Lemma B.1.
For utility analysis, we follow the analysis of [46, Theorem 3.1]. The main difference is that we prove a smaller sample complexity sine we only need the top eigenvalue, and we analyze a more general distribution family. The random vector is zero mean with covariance , where , and satisfies with probability ,
(65) |
which follows from Lemma 2.1. Applying union bound over all basis vectors , we know with probability ,
We next show that conditioned on event , the covariance is close to the true covariance . Note that
(66) |
We next show the empirical covariance concentrates around . First of all, using union bound on Eq. (65), we have with probability , for all and ,
Under the event that for all , , [70, Corrollary 6.20] together with Eq. (66) implies
The above bound implies that with probability ,
This means if , then with probability , for all , . This means all of must be within interval. Thus, at most two consecutive buckets are filled with . By private histogram from Lemma B.1, if , one of those two bins are released. The resulting total multiplicative error is bounded by .
E.3 Algorithm and proof of Lemma 6.2 on DP mean estimation
The histogram learner is called times, each with -DP guarantee, and the end-to-end privacy guarantee is from Lemma B.4 for . The sensitivity of the clipped mean estimate is . Gaussian mechanism with covariance satisfy -DP from Lemma 2.4 for . Putting these two together, with serial composition of Lemma B.3, we get the desired privacy guarantee.
The proof of utility follows similarly as [55, Lemma D.2]. Let . Denote the population probability of -th coordinate at as . Denote the empirical probability as . Denote the private empirical probability being released as .
Fix . Let be the bin that contains the . Then we know . By Lemma 2.1, we know . This means and .
By Dvoretzky-Kiefer-Wolfowitz inequality and an union bound over , we have that with probability , . Using Lemma B.1, if , with probability , we have . Thus, with our assumption on , we can make sure with probability , . Then we have . This implies with probability , the algorithm must pick one of the bins from . This means . By tail bound of Lemma 2.1, we know for all and , . This completes our proof.
Appendix F Technical lemmas
Lemma F.1.
Let . Then there exists universal constant such that with probability ,
(67) |
Proof.
Let . Then is also a Gaussian with . By Hanson-Wright inequality ( [67, Theorem 6.2.1]), there exists universal constant such that with probability ,
(68) |
∎
Lemma F.2 ([67, Theorem 4.4.5]).
Let be a random matrix where each entry is i.i.d. sampled from standard Gaussian . Then there exists universal constant such that with probability , for .
Lemma F.3.
Let be a random matrix where each entry is i.i.d. sampled from standard Gaussian . Then we have and .
Proof.
By Lemma F.2, there exists universal constant such that
(69) |
Then
(70) | ||||
(71) | ||||
(72) | ||||
(73) |
where is an absolute constant. The proof for the second claim follows similarly. ∎
Lemma F.4.
Let . Then
(74) |
If , then
(75) |
The following lemma follows from matrix Bernstein inequality [65].