FADI: Fast Distributed Principal Component Analysis With High Accuracy for Large-Scale Federated Data

Shuting Shen, Junwei Lu, and Xihong Lin
Shuting Shen is PhD student ([email protected]) and Junwei Lu ([email protected]) is Assistant Professor at the Department of Biostatistics at Harvard T.H. Chan School of Public Health. Xihong Lin is Professor of Biostatistics at Harvard T.H. Chan School of Public Health and Professor of Statistics at Harvard University ([email protected]). This work was supported by the National Institutes of Health grants R35-CA197449, U01-HG009088, U01HG012064, U19-CA203654, and P30 ES000002.

Abstract

Principal component analysis (PCA) is one of the most popular methods for dimension reduction. In light of the rapidly growing large-scale data in federated ecosystems, the traditional PCA method is often not applicable due to privacy protection considerations and large computational burden. Algorithms were proposed to lower the computational cost, but few can handle both high dimensionality and massive sample size under the distributed setting. In this paper, we propose the FAst DIstributed (FADI) PCA method for federated data when both the dimension $d$ and the sample size $n$ are ultra-large, by simultaneously performing parallel computing along $d$ and distributed computing along $n$ . Specifically, we utilize $L$ parallel copies of $p$ -dimensional fast sketches to divide the computing burden along $d$ and aggregate the results distributively along the split samples. We present FADI under a general framework applicable to multiple statistical problems, and establish comprehensive theoretical results under the general framework. We show that FADI enjoys the same non-asymptotic error rate as the traditional PCA when $Lp\geq d$ . We also derive inferential results that characterize the asymptotic distribution of FADI, and show a phase-transition phenomenon as $Lp$ increases. We perform extensive simulations to show that FADI substantially outperforms the existing methods in computational efficiency while preserving accuracy, and validate the distributional phase-transition phenomenon through numerical experiments. We apply FADI to the 1000 Genomes data to study the population structure.

Keyword: Computational efficiency; Distributed computing; Efficient communication; Fast PCA; Large-scale inference; Federated learning; Random matrices; Random sketches.

1 Introduction

As one of the most popular methods for dimension reduction, principal component analysis (PCA) finds applications in a broad spectrum of scientific fields including network studies [3], statistical genetics [35] and finance [31]. Methodologically, parameter estimation in many statistical models is based on PCA, such as spectral clustering in graphical models [2], missing data imputation through low-rank matrix completion [23], and clustering with subsequent k-means refinement in Gaussian mixture models [12]. When it comes to real data analysis, however, several shortcomings of the traditional PCA method hinder its application to large-scale datasets. First, the high dimensionality and large sample size of modern big data can render the PCA computation infeasible in practice. For instance, PCA is commonly used for controlling for ancestry confounding in Genome-Wide Association Studies (GWAS) [33], yet biomedical databases, such as the UK Biobank [39], often contain hundreds of thousands to millions of Single Nucleotide Polymorphisms (SNPs) and subjects, which entails more scalable algorithms to handle the intensive computation of PCA. Second, large-scale datasets in many applications are stored in federated ecosystems, where data cannot leave individual warehouses due to privacy protection considerations [8, 14, 15, 29, 34]. This calls for federated learning methods [26, 30] that provide efficient and privacy-protected strategies for joint analysis across data warehouses without the need to exchange individual-level data.

The burgeoning popularity of large-scale data necessitates the development of fast algorithms that can cope with both high dimensionality and massiveness efficiently and distributively. Indeed, efforts have been made in recent years on developing fast PCA and distributed PCA algorithms. The existing fast PCA algorithms use the full-sample data and apply random projection to speed up PCA calculations [11, 20], while the existing distributed PCA algorithms apply the traditional PCA method to the split data and aggregate the results [18, 28].

Specifically, fast PCA algorithms utilize the fact that the column space of a low-rank matrix can be represented by a small set of columns and use random projection to approximate the original high-dimensional matrix [4]. For instance, Halko et al., [20] proposed to estimate the $K$ leading eigenvectors of a $d\times d$ matrix ( $K\ll d$ ) using Gaussian random sketches, which decreases the computation time by a factor of $O(d)$ at the cost of increasing the statistical error by a factorial power of $d$ . Chen et al., [11] modified Halko et al., [20]’s method by repeating the fast sketching multiple times and showed the consistency of the algorithm using the average of i.i.d. random sketches when the number of sketches goes to infinity. However, they did not study the trade-off between computation complexities and error rates in finite samples, and hence did not recommend the number of fast sketches that optimizes both the computational efficiency and the statistical accuracy. As the fast PCA methods use the full data, they have two major limitations. First, they are often not scalable to large sample sizes $n$ . Second, they are not applicable to federated data when data in different sites cannot be shared.

The existing distributed PCA algorithms reduce the PCA computational burden by partitioning the full data “horizontally” or “vertically” [18, 27, 28]. The horizontal partition splits the data over the sample size $n$ , whereas the vertical partition splits the data over the dimension $d$ . Horizontal partition is useful when the sample size $n$ is large or when the data are federated in multiple sites. For example, Fan et al., [18] considered the horizontally distributed PCA where they estimated the $K$ leading eigenvectors of the $d\times d$ population covariance matrix by applying traditional PCA to each data split and aggregating the PCA results across different datasets. They showed when the number of data splits is not too large, the error rate of their algorithm is of the same order as the traditional PCA. Since they used the traditional PCA algorithm for each data partition, the computational complexity is at least of order $O(d^{3})$ , which will be computationally difficult when $d$ is large, e.g., in GWAS, $d$ is hundreds of thousands to millions. Kargupta et al., [28] considered vertical partition and developed a method that collects local principal components (PCs) and then reconstructs global PCs by linear transformations. However, there is no theoretical guarantee on the error rate compared with the traditional full sample PCA, and the method may fail when variables are correlated.

Apart from the aforementioned PCA applications in parameter estimation, inference also constitutes an important part of PCA methods. For example, when studying the ancestry groups of whole genome data under the mixed membership models, while the estimation error rate guarantees the overall misclustering rate for all subjects, one may be interested in testing whether two individuals of interest share the same ancestry membership profile and assessing the associated statistical uncertainty [16]. Furthermore, despite the rich literature depicting the asymptotic distribution of traditional PCA estimators under different statistical models [16, 32, 41], distributional characterization of fast PCA methods and distributed PCA methods are not well-studied. For instance, Yang et al., [44] characterized the convergence of fast sketching estimators in probability but gave no inferential results. Halko et al., [20] provided error bound for the fast PCA algorithm, but there is no characterization of the asymptotic distribution and hence no evaluation of the testing efficiency. Fan et al., [18] derived the non-asymptotic error rate of the distributed PC estimator but did not provide distributional guarantees, and inference based upon their estimator is computationally intensive when the dimension $d$ is large.

In summary, the existing fast PCA algorithms accelerate computation along $d$ by fast sketching, but cannot handle distributed computing along $n$ . The existing distributed PCA methods mainly focus on dividing the computing burden along $n$ , while distributed computing along $d$ is complicated by variable correlation and lacks theoretical guarantees. It remains an open question how to develop fast and distributed PCA algorithms that can handle both large $d$ and $n$ simultaneously, while achieving the same asymptotic efficiency as the traditional PCA.

In view of the gaps in existing literature, we propose in this paper a scalable and computationally efficient FAst DIstributed (FADI) PCA method applicable to federated data that could be large in both $d$ and $n$ . More specifically, to obtain the $K$ -leading PCs of a $d\times d$ matrix $\mathbf{M}$ from its estimator $\widehat{\mathbf{M}}$ , we take the divide-and-conquer strategy to break down the computation complexities along the dimension $d$ : we generate the $p$ -dimensional fast sketch $\widehat{\mathbf{Y}}=\widehat{\mathbf{M}}\bm{\Omega}$ and perform SVD on $\widehat{\mathbf{Y}}$ instead of $\widehat{\mathbf{M}}$ to expedite the PCA computation, where $\bm{\Omega}\in\mathbb{R}^{d\times p}$ is a Gaussian test matrix with $K\leq p\ll d$ ; meanwhile, to adjust for the additional variability induced by random approximation, we repeat the fast sketching for $L$ times in parallel, and then aggregate the SVD results across data splits to restore statistical accuracy. When the data are distributively stored, the federated structure of $\widehat{\mathbf{Y}}$ also enables its easy implementation without the need of sharing individual-level data, which in turn facilitates distributing the computing burden along $n$ among the split samples, as opposed to the existing fast PCA methods that are not scalable to large $n$ . We will show that FADI has computational complexities of smaller magnitudes than existing methods (see Table 3), while achieving the same asymptotic efficiency as the traditional PCA. Moreover, we establish FADI under general frameworks that cover multiple statistical models. We list below four statistical problems as illustrative applications of FADI, where we will define $\mathbf{M}$ and $\widehat{\mathbf{M}}$ in each setting:

(1)

Spiked covariance model: let $\bm{X}_{1},\ldots,\bm{X}_{n}\in\mathbb{R}^{d}$ be i.i.d. random vectors with spiked covariance $\bm{\Sigma}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}+\sigma^{2}\mathbf{I}$ , where $\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}$ is the rank- $K$ spiked component of interest. Define $\widehat{\mathbf{M}}=\frac{1}{n}\sum_{i=1}^{n}\bm{X}_{i}\bm{X}_{i}^{\top}-\widehat{\sigma}^{2}\mathbf{I}$ to be the estimator for $\mathbf{M}$ , where $\widehat{\sigma}^{2}$ is a consistent estimator for $\sigma^{2}$ . We assume that the data are split along the sample size $n$ and stored on $m$ servers.
(2)

Degree-corrected mixed membership (DCMM) model: let $\mathbf{X}$ be the adjacency matrix for an undirected graph of $d$ nodes, where the connection probabilities between nodes are determined by their membership assignments to $K$ communities and node-associated degrees. Consider the data $\widehat{\mathbf{M}}=\mathbf{X}$ to be split along $d$ on $m$ servers, and we aim to infer the membership profiles of nodes by recovering the $K$ -leading eigenspace of the marginal connection probability matrix $\mathbf{M}={\mathbb{E}}(\mathbf{X})$ using the data $\widehat{\mathbf{M}}$ .
(3)

Gaussian mixture models (GMM): let $\bm{W}_{1},\ldots,\bm{W}_{d}\in\mathbb{R}^{n}$ be independent random vectors drawn from $K$ Gaussian distributions with different means and identity covariance matrix. We are interested in clustering the samples by estimating the eigenspace of $\mathbf{M}=[\mathbf{M}_{jj^{\prime}}]=[{\mathbb{E}}(\bm{W}_{j})^{\top}{\mathbb{E}}(\bm{W}_{j^{\prime}})]$ , whose estimator is given by $\widehat{\mathbf{M}}=[\widehat{\mathbf{M}}_{jj^{\prime}}]=[\bm{W}_{j}^{\top}\bm{W}_{j^{\prime}}]-n\mathbf{I}$ . Assume the data are distributively stored on $m$ servers along the dimension $n$ .
(4)

Incomplete matrix inference: we have a low-rank matrix $\mathbf{M}$ of interest, and we observe $\widehat{\mathbf{M}}$ as a perturbed version of $\mathbf{M}$ with missing entries. Assume $\widehat{\mathbf{M}}$ to be vertically split along $d$ on $m$ servers, and we aim to infer the eigenspace of $\mathbf{M}$ through $\widehat{\mathbf{M}}$ .

We will elaborate on the above examples in Section 2. We consider distributed settings for all the four problems, where the data are split along $n$ for the spiked covariance model and the GMM, and along $d$ for the DCMM model and the incomplete matrix inference model given that $d$ coincides with $n$ for those two. We will establish in Section 4.1 a general non-asymptotic error bound applicable to multiple statistical models as well as case-specific error rates for each example, and show that the non-asymptotic error rate of FADI is of the same order as the traditional PCA as long as the sketching dimension $p$ and the number of fast sketches $L$ are sufficiently large. Inferentially, we provide distributional characterizations of FADI under different regimes of the fast sketching parameters. We observe a phase-transition phenomenon where the asymptotic covariance matrix takes on two different forms as $Lp$ increases. When $Lp\gg d$ , the FADI estimator converges in distribution to a multivariate Gaussian, and the asymptotic relative efficiency (ARE) between FADI and the traditional PCA is 1 (see Figure 1). On the other hand, when $Lp\ll d$ , FADI has higher computational efficiency and still enjoys asymptotic normality under certain models, but will have a larger asymptotic variance.

Refer to caption — Figure 1: Asymptotic relative efficiency (ARE) between the FADI estimator and the traditional PCA estimator under Example 1 and Example 3, where the ARE is measured by $\det(\widehat{\bm{\Sigma}}^{\rm FADI})^{1/K}\cdot\det(\widehat{\bm{\Sigma}}^{\rm PCA})^{-1/K}$ with $\widehat{\bm{\Sigma}}^{\rm FADI}$ and $\widehat{\bm{\Sigma}}^{\rm PCA}$ being the empirical covariance matrices for the FADI and traditional PCA estimators [36].

Related Papers on Inferential Analysis of PCA

There has been a great amount of literature depicting the asymptotic distribution of traditional PCA estimators. Anderson, [5] characterized the asymptotic normality of eigenvectors and eigenvalues for traditional PCA on the sample covariance matrix with fixed dimension. Paul, [32] and Wang and Fan, [41] extended the analysis to the high-dimensional regime and established distributional results under the spiked covariance model. Similar efforts were made by Johnstone, [25] and Baik et al., [6], where they studied the limiting distribution of the largest empirical eigenvalue when both the dimension and the sample size go to infinity. Apart from inference on the sample covariance matrix of i.i.d. data, previous works also made progress in inferential analyses for a variety of statistical models including the DCMM model [16], the matrix completion problem [13], and high-dimensional data with heteroskedastic noise and missingness under the spiked covariance model [43]. Specifically, Fan et al., [16] employed statistics based on principal eigenspace estimators of the adjacency matrix to perform inference on whether two given nodes share the same membership profile under the DCMM model. Chen et al., [13] constructed entry-wise confidence intervals (CIs) for a low-rank matrix with missing data and Gaussian noise based on debiased convex/nonconvex PC estimators. A similar missing data inference problem was conducted in Yan et al., [43], where they adopted a refined spectral method with imputed diagonal for CI construction of the underlying spiked covariance matrix of corrupted samples with missing data.

The aforementioned works were all based upon the traditional PCA approach and considered no distributed data setting, and hence will suffer from low computational efficiency when the data are high-dimensional or distributively stored across different sites. Our paper fills the gap in the literature and provides general inferential results on the fast sketching method with high computational efficiency adapted to high-dimensional federated data.

Our Contributions

We summarize the major contributions of our paper as follows.

First, the existing PCA methods either handle high dimensions $d$ or large sample sizes $n$ , but not both. Specifically, fast PCA [20] handles large $d$ but has elevated error rates and is difficult to apply when $n$ is large. Distributed PCA [18] handles large $n$ but is not scalable to large $d$ , as it applies traditional PCA to each data split. FADI overcomes the limitations of these methods by providing scalable PCA when both $d$ and $n$ are large or data are federated. Due to the fact that variables are usually dependent, it is challenging to achieve parallel computing along $d$ and distributed computing along $n$ simultaneously. To address this challenge, FADI splits the data along $n$ and untangles the variable dependency along $d$ by dividing the high-dimensional data into $L$ copies of $p$ -dimensional fast sketches. Namely, for each split dataset, FADI performs multiple parallel fast sketchings instead of the traditional PCA, and then aggregates the PC results distributively over the split samples. We establish theoretical error bounds to show that FADI is as accurate as the traditional PCA so long as $Lp\gtrsim d$ .

Second, we provide distributional characterizations for inferential analyses and show a phase-transition phenomenon. We provide distributional guarantees on the FADI estimator to facilitate inference, which is absent in previous literature on fast PCA methods and distributed PCA methods. More specifically, we depict the trade-off between computational complexity and testing efficiency by studying FADI’s asymptotic distribution under the regimes $Lp\ll d$ and $Lp\gg d$ respectively. We show that the same asymptotic efficiency as the traditional PCA can be achieved at $Lp\gg d$ with a compromise on computational efficiency, while faster inferential procedures can be performed at $Lp\ll d$ with suboptimal testing efficiency. We further validate the distributional phase transition via numerical experiments.

Third, we propose FADI under a general framework applicable to multiple statistical models under mild assumptions, including the four examples discussed earlier in this section. We provide a comprehensive investigation of FADI’s performance both methodologically and theoretically under the general framework, and illustrate the results with the aforementioned statistical models. In comparison, the existing distributed methods mainly focus on estimating the covariance structure of independent samples [18].

Paper Organization

The rest of the paper is organized as follows. Section 2 introduces the problem setting and provides an overview of FADI and its intuition. Section 3 discusses FADI’s implementation details, as well as the computational complexity of FADI and its modifications when $K$ is unknown. Section 4 presents the theoretical results of the statistical error and asymptotic normality of the FADI estimator. Section 5 shows the numerical evaluation of FADI and comparison with several existing methods. The application of FADI to the 1000 Genomes Data is given in Section 6.

Notation

We use $\mathbf{1}_{d}\in\mathbb{R}^{d}$ to denote the vector of length $d$ with all entries equal to 1, and denote by $\{\mathbf{e}_{i}\}_{i=1}^{d}$ the canonical basis of $\mathbb{R}^{d}$ . For a matrix $\mathbf{A}=[\mathbf{A}_{ij}]\in\mathbb{R}^{m\times n}$ , we use $\sigma_{i}(\mathbf{A})$ (respectively $\lambda_{i}(\mathbf{A})$ ) to represent the $i$ -th largest singular value (respectively eigenvalue) of $\mathbf{A}$ , and $\sigma_{\max}(\mathbf{A})$ or $\sigma_{\min}(\mathbf{A})$ (respectively $\lambda_{\max}(\mathbf{A})$ or $\lambda_{\min}(\mathbf{A})$ ) stands for the largest or smallest singular value (respectively eigenvalue) of $\mathbf{A}$ . If $\mathbf{A}$ has the singular value decomposition (SVD) $\mathbf{A}=\mathbf{U}\mathbf{\Lambda}\mathbf{V}^{\top}=\sum_{j=1}^{K}\sigma_{j}\mathbf{u}_{j}\mathbf{v}_{j}^{\top}$ , then we denote by $\mathbf{A}^{\dagger}=\mathbf{V}\mathbf{\Lambda}^{-1}\mathbf{U}^{\top}$ the pseudo-inverse of $\mathbf{A}$ , $\mathbf{P}_{\mathbf{A}}=\mathbf{A}\mathbf{A}^{\dagger}$ the projection matrix onto the column space of $\mathbf{A}$ , and $\operatorname{sgn}(\mathbf{A})=\sum_{\sigma_{j}>0}\mathbf{u}_{j}\mathbf{v}_{j}^{\top}$ the matrix signum. If $\mathbf{A}$ is positive definite with eigen-decomposition $\mathbf{A}=\mathbf{U}\mathbf{D}\mathbf{U}^{\top}$ , we define $\mathbf{A}^{1/2}=\mathbf{U}\mathbf{D}^{1/2}\mathbf{U}^{\top}$ and $\mathbf{A}^{-1/2}=\mathbf{U}\mathbf{D}^{-1/2}\mathbf{U}^{\top}$ . We denote by $\otimes$ the Kronecker product. For two orthonormal matrices $\mathbf{V},\mathbf{U}\in\mathbb{R}^{n_{1}\times n_{2}}$ with $n_{1}>n_{2}$ , we measure the distance between their column spaces by $\rho(\mathbf{U},\mathbf{V})=\|\mathbf{U}\mathbf{U}^{\top}-\mathbf{V}\mathbf{V}^{\top}\|_{\text{F}}$ . For a vector $\mathbf{v}$ , we use $\|\mathbf{v}\|_{2}$ to denote the vector $\ell_{2}$ -norm, and $\|\mathbf{v}\|_{\infty}$ to denote the vector $\ell_{\infty}$ -norm. For a matrix $\mathbf{A}=[A_{ij}]$ , we denote by $\|\mathbf{A}\|_{2}$ the matrix spectral norm, $\|\mathbf{A}\|_{\text{F}}$ the Frobenius norm, $\|\mathbf{A}\|_{2,\infty}=\sup_{\|\mathbf{x}\|_{2}=1}\|\mathbf{A}\mathbf{x}\|_{\infty}=\max_{i}\|\mathbf{A}^{\top}\mathbf{e}_{i}\|_{2}$ the 2-to- $\infty$ norm and $\|\mathbf{A}\|_{\max}=\max_{i,j}|A_{ij}|$ the matrix max norm. For an integer $n$ , define $[n]=\{1,2,\ldots,n\}$ . For two positive sequences $x_{n}$ and $y_{n}$ , we say $x_{n}\lesssim y_{n}$ or $x_{n}=O(y_{n})$ if $x_{n}\leq Cy_{n}$ for $C>0$ that does not depend on $n$ . We say $x_{n}\asymp y_{n}$ if $x_{n}\lesssim y_{n}$ and $y_{n}\lesssim x_{n}$ . If $\lim_{n\rightarrow\infty}x_{n}/y_{n}=0$ , we say $x_{n}=o(y_{n})$ or $x_{n}\ll y_{n}$ . Let $\mathbb{I}\{\cdot\}$ be an indicator function, which takes 1 if the statement inside $\{\cdot\}$ is true and 0 otherwise. Throughout the paper, we use $c$ and $C$ to represent generic constants and their values might change from place to place.

2 Preliminaries and Problem Setup

We aim to estimate the eigenspace of the rank- $K$ symmetric matrix $\mathbf{M}\in\mathbb{R}^{d\times d}$ , whose eigen-decomposition is given by $\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}$ ,¹¹1When $\mathbf{M}$ is asymmetric, we can deploy the “symmetric dilation” trick and take ${\mathcal{S}}(\mathbf{M})=\begin{pmatrix}\mathbf{0}&\mathbf{M}\\ \mathbf{M}^{\top}&\mathbf{0}\end{pmatrix}$ to fit it into the setting. where $\mathbf{\Lambda}=\operatorname{diag}(\lambda_{1},\ldots,\lambda_{K})$ , $|\lambda_{1}|\geq|\lambda_{2}|\geq\ldots\geq|\lambda_{K}|>0$ and $\mathbf{V}$ is the stacking of the $K$ leading eigenvectors. We denote by $\Delta=|\lambda_{K}|$ the eigengap of $\mathbf{M}$ , and assume without loss of generality that $\lambda_{1}>0$ . $\widehat{\mathbf{M}}$ is a corrupted version of $\mathbf{M}$ obtained from observed data, with $\mathbf{E}=\widehat{\mathbf{M}}-\mathbf{M}$ representing the error matrix. Our goal is to estimate the column space of $\mathbf{V}$ from $\widehat{\mathbf{M}}$ distributively and scalably. The following four examples provide concrete statistical setups for the above problem.

Example 1 (Spiked Covariance Model [25]).

Let $\bm{X}_{1},\ldots,\bm{X}_{n}\in\mathbb{R}^{d}$ be i.i.d. sub-Gaussian random vectors with ${\mathbb{E}}(\bm{X}_{i})=\mathbf{0}$ and ${\mathbb{E}}(\bm{X}_{i}\bm{X}_{i}^{\top})=\mathbf{\Sigma}$ .²²2We assume $\{\bm{X}_{i}\}_{i=1}^{n}$ are i.i.d. for the simplicity of presentation. We will generalize the theoretical results to non-i.i.d. and heterogeneous data in Section 4.1. We assume the following decomposition for the covariance matrix: $\mathbf{\Sigma}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}+\sigma^{2}\mathbf{I}_{d}$ , where $\mathbf{V}\in\mathbb{R}^{d\times K}$ is the stacked $K$ leading eigenvectors and $\mathbf{\Lambda}=\operatorname{diag}(\lambda_{1},\ldots,\lambda_{K})$ with $\lambda_{1}\geq\ldots\geq\lambda_{K}>0$ . Assume that the data are split along the sample size $n$ and stored on $m$ different sites. Denote by $\{\bm{X}_{i}^{(s)}\}_{i=1}^{n_{s}}$ the sample split of size $n_{s}$ on the $s$ -th site, and by $\mathbf{X}^{(s)}=(\bm{X}_{1}^{(s)},\ldots,\bm{X}_{n_{s}}^{(s)})^{\top}$ the corresponding data matrix split ( $s=1,\ldots,m$ and $\sum_{s=1}^{m}n_{s}=n$ ). Denote by $\mathbf{X}=(\bm{X}_{1},\ldots,\bm{X}_{n})^{\top}$ the full $n\times d$ data matrix. Then $\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}$ , and $\widehat{\mathbf{M}}=\widehat{\mathbf{\Sigma}}-\widehat{\sigma}^{2}\mathbf{I}_{d}$ , where $\widehat{\mathbf{\Sigma}}=\frac{1}{n}\sum_{i=1}^{n}\bm{X}_{i}\bm{X}_{i}^{\top}$ is the sample covariance matrix and $\widehat{\sigma}^{2}$ is a consistent estimator for $\sigma^{2}$ .

Example 2 (Degree-Corrected Mixed Membership (DCMM) Model [16]).

Let $\mathbf{X}\in\mathbb{R}^{d\times d}$ be a symmetric adjacency matrix for an undirected graph of $d$ nodes, where $\mathbf{X}_{ij}=1$ if nodes $i,j\in[d]$ are connected and $\mathbf{X}_{ij}=0$ otherwise. Assume $\mathbf{X}_{ij}$ ’s are independent for $i\leq j$ and ${\mathbb{E}}(\mathbf{X})=\mathbf{\Theta}\mathbf{\Pi}\mathbf{P}\mathbf{\Pi}^{\top}\mathbf{\Theta}$ , where $\mathbf{\Theta}=\operatorname{diag}(\theta_{1},\ldots,\theta_{d})$ stands for the degree heterogeneity matrix, $\mathbf{\Pi}=(\bm{\pi}_{1},\ldots,\bm{\pi}_{d})^{\top}\in\mathbb{R}^{d\times K}$ is the stacked community assignment probability vectors and $\mathbf{P}\in\mathbb{R}^{K\times K}$ is a symmetric rank- $K$ matrix with constant entries $\mathbf{P}_{kk^{\prime}}\in(0,1)$ for $k,k^{\prime}\in[K]$ . Then $\mathbf{M}={\mathbb{E}}(\mathbf{X})=\mathbf{\Theta}\mathbf{\Pi}\mathbf{P}\mathbf{\Pi}^{\top}\mathbf{\Theta}$ and $\widehat{\mathbf{M}}=\mathbf{X}$ .³³3In the case where self-loops are absent, $\mathbf{X}$ will be replaced by $\mathbf{X}^{\prime}=\mathbf{X}-\operatorname{diag}(\mathbf{X})$ and $\mathbf{E}$ will be replaced by $\mathbf{E}^{\prime}=\mathbf{E}-\operatorname{diag}(\mathbf{X})$ . Our theoretical results hold for both cases. The goal is to infer the community membership profiles $\mathbf{\Pi}$ . Recall $\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}$ . Since $\mathbf{V}$ and $\mathbf{\Theta}\mathbf{\Pi}$ share the same column space, we can make inference on $\mathbf{\Pi}$ through $\mathbf{V}$ . ⁴⁴4To address the degree heterogeneity, one can perform the SCORE normalization to cancel out $\mathbf{\Theta}$ [24]. In this paper, we assume that there exist constants $C\geq c>0$ such that $\sigma_{K}(\mathbf{\Pi})\geq c\sqrt{d/K}$ , $c\leq\lambda_{K}(\mathbf{P})\leq\lambda_{1}(\mathbf{P})\leq CK$ and $\max_{i}\theta_{i}\leq C\min_{i}\theta_{i}$ , where we define $\theta=\max_{i}\theta_{i}^{2}$ as the rate of signal strength. We assume that the adjacency matrix is distributed across $m$ sites, where on the $s$ -th site we observe the connectivity matrix $\mathbf{X}^{(s)}\in\mathbb{R}^{d\times d_{s}}$ and $\mathbf{X}=(\mathbf{X}^{(1)},\ldots,\mathbf{X}^{(m)})$ .

Example 3 (Gaussian Mixture Models (GMM) [12]).

Let $\bm{W}_{1},\ldots,\bm{W}_{d}\in\mathbb{R}^{n}$ be independent samples with $\bm{W}_{j}$ $(j\in[d])$ generated from one of $K$ Gaussian distributions with means $\bm{\theta}_{k}\in\mathbb{R}^{n}$ ( $k=1,\cdots,K$ ). More specifically, for $j\in[d]$ , $\bm{W}_{j}$ is associated with a membership label $k_{j}\in[K]$ , and $\bm{W}_{j}\sim{\mathcal{N}}(\sum_{k=1}^{K}\bm{\theta}_{k}\mathbb{I}\{k_{j}=k\},\mathbf{I}_{n})$ . Our goal is to recover the unknown membership labels $k_{j}$ ’s. Denote $\mathbf{X}=(\bm{W}_{1},\ldots,\bm{W}_{d})=(\bm{X}_{1},\ldots,\bm{X}_{n})^{\top}$ , where $\bm{X}_{i}$ is the $i$ -th row of $\mathbf{X}$ . Without loss of generality, we order $\bm{W}_{j}$ ’s such that ${\mathbb{E}}(\mathbf{X})=\mathbf{\Theta}\mathbf{F}^{\top}$ , where

\mathbf{\Theta}=(\bm{\theta}_{1},\ldots,\bm{\theta}_{K})\in\mathbb{R}^{n\times K},\quad\mathbf{F}=\operatorname{diag}(\mathbf{1}_{d_{1}},\ldots,\mathbf{1}_{d_{K}})\in\mathbb{R}^{d\times K},

with $d_{k}$ denoting the number of samples drawn from the Gaussian distribution with mean $\bm{\theta}_{k}$ . Then we define $\mathbf{M}={\mathbb{E}}[\mathbf{X}^{\top}\mathbf{X}]-n\mathbf{I}_{d}=\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}$ and $\widehat{\mathbf{M}}=\mathbf{X}^{\top}\mathbf{X}-n\mathbf{I}_{d}$ . Recall $\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}$ . Since $\mathbf{V}$ and $\mathbf{F}$ share the same column space, we can recover the memberships from $\mathbf{V}$ . We consider the regime where $n>d$ . Besides, we assume that there exists a constant $C>0$ such that $\max_{k}d_{k}\leq C\min_{k}d_{k}$ and $\sigma_{1}(\mathbf{\Theta})\leq C\sigma_{K}(\mathbf{\Theta})$ . We consider the distributed setting where the data are split along the dimension $n$ and distributively stored on $m$ sites. Denote by $\mathbf{X}^{(s)}=(\bm{X}_{1}^{(s)},\ldots,\bm{X}_{n_{s}}^{(s)})^{\top}$ the data split on the $s$ -th site of size $n_{s}$ ( $s\in[m]$ ).

Example 4 (Incomplete Matrix Inference [13]).

Assume that $\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}$ is a symmetric rank- $K$ matrix, and ${\mathcal{S}}\subseteq[d]\times[d]$ is a subset of indices. We only observe the perturbed entries of $\mathbf{M}$ in the subset ${\mathcal{S}}$ . Specifically, for $i\leq j$ , we denote $\delta_{ij}=\delta_{ji}=\mathbb{I}\{(i,j)\in{\mathcal{S}}\}$ , and $\delta_{ij}\overset{\text{i.i.d}}{\sim}\operatorname{Bernoulli}(\theta)$ is an indicator for whether the $(i,j)$ th entry is missing. Then for $i,j\in[d]$ , the observation for $\mathbf{M}_{ij}$ is $\mathbf{X}_{ij}=(\mathbf{M}_{ij}+\varepsilon_{ij})\delta_{ij}$ , where $\varepsilon_{ij}=\varepsilon_{ji}$ are i.i.d. random variables satisfying ${\mathbb{E}}(\varepsilon_{ij})=0$ , ${\mathbb{E}}(\varepsilon_{ij}^{2})=\sigma^{2}$ and $\sup_{i\leq j}|\varepsilon_{ij}|\lesssim\sigma\log d$ .⁵⁵5We can generalize the results to sub-Gaussian error $\varepsilon_{ij}$ ’s with variance proxy $\sigma^{2}$ by taking the truncated error $\varepsilon_{ij}^{t}=\varepsilon_{ij}\mathbb{I}\{|\varepsilon_{ij}|\leq 4\sigma\sqrt{\log d}\}$ , and by the maximal inequality for sub-Gaussian random variables we know that with probability at least $1-O(d^{-6})$ , $\varepsilon_{ij}=\varepsilon_{ij}^{t},\forall i,j\in[d]$ , and the theorems can be generalized with minor modifications. Then to adjust for scaling, we define the observed data as $\widehat{\mathbf{M}}=[\widehat{\mathbf{M}}_{ij}]=\widehat{\theta}^{-1}[\mathbf{X}_{ij}]$ , where $\widehat{\theta}=2|{\mathcal{S}}|/\big{(}d(d+1)\big{)}$ .⁶⁶6In practice, we can estimate $\mathbf{V}$ by $\mathbf{X}$ rather than by $\widehat{\mathbf{M}}=\widehat{\theta}^{-1}\mathbf{X}$ , since the two matrices share exactly the same eigenvectors. However, we need the factor $\widehat{\theta}^{-1}$ to preserve correct scaling for the estimation of eigenvalues as well as the follow-up matrix completion. Please see Theorem 4.3 and Corollary 4.9 for more details. Consider the distributed setting where the data are split along $d$ on $m$ servers, where $\mathbf{X}^{(s)}\in\mathbb{R}^{d\times d_{s}}$ stands for the observations on the $s$ -th server and $\widehat{\mathbf{M}}=\widehat{\theta}^{-1}(\mathbf{X}^{(1)},\ldots,\mathbf{X}^{(m)})$ . The goal is to infer $\mathbf{V}$ from $\widehat{\mathbf{M}}$ in the presence of missing data.

Table 1 provides the complexities of FADI for the four problems and suggested choice of parameters for optimal error rates. We will further discuss the computational complexities in detail in Section 3.4.

	Complexity	$p$	$L$
Spiked covariance model	$O(dnp/m+dKpL\log d)$	$K\vee\log d$	$d/p$
DCMM model	$O(d^{2}p/m+dKpL\log d)$	$\sqrt{d}$	$\sqrt{d}$
Gaussian mixture models	$O(dnp/m+dKpL\log d)$	$K\vee\log d$	$d/p$
Incomplete matrix inference	$O(d^{2}p/m+dKpL\log d)$	$\sqrt{d}$	$\sqrt{d}$

Table 1: Computational complexities and parameter choice of FADI for PCA estimation under different models, where

K

is the rank of

\mathbf{M}

d

is the dimension of

\mathbf{M}

n

is the sample size,

m

is the number of data splits,

p

is the fast sketching dimension and

L

is the number of repeated sketches.

3 Method

In this section, we present the FADI algorithm and its application to different examples. We then provide the computational complexities of FADI and compare it with the existing methods. We also discuss how to estimate the rank $K$ when it is unknown.

3.1 Fast Distributed PCA (FADI): Overview and Intuition

For a given matrix $\widehat{\mathbf{M}}\in\mathbb{R}^{d\times d}$ , the computational cost of the traditional PCA on $\widehat{\mathbf{M}}$ is $O(d^{3})$ . In the case where $\widehat{\mathbf{M}}$ is computed from observed data, e.g., the sample covariance matrix $\widehat{\bm{\Sigma}}=\frac{1}{n}\sum_{i=1}^{n}\bm{X}_{i}\bm{X}_{i}^{\top}$ , extra computational burden comes from calculating $\widehat{\mathbf{M}}$ , e.g., $O(nd^{2})$ flops for computing the sample covariance matrix. Hence performing traditional PCA for large-scale data with high dimensions and huge sample sizes can be considerably expensive.

To reduce the computational cost when $d$ is large, the most straightforward idea is to reduce the data dimension. One popular method for dimension reduction is random sketching [20]. For instance, for a low-rank matrix $\mathbf{M}$ of rank $K$ , its column space can be represented by a low-dimensional fast sketch $\mathbf{M}\bm{\Omega}\in\mathbb{R}^{d\times p}$ , where $\mathbf{\Omega}\in\mathbb{R}^{d\times p}$ is a random Gaussian matrix with $K<p\ll d$ . In practice, $\mathbf{M}$ is usually replaced by an almost low-rank corrupted matrix $\widehat{\mathbf{M}}$ calculated from observed data. Traditional fast PCA methods then consider performing random sketching on $\widehat{\mathbf{M}}$ instead, and use the full sample to obtain the fast sketch $\widehat{\mathbf{Y}}=\widehat{\mathbf{M}}\mathbf{\Omega}\approx\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega}$ that almost maintains the same left singular space as $\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}$ . It is hence reasonable to estimate $\mathbf{V}$ by performing SVD on the $d\times p$ matrix $\widehat{\mathbf{Y}}$ that has a much smaller computational cost than directly performing PCA on $\widehat{\mathbf{M}}$ . However, one major drawback of this approach is that information might be lost due to fast sketching. Furthermore, the method is not scalable when $n$ is large or the data are federated. This motivates us to propose FADI, where we repeat the fast sketching multiple times and aggregate the results to reduce the statistical error. Besides, instead of performing the fast sketching on the full sample, we apply multiple sketches to each split sample, and then aggregate the PC results across the data splits.

Specifically, assume the data are stored across $m$ sites, and we have the decomposition $\widehat{\mathbf{M}}=\sum_{s=1}^{m}\widehat{\mathbf{M}}^{(s)}$ , where $\widehat{\mathbf{M}}^{(s)}$ is the component that can be computed locally on the $s$ -th machine ( $s\in[m]$ ). Then instead of applying random sketching directly to $\widehat{\mathbf{M}}$ , FADI computes in parallel the local fast sketching for each component $\widehat{\mathbf{M}}^{(s)}$ and aggregates the results across $m$ sites, which will reduce the cost of computing $\widehat{\mathbf{M}}\bm{\Omega}$ by a factor of $1/m$ . Note that this representation of $\widehat{\mathbf{M}}$ is legitimate in many models. Taking Example 1 for instance, define $\widehat{\mathbf{M}}^{(s)}=\frac{1}{n}(\mathbf{X}^{(s)\top}\mathbf{X}^{(s)})-(\widehat{\sigma}^{2}/m)\mathbf{I}_{d}$ , and we have $\widehat{\mathbf{M}}=\widehat{\bm{\Sigma}}-\widehat{\sigma}^{2}\mathbf{I}_{d}=\sum_{s=1}^{m}\widehat{\mathbf{M}}^{(s)}$ . We will verify the decomposition for Examples 2 - 4 in Section 3.3.

We will see in Section 4.1 that when the number of repeated fast sketches is sufficiently large, FADI enjoys the same error rate as the traditional PCA. From this perspective, FADI can be viewed as a “vertically” distributed PCA method as it allocates the computational burden along the dimension $d$ to several machines using low-dimensional sketches while maintaining high statistical accuracy through the aggregation of local PCs. FADI overcomes the difficulties of vertical splitting caused by the correlation between variables.

3.2 General Algorithmic Framework

Recall we aim to estimate the $K$ leading eigenvectors $\mathbf{V}$ of a rank- $K$ matrix $\mathbf{M}$ from its estimator $\widehat{\mathbf{M}}=\sum_{s=1}^{m}\widehat{\mathbf{M}}^{(s)}$ . Figure 2 illustrates the fast distributed PCA (FADI) algorithm:

In Step 0, we perform preliminary processing on the raw data to produce $\{\widehat{\mathbf{M}}^{(s)}\}_{s=1}^{m}$ . We will elaborate on the case-specific preprocessing in Section 3.3.

In Step 1, we calculate the distributed fast sketch $\widehat{\mathbf{Y}}=\widehat{\mathbf{M}}\mathbf{\Omega}=\sum_{s=1}^{m}\widehat{\mathbf{M}}^{(s)}\bm{\Omega}$ , where $\mathbf{\Omega}$ is a $d\times p$ standard Gaussian test matrix and $K<p\ll d$ . To reduce the statistical error, we repeat the fast sketching $L$ times and aggregate the results from the $L$ copies of $\widehat{\mathbf{Y}}$ . Specifically, we generate $L$ i.i.d. Gaussian test matrices $\{\bm{\Omega}^{(\ell)}\}_{\ell=1}^{L}$ , and for each $\ell\in[L]$ , we apply $\bm{\Omega}^{(\ell)}$ distributively to $\widehat{\mathbf{M}}^{(s)}$ for each $s\in[m]$ and obtain the $\ell$ -th fast sketch of $\widehat{\mathbf{M}}^{(s)}$ as $\widehat{\mathbf{Y}}^{(s,\ell)}=\widehat{\mathbf{M}}^{(s)}\bm{\Omega}^{(\ell)}$ . We send $\widehat{\mathbf{Y}}^{(s,\ell)}$ ( $s=1,\cdots,m$ ) to the $\ell$ -th parallel server for aggregation.

In Step 2, on the $\ell$ -th server, the random sketches $\widehat{\mathbf{Y}}^{(s,\ell)}$ ( $s=1,\cdots,m$ ) from the $m$ split datasets corresponding to the $\ell$ -th Gaussian test matrix $\mathbf{\Omega}^{(\ell)}$ will be collected and added up to get the $\ell$ -th fast sketch: $\widehat{\mathbf{Y}}^{(\ell)}=\sum_{s=1}^{m}\widehat{\mathbf{Y}}^{(s,\ell)}$ $(\ell\in[L])$ . We next compute in parallel the top $K$ left singular vectors $\widehat{\mathbf{V}}^{(\ell)}$ of $\widehat{\mathbf{Y}}^{(\ell)}$ and send the $\widehat{\mathbf{V}}^{(\ell)}$ ’s to the central processor for aggregation.

In Step 3, on the central processor, calculate $\widetilde{\mathbf{\Sigma}}=\frac{1}{L}\sum_{\ell=1}^{L}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}=\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\ell}$ , where $\mathbf{P}_{\ell}=\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}$ is the projection matrix of $\widehat{\mathbf{V}}^{(\ell)}$ . We next calculate the $K$ leading eigenvectors $\widetilde{\mathbf{V}}$ of $\widetilde{\mathbf{\Sigma}}$ , which will serve as the final estimator of $\mathbf{V}$ .

To further improve the computational efficiency, we might conduct another fast sketching in Step 3 to compute $\widetilde{\mathbf{V}}$ . More specifically, we apply the power method [20] to $\widetilde{\bm{\Sigma}}$ by calculating $\widetilde{\mathbf{Y}}=\widetilde{\mathbf{\Sigma}}^{q}\mathbf{\Omega}^{\rm F}=\left(\frac{1}{L}\sum_{\ell=1}^{L}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}\right)^{q}\mathbf{\Omega}^{\rm F}$ for $q\geq 1$ , where $\mathbf{\Omega}^{\rm F}\in\mathbb{R}^{d\times p^{\prime}}$ is a Gaussian test matrix with dimension $p^{\prime}$ that can be set different from $p$ for optimal efficiency. Here, $\widetilde{\mathbf{Y}}$ can be calculated iteratively: $\widetilde{\mathbf{Y}}_{(i)}=\frac{1}{L}\sum_{\ell=1}^{L}\left(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}\widetilde{\mathbf{Y}}_{(i-1)}\right)$ for $i=1,\ldots,q$ , where $\widetilde{\mathbf{Y}}_{(0)}=\mathbf{\Omega}^{\rm F}$ and $\widetilde{\mathbf{Y}}=\widetilde{\mathbf{Y}}_{(q)}$ . We denote by $\widetilde{\mathbf{V}}^{\text{F}}$ the leading $K$ left singular vectors of $\widetilde{\mathbf{Y}}$ . We will show in Section 4 that when $q$ is properly large, the distance between $\widetilde{\mathbf{V}}$ and $\widetilde{\mathbf{V}}^{\text{F}}$ will be negligible.

Remark 1.

We refer to Theorem 4.1 for the choice of $p$ and $L$ . In general, taking $p=2K$ is sufficient. For now, we assume $K$ is known, and the scenarios where $K$ is unknown will be discussed in Section 3.5.

3.3 Case-Specific Processing of Raw Data

In this section, we discuss the calculation of $\widehat{\mathbf{M}}$ in Step 0 of FADI specifically for each example.

Example 1: Recall that in Step 0 of FADI, to obtain $\widehat{\mathbf{M}}$ , we need a consistent estimator of the residual variance $\sigma^{2}$ . Denote by $S=\{i_{1},i_{2},\ldots,i_{K^{\prime}}\}\subseteq[d]$ an arbitrary index set of size $K^{\prime}\geq K+1$ . Then we estimate $\sigma^{2}$ by $\widehat{\sigma}^{2}=\sigma_{\min}(\widehat{\bm{\Sigma}}_{S})$ , where $\widehat{\bm{\Sigma}}_{S}$ is a $K^{\prime}\times K^{\prime}$ principal submatrix of $\widehat{\bm{\Sigma}}$ computed using only data columns in the set $S$ . Due to the additive structure of the sample covariance matrix, $\widehat{\bm{\Sigma}}_{S}$ can be easily computed distributively (see Figure 9 in Appendix E for reference). Then for $s\in[m]$ , we have $\widehat{\mathbf{M}}^{(s)}=\frac{1}{n}(\mathbf{X}^{(s)\top}\mathbf{X}^{(s)})-(\widehat{\sigma}^{2}/m)\mathbf{I}_{d}$ . Note that since computing $\widehat{\mathbf{M}}^{(s)}\bm{\Omega}=\frac{1}{n}\mathbf{X}^{(s)\top}(\mathbf{X}^{(s)}\bm{\Omega})-m^{-1}\widehat{\sigma}^{2}\bm{\Omega}$ is much faster than first computing $\widehat{\mathbf{M}}^{(s)}$ then computing $\widehat{\mathbf{M}}^{(s)}\bm{\Omega}$ , we will calculate $\widehat{\mathbf{M}}^{(s)}\bm{\Omega}$ by calculating $\mathbf{X}^{(s)}\bm{\Omega}$ first rather than directly computing $\widehat{\mathbf{M}}^{(s)}$ .

Example 2: Recall that the adjacency matrix is stored distributively on $m$ sites, and for the $s$ -th site we observe the connectivity matrix $\mathbf{X}^{(s)}$ . Then for $s\in[m]$ , define $\widehat{\mathbf{M}}^{(s)}=(\mathbf{e}_{s}^{\top}\otimes\mathbf{I}_{d})\operatorname{diag}(\mathbf{X}^{(1)},\ldots,\mathbf{X}^{(m)})$ , where $\{\mathbf{e}_{s}\}_{s=1}^{m}\subseteq\mathbb{R}^{m}$ is the canonical basis for $\mathbb{R}^{m}$ . Namely, $\widehat{\mathbf{M}}^{(s)}$ is the $s$ -th observation $\mathbf{X}^{(s)}$ augmented by zeros, and $\widehat{\mathbf{M}}=\sum_{s=1}^{m}\widehat{\mathbf{M}}^{(s)}=(\mathbf{X}^{(1)},\ldots,\mathbf{X}^{(m)})=\mathbf{X}$ . No preliminary computation is needed.

Example 3: Recall that the data $\{\bm{W}_{j}\}_{j=1}^{d}\subseteq\mathbb{R}^{n}$ are vertically distributed across $m$ sites, and $\{\mathbf{X}^{(s)}\}_{s=1}^{m}$ are the corresponding data splits. For the $s$ -th site, we have $\widehat{\mathbf{M}}^{(s)}=\mathbf{X}^{(s)\top}\mathbf{X}^{(s)}-(n/m)\mathbf{I}_{d}$ , and for $\ell\in[L]$ , we compute $\widehat{\mathbf{Y}}^{(s,\ell)}$ by $\mathbf{X}^{(s)\top}(\mathbf{X}^{(s)}\bm{\Omega}^{(\ell)})-(n/m)\bm{\Omega}^{(\ell)}$ .

Example 4: Recall that we observe the split data $\{\mathbf{X}^{(s)}\}_{s=1}^{m}$ with missing entries on $m$ servers. Define $\widehat{\mathbf{M}}^{(s)}=\widehat{\theta}^{-1}(\mathbf{e}_{s}^{\top}\otimes\mathbf{I}_{d})\operatorname{diag}(\mathbf{X}^{(1)},\ldots,\mathbf{X}^{(m)})$ for the $s$ -th server, where $\widehat{\theta}=2|{\mathcal{S}}|/\big{(}d(d+1)\big{)}$ , then we have $\widehat{\mathbf{M}}=\sum_{s=1}^{m}\widehat{\mathbf{M}}^{(s)}=\widehat{\theta}^{-1}(\mathbf{X}^{(1)},\ldots,\mathbf{X}^{(m)})$ .

3.4 Computational Complexity

In this section, we provide the computational complexity of FADI for each example given in Section 2. The complexity of each step is listed in Table 2.

	Example 1	Example 2	Example 3	Example 4
Step 0	$\widehat{\bm{\Sigma}}_{S}:O(\frac{K^{2}n}{m}+K^{2}m)$ $\widehat{\sigma}^{2}:O(K^{3})$	N/A	O(1)	$O(\frac{d^{2}}{m})$
Step 1	$\widehat{\mathbf{Y}}^{(s,\ell)}:O(\frac{dnp}{m})$	$\widehat{\mathbf{Y}}^{(s,\ell)}:O(\frac{d^{2}p}{m})$	$\widehat{\mathbf{Y}}^{(s,\ell)}:O(\frac{dnp}{m})$	$\widehat{\mathbf{Y}}^{(s,\ell)}:O(\frac{d^{2}p}{m})$
Step 2	$\widehat{\mathbf{Y}}^{(\ell)}:O(mdp)$ $\widehat{\mathbf{V}}^{(\ell)}:O(dp^{2})$	$\widehat{\mathbf{Y}}^{(\ell)}:O(mdp)$ $\widehat{\mathbf{V}}^{(\ell)}:O(dp^{2})$	$\widehat{\mathbf{Y}}^{(\ell)}:O(mdp)$ $\widehat{\mathbf{V}}^{(\ell)}:O(dp^{2})$	$\widehat{\mathbf{Y}}^{(\ell)}:O(mdp)$ $\widehat{\mathbf{V}}^{(\ell)}:O(dp^{2})$
Step 3	$\widetilde{\mathbf{V}}:O(d^{2}pL+d^{3})$	N/A	$\widetilde{\mathbf{V}}:O(d^{2}pL+d^{3})$	N/A
Step 3	$\widetilde{\mathbf{V}}^{\text{F}}:O(dKp^{\prime}Lq+dp^{\prime 2})$
Total	$O(\frac{dnp}{m}+dKp^{\prime}Lq)$	$O(\frac{d^{2}p}{m}+dKp^{\prime}Lq)$	$O(\frac{dnp}{m}+dKp^{\prime}Lq)$	$O(\frac{d^{2}p}{m}+dKp^{\prime}Lq)$

Table 2: Computational costs for Examples 1-4. For the simplicity of presentation, we assume

\max_{s\in[m]}n_{s}\asymp n/m

for Examples 1 and 3 and

\max_{s\in[m]}d_{s}\asymp d/m

for Examples 2 and 4. In Step 3, the calculation of

\widetilde{\mathbf{V}}

involves computing

\widetilde{\bm{\Sigma}}

O(d^{2}pL)

flops and SVD on

\widetilde{\bm{\Sigma}}

O(d^{3})

flops, while computing

\widetilde{\mathbf{V}}^{\rm F}

involves computing

\widetilde{\bm{\Sigma}}^{q}\bm{\Omega}^{\rm F}

O(dKp^{\prime}Lq)

flops and SVD on

\widetilde{\bm{\Sigma}}^{q}\bm{\Omega}^{\rm F}

O(dp^{\prime 2})

flops. We recommend

\widetilde{\mathbf{V}}^{\rm F}

instead of

\widetilde{\mathbf{V}}

in practice. The total complexity in the last line refers to the total computational cost for

\widetilde{\mathbf{V}}^{\rm F}

When $m$ can be customized, we recommend taking $m\asymp n/d$ for Examples 1 and 3, and $m\asymp\sqrt{d}$ for Examples 2 and 4 for optimal efficiency. For Examples 1 and 3, when $p\asymp(K\vee\log d)$ , $L\asymp d/p$ , $p^{\prime}\asymp K$ and $q\asymp\log d$ , the total computational cost will be $O\big{(}dn(K\vee\log d)/m+d^{2}K\log d\big{)}$ . For Examples 2 and 4, direct SVD on $\widetilde{\bm{\Sigma}}$ will induce computational cost of order $d^{3}$ and we only suggest $\widetilde{\mathbf{V}}^{\text{F}}$ as the eigenspace estimator. If we take $p\asymp\sqrt{d}$ , $L\asymp d/p$ , $p^{\prime}\asymp K$ and $q\asymp\log d$ , the total computational cost will be $O(d^{5/2}/m+K^{2}d^{3/2}\log d)$ . Inference on eigenspace will require the calculation of the asymptotic covariance, whose formula and computational costs will be discussed in Sections 4.3 and 4.4.

Method	Error Rate	Computational Complexity
FADI	$O(\sqrt{{Kr}/{n}})$	$O\left(dn(K\vee\log d)/m+d^{2}K\log d\right)$
Traditional PCA	$O(\sqrt{{Kr}/{n}})$	$O(d^{2}n+d^{3})$
Fast PCA	$O(\sqrt{{Kdr}/{n}})$	$O(dnK+d^{2}K)$
Distributed PCA	$O(\sqrt{{Kr}/{n}})$	$O(d^{2}n/m+d^{3})$

Table 3: Error rates and computational complexities for FADI, traditional PCA, fast PCA (one sketching) [20] and distributed PCA [18] for Example 1, where the error rate is evaluated by

\big{(}{\mathbb{E}}|\rho(\,\cdot\,,\mathbf{V})|^{2}\big{)}^{1/2}

. Here

r=\operatorname{tr}(\mathbf{\Sigma})/\|\mathbf{\Sigma}\|_{2}

is the effective rank of the covariance matrix and

m

is the number of sites. For FADI, we take

p\asymp(K\vee\log d)

L\asymp d/p

p^{\prime}\asymp K

and

q\asymp\log d

For a comparison of FADI with the existing works, we provide in Table 3 the theoretical error rates and the computational complexities of FADI against different PCA methods under Example 1 (please refer to Therem 4.1 for the error rates of FADI). We choose Example 1 for illustration, as the existing distributed PCA methods mainly consider this setting [18]. The results show that under the distributed setting, FADI has a much lower computational complexity than the other three methods, while enjoying the same error rate as the traditional full-sample PCA. In comparison, the distributed PCA method in [18] is slowed down significantly by applying traditional PCA to each data split. The fast PCA algorithm in [20] has suboptimal computational complexity and theoretical error rate due to their downstream projection that hinders aggregation.

3.5 Estimation of the Rank $K$

FADI requires inputting the rank $K$ of the matrix $\mathbf{M}$ . In practice, if we are only interested in estimating the leading PCs, the exact value of $K$ is not needed as long as the fast sketching dimensions, $p$ and $p^{\prime}$ , are sufficiently larger than $K$ . Yet knowing the exact value of $K$ will improve the computational efficiency as well as facilitate inference on PCs. In fact, the estimation of $K$ can be incorporated into Step 2 and Step 3 of FADI. Specifically, for the $\ell$ -th parallel server ( $\ell\in[L]$ ), after performing the SVD $\widehat{\mathbf{Y}}^{(\ell)}=\widehat{\mathbf{V}}_{p}^{(\ell)}\widehat{\mathbf{\Lambda}}_{p}^{(\ell)}\widehat{\mathbf{U}}_{p}^{(\ell)\top}$ , we estimate $K$ by

\widehat{K}^{(\ell)}=\min\{k<p:\sigma_{k+1}(\widehat{\mathbf{Y}}^{(\ell)})-\sigma_{p}(\widehat{\mathbf{Y}}^{(\ell)})\leq\sqrt{p}\mu_{0}\},

where $\mu_{0}>0$ is a user-specified parameter (we refer to Theorem 4.3 for the choice of $\mu_{0}$ ). Then send all the left singular vectors $\widehat{\mathbf{V}}_{p}^{(\ell)}$ and $\widehat{K}^{(\ell)},\ell\in[L]$ to the central processor. Finally, on the central processor, take $\widehat{K}=\lceil\operatorname{median}\big{\{}\widehat{K}^{(1)},\widehat{K}^{(2)},\ldots,\widehat{K}^{(L)}\big{\}}\rceil$ as the estimator for $K$ , and obtain $\widetilde{\mathbf{V}}_{\widehat{K}}$ (respectively $\widetilde{\mathbf{V}}_{\widehat{K}}^{\text{F}}$ ) by performing PCA (respectively powered fast sketching) on the aggregated average of $\{\widehat{\mathbf{V}}_{\widehat{K}}^{(\ell)}\}_{\ell\in[L]}$ and taking the $\widehat{K}$ leading PCs, where $\widehat{\mathbf{V}}_{\widehat{K}}^{(\ell)}$ is the $\widehat{K}$ leading PCs of $\widehat{\mathbf{Y}}^{(\ell)}$ . We will show in Theorem 4.3 that $\widehat{K}$ is a consistent estimator of $K$ .

4 Theory

In this section, we will establish a theoretical upper bound for the error rate of FADI in Section 4.1, and characterize the asymptotic distribution of the FADI estimator in Section 4.3 and Section 4.4 to facilitate inference.

4.1 Theoretical Bound on Error Rates

We need the following condition to guarantee that the error term converges at a proper rate.

Assumption 1 (Convergence of $\|\mathbf{E}\|_{2}$ ).

Recall that ${\mathbf{E}}=\widehat{\mathbf{M}}-\mathbf{M}$ is the error matrix. Assume that $\|\mathbf{E}\|_{2}$ is sub-exponential, and there exists a rate $r_{1}(d)$ such that

\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}=\sup_{q\geq 1}q^{-1}\left({\mathbb{E}}\|\mathbf{E}\|_{2}^{q}\right)^{1/q}\lesssim r_{1}(d).

Remark 2.

By standard probability theory [40], we know that there exists a constant $c_{e}>0$ such that for any $t>0$ we have $\mathbb{P}(\|\mathbf{E}\|_{2}\geq t)\leq\exp\left(-c_{e}t/r_{1}(d)\right)$ and $\|\mathbf{E}\|_{2}=O_{P}\left(r_{1}(d)\right)$ .

We will conduct a variance-bias decomposition on the error rate $\rho(\widetilde{\mathbf{V}},\mathbf{V})$ . To facilitate the discussion, we introduce the intermediate matrix ${\mathbf{\Sigma}}^{\prime}={\mathbb{E}}_{\mathbf{\Omega}}\big{(}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}\big{)}$ , where the expectation is taken with respect to $\mathbf{\Omega}$ . Let $\mathbf{V}^{\prime}$ be the top $K$ eigenvectors of ${\mathbf{\Sigma}}^{\prime}$ . Note that both ${\mathbf{\Sigma}}^{\prime}$ and $\mathbf{V}^{\prime}$ are random depending on $\widehat{\mathbf{M}}$ . For the FADI PC estimator $\widetilde{\mathbf{V}}$ , we have the following “variance-bias” decomposition of the error rate:

\rho(\widetilde{\mathbf{V}},\mathbf{V})\leq\underbrace{\rho(\widetilde{\mathbf{V}},{\mathbf{V}}^{\prime})}_{\text{variance}}+\underbrace{\rho(\mathbf{V}^{\prime},{\mathbf{V}})}_{\text{bias}}.

Conditional on all the available data, the first term characterizes the statistical randomness of $\widetilde{\mathbf{V}}$ due to fast sketching, whereas the second bias term is deterministic and depends on all the information provided by the data. Intuitively, since $\widetilde{\mathbf{\Sigma}}=\frac{1}{L}\sum_{\ell=1}^{L}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}$ converges to the conditional expectation ${\mathbf{\Sigma}}^{\prime}$ , $\widetilde{\mathbf{V}}$ will also converge to $\mathbf{V}^{\prime}$ . Hence the first variance term goes to 0 asymptotically. As for the second bias term, let $\widehat{\mathbf{V}}$ be the $K$ leading eigenvectors of $\widehat{\mathbf{M}}$ , then we further break the bias term into two components: $\rho(\mathbf{V}^{\prime},{\mathbf{V}})\leq\rho(\widehat{\mathbf{V}},{\mathbf{V}})+\rho(\mathbf{V}^{\prime},\widehat{\mathbf{V}})$ . We can see that the first term is the error rate for the traditional PCA, whereas the second term is the bias caused by fast sketching. We can show that the second term is 0 with high probability and is hence negligible compared to the first term (see Lemma B.1 in Appendix B.1 for details), and the bias of the FADI estimator is of the same order as the error rate of the traditional PCA. In other words, the bias of the FADI estimator mainly comes from $\widehat{\mathbf{V}}$ , which is due to the information we can get from the available data. The following theorem gives the overall error rate of the FADI PC estimator. Its proof is given in Appendix B.2.

Theorem 4.1.

Under Assumption 1, if $p\geq\max(2K,K+7)$ and $(\log d)^{-1}\sqrt{p/d}\Delta/r_{1}(d)$ $\geq C$ for some large enough constant $C>0$ , we have

\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}},\mathbf{V})|^{2}\right)^{1/2}\lesssim{\frac{\sqrt{K}}{\Delta}r_{1}(d)}+{\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d)}.

(1)

Furthermore, recall $\widetilde{\mathbf{V}}^{\rm F}$ is the $K$ leading left singular vectors of $\widetilde{\mathbf{\Sigma}}^{q}\mathbf{\Omega}^{\rm F}$ for some power $q\geq 1$ , where $\mathbf{\Omega}^{\rm F}\in\mathbb{R}^{d\times p^{\prime}}$ is a random Gaussian matrix and $p^{\prime}\geq\max(2K,K+7)$ , then under Assumption 1 and the conditions that $p\geq\max(2K,K+8q-1)$ and $(\log d)^{-1}\sqrt{p/d}\Delta/r_{1}(d)\geq C$ , there exists some constant $\eta>0$ such that

\displaystyle\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V})|^{2}\right)^{1/2}

\displaystyle\!\lesssim\!{\frac{\sqrt{K}}{\Delta}r_{1}(d)}\!+\!{\sqrt{\!\frac{Kd}{\Delta^{2}pL}}r_{1}(d)}\!+\!\sqrt{\!\frac{Kd}{p^{\prime}}}\!\!\left(\!\eta q^{2}\!\sqrt{\frac{d}{\Delta^{2}p}}r_{1}(d)\!\right)^{q}\!\!\!.

(2)

Remark 3.

On the RHS of (1), the first term is the bias term, while the second term is the variance term. We can see that when the number of sketches $L$ reaches the order $d/p$ , the variance term will be of the same order as the bias term, which is the same as the error rate of the traditional PCA method. As for (2), the first term and the second term on the RHS are the same as the bias and the variance terms in (1), while the third term comes from the additional fast sketching. In fact, if we properly choose $q=\lceil\big{(}\log\big{(}\sqrt{p/d}\Delta/r_{1}(d)\big{)}\big{)}^{-1}\log d\rceil+1\leq\log d$ , the third term in (2) will be negligible. Theorem 4.1 also indicates that $p$ only needs to be of order $K\vee\log d$ , which significantly reduces the communication costs from $O(d^{2})$ to $O\left(d(K\vee\log d)\right)$ for each server.

Based upon Theorem 4.1, we provide the case-specific error rate for each example given in Section 2 in the following corollary. Please refer to Appendix B.3 for the proof.

Corollary 4.2.

For Examples 1 – 4, we have the following error bounds for each case under corresponding regularity conditions.

•

Example 1: Define $\kappa_{1}=(\lambda_{1}+\sigma^{2})/\Delta$ , then under the conditions that $p^{\prime}\geq\max(2K,K+7)$ , $p\geq\max(2K,K+8\log d-1)$ , $q=\lceil\log d\rceil$ and $n\geq C(rd/{p})\kappa_{1}^{2}\log^{4}d$ for some large enough constant $C>0$ , it holds that

\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V})|^{2}\right)^{1/2}\lesssim\kappa_{1}\sqrt{\frac{Kr}{n}}+\kappa_{1}\sqrt{\frac{Kdr}{npL}},

(3)

where $r=\operatorname{tr}(\mathbf{\Sigma})/\|\mathbf{\Sigma}\|_{2}$ is the effective rank.

•

Example 2: Suppose $\theta{\geq}K^{2}d^{-1/2+\epsilon}$ for some constant $\epsilon>0$ . If we take $p^{\prime}\geq\max(2K,K+7)$ , $p\gtrsim\sqrt{d}$ and $q=\lceil\log d\rceil$ , it holds that

\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V})|^{2}\right)^{1/2}\lesssim K\sqrt{\frac{K}{d\theta}}+K\sqrt{\frac{K}{pL\theta}}.

(4)

•

Example 3: Under the conditions that $\Delta_{0}^{2}\geq CK(\log d)^{2}\max\big{(}d(\log d)^{2}/p,\,\sqrt{{n}/{p}}\big{)}$ for some large enough constant $C>0$ , where $\Delta_{0}=\|\mathbf{\Theta}\|_{2}$ , if we take $p^{\prime}\geq\max(2K,K+7)$ , $p\geq\max(2K,K+8\log d-1)$ and $q=\lceil\log d\rceil$ , it holds that

\displaystyle\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V})|^{2}\right)^{1/2}

\displaystyle\!\!\!\lesssim\left(\frac{K}{\Delta_{0}}\!+\!\frac{K}{\Delta_{0}^{2}}\sqrt{\frac{Kn}{d}}\right)\!+\!\sqrt{\frac{d}{pL}}\left(\frac{K}{\Delta_{0}}\!+\!\frac{K}{\Delta_{0}^{2}}\sqrt{\frac{Kn}{d}}\right).

(5)

•

Example 4: Define $\kappa_{2}=|\lambda_{1}|/\Delta$ . Suppose $\theta\geq d^{-1/2+\epsilon}$ for some constant $\epsilon>0$ , $\sigma/\Delta\ll d^{-1}\sqrt{p\theta}$ , $\|\mathbf{V}\|_{2,\infty}\leq\sqrt{\mu K/d}$ for some $\mu\geq 1$ and $\kappa_{2}\mu K\ll d^{1/4}$ , if we take $p^{\prime}\geq\max(2K,K+7)$ , $p\gtrsim\sqrt{d}$ and $q=\lceil\log d\rceil$ , it holds that

\displaystyle\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V})|^{2}\right)^{1/2}

\displaystyle\!\!\lesssim\sqrt{K}\left(\frac{\kappa_{2}\mu K}{\sqrt{d\theta}}\!+\!\sqrt{\frac{d\sigma^{2}}{\Delta^{2}\theta}}\right)\!\!+\!\sqrt{\frac{Kd}{pL}}\!\!\left(\frac{\kappa_{2}\mu K}{\sqrt{d\theta}}\!+\!\sqrt{\frac{d\sigma^{2}}{\Delta^{2}\theta}}\right).

(6)

Remark 4.

We can generalize the results of Example 1 to the heterogeneous residual variance model for non-i.i.d. data, under which $\{\bm{X}_{i}\}_{i=1}^{n}\subseteq\mathbb{R}^{d}$ are centered random vectors with covariance matrices satisfying $\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{i=1}^{n}{\mathbb{E}}(\bm{X}_{i}\bm{X}_{i}^{\top})=\bm{\Sigma}=\mathbf{D}+\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}$ , where $\mathbf{D}=\operatorname{diag}(\sigma_{1}^{2},\ldots,\sigma_{d}^{2})$ and $\lambda_{1}\|\mathbf{V}\|_{2,\infty}^{2}/\Delta=o(1)$ . Then we have $\widehat{\mathbf{M}}=\widehat{\bm{\Sigma}}-\operatorname{diag}(\widehat{\bm{\Sigma}})$ , where $\widehat{\mathbf{\Sigma}}=\frac{1}{n}\sum_{i=1}^{n}\bm{X}_{i}\bm{X}_{i}^{\top}$ , $\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}$ and $\|\mathbf{E}\|_{2}\leq 2\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|_{2}+\|\operatorname{diag}(\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top})\|_{2}\leq 2\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|_{2}+\lambda_{1}\|\mathbf{V}\|_{2,\infty}^{2}$ . Then by plugging in $r_{1}(d)=\lambda_{1}\|\mathbf{V}\|_{2,\infty}^{2}+\|\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|_{2}\|_{\psi_{1}}$ , we have the error bound under the heterogeneous scenario. While the first term is deterministic, the second term depends on the dependence structure of the sample. Many studies depicted the convergence of the sample covariance matrix for non-i.i.d. data [7, 17].

For Example 1, when $Lp\gtrsim d$ , our error rate in (3) is optimal [18]. Under the distributed data setting, we require the total sample size $n$ to be larger than $rd/p$ , while Fan et al., [18]’s distributed PCA requires $n/m>r$ , where $n/m$ is the sample size for each data split. Compared with [18], our method has theoretical guarantees regardless of the number of data splits, but our scaling condition $n\gtrsim rd/p$ has an extra factor of $d/p$ in exchange for reduced computation cost. As for Example 2, our estimation rate in (4) matches the inferential results in [16]. Please also refer to Section 4.3 for a detailed comparison with the method in [16] in terms of the limiting distributions. For Example 3, our estimation rate in (5) is the same as in [12]. For Example 4, our error rate in (6) matches the results in [12]. When the rank $K$ is unknown and estimated by FADI, the following theorem shows that under appropriate conditions, our estimator $\widehat{K}$ presented in Section 3.5 recovers the true $K$ with high probability.

Theorem 4.3.

Under Assumption 1, define $\eta_{0}=480c_{e}^{-1}\sqrt{d/(\Delta^{2}p)}r_{1}(d)\log d$ , where $c_{e}>0$ is the constant defined in Remark 2. When $d\geq 2$ , $2K\leq p\ll d(\log d)^{-2}$ and $\eta_{0}\leq(32\log d)^{-{2}/{(p-K+1)}}$ , if we choose $\mu_{0}$ such that $\Delta\eta_{0}/24\leq\mu_{0}\leq\Delta\sqrt{\eta_{0}}/12$ , then with probability at least $1-O(d^{-(L\wedge 20)/2})$ , $\widehat{K}=K$ .

We defer the proof to Appendix B.4. We provide case-specific choices of the thresholding parameter $\mu_{0}$ in the following corollary, whose proof can be found in Appendix B.5.

Corollary 4.4.

For Examples 1 to 4, we specify the choice of $\mu_{0}$ under certain regularity conditions.

•

Example 1: Under the conditions that $2K\leq p\ll(\log d)^{-2}d$ , $n\gg\kappa_{1}^{2}rd/p(\log d)^{4}$ , $(\lambda_{1}+\sigma^{2})\ll\left(\sqrt{np}/(d\log d)\right)^{1/4}$ and $\Delta\gg\left({\sigma^{-2}(np)^{-1/2}}d\log d\right)^{1/3}$ , if we take $\mu_{0}=\left(d(np)^{-1/2}\log d\right)^{3/4}/12$ , with probability at least $1-O\left(d^{-(L\wedge 20)/2}\right)$ , we have $\widehat{K}=K$ .
•

Example 2: Define $\widehat{\theta}=d^{-2}\sum_{i\leq j}\widehat{\mathbf{M}}_{ij}$ , then under the condition that $\theta\geq K^{2}d^{-1/2+\epsilon}$ for some constant $\epsilon>0$ and $\sqrt{d}\lesssim p\ll(\log d)^{-2}d$ , if we take $\mu_{0}=(\widehat{\theta}/p)^{1/2}d\log d/12$ , with probability at least $1-O\left(d^{-(L\wedge 20)/2}\right)$ , we have $\widehat{K}=K$ .
•

Example 3: Under the conditions that $2K\leq p\ll(\log d)^{-2}d$ and ${K(\log d)^{3}}\sqrt{n/p}\ll\Delta_{0}^{2}\ll{nK/d}(\log d)^{2}$ , if we take $\mu_{0}=d(\log d)^{2}\sqrt{n/p}/12$ , with probability at least $1-O\left(d^{-(L\wedge 20)/2}\right)$ , we have $\widehat{K}=K$ .
•

Example 4: When $\theta\geq d^{-1/2+\epsilon}$ for some constant $\epsilon>0$ , $\|\mathbf{V}\|_{2,\infty}\leq\sqrt{\mu K/d}$ for some $\mu\geq 1$ , $\kappa_{2}^{2}\mu^{2}K\ll(\log d)^{2}$ , $\sqrt{d}\lesssim p\ll(\log d)^{-2}d$ and $(p\theta)^{-1/4}\sqrt{d\sigma/\Delta}\log d=o(1)$ , if we take $\mu_{0}=d\widehat{\sigma}_{0}\log d(p\widehat{\theta})^{-1/2}/12$ , where $\widehat{\sigma}_{0}=\big{(}\sum_{(i,j)\in{\mathcal{S}}}(\widehat{\theta}\widehat{\mathbf{M}}_{ij})^{2}/|{\mathcal{S}}|\big{)}^{1/2}$ , then with probability at least $1-O\left(d^{-(L\wedge 20)/2}\right)$ , we have $\widehat{K}=K$ .

Remark 5.

For Example 3, we impose the upper bound on $\Delta_{0}$ because in practice the eigengap $\Delta$ is unknown, and estimation of $\Delta$ requires knowledge of $K$ . Imposing the upper bound on $\Delta_{0}$ makes the term in $\mu_{0}$ involving knowledge of $\Delta$ vanish and enables the estimation of $K$ from observed data.

4.2 Inferential Results on the Asymptotic Distribution: Intuition and Assumptions

In Section 4.1, we discuss the theoretical upper bound for the error rate and present the bias-variance decomposition for the FADI estimator $\widetilde{\mathbf{V}}^{\text{F}}$ . From (2), we can see that when $Lp\gg d$ , the bias term will be the leading term, and the dominating error comes from $\rho(\widehat{\mathbf{V}},\mathbf{V})$ , whereas when $Lp\ll d$ , the variance term will be the leading term and the main error derives from $\rho(\widetilde{\mathbf{V}}^{\text{F}},\widehat{\mathbf{V}})$ . This offers insight into conducting inferential analysis on the estimator and implies a possible phase transition in the asymptotic distribution. Before moving on to further discussions, we state the following assumption to ensure that the bias of $\widehat{\mathbf{M}}$ is negligible.

Assumption 2 (Statistical Rate for the Biased Error Term).

For the error matrix $\mathbf{E}$ we have the decomposition $\mathbf{E}=\mathbf{E}_{0}+\mathbf{E}_{b}$ , where ${\mathbb{E}}(\mathbf{E}_{0})=\mathbf{0}$ and $\mathbf{E}_{b}$ is the biased error term satisfying $\lim_{d\rightarrow\infty}\mathbb{P}\big{(}\|\mathbf{E}_{b}\|_{2}{\leq}r_{2}(d)\big{)}=1$ with $r_{2}(d)=o\big{(}r_{1}(d)\big{)}$ .

In fact, we will later show in Section 4.3 and Section 4.4 that the leading term for the distance between $\widetilde{\mathbf{V}}^{\text{F}}$ and $\mathbf{V}$ takes on two different forms under the two regimes:

\begin{array}[]{ll}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\approx\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}&,\quad\text{ if }Lp\gg d;\\ \widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\approx\mathbf{P}_{\perp}\mathbf{E}_{0}\bm{\Omega}\mathbf{B}_{\bm{\Omega}}L^{-1}&,\quad\text{ if }Lp\ll d,\end{array}

where $\mathbf{H}$ is some orthogonal matrix aligning $\widetilde{\mathbf{V}}^{\text{F}}$ with $\mathbf{V}$ , $\mathbf{P}_{\perp}=\mathbf{I}-\mathbf{V}\mathbf{V}^{\top}$ is the projection matrix onto the linear space perpendicular to $\mathbf{V}$ , $\mathbf{\Omega}=(\mathbf{\Omega}^{(1)}/\sqrt{p},\ldots,\mathbf{\Omega}^{(L)}/\sqrt{p})\in\mathbb{R}^{d\times Lp}$ and $\mathbf{B}_{\mathbf{\Omega}}=(\mathbf{B}^{(1)\top},\ldots,\mathbf{B}^{(L)\top})^{\top}$ with $\mathbf{B}^{(\ell)}=(\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}/\sqrt{p})^{\dagger}\in\mathbb{R}^{p\times K}$ for $\ell=1,\ldots,L$ . To get an intuitive understanding on the form of the leading error term, let’s start with the regime $Lp\gg d$ where $\rho(\widetilde{\mathbf{V}}^{\text{F}},\mathbf{V})\approx\rho(\widehat{\mathbf{V}},\mathbf{V})$ and consider the case where $\{|\lambda_{k}|\}_{k=1}^{K}$ are well-separated such that $\mathbf{H}\approx\mathbf{I}_{K}$ . Following basic algebra, we have

	$\displaystyle\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}$	$\displaystyle\approx\widehat{\mathbf{V}}-\mathbf{V}\approx\mathbf{P}_{\perp}(\widehat{\mathbf{V}}-\mathbf{V})=\mathbf{P}_{\perp}(\widehat{\mathbf{M}}\widehat{\mathbf{V}}\widehat{\mathbf{\Lambda}}^{-1}-\mathbf{M}\mathbf{V}\mathbf{\Lambda}^{-1})$
		$\displaystyle\approx\mathbf{P}_{\perp}(\widehat{\mathbf{M}}-\mathbf{M})\mathbf{V}\mathbf{\Lambda}^{-1}=\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1},$

where $\widehat{\mathbf{\Lambda}}$ is the $K$ -leading eigenvalues of $\widehat{\mathbf{M}}$ corresponding to $\widehat{\mathbf{V}}$ , and the second approximation is due to the fact that $\widehat{\mathbf{V}}$ and $\mathbf{V}$ are fairly close and $\mathbf{P}_{\mathbf{V}}(\widehat{\mathbf{V}}-\mathbf{V})$ will be negligible.

Now we turn to the scenario $Lp\ll d$ , where the error mainly comes from $\widetilde{\mathbf{V}}^{\text{F}}-\widehat{\mathbf{V}}$ . For a given $\ell\in[L]$ , denote $\mathbf{Y}^{(\ell)}=\mathbf{M}\bm{\Omega}^{(\ell)}=\mathbf{V}\mathbf{\Lambda}\widetilde{\bm{\Omega}}^{(\ell)}$ , where $\widetilde{\bm{\Omega}}^{(\ell)}=\mathbf{V}^{\top}\bm{\Omega}^{(\ell)}$ is also a Gaussian test matrix. Intuitively, $p^{-1}\widetilde{\mathbf{\Omega}}^{(\ell)}\widetilde{\mathbf{\Omega}}^{(\ell)\top}\approx\mathbf{I}_{K}$ when $p$ is much larger than $K$ . Hence $\widetilde{\mathbf{\Omega}}^{(\ell)}$ acts like an orthonormal matrix scaled by $\sqrt{p}$ , and the rank- $K$ truncated SVD for $\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}$ and $\mathbf{Y}^{(\ell)}/\sqrt{p}$ will approximately be $\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{\Lambda}}(\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p})$ and $\mathbf{V}\mathbf{\Lambda}(\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p})$ respectively. Then following similar arguments as when $Lp\gg d$ , we have

	$\displaystyle\widehat{\mathbf{V}}^{(\ell)}-\mathbf{V}\approx\mathbf{P}_{\perp}\left((\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p})(\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p})^{\top}\widehat{\mathbf{\Lambda}}^{-1}-(\mathbf{Y}^{(\ell)}/\sqrt{p})(\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p})^{\top}\mathbf{\Lambda}^{-1}\right)$
	$\displaystyle\quad\approx\mathbf{P}_{\perp}\left(\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}-\mathbf{Y}^{(\ell)}/\sqrt{p}\right)(\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p})^{\top}\mathbf{\Lambda}^{-1}\approx\mathbf{P}_{\perp}\mathbf{E}_{0}(\bm{\Omega}^{(\ell)}/\sqrt{p})\mathbf{B}^{(\ell)},$

where the last approximation is because when $\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p}$ is almost orthonormal we have $\mathbf{B}^{(\ell)}=(\mathbf{\Lambda}\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p})^{\dagger}\approx(\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p})^{\top}\mathbf{\Lambda}^{-1}$ . Then aggregating the results over $\ell\in[L]$ we have

\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\approx\frac{1}{L}\sum_{\ell=1}^{L}\left\{\widehat{\mathbf{V}}^{(\ell)}-\mathbf{V}\right\}\approx\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\perp}\mathbf{E}_{0}(\bm{\Omega}^{(\ell)}/\sqrt{p})\mathbf{B}^{(\ell)}=\mathbf{P}_{\perp}\mathbf{E}_{0}\bm{\Omega}\mathbf{B}_{\bm{\Omega}}L^{-1}.

It is worth noting that

\frac{1}{L}\bm{\Omega}\mathbf{B}_{\bm{\Omega}}\approx\frac{1}{L}\left(\sum_{\ell=1}^{L}(\bm{\Omega}/\sqrt{p})(\bm{\Omega}/\sqrt{p})^{\top}\right)\mathbf{V}\mathbf{\Lambda}^{-1}\rightarrow\mathbf{V}\mathbf{\Lambda}^{-1},

(7)

when $Lp\gg d$ , which demonstrates the consistency of the leading term across different regimes of $Lp$ . To unify the notations, we denote the leading term for $\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}$ by

\mathcal{V}(\mathbf{E}_{0})=\left\{\begin{array}[]{ll}\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}&,\quad\text{ if }Lp\gg d;\\ \mathbf{P}_{\perp}\mathbf{E}_{0}\bm{\Omega}\mathbf{B}_{\bm{\Omega}}L^{-1}&,\quad\text{ if }Lp\ll d.\end{array}\right.

Before we formally present the theorems, we introduce the following extra regularity conditions necessary for studying the asymptotic features of the eigenspace estimator.

Assumption 3 (Incoherence Condition).

For the eigenspace of the true matrix $\mathbf{M}$ , we assume

\|\mathbf{V}\|_{2,\infty}\leq\sqrt{{\mu K}/{d}},

where $\mu\geq 1$ may change with $d$ .

Assumption 4 (Statistical Rates for Eigenspace Convergence).

For the unbiased error term $\mathbf{E}_{0}$ and the traditional PCA estimator $\widehat{\mathbf{V}}$ , we have the following statistical rates

\lim_{d\rightarrow\infty}\mathbb{P}\big{(}\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}{\leq}r_{3}(d)\big{)}=1,\,\,\lim_{d\rightarrow\infty}\mathbb{P}\big{(}\|\mathbf{E}_{0}(\mathbf{I}_{d}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top})\mathbf{V}\|_{2,\infty}{\leq}r_{4}(d)\big{)}=1.

Assumption 5 (Central Limit Theorem).

For the leading term $\mathcal{V}(\mathbf{E}_{0})$ and any $j\in[d]$ , it holds that

\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),

where $\bm{\Sigma}_{j}=\operatorname*{\rm Cov}(\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}|\bm{\Omega})$ when $Lp\ll d$ and $\bm{\Sigma}_{j}=\operatorname*{\rm Cov}(\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j})$ when $Lp\gg d$ .

Assumption 3 is the incoherence condition [10] to guarantee that the information of the eigenspace is uniformly spread. In Assumption 4 , $r_{3}(d)$ bounds the row-wise estimation error for the eigenspace, while $r_{4}(d)$ characterizes the row-wise convergence rate of the residual error term projected onto the spaces spanned by $\widehat{\mathbf{V}}_{\perp}$ and $\mathbf{V}$ consecutively, i.e., $\|\mathbf{E}_{0}(\mathbf{I}_{d}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top})\mathbf{V}\|_{2,\infty}=\|\mathbf{E}_{0}\mathbf{P}_{\widehat{\mathbf{V}}_{\perp}}\mathbf{P}_{\mathbf{V}}\|_{2,\infty}$ . Assumption 5 states that the leading term satisfies the central limit theorem (CLT). These assumptions are for the general framework and will be translated into case-specific conditions for concrete examples. With the above assumptions in place, we are ready to present the formal inferential results.

4.3 Inference When $Lp\gg d$

Recall that $\widetilde{\mathbf{V}}$ is the $K$ leading eigenvectors of the matrix $\widetilde{\mathbf{\Sigma}}=\frac{1}{L}\sum_{\ell=1}^{L}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}$ , and $\widetilde{\mathbf{V}}^{\text{F}}$ is the $K$ leading left singular vectors of the matrix $\widetilde{\mathbf{Y}}=\widetilde{\bm{\Sigma}}^{q}\mathbf{\Omega}^{\text{F}}$ . We define $\mathbf{H}=\mathbf{H}_{2}\mathbf{H}_{1}\mathbf{H}_{0}$ to be the alignment matrix between $\widetilde{\mathbf{V}}^{\text{F}}$ and $\mathbf{V}$ , where $\mathbf{H}_{2}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\text{F}\top}\widetilde{\mathbf{V}})$ , $\mathbf{H}_{1}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\top}\widehat{\mathbf{V}})$ and $\mathbf{H}_{0}=\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}{\mathbf{V}})$ . The follow theorem provides the distributional guarantee of FADI when $Lp\gg d$ .

Theorem 4.5.

When $Lp\gg d$ , under Assumptions 1 - 5, recall $\bm{\Sigma}_{j}=\operatorname*{\rm Cov}\big{(}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\big{)}$ for $j\in[d]$ . Define $r(d)=\Delta^{-1}\big{(}\sqrt{\frac{Kd}{pL}}r_{1}(d)+r_{3}(d)r_{1}(d)\!+\!\sqrt{\frac{\mu K}{d\Delta^{2}}}r_{1}(d)^{2}\!+\!r_{2}(d)\!+\!r_{4}(d)\big{)}$ , and assume that there exists a statistical rate $\eta_{1}(d)$ such that

\min_{j\in[d]}\lambda_{K}\big{(}\bm{\Sigma}_{j}\big{)}\gtrsim\eta_{1}(d)\quad\text{and}\quad\eta_{1}(d)^{-1/2}r(d)=o(1).

If $\Delta^{-1}r_{1}(d)(\log d)^{2}\sqrt{d/p}=o(1)$ and we take

q\geq 2+\log(Ld)/\log\log d,\quad p^{\prime}\geq\max(2K,K+7)\quad\text{and}\quad p\geq\max(2K,K+8q-1),

we have

\mathbf{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d].

(8)

Remark 6.

Please refer to Appendix B.9 for the proof of Theorem 4.5. Here $\eta_{1}(d)$ guarantees that the asymptotic covariance of the leading term is positive definite, and the rate $r(d)$ bounds the remainder term stemming from fast sketching approximation and eigenspace misalignment. When the rate $\eta_{1}(d)$ is not too small relative to $r(d)$ , Theorem 4.5 guarantees the distributional convergence of the FADI estimator. We will see in the concrete examples that the asymptotic covariance of the FADI estimator under the regime $Lp\gg d$ is the same as that of the traditional PCA estimator. In other words, we can increase the number of repeated sketches in exchange for the same testing efficiency as the traditional PCA.

We present the corollaries of Theorem 4.5 for Examples 1 to 4 as follows.

4.3.1 Spiked Covariance Model

Recall the set $S$ of size $K^{\prime}$ defined in Section 3.3 for estimating $\widehat{\sigma}^{2}$ . We denote by $\bm{\Sigma}_{S}$ the population covariance matrix corresponding to $\widehat{\bm{\Sigma}}_{S}$ and define $\widetilde{\sigma}_{1}=\|\mathbf{\Sigma}_{S}\|_{2}$ . Denote by $\delta=\lambda_{K}(\bm{\Sigma}_{S})-\sigma^{2}$ the eigengap of $\bm{\Sigma}_{S}$ . We have the following corollary of Theorem 4.5 for Example 1.

Corollary 4.6.

Assume that $\{\bm{X}_{i}\}_{i=1}^{n}$ are i.i.d. multivariate Gaussian. If we take $K^{\prime}=K+1$ , $p^{\prime}\geq\max(2K,K+7)$ , $q\geq 2+\log(Ld)/\log\log d$ and $p\geq\max(2K,K+8q-1)$ , then when $Lp\gg{Kdr}\kappa_{1}^{2}{\lambda_{1}}/{\sigma^{2}}$ , under Assumption 3 and the conditions that

n\gg\max\Big{(}\kappa_{1}^{4}(\log d)^{4}r^{2}{\lambda_{1}}/{\sigma^{2}},\big{(}\kappa_{1}{\lambda_{1}}/{\sigma^{2}}\big{)}^{6}\Big{)}\,\,\text{and}\,\,K\ll\min\left(\big{(}{\widetilde{\sigma}_{1}}/{\delta}\big{)}^{-2}\kappa_{1}r,\mu^{-{2}/{3}}\kappa^{-{4}/{3}}_{1}d^{{2}/{3}}\right),

we have that (8) holds. Furthermore, we have

\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d],

(9)

where $\widetilde{\bm{\Sigma}}_{j}=\frac{\sigma^{2}}{n}\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\bm{\Sigma}\mathbf{V}\mathbf{\Lambda}^{-1}$ is a simplification of $\bm{\Sigma}_{j}$ under Example 1. Besides, if we define $\widetilde{\mathbf{\Lambda}}=\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\rm F}$ and estimate $\widetilde{\bm{\Sigma}}_{j}$ by $\widehat{\bm{\Sigma}}_{j}=\frac{1}{n}(\widehat{\sigma}^{2}\widetilde{\mathbf{\Lambda}}^{-1}+\widehat{\sigma}^{4}\widetilde{\mathbf{\Lambda}}^{-2})$ , then we have

{\widehat{\bm{\Sigma}}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d].

(10)

Remark 7.

Please refer to Appendix B.10 for the proof. We compute $\widetilde{\mathbf{\Lambda}}$ distributively across the $m$ data splits, and the cost for computing $\widehat{\bm{\Sigma}}_{j}$ is $O(ndK/m)$ . We recommend taking $p=\lceil\sqrt{d}\rceil$ , $L=\lceil\kappa_{1}^{2}Kd^{3/2}\log d\rceil$ and $q=\lceil\log d\rceil\gg 2+\log(Ld)/\log\log d$ for optimal computational efficiency, where the total computation cost will be $O(K^{3}d^{5/2}(\log d)^{2})$ . Our asymptotic covariance matrix is the same as that of the traditional PCA estimator under the incoherence condition [5, 32, 41]. Specifically, Wang and Fan, [41] studied the asymptotic distribution of the traditional PCA estimator by assuming that the spiked eigenvalues are well-separated and diverging to infinity, which is not required by our paper. Our scaling conditions are stronger than the estimation results in Corollary 4.2 to cancel out the additional randomness induced by fast sketching and allow for efficient inference.

4.3.2 Degree-Corrected Mixed Membership Models

Corollary 4.7.

When $\theta\geq K^{2}d^{-1/2+\epsilon}$ for some constant $\epsilon>0$ and $K=o(d^{1/32})$ , if we take $p\gtrsim\sqrt{d}$ , $p^{\prime}\geq\max(2K,K+7)$ , $L\gg K^{5}d^{2}/p$ and $q\geq 2+\log(Ld)/\log\log d$ , then (8) holds. Furthermore, if we denote $\widetilde{\bm{\Sigma}}_{j}=\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\operatorname{diag}\big{(}\left[\mathbf{M}_{jj^{\prime}}(1-\mathbf{M}_{jj^{\prime}})\right]_{j^{\prime}\in[d]}\big{)}\mathbf{V}\mathbf{\Lambda}^{-1}$ , we have

\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d].

(11)

Besides, define $\widetilde{\mathbf{M}}=(\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{{\rm F}\top})\widehat{\mathbf{M}}(\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{{\rm F}\top})$ and $\widetilde{\mathbf{\Lambda}}=\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\rm F}$ , then if we estimate $\widetilde{\bm{\Sigma}}_{j}$ by $\widehat{\bm{\Sigma}}_{j}=\widetilde{\mathbf{\Lambda}}^{-1}\widetilde{\mathbf{V}}^{{\rm F}\top}\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{jj^{\prime}}(1-\widetilde{\mathbf{M}}_{jj^{\prime}})]_{j^{\prime}\in[d]}\big{)}\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{\Lambda}}^{-1}$ , we have

{\widehat{\bm{\Sigma}}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d].

(12)

Remark 8.

The proof is deferred to Appendix B.11. We can obtain $\widetilde{\mathbf{\Lambda}}$ by computing $\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{X}^{(s)}$ in parallel for $s\in[m]$ , and the computational cost for $\widehat{\bm{\Sigma}}_{j}$ is $O(d^{2}K/m)$ . To achieve the optimal computational efficiency, we would take $p=\lceil\sqrt{d}\rceil$ and $L=\lceil K^{5}d^{3/2}\log d\rceil$ . Hence taking $q=\lceil\log d\rceil$ is sufficient, and the total computational cost will be $O(K^{7}d^{5/2}(\log d)^{2})$ . Inferential analyses on the membership profiles has received attention in previous works [16, 37]. Fan et al., [16] studied the asymptotic normality of the spectral estimator under the DCMM model with complicated assumptions on the eigen-structure (see Conditions 1, 3, 6, 7 in their paper). In comparison, we only impose non-singularity conditions on the membership profiles, but have a stronger scaling condition on the signal strength to facilitate the divide-and-conquer process. Our asymptotic covariance is almost the same as Fan et al., [16]’s, suggesting the same level of asymptotic efficiency.

4.3.3 Gaussian Mixure Models

Denote by $\mu_{\theta}=\Delta_{0}^{-1}\sqrt{n/K}\|\mathbf{\Theta}\|_{2,\infty}$ the incoherence parameter for the Gaussian means. Then we have the following corollary for Example 3.

Corollary 4.8.

When $Lp\gg d$ , If we take $q\geq 2+\log(Ld)/\log\log d$ , $p\geq\max(2K,K+8q-1)$ and $p^{\prime}\geq\max(2K,K+7)$ , under the conditions that

K=o(d),\quad n\gg d^{2},\quad K\sqrt{n}(\log d)^{2}\ll\Delta_{0}^{2}\ll\frac{n^{4/3}}{\mu_{\theta}^{2}d}\quad\text{and}\quad L\gg\frac{Kd^{2}}{p},

we have that (8) holds. Furthermore, if we denote $\widetilde{\bm{\Sigma}}_{j}=\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\big{\{}\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+n\mathbf{I}_{d}\big{\}}\mathbf{V}\mathbf{\Lambda}^{-1}$ , we have

\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d].

(13)

If we define $\widetilde{\mathbf{\Lambda}}=\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\rm F}$ and estimate $\widetilde{\bm{\Sigma}}_{j}$ by $\widehat{\bm{\Sigma}}_{j}=\widetilde{\mathbf{\Lambda}}^{-1}+n\widetilde{\mathbf{\Lambda}}^{-2}$ , we have

{\widehat{\bm{\Sigma}}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d].

(14)

Remark 9.

Please refer to Appendix B.12 for the proof. We impose the upper bound on $\Delta_{0}$ to guarantee that the leading term satisfies the CLT. The distributive computation cost of $\widehat{\bm{\Sigma}}_{j}$ is $O(ndK/m)$ . We recommend taking $p=\lceil\sqrt{d}\rceil$ , $L=\lceil Kd^{3/2}\log d\rceil$ and $q=\lceil\log d\rceil$ , with total complexity of $O(K^{3}d^{5/2}(\log d)^{2})$ . In Corollary 4.8, the scaling condition for $n$ is $n\gg d^{2}$ compared to $n>d$ in Corollary 4.2, where the extra factor $d$ is to guarantee fast enough convergence rate of the remainder term for inference. It can be verified that the Cramér-Rao lower bound for unbiased estimators of $\mathbf{V}^{\top}\mathbf{e}_{j}$ is $\mathbf{\Lambda}^{-1}$ , and thus we can also see from (13) that when $\Delta_{0}$ is large enough, the asymptotic efficiency of $\widetilde{\mathbf{V}}^{\text{F}}$ is 1 under the regime $Lp\gg d$ .

4.3.4 Incomplete Matrix Inference

Corollary 4.9.

When $Lp\gg\kappa_{2}^{2}Kd^{2}$ and $\theta\geq d^{-1/2+\epsilon}$ for some constant $\epsilon>0$ , if we take $p^{\prime}\geq\max(2K,K+7)$ , $p\gtrsim\sqrt{d}$ and $q\geq 2+\log(Ld)/\log\log d$ , then under Assumption 3 and the conditions that

\kappa_{2}^{6}K^{3}\mu^{3}=o(d^{1/2})\quad\text{and}\quad\sigma/\Delta\ll\sqrt{\theta/d}\cdot\min\left(\big{(}\kappa_{2}^{2}\sqrt{\mu K}+\kappa_{2}\sqrt{K\log d}\big{)}^{-1},\sqrt{p/d}\right),

we have that (8) holds. Furthermore, if we denote $\widetilde{\bm{\Sigma}}_{j}=\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\operatorname{diag}\big{(}[\mathbf{M}_{jj^{\prime}}^{2}(1-\theta)/\theta+\sigma^{2}/\theta]_{j^{\prime}=1}^{d}\big{)}\mathbf{V}\mathbf{\Lambda}^{-1}$ , we have

\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d].

(15)

Define $\widetilde{\mathbf{\Lambda}}=\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\rm F}$ and $\widetilde{\mathbf{M}}=\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{\Lambda}}\widetilde{\mathbf{V}}^{{\rm F}\top}$ . If we estimate $\sigma^{2}$ by $\widehat{\sigma}^{2}=\sum_{(i,i^{\prime})\in{\mathcal{S}}}(\widehat{\theta}\widehat{\mathbf{M}}_{ii^{\prime}}-\widetilde{\mathbf{M}}_{ii^{\prime}})^{2}/|{\mathcal{S}}|$ and $\widetilde{\bm{\Sigma}}_{j}$ by $\widehat{\bm{\Sigma}}_{j}=\widetilde{\mathbf{\Lambda}}^{-1}\widetilde{\mathbf{V}}^{{\rm F}\top}\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{jj^{\prime}}^{2}(1-\widehat{\theta})/\widehat{\theta}+\widehat{\sigma}^{2}/\widehat{\theta}]_{j^{\prime}=1}^{d}\big{)}\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{\Lambda}}^{-1}$ , we have

{\widehat{\bm{\Sigma}}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d].

(16)

Remark 10.

Please see Appendix B.13 for the proof of Corollary 4.9. We compute $\widetilde{\mathbf{\Lambda}}$ by calculating $\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{X}^{(s)}$ in parallel, and then $\widetilde{\mathbf{\Lambda}}$ can be communicated across servers at low cost for computing $\widehat{\sigma}^{2}$ . The total computational cost for calculating $\widehat{\bm{\Sigma}}_{j}$ is $O(d^{2}K/m)$ . We recommend taking $p=\lceil\sqrt{d}\rceil$ , $L=\lceil\kappa_{2}^{2}Kd^{3/2}\log d\rceil$ and $q=\lceil\log d\rceil$ , and the total computational cost will be $O(K^{3}d^{5/2}(\log d)^{2})$ . Chen et al., [13] studied the incomplete matrix inference problem through penalized optimization, and their testing efficiency is the same as ours.

4.4 Inference When $Lp\ll d$

Similar as when $Lp\gg d$ , we first redefine the alignment matrix between $\widetilde{\mathbf{V}}^{\text{F}}$ and $\mathbf{V}$ as $\mathbf{H}=\mathbf{H}_{1}\mathbf{H}_{0}$ , where $\mathbf{H}_{1}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\text{F}\top}\widetilde{\mathbf{V}})$ and $\mathbf{H}_{0}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\top}\mathbf{V})$ . Then we have the following theorem characterizing the limiting distribution for $\widetilde{\mathbf{V}}^{\text{F}}$ .

Theorem 4.10.

For the case when $Lp\ll d$ , under Assumptions 1, 2, 3 and 5, for $j\in[d]$ , recall $\mathbf{\Sigma}_{j}=\operatorname{Cov}(\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}|\mathbf{\Omega})$ and assume that there exists a statistical rate $\eta_{2}(d)$ such that

\lim_{d\rightarrow\infty}\!\!\mathbb{P}_{\mathbf{\Omega}}\Big{(}\!\min_{j\in[d]}\lambda_{K}\big{(}\bm{\Sigma}_{j}\big{)}{\geq}\eta_{2}(d)\!\!\Big{)}\!=1,\,\frac{d^{2}r_{1}(d)^{4}(\log d)^{4}}{p^{2}\Delta^{4}\big{(}\eta_{2}(d)\!\!\wedge\!\!(\log d)^{-1}\big{)}}=o(1)\,\text{and}\,\frac{dr_{2}(d)^{2}}{Lp\Delta^{2}\eta_{2}(d)}=o(1).

Then if we take $K(\log d)^{2}\ll p\asymp p^{\prime}\lesssim d/(\log d)^{2}$ and $q\geq\log d$ we have

\mathbf{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d].

(17)

Remark 11.

Theorem 4.10 states that under proper scaling conditions, the FADI estimator still enjoys asymptotic normality even when the aggregated sketching dimension $Lp$ is much smaller than $d$ . The rate $\eta_{2}(d)$ is usually at least of order $(d/\lambda_{1}^{2}Lp)\lambda_{\min}(\operatorname*{\rm Cov}(\mathbf{E}_{0}\mathbf{e}_{j}))$ . In comparison, the rate $\eta_{1}(d)$ in Theorem 4.5 is usually of order $\lambda_{1}^{-2}\lambda_{\min}(\operatorname*{\rm Cov}(\mathbf{E}_{0}\mathbf{e}_{j}))$ , suggesting a larger variance and lower testing efficiency of FADI at $Lp\ll d$ than at $Lp\gg d$ . The proof is deferred to Appendix B.6.

The following corollaries of Theorem 4.10 provide case-specific distributional guarantee for Examples 1 and 3 under the regime $Lp\ll d$ .

4.4.1 Spiked Covariance Model

Corollary 4.11.

Assume $\{\bm{X}_{i}\}_{i=1}^{n}$ are i.i.d. multivariate Gaussian. When $Lp\ll{\lambda_{1}^{-2}\Delta^{2}d}$ , if we take $K^{\prime}=K+1$ , $K(\log d)^{2}\ll p\asymp p^{\prime}\lesssim d/(\log d)^{2}$ and $q\geq\log d$ , under Assumption 3 and the conditions that

\quad{n}\gg\max\Big{(}\frac{\kappa_{1}^{4}\lambda_{1}^{2}dr^{2}L}{p\sigma^{4}},\frac{\lambda_{1}^{2}\widetilde{\sigma}_{1}^{6}K^{2}}{\Delta^{2}\delta^{4}\sigma^{4}}\Big{)}(\log d)^{4}\quad\text{and}\quad\frac{K\lambda_{1}^{2}}{\Delta^{2}}\sqrt{\frac{\mu}{d}}=o(1),

we have that (17) holds. Furthermore, if we define $\widetilde{\bm{\Sigma}}_{j}=\frac{\sigma^{2}}{nL^{2}}\mathbf{B}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\bm{\Sigma}\mathbf{\Omega}\mathbf{B}_{\mathbf{\Omega}}$ , we have

\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d].

(18)

Besides, if we further assume ${\sigma^{-2}}{\lambda_{1}\kappa_{1}^{4}}\sqrt{{d^{2}r}/{(np^{2}L)}}=o(1)$ and estimate $\widetilde{\bm{\Sigma}}_{j}$ by $\widehat{\bm{\Sigma}}_{j}=\frac{\widehat{\sigma}^{2}}{nL^{2}}\widehat{\mathbf{B}}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\widehat{\bm{\Sigma}}\mathbf{\Omega}\widehat{\mathbf{B}}_{\mathbf{\Omega}}$ , where $\widehat{\mathbf{B}}_{\mathbf{\Omega}}=(\widehat{\mathbf{B}}^{(1)\top},\ldots,\widehat{\mathbf{B}}^{(L)\top})^{\top}$ with $\widehat{\mathbf{B}}^{(\ell)}=(\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p})^{\dagger}$ for $\ell\in[L]$ , we have

{\widehat{\bm{\Sigma}}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d].

(19)

Remark 12.

Please refer to Appendix B.7 for the proof. For the computation of $\widehat{\bm{\Sigma}}_{j}$ , apart from $\widehat{\mathbf{V}}^{(\ell)}$ , the $\ell$ -th machine on layer 2 (see Figure 2) will send $\mathbf{\Omega}^{(\ell)}$ and $\widehat{\mathbf{Y}}^{(\ell)}$ to the central processor, and the total communication cost for each server is $O(dp)$ . On the central processor, the total computational cost of $\mathbf{B}_{\mathbf{\Omega}}$ will be $O(dpKL)$ . Then we will compute $\mathbf{\Omega}^{\top}\widehat{\bm{\Sigma}}\mathbf{\Omega}=\frac{1}{\sqrt{p}}\mathbf{\Omega}^{\top}(\widehat{\mathbf{Y}}^{(1)},\ldots,\widehat{\mathbf{Y}}^{(L)})+\widehat{\sigma}^{2}\bm{\Omega}^{\top}\bm{\Omega}$ with total computational cost of $O\big{(}d(Lp)^{2}\big{)}=o(d^{3})$ . Compared to Corollary 4.6, Corollary 4.11 has stronger scaling conditions on the sample size $n$ to compensate for the extra variability due to less fast sketches. As indicated by (7), the asymptotic covariance matrix of Corollary 4.12 is consistent with Corollary 4.8.

4.4.2 Gaussian Mixture Models

Corollary 4.12.

When $Lp\ll d$ , if we take $K(\log d)^{2}\ll p\asymp p^{\prime}\lesssim d/(\log d)^{2}$ and $q\geq\log d$ , we have that (17) holds under the conditions that

\sqrt{\frac{K}{d}}\log d=O(1),\quad n\gg\frac{d^{3}L}{p},\quad\text{and}\quad K(\log d)^{2}\sqrt{\frac{dnL}{p}}\ll\Delta_{0}^{2}\ll\min\left(n,\frac{n^{4/3}}{\mu_{\theta}^{2}d}\right).

Furthermore, if we define $\widetilde{\bm{\Sigma}}_{j}=L^{-2}\mathbf{B}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\Big{(}\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+n\mathbf{I}_{d}\Big{)}\mathbf{\Omega}\mathbf{B}_{\mathbf{\Omega}},$ then we have

\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d].

(20)

If we further assume $d^{4}\Delta_{0}^{2}\ll{KLp^{2}n^{2}}$ and estimate $\widetilde{\bm{\Sigma}}_{j}$ by $\widehat{\bm{\Sigma}}_{j}=\frac{1}{L^{2}}\widehat{\mathbf{B}}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\Big{(}\widehat{\mathbf{M}}+n\mathbf{I}_{d}\Big{)}\mathbf{\Omega}\widehat{\mathbf{B}}_{\mathbf{\Omega}}$ , where $\widehat{\mathbf{B}}_{\mathbf{\Omega}}=(\widehat{\mathbf{B}}^{(1)\top},\ldots,\widehat{\mathbf{B}}^{(L)\top})^{\top}$ with $\widehat{\mathbf{B}}^{(\ell)}=(\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p})^{\dagger}$ for $\ell\in[L]$ , we have

{\widehat{\bm{\Sigma}}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d].

(21)

Remark 13.

The proof of Corollary 4.12 is deferred to Appendix B.8. Computation of $\widehat{\bm{\Sigma}}_{j}$ is very similar to Example 1 as described in Remark 12, and the total computational cost is $O(d(Lp)^{2})=o(d^{3})$ . The stronger scaling conditions are the trade-off for higher computational efficiency with less fast sketches.

We do not have distributional results for Examples 2 and 4 under the regime $Lp\ll d$ . An intuitive explanation would be that the information contained in each entry is independent for Example 2 and Example 4, and when $Lp\ll d$ , too much information will be lost from the $d\times d$ graph or matrix. In comparison, we can still recover information from Examples 1 and 3 under the regime $Lp\ll d$ due to the correlation structure of the matrix.

5 Numerical Results

We conduct extensive simulation studies to assess the performance of FADI under each example given in Section 2 and compare it with several existing methods. We provide in this section the representative results for Examples 1 and 2. The results for Examples 3 and 4 are given in Appendix A.

5.1 Example 1: Spiked Covariance Model

We generate $\{\bm{X}_{i}\}_{i=1}^{n}$ i.i.d. from ${\mathcal{N}}(\mathbf{0},\mathbf{\Sigma})$ , where $\bm{\Sigma}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}+\sigma^{2}\mathbf{I}_{d}$ . We consider $K=3$ , $n=20000$ and set $d=500,1000,2000$ respectively to study the asymptotic properties of the FADI estimator under different settings. To ensure the incoherence condition is satisfied, we set $\mathbf{V}$ to be the left singular vectors of a $d\times K$ i.i.d. Gaussian matrix. We take $\mathbf{\Lambda}=\operatorname{diag}(6,4,2)$ and $\sigma^{2}=1$ . For the estimation of $\sigma^{2}$ in Step 0, we set $K^{\prime}=6$ . We split the data into $m=20$ subsamples, and set $p=p^{\prime}=12$ and $q=7$ in Step 3 to compute $\widetilde{\mathbf{V}}^{\text{F}}$ . We set $L$ at a range of values by taking the ratio $Lp/d\in\{0.2,0.6,0.9,1,1.2,2,5,10\}$ for each setting and compute the asymptotic covariance via Corollary 4.6 and Corollary 4.11 correspondingly. We define $\widetilde{\mathbf{v}}=\widehat{\bm{\Sigma}}_{1}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{1}$ , where $\mathbf{H}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{V})$ , and calculate the coverage probability by empirically evaluating $\mathbb{P}\big{(}\|\widetilde{\mathbf{v}}\|_{2}^{2}\leq\chi_{3}^{2}(0.95)\big{)}$ with $\chi_{3}^{2}(0.95)$ being the 0.95 quantile of the Chi-squared distribution with degrees of freedom equal to 3. Results under different settings are shown in Figure 3. Figure 3(a) shows that as $Lp/d$ increases, the error rate of FADI converges to that of the traditional PCA. From Figure 3(b) we can see that when $Lp/d$ is approaching 1 from the left, the computational efficiency drops due to the cost of computing $\widehat{\bm{\Sigma}}_{1}$ . For Figure 3(c), convergence towards the nominal 95% level can be observed when $Lp/d$ is much smaller or much larger than 1, while the valley at $Lp/d$ around 1 is consistent with the theoretical conditions on $Lp/d$ in Section 4 and implies a possible phase-transition phenomenon on the distributional convergence of FADI. Note that the empirical coverage is closer to the nominal level 0.95 at $d=2000$ than at $d\in\{500,1000\}$ , which might be caused by the vanishing of some error terms for approximation of the asymptotic covariance matrix as $d$ grows larger. The good Gaussian approximation of $\widetilde{\mathbf{v}}_{1}$ is further validated by Figure 3(d), where $\widetilde{\mathbf{v}}_{1}$ is the first entry of $\widetilde{\mathbf{v}}$ . Based upon the low computational efficiency and poor empirical coverage at $Lp/d$ around 1, we recommend conducting inference based on FADI at regimes $Lp\gg d$ and $Lp\ll d$ only. In particular, we suggest the regime $Lp\gg d$ if priority is given to higher testing efficiency, and the regime $Lp\ll d$ if one needs valid inference with faster computation. We also compare FADI with the distributed PCA in [18]. Results over 100 Monte Carlos are given in Table 4. We can see that FADI outperforms both distributed PCA and the traditional PCA under the distributed setting.

Parameters				Error rate			Running time (seconds)
$d$	$n$	$m$	$L$	FADI	Traditional	Distributed	FADI	Traditional	Distributed
400	30000	15	40	0.068	0.065	0.065	0.07	4.53	0.59
400	60000	30	40	0.048	0.046	0.046	0.05	8.84	0.60
400	100000	50	40	0.037	0.036	0.036	0.05	14.84	0.62
800	100000	50	80	0.052	0.050	0.050	0.10	55.76	3.66
800	5000	50	80	0.230	0.220	0.230	0.05	3.76	2.56
800	25000	50	80	0.106	0.103	0.103	0.07	15.07	2.82
800	50000	50	80	0.073	0.070	0.070	0.07	28.68	3.23
1600	30000	15	160	0.134	0.130	0.130	0.31	80.72	27.02
1600	60000	30	160	0.095	0.092	0.092	0.35	150.75	27.29
1600	100000	50	160	0.074	0.071	0.071	0.34	243.83	27.38

Table 4: Comparison of the empirical error rates (of

\rho(\cdot,\mathbf{V})

) and the running times (in seconds) between FADI, traditional full sample PCA and distributed PCA [18] under different settings of

d,n

and

m

\mathbf{\Sigma}=\operatorname{diag}(50,25,12.5,1,\ldots,1)

. For FADI,

p=p^{\prime}=12

K=3

K^{\prime}=4

\Delta=11.5

and

q=7

in all settings.

5.2 Example 2: Degree-Corrected Mixed Membership Models

We consider the mixed membership model without degree heterogeneity for the simulation, i.e., $\mathbf{\Theta}=\sqrt{\theta}\mathbf{I}_{d}$ , and $\mathbf{M}=\theta\mathbf{\Pi}\mathbf{P}\mathbf{\Pi}^{\top}$ . For two preselected nodes $j,j^{\prime}\in[d]$ , we test ${\rm H}_{0}:\bm{\pi}_{j}=\bm{\pi}_{j^{\prime}}$ vs. ${\rm H}_{1}:\bm{\pi}_{j}\neq\bm{\pi}_{j^{\prime}}$ by testing whether $\mathbf{V}^{\top}(\mathbf{e}_{j}-\mathbf{e}_{j^{\prime}})=0$ . To simulate the data, we set $\theta=0.9$ , $K=3$ , and set the membership profiles $\mathbf{\Pi}$ and the connection probability matrix $\mathbf{P}$ to be

\bm{\pi}_{j}=\left\{\begin{array}[]{ll}(1,0,0)^{\top}&\text{ if }1\leq j\leq\lfloor d/6\rfloor\\ (0,1,0)^{\top}&\text{ if }\lfloor d/6\rfloor<j\leq\lfloor d/3\rfloor\\ (0,0,1)^{\top}&\text{ if }\lfloor d/3\rfloor<j\leq\lfloor d/2\rfloor\\ (0.6,0.2,0.2)^{\top}&\text{ if }\lfloor d/2\rfloor<j\leq\lfloor 5d/8\rfloor\\ (0.2,0.6,0.2)^{\top}&\text{ if }\lfloor 5d/8\rfloor<j\leq\lfloor 3d/4\rfloor\\ (0.2,0.2,0.6)^{\top}&\text{ if }\lfloor 3d/4\rfloor<j\leq\lfloor 7d/8\rfloor\\ (1/3,1/3,1/3)^{\top}&\text{ if }\lfloor 7d/8\rfloor<j\leq\lfloor d\rfloor\end{array}\right.,\quad\mathbf{P}=\begin{pmatrix}1&0.2&0.1\\ 0.2&1&0.2\\ 0.1&0.2&1\end{pmatrix}.

We test the performance of FADI under $d\in\{500,1000,2000\}$ respectively, and under each setting of $d$ , we take $m=10$ , $p=p^{\prime}=12$ , $q=7$ and set $L$ by the ratio $Lp/d\in\{0.2,0.6,0.9,1,1.2,2,5,10\}$ . For each setting, we conduct 300 independent Monte Carlo simulations. To perform the test, with minor modifications of Corollary 4.7, we can show that

\widetilde{\bm{\Sigma}}_{j,j^{\prime}}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}(\mathbf{e}_{j}-\mathbf{e}_{j^{\prime}})\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),

(22)

where the asymptotic covariance is defined as $\widetilde{\bm{\Sigma}}_{j,j^{\prime}}=\widetilde{\bm{\Sigma}}_{j}+\widetilde{\bm{\Sigma}}_{j^{\prime}}$ and can be consistently estimated by $\widehat{\bm{\Sigma}}_{j,j^{\prime}}=\widehat{\bm{\Sigma}}_{j}+\widehat{\bm{\Sigma}}_{j^{\prime}}$ . We first preselect two nodes, which we denote by $j$ and $j^{\prime}$ , with membership profiles both equal to $(0.6,0.2,0.2)^{\top}$ and calculate the empirical coverage probability of $\mathbb{P}\big{(}\|\widetilde{\bm{d}}\|_{2}^{2}\leq\chi_{3}^{2}(0.95)\big{)}$ , where $\widetilde{\bm{d}}=\widehat{\bm{\Sigma}}_{j,j^{\prime}}^{-1/2}\widetilde{\mathbf{V}}^{\text{F}\top}(\mathbf{e}_{j}-\mathbf{e}_{j^{\prime}})$ . We also evaluate the power of the test by choosing two nodes with different membership profiles equal to $(0.6,0.2,0.2)^{\top}$ and $(1/3,1/3,1/3)^{\top}$ respectively, which we denote by $j$ and $k$ . We empirically calculate the power $\mathbb{P}\big{(}\|\widetilde{\bm{d}^{\prime}}\|_{2}^{2}\geq\chi_{3}^{2}(0.95)\big{)}$ , where $\widetilde{\bm{d}}^{\prime}=\widehat{\bm{\Sigma}}_{j,k}^{-1/2}\widetilde{\mathbf{V}}^{\text{F}\top}(\mathbf{e}_{j}-\mathbf{e}_{k})$ . Under the regime $Lp/d<1$ , we calculate the asymptotic covariance referring to Theorem 4.10 by

\widehat{\bm{\Sigma}}_{j,j^{\prime}}=L^{-2}\widehat{\mathbf{B}}_{\bm{\Omega}}^{\top}\bm{\Omega}{\top}\operatorname{diag}\left([\widetilde{\mathbf{M}}_{jk}(1-\widetilde{\mathbf{M}}_{jk})+\widetilde{\mathbf{M}}_{j^{\prime}k}(1-\widetilde{\mathbf{M}}_{j^{\prime}k})]_{k=1}^{d}\right)\bm{\Omega}\widehat{\mathbf{B}}_{\bm{\Omega}},

where $\widehat{\mathbf{B}}_{\mathbf{\Omega}}=(\widehat{\mathbf{B}}^{(1)\top},\ldots,\widehat{\mathbf{B}}^{(L)\top})^{\top}$ with $\widehat{\mathbf{B}}^{(\ell)}=(\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p})^{\dagger}\in\mathbb{R}^{p\times K}$ for $\ell=1,\ldots,L$ . We also apply k-means to $\widetilde{\mathbf{V}}^{\rm F}$ to differentiate different membership profiles and compare the misclustering rate with the traditional PCA. The results of different settings are shown in Figure 4. We can see from Figure 4(d) that under the regime $Lp/d<1$ , the empirical coverage probability is zero under all settings, which validates the necessity of $Lp/d\gg 1$ for performance guarantee. Figure 4(f) demonstrates the asymptotic normality of $\widetilde{\bm{d}}_{1}$ at $Lp/d=10$ and poor Gaussian approximation of FADI at $Lp/d=0.2$ , where $\widetilde{\bm{d}}_{1}$ is the first entry of $\widetilde{\bm{d}}$ .

We also compare FADI with the SIMPLE method [16] on the membership profile inference under the DCMM model. The SIMPLE method conducted inference directly on the traditional PCA estimator $\widehat{\mathbf{V}}$ and adopted a one-step correction to the empirical eigenvalues for calculating the asymptotic covariance matrix. We compare the inferential performance of FADI at $Lp/d=10$ with the SIMPLE method (under 100 independent Monte Carlos), and summarize the results in Table 5, where the running time includes both the PCA procedure and the computation time of $\widehat{\bm{\Sigma}}_{j,j^{\prime}}$ . Compared to the SIMPLE method, our method has a similar coverage probability and power but is computationally more efficient.

Parameters			Coverage probability		Power		Running time (seconds)
$d$	$p$	$L$	FADI	SIMPLE	FADI	SIMPLE	FADI	SIMPLE
500	12	417	0.91	0.92	0.87	0.88	0.21	0.73
1000	12	833	0.94	0.94	1.00	1.00	0.69	6.77
2000	12	1667	0.95	0.98	1.00	1.00	2.61	59.42

Table 5: Comparison of the coverage probability, power and running time (in seconds) between FADI and SIMPLE [16] under different settings of

d

. In all settings, we take

m=10

p=p^{\prime}=12

q=7

and set

Lp/d=10

for FADI.

6 Application to the 1000 Genomes Data

In this section, we apply FADI and the existing methods to the 1000 Genomes Data [1]. We use phase 3 of the 1000 Genomes Data and focus on common variants with minor allele frequencies larger than or equal to 0.05. There are 2504 subjects in total, and 168,047 independent variants after the linkage disequilibrium (LD) pruning. As we are interested in the ancestry principal components to capture population structure, the sample size $n$ is the number of independent variants after LD pruning ( $n=168,047$ ), and the dimension $d$ is the number of subjects ( $d=2504$ ) [33]. The data were collected from 7 super populations: (1) AFR: African; (2) AMR: Ad Mixed American; (3) EAS: East Asian; (4) EUR: European; (5) SAS: South Asian; (6) PUR: Puerto Rican and (7) FIN: Finnish; and 26 sub-populations.

6.1 Estimation of Principal Eigenspace

For the estimation of the principal components, we assume that the data follow the spiked covariance model specified in Example 1. We perform FADI with $K^{\prime}=27$ , $p=50$ , $p^{\prime}=100$ , $q=3$ , $m=100$ and $L=80$ , where we choose $p$ and $L$ according to Table 1. For the estimation of the number of spikes, we take the thresholding parameter $\mu_{0}=\left(d(np)^{-1/2}\log d\right)^{3/4}/12$ . The estimated number of spikes from FADI is $\widehat{K}=26$ , which is close to 25, the number of self-reported ethnicity groups minus 1, i.e., $K=26-1$ . The results of the 4 leading PCs are shown in Figure 5, where a clear separation can be observed among different super-populations. Figure 10 and Figure 11 in the appendix show a good alignment between the PC results calculated by the traditional PCA and FADI. We compare the computational times of different methods for analyzing the 1000 Genomes Data. FADI takes 5.6 seconds at $q=3$ , whereas the traditional PCA method takes 595.4 seconds and the distributed PCA method [18] takes 120.2 seconds. These results show that FADI greatly outperforms the existing PCA methods in terms of computational time.

6.2 Inference on Ancestry Membership Profiles

We also generate an undirected graph from the 1000 Genomes Data. To increase the randomness for better fitting of the model setting in Example 2, we sample 1000 out of the total 168047 variants for generating the graph. More specifically, we treat each subject as a node, and for each given pair of subjects $(i,j)$ , we define a genetic similarity score $s_{ij}=\sum_{k=1}^{1000}\mathbb{I}\left\{x_{ik}=x_{jk}\right\}$ , where $x_{ik}$ refers to the genotype of the $k$ -th variant for subject $i$ . We denote by $s^{0.95}$ the 0.95 quantile of $\{s_{ij}\}_{i<j}$ . Subjects $i$ and $j$ are connected if and only if $s_{ij}>s^{0.95}$ . Denote by $\mathbf{A}$ the adjacency matrix (allowing no self-loops). We include only four super populations: AFR, EAS, EUR and SAS, with 2058 subjects in total. We are interested in testing whether two given subjects $i$ and $j$ belong to the same super population, i.e., ${\rm H}_{0}:\mathbf{V}_{i}=\mathbf{V}_{j}$ vs. ${\rm H}_{1}:\mathbf{V}_{i}\neq\mathbf{V}_{j}$ . We divide the adjacency matrix equally into $m=10$ splits, and perform FADI with $p=50$ , $p^{\prime}=50$ , $q=3$ and $L=1000$ . The rank estimator from FADI is $\widehat{K}=4$ by setting $\mu_{0}=({\widehat{\theta}}/{p})^{1/2}d\log d/12$ , where $\widehat{\theta}$ is the average degree estimator defined in Section 3.3. We can see the estimated rank is consistent with the number of super populations. We apply K-means clustering to the FADI estimator $\widetilde{\mathbf{V}}_{\widehat{K}}^{\text{F}}$ , and calculate the misclustering rate by treating the self-reported ancestry group as the ground truth. The misclustering rate of FADI is 0.135, with computation time of 3.7 seconds. In comparison, the misclustering rate for the traditional PCA method is 0.134 with computation time of 26.5 seconds, and the correlation between the top four PCs for the traditional PCA and FADI are 0.997, 0.994, 0.994 and 0.996 respectively.

To conduct pairwise inference on the ancestry membership profiles, we preselect 16 subjects, with 4 subjects from each super population. We apply Bonferroni correction to correct for the multiple comparison issue and set the level at $0.05\times\binom{16}{2}^{-1}=4.17\times 10^{-4}$ . We estimate the asymptotic covariance matrix by Corollary 4.7 and correct $\widetilde{\mathbf{M}}$ by setting entries larger than 1 to 1 and entries smaller than 0 to 0. The pairwise p-values are summarized in Figure 6. The computational time for computing the covariance matrix is 0.31 seconds. We can see that most of the comparison results are consistent with the true ancestry groups, while the inconsistency could be due to the mixed memberships of certain subjects and the unaccounted sub-population structures.

7 Discussion

In this paper, we develop a FAst DIstributed PCA algorithm FADI that can deal with high-dimensional PC calculations with low computational cost and high accuracy. The algorithm is applicable to multiple statistical models and is friendly for distributed computing. The main idea is to apply distributed-friendly random sketches so as to reduce the data dimension, and aggregate the results from multiple sketches to improve the statistical accuracy and accommodate federated data. We conduct theoretical analysis as well as simulation studies to demonstrate that FADI enjoys the same non-asymptotic error rate as the traditional full sample PCA while significantly reducing the computational time compared to existing methods. We also establish distributional guarantee for the FADI estimator and perform numerical experiments to validate the potential phase-transition phenomenon in distributional convergence.

Fast PCA algorithms using random sketches usually require the data to have certain “almost low-rank” structures, without which the approximation might not be accurate [20]. It is of future research interest to investigate whether the proposed FADI approach can be extended to non-low-rank settings. In Step 3 of FADI, we aggregate local estimators by taking a simple average over the projection matrices. It would be of future research interest to explore the performance of other weighted averages and investigate the best convex combination to reduce the statistical error.

References

1000 Genomes Project Consortium, [2015] 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature, 526(7571):68.
Abbe, [2018] Abbe, E. (2018). Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18(177):1–86.
Abbe et al., [2020] Abbe, E., Fan, J., Wang, K., and Zhong, Y. (2020). Entrywise eigenvector analysis of random matrices with low expected rank. Annals of Statistics, 48(3):1452–1474.
Achlioptas, [2003] Achlioptas, D. (2003). Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687. Special issue on PODS 2001 (Santa Barbara, CA).
Anderson, [1963] Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 34:122–148.
Baik et al., [2005] Baik, J., Arous, G. B., and Péché, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, 33(5):1643–1697.
Banna et al., [2016] Banna, M., Merlevède, F., and Youssef, P. (2016). Bernstein-type inequality for a class of dependent random matrices. Random Matrices: Theory and Applications, 05(02):1650006.
Belbin et al., [2021] Belbin, G. M., Cullina, S., Wenric, S., Soper, E. R., Glicksberg, B. S., Torre, D., Moscati, A., Wojcik, G. L., Shemirani, R., Beckmann, N. D., Cohain, A., Sorokin, E. P., Park, D. S., Ambite, J.-L., Ellis, S., Auton, A., Bottinger, E. P., Cho, J. H., Loos, R. J. F., Abul-Husn, N. S., Zaitlen, N. A., Gignoux, C. R., and Kenny, E. E. (2021). Toward a fine-scale population health monitoring system. Cell, 184(8):2068–2083.e11. PMID: 33861964.
Bernstein, [1924] Bernstein, S. (1924). On a modification of chebyshev’s inequality and of the error formula of laplace. Annals of Science Institute SAV. Ukraine, Sect. Math, I.
Candès and Tao, [2010] Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080.
Chen et al., [2016] Chen, T., Chang, D. D., Huang, S., Chen, H., Lin, C., and Wang, W. (2016). Integrating multiple random sketches for singular value decomposition. arXiv preprint arXiv:1608.08285.
Chen et al., [2021] Chen, Y., Chi, Y., Fan, J., and Ma, C. (2021). Spectral methods for data science: A statistical perspective. Foundations and Trends® in Machine Learning, 14(5):566–806.
Chen et al., [2019] Chen, Y., Fan, J., Ma, C., and Yan, Y. (2019). Inference and uncertainty quantification for noisy matrix completion. Proceedings of the National Academy of Sciences, 116(46):22931–22937.
Dey et al., [2022] Dey, R., Zhou, W., Kiiskinen, T., Havulinna, A., Elliott, A., Karjalainen, J., Kurki, M., Qin, A., Lee, S., Palotie, A., et al. (2022). Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks. Nature Communications, 13(1):1–13.
Dhruva et al., [2020] Dhruva, S. S., Ross, J. S., Akar, J. G., et al. (2020). Aggregating multiple real-world data sources using a patient-centered health-data-sharing platform. npj Digital Medicine, 3(1):60.
Fan et al., [2022] Fan, J., Fan, Y., Han, X., and Lv, J. (2022). Simple: Statistical inference on membership profiles in large networks. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(2):630–653.
Fan et al., [2013] Fan, J., Liao, Y., and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society. Series B., 75(4):603–680.
Fan et al., [2019] Fan, J., Wang, D., Wang, K., and Zhu, Z. (2019). Distributed estimation of principal eigenspaces. Annals of Statistics, 47(6):3009–3031.
Franklin, [2012] Franklin, J. N. (2012). Matrix theory. Courier Corporation.
Halko et al., [2011] Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288.
Hoeffding, [1994] Hoeffding, W. (1994). Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding, pages 409–426. Springer.
Jensen, [1906] Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30(1):175–193.
Jiang et al., [2016] Jiang, B., Ma, S., Causey, J., Qiao, L., Hardin, M. P., Bitts, I., Johnson, D., Zhang, S., and Huang, X. (2016). Sparrec: An effective matrix completion framework of missing data imputation for gwas. Scientific Reports, 6(1):35534.
Jin, [2015] Jin, J. (2015). Fast community detection by score. Annals of Statistics, 43(1):57–89.
Johnstone, [2001] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29(2):295–327.
Jordan et al., [2019] Jordan, M. I., Lee, J. D., and Yang, Y. (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526):668–681.
Kannan et al., [2014] Kannan, R., Vempala, S., and Woodruff, D. (2014). Principal component analysis and higher correlations for distributed data. In Proceedings of the 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 1040–1057, Barcelona, Spain. PMLR.
Kargupta et al., [2001] Kargupta, H., Huang, W., Sivakumar, K., and Johnson, E. (2001). Distributed clustering using collective principal component analysis. Knowledge and Information Systems, 3(4):422–448.
Klarin et al., [2018] Klarin, D., Damrauer, S. M., Cho, K., Sun, Y. V., Teslovich, T. M., Honerlaw, J., Gagnon, D. R., DuVall, S. L., Li, J., Peloso, G. M., et al. (2018). Genetics of blood lipids among ~300,000 multi-ethnic participants of the million veteran program. Nature genetics, 50(11):1514–1523. PMCID: PMC6521726.
Li et al., [2020] Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. (2020). Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60.
Pasini, [2017] Pasini, G. (2017). Principal component analysis for stock portfolio management. International Journal of Pure and Applied Mathematics, 115:153–167.
Paul, [2007] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, pages 1617–1642.
Price et al., [2006] Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909.
Pulley et al., [2010] Pulley, J., Clayton, E., Bernard, G. R., Roden, D. M., and Masys, D. R. (2010). Principles of human subjects protections applied in an opt-out, de-identified biobank. Clinical and Translational Science, 3(1):42–48. PMCID: PMC3075971.
Reich et al., [2008] Reich, D., Price, A. L., and Patterson, N. (2008). Principal component analysis of genetic data. Nature Genetics, 40(5):491–492.
Serfling, [2009] Serfling, R. J. (2009). Approximation theorems of mathematical statistics. John Wiley & Sons.
Shen and Lu, [2020] Shen, S. and Lu, J. (2020). Combinatorial-probabilistic trade-off: Community properties test in the stochastic block models. arXiv preprint arXiv:2010.15063.
Stewart, [1977] Stewart, G. W. (1977). On the perturbation of pseudo-inverses, projections and linear least squares problems. SIAM Review, 19(4):634–662.
Sudlow et al., [2015] Sudlow, C., Gallacher, J., Allen, N., et al. (2015). Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine, 12(3):e1001779–e1001779.
Vershynin, [2012] Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices, page 210–268. Cambridge University Press.
Wang and Fan, [2017] Wang, W. and Fan, J. (2017). Asymptotics of empirical eigen-structure for high dimensional spiked covariance. Annals of Statistics, 45(3):1342–1374.
Wedin, [1972] Wedin, P.-A. (1972). Perturbation bounds in connection with singular value decomposition. Nordisk Tidskr. Informationsbehandling (BIT), 12:99–111.
Yan et al., [2021] Yan, Y., Chen, Y., and Fan, J. (2021). Inference for heteroskedastic pca with missing data. arXiv preprint arXiv:2107.12365.
Yang et al., [2021] Yang, F., Liu, S., Dobriban, E., and Woodruff, D. P. (2021). How to reduce dimension with pca and random projections? IEEE Transactions on Information Theory, 67:8154–8189.
Yu et al., [2015] Yu, Y., Wang, T., and Samworth, R. J. (2015). A useful variant of the Davis-Kahan theorem for statisticians. Biometrika, 102(2):315–323.

Supplementary Materials to

“FADI: Fast Distributed Principal Component Analysis With High Accuracy for Large-Scale Federated Data”

This file contains the supplementary materials to the paper “FADI: Fast Distributed Principal Component Analysis With High Accuracy for Large-Scale Federated Data”. In Appendix A we provide numerical results for Example 3 and Example 4 along with some additional simulation results for Example 1 under the genetic setting. In Appendix B, we present the proofs for the main theorems, propositions and corollaries given in Section 4 of the main paper. In Appendix C we give the proofs of some technical lemmas useful for the proofs of the main theorems. In Appendix D, we present the modified version of Wedin’s theorem, which is used in several proofs. Appendix E provides the supplementary figures deferred from the main paper.

A Additional Simulation Results

In this section we present the simulation results for Example 3 and Example 4, and we provide some additional simulation results for Example 1 to evaluate the performance of FADI under the genetic settings.

A.1 Example 3: Gaussian Mixture Models

Under this setting, we take $K=3$ , fix the Gaussian vector dimension at $n=20000$ and set $\Delta_{0}^{2}=n^{2/3}$ . Then we generate the Gaussian means by $\bm{\theta}_{k}\overset{\text{i.i.d.}}{\sim}N\left(\mathbf{0},\frac{\Delta^{2}_{0}}{2n}\mathbf{I}_{n}\right)$ , $k\in[K]$ . We set the sample size at $d=500,1000,2000$ respectively and generate independent Gaussian samples $\{\bm{W}_{i}\}_{i=1}^{d}\in\mathbb{R}^{n}$ from a mixture of Gaussian with means $\bm{\theta}_{k},k\in[K]$ to study the performance of FADI under different settings. We assign each cluster $k\in[K]$ with $d/K$ Gaussian samples. We divide the data vertically along $n$ into $m=20$ splits, set $p=p^{\prime}=12$ and $q=7$ for the final powered fast sketching. We take the ratio $Lp/d\in\{0.2,0.6,0.9,1,1.2,2,5,10\}$ for each setting and compute the asymptotic covariance via Corollary 4.8 and Corollary 4.12 under different regimes of $Lp$ . We define $\widetilde{\mathbf{v}}=\widehat{\bm{\Sigma}}_{1}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{1}$ where $\widehat{\bm{\Sigma}}_{1}$ is the asymptotic covariance for the first row of $\widetilde{\mathbf{V}}^{\text{F}}$ and $\mathbf{H}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{V})$ is the alignment matrix, and calculate the empirical coverage probability by empirically evaluating $\mathbb{P}\big{(}\|\widetilde{\mathbf{v}}\|_{2}^{2}\leq\chi_{3}^{2}(0.95)\big{)}$ , where $\chi_{3}^{2}(0.95)$ is the 0.95 quantile of the Chi-squared distribution with degrees of freedom equal to 3. We perform 300 Monte Carlo simulations and the results under different settings are shown in Figure 7. We can see that the error rate of FADI gets closer to that of the traditional PCA estimator as $Lp/d$ increases while FADI greatly outperforms the traditional PCA in terms of running time under different settings. Note that here $d$ is the sample size, and the decreasing of error rates with increasing $d$ and fixed $n$ (at the same $Lp/d$ ratio) is consistent with Corollary 4.2. Similar to Example 1 in Section 5.1, we can see from Figure 7(b) the running time is large due to the calculation of $\widehat{\bm{\Sigma}}_{1}$ at $Lp/d$ approaching 1 from the left, and we do not recommend inference at this regime. Validation of the inferential properties are shown in Figure 7(c) and Figure 7(d).

A.2 Example 4: Incomplete Matrix Inference

For the true matrix $\mathbf{M}$ , we consider $K=3$ , take $\mathbf{V}$ to be the $K$ left singular vectors of a pregenerated $d\times K$ i.i.d. Gaussian matrix, and take $\mathbf{\Lambda}=\operatorname{diag}(6,4,2)$ . We consider the distributed setting $m=10$ , and set the dimension at $d\in\{500,1000,2000\}$ respectively, and set $\theta=0.4$ and $\sigma=8/d$ for each setting. Then we generate the entry-wise noise by $\varepsilon_{ij}\overset{\text{i.i.d.}}{\sim}{\mathcal{N}}(0,\sigma^{2})$ for $i\leq j$ , and subsample non-zero entries of $\mathbf{M}$ with probability $\theta=0.4$ . Under each setting, we perform FADI at $p=p^{\prime}=12$ , $q=7$ and $Lp/d\in\{0.2,0.6,0.9,1,1.2,2,5,10\}$ for the computation of $\widetilde{\mathbf{V}}^{\text{F}}$ . Define $\widetilde{\mathbf{v}}=\widehat{\bm{\Sigma}}_{1}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{1}$ with $\widehat{\bm{\Sigma}}_{1}$ being the asymptotic covariance for $\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{e}_{1}$ defined in Corollary 4.9 and $\mathbf{H}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{V})$ , and empirically calculate the coverage probability, i.e., $\mathbb{P}\big{(}\|\widetilde{\mathbf{v}}\|_{2}^{2}\leq\chi_{3}^{2}(0.95)\big{)}$ . Similar as in Section 5.2, for the regime $Lp<d$ , we refer to Theorem 4.10 and calculate $\widehat{\bm{\Sigma}}_{1}$ by

\widehat{\bm{\Sigma}}_{1}=L^{-2}\widehat{\mathbf{B}}_{\bm{\Omega}}^{\top}\bm{\Omega}^{\top}\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{1j}^{2}(1-\widehat{\theta})/\widehat{\theta}+\widehat{\sigma}^{2}/\widehat{\theta}]_{j=1}^{d}\big{)}\bm{\Omega}\widehat{\mathbf{B}}_{\bm{\Omega}}.

Results over 300 Monte Carlo simulations are provided in Figure 8. Figure 8(a) illustrates that the error rate of FADI is almost the same as the traditional PCA as $Lp/d$ gets larger, and Figure 8(b) shows that the computational efficiency of FADI greatly outperforms the traditional PCA for large dimension $d$ . We can observe from Figure 8(c) that the confidence interval performs poorly at $Lp/d<1$ with the coverage probability equal to 1, which is consistent with the theoretical conditions in Corollary 4.9 for distributional convergence. Figure 8(d) shows the good Gaussian approximation of FADI at $Lp/d=10$ , and the results at $Lp/d=0.2$ is consistent with Figure 8(c).

A.3 Additional Results for Example 1 in the Genetic Setting

Section 5.1 compares FADI with several existing methods under a relatively large eigengap. In practice, the eigengap of the population covariance matrix may not be large. To assess different methods in a more realistic scenario, we imitate the setting of the 1000 Genomes Data, where we take the number of spikes $K=20$ , $\sigma^{2}=0.4$ and the eigengap to be $\Delta=0.2$ . We generate the data by $\{\bm{X}_{i}\}_{i=1}^{n}\overset{\text{i.i.d.}}{\sim}{\mathcal{N}}(\mathbf{0},\mathbf{\Sigma})$ , where

\mathbf{\Sigma}=\operatorname{diag}(2.4,1.2,\underbrace{0.6,\ldots,0.6}_{K-2},0.4\ldots,0.4).

The dimension is $d=2504$ and the sample size is $n=160,000$ . Error rates and running times using different algorithms are compared under different number of splits $m$ for the sample size $n$ . For FADI, we take $L=75$ , $p=p^{\prime}=40$ and $q=7$ .

Table 6 shows that the number of sample splits $m$ has little impact on the error rate of FADI as expected, while the error rate of Fan et al., [18]’s distributed PCA increases as $m$ increases. FADI is much faster than the other two methods in all the practical settings when the eigengap is small. This suggests that in practical problems where the sample size is large and the eigengap is small, FADI not only enjoys much higher computational efficiency compared to the existing methods, but also gives stable estimation for different sample splits along the sample size $n$ . Although the settings of small eigengap are of major interest in this section, we still conduct simulations where the eigengap increases gradually to see how it affects the performance of FADI. Table 7 shows that as the eigengap gets larger, the error rate of FADI gets closer to that of the traditional full sample PCA, whereas the error rate ratio of distributed PCA to FADI gets below 1, but are still above 0.9 when the eigengap is larger than 1. As to the running time, FADI outperforms the other two methods in all the settings. In summary, when the eigengap grows larger, the performance of the three algorithms becomes similar to what we see in Section 5.1.

	FADI	Traditional PCA	Distributed PCA	$m$
Error Rate	2.296	1.811 (0.79)	2.629 (1.15)	10
	2.294	1.811 (0.79)	3.412 (1.49)	20
	2.294	1.811 (0.79)	3.955 (1.72)	40
	2.294	1.811 (0.79)	4.215 (1.84)	80
Running Time	5.76	983.86 (170.8)	189.76 (32.9)	10
	3.82	992.09 (259.8)	144.18 (37.8)	20
	2.86	972.47 (339.5)	119.29 (41.6)	40
	2.37	968.43 (408.5)	99.39 (41.9)	80

Table 6: Comparison of the error rates and running times (in seconds) among FADI, full sample PCA and distributed PCA [18], using different numbers of sample splits

m

in the genetic setting. Values in the parentheses represent the error rate ratios or the computational time ratios of each method with respect to FADI.

	FADI	Traditional PCA	Distributed PCA	Eigengap
Error Rate	1.28	1.06 (0.82)	1.57 (1.22)	0.4
	0.77	0.65 (0.85)	0.71 (0.92)	0.8
	0.48	0.42 (0.88)	0.43 (0.90)	1.6
	0.31	0.29 (0.92)	0.29 (0.93)	3.2
Running Time	2.76	925.15 (334.7)	115.29 (41.7)	0.4
	2.77	916.52 (331.4)	114.76 (41.5)	0.8
	2.69	922.85 (342.7)	114.75 (42.6)	1.6
	2.77	919.20 (332.2)	115.26 (41.7)	3.2

Table 7: Comparison of the error rates and running times (in seconds) among FADI, full sample PCA and distributed PCA [18] for different eigengaps

\Delta

in the genetic setting. The number of sample splits

m

is 40 for FADI and distributed PCA. The settings of the other parameters are the same as those in Table 6.

B Proof of Main Theoretical Results

In this section we provide proofs of the theoretical results in Section 4. For the inferential results, we will present proofs of the theorems under the regime $Lp\ll d$ first, which takes into consideration the extra variability caused by the fast sketching, and then give proofs of the theorems under the regime $Lp\gg d$ where the fast sketching randomness is negligible.

B.1 Unbiasedness of Fast Sketching With Respect to $\widehat{\mathbf{M}}$

We show by the following Lemma B.1 that the fast sketching is unbiased with respect to $\widehat{\mathbf{M}}$ under proper conditions.

Lemma B.1.

Let $\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top}$ be the eigen-decomposition of $\widehat{\mathbf{M}}$ , and let $\widehat{\mathbf{V}}=(\widehat{\mathbf{v}}_{1},\ldots,\widehat{\mathbf{v}}_{K})$ be the stacked $K$ leading eigenvectors of $\widehat{\mathbf{M}}$ corresponding to the eigenvalues with largest magnitudes. When $\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}<1/2$ , we have that $\operatorname{Col}(\mathbf{V}^{\prime})=\operatorname{Col}(\widehat{\mathbf{V}})$ , where $\operatorname{Col}(\cdot)$ denotes the column space of the matrix.

Proof.

We will first show that $\widehat{\mathbf{V}}_{d}^{\top}{\mathbf{\Sigma}}^{\prime}\widehat{\mathbf{V}}_{d}$ is diagonal. For any $j\in[d]$ , we let $\mathbf{D}_{j}=\mathbf{I}_{d}-2\mathbf{e}_{j}\mathbf{e}_{j}^{\top}$ , and recall we denote the eigen-decomposition of $\widehat{\mathbf{M}}$ by $\widehat{\mathbf{M}}=\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top}$ . Then conditional on $\widehat{\mathbf{M}}$ we have

	$\displaystyle\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{Y}}^{(\ell)}\widehat{\mathbf{Y}}^{(\ell)\top}\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}=\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top}\mathbf{\Omega}^{(\ell)}\mathbf{\Omega}^{(\ell)\top}\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}$
	$\displaystyle\quad=\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}(\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\mathbf{\Omega}^{(\ell)})(\mathbf{\Omega}^{(\ell)\top}\widehat{\mathbf{V}}_{d}\mathbf{D}_{j})\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top}\overset{\rm d}{=}\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top}\mathbf{\Omega}^{(\ell)}\mathbf{\Omega}^{(\ell)\top}\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top}=\widehat{\mathbf{Y}}^{(\ell)}\widehat{\mathbf{Y}}^{(\ell)\top},$

where the second equality is due to the fact that diagonal matrices are commutative, and the last but one equivalence in distribution is due to the fact that $\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\mathbf{\Omega}^{(\ell)}\overset{\rm d}{=}\widehat{\mathbf{V}}_{d}^{\top}\mathbf{\Omega}^{(\ell)}$ . Also we know the top $K$ eigenvectors of $\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{Y}}^{(\ell)}\widehat{\mathbf{Y}}^{(\ell)\top}\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}$ are $\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{V}}^{(\ell)}$ , and thus $\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{V}}^{(\ell)}\overset{\rm d}{=}\widehat{\mathbf{V}}^{(\ell)}$ . Hence we have

	$\displaystyle\widehat{\mathbf{V}}_{d}^{\top}\mathbb{E}\left(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}\|\widehat{\mathbf{M}}\right)\widehat{\mathbf{V}}_{d}=\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\mathbb{E}\left(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}\|\widehat{\mathbf{M}}\right)\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{V}}_{d}$
	$\displaystyle\quad=\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\mathbb{E}\left(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}\|\widehat{\mathbf{M}}\right)\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}=\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\bm{\Sigma}^{\prime}\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}.$

The above equation holds for any $j\in[d]$ , which suggests that $\widehat{\mathbf{V}}_{d}^{\top}\mathbb{E}\left(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}|\widehat{\mathbf{M}}\right)\widehat{\mathbf{V}}_{d}$ is diagonal and that $\mathbf{\Sigma}^{\prime}$ and $\widehat{\mathbf{M}}$ share the same set of eigenvectors.

Now under the condition that $\left\|\mathbf{\Sigma}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\right\|_{2}<1/2$ , for any $j\in[K],$ we denote by $\widehat{\mathbf{v}}_{j}$ the $j$ -th column of $\widehat{\mathbf{V}}$ , and we have

\left\|\mathbf{\Sigma}^{\prime}\widehat{\mathbf{v}}_{j}\right\|_{2}=\left\|\left(\mathbf{\Sigma}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}+\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\right)\widehat{\mathbf{v}}_{j}\right\|_{2}\geq 1-\left\|\mathbf{\Sigma}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\right\|_{2}>1-\frac{1}{2}=\frac{1}{2}.

In other words, the corresponding eigenvalue of $\widehat{\mathbf{v}}_{j}$ in $\mathbf{\Sigma}^{\prime}$ is larger than $1/2$ . On the other hand, by Weyl’s inequality [19], the rest of the $d-K$ eigenvalues of ${\mathbf{\Sigma}}^{\prime}$ should be less than 1/2. Therefore, $\widehat{\mathbf{V}}$ are still the leading $K$ eigenvectors for ${\mathbf{\Sigma}}^{\prime}$ , and thus $\operatorname{Col}(\mathbf{V}^{\prime})=\operatorname{Col}(\widehat{\mathbf{V}})$ . ∎

Recall in Section 4 we discuss that the bias term has the following decomposition $\rho(\mathbf{V}^{\prime},{\mathbf{V}})\leq\rho(\widehat{\mathbf{V}},{\mathbf{V}})+\rho(\mathbf{V}^{\prime},\widehat{\mathbf{V}})$ . Lemma B.1 shows that as long as $\mathbf{\Sigma}^{\prime}$ and $\mathbf{V}\mathbf{V}^{\top}$ are not too far apart, $\mathbf{V}^{\prime}$ and $\widehat{\mathbf{V}}$ will share the same column space. In fact, Lemma B.4 in Section B.2 will show that the probability that ${\mathbf{\Sigma}}^{\prime}$ and $\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}$ are not sufficiently close converges to 0, and $\rho(\mathbf{V}^{\prime},{\mathbf{V}})=\rho(\widehat{\mathbf{V}},{\mathbf{V}})$ with high probability. With the help of Lemma B.1, we present the proof of the main error bound results in the following section.

B.2 Proof of Theorem 4.1

Recall the problem setting in Section 2. It is not hard to see that we can write $\mathbf{\Lambda}=\mathbf{P}_{0}\mathbf{\Lambda}^{0}$ , where $\mathbf{\Lambda}^{0}=\operatorname{diag}(|\lambda_{1}|,\ldots,|\lambda_{K}|)$ and $\mathbf{P}_{0}=\operatorname{diag}\big{(}[\operatorname{sgn}(\lambda_{k})]_{k=1}^{K}\big{)}$ . Then $\mathbf{M}=(\mathbf{V}\mathbf{P}_{0})\mathbf{\Lambda}^{0}\mathbf{V}^{\top}$ is the SVD of $\mathbf{M}$ .

We begin with bounding $\left({\mathbb{E}}\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}^{\top}\|^{2}_{\rm F}\right)^{1/2}$ . Before delving into the detailed proof, the following two lemmas provide some important properties of the random Gaussian matrix.

Lemma B.2.

Let $\mathbf{\Omega}\in\mathbb{R}^{d\times p}$ be a random matrix with i.i.d. standard Gaussian entries, where $p\leq d$ . For a random variable, recall that we define the $\psi_{1}$ norm to be $\|\cdot\|_{\psi_{1}}=\sup_{p\geq 1}({\mathbb{E}}|\cdot|^{p})^{1/p}/p$ . Then we have the following bound on the $\psi_{1}$ norm of the matrix $\mathbf{\Omega}/\sqrt{p}$ :

\|\|\mathbf{\Omega}/\sqrt{p}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{d/p}.

(B.23)

Lemma B.3.

Let $\bm{\Omega}\in\mathbb{R}^{K\times p}$ denote a random matrix with i.i.d. Gaussian entries, where $p\geq 2K$ . For any integer $a$ such that $1\leq a\leq(p-K+1)/2$ , there exists a constant $C>0$ such that

{\mathbb{E}}\left(\left(\sigma_{\min}(\bm{\Omega}/\sqrt{p})\right)^{-a}\right)\leq C^{a}.

(B.24)

The following lemma shows that $\|{\mathbf{\Sigma}}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\|_{2}$ and $\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}$ are bounded by a small constant with high probability.

Lemma B.4.

If Assumption 1 holds and $p\geq\max(2K,K+3)$ , there exists a constant $c_{0}>0$ such that for any $\varepsilon>0$ , we have

\max\left\{\mathbb{P}\Big{(}\|{\mathbf{\Sigma}}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq\varepsilon\Big{)},\mathbb{P}\left(\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}\geq\varepsilon\right)\right\}\lesssim\exp\left(-c_{0}\sqrt{\frac{p}{d}}\frac{\Delta\varepsilon}{r_{1}(d)}\right).

The proof of Lemma B.2, Lemma B.3 and Lemma B.4 are deferred to Appendix C. Now we can start with the proof. We first decompose the bias term into two parts,

\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}},\mathbf{V})|^{2}\right)^{1/2}\leq\underbrace{\Big{(}{\mathbb{E}}|\rho(\widetilde{\mathbf{V}},\mathbf{V}^{\prime})|^{2}\Big{)}^{1/2}}_{\text{I}}\!\!\!+\underbrace{\Big{(}{\mathbb{E}}|\rho(\mathbf{V}^{\prime},\mathbf{V})|^{2}\Big{)}^{1/2}}_{\text{II}}.

(B.25)

Term I can be regarded as the variance term, whereas term II is the bias term. We will consider the bias term first.

B.2.1 Control of the Bias Term

We can see that term II can be further decomposed into two terms

\left({\mathbb{E}}|\rho(\mathbf{V}^{\prime},\mathbf{V})|^{2}\right)^{1/2}\leq\left({\mathbb{E}}\|\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{\rm F}^{2}\right)^{1/2}+\left({\mathbb{E}}\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}^{\top}\|_{\rm F}^{2}\right)^{1/2}.

(B.26)

We can bound both terms separately. First note that $\|\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{\rm F}\leq\sqrt{2K}\|\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}\leq\sqrt{2K}$ . Thus we have,

	$\displaystyle\left({\mathbb{E}}\\|\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\\|_{\rm F}^{2}\right)^{1/2}\leq\left({\mathbb{E}}\\|\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\\|_{\rm F}^{2}\mathbb{I}\big{\{}\\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\\|_{2}\geq 1/2\big{\}}\right)^{1/2}$
	$\displaystyle\quad\quad+\left({\mathbb{E}}\\|\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\\|_{\rm F}^{2}\mathbb{I}{\big{\{}\\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\\|_{2}<1/2\big{\}}}\right)^{1/2}$
	$\displaystyle\lesssim 0+\sqrt{K}\left(\mathbb{P}\left(\\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\\|_{2}\geq 1/2\right)\right)^{1/2}\lesssim\sqrt{K}\exp\left(-\frac{c_{0}}{4}\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right),$

where the last but one inequality follows from Lemma B.1, and the last inequality is a result of Lemma B.4. As for the second term on the RHS of (B.26), by Davis-Kahan’s Theorem [45], we have

	$\displaystyle\left({\mathbb{E}}\\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}^{\top}\\|_{\rm F}^{2}\right)^{1/2}$	$\displaystyle\lesssim\frac{\sqrt{K}}{\Delta}\left({\mathbb{E}}\\|\widehat{\mathbf{M}}-\mathbf{M}\\|_{2}^{2}\right)^{1/2}=\frac{\sqrt{K}}{\Delta}\left({\mathbb{E}}\\|\mathbf{E}\\|_{2}^{2}\right)^{1/2}$
		$\displaystyle\leq\frac{\sqrt{K}}{\Delta}\\|\\|\mathbf{E}\\|_{2}\\|_{\psi_{1}}\lesssim\frac{\sqrt{K}}{\Delta}r_{1}(d).$

Therefore, the bound for the bias term is

\text{II}\lesssim\sqrt{K}\exp\left(-\frac{c_{0}}{4}\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right)+\frac{\sqrt{K}}{\Delta}r_{1}(d).

B.2.2 Control of the Variance Term

Now we move on to control the variance term. Suppose that $\left\|\mathbf{\Sigma}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\right\|_{2}<1/4$ . Then by Weyl’s inequality [19] we have that $\sigma_{K}(\mathbf{\Sigma}^{\prime})>1-1/4=3/4$ and $\sigma_{K+1}(\mathbf{\Sigma}^{\prime})<1/4$ . Thus by Davis-Kahan theorem [45]

	$\displaystyle\left({\mathbb{E}}\left(\\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\!\!-\!\!\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}\\|_{\rm F}^{2}\mathbb{I}\left\{\left\\|\mathbf{\Sigma}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\right\\|_{2}<1/4\right\}\right)\right)^{1/2}\!\!\!$
	$\displaystyle\lesssim\left({\mathbb{E}}\left(\frac{\\|\widetilde{\mathbf{\Sigma}}-{\mathbf{\Sigma}}^{\prime}\\|_{\rm F}^{2}}{\left(\sigma_{K}({\mathbf{\Sigma}}^{\prime})-\sigma_{K+1}({\mathbf{\Sigma}}^{\prime})\right)^{2}}\mathbb{I}\left\{\left\\|\mathbf{\Sigma}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\right\\|_{2}<1/4\right\}\right)\right)^{1/2}$
	$\displaystyle\quad\lesssim\left({\mathbb{E}}\left(\\|\widetilde{\mathbf{\Sigma}}-{\mathbf{\Sigma}}^{\prime}\\|_{\rm F}^{2}\mathbb{I}{\left\{\left\\|\mathbf{\Sigma}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\right\\|_{2}<1/4\right\}}\right)\right)^{1/2}\leq\underbrace{\left({\mathbb{E}}\\|\widetilde{\mathbf{\Sigma}}-{\mathbf{\Sigma}}^{\prime}\\|_{\rm F}^{2}\right)^{1/2}}_{\text{III}}.$

We will bound term III later. Also similar as previously, note that $\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\!\!-\!\!\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}\|_{\rm F}\leq\sqrt{2K}$ . Thus by Lemma B.4,

	$\displaystyle\left({\mathbb{E}}\left(\\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\!\!-\!\!\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}\\|_{\rm F}^{2}\mathbb{I}{\left\{\left\\|\mathbf{\Sigma}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\right\\|_{2}\geq\frac{1}{4}\right\}}\right)\right)^{{1}/{2}}\!\!\!\lesssim\sqrt{K}\left(\mathbb{P}\left(\left\\|\mathbf{\Sigma}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\right\\|_{2}\geq\frac{1}{4}\right)\right)^{{1}/{2}}$
	$\displaystyle\quad\leq\sqrt{K}\exp\left(-\frac{c_{0}}{8}\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right).$

Therefore, we have

\left({\mathbb{E}}\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\!\!-\!\!\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}\|_{\rm F}^{2}\right)^{1/2}\lesssim\sqrt{K}\exp\left(-\frac{c_{0}}{8}\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right)+\underbrace{\left({\mathbb{E}}\|\widetilde{\mathbf{\Sigma}}-{\mathbf{\Sigma}}^{\prime}\|_{\rm F}^{2}\right)^{1/2}}_{\text{III}}.

Now we move on to bound term III.

	$\displaystyle\left({\mathbb{E}}\\|\widetilde{\mathbf{\Sigma}}-{\mathbf{\Sigma}}^{\prime}\\|_{\rm F}^{2}\right)^{1/2}$	$\displaystyle=\left({\mathbb{E}}\left\\|\frac{1}{L}\sum_{\ell=1}^{L}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-{\mathbb{E}}\left(\widehat{\mathbf{V}}^{(1)}\widehat{\mathbf{V}}^{(1)\top}\|\widehat{\mathbf{M}}\right)\right\\|_{\rm F}^{2}\right)^{1/2}$
		$\displaystyle=\bigg{(}{\mathbb{E}}\bigg{(}{\mathbb{E}}\bigg{(}\bigg{\\|}\frac{1}{L}\sum_{\ell=1}^{L}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-{\mathbb{E}}\left(\widehat{\mathbf{V}}^{(1)}\widehat{\mathbf{V}}^{(1)\top}\|\widehat{\mathbf{M}}\right)\bigg{\\|}_{\rm F}^{2}\bigg{\|}\widehat{\mathbf{M}}\bigg{)}\bigg{)}\bigg{)}^{1/2}$
		$\displaystyle=\frac{1}{\sqrt{L}}\left({\mathbb{E}}\left\\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-{\mathbb{E}}\left(\widehat{\mathbf{V}}^{(1)}\widehat{\mathbf{V}}^{(1)\top}\|\widehat{\mathbf{M}}\right)\right\\|_{\rm F}^{2}\right)^{1/2}$
		$\displaystyle\leq\frac{1}{\sqrt{L}}\left({\mathbb{E}}\left\\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}\right\\|_{\rm F}^{2}\right)^{1/2}+\frac{1}{\sqrt{L}}\left({\mathbb{E}}\left\\|\mathbf{V}\mathbf{V}^{\top}-{\mathbf{\Sigma}}^{\prime}\right\\|_{\rm F}^{2}\right)^{1/2}.$

where the last but one equality is due to the independence of estimators from different sketches conditional on $\widehat{\mathbf{M}}$ . By Jensen’s inequality [22], we have

\frac{1}{\sqrt{L}}\left({\mathbb{E}}\left\|\mathbf{V}\mathbf{V}^{\top}-{\mathbf{\Sigma}}^{\prime}\right\|_{\rm F}^{2}\right)^{1/2}\leq\frac{1}{\sqrt{L}}\left({\mathbb{E}}\left\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}\right\|_{\rm F}^{2}\right)^{1/2}.

Thus we have

\left({\mathbb{E}}\|\widetilde{\mathbf{\Sigma}}-{\mathbf{\Sigma}}^{\prime}\|_{\rm F}^{2}\right)^{1/2}\lesssim\frac{1}{\sqrt{L}}\left({\mathbb{E}}\left\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}\right\|_{\rm F}^{2}\right)^{1/2},

(B.27)

Before bounding the RHS, let’s consider the matrix ${\mathbf{Y}}^{(\ell)}:=\mathbf{V}\mathbf{P}_{0}\mathbf{\Lambda}^{0}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}$ . If $\widetilde{\mathbf{\Omega}}^{(\ell)}:=\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}\in\mathbb{R}^{K\times p}$ does not have full row rank, then the entries will be restricted to a linear space with dimension less than $K\times p$ . Since $\widetilde{\mathbf{\Omega}}^{(\ell)}$ is a $K\times p$ standard Gaussian matrix, the probability that $\widetilde{\mathbf{\Omega}}^{(\ell)}$ has full row rank is 1. And thus with probability 1, the matrix ${\mathbf{Y}}^{(\ell)}$ is of rank $K$ , and $\mathbf{V}$ and the top $K$ left singular vectors of ${\mathbf{Y}}^{(\ell)}/\sqrt{p}$ span the same column space. In other words, if we let $\bm{\Gamma}_{K}^{(\ell)}$ be the left singular vectors of ${\mathbf{Y}}^{(\ell)}/\sqrt{p}$ , then $\bm{\Gamma}_{K}^{(\ell)}\bm{\Gamma}_{K}^{(\ell)\top}=\mathbf{V}\mathbf{V}^{\top}$ .

Now consider the $K$ -th singular value of ${\mathbf{Y}}^{(\ell)}/\sqrt{p}$ , we let $\mathbf{U}_{\widetilde{\mathbf{\Omega}}}\mathbf{D}_{\widetilde{\mathbf{\Omega}}}\mathbf{V}_{\widetilde{\mathbf{\Omega}}}^{\top}$ be the SVD of $\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}$ , and we have

	$\displaystyle\sigma_{K}\left({\mathbf{Y}}^{(\ell)}/\sqrt{p}\right)$	$\displaystyle=\sigma_{K}\left(\mathbf{V}\mathbf{P}_{0}\mathbf{\Lambda}^{0}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)=\sigma_{K}\left(\mathbf{\Lambda}^{0}\mathbf{U}_{\widetilde{\mathbf{\Omega}}}\mathbf{D}_{\widetilde{\mathbf{\Omega}}}\right)$
		$\displaystyle=\min_{\\|\bm{x}\\|_{2}=1}\\|\mathbf{\Lambda}^{0}\mathbf{U}_{\widetilde{\mathbf{\Omega}}}\mathbf{D}_{\widetilde{\mathbf{\Omega}}}\bm{x}\\|_{2}\overset{(i)}{\geq}\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)\min_{\\|\bm{v}_{1}\\|_{2}=1}\left\\|\mathbf{\Lambda}^{0}\mathbf{U}_{\widetilde{\mathbf{\Omega}}}\bm{v}_{1}\right\\|_{2}$
		$\displaystyle\overset{(ii)}{\geq}\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)\min_{\\|\bm{v}_{2}\\|_{2}=1}\left\\|\mathbf{\Lambda}^{0}\bm{v}_{2}\right\\|_{2}\geq\Delta\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right),$

where $\bm{v}_{1}=\mathbf{D}_{\widetilde{\mathbf{\Omega}}}\bm{x}/\|\mathbf{D}_{\widetilde{\mathbf{\Omega}}}\bm{x}\|_{2}$ , and $\bm{v}_{2}=\mathbf{U}_{\widetilde{\mathbf{\Omega}}}\bm{v}_{1}$ . Inequality (i) follows because

\|\mathbf{D}_{\widetilde{\mathbf{\Omega}}}\bm{x}\|_{2}\geq\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)\|\bm{x}\|_{2}=\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right),

and inequality (ii) is because $\|\bm{v}_{2}\|_{2}=\|\bm{v}_{1}\|_{2}=1$ .

Now by Wedin’s Theorem [42] we have the following bound on the RHS of (B.27),

	$\displaystyle\frac{1}{\sqrt{L}}\left({\mathbb{E}}\big{\|}\rho(\widehat{\mathbf{V}}^{(\ell)},\mathbf{V})\big{\|}^{2}\right)^{1/2}\!\!\!\lesssim\frac{\sqrt{K}}{\sqrt{L}}\left({\mathbb{E}}\left\\|\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}-{\mathbf{Y}}^{(\ell)}/\sqrt{p}\right\\|_{2}^{2}/\left(\Delta\sigma_{\min}\big{(}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}\right)^{2}\right)^{1/2}$
	$\displaystyle\quad\leq\frac{\sqrt{K}}{\Delta\sqrt{L}}\left({\mathbb{E}}\left\\|\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}-{\mathbf{Y}}^{(\ell)}/\sqrt{p}\right\\|_{2}^{4}\right)^{1/4}\left({\mathbb{E}}\left(\sigma_{\min}\big{(}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}\right)^{-4}\right)^{1/4}$
	$\displaystyle\quad\lesssim\frac{\sqrt{K}}{\Delta\sqrt{L}}\\|\\|\mathbf{E}\\|_{2}\\|_{\psi_{1}}\cdot\\|\\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\\|_{2}\\|_{\psi_{1}}\lesssim\sqrt{\frac{Kd}{\Delta^{2}pL}}\\|\\|\mathbf{E}\\|_{2}\\|_{\psi_{1}}\lesssim\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d),$

where the last but one inequality is due to Lemma B.3. Therefore, we have the final error rate for the estimator $\widetilde{\mathbf{V}}$ :

\left({\mathbb{E}}\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}\|_{\rm F}^{2}\right)^{1/2}\lesssim\underbrace{\sqrt{K}\exp\left(-\frac{c_{0}}{8}\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right)+\frac{\sqrt{K}}{\Delta}r_{1}(d)}_{\text{bias}}+\underbrace{\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d)}_{\text{variance}}.

Now consider the function $g(x):=\exp(a_{0}\sqrt{\frac{p}{d}}x)/(\sqrt{d}x^{2})$ , where $a_{0}>0$ is a fixed constant. We have

\frac{d\log g(x)}{dx}=a_{0}\sqrt{\frac{p}{d}}-\frac{2}{x}>0,\quad\text{for }x\geq\frac{2}{a_{0}}\sqrt{\frac{d}{p}}.

Thus $g(x)$ is increasing on $x\geq 2\sqrt{d/p}/a_{0}$ , and if we take $x\geq C\sqrt{\frac{d}{p}}\log d$ for some large enough constant $C>0$ , we have that $g(x)\geq 1$ . Then by plugging in $x=\Delta/r_{1}(d)$ and taking $a_{0}=c_{0}/8$ , under the condition that $(\log d)^{-1}\sqrt{p/d}\Delta/r_{1}(d)\geq C$ for some large enough constant $C>0$ , we have that

\exp\left(-\frac{c_{0}}{8}\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right)\lesssim\frac{1}{\sqrt{d}}\left(\frac{r_{1}(d)}{\Delta}\right)^{2}=o\left(\frac{r_{1}(d)}{\Delta}\right),

and the error rate simplifies to

\left({\mathbb{E}}\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}\|_{\rm F}^{2}\right)^{1/2}\lesssim\underbrace{\frac{\sqrt{K}}{\Delta}r_{1}(d)}_{\text{bias}}+\underbrace{\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d)}_{\text{variance}}.

Now we move on to bound $\left({\mathbb{E}}\|\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{\text{F}\top}-\mathbf{V}\mathbf{V}^{\top}\|^{2}_{\rm F}\right)^{1/2}$ . Since $\|\cdot\|_{2}^{2q}$ is convex, by Jensen’s inequality [22], under the condition that $p\geq\max(2K,8q+K-1)$ we have that there exists some constant $\eta$ such that

	$\displaystyle{\mathbb{E}}\\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}^{2q}$	$\displaystyle\leq\frac{1}{L}\sum_{\ell=1}^{L}{\mathbb{E}}\\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}^{2q}={\mathbb{E}}\\|\widehat{\mathbf{V}}^{(1)}\widehat{\mathbf{V}}^{(1)\top}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}^{2q}$
		$\displaystyle\leq{\mathbb{E}}\bigg{(}\left\\|\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}-{\mathbf{Y}}^{(\ell)}/\sqrt{p}\right\\|_{2}^{2q}\Big{/}\left(\Delta\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)\right)^{2q}\bigg{)}$
		$\displaystyle\leq\frac{1}{\Delta^{2q}}\left({\mathbb{E}}\left\\|\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}-{\mathbf{Y}}^{(\ell)}/\sqrt{p}\right\\|_{2}^{4q}\right)^{1/2}\left({\mathbb{E}}\left(\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)\right)^{-4q}\right)^{1/2}$
		$\displaystyle\lesssim\left(\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}\\|\\|\mathbf{E}\\|_{2}\\|_{\psi_{1}}\right)^{2q}.$

Thus by Markov’s inequality, we also have

	$\displaystyle\mathbb{P}\left(\\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}\geq\frac{1}{2}\right)$	$\displaystyle=\mathbb{P}\left(\\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}^{2q}\geq\frac{1}{2^{2q}}\right)\leq 2^{2q}{\mathbb{E}}\big{(}\\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}^{2q}\big{)}$
		$\displaystyle\lesssim\left(2\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}\\|\\|\mathbf{E}\\|_{2}\\|_{\psi_{1}}\right)^{2q}.$

Since $\widetilde{\bm{\Sigma}}$ is the summation of positive semi-definite matrices by construction, $\widetilde{\bm{\Sigma}}$ is also positive semi-definite. By Weyl’s inequality [19], we know that $\sigma_{K}(\widetilde{\mathbf{\Sigma}})\geq 1-\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}$ and $\sigma_{K+1}(\widetilde{\mathbf{\Sigma}})\leq\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}$ .

Now if we denote the SVD of $\widetilde{\mathbf{\Sigma}}^{q}$ by $\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{V}}^{\top}+\widetilde{\mathbf{V}}_{\perp}\widetilde{\mathbf{\Lambda}}_{\perp}^{q}\widetilde{\mathbf{V}}_{\perp}^{\top}$ , then with probability 1, $\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{V}}^{\top}\bm{\Omega}^{\text{F}}$ and $\widetilde{\mathbf{V}}$ share the same column space. By the relationship $\sigma_{k}(\widetilde{\mathbf{\Sigma}}^{q})=\sigma_{k}^{q}(\widetilde{\mathbf{\Sigma}})$ for $k\in[d]$ and Davis-Kahan’s Theorem [45], we have

	$\displaystyle{\mathbb{E}}\left(\\|\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\\|^{2}_{\rm F}\,\|\widetilde{\mathbf{\Sigma}}\right)$	$\displaystyle\lesssim{\mathbb{E}}\left(K\\|\widetilde{\mathbf{\Sigma}}^{q}\mathbf{\Omega}^{\rm F}-\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{V}}^{\top}\mathbf{\Omega}^{\rm F}\\|_{2}^{2}/\sigma^{2}_{\min}(\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{V}}^{\top}\mathbf{\Omega}^{\rm F})\,\|\widetilde{\mathbf{\Sigma}}\right)$
		$\displaystyle\lesssim\left(\frac{\sqrt{K}}{\sigma_{K}^{q}(\widetilde{\mathbf{\Sigma}})}\\|\widetilde{\mathbf{V}}_{\perp}\widetilde{\mathbf{\Lambda}}_{\perp}^{q}\widetilde{\mathbf{V}}_{\perp}^{\top}\\|_{2}\cdot\\|\\|\mathbf{\Omega}^{\rm F}/\sqrt{p^{\prime}}\\|_{2}\\|_{\psi_{1}}\right)^{2}$
		$\displaystyle\lesssim\frac{Kd}{p^{\prime}}\frac{\\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}^{2q}}{\left(1-\\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}\right)^{2q}}.$

Therefore we have,

	$\displaystyle\left({\mathbb{E}}\\|\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\\|^{2}_{\rm F}\right)^{1/2}\lesssim\left({\mathbb{E}}\\|\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\\|^{2}_{\rm F}\mathbb{I}\big{\{}\\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}\leq{1}/{2}\big{\}}\right)^{1/2}$
	$\displaystyle\quad+\left({\mathbb{E}}\\|\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\\|^{2}_{\rm F}\mathbb{I}\big{\{}\\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}>{1}/{2}\big{\}}\right)^{1/2}$
	$\displaystyle\lesssim 2^{q}\sqrt{\frac{Kd}{p^{\prime}}}\left({\mathbb{E}}\\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}^{2q}\right)^{1/2}+\sqrt{K}\left\{\mathbb{P}\left(\\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}\geq\frac{1}{2}\right)\right\}^{1/2}$
	$\displaystyle\lesssim\sqrt{\frac{Kd}{p^{\prime}}}\left(2\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}\\|\\|\mathbf{E}\\|_{2}\\|_{\psi_{1}}\right)^{q}+\sqrt{K}\left(2\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}\\|\\|\mathbf{E}\\|_{2}\\|_{\psi_{1}}\right)^{q}$
	$\displaystyle\lesssim\sqrt{\frac{Kd}{p^{\prime}}}\left(2\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}\\|\\|\mathbf{E}\\|_{2}\\|_{\psi_{1}}\right)^{q},$

where the last but one inequality is by Markov’s inequality, i.e.,

\mathbb{P}\left(\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq\frac{1}{2}\right)\leq 2^{2q}{\mathbb{E}}\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{2q}\lesssim\left(2\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\right)^{2q}.

Thus by previous results and triangle inequality we have

	$\displaystyle\left({\mathbb{E}}\big{\|}\rho(\widetilde{\mathbf{V}}^{\text{F}},\mathbf{V})\big{\|}^{2}\right)^{1/2}\lesssim\left({\mathbb{E}}\\|\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\\|^{2}_{\rm F}\right)^{1/2}+\left({\mathbb{E}}\\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}^{\top}\\|^{2}_{\rm F}\right)^{1/2}$
	$\displaystyle\quad\lesssim{\frac{\sqrt{K}}{\Delta}r_{1}(d)}\!+\!{\!\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d)}\!+\!\sqrt{\frac{Kd}{p^{\prime}}}\left(\!\!2\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}r_{1}(d)\!\!\!\right)^{q}.$

B.3 Proof of Corollary 4.2

The case-specific error rates can be calculated by computing $r_{1}(d)$ and studying the proper value of $q$ for each example.

$\bullet$ Example 1: we know that $\mathbf{E}=\widehat{\bm{\Sigma}}-\bm{\Sigma}+(\sigma^{2}-\widehat{\sigma}^{2})\mathbf{I}$ . Now consider the $K^{\prime}\times K^{\prime}$ submatrix of $\bm{\Sigma}$ corresponding to the the index set $S$ , which we denote by $\mathbf{\Sigma}_{S}=\bm{\Sigma}_{[S,S]}$ . We have $\mathbf{\Sigma}_{S}=\sigma^{2}\mathbf{I}_{K^{\prime}}+(\mathbf{V})_{[S,:]}\mathbf{\Lambda}(\mathbf{V})_{[S,:]}^{\top}$ , where $(\mathbf{V})_{[S,:]}$ is the submatrix of $\mathbf{V}$ composed of the rows in $S$ . Then since $(\mathbf{V})_{[S,:]}\mathbf{\Lambda}(\mathbf{V})_{[S,:]}^{\top}\succeq\mathbf{0}$ and $\operatorname{rank}\big{(}(\mathbf{V})_{[S,:]}\mathbf{\Lambda}(\mathbf{V})_{[S,:]}^{\top}\big{)}\leq K$ , we know that $\sigma_{\min}(\mathbf{\Sigma}_{S})=\sigma^{2}$ . By Weyl’s inequality [19], we know $|\sigma^{2}-\widehat{\sigma}^{2}|\leq\|\widehat{\mathbf{\Sigma}}_{S}-\mathbf{\Sigma}_{S}\|_{2}\leq\|\widehat{\mathbf{\Sigma}}-\mathbf{\Sigma}\|_{2}$ . Thus we have $\|\mathbf{E}\|_{2}\leq\|\widehat{\mathbf{\Sigma}}-\mathbf{\Sigma}\|_{2}+|\sigma^{2}-\widehat{\sigma}^{2}|\leq 2\|\widehat{\mathbf{\Sigma}}-\mathbf{\Sigma}\|_{2}$ . Then by Lemma 3 in Fan et al., [18], we have that there exists some constant $c\geq 1$ such that for any $t\geq 0$ , we have

\displaystyle\mathbb{P}(\|\mathbf{E}\|_{2}\geq t)\leq\mathbb{P}(2\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|_{2}\geq t)\leq\exp(-\frac{t}{2c(\lambda_{1}+\sigma^{2})\sqrt{r/n}}),

where $r=\operatorname{tr}(\bm{\Sigma})/\|\bm{\Sigma}\|_{2}$ is the effective rank of $\bm{\Sigma}$ . Thus we can see that $\|\mathbf{E}\|_{2}$ is sub-exponential with

\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\lesssim\|\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|_{2}\|_{\psi_{1}}\lesssim(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}},

and hence we can take $r_{1}(d)=(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}$ . When $n\geq C(dr/{p})\kappa_{1}^{2}(\log d)^{4}$ , by Theorem 4.1 we have

\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\text{F}},\mathbf{V})|^{2}\right)^{1/2}\lesssim\kappa_{1}\sqrt{\frac{Kr}{n}}+\kappa_{1}\sqrt{\frac{Kdr}{npL}}+\sqrt{\frac{Kd}{p^{\prime}}}\left(\eta q^{2}\kappa_{1}\sqrt{\frac{dr}{np}}\right)^{q},

where the third term will be dominated by the first bias term when taking $q=\log d$ , and hence (3) holds.

$\bullet$ Example 2: Under the problem settings we know that $\mathbf{E}=\widehat{\mathbf{M}}-\mathbf{M}=\mathbf{X}-{\mathbb{E}}\mathbf{X}$ . For the eigenvalues of $\mathbf{M}$ , under the given conditions we know that

\sigma_{K}(\mathbf{M})\gtrsim\theta\sigma_{K}(\mathbf{P})\sigma_{K}^{2}(\mathbf{\Pi})\gtrsim d\theta/K,\quad\sigma_{1}(\mathbf{M})\lesssim\theta\sigma_{1}(\mathbf{P})\sigma_{1}^{2}(\mathbf{\Pi})\lesssim Kd\theta\|\mathbf{\Pi}\|_{2,\infty}^{2}\leq Kd\theta,

where the last inequality is because for $i\in[d]$ , we have that

\|\bm{\pi}_{i}\|_{2}=\big{(}\sum_{k=1}^{K}\bm{\pi}_{i}(k)^{2}\big{)}^{1/2}\leq\big{(}\sum_{k=1}^{K}\bm{\pi}_{i}(k)\big{)}^{1/2}=1\quad\text{and}\quad\|\mathbf{\Pi}\|_{2,\infty}\leq 1.

Thus we know that $\Delta\gtrsim d\theta/K$ .

We then bound the entries of $\mathbf{M}$ . We know $\mathbf{M}_{ij}=\theta_{i}\theta_{j}\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}\bm{\pi}_{i}(k)\bm{\pi}_{j}(k^{\prime})\mathbf{P}_{kk^{\prime}}$ , and thus we have that

	$\displaystyle\mathbf{M}_{ij}\geq\theta_{i}\theta_{j}\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}\bm{\pi}_{i}(k)\bm{\pi}_{j}(k^{\prime})\min_{kk^{\prime}}(\mathbf{P}_{kk^{\prime}})$
	$\displaystyle=\theta_{i}\theta_{j}\min_{kk^{\prime}}(\mathbf{P}_{kk^{\prime}})\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}\bm{\pi}_{i}(k)\bm{\pi}_{j}(k^{\prime})=\theta_{i}\theta_{j}\min_{kk^{\prime}}(\mathbf{P}_{kk^{\prime}});$
	$\displaystyle\mathbf{M}_{ij}\leq\theta_{i}\theta_{j}\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}\bm{\pi}_{i}(k)\bm{\pi}_{j}(k^{\prime})\max_{kk^{\prime}}(\mathbf{P}_{kk^{\prime}})$
	$\displaystyle=\theta_{i}\theta_{j}\max_{kk^{\prime}}(\mathbf{P}_{kk^{\prime}})\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}\bm{\pi}_{i}(k)\bm{\pi}_{j}(k^{\prime})=\theta_{i}\theta_{j}\max_{kk^{\prime}}(\mathbf{P}_{kk^{\prime}}).$

Thus we can see that $\mathbf{M}_{ij}\asymp\theta$ , $\max_{ij}{\mathbb{E}}(\mathbf{E}_{ij}^{2})\lesssim\theta$ and $\max_{i}\sum_{j}{\mathbb{E}}(\mathbf{E}_{ij}^{2})\lesssim d\theta$ . By Theorem 3.1.4 in [12], we know that there exists some constant $c>0$ such that for any $t>0$ ,

\mathbb{P}\{\|\mathbf{E}\|_{2}\geq 4\sqrt{d\theta}+t\}\leq d\exp\left(-t^{2}/c\right).

Also, since for $t\geq 5\sqrt{d\theta}$ , there exists a constant $c>0$ such that $\mathbb{P}(\|\mathbf{E}\|_{2}\geq t)\leq\exp(-t^{2}/c)$ , we have that $\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{d\theta}$ , and hence we can take $r_{1}(d)=\sqrt{d\theta}$ . Besides, $\sqrt{p/d}\Delta/r_{1}(d)=\sqrt{p\theta}/K\gtrsim d^{\epsilon/2}$ , and hence by Theorem 4.1 we have

\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\text{F}},\mathbf{V})|^{2}\right)^{1/2}\lesssim K\sqrt{\frac{K}{d\theta}}+K\sqrt{\frac{K}{pL\theta}}+\sqrt{\frac{Kd}{p^{\prime}}}\left(\eta q^{2}\frac{K}{\sqrt{p\theta}}\right)^{q}.

When

q=\log d\gg 1+2\epsilon^{-1}>\frac{\log\left(\sqrt{d/p^{\prime}}\sqrt{d\theta}/K\right)}{\log\left(\sqrt{p/d}\sqrt{d\theta}/K\right)},

the third term is negligible and (4) holds.

Remark 14.

It’s worth noting that here in Example 2 $\|\mathbf{E}\|_{2}$ converges faster than sub-Exponential random variables and $\|\mathbf{E}\|_{2}\lesssim\sqrt{d\theta}$ with probability at least $1-d^{-10}$ , which we will take into account in later proofs.

Remark 15.

Under the case where no self-loops are present, $\mathbf{E}$ is replaced by $\mathbf{E}^{\prime}=\mathbf{E}-\operatorname{diag}(\mathbf{X})=\mathbf{E}-\operatorname{diag}(\mathbf{E})-\operatorname{diag}(\mathbf{M})$ . With similar arguments we can show that

\|\|\mathbf{E}^{\prime}\|_{2}\|_{\psi_{1}}\lesssim\|\|\mathbf{E}-\operatorname{diag}(\mathbf{E})\|_{2}\|_{\psi_{1}}+\|\operatorname{diag}(\mathbf{M})\|_{2}\lesssim\sqrt{d\theta}+\theta\lesssim\sqrt{d\theta},

\text{and}\quad\|\mathbf{E}^{\prime}\|_{2}\lesssim\|\mathbf{E}-\operatorname{diag}(\mathbf{E})\|_{2}+\|\operatorname{diag}(\mathbf{M})\|_{2}\lesssim\sqrt{d\theta}+\theta\lesssim\sqrt{d\theta},

with probability at least $1-d^{-10}$ , and hence (4) also holds for the no-self-loops case.

$\bullet$ Example 3: From the problem setting we know that we can represent $\bm{W}_{j}$ as $\bm{W}_{j}=\sum_{k=1}^{K}\mathbb{I}\{k_{j}=k\}\bm{\theta}_{k}+\bm{Z}_{j}$ , where $\bm{Z}_{j}\overset{\text{i.i.d}}{\sim}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{n})$ , $j\in[d]$ . Denote $\mathbf{Z}=(\bm{Z}_{1},\ldots,\bm{Z}_{d})$ , then it can be seen that ${\mathbb{E}}(\mathbf{X}^{\top}\mathbf{X})={\mathbb{E}}(\mathbf{X})^{\top}{\mathbb{E}}(\mathbf{X})+{\mathbb{E}}(\mathbf{Z}^{\top}\mathbf{Z})=\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+n\mathbf{I}_{d}$ , and we can write

\mathbf{E}=\mathbf{X}^{\top}\mathbf{X}-{\mathbb{E}}(\mathbf{X}^{\top}\mathbf{X})=\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{Z}+\mathbf{Z}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+\mathbf{Z}^{\top}\mathbf{Z}-n\mathbf{I}_{d},

then we know that $\|\mathbf{E}\|_{2}\leq 2\|\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{Z}\|_{2}+n\|\mathbf{Z}^{\top}\mathbf{Z}/n-\mathbf{I}_{d}\|_{2}$ . We consider $\|\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{Z}\|_{2}$ first. We know that $\widetilde{\mathbf{Z}}:=\mathbf{\Theta}^{\top}\mathbf{Z}=\mathbf{\Theta}^{\top}(\bm{Z}_{1},\ldots,\bm{Z}_{d})=(\widetilde{\bm{Z}}_{1},\ldots,\widetilde{\bm{Z}}_{d})\in\mathbb{R}^{K\times d}$ , where $\widetilde{\bm{Z}}_{j}\overset{\text{i.i.d}}{\sim}{\mathcal{N}}(\mathbf{0},\mathbf{\Theta}^{\top}\mathbf{\Theta})$ . Under the given conditions we know that $\|\mathbf{\Theta}^{\top}\mathbf{\Theta}\|_{2}\leq\Delta_{0}^{2}$ . Since $(\mathbf{\Theta}^{\top}\mathbf{\Theta})^{-1/2}\widetilde{\mathbf{Z}}$ is a $K\times d$ i.i.d. Gaussian matrix, by Lemma B.2, we have that

\|\|\widetilde{\mathbf{Z}}\|_{2}\|_{\psi_{1}}\lesssim\|(\mathbf{\Theta}^{\top}\mathbf{\Theta})^{1/2}\|_{2}\|\|(\mathbf{\Theta}^{\top}\mathbf{\Theta})^{-1/2}\widetilde{\mathbf{Z}}\|_{2}\|_{\psi_{1}}\lesssim\Delta_{0}\sqrt{d}.

As for $\|\mathbf{Z}^{\top}\mathbf{Z}/n-\mathbf{I}_{d}\|_{2}$ , when $n>d$ , by Lemma 3 in Fan et al., [18] we know that $\|\|\mathbf{Z}^{\top}\mathbf{Z}/n-\mathbf{I}_{d}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{d/n}$ , and hence in summary we have

\displaystyle\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\lesssim\|\mathbf{F}\|_{2}\|\|\widetilde{\mathbf{Z}}\|_{2}\|_{\psi_{1}}+n\|\|\mathbf{Z}^{\top}\mathbf{Z}/n-\mathbf{I}_{d}\|_{2}\|_{\psi_{1}}\lesssim\Delta_{0}d/\sqrt{K}+\sqrt{nd},

and we can take $r_{1}(d)=\Delta_{0}d/\sqrt{K}+\sqrt{nd}$ . We know that $\Delta=\sigma_{\min}(\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top})\gtrsim d\Delta_{0}^{2}/K$ , and thus under the condition that $\Delta_{0}^{2}\geq CK(\log d)^{2}\left(d(\log d)^{2}/p\vee\sqrt{n/p}\right)$ for some large enough constant $C>0$ , by Theorem 4.1 we have that

	$\displaystyle\left({\mathbb{E}}\|\rho(\widetilde{\mathbf{V}}^{\text{F}},\mathbf{V})\|^{2}\right)^{1/2}$	$\displaystyle\!\!\!\lesssim\left(\frac{K}{\Delta_{0}}\!+\!\frac{K}{\Delta_{0}^{2}}\sqrt{\frac{Kn}{d}}\right)\!+\!\sqrt{\frac{d}{pL}}\left(\frac{K}{\Delta_{0}}\!+\!\frac{K}{\Delta_{0}^{2}}\sqrt{\frac{Kn}{d}}\right)$
		$\displaystyle+\!\!\sqrt{\frac{Kd}{p^{\prime}}}\left(\!\eta q^{2}\!\left(\!\!\!\sqrt{\frac{dK}{p\Delta_{0}^{2}}}+\!\!\frac{K}{\Delta_{0}^{2}}\sqrt{\frac{n}{p}}\right)\!\!\right)^{q}.$

Now for the third term to be dominated by the bias term, we can take

q=\log d\geq\frac{\log\left(d/\sqrt{pp^{\prime}}\right)}{\log\log d}+1\geq\frac{\log\left(d/\sqrt{pp^{\prime}}\right)}{\log\left(\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right)}+1,

and hence (5) holds.

Remark 16.

In fact we can derive a slightly sharper tail bound for the convergence rate of $\|\mathbf{E}\|_{2}$ . More specifically, for any $t\geq\Delta_{0}\sqrt{d}$ , by Lemma 3 in Fan et al., [18] there exists some constant $c\geq 1$ such that

	$\displaystyle\mathbb{P}\big{(}\\|\widetilde{\mathbf{Z}}\\|_{2}\geq t\big{)}=\mathbb{P}\big{(}\\|\widetilde{\mathbf{Z}}\widetilde{\mathbf{Z}}^{\top}\\|_{2}\geq t^{2}\big{)}=\mathbb{P}\big{(}d\\|\widetilde{\mathbf{Z}}\widetilde{\mathbf{Z}}^{\top}/d-\mathbf{\Theta}^{\top}\mathbf{\Theta}+\mathbf{\Theta}^{\top}\mathbf{\Theta}\\|_{2}\geq t^{2}\big{)}$
	$\displaystyle\leq\mathbb{P}\big{(}d\\|\widetilde{\mathbf{Z}}\widetilde{\mathbf{Z}}^{\top}/d-\mathbf{\Theta}^{\top}\mathbf{\Theta}\\|_{2}\geq t^{2}-d\\|\mathbf{\Theta}^{\top}\mathbf{\Theta}\\|_{2}\big{)}\leq\mathbb{P}\big{(}\\|\widetilde{\mathbf{Z}}\widetilde{\mathbf{Z}}^{\top}/d-\mathbf{\Theta}^{\top}\mathbf{\Theta}\\|_{2}\geq t^{2}/d-\Delta_{0}^{2}\big{)}$
	$\displaystyle\leq\exp\Big{(}-\frac{t^{2}/d-\Delta_{0}^{2}}{c\Delta_{0}^{2}\sqrt{K/d}}\Big{)},$

which indicates that $\|\widetilde{\mathbf{Z}}\|_{2}\lesssim\Delta_{0}\sqrt{d}$ with probability at least $1-d^{-10}$ . Hence under the condition that $\sqrt{K/d}\log d=O(1)$ , with probability at least $1-O(d^{-10})$ we have that $\|\mathbf{E}\|_{2}\lesssim d\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d$ , which will be used as the statistical rate of $\|\mathbf{E}\|_{2}$ in later proofs.

$\bullet$ Example 4: We define $\bar{\mathcal{E}}=[\varepsilon_{ij}]$ , then $\widehat{\mathbf{M}}=(1/\widehat{\theta})\mathcal{P}_{{\mathcal{S}}}(\mathbf{M}+\bar{\mathcal{E}})$ , where $\mathcal{P}_{{\mathcal{S}}}$ is the projection onto the subspace of matrices with non-zero entries only in ${\mathcal{S}}$ . Since $\widehat{\mathbf{M}}$ and $\widehat{\mathbf{M}}^{\prime}:=(\widehat{\theta}/\theta)\widehat{\mathbf{M}}=({1}/{\theta})\mathcal{P}_{{\mathcal{S}}}(\mathbf{M}+\bar{\mathcal{E}})$ differ only by a positive factor, $\widehat{\mathbf{M}}$ and $\widehat{\mathbf{M}}^{\prime}$ share exactly the same sequence of eigenvectors and $\widetilde{\mathbf{V}}^{\text{F}}$ can be viewed as the output by applying FADI to $\widehat{\mathbf{M}}^{\prime}$ . Thus we will establish the results for $\widehat{\mathbf{M}}^{\prime}$ instead, and abuse the notation by denoting $\mathbf{E}:=\widehat{\mathbf{M}}^{\prime}-\mathbf{M}$ . We first study the order of $\|\mathbf{M}\|_{\max}$ . When $\|\mathbf{V}\|_{2,\infty}\leq\sqrt{\mu K/d}$ for some rate $\mu\geq 1$ (that may change with $d$ ), for any $i,j\in[d]$ , we have that

|\mathbf{M}_{ij}|=|\mathbf{e}_{i}^{\top}\mathbf{V}\mathbf{\Lambda}(\mathbf{e}_{j}^{\top}\mathbf{V})^{\top}|\leq\|\mathbf{\Lambda}\|_{2}\|\mathbf{e}_{i}^{\top}\mathbf{V}\|_{2}\|\mathbf{e}_{j}^{\top}\mathbf{V}\|_{2}\leq|\lambda_{1}|\|\mathbf{V}\|_{2,\infty}^{2}\leq\frac{|\lambda_{1}|\mu K}{d}.

Thus we have $\|\mathbf{M}\|_{\max}=O({|\lambda_{1}|\mu K}/{d})$ . Also, we can write $\mathbf{E}=\mathbf{E}_{1}+\mathbf{E}_{2}$ , where $(\mathbf{E}_{1})_{ij}=\mathbf{M}_{ij}(\delta_{ij}-\theta)/\theta,(\mathbf{E}_{2})_{ij}=\varepsilon_{ij}\delta_{ij}/\theta$ , and for $i\leq j$

\operatorname{Var}\big{(}(\mathbf{E}_{1})_{ij}\big{)}=\mathbf{M}_{ij}^{2}(1-\theta)/\theta\leq\|\mathbf{M}\|_{\max}^{2}/\theta=O\Big{(}\frac{(\lambda_{1}\mu K)^{2}}{d^{2}\theta}\Big{)},\quad\operatorname{Var}\big{(}(\mathbf{E}_{2})_{ij}\big{)}=\sigma^{2}/\theta.

It is not hard to see that $\operatorname*{\rm Cov}((\mathbf{E}_{1})_{ij},(\mathbf{E}_{2})_{ij})=0$ . Also, by the setting of Example 4 we have that $|(\mathbf{E}_{1})_{ij}|\leq\|\mathbf{M}\|_{\max}/\theta=O(\frac{|\lambda_{1}|\mu K}{d\theta})$ , and there exists a constant $C>0$ independent of $d$ such that $|(\mathbf{E}_{2})_{ij}|\leq C\sigma\log d/\theta$ for all $i\leq j$ . Then we will study $\|\mathbf{E}_{1}\|_{2}$ and $\|\mathbf{E}_{2}\|_{2}$ separately. We denote $\nu_{1}=d\|\mathbf{M}\|_{\max}^{2}/\theta$ and $\nu_{2}=d\sigma^{2}/\theta$ . Under the condition that $\theta\geq d^{-1/2+\epsilon}$ for some constant $\epsilon>0$ , by Theorem 3.1.4 in Chen et al., [12], there exists constant $c>0$ such that for any $t\geq 4$ we have

	$\displaystyle\mathbb{P}\Big{(}\frac{\\|\mathbf{E}_{1}\\|_{2}}{2\sqrt{\nu_{1}}}\geq t\Big{)}$	$\displaystyle\leq\mathbb{P}\big{(}\\|\mathbf{E}_{1}\\|_{2}/\sqrt{\nu_{1}}\geq 4+t\big{)}=\mathbb{P}\big{(}\\|\mathbf{E}_{1}\\|_{2}\geq 4\sqrt{\nu_{1}}+t\sqrt{\nu_{1}}\big{)}$
		$\displaystyle\leq d\exp\Big{(}-\frac{t^{2}d\\|\mathbf{M}\\|_{\max}^{2}/\theta}{c\\|\mathbf{M}\\|_{\max}^{2}/\theta^{2}}\Big{)}=\exp(-d\theta t^{2}/c+\log d)$
		$\displaystyle\leq\exp(-\frac{d\theta t^{2}}{2c})\leq\exp(-t^{2}).$

Very similarly for $\|\mathbf{E}_{2}\|_{2}$ , there exists $c^{\prime}>0$ such that for any $t\geq 4$ , we have

	$\displaystyle\mathbb{P}\Big{(}\frac{\\|\mathbf{E}_{2}\\|_{2}}{2\sqrt{\nu_{2}}}\geq t\Big{)}\leq\mathbb{P}\big{(}\\|\mathbf{E}_{2}\\|_{2}\geq 4\sqrt{\nu_{2}}+t\sqrt{\nu_{2}}\big{)}\leq d\exp\Big{(}-\frac{t^{2}d\sigma^{2}/\theta}{c^{\prime}\sigma^{2}(\log d)^{2}/\theta^{2}}\Big{)}$
	$\displaystyle=\exp\Big{(}-\frac{d\theta t^{2}}{c^{\prime}(\log d)^{2}}+\log d\Big{)}\leq\exp\Big{(}-\frac{d\theta t^{2}}{2c^{\prime}(\log d)^{2}}\Big{)}\leq\exp(-t^{2}).$

Thus we can see that

\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\leq\|\|\mathbf{E}_{1}\|_{2}\|_{\psi_{1}}+\|\|\mathbf{E}_{2}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{\nu_{1}}+\sqrt{\nu_{2}}\lesssim\frac{|\lambda_{1}|\mu K}{\sqrt{d\theta}}+\sqrt{\frac{d\sigma^{2}}{\theta}}.

By Theorem 4.1, under the condition that $p=\Omega(\sqrt{d})$ , $\sigma/\Delta\ll(\log d)^{-2}d^{-1}\sqrt{p\theta}$ and $\kappa_{2}\mu K\ll d^{1/4}$ , it holds that

	$\displaystyle\left({\mathbb{E}}\|\rho(\widetilde{\mathbf{V}}^{\text{F}},\mathbf{V})\|_{2}^{2}\right)^{1/2}$	$\displaystyle\!\!\lesssim\sqrt{K}\left(\frac{\kappa_{2}\mu K}{\sqrt{d\theta}}\!+\!\sqrt{\frac{d\sigma^{2}}{\Delta^{2}\theta}}\right)\!\!+\!K\sqrt{\frac{d}{pL}}\!\!\left(\frac{\kappa_{2}\mu K}{\sqrt{d\theta}}\!+\!\sqrt{\frac{d\sigma^{2}}{\Delta^{2}\theta}}\right)$
		$\displaystyle+\sqrt{\frac{Kd}{p^{\prime}}}\left(\eta q^{2}\left(\frac{\kappa_{2}\mu K}{\sqrt{p\theta}}+\sqrt{\frac{d^{2}\sigma^{2}}{p\Delta^{2}\theta}}\right)\right)^{q}.$

Furthermore, the third term vanishes when $q=\log d$ and (6) holds.

Remark 17.

Here we can also obtain a statistical rate sharper than subexponential rate for $\|\mathbf{E}\|_{2}$ that would be used in later proofs. Combining the above results for any $t\geq 16\max(\sqrt{\nu_{1}},\sqrt{\nu_{2}})$ we have

	$\displaystyle\mathbb{P}\big{(}\\|\mathbf{E}\\|_{2}\geq t\big{)}$	$\displaystyle\leq\mathbb{P}\big{(}\\|\mathbf{E}_{1}\\|_{2}\geq t/2\big{)}+\mathbb{P}\big{(}\\|\mathbf{E}_{2}\\|_{2}\geq t/2\big{)}\leq\exp(-\frac{d\theta t^{2}}{32c\nu_{1}})+\exp\Big{(}\!\!-\!\frac{d\theta t^{2}}{32c^{\prime}(\log d)^{2}\nu_{2}}\Big{)}$
		$\displaystyle=\exp\Big{(}-\frac{d^{2}\theta^{2}t^{2}}{C_{1}(\lambda_{1}\mu K)^{2}}\Big{)}+\exp\Big{(}-\frac{\theta^{2}t^{2}}{C_{2}(\log d)^{2}\sigma^{2}}\Big{)},$

where $C_{1},C_{2}>0$ are constants. Thus $\|\mathbf{E}\|_{2}\lesssim\frac{|\lambda_{1}|\mu K}{\sqrt{d\theta}}+\sqrt{\frac{d\sigma^{2}}{\theta}}$ with probability at least $1-d^{-10}$ .

B.4 Proof of Theorem 4.3

We first bound the recovery probability of $\widehat{K}^{(\ell)}$ for each $\ell\in[L]$ . Recall that $\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}=\mathbf{V}\mathbf{\Lambda}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}+{\mathbf{E}}\mathbf{\Omega}^{(\ell)}/\sqrt{p}$ , where $\widetilde{\mathbf{\Omega}}^{(\ell)}=\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}$ .

For the residual term ${\mathbf{E}}\mathbf{\Omega}^{(\ell)}/\sqrt{p}$ , by Lemma 3 in [18], under the condition that $\sqrt{p/d}\log d=o(1)$ , with probability at least $1-d^{-10}$ we have $\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}\leq 2\sqrt{\frac{d}{p}}$ . Denote by $\mathcal{A}_{\mathbf{E}}$ the event $\big{\{}\|{\mathbf{E}}\|_{2}\leq 10c_{e}^{-1}r_{1}(d)\log d\big{\}}$ , where $c_{e}>0$ is the constant defined in Remark 2. Then conditional on $\mathcal{A}_{\mathbf{E}}$ , we have that $\|{\mathbf{E}}\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}\leq 20c_{e}^{-1}\sqrt{\frac{d}{p}}r_{1}(d)\log d$ with probability at least $1-d^{-10}$ for each $\ell\in[L]$ . Recall $\eta_{0}=480c_{e}^{-1}\sqrt{\frac{d}{\Delta^{2}p}}r_{1}(d)\log d$ . From Proposition 10.4 in [20], we know that when $p\geq 2K$ ,

\mathbb{P}\left(\sigma_{\min}\big{(}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}\leq\frac{1}{6}\sqrt{\eta_{0}}\right)\leq\mathbb{P}\left(\sigma_{\min}\big{(}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}\leq\frac{p-K+1}{{\rm e}p}\sqrt{\eta_{0}}\right)\leq\eta_{0}^{\frac{p-K+1}{2}}.

Therefore, with probability at least $1-\eta_{0}^{{(p-K+1)}/2}$ ,

\sigma_{\min}\big{(}\mathbf{V}\mathbf{\Lambda}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}\geq\Delta\sigma_{\min}\big{(}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}>\Delta\sqrt{\eta_{0}}/6\geq 2\mu_{0}.

By Weyl’s inequality [19], we know that conditional on $\mathcal{A}_{\mathbf{E}}$ , with probability at least $1-d^{-10}$ , $\sigma_{K+1}(\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p})\leq\|{\mathbf{E}}\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}\leq 20c_{e}^{-1}\sqrt{\frac{d}{p}}r_{1}(d)=\Delta\eta_{0}/24\leq\mu_{0}$ for large enough $d$ , which indicates that $\sigma_{k+1}(\widehat{\mathbf{Y}}^{(\ell)})-\sigma_{p}(\widehat{\mathbf{Y}}^{(\ell)})<\sqrt{p}\mu_{0}$ for any $k\geq K$ . For $k\leq K-1$ , under the same event we have

	$\displaystyle\sigma_{k+1}(\widehat{\mathbf{Y}}^{(\ell)})-\sigma_{p}(\widehat{\mathbf{Y}}^{(\ell)})\geq\sigma_{K}(\widehat{\mathbf{Y}}^{(\ell)})-\sigma_{p}(\widehat{\mathbf{Y}}^{(\ell)})\geq\sigma_{\min}\big{(}\mathbf{V}\mathbf{\Lambda}\widetilde{\mathbf{\Omega}}^{(\ell)}\big{)}-2\\|{\mathbf{E}}\mathbf{\Omega}^{(\ell)}\\|_{2}$
	$\displaystyle\quad>\sqrt{p}(\Delta\sqrt{\eta_{0}}/6-\Delta\eta_{0}/12)\geq\sqrt{p}(\Delta\sqrt{\eta_{0}}/6-\Delta\sqrt{\eta_{0}}/12)=\Delta\sqrt{p\eta_{0}}/12\geq\sqrt{p}\mu_{0}.$

Then we have

	$\displaystyle\mathbb{P}\left(\widehat{K}^{(\ell)}=K\,\,\big{\|}\,\mathcal{A}_{\mathbf{E}}\right)\geq\mathbb{P}\left(\!\!\sigma_{K}\!\big{(}\!{\widehat{\mathbf{Y}}^{(\ell)}}\!\big{)}-\sigma_{p}\big{(}\!{\widehat{\mathbf{Y}}^{(\ell)}}\!\big{)}>\sqrt{p}\mu_{0},\,\,\sigma_{K+1}\!\big{(}\!{\widehat{\mathbf{Y}}^{(\ell)}}\!\big{)}-\sigma_{p}\!\big{(}\!{\widehat{\mathbf{Y}}^{(\ell)}}\!\big{)}\leq\sqrt{p}\mu_{0}\,\,\Big{\|}\,\mathcal{A}_{\mathbf{E}}\right)$
	$\displaystyle\quad\geq\mathbb{P}\left(\sigma_{\min}\big{(}\mathbf{V}\mathbf{\Lambda}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}\geq\Delta\sqrt{\eta_{0}}/6,\quad\\|{\mathbf{E}}\mathbf{\Omega}^{(\ell)}/\sqrt{p}\\|_{2}\leq\Delta\eta_{0}/24\,\Big{\|}\,\mathcal{A}_{\mathbf{E}}\right)$
	$\displaystyle\quad\geq 1-d^{-10}-\eta_{0}^{\frac{p-K+1}{2}}.$

We know that conditional on ${\mathbf{E}}$ , $\mathbb{I}\{\widehat{K}^{(\ell)}\neq K\,|\,\mathcal{A}_{\mathbf{E}}\}$ are i.i.d. Bernoulli variables with expectation $p_{K}:=\mathbb{P}(\widehat{K}^{(\ell)}\neq K\,|\,\mathcal{A}_{\mathbf{E}})\leq d^{-10}+\eta_{0}^{\frac{p-K+1}{2}}\leq 1/4$ and variance $p_{K}(1-p_{K})\leq p_{K}$ . Since the estimators $\{\widehat{K}^{(\ell)}\}_{\ell=1}^{L}$ are all integers, we know that if $\widehat{K}\neq K$ , at least half of $\{\widehat{K}^{(\ell)}\}_{\ell=1}^{L}$ are not equal to $K$ . Then by Hoeffding’s inequality, we have

	$\displaystyle\mathbb{P}(\widehat{K}\!\neq\!K)$	$\displaystyle\leq\mathbb{P}\!\left(\sum_{\ell=1}^{L}\mathbb{I}\left\{\widehat{K}^{(\ell)}\neq K\right\}\!-\!p_{K}L\!\geq\!\frac{L}{4}\!\right)\!=\!{\mathbb{E}}_{{\mathbf{E}}}\!\left(\!\mathbb{P}\bigg{(}\!\sum_{\ell=1}^{L}\mathbb{I}\left\{\widehat{K}^{(\ell)}\!\neq\!K\right\}\!-\!p_{K}L\geq\frac{L}{4}\,\big{\|}\,{\mathbf{E}}\bigg{)}\!\right)$
		$\displaystyle\leq\mathbb{P}(\mathcal{A}_{\mathbf{E}})\exp\left\{-(L/4)^{2}/(2Lp_{K})\right\}+1-\mathbb{P}(\mathcal{A}_{\mathbf{E}})$
		$\displaystyle\leq\exp\left\{-L\big{/}\big{(}32d^{-10}+32\eta_{0}^{\frac{p-K+1}{2}}\big{)}\right\}+O(d^{-10}).$

We know that $32d^{-10}\leq(\log d)^{-1}$ for $d\geq 2$ , and under the condition that $\eta_{0}\leq(32\log d)^{-\frac{2}{p-K+1}}$ we have $\mathbb{P}(\widehat{K}\neq K)\leq\exp(-L\log d/2)+O(d^{-10})\lesssim d^{-(L\wedge 20)/2}$ .

B.5 Proof of Corollary 4.4

$\bullet$ Example 1: From the proof of Corollary 4.2 we know that we can take $r_{1}(d)=(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\log d$ . Then by plugging in each term we know that under the condition that $(\lambda_{1}+\sigma^{2})\left(d(np)^{-1/2}\log d\right)^{1/4}=o(1)$ and $\Delta\gg\left({\sigma^{-2}(np)^{-1/2}}d\log d\right)^{1/3}$ , we have $\Delta\eta_{0}/24\ll\mu_{0}\ll\Delta\sqrt{\eta_{0}}/12$ . Besides, under the condition that $\kappa_{1}\sqrt{dr/(np)}(\log d)^{2}=o(1)$ , we also have $\eta_{0}\leq(32\log d)^{-\frac{2}{p-K+1}}$ . Thus the conditions for Theorem 4.3 are satisfied and we have $\widehat{K}=K$ with probaility at least $1-O(d^{-(L\wedge 20)/2})$ .

$\bullet$ Example 2: We know from the proof of Corollary 4.2 that $\Delta\gtrsim d\theta/K$ . Also from Remark 14 we know that $\|\mathbf{E}\|_{2}\lesssim\sqrt{d\theta}$ with probability at least $1-d^{-10}$ , and thus we have $\eta_{0}\asymp\sqrt{d/(\Delta^{2}p)}\sqrt{d\theta}\lesssim K/\sqrt{p\theta}\asymp 1/\sqrt{d^{\epsilon-1/2}p}$ , $\Delta\eta_{0}\asymp d\sqrt{\theta/p}$ and $\Delta\sqrt{\eta_{0}}\gtrsim d\theta^{3/4}p^{-1/4}K^{-1/2}$ . Also recall from the proof of Corollary 4.2 that ${\mathbb{E}}(\widehat{\mathbf{M}}_{ij})=\mathbf{M}_{ij}\asymp\theta$ for any $i,j\in[d]$ , and hence $d^{-2}\sum_{i\leq j}\mathbf{M}_{ij}\asymp\theta$ . By Hoeffding’s inequality [21], we have that

\mathbb{P}\left(\frac{2}{d(d-1)}\left|\sum_{i\leq j}\widehat{\mathbf{M}}_{ij}-\sum_{i\leq j}\mathbf{M}_{ij}\right|\geq\frac{\sqrt{11\log d}}{d}\right)\lesssim\exp\left(-11d(d-1)\log d/d^{2}\right)\lesssim d^{-10}.

Thus we can see with probability at least $1-O(d^{-10})$ , $|\widehat{\theta}-d^{-2}\sum_{i\leq j}\mathbf{M}_{ij}|\lesssim\frac{\sqrt{\log d}}{d}$ and $\widehat{\theta}\asymp\theta$ , and in turn $\Delta\eta_{0}/24\ll\mu_{0}\ll\Delta\sqrt{\eta_{0}}/12$ . Thus by Theorem 4.3 the claim follows.

$\bullet$ Example 3: We know from the proof of Corollary 4.2 and Remark 16 that $\Delta\gtrsim d\Delta_{0}^{2}/K$ and $\|\mathbf{E}\|_{2}\lesssim d\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d$ with probability at least $1-d^{-10}$ . Thus we have $\eta_{0}\asymp\sqrt{d/(\Delta^{2}p)}\big{(}d\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d\big{)}$ . Under the condition that $\sqrt{K(\log d)^{3}}\left(n/p\right)^{1/4}\ll\Delta_{0}\ll\sqrt{nK/d}\log d$ , we know that $d\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d\lesssim\sqrt{dn}\log d$ , $\Delta\eta_{0}\asymp d\sqrt{n/p}\log d$ and $\sqrt{\eta_{0}}\log d=o(1)$ , and thus $\Delta\eta_{0}/24\ll\mu_{0}\ll\Delta\sqrt{\eta_{0}}/12$ . By Theorem 4.3 the claim follows.

$\bullet$ Example 4: By Hoeffding’s inequality [21], with probability at least $1-d^{-10}$ we have that $|\widehat{\theta}-\theta|/\widehat{\theta}\leq C\sqrt{\log d}/d\theta$ . As for $\widehat{\sigma}_{0}^{2}$ , we have

\displaystyle\widehat{\sigma}_{0}^{2}=\frac{1}{|{\mathcal{S}}|}\sum_{(i,j)\in{\mathcal{S}}}(\widehat{\theta}\widehat{\mathbf{M}}_{ij})^{2}=\frac{1}{|{\mathcal{S}}|}\left(\sum_{i\leq j}\delta_{ij}\mathbf{M}_{ij}^{2}+2\sum_{(i,j)\in{\mathcal{S}}}\mathbf{M}_{ij}\varepsilon_{ij}+\sum_{(i,j)\in{\mathcal{S}}}\varepsilon_{ij}^{2}\right).

We consider the latter two terms first. We know that $|\varepsilon_{ij}|\leq C\sigma\log d$ for some constant $C>0$ and $|\mathbf{M}_{ij}|\leq|\lambda_{1}|\mu K/d$ , for any $i\leq j$ . Denote by $\widetilde{\sigma}=(|\lambda_{1}|\mu K/d)\vee\sigma$ , then we have

\operatorname{{\rm Var}}({\mathrm{M}}_{ij}\varepsilon_{ij})\leq(\frac{|\lambda_{1}|\mu K}{d})^{2}\sigma^{2}\leq\widetilde{\sigma}^{4},\quad|{\mathrm{M}}_{ij}\varepsilon_{ij}|\leq\frac{|\lambda_{1}|\mu K}{d}C\sigma\log d\leq C\widetilde{\sigma}^{2}\log d,\quad\forall i\leq j,

and

\operatorname{{\rm Var}}(\varepsilon_{ij}^{2})\leq C^{4}\sigma^{4}(\log d)^{4}\leq C^{4}\widetilde{\sigma}^{4}(\log d)^{4},\quad|\varepsilon_{ij}^{2}|\leq C^{2}\sigma^{2}(\log d)^{2}\leq C^{2}\widetilde{\sigma}^{2}(\log d)^{2},\quad\forall i\leq j.

Thus by Bernstein inequality [9], conditional on ${\mathcal{S}}$ , with probability at least $1-2d^{-10}$ we have that there exists a constant $C^{\prime}>0$ independent of ${\mathcal{S}}$ such that

\left|\frac{1}{|{\mathcal{S}}|}\sum_{(i,j)\in{\mathcal{S}}}\mathbf{M}_{ij}\varepsilon_{ij}\right|\leq C^{\prime}\left(\frac{\widetilde{\sigma}^{2}\sqrt{\log d}}{\sqrt{|{\mathcal{S}}|}}+\frac{\widetilde{\sigma}^{2}(\log d)^{2}}{|{\mathcal{S}}|}\right),

(B.28)

and

\left|\frac{1}{|{\mathcal{S}}|}\sum_{(i,j)\in{\mathcal{S}}}\varepsilon_{ij}^{2}-\sigma^{2}\right|\leq C^{\prime}\left(\frac{\widetilde{\sigma}^{2}(\log d)^{5/2}}{\sqrt{|{\mathcal{S}}|}}+\frac{\widetilde{\sigma}^{2}(\log d)^{3}}{|{\mathcal{S}}|}\right).

(B.29)

Now we consider the first term. Since $\delta_{ij}$ ’s are i.i.d. Bernoulli random variables with expectation $\theta$ , we have

\operatorname{{\rm Var}}(\mathbf{M}_{ij}^{2}\delta_{ij})\leq\theta\widetilde{\sigma}^{4},\quad|\mathbf{M}_{ij}^{2}\delta_{ij}|\leq\widetilde{\sigma}^{2},\quad i\leq j.

Also, we know that $\sum_{i\leq j}\mathbf{M}_{ij}^{2}\geq\|\mathbf{M}\|_{\text{F}}^{2}/2\geq K\Delta^{2}/2$ and $\sum_{i\leq j}\mathbf{M}_{ij}^{2}\leq\|\mathbf{M}\|_{\text{F}}^{2}\leq K\lambda_{1}^{2}$ , and hence $K\Delta^{2}\theta/2\leq{\mathbb{E}}\Big{(}\sum_{i\leq j}\delta_{ij}\mathbf{M}_{ij}^{2}\Big{)}\leq K\lambda_{1}^{2}\theta$ . Then by Bernstein inequality [9] with probability at least $1-d^{-10}$ , it holds that

\left|\Big{(}\sum_{i\leq j}\delta_{ij}\mathbf{M}_{ij}^{2}\Big{)}-{\mathbb{E}}\Big{(}\sum_{i\leq j}\delta_{ij}\mathbf{M}_{ij}^{2}\Big{)}\right|\lesssim d^{2}\left(\frac{\widetilde{\sigma}^{2}\sqrt{\theta\log d}}{d}+\frac{\widetilde{\sigma}^{2}\log d}{d^{2}}\right)=\widetilde{\sigma}^{2}(d\sqrt{\theta\log d}+\log d).

(B.30)

Thus combining (B.28), (B.29) and (B.30) with the fact that $|{\mathcal{S}}|\asymp d^{2}\theta$ with probability at least $1-d^{-10}$ , under the condition that $\kappa_{2}^{2}\mu^{2}K\ll(\log d)^{2}$ , with probability at least $1-O(d^{-10})$ we have

\widetilde{\sigma}\ll\left(\frac{\Delta\sqrt{K}\log d}{d}\vee\sigma\right)+o(\widetilde{\sigma})\lesssim\widehat{\sigma}_{0}\log d\lesssim\left(\frac{|\lambda_{1}|\sqrt{K}\log d}{d}\vee\sigma\right)+o(\widetilde{\sigma})\lesssim\widetilde{\sigma}\log d.

From the proof of Corollary 4.2 and Remark 17, we know that with probability at least $1-d^{-10}$ ,

\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}\lesssim\left\|\frac{\widehat{\theta}}{\theta}\widehat{\mathbf{M}}-\widehat{\mathbf{M}}\right\|_{2}+\left\|\frac{\widehat{\theta}}{\theta}\widehat{\mathbf{M}}-\mathbf{M}\right\|_{2}\lesssim\frac{|\lambda_{1}|\sqrt{\log d}}{d\theta}+\frac{|\lambda_{1}|\mu K}{\sqrt{d\theta}}+\sqrt{\frac{d\sigma^{2}}{\theta}}\lesssim\sqrt{\frac{d\widetilde{\sigma}^{2}}{\theta}},

and hence $\eta_{0}\asymp d\widetilde{\sigma}(\Delta\sqrt{p\theta})^{-1}$ and $\Delta\eta_{0}\asymp d\widetilde{\sigma}/\sqrt{p\theta}$ .

Under the condition that $(p\theta)^{-1/4}\sqrt{d\sigma/\Delta}\log d=o(1)$ , with probability at least $1-O(d^{-10})$ we have $\Delta\eta_{0}/24\ll\mu_{0}\ll\Delta\sqrt{\eta_{0}}/12$ . Thus by Theorem 4.3 the claim follows.

B.6 Proof of Theorem 4.10

We first decompose $\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}=\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{0}+\widetilde{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}$ , and we consider the term $\widetilde{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}$ first.

By Lemma 8 in Fan et al., [18], we have that $\|\widetilde{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}-\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}\|_{2}\lesssim\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\|\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}\|_{2}$ . Note that in Lemma 8 of Fan et al., [18], the norm is Frobenius norm rather than operator norm, and the modification from Frobenius norm to operator norm is trivial and hence omitted. We first study the leading term $\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}=\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\perp}(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}$ .

For a given $\ell\in[L]$ , we know that $\widehat{\mathbf{V}}^{(\ell)}$ is the top $K$ left singular vectors of $\widehat{\mathbf{Y}}^{(\ell)}=\widehat{\mathbf{M}}\mathbf{\Omega}^{(\ell)}/\sqrt{p}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}/\sqrt{p}+\mathbf{E}\mathbf{\Omega}^{(\ell)}/\sqrt{p}={\mathbf{Y}}^{(\ell)}+\mathbf{\mathcal{E}}^{(\ell)}$ , where

\mathbf{Y}^{(\ell)}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}/\sqrt{p}\quad\text{and}\quad\mathbf{\mathcal{E}}^{(\ell)}=\mathbf{E}\mathbf{\Omega}^{(\ell)}/\sqrt{p}.

By the “symmetric dilation” trick, we denote

{\mathcal{S}}(\widehat{\mathbf{Y}}^{(\ell)})=\begin{pmatrix}\mathbf{0}&\widehat{\mathbf{Y}}^{(\ell)}\\ \widehat{\mathbf{Y}}^{(\ell)\top}&\mathbf{0}\end{pmatrix},\quad{\mathcal{S}}({\mathbf{Y}}^{(\ell)})=\begin{pmatrix}\mathbf{0}&{\mathbf{Y}}^{(\ell)}\\ {\mathbf{Y}}^{(\ell)\top}&\mathbf{0}\end{pmatrix},

\text{and}\quad{\mathcal{S}}(\mathbf{\mathcal{E}}^{(\ell)})={\mathcal{S}}(\widehat{\mathbf{Y}}^{(\ell)})-{\mathcal{S}}({\mathbf{Y}}^{(\ell)})=\begin{pmatrix}\mathbf{0}&\mathbf{E}\mathbf{\Omega}^{(\ell)}/\sqrt{p}\\ \mathbf{\Omega}^{(\ell)\top}\mathbf{E}/\sqrt{p}&\mathbf{0}\end{pmatrix}.

We let $\mathbf{\Gamma}_{K}^{(\ell)}{\mathbf{\Lambda}}_{K}^{(\ell)}{\mathbf{U}}_{K}^{(\ell)\top}$ be the SVD of ${\mathbf{Y}}^{(\ell)}$ , and we know that with probability 1 we have $\mathbf{\Gamma}_{K}^{(\ell)}=\mathbf{V}\mathbf{O}_{\mathbf{\Omega}^{(\ell)}}$ , where $\mathbf{O}_{\mathbf{\Omega}^{(\ell)}}$ is an orthonormal matrix depending on $\mathbf{\Omega}^{(\ell)}$ . It is not hard to verify that the eigen-decomposition of ${\mathcal{S}}({\mathbf{Y}}^{(\ell)})$ is:

{\mathcal{S}}({\mathbf{Y}}^{(\ell)})=\frac{1}{\sqrt{2}}\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}\cdot\begin{pmatrix}{\mathbf{\Lambda}}_{K}^{(\ell)}&\mathbf{0}\\ \mathbf{0}&-{\mathbf{\Lambda}}_{K}^{(\ell)}\end{pmatrix}\cdot\frac{1}{\sqrt{2}}\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}^{\top},

where ${\mathbf{\Lambda}}_{K}^{(\ell)}=\operatorname{diag}(\lambda_{1}^{(\ell)},\ldots,\lambda_{K}^{(\ell)})$ . First we study the eigengap $\sigma_{\min}({\mathbf{\Lambda}}_{K}^{(\ell)})=\lambda_{K}^{(\ell)}$ . Recall $\widetilde{\mathbf{\Omega}}^{(\ell)}=\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}\in\mathbb{R}^{K\times p}$ , and it can be seen that the entries of $\widetilde{\mathbf{\Omega}}^{(\ell)}$ are i.i.d. standard Gaussian. By Lemma 3 in Fan et al., [18], we know that with probability at least $1-d^{-10}$ , we have that $\|\widetilde{\mathbf{\Omega}}^{(\ell)}\widetilde{\mathbf{\Omega}}^{(\ell)\top}/p-\mathbf{I}_{K}\|_{2}\lesssim\sqrt{\frac{K}{p}}\log d$ , and thus $\sigma_{\min}(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p})\geq 1-O(\sqrt{\frac{K}{p}}\log d)$ with probability at least $1-d^{-10}$ . Thus under the condition that $\sqrt{\frac{K}{p}}\log d=o(1)$ , under the same high probability event we have that $\sigma_{\min}({\mathbf{\Lambda}}_{K}^{(\ell)})\geq\Delta/2$ . Now we let $\widehat{\mathbf{U}}_{K}^{(\ell)}$ be the top $K$ right singular vectors of $\widehat{\mathbf{Y}}^{(\ell)}$ . For $j\in[K]$ we define

\mathbf{G}_{j}^{(\ell)}\!\!=\!\frac{1}{2}\!\!\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}\\ -\mathbf{U}_{K}^{(\ell)}\end{pmatrix}\!\!(-\mathbf{\Lambda}_{K}^{(\ell)}-\lambda_{j}^{(\ell)}\mathbf{I}_{K})^{-1}\!\!\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}\\ -\mathbf{U}_{K}^{(\ell)}\end{pmatrix}^{\!\!\!\top}\!\!\!-\frac{1}{\lambda_{j}^{(\ell)}}\bigg{\{}\mathbf{I}_{K}-\frac{1}{2}\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}\!\!\!\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}^{\!\!\!\top}\!\!\bigg{\}}.

Then we have $\|\mathbf{G}_{j}^{(\ell)}\|_{2}\leq 1/\lambda_{K}^{(\ell)}\leq 2/\Delta$ with probability at least $1-d^{-10}$ . Correspondingly we define the linear mapping

f:\mathbb{R}^{(d+p)\times K}\rightarrow\mathbb{R}^{(d+p)\times K},\quad\left(\mathbf{w}_{1},\cdots,\mathbf{w}_{K}\right)\mapsto\left(-\mathbf{G}_{1}^{(\ell)}\mathbf{w}_{1},\cdots,-\mathbf{G}_{K}^{(\ell)}{\mathbf{w}}_{K}\right),

and denote $\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}=\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}\\ \mathbf{U}_{K}^{(\ell)}\end{pmatrix}$ . By Lemma 8 in Fan et al., [18], under the condition that $\|{\mathcal{S}}(\mathcal{E}^{(\ell)})\|_{2}/\Delta=o(1)$ we have

	$\displaystyle\bigg{\\|}\begin{pmatrix}\widehat{\mathbf{V}}^{(\ell)}\\ \widehat{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}(\widehat{\mathbf{V}}^{(\ell)\top},\widehat{\mathbf{U}}_{K}^{(\ell)\top})-\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)\top}-f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)\top}-\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}^{\top}\bigg{\\|}_{2}$
	$\displaystyle\leq\bigg{\\|}\begin{pmatrix}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{\Gamma}_{K}^{(\ell)}\mathbf{\Gamma}_{K}^{(\ell)\top}&\quad\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{U}}_{K}^{(\ell)\top}-\mathbf{\Gamma}_{K}^{(\ell)}\mathbf{U}_{K}^{(\ell)\top}\\ \widehat{\mathbf{U}}_{K}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{U}_{K}^{(\ell)}\mathbf{\Gamma}_{K}^{(\ell)\top}&\quad\widehat{\mathbf{U}}_{K}^{(\ell)}\widehat{\mathbf{U}}_{K}^{(\ell)\top}-\mathbf{U}_{K}^{(\ell)}\mathbf{U}_{K}^{(\ell)\top}\end{pmatrix}$
	$\displaystyle\quad-f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)\top}-\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}^{\top}\bigg{\\|}_{2}\lesssim\\|{\mathcal{S}}(\mathcal{E}^{(\ell)})\\|_{2}^{2}/\Delta^{2}.$

By taking the upper left block of the matrix, we have

	$\displaystyle\big{\\|}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{\Gamma}_{K}^{(\ell)}\mathbf{\Gamma}_{K}^{(\ell)\top}-f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}\mathbf{\Gamma}_{K}^{(\ell)\top}-\mathbf{\Gamma}_{K}^{(\ell)}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}^{\top}\big{\\|}_{2}$
	$\displaystyle=\big{\\|}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}-f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}\mathbf{\Gamma}_{K}^{(\ell)\top}-\mathbf{\Gamma}_{K}^{(\ell)}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}^{\top}\big{\\|}_{2}$
	$\displaystyle\lesssim\\|{\mathcal{S}}(\mathcal{E}^{(\ell)})\\|_{2}^{2}/\Delta^{2}.$

Now for $j\in[K]$ , we study $\mathbf{P}_{\perp}(\mathbf{G}_{j}^{(\ell)})_{[1:d,:]}$ . Since $\mathbf{\Gamma}_{K}^{(\ell)}=\mathbf{V}\mathbf{O}_{\mathbf{\Omega}^{(\ell)}}$ , we have $\mathbf{P}_{\perp}\mathbf{\Gamma}_{K}^{(\ell)}=\mathbf{0}$ . Therefore we have,

	$\displaystyle\mathbf{P}_{\perp}\mathbf{\Gamma}_{K}^{(\ell)}(-\mathbf{\Lambda}_{K}^{(\ell)}-\lambda_{j}^{(\ell)}\mathbf{I}_{K})^{-1}\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}\\ -\mathbf{U}_{K}^{(\ell)}\end{pmatrix}^{\top}=\mathbf{0},\quad\text{and}$
	$\displaystyle\mathbf{P}_{\perp}\bigg{\{}\mathbf{I}_{d+p}-\frac{1}{2}\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}\!\!\!\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}^{\top}\!\bigg{\}}_{[1:d,:]}$
	$\displaystyle=(\mathbf{P}_{\perp},\mathbf{0})-\frac{1}{2}\mathbf{P}_{\perp}\mathbf{\Gamma}_{K}^{(\ell)}(\mathbf{I}_{d},\mathbf{I}_{d})\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}^{\top}$
	$\displaystyle=(\mathbf{P}_{\perp},\mathbf{0})+\mathbf{0}=(\mathbf{P}_{\perp},\mathbf{0}),$

and as a result we have

\mathbf{P}_{\perp}(\mathbf{G}_{j})_{[1:d,:]}=\frac{1}{2}\cdot\mathbf{0}-\frac{1}{\lambda_{j}^{(\ell)}}\left\{(\mathbf{P}_{\perp},\mathbf{0})-\mathbf{0}\right\}=-\frac{1}{\lambda_{j}^{(\ell)}}(\mathbf{P}_{\perp},\mathbf{0}).

Thus in turn,

	$\displaystyle\mathbf{P}_{\perp}\Big{(}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}\mathbf{\Gamma}_{K}^{(\ell)\top}+\mathbf{\Gamma}_{K}^{(\ell)}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}^{\top}\Big{)}=\mathbf{P}_{\perp}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}\mathbf{\Gamma}_{K}^{(\ell)\top}$
	$\displaystyle\quad=(\mathbf{P}_{\perp},\mathbf{0})\begin{pmatrix}\mathbf{0}&\mathbf{E}\mathbf{\Omega}^{(\ell)}/\sqrt{p}\\ \mathbf{\Omega}^{(\ell)\top}\mathbf{E}/\sqrt{p}&\mathbf{0}\end{pmatrix}\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}\\ \mathbf{U}_{K}^{(\ell)}\end{pmatrix}(\mathbf{\Lambda}_{K}^{(\ell)})^{-1}\mathbf{\Gamma}_{K}^{(\ell)\top}$
	$\displaystyle\quad=\mathbf{P}_{\perp}\mathbf{E}(\mathbf{\Omega}^{(\ell)}/\sqrt{p})\mathbf{U}_{K}^{(\ell)}(\mathbf{\Lambda}_{K}^{(\ell)})^{-1}\mathbf{\Gamma}_{K}^{(\ell)\top}=\mathbf{P}_{\perp}\mathbf{E}(\mathbf{\Omega}^{(\ell)}/\sqrt{p})(\mathbf{Y}^{(\ell)})^{\dagger}.$

For a given $\ell\in[L]$ , under the condition that $\sqrt{p/d}\log d=O(1)$ , by Lemma 3 in Fan et al., [18] we have that with probability at least $1-d^{-10}$ , $\|\mathbf{\Omega}^{(\ell)}\|_{2}\lesssim\sqrt{d}$ . Combined with previous results on the eigengap $\sigma_{\min}(\mathbf{\Lambda}_{K}^{(\ell)})$ , we have that with probability $1-O(d^{-9})$ , for a fixed constant $C>0$

\|\mathbf{\Omega}^{(\ell)}\|_{2}\leq C\sqrt{d},\quad\sigma_{\min}(\mathbf{\Lambda}_{K}^{(\ell)})\geq\Delta/2,\quad\forall\ell\in[L].

Besides, under Assumption 1, we have that $\|\mathbf{E}\|_{2}\lesssim r_{1}(d)\log d$ with probability at least $1-d^{-10}$ , and in turn by Wedin’s Theorem [42], with high probability for all $\ell\in[L]$ we have that

\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\lesssim\|\mathcal{E}^{(\ell)}\|_{2}/\sigma_{\min}(\mathbf{\Lambda}_{K}^{(\ell)})\lesssim\|\mathbf{E}\|_{2}\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}/\Delta\lesssim\frac{r_{1}(d)}{\Delta}\log d\sqrt{\frac{d}{p}},

and thus $\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}=O_{P}\big{(}r_{1}(d)\log d\sqrt{d/p}/\Delta\big{)}$ . Besides, we have

	$\displaystyle\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}=\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\perp}(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}$
	$\displaystyle\quad=\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\perp}\Big{(}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}\mathbf{\Gamma}_{K}^{(\ell)\top}+\mathbf{\Gamma}_{K}^{(\ell)}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}^{\top}\Big{)}\mathbf{V}+\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})$
	$\displaystyle\quad=\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\perp}\mathbf{E}(\mathbf{\Omega}^{(\ell)}/\sqrt{p})(\mathbf{Y}^{(\ell)})^{\dagger}\mathbf{V}+R_{1}(\widetilde{\mathbf{\Sigma}})=\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\perp}\mathbf{E}(\mathbf{\Omega}^{(\ell)}/\sqrt{p})\mathbf{B}^{(\ell)\top}+\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})$
	$\displaystyle\quad=\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}}+\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}}),$

where $\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})$ is the residual matrix with $\|\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})\|_{2}=O_{P}(\|{\mathcal{S}}(\mathcal{E}^{(\ell)})\|_{2}^{2}/\Delta^{2})$ . Now we study the matrix $\mathbf{B}^{(\ell)}=(\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}/\sqrt{p})^{\dagger}$ . From previous results we know that with probability at least $1-d^{-9}$ , $1/2\leq\sigma_{\min}(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p})\leq\sigma_{\max}(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p})\leq 3/2$ for any $\ell\in[L]$ , and in turn $\frac{2}{3|\lambda_{1}|}\leq\sigma_{\min}(\mathbf{B}^{(\ell)})\leq\sigma_{\max}(\mathbf{B}^{(\ell)})\leq\frac{2}{\Delta},\quad\forall\ell\in[L]$ . Now for any vector $\mathbf{y}\in\mathbb{R}^{K}$ such that $\|\mathbf{y}\|_{2}=1$ , with probability $1-O(d^{-9})$ we have that

	$\displaystyle\\|\mathbf{B}_{\mathbf{\Omega}}\mathbf{y}\\|_{2}$	$\displaystyle=\\|(\mathbf{y}^{\top}\mathbf{B}^{(1)\top},\ldots,\mathbf{y}^{\top}\mathbf{B}^{(L)\top})^{\top}\\|_{2}=\Big{(}\sum_{\ell=1}^{L}\\|\mathbf{B}^{(\ell)}\mathbf{y}\\|_{2}^{2}\Big{)}^{1/2},$
	$\displaystyle\\|\mathbf{B}_{\mathbf{\Omega}}\\|_{2}$	$\displaystyle=\max_{\\|\mathbf{y}\\|_{2}=1}\\|\mathbf{B}_{\mathbf{\Omega}}\mathbf{y}\\|_{2}=\max_{\\|\mathbf{y}\\|_{2}=1}\Big{(}\sum_{\ell=1}^{L}\\|\mathbf{B}^{(\ell)}\mathbf{y}\\|_{2}^{2}\Big{)}^{1/2}\leq\Big{(}\sum_{\ell=1}^{L}\\|\mathbf{B}^{(\ell)}\\|_{2}^{2}\Big{)}^{1/2}\leq\frac{2\sqrt{L}}{\Delta},$
	$\displaystyle\sigma_{\min}\left(\mathbf{B}_{\mathbf{\Omega}}\right)$	$\displaystyle=\min_{\\|\mathbf{y}\\|_{2}=1}\\|\mathbf{B}_{\mathbf{\Omega}}\mathbf{y}\\|_{2}=\min_{\\|\mathbf{y}\\|_{2}=1}\Big{(}\sum_{\ell=1}^{L}\\|\mathbf{B}^{(\ell)}\mathbf{y}\\|_{2}^{2}\Big{)}^{1/2}\geq\Big{(}\sum_{\ell=1}^{L}\sigma_{\min}^{2}(\mathbf{B}^{(\ell)})\Big{)}^{1/2}\geq\frac{2\sqrt{L}}{3\|\lambda_{1}\|}.$

Now since we know that the entries of $\sqrt{p}\mathbf{\Omega}$ are i.i.d. standard Gaussian, similar as before, under the condition that $Lp\ll d$ , by Lemma 3 in Fan et al., [18] we have with high probability that $\frac{1}{2}\sqrt{\frac{d}{p}}\leq\sigma_{\min}(\mathbf{\Omega})\leq\sigma_{\max}(\mathbf{\Omega})\leq\frac{3}{2}\sqrt{\frac{d}{p}}$ . Therefore, we have the following upper bound on the norm of the leading term

	$\displaystyle\\|\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}\\|_{2}$	$\displaystyle\lesssim\\|\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}}\\|_{2}+\\|\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})\\|_{2}\leq\frac{1}{L}\\|\mathbf{E}\\|_{2}\\|\mathbf{\Omega}\\|_{2}\\|\mathbf{B}_{\mathbf{\Omega}}\\|_{2}+\\|\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})\\|_{2}$
		$\displaystyle=O_{P}\Big{(}\sqrt{\frac{d}{Lp}}\frac{r_{1}(d)\log d}{\Delta}+r_{1}(d)^{2}(\log d)^{2}\frac{d}{p\Delta^{2}}\Big{)}.$

Thus we have the following decomposition

	$\displaystyle\widetilde{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}$	$\displaystyle=\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}+\mathbf{R}_{0}(\widetilde{\mathbf{\Sigma}})$
		$\displaystyle=\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}}+\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})+\mathbf{R}_{0}(\widetilde{\mathbf{\Sigma}})$
		$\displaystyle=\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}}+\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}_{b}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}}+\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})+\mathbf{R}_{0}(\widetilde{\mathbf{\Sigma}}),$

where $\mathbf{R}_{0}(\widetilde{\mathbf{\Sigma}})$ is a residual matrix with

	$\displaystyle\\|\mathbf{R}_{0}(\widetilde{\mathbf{\Sigma}})\\|_{2}$	$\displaystyle=O_{P}(\\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}\\|\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}\\|_{2})$
		$\displaystyle=O_{P}\Big{(}\frac{r_{1}(d)^{2}(\log d)^{2}{d}}{\sqrt{L}p\Delta^{2}}\Big{)}+o_{P}\Big{(}r_{1}(d)^{2}(\log d)^{2}\frac{d}{p\Delta^{2}}\Big{)}.$

Thus

\|\mathbf{R}_{0}(\widetilde{\mathbf{\Sigma}})+\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})\|_{2}=O_{P}\Big{(}r_{1}(d)^{2}(\log d)^{2}\frac{d}{p\Delta^{2}}\Big{)}.

Next we consider the term $\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{0}$ . We denote the SVD of $\widetilde{\bm{\Sigma}}^{q}$ by $\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{V}}^{\top}+\widetilde{\mathbf{V}}_{\perp}\widetilde{\mathbf{\Lambda}}_{\perp}^{q}\widetilde{\mathbf{V}}_{\perp}^{\top}$ , and by Weyl’s inequality [19], we know that $\|\widetilde{\mathbf{\Lambda}}_{\perp}\|_{2}\leq\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}=O_{P}\big{(}r_{1}(d)\log d\sqrt{d/p}/\Delta\big{)}$ and $\sigma_{K}(\widetilde{\mathbf{\Lambda}}_{K})\geq 1-\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq 1-O_{P}(r_{1}(d)\log d\sqrt{d/p}/\Delta)$ . Thus under the condition that $r_{1}(d)\log d\sqrt{d/p}/\Delta=o(1)$ , for large enough $d$ with high probability we have

\|\widetilde{\mathbf{\Lambda}}_{\perp}^{q}\|_{2}\leq(r_{1}(d)\log d\sqrt{d/p}/\Delta)^{q}\quad\text{and}\quad\sigma_{K}(\widetilde{\mathbf{\Lambda}}_{K}^{q})\geq(1-O(r_{1}(d)\log d\sqrt{d/p}/\Delta))^{q}\geq(1/2)^{q}.

Similar as before, we know that with probability 1 the left singular vector space of $\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{V}}^{\top}\mathbf{\Omega}^{\text{F}}=\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{\Omega}}^{\text{F}}$ and the column space of $\widetilde{\mathbf{V}}$ are the same, where $\widetilde{\mathbf{\Omega}}^{\text{F}}:=\widetilde{\mathbf{V}}^{\top}\mathbf{\Omega}^{\text{F}}\in\mathbb{R}^{K\times p^{\prime}}$ is still a Gaussian test matrix with i.i.d. entries. By Lemma 3 in Fan et al., [18], we have with probability at least $1-d^{-10}$ , $\sigma_{\min}(\widetilde{\mathbf{\Omega}}^{\text{F}}/\sqrt{p^{\prime}})\geq 1-O(\sqrt{\frac{K}{p^{\prime}}}\log d)$ . When $\sqrt{\frac{K}{p^{\prime}}}\log d=o(1)$ , by Wedin’s Theorem [42], there exists a constant $\eta>0$ such that with high probability we have

	$\displaystyle\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{0}\\|_{2}$	$\displaystyle=\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}_{1}-\widetilde{\mathbf{V}}\\|_{2}\lesssim\\|\widetilde{\mathbf{V}}_{\perp}\widetilde{\mathbf{\Lambda}}_{\perp}^{q}\widetilde{\mathbf{V}}_{\perp}^{\top}{\mathbf{\Omega}}^{\text{F}}/\sqrt{p^{\prime}}\\|_{2}/\sigma_{K}(\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{\Omega}}^{\text{F}}/\sqrt{p^{\prime}})$
		$\displaystyle\leq\frac{\\|\widetilde{\mathbf{\Lambda}}_{\perp}\\|_{2}^{q}\\|{\mathbf{\Omega}}^{\text{F}}/\sqrt{p^{\prime}}\\|_{2}}{\sigma_{K}(\widetilde{\mathbf{\Lambda}}_{K}^{q})\sigma_{K}(\widetilde{\mathbf{\Omega}}^{\text{F}}/\sqrt{p^{\prime}})}\lesssim\left(\frac{2\eta r_{1}(d)\log d\sqrt{d/p}}{\Delta}\right)^{q}\sqrt{\frac{d}{p^{\prime}}}.$

Denote $r^{\prime}:=2\eta r_{1}(d)\log d\sqrt{d/p}/\Delta=o\big{(}(\log d)^{-1/4}\big{)}$ . Then it can be seen that when

q\geq\log d\gg 2+\frac{\log d}{\log\log d}\geq 2+\frac{\log\sqrt{d/p^{\prime}}}{\log(1/r^{\prime})},

we have that $(r^{\prime})^{q}\sqrt{d/p^{\prime}}=o\big{(}(r^{\prime})^{2}\big{)}$ and $\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{0}\|_{2}=O_{P}\Big{(}r_{1}(d)^{2}(\log d)^{2}\frac{d}{p\Delta^{2}}\Big{)}$ .

Now for a given $j\in[d]$ , recall that with high probability $\sigma_{\min}(\mathbf{\Sigma}_{j})=\Omega\big{(}\eta_{2}(d)\Big{)}$ . Therefore, under the condition that ${d^{2}r_{1}(d)^{4}(\log d)^{4}}\big{(}{p^{2}\Delta^{4}\eta_{2}(d)}\big{)}^{-1}=o(1)$ and ${dr_{2}(d)^{2}}\big{(}{Lp\Delta^{2}\eta_{2}(d)}\big{)}^{-1}=o(1)$ , we have with probability $1-O(d^{-9})$ , $\|\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}_{b}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}}\|_{2}=O_{P}\Big{(}\sqrt{\frac{d}{\Delta^{2}Lp}}r_{2}(d)\Big{)}=o_{P}\big{(}(\sigma_{\min}(\mathbf{\Sigma}_{j}))^{1/2}\big{)}$ , and $\|\mathbf{R}_{0}(\widetilde{\bm{\Sigma}})+\mathbf{R}_{1}(\widetilde{\bm{\Sigma}})\|_{2}=o_{P}\big{(}(\sigma_{\min}(\mathbf{\Sigma}_{j}))^{1/2}\big{)}$ . Then under Assumption 5, we have

	$\displaystyle\mathbf{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}=\mathbf{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{0}+\widetilde{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})^{\top}\mathbf{e}_{j}$
	$\displaystyle=\mathbf{\Sigma}_{j}^{-1/2}(\frac{1}{L}\mathbf{B}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})\!+\!\mathbf{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}\!-\!\widetilde{\mathbf{V}}\mathbf{H}_{0}\!+\!\mathbf{R}_{0}(\widetilde{\bm{\Sigma}})\!+\!\mathbf{R}_{1}(\widetilde{\bm{\Sigma}})\!+\!\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}_{b}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}})^{\top}\mathbf{e}_{j}$
	$\displaystyle=\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}+o_{P}(1)\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}).$

B.7 Proof of Corollary 4.11

To prove Corollary 4.11, it suffices for us to show that Assumptions 1, 2 and 5 are met. From the proof of Corollary 4.2, we know that Assumption 1 is satisfied. We move on to show that Assumption 2 is met. Define $\mathbf{V}_{d}=(\mathbf{V},\mathbf{V}^{\perp})$ as the stacking of eigenvectors for the covariance matrix $\mathbf{\Sigma}$ . Note that $\mathbf{V}^{\perp}$ is not identifiable under the spiked covariance model and is unique up to orthogonal transformation. Let $\bm{Z}_{i}=\mathbf{V}_{d}^{\top}\bm{X}_{i}$ , and $\bm{Z}_{i}\sim{\mathcal{N}}(\mathbf{0},\mathbf{\Lambda}_{d})$ , where $\mathbf{\Lambda}_{d}=\operatorname{diag}(\mathbf{\Lambda}+\sigma^{2}\mathbf{I}_{K},\sigma^{2}\mathbf{I}_{d-K})$ . We let $\mathbf{\Gamma}_{S}=(\mathbf{u}_{1},\ldots,\mathbf{u}_{K+1})$ be the stacking of eigenvectors for the matrix $\bm{\Sigma}_{S}$ , and let $\widetilde{\sigma}_{1}\geq\widetilde{\sigma}_{2}\geq\ldots\geq\widetilde{\sigma}_{K+1}$ be the $K+1$ eigenvalues of $\bm{\Sigma}_{S}$ . Correspondingly, let $\widehat{\sigma}_{1}\geq\ldots\geq\widehat{\sigma}_{K+1}=\widehat{\sigma}^{2}$ be the eigenvalues of the sample covariance matrix $\widehat{\bm{\Sigma}}_{S}$ . Since $\bm{\Sigma}_{S}=(\mathbf{V})_{[S,:]}\mathbf{\Lambda}(\mathbf{V})_{[S,:]}^{\top}+\sigma^{2}\mathbf{I}_{K+1}$ , we know that $\widetilde{\sigma}_{K+1}=\sigma^{2}$ and $\delta=\widetilde{\sigma}_{K}-\widetilde{\sigma}_{K+1}\geq\Delta\sigma_{\min}^{2}\big{(}(\mathbf{V})_{[S,:]}\big{)}$ . We define $\widetilde{\mathbf{c}}=(\mathbf{V}^{\perp}_{[S,:]})^{\top}\mathbf{u}_{K+1}$ , and denote $\widetilde{{\mathbf{c}}}_{0}=(\mathbf{0},\mathbf{I}_{d-K})^{\top}\widetilde{{\mathbf{c}}}\in\mathbb{R}^{d}$ . Then by the proof of Lemma 6.2 in Wang and Fan, [41], we know that

\widehat{\sigma}^{2}-\sigma^{2}=\widetilde{{\mathbf{c}}}_{0}^{\top}(\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{{\mathbf{c}}}_{0}+\frac{1}{n}O_{P}\left(M_{K+1}-\sigma^{2}W_{K+1}\right),

where $M_{K+1}=\sum_{k\leq K}f_{k}^{2}\left(\widetilde{\sigma}_{k}+(\widehat{\sigma}_{k}-\widetilde{\sigma}_{k})\right),W_{K+1}=\sum_{k\leq K}f_{k}^{2}$ and $f_{k}$ is the $(K+1)$ -th element of the $k$ -th eigenvector of $\mathbf{\Gamma}_{S}^{\top}\widehat{\bm{\Sigma}}_{S}\mathbf{\Gamma}_{S}$ multiplied by $\sqrt{n}$ for $k\leq K$ . We let $\mathbf{f}=(f_{1},\ldots,f_{K})^{\top}/\sqrt{n}$ . By Wedin’s Theorem [42] and Lemma 3 in Fan et al., [18], we have that with probability at least $1-d^{-10}$ , $|\widehat{\sigma}_{k}-\widetilde{\sigma}_{k}|\leq\|\widehat{\bm{\Sigma}}_{S}-\bm{\Sigma}_{S}\|_{2}\lesssim\widetilde{\sigma}_{1}\log d\sqrt{\frac{K}{n}}$ for $k\leq K$ . If we denote by $\mathbf{F}_{S}:=(\mathbf{I}_{K},\mathbf{0})^{\top}$ the stacked top $K$ eigenvectors of $\mathbf{\Gamma}_{S}^{\top}\bm{\Sigma}_{S}\mathbf{\Gamma}_{S}$ , and by $\widehat{\mathbf{F}}_{S}$ the stacked top $K$ eigenvectors of $\mathbf{\Gamma}_{S}^{\top}\widehat{\bm{\Sigma}}_{S}\mathbf{\Gamma}_{S}$ , then we know that $\mathbf{f}$ is the $(K+1)$ -th row of $\widehat{\mathbf{F}}_{S}$ . By Davis-Kahan’s Theorem [45], we also know that there exists an orthonormal matrix $\mathbf{O}_{S}\in\mathbb{R}^{K\times K}$ such that $\|\mathbf{f}\|_{2}=\|\mathbf{O}_{S}^{\top}\mathbf{f}-\mathbf{0}\|_{2}\leq\|\widehat{\mathbf{F}}_{S}\mathbf{O}_{S}-\mathbf{F}_{S}\|_{2}\lesssim\frac{\widetilde{\sigma}_{1}\log d}{\delta}\sqrt{\frac{K}{n}}$ , and thus

W_{K+1}=\sum_{k\leq K}f_{k}^{2}=n\|\mathbf{f}\|_{2}^{2}\lesssim\frac{\widetilde{\sigma}_{1}^{2}K(\log d)^{2}}{\delta^{2}},

\text{and}\quad M_{K+1}\leq\widetilde{\sigma}_{1}\sum_{k\leq K}f_{k}^{2}+(\sum_{k\leq K}f_{k}^{2})\|\widehat{\bm{\Sigma}}_{S}-\bm{\Sigma}_{S}\|_{2}\lesssim\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}}(\log d)^{2}.

Thus we can write $\widehat{\sigma}^{2}-\sigma^{2}=\widetilde{{\mathbf{c}}}_{0}^{\top}(\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{{\mathbf{c}}}_{0}+O_{P}\big{(}\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}n}(\log d)^{2}\big{)}$ .

Now we take $\mathbf{E}_{0}=\widehat{\mathbf{\Sigma}}-\mathbf{\Sigma}-(\widetilde{{\mathbf{c}}}_{0}^{\top}(\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{{\mathbf{c}}}_{0})\mathbf{I}_{d}$ , and from previous results we know that with high probability $\|\mathbf{E}_{b}\|_{2}=\|\mathbf{E}-\mathbf{E}_{0}\|_{2}\lesssim\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}n}(\log d)^{2}$ , such that we have $r_{2}(d)\asymp\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}n}(\log d)^{2}=o\left(r_{1}(d)\right)$ and Assumption 2 is satisfied.

Now we move on to study the statistical rate $\eta_{2}(d)$ . For any $j\in[d]$ , we first study the covariance of $\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}$ . We denote $\widetilde{\mathbf{E}}=\bm{Z}_{1}\bm{Z}_{1}^{\top}-\mathbf{\Lambda}_{d}$ , then it’s not hard to verify that $\operatorname*{\rm Cov}(\widetilde{\mathbf{E}}_{st},\widetilde{\mathbf{E}}_{gh})=\lambda_{s}(\bm{\Sigma})\lambda_{t}(\bm{\Sigma})(\mathbb{I}\{s=g,t=h\}+\mathbb{I}\{s=h,t=g\})$ . Since $\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}$ and $\mathbf{V}_{d}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}$ share the same eigenvalues, we can study the covariance of $\mathbf{V}_{d}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}$ instead. Then $\operatorname{Cov}(\mathbf{V}_{d}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})$ can be calculated as following

	$\displaystyle\operatorname{Cov}\Big{\{}\mathbf{V}_{d}^{\top}\big{(}\frac{1}{n}\sum_{i=1}^{n}\bm{X}_{i}\bm{X}_{i}^{\top}-\mathbf{\Sigma}\big{)}\mathbf{V}^{\perp}(\mathbf{V}^{\perp})^{\top}\mathbf{e}_{j}-\big{(}\widetilde{{\mathbf{c}}}_{0}^{\top}(\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{{\mathbf{c}}}_{0}\big{)}\mathbf{V}_{d}^{\top}\mathbf{P}_{\perp}\mathbf{e}_{j}\Big{\}}$
	$\displaystyle=\operatorname{Cov}\Big{\{}\mathbf{V}_{d}^{\top}\big{(}\frac{1}{n}\sum_{i=1}^{n}\bm{X}_{i}\bm{X}_{i}^{\top}-\mathbf{\Sigma}\big{)}\mathbf{V}_{d}(\mathbf{0},\mathbf{I}_{d-K})^{\top}\widetilde{\mathbf{e}}-\big{(}\widetilde{{\mathbf{c}}}_{0}^{\top}(\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{{\mathbf{c}}}_{0}\big{)}\widetilde{\mathbf{e}}_{0}\Big{\}}$
	$\displaystyle=\operatorname{Cov}\Big{\{}\big{(}\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d}\big{)}\widetilde{\mathbf{e}}_{0}-\big{(}\widetilde{{\mathbf{c}}}_{0}^{\top}(\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{{\mathbf{c}}}_{0}\big{)}\widetilde{\mathbf{e}}_{0}\Big{\}},$

where $\widetilde{\mathbf{e}}=(\mathbf{V}^{\perp})^{\top}\mathbf{e}_{j}$ and $\widetilde{\mathbf{e}}_{0}=(\mathbf{0},\mathbf{I}_{d-K})^{\top}\widetilde{\mathbf{e}}$ . Then we have

	$\displaystyle\operatorname{Cov}(\mathbf{V}_{d}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})=\frac{1}{n}\operatorname*{\rm Cov}(\widetilde{\mathbf{E}}\widetilde{\mathbf{e}}_{0}-\widetilde{{\mathbf{c}}}_{0}^{\top}\widetilde{\mathbf{E}}\widetilde{{\mathbf{c}}}_{0}\widetilde{\mathbf{e}}_{0})$
	$\displaystyle\quad=\frac{1}{n}\Big{\{}\operatorname{\rm Cov}(\widetilde{\mathbf{E}}\widetilde{\mathbf{e}}_{0})+\operatorname{Var}\big{(}\widetilde{{\mathbf{c}}}_{0}^{\top}\widetilde{\mathbf{E}}\widetilde{{\mathbf{c}}}_{0}\big{)}\widetilde{\mathbf{e}}_{0}\widetilde{\mathbf{e}}_{0}^{\top}-\operatorname{\rm Cov}(\widetilde{\mathbf{E}}\widetilde{\mathbf{e}}_{0},\widetilde{{\mathbf{c}}}_{0}^{\top}\widetilde{\mathbf{E}}\widetilde{{\mathbf{c}}}_{0})\widetilde{\mathbf{e}}_{0}^{\top}-\widetilde{\mathbf{e}}_{0}\operatorname*{\rm Cov}(\widetilde{\mathbf{E}}\widetilde{\mathbf{e}}_{0},\widetilde{{\mathbf{c}}}_{0}^{\top}\widetilde{\mathbf{E}}\widetilde{{\mathbf{c}}}_{0})^{\top}\Big{\}}$
	$\displaystyle\quad=\frac{1}{n}\{\\|\widetilde{\mathbf{e}}_{0}\\|_{2}^{2}\sigma^{2}\mathbf{\Lambda}_{d}+3\sigma^{4}\widetilde{\mathbf{e}}_{0}\widetilde{\mathbf{e}}_{0}^{\top}-2\sigma^{4}\langle\widetilde{\mathbf{c}},\widetilde{\mathbf{e}}\rangle(\widetilde{\mathbf{c}}_{0}\widetilde{\mathbf{e}}_{0}^{\top}+\widetilde{\mathbf{e}}_{0}\widetilde{\mathbf{c}}_{0}^{\top})\}.$

Thus it can be seen that the covariance matrix is block-diagonal:

\operatorname{Cov}(\mathbf{V}_{d}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})=\frac{1}{n}\begin{pmatrix}\|\widetilde{\mathbf{e}}_{0}\|_{2}^{2}\sigma^{2}(\mathbf{\Lambda}+\sigma^{2}\mathbf{I}_{K})&\mathbf{0}\\ \mathbf{0}&\|\widetilde{\mathbf{e}}_{0}\|_{2}^{2}\sigma^{4}(\mathbf{I}_{d-K}+3\mathbf{\tau}_{1}\mathbf{\tau}_{1}^{\top}-2\rho\widetilde{\mathbf{c}}\mathbf{\tau}_{1}^{\top}-2\rho\mathbf{\tau}_{1}\widetilde{\mathbf{c}}^{\top})\end{pmatrix},

where $\mathbf{\tau}_{1}=\widetilde{\mathbf{e}}/\|\widetilde{\mathbf{e}}\|_{2}$ and $\rho=\langle\widetilde{{\mathbf{c}}},\mathbf{\tau}_{1}\rangle$ . Then following basic algebra, we can write $\operatorname{Cov}(\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})$ as:

\frac{1}{n}\!\!\left\{\!\sigma^{2}\|\widetilde{\mathbf{e}}_{0}\|_{2}^{2}\bm{\Sigma}\!+\!3\sigma^{4}\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\!\!-\!2\sigma^{4}\rho\|\widetilde{\mathbf{e}}_{0}\|_{2}\big{[}\!(\mathbf{P}_{\perp})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\!\mathbf{P}_{\perp}\!\!+\!\mathbf{P}_{\perp}\mathbf{e}_{j}(\mathbf{u}_{K+1})^{\!\top}\!(\mathbf{P}_{\perp})_{[S,:]}\big{]}\!\!\right\}.

To study $\eta_{2}(d)$ , we will first define $\bm{\Sigma}_{j}^{\prime}$ as following

\bm{\Sigma}_{j}^{\prime}=\frac{1}{nL^{2}}\mathbf{B}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\Big{\{}\sigma^{2}\bm{\Sigma}+3\sigma^{4}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}-2\sigma^{4}\rho\|\widetilde{\mathbf{e}}_{0}\|_{2}\big{(}(\mathbf{I}_{d})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}+\mathbf{e}_{j}\mathbf{u}_{K+1}^{\top}(\mathbf{I}_{d})_{[S,:]}\big{)}\Big{\}}\mathbf{\Omega}\mathbf{B}_{\mathbf{\Omega}}.

We know that $\|\widetilde{\mathbf{e}}_{0}\|^{2}_{2}=\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}^{2}=1-O(\mu K/d)$ , thus we have

	$\displaystyle\Big{\\|}\sigma^{2}\\|\mathbf{P}_{\perp}\mathbf{e}_{j}\\|_{2}^{2}\bm{\Sigma}-\sigma^{2}\bm{\Sigma}\Big{\\|}_{2}\leq O\big{(}\frac{\mu K\sigma^{2}}{d}(\sigma^{2}+\lambda_{1})\big{)},$
	$\displaystyle\\|3\sigma^{4}\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}-3\sigma^{4}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\\|_{2}\leq 3\sigma^{4}\\|(\mathbf{P}_{\perp}\mathbf{e}_{j}-\mathbf{e}_{j})\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\\|_{2}+3\sigma^{4}\\|\mathbf{e}_{j}(\mathbf{P}_{\perp}\mathbf{e}_{j}-\mathbf{e}_{j})^{\top}\\|_{2}\lesssim\sigma^{4}\sqrt{\frac{\mu K}{d}},$
	$\displaystyle\\|(\mathbf{P}_{\perp})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}-(\mathbf{I}_{d})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\\|_{2}\leq\\|[(\mathbf{P}_{\perp})_{[:,S]}-(\mathbf{I}_{d})_{[:,S]}]\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\\|_{2}$
	$\displaystyle\quad+\\|(\mathbf{I}_{d})_{[:,S]}]\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}(\mathbf{P}_{\perp}-\mathbf{I}_{d})\\|_{2}\lesssim K\sqrt{\frac{\mu}{d}}+\sqrt{\frac{\mu K}{d}}\lesssim K\sqrt{\frac{\mu}{d}},$
	$\displaystyle 2\sigma^{4}\rho\\|\widetilde{\mathbf{e}}_{0}\\|_{2}\Big{\\|}\!\big{[}(\mathbf{P}_{\perp})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\!\!+\!\mathbf{P}_{\perp}\mathbf{e}_{j}(\mathbf{u}_{K+1})^{\top}\!(\mathbf{P}_{\perp})_{[S,:]}\big{]}\!\!-\!\big{[}\!(\mathbf{I}_{d})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\!\!\!+\!\mathbf{e}_{j}(\mathbf{u}_{K+1})^{\top}\!(\mathbf{I}_{d})_{[S,:]}\big{]}\!\Big{\\|}_{2}$
	$\displaystyle\quad\lesssim K\sigma^{4}\sqrt{\frac{\mu}{d}},$

and in summary we have $\|\bm{\Sigma}_{j}-{\bm{\Sigma}}_{j}^{\prime}\|_{2}=O_{P}\Big{(}\frac{Kd\sigma^{4}}{n\Delta^{2}Lp}\sqrt{\frac{\mu}{d}}\Big{)}=O_{P}\Big{(}\frac{K\lambda_{1}^{2}}{\Delta^{2}}\sqrt{\frac{\mu}{d}}\Big{)}\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}=o_{P}\big{(}\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}\big{)}$ . Now we study $\|\bm{\Sigma}_{j}^{\prime}-\widetilde{\bm{\Sigma}}_{j}\|_{2}$ . Since the entries of $\sqrt{p}\mathbf{\Omega}$ are i.i.d. standard Gaussian, by Lemma 3 in Fan et al., [18], we know that with probability $1-O(d^{-9})$ , we have

\|\mathbf{\Omega}\|_{2,\infty}\lesssim\sqrt{L},\quad\text{and}\quad\|\mathbf{\Omega}_{[S,:]}\|_{2}\lesssim\sqrt{L}.

Therefore, under the condition that $\frac{\lambda_{1}^{2}Lp}{\Delta^{2}d}=o(1)$ we have

	$\displaystyle\\|\bm{\Sigma}_{j}^{\prime}-\widetilde{\bm{\Sigma}}_{j}\\|_{2}$	$\displaystyle=\sigma^{4}\Big{\\|}\frac{1}{nL^{2}}\mathbf{B}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\Big{(}3\mathbf{e}_{j}\mathbf{e}_{j}^{\top}-2\rho\big{(}(\mathbf{I}_{d})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}+\mathbf{e}_{j}\mathbf{u}_{K+1}^{\top}(\mathbf{I}_{d})_{[S,:]}\big{)}\Big{)}\mathbf{\Omega}\mathbf{B}_{\mathbf{\Omega}}\Big{\\|}_{2}$
		$\displaystyle\lesssim\frac{\sigma^{4}}{nL^{2}}\\|\mathbf{B}_{\mathbf{\Omega}}\\|_{2}^{2}\\|\mathbf{\Omega}\\|_{2,\infty}\big{(}\\|\mathbf{\Omega}\\|_{2,\infty}+\\|\mathbf{\Omega}_{[S,:]}\\|_{2}\big{)}=O_{P}\big{(}\frac{\sigma^{4}}{n\Delta^{2}}\big{)}=o_{P}(\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}).$

As for $\widetilde{\bm{\Sigma}}_{j}$ , by Lemma 3 in Fan et al., [18] with high probability we have that $\sigma_{K}(\bm{\Omega}^{\top}\mathbf{V})\gtrsim\sqrt{L}$ and in turn

\displaystyle\sigma_{K}(\widetilde{\bm{\Sigma}}_{j})

\displaystyle\gtrsim\frac{\sigma^{2}}{nL^{2}}\big{(}\sigma_{K}(\mathbf{B}_{\bm{\Omega}})\big{)}^{2}\left(\big{(}\sigma_{K}(\bm{\Omega}^{\top}\mathbf{V}\big{)}^{2}\Delta+\big{(}\sigma_{K}(\bm{\Omega}\big{)}^{2}\sigma^{2}\right)\gtrsim\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}+\frac{\sigma^{2}\Delta}{n\lambda_{1}^{2}}.

Therefore, combining the previous results, we have that by Weyl’s inequality [19], with high probability

	$\displaystyle\lambda_{K}\big{(}\bm{\Sigma}_{j}\big{)}$	$\displaystyle\geq\lambda_{K}\big{(}\widetilde{\bm{\Sigma}}_{j}\big{)}-\\|\bm{\Sigma}_{j}-\bm{\Sigma}_{j}^{\prime}\\|_{2}-\\|\bm{\Sigma}_{j}^{\prime}-\widetilde{\bm{\Sigma}}_{j}\\|_{2}$
		$\displaystyle\gtrsim\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}+\frac{\sigma^{2}\Delta}{n\lambda_{1}^{2}}-o(\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}})\gtrsim\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}+\frac{\sigma^{2}\Delta}{n\lambda_{1}^{2}}.$

Thus we know $\eta_{2}(d)\asymp d\sigma^{4}/(nLp\lambda_{1}^{2})+\sigma^{2}\Delta/(n\lambda_{1}^{2})$ .

Recall from the proof of Corollary 4.2 with probability $1-O(d^{-10})$ we have $\|\mathbf{E}_{0}\|_{2}\lesssim(\lambda_{1}+\sigma^{2})\log d\sqrt{\frac{r}{n}}$ . Also recall that $r_{2}(d)\asymp\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}n}(\log d)^{2}$ . Therefore, under the condition that

n\gg\frac{\kappa_{1}^{4}\lambda_{1}dr^{2}(\log d)^{4}}{p\sigma^{2}}\left(\kappa_{1}\frac{d}{p}\wedge\frac{\lambda_{1}}{\sigma^{2}}L\right)\quad\text{and}\quad\frac{\widetilde{\sigma}_{1}^{6}K^{2}}{\delta^{4}\sigma^{4}n}(\log d)^{4}\ll(\frac{\Delta}{\lambda_{1}})^{2},

we have ${d^{2}r_{1}(d)^{4}(\log d)^{4}}\big{(}{p^{2}\Delta^{4}\eta_{2}(d)}\big{)}^{-1}=o(1)$ and ${dr_{2}(d)^{2}}\big{(}{Lp\Delta^{2}\eta_{2}(d)}\big{)}^{-1}=o(1)$ .

Now we need to verify Assumption 5. It can be seen that the randomness of the leading term comes from $\bm{\Omega}$ and $\mathbf{E}_{0}$ both. We will first establish the results conditional on $\bm{\Omega}$ . In fact, we will first show a more general CLT that will also cover the case of the leading term under the regime $Lp\gg d$ . More specifically, we will show that for any matrix $\mathbf{A}\in\mathbb{R}^{d\times K}$ that satisfies the following two conditions: (1) $\sigma_{\max}(\mathbf{A})/\sigma_{\min}(\mathbf{A})\leq C|\lambda_{1}|/\Delta$ ; (2) $\lambda_{K}\big{(}\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}\geq cn^{-1}\sigma^{4}\big{(}\sigma_{\min}(\mathbf{A})\big{)}^{2}$ , where $C,c>0$ are fixed constants irrelevant to $\mathbf{A}$ and we abuse the notation by denoting $\bm{\Sigma}_{j}:=\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})$ , it holds that

\mathbf{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}).

(B.31)

Now for any matrix $\mathbf{A}\in\mathbb{R}^{d\times K}$ satisfying the aforementioned conditions, to show that $\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}$ is asymptotically normal, we only need to show that ${\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(0,1)$ for any ${\mathbf{a}}\in\mathbb{R}^{K}$ with $\|{\mathbf{a}}\|_{2}=1$ . We can write

	$\displaystyle{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}=\frac{1}{n}\sum_{i=1}^{n}{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\{\bm{X}_{i}\bm{X}_{i}^{\top}-\bm{\Sigma}-\widetilde{\mathbf{c}}_{0}^{\top}(\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{\mathbf{c}}_{0}\mathbf{I}_{d}\}\mathbf{P}_{\perp}\mathbf{e}_{j}$
	$\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\Big{\{}{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}(\bm{X}_{i}\bm{X}_{i}^{\top}-\bm{\Sigma})\mathbf{P}_{\perp}\mathbf{e}_{j}-\widetilde{\mathbf{c}}_{0}^{\top}(\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{\mathbf{c}}_{0}({\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{P}_{\perp}\mathbf{e}_{j})\Big{\}}.$

We let $x_{i}={\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}(\bm{X}_{i}\bm{X}_{i}^{\top}-\bm{\Sigma})\mathbf{P}_{\perp}\mathbf{e}_{j}$ and $y_{i}=\widetilde{\mathbf{c}}_{0}^{\top}(\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{\mathbf{c}}_{0}({\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{P}_{\perp}\mathbf{e}_{j})$ . For $\bm{\Sigma}_{j}$ , we have that $\|\bm{\Sigma}_{j}^{-1/2}\|_{2}\leq\sigma_{\min}(\bm{\Sigma}_{j})^{-1/2}\leq\sqrt{n}/\big{(}\sigma^{2}\sigma_{\min}(\mathbf{A})\big{)}$ . Then we have

	$\displaystyle{\mathbb{E}}\|x_{i}\|^{3}\lesssim{\mathbb{E}}\|{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\bm{X}_{i}\bm{X}_{i}^{\top}\mathbf{P}_{\perp}\mathbf{e}_{j}\|^{3}\leq\sqrt{{\mathbb{E}}\|{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\bm{X}_{i}\|^{6}{\mathbb{E}}\|\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\bm{X}_{i}\|^{6}}$
	$\displaystyle\quad\lesssim\\|\bm{\Sigma}_{j}^{-1/2}\\|_{2}^{3}\sqrt{(\lambda_{1}+\sigma^{2})^{3}\sigma^{6}\\|\mathbf{A}\\|_{2}^{6}},$
	$\displaystyle{\mathbb{E}}\|y_{i}\|^{3}\lesssim({\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{P}_{\perp}\mathbf{e}_{j})^{3}{\mathbb{E}}\|\widetilde{\mathbf{c}}_{0}^{\top}\bm{Z}_{i}\bm{Z}_{i}^{\top}\widetilde{\mathbf{c}}_{0}\|^{3}\leq\\|\bm{\Sigma}_{j}^{-1/2}\mathbf{A}\\|_{2}^{3}{\mathbb{E}}\|\widetilde{{\mathbf{c}}}_{0}^{\top}\bm{Z}_{i}\|^{6}$
	$\displaystyle\quad\lesssim\\|\bm{\Sigma}_{j}^{-1/2}\\|_{2}^{3}(\lambda_{1}+\sigma^{2})^{3}\\|\mathbf{A}\\|_{2}^{3},$
	$\displaystyle{\mathbb{E}}\|x_{i}-y_{i}\|^{3}\lesssim{\mathbb{E}}\|x_{i}\|^{3}+{\mathbb{E}}\|y_{i}\|^{3}\lesssim\\|\bm{\Sigma}_{j}^{-1/2}\\|_{2}^{3}\Big{(}\sqrt{(\lambda_{1}+\sigma^{2})^{3}\sigma^{6}\\|\mathbf{A}\\|_{2}^{6}}+(\lambda_{1}+\sigma^{2})^{3}\\|\mathbf{A}\\|_{2}^{3}\Big{)}$
	$\displaystyle\lesssim n^{3/2}(\lambda_{1}+\sigma^{2})^{3}\\|\mathbf{A}\\|_{2}^{3}/\big{(}\sigma^{2}\sigma_{\min}(\mathbf{A})\big{)}^{3}.$

Thus

\frac{\sum_{i=1}^{n}{\mathbb{E}}|x_{i}-y_{i}|^{3}}{\operatorname{Var}\Big{\{}\sum_{i=1}^{n}(x_{i}-y_{i})\Big{\}}^{3/2}}\lesssim\frac{n(\lambda_{1}+\sigma^{2})^{3}\|\mathbf{A}\|_{2}^{3}}{n^{3/2}\sigma^{6}\sigma_{\min}(\mathbf{A})^{3}}\lesssim\frac{(\lambda_{1}+\sigma^{2})^{3}\lambda_{1}^{3}}{\sqrt{n}\sigma^{6}\Delta^{3}}=o(1).

Thus the Lyapunov’s condition is met and (B.31) holds. Then we take $\mathbf{A}=\bm{\Omega}\mathbf{B}_{\bm{\Omega}}$ , and define the following event

	$\displaystyle\mathcal{A}_{\bm{\Omega}}$	$\displaystyle=\bigg{\{}1/2\leq\sigma_{\min}(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p})\leq\sigma_{\max}(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p})\leq 3/2,\quad\forall\ell\in[L]\bigg{\}}$
		$\displaystyle\quad\cap\bigg{\{}\frac{1}{2}\sqrt{\frac{d}{p}}\leq\sigma_{\min}(\mathbf{\Omega})\leq\sigma_{\max}(\mathbf{\Omega})\leq\frac{3}{2}\sqrt{\frac{d}{p}},\quad\forall\ell\in[L]\bigg{\}}.$

Then from previous results we know that $\mathbb{P}((\mathcal{A}_{\bm{\Omega}})^{c})=o(1)$ , and under the event $\mathcal{A}_{\bm{\Omega}}$ we have

\frac{\sigma_{\max}(\bm{\Omega}\mathbf{B}_{\bm{\Omega}})}{\sigma_{\min}(\bm{\Omega}\mathbf{B}_{\bm{\Omega}})}\leq 9\lambda_{1}/\Delta,\quad\lambda_{K}(\bm{\Sigma}_{j})\geq\frac{\sigma^{4}}{2n}\big{(}\sigma_{\min}(\bm{\Omega}\mathbf{B}_{\bm{\Omega}})\big{)}^{2}.

Thus from the above proof, for any vector $\mathbf{t}\in\mathbb{R}^{K}$ , we have $\mathbb{P}\big{(}\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\leq\mathbf{t}|\mathcal{A}_{\bm{\Omega}}\big{)}-\Phi(\mathbf{t})=o(1)$ , where $\Phi(\cdot)$ is the CDF for ${\mathcal{N}}(0,\mathbf{I}_{K})$ . Then we have

	$\displaystyle\mathbb{P}\big{(}\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\leq\mathbf{t}\big{)}={\mathbb{E}}\Big{(}\mathbb{P}\big{(}\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\leq\mathbf{t}\|\mathbf{\Omega}\big{)}\Big{)}$
	$\displaystyle\quad=\mathbb{P}\big{(}\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\leq\mathbf{t}\|\mathbf{\Omega}\in\mathcal{A}_{\bm{\Omega}}\big{)}\mathbb{P}(\mathcal{A}_{\bm{\Omega}})+\mathbb{P}\big{(}\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\leq\mathbf{t}\|\mathbf{\Omega}\in\mathcal{A}_{\bm{\Omega}}^{c}\big{)}\mathbb{P}(\mathcal{A}_{\bm{\Omega}}^{c})$
	$\displaystyle\quad=\big{(}\Phi(\mathbf{t})+o(1)\big{)}\big{(}1-o(1)\big{)}+o(1)=\Phi(\mathbf{t})+o(1).$

Hence we have that Assumption 5 holds and (18) follows. Next we need to show that the result also holds for $\widetilde{\bm{\Sigma}}_{j}$ . From previous discussion we already know that $\|\bm{\Sigma}_{j}-\widetilde{\bm{\Sigma}}_{j}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}$ , then by Lemma 13 in Chen et al., [13] we have that $\|\widetilde{\bm{\Sigma}}_{j}^{-1/2}\bm{\Sigma}_{j}^{1/2}-\mathbf{I}_{d}\|_{2}=O_{P}\big{(}\|\widetilde{\bm{\Sigma}}_{j}^{-1/2}\|_{2}\|\bm{\Sigma}_{j}^{1/2}-\widetilde{\bm{\Sigma}}_{j}^{1/2}\|_{2}\big{)}=O_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})^{-1}\|\bm{\Sigma}_{j}-\widetilde{\bm{\Sigma}}_{j}\|_{2}\big{)}=o_{P}(1)$ . Then by Slutsky’s Theorem, we have

\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}=(\widetilde{\bm{\Sigma}}_{j}^{-1/2}\bm{\Sigma}_{j}^{1/2})\bm{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(0,\mathbf{I}_{K}).

Finally, we move on to verify the validity of the estimator $\widehat{\bm{\Sigma}}_{j}$ for the asymptotic covariance matrix. From Lemma 7 in Fan et al., [18], it can be seen that with probability $1-o(1)$ , $\mathbf{H}$ is orthonormal. When $\mathbf{H}$ is orthonormal, by Slutsky’s Theorem we have that

\mathbf{H}\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}=\mathbf{H}\widetilde{\bm{\Sigma}}_{j}^{-1/2}\mathbf{H}^{\top}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),

where it can be seen that $\mathbf{H}\widetilde{\bm{\Sigma}}_{j}^{-1/2}\mathbf{H}^{\top}=(\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top})^{-1/2}$ . Therefore, it suffices to show that $\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}$ , and the results will hold by Slutsky’s Theorem. Recall from the proof of Corollary 4.2, we have the following bounds

\|\bm{\Sigma}-\widehat{\bm{\Sigma}}\|_{2}=O_{P}\Big{(}(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\Big{)},\quad|\widehat{\sigma}^{2}-\sigma^{2}|=O_{P}(\widetilde{\sigma}_{1}\sqrt{\frac{K}{n}}),

We will bound the components of $\|\widehat{\bm{\Sigma}}_{j}-\widetilde{\bm{\Sigma}}_{j}\|_{2}$ respectively. We have

	$\displaystyle\\|\sigma^{2}\bm{\Sigma}-\widehat{\sigma}^{2}\widehat{\bm{\Sigma}}\\|_{2}\lesssim\|\widehat{\sigma}^{2}-\sigma^{2}\|\\|\bm{\Sigma}\\|_{2}+\sigma^{2}\\|\bm{\Sigma}-\widehat{\bm{\Sigma}}\\|_{2}=O_{P}\Big{(}\widetilde{\sigma}_{1}(\lambda_{1}+\sigma^{2})\sqrt{\frac{K}{n}}\Big{)}$
	$\displaystyle\quad+O_{P}\Big{(}\sigma^{2}(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\Big{)}=O_{P}\Big{(}\sigma^{2}(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\Big{)},$

Also, from proof of Theorem 4.10, we have that with high probability

\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2}=\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}\lesssim\|\mathbf{E}_{0}\|_{2}\|\mathbf{\Omega}\|_{2}\|\mathbf{B}_{\mathbf{\Omega}}\|_{2}/L=O_{P}(\kappa_{1}\sqrt{\frac{dr}{npL}}),

and $\|\widehat{\bm{\Sigma}}^{\text{tr}}-\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\|_{2}=O_{P}\Big{(}(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\Big{)}$ , where $\widehat{\bm{\Sigma}}^{\text{tr}}=\widehat{\bm{\Sigma}}-\widehat{\sigma}^{2}\mathbf{I}_{d}$ .Then with high probability, for all $\ell\in[L]$ we have that

	$\displaystyle\big{\\|}(\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\bm{\Sigma}}^{\text{tr}}-\mathbf{H}\mathbf{\Lambda}\mathbf{V}^{\top})\mathbf{\Omega}^{(\ell)}/\sqrt{p}\big{\\|}_{2}\lesssim\sqrt{\frac{d}{p}}\big{(}\\|\widehat{\bm{\Sigma}}^{\text{tr}}-\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\\|_{2}+\lambda_{1}\\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\\|_{2}\big{)}$
	$\displaystyle=O_{P}\left(\kappa_{1}\lambda_{1}\sqrt{\frac{d^{2}r}{np^{2}L}}\right)=o_{P}(\Delta),$

and thus by Theorem 3.3 in Stewart, [38], with high probability for all $\ell\in[L]$ we have that

	$\displaystyle\\|\widehat{\mathbf{B}}^{(\ell)}-\mathbf{B}^{(\ell)}\mathbf{H}^{\top}\\|_{2}=\big{\\|}(\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\bm{\Sigma}}^{\text{tr}}\mathbf{\Omega}^{(\ell)}/\sqrt{p})^{\dagger}-(\mathbf{H}\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}/\sqrt{p})^{\dagger}\big{\\|}_{2}$
	$\displaystyle=O_{P}\bigg{(}\Delta^{-2}\kappa_{1}\lambda_{1}\sqrt{\frac{d^{2}r}{np^{2}L}}\bigg{)},$

and in turn we have $\|\widehat{\mathbf{B}}_{\mathbf{\Omega}}-\mathbf{B}_{\mathbf{\Omega}}\mathbf{H}^{\top}\|_{2}=O_{P}\bigg{(}\Delta^{-2}\kappa_{1}\lambda_{1}\sqrt{\frac{d^{2}r}{np^{2}L}}\bigg{)}\sqrt{L}=O_{P}\bigg{(}\Delta^{-2}\kappa_{1}\lambda_{1}\sqrt{\frac{d^{2}r}{np^{2}}}\bigg{)}$ .

Thus combining the above results, under the condition that $\frac{\lambda_{1}\kappa_{1}^{4}}{\sigma^{2}}\sqrt{\frac{d^{2}r}{np^{2}L}}=o(1)$ , following basic algebra we have

	$\displaystyle\\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\\|_{2}\!\lesssim O_{P}\Big{(}\sigma^{2}(\lambda_{1}\!+\!\sigma^{2})\sqrt{\frac{r}{n}}\Big{)}\frac{d}{nLp\Delta^{2}}\!+\!O_{P}\bigg{(}\frac{d\sqrt{L}}{nL^{2}p\Delta^{3}}\sigma^{2}(\sigma^{2}\!+\!\lambda_{1})\kappa_{1}\lambda_{1}\sqrt{\frac{d^{2}r}{np^{2}}}\bigg{)}$
	$\displaystyle=O_{P}\Big{(}\frac{\lambda_{1}^{2}}{\Delta^{2}\sigma^{2}}(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\Big{)}\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}+O_{P}\Big{(}\frac{\lambda_{1}\kappa_{1}^{4}}{\sigma^{2}}\sqrt{\frac{d^{2}r}{np^{2}L}}\Big{)}\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.$

Therefore, by Slutsky’s Theorem, under the event $\mathcal{B}:=\{\mathbf{H}\text{ is orthonormal}\}$ , for any vector $\mathbf{t}\in\mathbb{R}^{K}$ , we have that $\mathbb{P}(\widehat{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\leq\mathbf{t}|\mathcal{B})-\Phi(\mathbf{t})=o(1)$ , and thus

	$\displaystyle\mathbb{P}(\widehat{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\leq\mathbf{t})=\mathbb{P}(\widehat{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\leq\mathbf{t}\|\mathcal{B})\mathbb{P}(\mathcal{B})$
	$\displaystyle\quad+\mathbb{P}(\widehat{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\leq\mathbf{t}\|\mathcal{B}^{c})\mathbb{P}(\mathcal{B}^{c})$
	$\displaystyle=\mathbb{P}\big{(}\widehat{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\leq\mathbf{t}\|\mathcal{B}\big{)}\big{(}1-o(1)\big{)}+o(1)=\Phi(\mathbf{t})+o(1).$

Hence the claim follows.

B.8 Proof of Corollary 4.12

We will verify that Assumptions 1, 2, 3 and 5 hold. First, it is not hard to see that there exists some orthonormal matrix $\mathbf{O}\in\mathbb{R}^{K\times K}$ such that $\mathbf{V}=\mathbf{F}\mathbf{C}^{-1}\mathbf{O}$ , where $\mathbf{C}=\operatorname{diag}(\sqrt{d_{1}},\ldots,\sqrt{d_{K}})$ . From the problem setting of Example 3 we also know that there exists a constant $C>0$ such that

C^{-1}K\max_{k}d_{k}\leq K\min_{k}d_{k}\leq d\leq K\max_{k}d_{k},\quad d_{1}\asymp\ldots\asymp d_{K}\asymp d/K,

and thus that $\sqrt{d/K}\lesssim\sigma_{K}(\mathbf{C})\leq\|\mathbf{C}\|_{2}\lesssim\sqrt{d/K}$ . Then $\|\mathbf{V}\|_{2,\infty}\lesssim\sqrt{\frac{K}{d}}\|\mathbf{F}\|_{2,\infty}=\sqrt{\frac{K}{d}}$ . Thus Assumption 3 holds with $\mu=O(1)$ .

From the proof of Corollary 4.2 we know that Assumption 1 is satisfied. Besides, recall from Remark 16, under the condition that $\sqrt{K/d}\log d=O(1)$ , with probability at least $1-d^{-10}$ we have that $\|\mathbf{E}\|_{2}\lesssim d\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d:=r_{1}^{\prime}(d)$ , which is sharper than $r_{1}(d)\log d$ . Since $\mathbf{E}_{b}=0$ , we have $r_{2}(d)=0$ and Assumption 2 holds trivially. Now we move on to study the minimum covariance eigenvalue rate $\eta_{2}(d)$ . From the proof of Corollary 4.2, we know that

\mathbf{E}=\mathbf{E}_{0}=\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{Z}+\mathbf{Z}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+\mathbf{Z}^{\top}\mathbf{Z}-n\mathbf{I}_{d}=\sum_{i=1}^{n}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}+\mathbf{Z}_{i.}\mathbf{Q}_{i}^{\top}+\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d}\big{\}},

where $\mathbf{Q}_{i}=\mathbf{F}\mathbf{\Theta}_{i.}\in\mathbb{R}^{d}$ with $\mathbf{\Theta}_{i.}$ being the $i$ -th row of $\mathbf{\Theta}$ , $\mathbf{Z}_{i.}$ is the $i$ -th row of $\mathbf{Z}$ and $\mathbf{Z}_{i.}\overset{\text{i.i.d}}{\sim}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{d})$ . Then for $j\in[d]$ , we have

	$\displaystyle\operatorname*{\rm Cov}(\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})$	$\displaystyle=\operatorname*{\rm Cov}\Big{(}\sum_{i=1}^{n}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}+\mathbf{Z}_{i.}\mathbf{Q}_{i}^{\top}+\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d}\big{\}}\mathbf{P}_{\perp}\mathbf{e}_{j}\Big{)}$
		$\displaystyle=\sum_{i=1}^{n}\operatorname*{\rm Cov}\Big{(}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}+\mathbf{Z}_{i.}\mathbf{Q}_{i}^{\top}+\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d}\big{\}}\mathbf{P}_{\perp}\mathbf{e}_{j}\Big{)}$
		$\displaystyle=\sum_{i=1}^{n}\operatorname*{\rm Cov}\Big{(}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}+\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d}\big{\}}\mathbf{P}_{\perp}\mathbf{e}_{j}\Big{)},$

where the last equality is due to the fact that $\mathbf{P}_{\perp}\mathbf{Q}_{i}=\mathbf{P}_{\perp}\mathbf{F}\mathbf{\Theta}_{i.}=\mathbf{0}$ . Now for $i\in[n]$ , we calculate $\operatorname*{\rm Cov}\Big{(}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}+\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d}\big{\}}\mathbf{P}_{\perp}\mathbf{e}_{j}\Big{)}$ . Following basic algebra, we have that

	$\displaystyle\operatorname*{\rm Cov}\Big{(}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}\!+\!\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}\!-\!\mathbf{I}_{d}\big{\}}\mathbf{P}_{\perp}\mathbf{e}_{j}\Big{)}$
	$\displaystyle=\!{\mathbb{E}}\Big{(}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}\!+\!\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}\big{\}}\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\big{\{}\mathbf{Z}_{i.}\mathbf{Q}_{i}^{\top}\!+\!\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}\big{\}}\Big{)}\!-\!\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}$
	$\displaystyle=\\|\mathbf{P}_{\perp}\mathbf{e}_{j}\\|_{2}^{2}(\mathbf{Q}_{i}\mathbf{Q}_{i}^{\top}+\mathbf{I}_{d})+2\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}-\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}$
	$\displaystyle=\\|\mathbf{P}_{\perp}\mathbf{e}_{j}\\|_{2}^{2}(\mathbf{Q}_{i}\mathbf{Q}_{i}^{\top}+\mathbf{I}_{d})+\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp},$

and thus

	$\displaystyle\operatorname*{\rm Cov}(\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})$	$\displaystyle=\sum_{i=1}^{n}\left(\\|\mathbf{P}_{\perp}\mathbf{e}_{j}\\|_{2}^{2}(\mathbf{Q}_{i}\mathbf{Q}_{i}^{\top}\!+\!\mathbf{I}_{d})\!+\!\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\right)$
		$\displaystyle=\!\\|\mathbf{P}_{\perp}\mathbf{e}_{j}\\|_{2}^{2}(\sum_{i=1}^{n}\mathbf{Q}_{i}\mathbf{Q}_{i}^{\top}\!+\!n\mathbf{I}_{d})\!+\!n\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}$
		$\displaystyle=\\|\mathbf{P}_{\perp}\mathbf{e}_{j}\\|_{2}^{2}(\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+n\mathbf{I}_{d})+\!n\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}.$

Then since $\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}=1-K/d=1-o(1)$ , we have that $\lambda_{d}\big{(}\operatorname*{\rm Cov}(\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}\gtrsim n$ , and hence we have $\eta_{2}(d)\asymp dn/(\lambda_{1}^{2}Lp)$ . Then under the condition that $n\gg d^{3}L/p$ and $\Delta_{0}^{2}\gg K(\log d)^{2}\sqrt{dnL/p}$ , we have that

\frac{r_{1}^{\prime}(d)^{4}}{\eta_{2}(d)}\lesssim\frac{\lambda_{1}^{2}Lp}{d}\left(\frac{d^{4}\Delta_{0}^{4}}{K^{2}n}+d^{2}n(\log d)^{4}\right)\ll\frac{\lambda_{1}^{2}p^{2}d^{2}\Delta_{0}^{4}}{K^{2}d^{2}}\asymp\frac{p^{2}\Delta^{4}}{d^{2}},\quad\frac{d^{2}r_{1}^{\prime}(d)^{4}}{p^{2}\Delta^{4}\eta_{2}(d)}=o(1).

Now we move on to check Assumption 5. Similar as in the proof of Corollary 4.11, we will first show the results conditional on $\bm{\Omega}$ by establishing a more general CLT . More specifically, we will show that for any ${\mathbf{a}}\in\mathbb{R}^{K}$ with $\|{\mathbf{a}}\|_{2}=1$ , and $\mathbf{A}\in\mathbb{R}^{d\times K}$ such that $\lambda_{K}\big{(}\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}\geq cn\sigma_{\min}(\mathbf{A})^{2}$ and $\sigma_{\max}(\mathbf{A})/\sigma_{\min}(\mathbf{A})\leq C$ , where $C,c>0$ are constants irrelevant to $\mathbf{A}$ and we abuse the notation by denoting $\bm{\Sigma}_{j}:=\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})$ , we have ${\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(0,1)$ . Define $\mathbf{Q}=\mathbf{\Theta}\mathbf{F}^{\top}$ . We know that

\|\mathbf{Q}\|_{2,\infty}=\max_{i\in[n]}\|\mathbf{Q}_{i}\|_{2}\leq\|\mathbf{F}\|_{2}\|\mathbf{\Theta}\|_{2,\infty}\lesssim\mu_{\theta}\Delta_{0}\sqrt{\frac{d}{n}},\,\text{and}

{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}=\sum_{i=1}^{n}\big{\{}{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}(\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}+\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d})\mathbf{P}_{\perp}\mathbf{e}_{j}\big{\}},

and we denote

x_{i}={\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}\mathbf{P}_{\perp}\mathbf{e}_{j},\quad y_{i}={\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}(\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d})\mathbf{P}_{\perp}\mathbf{e}_{j}.

Then we have

	$\displaystyle{\mathbb{E}}\|x_{i}+y_{i}\|^{3}$	$\displaystyle\lesssim{\mathbb{E}}\|x_{i}\|^{3}+{\mathbb{E}}\|y_{i}\|^{3}\lesssim\\|\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{Q}_{i}\\|_{2}^{3}+\\|\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\\|_{2}^{3}$
		$\displaystyle\leq\\|\bm{\Sigma}_{j}^{-1/2}\\|_{2}^{3}\\|\mathbf{A}\\|_{2}^{3}(\\|\mathbf{Q}\\|_{2,\infty}^{3}+1)\lesssim n^{-3/2}\big{\{}\frac{\\|\mathbf{A}\\|_{2}}{\sigma_{\min}(\mathbf{A})}\big{\}}^{3}\big{\{}\mu_{\theta}^{3}\Delta_{0}^{3}\big{(}\frac{d}{n}\big{)}^{3/2}+1\big{\}}$
		$\displaystyle\lesssim n^{-3/2}\mu_{\theta}^{3}\Delta_{0}^{3}\big{(}\frac{d}{n}\big{)}^{3/2}+n^{-3/2}.$

Then

\displaystyle\frac{\sum_{i=1}^{n}{\mathbb{E}}|x_{i}+y_{i}|^{3}}{\operatorname{Var}\big{\{}\sum_{i=1}^{n}(x_{i}+y_{i})\big{\}}^{3/2}}

\displaystyle=\sum_{i=1}^{n}{\mathbb{E}}|x_{i}+y_{i}|^{3}\lesssim n^{-2}\mu_{\theta}^{3}\Delta_{0}^{3}d^{3/2}+n^{-1/2}.

Then under the condition that $\Delta_{0}^{2}\ll n^{4/3}/(\mu_{\theta}^{2}d)$ , we have that

n^{-2}\mu_{\theta}^{3}\Delta_{0}^{3}d^{3/2}=o(1)\quad\text{and}\quad\left({\sum_{i=1}^{n}{\mathbb{E}}|x_{i}+y_{i}|^{3}}\right){\operatorname{Var}\big{(}\sum_{i=1}^{n}(x_{i}+y_{i})\big{)}^{-3/2}}=o(1).

Thus the Lyapunov’s condition is met and the CLT holds. Also recall from previous arguments, there exists a fixed constant $C>0$ such that with high probability we have

\frac{\sigma_{\max}(\bm{\Omega}\mathbf{B}_{\bm{\Omega}})}{\sigma_{\min}(\bm{\Omega}\mathbf{B}_{\bm{\Omega}})}\leq 9\left(\frac{\sigma_{1}(\mathbf{\Theta})}{\sigma_{K}(\mathbf{\Theta})}\right)^{2}\leq C,\quad\lambda_{K}(\bm{\Sigma}_{j})\geq\frac{n}{2}\big{(}\sigma_{\min}(\bm{\Omega}\mathbf{B}_{\bm{\Omega}})\big{)}^{2}.

Then by taking $\mathbf{A}=\bm{\Omega}\mathbf{B}_{\bm{\Omega}}$ and following similar steps as in the proof of Corollary 4.11, we know that Assumption 5 is satisfied. Then by Theorem 4.10, (17) holds.

We move on to prove (20). It suffices to show that $\|\bm{\Sigma}_{j}-\widetilde{\bm{\Sigma}}_{j}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}$ . When $\Delta_{0}^{2}\ll n$ and $K\ll d$ , with high probability we have

	$\displaystyle\\|\bm{\Sigma}_{j}-\widetilde{\bm{\Sigma}}_{j}\\|_{2}\lesssim\frac{d}{L\Delta^{2}p}\big{\{}(n+\Delta)(1\!-\!\\|\mathbf{P}_{\perp}\mathbf{e}_{j}\\|_{2}^{2})+n\\|\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}-\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\\|_{2}\big{\}}$
	$\displaystyle\quad+\frac{n}{L\Delta^{2}}\\|\mathbf{\Omega}\\|_{2,\infty}^{2}\lesssim\frac{d}{L\Delta^{2}p}\Big{(}\frac{Kn}{d}+\Delta_{0}^{2}+n\sqrt{\frac{K}{d}}\Big{)}+\frac{n}{\Delta^{2}}=o(\frac{dn}{L\lambda_{1}^{2}p})=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.$

Thus (20) holds.

Last we verify the validity of $\widehat{\bm{\Sigma}}_{j}$ . Similar as in the proof of Corollary 4.11, it suffices to show that $\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}$ . Recall with high probability $\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}\lesssim r_{1}^{\prime}(d)=d\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d$ .

Also, from the proof of Theorem 4.10, we have that

\displaystyle\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2}=\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}=\frac{1}{L}O_{P}(\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}\|\mathbf{\Omega}\|_{2}\|\mathbf{B}_{\mathbf{\Omega}}\|_{2})=O_{P}\Big{(}\sqrt{\frac{d}{pL}}\frac{r_{1}^{\prime}(d)}{\Delta}\Big{)},

Then with high probability, for all $\ell\in[L]$ we have that

	$\displaystyle\big{\\|}(\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}-\mathbf{H}\mathbf{\Lambda}\mathbf{V}^{\top})\mathbf{\Omega}^{(\ell)}/\sqrt{p}\big{\\|}_{2}\lesssim\sqrt{\frac{d}{p}}\big{(}\\|\widehat{\mathbf{M}}-\mathbf{M}\\|_{2}+\lambda_{1}\\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\\|_{2}\big{)}$
	$\displaystyle=O_{P}\left(\sqrt{\frac{d^{2}}{p^{2}L}}r_{1}^{\prime}(d)\right)=o_{P}(\Delta),$

and thus by Theorem 3.3 in Stewart, [38], we have that

\displaystyle\|\widehat{\mathbf{B}}^{(\ell)}-\mathbf{B}^{(\ell)}\mathbf{H}^{\top}\|_{2}=\big{\|}(\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\mathbf{\Omega}^{(\ell)}/\sqrt{p})^{\dagger}-(\mathbf{H}\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}/\sqrt{p})^{\dagger}\big{\|}_{2}=O_{P}\left(\sqrt{\frac{d^{2}}{p^{2}L}}\frac{r_{1}^{\prime}(d)}{\Delta^{2}}\right),

and in turn we have $\|\widehat{\mathbf{B}}_{\mathbf{\Omega}}-\mathbf{B}_{\mathbf{\Omega}}\mathbf{H}^{\top}\|_{2}=O_{P}\left(\sqrt{\frac{d^{2}}{p^{2}L}}\frac{r_{1}^{\prime}(d)}{\Delta^{2}}\right)\sqrt{L}=O_{P}\left(\frac{dr_{1}^{\prime}(d)}{p\Delta^{2}}\right)$ .

Therefore, under the condition that $\Delta_{0}^{2}\ll{KLp^{2}n^{2}}/{d^{4}}$ , we have

	$\displaystyle\\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\\|_{2}\lesssim\frac{d}{L\Delta^{2}p}\\|\widehat{\mathbf{M}}-\mathbf{M}\\|_{2}+(n+\lambda_{1})\frac{d}{pL\Delta}O_{P}\left(\frac{d}{p\sqrt{L}}\frac{r_{1}^{\prime}(d)}{\Delta^{2}}\right)$
	$\displaystyle=o_{P}\big{(}\frac{dn}{L\Delta^{2}p}\big{)}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.$

Thus the claim follows.

B.9 Proof of Theorem 4.5

We will first decompose $\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}=(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{1}\mathbf{H}_{0})+(\widetilde{\mathbf{V}}\mathbf{H}_{1}\mathbf{H}_{0}-\widehat{\mathbf{V}}\mathbf{H}_{0})+(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})$ . We will show that when $L$ is sufficiently large the first two terms are negligible, and we will consider the third term $\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}$ first. We will first study $\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}-\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}\|_{2,\infty}$ by conducting decomposition of the error term. For the convenience of notations, we let $\mathbf{P}=\mathbf{V}^{\top}\mathbf{V}$ for short. If we define $\widehat{\mathbf{H}}_{0}=\widehat{\mathbf{V}}^{\top}\mathbf{V}$ , we can decompose

	$\displaystyle\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}-\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}$
	$\displaystyle=\mathbf{P}_{\perp}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}\!-\!\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}\!+\!\mathbf{P}_{\perp}\widehat{\mathbf{V}}({\mathbf{H}}_{0}\!-\!\widehat{\mathbf{H}}_{0})\!+\!(\mathbf{P}\widehat{\mathbf{V}}\mathbf{H}_{0}\!-\!\mathbf{V}).$

Under the condition that $\|\mathbf{E}\|_{2}/\Delta=O_{P}\big{(}r_{1}(d)/\Delta\big{)}=o_{P}(1)$ , we have that $\mathbf{H}_{0}$ is a full-rank orthonormal matrix with probability $1-o(1)$ . Then we have with probability $1-o(1)$ that

	$\displaystyle\\|\mathbf{P}_{\perp}\widehat{\mathbf{V}}({\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0})\\|_{2,\infty}=\\|(\mathbf{I}-\mathbf{V}\mathbf{V}^{\top})(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}({\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0})\\|_{2,\infty}$
	$\displaystyle\leq\\|(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}({\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0})\\|_{2,\infty}+\\|\mathbf{V}\mathbf{V}^{\top}(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}({\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0})\\|_{2,\infty}$
	$\displaystyle\leq\\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\\|_{2,\infty}\\|{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\\|_{2}+\\|\mathbf{V}\\|_{2,\infty}\\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\\|_{2}\\|{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\\|_{2}$
	$\displaystyle\lesssim\Big{(}r_{3}(d)+\sqrt{\frac{\mu K}{d}}\frac{\\|\mathbf{E}\\|_{2}}{\Delta}\Big{)}\\|{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\\|_{2}.$

From Lemma 7 in Fan et al., [18], we know that $\|{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\|_{2}\lesssim\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{2}\lesssim(\|\mathbf{E}\|_{2}/\Delta)^{2}=O_{P}(r_{1}(d)^{2}/\Delta^{2})$ , and thus we have

\|\mathbf{P}_{\perp}\widehat{\mathbf{V}}({\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0})\|_{2,\infty}=O_{P}\left(\Big{(}r_{3}(d)+\sqrt{\frac{\mu K}{d}}\frac{r_{1}(d)}{\Delta}\Big{)}r_{1}(d)^{2}/\Delta^{2}\right).

We move on to bound $\|\mathbf{P}\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\|_{2,\infty}$ ,

	$\displaystyle\\|\mathbf{P}\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\\|_{2,\infty}$	$\displaystyle=\\|\mathbf{V}(\widehat{\mathbf{H}}_{0}^{\top}\mathbf{H}_{0}-\mathbf{I}_{K})\\|_{2,\infty}\leq\\|\mathbf{V}\\|_{2,\infty}\\|{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\\|_{2}$
		$\displaystyle=O_{P}\left(\sqrt{\frac{\mu K}{d}}r_{1}(d)^{2}/\Delta^{2}\right).$

Finally, we consider the term $\mathbf{P}_{\perp}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}$ . We can decompose

	$\displaystyle\mathbf{P}_{\perp}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}=\mathbf{P}_{\perp}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}\mathbf{\Lambda}^{-1}-\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}$
	$\displaystyle=\mathbf{P}_{\perp}\big{(}\mathbf{E}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{E}_{0}\mathbf{V}+\widehat{\mathbf{V}}(\mathbf{\Lambda}-\widehat{\mathbf{\Lambda}})\widehat{\mathbf{H}}_{0}+\widehat{\mathbf{V}}(\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}-\mathbf{\Lambda}\widehat{\mathbf{H}}_{0})\big{)}\mathbf{\Lambda}^{-1}.$

We bound the three terms separately, with high probability

	$\displaystyle\\|\!\mathbf{P}_{\perp}\!\big{(}\mathbf{E}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}\!-\!\mathbf{E}_{0}\mathbf{V}\big{)}\!\mathbf{\Lambda}^{-1}\!\\|_{2,\infty}\!\!\leq\!\\|\!\mathbf{P}_{\perp}\mathbf{E}_{0}(\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}\!-\!\mathbf{V}\!)\mathbf{\Lambda}^{-1}\!\\|_{2,\infty}\!\!+\!\\|\!\mathbf{P}_{\perp}\mathbf{E}_{b}\!\widehat{\mathbf{V}}\!\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\!\\|_{2,\infty}$
	$\displaystyle\quad\leq\\|\mathbf{E}_{0}(\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{V})\mathbf{\Lambda}^{-1}\\|_{2,\infty}+\\|\mathbf{V}\mathbf{V}^{\top}\mathbf{E}_{0}(\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{V})\mathbf{\Lambda}^{-1}\\|_{2,\infty}+\\|\mathbf{E}_{b}\\|_{2}/\Delta$
	$\displaystyle\quad\leq\\|\mathbf{E}_{0}(\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{V})\\|_{2,\infty}/\Delta+\\|\mathbf{V}\\|_{2,\infty}\\|\mathbf{E}_{0}\\|_{2}\\|\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{V}\\|_{2}/\Delta+r_{2}(d)/\Delta$
	$\displaystyle\quad=O_{P}\left(r_{4}(d,\mathbf{\Lambda})/\Delta+\sqrt{\frac{\mu K}{d}}r_{1}(d)^{2}/\Delta^{2}+r_{2}(d)/\Delta\right).$

As for $\mathbf{P}_{\perp}\widehat{\mathbf{V}}({\mathbf{\Lambda}}-\widehat{\mathbf{\Lambda}})\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}$ , we have

	$\displaystyle\\|\mathbf{P}_{\perp}\widehat{\mathbf{V}}(\widehat{\mathbf{\Lambda}}-\mathbf{\Lambda})\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\\|_{2,\infty}\leq\\|(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}(\widehat{\mathbf{\Lambda}}-\mathbf{\Lambda})\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\\|_{2,\infty}$
	$\displaystyle\quad\quad+\\|\mathbf{V}\mathbf{V}^{\top}(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}(\widehat{\mathbf{\Lambda}}-\mathbf{\Lambda})\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\\|_{2,\infty}$
	$\displaystyle\quad\leq\\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\\|_{2,\infty}\\|\mathbf{E}_{0}\\|_{2}/\Delta+\\|\mathbf{V}\\|_{2,\infty}\\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\\|_{2}\\|\mathbf{E}_{0}\\|_{2}/\Delta$
	$\displaystyle\quad=O_{P}\bigg{\{}r_{3}(d)r_{1}(d)/\Delta+\sqrt{\frac{\mu K}{d}}r_{1}(d)^{2}/\Delta^{2}\bigg{\}},$

and finally

	$\displaystyle\\|\mathbf{P}_{\perp}\widehat{\mathbf{V}}(\mathbf{\Lambda}\widehat{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\mathbf{\Lambda})\mathbf{\Lambda}^{-1}\\|_{2,\infty}\leq\\|(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}(\mathbf{\Lambda}\widehat{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\mathbf{\Lambda})\mathbf{\Lambda}^{-1}\\|_{2,\infty}$
	$\displaystyle\quad\quad+\\|\mathbf{V}\mathbf{V}^{\top}(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}(\mathbf{\Lambda}\widehat{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\mathbf{\Lambda})\mathbf{\Lambda}^{-1}\\|_{2,\infty}$
	$\displaystyle\quad=O_{P}\left(\big{(}r_{3}(d)+\sqrt{\frac{\mu K}{d}}r_{1}(d)/\Delta\big{)}\\|\mathbf{\Lambda}\widehat{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}\\|_{2}/\Delta\right)$
	$\displaystyle\quad=O_{P}\left(\big{(}r_{3}(d)+\sqrt{\frac{\mu K}{d}}r_{1}(d)/\Delta\big{)}r_{1}(d)/\Delta\right),$

where the last inequality is due to the fact that

	$\displaystyle\\|\mathbf{\Lambda}\widehat{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}\\|_{2}$	$\displaystyle=\\|\mathbf{\Lambda}\widehat{\mathbf{V}}^{\top}\mathbf{V}\mathbf{V}^{\top}-\widehat{\mathbf{V}}^{\top}\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\\|_{2}$
		$\displaystyle=\\|\mathbf{\Lambda}\widehat{\mathbf{V}}^{\top}\mathbf{V}\mathbf{V}^{\top}-\widehat{\mathbf{V}}^{\top}\mathbf{M}\mathbf{V}\mathbf{V}^{\top}\\|_{2}$
		$\displaystyle\leq\\|\mathbf{\Lambda}\widehat{\mathbf{V}}^{\top}\mathbf{V}\mathbf{V}^{\top}-\widehat{\mathbf{V}}^{\top}\widehat{\mathbf{M}}\mathbf{V}\mathbf{V}^{\top}\\|_{2}+\\|\widehat{\mathbf{V}}^{\top}\mathbf{E}\mathbf{V}\mathbf{V}^{\top}\\|_{2}$
		$\displaystyle=\\|(\mathbf{\Lambda}-\widehat{\mathbf{\Lambda}})\widehat{\mathbf{V}}^{\top}\mathbf{V}\mathbf{V}^{\top}\\|_{2}+\\|\widehat{\mathbf{V}}^{\top}\mathbf{E}\mathbf{V}\mathbf{V}^{\top}\\|_{2}\leq 2\\|\mathbf{E}\\|_{2}.$

Thus in summary, we have

\displaystyle\|\widehat{\mathbf{V}}\mathbf{H}_{0}\!\!-\!\!\mathbf{V}\!\!-\!\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}\|_{2,\infty}\!=\!O_{P}\bigg{\{}\!\!\frac{r_{3}(d)r_{1}(d)}{\Delta}\!+\!\!\sqrt{\frac{\mu K}{d}}\!\frac{r_{1}(d)^{2}}{\Delta^{2}}\!+\!\frac{r_{2}(d)\!+\!r_{4}(d)}{\Delta}\!\!\bigg{\}},

Now we move on to bound $\|\widetilde{\mathbf{V}}\mathbf{H}_{1}\mathbf{H}_{0}-\widehat{\mathbf{V}}\mathbf{H}_{0}\|_{2}$ . By Theorem 4.1, we know that

	$\displaystyle\\|\widetilde{\mathbf{V}}\mathbf{H}_{1}\mathbf{H}_{0}-\widehat{\mathbf{V}}\mathbf{H}_{0}\\|_{2}$	$\displaystyle\leq\\|\widetilde{\mathbf{V}}\mathbf{H}_{1}-\widehat{\mathbf{V}}\\|_{2}\lesssim\\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\\|_{2}$
		$\displaystyle\leq\\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-{\mathbf{V}}^{\prime}{\mathbf{V}}^{\prime\top}\\|_{2}+\\|{\mathbf{V}}^{\prime}{\mathbf{V}}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\\|_{2}$
		$\displaystyle\leq\\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-{\mathbf{V}}^{\prime}{\mathbf{V}}^{\prime\top}\\|_{\text{F}}+\\|{\mathbf{V}}^{\prime}{\mathbf{V}}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\\|_{2}$
		$\displaystyle=O_{P}\Big{(}\frac{1}{\sqrt{d}}\frac{r_{1}(d)^{2}}{\Delta^{2}}+\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d)\Big{)}.$

Finally, we consider $\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{1}\mathbf{H}_{0}\|_{2}$ . From the proof of Theorem 4.1, we know that

	$\displaystyle\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{1}\mathbf{H}_{0}\\|_{2}\leq\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}_{2}-\widetilde{\mathbf{V}}\\|_{2}\lesssim\\|\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\\|_{2}$
	$\displaystyle\quad=O_{P}\big{(}{\mathbb{E}}(\\|\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\\|_{2}^{2}\|\widetilde{\bm{\Sigma}})^{1/2}\big{)}\lesssim\sqrt{\frac{d}{p^{\prime}}}\frac{\\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}^{q}}{\left(1-\\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}\right)^{q}}.$

From the proof of Theorem 4.1, we know that with probability converging to 1, there exists some constant $\eta>0$ such that $\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\leq\eta r_{1}(d)\log d\sqrt{d/p}/\Delta=o(1)$ , and thus that

\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}_{2}-\widetilde{\mathbf{V}}\|_{2}=O_{P}\left(\sqrt{\frac{d}{p^{\prime}}}\left(2\eta r_{1}(d)\log d\sqrt{\frac{d}{\Delta^{2}p}}\right)^{q}\right).

When we choose $q$ to be large enough, i.e.,

q\geq 2+\frac{\log(Ld)}{\log\log d}\gg 1+\frac{\log\left(\log d\sqrt{Ld/(Kp^{\prime})}\right)}{\log\left((2\eta\log d)^{-1}\Delta/r_{1}(d)\sqrt{p/d}\right)},

we have $\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}_{2}-\widetilde{\mathbf{V}}\|_{2}=O_{P}(\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d))$ . Therefore, if we denote

r(d):=\Delta^{-1}\Big{(}\sqrt{\frac{Kd}{pL}}r_{1}(d)+r_{3}(d)r_{1}(d)\!+\!\sqrt{\frac{\mu K}{d\Delta^{2}}}r_{1}(d)^{2}\!+\!r_{2}(d)\!+\!r_{4}(d)\Big{)},

we can write

\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}=\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}+\mathbf{R}(d),

where $\|\mathbf{R}(d)\|_{2,\infty}=O_{P}\big{(}r(d)\big{)}$ . Then under the condition that $\eta_{1}(d)^{-1/2}r(d)=o(1)$ , we have that $\|\mathbf{R}(d)\|_{2,\infty}=o_{P}\Big{(}\sigma_{\min}\big{(}\bm{\Sigma}_{j})\big{)}^{1/2}\Big{)}$ . Thus by Assumption 5,

\bm{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}=\bm{\Sigma}_{j}^{-1/2}(\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})+o_{P}(1)\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}).

B.10 Proof of Corollary 4.6

We define $\mathbf{E}_{0}$ and $\mathbf{E}_{b}$ the same as in the proof of Corollary 4.11. Then Assumptions 1 and 2 are satisfied as been proven for Corollary 4.11. As for Assumption 5, we have shown that under the condition that ${\kappa_{1}^{3}(\lambda_{1}/\sigma^{2})^{3}}=o(\sqrt{n})$ , the results (B.31) holds for any matrix $\mathbf{A}\in\mathbb{R}^{d\times K}$ such that $\sigma_{\max}(\mathbf{A})/\sigma_{\min}(\mathbf{A})\leq C|\lambda_{1}|/\Delta$ and $\lambda_{K}\big{(}\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}\geq cn^{-1}\sigma^{4}\big{(}\sigma_{\min}(\mathbf{A})\big{)}^{2}$ in the proof of Corollary 4.11. Under the regime $Lp\gg d$ , the leading term $\mathcal{V}(\mathbf{E}_{0})=\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}$ , and by taking $\mathbf{A}=\mathbf{V}\mathbf{\Lambda}^{-1}$ , it can be seen that

\sigma_{\max}(\mathbf{V}\mathbf{\Lambda}^{-1})/\sigma_{\min}(\mathbf{V}\mathbf{\Lambda}^{-1})=\sigma_{\max}(\mathbf{\Lambda})/\sigma_{\min}(\mathbf{\Lambda})\leq|\lambda_{1}|/\Delta,

and if we can show that $\eta_{1}(d)\geq(2n)^{-1}\lambda_{1}^{-2}\sigma^{4}$ , we have $\lambda_{K}(\bm{\Sigma}_{j})\geq\eta_{1}(d)=(2n)^{-1}\sigma^{4}\big{(}\sigma_{\min}(\mathbf{V}\mathbf{\Lambda}^{-1})\big{)}^{2}$ and Assumption 5 is satisfied. Thus we only need to verify Assumption 4 and the conditions for $\eta_{1}(d)$ . Recall from the proof of Corollary 4.11 we have the following rates

r_{1}(d)=(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}},\quad r_{2}(d)\asymp\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}n}(\log d)^{2},

and we can further derive that the following bounds hold with high probability

	$\displaystyle\\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\\|_{2,\infty}\leq\\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\\|_{2}\lesssim\\|\mathbf{E}_{0}\\|_{2}/\Delta\lesssim r_{1}(d)\log d/\Delta;$
	$\displaystyle\\|\mathbf{E}_{0}(\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V})\\|_{2,\infty}\lesssim\\|\mathbf{E}_{0}\\|_{2}\\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\\|_{2}\lesssim r_{1}(d)^{2}(\log d)^{2}/\Delta.$

Thus we know $r_{3}(d)\asymp\kappa_{1}\log d\sqrt{{r}/{n}}$ and $r_{4}(d)\asymp r_{1}(d)^{2}(\log d)^{2}/\Delta=\kappa_{1}(\lambda_{1}+\sigma^{2})(\log d)^{2}r/n$ .

From the proof of Corollary 4.11, we know that $\bm{\Sigma}_{j}=n^{-1}\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\bm{\Sigma}_{j}^{0}\mathbf{V}\mathbf{\Lambda}^{-1}$ , where

\mathbf{\Sigma}_{j}^{0}\!=\!\Big{\{}\!\sigma^{2}\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}^{2}\bm{\Sigma}\!+3\sigma^{4}\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\!\!-\!2\sigma^{4}\rho\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}\!\big{[}(\mathbf{P}_{\perp})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\!\mathbf{P}_{\perp}\!\!+\!\mathbf{P}_{\perp}\mathbf{e}_{j}(\mathbf{u}_{K+1})^{\top}\!(\mathbf{P}_{\perp})_{[S,:]}\big{]}\!\!\Big{\}}.

Similar as in the proof of Corollary 4.11, we will first define $\bm{\Sigma}_{j}^{\prime}$ as following

\bm{\Sigma}_{j}^{\prime}=\frac{1}{n}\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\Big{\{}\sigma^{2}\bm{\Sigma}+3\sigma^{4}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}-2\sigma^{4}\rho\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}\big{(}(\mathbf{I}_{d})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}+\mathbf{e}_{j}\mathbf{u}_{K+1}^{\top}(\mathbf{I}_{d})_{[S,:]}\big{)}\Big{\}}\mathbf{V}\mathbf{\Lambda}^{-1}.

Then following similar arguments as in the proof of Corollary 4.11, we have that

\|\bm{\Sigma}_{j}-{\bm{\Sigma}}_{j}^{\prime}\|_{2}=O\Big{(}\frac{K\sigma^{4}}{n\Delta^{2}}\sqrt{\frac{\mu}{d}}\Big{)}=O\Big{(}\frac{K\lambda_{1}^{2}}{\Delta^{2}}\sqrt{\frac{\mu}{d}}\Big{)}\frac{\sigma^{4}}{n\lambda_{1}^{2}}=o\big{(}\frac{\sigma^{4}}{n\lambda_{1}^{2}}\big{)}.

Besides, under the condition that $\mu^{2}\kappa_{1}^{4}K^{3}\ll d^{2}$ we have

\displaystyle\|\bm{\Sigma}_{j}^{\prime}-\widetilde{\bm{\Sigma}}_{j}\|_{2}\lesssim\frac{\sqrt{K}\sigma^{4}}{n\Delta^{2}}\|\mathbf{V}\|_{2,\infty}^{2}\lesssim\frac{\mu K\sqrt{K}\sigma^{4}}{dn\Delta^{2}}=O\left(\frac{\mu\kappa_{1}^{2}K\sqrt{K}}{d}\right)\frac{\sigma^{4}}{n\lambda_{1}^{2}}=o\Big{(}\frac{\sigma^{4}}{n\lambda_{1}^{2}}\Big{)}.

Then we know that $\lambda_{K}\big{(}\bm{\Sigma}_{j}\big{)}\geq\frac{\sigma^{4}}{2n\lambda_{1}^{2}}+\frac{\sigma^{2}}{2n\lambda_{1}}$ and we can take $\eta_{1}(d)=\frac{\sigma^{4}}{2n\lambda_{1}^{2}}+\frac{\sigma^{2}}{2n\lambda_{1}}$ . Thus Assumption 5 holds. Then by plugging in the above rates, we can derive the rate $r(d)$ as

	$\displaystyle r(d)$	$\displaystyle=\sqrt{\frac{Kd}{pL}}\frac{r_{1}(d)}{\Delta}+r_{3}(d)r_{1}(d)/\Delta+\sqrt{\frac{\mu K}{d}}r_{1}(d)^{2}/\Delta^{2}+\big{(}r_{2}(d)+r_{4}(d)\big{)}/\Delta$
		$\displaystyle\lesssim\kappa_{1}\sqrt{\frac{Kdr}{npL}}+\frac{\kappa_{1}^{2}(\log d)^{2}r}{n}+\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}n\Delta}(\log d)^{2}.$

Then under the condition that $L\gg\frac{Kdr}{p}\kappa_{1}^{2}(\frac{\lambda_{1}}{\sigma^{2}})$ , $n\gg\kappa_{1}^{4}(\log d)^{4}r^{2}(\frac{\lambda_{1}}{\sigma^{2}})$ and $K(\frac{\widetilde{\sigma}_{1}}{\delta})^{2}\ll\kappa_{1}r$ , we have $\eta_{1}(d)^{-1/2}r(d)=o(1)$ , and hence the condition for $\eta_{1}(d)$ is satisfied and (8) holds. Also recall from the above proof that $\|\widetilde{\bm{\Sigma}}_{j}-\bm{\Sigma}_{j}\|_{2}=o\big{(}\lambda_{K}(\bm{\Sigma}_{j})\big{)}$ , and (9) holds.

Now we verify the validity of $\widehat{\bm{\Sigma}}_{j}$ . Similar as in the proof of Corollary 4.11, it suffices to show that $\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}$ , and the results will hold by Slutsky’s Theorem. From proof of Corollary 4.11, we have

\|\widehat{\bm{\Sigma}}^{\text{tr}}-\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\|_{2}=O_{P}\Big{(}(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\Big{)}.

Also, we know that with high probability

	$\displaystyle\\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\\|_{2}$	$\displaystyle=\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\\|_{2}\lesssim\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d)\log d+r_{1}(d)\log d/\Delta$
		$\displaystyle\lesssim r_{1}(d)\log d/\Delta\lesssim\kappa_{1}\log d\sqrt{\frac{r}{n}}.$

Then we have

	$\displaystyle\\|\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top}\\|_{2}\leq\\|\widetilde{\mathbf{V}}^{\text{F}\top}(\widehat{\bm{\Sigma}}^{\text{tr}}-\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top})\widetilde{\mathbf{V}}^{\text{F}}\\|_{2}+\\|(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}(\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top})\widetilde{\mathbf{V}}^{\text{F}}\\|_{2}$
	$\displaystyle\quad+\\|\mathbf{H}\mathbf{V}^{\top}(\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top})(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})\\|_{2}=O_{P}\Big{(}\lambda_{1}\kappa_{1}\log d\sqrt{\frac{r}{n}}\Big{)}.$

Then if we denote $\mathbf{D}_{\mathbf{\Lambda}}=(\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top})\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}$ , we have that $\|\mathbf{D}_{\mathbf{\Lambda}}\|_{2}=O_{P}(\kappa_{1}^{2}\log d\sqrt{\frac{r}{n}})=o_{P}(1)$ , and thus we have

	$\displaystyle\\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\\|_{2}$	$\displaystyle=\\|(\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top}+\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top})^{-1}-(\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top})^{-1}\\|_{2}$
		$\displaystyle=\Big{\\|}\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\big{[}(\mathbf{I}_{K}+\mathbf{D}_{\mathbf{\Lambda}})^{-1}-\mathbf{I}_{K}\big{]}\Big{\\|}_{2}\leq\\|\mathbf{\Lambda}^{-1}\\|_{2}\big{\\|}\sum_{i=1}^{\infty}(-\mathbf{D}_{\mathbf{\Lambda}})^{i}\big{\\|}_{2}$
		$\displaystyle=O_{P}\left(\kappa_{1}^{2}\log d\sqrt{\frac{r}{n}}\right)\Delta^{-1},$

and furthermore, we have

\displaystyle\|\widetilde{\mathbf{\Lambda}}^{-2}-\mathbf{H}\mathbf{\Lambda}^{-2}\mathbf{H}^{\top}\|_{2}

\displaystyle\lesssim\|\mathbf{\Lambda}^{-1}\|_{2}\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\|_{2}=O_{P}\left(\kappa_{1}^{2}\log d\sqrt{\frac{r}{n}}\right)\Delta^{-2}.

Then following basic algebra, under the condition that $n\gg\kappa_{1}^{4}(\log d)^{4}r^{2}(\lambda_{1}/\sigma^{2})^{2}$ we have

	$\displaystyle\\|\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}-\widehat{\bm{\Sigma}}_{j}\\|_{2}=\frac{1}{n}\\|\mathbf{H}(\sigma^{2}\mathbf{\Lambda}^{-1}+\sigma^{4}\mathbf{\Lambda}^{-2})\mathbf{H}^{\top}-(\widehat{\sigma}^{2}\widetilde{\mathbf{\Lambda}}^{-1}+\widehat{\sigma}^{4}\widetilde{\mathbf{\Lambda}}^{-2})\\|_{2}$
	$\displaystyle\leq\frac{1}{n}\left(\\|\sigma^{2}\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}-\widehat{\sigma}^{2}\widetilde{\mathbf{\Lambda}}^{-1}\\|_{2}+\\|\sigma^{4}\mathbf{H}\mathbf{\Lambda}^{-2}\mathbf{H}^{\top}-\widehat{\sigma}^{4}\widetilde{\mathbf{\Lambda}}^{-2}\\|_{2}\right)$
	$\displaystyle=O_{P}\left(\kappa_{1}^{2}\log d\frac{\sigma^{2}}{n\Delta}\sqrt{\frac{r}{n}}\right)+O_{P}\left(\frac{\widetilde{\sigma}_{1}}{n\Delta}\sqrt{\frac{K}{n}}\right)+O_{P}\left(\kappa_{1}^{2}\log d\frac{\sigma^{4}}{n\Delta^{2}}\sqrt{\frac{r}{n}}\right)+O_{P}\left(\frac{\widetilde{\sigma}_{1}\sigma^{2}}{n\Delta^{2}}\sqrt{\frac{K}{n}}\right)$
	$\displaystyle=O_{P}\Big{(}\kappa_{1}^{2}\log d\big{(}\frac{\Delta}{\sigma^{2}}\big{)}\sqrt{\frac{r}{n}}\Big{)}\frac{\sigma^{4}}{n\Delta^{2}}=O_{P}\Big{(}\kappa_{1}^{2}\log d\big{(}\frac{\lambda_{1}}{\sigma^{2}}\big{)}\big{(}\frac{\lambda_{1}}{\Delta}\big{)}\sqrt{\frac{r}{n}}\Big{)}\frac{\sigma^{4}}{n\lambda_{1}^{2}}$
	$\displaystyle=O_{P}\Big{(}\kappa_{1}^{3}\log d\big{(}\frac{\lambda_{1}}{\sigma^{2}}\big{)}\sqrt{\frac{r}{n}}\Big{)}\frac{\sigma^{4}}{n\lambda_{1}^{2}}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.$

Therefore, by Slutsky’s Theorem, the claim follows.

B.11 Proof of Corollary 4.7

The proof for the case where no self-loops are present is almost identical to the case where there are self-loops except for some modifications. We will first prove the results for the case when self-loops are present, then in the end we will discuss how to modify the proof for the case where self-loops are absent.

We only need to verify that Assumptions 1 to 5 hold. Recall from the proof of Corollary 4.2 that we have $\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\lesssim r_{1}(d)=\sqrt{d\theta}$ , and thus we know that Assumption 1 is satisfied. Also Assumption 2 holds trivially due to the unbiasedness of $\mathbf{E}$ . We will then verify Assumption 3 holds under the model. We know that $\mathbf{\Theta}\mathbf{\Pi}$ and $\mathbf{V}$ share the same column space, and thus there exists a non-singular matrix $\mathbf{C}\in\mathbb{R}^{K\times K}$ such that $\mathbf{\Theta}\mathbf{\Pi}=\mathbf{V}\mathbf{C}$ and $\mathbf{V}=\mathbf{\Theta}\mathbf{\Pi}\mathbf{C}^{-1}$ . Then we can see that $\sigma_{\min}(\mathbf{C})=\sigma_{\min}(\mathbf{\Theta}\mathbf{\Pi})\gtrsim\sqrt{d\theta/K}$ , and $\|\mathbf{C}^{-1}\|_{2}\lesssim\sqrt{K/d\theta}$ . Hence we have $\|\mathbf{V}\|_{2,\infty}\leq\|\mathbf{\Theta}\mathbf{\Pi}\|_{2,\infty}\|\mathbf{C}^{-1}\|_{2}\lesssim\sqrt{\theta}\sqrt{K/d\theta}=\sqrt{K/d}$ . Thus we can see that Assumption 3 is satisfied with $\mu=O(1)$ .

Now we move on to verify Assumption 4. Recall from the proof of Corollary 4.2 that $\Delta\gtrsim d\theta/K$ , $\|\mathbf{M}\|_{2}\lesssim Kd\theta$ , $\mathbf{M}_{ij}\asymp\theta$ and $\max_{ij}{\mathbb{E}}(\mathbf{E}_{ij}^{2})\lesssim\theta$ . By Theorem 4.2.1 in Chen et al., [12], we have that with probability $1-O(d^{-5})$ ,

\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}},\quad r_{3}(d)\asymp\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}},

and by the proof of Theorem 4.2.1 in [12], we further have that with probability $1-O(d^{-7})$ ,

	$\displaystyle\\|\mathbf{E}(\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})\!-\!\mathbf{V})\\|_{2,\infty}\lesssim\!\frac{K\!\sqrt{K\theta\log d}}{d\theta}\\|\mathbf{E}\\|_{2}\!+\!r_{3}(d)(\log d\!\!+\!\!\sqrt{d\theta})$
	$\displaystyle\lesssim r_{3}(d)(\log d+\sqrt{d\theta})+K\sqrt{K\log d/d}$
	$\displaystyle\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{\sqrt{d}},r_{4}(d)\asymp\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{\sqrt{d}}.$

Thus Assumption 4 is met and now we move on to study the order of $\eta_{1}(d)$ . Before we continue with the proof, we state the following elementary lemma that helps study the operator norm of a covariance matrix.

Lemma B.5.

$\mathbf{x}_{1},\mathbf{x}_{2}\in\mathbb{R}^{d}$ are two random vectors, then we have

\|\operatorname*{\rm Cov}(\mathbf{x}_{1},\mathbf{x}_{2})\|_{2}=\|\operatorname*{\rm Cov}(\mathbf{x}_{2},\mathbf{x}_{1})\|_{2}\leq\sqrt{\|\operatorname*{\rm Cov}(\mathbf{x}_{1})\|_{2}\|\operatorname*{\rm Cov}(\mathbf{x}_{2})\|_{2}},

and

\|\operatorname*{\rm Cov}(\mathbf{x}_{1}+\mathbf{x}_{2})\|_{2}\leq 2\|\operatorname*{\rm Cov}(\mathbf{x}_{1})\|_{2}+2\|\operatorname*{\rm Cov}(\mathbf{x}_{2})\|_{2}.

The proof of Lemma B.5 can be found in Appendix C.4. With the help of Lemma B.5, we first decompose $\mathbf{E}=\mathbf{E}_{1}+\mathbf{E}_{2}$ , where $\mathbf{E}_{1}=[\mathbf{E}_{ij}\mathbb{I}\{i\leq j\}]$ is composed of the diagonal and upper triangular entries of $\mathbf{E}$ and $\mathbf{E}_{2}=[\mathbf{E}_{ij}\mathbb{I}\{i>j\}]$ is composed of the off-diagonal lower triangular entries of $\mathbf{E}$ . Then it can be seen that both $\mathbf{E}_{1}$ and $\mathbf{E}_{2}$ have independent entries. Now for $j\in[d]$ , we can write

\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j}=\mathbf{E}\mathbf{e}_{j}-\mathbf{E}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}=\mathbf{E}\mathbf{e}_{j}-(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}+\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}).

Then we study the covariance of the three terms separately. We have

	$\displaystyle\operatorname{\rm Cov}(\mathbf{E}\mathbf{e}_{j})=\operatorname{\rm Cov}(\mathbf{E}_{.j})=\operatorname{diag}\big{(}\mathbf{M}_{1j}(1-\mathbf{M}_{1j}),\ldots,\mathbf{M}_{dj}(1-\mathbf{M}_{dj})\big{)};$
	$\displaystyle\operatorname*{\rm Cov}(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j})=\operatorname{diag}\Big{(}\big{[}\sum_{k=1}^{d}\mathbf{M}_{ik}(1-\mathbf{M}_{ik})(\mathbf{P}_{\mathbf{V}}\mathbf{e}_{j})_{k}^{2}\mathbb{I}\{i\leq k\}\big{]}_{i=1}^{d}\Big{)};$
	$\displaystyle\operatorname*{\rm Cov}(\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j})=\operatorname{diag}\Big{(}\big{[}\sum_{k=1}^{d}\mathbf{M}_{ik}(1-\mathbf{M}_{ik})(\mathbf{P}_{\mathbf{V}}\mathbf{e}_{j})_{k}^{2}\mathbb{I}\{i>k\}\big{]}_{i=1}^{d}\Big{)}.$

Then we have $\theta\lesssim\lambda_{d}\big{(}\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\big{)}\leq\|\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\|_{2}\leq\max_{ij}{\mathbb{E}}(\mathbf{E}_{ij}^{2})\lesssim\theta$ and

	$\displaystyle\\|\operatorname*{\rm Cov}(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j})\\|_{2}\leq\max_{i\in[d]}\sum_{k=1}^{d}\mathbf{M}_{ik}(1-\mathbf{M}_{ik})(\mathbf{P}_{\mathbf{V}}\mathbf{e}_{j})_{k}^{2}\mathbb{I}\{i\leq k\}$
	$\displaystyle\leq\max_{ik}{\mathbb{E}}(\mathbf{E}_{ik})^{2}\sum_{k=1}^{d}(\mathbf{P}_{\mathbf{V}}\mathbf{e}_{j})_{k}^{2}\lesssim\theta\\|\mathbf{P}_{\mathbf{V}}\mathbf{e}_{j}\\|_{2}^{2}\leq\theta\\|\mathbf{V}\\|_{2,\infty}^{2}\leq\frac{\theta K}{d},$

and very similarly we also have $\|\operatorname*{\rm Cov}(\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j})\|_{2}\lesssim\theta K/d$ . Thus by Lemma B.5, we know that $\|\operatorname*{\rm Cov}(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}+\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j})\|_{2}\lesssim\theta K/d$ and

\|\operatorname*{\rm Cov}(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}+\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j},\mathbf{E}\mathbf{e}_{j})\|_{2}\lesssim\sqrt{\theta^{2}K/d}=\theta\sqrt{K/d}.

Therefore, we can write

	$\displaystyle\\|\operatorname{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})-\operatorname{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\\|_{2}\leq 2\\|\operatorname*{\rm Cov}(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}+\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j},\mathbf{E}\mathbf{e}_{j})\\|_{2}$
	$\displaystyle\quad+\\|\operatorname*{\rm Cov}(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}+\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j})\\|_{2}\lesssim\theta\sqrt{K/d}.$

Thus we have $\lambda_{d}\big{(}\operatorname*{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}\geq\lambda_{d}\big{(}\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\big{)}-\|\operatorname*{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})-\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\|_{2}\gtrsim\theta$ , and we have $\eta_{1}(d)\asymp\lambda_{1}^{-2}\theta$ . Therefore, when $\theta=K^{2}d^{-1/2+\epsilon}$ for some constant $\epsilon>0$ , $p=\Omega(\sqrt{d})$ and $L\gg K^{5}d^{2}/p$ , $K=o(d^{1/18})$ , we have that

	$\displaystyle r(d)$	$\displaystyle=\Delta^{-1}\Big{(}\sqrt{\frac{Kd}{pL}}r_{1}(d)+r_{3}(d)r_{1}(d)+\sqrt{\frac{\mu K}{d\Delta^{2}}}r_{1}(d)^{2}+r_{2}(d)+r_{4}(d)\Big{)}$
		$\displaystyle\lesssim\frac{K^{4}\sqrt{K}+K^{2}\sqrt{K\log d}}{d^{3/2}\theta}+K\sqrt{\frac{K}{\theta pL}}\ll\frac{1}{Kd\sqrt{\theta}}\lesssim\eta_{1}(d)^{1/2}.$

Thus $\eta_{1}(d)^{-1/2}r(d)=o(1)$ and the condition for the asymptotic covariance matrix is satisfied. Now we need to verify Assumption 5, and similar as in the proof of Corollary 4.11, we can verify the following more general result.

Given $j\in[d]$ , for any matrix $\mathbf{A}\in\mathbb{R}^{d\times K}$ that satisfies the following two conditions: (1) $\|\mathbf{A}\|_{2,\infty}/\sigma_{\min}(\mathbf{A})\leq C\sqrt{\lambda_{1}^{2}\mu K/(d\Delta^{2})}$ ; (2) $\lambda_{K}\big{(}\bm{\Sigma}_{j}\big{)}\geq c\theta\big{(}\sigma_{\min}(\mathbf{A})\big{)}^{2}$ , where $\bm{\Sigma}_{j}:=\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})$ and $C,c>0$ are fixed constants independent of $\mathbf{A}$ , it holds that

\mathbf{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}).

(B.32)

It can be checked from the previous proof that $\mathbf{A}=\mathbf{V}\mathbf{\Lambda}^{-1}$ satisfies the two conditions. To show (B.32), we need to show that ${\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(0,1)$ for any ${\mathbf{a}}\in\mathbb{R}^{K},\|{\mathbf{a}}\|_{2}=1$ . We will first study the entries of $\mathbf{P}_{\perp}\mathbf{e}_{j}$ and $\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}$ . It holds that

	$\displaystyle\|(\mathbf{P}_{\perp}\mathbf{e}_{j})_{j}\|=\|\big{(}(\mathbf{I}_{d}-\mathbf{V}\mathbf{V}^{\top})\mathbf{e}_{j}\big{)}_{j}\|\leq 1+\\|\mathbf{V}\\|_{2,\infty}^{2}=1+o(1);$
	$\displaystyle\max_{i\neq j}\|(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\|=\max_{i\neq j}\|\mathbf{e}_{i}^{\top}\mathbf{e}_{j}-\mathbf{e}_{i}^{\top}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}\|\leq 0+\\|\mathbf{V}\\|_{2,\infty}^{2}=\frac{K}{d};$
	$\displaystyle\\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\\|_{\infty}\leq\\|\mathbf{A}\\|_{2,\infty}\\|\bm{\Sigma}_{j}^{-1/2}\\|_{2}\lesssim\theta^{-1/2}\\|\mathbf{A}\\|_{2,\infty}/\sigma_{\min}(\mathbf{A})\lesssim K^{2}\sqrt{\frac{K}{d\theta}}.$

Then we know that

	$\displaystyle{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j}=\sum_{ik}\mathbf{E}_{ik}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{k}=\sum_{i=1}^{d}\mathbf{E}_{ii}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}$
	$\displaystyle\quad+\sum_{i<k}\mathbf{E}_{ik}\big{[}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{k}+(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{k}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\big{]}.$

Then for the diagonal entries we have

	$\displaystyle\sum_{i=1}^{d}{\mathbb{E}}\|\mathbf{E}_{ii}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\|^{3}$
	$\displaystyle={\mathbb{E}}\|\mathbf{E}_{jj}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{j}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{j}\|^{3}+\sum_{i\neq j}{\mathbb{E}}\|\mathbf{E}_{ii}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\|^{3}$
	$\displaystyle\quad\lesssim\theta\\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\\|_{\infty}^{3}+d\theta\\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\\|_{\infty}^{3}\max_{i\neq j}\|(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\|^{3}\lesssim\frac{K^{6}}{d}\sqrt{\frac{K^{3}}{d\theta}},$

and for the off-diagonal entries, when $K=o(d^{1/26})$ it holds that

	$\displaystyle\sum_{i<k}{\mathbb{E}}\Big{\|}\mathbf{E}_{ik}\big{[}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{k}+(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{k}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\big{]}\Big{\|}^{3}\lesssim d\theta\\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\\|_{\infty}^{3}$
	$\displaystyle\quad+d^{2}\theta\\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\\|_{\infty}^{3}\big{(}\frac{K}{d}\big{)}^{3}\lesssim K^{6}\sqrt{\frac{K^{3}}{d\theta}}=o(1).$

Moreover, since $\operatorname{Var}({\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})=1$ , by the Lyapunov’s condition and plugging in $\mathbf{A}=\mathbf{V}\mathbf{\Lambda}^{-1}$ , Assumption 5 is met and (8) follows.

Now we only need to verify that the result also holds when replacing $\bm{\Sigma}_{j}$ by $\widetilde{\bm{\Sigma}}_{j}$ . From previous discussion we learnt that

	$\displaystyle\\|\widetilde{\bm{\Sigma}}_{j}-\bm{\Sigma}_{j}\\|_{2}\leq\\|\mathbf{V}\mathbf{\Lambda}^{-1}\\|_{2}^{2}\\|\operatorname{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})-\operatorname{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\\|_{2}$
	$\displaystyle\leq\frac{K^{2}}{d^{2}\theta}\sqrt{\frac{K}{d}}\lesssim K^{4}\sqrt{\frac{K}{d}}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})=o(\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})).$

Then by Slutsky’s Theorem, (11) holds.

Now we verify the validity of $\widehat{\bm{\Sigma}}_{j}$ . Similar as in the proof of Corollary 4.6, $\mathbf{H}$ is orthonormal with probability $1-o(1)$ , and we will start by showing that $\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}$ . From previous discussion we have the following bounds

\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}=O_{P}(\sqrt{d\theta}),\quad\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}=\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2}=O_{P}(\frac{K}{\sqrt{d\theta}}),

and

	$\displaystyle\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\\|_{2,\infty}$	$\displaystyle\leq\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widehat{\mathbf{V}}\mathbf{H}_{0}\\|_{2}+\\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\\|_{2,\infty}=o_{P}(\frac{1}{Kd\sqrt{\theta}})$
		$\displaystyle\quad+O_{P}(\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}})=O_{P}(\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}}).$

With the help of the above results, we will study the components of $\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}$ separately. In the following proof, we will base the discussion on the event that $\mathbf{H}$ is orthonormal. We first study $\widetilde{\mathbf{M}}=(\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{V}}^{\text{F}\top})\widehat{\mathbf{M}}(\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{V}}^{\text{F}\top})=\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}(\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}$ . We have that

	$\displaystyle\\|\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda}\\|_{2}\leq\\|\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{M}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}\\|_{2}$
	$\displaystyle\quad+\\|\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{M}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})\\|_{2}+\\|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{M}\mathbf{V}\\|_{2}$
	$\displaystyle\leq\\|\widehat{\mathbf{M}}-\mathbf{M}\\|_{2}+2\\|\mathbf{M}\\|_{2}\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\\|_{2}=O_{P}(K^{2}\sqrt{d\theta}).$

Then for $i,k\in[d]$ , we have

	$\displaystyle\|\widetilde{\mathbf{M}}_{ik}-\mathbf{M}_{ik}\|=\|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{i}^{\top}(\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}-\mathbf{M}_{ik}\|$
	$\displaystyle\leq\|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{i}^{\top}(\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda})(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}\|+\|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})_{i}\mathbf{\Lambda}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}\|$
	$\displaystyle\quad+\|(\mathbf{V})_{i}\mathbf{\Lambda}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})_{k}\|.$

It is not hard to see that

	$\displaystyle\|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{i}^{\top}(\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda})(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}\|\lesssim\\|\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda}\\|_{2}\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}\\|_{2,\infty}^{2}$
	$\displaystyle=O_{P}(K^{2}\sqrt{d\theta}\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}\\|_{2,\infty}^{2})=O_{P}\left(K^{3}\sqrt{\frac{\theta}{d}}\right),$
	$\displaystyle\|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})_{i}\mathbf{\Lambda}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}\|+\|(\mathbf{V})_{i}\mathbf{\Lambda}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})_{k}\|$
	$\displaystyle=O_{P}(Kd\theta\\|\mathbf{V}\\|_{2,\infty}\\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\\|_{2,\infty})=O_{P}\left(K^{3}(K^{2}+\sqrt{\log d})\sqrt{\frac{\theta}{d}}\right),$

and in turn we have the upper bound

	$\displaystyle\|\widetilde{\mathbf{M}}_{ik}-\mathbf{M}_{ik}\|=O_{P}\left(K^{3}\sqrt{\frac{\theta}{d}}\right)+O_{P}\left(K^{3}(K^{2}+\sqrt{\log d})\sqrt{\frac{\theta}{d}}\right)$
	$\displaystyle=O_{P}\Big{(}\frac{K^{3}(K^{2}+\sqrt{\log d})}{\sqrt{d\theta}}\Big{)}\theta=o_{P}(\theta)=o_{P}(\mathbf{M}_{ik}).$

Thus we have

\|\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{ij}(1-\widetilde{\mathbf{M}}_{ij})]_{i=1}^{d}\big{)}-\operatorname{diag}\big{(}[\mathbf{M}_{ij}(1-\mathbf{M}_{ij})]_{i=1}^{d}\big{)}\|_{2}=O_{P}\Big{(}\frac{K^{3}(K^{2}+\sqrt{\log d})}{\sqrt{d\theta}}\theta\Big{)}.

Then we move on to study $\widetilde{\mathbf{\Lambda}}$ . We have

	$\displaystyle\\|\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top}\\|_{2}\leq\\|\widetilde{\mathbf{V}}^{\text{F}\top}(\widehat{\mathbf{M}}-\mathbf{M})\widetilde{\mathbf{V}}^{\text{F}}\\|_{2}+\\|(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{M}\widetilde{\mathbf{V}}^{\text{F}}\\|_{2}$
	$\displaystyle\quad+\\|\mathbf{H}\mathbf{V}^{\top}\mathbf{M}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})\\|_{2}=O_{P}(\sqrt{d\theta})+O_{P}(K^{2}\sqrt{d\theta})=O_{P}(K^{2}\sqrt{d\theta}).$

	$\displaystyle\\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\\|_{2}$	$\displaystyle=\\|(\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top}+\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top})^{-1}-(\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top})^{-1}\\|_{2}$
		$\displaystyle=\Big{\\|}\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\big{[}(\mathbf{I}_{K}+\mathbf{D}_{\mathbf{\Lambda}})^{-1}-\mathbf{I}_{K}\big{]}\Big{\\|}\leq\\|\mathbf{\Lambda}^{-1}\\|_{2}\big{\\|}\sum_{i=1}^{\infty}(-\mathbf{D}_{\mathbf{\Lambda}})^{i}\big{\\|}_{2}$
		$\displaystyle=O_{P}\big{(}{K^{4}}/{(d\theta)^{3/2}}\big{)}.$

Thus, following basic algebra we have the following bounds

	$\displaystyle\\|\widetilde{\mathbf{V}}^{\text{F}\top}\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{ij}(1-\widetilde{\mathbf{M}}_{ij})]_{i=1}^{d}\big{)}\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{H}\mathbf{V}^{\top}\operatorname{diag}\big{(}[\mathbf{M}_{ij}(1-\mathbf{M}_{ij})]_{i=1}^{d}\big{)}\mathbf{V}\mathbf{H}^{\top}\\|_{2}$
	$\displaystyle\leq\\|\widetilde{\mathbf{V}}^{\text{F}\top}\Big{(}\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{ij}(1-\widetilde{\mathbf{M}}_{ij})]_{i=1}^{d}\big{)}-\operatorname{diag}\big{(}[\mathbf{M}_{ij}(1-\mathbf{M}_{ij})]_{i=1}^{d}\big{)}\Big{)}\widetilde{\mathbf{V}}^{\text{F}}\\|_{2}$
	$\displaystyle\quad+2\\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\\|_{2}\\|\operatorname{diag}\big{(}[\mathbf{M}_{ij}(1-\mathbf{M}_{ij})]_{i=1}^{d}\big{)}\\|_{2}=O_{P}\Big{(}\frac{K^{3}(K^{2}+\sqrt{\log d})}{\sqrt{d\theta}}\Big{)}\theta,$

and further, under the condition that $K=o(d^{1/32})$ , we have

	$\displaystyle\\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\\|_{2}$	$\displaystyle\lesssim O_{P}\Big{(}\frac{K^{3}(K^{2}+\sqrt{\log d})}{\sqrt{d\theta}}\theta\Big{)}\\|\widetilde{\mathbf{\Lambda}}^{-1}\\|_{2}^{2}+\theta\\|\mathbf{\Lambda}^{-1}\\|_{2}\\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\\|_{2}$
		$\displaystyle=O_{P}\Big{(}\frac{K^{7}(K^{2}+\sqrt{\log d})}{\sqrt{d\theta}}\Big{)}\frac{1}{K^{2}d^{2}\theta}+O_{P}(\frac{K^{7}}{\sqrt{d\theta}})\frac{1}{K^{2}d^{2}\theta}$
		$\displaystyle=O_{P}\Big{(}\frac{K^{7}(K^{2}+\sqrt{\log d})}{\sqrt{d\theta}}\Big{)}\frac{1}{K^{2}d^{2}\theta}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.$

Thus with similar arguments as in the proof of Corollary 4.6, the claim follows.

Remark 18.

The inferential results also hold for the case where self-loops are absent. Recall that under the no-self-loop case, the observed matrix is

\widehat{\mathbf{M}}=\mathbf{X}-\operatorname{diag}(\mathbf{X})=\mathbf{M}+\mathbf{E}-\operatorname{diag}(\mathbf{M}+\mathbf{E})=\mathbf{M}+\mathbf{E}-\operatorname{diag}(\mathbf{E})-\operatorname{diag}(\mathbf{M}),

where $\mathbf{E}=\mathbf{X}-\mathbf{M}$ is the error matrix between the adjacency matrix with self-loops and its expectation. We define $\widehat{\mathbf{M}}^{\prime}=\mathbf{M}+\mathbf{E}-\operatorname{diag}(\mathbf{E})$ and denote by $\widehat{\mathbf{V}}^{\prime}$ its $K$ leading eigenvectors. By Weyl’s inequality [19] we know that with probability at least $1-d^{-10}$ we have that $\sigma_{K}(\widehat{\mathbf{M}}^{\prime})-\sigma_{K+1}(\widehat{\mathbf{M}}^{\prime})\geq\Delta-O(\sqrt{d\theta})\gtrsim d\theta/K$ , and hence by Davis-Kahan’s Theorem [45] we have

\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}\|_{2}\leq\|\operatorname{diag}(\mathbf{M})\|_{2}/\big{(}\sigma_{K}(\widehat{\mathbf{M}}^{\prime})-\sigma_{K+1}(\widehat{\mathbf{M}}^{\prime})\big{)}\lesssim K/d,

with probability at least $1-d^{-10}$ . The verification of Assumptions 1, 3 and 5 when self-loops are present can also be applied to the no-self-loop case. For Assumption 2, we can take $\mathbf{E}_{0}=\mathbf{E}-\operatorname{diag}(\mathbf{E})$ and $\mathbf{E}_{b}=-\operatorname{diag}(\mathbf{M})$ . Then $r_{2}(d)=\|\operatorname{diag}(\mathbf{M})\|_{2}\lesssim\theta=o(r_{1}(d))$ and Assumption 2 is satisfied. As for Assumption 4, by Lemma 7 in Fan et al., [18], we have

\|\operatorname{sgn}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})-\widehat{\mathbf{V}}^{\prime\top}\mathbf{V}\|_{2}\lesssim\|\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{2}\lesssim\frac{K^{2}}{d\theta},

\|\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\widehat{\mathbf{V}}^{\prime})-\widehat{\mathbf{V}}^{\top}\widehat{\mathbf{V}}^{\prime}\|_{2}\lesssim\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}\|_{2}^{2}\lesssim\frac{K^{2}}{d^{2}}.

With similar arguments as in the self-loop case, for $\widehat{\mathbf{V}}^{\prime}$ with high probability we have

\|\widehat{\mathbf{V}}^{\prime}\operatorname{sgn}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}},

\|\mathbf{E}_{0}(\widehat{\mathbf{V}}^{\prime}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})\!-\!\mathbf{V})\|_{2,\infty}\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{\sqrt{d}}.

Then for $\widehat{\mathbf{V}}$ , with high probability we have that

	$\displaystyle\\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\\|_{2,\infty}\leq\\|\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\\|_{2,\infty}+\\|\widehat{\mathbf{V}}\big{(}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\widehat{\mathbf{V}}^{\top}\mathbf{V}\big{)}-\mathbf{V}\\|_{2,\infty}$
	$\displaystyle\leq\\|\widehat{\mathbf{V}}\big{(}\widehat{\mathbf{V}}^{\top}(\mathbf{I}_{d}-\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top})\mathbf{V}\big{)}-\mathbf{V}\\|_{2,\infty}+\\|\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})-\mathbf{V}\\|_{2,\infty}+O(\frac{K^{2}}{d\theta})\\|\widehat{\mathbf{V}}\\|_{2,\infty}$
	$\displaystyle\leq\\|\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\widehat{\mathbf{V}}^{\prime})\!-\!\widehat{\mathbf{V}}^{\prime}\\|_{2,\infty}\!+\!\\|\widehat{\mathbf{V}}^{\prime}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})\!-\!\mathbf{V}\\|_{2,\infty}\!+\!O(\frac{K^{2}}{d\theta})\\|\widehat{\mathbf{V}}\\|_{2,\infty}\!+\!\\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\!-\!\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}\\|_{2}\\|\widehat{\mathbf{V}}\\|_{2,\infty}$
	$\displaystyle\leq O(\frac{K^{2}}{d\theta})\!\!\left(\\|\mathbf{V}\\|_{2,\infty}\!+\!\\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\\|_{2,\infty}\!\right)\!+\!\\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\!-\!\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}\\|_{2}\!+\!\\|\widehat{\mathbf{V}}^{\prime}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})\!-\!\mathbf{V}\\|_{2,\infty},$

where in the last two inequalities we use the fact that

\|(\mathbf{I}_{d}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top})\widehat{\mathbf{V}}^{\prime}\|_{2}=\|(\mathbf{I}_{d}-\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top})\widehat{\mathbf{V}}\|_{2}=\|\widehat{\mathbf{V}}_{\perp}^{\top}\widehat{\mathbf{V}}^{\prime}\|_{2}=\|\widehat{\mathbf{V}}_{\perp}^{\prime\top}\widehat{\mathbf{V}}\|_{2}=\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}\|_{2},

with $\widehat{\mathbf{V}}_{\perp}$ and $\widehat{\mathbf{V}}^{\prime}_{\perp}$ being the orthogonal complement of $\widehat{\mathbf{V}}$ and $\widehat{\mathbf{V}}^{\prime}$ respectively. Since $K^{2}/(d\theta)=o(1)$ , for large enough $d$ we further get

	$\displaystyle\frac{1}{2}\\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\\|_{2,\infty}\leq\left(1-O\left({K^{2}}/{(d\theta)}\right)\right)\\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\\|_{2,\infty}$
	$\displaystyle\quad\leq O(\frac{K^{2}}{d\theta})\\|\mathbf{V}\\|_{2,\infty}+O(\frac{K}{d})+\\|\widehat{\mathbf{V}}^{\prime}\operatorname{sgn}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})-\mathbf{V}\\|_{2,\infty}+O(\frac{K^{2}}{d\theta})\\|\widehat{\mathbf{V}}^{\prime}\\|_{2,\infty}$
	$\displaystyle\quad\lesssim\frac{K^{2}}{d\theta}\sqrt{\frac{K}{d}}+\frac{K}{d}+\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}}\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}}.$

Hence $r_{3}(d)=K\sqrt{K}(K^{2}+\sqrt{\log d})/(d\sqrt{\theta})$ . We also have

	$\displaystyle\\|\mathbf{E}(\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V})\\|_{2,\infty}$	$\displaystyle\lesssim\\|\mathbf{E}_{0}(\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V})\\|_{2,\infty}+\frac{r_{2}(d)r_{1}(d)}{\Delta}$
		$\displaystyle\lesssim\\|\mathbf{E}_{0}(\widehat{\mathbf{V}}^{\prime}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})-\mathbf{V})\!\\|_{2,\infty}+\frac{r_{2}(d)r_{1}(d)}{\Delta}$
		$\displaystyle\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{\sqrt{d}}+K\sqrt{\frac{\theta}{d}}\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{\sqrt{d}},$

and hence we can take $r_{4}(d)=K\sqrt{K}(K^{2}+\sqrt{\log d})/\sqrt{d}$ . Now to get a sharper rate for $r(d)$ , we take into consideration the diagonal structure of $\mathbf{E}_{b}$ and derive the following bound

	$\displaystyle\\|\!\mathbf{P}_{\perp}\mathbf{E}_{b}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\!\\|_{2,\infty}$	$\displaystyle\leq\\|\mathbf{V}\mathbf{V}^{\top}\mathbf{E}_{b}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\!\\|_{2,\infty}+\\|\mathbf{E}_{b}\widehat{\mathbf{V}}\!\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\\|_{2,\infty}$
		$\displaystyle\leq\frac{r_{2}(d)\\|\mathbf{V}\\|_{2,\infty}}{\Delta}+\frac{\\|\operatorname{diag}(\mathbf{M})\widehat{\mathbf{V}}\\|_{2,\infty}}{\Delta}$
		$\displaystyle\lesssim\frac{K}{d}\sqrt{\frac{K}{d}}+\frac{\\|\operatorname{diag}(\mathbf{M})\\|_{2}\\|\widehat{\mathbf{V}}\\|_{2,\infty}}{\Delta}\lesssim\frac{K}{d}\sqrt{\frac{K}{d}}.$

Then from the proof of Theorem 4.5 we have that

r(d)\lesssim\frac{K^{4}\sqrt{K}+K^{2}\sqrt{K\log d}}{d^{3/2}\theta}+K\sqrt{\frac{K}{\theta pL}}+\frac{K}{d}\sqrt{\frac{K}{d}}\ll\frac{1}{Kd\sqrt{\theta}},

and we are only left to verify the minimum eigenvalue condition of $\bm{\Sigma}_{j}$ by showing that the order of $\eta_{1}(d)$ is the same as when there are self-loops. With the same arguments, we know that

\|\operatorname*{\rm Cov}(\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\mathbf{E}^{\prime}\mathbf{P}_{\perp}\mathbf{e}_{j})-\operatorname*{\rm Cov}(\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\mathbf{E}^{\prime}\mathbf{e}_{j})\|\leq O\left(K^{4}\sqrt{\frac{K}{d}}\right)\frac{1}{K^{2}d^{2}\theta}.

Besides, we also have

	$\displaystyle\\|\operatorname*{\rm Cov}(\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\mathbf{E}^{\prime}\mathbf{e}_{j})-\widetilde{\bm{\Sigma}}_{j}\\|_{2}=\big{\\|}\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\big{(}\mathbf{M}_{jj}(1-\mathbf{M}_{jj})\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\big{)}\mathbf{V}\mathbf{\Lambda}^{-1}\big{\\|}_{2}$
	$\displaystyle\lesssim\mathbf{M}_{jj}\\|\mathbf{\Lambda}^{-1}\\|_{2}^{2}\\|\mathbf{V}\\|_{2,\infty}^{2}\lesssim\frac{K^{2}}{d^{2}\theta}\frac{K}{d}=O(\frac{K^{5}}{d})\frac{1}{K^{2}d^{2}\theta}=o\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.$

Thus we also have $\|\operatorname*{\rm Cov}(\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\mathbf{E}^{\prime}\mathbf{P}_{\perp}\mathbf{e}_{j})-\widetilde{\bm{\Sigma}}_{j}\|_{2}=o\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}$ , and thereby

\lambda_{K}\big{(}\operatorname*{\rm Cov}(\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\mathbf{E}^{\prime}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}=\lambda_{K}\big{(}\widetilde{\bm{\Sigma}}_{j}\big{)}\big{(}1+o(1)\big{)}\gtrsim\frac{\theta}{K^{2}d^{2}\theta^{2}}.

Thus we still have $\eta_{1}(d)=\lambda_{1}^{-2}\theta$ for the case where self-loops are absent. The condition for $\eta_{1}(d)$ also holds for the no-self-loop case and both (8) and (11) hold. The verification of (12) is almost identical to the self-loop case and is hence omitted.

B.12 Proof of Corollary 4.8

From the proof of Corollary 4.12, we have verified Assumptions 1-3. It can be checked that $\mathbf{V}\mathbf{\Lambda}^{-1}$ satisfies the two conditions for the general CLT results in the proof of Corollary 4.12, then under the condition that $\Delta_{0}^{2}\ll n^{4/3}/(\mu_{\theta}^{2}d)$ , Assumption 5 is also satisfied.

Now we move on to check the conditions for $\eta_{1}(d)$ . Recall from the proof of Corollary 4.12, we have

\operatorname*{\rm Cov}(\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})=\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}^{2}(\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+n\mathbf{I}_{d})+\!n\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}.

Then we have

\displaystyle\|\widetilde{\bm{\Sigma}}_{j}-\bm{\Sigma}_{j}\|_{2}\lesssim\frac{K}{d\Delta^{2}}(n+\Delta)\lesssim O\left(\frac{K}{dn}(n+\Delta)\right)\frac{n}{\lambda_{1}^{2}}=o\left(\frac{n}{\lambda_{1}^{2}}\right).

Besides, it can be seen that $\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\geq n/\lambda_{1}^{2}+1/\lambda_{1}$ , and hence we can take $\eta_{1}(d)=\lambda_{1}^{-2}n/2+\lambda_{1}^{-1}/2$ . Next we move on to verify the statistical rates $r_{3}(d)$ and $r_{4}(d)$ . By Davis-Kahan’s Theorem [45], we have that with high probability

\displaystyle\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}\leq\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2}\lesssim\|\mathbf{E}\|_{2}/\Delta\lesssim r_{1}^{\prime}(d)/\Delta,

where $r_{1}^{\prime}(d)=d\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d$ as defined in the proof of Corollary 4.12, and thus we know that $r_{3}(d)\asymp r_{1}^{\prime}(d)/\Delta$ . Besides, with high probability we have

\displaystyle\|\mathbf{E}_{0}(\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V})\|_{2,\infty}\leq\|\mathbf{E}_{0}(\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V})\|_{2}\lesssim r_{1}^{\prime}(d)^{2}/\Delta,

and we have $r_{4}(d)\asymp r_{1}^{\prime}(d)^{2}/\Delta$ . Thus Assumption 4 is satisfied. Then we have

	$\displaystyle r(d)$	$\displaystyle=\sqrt{\frac{Kd}{pL}}\frac{r_{1}(d)}{\Delta}+r_{3}(d)r_{1}(d)/\Delta+\sqrt{\frac{K}{d}}r_{1}(d)^{2}/\Delta^{2}+\big{(}r_{2}(d)+r_{4}(d)\big{)}/\Delta$
		$\displaystyle\lesssim\sqrt{\frac{Kd}{pL}}\frac{r_{1}(d)}{\Delta}+r_{1}^{\prime}(d)^{2}/\Delta^{2}\lesssim\frac{K}{\Delta_{0}^{2}}+\frac{K^{2}n(\log d)^{2}}{d\Delta_{0}^{4}}+\sqrt{\frac{Kd}{pL}}\Big{(}\frac{\sqrt{K}}{\Delta_{0}}+\frac{K\sqrt{n}}{\sqrt{d}\Delta_{0}^{2}}\Big{)}.$

Therefore, under the conditions that $\Delta_{0}^{2}\gg K\sqrt{n}(\log d)^{2}$ , $n\gg d^{2}$ and $L\gg Kd^{2}/p$ , we have $\eta_{1}(d)^{-1/2}r(d)=o(1)$ . Thus by Theorem 4.5, (8) holds. As for (13), from the above arguments we have $\|\widetilde{\bm{\Sigma}}_{j}-\bm{\Sigma}_{j}\|_{2}=o\big{(}\lambda_{K}(\bm{\Sigma}_{j})\big{)}$ , and hence (13) holds.

Now we need to check the validity of $\widehat{\bm{\Sigma}}_{j}$ . Similar as before, it suffices for us to prove that $\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}$ . From Corollary 4.8, we have that $\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}=O_{P}(d\Delta_{0}/\sqrt{K}+\sqrt{dn})$ and $\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2}=\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}=O_{P}\big{(}K\sqrt{\frac{n}{d}}/\Delta_{0}^{2}\big{)}$ . Then we have

	$\displaystyle\\|\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top}\\|_{2}\leq\\|\widetilde{\mathbf{V}}^{\text{F}\top}(\widehat{\mathbf{M}}-\mathbf{M})\widetilde{\mathbf{V}}^{\text{F}}\\|_{2}+\\|(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{M}\widetilde{\mathbf{V}}^{\text{F}}\\|_{2}$
	$\displaystyle\quad+\\|\mathbf{H}\mathbf{V}^{\top}\mathbf{M}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})\\|_{2}=O_{P}\Big{(}d\Delta_{0}/\sqrt{K}+\sqrt{dn}\Big{)}.$

Then if we denote $\mathbf{D}_{\mathbf{\Lambda}}=(\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top})\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}$ , we have that

\|\mathbf{D}_{\mathbf{\Lambda}}\|_{2}=O_{P}\left(K\sqrt{\frac{n}{d}}\Delta_{0}^{-2}\right)=o_{P}(1),

and thus we have

\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\|_{2}\lesssim\|\mathbf{\Lambda}^{-1}\|_{2}\|\mathbf{D}_{\mathbf{\Lambda}}\|_{2}=O_{P}\left(K\sqrt{\frac{n}{d}}\Delta_{0}^{-2}\right)\Delta^{-1}=o_{P}(n/\lambda_{1}^{2})=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)},

and furthermore, we have

\displaystyle n\|\widetilde{\mathbf{\Lambda}}^{-2}\!-\!\mathbf{H}\mathbf{\Lambda}^{-2}\mathbf{H}^{\top}\|_{2}

\displaystyle\!\lesssim\!n\|\mathbf{\Lambda}^{-1}\|_{2}\!\|\widetilde{\mathbf{\Lambda}}^{-1}\!-\!\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\!\|_{2}\!=\!O_{P}\big{(}K\sqrt{\frac{n}{d}}/\Delta_{0}^{2}\big{)}n\Delta^{-2}\!=\!o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.

Combining the above results, we have $\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}$ , and hence (13) holds with $\widetilde{\bm{\Sigma}}_{j}$ replaced by $\widehat{\bm{\Sigma}}_{j}$ .

B.13 Proof of Corollary 4.9

Recall that $\widehat{\mathbf{M}}=({1}/{\widehat{\theta}})\mathcal{P}_{{\mathcal{S}}}(\mathbf{M}+\bar{\mathcal{E}})$ and $\widehat{\mathbf{M}}^{\prime}=({1}/{\theta})\mathcal{P}_{{\mathcal{S}}}(\mathbf{M}+\bar{\mathcal{E}})$ share exactly the same sequence of eigenvectors, and we can treat $\widetilde{\mathbf{V}}^{\text{F}}$ as the FADI estimator applied to $\widehat{\mathbf{M}}^{\prime}$ . We will abuse the notation and denote $\mathbf{E}:=\widehat{\mathbf{M}}^{\prime}-\mathbf{M}$ .

To show that (8) holds, we need to verify that Assumptions 1 to 5 hold and the minimum eigenvalue conditions hold for the asymptotic covariance matrix. We know from Corollary 4.2 that Assumption 1 and Assumption 2 are satisfied, and that $r_{1}(d)=|\lambda_{1}|\mu K/\sqrt{d\theta}+\sqrt{d\sigma^{2}/\theta}$ and $r_{2}(d)=0$ . Define $\widetilde{\sigma}=(|\lambda_{1}|\mu K/d)\vee\sigma$ , we have from the proof of Corollary 4.2 that $\operatorname{Var}(\mathbf{E}_{ij})\asymp\widetilde{\sigma}^{2}/\theta$ and $|\mathbf{E}_{ij}|=O(\widetilde{\sigma}\log d/\theta)$ for $i,j\in[d]$ . From Theorem 4.2.1 in Chen et al., [12], we have that with probability $1-O(d^{-5})$

\displaystyle\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}\lesssim\frac{\kappa_{2}\widetilde{\sigma}\sqrt{\mu K/\theta}+\widetilde{\sigma}\sqrt{K\log d/\theta}}{\Delta},

and thus we know $r_{3}(d)\asymp\big{(}\kappa_{2}\widetilde{\sigma}\sqrt{\mu K/\theta}+\widetilde{\sigma}\sqrt{K\log d/\theta}\big{)}/{\Delta}$ . Besides, by the proof of Theorem 4.2.1 in Chen et al., [12], with probability $1-O(d^{-7})$ , we have

	$\displaystyle\big{\\|}\mathbf{E}\big{(}\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})\!-\!\!\mathbf{V}\big{)}\!\big{\\|}_{2,\infty}\!$	$\displaystyle\lesssim\!\frac{\sqrt{dK}\widetilde{\sigma}^{2}}{\Delta{\theta}}\big{(}\!\sqrt{\log d}\!+\!\!\sqrt{\mu}\big{)}\!\!+\!\widetilde{\sigma}\sqrt{\frac{d}{\theta}}r_{3}(d)\!+\!\frac{\widetilde{\sigma}}{\Delta}\!\sqrt{K\frac{\log d}{\theta}}\\|\mathbf{E}\\|_{2}$
		$\displaystyle\quad\lesssim\frac{\sqrt{d}\widetilde{\sigma}^{2}}{\Delta{\theta}}\big{(}\sqrt{K\log d}+\kappa_{2}\sqrt{\mu K}\big{)},$

and thus $r_{4}(d)\asymp\frac{\sqrt{d}\widetilde{\sigma}^{2}}{\Delta{\theta}}\big{(}\sqrt{K\log d}+\kappa_{2}\sqrt{\mu K}\big{)}$ . Therefore, Assumption 4 is met and we have

	$\displaystyle r(d)$	$\displaystyle=\sqrt{\frac{Kd}{pL}}\frac{r_{1}(d)}{\Delta}+r_{3}(d)r_{1}(d)/\Delta+\sqrt{\frac{\mu K}{d}}r_{1}(d)^{2}/\Delta^{2}+\big{(}r_{2}(d)+r_{4}(d)\big{)}/\Delta$
		$\displaystyle\lesssim\left(\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\right)\left(\left(\frac{\kappa_{2}\sqrt{\mu K}+\sqrt{K\log d}}{\Delta}\right)\frac{\widetilde{\sigma}}{\sqrt{\theta}}+\sqrt{\frac{Kd}{pL}}\right).$

Now we will study the statistical rate $\eta_{1}(d)$ . We know that $\mathbf{E}_{ij}=\mathbf{E}_{ji}$ are i.i.d. across $i\leq j$ and $\operatorname{Var}(\mathbf{E}_{ij})\asymp\widetilde{\sigma}^{2}/\theta$ , then by Lemma B.5, with almost identical arguments as in the proof of Corollary 4.7, for $j\in[d]$ we have that $\|\operatorname*{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})-\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\|_{2}\lesssim\widetilde{\sigma}^{2}/\theta\sqrt{\mu K/d}$ , and thus $\lambda_{d}\big{(}\operatorname*{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}\gtrsim\lambda_{d}\big{(}\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\big{)}\gtrsim\widetilde{\sigma}^{2}/\theta$ and we have $\eta_{1}(d)\asymp\lambda_{1}^{-2}\theta^{-1}\widetilde{\sigma}^{2}$ . Therefore, under the condition that $L\gg\kappa_{2}^{2}Kd^{2}/p$ and $\widetilde{\sigma}/\Delta\sqrt{d/\theta}\ll\min\left(\big{(}\kappa_{2}^{2}\sqrt{\mu K}+\kappa_{2}\sqrt{K\log d}\big{)}^{-1},\sqrt{p/d}\right)$ , we have that $\eta_{1}(d)^{-1/2}r(d)=o(1)$ .

Now we move on to verify Assumption 5. More specifically, we will show that the following results hold:

Given $j\in[d]$ , for any matrix $\mathbf{A}\in\mathbb{R}^{d\times K}$ that satisfies the following two conditions: (1) $\|\mathbf{A}\|_{2,\infty}/\sigma_{\min}(\mathbf{A})\leq C\sqrt{\lambda_{1}^{2}\mu K/(d\Delta^{2})}$ ; (2) $\lambda_{K}\big{(}\bm{\Sigma}_{j}\big{)}\geq c\widetilde{\sigma}^{2}\theta^{-1}\big{(}\sigma_{\min}(\mathbf{A})\big{)}^{2}$ , where $\bm{\Sigma}_{j}:=\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})$ and $C,c>0$ are fixed constants independent of $\mathbf{A}$ , it holds that

\mathbf{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}).

(B.33)

To prove (B.33), it suffices to show that ${\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(0,1)$ for any ${\mathbf{a}}\in\mathbb{R}^{K},\|{\mathbf{a}}\|_{2}=1$ . We will first study $\mathbf{P}_{\perp}\mathbf{e}_{j}$ , $\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}$ and $\max_{ik}{\mathbb{E}}|\mathbf{E}_{ik}|^{3}$ . It holds that

	$\displaystyle\|(\mathbf{P}_{\perp}\mathbf{e}_{j})_{j}\|=\|\big{(}(\mathbf{I}_{d}-\mathbf{V}\mathbf{V}^{\top})\mathbf{e}_{j}\big{)}_{j}\|\leq 1+\\|\mathbf{V}\\|_{2,\infty}^{2}=1+o(1);$
	$\displaystyle\max_{i\neq j}\|(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\|=\max_{i\neq j}\|\mathbf{e}_{i}^{\top}\mathbf{e}_{j}-\mathbf{e}_{i}^{\top}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}\|\leq 0+\\|\mathbf{V}\\|_{2,\infty}^{2}=\frac{\mu K}{d};$
	$\displaystyle\\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\\|_{\infty}\leq\\|\mathbf{A}\\|_{2,\infty}\\|\bm{\Sigma}_{j}^{-1/2}\\|_{2}\lesssim(\widetilde{\sigma}^{2}/{\theta})^{-1/2}\\|\mathbf{A}\\|_{2,\infty}/{\sigma}_{\min}(\mathbf{A})\lesssim\kappa_{2}\sqrt{\frac{\mu K}{d}}\frac{\sqrt{\theta}}{\widetilde{\sigma}};$
	$\displaystyle\max_{ik}{\mathbb{E}}\|\mathbf{E}_{ik}\|^{3}\lesssim\frac{\\|\mathbf{M}\\|_{\max}^{3}}{\theta^{3}}\theta+\frac{{\sigma}^{3}(\log d)^{3}}{\theta^{3}}\theta\lesssim\frac{\widetilde{\sigma}^{3}}{\theta^{2}}(\log d)^{3}.$

Then we know that

	$\displaystyle{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j}=\sum_{ik}\mathbf{E}_{ik}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{k}=\sum_{i=1}^{d}\mathbf{E}_{ii}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}$
	$\displaystyle\quad+\sum_{i<k}\mathbf{E}_{ik}\big{[}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{k}+(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{k}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\big{]}.$

Then for the diagonal entries we have

	$\displaystyle\sum_{i=1}^{d}{\mathbb{E}}\|\mathbf{E}_{ii}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\|^{3}$
	$\displaystyle={\mathbb{E}}\|\mathbf{E}_{jj}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{j}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{j}\|^{3}+\sum_{i\neq j}{\mathbb{E}}\|\mathbf{E}_{ii}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\|^{3}$
	$\displaystyle\lesssim{\mathbb{E}}\|\mathbf{E}_{jj}\|^{3}\\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\\|_{\infty}^{3}+d\max_{i}{\mathbb{E}}\|\mathbf{E}_{ii}\|^{3}\\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\\|_{\infty}^{3}\max_{i\neq j}\|(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\|^{3}$
	$\displaystyle\lesssim\frac{\kappa_{2}^{3}K\mu}{d}\sqrt{\frac{\mu K}{d\theta}}(\log d)^{3},$

and for the off-diagonal entries, under the condition $\kappa_{2}^{6}K^{3}\mu^{3}=o(d^{1/2})$ it holds that

	$\displaystyle\sum_{i<k}{\mathbb{E}}\Big{\|}\mathbf{E}_{ik}\big{[}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{k}+(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{k}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\big{]}\Big{\|}^{3}\lesssim d\frac{\widetilde{\sigma}^{3}}{\theta^{2}}\\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\\|_{\infty}^{3}(\log d)^{3}$
	$\displaystyle\quad+d^{2}\frac{\widetilde{\sigma}^{3}}{\theta^{2}}(\log d)^{3}\\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\\|_{\infty}^{3}\big{(}\frac{\mu K}{d}\big{)}^{3}\lesssim{\kappa_{2}^{3}K\mu}\sqrt{\frac{\mu K}{d\theta}}(\log d)^{3}=o(1).$

Moreover, since $\operatorname{Var}({\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})=1$ , by the Lyapunov’s condition, (B.33) holds and Assumption 5 is satisfied by plugging in $\mathbf{A}=\mathbf{V}\mathbf{\Lambda}^{-1}$ . By Theorem 4.5, we have that (8) follows.

To show that (15) holds we need to show that $\|\widetilde{\bm{\Sigma}}_{j}-\bm{\Sigma}_{j}\|_{2}=o(\lambda_{K}(\widetilde{\bm{\Sigma}}_{j}))$ . From previous discussion we learnt that

	$\displaystyle\\|\widetilde{\bm{\Sigma}}_{i}-\bm{\Sigma}_{j}\\|_{2}\leq\\|\mathbf{V}\mathbf{\Lambda}^{-1}\\|_{2}^{2}\\|\operatorname{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})-\operatorname{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\\|_{2}$
	$\displaystyle\leq\frac{1}{\Delta^{2}}\sqrt{\frac{\mu K}{d}}\frac{\widetilde{\sigma}^{2}}{\theta}\lesssim\kappa_{2}^{2}\sqrt{\frac{\mu K}{d}}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})=o(\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})).$

Then by Slutsky’s Theorem, (15) holds.

Last we verify that the distributional convergence still holds when we plug in the estimator $\widehat{\bm{\Sigma}}_{j}$ . Similar as in the previous proof, it suffices for us to prove that $\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}$ . In the following proof, we will base the discussion on the event that $\mathbf{H}$ is orthonormal. We will first bound $\|\widetilde{\mathbf{M}}-\mathbf{M}\|_{\max}$ . From previous discussion we have the following bounds

\|\widehat{\mathbf{M}}^{\prime}-\mathbf{M}\|_{2}=O_{P}(\sqrt{d\widetilde{\sigma}^{2}/\theta}),\quad\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}=\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2}=O_{P}(\frac{1}{\Delta}\sqrt{d\widetilde{\sigma}^{2}/\theta}),

and

	$\displaystyle\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\\|_{2,\infty}\leq\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widehat{\mathbf{V}}\mathbf{H}_{0}\\|_{2}+\\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\\|_{2,\infty}=o_{P}(\frac{\widetilde{\sigma}}{\|\lambda_{1}\|\sqrt{\theta}})$
	$\displaystyle\quad+O_{P}(\frac{\kappa_{2}\widetilde{\sigma}\sqrt{\mu K/\theta}+\widetilde{\sigma}\sqrt{K\log d/\theta}}{\Delta})=O_{P}(\frac{\kappa_{2}\widetilde{\sigma}\sqrt{\mu K/\theta}+\widetilde{\sigma}\sqrt{K\log d/\theta}}{\Delta}).$

Now we can study $\widetilde{\mathbf{M}}=(\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{V}}^{\text{F}\top})\widehat{\mathbf{M}}(\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{V}}^{\text{F}\top})=\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}(\frac{\theta}{\widehat{\theta}}\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}^{\prime}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}$ . Recall by Hoeffding’s inequality [21], with probability $1-O(d^{-10})$ we have that $|\widehat{\theta}-\theta|\lesssim\frac{\sqrt{\log d}}{d}$ and $|{\mathcal{S}}|=\Omega(d^{2}\theta)$ , and we have that

	$\displaystyle\\|\frac{\theta}{\widehat{\theta}}\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}^{\prime}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda}\\|_{2}\leq\\|\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}^{\prime}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{M}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}\\|_{2}$
	$\displaystyle\quad+\\|\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{M}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})\\|_{2}+\\|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{M}\mathbf{V}\\|_{2}+O_{P}\Big{(}\frac{\sqrt{\log d}}{d{\theta}}\|\lambda_{1}\|\Big{)}$
	$\displaystyle\lesssim\\|\widehat{\mathbf{M}}^{\prime}-\mathbf{M}\\|_{2}+2\\|\mathbf{M}\\|_{2}\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\\|_{2}+O_{P}\Big{(}\frac{\sqrt{\log d}}{d{\theta}}\|\lambda_{1}\|\Big{)}=O_{P}(\kappa_{2}\sqrt{d\widetilde{\sigma}^{2}/\theta}).$

Then for any $i,k\in[d]$ , we have

	$\displaystyle\|\widetilde{\mathbf{M}}_{ik}\!-\!\mathbf{M}_{ik}\|=\|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{i}^{\top}(\frac{\theta}{\widehat{\theta}}\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}^{\prime}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}-\mathbf{M}_{ik}\|$
	$\displaystyle\leq\|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{i}^{\top}(\frac{\theta}{\widehat{\theta}}\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}^{\prime}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda})(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}\|+\|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})_{i}\mathbf{\Lambda}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}\|$
	$\displaystyle\quad+\|(\mathbf{V})_{i}\mathbf{\Lambda}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})_{k}\|=O_{P}(\kappa_{2}\sqrt{d\widetilde{\sigma}^{2}/\theta}\\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}\\|_{2,\infty}^{2})$
	$\displaystyle\quad+O_{P}(\|\lambda_{1}\|\\|\mathbf{V}\\|_{2,\infty}\\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\\|_{2,\infty})=O_{P}(\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}})\frac{\|\lambda_{1}\|\mu K}{d}=O_{P}\left(\frac{\kappa_{2}\mu K}{\sqrt{d\theta}}\right)\widetilde{\sigma},$

and in turn we have

\displaystyle|\widetilde{\mathbf{M}}_{ik}^{2}\!-\!\mathbf{M}_{ik}^{2}|\lesssim\frac{|\lambda_{1}|\mu K}{d}|\widetilde{\mathbf{M}}_{ik}\!-\!\mathbf{M}_{ik}|=O_{P}\left(\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\right)\Big{(}\frac{|\lambda_{1}|\mu K}{d}\Big{)}^{2},\quad\forall i,k\in[d].

Now we move on to bound the error of $\widehat{\sigma}^{2}$ . We know from the setting of Example 4 that $\varepsilon_{ik}$ ’s are sub-Gaussian with variance proxy of order $O(\sigma^{2}(\log d)^{2})$ , and thus

	$\displaystyle\|\widehat{\sigma}^{2}\!-\!\sigma^{2}\|$	$\displaystyle\!=\!\Big{\|}\!\!\sum_{(i,k)\in{\mathcal{S}}}\!\!\!(\mathbf{M}_{ik}\!+\!\varepsilon_{ik}\!-\!\widetilde{\mathbf{M}}_{ik})^{2}/\|{\mathcal{S}}\|\!-\!\sigma^{2}\Big{\|}\!=\!\Big{\|}\!\!\sum_{(i,k)\in{\mathcal{S}}}\!\!\!(\mathbf{M}_{ik}\!+\!\varepsilon_{ik}\!-\!\mathbf{M}_{ik}\!+\!\mathbf{M}_{ik}\!-\!\widetilde{\mathbf{M}}_{ik})^{2}/\|{\mathcal{S}}\|\!-\!\sigma^{2}\Big{\|}$
		$\displaystyle\lesssim\Big{\|}\frac{1}{\|{\mathcal{S}}\|}\sum_{(i,k)\in{\mathcal{S}}}\varepsilon_{ik}^{2}-\sigma^{2}\Big{\|}+\\|\widetilde{\mathbf{M}}-\mathbf{M}\\|_{\max}^{2}=O_{P}\big{(}\frac{\sigma^{2}(\log d)^{2}}{\sqrt{\|{\mathcal{S}}\|}}\big{)}+O_{P}\Big{(}\frac{\kappa_{2}^{2}\mu^{2}K^{2}}{d\theta}\Big{)}\widetilde{\sigma}^{2}$
		$\displaystyle=O_{P}\Big{(}\frac{(\log d)^{2}}{d\sqrt{\theta}}\Big{)}\sigma^{2}+O_{P}\Big{(}\frac{\kappa_{2}^{2}\mu^{2}K^{2}}{d\theta}\Big{)}\widetilde{\sigma}^{2}.$

Then for any $i\in[d]$ , we have that

	$\displaystyle\Big{\|}\frac{\widetilde{\mathbf{M}}_{ij}^{2}(1-\widehat{\theta})}{\widehat{\theta}}\!+\!\frac{\widehat{\sigma}^{2}}{\widehat{\theta}}\!-\!\frac{\mathbf{M}_{ij}^{2}(1\!-\!\theta)}{\theta}\!-\!\frac{\sigma^{2}}{\theta}\Big{\|}\!\lesssim\!\|\widetilde{\mathbf{M}}_{ij}\|^{2}\Big{\|}\frac{1}{\widehat{\theta}}\!-\!\frac{1}{\theta}\Big{\|}\!+\!\frac{\|\widetilde{\mathbf{M}}_{ij}^{2}\!-\!\mathbf{M}_{ij}^{2}\|}{\theta}\!+\!\widehat{\sigma}^{2}\Big{\|}\frac{1}{\widehat{\theta}}\!-\!\frac{1}{\theta}\Big{\|}\!+\!\frac{\|\widehat{\sigma}^{2}\!-\!\sigma^{2}\|}{\theta}$
	$\displaystyle=O_{P}\left(\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\right)\theta^{-1}\Big{(}\frac{\|\lambda_{1}\|\mu K}{d}\Big{)}^{2}+O_{P}\Big{(}\frac{(\log d)^{2}}{d\sqrt{\theta}}+\frac{\kappa_{2}^{2}\mu^{2}K^{2}}{d\theta}\Big{)}\frac{\widetilde{\sigma}^{2}}{\theta}=O_{P}\left(\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\right)\frac{\widetilde{\sigma}^{2}}{\theta},$

and thus we have that

\|\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{ij}^{2}(1-\widehat{\theta})/\widehat{\theta}+\widehat{\sigma}^{2}/\widehat{\theta}]_{i=1}^{d}\big{)}-\operatorname{diag}\big{(}[\mathbf{M}_{ij}^{2}(1-\theta)/\theta+\sigma^{2}/\theta]_{i=1}^{d}\big{)}\|_{2}=O_{P}\Big{(}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{\widetilde{\sigma}^{2}}{\theta}.

Also, we have shown that

\|\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top}\|_{2}=\|\frac{\theta}{\widehat{\theta}}\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}^{\prime}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda}\|_{2}=O_{P}\Big{(}\kappa_{2}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\Delta,

then we have $\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\|_{2}=O_{P}\Big{(}\kappa_{2}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{1}{\Delta}$ , and hence

	$\displaystyle\\|\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{V}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\\|_{2}\leq\\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\\|_{2}+\\|\mathbf{\Lambda}^{-1}\\|_{2}\\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\\|_{2}$
	$\displaystyle=O_{P}\Big{(}\kappa_{2}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{1}{\Delta}+O_{P}\Big{(}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{1}{\Delta}=O_{P}\Big{(}\kappa_{2}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{1}{\Delta}.$

Then following basic algebra we have that with high probability

\displaystyle\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}\lesssim O_{P}\Big{(}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{\widetilde{\sigma}^{2}}{\Delta^{2}\theta}+O_{P}\Big{(}\kappa_{2}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{\widetilde{\sigma}^{2}}{\Delta^{2}\theta}=O_{P}\Big{(}\kappa_{2}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{\widetilde{\sigma}^{2}}{\Delta^{2}\theta}.

Then under the condition that $\kappa_{2}^{3}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}=o(1)$ , we have that

\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=O_{P}(\kappa_{2}^{3}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}})\frac{\widetilde{\sigma}^{2}}{\lambda_{1}^{2}\theta}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.

C Proof of Technical Lemmas

In this section, we provide proofs of the technical lemmas used in the proofs of the main theorems.

C.1 Proof of Lemma B.2

It can be easily seen that

\|\mathbf{\Omega}/\sqrt{p}\|_{2}=(\|\mathbf{\Omega}\mathbf{\Omega}^{\top}/p\|_{2})^{1/2}=\left((d/p)\|\mathbf{\Omega}^{\top}\mathbf{\Omega}/d\|_{2}\right)^{1/2}.

By Lemma 3 in [18], we know that $\|\|\mathbf{\Omega}^{\top}\mathbf{\Omega}/d-\mathbf{I}_{p}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{p/d}$ , and thus $\|\|\mathbf{\Omega}^{\top}\mathbf{\Omega}/d\|_{2}\|_{\psi_{1}}\lesssim 1+\sqrt{p/d}=O(1)$ . Therefore, we have $\|\|\mathbf{\Omega}\mathbf{\Omega}^{\top}/p\|_{2}\|_{\psi_{1}}\lesssim d/p$ . By Jensen’s inequality, we in turn get $\|\|\mathbf{\Omega}/\sqrt{p}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{d/p}$ .

C.2 Proof of Lemma B.3

By Proposition 10.4 in [20], we know that for any $t\geq 1$ , we have

\mathbb{P}\left(\left\|\bm{\Omega}^{\dagger}\right\|_{2}\geq\frac{{\rm e}\sqrt{p}}{p-K+1}\cdot t\right)\leq t^{-(p-K+1)}.

(C.34)

Since $p\geq 2K$ , there exists a constant $c$ such that $\frac{{\rm e}p}{p-K+1}\leq c$ , and thus

\mathbb{P}\left(\sqrt{p}\left\|\bm{\Omega}^{\dagger}\right\|_{2}\geq ct\right)\leq t^{-(p-K+1)}.

(C.35)

Therefore, we have

	$\displaystyle{\mathbb{E}}\left(\left(\sigma_{\min}(\bm{\Omega}/\sqrt{p})\right)^{-a}\right)={\mathbb{E}}\left(\left\\|\sqrt{p}\bm{\Omega}^{\dagger}\right\\|_{2}^{a}\right)=\int_{u\geq 0}\mathbb{P}\left(\left\\|\sqrt{p}\bm{\Omega}^{\dagger}\right\\|_{2}^{a}\geq u\right)du$
	$\displaystyle\quad=\int_{0\leq u\leq c^{a}}\mathbb{P}\left(\left\\|\sqrt{p}\bm{\Omega}^{\dagger}\right\\|_{2}^{a}\geq u\right)du+\int_{u\geq c^{a}}\mathbb{P}\left(\left\\|\sqrt{p}\bm{\Omega}^{\dagger}\right\\|_{2}^{a}\geq u\right)du$
	$\displaystyle\quad\leq c^{a}+\int_{u\geq c^{a}}\!\!\!\mathbb{P}\left(\left\\|\sqrt{p}\bm{\Omega}^{\dagger}\right\\|_{2}\geq u^{1/a}\right)du\leq c^{a}+\int_{u\geq c^{a}}\!\!\!\left(u^{1/a}/c\right)^{-(p-K+1)}du$
	$\displaystyle\quad=c^{a}\left(1+\frac{1}{(p-K+1)/a-1}\right).$

Since $1+\frac{1}{(p-K+1)/a-1}\leq 2$ , the claim follows.

C.3 Proof of Lemma B.4

We first consider the probability $\mathbb{P}\left(\|{\mathbf{\Sigma}}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq\varepsilon\right)$ . Recall the matrix ${\mathbf{Y}}^{(\ell)}:=\mathbf{V}\mathbf{P}_{0}\mathbf{\Lambda}^{0}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}$ . Now by Jensen’s inequality and Wedin’s Theorem [42], we have

	$\displaystyle\\|{\mathbf{\Sigma}}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}=\\|\mathbb{E}\left(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}\|\widehat{\mathbf{M}}\right)-\mathbf{V}\mathbf{V}^{\top}\\|_{2}\leq\mathbb{E}\left(\left\\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}\right\\|_{2}\Big{\|}\widehat{\mathbf{M}}\right)$
	$\displaystyle\quad\lesssim{\mathbb{E}}\left(\\|\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}-{\mathbf{Y}}^{(\ell)}/\sqrt{p}\\|_{2}/\sigma_{K}\left({\mathbf{Y}}^{(\ell)}/\sqrt{p}\right)\|\widehat{\mathbf{M}}\right)\leq\frac{\\|\mathbf{E}\\|_{2}}{\Delta}{\mathbb{E}}\left(\frac{\\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\\|_{2}}{\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)}\quad\bigg{\|}\widehat{\mathbf{M}}\right)$
	$\displaystyle\quad=\frac{\\|\mathbf{E}\\|_{2}}{\Delta}{\mathbb{E}}\left(\frac{\\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\\|_{2}}{\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)}\right)\leq\frac{\\|\mathbf{E}\\|_{2}}{\Delta}{\mathbb{E}}\left(\\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\\|_{2}^{2}\right)^{1/2}{\mathbb{E}}\left(\left(\sigma_{\min}(\bm{\Omega}^{(\ell)}/\sqrt{p})\right)^{-2}\right)^{1/2}$
	$\displaystyle\quad\lesssim\frac{\\|\mathbf{E}\\|_{2}}{\Delta}\\|\\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\\|_{2}\\|_{\psi_{1}}\lesssim\frac{\\|\mathbf{E}\\|_{2}}{\Delta}\sqrt{d/p},$

where the last but one inequality is due to Lemma B.3 under the condition that $p\geq\max(2K,K+3)$ , and the last inequality is due to Lemma B.2. Therefore, by Assumption 1, there exist constants $c_{0},c_{0}^{\prime}>0$ such that

\mathbb{P}\left(\|{\mathbf{\Sigma}}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq\varepsilon\right)\leq\mathbb{P}\left(\frac{\|\mathbf{E}\|_{2}}{\Delta}\sqrt{d/p}\geq c_{0}^{\prime}\varepsilon\right)\leq\exp\left(-c_{0}\sqrt{\frac{p}{d}}\frac{\Delta\varepsilon}{r_{1}(d)}\right).

Similarly, we consider the probability $\mathbb{P}\left(\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}\geq\varepsilon\right)$ . By Assumption 1, there exist constants $c_{0}^{\prime\prime},c_{0}^{\prime\prime\prime}>0$ such that

	$\displaystyle\mathbb{P}\left(\\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\\|_{2}\geq\varepsilon\right)\leq\mathbb{P}\left(\\|{\mathbf{\Sigma}}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}\geq\varepsilon/2\right)+\mathbb{P}\left(\\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}^{\top}\\|_{2}\geq\varepsilon/2\right)$
	$\displaystyle\quad\leq\exp\left(-c_{0}\sqrt{\frac{p}{d}}\frac{\Delta\varepsilon}{2r_{1}(d)}\right)+\mathbb{P}\left(\frac{\\|\mathbf{E}\\|_{2}}{\Delta}\geq c_{0}^{\prime\prime\prime}\varepsilon\right)\leq\exp\left(-c_{0}\sqrt{\frac{p}{d}}\frac{\Delta\varepsilon}{2r_{1}(d)}\right)+\exp\left(-\frac{c_{0}^{\prime\prime}\Delta\varepsilon}{r_{1}(d)}\right)$
	$\displaystyle\quad\lesssim\exp\left(-c_{0}\sqrt{\frac{p}{d}}\frac{\Delta\varepsilon}{2r_{1}(d)}\right).$

Therefore, the claim follows.

C.4 Proof of Lemma B.5

We know that $\operatorname*{\rm Cov}(\mathbf{x}_{1}+\mathbf{x}_{2})=\operatorname*{\rm Cov}(\mathbf{x}_{1})+\operatorname*{\rm Cov}(\mathbf{x}_{2})+\operatorname*{\rm Cov}(\mathbf{x}_{1},\mathbf{x}_{2})+\operatorname*{\rm Cov}(\mathbf{x}_{2},\mathbf{x}_{1})$ , where $\operatorname*{\rm Cov}(\mathbf{x}_{1},\mathbf{x}_{2})={\mathbb{E}}(\mathbf{x}_{1}-{\mathbb{E}}\mathbf{x}_{1})(\mathbf{x}_{2}-{\mathbb{E}}\mathbf{x}_{2})^{\top}$ , and

\|\operatorname*{\rm Cov}(\mathbf{x}_{i})\|_{2}=\max_{\|\mathbf{v}\|_{2}=1}\mathbf{v}^{\top}\operatorname*{\rm Cov}(\mathbf{x}_{i})\mathbf{v}=\max_{\|\mathbf{v}\|_{2}=1}\operatorname{Var}\big{(}\mathbf{v}^{\top}\mathbf{x}_{i}\big{)},

for $i=1,2$ . Therefore, we have

	$\displaystyle\\|\operatorname*{\rm Cov}(\mathbf{x}_{1},\mathbf{x}_{2})\\|_{2}$	$\displaystyle=\max_{\\|\mathbf{v}\\|_{2}=1,\\|\mathbf{u}\\|_{2}=1}\mathbf{v}^{\top}\operatorname{\rm Cov}(\mathbf{x}_{1},\mathbf{x}_{2})\mathbf{u}=\max_{\\|\mathbf{v}\\|_{2}=1,\\|\mathbf{u}\\|_{2}=1}\operatorname{\rm Cov}(\mathbf{v}^{\top}\mathbf{x}_{1},\mathbf{u}^{\top}\mathbf{x}_{2})$
		$\displaystyle\leq\max_{\\|\mathbf{v}\\|_{2}=1,\\|\mathbf{u}\\|_{2}=1}\sqrt{\operatorname{Var}(\mathbf{v}^{\top}\mathbf{x}_{1})}\sqrt{\operatorname{Var}(\mathbf{v}^{\top}\mathbf{x}_{2})}=\sqrt{\\|\operatorname{\rm Cov}(\mathbf{x}_{1})\\|_{2}\\|\operatorname{\rm Cov}(\mathbf{x}_{2})\\|_{2}}$
		$\displaystyle\leq\frac{1}{2}\\|\operatorname{\rm Cov}(\mathbf{x}_{1})\\|_{2}+\frac{1}{2}\\|\operatorname{\rm Cov}(\mathbf{x}_{2})\\|_{2}.$

Thus we have

	$\displaystyle\\|\operatorname*{\rm Cov}(\mathbf{x}_{1}+\mathbf{x}_{2})\\|_{2}$	$\displaystyle\leq\\|\operatorname{\rm Cov}(\mathbf{x}_{1})\\|_{2}+\\|\operatorname{\rm Cov}(\mathbf{x}_{2})\\|_{2}+\\|\operatorname{\rm Cov}(\mathbf{x}_{1},\mathbf{x}_{2})\\|_{2}+\\|\operatorname{\rm Cov}(\mathbf{x}_{2},\mathbf{x}_{1})\\|_{2}$
		$\displaystyle\leq 2\\|\operatorname{\rm Cov}(\mathbf{x}_{1})\\|_{2}+2\\|\operatorname{\rm Cov}(\mathbf{x}_{2})\\|_{2}.$

D Wedin’s Theorem

Lemma D.1 (Modified Wedin’s Theorem).

Let $\mathbf{M}^{\star}$ and $\mathbf{M}=\mathbf{M}^{\star}+\mathbf{E}$ be two matrices in $\mathbb{R}^{n_{1}\times n_{2}}$ (without loss of generality, we assume $n_{1}\leq n_{2}$ ), whose SVDs are given respectively by

\mathbf{M}^{\star}=\sum_{i=1}^{n_{1}}\sigma_{i}^{\star}\mathbf{u}_{i}^{\star}\mathbf{v}_{i}^{\star\top}=\left[\begin{array}[]{ll}\mathbf{U}^{\star}&\mathbf{U}_{\perp}^{\star}\end{array}\right]\left[\begin{array}[]{ccc}\bm{\Sigma}^{\star}&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\bm{\Sigma}_{\perp}^{\star}&\mathbf{0}\end{array}\right]\left[\begin{array}[]{c}\mathbf{V}^{\star\top}\\ \mathbf{V}_{\perp}^{\star\top}\end{array}\right],

\mathbf{M}=\sum_{i=1}^{n_{1}}\sigma_{i}\mathbf{u}_{i}\mathbf{v}_{i}^{\top}=\left[\begin{array}[]{ll}\mathbf{U}&\mathbf{U}_{\perp}\end{array}\right]\left[\begin{array}[]{ccc}\bm{\Sigma}&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\bm{\Sigma}_{\perp}&\mathbf{0}\end{array}\right]\left[\begin{array}[]{l}\mathbf{V}^{\top}\\ \mathbf{V}_{\perp}^{\top}\end{array}\right].

Here, $\sigma_{1}\geq\cdots\geq\sigma_{n_{1}}$ (resp. $\sigma_{1}^{\star}\geq\cdots\geq\sigma_{n_{1}}^{\star}$ ) stand for the singular values of $\mathbf{M}$ (resp. $\mathbf{M}^{\star}$ ) arranged in descending order, $\mathbf{u}_{i}$ (resp. $\left.\mathbf{u}_{i}^{\star}\right)$ denotes the left singular vector associated with the singular value $\sigma_{i}$ (resp. $\sigma_{i}^{\star}$ ), and $\mathbf{v}_{i}$ (resp. $\mathbf{v}_{i}^{\star}$ ) represents the right singular vector associated with $\sigma_{i}$ (resp. $\sigma_{i}^{\star}$ ). $\mathbf{U}$ and $\mathbf{U}^{\star}$ stand for the top $r$ eigenvectors of $\mathbf{M}$ and $\mathbf{M}^{\star}$ respectively. Then,

\max\left\{\|\mathbf{U}\mathbf{U}^{\top}-\mathbf{U}^{\star}\mathbf{U}^{\star\top}\|_{2},\|\mathbf{V}\mathbf{V}^{\top}-\mathbf{V}^{\star}\mathbf{V}^{\star\top}\|_{2}\right\}\lesssim\frac{2\|\mathbf{E}\|}{\sigma_{r}^{\star}-\sigma_{r+1}^{\star}},

(D.36)

and

\max\left\{\|\mathbf{U}\mathbf{U}^{\top}-\mathbf{U}^{\star}\mathbf{U}^{\star\top}\|_{\rm F},\|\mathbf{V}\mathbf{V}^{\top}-\mathbf{V}^{\star}\mathbf{V}^{\star\top}\|_{\rm F}\right\}\lesssim\frac{2\sqrt{r}\|\mathbf{E}\|}{\sigma_{r}^{\star}-\sigma_{r+1}^{\star}}.

(D.37)

Proof.

By Wedin’s Theorem [42], if $\|\mathbf{E}\|_{2}<(1-1/\sqrt{2})\left(\sigma_{r}^{\star}-\sigma_{r+1}^{\star}\right)$ , (D.36) and (D.37) are true. When $\|\mathbf{E}\|_{2}\geq(1-1/\sqrt{2})\left(\sigma_{r}^{\star}-\sigma_{r+1}^{\star}\right)$ , the RHS of (D.36) are larger than or equal to $2-\sqrt{2}$ , whereas the LHS are bounded by 1. Thus (D.36) follows trivially, and so is (D.37). ∎

E Supplementary Figures

We provide in this section additional figures deferred from the main paper.

(a) Error Rate	(b) Running Time

(c) Coverage Probability	(d) Q-Q Plot

(a) Error Rate	(b) Misclustering Rate	(c) Running Time

(d) Coverage Probability	(e) Power	(f) Q-Q Plot


(a) PC 1 versus PC 2	(b) PC 1 versus PC 3	(c) PC 1 versus PC 4

(d) PC 2 versus PC 3	(e) PC 2 versus PC 4	(f) PC 3 versus PC 4

(a) Error Rate	(b) Running Time

(c) Coverage Probability	(d) Q-Q Plot

(a) Error Rate	(b) Running Time

(c) Coverage Probability	(d) Q-Q Plot

(a) Example 1: Spiked Covariance Model	(b) Example 3: Gaussian Mixture Models

FADI: Fast Distributed Principal Component Analysis With High Accuracy for Large-Scale Federated Data

Abstract

1 Introduction

Related Papers on Inferential Analysis of PCA

Our Contributions

Paper Organization

Notation

2 Preliminaries and Problem Setup

Example 1 (Spiked Covariance Model [25]).

Example 2 (Degree-Corrected Mixed Membership (DCMM) Model [16]).

Example 3 (Gaussian Mixture Models (GMM) [12]).

Example 4 (Incomplete Matrix Inference [13]).

3 Method

3.1 Fast Distributed PCA (FADI): Overview and Intuition

3.2 General Algorithmic Framework

Remark 1.

3.3 Case-Specific Processing of Raw Data

3.4 Computational Complexity

3.5 Estimation of the Rank KK

4 Theory

4.1 Theoretical Bound on Error Rates

Assumption 1 (Convergence of ‖𝐄‖2\|\mathbf{E}\|_{2}).

Remark 2.

Theorem 4.1.

Remark 3.

Corollary 4.2.

Remark 4.

Theorem 4.3.

Corollary 4.4.

Remark 5.

4.2 Inferential Results on the Asymptotic Distribution: Intuition and Assumptions

Assumption 2 (Statistical Rate for the Biased Error Term).

Assumption 3 (Incoherence Condition).

Assumption 4 (Statistical Rates for Eigenspace Convergence).

Assumption 5 (Central Limit Theorem).

4.3 Inference When L​p≫dLp\gg d

Theorem 4.5.

Remark 6.

4.3.1 Spiked Covariance Model

Corollary 4.6.

Remark 7.

4.3.2 Degree-Corrected Mixed Membership Models

Corollary 4.7.

Remark 8.

4.3.3 Gaussian Mixure Models

Corollary 4.8.

Remark 9.

4.3.4 Incomplete Matrix Inference

Corollary 4.9.

Remark 10.

4.4 Inference When L​p≪dLp\ll d

Theorem 4.10.

Remark 11.

4.4.1 Spiked Covariance Model

Corollary 4.11.

Remark 12.

4.4.2 Gaussian Mixture Models

Corollary 4.12.

Remark 13.

5 Numerical Results

5.1 Example 1: Spiked Covariance Model

5.2 Example 2: Degree-Corrected Mixed Membership Models

6 Application to the 1000 Genomes Data

6.1 Estimation of Principal Eigenspace

6.2 Inference on Ancestry Membership Profiles

7 Discussion

References

A Additional Simulation Results

A.1 Example 3: Gaussian Mixture Models

A.2 Example 4: Incomplete Matrix Inference

A.3 Additional Results for Example 1 in the Genetic Setting

B Proof of Main Theoretical Results

B.1 Unbiasedness of Fast Sketching With Respect to 𝐌^\widehat{\mathbf{M}}

Lemma B.1.

Proof.

B.2 Proof of Theorem 4.1

Lemma B.2.

Lemma B.3.

Lemma B.4.

B.2.1 Control of the Bias Term

3.5 Estimation of the Rank $K$

Assumption 1 (Convergence of $\|\mathbf{E}\|_{2}$ ).

4.3 Inference When $Lp\gg d$

4.4 Inference When $Lp\ll d$

B.1 Unbiasedness of Fast Sketching With Respect to $\widehat{\mathbf{M}}$