This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

FADI: Fast Distributed Principal Component Analysis With High Accuracy for Large-Scale Federated Data

Shuting Shen, Junwei Lu, and Xihong Lin
Shuting Shen is PhD student ([email protected]) and Junwei Lu ([email protected]) is Assistant Professor at the Department of Biostatistics at Harvard T.H. Chan School of Public Health. Xihong Lin is Professor of Biostatistics at Harvard T.H. Chan School of Public Health and Professor of Statistics at Harvard University ([email protected]). This work was supported by the National Institutes of Health grants R35-CA197449, U01-HG009088, U01HG012064, U19-CA203654, and P30 ES000002.
Abstract

Principal component analysis (PCA) is one of the most popular methods for dimension reduction. In light of the rapidly growing large-scale data in federated ecosystems, the traditional PCA method is often not applicable due to privacy protection considerations and large computational burden. Algorithms were proposed to lower the computational cost, but few can handle both high dimensionality and massive sample size under the distributed setting. In this paper, we propose the FAst DIstributed (FADI) PCA method for federated data when both the dimension dd and the sample size nn are ultra-large, by simultaneously performing parallel computing along dd and distributed computing along nn. Specifically, we utilize LL parallel copies of pp-dimensional fast sketches to divide the computing burden along dd and aggregate the results distributively along the split samples. We present FADI under a general framework applicable to multiple statistical problems, and establish comprehensive theoretical results under the general framework. We show that FADI enjoys the same non-asymptotic error rate as the traditional PCA when LpdLp\geq d. We also derive inferential results that characterize the asymptotic distribution of FADI, and show a phase-transition phenomenon as LpLp increases. We perform extensive simulations to show that FADI substantially outperforms the existing methods in computational efficiency while preserving accuracy, and validate the distributional phase-transition phenomenon through numerical experiments. We apply FADI to the 1000 Genomes data to study the population structure.

Keyword: Computational efficiency; Distributed computing; Efficient communication; Fast PCA; Large-scale inference; Federated learning; Random matrices; Random sketches.

1 Introduction

As one of the most popular methods for dimension reduction, principal component analysis (PCA) finds applications in a broad spectrum of scientific fields including network studies [3], statistical genetics [35] and finance [31]. Methodologically, parameter estimation in many statistical models is based on PCA, such as spectral clustering in graphical models [2], missing data imputation through low-rank matrix completion [23], and clustering with subsequent k-means refinement in Gaussian mixture models [12]. When it comes to real data analysis, however, several shortcomings of the traditional PCA method hinder its application to large-scale datasets. First, the high dimensionality and large sample size of modern big data can render the PCA computation infeasible in practice. For instance, PCA is commonly used for controlling for ancestry confounding in Genome-Wide Association Studies (GWAS) [33], yet biomedical databases, such as the UK Biobank [39], often contain hundreds of thousands to millions of Single Nucleotide Polymorphisms (SNPs) and subjects, which entails more scalable algorithms to handle the intensive computation of PCA. Second, large-scale datasets in many applications are stored in federated ecosystems, where data cannot leave individual warehouses due to privacy protection considerations [8, 14, 15, 29, 34]. This calls for federated learning methods [26, 30] that provide efficient and privacy-protected strategies for joint analysis across data warehouses without the need to exchange individual-level data.

The burgeoning popularity of large-scale data necessitates the development of fast algorithms that can cope with both high dimensionality and massiveness efficiently and distributively. Indeed, efforts have been made in recent years on developing fast PCA and distributed PCA algorithms. The existing fast PCA algorithms use the full-sample data and apply random projection to speed up PCA calculations [11, 20], while the existing distributed PCA algorithms apply the traditional PCA method to the split data and aggregate the results [18, 28].

Specifically, fast PCA algorithms utilize the fact that the column space of a low-rank matrix can be represented by a small set of columns and use random projection to approximate the original high-dimensional matrix [4]. For instance, Halko et al., [20] proposed to estimate the KK leading eigenvectors of a d×dd\times d matrix (KdK\ll d) using Gaussian random sketches, which decreases the computation time by a factor of O(d)O(d) at the cost of increasing the statistical error by a factorial power of dd. Chen et al., [11] modified Halko et al., [20]’s method by repeating the fast sketching multiple times and showed the consistency of the algorithm using the average of i.i.d. random sketches when the number of sketches goes to infinity. However, they did not study the trade-off between computation complexities and error rates in finite samples, and hence did not recommend the number of fast sketches that optimizes both the computational efficiency and the statistical accuracy. As the fast PCA methods use the full data, they have two major limitations. First, they are often not scalable to large sample sizes nn. Second, they are not applicable to federated data when data in different sites cannot be shared.

The existing distributed PCA algorithms reduce the PCA computational burden by partitioning the full data “horizontally” or “vertically” [18, 27, 28]. The horizontal partition splits the data over the sample size nn, whereas the vertical partition splits the data over the dimension dd. Horizontal partition is useful when the sample size nn is large or when the data are federated in multiple sites. For example, Fan et al., [18] considered the horizontally distributed PCA where they estimated the KK leading eigenvectors of the d×dd\times d population covariance matrix by applying traditional PCA to each data split and aggregating the PCA results across different datasets. They showed when the number of data splits is not too large, the error rate of their algorithm is of the same order as the traditional PCA. Since they used the traditional PCA algorithm for each data partition, the computational complexity is at least of order O(d3)O(d^{3}), which will be computationally difficult when dd is large, e.g., in GWAS, dd is hundreds of thousands to millions. Kargupta et al., [28] considered vertical partition and developed a method that collects local principal components (PCs) and then reconstructs global PCs by linear transformations. However, there is no theoretical guarantee on the error rate compared with the traditional full sample PCA, and the method may fail when variables are correlated.

Apart from the aforementioned PCA applications in parameter estimation, inference also constitutes an important part of PCA methods. For example, when studying the ancestry groups of whole genome data under the mixed membership models, while the estimation error rate guarantees the overall misclustering rate for all subjects, one may be interested in testing whether two individuals of interest share the same ancestry membership profile and assessing the associated statistical uncertainty [16]. Furthermore, despite the rich literature depicting the asymptotic distribution of traditional PCA estimators under different statistical models [16, 32, 41], distributional characterization of fast PCA methods and distributed PCA methods are not well-studied. For instance, Yang et al., [44] characterized the convergence of fast sketching estimators in probability but gave no inferential results. Halko et al., [20] provided error bound for the fast PCA algorithm, but there is no characterization of the asymptotic distribution and hence no evaluation of the testing efficiency. Fan et al., [18] derived the non-asymptotic error rate of the distributed PC estimator but did not provide distributional guarantees, and inference based upon their estimator is computationally intensive when the dimension dd is large.

In summary, the existing fast PCA algorithms accelerate computation along dd by fast sketching, but cannot handle distributed computing along nn. The existing distributed PCA methods mainly focus on dividing the computing burden along nn, while distributed computing along dd is complicated by variable correlation and lacks theoretical guarantees. It remains an open question how to develop fast and distributed PCA algorithms that can handle both large dd and nn simultaneously, while achieving the same asymptotic efficiency as the traditional PCA.

In view of the gaps in existing literature, we propose in this paper a scalable and computationally efficient FAst DIstributed (FADI) PCA method applicable to federated data that could be large in both dd and nn. More specifically, to obtain the KK-leading PCs of a d×dd\times d matrix 𝐌\mathbf{M} from its estimator 𝐌^\widehat{\mathbf{M}}, we take the divide-and-conquer strategy to break down the computation complexities along the dimension dd: we generate the pp-dimensional fast sketch 𝐘^=𝐌^𝛀\widehat{\mathbf{Y}}=\widehat{\mathbf{M}}\bm{\Omega} and perform SVD on 𝐘^\widehat{\mathbf{Y}} instead of 𝐌^\widehat{\mathbf{M}} to expedite the PCA computation, where 𝛀d×p\bm{\Omega}\in\mathbb{R}^{d\times p} is a Gaussian test matrix with KpdK\leq p\ll d; meanwhile, to adjust for the additional variability induced by random approximation, we repeat the fast sketching for LL times in parallel, and then aggregate the SVD results across data splits to restore statistical accuracy. When the data are distributively stored, the federated structure of 𝐘^\widehat{\mathbf{Y}} also enables its easy implementation without the need of sharing individual-level data, which in turn facilitates distributing the computing burden along nn among the split samples, as opposed to the existing fast PCA methods that are not scalable to large nn. We will show that FADI has computational complexities of smaller magnitudes than existing methods (see Table 3), while achieving the same asymptotic efficiency as the traditional PCA. Moreover, we establish FADI under general frameworks that cover multiple statistical models. We list below four statistical problems as illustrative applications of FADI, where we will define 𝐌\mathbf{M} and 𝐌^\widehat{\mathbf{M}} in each setting:

  1. (1)

    Spiked covariance model: let 𝑿1,,𝑿nd\bm{X}_{1},\ldots,\bm{X}_{n}\in\mathbb{R}^{d} be i.i.d. random vectors with spiked covariance 𝚺=𝐕𝚲𝐕+σ2𝐈\bm{\Sigma}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}+\sigma^{2}\mathbf{I}, where 𝐌=𝐕𝚲𝐕\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top} is the rank-KK spiked component of interest. Define 𝐌^=1ni=1n𝑿i𝑿iσ^2𝐈\widehat{\mathbf{M}}=\frac{1}{n}\sum_{i=1}^{n}\bm{X}_{i}\bm{X}_{i}^{\top}-\widehat{\sigma}^{2}\mathbf{I} to be the estimator for 𝐌\mathbf{M}, where σ^2\widehat{\sigma}^{2} is a consistent estimator for σ2\sigma^{2}. We assume that the data are split along the sample size nn and stored on mm servers.

  2. (2)

    Degree-corrected mixed membership (DCMM) model: let 𝐗\mathbf{X} be the adjacency matrix for an undirected graph of dd nodes, where the connection probabilities between nodes are determined by their membership assignments to KK communities and node-associated degrees. Consider the data 𝐌^=𝐗\widehat{\mathbf{M}}=\mathbf{X} to be split along dd on mm servers, and we aim to infer the membership profiles of nodes by recovering the KK-leading eigenspace of the marginal connection probability matrix 𝐌=𝔼(𝐗)\mathbf{M}={\mathbb{E}}(\mathbf{X}) using the data 𝐌^\widehat{\mathbf{M}}.

  3. (3)

    Gaussian mixture models (GMM): let 𝑾1,,𝑾dn\bm{W}_{1},\ldots,\bm{W}_{d}\in\mathbb{R}^{n} be independent random vectors drawn from KK Gaussian distributions with different means and identity covariance matrix. We are interested in clustering the samples by estimating the eigenspace of 𝐌=[𝐌jj]=[𝔼(𝑾j)𝔼(𝑾j)]\mathbf{M}=[\mathbf{M}_{jj^{\prime}}]=[{\mathbb{E}}(\bm{W}_{j})^{\top}{\mathbb{E}}(\bm{W}_{j^{\prime}})], whose estimator is given by 𝐌^=[𝐌^jj]=[𝑾j𝑾j]n𝐈\widehat{\mathbf{M}}=[\widehat{\mathbf{M}}_{jj^{\prime}}]=[\bm{W}_{j}^{\top}\bm{W}_{j^{\prime}}]-n\mathbf{I}. Assume the data are distributively stored on mm servers along the dimension nn.

  4. (4)

    Incomplete matrix inference: we have a low-rank matrix 𝐌\mathbf{M} of interest, and we observe 𝐌^\widehat{\mathbf{M}} as a perturbed version of 𝐌\mathbf{M} with missing entries. Assume 𝐌^\widehat{\mathbf{M}} to be vertically split along dd on mm servers, and we aim to infer the eigenspace of 𝐌\mathbf{M} through 𝐌^\widehat{\mathbf{M}}.

We will elaborate on the above examples in Section 2. We consider distributed settings for all the four problems, where the data are split along nn for the spiked covariance model and the GMM, and along dd for the DCMM model and the incomplete matrix inference model given that dd coincides with nn for those two. We will establish in Section 4.1 a general non-asymptotic error bound applicable to multiple statistical models as well as case-specific error rates for each example, and show that the non-asymptotic error rate of FADI is of the same order as the traditional PCA as long as the sketching dimension pp and the number of fast sketches LL are sufficiently large. Inferentially, we provide distributional characterizations of FADI under different regimes of the fast sketching parameters. We observe a phase-transition phenomenon where the asymptotic covariance matrix takes on two different forms as LpLp increases. When LpdLp\gg d, the FADI estimator converges in distribution to a multivariate Gaussian, and the asymptotic relative efficiency (ARE) between FADI and the traditional PCA is 1 (see Figure 1). On the other hand, when LpdLp\ll d, FADI has higher computational efficiency and still enjoys asymptotic normality under certain models, but will have a larger asymptotic variance.

(a) Example 1: Spiked Covariance Model (b) Example 3: Gaussian Mixture Models
Refer to caption Refer to caption
Figure 1: Asymptotic relative efficiency (ARE) between the FADI estimator and the traditional PCA estimator under Example 1 and Example 3, where the ARE is measured by det(𝚺^FADI)1/Kdet(𝚺^PCA)1/K\det(\widehat{\bm{\Sigma}}^{\rm FADI})^{1/K}\cdot\det(\widehat{\bm{\Sigma}}^{\rm PCA})^{-1/K} with 𝚺^FADI\widehat{\bm{\Sigma}}^{\rm FADI} and 𝚺^PCA\widehat{\bm{\Sigma}}^{\rm PCA} being the empirical covariance matrices for the FADI and traditional PCA estimators [36].

Related Papers on Inferential Analysis of PCA

There has been a great amount of literature depicting the asymptotic distribution of traditional PCA estimators. Anderson, [5] characterized the asymptotic normality of eigenvectors and eigenvalues for traditional PCA on the sample covariance matrix with fixed dimension. Paul, [32] and Wang and Fan, [41] extended the analysis to the high-dimensional regime and established distributional results under the spiked covariance model. Similar efforts were made by Johnstone, [25] and Baik et al., [6], where they studied the limiting distribution of the largest empirical eigenvalue when both the dimension and the sample size go to infinity. Apart from inference on the sample covariance matrix of i.i.d. data, previous works also made progress in inferential analyses for a variety of statistical models including the DCMM model [16], the matrix completion problem [13], and high-dimensional data with heteroskedastic noise and missingness under the spiked covariance model [43]. Specifically, Fan et al., [16] employed statistics based on principal eigenspace estimators of the adjacency matrix to perform inference on whether two given nodes share the same membership profile under the DCMM model. Chen et al., [13] constructed entry-wise confidence intervals (CIs) for a low-rank matrix with missing data and Gaussian noise based on debiased convex/nonconvex PC estimators. A similar missing data inference problem was conducted in Yan et al., [43], where they adopted a refined spectral method with imputed diagonal for CI construction of the underlying spiked covariance matrix of corrupted samples with missing data.

The aforementioned works were all based upon the traditional PCA approach and considered no distributed data setting, and hence will suffer from low computational efficiency when the data are high-dimensional or distributively stored across different sites. Our paper fills the gap in the literature and provides general inferential results on the fast sketching method with high computational efficiency adapted to high-dimensional federated data.

Our Contributions

We summarize the major contributions of our paper as follows.

First, the existing PCA methods either handle high dimensions dd or large sample sizes nn, but not both. Specifically, fast PCA [20] handles large dd but has elevated error rates and is difficult to apply when nn is large. Distributed PCA [18] handles large nn but is not scalable to large dd, as it applies traditional PCA to each data split. FADI overcomes the limitations of these methods by providing scalable PCA when both dd and nn are large or data are federated. Due to the fact that variables are usually dependent, it is challenging to achieve parallel computing along dd and distributed computing along nn simultaneously. To address this challenge, FADI splits the data along nn and untangles the variable dependency along dd by dividing the high-dimensional data into LL copies of pp-dimensional fast sketches. Namely, for each split dataset, FADI performs multiple parallel fast sketchings instead of the traditional PCA, and then aggregates the PC results distributively over the split samples. We establish theoretical error bounds to show that FADI is as accurate as the traditional PCA so long as LpdLp\gtrsim d.

Second, we provide distributional characterizations for inferential analyses and show a phase-transition phenomenon. We provide distributional guarantees on the FADI estimator to facilitate inference, which is absent in previous literature on fast PCA methods and distributed PCA methods. More specifically, we depict the trade-off between computational complexity and testing efficiency by studying FADI’s asymptotic distribution under the regimes LpdLp\ll d and LpdLp\gg d respectively. We show that the same asymptotic efficiency as the traditional PCA can be achieved at LpdLp\gg d with a compromise on computational efficiency, while faster inferential procedures can be performed at LpdLp\ll d with suboptimal testing efficiency. We further validate the distributional phase transition via numerical experiments.

Third, we propose FADI under a general framework applicable to multiple statistical models under mild assumptions, including the four examples discussed earlier in this section. We provide a comprehensive investigation of FADI’s performance both methodologically and theoretically under the general framework, and illustrate the results with the aforementioned statistical models. In comparison, the existing distributed methods mainly focus on estimating the covariance structure of independent samples [18].

Paper Organization

The rest of the paper is organized as follows. Section 2 introduces the problem setting and provides an overview of FADI and its intuition. Section 3 discusses FADI’s implementation details, as well as the computational complexity of FADI and its modifications when KK is unknown. Section 4 presents the theoretical results of the statistical error and asymptotic normality of the FADI estimator. Section 5 shows the numerical evaluation of FADI and comparison with several existing methods. The application of FADI to the 1000 Genomes Data is given in Section 6.

Notation

We use 𝟏dd\mathbf{1}_{d}\in\mathbb{R}^{d} to denote the vector of length dd with all entries equal to 1, and denote by {𝐞i}i=1d\{\mathbf{e}_{i}\}_{i=1}^{d} the canonical basis of d\mathbb{R}^{d}. For a matrix 𝐀=[𝐀ij]m×n\mathbf{A}=[\mathbf{A}_{ij}]\in\mathbb{R}^{m\times n}, we use σi(𝐀)\sigma_{i}(\mathbf{A}) (respectively λi(𝐀)\lambda_{i}(\mathbf{A})) to represent the ii-th largest singular value (respectively eigenvalue) of 𝐀\mathbf{A}, and σmax(𝐀)\sigma_{\max}(\mathbf{A}) or σmin(𝐀)\sigma_{\min}(\mathbf{A}) (respectively λmax(𝐀)\lambda_{\max}(\mathbf{A}) or λmin(𝐀)\lambda_{\min}(\mathbf{A})) stands for the largest or smallest singular value (respectively eigenvalue) of 𝐀\mathbf{A}. If 𝐀\mathbf{A} has the singular value decomposition (SVD) 𝐀=𝐔𝚲𝐕=j=1Kσj𝐮j𝐯j\mathbf{A}=\mathbf{U}\mathbf{\Lambda}\mathbf{V}^{\top}=\sum_{j=1}^{K}\sigma_{j}\mathbf{u}_{j}\mathbf{v}_{j}^{\top}, then we denote by 𝐀=𝐕𝚲1𝐔\mathbf{A}^{\dagger}=\mathbf{V}\mathbf{\Lambda}^{-1}\mathbf{U}^{\top} the pseudo-inverse of 𝐀\mathbf{A}, 𝐏𝐀=𝐀𝐀\mathbf{P}_{\mathbf{A}}=\mathbf{A}\mathbf{A}^{\dagger} the projection matrix onto the column space of 𝐀\mathbf{A}, and sgn(𝐀)=σj>0𝐮j𝐯j\operatorname{sgn}(\mathbf{A})=\sum_{\sigma_{j}>0}\mathbf{u}_{j}\mathbf{v}_{j}^{\top} the matrix signum. If 𝐀\mathbf{A} is positive definite with eigen-decomposition 𝐀=𝐔𝐃𝐔\mathbf{A}=\mathbf{U}\mathbf{D}\mathbf{U}^{\top}, we define 𝐀1/2=𝐔𝐃1/2𝐔\mathbf{A}^{1/2}=\mathbf{U}\mathbf{D}^{1/2}\mathbf{U}^{\top} and 𝐀1/2=𝐔𝐃1/2𝐔\mathbf{A}^{-1/2}=\mathbf{U}\mathbf{D}^{-1/2}\mathbf{U}^{\top}. We denote by \otimes the Kronecker product. For two orthonormal matrices 𝐕,𝐔n1×n2\mathbf{V},\mathbf{U}\in\mathbb{R}^{n_{1}\times n_{2}} with n1>n2n_{1}>n_{2}, we measure the distance between their column spaces by ρ(𝐔,𝐕)=𝐔𝐔𝐕𝐕F\rho(\mathbf{U},\mathbf{V})=\|\mathbf{U}\mathbf{U}^{\top}-\mathbf{V}\mathbf{V}^{\top}\|_{\text{F}}. For a vector 𝐯\mathbf{v}, we use 𝐯2\|\mathbf{v}\|_{2} to denote the vector 2\ell_{2}-norm, and 𝐯\|\mathbf{v}\|_{\infty} to denote the vector \ell_{\infty}-norm. For a matrix 𝐀=[Aij]\mathbf{A}=[A_{ij}], we denote by 𝐀2\|\mathbf{A}\|_{2} the matrix spectral norm, 𝐀F\|\mathbf{A}\|_{\text{F}} the Frobenius norm, 𝐀2,=sup𝐱2=1𝐀𝐱=maxi𝐀𝐞i2\|\mathbf{A}\|_{2,\infty}=\sup_{\|\mathbf{x}\|_{2}=1}\|\mathbf{A}\mathbf{x}\|_{\infty}=\max_{i}\|\mathbf{A}^{\top}\mathbf{e}_{i}\|_{2} the 2-to-\infty norm and 𝐀max=maxi,j|Aij|\|\mathbf{A}\|_{\max}=\max_{i,j}|A_{ij}| the matrix max norm. For an integer nn, define [n]={1,2,,n}[n]=\{1,2,\ldots,n\}. For two positive sequences xnx_{n} and yny_{n}, we say xnynx_{n}\lesssim y_{n} or xn=O(yn)x_{n}=O(y_{n}) if xnCynx_{n}\leq Cy_{n} for C>0C>0 that does not depend on nn. We say xnynx_{n}\asymp y_{n} if xnynx_{n}\lesssim y_{n} and ynxny_{n}\lesssim x_{n}. If limnxn/yn=0\lim_{n\rightarrow\infty}x_{n}/y_{n}=0, we say xn=o(yn)x_{n}=o(y_{n}) or xnynx_{n}\ll y_{n}. Let 𝕀{}\mathbb{I}\{\cdot\} be an indicator function, which takes 1 if the statement inside {}\{\cdot\} is true and 0 otherwise. Throughout the paper, we use cc and CC to represent generic constants and their values might change from place to place.

2 Preliminaries and Problem Setup

We aim to estimate the eigenspace of the rank-KK symmetric matrix 𝐌d×d\mathbf{M}\in\mathbb{R}^{d\times d}, whose eigen-decomposition is given by 𝐌=𝐕𝚲𝐕\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top},111When 𝐌\mathbf{M} is asymmetric, we can deploy the “symmetric dilation” trick and take 𝒮(𝐌)=(𝟎𝐌𝐌𝟎){\mathcal{S}}(\mathbf{M})=\begin{pmatrix}\mathbf{0}&\mathbf{M}\\ \mathbf{M}^{\top}&\mathbf{0}\end{pmatrix} to fit it into the setting. where 𝚲=diag(λ1,,λK)\mathbf{\Lambda}=\operatorname{diag}(\lambda_{1},\ldots,\lambda_{K}), |λ1||λ2||λK|>0|\lambda_{1}|\geq|\lambda_{2}|\geq\ldots\geq|\lambda_{K}|>0 and 𝐕\mathbf{V} is the stacking of the KK leading eigenvectors. We denote by Δ=|λK|\Delta=|\lambda_{K}| the eigengap of 𝐌\mathbf{M}, and assume without loss of generality that λ1>0\lambda_{1}>0. 𝐌^\widehat{\mathbf{M}} is a corrupted version of 𝐌\mathbf{M} obtained from observed data, with 𝐄=𝐌^𝐌\mathbf{E}=\widehat{\mathbf{M}}-\mathbf{M} representing the error matrix. Our goal is to estimate the column space of 𝐕\mathbf{V} from 𝐌^\widehat{\mathbf{M}} distributively and scalably. The following four examples provide concrete statistical setups for the above problem.

Example 1 (Spiked Covariance Model [25]).

Let 𝑿1,,𝑿nd\bm{X}_{1},\ldots,\bm{X}_{n}\in\mathbb{R}^{d} be i.i.d. sub-Gaussian random vectors with 𝔼(𝑿i)=𝟎{\mathbb{E}}(\bm{X}_{i})=\mathbf{0} and 𝔼(𝑿i𝑿i)=𝚺{\mathbb{E}}(\bm{X}_{i}\bm{X}_{i}^{\top})=\mathbf{\Sigma}.222We assume {𝑿i}i=1n\{\bm{X}_{i}\}_{i=1}^{n} are i.i.d. for the simplicity of presentation. We will generalize the theoretical results to non-i.i.d. and heterogeneous data in Section 4.1. We assume the following decomposition for the covariance matrix: 𝚺=𝐕𝚲𝐕+σ2𝐈d\mathbf{\Sigma}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}+\sigma^{2}\mathbf{I}_{d}, where 𝐕d×K\mathbf{V}\in\mathbb{R}^{d\times K} is the stacked KK leading eigenvectors and 𝚲=diag(λ1,,λK)\mathbf{\Lambda}=\operatorname{diag}(\lambda_{1},\ldots,\lambda_{K}) with λ1λK>0\lambda_{1}\geq\ldots\geq\lambda_{K}>0. Assume that the data are split along the sample size nn and stored on mm different sites. Denote by {𝑿i(s)}i=1ns\{\bm{X}_{i}^{(s)}\}_{i=1}^{n_{s}} the sample split of size nsn_{s} on the ss-th site, and by 𝐗(s)=(𝑿1(s),,𝑿ns(s))\mathbf{X}^{(s)}=(\bm{X}_{1}^{(s)},\ldots,\bm{X}_{n_{s}}^{(s)})^{\top} the corresponding data matrix split (s=1,,ms=1,\ldots,m and s=1mns=n\sum_{s=1}^{m}n_{s}=n). Denote by 𝐗=(𝑿1,,𝑿n)\mathbf{X}=(\bm{X}_{1},\ldots,\bm{X}_{n})^{\top} the full n×dn\times d data matrix. Then 𝐌=𝐕𝚲𝐕\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}, and 𝐌^=𝚺^σ^2𝐈d\widehat{\mathbf{M}}=\widehat{\mathbf{\Sigma}}-\widehat{\sigma}^{2}\mathbf{I}_{d}, where 𝚺^=1ni=1n𝑿i𝑿i\widehat{\mathbf{\Sigma}}=\frac{1}{n}\sum_{i=1}^{n}\bm{X}_{i}\bm{X}_{i}^{\top} is the sample covariance matrix and σ^2\widehat{\sigma}^{2} is a consistent estimator for σ2\sigma^{2}.

Example 2 (Degree-Corrected Mixed Membership (DCMM) Model [16]).

Let 𝐗d×d\mathbf{X}\in\mathbb{R}^{d\times d} be a symmetric adjacency matrix for an undirected graph of dd nodes, where 𝐗ij=1\mathbf{X}_{ij}=1 if nodes i,j[d]i,j\in[d] are connected and 𝐗ij=0\mathbf{X}_{ij}=0 otherwise. Assume 𝐗ij\mathbf{X}_{ij}’s are independent for iji\leq j and 𝔼(𝐗)=𝚯𝚷𝐏𝚷𝚯{\mathbb{E}}(\mathbf{X})=\mathbf{\Theta}\mathbf{\Pi}\mathbf{P}\mathbf{\Pi}^{\top}\mathbf{\Theta}, where 𝚯=diag(θ1,,θd)\mathbf{\Theta}=\operatorname{diag}(\theta_{1},\ldots,\theta_{d}) stands for the degree heterogeneity matrix, 𝚷=(𝝅1,,𝝅d)d×K\mathbf{\Pi}=(\bm{\pi}_{1},\ldots,\bm{\pi}_{d})^{\top}\in\mathbb{R}^{d\times K} is the stacked community assignment probability vectors and 𝐏K×K\mathbf{P}\in\mathbb{R}^{K\times K} is a symmetric rank-KK matrix with constant entries 𝐏kk(0,1)\mathbf{P}_{kk^{\prime}}\in(0,1) for k,k[K]k,k^{\prime}\in[K]. Then 𝐌=𝔼(𝐗)=𝚯𝚷𝐏𝚷𝚯\mathbf{M}={\mathbb{E}}(\mathbf{X})=\mathbf{\Theta}\mathbf{\Pi}\mathbf{P}\mathbf{\Pi}^{\top}\mathbf{\Theta} and 𝐌^=𝐗\widehat{\mathbf{M}}=\mathbf{X}.333In the case where self-loops are absent, 𝐗\mathbf{X} will be replaced by 𝐗=𝐗diag(𝐗)\mathbf{X}^{\prime}=\mathbf{X}-\operatorname{diag}(\mathbf{X}) and 𝐄\mathbf{E} will be replaced by 𝐄=𝐄diag(𝐗)\mathbf{E}^{\prime}=\mathbf{E}-\operatorname{diag}(\mathbf{X}). Our theoretical results hold for both cases. The goal is to infer the community membership profiles 𝚷\mathbf{\Pi}. Recall 𝐌=𝐕𝚲𝐕\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}. Since 𝐕\mathbf{V} and 𝚯𝚷\mathbf{\Theta}\mathbf{\Pi} share the same column space, we can make inference on 𝚷\mathbf{\Pi} through 𝐕\mathbf{V}. 444To address the degree heterogeneity, one can perform the SCORE normalization to cancel out 𝚯\mathbf{\Theta} [24]. In this paper, we assume that there exist constants Cc>0C\geq c>0 such that σK(𝚷)cd/K\sigma_{K}(\mathbf{\Pi})\geq c\sqrt{d/K}, cλK(𝐏)λ1(𝐏)CKc\leq\lambda_{K}(\mathbf{P})\leq\lambda_{1}(\mathbf{P})\leq CK and maxiθiCminiθi\max_{i}\theta_{i}\leq C\min_{i}\theta_{i}, where we define θ=maxiθi2\theta=\max_{i}\theta_{i}^{2} as the rate of signal strength. We assume that the adjacency matrix is distributed across mm sites, where on the ss-th site we observe the connectivity matrix 𝐗(s)d×ds\mathbf{X}^{(s)}\in\mathbb{R}^{d\times d_{s}} and 𝐗=(𝐗(1),,𝐗(m))\mathbf{X}=(\mathbf{X}^{(1)},\ldots,\mathbf{X}^{(m)}).

Example 3 (Gaussian Mixture Models (GMM) [12]).

Let 𝑾1,,𝑾dn\bm{W}_{1},\ldots,\bm{W}_{d}\in\mathbb{R}^{n} be independent samples with 𝑾j\bm{W}_{j} (j[d])(j\in[d]) generated from one of KK Gaussian distributions with means 𝜽kn\bm{\theta}_{k}\in\mathbb{R}^{n} (k=1,,Kk=1,\cdots,K). More specifically, for j[d]j\in[d], 𝑾j\bm{W}_{j} is associated with a membership label kj[K]k_{j}\in[K], and 𝑾j𝒩(k=1K𝜽k𝕀{kj=k},𝐈n)\bm{W}_{j}\sim{\mathcal{N}}(\sum_{k=1}^{K}\bm{\theta}_{k}\mathbb{I}\{k_{j}=k\},\mathbf{I}_{n}). Our goal is to recover the unknown membership labels kjk_{j}’s. Denote 𝐗=(𝑾1,,𝑾d)=(𝑿1,,𝑿n)\mathbf{X}=(\bm{W}_{1},\ldots,\bm{W}_{d})=(\bm{X}_{1},\ldots,\bm{X}_{n})^{\top}, where 𝑿i\bm{X}_{i} is the ii-th row of 𝐗\mathbf{X}. Without loss of generality, we order 𝑾j\bm{W}_{j}’s such that 𝔼(𝐗)=𝚯𝐅{\mathbb{E}}(\mathbf{X})=\mathbf{\Theta}\mathbf{F}^{\top}, where

𝚯=(𝜽1,,𝜽K)n×K,𝐅=diag(𝟏d1,,𝟏dK)d×K,\mathbf{\Theta}=(\bm{\theta}_{1},\ldots,\bm{\theta}_{K})\in\mathbb{R}^{n\times K},\quad\mathbf{F}=\operatorname{diag}(\mathbf{1}_{d_{1}},\ldots,\mathbf{1}_{d_{K}})\in\mathbb{R}^{d\times K},

with dkd_{k} denoting the number of samples drawn from the Gaussian distribution with mean 𝜽k\bm{\theta}_{k}. Then we define 𝐌=𝔼[𝐗𝐗]n𝐈d=𝐅𝚯𝚯𝐅\mathbf{M}={\mathbb{E}}[\mathbf{X}^{\top}\mathbf{X}]-n\mathbf{I}_{d}=\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top} and 𝐌^=𝐗𝐗n𝐈d\widehat{\mathbf{M}}=\mathbf{X}^{\top}\mathbf{X}-n\mathbf{I}_{d}. Recall 𝐌=𝐕𝚲𝐕\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}. Since 𝐕\mathbf{V} and 𝐅\mathbf{F} share the same column space, we can recover the memberships from 𝐕\mathbf{V}. We consider the regime where n>dn>d. Besides, we assume that there exists a constant C>0C>0 such that maxkdkCminkdk\max_{k}d_{k}\leq C\min_{k}d_{k} and σ1(𝚯)CσK(𝚯)\sigma_{1}(\mathbf{\Theta})\leq C\sigma_{K}(\mathbf{\Theta}). We consider the distributed setting where the data are split along the dimension nn and distributively stored on mm sites. Denote by 𝐗(s)=(𝑿1(s),,𝑿ns(s))\mathbf{X}^{(s)}=(\bm{X}_{1}^{(s)},\ldots,\bm{X}_{n_{s}}^{(s)})^{\top} the data split on the ss-th site of size nsn_{s} (s[m]s\in[m]).

Example 4 (Incomplete Matrix Inference [13]).

Assume that 𝐌=𝐕𝚲𝐕\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top} is a symmetric rank-KK matrix, and 𝒮[d]×[d]{\mathcal{S}}\subseteq[d]\times[d] is a subset of indices. We only observe the perturbed entries of 𝐌\mathbf{M} in the subset 𝒮{\mathcal{S}}. Specifically, for iji\leq j, we denote δij=δji=𝕀{(i,j)𝒮}\delta_{ij}=\delta_{ji}=\mathbb{I}\{(i,j)\in{\mathcal{S}}\}, and δiji.i.dBernoulli(θ)\delta_{ij}\overset{\text{i.i.d}}{\sim}\operatorname{Bernoulli}(\theta) is an indicator for whether the (i,j)(i,j)th entry is missing. Then for i,j[d]i,j\in[d], the observation for 𝐌ij\mathbf{M}_{ij} is 𝐗ij=(𝐌ij+εij)δij\mathbf{X}_{ij}=(\mathbf{M}_{ij}+\varepsilon_{ij})\delta_{ij}, where εij=εji\varepsilon_{ij}=\varepsilon_{ji} are i.i.d. random variables satisfying 𝔼(εij)=0{\mathbb{E}}(\varepsilon_{ij})=0, 𝔼(εij2)=σ2{\mathbb{E}}(\varepsilon_{ij}^{2})=\sigma^{2} and supij|εij|σlogd\sup_{i\leq j}|\varepsilon_{ij}|\lesssim\sigma\log d.555We can generalize the results to sub-Gaussian error εij\varepsilon_{ij}’s with variance proxy σ2\sigma^{2} by taking the truncated error εijt=εij𝕀{|εij|4σlogd}\varepsilon_{ij}^{t}=\varepsilon_{ij}\mathbb{I}\{|\varepsilon_{ij}|\leq 4\sigma\sqrt{\log d}\}, and by the maximal inequality for sub-Gaussian random variables we know that with probability at least 1O(d6)1-O(d^{-6}), εij=εijt,i,j[d]\varepsilon_{ij}=\varepsilon_{ij}^{t},\forall i,j\in[d], and the theorems can be generalized with minor modifications. Then to adjust for scaling, we define the observed data as 𝐌^=[𝐌^ij]=θ^1[𝐗ij]\widehat{\mathbf{M}}=[\widehat{\mathbf{M}}_{ij}]=\widehat{\theta}^{-1}[\mathbf{X}_{ij}], where θ^=2|𝒮|/(d(d+1))\widehat{\theta}=2|{\mathcal{S}}|/\big{(}d(d+1)\big{)}.666In practice, we can estimate 𝐕\mathbf{V} by 𝐗\mathbf{X} rather than by 𝐌^=θ^1𝐗\widehat{\mathbf{M}}=\widehat{\theta}^{-1}\mathbf{X}, since the two matrices share exactly the same eigenvectors. However, we need the factor θ^1\widehat{\theta}^{-1} to preserve correct scaling for the estimation of eigenvalues as well as the follow-up matrix completion. Please see Theorem 4.3 and Corollary 4.9 for more details. Consider the distributed setting where the data are split along dd on mm servers, where 𝐗(s)d×ds\mathbf{X}^{(s)}\in\mathbb{R}^{d\times d_{s}} stands for the observations on the ss-th server and 𝐌^=θ^1(𝐗(1),,𝐗(m))\widehat{\mathbf{M}}=\widehat{\theta}^{-1}(\mathbf{X}^{(1)},\ldots,\mathbf{X}^{(m)}). The goal is to infer 𝐕\mathbf{V} from 𝐌^\widehat{\mathbf{M}} in the presence of missing data.

Table 1 provides the complexities of FADI for the four problems and suggested choice of parameters for optimal error rates. We will further discuss the computational complexities in detail in Section 3.4.

Complexity pp LL
Spiked covariance model O(dnp/m+dKpLlogd)O(dnp/m+dKpL\log d) KlogdK\vee\log d d/pd/p
DCMM model O(d2p/m+dKpLlogd)O(d^{2}p/m+dKpL\log d) d\sqrt{d} d\sqrt{d}
Gaussian mixture models O(dnp/m+dKpLlogd)O(dnp/m+dKpL\log d) KlogdK\vee\log d d/pd/p
Incomplete matrix inference O(d2p/m+dKpLlogd)O(d^{2}p/m+dKpL\log d) d\sqrt{d} d\sqrt{d}
Table 1: Computational complexities and parameter choice of FADI for PCA estimation under different models, where KK is the rank of 𝐌\mathbf{M}, dd is the dimension of 𝐌\mathbf{M}, nn is the sample size, mm is the number of data splits, pp is the fast sketching dimension and LL is the number of repeated sketches.

3 Method

In this section, we present the FADI algorithm and its application to different examples. We then provide the computational complexities of FADI and compare it with the existing methods. We also discuss how to estimate the rank KK when it is unknown.

3.1 Fast Distributed PCA (FADI): Overview and Intuition

For a given matrix 𝐌^d×d\widehat{\mathbf{M}}\in\mathbb{R}^{d\times d}, the computational cost of the traditional PCA on 𝐌^\widehat{\mathbf{M}} is O(d3)O(d^{3}). In the case where 𝐌^\widehat{\mathbf{M}} is computed from observed data, e.g., the sample covariance matrix 𝚺^=1ni=1n𝑿i𝑿i\widehat{\bm{\Sigma}}=\frac{1}{n}\sum_{i=1}^{n}\bm{X}_{i}\bm{X}_{i}^{\top}, extra computational burden comes from calculating 𝐌^\widehat{\mathbf{M}}, e.g., O(nd2)O(nd^{2}) flops for computing the sample covariance matrix. Hence performing traditional PCA for large-scale data with high dimensions and huge sample sizes can be considerably expensive.

To reduce the computational cost when dd is large, the most straightforward idea is to reduce the data dimension. One popular method for dimension reduction is random sketching [20]. For instance, for a low-rank matrix 𝐌\mathbf{M} of rank KK, its column space can be represented by a low-dimensional fast sketch 𝐌𝛀d×p\mathbf{M}\bm{\Omega}\in\mathbb{R}^{d\times p}, where 𝛀d×p\mathbf{\Omega}\in\mathbb{R}^{d\times p} is a random Gaussian matrix with K<pdK<p\ll d. In practice, 𝐌\mathbf{M} is usually replaced by an almost low-rank corrupted matrix 𝐌^\widehat{\mathbf{M}} calculated from observed data. Traditional fast PCA methods then consider performing random sketching on 𝐌^\widehat{\mathbf{M}} instead, and use the full sample to obtain the fast sketch 𝐘^=𝐌^𝛀𝐕𝚲𝐕𝛀\widehat{\mathbf{Y}}=\widehat{\mathbf{M}}\mathbf{\Omega}\approx\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega} that almost maintains the same left singular space as 𝐌=𝐕𝚲𝐕\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}. It is hence reasonable to estimate 𝐕\mathbf{V} by performing SVD on the d×pd\times p matrix 𝐘^\widehat{\mathbf{Y}} that has a much smaller computational cost than directly performing PCA on 𝐌^\widehat{\mathbf{M}}. However, one major drawback of this approach is that information might be lost due to fast sketching. Furthermore, the method is not scalable when nn is large or the data are federated. This motivates us to propose FADI, where we repeat the fast sketching multiple times and aggregate the results to reduce the statistical error. Besides, instead of performing the fast sketching on the full sample, we apply multiple sketches to each split sample, and then aggregate the PC results across the data splits.

Specifically, assume the data are stored across mm sites, and we have the decomposition 𝐌^=s=1m𝐌^(s)\widehat{\mathbf{M}}=\sum_{s=1}^{m}\widehat{\mathbf{M}}^{(s)}, where 𝐌^(s)\widehat{\mathbf{M}}^{(s)} is the component that can be computed locally on the ss-th machine (s[m]s\in[m]). Then instead of applying random sketching directly to 𝐌^\widehat{\mathbf{M}}, FADI computes in parallel the local fast sketching for each component 𝐌^(s)\widehat{\mathbf{M}}^{(s)} and aggregates the results across mm sites, which will reduce the cost of computing 𝐌^𝛀\widehat{\mathbf{M}}\bm{\Omega} by a factor of 1/m1/m. Note that this representation of 𝐌^\widehat{\mathbf{M}} is legitimate in many models. Taking Example 1 for instance, define 𝐌^(s)=1n(𝐗(s)𝐗(s))(σ^2/m)𝐈d\widehat{\mathbf{M}}^{(s)}=\frac{1}{n}(\mathbf{X}^{(s)\top}\mathbf{X}^{(s)})-(\widehat{\sigma}^{2}/m)\mathbf{I}_{d}, and we have 𝐌^=𝚺^σ^2𝐈d=s=1m𝐌^(s)\widehat{\mathbf{M}}=\widehat{\bm{\Sigma}}-\widehat{\sigma}^{2}\mathbf{I}_{d}=\sum_{s=1}^{m}\widehat{\mathbf{M}}^{(s)}. We will verify the decomposition for Examples 2 - 4 in Section 3.3.

We will see in Section 4.1 that when the number of repeated fast sketches is sufficiently large, FADI enjoys the same error rate as the traditional PCA. From this perspective, FADI can be viewed as a “vertically” distributed PCA method as it allocates the computational burden along the dimension dd to several machines using low-dimensional sketches while maintaining high statistical accuracy through the aggregation of local PCs. FADI overcomes the difficulties of vertical splitting caused by the correlation between variables.

3.2 General Algorithmic Framework

Recall we aim to estimate the KK leading eigenvectors 𝐕\mathbf{V} of a rank-KK matrix 𝐌\mathbf{M} from its estimator 𝐌^=s=1m𝐌^(s)\widehat{\mathbf{M}}=\sum_{s=1}^{m}\widehat{\mathbf{M}}^{(s)}. Figure 2 illustrates the fast distributed PCA (FADI) algorithm:

Refer to caption
Figure 2: Illustration of FADI. Here {𝐗(s)}s=1m\{\mathbf{X}^{(s)}\}_{s=1}^{m} are the raw data stored distributively on mm sites, and 𝐌^(s)\widehat{\mathbf{M}}^{(s)} is the ss-th component of 𝐌^\widehat{\mathbf{M}} that can be calculated from 𝐗(s)\mathbf{X}^{(s)}. 𝐘^()=s[m]𝐘^(s,)\widehat{\mathbf{Y}}^{(\ell)}=\sum_{s\in[m]}\widehat{\mathbf{Y}}^{(s,\ell)} ([L]\ell\in[L]) is the \ell-th copy of the fast sketch obtained by aggregating the fast sketches calculated distributively for each data split.

In Step 0, we perform preliminary processing on the raw data to produce {𝐌^(s)}s=1m\{\widehat{\mathbf{M}}^{(s)}\}_{s=1}^{m}. We will elaborate on the case-specific preprocessing in Section 3.3.

In Step 1, we calculate the distributed fast sketch 𝐘^=𝐌^𝛀=s=1m𝐌^(s)𝛀\widehat{\mathbf{Y}}=\widehat{\mathbf{M}}\mathbf{\Omega}=\sum_{s=1}^{m}\widehat{\mathbf{M}}^{(s)}\bm{\Omega}, where 𝛀\mathbf{\Omega} is a d×pd\times p standard Gaussian test matrix and K<pdK<p\ll d. To reduce the statistical error, we repeat the fast sketching LL times and aggregate the results from the LL copies of 𝐘^\widehat{\mathbf{Y}}. Specifically, we generate LL i.i.d. Gaussian test matrices {𝛀()}=1L\{\bm{\Omega}^{(\ell)}\}_{\ell=1}^{L}, and for each [L]\ell\in[L], we apply 𝛀()\bm{\Omega}^{(\ell)} distributively to 𝐌^(s)\widehat{\mathbf{M}}^{(s)} for each s[m]s\in[m] and obtain the \ell-th fast sketch of 𝐌^(s)\widehat{\mathbf{M}}^{(s)} as 𝐘^(s,)=𝐌^(s)𝛀()\widehat{\mathbf{Y}}^{(s,\ell)}=\widehat{\mathbf{M}}^{(s)}\bm{\Omega}^{(\ell)}. We send 𝐘^(s,)\widehat{\mathbf{Y}}^{(s,\ell)} (s=1,,ms=1,\cdots,m) to the \ell-th parallel server for aggregation.

In Step 2, on the \ell-th server, the random sketches 𝐘^(s,)\widehat{\mathbf{Y}}^{(s,\ell)} (s=1,,ms=1,\cdots,m) from the mm split datasets corresponding to the \ell-th Gaussian test matrix 𝛀()\mathbf{\Omega}^{(\ell)} will be collected and added up to get the \ell-th fast sketch: 𝐘^()=s=1m𝐘^(s,)\widehat{\mathbf{Y}}^{(\ell)}=\sum_{s=1}^{m}\widehat{\mathbf{Y}}^{(s,\ell)} ([L])(\ell\in[L]). We next compute in parallel the top KK left singular vectors 𝐕^()\widehat{\mathbf{V}}^{(\ell)} of 𝐘^()\widehat{\mathbf{Y}}^{(\ell)} and send the 𝐕^()\widehat{\mathbf{V}}^{(\ell)}’s to the central processor for aggregation.

In Step 3, on the central processor, calculate 𝚺~=1L=1L𝐕^()𝐕^()=1L=1L𝐏\widetilde{\mathbf{\Sigma}}=\frac{1}{L}\sum_{\ell=1}^{L}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}=\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\ell}, where 𝐏=𝐕^()𝐕^()\mathbf{P}_{\ell}=\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top} is the projection matrix of 𝐕^()\widehat{\mathbf{V}}^{(\ell)}. We next calculate the KK leading eigenvectors 𝐕~\widetilde{\mathbf{V}} of 𝚺~\widetilde{\mathbf{\Sigma}}, which will serve as the final estimator of 𝐕\mathbf{V}.

To further improve the computational efficiency, we might conduct another fast sketching in Step 3 to compute 𝐕~\widetilde{\mathbf{V}}. More specifically, we apply the power method [20] to 𝚺~\widetilde{\bm{\Sigma}} by calculating 𝐘~=𝚺~q𝛀F=(1L=1L𝐕^()𝐕^())q𝛀F\widetilde{\mathbf{Y}}=\widetilde{\mathbf{\Sigma}}^{q}\mathbf{\Omega}^{\rm F}=\left(\frac{1}{L}\sum_{\ell=1}^{L}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}\right)^{q}\mathbf{\Omega}^{\rm F} for q1q\geq 1, where 𝛀Fd×p\mathbf{\Omega}^{\rm F}\in\mathbb{R}^{d\times p^{\prime}} is a Gaussian test matrix with dimension pp^{\prime} that can be set different from pp for optimal efficiency. Here, 𝐘~\widetilde{\mathbf{Y}} can be calculated iteratively: 𝐘~(i)=1L=1L(𝐕^()𝐕^()𝐘~(i1))\widetilde{\mathbf{Y}}_{(i)}=\frac{1}{L}\sum_{\ell=1}^{L}\left(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}\widetilde{\mathbf{Y}}_{(i-1)}\right) for i=1,,qi=1,\ldots,q, where 𝐘~(0)=𝛀F\widetilde{\mathbf{Y}}_{(0)}=\mathbf{\Omega}^{\rm F} and 𝐘~=𝐘~(q)\widetilde{\mathbf{Y}}=\widetilde{\mathbf{Y}}_{(q)}. We denote by 𝐕~F\widetilde{\mathbf{V}}^{\text{F}} the leading KK left singular vectors of 𝐘~\widetilde{\mathbf{Y}}. We will show in Section 4 that when qq is properly large, the distance between 𝐕~\widetilde{\mathbf{V}} and 𝐕~F\widetilde{\mathbf{V}}^{\text{F}} will be negligible.

Remark 1.

We refer to Theorem 4.1 for the choice of pp and LL. In general, taking p=2Kp=2K is sufficient. For now, we assume KK is known, and the scenarios where KK is unknown will be discussed in Section 3.5.

3.3 Case-Specific Processing of Raw Data

In this section, we discuss the calculation of 𝐌^\widehat{\mathbf{M}} in Step 0 of FADI specifically for each example.

Example 1: Recall that in Step 0 of FADI, to obtain 𝐌^\widehat{\mathbf{M}}, we need a consistent estimator of the residual variance σ2\sigma^{2}. Denote by S={i1,i2,,iK}[d]S=\{i_{1},i_{2},\ldots,i_{K^{\prime}}\}\subseteq[d] an arbitrary index set of size KK+1K^{\prime}\geq K+1. Then we estimate σ2\sigma^{2} by σ^2=σmin(𝚺^S)\widehat{\sigma}^{2}=\sigma_{\min}(\widehat{\bm{\Sigma}}_{S}), where 𝚺^S\widehat{\bm{\Sigma}}_{S} is a K×KK^{\prime}\times K^{\prime} principal submatrix of 𝚺^\widehat{\bm{\Sigma}} computed using only data columns in the set SS. Due to the additive structure of the sample covariance matrix, 𝚺^S\widehat{\bm{\Sigma}}_{S} can be easily computed distributively (see Figure 9 in Appendix E for reference). Then for s[m]s\in[m], we have 𝐌^(s)=1n(𝐗(s)𝐗(s))(σ^2/m)𝐈d\widehat{\mathbf{M}}^{(s)}=\frac{1}{n}(\mathbf{X}^{(s)\top}\mathbf{X}^{(s)})-(\widehat{\sigma}^{2}/m)\mathbf{I}_{d}. Note that since computing 𝐌^(s)𝛀=1n𝐗(s)(𝐗(s)𝛀)m1σ^2𝛀\widehat{\mathbf{M}}^{(s)}\bm{\Omega}=\frac{1}{n}\mathbf{X}^{(s)\top}(\mathbf{X}^{(s)}\bm{\Omega})-m^{-1}\widehat{\sigma}^{2}\bm{\Omega} is much faster than first computing 𝐌^(s)\widehat{\mathbf{M}}^{(s)} then computing 𝐌^(s)𝛀\widehat{\mathbf{M}}^{(s)}\bm{\Omega}, we will calculate 𝐌^(s)𝛀\widehat{\mathbf{M}}^{(s)}\bm{\Omega} by calculating 𝐗(s)𝛀\mathbf{X}^{(s)}\bm{\Omega} first rather than directly computing 𝐌^(s)\widehat{\mathbf{M}}^{(s)}.

Example 2: Recall that the adjacency matrix is stored distributively on mm sites, and for the ss-th site we observe the connectivity matrix 𝐗(s)\mathbf{X}^{(s)}. Then for s[m]s\in[m], define 𝐌^(s)=(𝐞s𝐈d)diag(𝐗(1),,𝐗(m))\widehat{\mathbf{M}}^{(s)}=(\mathbf{e}_{s}^{\top}\otimes\mathbf{I}_{d})\operatorname{diag}(\mathbf{X}^{(1)},\ldots,\mathbf{X}^{(m)}), where {𝐞s}s=1mm\{\mathbf{e}_{s}\}_{s=1}^{m}\subseteq\mathbb{R}^{m} is the canonical basis for m\mathbb{R}^{m}. Namely, 𝐌^(s)\widehat{\mathbf{M}}^{(s)} is the ss-th observation 𝐗(s)\mathbf{X}^{(s)} augmented by zeros, and 𝐌^=s=1m𝐌^(s)=(𝐗(1),,𝐗(m))=𝐗\widehat{\mathbf{M}}=\sum_{s=1}^{m}\widehat{\mathbf{M}}^{(s)}=(\mathbf{X}^{(1)},\ldots,\mathbf{X}^{(m)})=\mathbf{X}. No preliminary computation is needed.

Example 3: Recall that the data {𝑾j}j=1dn\{\bm{W}_{j}\}_{j=1}^{d}\subseteq\mathbb{R}^{n} are vertically distributed across mm sites, and {𝐗(s)}s=1m\{\mathbf{X}^{(s)}\}_{s=1}^{m} are the corresponding data splits. For the ss-th site, we have 𝐌^(s)=𝐗(s)𝐗(s)(n/m)𝐈d\widehat{\mathbf{M}}^{(s)}=\mathbf{X}^{(s)\top}\mathbf{X}^{(s)}-(n/m)\mathbf{I}_{d}, and for [L]\ell\in[L], we compute 𝐘^(s,)\widehat{\mathbf{Y}}^{(s,\ell)} by 𝐗(s)(𝐗(s)𝛀())(n/m)𝛀()\mathbf{X}^{(s)\top}(\mathbf{X}^{(s)}\bm{\Omega}^{(\ell)})-(n/m)\bm{\Omega}^{(\ell)}.

Example 4: Recall that we observe the split data {𝐗(s)}s=1m\{\mathbf{X}^{(s)}\}_{s=1}^{m} with missing entries on mm servers. Define 𝐌^(s)=θ^1(𝐞s𝐈d)diag(𝐗(1),,𝐗(m))\widehat{\mathbf{M}}^{(s)}=\widehat{\theta}^{-1}(\mathbf{e}_{s}^{\top}\otimes\mathbf{I}_{d})\operatorname{diag}(\mathbf{X}^{(1)},\ldots,\mathbf{X}^{(m)}) for the ss-th server, where θ^=2|𝒮|/(d(d+1))\widehat{\theta}=2|{\mathcal{S}}|/\big{(}d(d+1)\big{)}, then we have 𝐌^=s=1m𝐌^(s)=θ^1(𝐗(1),,𝐗(m))\widehat{\mathbf{M}}=\sum_{s=1}^{m}\widehat{\mathbf{M}}^{(s)}=\widehat{\theta}^{-1}(\mathbf{X}^{(1)},\ldots,\mathbf{X}^{(m)}).

3.4 Computational Complexity

In this section, we provide the computational complexity of FADI for each example given in Section 2. The complexity of each step is listed in Table 2.

Example 1 Example 2 Example 3 Example 4
Step 0 𝚺^S:O(K2nm+K2m)\widehat{\bm{\Sigma}}_{S}:O(\frac{K^{2}n}{m}+K^{2}m) σ^2:O(K3)\widehat{\sigma}^{2}:O(K^{3}) N/A O(1) O(d2m)O(\frac{d^{2}}{m})
Step 1 𝐘^(s,):O(dnpm)\widehat{\mathbf{Y}}^{(s,\ell)}:O(\frac{dnp}{m}) 𝐘^(s,):O(d2pm)\widehat{\mathbf{Y}}^{(s,\ell)}:O(\frac{d^{2}p}{m}) 𝐘^(s,):O(dnpm)\widehat{\mathbf{Y}}^{(s,\ell)}:O(\frac{dnp}{m}) 𝐘^(s,):O(d2pm)\widehat{\mathbf{Y}}^{(s,\ell)}:O(\frac{d^{2}p}{m})
Step 2 𝐘^():O(mdp)\widehat{\mathbf{Y}}^{(\ell)}:O(mdp) 𝐕^():O(dp2)\widehat{\mathbf{V}}^{(\ell)}:O(dp^{2}) 𝐘^():O(mdp)\widehat{\mathbf{Y}}^{(\ell)}:O(mdp) 𝐕^():O(dp2)\widehat{\mathbf{V}}^{(\ell)}:O(dp^{2}) 𝐘^():O(mdp)\widehat{\mathbf{Y}}^{(\ell)}:O(mdp) 𝐕^():O(dp2)\widehat{\mathbf{V}}^{(\ell)}:O(dp^{2}) 𝐘^():O(mdp)\widehat{\mathbf{Y}}^{(\ell)}:O(mdp) 𝐕^():O(dp2)\widehat{\mathbf{V}}^{(\ell)}:O(dp^{2})
Step 3 𝐕~:O(d2pL+d3)\widetilde{\mathbf{V}}:O(d^{2}pL+d^{3}) N/A 𝐕~:O(d2pL+d3)\widetilde{\mathbf{V}}:O(d^{2}pL+d^{3}) N/A
𝐕~F:O(dKpLq+dp2)\widetilde{\mathbf{V}}^{\text{F}}:O(dKp^{\prime}Lq+dp^{\prime 2})
Total O(dnpm+dKpLq)O(\frac{dnp}{m}+dKp^{\prime}Lq) O(d2pm+dKpLq)O(\frac{d^{2}p}{m}+dKp^{\prime}Lq) O(dnpm+dKpLq)O(\frac{dnp}{m}+dKp^{\prime}Lq) O(d2pm+dKpLq)O(\frac{d^{2}p}{m}+dKp^{\prime}Lq)
Table 2: Computational costs for Examples 1-4. For the simplicity of presentation, we assume maxs[m]nsn/m\max_{s\in[m]}n_{s}\asymp n/m for Examples 1 and 3 and maxs[m]dsd/m\max_{s\in[m]}d_{s}\asymp d/m for Examples 2 and 4. In Step 3, the calculation of 𝐕~\widetilde{\mathbf{V}} involves computing 𝚺~\widetilde{\bm{\Sigma}} at O(d2pL)O(d^{2}pL) flops and SVD on 𝚺~\widetilde{\bm{\Sigma}} at O(d3)O(d^{3}) flops, while computing 𝐕~F\widetilde{\mathbf{V}}^{\rm F} involves computing 𝚺~q𝛀F\widetilde{\bm{\Sigma}}^{q}\bm{\Omega}^{\rm F} at O(dKpLq)O(dKp^{\prime}Lq) flops and SVD on 𝚺~q𝛀F\widetilde{\bm{\Sigma}}^{q}\bm{\Omega}^{\rm F} at O(dp2)O(dp^{\prime 2}) flops. We recommend 𝐕~F\widetilde{\mathbf{V}}^{\rm F} instead of 𝐕~\widetilde{\mathbf{V}} in practice. The total complexity in the last line refers to the total computational cost for 𝐕~F\widetilde{\mathbf{V}}^{\rm F}.

When mm can be customized, we recommend taking mn/dm\asymp n/d for Examples 1 and 3, and mdm\asymp\sqrt{d} for Examples 2 and 4 for optimal efficiency. For Examples 1 and 3, when p(Klogd)p\asymp(K\vee\log d), Ld/pL\asymp d/p, pKp^{\prime}\asymp K and qlogdq\asymp\log d, the total computational cost will be O(dn(Klogd)/m+d2Klogd)O\big{(}dn(K\vee\log d)/m+d^{2}K\log d\big{)}. For Examples 2 and 4, direct SVD on 𝚺~\widetilde{\bm{\Sigma}} will induce computational cost of order d3d^{3} and we only suggest 𝐕~F\widetilde{\mathbf{V}}^{\text{F}} as the eigenspace estimator. If we take pdp\asymp\sqrt{d}, Ld/pL\asymp d/p, pKp^{\prime}\asymp K and qlogdq\asymp\log d, the total computational cost will be O(d5/2/m+K2d3/2logd)O(d^{5/2}/m+K^{2}d^{3/2}\log d). Inference on eigenspace will require the calculation of the asymptotic covariance, whose formula and computational costs will be discussed in Sections 4.3 and 4.4.

Method Error Rate Computational Complexity
FADI O(Kr/n)O(\sqrt{{Kr}/{n}}) O(dn(Klogd)/m+d2Klogd)O\left(dn(K\vee\log d)/m+d^{2}K\log d\right)
Traditional PCA O(Kr/n)O(\sqrt{{Kr}/{n}}) O(d2n+d3)O(d^{2}n+d^{3})
Fast PCA O(Kdr/n)O(\sqrt{{Kdr}/{n}}) O(dnK+d2K)O(dnK+d^{2}K)
Distributed PCA O(Kr/n)O(\sqrt{{Kr}/{n}}) O(d2n/m+d3)O(d^{2}n/m+d^{3})
Table 3: Error rates and computational complexities for FADI, traditional PCA, fast PCA (one sketching) [20] and distributed PCA [18] for Example 1, where the error rate is evaluated by (𝔼|ρ(,𝐕)|2)1/2\big{(}{\mathbb{E}}|\rho(\,\cdot\,,\mathbf{V})|^{2}\big{)}^{1/2}. Here r=tr(𝚺)/𝚺2r=\operatorname{tr}(\mathbf{\Sigma})/\|\mathbf{\Sigma}\|_{2} is the effective rank of the covariance matrix and mm is the number of sites. For FADI, we take p(Klogd)p\asymp(K\vee\log d), Ld/pL\asymp d/p, pKp^{\prime}\asymp K and qlogdq\asymp\log d.

For a comparison of FADI with the existing works, we provide in Table 3 the theoretical error rates and the computational complexities of FADI against different PCA methods under Example 1 (please refer to Therem 4.1 for the error rates of FADI). We choose Example 1 for illustration, as the existing distributed PCA methods mainly consider this setting [18]. The results show that under the distributed setting, FADI has a much lower computational complexity than the other three methods, while enjoying the same error rate as the traditional full-sample PCA. In comparison, the distributed PCA method in [18] is slowed down significantly by applying traditional PCA to each data split. The fast PCA algorithm in [20] has suboptimal computational complexity and theoretical error rate due to their downstream projection that hinders aggregation.

3.5 Estimation of the Rank KK

FADI requires inputting the rank KK of the matrix 𝐌\mathbf{M}. In practice, if we are only interested in estimating the leading PCs, the exact value of KK is not needed as long as the fast sketching dimensions, pp and pp^{\prime}, are sufficiently larger than KK. Yet knowing the exact value of KK will improve the computational efficiency as well as facilitate inference on PCs. In fact, the estimation of KK can be incorporated into Step 2 and Step 3 of FADI. Specifically, for the \ell-th parallel server ( [L]\ell\in[L]), after performing the SVD 𝐘^()=𝐕^p()𝚲^p()𝐔^p()\widehat{\mathbf{Y}}^{(\ell)}=\widehat{\mathbf{V}}_{p}^{(\ell)}\widehat{\mathbf{\Lambda}}_{p}^{(\ell)}\widehat{\mathbf{U}}_{p}^{(\ell)\top} , we estimate KK by

K^()=min{k<p:σk+1(𝐘^())σp(𝐘^())pμ0},\widehat{K}^{(\ell)}=\min\{k<p:\sigma_{k+1}(\widehat{\mathbf{Y}}^{(\ell)})-\sigma_{p}(\widehat{\mathbf{Y}}^{(\ell)})\leq\sqrt{p}\mu_{0}\},

where μ0>0\mu_{0}>0 is a user-specified parameter (we refer to Theorem 4.3 for the choice of μ0\mu_{0}). Then send all the left singular vectors 𝐕^p()\widehat{\mathbf{V}}_{p}^{(\ell)} and K^(),[L]\widehat{K}^{(\ell)},\ell\in[L] to the central processor. Finally, on the central processor, take K^=median{K^(1),K^(2),,K^(L)}\widehat{K}=\lceil\operatorname{median}\big{\{}\widehat{K}^{(1)},\widehat{K}^{(2)},\ldots,\widehat{K}^{(L)}\big{\}}\rceil as the estimator for KK, and obtain 𝐕~K^\widetilde{\mathbf{V}}_{\widehat{K}} (respectively 𝐕~K^F\widetilde{\mathbf{V}}_{\widehat{K}}^{\text{F}}) by performing PCA (respectively powered fast sketching) on the aggregated average of {𝐕^K^()}[L]\{\widehat{\mathbf{V}}_{\widehat{K}}^{(\ell)}\}_{\ell\in[L]} and taking the K^\widehat{K} leading PCs, where 𝐕^K^()\widehat{\mathbf{V}}_{\widehat{K}}^{(\ell)} is the K^\widehat{K} leading PCs of 𝐘^()\widehat{\mathbf{Y}}^{(\ell)}. We will show in Theorem 4.3 that K^\widehat{K} is a consistent estimator of KK.

4 Theory

In this section, we will establish a theoretical upper bound for the error rate of FADI in Section 4.1, and characterize the asymptotic distribution of the FADI estimator in Section 4.3 and Section 4.4 to facilitate inference.

4.1 Theoretical Bound on Error Rates

We need the following condition to guarantee that the error term converges at a proper rate.

Assumption 1 (Convergence of 𝐄2\|\mathbf{E}\|_{2}).

Recall that 𝐄=𝐌^𝐌{\mathbf{E}}=\widehat{\mathbf{M}}-\mathbf{M} is the error matrix. Assume that 𝐄2\|\mathbf{E}\|_{2} is sub-exponential, and there exists a rate r1(d)r_{1}(d) such that

𝐄2ψ1=supq1q1(𝔼𝐄2q)1/qr1(d).\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}=\sup_{q\geq 1}q^{-1}\left({\mathbb{E}}\|\mathbf{E}\|_{2}^{q}\right)^{1/q}\lesssim r_{1}(d).
Remark 2.

By standard probability theory [40], we know that there exists a constant ce>0c_{e}>0 such that for any t>0t>0 we have (𝐄2t)exp(cet/r1(d))\mathbb{P}(\|\mathbf{E}\|_{2}\geq t)\leq\exp\left(-c_{e}t/r_{1}(d)\right) and 𝐄2=OP(r1(d))\|\mathbf{E}\|_{2}=O_{P}\left(r_{1}(d)\right).

We will conduct a variance-bias decomposition on the error rate ρ(𝐕~,𝐕)\rho(\widetilde{\mathbf{V}},\mathbf{V}). To facilitate the discussion, we introduce the intermediate matrix 𝚺=𝔼𝛀(𝐕^()𝐕^()){\mathbf{\Sigma}}^{\prime}={\mathbb{E}}_{\mathbf{\Omega}}\big{(}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}\big{)}, where the expectation is taken with respect to 𝛀\mathbf{\Omega}. Let 𝐕\mathbf{V}^{\prime} be the top KK eigenvectors of 𝚺{\mathbf{\Sigma}}^{\prime}. Note that both 𝚺{\mathbf{\Sigma}}^{\prime} and 𝐕\mathbf{V}^{\prime} are random depending on 𝐌^\widehat{\mathbf{M}}. For the FADI PC estimator 𝐕~\widetilde{\mathbf{V}}, we have the following “variance-bias” decomposition of the error rate:

ρ(𝐕~,𝐕)ρ(𝐕~,𝐕)variance+ρ(𝐕,𝐕)bias.\rho(\widetilde{\mathbf{V}},\mathbf{V})\leq\underbrace{\rho(\widetilde{\mathbf{V}},{\mathbf{V}}^{\prime})}_{\text{variance}}+\underbrace{\rho(\mathbf{V}^{\prime},{\mathbf{V}})}_{\text{bias}}.

Conditional on all the available data, the first term characterizes the statistical randomness of 𝐕~\widetilde{\mathbf{V}} due to fast sketching, whereas the second bias term is deterministic and depends on all the information provided by the data. Intuitively, since 𝚺~=1L=1L𝐕^()𝐕^()\widetilde{\mathbf{\Sigma}}=\frac{1}{L}\sum_{\ell=1}^{L}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top} converges to the conditional expectation 𝚺{\mathbf{\Sigma}}^{\prime}, 𝐕~\widetilde{\mathbf{V}} will also converge to 𝐕\mathbf{V}^{\prime}. Hence the first variance term goes to 0 asymptotically. As for the second bias term, let 𝐕^\widehat{\mathbf{V}} be the KK leading eigenvectors of 𝐌^\widehat{\mathbf{M}}, then we further break the bias term into two components: ρ(𝐕,𝐕)ρ(𝐕^,𝐕)+ρ(𝐕,𝐕^)\rho(\mathbf{V}^{\prime},{\mathbf{V}})\leq\rho(\widehat{\mathbf{V}},{\mathbf{V}})+\rho(\mathbf{V}^{\prime},\widehat{\mathbf{V}}). We can see that the first term is the error rate for the traditional PCA, whereas the second term is the bias caused by fast sketching. We can show that the second term is 0 with high probability and is hence negligible compared to the first term (see Lemma B.1 in Appendix B.1 for details), and the bias of the FADI estimator is of the same order as the error rate of the traditional PCA. In other words, the bias of the FADI estimator mainly comes from 𝐕^\widehat{\mathbf{V}}, which is due to the information we can get from the available data. The following theorem gives the overall error rate of the FADI PC estimator. Its proof is given in Appendix B.2.

Theorem 4.1.

Under Assumption 1, if pmax(2K,K+7)p\geq\max(2K,K+7) and (logd)1p/dΔ/r1(d)(\log d)^{-1}\sqrt{p/d}\Delta/r_{1}(d) C\geq C for some large enough constant C>0C>0, we have

(𝔼|ρ(𝐕~,𝐕)|2)1/2KΔr1(d)+KdΔ2pLr1(d).\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}},\mathbf{V})|^{2}\right)^{1/2}\lesssim{\frac{\sqrt{K}}{\Delta}r_{1}(d)}+{\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d)}. (1)

Furthermore, recall 𝐕~F\widetilde{\mathbf{V}}^{\rm F} is the KK leading left singular vectors of 𝚺~q𝛀F\widetilde{\mathbf{\Sigma}}^{q}\mathbf{\Omega}^{\rm F} for some power q1q\geq 1, where 𝛀Fd×p\mathbf{\Omega}^{\rm F}\in\mathbb{R}^{d\times p^{\prime}} is a random Gaussian matrix and pmax(2K,K+7)p^{\prime}\geq\max(2K,K+7), then under Assumption 1 and the conditions that pmax(2K,K+8q1)p\geq\max(2K,K+8q-1) and (logd)1p/dΔ/r1(d)C(\log d)^{-1}\sqrt{p/d}\Delta/r_{1}(d)\geq C, there exists some constant η>0\eta>0 such that

(𝔼|ρ(𝐕~F,𝐕)|2)1/2\displaystyle\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V})|^{2}\right)^{1/2} KΔr1(d)+KdΔ2pLr1(d)+Kdp(ηq2dΔ2pr1(d))q.\displaystyle\!\lesssim\!{\frac{\sqrt{K}}{\Delta}r_{1}(d)}\!+\!{\sqrt{\!\frac{Kd}{\Delta^{2}pL}}r_{1}(d)}\!+\!\sqrt{\!\frac{Kd}{p^{\prime}}}\!\!\left(\!\eta q^{2}\!\sqrt{\frac{d}{\Delta^{2}p}}r_{1}(d)\!\right)^{q}\!\!\!. (2)
Remark 3.

On the RHS of (1), the first term is the bias term, while the second term is the variance term. We can see that when the number of sketches LL reaches the order d/pd/p, the variance term will be of the same order as the bias term, which is the same as the error rate of the traditional PCA method. As for (2), the first term and the second term on the RHS are the same as the bias and the variance terms in (1), while the third term comes from the additional fast sketching. In fact, if we properly choose q=(log(p/dΔ/r1(d)))1logd+1logdq=\lceil\big{(}\log\big{(}\sqrt{p/d}\Delta/r_{1}(d)\big{)}\big{)}^{-1}\log d\rceil+1\leq\log d, the third term in (2) will be negligible. Theorem 4.1 also indicates that pp only needs to be of order KlogdK\vee\log d, which significantly reduces the communication costs from O(d2)O(d^{2}) to O(d(Klogd))O\left(d(K\vee\log d)\right) for each server.

Based upon Theorem 4.1, we provide the case-specific error rate for each example given in Section 2 in the following corollary. Please refer to Appendix B.3 for the proof.

Corollary 4.2.

For Examples 14, we have the following error bounds for each case under corresponding regularity conditions.

  • Example 1: Define κ1=(λ1+σ2)/Δ\kappa_{1}=(\lambda_{1}+\sigma^{2})/\Delta, then under the conditions that pmax(2K,K+7)p^{\prime}\geq\max(2K,K+7), pmax(2K,K+8logd1)p\geq\max(2K,K+8\log d-1), q=logdq=\lceil\log d\rceil and nC(rd/p)κ12log4dn\geq C(rd/{p})\kappa_{1}^{2}\log^{4}d for some large enough constant C>0C>0, it holds that

    (𝔼|ρ(𝐕~F,𝐕)|2)1/2κ1Krn+κ1KdrnpL,\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V})|^{2}\right)^{1/2}\lesssim\kappa_{1}\sqrt{\frac{Kr}{n}}+\kappa_{1}\sqrt{\frac{Kdr}{npL}}, (3)

    where r=tr(𝚺)/𝚺2r=\operatorname{tr}(\mathbf{\Sigma})/\|\mathbf{\Sigma}\|_{2} is the effective rank.

  • Example 2: Suppose θK2d1/2+ϵ\theta{\geq}K^{2}d^{-1/2+\epsilon} for some constant ϵ>0\epsilon>0. If we take pmax(2K,K+7)p^{\prime}\geq\max(2K,K+7), pdp\gtrsim\sqrt{d} and q=logdq=\lceil\log d\rceil, it holds that

    (𝔼|ρ(𝐕~F,𝐕)|2)1/2KKdθ+KKpLθ.\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V})|^{2}\right)^{1/2}\lesssim K\sqrt{\frac{K}{d\theta}}+K\sqrt{\frac{K}{pL\theta}}. (4)
  • Example 3: Under the conditions that Δ02CK(logd)2max(d(logd)2/p,n/p)\Delta_{0}^{2}\geq CK(\log d)^{2}\max\big{(}d(\log d)^{2}/p,\,\sqrt{{n}/{p}}\big{)} for some large enough constant C>0C>0, where Δ0=𝚯2\Delta_{0}=\|\mathbf{\Theta}\|_{2}, if we take pmax(2K,K+7)p^{\prime}\geq\max(2K,K+7), pmax(2K,K+8logd1)p\geq\max(2K,K+8\log d-1) and q=logdq=\lceil\log d\rceil, it holds that

    (𝔼|ρ(𝐕~F,𝐕)|2)1/2\displaystyle\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V})|^{2}\right)^{1/2} (KΔ0+KΔ02Knd)+dpL(KΔ0+KΔ02Knd).\displaystyle\!\!\!\lesssim\left(\frac{K}{\Delta_{0}}\!+\!\frac{K}{\Delta_{0}^{2}}\sqrt{\frac{Kn}{d}}\right)\!+\!\sqrt{\frac{d}{pL}}\left(\frac{K}{\Delta_{0}}\!+\!\frac{K}{\Delta_{0}^{2}}\sqrt{\frac{Kn}{d}}\right). (5)
  • Example 4: Define κ2=|λ1|/Δ\kappa_{2}=|\lambda_{1}|/\Delta. Suppose θd1/2+ϵ\theta\geq d^{-1/2+\epsilon} for some constant ϵ>0\epsilon>0, σ/Δd1pθ\sigma/\Delta\ll d^{-1}\sqrt{p\theta}, 𝐕2,μK/d\|\mathbf{V}\|_{2,\infty}\leq\sqrt{\mu K/d} for some μ1\mu\geq 1 and κ2μKd1/4\kappa_{2}\mu K\ll d^{1/4}, if we take pmax(2K,K+7)p^{\prime}\geq\max(2K,K+7), pdp\gtrsim\sqrt{d} and q=logdq=\lceil\log d\rceil, it holds that

    (𝔼|ρ(𝐕~F,𝐕)|2)1/2\displaystyle\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V})|^{2}\right)^{1/2} K(κ2μKdθ+dσ2Δ2θ)+KdpL(κ2μKdθ+dσ2Δ2θ).\displaystyle\!\!\lesssim\sqrt{K}\left(\frac{\kappa_{2}\mu K}{\sqrt{d\theta}}\!+\!\sqrt{\frac{d\sigma^{2}}{\Delta^{2}\theta}}\right)\!\!+\!\sqrt{\frac{Kd}{pL}}\!\!\left(\frac{\kappa_{2}\mu K}{\sqrt{d\theta}}\!+\!\sqrt{\frac{d\sigma^{2}}{\Delta^{2}\theta}}\right). (6)
Remark 4.

We can generalize the results of Example 1 to the heterogeneous residual variance model for non-i.i.d. data, under which {𝑿i}i=1nd\{\bm{X}_{i}\}_{i=1}^{n}\subseteq\mathbb{R}^{d} are centered random vectors with covariance matrices satisfying limn1ni=1n𝔼(𝑿i𝑿i)=𝚺=𝐃+𝐕𝚲𝐕\lim_{n\rightarrow\infty}\frac{1}{n}\sum_{i=1}^{n}{\mathbb{E}}(\bm{X}_{i}\bm{X}_{i}^{\top})=\bm{\Sigma}=\mathbf{D}+\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}, where 𝐃=diag(σ12,,σd2)\mathbf{D}=\operatorname{diag}(\sigma_{1}^{2},\ldots,\sigma_{d}^{2}) and λ1𝐕2,2/Δ=o(1)\lambda_{1}\|\mathbf{V}\|_{2,\infty}^{2}/\Delta=o(1). Then we have 𝐌^=𝚺^diag(𝚺^)\widehat{\mathbf{M}}=\widehat{\bm{\Sigma}}-\operatorname{diag}(\widehat{\bm{\Sigma}}), where 𝚺^=1ni=1n𝑿i𝑿i\widehat{\mathbf{\Sigma}}=\frac{1}{n}\sum_{i=1}^{n}\bm{X}_{i}\bm{X}_{i}^{\top}, 𝐌=𝐕𝚲𝐕\mathbf{M}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top} and 𝐄22𝚺^𝚺2+diag(𝐕𝚲𝐕)22𝚺^𝚺2+λ1𝐕2,2\|\mathbf{E}\|_{2}\leq 2\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|_{2}+\|\operatorname{diag}(\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top})\|_{2}\leq 2\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|_{2}+\lambda_{1}\|\mathbf{V}\|_{2,\infty}^{2}. Then by plugging in r1(d)=λ1𝐕2,2+𝚺^𝚺2ψ1r_{1}(d)=\lambda_{1}\|\mathbf{V}\|_{2,\infty}^{2}+\|\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|_{2}\|_{\psi_{1}}, we have the error bound under the heterogeneous scenario. While the first term is deterministic, the second term depends on the dependence structure of the sample. Many studies depicted the convergence of the sample covariance matrix for non-i.i.d. data [7, 17].

For Example 1, when LpdLp\gtrsim d, our error rate in (3) is optimal [18]. Under the distributed data setting, we require the total sample size nn to be larger than rd/prd/p, while Fan et al., [18]’s distributed PCA requires n/m>rn/m>r, where n/mn/m is the sample size for each data split. Compared with [18], our method has theoretical guarantees regardless of the number of data splits, but our scaling condition nrd/pn\gtrsim rd/p has an extra factor of d/pd/p in exchange for reduced computation cost. As for Example 2, our estimation rate in (4) matches the inferential results in [16]. Please also refer to Section 4.3 for a detailed comparison with the method in [16] in terms of the limiting distributions. For Example 3, our estimation rate in (5) is the same as in [12]. For Example 4, our error rate in (6) matches the results in [12]. When the rank KK is unknown and estimated by FADI, the following theorem shows that under appropriate conditions, our estimator K^\widehat{K} presented in Section 3.5 recovers the true KK with high probability.

Theorem 4.3.

Under Assumption 1, define η0=480ce1d/(Δ2p)r1(d)logd\eta_{0}=480c_{e}^{-1}\sqrt{d/(\Delta^{2}p)}r_{1}(d)\log d, where ce>0c_{e}>0 is the constant defined in Remark 2. When d2d\geq 2, 2Kpd(logd)22K\leq p\ll d(\log d)^{-2} and η0(32logd)2/(pK+1)\eta_{0}\leq(32\log d)^{-{2}/{(p-K+1)}}, if we choose μ0\mu_{0} such that Δη0/24μ0Δη0/12\Delta\eta_{0}/24\leq\mu_{0}\leq\Delta\sqrt{\eta_{0}}/12, then with probability at least 1O(d(L20)/2)1-O(d^{-(L\wedge 20)/2}), K^=K\widehat{K}=K.

We defer the proof to Appendix B.4. We provide case-specific choices of the thresholding parameter μ0\mu_{0} in the following corollary, whose proof can be found in Appendix B.5.

Corollary 4.4.

For Examples 1 to 4, we specify the choice of μ0\mu_{0} under certain regularity conditions.

  • Example 1: Under the conditions that 2Kp(logd)2d2K\leq p\ll(\log d)^{-2}d, nκ12rd/p(logd)4n\gg\kappa_{1}^{2}rd/p(\log d)^{4}, (λ1+σ2)(np/(dlogd))1/4(\lambda_{1}+\sigma^{2})\ll\left(\sqrt{np}/(d\log d)\right)^{1/4} and Δ(σ2(np)1/2dlogd)1/3\Delta\gg\left({\sigma^{-2}(np)^{-1/2}}d\log d\right)^{1/3}, if we take μ0=(d(np)1/2logd)3/4/12\mu_{0}=\left(d(np)^{-1/2}\log d\right)^{3/4}/12, with probability at least 1O(d(L20)/2)1-O\left(d^{-(L\wedge 20)/2}\right), we have K^=K\widehat{K}=K.

  • Example 2: Define θ^=d2ij𝐌^ij\widehat{\theta}=d^{-2}\sum_{i\leq j}\widehat{\mathbf{M}}_{ij}, then under the condition that θK2d1/2+ϵ\theta\geq K^{2}d^{-1/2+\epsilon} for some constant ϵ>0\epsilon>0 and dp(logd)2d\sqrt{d}\lesssim p\ll(\log d)^{-2}d, if we take μ0=(θ^/p)1/2dlogd/12\mu_{0}=(\widehat{\theta}/p)^{1/2}d\log d/12, with probability at least 1O(d(L20)/2)1-O\left(d^{-(L\wedge 20)/2}\right), we have K^=K\widehat{K}=K.

  • Example 3: Under the conditions that 2Kp(logd)2d2K\leq p\ll(\log d)^{-2}d and K(logd)3n/pΔ02nK/d(logd)2{K(\log d)^{3}}\sqrt{n/p}\ll\Delta_{0}^{2}\ll{nK/d}(\log d)^{2}, if we take μ0=d(logd)2n/p/12\mu_{0}=d(\log d)^{2}\sqrt{n/p}/12, with probability at least 1O(d(L20)/2)1-O\left(d^{-(L\wedge 20)/2}\right), we have K^=K\widehat{K}=K.

  • Example 4: When θd1/2+ϵ\theta\geq d^{-1/2+\epsilon} for some constant ϵ>0\epsilon>0, 𝐕2,μK/d\|\mathbf{V}\|_{2,\infty}\leq\sqrt{\mu K/d} for some μ1\mu\geq 1, κ22μ2K(logd)2\kappa_{2}^{2}\mu^{2}K\ll(\log d)^{2}, dp(logd)2d\sqrt{d}\lesssim p\ll(\log d)^{-2}d and (pθ)1/4dσ/Δlogd=o(1)(p\theta)^{-1/4}\sqrt{d\sigma/\Delta}\log d=o(1), if we take μ0=dσ^0logd(pθ^)1/2/12\mu_{0}=d\widehat{\sigma}_{0}\log d(p\widehat{\theta})^{-1/2}/12, where σ^0=((i,j)𝒮(θ^𝐌^ij)2/|𝒮|)1/2\widehat{\sigma}_{0}=\big{(}\sum_{(i,j)\in{\mathcal{S}}}(\widehat{\theta}\widehat{\mathbf{M}}_{ij})^{2}/|{\mathcal{S}}|\big{)}^{1/2}, then with probability at least 1O(d(L20)/2)1-O\left(d^{-(L\wedge 20)/2}\right), we have K^=K\widehat{K}=K.

Remark 5.

For Example 3, we impose the upper bound on Δ0\Delta_{0} because in practice the eigengap Δ\Delta is unknown, and estimation of Δ\Delta requires knowledge of KK. Imposing the upper bound on Δ0\Delta_{0} makes the term in μ0\mu_{0} involving knowledge of Δ\Delta vanish and enables the estimation of KK from observed data.

4.2 Inferential Results on the Asymptotic Distribution: Intuition and Assumptions

In Section 4.1, we discuss the theoretical upper bound for the error rate and present the bias-variance decomposition for the FADI estimator 𝐕~F\widetilde{\mathbf{V}}^{\text{F}}. From (2), we can see that when LpdLp\gg d, the bias term will be the leading term, and the dominating error comes from ρ(𝐕^,𝐕)\rho(\widehat{\mathbf{V}},\mathbf{V}), whereas when LpdLp\ll d, the variance term will be the leading term and the main error derives from ρ(𝐕~F,𝐕^)\rho(\widetilde{\mathbf{V}}^{\text{F}},\widehat{\mathbf{V}}). This offers insight into conducting inferential analysis on the estimator and implies a possible phase transition in the asymptotic distribution. Before moving on to further discussions, we state the following assumption to ensure that the bias of 𝐌^\widehat{\mathbf{M}} is negligible.

Assumption 2 (Statistical Rate for the Biased Error Term).

For the error matrix 𝐄\mathbf{E} we have the decomposition 𝐄=𝐄0+𝐄b\mathbf{E}=\mathbf{E}_{0}+\mathbf{E}_{b}, where 𝔼(𝐄0)=𝟎{\mathbb{E}}(\mathbf{E}_{0})=\mathbf{0} and 𝐄b\mathbf{E}_{b} is the biased error term satisfying limd(𝐄b2r2(d))=1\lim_{d\rightarrow\infty}\mathbb{P}\big{(}\|\mathbf{E}_{b}\|_{2}{\leq}r_{2}(d)\big{)}=1 with r2(d)=o(r1(d))r_{2}(d)=o\big{(}r_{1}(d)\big{)}.

In fact, we will later show in Section 4.3 and Section 4.4 that the leading term for the distance between 𝐕~F\widetilde{\mathbf{V}}^{\text{F}} and 𝐕\mathbf{V} takes on two different forms under the two regimes:

𝐕~F𝐇𝐕𝐏𝐄0𝐕𝚲1, if Lpd;𝐕~F𝐇𝐕𝐏𝐄0𝛀𝐁𝛀L1, if Lpd,\begin{array}[]{ll}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\approx\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}&,\quad\text{ if }Lp\gg d;\\ \widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\approx\mathbf{P}_{\perp}\mathbf{E}_{0}\bm{\Omega}\mathbf{B}_{\bm{\Omega}}L^{-1}&,\quad\text{ if }Lp\ll d,\end{array}

where 𝐇\mathbf{H} is some orthogonal matrix aligning 𝐕~F\widetilde{\mathbf{V}}^{\text{F}} with 𝐕\mathbf{V}, 𝐏=𝐈𝐕𝐕\mathbf{P}_{\perp}=\mathbf{I}-\mathbf{V}\mathbf{V}^{\top} is the projection matrix onto the linear space perpendicular to 𝐕\mathbf{V}, 𝛀=(𝛀(1)/p,,𝛀(L)/p)d×Lp\mathbf{\Omega}=(\mathbf{\Omega}^{(1)}/\sqrt{p},\ldots,\mathbf{\Omega}^{(L)}/\sqrt{p})\in\mathbb{R}^{d\times Lp} and 𝐁𝛀=(𝐁(1),,𝐁(L))\mathbf{B}_{\mathbf{\Omega}}=(\mathbf{B}^{(1)\top},\ldots,\mathbf{B}^{(L)\top})^{\top} with 𝐁()=(𝚲𝐕𝛀()/p)p×K\mathbf{B}^{(\ell)}=(\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}/\sqrt{p})^{\dagger}\in\mathbb{R}^{p\times K} for =1,,L\ell=1,\ldots,L. To get an intuitive understanding on the form of the leading error term, let’s start with the regime LpdLp\gg d where ρ(𝐕~F,𝐕)ρ(𝐕^,𝐕)\rho(\widetilde{\mathbf{V}}^{\text{F}},\mathbf{V})\approx\rho(\widehat{\mathbf{V}},\mathbf{V}) and consider the case where {|λk|}k=1K\{|\lambda_{k}|\}_{k=1}^{K} are well-separated such that 𝐇𝐈K\mathbf{H}\approx\mathbf{I}_{K}. Following basic algebra, we have

𝐕~F𝐕\displaystyle\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V} 𝐕^𝐕𝐏(𝐕^𝐕)=𝐏(𝐌^𝐕^𝚲^1𝐌𝐕𝚲1)\displaystyle\approx\widehat{\mathbf{V}}-\mathbf{V}\approx\mathbf{P}_{\perp}(\widehat{\mathbf{V}}-\mathbf{V})=\mathbf{P}_{\perp}(\widehat{\mathbf{M}}\widehat{\mathbf{V}}\widehat{\mathbf{\Lambda}}^{-1}-\mathbf{M}\mathbf{V}\mathbf{\Lambda}^{-1})
𝐏(𝐌^𝐌)𝐕𝚲1=𝐏𝐄0𝐕𝚲1,\displaystyle\approx\mathbf{P}_{\perp}(\widehat{\mathbf{M}}-\mathbf{M})\mathbf{V}\mathbf{\Lambda}^{-1}=\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1},

where 𝚲^\widehat{\mathbf{\Lambda}} is the KK-leading eigenvalues of 𝐌^\widehat{\mathbf{M}} corresponding to 𝐕^\widehat{\mathbf{V}}, and the second approximation is due to the fact that 𝐕^\widehat{\mathbf{V}} and 𝐕\mathbf{V} are fairly close and 𝐏𝐕(𝐕^𝐕)\mathbf{P}_{\mathbf{V}}(\widehat{\mathbf{V}}-\mathbf{V}) will be negligible.

Now we turn to the scenario LpdLp\ll d, where the error mainly comes from 𝐕~F𝐕^\widetilde{\mathbf{V}}^{\text{F}}-\widehat{\mathbf{V}}. For a given [L]\ell\in[L], denote 𝐘()=𝐌𝛀()=𝐕𝚲𝛀~()\mathbf{Y}^{(\ell)}=\mathbf{M}\bm{\Omega}^{(\ell)}=\mathbf{V}\mathbf{\Lambda}\widetilde{\bm{\Omega}}^{(\ell)}, where 𝛀~()=𝐕𝛀()\widetilde{\bm{\Omega}}^{(\ell)}=\mathbf{V}^{\top}\bm{\Omega}^{(\ell)} is also a Gaussian test matrix. Intuitively, p1𝛀~()𝛀~()𝐈Kp^{-1}\widetilde{\mathbf{\Omega}}^{(\ell)}\widetilde{\mathbf{\Omega}}^{(\ell)\top}\approx\mathbf{I}_{K} when pp is much larger than KK. Hence 𝛀~()\widetilde{\mathbf{\Omega}}^{(\ell)} acts like an orthonormal matrix scaled by p\sqrt{p}, and the rank-KK truncated SVD for 𝐘^()/p\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p} and 𝐘()/p\mathbf{Y}^{(\ell)}/\sqrt{p} will approximately be 𝐕^()𝚲^(𝛀~()/p)\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{\Lambda}}(\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p}) and 𝐕𝚲(𝛀~()/p)\mathbf{V}\mathbf{\Lambda}(\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p}) respectively. Then following similar arguments as when LpdLp\gg d, we have

𝐕^()𝐕𝐏((𝐘^()/p)(𝛀~()/p)𝚲^1(𝐘()/p)(𝛀~()/p)𝚲1)\displaystyle\widehat{\mathbf{V}}^{(\ell)}-\mathbf{V}\approx\mathbf{P}_{\perp}\left((\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p})(\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p})^{\top}\widehat{\mathbf{\Lambda}}^{-1}-(\mathbf{Y}^{(\ell)}/\sqrt{p})(\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p})^{\top}\mathbf{\Lambda}^{-1}\right)
𝐏(𝐘^()/p𝐘()/p)(𝛀~()/p)𝚲1𝐏𝐄0(𝛀()/p)𝐁(),\displaystyle\quad\approx\mathbf{P}_{\perp}\left(\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}-\mathbf{Y}^{(\ell)}/\sqrt{p}\right)(\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p})^{\top}\mathbf{\Lambda}^{-1}\approx\mathbf{P}_{\perp}\mathbf{E}_{0}(\bm{\Omega}^{(\ell)}/\sqrt{p})\mathbf{B}^{(\ell)},

where the last approximation is because when 𝛀~()/p\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p} is almost orthonormal we have 𝐁()=(𝚲𝛀~()/p)(𝛀~()/p)𝚲1\mathbf{B}^{(\ell)}=(\mathbf{\Lambda}\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p})^{\dagger}\approx(\widetilde{\bm{\Omega}}^{(\ell)}/\sqrt{p})^{\top}\mathbf{\Lambda}^{-1}. Then aggregating the results over [L]\ell\in[L] we have

𝐕~F𝐕1L=1L{𝐕^()𝐕}1L=1L𝐏𝐄0(𝛀()/p)𝐁()=𝐏𝐄0𝛀𝐁𝛀L1.\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\approx\frac{1}{L}\sum_{\ell=1}^{L}\left\{\widehat{\mathbf{V}}^{(\ell)}-\mathbf{V}\right\}\approx\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\perp}\mathbf{E}_{0}(\bm{\Omega}^{(\ell)}/\sqrt{p})\mathbf{B}^{(\ell)}=\mathbf{P}_{\perp}\mathbf{E}_{0}\bm{\Omega}\mathbf{B}_{\bm{\Omega}}L^{-1}.

It is worth noting that

1L𝛀𝐁𝛀1L(=1L(𝛀/p)(𝛀/p))𝐕𝚲1𝐕𝚲1,\frac{1}{L}\bm{\Omega}\mathbf{B}_{\bm{\Omega}}\approx\frac{1}{L}\left(\sum_{\ell=1}^{L}(\bm{\Omega}/\sqrt{p})(\bm{\Omega}/\sqrt{p})^{\top}\right)\mathbf{V}\mathbf{\Lambda}^{-1}\rightarrow\mathbf{V}\mathbf{\Lambda}^{-1}, (7)

when LpdLp\gg d, which demonstrates the consistency of the leading term across different regimes of LpLp. To unify the notations, we denote the leading term for 𝐕~F𝐇𝐕\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V} by

𝒱(𝐄0)={𝐏𝐄0𝐕𝚲1, if Lpd;𝐏𝐄0𝛀𝐁𝛀L1, if Lpd.\mathcal{V}(\mathbf{E}_{0})=\left\{\begin{array}[]{ll}\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}&,\quad\text{ if }Lp\gg d;\\ \mathbf{P}_{\perp}\mathbf{E}_{0}\bm{\Omega}\mathbf{B}_{\bm{\Omega}}L^{-1}&,\quad\text{ if }Lp\ll d.\end{array}\right.

Before we formally present the theorems, we introduce the following extra regularity conditions necessary for studying the asymptotic features of the eigenspace estimator.

Assumption 3 (Incoherence Condition).

For the eigenspace of the true matrix 𝐌\mathbf{M}, we assume

𝐕2,μK/d,\|\mathbf{V}\|_{2,\infty}\leq\sqrt{{\mu K}/{d}},

where μ1\mu\geq 1 may change with dd.

Assumption 4 (Statistical Rates for Eigenspace Convergence).

For the unbiased error term 𝐄0\mathbf{E}_{0} and the traditional PCA estimator 𝐕^\widehat{\mathbf{V}}, we have the following statistical rates

limd(𝐕^sgn(𝐕^𝐕)𝐕2,r3(d))=1,limd(𝐄0(𝐈d𝐕^𝐕^)𝐕2,r4(d))=1.\lim_{d\rightarrow\infty}\mathbb{P}\big{(}\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}{\leq}r_{3}(d)\big{)}=1,\,\,\lim_{d\rightarrow\infty}\mathbb{P}\big{(}\|\mathbf{E}_{0}(\mathbf{I}_{d}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top})\mathbf{V}\|_{2,\infty}{\leq}r_{4}(d)\big{)}=1.
Assumption 5 (Central Limit Theorem).

For the leading term 𝒱(𝐄0)\mathcal{V}(\mathbf{E}_{0}) and any j[d]j\in[d], it holds that

𝚺j1/2𝒱(𝐄0)𝐞j𝑑𝒩(𝟎,𝐈K),\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),

where 𝚺j=Cov(𝒱(𝐄0)𝐞j|𝛀)\bm{\Sigma}_{j}=\operatorname*{\rm Cov}(\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}|\bm{\Omega}) when LpdLp\ll d and 𝚺j=Cov(𝒱(𝐄0)𝐞j)\bm{\Sigma}_{j}=\operatorname*{\rm Cov}(\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}) when LpdLp\gg d.

Assumption 3 is the incoherence condition [10] to guarantee that the information of the eigenspace is uniformly spread. In Assumption 4 , r3(d)r_{3}(d) bounds the row-wise estimation error for the eigenspace, while r4(d)r_{4}(d) characterizes the row-wise convergence rate of the residual error term projected onto the spaces spanned by 𝐕^\widehat{\mathbf{V}}_{\perp} and 𝐕\mathbf{V} consecutively, i.e., 𝐄0(𝐈d𝐕^𝐕^)𝐕2,=𝐄0𝐏𝐕^𝐏𝐕2,\|\mathbf{E}_{0}(\mathbf{I}_{d}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top})\mathbf{V}\|_{2,\infty}=\|\mathbf{E}_{0}\mathbf{P}_{\widehat{\mathbf{V}}_{\perp}}\mathbf{P}_{\mathbf{V}}\|_{2,\infty}. Assumption 5 states that the leading term satisfies the central limit theorem (CLT). These assumptions are for the general framework and will be translated into case-specific conditions for concrete examples. With the above assumptions in place, we are ready to present the formal inferential results.

4.3 Inference When LpdLp\gg d

Recall that 𝐕~\widetilde{\mathbf{V}} is the KK leading eigenvectors of the matrix 𝚺~=1L=1L𝐕^()𝐕^()\widetilde{\mathbf{\Sigma}}=\frac{1}{L}\sum_{\ell=1}^{L}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}, and 𝐕~F\widetilde{\mathbf{V}}^{\text{F}} is the KK leading left singular vectors of the matrix 𝐘~=𝚺~q𝛀F\widetilde{\mathbf{Y}}=\widetilde{\bm{\Sigma}}^{q}\mathbf{\Omega}^{\text{F}}. We define 𝐇=𝐇2𝐇1𝐇0\mathbf{H}=\mathbf{H}_{2}\mathbf{H}_{1}\mathbf{H}_{0} to be the alignment matrix between 𝐕~F\widetilde{\mathbf{V}}^{\text{F}} and 𝐕\mathbf{V}, where 𝐇2=sgn(𝐕~F𝐕~)\mathbf{H}_{2}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\text{F}\top}\widetilde{\mathbf{V}}), 𝐇1=sgn(𝐕~𝐕^)\mathbf{H}_{1}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\top}\widehat{\mathbf{V}}) and 𝐇0=sgn(𝐕^𝐕)\mathbf{H}_{0}=\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}{\mathbf{V}}). The follow theorem provides the distributional guarantee of FADI when LpdLp\gg d.

Theorem 4.5.

When LpdLp\gg d, under Assumptions 1 - 5, recall 𝚺j=Cov(𝒱(𝐄0)𝐞j)\bm{\Sigma}_{j}=\operatorname*{\rm Cov}\big{(}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\big{)} for j[d]j\in[d]. Define r(d)=Δ1(KdpLr1(d)+r3(d)r1(d)+μKdΔ2r1(d)2+r2(d)+r4(d))r(d)=\Delta^{-1}\big{(}\sqrt{\frac{Kd}{pL}}r_{1}(d)+r_{3}(d)r_{1}(d)\!+\!\sqrt{\frac{\mu K}{d\Delta^{2}}}r_{1}(d)^{2}\!+\!r_{2}(d)\!+\!r_{4}(d)\big{)}, and assume that there exists a statistical rate η1(d)\eta_{1}(d) such that

minj[d]λK(𝚺j)η1(d)andη1(d)1/2r(d)=o(1).\min_{j\in[d]}\lambda_{K}\big{(}\bm{\Sigma}_{j}\big{)}\gtrsim\eta_{1}(d)\quad\text{and}\quad\eta_{1}(d)^{-1/2}r(d)=o(1).

If Δ1r1(d)(logd)2d/p=o(1)\Delta^{-1}r_{1}(d)(\log d)^{2}\sqrt{d/p}=o(1) and we take

q2+log(Ld)/loglogd,pmax(2K,K+7)andpmax(2K,K+8q1),q\geq 2+\log(Ld)/\log\log d,\quad p^{\prime}\geq\max(2K,K+7)\quad\text{and}\quad p\geq\max(2K,K+8q-1),

we have

𝚺j1/2(𝐕~F𝐇𝐕)𝐞j𝑑𝒩(𝟎,𝐈K),j[d].\mathbf{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d]. (8)
Remark 6.

Please refer to Appendix B.9 for the proof of Theorem 4.5. Here η1(d)\eta_{1}(d) guarantees that the asymptotic covariance of the leading term is positive definite, and the rate r(d)r(d) bounds the remainder term stemming from fast sketching approximation and eigenspace misalignment. When the rate η1(d)\eta_{1}(d) is not too small relative to r(d)r(d), Theorem 4.5 guarantees the distributional convergence of the FADI estimator. We will see in the concrete examples that the asymptotic covariance of the FADI estimator under the regime LpdLp\gg d is the same as that of the traditional PCA estimator. In other words, we can increase the number of repeated sketches in exchange for the same testing efficiency as the traditional PCA.

We present the corollaries of Theorem 4.5 for Examples 1 to 4 as follows.

4.3.1 Spiked Covariance Model

Recall the set SS of size KK^{\prime} defined in Section 3.3 for estimating σ^2\widehat{\sigma}^{2}. We denote by 𝚺S\bm{\Sigma}_{S} the population covariance matrix corresponding to 𝚺^S\widehat{\bm{\Sigma}}_{S} and define σ~1=𝚺S2\widetilde{\sigma}_{1}=\|\mathbf{\Sigma}_{S}\|_{2}. Denote by δ=λK(𝚺S)σ2\delta=\lambda_{K}(\bm{\Sigma}_{S})-\sigma^{2} the eigengap of 𝚺S\bm{\Sigma}_{S}. We have the following corollary of Theorem 4.5 for Example 1.

Corollary 4.6.

Assume that {𝐗i}i=1n\{\bm{X}_{i}\}_{i=1}^{n} are i.i.d. multivariate Gaussian. If we take K=K+1K^{\prime}=K+1, pmax(2K,K+7)p^{\prime}\geq\max(2K,K+7), q2+log(Ld)/loglogdq\geq 2+\log(Ld)/\log\log d and pmax(2K,K+8q1)p\geq\max(2K,K+8q-1), then when LpKdrκ12λ1/σ2Lp\gg{Kdr}\kappa_{1}^{2}{\lambda_{1}}/{\sigma^{2}}, under Assumption 3 and the conditions that

nmax(κ14(logd)4r2λ1/σ2,(κ1λ1/σ2)6)andKmin((σ~1/δ)2κ1r,μ2/3κ14/3d2/3),n\gg\max\Big{(}\kappa_{1}^{4}(\log d)^{4}r^{2}{\lambda_{1}}/{\sigma^{2}},\big{(}\kappa_{1}{\lambda_{1}}/{\sigma^{2}}\big{)}^{6}\Big{)}\,\,\text{and}\,\,K\ll\min\left(\big{(}{\widetilde{\sigma}_{1}}/{\delta}\big{)}^{-2}\kappa_{1}r,\mu^{-{2}/{3}}\kappa^{-{4}/{3}}_{1}d^{{2}/{3}}\right),

we have that (8) holds. Furthermore, we have

𝚺~j1/2(𝐕~F𝐇𝐕)𝐞j𝑑𝒩(𝟎,𝐈K),j[d],\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d], (9)

where 𝚺~j=σ2n𝚲1𝐕𝚺𝐕𝚲1\widetilde{\bm{\Sigma}}_{j}=\frac{\sigma^{2}}{n}\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\bm{\Sigma}\mathbf{V}\mathbf{\Lambda}^{-1} is a simplification of 𝚺j\bm{\Sigma}_{j} under Example 1. Besides, if we define 𝚲~=𝐕~F𝐌^𝐕~F\widetilde{\mathbf{\Lambda}}=\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\rm F} and estimate 𝚺~j\widetilde{\bm{\Sigma}}_{j} by 𝚺^j=1n(σ^2𝚲~1+σ^4𝚲~2)\widehat{\bm{\Sigma}}_{j}=\frac{1}{n}(\widehat{\sigma}^{2}\widetilde{\mathbf{\Lambda}}^{-1}+\widehat{\sigma}^{4}\widetilde{\mathbf{\Lambda}}^{-2}), then we have

𝚺^j1/2(𝐕~F𝐕𝐇)𝐞j𝑑𝒩(𝟎,𝐈K),j[d].{\widehat{\bm{\Sigma}}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d]. (10)
Remark 7.

Please refer to Appendix B.10 for the proof. We compute 𝚲~\widetilde{\mathbf{\Lambda}} distributively across the mm data splits, and the cost for computing 𝚺^j\widehat{\bm{\Sigma}}_{j} is O(ndK/m)O(ndK/m). We recommend taking p=dp=\lceil\sqrt{d}\rceil, L=κ12Kd3/2logdL=\lceil\kappa_{1}^{2}Kd^{3/2}\log d\rceil and q=logd2+log(Ld)/loglogdq=\lceil\log d\rceil\gg 2+\log(Ld)/\log\log d for optimal computational efficiency, where the total computation cost will be O(K3d5/2(logd)2)O(K^{3}d^{5/2}(\log d)^{2}). Our asymptotic covariance matrix is the same as that of the traditional PCA estimator under the incoherence condition [5, 32, 41]. Specifically, Wang and Fan, [41] studied the asymptotic distribution of the traditional PCA estimator by assuming that the spiked eigenvalues are well-separated and diverging to infinity, which is not required by our paper. Our scaling conditions are stronger than the estimation results in Corollary 4.2 to cancel out the additional randomness induced by fast sketching and allow for efficient inference.

4.3.2 Degree-Corrected Mixed Membership Models

Corollary 4.7.

When θK2d1/2+ϵ\theta\geq K^{2}d^{-1/2+\epsilon} for some constant ϵ>0\epsilon>0 and K=o(d1/32)K=o(d^{1/32}), if we take pdp\gtrsim\sqrt{d}, pmax(2K,K+7)p^{\prime}\geq\max(2K,K+7), LK5d2/pL\gg K^{5}d^{2}/p and q2+log(Ld)/loglogdq\geq 2+\log(Ld)/\log\log d, then (8) holds. Furthermore, if we denote 𝚺~j=𝚲1𝐕diag([𝐌jj(1𝐌jj)]j[d])𝐕𝚲1\widetilde{\bm{\Sigma}}_{j}=\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\operatorname{diag}\big{(}\left[\mathbf{M}_{jj^{\prime}}(1-\mathbf{M}_{jj^{\prime}})\right]_{j^{\prime}\in[d]}\big{)}\mathbf{V}\mathbf{\Lambda}^{-1}, we have

𝚺~j1/2(𝐕~F𝐇𝐕)𝐞j𝑑𝒩(𝟎,𝐈K),j[d].\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d]. (11)

Besides, define 𝐌~=(𝐕~F𝐕~F)𝐌^(𝐕~F𝐕~F)\widetilde{\mathbf{M}}=(\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{{\rm F}\top})\widehat{\mathbf{M}}(\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{{\rm F}\top}) and 𝚲~=𝐕~F𝐌^𝐕~F\widetilde{\mathbf{\Lambda}}=\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\rm F}, then if we estimate 𝚺~j\widetilde{\bm{\Sigma}}_{j} by 𝚺^j=𝚲~1𝐕~Fdiag([𝐌~jj(1𝐌~jj)]j[d])𝐕~F𝚲~1\widehat{\bm{\Sigma}}_{j}=\widetilde{\mathbf{\Lambda}}^{-1}\widetilde{\mathbf{V}}^{{\rm F}\top}\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{jj^{\prime}}(1-\widetilde{\mathbf{M}}_{jj^{\prime}})]_{j^{\prime}\in[d]}\big{)}\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{\Lambda}}^{-1}, we have

𝚺^j1/2(𝐕~F𝐕𝐇)𝐞j𝑑𝒩(𝟎,𝐈K),j[d].{\widehat{\bm{\Sigma}}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d]. (12)
Remark 8.

The proof is deferred to Appendix B.11. We can obtain 𝚲~\widetilde{\mathbf{\Lambda}} by computing 𝐕~F𝐗(s)\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{X}^{(s)} in parallel for s[m]s\in[m], and the computational cost for 𝚺^j\widehat{\bm{\Sigma}}_{j} is O(d2K/m)O(d^{2}K/m). To achieve the optimal computational efficiency, we would take p=dp=\lceil\sqrt{d}\rceil and L=K5d3/2logdL=\lceil K^{5}d^{3/2}\log d\rceil. Hence taking q=logdq=\lceil\log d\rceil is sufficient, and the total computational cost will be O(K7d5/2(logd)2)O(K^{7}d^{5/2}(\log d)^{2}). Inferential analyses on the membership profiles has received attention in previous works [16, 37]. Fan et al., [16] studied the asymptotic normality of the spectral estimator under the DCMM model with complicated assumptions on the eigen-structure (see Conditions 1, 3, 6, 7 in their paper). In comparison, we only impose non-singularity conditions on the membership profiles, but have a stronger scaling condition on the signal strength to facilitate the divide-and-conquer process. Our asymptotic covariance is almost the same as Fan et al., [16]’s, suggesting the same level of asymptotic efficiency.

4.3.3 Gaussian Mixure Models

Denote by μθ=Δ01n/K𝚯2,\mu_{\theta}=\Delta_{0}^{-1}\sqrt{n/K}\|\mathbf{\Theta}\|_{2,\infty} the incoherence parameter for the Gaussian means. Then we have the following corollary for Example 3.

Corollary 4.8.

When LpdLp\gg d, If we take q2+log(Ld)/loglogdq\geq 2+\log(Ld)/\log\log d, pmax(2K,K+8q1)p\geq\max(2K,K+8q-1) and pmax(2K,K+7)p^{\prime}\geq\max(2K,K+7), under the conditions that

K=o(d),nd2,Kn(logd)2Δ02n4/3μθ2dandLKd2p,K=o(d),\quad n\gg d^{2},\quad K\sqrt{n}(\log d)^{2}\ll\Delta_{0}^{2}\ll\frac{n^{4/3}}{\mu_{\theta}^{2}d}\quad\text{and}\quad L\gg\frac{Kd^{2}}{p},

we have that (8) holds. Furthermore, if we denote 𝚺~j=𝚲1𝐕{𝐅𝚯𝚯𝐅+n𝐈d}𝐕𝚲1\widetilde{\bm{\Sigma}}_{j}=\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\big{\{}\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+n\mathbf{I}_{d}\big{\}}\mathbf{V}\mathbf{\Lambda}^{-1}, we have

𝚺~j1/2(𝐕~F𝐇𝐕)𝐞j𝑑𝒩(𝟎,𝐈K),j[d].\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d]. (13)

If we define 𝚲~=𝐕~F𝐌^𝐕~F\widetilde{\mathbf{\Lambda}}=\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\rm F} and estimate 𝚺~j\widetilde{\bm{\Sigma}}_{j} by 𝚺^j=𝚲~1+n𝚲~2\widehat{\bm{\Sigma}}_{j}=\widetilde{\mathbf{\Lambda}}^{-1}+n\widetilde{\mathbf{\Lambda}}^{-2}, we have

𝚺^j1/2(𝐕~F𝐕𝐇)𝐞j𝑑𝒩(𝟎,𝐈K),j[d].{\widehat{\bm{\Sigma}}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d]. (14)
Remark 9.

Please refer to Appendix B.12 for the proof. We impose the upper bound on Δ0\Delta_{0} to guarantee that the leading term satisfies the CLT. The distributive computation cost of 𝚺^j\widehat{\bm{\Sigma}}_{j} is O(ndK/m)O(ndK/m). We recommend taking p=dp=\lceil\sqrt{d}\rceil, L=Kd3/2logdL=\lceil Kd^{3/2}\log d\rceil and q=logdq=\lceil\log d\rceil, with total complexity of O(K3d5/2(logd)2)O(K^{3}d^{5/2}(\log d)^{2}). In Corollary 4.8, the scaling condition for nn is nd2n\gg d^{2} compared to n>dn>d in Corollary 4.2, where the extra factor dd is to guarantee fast enough convergence rate of the remainder term for inference. It can be verified that the Cramér-Rao lower bound for unbiased estimators of 𝐕𝐞j\mathbf{V}^{\top}\mathbf{e}_{j} is 𝚲1\mathbf{\Lambda}^{-1}, and thus we can also see from (13) that when Δ0\Delta_{0} is large enough, the asymptotic efficiency of 𝐕~F\widetilde{\mathbf{V}}^{\text{F}} is 1 under the regime LpdLp\gg d.

4.3.4 Incomplete Matrix Inference

Corollary 4.9.

When Lpκ22Kd2Lp\gg\kappa_{2}^{2}Kd^{2} and θd1/2+ϵ\theta\geq d^{-1/2+\epsilon} for some constant ϵ>0\epsilon>0, if we take pmax(2K,K+7)p^{\prime}\geq\max(2K,K+7), pdp\gtrsim\sqrt{d} and q2+log(Ld)/loglogdq\geq 2+\log(Ld)/\log\log d, then under Assumption 3 and the conditions that

κ26K3μ3=o(d1/2)andσ/Δθ/dmin((κ22μK+κ2Klogd)1,p/d),\kappa_{2}^{6}K^{3}\mu^{3}=o(d^{1/2})\quad\text{and}\quad\sigma/\Delta\ll\sqrt{\theta/d}\cdot\min\left(\big{(}\kappa_{2}^{2}\sqrt{\mu K}+\kappa_{2}\sqrt{K\log d}\big{)}^{-1},\sqrt{p/d}\right),

we have that (8) holds. Furthermore, if we denote 𝚺~j=𝚲1𝐕diag([𝐌jj2(1θ)/θ+σ2/θ]j=1d)𝐕𝚲1\widetilde{\bm{\Sigma}}_{j}=\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\operatorname{diag}\big{(}[\mathbf{M}_{jj^{\prime}}^{2}(1-\theta)/\theta+\sigma^{2}/\theta]_{j^{\prime}=1}^{d}\big{)}\mathbf{V}\mathbf{\Lambda}^{-1}, we have

𝚺~j1/2(𝐕~F𝐇𝐕)𝐞j𝑑𝒩(𝟎,𝐈K),j[d].\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d]. (15)

Define 𝚲~=𝐕~F𝐌^𝐕~F\widetilde{\mathbf{\Lambda}}=\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\rm F} and 𝐌~=𝐕~F𝚲~𝐕~F\widetilde{\mathbf{M}}=\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{\Lambda}}\widetilde{\mathbf{V}}^{{\rm F}\top}. If we estimate σ2\sigma^{2} by σ^2=(i,i)𝒮(θ^𝐌^ii𝐌~ii)2/|𝒮|\widehat{\sigma}^{2}=\sum_{(i,i^{\prime})\in{\mathcal{S}}}(\widehat{\theta}\widehat{\mathbf{M}}_{ii^{\prime}}-\widetilde{\mathbf{M}}_{ii^{\prime}})^{2}/|{\mathcal{S}}| and 𝚺~j\widetilde{\bm{\Sigma}}_{j} by 𝚺^j=𝚲~1𝐕~Fdiag([𝐌~jj2(1θ^)/θ^+σ^2/θ^]j=1d)𝐕~F𝚲~1\widehat{\bm{\Sigma}}_{j}=\widetilde{\mathbf{\Lambda}}^{-1}\widetilde{\mathbf{V}}^{{\rm F}\top}\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{jj^{\prime}}^{2}(1-\widehat{\theta})/\widehat{\theta}+\widehat{\sigma}^{2}/\widehat{\theta}]_{j^{\prime}=1}^{d}\big{)}\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{\Lambda}}^{-1}, we have

𝚺^j1/2(𝐕~F𝐕𝐇)𝐞j𝑑𝒩(𝟎,𝐈K),j[d].{\widehat{\bm{\Sigma}}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d]. (16)
Remark 10.

Please see Appendix B.13 for the proof of Corollary 4.9. We compute 𝚲~\widetilde{\mathbf{\Lambda}} by calculating 𝐕~F𝐗(s)\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{X}^{(s)} in parallel, and then 𝚲~\widetilde{\mathbf{\Lambda}} can be communicated across servers at low cost for computing σ^2\widehat{\sigma}^{2}. The total computational cost for calculating 𝚺^j\widehat{\bm{\Sigma}}_{j} is O(d2K/m)O(d^{2}K/m). We recommend taking p=dp=\lceil\sqrt{d}\rceil, L=κ22Kd3/2logdL=\lceil\kappa_{2}^{2}Kd^{3/2}\log d\rceil and q=logdq=\lceil\log d\rceil, and the total computational cost will be O(K3d5/2(logd)2)O(K^{3}d^{5/2}(\log d)^{2}). Chen et al., [13] studied the incomplete matrix inference problem through penalized optimization, and their testing efficiency is the same as ours.

4.4 Inference When LpdLp\ll d

Similar as when LpdLp\gg d, we first redefine the alignment matrix between 𝐕~F\widetilde{\mathbf{V}}^{\text{F}} and 𝐕\mathbf{V} as 𝐇=𝐇1𝐇0\mathbf{H}=\mathbf{H}_{1}\mathbf{H}_{0}, where 𝐇1=sgn(𝐕~F𝐕~)\mathbf{H}_{1}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\text{F}\top}\widetilde{\mathbf{V}}) and 𝐇0=sgn(𝐕~𝐕)\mathbf{H}_{0}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\top}\mathbf{V}). Then we have the following theorem characterizing the limiting distribution for 𝐕~F\widetilde{\mathbf{V}}^{\text{F}}.

Theorem 4.10.

For the case when LpdLp\ll d, under Assumptions 1, 2, 3 and 5, for j[d]j\in[d], recall 𝚺j=Cov(𝒱(𝐄0)𝐞j|𝛀)\mathbf{\Sigma}_{j}=\operatorname{Cov}(\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}|\mathbf{\Omega}) and assume that there exists a statistical rate η2(d)\eta_{2}(d) such that

limd𝛀(minj[d]λK(𝚺j)η2(d))=1,d2r1(d)4(logd)4p2Δ4(η2(d)(logd)1)=o(1)anddr2(d)2LpΔ2η2(d)=o(1).\lim_{d\rightarrow\infty}\!\!\mathbb{P}_{\mathbf{\Omega}}\Big{(}\!\min_{j\in[d]}\lambda_{K}\big{(}\bm{\Sigma}_{j}\big{)}{\geq}\eta_{2}(d)\!\!\Big{)}\!=1,\,\frac{d^{2}r_{1}(d)^{4}(\log d)^{4}}{p^{2}\Delta^{4}\big{(}\eta_{2}(d)\!\!\wedge\!\!(\log d)^{-1}\big{)}}=o(1)\,\text{and}\,\frac{dr_{2}(d)^{2}}{Lp\Delta^{2}\eta_{2}(d)}=o(1).

Then if we take K(logd)2ppd/(logd)2K(\log d)^{2}\ll p\asymp p^{\prime}\lesssim d/(\log d)^{2} and qlogdq\geq\log d we have

𝚺j1/2(𝐕~F𝐇𝐕)𝐞j𝑑𝒩(𝟎,𝐈K),j[d].\mathbf{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d]. (17)
Remark 11.

Theorem 4.10 states that under proper scaling conditions, the FADI estimator still enjoys asymptotic normality even when the aggregated sketching dimension LpLp is much smaller than dd. The rate η2(d)\eta_{2}(d) is usually at least of order (d/λ12Lp)λmin(Cov(𝐄0𝐞j))(d/\lambda_{1}^{2}Lp)\lambda_{\min}(\operatorname*{\rm Cov}(\mathbf{E}_{0}\mathbf{e}_{j})). In comparison, the rate η1(d)\eta_{1}(d) in Theorem 4.5 is usually of order λ12λmin(Cov(𝐄0𝐞j))\lambda_{1}^{-2}\lambda_{\min}(\operatorname*{\rm Cov}(\mathbf{E}_{0}\mathbf{e}_{j})), suggesting a larger variance and lower testing efficiency of FADI at LpdLp\ll d than at LpdLp\gg d. The proof is deferred to Appendix B.6.

The following corollaries of Theorem 4.10 provide case-specific distributional guarantee for Examples 1 and 3 under the regime LpdLp\ll d.

4.4.1 Spiked Covariance Model

Corollary 4.11.

Assume {𝐗i}i=1n\{\bm{X}_{i}\}_{i=1}^{n} are i.i.d. multivariate Gaussian. When Lpλ12Δ2dLp\ll{\lambda_{1}^{-2}\Delta^{2}d}, if we take K=K+1K^{\prime}=K+1, K(logd)2ppd/(logd)2K(\log d)^{2}\ll p\asymp p^{\prime}\lesssim d/(\log d)^{2} and qlogdq\geq\log d, under Assumption 3 and the conditions that

nmax(κ14λ12dr2Lpσ4,λ12σ~16K2Δ2δ4σ4)(logd)4andKλ12Δ2μd=o(1),\quad{n}\gg\max\Big{(}\frac{\kappa_{1}^{4}\lambda_{1}^{2}dr^{2}L}{p\sigma^{4}},\frac{\lambda_{1}^{2}\widetilde{\sigma}_{1}^{6}K^{2}}{\Delta^{2}\delta^{4}\sigma^{4}}\Big{)}(\log d)^{4}\quad\text{and}\quad\frac{K\lambda_{1}^{2}}{\Delta^{2}}\sqrt{\frac{\mu}{d}}=o(1),

we have that (17) holds. Furthermore, if we define 𝚺~j=σ2nL2𝐁𝛀𝛀𝚺𝛀𝐁𝛀\widetilde{\bm{\Sigma}}_{j}=\frac{\sigma^{2}}{nL^{2}}\mathbf{B}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\bm{\Sigma}\mathbf{\Omega}\mathbf{B}_{\mathbf{\Omega}}, we have

𝚺~j1/2(𝐕~F𝐇𝐕)𝐞j𝑑𝒩(𝟎,𝐈K),j[d].\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d]. (18)

Besides, if we further assume σ2λ1κ14d2r/(np2L)=o(1){\sigma^{-2}}{\lambda_{1}\kappa_{1}^{4}}\sqrt{{d^{2}r}/{(np^{2}L)}}=o(1) and estimate 𝚺~j\widetilde{\bm{\Sigma}}_{j} by 𝚺^j=σ^2nL2𝐁^𝛀𝛀𝚺^𝛀𝐁^𝛀\widehat{\bm{\Sigma}}_{j}=\frac{\widehat{\sigma}^{2}}{nL^{2}}\widehat{\mathbf{B}}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\widehat{\bm{\Sigma}}\mathbf{\Omega}\widehat{\mathbf{B}}_{\mathbf{\Omega}}, where 𝐁^𝛀=(𝐁^(1),,𝐁^(L))\widehat{\mathbf{B}}_{\mathbf{\Omega}}=(\widehat{\mathbf{B}}^{(1)\top},\ldots,\widehat{\mathbf{B}}^{(L)\top})^{\top} with 𝐁^()=(𝐕~F𝐘^()/p)\widehat{\mathbf{B}}^{(\ell)}=(\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p})^{\dagger} for [L]\ell\in[L], we have

𝚺^j1/2(𝐕~F𝐕𝐇)𝐞j𝑑𝒩(𝟎,𝐈K),j[d].{\widehat{\bm{\Sigma}}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d]. (19)
Remark 12.

Please refer to Appendix B.7 for the proof. For the computation of 𝚺^j\widehat{\bm{\Sigma}}_{j}, apart from 𝐕^()\widehat{\mathbf{V}}^{(\ell)}, the \ell-th machine on layer 2 (see Figure 2) will send 𝛀()\mathbf{\Omega}^{(\ell)} and 𝐘^()\widehat{\mathbf{Y}}^{(\ell)} to the central processor, and the total communication cost for each server is O(dp)O(dp). On the central processor, the total computational cost of 𝐁𝛀\mathbf{B}_{\mathbf{\Omega}} will be O(dpKL)O(dpKL). Then we will compute 𝛀𝚺^𝛀=1p𝛀(𝐘^(1),,𝐘^(L))+σ^2𝛀𝛀\mathbf{\Omega}^{\top}\widehat{\bm{\Sigma}}\mathbf{\Omega}=\frac{1}{\sqrt{p}}\mathbf{\Omega}^{\top}(\widehat{\mathbf{Y}}^{(1)},\ldots,\widehat{\mathbf{Y}}^{(L)})+\widehat{\sigma}^{2}\bm{\Omega}^{\top}\bm{\Omega} with total computational cost of O(d(Lp)2)=o(d3)O\big{(}d(Lp)^{2}\big{)}=o(d^{3}). Compared to Corollary 4.6, Corollary 4.11 has stronger scaling conditions on the sample size nn to compensate for the extra variability due to less fast sketches. As indicated by (7), the asymptotic covariance matrix of Corollary 4.12 is consistent with Corollary 4.8.

4.4.2 Gaussian Mixture Models

Corollary 4.12.

When LpdLp\ll d, if we take K(logd)2ppd/(logd)2K(\log d)^{2}\ll p\asymp p^{\prime}\lesssim d/(\log d)^{2} and qlogdq\geq\log d, we have that (17) holds under the conditions that

Kdlogd=O(1),nd3Lp,andK(logd)2dnLpΔ02min(n,n4/3μθ2d).\sqrt{\frac{K}{d}}\log d=O(1),\quad n\gg\frac{d^{3}L}{p},\quad\text{and}\quad K(\log d)^{2}\sqrt{\frac{dnL}{p}}\ll\Delta_{0}^{2}\ll\min\left(n,\frac{n^{4/3}}{\mu_{\theta}^{2}d}\right).

Furthermore, if we define 𝚺~j=L2𝐁𝛀𝛀(𝐅𝚯𝚯𝐅+n𝐈d)𝛀𝐁𝛀,\widetilde{\bm{\Sigma}}_{j}=L^{-2}\mathbf{B}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\Big{(}\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+n\mathbf{I}_{d}\Big{)}\mathbf{\Omega}\mathbf{B}_{\mathbf{\Omega}}, then we have

𝚺~j1/2(𝐕~F𝐇𝐕)𝐞j𝑑𝒩(𝟎,𝐈K),j[d].\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d]. (20)

If we further assume d4Δ02KLp2n2d^{4}\Delta_{0}^{2}\ll{KLp^{2}n^{2}} and estimate 𝚺~j\widetilde{\bm{\Sigma}}_{j} by 𝚺^j=1L2𝐁^𝛀𝛀(𝐌^+n𝐈d)𝛀𝐁^𝛀\widehat{\bm{\Sigma}}_{j}=\frac{1}{L^{2}}\widehat{\mathbf{B}}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\Big{(}\widehat{\mathbf{M}}+n\mathbf{I}_{d}\Big{)}\mathbf{\Omega}\widehat{\mathbf{B}}_{\mathbf{\Omega}}, where 𝐁^𝛀=(𝐁^(1),,𝐁^(L))\widehat{\mathbf{B}}_{\mathbf{\Omega}}=(\widehat{\mathbf{B}}^{(1)\top},\ldots,\widehat{\mathbf{B}}^{(L)\top})^{\top} with 𝐁^()=(𝐕~F𝐘^()/p)\widehat{\mathbf{B}}^{(\ell)}=(\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p})^{\dagger} for [L]\ell\in[L], we have

𝚺^j1/2(𝐕~F𝐕𝐇)𝐞j𝑑𝒩(𝟎,𝐈K),j[d].{\widehat{\bm{\Sigma}}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\rm F}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),\quad\forall j\in[d]. (21)
Remark 13.

The proof of Corollary 4.12 is deferred to Appendix B.8. Computation of 𝚺^j\widehat{\bm{\Sigma}}_{j} is very similar to Example 1 as described in Remark 12, and the total computational cost is O(d(Lp)2)=o(d3)O(d(Lp)^{2})=o(d^{3}). The stronger scaling conditions are the trade-off for higher computational efficiency with less fast sketches.

We do not have distributional results for Examples 2 and 4 under the regime LpdLp\ll d. An intuitive explanation would be that the information contained in each entry is independent for Example 2 and Example 4, and when LpdLp\ll d, too much information will be lost from the d×dd\times d graph or matrix. In comparison, we can still recover information from Examples 1 and 3 under the regime LpdLp\ll d due to the correlation structure of the matrix.

5 Numerical Results

We conduct extensive simulation studies to assess the performance of FADI under each example given in Section 2 and compare it with several existing methods. We provide in this section the representative results for Examples 1 and 2. The results for Examples 3 and 4 are given in Appendix A.

5.1 Example 1: Spiked Covariance Model

We generate {𝑿i}i=1n\{\bm{X}_{i}\}_{i=1}^{n} i.i.d. from 𝒩(𝟎,𝚺){\mathcal{N}}(\mathbf{0},\mathbf{\Sigma}), where 𝚺=𝐕𝚲𝐕+σ2𝐈d\bm{\Sigma}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}+\sigma^{2}\mathbf{I}_{d}. We consider K=3K=3, n=20000n=20000 and set d=500,1000,2000d=500,1000,2000 respectively to study the asymptotic properties of the FADI estimator under different settings. To ensure the incoherence condition is satisfied, we set 𝐕\mathbf{V} to be the left singular vectors of a d×Kd\times K i.i.d. Gaussian matrix. We take 𝚲=diag(6,4,2)\mathbf{\Lambda}=\operatorname{diag}(6,4,2) and σ2=1\sigma^{2}=1. For the estimation of σ2\sigma^{2} in Step 0, we set K=6K^{\prime}=6. We split the data into m=20m=20 subsamples, and set p=p=12p=p^{\prime}=12 and q=7q=7 in Step 3 to compute 𝐕~F\widetilde{\mathbf{V}}^{\text{F}}. We set LL at a range of values by taking the ratio Lp/d{0.2,0.6,0.9,1,1.2,2,5,10}Lp/d\in\{0.2,0.6,0.9,1,1.2,2,5,10\} for each setting and compute the asymptotic covariance via Corollary 4.6 and Corollary 4.11 correspondingly. We define 𝐯~=𝚺^11/2(𝐕~F𝐕𝐇)𝐞1\widetilde{\mathbf{v}}=\widehat{\bm{\Sigma}}_{1}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{1}, where 𝐇=sgn(𝐕~F𝐕)\mathbf{H}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{V}), and calculate the coverage probability by empirically evaluating (𝐯~22χ32(0.95))\mathbb{P}\big{(}\|\widetilde{\mathbf{v}}\|_{2}^{2}\leq\chi_{3}^{2}(0.95)\big{)} with χ32(0.95)\chi_{3}^{2}(0.95) being the 0.95 quantile of the Chi-squared distribution with degrees of freedom equal to 3. Results under different settings are shown in Figure 3. Figure 3(a) shows that as Lp/dLp/d increases, the error rate of FADI converges to that of the traditional PCA. From Figure 3(b) we can see that when Lp/dLp/d is approaching 1 from the left, the computational efficiency drops due to the cost of computing 𝚺^1\widehat{\bm{\Sigma}}_{1}. For Figure 3(c), convergence towards the nominal 95% level can be observed when Lp/dLp/d is much smaller or much larger than 1, while the valley at Lp/dLp/d around 1 is consistent with the theoretical conditions on Lp/dLp/d in Section 4 and implies a possible phase-transition phenomenon on the distributional convergence of FADI. Note that the empirical coverage is closer to the nominal level 0.95 at d=2000d=2000 than at d{500,1000}d\in\{500,1000\}, which might be caused by the vanishing of some error terms for approximation of the asymptotic covariance matrix as dd grows larger. The good Gaussian approximation of 𝐯~1\widetilde{\mathbf{v}}_{1} is further validated by Figure 3(d), where 𝐯~1\widetilde{\mathbf{v}}_{1} is the first entry of 𝐯~\widetilde{\mathbf{v}}. Based upon the low computational efficiency and poor empirical coverage at Lp/dLp/d around 1, we recommend conducting inference based on FADI at regimes LpdLp\gg d and LpdLp\ll d only. In particular, we suggest the regime LpdLp\gg d if priority is given to higher testing efficiency, and the regime LpdLp\ll d if one needs valid inference with faster computation. We also compare FADI with the distributed PCA in [18]. Results over 100 Monte Carlos are given in Table 4. We can see that FADI outperforms both distributed PCA and the traditional PCA under the distributed setting.

 (a) Error Rate  (b) Running Time
​​​Refer to caption ​​​​Refer to caption
  ​​(c) Coverage Probability  (d) Q-Q Plot
​​​​Refer to caption ​​​​Refer to caption
Figure 3: Performance of FADI under different settings for Example 1 (with 300 Monte Carlos). (a) Empirical error rates of ρ(𝐕~F,𝐕)\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V}), where the grey dashed lines represent the error rates for the traditional PCA estimator 𝐕^\widehat{\mathbf{V}}; (b) Running time (in seconds) under different settings (including the computation time of 𝚺^1\widehat{\bm{\Sigma}}_{1}). For the traditional PCA, the running time is 4.86 seconds at d=500d=500, 20.95 seconds at d=1000d=1000 and 99.23 seconds at d=2000d=2000; (c) Empirical coverage probability, where the grey dashed line represents the theoretical rate at 0.95; (d) Q-Q plot for 𝐯~1\widetilde{\mathbf{v}}_{1} at Lp/d{0.2,10}Lp/d\in\{0.2,10\}.
Parameters Error rate Running time (seconds)
dd nn mm LL FADI Traditional Distributed FADI Traditional Distributed
400 30000 15 40 0.068 0.065 0.065 0.07 4.53 0.59
400 60000 30 40 0.048 0.046 0.046 0.05 8.84 0.60
400 100000 50 40 0.037 0.036 0.036 0.05 14.84 0.62
800 100000 50 80 0.052 0.050 0.050 0.10 55.76 3.66
800 5000 50 80 0.230 0.220 0.230 0.05 3.76 2.56
800 25000 50 80 0.106 0.103 0.103 0.07 15.07 2.82
800 50000 50 80 0.073 0.070 0.070 0.07 28.68 3.23
1600 30000 15 160 0.134 0.130 0.130 0.31 80.72 27.02
1600 60000 30 160 0.095 0.092 0.092 0.35 150.75 27.29
1600 100000 50 160 0.074 0.071 0.071 0.34 243.83 27.38
Table 4: Comparison of the empirical error rates (of ρ(,𝐕)\rho(\cdot,\mathbf{V})) and the running times (in seconds) between FADI, traditional full sample PCA and distributed PCA [18] under different settings of d,nd,n and mm at 𝚺=diag(50,25,12.5,1,,1)\mathbf{\Sigma}=\operatorname{diag}(50,25,12.5,1,\ldots,1). For FADI, p=p=12p=p^{\prime}=12, K=3K=3, K=4K^{\prime}=4, Δ=11.5\Delta=11.5 and q=7q=7 in all settings.

5.2 Example 2: Degree-Corrected Mixed Membership Models

We consider the mixed membership model without degree heterogeneity for the simulation, i.e., 𝚯=θ𝐈d\mathbf{\Theta}=\sqrt{\theta}\mathbf{I}_{d}, and 𝐌=θ𝚷𝐏𝚷\mathbf{M}=\theta\mathbf{\Pi}\mathbf{P}\mathbf{\Pi}^{\top}. For two preselected nodes j,j[d]j,j^{\prime}\in[d], we test H0:𝝅j=𝝅j{\rm H}_{0}:\bm{\pi}_{j}=\bm{\pi}_{j^{\prime}} vs. H1:𝝅j𝝅j{\rm H}_{1}:\bm{\pi}_{j}\neq\bm{\pi}_{j^{\prime}} by testing whether 𝐕(𝐞j𝐞j)=0\mathbf{V}^{\top}(\mathbf{e}_{j}-\mathbf{e}_{j^{\prime}})=0. To simulate the data, we set θ=0.9\theta=0.9, K=3K=3, and set the membership profiles 𝚷\mathbf{\Pi} and the connection probability matrix 𝐏\mathbf{P} to be

𝝅j={(1,0,0) if 1jd/6(0,1,0) if d/6<jd/3(0,0,1) if d/3<jd/2(0.6,0.2,0.2) if d/2<j5d/8(0.2,0.6,0.2) if 5d/8<j3d/4(0.2,0.2,0.6) if 3d/4<j7d/8(1/3,1/3,1/3) if 7d/8<jd,𝐏=(10.20.10.210.20.10.21).\bm{\pi}_{j}=\left\{\begin{array}[]{ll}(1,0,0)^{\top}&\text{ if }1\leq j\leq\lfloor d/6\rfloor\\ (0,1,0)^{\top}&\text{ if }\lfloor d/6\rfloor<j\leq\lfloor d/3\rfloor\\ (0,0,1)^{\top}&\text{ if }\lfloor d/3\rfloor<j\leq\lfloor d/2\rfloor\\ (0.6,0.2,0.2)^{\top}&\text{ if }\lfloor d/2\rfloor<j\leq\lfloor 5d/8\rfloor\\ (0.2,0.6,0.2)^{\top}&\text{ if }\lfloor 5d/8\rfloor<j\leq\lfloor 3d/4\rfloor\\ (0.2,0.2,0.6)^{\top}&\text{ if }\lfloor 3d/4\rfloor<j\leq\lfloor 7d/8\rfloor\\ (1/3,1/3,1/3)^{\top}&\text{ if }\lfloor 7d/8\rfloor<j\leq\lfloor d\rfloor\end{array}\right.,\quad\mathbf{P}=\begin{pmatrix}1&0.2&0.1\\ 0.2&1&0.2\\ 0.1&0.2&1\end{pmatrix}.

We test the performance of FADI under d{500,1000,2000}d\in\{500,1000,2000\} respectively, and under each setting of dd, we take m=10m=10, p=p=12p=p^{\prime}=12, q=7q=7 and set LL by the ratio Lp/d{0.2,0.6,0.9,1,1.2,2,5,10}Lp/d\in\{0.2,0.6,0.9,1,1.2,2,5,10\}. For each setting, we conduct 300 independent Monte Carlo simulations. To perform the test, with minor modifications of Corollary 4.7, we can show that

𝚺~j,j1/2(𝐕~F𝐇𝐕)(𝐞j𝐞j)𝑑𝒩(𝟎,𝐈K),\widetilde{\bm{\Sigma}}_{j,j^{\prime}}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}(\mathbf{e}_{j}-\mathbf{e}_{j^{\prime}})\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}), (22)

where the asymptotic covariance is defined as 𝚺~j,j=𝚺~j+𝚺~j\widetilde{\bm{\Sigma}}_{j,j^{\prime}}=\widetilde{\bm{\Sigma}}_{j}+\widetilde{\bm{\Sigma}}_{j^{\prime}} and can be consistently estimated by 𝚺^j,j=𝚺^j+𝚺^j\widehat{\bm{\Sigma}}_{j,j^{\prime}}=\widehat{\bm{\Sigma}}_{j}+\widehat{\bm{\Sigma}}_{j^{\prime}}. We first preselect two nodes, which we denote by jj and jj^{\prime}, with membership profiles both equal to (0.6,0.2,0.2)(0.6,0.2,0.2)^{\top} and calculate the empirical coverage probability of (𝒅~22χ32(0.95))\mathbb{P}\big{(}\|\widetilde{\bm{d}}\|_{2}^{2}\leq\chi_{3}^{2}(0.95)\big{)}, where 𝒅~=𝚺^j,j1/2𝐕~F(𝐞j𝐞j)\widetilde{\bm{d}}=\widehat{\bm{\Sigma}}_{j,j^{\prime}}^{-1/2}\widetilde{\mathbf{V}}^{\text{F}\top}(\mathbf{e}_{j}-\mathbf{e}_{j^{\prime}}). We also evaluate the power of the test by choosing two nodes with different membership profiles equal to (0.6,0.2,0.2)(0.6,0.2,0.2)^{\top} and (1/3,1/3,1/3)(1/3,1/3,1/3)^{\top} respectively, which we denote by jj and kk. We empirically calculate the power (𝒅~22χ32(0.95))\mathbb{P}\big{(}\|\widetilde{\bm{d}^{\prime}}\|_{2}^{2}\geq\chi_{3}^{2}(0.95)\big{)}, where 𝒅~=𝚺^j,k1/2𝐕~F(𝐞j𝐞k)\widetilde{\bm{d}}^{\prime}=\widehat{\bm{\Sigma}}_{j,k}^{-1/2}\widetilde{\mathbf{V}}^{\text{F}\top}(\mathbf{e}_{j}-\mathbf{e}_{k}). Under the regime Lp/d<1Lp/d<1, we calculate the asymptotic covariance referring to Theorem 4.10 by

𝚺^j,j=L2𝐁^𝛀𝛀diag([𝐌~jk(1𝐌~jk)+𝐌~jk(1𝐌~jk)]k=1d)𝛀𝐁^𝛀,\widehat{\bm{\Sigma}}_{j,j^{\prime}}=L^{-2}\widehat{\mathbf{B}}_{\bm{\Omega}}^{\top}\bm{\Omega}{\top}\operatorname{diag}\left([\widetilde{\mathbf{M}}_{jk}(1-\widetilde{\mathbf{M}}_{jk})+\widetilde{\mathbf{M}}_{j^{\prime}k}(1-\widetilde{\mathbf{M}}_{j^{\prime}k})]_{k=1}^{d}\right)\bm{\Omega}\widehat{\mathbf{B}}_{\bm{\Omega}},

where 𝐁^𝛀=(𝐁^(1),,𝐁^(L))\widehat{\mathbf{B}}_{\mathbf{\Omega}}=(\widehat{\mathbf{B}}^{(1)\top},\ldots,\widehat{\mathbf{B}}^{(L)\top})^{\top} with 𝐁^()=(𝐕~F𝐘^()/p)p×K\widehat{\mathbf{B}}^{(\ell)}=(\widetilde{\mathbf{V}}^{{\rm F}\top}\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p})^{\dagger}\in\mathbb{R}^{p\times K} for =1,,L\ell=1,\ldots,L. We also apply k-means to 𝐕~F\widetilde{\mathbf{V}}^{\rm F} to differentiate different membership profiles and compare the misclustering rate with the traditional PCA. The results of different settings are shown in Figure 4. We can see from Figure 4(d) that under the regime Lp/d<1Lp/d<1, the empirical coverage probability is zero under all settings, which validates the necessity of Lp/d1Lp/d\gg 1 for performance guarantee. Figure 4(f) demonstrates the asymptotic normality of 𝒅~1\widetilde{\bm{d}}_{1} at Lp/d=10Lp/d=10 and poor Gaussian approximation of FADI at Lp/d=0.2Lp/d=0.2, where 𝒅~1\widetilde{\bm{d}}_{1} is the first entry of 𝒅~\widetilde{\bm{d}}.

We also compare FADI with the SIMPLE method [16] on the membership profile inference under the DCMM model. The SIMPLE method conducted inference directly on the traditional PCA estimator 𝐕^\widehat{\mathbf{V}} and adopted a one-step correction to the empirical eigenvalues for calculating the asymptotic covariance matrix. We compare the inferential performance of FADI at Lp/d=10Lp/d=10 with the SIMPLE method (under 100 independent Monte Carlos), and summarize the results in Table 5, where the running time includes both the PCA procedure and the computation time of 𝚺^j,j\widehat{\bm{\Sigma}}_{j,j^{\prime}}. Compared to the SIMPLE method, our method has a similar coverage probability and power but is computationally more efficient.

Parameters Coverage probability Power Running time (seconds)
dd pp LL FADI SIMPLE FADI SIMPLE FADI SIMPLE
500 12 417 0.91 0.92 0.87 0.88 0.21 0.73
1000 12 833 0.94 0.94 1.00 1.00 0.69 6.77
2000 12 1667 0.95 0.98 1.00 1.00 2.61 59.42
Table 5: Comparison of the coverage probability, power and running time (in seconds) between FADI and SIMPLE [16] under different settings of dd. In all settings, we take m=10m=10, p=p=12p=p^{\prime}=12, q=7q=7 and set Lp/d=10Lp/d=10 for FADI.
  (a) Error Rate  (b) Misclustering Rate  ​​(c) Running Time
Refer to caption ​​​​​​Refer to caption ​​​​​​​Refer to caption
 (d) Coverage Probability  (e) Power  (f) Q-Q Plot
Refer to caption ​​​​​​ Refer to caption ​​​​​​​​ Refer to caption
Figure 4: Performance of FADI under different settings for Example 2. (a) Empirical error rates of ρ(𝐕~F,𝐕)\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V}); (b) Misclustering rate for 𝐕~F\widetilde{\mathbf{V}}^{\rm F} by K-means with grey dashed lines representing the misclustering rates for the traditional PCA estimator 𝐕^\widehat{\mathbf{V}}; (c) Running time (in seconds) under different settings (including computing 𝚺^j,j\widehat{\bm{\Sigma}}_{j,j^{\prime}}). For the traditional PCA, the running time is 0.43 seconds at d=500d=500, 3.77 seconds at d=1000d=1000 and 32.62 seconds at d=2000d=2000; (d) Empirical coverage probability (11- Type I error); (e) Power of the test; (f) Q-Q plot for 𝒅~1\widetilde{\bm{d}}_{1} at Lp/d{0.2,10}Lp/d\in\{0.2,10\}.

6 Application to the 1000 Genomes Data

In this section, we apply FADI and the existing methods to the 1000 Genomes Data [1]. We use phase 3 of the 1000 Genomes Data and focus on common variants with minor allele frequencies larger than or equal to 0.05. There are 2504 subjects in total, and 168,047 independent variants after the linkage disequilibrium (LD) pruning. As we are interested in the ancestry principal components to capture population structure, the sample size nn is the number of independent variants after LD pruning (n=168,047n=168,047), and the dimension dd is the number of subjects (d=2504d=2504) [33]. The data were collected from 7 super populations: (1) AFR: African; (2) AMR: Ad Mixed American; (3) EAS: East Asian; (4) EUR: European; (5) SAS: South Asian; (6) PUR: Puerto Rican and (7) FIN: Finnish; and 26 sub-populations.

6.1 Estimation of Principal Eigenspace

For the estimation of the principal components, we assume that the data follow the spiked covariance model specified in Example 1. We perform FADI with K=27K^{\prime}=27, p=50p=50, p=100p^{\prime}=100, q=3q=3, m=100m=100 and L=80L=80, where we choose pp and LL according to Table 1. For the estimation of the number of spikes, we take the thresholding parameter μ0=(d(np)1/2logd)3/4/12\mu_{0}=\left(d(np)^{-1/2}\log d\right)^{3/4}/12. The estimated number of spikes from FADI is K^=26\widehat{K}=26, which is close to 25, the number of self-reported ethnicity groups minus 1, i.e., K=261K=26-1. The results of the 4 leading PCs are shown in Figure 5, where a clear separation can be observed among different super-populations. Figure 10 and Figure 11 in the appendix show a good alignment between the PC results calculated by the traditional PCA and FADI. We compare the computational times of different methods for analyzing the 1000 Genomes Data. FADI takes 5.6 seconds at q=3q=3, whereas the traditional PCA method takes 595.4 seconds and the distributed PCA method [18] takes 120.2 seconds. These results show that FADI greatly outperforms the existing PCA methods in terms of computational time.

Refer to caption Refer to caption Refer to caption
(a) PC 1 versus PC 2 (b) PC 1 versus PC 3 (c) PC 1 versus PC 4
Refer to caption Refer to caption Refer to caption
(d) PC 2 versus PC 3 (e) PC 2 versus PC 4 (f) PC 3 versus PC 4
Figure 5: The top 4 principal components of the 1000 Genomes Data. For the first two PCs, PC 1 separates African (AFR) super-population from the others, whereas PC 2 separates East Asian (EAS) from the others. As for PC 3 and PC 4, South Asian (SAS) and Ad Mixed American (AMR) are well separated from the rest of the super-populations by PC 3, while PC 4 presents some additional separation.

6.2 Inference on Ancestry Membership Profiles

We also generate an undirected graph from the 1000 Genomes Data. To increase the randomness for better fitting of the model setting in Example 2, we sample 1000 out of the total 168047 variants for generating the graph. More specifically, we treat each subject as a node, and for each given pair of subjects (i,j)(i,j), we define a genetic similarity score sij=k=11000𝕀{xik=xjk}s_{ij}=\sum_{k=1}^{1000}\mathbb{I}\left\{x_{ik}=x_{jk}\right\}, where xikx_{ik} refers to the genotype of the kk-th variant for subject ii. We denote by s0.95s^{0.95} the 0.95 quantile of {sij}i<j\{s_{ij}\}_{i<j}. Subjects ii and jj are connected if and only if sij>s0.95s_{ij}>s^{0.95}. Denote by 𝐀\mathbf{A} the adjacency matrix (allowing no self-loops). We include only four super populations: AFR, EAS, EUR and SAS, with 2058 subjects in total. We are interested in testing whether two given subjects ii and jj belong to the same super population, i.e., H0:𝐕i=𝐕j{\rm H}_{0}:\mathbf{V}_{i}=\mathbf{V}_{j} vs. H1:𝐕i𝐕j{\rm H}_{1}:\mathbf{V}_{i}\neq\mathbf{V}_{j}. We divide the adjacency matrix equally into m=10m=10 splits, and perform FADI with p=50p=50, p=50p^{\prime}=50, q=3q=3 and L=1000L=1000. The rank estimator from FADI is K^=4\widehat{K}=4 by setting μ0=(θ^/p)1/2dlogd/12\mu_{0}=({\widehat{\theta}}/{p})^{1/2}d\log d/12, where θ^\widehat{\theta} is the average degree estimator defined in Section 3.3. We can see the estimated rank is consistent with the number of super populations. We apply K-means clustering to the FADI estimator 𝐕~K^F\widetilde{\mathbf{V}}_{\widehat{K}}^{\text{F}}, and calculate the misclustering rate by treating the self-reported ancestry group as the ground truth. The misclustering rate of FADI is 0.135, with computation time of 3.7 seconds. In comparison, the misclustering rate for the traditional PCA method is 0.134 with computation time of 26.5 seconds, and the correlation between the top four PCs for the traditional PCA and FADI are 0.997, 0.994, 0.994 and 0.996 respectively.

To conduct pairwise inference on the ancestry membership profiles, we preselect 16 subjects, with 4 subjects from each super population. We apply Bonferroni correction to correct for the multiple comparison issue and set the level at 0.05×(162)1=4.17×1040.05\times\binom{16}{2}^{-1}=4.17\times 10^{-4}. We estimate the asymptotic covariance matrix by Corollary 4.7 and correct 𝐌~\widetilde{\mathbf{M}} by setting entries larger than 1 to 1 and entries smaller than 0 to 0. The pairwise p-values are summarized in Figure 6. The computational time for computing the covariance matrix is 0.31 seconds. We can see that most of the comparison results are consistent with the true ancestry groups, while the inconsistency could be due to the mixed memberships of certain subjects and the unaccounted sub-population structures.

Refer to caption
Figure 6: p-values for pairwise comparison among 16 preselected subjects. For subjects pair (i,j)(i,j), p-value is defined as (χK^2>𝒅~22)\mathbb{P}\big{(}\chi_{\widehat{K}}^{2}>\|\widetilde{\bm{d}}\|_{2}^{2}\big{)}, where χK^2\chi_{\widehat{K}}^{2} is Chi-squared distribution with degrees of freedom equal to K^\widehat{K}, and 𝒅~=𝚺^i,j1/2𝐕~K^F(𝐞i𝐞j)\widetilde{\bm{d}}=\widehat{\bm{\Sigma}}_{i,j}^{-1/2}\widetilde{\mathbf{V}}^{\rm F}_{\widehat{K}}(\mathbf{e}_{i}-\mathbf{e}_{j}) with 𝚺^i,j\widehat{\bm{\Sigma}}_{i,j} being the asymptotic covariance matrix defined in Section 5.2.

7 Discussion

In this paper, we develop a FAst DIstributed PCA algorithm FADI that can deal with high-dimensional PC calculations with low computational cost and high accuracy. The algorithm is applicable to multiple statistical models and is friendly for distributed computing. The main idea is to apply distributed-friendly random sketches so as to reduce the data dimension, and aggregate the results from multiple sketches to improve the statistical accuracy and accommodate federated data. We conduct theoretical analysis as well as simulation studies to demonstrate that FADI enjoys the same non-asymptotic error rate as the traditional full sample PCA while significantly reducing the computational time compared to existing methods. We also establish distributional guarantee for the FADI estimator and perform numerical experiments to validate the potential phase-transition phenomenon in distributional convergence.

Fast PCA algorithms using random sketches usually require the data to have certain “almost low-rank” structures, without which the approximation might not be accurate [20]. It is of future research interest to investigate whether the proposed FADI approach can be extended to non-low-rank settings. In Step 3 of FADI, we aggregate local estimators by taking a simple average over the projection matrices. It would be of future research interest to explore the performance of other weighted averages and investigate the best convex combination to reduce the statistical error.

References

  • 1000 Genomes Project Consortium, [2015] 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature, 526(7571):68.
  • Abbe, [2018] Abbe, E. (2018). Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18(177):1–86.
  • Abbe et al., [2020] Abbe, E., Fan, J., Wang, K., and Zhong, Y. (2020). Entrywise eigenvector analysis of random matrices with low expected rank. Annals of Statistics, 48(3):1452–1474.
  • Achlioptas, [2003] Achlioptas, D. (2003). Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687. Special issue on PODS 2001 (Santa Barbara, CA).
  • Anderson, [1963] Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 34:122–148.
  • Baik et al., [2005] Baik, J., Arous, G. B., and Péché, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, 33(5):1643–1697.
  • Banna et al., [2016] Banna, M., Merlevède, F., and Youssef, P. (2016). Bernstein-type inequality for a class of dependent random matrices. Random Matrices: Theory and Applications, 05(02):1650006.
  • Belbin et al., [2021] Belbin, G. M., Cullina, S., Wenric, S., Soper, E. R., Glicksberg, B. S., Torre, D., Moscati, A., Wojcik, G. L., Shemirani, R., Beckmann, N. D., Cohain, A., Sorokin, E. P., Park, D. S., Ambite, J.-L., Ellis, S., Auton, A., Bottinger, E. P., Cho, J. H., Loos, R. J. F., Abul-Husn, N. S., Zaitlen, N. A., Gignoux, C. R., and Kenny, E. E. (2021). Toward a fine-scale population health monitoring system. Cell, 184(8):2068–2083.e11. PMID: 33861964.
  • Bernstein, [1924] Bernstein, S. (1924). On a modification of chebyshev’s inequality and of the error formula of laplace. Annals of Science Institute SAV. Ukraine, Sect. Math, I.
  • Candès and Tao, [2010] Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080.
  • Chen et al., [2016] Chen, T., Chang, D. D., Huang, S., Chen, H., Lin, C., and Wang, W. (2016). Integrating multiple random sketches for singular value decomposition. arXiv preprint arXiv:1608.08285.
  • Chen et al., [2021] Chen, Y., Chi, Y., Fan, J., and Ma, C. (2021). Spectral methods for data science: A statistical perspective. Foundations and Trends® in Machine Learning, 14(5):566–806.
  • Chen et al., [2019] Chen, Y., Fan, J., Ma, C., and Yan, Y. (2019). Inference and uncertainty quantification for noisy matrix completion. Proceedings of the National Academy of Sciences, 116(46):22931–22937.
  • Dey et al., [2022] Dey, R., Zhou, W., Kiiskinen, T., Havulinna, A., Elliott, A., Karjalainen, J., Kurki, M., Qin, A., Lee, S., Palotie, A., et al. (2022). Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks. Nature Communications, 13(1):1–13.
  • Dhruva et al., [2020] Dhruva, S. S., Ross, J. S., Akar, J. G., et al. (2020). Aggregating multiple real-world data sources using a patient-centered health-data-sharing platform. npj Digital Medicine, 3(1):60.
  • Fan et al., [2022] Fan, J., Fan, Y., Han, X., and Lv, J. (2022). Simple: Statistical inference on membership profiles in large networks. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(2):630–653.
  • Fan et al., [2013] Fan, J., Liao, Y., and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society. Series B., 75(4):603–680.
  • Fan et al., [2019] Fan, J., Wang, D., Wang, K., and Zhu, Z. (2019). Distributed estimation of principal eigenspaces. Annals of Statistics, 47(6):3009–3031.
  • Franklin, [2012] Franklin, J. N. (2012). Matrix theory. Courier Corporation.
  • Halko et al., [2011] Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288.
  • Hoeffding, [1994] Hoeffding, W. (1994). Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding, pages 409–426. Springer.
  • Jensen, [1906] Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30(1):175–193.
  • Jiang et al., [2016] Jiang, B., Ma, S., Causey, J., Qiao, L., Hardin, M. P., Bitts, I., Johnson, D., Zhang, S., and Huang, X. (2016). Sparrec: An effective matrix completion framework of missing data imputation for gwas. Scientific Reports, 6(1):35534.
  • Jin, [2015] Jin, J. (2015). Fast community detection by score. Annals of Statistics, 43(1):57–89.
  • Johnstone, [2001] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29(2):295–327.
  • Jordan et al., [2019] Jordan, M. I., Lee, J. D., and Yang, Y. (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526):668–681.
  • Kannan et al., [2014] Kannan, R., Vempala, S., and Woodruff, D. (2014). Principal component analysis and higher correlations for distributed data. In Proceedings of the 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 1040–1057, Barcelona, Spain. PMLR.
  • Kargupta et al., [2001] Kargupta, H., Huang, W., Sivakumar, K., and Johnson, E. (2001). Distributed clustering using collective principal component analysis. Knowledge and Information Systems, 3(4):422–448.
  • Klarin et al., [2018] Klarin, D., Damrauer, S. M., Cho, K., Sun, Y. V., Teslovich, T. M., Honerlaw, J., Gagnon, D. R., DuVall, S. L., Li, J., Peloso, G. M., et al. (2018). Genetics of blood lipids among ~300,000 multi-ethnic participants of the million veteran program. Nature genetics, 50(11):1514–1523. PMCID: PMC6521726.
  • Li et al., [2020] Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. (2020). Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60.
  • Pasini, [2017] Pasini, G. (2017). Principal component analysis for stock portfolio management. International Journal of Pure and Applied Mathematics, 115:153–167.
  • Paul, [2007] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, pages 1617–1642.
  • Price et al., [2006] Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909.
  • Pulley et al., [2010] Pulley, J., Clayton, E., Bernard, G. R., Roden, D. M., and Masys, D. R. (2010). Principles of human subjects protections applied in an opt-out, de-identified biobank. Clinical and Translational Science, 3(1):42–48. PMCID: PMC3075971.
  • Reich et al., [2008] Reich, D., Price, A. L., and Patterson, N. (2008). Principal component analysis of genetic data. Nature Genetics, 40(5):491–492.
  • Serfling, [2009] Serfling, R. J. (2009). Approximation theorems of mathematical statistics. John Wiley & Sons.
  • Shen and Lu, [2020] Shen, S. and Lu, J. (2020). Combinatorial-probabilistic trade-off: Community properties test in the stochastic block models. arXiv preprint arXiv:2010.15063.
  • Stewart, [1977] Stewart, G. W. (1977). On the perturbation of pseudo-inverses, projections and linear least squares problems. SIAM Review, 19(4):634–662.
  • Sudlow et al., [2015] Sudlow, C., Gallacher, J., Allen, N., et al. (2015). Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine, 12(3):e1001779–e1001779.
  • Vershynin, [2012] Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices, page 210–268. Cambridge University Press.
  • Wang and Fan, [2017] Wang, W. and Fan, J. (2017). Asymptotics of empirical eigen-structure for high dimensional spiked covariance. Annals of Statistics, 45(3):1342–1374.
  • Wedin, [1972] Wedin, P.-A. (1972). Perturbation bounds in connection with singular value decomposition. Nordisk Tidskr. Informationsbehandling (BIT), 12:99–111.
  • Yan et al., [2021] Yan, Y., Chen, Y., and Fan, J. (2021). Inference for heteroskedastic pca with missing data. arXiv preprint arXiv:2107.12365.
  • Yang et al., [2021] Yang, F., Liu, S., Dobriban, E., and Woodruff, D. P. (2021). How to reduce dimension with pca and random projections? IEEE Transactions on Information Theory, 67:8154–8189.
  • Yu et al., [2015] Yu, Y., Wang, T., and Samworth, R. J. (2015). A useful variant of the Davis-Kahan theorem for statisticians. Biometrika, 102(2):315–323.

Supplementary Materials to

“FADI: Fast Distributed Principal Component Analysis With High Accuracy for Large-Scale Federated Data”

This file contains the supplementary materials to the paper “FADI: Fast Distributed Principal Component Analysis With High Accuracy for Large-Scale Federated Data”. In Appendix A we provide numerical results for Example 3 and Example 4 along with some additional simulation results for Example 1 under the genetic setting. In Appendix B, we present the proofs for the main theorems, propositions and corollaries given in Section 4 of the main paper. In Appendix C we give the proofs of some technical lemmas useful for the proofs of the main theorems. In Appendix D, we present the modified version of Wedin’s theorem, which is used in several proofs. Appendix E provides the supplementary figures deferred from the main paper.

A Additional Simulation Results

In this section we present the simulation results for Example 3 and Example 4, and we provide some additional simulation results for Example 1 to evaluate the performance of FADI under the genetic settings.

A.1 Example 3: Gaussian Mixture Models

Under this setting, we take K=3K=3, fix the Gaussian vector dimension at n=20000n=20000 and set Δ02=n2/3\Delta_{0}^{2}=n^{2/3}. Then we generate the Gaussian means by 𝜽ki.i.d.N(𝟎,Δ022n𝐈n)\bm{\theta}_{k}\overset{\text{i.i.d.}}{\sim}N\left(\mathbf{0},\frac{\Delta^{2}_{0}}{2n}\mathbf{I}_{n}\right), k[K]k\in[K]. We set the sample size at d=500,1000,2000d=500,1000,2000 respectively and generate independent Gaussian samples {𝑾i}i=1dn\{\bm{W}_{i}\}_{i=1}^{d}\in\mathbb{R}^{n} from a mixture of Gaussian with means 𝜽k,k[K]\bm{\theta}_{k},k\in[K] to study the performance of FADI under different settings. We assign each cluster k[K]k\in[K] with d/Kd/K Gaussian samples. We divide the data vertically along nn into m=20m=20 splits, set p=p=12p=p^{\prime}=12 and q=7q=7 for the final powered fast sketching. We take the ratio Lp/d{0.2,0.6,0.9,1,1.2,2,5,10}Lp/d\in\{0.2,0.6,0.9,1,1.2,2,5,10\} for each setting and compute the asymptotic covariance via Corollary 4.8 and Corollary 4.12 under different regimes of LpLp. We define 𝐯~=𝚺^11/2(𝐕~F𝐕𝐇)𝐞1\widetilde{\mathbf{v}}=\widehat{\bm{\Sigma}}_{1}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{1} where 𝚺^1\widehat{\bm{\Sigma}}_{1} is the asymptotic covariance for the first row of 𝐕~F\widetilde{\mathbf{V}}^{\text{F}} and 𝐇=sgn(𝐕~F𝐕)\mathbf{H}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{V}) is the alignment matrix, and calculate the empirical coverage probability by empirically evaluating (𝐯~22χ32(0.95))\mathbb{P}\big{(}\|\widetilde{\mathbf{v}}\|_{2}^{2}\leq\chi_{3}^{2}(0.95)\big{)}, where χ32(0.95)\chi_{3}^{2}(0.95) is the 0.95 quantile of the Chi-squared distribution with degrees of freedom equal to 3. We perform 300 Monte Carlo simulations and the results under different settings are shown in Figure 7. We can see that the error rate of FADI gets closer to that of the traditional PCA estimator as Lp/dLp/d increases while FADI greatly outperforms the traditional PCA in terms of running time under different settings. Note that here dd is the sample size, and the decreasing of error rates with increasing dd and fixed nn (at the same Lp/dLp/d ratio) is consistent with Corollary 4.2. Similar to Example 1 in Section 5.1, we can see from Figure 7(b) the running time is large due to the calculation of 𝚺^1\widehat{\bm{\Sigma}}_{1} at Lp/dLp/d approaching 1 from the left, and we do not recommend inference at this regime. Validation of the inferential properties are shown in Figure 7(c) and Figure 7(d).

  (a) Error Rate  (b) Running Time
Refer to caption Refer to caption
   (c) Coverage Probability  (d) Q-Q Plot
Refer to caption Refer to caption
Figure 7: Performance of FADI under different settings for Example 3. (a) Empirical error rates of ρ(𝐕~F,𝐕)\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V}), where the grey dashed lines represent the error rates for the traditional PCA estimator 𝐕^\widehat{\mathbf{V}}; (b) Running time (in seconds) under different settings (including the runtime for computing 𝚺^1\widehat{\bm{\Sigma}}_{1}). For the traditional PCA, the running time is 5.43 seconds at d=500d=500, 23.32 seconds at d=1000d=1000 and 105.58 seconds at d=2000d=2000; (c) Empirical coverage probability, where the grey dashed line represents the theoretical rate at 0.95; (d) Q-Q plot for 𝐯~1\widetilde{\mathbf{v}}_{1} at Lp/d{0.2,10}Lp/d\in\{0.2,10\}.

A.2 Example 4: Incomplete Matrix Inference

For the true matrix 𝐌\mathbf{M}, we consider K=3K=3, take 𝐕\mathbf{V} to be the KK left singular vectors of a pregenerated d×Kd\times K i.i.d. Gaussian matrix, and take 𝚲=diag(6,4,2)\mathbf{\Lambda}=\operatorname{diag}(6,4,2). We consider the distributed setting m=10m=10, and set the dimension at d{500,1000,2000}d\in\{500,1000,2000\} respectively, and set θ=0.4\theta=0.4 and σ=8/d\sigma=8/d for each setting. Then we generate the entry-wise noise by εiji.i.d.𝒩(0,σ2)\varepsilon_{ij}\overset{\text{i.i.d.}}{\sim}{\mathcal{N}}(0,\sigma^{2}) for iji\leq j, and subsample non-zero entries of 𝐌\mathbf{M} with probability θ=0.4\theta=0.4. Under each setting, we perform FADI at p=p=12p=p^{\prime}=12, q=7q=7 and Lp/d{0.2,0.6,0.9,1,1.2,2,5,10}Lp/d\in\{0.2,0.6,0.9,1,1.2,2,5,10\} for the computation of 𝐕~F\widetilde{\mathbf{V}}^{\text{F}}. Define 𝐯~=𝚺^11/2(𝐕~F𝐕𝐇)𝐞1\widetilde{\mathbf{v}}=\widehat{\bm{\Sigma}}_{1}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{1} with 𝚺^1\widehat{\bm{\Sigma}}_{1} being the asymptotic covariance for 𝐕~F𝐞1\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{e}_{1} defined in Corollary 4.9 and 𝐇=sgn(𝐕~F𝐕)\mathbf{H}=\operatorname{sgn}(\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{V}), and empirically calculate the coverage probability, i.e., (𝐯~22χ32(0.95))\mathbb{P}\big{(}\|\widetilde{\mathbf{v}}\|_{2}^{2}\leq\chi_{3}^{2}(0.95)\big{)}. Similar as in Section 5.2, for the regime Lp<dLp<d, we refer to Theorem 4.10 and calculate 𝚺^1\widehat{\bm{\Sigma}}_{1} by

𝚺^1=L2𝐁^𝛀𝛀diag([𝐌~1j2(1θ^)/θ^+σ^2/θ^]j=1d)𝛀𝐁^𝛀.\widehat{\bm{\Sigma}}_{1}=L^{-2}\widehat{\mathbf{B}}_{\bm{\Omega}}^{\top}\bm{\Omega}^{\top}\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{1j}^{2}(1-\widehat{\theta})/\widehat{\theta}+\widehat{\sigma}^{2}/\widehat{\theta}]_{j=1}^{d}\big{)}\bm{\Omega}\widehat{\mathbf{B}}_{\bm{\Omega}}.

Results over 300 Monte Carlo simulations are provided in Figure 8. Figure 8(a) illustrates that the error rate of FADI is almost the same as the traditional PCA as Lp/dLp/d gets larger, and Figure 8(b) shows that the computational efficiency of FADI greatly outperforms the traditional PCA for large dimension dd. We can observe from Figure 8(c) that the confidence interval performs poorly at Lp/d<1Lp/d<1 with the coverage probability equal to 1, which is consistent with the theoretical conditions in Corollary 4.9 for distributional convergence. Figure 8(d) shows the good Gaussian approximation of FADI at Lp/d=10Lp/d=10, and the results at Lp/d=0.2Lp/d=0.2 is consistent with Figure 8(c).

 (a) Error Rate  (b) Running Time
Refer to caption Refer to caption
 (c) Coverage Probability  (d) Q-Q Plot
Refer to caption Refer to caption
Figure 8: Performance of FADI under different settings for Example 4. (a) Empirical error rates of ρ(𝐕~F,𝐕)\rho(\widetilde{\mathbf{V}}^{\rm F},\mathbf{V}) with traditional PCA error rates as the reference; (b) Running time (in seconds) under different settings (including the computational time of 𝚺^1\widehat{\bm{\Sigma}}_{1}). For the traditional PCA, the running time is 0.42 seconds at d=500d=500, 3.48 seconds at d=1000d=1000 and 30.62 seconds at d=2000d=2000; (c) Empirical coverage probability; (d) Q-Q plot for 𝐯~1\widetilde{\mathbf{v}}_{1} at Lp/d=10Lp/d=10.

A.3 Additional Results for Example 1 in the Genetic Setting

Section 5.1 compares FADI with several existing methods under a relatively large eigengap. In practice, the eigengap of the population covariance matrix may not be large. To assess different methods in a more realistic scenario, we imitate the setting of the 1000 Genomes Data, where we take the number of spikes K=20K=20, σ2=0.4\sigma^{2}=0.4 and the eigengap to be Δ=0.2\Delta=0.2. We generate the data by {𝑿i}i=1ni.i.d.𝒩(𝟎,𝚺)\{\bm{X}_{i}\}_{i=1}^{n}\overset{\text{i.i.d.}}{\sim}{\mathcal{N}}(\mathbf{0},\mathbf{\Sigma}), where

𝚺=diag(2.4,1.2,0.6,,0.6K2,0.4,0.4).\mathbf{\Sigma}=\operatorname{diag}(2.4,1.2,\underbrace{0.6,\ldots,0.6}_{K-2},0.4\ldots,0.4).

The dimension is d=2504d=2504 and the sample size is n=160,000n=160,000. Error rates and running times using different algorithms are compared under different number of splits mm for the sample size nn. For FADI, we take L=75L=75, p=p=40p=p^{\prime}=40 and q=7q=7.

Table 6 shows that the number of sample splits mm has little impact on the error rate of FADI as expected, while the error rate of Fan et al., [18]’s distributed PCA increases as mm increases. FADI is much faster than the other two methods in all the practical settings when the eigengap is small. This suggests that in practical problems where the sample size is large and the eigengap is small, FADI not only enjoys much higher computational efficiency compared to the existing methods, but also gives stable estimation for different sample splits along the sample size nn. Although the settings of small eigengap are of major interest in this section, we still conduct simulations where the eigengap increases gradually to see how it affects the performance of FADI. Table 7 shows that as the eigengap gets larger, the error rate of FADI gets closer to that of the traditional full sample PCA, whereas the error rate ratio of distributed PCA to FADI gets below 1, but are still above 0.9 when the eigengap is larger than 1. As to the running time, FADI outperforms the other two methods in all the settings. In summary, when the eigengap grows larger, the performance of the three algorithms becomes similar to what we see in Section 5.1.

FADI Traditional PCA Distributed PCA mm
Error Rate 2.296 1.811 (0.79) 2.629 (1.15) 10
2.294 1.811 (0.79) 3.412 (1.49) 20
2.294 1.811 (0.79) 3.955 (1.72) 40
2.294 1.811 (0.79) 4.215 (1.84) 80
Running Time 5.76 983.86 (170.8) 189.76 (32.9) 10
3.82 992.09 (259.8) 144.18 (37.8) 20
2.86 972.47 (339.5) 119.29 (41.6) 40
2.37 968.43 (408.5) 99.39 (41.9) 80
Table 6: Comparison of the error rates and running times (in seconds) among FADI, full sample PCA and distributed PCA [18], using different numbers of sample splits mm in the genetic setting. Values in the parentheses represent the error rate ratios or the computational time ratios of each method with respect to FADI.
FADI Traditional PCA Distributed PCA Eigengap
Error Rate 1.28 1.06 (0.82) 1.57 (1.22) 0.4
0.77 0.65 (0.85) 0.71 (0.92) 0.8
0.48 0.42 (0.88) 0.43 (0.90) 1.6
0.31 0.29 (0.92) 0.29 (0.93) 3.2
Running Time 2.76 925.15 (334.7) 115.29 (41.7) 0.4
2.77 916.52 (331.4) 114.76 (41.5) 0.8
2.69 922.85 (342.7) 114.75 (42.6) 1.6
2.77 919.20 (332.2) 115.26 (41.7) 3.2
Table 7: Comparison of the error rates and running times (in seconds) among FADI, full sample PCA and distributed PCA [18] for different eigengaps Δ\Delta in the genetic setting. The number of sample splits mm is 40 for FADI and distributed PCA. The settings of the other parameters are the same as those in Table 6.

B Proof of Main Theoretical Results

In this section we provide proofs of the theoretical results in Section 4. For the inferential results, we will present proofs of the theorems under the regime LpdLp\ll d first, which takes into consideration the extra variability caused by the fast sketching, and then give proofs of the theorems under the regime LpdLp\gg d where the fast sketching randomness is negligible.

B.1 Unbiasedness of Fast Sketching With Respect to 𝐌^\widehat{\mathbf{M}}

We show by the following Lemma B.1 that the fast sketching is unbiased with respect to 𝐌^\widehat{\mathbf{M}} under proper conditions.

Lemma B.1.

Let 𝐕^d𝚲^d𝐕^d\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top} be the eigen-decomposition of 𝐌^\widehat{\mathbf{M}}, and let 𝐕^=(𝐯^1,,𝐯^K)\widehat{\mathbf{V}}=(\widehat{\mathbf{v}}_{1},\ldots,\widehat{\mathbf{v}}_{K}) be the stacked KK leading eigenvectors of 𝐌^\widehat{\mathbf{M}} corresponding to the eigenvalues with largest magnitudes. When 𝚺𝐕^𝐕^2<1/2\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}<1/2, we have that Col(𝐕)=Col(𝐕^)\operatorname{Col}(\mathbf{V}^{\prime})=\operatorname{Col}(\widehat{\mathbf{V}}), where Col()\operatorname{Col}(\cdot) denotes the column space of the matrix.

Proof.

We will first show that 𝐕^d𝚺𝐕^d\widehat{\mathbf{V}}_{d}^{\top}{\mathbf{\Sigma}}^{\prime}\widehat{\mathbf{V}}_{d} is diagonal. For any j[d]j\in[d], we let 𝐃j=𝐈d2𝐞j𝐞j\mathbf{D}_{j}=\mathbf{I}_{d}-2\mathbf{e}_{j}\mathbf{e}_{j}^{\top}, and recall we denote the eigen-decomposition of 𝐌^\widehat{\mathbf{M}} by 𝐌^=𝐕^d𝚲^d𝐕^d\widehat{\mathbf{M}}=\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top}. Then conditional on 𝐌^\widehat{\mathbf{M}} we have

𝐕^d𝐃j𝐕^d𝐘^()𝐘^()𝐕^d𝐃j𝐕^d=𝐕^d𝐃j𝐕^d𝐕^d𝚲^d𝐕^d𝛀()𝛀()𝐕^d𝚲^d𝐕^d𝐕^d𝐃j𝐕^d\displaystyle\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{Y}}^{(\ell)}\widehat{\mathbf{Y}}^{(\ell)\top}\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}=\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top}\mathbf{\Omega}^{(\ell)}\mathbf{\Omega}^{(\ell)\top}\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}
=𝐕^d𝚲^d(𝐃j𝐕^d𝛀())(𝛀()𝐕^d𝐃j)𝚲^d𝐕^d=d𝐕^d𝚲^d𝐕^d𝛀()𝛀()𝐕^d𝚲^d𝐕^d=𝐘^()𝐘^(),\displaystyle\quad=\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}(\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\mathbf{\Omega}^{(\ell)})(\mathbf{\Omega}^{(\ell)\top}\widehat{\mathbf{V}}_{d}\mathbf{D}_{j})\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top}\overset{\rm d}{=}\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top}\mathbf{\Omega}^{(\ell)}\mathbf{\Omega}^{(\ell)\top}\widehat{\mathbf{V}}_{d}\widehat{\mathbf{\Lambda}}_{d}\widehat{\mathbf{V}}_{d}^{\top}=\widehat{\mathbf{Y}}^{(\ell)}\widehat{\mathbf{Y}}^{(\ell)\top},

where the second equality is due to the fact that diagonal matrices are commutative, and the last but one equivalence in distribution is due to the fact that 𝐃j𝐕^d𝛀()=d𝐕^d𝛀()\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\mathbf{\Omega}^{(\ell)}\overset{\rm d}{=}\widehat{\mathbf{V}}_{d}^{\top}\mathbf{\Omega}^{(\ell)}. Also we know the top KK eigenvectors of 𝐕^d𝐃j𝐕^d𝐘^()𝐘^()𝐕^d𝐃j𝐕^d\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{Y}}^{(\ell)}\widehat{\mathbf{Y}}^{(\ell)\top}\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top} are 𝐕^d𝐃j𝐕^d𝐕^()\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{V}}^{(\ell)}, and thus 𝐕^d𝐃j𝐕^d𝐕^()=d𝐕^()\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{V}}^{(\ell)}\overset{\rm d}{=}\widehat{\mathbf{V}}^{(\ell)}. Hence we have

𝐕^d𝔼(𝐕^()𝐕^()|𝐌^)𝐕^d=𝐕^d𝐕^d𝐃j𝐕^d𝔼(𝐕^()𝐕^()|𝐌^)𝐕^d𝐃j𝐕^d𝐕^d\displaystyle\widehat{\mathbf{V}}_{d}^{\top}\mathbb{E}\left(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}|\widehat{\mathbf{M}}\right)\widehat{\mathbf{V}}_{d}=\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\mathbb{E}\left(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}|\widehat{\mathbf{M}}\right)\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\widehat{\mathbf{V}}_{d}
=𝐃j𝐕^d𝔼(𝐕^()𝐕^()|𝐌^)𝐕^d𝐃j=𝐃j𝐕^d𝚺𝐕^d𝐃j.\displaystyle\quad=\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\mathbb{E}\left(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}|\widehat{\mathbf{M}}\right)\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}=\mathbf{D}_{j}\widehat{\mathbf{V}}_{d}^{\top}\bm{\Sigma}^{\prime}\widehat{\mathbf{V}}_{d}\mathbf{D}_{j}.

The above equation holds for any j[d]j\in[d], which suggests that 𝐕^d𝔼(𝐕^()𝐕^()|𝐌^)𝐕^d\widehat{\mathbf{V}}_{d}^{\top}\mathbb{E}\left(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}|\widehat{\mathbf{M}}\right)\widehat{\mathbf{V}}_{d} is diagonal and that 𝚺\mathbf{\Sigma}^{\prime} and 𝐌^\widehat{\mathbf{M}} share the same set of eigenvectors.

Now under the condition that 𝚺𝐕^𝐕^2<1/2\left\|\mathbf{\Sigma}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\right\|_{2}<1/2, for any j[K],j\in[K], we denote by 𝐯^j\widehat{\mathbf{v}}_{j} the jj-th column of 𝐕^\widehat{\mathbf{V}}, and we have

𝚺𝐯^j2=(𝚺𝐕^𝐕^+𝐕^𝐕^)𝐯^j21𝚺𝐕^𝐕^2>112=12.\left\|\mathbf{\Sigma}^{\prime}\widehat{\mathbf{v}}_{j}\right\|_{2}=\left\|\left(\mathbf{\Sigma}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}+\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\right)\widehat{\mathbf{v}}_{j}\right\|_{2}\geq 1-\left\|\mathbf{\Sigma}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\right\|_{2}>1-\frac{1}{2}=\frac{1}{2}.

In other words, the corresponding eigenvalue of 𝐯^j\widehat{\mathbf{v}}_{j} in 𝚺\mathbf{\Sigma}^{\prime} is larger than 1/21/2. On the other hand, by Weyl’s inequality [19], the rest of the dKd-K eigenvalues of 𝚺{\mathbf{\Sigma}}^{\prime} should be less than 1/2. Therefore, 𝐕^\widehat{\mathbf{V}} are still the leading KK eigenvectors for 𝚺{\mathbf{\Sigma}}^{\prime}, and thus Col(𝐕)=Col(𝐕^)\operatorname{Col}(\mathbf{V}^{\prime})=\operatorname{Col}(\widehat{\mathbf{V}}). ∎

Recall in Section 4 we discuss that the bias term has the following decomposition ρ(𝐕,𝐕)ρ(𝐕^,𝐕)+ρ(𝐕,𝐕^)\rho(\mathbf{V}^{\prime},{\mathbf{V}})\leq\rho(\widehat{\mathbf{V}},{\mathbf{V}})+\rho(\mathbf{V}^{\prime},\widehat{\mathbf{V}}). Lemma B.1 shows that as long as 𝚺\mathbf{\Sigma}^{\prime} and 𝐕𝐕\mathbf{V}\mathbf{V}^{\top} are not too far apart, 𝐕\mathbf{V}^{\prime} and 𝐕^\widehat{\mathbf{V}} will share the same column space. In fact, Lemma B.4 in Section B.2 will show that the probability that 𝚺{\mathbf{\Sigma}}^{\prime} and 𝐕^𝐕^\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top} are not sufficiently close converges to 0, and ρ(𝐕,𝐕)=ρ(𝐕^,𝐕)\rho(\mathbf{V}^{\prime},{\mathbf{V}})=\rho(\widehat{\mathbf{V}},{\mathbf{V}}) with high probability. With the help of Lemma B.1, we present the proof of the main error bound results in the following section.

B.2 Proof of Theorem 4.1

Recall the problem setting in Section 2. It is not hard to see that we can write 𝚲=𝐏0𝚲0\mathbf{\Lambda}=\mathbf{P}_{0}\mathbf{\Lambda}^{0}, where 𝚲0=diag(|λ1|,,|λK|)\mathbf{\Lambda}^{0}=\operatorname{diag}(|\lambda_{1}|,\ldots,|\lambda_{K}|) and 𝐏0=diag([sgn(λk)]k=1K)\mathbf{P}_{0}=\operatorname{diag}\big{(}[\operatorname{sgn}(\lambda_{k})]_{k=1}^{K}\big{)}. Then 𝐌=(𝐕𝐏0)𝚲0𝐕\mathbf{M}=(\mathbf{V}\mathbf{P}_{0})\mathbf{\Lambda}^{0}\mathbf{V}^{\top} is the SVD of 𝐌\mathbf{M}.

We begin with bounding (𝔼𝐕~𝐕~𝐕𝐕F2)1/2\left({\mathbb{E}}\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}^{\top}\|^{2}_{\rm F}\right)^{1/2}. Before delving into the detailed proof, the following two lemmas provide some important properties of the random Gaussian matrix.

Lemma B.2.

Let 𝛀d×p\mathbf{\Omega}\in\mathbb{R}^{d\times p} be a random matrix with i.i.d. standard Gaussian entries, where pdp\leq d. For a random variable, recall that we define the ψ1\psi_{1} norm to be ψ1=supp1(𝔼||p)1/p/p\|\cdot\|_{\psi_{1}}=\sup_{p\geq 1}({\mathbb{E}}|\cdot|^{p})^{1/p}/p. Then we have the following bound on the ψ1\psi_{1} norm of the matrix 𝛀/p\mathbf{\Omega}/\sqrt{p}:

𝛀/p2ψ1d/p.\|\|\mathbf{\Omega}/\sqrt{p}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{d/p}. (B.23)
Lemma B.3.

Let 𝛀K×p\bm{\Omega}\in\mathbb{R}^{K\times p} denote a random matrix with i.i.d. Gaussian entries, where p2Kp\geq 2K. For any integer aa such that 1a(pK+1)/21\leq a\leq(p-K+1)/2, there exists a constant C>0C>0 such that

𝔼((σmin(𝛀/p))a)Ca.{\mathbb{E}}\left(\left(\sigma_{\min}(\bm{\Omega}/\sqrt{p})\right)^{-a}\right)\leq C^{a}. (B.24)

The following lemma shows that 𝚺𝐕𝐕2\|{\mathbf{\Sigma}}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\|_{2} and 𝚺𝐕^𝐕^2\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2} are bounded by a small constant with high probability.

Lemma B.4.

If Assumption 1 holds and pmax(2K,K+3)p\geq\max(2K,K+3), there exists a constant c0>0c_{0}>0 such that for any ε>0\varepsilon>0, we have

max{(𝚺𝐕𝐕2ε),(𝚺𝐕^𝐕^2ε)}exp(c0pdΔεr1(d)).\max\left\{\mathbb{P}\Big{(}\|{\mathbf{\Sigma}}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq\varepsilon\Big{)},\mathbb{P}\left(\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}\geq\varepsilon\right)\right\}\lesssim\exp\left(-c_{0}\sqrt{\frac{p}{d}}\frac{\Delta\varepsilon}{r_{1}(d)}\right).

The proof of Lemma B.2, Lemma B.3 and Lemma B.4 are deferred to Appendix C. Now we can start with the proof. We first decompose the bias term into two parts,

(𝔼|ρ(𝐕~,𝐕)|2)1/2(𝔼|ρ(𝐕~,𝐕)|2)1/2I+(𝔼|ρ(𝐕,𝐕)|2)1/2II.\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}},\mathbf{V})|^{2}\right)^{1/2}\leq\underbrace{\Big{(}{\mathbb{E}}|\rho(\widetilde{\mathbf{V}},\mathbf{V}^{\prime})|^{2}\Big{)}^{1/2}}_{\text{I}}\!\!\!+\underbrace{\Big{(}{\mathbb{E}}|\rho(\mathbf{V}^{\prime},\mathbf{V})|^{2}\Big{)}^{1/2}}_{\text{II}}. (B.25)

Term I can be regarded as the variance term, whereas term II is the bias term. We will consider the bias term first.

B.2.1 Control of the Bias Term

We can see that term II can be further decomposed into two terms

(𝔼|ρ(𝐕,𝐕)|2)1/2(𝔼𝐕𝐕𝐕^𝐕^F2)1/2+(𝔼𝐕^𝐕^𝐕𝐕F2)1/2.\left({\mathbb{E}}|\rho(\mathbf{V}^{\prime},\mathbf{V})|^{2}\right)^{1/2}\leq\left({\mathbb{E}}\|\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{\rm F}^{2}\right)^{1/2}+\left({\mathbb{E}}\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}^{\top}\|_{\rm F}^{2}\right)^{1/2}. (B.26)

We can bound both terms separately. First note that 𝐕𝐕𝐕^𝐕^F2K𝐕𝐕𝐕^𝐕^22K\|\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{\rm F}\leq\sqrt{2K}\|\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}\leq\sqrt{2K}. Thus we have,

(𝔼𝐕𝐕𝐕^𝐕^F2)1/2(𝔼𝐕𝐕𝐕^𝐕^F2𝕀{𝚺𝐕^𝐕^21/2})1/2\displaystyle\left({\mathbb{E}}\|\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{\rm F}^{2}\right)^{1/2}\leq\left({\mathbb{E}}\|\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{\rm F}^{2}\mathbb{I}\big{\{}\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}\geq 1/2\big{\}}\right)^{1/2}
+(𝔼𝐕𝐕𝐕^𝐕^F2𝕀{𝚺𝐕^𝐕^2<1/2})1/2\displaystyle\quad\quad+\left({\mathbb{E}}\|\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{\rm F}^{2}\mathbb{I}{\big{\{}\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}<1/2\big{\}}}\right)^{1/2}
0+K((𝚺𝐕^𝐕^21/2))1/2Kexp(c04pdΔr1(d)),\displaystyle\lesssim 0+\sqrt{K}\left(\mathbb{P}\left(\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}\geq 1/2\right)\right)^{1/2}\lesssim\sqrt{K}\exp\left(-\frac{c_{0}}{4}\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right),

where the last but one inequality follows from Lemma B.1, and the last inequality is a result of Lemma B.4. As for the second term on the RHS of (B.26), by Davis-Kahan’s Theorem [45], we have

(𝔼𝐕^𝐕^𝐕𝐕F2)1/2\displaystyle\left({\mathbb{E}}\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}^{\top}\|_{\rm F}^{2}\right)^{1/2} KΔ(𝔼𝐌^𝐌22)1/2=KΔ(𝔼𝐄22)1/2\displaystyle\lesssim\frac{\sqrt{K}}{\Delta}\left({\mathbb{E}}\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}^{2}\right)^{1/2}=\frac{\sqrt{K}}{\Delta}\left({\mathbb{E}}\|\mathbf{E}\|_{2}^{2}\right)^{1/2}
KΔ𝐄2ψ1KΔr1(d).\displaystyle\leq\frac{\sqrt{K}}{\Delta}\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\lesssim\frac{\sqrt{K}}{\Delta}r_{1}(d).

Therefore, the bound for the bias term is

IIKexp(c04pdΔr1(d))+KΔr1(d).\text{II}\lesssim\sqrt{K}\exp\left(-\frac{c_{0}}{4}\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right)+\frac{\sqrt{K}}{\Delta}r_{1}(d).

B.2.2 Control of the Variance Term

Now we move on to control the variance term. Suppose that 𝚺𝐕𝐕2<1/4\left\|\mathbf{\Sigma}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\right\|_{2}<1/4. Then by Weyl’s inequality [19] we have that σK(𝚺)>11/4=3/4\sigma_{K}(\mathbf{\Sigma}^{\prime})>1-1/4=3/4 and σK+1(𝚺)<1/4\sigma_{K+1}(\mathbf{\Sigma}^{\prime})<1/4. Thus by Davis-Kahan theorem [45]

(𝔼(𝐕~𝐕~𝐕𝐕F2𝕀{𝚺𝐕𝐕2<1/4}))1/2\displaystyle\left({\mathbb{E}}\left(\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\!\!-\!\!\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}\|_{\rm F}^{2}\mathbb{I}\left\{\left\|\mathbf{\Sigma}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\right\|_{2}<1/4\right\}\right)\right)^{1/2}\!\!\!
(𝔼(𝚺~𝚺F2(σK(𝚺)σK+1(𝚺))2𝕀{𝚺𝐕𝐕2<1/4}))1/2\displaystyle\lesssim\left({\mathbb{E}}\left(\frac{\|\widetilde{\mathbf{\Sigma}}-{\mathbf{\Sigma}}^{\prime}\|_{\rm F}^{2}}{\left(\sigma_{K}({\mathbf{\Sigma}}^{\prime})-\sigma_{K+1}({\mathbf{\Sigma}}^{\prime})\right)^{2}}\mathbb{I}\left\{\left\|\mathbf{\Sigma}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\right\|_{2}<1/4\right\}\right)\right)^{1/2}
(𝔼(𝚺~𝚺F2𝕀{𝚺𝐕𝐕2<1/4}))1/2(𝔼𝚺~𝚺F2)1/2III.\displaystyle\quad\lesssim\left({\mathbb{E}}\left(\|\widetilde{\mathbf{\Sigma}}-{\mathbf{\Sigma}}^{\prime}\|_{\rm F}^{2}\mathbb{I}{\left\{\left\|\mathbf{\Sigma}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\right\|_{2}<1/4\right\}}\right)\right)^{1/2}\leq\underbrace{\left({\mathbb{E}}\|\widetilde{\mathbf{\Sigma}}-{\mathbf{\Sigma}}^{\prime}\|_{\rm F}^{2}\right)^{1/2}}_{\text{III}}.

We will bound term III later. Also similar as previously, note that 𝐕~𝐕~𝐕𝐕F2K\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\!\!-\!\!\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}\|_{\rm F}\leq\sqrt{2K}. Thus by Lemma B.4,

(𝔼(𝐕~𝐕~𝐕𝐕F2𝕀{𝚺𝐕𝐕214}))1/2K((𝚺𝐕𝐕214))1/2\displaystyle\left({\mathbb{E}}\left(\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\!\!-\!\!\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}\|_{\rm F}^{2}\mathbb{I}{\left\{\left\|\mathbf{\Sigma}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\right\|_{2}\geq\frac{1}{4}\right\}}\right)\right)^{{1}/{2}}\!\!\!\lesssim\sqrt{K}\left(\mathbb{P}\left(\left\|\mathbf{\Sigma}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\right\|_{2}\geq\frac{1}{4}\right)\right)^{{1}/{2}}
Kexp(c08pdΔr1(d)).\displaystyle\quad\leq\sqrt{K}\exp\left(-\frac{c_{0}}{8}\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right).

Therefore, we have

(𝔼𝐕~𝐕~𝐕𝐕F2)1/2Kexp(c08pdΔr1(d))+(𝔼𝚺~𝚺F2)1/2III.\left({\mathbb{E}}\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\!\!-\!\!\mathbf{V}^{\prime}\mathbf{V}^{\prime\top}\|_{\rm F}^{2}\right)^{1/2}\lesssim\sqrt{K}\exp\left(-\frac{c_{0}}{8}\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right)+\underbrace{\left({\mathbb{E}}\|\widetilde{\mathbf{\Sigma}}-{\mathbf{\Sigma}}^{\prime}\|_{\rm F}^{2}\right)^{1/2}}_{\text{III}}.

Now we move on to bound term III.

(𝔼𝚺~𝚺F2)1/2\displaystyle\left({\mathbb{E}}\|\widetilde{\mathbf{\Sigma}}-{\mathbf{\Sigma}}^{\prime}\|_{\rm F}^{2}\right)^{1/2} =(𝔼1L=1L𝐕^()𝐕^()𝔼(𝐕^(1)𝐕^(1)|𝐌^)F2)1/2\displaystyle=\left({\mathbb{E}}\left\|\frac{1}{L}\sum_{\ell=1}^{L}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-{\mathbb{E}}\left(\widehat{\mathbf{V}}^{(1)}\widehat{\mathbf{V}}^{(1)\top}|\widehat{\mathbf{M}}\right)\right\|_{\rm F}^{2}\right)^{1/2}
=(𝔼(𝔼(1L=1L𝐕^()𝐕^()𝔼(𝐕^(1)𝐕^(1)|𝐌^)F2|𝐌^)))1/2\displaystyle=\bigg{(}{\mathbb{E}}\bigg{(}{\mathbb{E}}\bigg{(}\bigg{\|}\frac{1}{L}\sum_{\ell=1}^{L}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-{\mathbb{E}}\left(\widehat{\mathbf{V}}^{(1)}\widehat{\mathbf{V}}^{(1)\top}|\widehat{\mathbf{M}}\right)\bigg{\|}_{\rm F}^{2}\bigg{|}\widehat{\mathbf{M}}\bigg{)}\bigg{)}\bigg{)}^{1/2}
=1L(𝔼𝐕^()𝐕^()𝔼(𝐕^(1)𝐕^(1)|𝐌^)F2)1/2\displaystyle=\frac{1}{\sqrt{L}}\left({\mathbb{E}}\left\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-{\mathbb{E}}\left(\widehat{\mathbf{V}}^{(1)}\widehat{\mathbf{V}}^{(1)\top}|\widehat{\mathbf{M}}\right)\right\|_{\rm F}^{2}\right)^{1/2}
1L(𝔼𝐕^()𝐕^()𝐕𝐕F2)1/2+1L(𝔼𝐕𝐕𝚺F2)1/2.\displaystyle\leq\frac{1}{\sqrt{L}}\left({\mathbb{E}}\left\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}\right\|_{\rm F}^{2}\right)^{1/2}+\frac{1}{\sqrt{L}}\left({\mathbb{E}}\left\|\mathbf{V}\mathbf{V}^{\top}-{\mathbf{\Sigma}}^{\prime}\right\|_{\rm F}^{2}\right)^{1/2}.

where the last but one equality is due to the independence of estimators from different sketches conditional on 𝐌^\widehat{\mathbf{M}}. By Jensen’s inequality [22], we have

1L(𝔼𝐕𝐕𝚺F2)1/21L(𝔼𝐕^()𝐕^()𝐕𝐕F2)1/2.\frac{1}{\sqrt{L}}\left({\mathbb{E}}\left\|\mathbf{V}\mathbf{V}^{\top}-{\mathbf{\Sigma}}^{\prime}\right\|_{\rm F}^{2}\right)^{1/2}\leq\frac{1}{\sqrt{L}}\left({\mathbb{E}}\left\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}\right\|_{\rm F}^{2}\right)^{1/2}.

Thus we have

(𝔼𝚺~𝚺F2)1/21L(𝔼𝐕^()𝐕^()𝐕𝐕F2)1/2,\left({\mathbb{E}}\|\widetilde{\mathbf{\Sigma}}-{\mathbf{\Sigma}}^{\prime}\|_{\rm F}^{2}\right)^{1/2}\lesssim\frac{1}{\sqrt{L}}\left({\mathbb{E}}\left\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}\right\|_{\rm F}^{2}\right)^{1/2}, (B.27)

Before bounding the RHS, let’s consider the matrix 𝐘():=𝐕𝐏0𝚲0𝐕𝛀(){\mathbf{Y}}^{(\ell)}:=\mathbf{V}\mathbf{P}_{0}\mathbf{\Lambda}^{0}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}. If 𝛀~():=𝐕𝛀()K×p\widetilde{\mathbf{\Omega}}^{(\ell)}:=\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}\in\mathbb{R}^{K\times p} does not have full row rank, then the entries will be restricted to a linear space with dimension less than K×pK\times p. Since 𝛀~()\widetilde{\mathbf{\Omega}}^{(\ell)} is a K×pK\times p standard Gaussian matrix, the probability that 𝛀~()\widetilde{\mathbf{\Omega}}^{(\ell)} has full row rank is 1. And thus with probability 1, the matrix 𝐘(){\mathbf{Y}}^{(\ell)} is of rank KK, and 𝐕\mathbf{V} and the top KK left singular vectors of 𝐘()/p{\mathbf{Y}}^{(\ell)}/\sqrt{p} span the same column space. In other words, if we let 𝚪K()\bm{\Gamma}_{K}^{(\ell)} be the left singular vectors of 𝐘()/p{\mathbf{Y}}^{(\ell)}/\sqrt{p}, then 𝚪K()𝚪K()=𝐕𝐕\bm{\Gamma}_{K}^{(\ell)}\bm{\Gamma}_{K}^{(\ell)\top}=\mathbf{V}\mathbf{V}^{\top}.

Now consider the KK-th singular value of 𝐘()/p{\mathbf{Y}}^{(\ell)}/\sqrt{p}, we let 𝐔𝛀~𝐃𝛀~𝐕𝛀~\mathbf{U}_{\widetilde{\mathbf{\Omega}}}\mathbf{D}_{\widetilde{\mathbf{\Omega}}}\mathbf{V}_{\widetilde{\mathbf{\Omega}}}^{\top} be the SVD of 𝛀~()/p\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}, and we have

σK(𝐘()/p)\displaystyle\sigma_{K}\left({\mathbf{Y}}^{(\ell)}/\sqrt{p}\right) =σK(𝐕𝐏0𝚲0𝛀~()/p)=σK(𝚲0𝐔𝛀~𝐃𝛀~)\displaystyle=\sigma_{K}\left(\mathbf{V}\mathbf{P}_{0}\mathbf{\Lambda}^{0}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)=\sigma_{K}\left(\mathbf{\Lambda}^{0}\mathbf{U}_{\widetilde{\mathbf{\Omega}}}\mathbf{D}_{\widetilde{\mathbf{\Omega}}}\right)
=min𝒙2=1𝚲0𝐔𝛀~𝐃𝛀~𝒙2(i)σmin(𝛀~()/p)min𝒗12=1𝚲0𝐔𝛀~𝒗12\displaystyle=\min_{\|\bm{x}\|_{2}=1}\|\mathbf{\Lambda}^{0}\mathbf{U}_{\widetilde{\mathbf{\Omega}}}\mathbf{D}_{\widetilde{\mathbf{\Omega}}}\bm{x}\|_{2}\overset{(i)}{\geq}\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)\min_{\|\bm{v}_{1}\|_{2}=1}\left\|\mathbf{\Lambda}^{0}\mathbf{U}_{\widetilde{\mathbf{\Omega}}}\bm{v}_{1}\right\|_{2}
(ii)σmin(𝛀~()/p)min𝒗22=1𝚲0𝒗22Δσmin(𝛀~()/p),\displaystyle\overset{(ii)}{\geq}\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)\min_{\|\bm{v}_{2}\|_{2}=1}\left\|\mathbf{\Lambda}^{0}\bm{v}_{2}\right\|_{2}\geq\Delta\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right),

where 𝒗1=𝐃𝛀~𝒙/𝐃𝛀~𝒙2\bm{v}_{1}=\mathbf{D}_{\widetilde{\mathbf{\Omega}}}\bm{x}/\|\mathbf{D}_{\widetilde{\mathbf{\Omega}}}\bm{x}\|_{2}, and 𝒗2=𝐔𝛀~𝒗1\bm{v}_{2}=\mathbf{U}_{\widetilde{\mathbf{\Omega}}}\bm{v}_{1}. Inequality (i) follows because

𝐃𝛀~𝒙2σmin(𝛀~()/p)𝒙2=σmin(𝛀~()/p),\|\mathbf{D}_{\widetilde{\mathbf{\Omega}}}\bm{x}\|_{2}\geq\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)\|\bm{x}\|_{2}=\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right),

and inequality (ii) is because 𝒗22=𝒗12=1\|\bm{v}_{2}\|_{2}=\|\bm{v}_{1}\|_{2}=1.

Now by Wedin’s Theorem [42] we have the following bound on the RHS of (B.27),

1L(𝔼|ρ(𝐕^(),𝐕)|2)1/2KL(𝔼𝐘^()/p𝐘()/p22/(Δσmin(𝛀~()/p))2)1/2\displaystyle\frac{1}{\sqrt{L}}\left({\mathbb{E}}\big{|}\rho(\widehat{\mathbf{V}}^{(\ell)},\mathbf{V})\big{|}^{2}\right)^{1/2}\!\!\!\lesssim\frac{\sqrt{K}}{\sqrt{L}}\left({\mathbb{E}}\left\|\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}-{\mathbf{Y}}^{(\ell)}/\sqrt{p}\right\|_{2}^{2}/\left(\Delta\sigma_{\min}\big{(}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}\right)^{2}\right)^{1/2}
KΔL(𝔼𝐘^()/p𝐘()/p24)1/4(𝔼(σmin(𝛀~()/p))4)1/4\displaystyle\quad\leq\frac{\sqrt{K}}{\Delta\sqrt{L}}\left({\mathbb{E}}\left\|\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}-{\mathbf{Y}}^{(\ell)}/\sqrt{p}\right\|_{2}^{4}\right)^{1/4}\left({\mathbb{E}}\left(\sigma_{\min}\big{(}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}\right)^{-4}\right)^{1/4}
KΔL𝐄2ψ1𝛀()/p2ψ1KdΔ2pL𝐄2ψ1KdΔ2pLr1(d),\displaystyle\quad\lesssim\frac{\sqrt{K}}{\Delta\sqrt{L}}\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\cdot\|\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{\frac{Kd}{\Delta^{2}pL}}\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d),

where the last but one inequality is due to Lemma B.3. Therefore, we have the final error rate for the estimator 𝐕~\widetilde{\mathbf{V}}:

(𝔼𝐕~𝐕~𝐕𝐕F2)1/2Kexp(c08pdΔr1(d))+KΔr1(d)bias+KdΔ2pLr1(d)variance.\left({\mathbb{E}}\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}\|_{\rm F}^{2}\right)^{1/2}\lesssim\underbrace{\sqrt{K}\exp\left(-\frac{c_{0}}{8}\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right)+\frac{\sqrt{K}}{\Delta}r_{1}(d)}_{\text{bias}}+\underbrace{\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d)}_{\text{variance}}.

Now consider the function g(x):=exp(a0pdx)/(dx2)g(x):=\exp(a_{0}\sqrt{\frac{p}{d}}x)/(\sqrt{d}x^{2}), where a0>0a_{0}>0 is a fixed constant. We have

dlogg(x)dx=a0pd2x>0,for x2a0dp.\frac{d\log g(x)}{dx}=a_{0}\sqrt{\frac{p}{d}}-\frac{2}{x}>0,\quad\text{for }x\geq\frac{2}{a_{0}}\sqrt{\frac{d}{p}}.

Thus g(x)g(x) is increasing on x2d/p/a0x\geq 2\sqrt{d/p}/a_{0}, and if we take xCdplogdx\geq C\sqrt{\frac{d}{p}}\log d for some large enough constant C>0C>0, we have that g(x)1g(x)\geq 1. Then by plugging in x=Δ/r1(d)x=\Delta/r_{1}(d) and taking a0=c0/8a_{0}=c_{0}/8, under the condition that (logd)1p/dΔ/r1(d)C(\log d)^{-1}\sqrt{p/d}\Delta/r_{1}(d)\geq C for some large enough constant C>0C>0, we have that

exp(c08pdΔr1(d))1d(r1(d)Δ)2=o(r1(d)Δ),\exp\left(-\frac{c_{0}}{8}\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right)\lesssim\frac{1}{\sqrt{d}}\left(\frac{r_{1}(d)}{\Delta}\right)^{2}=o\left(\frac{r_{1}(d)}{\Delta}\right),

and the error rate simplifies to

(𝔼𝐕~𝐕~𝐕𝐕F2)1/2KΔr1(d)bias+KdΔ2pLr1(d)variance.\left({\mathbb{E}}\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}\|_{\rm F}^{2}\right)^{1/2}\lesssim\underbrace{\frac{\sqrt{K}}{\Delta}r_{1}(d)}_{\text{bias}}+\underbrace{\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d)}_{\text{variance}}.

Now we move on to bound (𝔼𝐕~F𝐕~F𝐕𝐕2F)1/2\left({\mathbb{E}}\|\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{\text{F}\top}-\mathbf{V}\mathbf{V}^{\top}\|^{2}_{\rm F}\right)^{1/2}. Since22q\|\cdot\|_{2}^{2q} is convex, by Jensen’s inequality [22], under the condition that pmax(2K,8q+K1)p\geq\max(2K,8q+K-1) we have that there exists some constant η\eta such that

𝔼𝚺~𝐕𝐕22q\displaystyle{\mathbb{E}}\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{2q} 1L=1L𝔼𝐕^()𝐕^()𝐕𝐕22q=𝔼𝐕^(1)𝐕^(1)𝐕𝐕22q\displaystyle\leq\frac{1}{L}\sum_{\ell=1}^{L}{\mathbb{E}}\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{2q}={\mathbb{E}}\|\widehat{\mathbf{V}}^{(1)}\widehat{\mathbf{V}}^{(1)\top}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{2q}
𝔼(𝐘^()/p𝐘()/p22q/(Δσmin(𝛀~()/p))2q)\displaystyle\leq{\mathbb{E}}\bigg{(}\left\|\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}-{\mathbf{Y}}^{(\ell)}/\sqrt{p}\right\|_{2}^{2q}\Big{/}\left(\Delta\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)\right)^{2q}\bigg{)}
1Δ2q(𝔼𝐘^()/p𝐘()/p24q)1/2(𝔼(σmin(𝛀~()/p))4q)1/2\displaystyle\leq\frac{1}{\Delta^{2q}}\left({\mathbb{E}}\left\|\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}-{\mathbf{Y}}^{(\ell)}/\sqrt{p}\right\|_{2}^{4q}\right)^{1/2}\left({\mathbb{E}}\left(\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)\right)^{-4q}\right)^{1/2}
(ηq2dΔ2p𝐄2ψ1)2q.\displaystyle\lesssim\left(\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\right)^{2q}.

Thus by Markov’s inequality, we also have

(𝚺~𝐕𝐕212)\displaystyle\mathbb{P}\left(\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq\frac{1}{2}\right) =(𝚺~𝐕𝐕22q122q)22q𝔼(𝚺~𝐕𝐕22q)\displaystyle=\mathbb{P}\left(\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{2q}\geq\frac{1}{2^{2q}}\right)\leq 2^{2q}{\mathbb{E}}\big{(}\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{2q}\big{)}
(2ηq2dΔ2p𝐄2ψ1)2q.\displaystyle\lesssim\left(2\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\right)^{2q}.

Since 𝚺~\widetilde{\bm{\Sigma}} is the summation of positive semi-definite matrices by construction, 𝚺~\widetilde{\bm{\Sigma}} is also positive semi-definite. By Weyl’s inequality [19], we know that σK(𝚺~)1𝚺~𝐕𝐕2\sigma_{K}(\widetilde{\mathbf{\Sigma}})\geq 1-\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2} and σK+1(𝚺~)𝚺~𝐕𝐕2\sigma_{K+1}(\widetilde{\mathbf{\Sigma}})\leq\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}.

Now if we denote the SVD of 𝚺~q\widetilde{\mathbf{\Sigma}}^{q} by 𝐕~𝚲~Kq𝐕~+𝐕~𝚲~q𝐕~\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{V}}^{\top}+\widetilde{\mathbf{V}}_{\perp}\widetilde{\mathbf{\Lambda}}_{\perp}^{q}\widetilde{\mathbf{V}}_{\perp}^{\top}, then with probability 1, 𝐕~𝚲~Kq𝐕~𝛀F\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{V}}^{\top}\bm{\Omega}^{\text{F}} and 𝐕~\widetilde{\mathbf{V}} share the same column space. By the relationship σk(𝚺~q)=σkq(𝚺~)\sigma_{k}(\widetilde{\mathbf{\Sigma}}^{q})=\sigma_{k}^{q}(\widetilde{\mathbf{\Sigma}}) for k[d]k\in[d] and Davis-Kahan’s Theorem [45], we have

𝔼(𝐕~F𝐕~F𝐕~𝐕~2F|𝚺~)\displaystyle{\mathbb{E}}\left(\|\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\|^{2}_{\rm F}\,|\widetilde{\mathbf{\Sigma}}\right) 𝔼(K𝚺~q𝛀F𝐕~𝚲~Kq𝐕~𝛀F22/σ2min(𝐕~𝚲~Kq𝐕~𝛀F)|𝚺~)\displaystyle\lesssim{\mathbb{E}}\left(K\|\widetilde{\mathbf{\Sigma}}^{q}\mathbf{\Omega}^{\rm F}-\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{V}}^{\top}\mathbf{\Omega}^{\rm F}\|_{2}^{2}/\sigma^{2}_{\min}(\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{V}}^{\top}\mathbf{\Omega}^{\rm F})\,|\widetilde{\mathbf{\Sigma}}\right)
(KσKq(𝚺~)𝐕~𝚲~q𝐕~2𝛀F/p2ψ1)2\displaystyle\lesssim\left(\frac{\sqrt{K}}{\sigma_{K}^{q}(\widetilde{\mathbf{\Sigma}})}\|\widetilde{\mathbf{V}}_{\perp}\widetilde{\mathbf{\Lambda}}_{\perp}^{q}\widetilde{\mathbf{V}}_{\perp}^{\top}\|_{2}\cdot\|\|\mathbf{\Omega}^{\rm F}/\sqrt{p^{\prime}}\|_{2}\|_{\psi_{1}}\right)^{2}
Kdp𝚺~𝐕𝐕22q(1𝚺~𝐕𝐕2)2q.\displaystyle\lesssim\frac{Kd}{p^{\prime}}\frac{\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{2q}}{\left(1-\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\right)^{2q}}.

Therefore we have,

(𝔼𝐕~F𝐕~F𝐕~𝐕~2F)1/2(𝔼𝐕~F𝐕~F𝐕~𝐕~2F𝕀{𝚺~𝐕𝐕21/2})1/2\displaystyle\left({\mathbb{E}}\|\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\|^{2}_{\rm F}\right)^{1/2}\lesssim\left({\mathbb{E}}\|\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\|^{2}_{\rm F}\mathbb{I}\big{\{}\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\leq{1}/{2}\big{\}}\right)^{1/2}
+(𝔼𝐕~F𝐕~F𝐕~𝐕~2F𝕀{𝚺~𝐕𝐕2>1/2})1/2\displaystyle\quad+\left({\mathbb{E}}\|\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\|^{2}_{\rm F}\mathbb{I}\big{\{}\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}>{1}/{2}\big{\}}\right)^{1/2}
2qKdp(𝔼𝚺~𝐕𝐕22q)1/2+K{(𝚺~𝐕𝐕212)}1/2\displaystyle\lesssim 2^{q}\sqrt{\frac{Kd}{p^{\prime}}}\left({\mathbb{E}}\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{2q}\right)^{1/2}+\sqrt{K}\left\{\mathbb{P}\left(\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq\frac{1}{2}\right)\right\}^{1/2}
Kdp(2ηq2dΔ2p𝐄2ψ1)q+K(2ηq2dΔ2p𝐄2ψ1)q\displaystyle\lesssim\sqrt{\frac{Kd}{p^{\prime}}}\left(2\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\right)^{q}+\sqrt{K}\left(2\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\right)^{q}
Kdp(2ηq2dΔ2p𝐄2ψ1)q,\displaystyle\lesssim\sqrt{\frac{Kd}{p^{\prime}}}\left(2\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\right)^{q},

where the last but one inequality is by Markov’s inequality, i.e.,

(𝚺~𝐕𝐕212)22q𝔼𝚺~𝐕𝐕22q(2ηq2dΔ2p𝐄2ψ1)2q.\mathbb{P}\left(\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq\frac{1}{2}\right)\leq 2^{2q}{\mathbb{E}}\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{2q}\lesssim\left(2\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\right)^{2q}.

Thus by previous results and triangle inequality we have

(𝔼|ρ(𝐕~F,𝐕)|2)1/2(𝔼𝐕~F𝐕~F𝐕~𝐕~2F)1/2+(𝔼𝐕~𝐕~𝐕𝐕2F)1/2\displaystyle\left({\mathbb{E}}\big{|}\rho(\widetilde{\mathbf{V}}^{\text{F}},\mathbf{V})\big{|}^{2}\right)^{1/2}\lesssim\left({\mathbb{E}}\|\widetilde{\mathbf{V}}^{\rm F}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\|^{2}_{\rm F}\right)^{1/2}+\left({\mathbb{E}}\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}^{\top}\|^{2}_{\rm F}\right)^{1/2}
KΔr1(d)+KdΔ2pLr1(d)+Kdp(2ηq2dΔ2pr1(d))q.\displaystyle\quad\lesssim{\frac{\sqrt{K}}{\Delta}r_{1}(d)}\!+\!{\!\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d)}\!+\!\sqrt{\frac{Kd}{p^{\prime}}}\left(\!\!2\eta q^{2}\sqrt{\frac{d}{\Delta^{2}p}}r_{1}(d)\!\!\!\right)^{q}.

B.3 Proof of Corollary 4.2

The case-specific error rates can be calculated by computing r1(d)r_{1}(d) and studying the proper value of qq for each example.

\bullet Example 1: we know that 𝐄=𝚺^𝚺+(σ2σ^2)𝐈\mathbf{E}=\widehat{\bm{\Sigma}}-\bm{\Sigma}+(\sigma^{2}-\widehat{\sigma}^{2})\mathbf{I}. Now consider the K×KK^{\prime}\times K^{\prime} submatrix of 𝚺\bm{\Sigma} corresponding to the the index set SS, which we denote by 𝚺S=𝚺[S,S]\mathbf{\Sigma}_{S}=\bm{\Sigma}_{[S,S]}. We have 𝚺S=σ2𝐈K+(𝐕)[S,:]𝚲(𝐕)[S,:]\mathbf{\Sigma}_{S}=\sigma^{2}\mathbf{I}_{K^{\prime}}+(\mathbf{V})_{[S,:]}\mathbf{\Lambda}(\mathbf{V})_{[S,:]}^{\top}, where (𝐕)[S,:](\mathbf{V})_{[S,:]} is the submatrix of 𝐕\mathbf{V} composed of the rows in SS. Then since (𝐕)[S,:]𝚲(𝐕)[S,:]𝟎(\mathbf{V})_{[S,:]}\mathbf{\Lambda}(\mathbf{V})_{[S,:]}^{\top}\succeq\mathbf{0} and rank((𝐕)[S,:]𝚲(𝐕)[S,:])K\operatorname{rank}\big{(}(\mathbf{V})_{[S,:]}\mathbf{\Lambda}(\mathbf{V})_{[S,:]}^{\top}\big{)}\leq K, we know that σmin(𝚺S)=σ2\sigma_{\min}(\mathbf{\Sigma}_{S})=\sigma^{2}. By Weyl’s inequality [19], we know |σ2σ^2|𝚺^S𝚺S2𝚺^𝚺2|\sigma^{2}-\widehat{\sigma}^{2}|\leq\|\widehat{\mathbf{\Sigma}}_{S}-\mathbf{\Sigma}_{S}\|_{2}\leq\|\widehat{\mathbf{\Sigma}}-\mathbf{\Sigma}\|_{2}. Thus we have 𝐄2𝚺^𝚺2+|σ2σ^2|2𝚺^𝚺2\|\mathbf{E}\|_{2}\leq\|\widehat{\mathbf{\Sigma}}-\mathbf{\Sigma}\|_{2}+|\sigma^{2}-\widehat{\sigma}^{2}|\leq 2\|\widehat{\mathbf{\Sigma}}-\mathbf{\Sigma}\|_{2}. Then by Lemma 3 in Fan et al., [18], we have that there exists some constant c1c\geq 1 such that for any t0t\geq 0, we have

(𝐄2t)(2𝚺^𝚺2t)exp(t2c(λ1+σ2)r/n),\displaystyle\mathbb{P}(\|\mathbf{E}\|_{2}\geq t)\leq\mathbb{P}(2\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|_{2}\geq t)\leq\exp(-\frac{t}{2c(\lambda_{1}+\sigma^{2})\sqrt{r/n}}),

where r=tr(𝚺)/𝚺2r=\operatorname{tr}(\bm{\Sigma})/\|\bm{\Sigma}\|_{2} is the effective rank of 𝚺\bm{\Sigma}. Thus we can see that 𝐄2\|\mathbf{E}\|_{2} is sub-exponential with

𝐄2ψ1𝚺^𝚺2ψ1(λ1+σ2)rn,\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\lesssim\|\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|_{2}\|_{\psi_{1}}\lesssim(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}},

and hence we can take r1(d)=(λ1+σ2)rnr_{1}(d)=(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}. When nC(dr/p)κ12(logd)4n\geq C(dr/{p})\kappa_{1}^{2}(\log d)^{4}, by Theorem 4.1 we have

(𝔼|ρ(𝐕~F,𝐕)|2)1/2κ1Krn+κ1KdrnpL+Kdp(ηq2κ1drnp)q,\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\text{F}},\mathbf{V})|^{2}\right)^{1/2}\lesssim\kappa_{1}\sqrt{\frac{Kr}{n}}+\kappa_{1}\sqrt{\frac{Kdr}{npL}}+\sqrt{\frac{Kd}{p^{\prime}}}\left(\eta q^{2}\kappa_{1}\sqrt{\frac{dr}{np}}\right)^{q},

where the third term will be dominated by the first bias term when taking q=logdq=\log d, and hence (3) holds.

\bullet Example 2: Under the problem settings we know that 𝐄=𝐌^𝐌=𝐗𝔼𝐗\mathbf{E}=\widehat{\mathbf{M}}-\mathbf{M}=\mathbf{X}-{\mathbb{E}}\mathbf{X}. For the eigenvalues of 𝐌\mathbf{M}, under the given conditions we know that

σK(𝐌)θσK(𝐏)σK2(𝚷)dθ/K,σ1(𝐌)θσ1(𝐏)σ12(𝚷)Kdθ𝚷2,2Kdθ,\sigma_{K}(\mathbf{M})\gtrsim\theta\sigma_{K}(\mathbf{P})\sigma_{K}^{2}(\mathbf{\Pi})\gtrsim d\theta/K,\quad\sigma_{1}(\mathbf{M})\lesssim\theta\sigma_{1}(\mathbf{P})\sigma_{1}^{2}(\mathbf{\Pi})\lesssim Kd\theta\|\mathbf{\Pi}\|_{2,\infty}^{2}\leq Kd\theta,

where the last inequality is because for i[d]i\in[d], we have that

𝝅i2=(k=1K𝝅i(k)2)1/2(k=1K𝝅i(k))1/2=1and𝚷2,1.\|\bm{\pi}_{i}\|_{2}=\big{(}\sum_{k=1}^{K}\bm{\pi}_{i}(k)^{2}\big{)}^{1/2}\leq\big{(}\sum_{k=1}^{K}\bm{\pi}_{i}(k)\big{)}^{1/2}=1\quad\text{and}\quad\|\mathbf{\Pi}\|_{2,\infty}\leq 1.

Thus we know that Δdθ/K\Delta\gtrsim d\theta/K.

We then bound the entries of 𝐌\mathbf{M}. We know 𝐌ij=θiθjk=1Kk=1K𝝅i(k)𝝅j(k)𝐏kk\mathbf{M}_{ij}=\theta_{i}\theta_{j}\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}\bm{\pi}_{i}(k)\bm{\pi}_{j}(k^{\prime})\mathbf{P}_{kk^{\prime}}, and thus we have that

𝐌ijθiθjk=1Kk=1K𝝅i(k)𝝅j(k)minkk(𝐏kk)\displaystyle\mathbf{M}_{ij}\geq\theta_{i}\theta_{j}\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}\bm{\pi}_{i}(k)\bm{\pi}_{j}(k^{\prime})\min_{kk^{\prime}}(\mathbf{P}_{kk^{\prime}})
=θiθjminkk(𝐏kk)k=1Kk=1K𝝅i(k)𝝅j(k)=θiθjminkk(𝐏kk);\displaystyle=\theta_{i}\theta_{j}\min_{kk^{\prime}}(\mathbf{P}_{kk^{\prime}})\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}\bm{\pi}_{i}(k)\bm{\pi}_{j}(k^{\prime})=\theta_{i}\theta_{j}\min_{kk^{\prime}}(\mathbf{P}_{kk^{\prime}});
𝐌ijθiθjk=1Kk=1K𝝅i(k)𝝅j(k)maxkk(𝐏kk)\displaystyle\mathbf{M}_{ij}\leq\theta_{i}\theta_{j}\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}\bm{\pi}_{i}(k)\bm{\pi}_{j}(k^{\prime})\max_{kk^{\prime}}(\mathbf{P}_{kk^{\prime}})
=θiθjmaxkk(𝐏kk)k=1Kk=1K𝝅i(k)𝝅j(k)=θiθjmaxkk(𝐏kk).\displaystyle=\theta_{i}\theta_{j}\max_{kk^{\prime}}(\mathbf{P}_{kk^{\prime}})\sum_{k=1}^{K}\sum_{k^{\prime}=1}^{K}\bm{\pi}_{i}(k)\bm{\pi}_{j}(k^{\prime})=\theta_{i}\theta_{j}\max_{kk^{\prime}}(\mathbf{P}_{kk^{\prime}}).

Thus we can see that 𝐌ijθ\mathbf{M}_{ij}\asymp\theta, maxij𝔼(𝐄ij2)θ\max_{ij}{\mathbb{E}}(\mathbf{E}_{ij}^{2})\lesssim\theta and maxij𝔼(𝐄ij2)dθ\max_{i}\sum_{j}{\mathbb{E}}(\mathbf{E}_{ij}^{2})\lesssim d\theta. By Theorem 3.1.4 in [12], we know that there exists some constant c>0c>0 such that for any t>0t>0,

{𝐄24dθ+t}dexp(t2/c).\mathbb{P}\{\|\mathbf{E}\|_{2}\geq 4\sqrt{d\theta}+t\}\leq d\exp\left(-t^{2}/c\right).

Also, since for t5dθt\geq 5\sqrt{d\theta}, there exists a constant c>0c>0 such that (𝐄2t)exp(t2/c)\mathbb{P}(\|\mathbf{E}\|_{2}\geq t)\leq\exp(-t^{2}/c), we have that 𝐄2ψ1dθ\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{d\theta}, and hence we can take r1(d)=dθr_{1}(d)=\sqrt{d\theta}. Besides, p/dΔ/r1(d)=pθ/Kdϵ/2\sqrt{p/d}\Delta/r_{1}(d)=\sqrt{p\theta}/K\gtrsim d^{\epsilon/2}, and hence by Theorem 4.1 we have

(𝔼|ρ(𝐕~F,𝐕)|2)1/2KKdθ+KKpLθ+Kdp(ηq2Kpθ)q.\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\text{F}},\mathbf{V})|^{2}\right)^{1/2}\lesssim K\sqrt{\frac{K}{d\theta}}+K\sqrt{\frac{K}{pL\theta}}+\sqrt{\frac{Kd}{p^{\prime}}}\left(\eta q^{2}\frac{K}{\sqrt{p\theta}}\right)^{q}.

When

q=logd1+2ϵ1>log(d/pdθ/K)log(p/ddθ/K),q=\log d\gg 1+2\epsilon^{-1}>\frac{\log\left(\sqrt{d/p^{\prime}}\sqrt{d\theta}/K\right)}{\log\left(\sqrt{p/d}\sqrt{d\theta}/K\right)},

the third term is negligible and (4) holds.

Remark 14.

It’s worth noting that here in Example 2 𝐄2\|\mathbf{E}\|_{2} converges faster than sub-Exponential random variables and 𝐄2dθ\|\mathbf{E}\|_{2}\lesssim\sqrt{d\theta} with probability at least 1d101-d^{-10}, which we will take into account in later proofs.

Remark 15.

Under the case where no self-loops are present, 𝐄\mathbf{E} is replaced by 𝐄=𝐄diag(𝐗)=𝐄diag(𝐄)diag(𝐌)\mathbf{E}^{\prime}=\mathbf{E}-\operatorname{diag}(\mathbf{X})=\mathbf{E}-\operatorname{diag}(\mathbf{E})-\operatorname{diag}(\mathbf{M}). With similar arguments we can show that

𝐄2ψ1𝐄diag(𝐄)2ψ1+diag(𝐌)2dθ+θdθ,\|\|\mathbf{E}^{\prime}\|_{2}\|_{\psi_{1}}\lesssim\|\|\mathbf{E}-\operatorname{diag}(\mathbf{E})\|_{2}\|_{\psi_{1}}+\|\operatorname{diag}(\mathbf{M})\|_{2}\lesssim\sqrt{d\theta}+\theta\lesssim\sqrt{d\theta},
and𝐄2𝐄diag(𝐄)2+diag(𝐌)2dθ+θdθ,\text{and}\quad\|\mathbf{E}^{\prime}\|_{2}\lesssim\|\mathbf{E}-\operatorname{diag}(\mathbf{E})\|_{2}+\|\operatorname{diag}(\mathbf{M})\|_{2}\lesssim\sqrt{d\theta}+\theta\lesssim\sqrt{d\theta},

with probability at least 1d101-d^{-10}, and hence (4) also holds for the no-self-loops case.

\bullet Example 3: From the problem setting we know that we can represent 𝑾j\bm{W}_{j} as 𝑾j=k=1K𝕀{kj=k}𝜽k+𝒁j\bm{W}_{j}=\sum_{k=1}^{K}\mathbb{I}\{k_{j}=k\}\bm{\theta}_{k}+\bm{Z}_{j}, where 𝒁ji.i.d𝒩(𝟎,𝐈n)\bm{Z}_{j}\overset{\text{i.i.d}}{\sim}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{n}), j[d]j\in[d]. Denote 𝐙=(𝒁1,,𝒁d)\mathbf{Z}=(\bm{Z}_{1},\ldots,\bm{Z}_{d}), then it can be seen that 𝔼(𝐗𝐗)=𝔼(𝐗)𝔼(𝐗)+𝔼(𝐙𝐙)=𝐅𝚯𝚯𝐅+n𝐈d{\mathbb{E}}(\mathbf{X}^{\top}\mathbf{X})={\mathbb{E}}(\mathbf{X})^{\top}{\mathbb{E}}(\mathbf{X})+{\mathbb{E}}(\mathbf{Z}^{\top}\mathbf{Z})=\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+n\mathbf{I}_{d}, and we can write

𝐄=𝐗𝐗𝔼(𝐗𝐗)=𝐅𝚯𝐙+𝐙𝚯𝐅+𝐙𝐙n𝐈d,\mathbf{E}=\mathbf{X}^{\top}\mathbf{X}-{\mathbb{E}}(\mathbf{X}^{\top}\mathbf{X})=\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{Z}+\mathbf{Z}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+\mathbf{Z}^{\top}\mathbf{Z}-n\mathbf{I}_{d},

then we know that 𝐄22𝐅𝚯𝐙2+n𝐙𝐙/n𝐈d2\|\mathbf{E}\|_{2}\leq 2\|\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{Z}\|_{2}+n\|\mathbf{Z}^{\top}\mathbf{Z}/n-\mathbf{I}_{d}\|_{2}. We consider 𝐅𝚯𝐙2\|\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{Z}\|_{2} first. We know that 𝐙~:=𝚯𝐙=𝚯(𝒁1,,𝒁d)=(𝒁~1,,𝒁~d)K×d\widetilde{\mathbf{Z}}:=\mathbf{\Theta}^{\top}\mathbf{Z}=\mathbf{\Theta}^{\top}(\bm{Z}_{1},\ldots,\bm{Z}_{d})=(\widetilde{\bm{Z}}_{1},\ldots,\widetilde{\bm{Z}}_{d})\in\mathbb{R}^{K\times d}, where 𝒁~ji.i.d𝒩(𝟎,𝚯𝚯)\widetilde{\bm{Z}}_{j}\overset{\text{i.i.d}}{\sim}{\mathcal{N}}(\mathbf{0},\mathbf{\Theta}^{\top}\mathbf{\Theta}). Under the given conditions we know that 𝚯𝚯2Δ02\|\mathbf{\Theta}^{\top}\mathbf{\Theta}\|_{2}\leq\Delta_{0}^{2}. Since (𝚯𝚯)1/2𝐙~(\mathbf{\Theta}^{\top}\mathbf{\Theta})^{-1/2}\widetilde{\mathbf{Z}} is a K×dK\times d i.i.d. Gaussian matrix, by Lemma B.2, we have that

𝐙~2ψ1(𝚯𝚯)1/22(𝚯𝚯)1/2𝐙~2ψ1Δ0d.\|\|\widetilde{\mathbf{Z}}\|_{2}\|_{\psi_{1}}\lesssim\|(\mathbf{\Theta}^{\top}\mathbf{\Theta})^{1/2}\|_{2}\|\|(\mathbf{\Theta}^{\top}\mathbf{\Theta})^{-1/2}\widetilde{\mathbf{Z}}\|_{2}\|_{\psi_{1}}\lesssim\Delta_{0}\sqrt{d}.

As for 𝐙𝐙/n𝐈d2\|\mathbf{Z}^{\top}\mathbf{Z}/n-\mathbf{I}_{d}\|_{2}, when n>dn>d, by Lemma 3 in Fan et al., [18] we know that 𝐙𝐙/n𝐈d2ψ1d/n\|\|\mathbf{Z}^{\top}\mathbf{Z}/n-\mathbf{I}_{d}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{d/n}, and hence in summary we have

𝐄2ψ1𝐅2𝐙~2ψ1+n𝐙𝐙/n𝐈d2ψ1Δ0d/K+nd,\displaystyle\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\lesssim\|\mathbf{F}\|_{2}\|\|\widetilde{\mathbf{Z}}\|_{2}\|_{\psi_{1}}+n\|\|\mathbf{Z}^{\top}\mathbf{Z}/n-\mathbf{I}_{d}\|_{2}\|_{\psi_{1}}\lesssim\Delta_{0}d/\sqrt{K}+\sqrt{nd},

and we can take r1(d)=Δ0d/K+ndr_{1}(d)=\Delta_{0}d/\sqrt{K}+\sqrt{nd}. We know that Δ=σmin(𝐅𝚯𝚯𝐅)dΔ02/K\Delta=\sigma_{\min}(\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top})\gtrsim d\Delta_{0}^{2}/K, and thus under the condition that Δ02CK(logd)2(d(logd)2/pn/p)\Delta_{0}^{2}\geq CK(\log d)^{2}\left(d(\log d)^{2}/p\vee\sqrt{n/p}\right) for some large enough constant C>0C>0, by Theorem 4.1 we have that

(𝔼|ρ(𝐕~F,𝐕)|2)1/2\displaystyle\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\text{F}},\mathbf{V})|^{2}\right)^{1/2} (KΔ0+KΔ02Knd)+dpL(KΔ0+KΔ02Knd)\displaystyle\!\!\!\lesssim\left(\frac{K}{\Delta_{0}}\!+\!\frac{K}{\Delta_{0}^{2}}\sqrt{\frac{Kn}{d}}\right)\!+\!\sqrt{\frac{d}{pL}}\left(\frac{K}{\Delta_{0}}\!+\!\frac{K}{\Delta_{0}^{2}}\sqrt{\frac{Kn}{d}}\right)
+Kdp(ηq2(dKpΔ02+KΔ02np))q.\displaystyle+\!\!\sqrt{\frac{Kd}{p^{\prime}}}\left(\!\eta q^{2}\!\left(\!\!\!\sqrt{\frac{dK}{p\Delta_{0}^{2}}}+\!\!\frac{K}{\Delta_{0}^{2}}\sqrt{\frac{n}{p}}\right)\!\!\right)^{q}.

Now for the third term to be dominated by the bias term, we can take

q=logdlog(d/pp)loglogd+1log(d/pp)log(pdΔr1(d))+1,q=\log d\geq\frac{\log\left(d/\sqrt{pp^{\prime}}\right)}{\log\log d}+1\geq\frac{\log\left(d/\sqrt{pp^{\prime}}\right)}{\log\left(\sqrt{\frac{p}{d}}\frac{\Delta}{r_{1}(d)}\right)}+1,

and hence (5) holds.

Remark 16.

In fact we can derive a slightly sharper tail bound for the convergence rate of 𝐄2\|\mathbf{E}\|_{2}. More specifically, for any tΔ0dt\geq\Delta_{0}\sqrt{d}, by Lemma 3 in Fan et al., [18] there exists some constant c1c\geq 1 such that

(𝐙~2t)=(𝐙~𝐙~2t2)=(d𝐙~𝐙~/d𝚯𝚯+𝚯𝚯2t2)\displaystyle\mathbb{P}\big{(}\|\widetilde{\mathbf{Z}}\|_{2}\geq t\big{)}=\mathbb{P}\big{(}\|\widetilde{\mathbf{Z}}\widetilde{\mathbf{Z}}^{\top}\|_{2}\geq t^{2}\big{)}=\mathbb{P}\big{(}d\|\widetilde{\mathbf{Z}}\widetilde{\mathbf{Z}}^{\top}/d-\mathbf{\Theta}^{\top}\mathbf{\Theta}+\mathbf{\Theta}^{\top}\mathbf{\Theta}\|_{2}\geq t^{2}\big{)}
(d𝐙~𝐙~/d𝚯𝚯2t2d𝚯𝚯2)(𝐙~𝐙~/d𝚯𝚯2t2/dΔ02)\displaystyle\leq\mathbb{P}\big{(}d\|\widetilde{\mathbf{Z}}\widetilde{\mathbf{Z}}^{\top}/d-\mathbf{\Theta}^{\top}\mathbf{\Theta}\|_{2}\geq t^{2}-d\|\mathbf{\Theta}^{\top}\mathbf{\Theta}\|_{2}\big{)}\leq\mathbb{P}\big{(}\|\widetilde{\mathbf{Z}}\widetilde{\mathbf{Z}}^{\top}/d-\mathbf{\Theta}^{\top}\mathbf{\Theta}\|_{2}\geq t^{2}/d-\Delta_{0}^{2}\big{)}
exp(t2/dΔ02cΔ02K/d),\displaystyle\leq\exp\Big{(}-\frac{t^{2}/d-\Delta_{0}^{2}}{c\Delta_{0}^{2}\sqrt{K/d}}\Big{)},

which indicates that 𝐙~2Δ0d\|\widetilde{\mathbf{Z}}\|_{2}\lesssim\Delta_{0}\sqrt{d} with probability at least 1d101-d^{-10}. Hence under the condition that K/dlogd=O(1)\sqrt{K/d}\log d=O(1), with probability at least 1O(d10)1-O(d^{-10}) we have that 𝐄2dΔ0/K+dnlogd\|\mathbf{E}\|_{2}\lesssim d\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d, which will be used as the statistical rate of 𝐄2\|\mathbf{E}\|_{2} in later proofs.

\bullet Example 4: We define ¯=[εij]\bar{\mathcal{E}}=[\varepsilon_{ij}], then 𝐌^=(1/θ^)𝒫𝒮(𝐌+¯)\widehat{\mathbf{M}}=(1/\widehat{\theta})\mathcal{P}_{{\mathcal{S}}}(\mathbf{M}+\bar{\mathcal{E}}), where 𝒫𝒮\mathcal{P}_{{\mathcal{S}}} is the projection onto the subspace of matrices with non-zero entries only in 𝒮{\mathcal{S}}. Since 𝐌^\widehat{\mathbf{M}} and 𝐌^:=(θ^/θ)𝐌^=(1/θ)𝒫𝒮(𝐌+¯)\widehat{\mathbf{M}}^{\prime}:=(\widehat{\theta}/\theta)\widehat{\mathbf{M}}=({1}/{\theta})\mathcal{P}_{{\mathcal{S}}}(\mathbf{M}+\bar{\mathcal{E}}) differ only by a positive factor, 𝐌^\widehat{\mathbf{M}} and 𝐌^\widehat{\mathbf{M}}^{\prime} share exactly the same sequence of eigenvectors and 𝐕~F\widetilde{\mathbf{V}}^{\text{F}} can be viewed as the output by applying FADI to 𝐌^\widehat{\mathbf{M}}^{\prime}. Thus we will establish the results for 𝐌^\widehat{\mathbf{M}}^{\prime} instead, and abuse the notation by denoting 𝐄:=𝐌^𝐌\mathbf{E}:=\widehat{\mathbf{M}}^{\prime}-\mathbf{M}. We first study the order of 𝐌max\|\mathbf{M}\|_{\max}. When 𝐕2,μK/d\|\mathbf{V}\|_{2,\infty}\leq\sqrt{\mu K/d} for some rate μ1\mu\geq 1 (that may change with dd), for any i,j[d]i,j\in[d], we have that

|𝐌ij|=|𝐞i𝐕𝚲(𝐞j𝐕)|𝚲2𝐞i𝐕2𝐞j𝐕2|λ1|𝐕2,2|λ1|μKd.|\mathbf{M}_{ij}|=|\mathbf{e}_{i}^{\top}\mathbf{V}\mathbf{\Lambda}(\mathbf{e}_{j}^{\top}\mathbf{V})^{\top}|\leq\|\mathbf{\Lambda}\|_{2}\|\mathbf{e}_{i}^{\top}\mathbf{V}\|_{2}\|\mathbf{e}_{j}^{\top}\mathbf{V}\|_{2}\leq|\lambda_{1}|\|\mathbf{V}\|_{2,\infty}^{2}\leq\frac{|\lambda_{1}|\mu K}{d}.

Thus we have 𝐌max=O(|λ1|μK/d)\|\mathbf{M}\|_{\max}=O({|\lambda_{1}|\mu K}/{d}). Also, we can write 𝐄=𝐄1+𝐄2\mathbf{E}=\mathbf{E}_{1}+\mathbf{E}_{2}, where (𝐄1)ij=𝐌ij(δijθ)/θ,(𝐄2)ij=εijδij/θ(\mathbf{E}_{1})_{ij}=\mathbf{M}_{ij}(\delta_{ij}-\theta)/\theta,(\mathbf{E}_{2})_{ij}=\varepsilon_{ij}\delta_{ij}/\theta, and for iji\leq j

Var((𝐄1)ij)=𝐌ij2(1θ)/θ𝐌max2/θ=O((λ1μK)2d2θ),Var((𝐄2)ij)=σ2/θ.\operatorname{Var}\big{(}(\mathbf{E}_{1})_{ij}\big{)}=\mathbf{M}_{ij}^{2}(1-\theta)/\theta\leq\|\mathbf{M}\|_{\max}^{2}/\theta=O\Big{(}\frac{(\lambda_{1}\mu K)^{2}}{d^{2}\theta}\Big{)},\quad\operatorname{Var}\big{(}(\mathbf{E}_{2})_{ij}\big{)}=\sigma^{2}/\theta.

It is not hard to see that Cov((𝐄1)ij,(𝐄2)ij)=0\operatorname*{\rm Cov}((\mathbf{E}_{1})_{ij},(\mathbf{E}_{2})_{ij})=0. Also, by the setting of Example 4 we have that |(𝐄1)ij|𝐌max/θ=O(|λ1|μKdθ)|(\mathbf{E}_{1})_{ij}|\leq\|\mathbf{M}\|_{\max}/\theta=O(\frac{|\lambda_{1}|\mu K}{d\theta}), and there exists a constant C>0C>0 independent of dd such that |(𝐄2)ij|Cσlogd/θ|(\mathbf{E}_{2})_{ij}|\leq C\sigma\log d/\theta for all iji\leq j. Then we will study 𝐄12\|\mathbf{E}_{1}\|_{2} and 𝐄22\|\mathbf{E}_{2}\|_{2} separately. We denote ν1=d𝐌max2/θ\nu_{1}=d\|\mathbf{M}\|_{\max}^{2}/\theta and ν2=dσ2/θ\nu_{2}=d\sigma^{2}/\theta. Under the condition that θd1/2+ϵ\theta\geq d^{-1/2+\epsilon} for some constant ϵ>0\epsilon>0, by Theorem 3.1.4 in Chen et al., [12], there exists constant c>0c>0 such that for any t4t\geq 4 we have

(𝐄122ν1t)\displaystyle\mathbb{P}\Big{(}\frac{\|\mathbf{E}_{1}\|_{2}}{2\sqrt{\nu_{1}}}\geq t\Big{)} (𝐄12/ν14+t)=(𝐄124ν1+tν1)\displaystyle\leq\mathbb{P}\big{(}\|\mathbf{E}_{1}\|_{2}/\sqrt{\nu_{1}}\geq 4+t\big{)}=\mathbb{P}\big{(}\|\mathbf{E}_{1}\|_{2}\geq 4\sqrt{\nu_{1}}+t\sqrt{\nu_{1}}\big{)}
dexp(t2d𝐌max2/θc𝐌max2/θ2)=exp(dθt2/c+logd)\displaystyle\leq d\exp\Big{(}-\frac{t^{2}d\|\mathbf{M}\|_{\max}^{2}/\theta}{c\|\mathbf{M}\|_{\max}^{2}/\theta^{2}}\Big{)}=\exp(-d\theta t^{2}/c+\log d)
exp(dθt22c)exp(t2).\displaystyle\leq\exp(-\frac{d\theta t^{2}}{2c})\leq\exp(-t^{2}).

Very similarly for 𝐄22\|\mathbf{E}_{2}\|_{2}, there exists c>0c^{\prime}>0 such that for any t4t\geq 4, we have

(𝐄222ν2t)(𝐄224ν2+tν2)dexp(t2dσ2/θcσ2(logd)2/θ2)\displaystyle\mathbb{P}\Big{(}\frac{\|\mathbf{E}_{2}\|_{2}}{2\sqrt{\nu_{2}}}\geq t\Big{)}\leq\mathbb{P}\big{(}\|\mathbf{E}_{2}\|_{2}\geq 4\sqrt{\nu_{2}}+t\sqrt{\nu_{2}}\big{)}\leq d\exp\Big{(}-\frac{t^{2}d\sigma^{2}/\theta}{c^{\prime}\sigma^{2}(\log d)^{2}/\theta^{2}}\Big{)}
=exp(dθt2c(logd)2+logd)exp(dθt22c(logd)2)exp(t2).\displaystyle=\exp\Big{(}-\frac{d\theta t^{2}}{c^{\prime}(\log d)^{2}}+\log d\Big{)}\leq\exp\Big{(}-\frac{d\theta t^{2}}{2c^{\prime}(\log d)^{2}}\Big{)}\leq\exp(-t^{2}).

Thus we can see that

𝐄2ψ1𝐄12ψ1+𝐄22ψ1ν1+ν2|λ1|μKdθ+dσ2θ.\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\leq\|\|\mathbf{E}_{1}\|_{2}\|_{\psi_{1}}+\|\|\mathbf{E}_{2}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{\nu_{1}}+\sqrt{\nu_{2}}\lesssim\frac{|\lambda_{1}|\mu K}{\sqrt{d\theta}}+\sqrt{\frac{d\sigma^{2}}{\theta}}.

By Theorem 4.1, under the condition that p=Ω(d)p=\Omega(\sqrt{d}), σ/Δ(logd)2d1pθ\sigma/\Delta\ll(\log d)^{-2}d^{-1}\sqrt{p\theta} and κ2μKd1/4\kappa_{2}\mu K\ll d^{1/4}, it holds that

(𝔼|ρ(𝐕~F,𝐕)|22)1/2\displaystyle\left({\mathbb{E}}|\rho(\widetilde{\mathbf{V}}^{\text{F}},\mathbf{V})|_{2}^{2}\right)^{1/2} K(κ2μKdθ+dσ2Δ2θ)+KdpL(κ2μKdθ+dσ2Δ2θ)\displaystyle\!\!\lesssim\sqrt{K}\left(\frac{\kappa_{2}\mu K}{\sqrt{d\theta}}\!+\!\sqrt{\frac{d\sigma^{2}}{\Delta^{2}\theta}}\right)\!\!+\!K\sqrt{\frac{d}{pL}}\!\!\left(\frac{\kappa_{2}\mu K}{\sqrt{d\theta}}\!+\!\sqrt{\frac{d\sigma^{2}}{\Delta^{2}\theta}}\right)
+Kdp(ηq2(κ2μKpθ+d2σ2pΔ2θ))q.\displaystyle+\sqrt{\frac{Kd}{p^{\prime}}}\left(\eta q^{2}\left(\frac{\kappa_{2}\mu K}{\sqrt{p\theta}}+\sqrt{\frac{d^{2}\sigma^{2}}{p\Delta^{2}\theta}}\right)\right)^{q}.

Furthermore, the third term vanishes when q=logdq=\log d and (6) holds.

Remark 17.

Here we can also obtain a statistical rate sharper than subexponential rate for 𝐄2\|\mathbf{E}\|_{2} that would be used in later proofs. Combining the above results for any t16max(ν1,ν2)t\geq 16\max(\sqrt{\nu_{1}},\sqrt{\nu_{2}}) we have

(𝐄2t)\displaystyle\mathbb{P}\big{(}\|\mathbf{E}\|_{2}\geq t\big{)} (𝐄12t/2)+(𝐄22t/2)exp(dθt232cν1)+exp(dθt232c(logd)2ν2)\displaystyle\leq\mathbb{P}\big{(}\|\mathbf{E}_{1}\|_{2}\geq t/2\big{)}+\mathbb{P}\big{(}\|\mathbf{E}_{2}\|_{2}\geq t/2\big{)}\leq\exp(-\frac{d\theta t^{2}}{32c\nu_{1}})+\exp\Big{(}\!\!-\!\frac{d\theta t^{2}}{32c^{\prime}(\log d)^{2}\nu_{2}}\Big{)}
=exp(d2θ2t2C1(λ1μK)2)+exp(θ2t2C2(logd)2σ2),\displaystyle=\exp\Big{(}-\frac{d^{2}\theta^{2}t^{2}}{C_{1}(\lambda_{1}\mu K)^{2}}\Big{)}+\exp\Big{(}-\frac{\theta^{2}t^{2}}{C_{2}(\log d)^{2}\sigma^{2}}\Big{)},

where C1,C2>0C_{1},C_{2}>0 are constants. Thus 𝐄2|λ1|μKdθ+dσ2θ\|\mathbf{E}\|_{2}\lesssim\frac{|\lambda_{1}|\mu K}{\sqrt{d\theta}}+\sqrt{\frac{d\sigma^{2}}{\theta}} with probability at least 1d101-d^{-10}.

B.4 Proof of Theorem 4.3

We first bound the recovery probability of K^()\widehat{K}^{(\ell)} for each [L]\ell\in[L]. Recall that 𝐘^()/p=𝐕𝚲𝛀~()/p+𝐄𝛀()/p\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}=\mathbf{V}\mathbf{\Lambda}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}+{\mathbf{E}}\mathbf{\Omega}^{(\ell)}/\sqrt{p}, where 𝛀~()=𝐕𝛀()\widetilde{\mathbf{\Omega}}^{(\ell)}=\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}.

For the residual term 𝐄𝛀()/p{\mathbf{E}}\mathbf{\Omega}^{(\ell)}/\sqrt{p}, by Lemma 3 in [18], under the condition that p/dlogd=o(1)\sqrt{p/d}\log d=o(1), with probability at least 1d101-d^{-10} we have 𝛀()/p22dp\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}\leq 2\sqrt{\frac{d}{p}}. Denote by 𝒜𝐄\mathcal{A}_{\mathbf{E}} the event {𝐄210ce1r1(d)logd}\big{\{}\|{\mathbf{E}}\|_{2}\leq 10c_{e}^{-1}r_{1}(d)\log d\big{\}}, where ce>0c_{e}>0 is the constant defined in Remark 2. Then conditional on 𝒜𝐄\mathcal{A}_{\mathbf{E}}, we have that 𝐄𝛀()/p220ce1dpr1(d)logd\|{\mathbf{E}}\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}\leq 20c_{e}^{-1}\sqrt{\frac{d}{p}}r_{1}(d)\log d with probability at least 1d101-d^{-10} for each [L]\ell\in[L]. Recall η0=480ce1dΔ2pr1(d)logd\eta_{0}=480c_{e}^{-1}\sqrt{\frac{d}{\Delta^{2}p}}r_{1}(d)\log d. From Proposition 10.4 in [20], we know that when p2Kp\geq 2K,

(σmin(𝛀~()/p)16η0)(σmin(𝛀~()/p)pK+1epη0)η0pK+12.\mathbb{P}\left(\sigma_{\min}\big{(}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}\leq\frac{1}{6}\sqrt{\eta_{0}}\right)\leq\mathbb{P}\left(\sigma_{\min}\big{(}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}\leq\frac{p-K+1}{{\rm e}p}\sqrt{\eta_{0}}\right)\leq\eta_{0}^{\frac{p-K+1}{2}}.

Therefore, with probability at least 1η0(pK+1)/21-\eta_{0}^{{(p-K+1)}/2},

σmin(𝐕𝚲𝛀~()/p)Δσmin(𝛀~()/p)>Δη0/62μ0.\sigma_{\min}\big{(}\mathbf{V}\mathbf{\Lambda}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}\geq\Delta\sigma_{\min}\big{(}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}>\Delta\sqrt{\eta_{0}}/6\geq 2\mu_{0}.

By Weyl’s inequality [19], we know that conditional on 𝒜𝐄\mathcal{A}_{\mathbf{E}}, with probability at least 1d101-d^{-10}, σK+1(𝐘^()/p)𝐄𝛀()/p220ce1dpr1(d)=Δη0/24μ0\sigma_{K+1}(\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p})\leq\|{\mathbf{E}}\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}\leq 20c_{e}^{-1}\sqrt{\frac{d}{p}}r_{1}(d)=\Delta\eta_{0}/24\leq\mu_{0} for large enough dd, which indicates that σk+1(𝐘^())σp(𝐘^())<pμ0\sigma_{k+1}(\widehat{\mathbf{Y}}^{(\ell)})-\sigma_{p}(\widehat{\mathbf{Y}}^{(\ell)})<\sqrt{p}\mu_{0} for any kKk\geq K. For kK1k\leq K-1, under the same event we have

σk+1(𝐘^())σp(𝐘^())σK(𝐘^())σp(𝐘^())σmin(𝐕𝚲𝛀~())2𝐄𝛀()2\displaystyle\sigma_{k+1}(\widehat{\mathbf{Y}}^{(\ell)})-\sigma_{p}(\widehat{\mathbf{Y}}^{(\ell)})\geq\sigma_{K}(\widehat{\mathbf{Y}}^{(\ell)})-\sigma_{p}(\widehat{\mathbf{Y}}^{(\ell)})\geq\sigma_{\min}\big{(}\mathbf{V}\mathbf{\Lambda}\widetilde{\mathbf{\Omega}}^{(\ell)}\big{)}-2\|{\mathbf{E}}\mathbf{\Omega}^{(\ell)}\|_{2}
>p(Δη0/6Δη0/12)p(Δη0/6Δη0/12)=Δpη0/12pμ0.\displaystyle\quad>\sqrt{p}(\Delta\sqrt{\eta_{0}}/6-\Delta\eta_{0}/12)\geq\sqrt{p}(\Delta\sqrt{\eta_{0}}/6-\Delta\sqrt{\eta_{0}}/12)=\Delta\sqrt{p\eta_{0}}/12\geq\sqrt{p}\mu_{0}.

Then we have

(K^()=K|𝒜𝐄)(σK(𝐘^())σp(𝐘^())>pμ0,σK+1(𝐘^())σp(𝐘^())pμ0|𝒜𝐄)\displaystyle\mathbb{P}\left(\widehat{K}^{(\ell)}=K\,\,\big{|}\,\mathcal{A}_{\mathbf{E}}\right)\geq\mathbb{P}\left(\!\!\sigma_{K}\!\big{(}\!{\widehat{\mathbf{Y}}^{(\ell)}}\!\big{)}-\sigma_{p}\big{(}\!{\widehat{\mathbf{Y}}^{(\ell)}}\!\big{)}>\sqrt{p}\mu_{0},\,\,\sigma_{K+1}\!\big{(}\!{\widehat{\mathbf{Y}}^{(\ell)}}\!\big{)}-\sigma_{p}\!\big{(}\!{\widehat{\mathbf{Y}}^{(\ell)}}\!\big{)}\leq\sqrt{p}\mu_{0}\,\,\Big{|}\,\mathcal{A}_{\mathbf{E}}\right)
(σmin(𝐕𝚲𝛀~()/p)Δη0/6,𝐄𝛀()/p2Δη0/24|𝒜𝐄)\displaystyle\quad\geq\mathbb{P}\left(\sigma_{\min}\big{(}\mathbf{V}\mathbf{\Lambda}\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\big{)}\geq\Delta\sqrt{\eta_{0}}/6,\quad\|{\mathbf{E}}\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}\leq\Delta\eta_{0}/24\,\Big{|}\,\mathcal{A}_{\mathbf{E}}\right)
1d10η0pK+12.\displaystyle\quad\geq 1-d^{-10}-\eta_{0}^{\frac{p-K+1}{2}}.

We know that conditional on 𝐄{\mathbf{E}}, 𝕀{K^()K|𝒜𝐄}\mathbb{I}\{\widehat{K}^{(\ell)}\neq K\,|\,\mathcal{A}_{\mathbf{E}}\} are i.i.d. Bernoulli variables with expectation pK:=(K^()K|𝒜𝐄)d10+η0pK+121/4p_{K}:=\mathbb{P}(\widehat{K}^{(\ell)}\neq K\,|\,\mathcal{A}_{\mathbf{E}})\leq d^{-10}+\eta_{0}^{\frac{p-K+1}{2}}\leq 1/4 and variance pK(1pK)pKp_{K}(1-p_{K})\leq p_{K}. Since the estimators {K^()}=1L\{\widehat{K}^{(\ell)}\}_{\ell=1}^{L} are all integers, we know that if K^K\widehat{K}\neq K, at least half of {K^()}=1L\{\widehat{K}^{(\ell)}\}_{\ell=1}^{L} are not equal to KK. Then by Hoeffding’s inequality, we have

(K^K)\displaystyle\mathbb{P}(\widehat{K}\!\neq\!K) (=1L𝕀{K^()K}pKLL4)=𝔼𝐄((=1L𝕀{K^()K}pKLL4|𝐄))\displaystyle\leq\mathbb{P}\!\left(\sum_{\ell=1}^{L}\mathbb{I}\left\{\widehat{K}^{(\ell)}\neq K\right\}\!-\!p_{K}L\!\geq\!\frac{L}{4}\!\right)\!=\!{\mathbb{E}}_{{\mathbf{E}}}\!\left(\!\mathbb{P}\bigg{(}\!\sum_{\ell=1}^{L}\mathbb{I}\left\{\widehat{K}^{(\ell)}\!\neq\!K\right\}\!-\!p_{K}L\geq\frac{L}{4}\,\big{|}\,{\mathbf{E}}\bigg{)}\!\right)
(𝒜𝐄)exp{(L/4)2/(2LpK)}+1(𝒜𝐄)\displaystyle\leq\mathbb{P}(\mathcal{A}_{\mathbf{E}})\exp\left\{-(L/4)^{2}/(2Lp_{K})\right\}+1-\mathbb{P}(\mathcal{A}_{\mathbf{E}})
exp{L/(32d10+32η0pK+12)}+O(d10).\displaystyle\leq\exp\left\{-L\big{/}\big{(}32d^{-10}+32\eta_{0}^{\frac{p-K+1}{2}}\big{)}\right\}+O(d^{-10}).

We know that 32d10(logd)132d^{-10}\leq(\log d)^{-1} for d2d\geq 2, and under the condition that η0(32logd)2pK+1\eta_{0}\leq(32\log d)^{-\frac{2}{p-K+1}} we have (K^K)exp(Llogd/2)+O(d10)d(L20)/2\mathbb{P}(\widehat{K}\neq K)\leq\exp(-L\log d/2)+O(d^{-10})\lesssim d^{-(L\wedge 20)/2}.

B.5 Proof of Corollary 4.4

\bullet Example 1: From the proof of Corollary 4.2 we know that we can take r1(d)=(λ1+σ2)rnlogdr_{1}(d)=(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\log d. Then by plugging in each term we know that under the condition that (λ1+σ2)(d(np)1/2logd)1/4=o(1)(\lambda_{1}+\sigma^{2})\left(d(np)^{-1/2}\log d\right)^{1/4}=o(1) and Δ(σ2(np)1/2dlogd)1/3\Delta\gg\left({\sigma^{-2}(np)^{-1/2}}d\log d\right)^{1/3}, we have Δη0/24μ0Δη0/12\Delta\eta_{0}/24\ll\mu_{0}\ll\Delta\sqrt{\eta_{0}}/12. Besides, under the condition that κ1dr/(np)(logd)2=o(1)\kappa_{1}\sqrt{dr/(np)}(\log d)^{2}=o(1), we also have η0(32logd)2pK+1\eta_{0}\leq(32\log d)^{-\frac{2}{p-K+1}}. Thus the conditions for Theorem 4.3 are satisfied and we have K^=K\widehat{K}=K with probaility at least 1O(d(L20)/2)1-O(d^{-(L\wedge 20)/2}).

\bullet Example 2: We know from the proof of Corollary 4.2 that Δdθ/K\Delta\gtrsim d\theta/K. Also from Remark 14 we know that 𝐄2dθ\|\mathbf{E}\|_{2}\lesssim\sqrt{d\theta} with probability at least 1d101-d^{-10}, and thus we have η0d/(Δ2p)dθK/pθ1/dϵ1/2p\eta_{0}\asymp\sqrt{d/(\Delta^{2}p)}\sqrt{d\theta}\lesssim K/\sqrt{p\theta}\asymp 1/\sqrt{d^{\epsilon-1/2}p}, Δη0dθ/p\Delta\eta_{0}\asymp d\sqrt{\theta/p} and Δη0dθ3/4p1/4K1/2\Delta\sqrt{\eta_{0}}\gtrsim d\theta^{3/4}p^{-1/4}K^{-1/2}. Also recall from the proof of Corollary 4.2 that 𝔼(𝐌^ij)=𝐌ijθ{\mathbb{E}}(\widehat{\mathbf{M}}_{ij})=\mathbf{M}_{ij}\asymp\theta for any i,j[d]i,j\in[d], and hence d2ij𝐌ijθd^{-2}\sum_{i\leq j}\mathbf{M}_{ij}\asymp\theta. By Hoeffding’s inequality [21], we have that

(2d(d1)|ij𝐌^ijij𝐌ij|11logdd)exp(11d(d1)logd/d2)d10.\mathbb{P}\left(\frac{2}{d(d-1)}\left|\sum_{i\leq j}\widehat{\mathbf{M}}_{ij}-\sum_{i\leq j}\mathbf{M}_{ij}\right|\geq\frac{\sqrt{11\log d}}{d}\right)\lesssim\exp\left(-11d(d-1)\log d/d^{2}\right)\lesssim d^{-10}.

Thus we can see with probability at least 1O(d10)1-O(d^{-10}), |θ^d2ij𝐌ij|logdd|\widehat{\theta}-d^{-2}\sum_{i\leq j}\mathbf{M}_{ij}|\lesssim\frac{\sqrt{\log d}}{d} and θ^θ\widehat{\theta}\asymp\theta, and in turn Δη0/24μ0Δη0/12\Delta\eta_{0}/24\ll\mu_{0}\ll\Delta\sqrt{\eta_{0}}/12. Thus by Theorem 4.3 the claim follows.

\bullet Example 3: We know from the proof of Corollary 4.2 and Remark 16 that ΔdΔ02/K\Delta\gtrsim d\Delta_{0}^{2}/K and 𝐄2dΔ0/K+dnlogd\|\mathbf{E}\|_{2}\lesssim d\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d with probability at least 1d101-d^{-10}. Thus we have η0d/(Δ2p)(dΔ0/K+dnlogd)\eta_{0}\asymp\sqrt{d/(\Delta^{2}p)}\big{(}d\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d\big{)}. Under the condition that K(logd)3(n/p)1/4Δ0nK/dlogd\sqrt{K(\log d)^{3}}\left(n/p\right)^{1/4}\ll\Delta_{0}\ll\sqrt{nK/d}\log d, we know that dΔ0/K+dnlogddnlogdd\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d\lesssim\sqrt{dn}\log d, Δη0dn/plogd\Delta\eta_{0}\asymp d\sqrt{n/p}\log d and η0logd=o(1)\sqrt{\eta_{0}}\log d=o(1), and thus Δη0/24μ0Δη0/12\Delta\eta_{0}/24\ll\mu_{0}\ll\Delta\sqrt{\eta_{0}}/12. By Theorem 4.3 the claim follows.

\bullet Example 4: By Hoeffding’s inequality [21], with probability at least 1d101-d^{-10} we have that |θ^θ|/θ^Clogd/dθ|\widehat{\theta}-\theta|/\widehat{\theta}\leq C\sqrt{\log d}/d\theta. As for σ^02\widehat{\sigma}_{0}^{2}, we have

σ^02=1|𝒮|(i,j)𝒮(θ^𝐌^ij)2=1|𝒮|(ijδij𝐌ij2+2(i,j)𝒮𝐌ijεij+(i,j)𝒮εij2).\displaystyle\widehat{\sigma}_{0}^{2}=\frac{1}{|{\mathcal{S}}|}\sum_{(i,j)\in{\mathcal{S}}}(\widehat{\theta}\widehat{\mathbf{M}}_{ij})^{2}=\frac{1}{|{\mathcal{S}}|}\left(\sum_{i\leq j}\delta_{ij}\mathbf{M}_{ij}^{2}+2\sum_{(i,j)\in{\mathcal{S}}}\mathbf{M}_{ij}\varepsilon_{ij}+\sum_{(i,j)\in{\mathcal{S}}}\varepsilon_{ij}^{2}\right).

We consider the latter two terms first. We know that |εij|Cσlogd|\varepsilon_{ij}|\leq C\sigma\log d for some constant C>0C>0 and |𝐌ij||λ1|μK/d|\mathbf{M}_{ij}|\leq|\lambda_{1}|\mu K/d, for any iji\leq j. Denote by σ~=(|λ1|μK/d)σ\widetilde{\sigma}=(|\lambda_{1}|\mu K/d)\vee\sigma, then we have

Var(Mijεij)(|λ1|μKd)2σ2σ~4,|Mijεij||λ1|μKdCσlogdCσ~2logd,ij,\operatorname{{\rm Var}}({\mathrm{M}}_{ij}\varepsilon_{ij})\leq(\frac{|\lambda_{1}|\mu K}{d})^{2}\sigma^{2}\leq\widetilde{\sigma}^{4},\quad|{\mathrm{M}}_{ij}\varepsilon_{ij}|\leq\frac{|\lambda_{1}|\mu K}{d}C\sigma\log d\leq C\widetilde{\sigma}^{2}\log d,\quad\forall i\leq j,

and

Var(εij2)C4σ4(logd)4C4σ~4(logd)4,|εij2|C2σ2(logd)2C2σ~2(logd)2,ij.\operatorname{{\rm Var}}(\varepsilon_{ij}^{2})\leq C^{4}\sigma^{4}(\log d)^{4}\leq C^{4}\widetilde{\sigma}^{4}(\log d)^{4},\quad|\varepsilon_{ij}^{2}|\leq C^{2}\sigma^{2}(\log d)^{2}\leq C^{2}\widetilde{\sigma}^{2}(\log d)^{2},\quad\forall i\leq j.

Thus by Bernstein inequality [9], conditional on 𝒮{\mathcal{S}}, with probability at least 12d101-2d^{-10} we have that there exists a constant C>0C^{\prime}>0 independent of 𝒮{\mathcal{S}} such that

|1|𝒮|(i,j)𝒮𝐌ijεij|C(σ~2logd|𝒮|+σ~2(logd)2|𝒮|),\left|\frac{1}{|{\mathcal{S}}|}\sum_{(i,j)\in{\mathcal{S}}}\mathbf{M}_{ij}\varepsilon_{ij}\right|\leq C^{\prime}\left(\frac{\widetilde{\sigma}^{2}\sqrt{\log d}}{\sqrt{|{\mathcal{S}}|}}+\frac{\widetilde{\sigma}^{2}(\log d)^{2}}{|{\mathcal{S}}|}\right), (B.28)

and

|1|𝒮|(i,j)𝒮εij2σ2|C(σ~2(logd)5/2|𝒮|+σ~2(logd)3|𝒮|).\left|\frac{1}{|{\mathcal{S}}|}\sum_{(i,j)\in{\mathcal{S}}}\varepsilon_{ij}^{2}-\sigma^{2}\right|\leq C^{\prime}\left(\frac{\widetilde{\sigma}^{2}(\log d)^{5/2}}{\sqrt{|{\mathcal{S}}|}}+\frac{\widetilde{\sigma}^{2}(\log d)^{3}}{|{\mathcal{S}}|}\right). (B.29)

Now we consider the first term. Since δij\delta_{ij}’s are i.i.d. Bernoulli random variables with expectation θ\theta, we have

Var(𝐌ij2δij)θσ~4,|𝐌ij2δij|σ~2,ij.\operatorname{{\rm Var}}(\mathbf{M}_{ij}^{2}\delta_{ij})\leq\theta\widetilde{\sigma}^{4},\quad|\mathbf{M}_{ij}^{2}\delta_{ij}|\leq\widetilde{\sigma}^{2},\quad i\leq j.

Also, we know that ij𝐌ij2𝐌F2/2KΔ2/2\sum_{i\leq j}\mathbf{M}_{ij}^{2}\geq\|\mathbf{M}\|_{\text{F}}^{2}/2\geq K\Delta^{2}/2 and ij𝐌ij2𝐌F2Kλ12\sum_{i\leq j}\mathbf{M}_{ij}^{2}\leq\|\mathbf{M}\|_{\text{F}}^{2}\leq K\lambda_{1}^{2}, and hence KΔ2θ/2𝔼(ijδij𝐌ij2)Kλ12θK\Delta^{2}\theta/2\leq{\mathbb{E}}\Big{(}\sum_{i\leq j}\delta_{ij}\mathbf{M}_{ij}^{2}\Big{)}\leq K\lambda_{1}^{2}\theta. Then by Bernstein inequality [9] with probability at least 1d101-d^{-10}, it holds that

|(ijδij𝐌ij2)𝔼(ijδij𝐌ij2)|d2(σ~2θlogdd+σ~2logdd2)=σ~2(dθlogd+logd).\left|\Big{(}\sum_{i\leq j}\delta_{ij}\mathbf{M}_{ij}^{2}\Big{)}-{\mathbb{E}}\Big{(}\sum_{i\leq j}\delta_{ij}\mathbf{M}_{ij}^{2}\Big{)}\right|\lesssim d^{2}\left(\frac{\widetilde{\sigma}^{2}\sqrt{\theta\log d}}{d}+\frac{\widetilde{\sigma}^{2}\log d}{d^{2}}\right)=\widetilde{\sigma}^{2}(d\sqrt{\theta\log d}+\log d). (B.30)

Thus combining (B.28), (B.29) and (B.30) with the fact that |𝒮|d2θ|{\mathcal{S}}|\asymp d^{2}\theta with probability at least 1d101-d^{-10}, under the condition that κ22μ2K(logd)2\kappa_{2}^{2}\mu^{2}K\ll(\log d)^{2}, with probability at least 1O(d10)1-O(d^{-10}) we have

σ~(ΔKlogddσ)+o(σ~)σ^0logd(|λ1|Klogddσ)+o(σ~)σ~logd.\widetilde{\sigma}\ll\left(\frac{\Delta\sqrt{K}\log d}{d}\vee\sigma\right)+o(\widetilde{\sigma})\lesssim\widehat{\sigma}_{0}\log d\lesssim\left(\frac{|\lambda_{1}|\sqrt{K}\log d}{d}\vee\sigma\right)+o(\widetilde{\sigma})\lesssim\widetilde{\sigma}\log d.

From the proof of Corollary 4.2 and Remark 17, we know that with probability at least 1d101-d^{-10},

𝐌^𝐌2θ^θ𝐌^𝐌^2+θ^θ𝐌^𝐌2|λ1|logddθ+|λ1|μKdθ+dσ2θdσ~2θ,\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}\lesssim\left\|\frac{\widehat{\theta}}{\theta}\widehat{\mathbf{M}}-\widehat{\mathbf{M}}\right\|_{2}+\left\|\frac{\widehat{\theta}}{\theta}\widehat{\mathbf{M}}-\mathbf{M}\right\|_{2}\lesssim\frac{|\lambda_{1}|\sqrt{\log d}}{d\theta}+\frac{|\lambda_{1}|\mu K}{\sqrt{d\theta}}+\sqrt{\frac{d\sigma^{2}}{\theta}}\lesssim\sqrt{\frac{d\widetilde{\sigma}^{2}}{\theta}},

and hence η0dσ~(Δpθ)1\eta_{0}\asymp d\widetilde{\sigma}(\Delta\sqrt{p\theta})^{-1} and Δη0dσ~/pθ\Delta\eta_{0}\asymp d\widetilde{\sigma}/\sqrt{p\theta}.

Under the condition that (pθ)1/4dσ/Δlogd=o(1)(p\theta)^{-1/4}\sqrt{d\sigma/\Delta}\log d=o(1), with probability at least 1O(d10)1-O(d^{-10}) we have Δη0/24μ0Δη0/12\Delta\eta_{0}/24\ll\mu_{0}\ll\Delta\sqrt{\eta_{0}}/12. Thus by Theorem 4.3 the claim follows.

B.6 Proof of Theorem 4.10

We first decompose 𝐕~F𝐇𝐕=𝐕~F𝐇𝐕~𝐇0+𝐕~𝐇0𝐕\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}=\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{0}+\widetilde{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}, and we consider the term 𝐕~𝐇0𝐕\widetilde{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V} first.

By Lemma 8 in Fan et al., [18], we have that 𝐕~𝐇0𝐕𝐏(𝚺~𝐕𝐕)𝐕2𝚺~𝐕𝐕2𝐏(𝚺~𝐕𝐕)𝐕2\|\widetilde{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}-\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}\|_{2}\lesssim\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\|\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}\|_{2}. Note that in Lemma 8 of Fan et al., [18], the norm is Frobenius norm rather than operator norm, and the modification from Frobenius norm to operator norm is trivial and hence omitted. We first study the leading term 𝐏(𝚺~𝐕𝐕)𝐕=1L=1L𝐏(𝐕^()𝐕^()𝐕𝐕)𝐕\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}=\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\perp}(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}.

For a given [L]\ell\in[L], we know that 𝐕^()\widehat{\mathbf{V}}^{(\ell)} is the top KK left singular vectors of 𝐘^()=𝐌^𝛀()/p=𝐕𝚲𝐕𝛀()/p+𝐄𝛀()/p=𝐘()+()\widehat{\mathbf{Y}}^{(\ell)}=\widehat{\mathbf{M}}\mathbf{\Omega}^{(\ell)}/\sqrt{p}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}/\sqrt{p}+\mathbf{E}\mathbf{\Omega}^{(\ell)}/\sqrt{p}={\mathbf{Y}}^{(\ell)}+\mathbf{\mathcal{E}}^{(\ell)}, where

𝐘()=𝐕𝚲𝐕𝛀()/pand()=𝐄𝛀()/p.\mathbf{Y}^{(\ell)}=\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}/\sqrt{p}\quad\text{and}\quad\mathbf{\mathcal{E}}^{(\ell)}=\mathbf{E}\mathbf{\Omega}^{(\ell)}/\sqrt{p}.

By the “symmetric dilation” trick, we denote

𝒮(𝐘^())=(𝟎𝐘^()𝐘^()𝟎),𝒮(𝐘())=(𝟎𝐘()𝐘()𝟎),{\mathcal{S}}(\widehat{\mathbf{Y}}^{(\ell)})=\begin{pmatrix}\mathbf{0}&\widehat{\mathbf{Y}}^{(\ell)}\\ \widehat{\mathbf{Y}}^{(\ell)\top}&\mathbf{0}\end{pmatrix},\quad{\mathcal{S}}({\mathbf{Y}}^{(\ell)})=\begin{pmatrix}\mathbf{0}&{\mathbf{Y}}^{(\ell)}\\ {\mathbf{Y}}^{(\ell)\top}&\mathbf{0}\end{pmatrix},
and𝒮(())=𝒮(𝐘^())𝒮(𝐘())=(𝟎𝐄𝛀()/p𝛀()𝐄/p𝟎).\text{and}\quad{\mathcal{S}}(\mathbf{\mathcal{E}}^{(\ell)})={\mathcal{S}}(\widehat{\mathbf{Y}}^{(\ell)})-{\mathcal{S}}({\mathbf{Y}}^{(\ell)})=\begin{pmatrix}\mathbf{0}&\mathbf{E}\mathbf{\Omega}^{(\ell)}/\sqrt{p}\\ \mathbf{\Omega}^{(\ell)\top}\mathbf{E}/\sqrt{p}&\mathbf{0}\end{pmatrix}.

We let 𝚪K()𝚲K()𝐔K()\mathbf{\Gamma}_{K}^{(\ell)}{\mathbf{\Lambda}}_{K}^{(\ell)}{\mathbf{U}}_{K}^{(\ell)\top} be the SVD of 𝐘(){\mathbf{Y}}^{(\ell)}, and we know that with probability 1 we have 𝚪K()=𝐕𝐎𝛀()\mathbf{\Gamma}_{K}^{(\ell)}=\mathbf{V}\mathbf{O}_{\mathbf{\Omega}^{(\ell)}}, where 𝐎𝛀()\mathbf{O}_{\mathbf{\Omega}^{(\ell)}} is an orthonormal matrix depending on 𝛀()\mathbf{\Omega}^{(\ell)}. It is not hard to verify that the eigen-decomposition of 𝒮(𝐘()){\mathcal{S}}({\mathbf{Y}}^{(\ell)}) is:

𝒮(𝐘())=12(𝚪K()𝚪K()𝐔K()𝐔K())(𝚲K()𝟎𝟎𝚲K())12(𝚪K()𝚪K()𝐔K()𝐔K()),{\mathcal{S}}({\mathbf{Y}}^{(\ell)})=\frac{1}{\sqrt{2}}\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}\cdot\begin{pmatrix}{\mathbf{\Lambda}}_{K}^{(\ell)}&\mathbf{0}\\ \mathbf{0}&-{\mathbf{\Lambda}}_{K}^{(\ell)}\end{pmatrix}\cdot\frac{1}{\sqrt{2}}\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}^{\top},

where 𝚲K()=diag(λ1(),,λK()){\mathbf{\Lambda}}_{K}^{(\ell)}=\operatorname{diag}(\lambda_{1}^{(\ell)},\ldots,\lambda_{K}^{(\ell)}). First we study the eigengap σmin(𝚲K())=λK()\sigma_{\min}({\mathbf{\Lambda}}_{K}^{(\ell)})=\lambda_{K}^{(\ell)}. Recall 𝛀~()=𝐕𝛀()K×p\widetilde{\mathbf{\Omega}}^{(\ell)}=\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}\in\mathbb{R}^{K\times p}, and it can be seen that the entries of 𝛀~()\widetilde{\mathbf{\Omega}}^{(\ell)} are i.i.d. standard Gaussian. By Lemma 3 in Fan et al., [18], we know that with probability at least 1d101-d^{-10}, we have that 𝛀~()𝛀~()/p𝐈K2Kplogd\|\widetilde{\mathbf{\Omega}}^{(\ell)}\widetilde{\mathbf{\Omega}}^{(\ell)\top}/p-\mathbf{I}_{K}\|_{2}\lesssim\sqrt{\frac{K}{p}}\log d, and thus σmin(𝛀~()/p)1O(Kplogd)\sigma_{\min}(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p})\geq 1-O(\sqrt{\frac{K}{p}}\log d) with probability at least 1d101-d^{-10}. Thus under the condition that Kplogd=o(1)\sqrt{\frac{K}{p}}\log d=o(1), under the same high probability event we have that σmin(𝚲K())Δ/2\sigma_{\min}({\mathbf{\Lambda}}_{K}^{(\ell)})\geq\Delta/2. Now we let 𝐔^K()\widehat{\mathbf{U}}_{K}^{(\ell)} be the top KK right singular vectors of 𝐘^()\widehat{\mathbf{Y}}^{(\ell)}. For j[K]j\in[K] we define

𝐆j()=12(𝚪K()𝐔K())(𝚲K()λj()𝐈K)1(𝚪K()𝐔K())1λj(){𝐈K12(𝚪K()𝚪K()𝐔K()𝐔K())(𝚪K()𝚪K()𝐔K()𝐔K())}.\mathbf{G}_{j}^{(\ell)}\!\!=\!\frac{1}{2}\!\!\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}\\ -\mathbf{U}_{K}^{(\ell)}\end{pmatrix}\!\!(-\mathbf{\Lambda}_{K}^{(\ell)}-\lambda_{j}^{(\ell)}\mathbf{I}_{K})^{-1}\!\!\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}\\ -\mathbf{U}_{K}^{(\ell)}\end{pmatrix}^{\!\!\!\top}\!\!\!-\frac{1}{\lambda_{j}^{(\ell)}}\bigg{\{}\mathbf{I}_{K}-\frac{1}{2}\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}\!\!\!\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}^{\!\!\!\top}\!\!\bigg{\}}.

Then we have 𝐆j()21/λK()2/Δ\|\mathbf{G}_{j}^{(\ell)}\|_{2}\leq 1/\lambda_{K}^{(\ell)}\leq 2/\Delta with probability at least 1d101-d^{-10}. Correspondingly we define the linear mapping

f:(d+p)×K(d+p)×K,(𝐰1,,𝐰K)(𝐆1()𝐰1,,𝐆K()𝐰K),f:\mathbb{R}^{(d+p)\times K}\rightarrow\mathbb{R}^{(d+p)\times K},\quad\left(\mathbf{w}_{1},\cdots,\mathbf{w}_{K}\right)\mapsto\left(-\mathbf{G}_{1}^{(\ell)}\mathbf{w}_{1},\cdots,-\mathbf{G}_{K}^{(\ell)}{\mathbf{w}}_{K}\right),

and denote 𝚪~K()=(𝚪K()𝐔K())\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}=\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}\\ \mathbf{U}_{K}^{(\ell)}\end{pmatrix}. By Lemma 8 in Fan et al., [18], under the condition that 𝒮(())2/Δ=o(1)\|{\mathcal{S}}(\mathcal{E}^{(\ell)})\|_{2}/\Delta=o(1) we have

(𝐕^()𝐔^K())(𝐕^(),𝐔^K())𝚪~K()𝚪~K()f(𝒮(())𝚪~K())𝚪~K()𝚪~K()f(𝒮(())𝚪~K())2\displaystyle\bigg{\|}\begin{pmatrix}\widehat{\mathbf{V}}^{(\ell)}\\ \widehat{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}(\widehat{\mathbf{V}}^{(\ell)\top},\widehat{\mathbf{U}}_{K}^{(\ell)\top})-\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)\top}-f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)\top}-\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}^{\top}\bigg{\|}_{2}
(𝐕^()𝐕^()𝚪K()𝚪K()𝐕^()𝐔^K()𝚪K()𝐔K()𝐔^K()𝐕^()𝐔K()𝚪K()𝐔^K()𝐔^K()𝐔K()𝐔K())\displaystyle\leq\bigg{\|}\begin{pmatrix}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{\Gamma}_{K}^{(\ell)}\mathbf{\Gamma}_{K}^{(\ell)\top}&\quad\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{U}}_{K}^{(\ell)\top}-\mathbf{\Gamma}_{K}^{(\ell)}\mathbf{U}_{K}^{(\ell)\top}\\ \widehat{\mathbf{U}}_{K}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{U}_{K}^{(\ell)}\mathbf{\Gamma}_{K}^{(\ell)\top}&\quad\widehat{\mathbf{U}}_{K}^{(\ell)}\widehat{\mathbf{U}}_{K}^{(\ell)\top}-\mathbf{U}_{K}^{(\ell)}\mathbf{U}_{K}^{(\ell)\top}\end{pmatrix}
f(𝒮(())𝚪~K())𝚪~K()𝚪~K()f(𝒮(())𝚪~K())2𝒮(())22/Δ2.\displaystyle\quad-f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)\top}-\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}^{\top}\bigg{\|}_{2}\lesssim\|{\mathcal{S}}(\mathcal{E}^{(\ell)})\|_{2}^{2}/\Delta^{2}.

By taking the upper left block of the matrix, we have

𝐕^()𝐕^()𝚪K()𝚪K()f(𝒮(())𝚪~K())[1:d,:]𝚪K()𝚪K()f(𝒮(())𝚪~K())[1:d,:]2\displaystyle\big{\|}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{\Gamma}_{K}^{(\ell)}\mathbf{\Gamma}_{K}^{(\ell)\top}-f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}\mathbf{\Gamma}_{K}^{(\ell)\top}-\mathbf{\Gamma}_{K}^{(\ell)}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}^{\top}\big{\|}_{2}
=𝐕^()𝐕^()𝐕𝐕f(𝒮(())𝚪~K())[1:d,:]𝚪K()𝚪K()f(𝒮(())𝚪~K())[1:d,:]2\displaystyle=\big{\|}\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}-f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}\mathbf{\Gamma}_{K}^{(\ell)\top}-\mathbf{\Gamma}_{K}^{(\ell)}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}^{\top}\big{\|}_{2}
𝒮(())22/Δ2.\displaystyle\lesssim\|{\mathcal{S}}(\mathcal{E}^{(\ell)})\|_{2}^{2}/\Delta^{2}.

Now for j[K]j\in[K], we study 𝐏(𝐆j())[1:d,:]\mathbf{P}_{\perp}(\mathbf{G}_{j}^{(\ell)})_{[1:d,:]}. Since 𝚪K()=𝐕𝐎𝛀()\mathbf{\Gamma}_{K}^{(\ell)}=\mathbf{V}\mathbf{O}_{\mathbf{\Omega}^{(\ell)}}, we have 𝐏𝚪K()=𝟎\mathbf{P}_{\perp}\mathbf{\Gamma}_{K}^{(\ell)}=\mathbf{0}. Therefore we have,

𝐏𝚪K()(𝚲K()λj()𝐈K)1(𝚪K()𝐔K())=𝟎,and\displaystyle\mathbf{P}_{\perp}\mathbf{\Gamma}_{K}^{(\ell)}(-\mathbf{\Lambda}_{K}^{(\ell)}-\lambda_{j}^{(\ell)}\mathbf{I}_{K})^{-1}\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}\\ -\mathbf{U}_{K}^{(\ell)}\end{pmatrix}^{\top}=\mathbf{0},\quad\text{and}
𝐏{𝐈d+p12(𝚪K()𝚪K()𝐔K()𝐔K())(𝚪K()𝚪K()𝐔K()𝐔K())}[1:d,:]\displaystyle\mathbf{P}_{\perp}\bigg{\{}\mathbf{I}_{d+p}-\frac{1}{2}\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}\!\!\!\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}^{\top}\!\bigg{\}}_{[1:d,:]}
=(𝐏,𝟎)12𝐏𝚪K()(𝐈d,𝐈d)(𝚪K()𝚪K()𝐔K()𝐔K())\displaystyle=(\mathbf{P}_{\perp},\mathbf{0})-\frac{1}{2}\mathbf{P}_{\perp}\mathbf{\Gamma}_{K}^{(\ell)}(\mathbf{I}_{d},\mathbf{I}_{d})\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}&\mathbf{\Gamma}_{K}^{(\ell)}\\ {\mathbf{U}}_{K}^{(\ell)}&-{\mathbf{U}}_{K}^{(\ell)}\end{pmatrix}^{\top}
=(𝐏,𝟎)+𝟎=(𝐏,𝟎),\displaystyle=(\mathbf{P}_{\perp},\mathbf{0})+\mathbf{0}=(\mathbf{P}_{\perp},\mathbf{0}),

and as a result we have

𝐏(𝐆j)[1:d,:]=12𝟎1λj(){(𝐏,𝟎)𝟎}=1λj()(𝐏,𝟎).\mathbf{P}_{\perp}(\mathbf{G}_{j})_{[1:d,:]}=\frac{1}{2}\cdot\mathbf{0}-\frac{1}{\lambda_{j}^{(\ell)}}\left\{(\mathbf{P}_{\perp},\mathbf{0})-\mathbf{0}\right\}=-\frac{1}{\lambda_{j}^{(\ell)}}(\mathbf{P}_{\perp},\mathbf{0}).

Thus in turn,

𝐏(f(𝒮(())𝚪~K())[1:d,:]𝚪K()+𝚪K()f(𝒮(())𝚪~K())[1:d,:])=𝐏f(𝒮(())𝚪~K())[1:d,:]𝚪K()\displaystyle\mathbf{P}_{\perp}\Big{(}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}\mathbf{\Gamma}_{K}^{(\ell)\top}+\mathbf{\Gamma}_{K}^{(\ell)}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}^{\top}\Big{)}=\mathbf{P}_{\perp}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}\mathbf{\Gamma}_{K}^{(\ell)\top}
=(𝐏,𝟎)(𝟎𝐄𝛀()/p𝛀()𝐄/p𝟎)(𝚪K()𝐔K())(𝚲K())1𝚪K()\displaystyle\quad=(\mathbf{P}_{\perp},\mathbf{0})\begin{pmatrix}\mathbf{0}&\mathbf{E}\mathbf{\Omega}^{(\ell)}/\sqrt{p}\\ \mathbf{\Omega}^{(\ell)\top}\mathbf{E}/\sqrt{p}&\mathbf{0}\end{pmatrix}\begin{pmatrix}\mathbf{\Gamma}_{K}^{(\ell)}\\ \mathbf{U}_{K}^{(\ell)}\end{pmatrix}(\mathbf{\Lambda}_{K}^{(\ell)})^{-1}\mathbf{\Gamma}_{K}^{(\ell)\top}
=𝐏𝐄(𝛀()/p)𝐔K()(𝚲K())1𝚪K()=𝐏𝐄(𝛀()/p)(𝐘()).\displaystyle\quad=\mathbf{P}_{\perp}\mathbf{E}(\mathbf{\Omega}^{(\ell)}/\sqrt{p})\mathbf{U}_{K}^{(\ell)}(\mathbf{\Lambda}_{K}^{(\ell)})^{-1}\mathbf{\Gamma}_{K}^{(\ell)\top}=\mathbf{P}_{\perp}\mathbf{E}(\mathbf{\Omega}^{(\ell)}/\sqrt{p})(\mathbf{Y}^{(\ell)})^{\dagger}.

For a given [L]\ell\in[L], under the condition that p/dlogd=O(1)\sqrt{p/d}\log d=O(1), by Lemma 3 in Fan et al., [18] we have that with probability at least 1d101-d^{-10}, 𝛀()2d\|\mathbf{\Omega}^{(\ell)}\|_{2}\lesssim\sqrt{d}. Combined with previous results on the eigengap σmin(𝚲K())\sigma_{\min}(\mathbf{\Lambda}_{K}^{(\ell)}), we have that with probability 1O(d9)1-O(d^{-9}), for a fixed constant C>0C>0

𝛀()2Cd,σmin(𝚲K())Δ/2,[L].\|\mathbf{\Omega}^{(\ell)}\|_{2}\leq C\sqrt{d},\quad\sigma_{\min}(\mathbf{\Lambda}_{K}^{(\ell)})\geq\Delta/2,\quad\forall\ell\in[L].

Besides, under Assumption 1, we have that 𝐄2r1(d)logd\|\mathbf{E}\|_{2}\lesssim r_{1}(d)\log d with probability at least 1d101-d^{-10}, and in turn by Wedin’s Theorem [42], with high probability for all [L]\ell\in[L] we have that

𝐕^()𝐕^()𝐕𝐕2()2/σmin(𝚲K())𝐄2𝛀()/p2/Δr1(d)Δlogddp,\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\lesssim\|\mathcal{E}^{(\ell)}\|_{2}/\sigma_{\min}(\mathbf{\Lambda}_{K}^{(\ell)})\lesssim\|\mathbf{E}\|_{2}\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}/\Delta\lesssim\frac{r_{1}(d)}{\Delta}\log d\sqrt{\frac{d}{p}},

and thus 𝚺~𝐕𝐕2=OP(r1(d)logdd/p/Δ)\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}=O_{P}\big{(}r_{1}(d)\log d\sqrt{d/p}/\Delta\big{)}. Besides, we have

𝐏(𝚺~𝐕𝐕)𝐕=1L=1L𝐏(𝐕^()𝐕^()𝐕𝐕)𝐕\displaystyle\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}=\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\perp}(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}
=1L=1L𝐏(f(𝒮(())𝚪~K())[1:d,:]𝚪K()+𝚪K()f(𝒮(())𝚪~K())[1:d,:])𝐕+𝐑1(𝚺~)\displaystyle\quad=\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\perp}\Big{(}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}\mathbf{\Gamma}_{K}^{(\ell)\top}+\mathbf{\Gamma}_{K}^{(\ell)}f\big{(}{\mathcal{S}}(\mathcal{E}^{(\ell)})\widetilde{\mathbf{\Gamma}}_{K}^{(\ell)}\big{)}_{[1:d,:]}^{\top}\Big{)}\mathbf{V}+\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})
=1L=1L𝐏𝐄(𝛀()/p)(𝐘())𝐕+R1(𝚺~)=1L=1L𝐏𝐄(𝛀()/p)𝐁()+𝐑1(𝚺~)\displaystyle\quad=\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\perp}\mathbf{E}(\mathbf{\Omega}^{(\ell)}/\sqrt{p})(\mathbf{Y}^{(\ell)})^{\dagger}\mathbf{V}+R_{1}(\widetilde{\mathbf{\Sigma}})=\frac{1}{L}\sum_{\ell=1}^{L}\mathbf{P}_{\perp}\mathbf{E}(\mathbf{\Omega}^{(\ell)}/\sqrt{p})\mathbf{B}^{(\ell)\top}+\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})
=1L𝐏𝐄𝛀𝐁𝛀+𝐑1(𝚺~),\displaystyle\quad=\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}}+\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}}),

where 𝐑1(𝚺~)\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}}) is the residual matrix with 𝐑1(𝚺~)2=OP(𝒮(())22/Δ2)\|\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})\|_{2}=O_{P}(\|{\mathcal{S}}(\mathcal{E}^{(\ell)})\|_{2}^{2}/\Delta^{2}). Now we study the matrix 𝐁()=(𝚲𝐕𝛀()/p)\mathbf{B}^{(\ell)}=(\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}/\sqrt{p})^{\dagger}. From previous results we know that with probability at least 1d91-d^{-9}, 1/2σmin(𝛀~()/p)σmax(𝛀~()/p)3/21/2\leq\sigma_{\min}(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p})\leq\sigma_{\max}(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p})\leq 3/2 for any [L]\ell\in[L], and in turn 23|λ1|σmin(𝐁())σmax(𝐁())2Δ,[L]\frac{2}{3|\lambda_{1}|}\leq\sigma_{\min}(\mathbf{B}^{(\ell)})\leq\sigma_{\max}(\mathbf{B}^{(\ell)})\leq\frac{2}{\Delta},\quad\forall\ell\in[L]. Now for any vector 𝐲K\mathbf{y}\in\mathbb{R}^{K} such that 𝐲2=1\|\mathbf{y}\|_{2}=1, with probability 1O(d9)1-O(d^{-9}) we have that

𝐁𝛀𝐲2\displaystyle\|\mathbf{B}_{\mathbf{\Omega}}\mathbf{y}\|_{2} =(𝐲𝐁(1),,𝐲𝐁(L))2=(=1L𝐁()𝐲22)1/2,\displaystyle=\|(\mathbf{y}^{\top}\mathbf{B}^{(1)\top},\ldots,\mathbf{y}^{\top}\mathbf{B}^{(L)\top})^{\top}\|_{2}=\Big{(}\sum_{\ell=1}^{L}\|\mathbf{B}^{(\ell)}\mathbf{y}\|_{2}^{2}\Big{)}^{1/2},
𝐁𝛀2\displaystyle\|\mathbf{B}_{\mathbf{\Omega}}\|_{2} =max𝐲2=1𝐁𝛀𝐲2=max𝐲2=1(=1L𝐁()𝐲22)1/2(=1L𝐁()22)1/22LΔ,\displaystyle=\max_{\|\mathbf{y}\|_{2}=1}\|\mathbf{B}_{\mathbf{\Omega}}\mathbf{y}\|_{2}=\max_{\|\mathbf{y}\|_{2}=1}\Big{(}\sum_{\ell=1}^{L}\|\mathbf{B}^{(\ell)}\mathbf{y}\|_{2}^{2}\Big{)}^{1/2}\leq\Big{(}\sum_{\ell=1}^{L}\|\mathbf{B}^{(\ell)}\|_{2}^{2}\Big{)}^{1/2}\leq\frac{2\sqrt{L}}{\Delta},
σmin(𝐁𝛀)\displaystyle\sigma_{\min}\left(\mathbf{B}_{\mathbf{\Omega}}\right) =min𝐲2=1𝐁𝛀𝐲2=min𝐲2=1(=1L𝐁()𝐲22)1/2(=1Lσmin2(𝐁()))1/22L3|λ1|.\displaystyle=\min_{\|\mathbf{y}\|_{2}=1}\|\mathbf{B}_{\mathbf{\Omega}}\mathbf{y}\|_{2}=\min_{\|\mathbf{y}\|_{2}=1}\Big{(}\sum_{\ell=1}^{L}\|\mathbf{B}^{(\ell)}\mathbf{y}\|_{2}^{2}\Big{)}^{1/2}\geq\Big{(}\sum_{\ell=1}^{L}\sigma_{\min}^{2}(\mathbf{B}^{(\ell)})\Big{)}^{1/2}\geq\frac{2\sqrt{L}}{3|\lambda_{1}|}.

Now since we know that the entries of p𝛀\sqrt{p}\mathbf{\Omega} are i.i.d. standard Gaussian, similar as before, under the condition that LpdLp\ll d, by Lemma 3 in Fan et al., [18] we have with high probability that 12dpσmin(𝛀)σmax(𝛀)32dp\frac{1}{2}\sqrt{\frac{d}{p}}\leq\sigma_{\min}(\mathbf{\Omega})\leq\sigma_{\max}(\mathbf{\Omega})\leq\frac{3}{2}\sqrt{\frac{d}{p}}. Therefore, we have the following upper bound on the norm of the leading term

𝐏(𝚺~𝐕𝐕)𝐕2\displaystyle\|\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}\|_{2} 1L𝐏𝐄𝛀𝐁𝛀2+𝐑1(𝚺~)21L𝐄2𝛀2𝐁𝛀2+𝐑1(𝚺~)2\displaystyle\lesssim\|\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}}\|_{2}+\|\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})\|_{2}\leq\frac{1}{L}\|\mathbf{E}\|_{2}\|\mathbf{\Omega}\|_{2}\|\mathbf{B}_{\mathbf{\Omega}}\|_{2}+\|\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})\|_{2}
=OP(dLpr1(d)logdΔ+r1(d)2(logd)2dpΔ2).\displaystyle=O_{P}\Big{(}\sqrt{\frac{d}{Lp}}\frac{r_{1}(d)\log d}{\Delta}+r_{1}(d)^{2}(\log d)^{2}\frac{d}{p\Delta^{2}}\Big{)}.

Thus we have the following decomposition

𝐕~𝐇0𝐕\displaystyle\widetilde{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V} =𝐏(𝚺~𝐕𝐕)𝐕+𝐑0(𝚺~)\displaystyle=\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}+\mathbf{R}_{0}(\widetilde{\mathbf{\Sigma}})
=1L𝐏𝐄𝛀𝐁𝛀+𝐑1(𝚺~)+𝐑0(𝚺~)\displaystyle=\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}}+\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})+\mathbf{R}_{0}(\widetilde{\mathbf{\Sigma}})
=1L𝐏𝐄0𝛀𝐁𝛀+1L𝐏𝐄b𝛀𝐁𝛀+𝐑1(𝚺~)+𝐑0(𝚺~),\displaystyle=\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}}+\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}_{b}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}}+\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})+\mathbf{R}_{0}(\widetilde{\mathbf{\Sigma}}),

where 𝐑0(𝚺~)\mathbf{R}_{0}(\widetilde{\mathbf{\Sigma}}) is a residual matrix with

𝐑0(𝚺~)2\displaystyle\|\mathbf{R}_{0}(\widetilde{\mathbf{\Sigma}})\|_{2} =OP(𝚺~𝐕𝐕2𝐏(𝚺~𝐕𝐕)𝐕2)\displaystyle=O_{P}(\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\|\mathbf{P}_{\perp}(\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top})\mathbf{V}\|_{2})
=OP(r1(d)2(logd)2dLpΔ2)+oP(r1(d)2(logd)2dpΔ2).\displaystyle=O_{P}\Big{(}\frac{r_{1}(d)^{2}(\log d)^{2}{d}}{\sqrt{L}p\Delta^{2}}\Big{)}+o_{P}\Big{(}r_{1}(d)^{2}(\log d)^{2}\frac{d}{p\Delta^{2}}\Big{)}.

Thus

𝐑0(𝚺~)+𝐑1(𝚺~)2=OP(r1(d)2(logd)2dpΔ2).\|\mathbf{R}_{0}(\widetilde{\mathbf{\Sigma}})+\mathbf{R}_{1}(\widetilde{\mathbf{\Sigma}})\|_{2}=O_{P}\Big{(}r_{1}(d)^{2}(\log d)^{2}\frac{d}{p\Delta^{2}}\Big{)}.

Next we consider the term 𝐕~F𝐇𝐕~𝐇0\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{0}. We denote the SVD of 𝚺~q\widetilde{\bm{\Sigma}}^{q} by 𝐕~𝚲~Kq𝐕~+𝐕~𝚲~q𝐕~\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{V}}^{\top}+\widetilde{\mathbf{V}}_{\perp}\widetilde{\mathbf{\Lambda}}_{\perp}^{q}\widetilde{\mathbf{V}}_{\perp}^{\top}, and by Weyl’s inequality [19], we know that 𝚲~2𝚺~𝐕𝐕2=OP(r1(d)logdd/p/Δ)\|\widetilde{\mathbf{\Lambda}}_{\perp}\|_{2}\leq\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}=O_{P}\big{(}r_{1}(d)\log d\sqrt{d/p}/\Delta\big{)} and σK(𝚲~K)1𝚺~𝐕𝐕21OP(r1(d)logdd/p/Δ)\sigma_{K}(\widetilde{\mathbf{\Lambda}}_{K})\geq 1-\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq 1-O_{P}(r_{1}(d)\log d\sqrt{d/p}/\Delta). Thus under the condition that r1(d)logdd/p/Δ=o(1)r_{1}(d)\log d\sqrt{d/p}/\Delta=o(1), for large enough dd with high probability we have

𝚲~q2(r1(d)logdd/p/Δ)qandσK(𝚲~Kq)(1O(r1(d)logdd/p/Δ))q(1/2)q.\|\widetilde{\mathbf{\Lambda}}_{\perp}^{q}\|_{2}\leq(r_{1}(d)\log d\sqrt{d/p}/\Delta)^{q}\quad\text{and}\quad\sigma_{K}(\widetilde{\mathbf{\Lambda}}_{K}^{q})\geq(1-O(r_{1}(d)\log d\sqrt{d/p}/\Delta))^{q}\geq(1/2)^{q}.

Similar as before, we know that with probability 1 the left singular vector space of 𝐕~𝚲~Kq𝐕~𝛀F=𝐕~𝚲~Kq𝛀~F\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{V}}^{\top}\mathbf{\Omega}^{\text{F}}=\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{\Omega}}^{\text{F}} and the column space of 𝐕~\widetilde{\mathbf{V}} are the same, where 𝛀~F:=𝐕~𝛀FK×p\widetilde{\mathbf{\Omega}}^{\text{F}}:=\widetilde{\mathbf{V}}^{\top}\mathbf{\Omega}^{\text{F}}\in\mathbb{R}^{K\times p^{\prime}} is still a Gaussian test matrix with i.i.d. entries. By Lemma 3 in Fan et al., [18], we have with probability at least 1d101-d^{-10}, σmin(𝛀~F/p)1O(Kplogd)\sigma_{\min}(\widetilde{\mathbf{\Omega}}^{\text{F}}/\sqrt{p^{\prime}})\geq 1-O(\sqrt{\frac{K}{p^{\prime}}}\log d). When Kplogd=o(1)\sqrt{\frac{K}{p^{\prime}}}\log d=o(1), by Wedin’s Theorem [42], there exists a constant η>0\eta>0 such that with high probability we have

𝐕~F𝐇𝐕~𝐇02\displaystyle\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{0}\|_{2} =𝐕~F𝐇1𝐕~2𝐕~𝚲~q𝐕~𝛀F/p2/σK(𝐕~𝚲~Kq𝛀~F/p)\displaystyle=\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}_{1}-\widetilde{\mathbf{V}}\|_{2}\lesssim\|\widetilde{\mathbf{V}}_{\perp}\widetilde{\mathbf{\Lambda}}_{\perp}^{q}\widetilde{\mathbf{V}}_{\perp}^{\top}{\mathbf{\Omega}}^{\text{F}}/\sqrt{p^{\prime}}\|_{2}/\sigma_{K}(\widetilde{\mathbf{V}}\widetilde{\mathbf{\Lambda}}_{K}^{q}\widetilde{\mathbf{\Omega}}^{\text{F}}/\sqrt{p^{\prime}})
𝚲~2q𝛀F/p2σK(𝚲~Kq)σK(𝛀~F/p)(2ηr1(d)logdd/pΔ)qdp.\displaystyle\leq\frac{\|\widetilde{\mathbf{\Lambda}}_{\perp}\|_{2}^{q}\|{\mathbf{\Omega}}^{\text{F}}/\sqrt{p^{\prime}}\|_{2}}{\sigma_{K}(\widetilde{\mathbf{\Lambda}}_{K}^{q})\sigma_{K}(\widetilde{\mathbf{\Omega}}^{\text{F}}/\sqrt{p^{\prime}})}\lesssim\left(\frac{2\eta r_{1}(d)\log d\sqrt{d/p}}{\Delta}\right)^{q}\sqrt{\frac{d}{p^{\prime}}}.

Denote r:=2ηr1(d)logdd/p/Δ=o((logd)1/4)r^{\prime}:=2\eta r_{1}(d)\log d\sqrt{d/p}/\Delta=o\big{(}(\log d)^{-1/4}\big{)}. Then it can be seen that when

qlogd2+logdloglogd2+logd/plog(1/r),q\geq\log d\gg 2+\frac{\log d}{\log\log d}\geq 2+\frac{\log\sqrt{d/p^{\prime}}}{\log(1/r^{\prime})},

we have that (r)qd/p=o((r)2)(r^{\prime})^{q}\sqrt{d/p^{\prime}}=o\big{(}(r^{\prime})^{2}\big{)} and 𝐕~F𝐇𝐕~𝐇02=OP(r1(d)2(logd)2dpΔ2)\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{0}\|_{2}=O_{P}\Big{(}r_{1}(d)^{2}(\log d)^{2}\frac{d}{p\Delta^{2}}\Big{)}.

Now for a given j[d]j\in[d], recall that with high probability σmin(𝚺j)=Ω(η2(d))\sigma_{\min}(\mathbf{\Sigma}_{j})=\Omega\big{(}\eta_{2}(d)\Big{)}. Therefore, under the condition that d2r1(d)4(logd)4(p2Δ4η2(d))1=o(1){d^{2}r_{1}(d)^{4}(\log d)^{4}}\big{(}{p^{2}\Delta^{4}\eta_{2}(d)}\big{)}^{-1}=o(1) and dr2(d)2(LpΔ2η2(d))1=o(1){dr_{2}(d)^{2}}\big{(}{Lp\Delta^{2}\eta_{2}(d)}\big{)}^{-1}=o(1), we have with probability 1O(d9)1-O(d^{-9}), 1L𝐏𝐄b𝛀𝐁𝛀2=OP(dΔ2Lpr2(d))=oP((σmin(𝚺j))1/2)\|\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}_{b}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}}\|_{2}=O_{P}\Big{(}\sqrt{\frac{d}{\Delta^{2}Lp}}r_{2}(d)\Big{)}=o_{P}\big{(}(\sigma_{\min}(\mathbf{\Sigma}_{j}))^{1/2}\big{)}, and 𝐑0(𝚺~)+𝐑1(𝚺~)2=oP((σmin(𝚺j))1/2)\|\mathbf{R}_{0}(\widetilde{\bm{\Sigma}})+\mathbf{R}_{1}(\widetilde{\bm{\Sigma}})\|_{2}=o_{P}\big{(}(\sigma_{\min}(\mathbf{\Sigma}_{j}))^{1/2}\big{)}. Then under Assumption 5, we have

𝚺j1/2(𝐕~F𝐇𝐕)𝐞j=𝚺j1/2(𝐕~F𝐇𝐕~𝐇0+𝐕~𝐇0𝐕)𝐞j\displaystyle\mathbf{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}=\mathbf{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{0}+\widetilde{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})^{\top}\mathbf{e}_{j}
=𝚺j1/2(1L𝐁𝛀𝛀𝐄0𝐏𝐞j)+𝚺j1/2(𝐕~F𝐇𝐕~𝐇0+𝐑0(𝚺~)+𝐑1(𝚺~)+1L𝐏𝐄b𝛀𝐁𝛀)𝐞j\displaystyle=\mathbf{\Sigma}_{j}^{-1/2}(\frac{1}{L}\mathbf{B}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})\!+\!\mathbf{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}\!-\!\widetilde{\mathbf{V}}\mathbf{H}_{0}\!+\!\mathbf{R}_{0}(\widetilde{\bm{\Sigma}})\!+\!\mathbf{R}_{1}(\widetilde{\bm{\Sigma}})\!+\!\frac{1}{L}\mathbf{P}_{\perp}\mathbf{E}_{b}\mathbf{\Omega}\mathbf{B}_{\bm{\Omega}})^{\top}\mathbf{e}_{j}
=𝚺j1/2𝒱(𝐄0)𝐞j+oP(1)𝑑𝒩(𝟎,𝐈K).\displaystyle=\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}+o_{P}(1)\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}).

B.7 Proof of Corollary 4.11

To prove Corollary 4.11, it suffices for us to show that Assumptions 1, 2 and 5 are met. From the proof of Corollary 4.2, we know that Assumption 1 is satisfied. We move on to show that Assumption 2 is met. Define 𝐕d=(𝐕,𝐕)\mathbf{V}_{d}=(\mathbf{V},\mathbf{V}^{\perp}) as the stacking of eigenvectors for the covariance matrix 𝚺\mathbf{\Sigma}. Note that 𝐕\mathbf{V}^{\perp} is not identifiable under the spiked covariance model and is unique up to orthogonal transformation. Let 𝒁i=𝐕d𝑿i\bm{Z}_{i}=\mathbf{V}_{d}^{\top}\bm{X}_{i}, and 𝒁i𝒩(𝟎,𝚲d)\bm{Z}_{i}\sim{\mathcal{N}}(\mathbf{0},\mathbf{\Lambda}_{d}), where 𝚲d=diag(𝚲+σ2𝐈K,σ2𝐈dK)\mathbf{\Lambda}_{d}=\operatorname{diag}(\mathbf{\Lambda}+\sigma^{2}\mathbf{I}_{K},\sigma^{2}\mathbf{I}_{d-K}). We let 𝚪S=(𝐮1,,𝐮K+1)\mathbf{\Gamma}_{S}=(\mathbf{u}_{1},\ldots,\mathbf{u}_{K+1}) be the stacking of eigenvectors for the matrix 𝚺S\bm{\Sigma}_{S}, and let σ~1σ~2σ~K+1\widetilde{\sigma}_{1}\geq\widetilde{\sigma}_{2}\geq\ldots\geq\widetilde{\sigma}_{K+1} be the K+1K+1 eigenvalues of 𝚺S\bm{\Sigma}_{S}. Correspondingly, let σ^1σ^K+1=σ^2\widehat{\sigma}_{1}\geq\ldots\geq\widehat{\sigma}_{K+1}=\widehat{\sigma}^{2} be the eigenvalues of the sample covariance matrix 𝚺^S\widehat{\bm{\Sigma}}_{S}. Since 𝚺S=(𝐕)[S,:]𝚲(𝐕)[S,:]+σ2𝐈K+1\bm{\Sigma}_{S}=(\mathbf{V})_{[S,:]}\mathbf{\Lambda}(\mathbf{V})_{[S,:]}^{\top}+\sigma^{2}\mathbf{I}_{K+1}, we know that σ~K+1=σ2\widetilde{\sigma}_{K+1}=\sigma^{2} and δ=σ~Kσ~K+1Δσmin2((𝐕)[S,:])\delta=\widetilde{\sigma}_{K}-\widetilde{\sigma}_{K+1}\geq\Delta\sigma_{\min}^{2}\big{(}(\mathbf{V})_{[S,:]}\big{)}. We define 𝐜~=(𝐕[S,:])𝐮K+1\widetilde{\mathbf{c}}=(\mathbf{V}^{\perp}_{[S,:]})^{\top}\mathbf{u}_{K+1}, and denote 𝐜~0=(𝟎,𝐈dK)𝐜~d\widetilde{{\mathbf{c}}}_{0}=(\mathbf{0},\mathbf{I}_{d-K})^{\top}\widetilde{{\mathbf{c}}}\in\mathbb{R}^{d}. Then by the proof of Lemma 6.2 in Wang and Fan, [41], we know that

σ^2σ2=𝐜~0(1ni=1n𝒁i𝒁i𝚲d)𝐜~0+1nOP(MK+1σ2WK+1),\widehat{\sigma}^{2}-\sigma^{2}=\widetilde{{\mathbf{c}}}_{0}^{\top}(\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{{\mathbf{c}}}_{0}+\frac{1}{n}O_{P}\left(M_{K+1}-\sigma^{2}W_{K+1}\right),

where MK+1=kKfk2(σ~k+(σ^kσ~k)),WK+1=kKfk2M_{K+1}=\sum_{k\leq K}f_{k}^{2}\left(\widetilde{\sigma}_{k}+(\widehat{\sigma}_{k}-\widetilde{\sigma}_{k})\right),W_{K+1}=\sum_{k\leq K}f_{k}^{2} and fkf_{k} is the (K+1)(K+1)-th element of the kk-th eigenvector of 𝚪S𝚺^S𝚪S\mathbf{\Gamma}_{S}^{\top}\widehat{\bm{\Sigma}}_{S}\mathbf{\Gamma}_{S} multiplied by n\sqrt{n} for kKk\leq K. We let 𝐟=(f1,,fK)/n\mathbf{f}=(f_{1},\ldots,f_{K})^{\top}/\sqrt{n}. By Wedin’s Theorem [42] and Lemma 3 in Fan et al., [18], we have that with probability at least 1d101-d^{-10}, |σ^kσ~k|𝚺^S𝚺S2σ~1logdKn|\widehat{\sigma}_{k}-\widetilde{\sigma}_{k}|\leq\|\widehat{\bm{\Sigma}}_{S}-\bm{\Sigma}_{S}\|_{2}\lesssim\widetilde{\sigma}_{1}\log d\sqrt{\frac{K}{n}} for kKk\leq K. If we denote by 𝐅S:=(𝐈K,𝟎)\mathbf{F}_{S}:=(\mathbf{I}_{K},\mathbf{0})^{\top} the stacked top KK eigenvectors of 𝚪S𝚺S𝚪S\mathbf{\Gamma}_{S}^{\top}\bm{\Sigma}_{S}\mathbf{\Gamma}_{S}, and by 𝐅^S\widehat{\mathbf{F}}_{S} the stacked top KK eigenvectors of 𝚪S𝚺^S𝚪S\mathbf{\Gamma}_{S}^{\top}\widehat{\bm{\Sigma}}_{S}\mathbf{\Gamma}_{S}, then we know that 𝐟\mathbf{f} is the (K+1)(K+1)-th row of 𝐅^S\widehat{\mathbf{F}}_{S}. By Davis-Kahan’s Theorem [45], we also know that there exists an orthonormal matrix 𝐎SK×K\mathbf{O}_{S}\in\mathbb{R}^{K\times K} such that 𝐟2=𝐎S𝐟𝟎2𝐅^S𝐎S𝐅S2σ~1logdδKn\|\mathbf{f}\|_{2}=\|\mathbf{O}_{S}^{\top}\mathbf{f}-\mathbf{0}\|_{2}\leq\|\widehat{\mathbf{F}}_{S}\mathbf{O}_{S}-\mathbf{F}_{S}\|_{2}\lesssim\frac{\widetilde{\sigma}_{1}\log d}{\delta}\sqrt{\frac{K}{n}}, and thus

WK+1=kKfk2=n𝐟22σ~12K(logd)2δ2,W_{K+1}=\sum_{k\leq K}f_{k}^{2}=n\|\mathbf{f}\|_{2}^{2}\lesssim\frac{\widetilde{\sigma}_{1}^{2}K(\log d)^{2}}{\delta^{2}},
andMK+1σ~1kKfk2+(kKfk2)𝚺^S𝚺S2σ~13Kδ2(logd)2.\text{and}\quad M_{K+1}\leq\widetilde{\sigma}_{1}\sum_{k\leq K}f_{k}^{2}+(\sum_{k\leq K}f_{k}^{2})\|\widehat{\bm{\Sigma}}_{S}-\bm{\Sigma}_{S}\|_{2}\lesssim\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}}(\log d)^{2}.

Thus we can write σ^2σ2=𝐜~0(1ni=1n𝒁i𝒁i𝚲d)𝐜~0+OP(σ~13Kδ2n(logd)2)\widehat{\sigma}^{2}-\sigma^{2}=\widetilde{{\mathbf{c}}}_{0}^{\top}(\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{{\mathbf{c}}}_{0}+O_{P}\big{(}\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}n}(\log d)^{2}\big{)}.

Now we take 𝐄0=𝚺^𝚺(𝐜~0(1ni=1n𝒁i𝒁i𝚲d)𝐜~0)𝐈d\mathbf{E}_{0}=\widehat{\mathbf{\Sigma}}-\mathbf{\Sigma}-(\widetilde{{\mathbf{c}}}_{0}^{\top}(\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{{\mathbf{c}}}_{0})\mathbf{I}_{d}, and from previous results we know that with high probability 𝐄b2=𝐄𝐄02σ~13Kδ2n(logd)2\|\mathbf{E}_{b}\|_{2}=\|\mathbf{E}-\mathbf{E}_{0}\|_{2}\lesssim\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}n}(\log d)^{2}, such that we have r2(d)σ~13Kδ2n(logd)2=o(r1(d))r_{2}(d)\asymp\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}n}(\log d)^{2}=o\left(r_{1}(d)\right) and Assumption 2 is satisfied.

Now we move on to study the statistical rate η2(d)\eta_{2}(d). For any j[d]j\in[d], we first study the covariance of 𝐄0𝐏𝐞j\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}. We denote 𝐄~=𝒁1𝒁1𝚲d\widetilde{\mathbf{E}}=\bm{Z}_{1}\bm{Z}_{1}^{\top}-\mathbf{\Lambda}_{d}, then it’s not hard to verify that Cov(𝐄~st,𝐄~gh)=λs(𝚺)λt(𝚺)(𝕀{s=g,t=h}+𝕀{s=h,t=g})\operatorname*{\rm Cov}(\widetilde{\mathbf{E}}_{st},\widetilde{\mathbf{E}}_{gh})=\lambda_{s}(\bm{\Sigma})\lambda_{t}(\bm{\Sigma})(\mathbb{I}\{s=g,t=h\}+\mathbb{I}\{s=h,t=g\}). Since 𝐄0𝐏𝐞j\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j} and 𝐕d𝐄0𝐏𝐞j\mathbf{V}_{d}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j} share the same eigenvalues, we can study the covariance of 𝐕d𝐄0𝐏𝐞j\mathbf{V}_{d}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j} instead. Then Cov(𝐕d𝐄0𝐏𝐞j)\operatorname{Cov}(\mathbf{V}_{d}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}) can be calculated as following

Cov{𝐕d(1ni=1n𝑿i𝑿i𝚺)𝐕(𝐕)𝐞j(𝐜~0(1ni=1n𝒁i𝒁i𝚲d)𝐜~0)𝐕d𝐏𝐞j}\displaystyle\operatorname{Cov}\Big{\{}\mathbf{V}_{d}^{\top}\big{(}\frac{1}{n}\sum_{i=1}^{n}\bm{X}_{i}\bm{X}_{i}^{\top}-\mathbf{\Sigma}\big{)}\mathbf{V}^{\perp}(\mathbf{V}^{\perp})^{\top}\mathbf{e}_{j}-\big{(}\widetilde{{\mathbf{c}}}_{0}^{\top}(\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{{\mathbf{c}}}_{0}\big{)}\mathbf{V}_{d}^{\top}\mathbf{P}_{\perp}\mathbf{e}_{j}\Big{\}}
=Cov{𝐕d(1ni=1n𝑿i𝑿i𝚺)𝐕d(𝟎,𝐈dK)𝐞~(𝐜~0(1ni=1n𝒁i𝒁i𝚲d)𝐜~0)𝐞~0}\displaystyle=\operatorname{Cov}\Big{\{}\mathbf{V}_{d}^{\top}\big{(}\frac{1}{n}\sum_{i=1}^{n}\bm{X}_{i}\bm{X}_{i}^{\top}-\mathbf{\Sigma}\big{)}\mathbf{V}_{d}(\mathbf{0},\mathbf{I}_{d-K})^{\top}\widetilde{\mathbf{e}}-\big{(}\widetilde{{\mathbf{c}}}_{0}^{\top}(\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{{\mathbf{c}}}_{0}\big{)}\widetilde{\mathbf{e}}_{0}\Big{\}}
=Cov{(1ni=1n𝒁i𝒁i𝚲d)𝐞~0(𝐜~0(1ni=1n𝒁i𝒁i𝚲d)𝐜~0)𝐞~0},\displaystyle=\operatorname{Cov}\Big{\{}\big{(}\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d}\big{)}\widetilde{\mathbf{e}}_{0}-\big{(}\widetilde{{\mathbf{c}}}_{0}^{\top}(\frac{1}{n}\sum_{i=1}^{n}\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{{\mathbf{c}}}_{0}\big{)}\widetilde{\mathbf{e}}_{0}\Big{\}},

where 𝐞~=(𝐕)𝐞j\widetilde{\mathbf{e}}=(\mathbf{V}^{\perp})^{\top}\mathbf{e}_{j} and 𝐞~0=(𝟎,𝐈dK)𝐞~\widetilde{\mathbf{e}}_{0}=(\mathbf{0},\mathbf{I}_{d-K})^{\top}\widetilde{\mathbf{e}}. Then we have

Cov(𝐕d𝐄0𝐏𝐞j)=1nCov(𝐄~𝐞~0𝐜~0𝐄~𝐜~0𝐞~0)\displaystyle\operatorname{Cov}(\mathbf{V}_{d}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})=\frac{1}{n}\operatorname*{\rm Cov}(\widetilde{\mathbf{E}}\widetilde{\mathbf{e}}_{0}-\widetilde{{\mathbf{c}}}_{0}^{\top}\widetilde{\mathbf{E}}\widetilde{{\mathbf{c}}}_{0}\widetilde{\mathbf{e}}_{0})
=1n{Cov(𝐄~𝐞~0)+Var(𝐜~0𝐄~𝐜~0)𝐞~0𝐞~0Cov(𝐄~𝐞~0,𝐜~0𝐄~𝐜~0)𝐞~0𝐞~0Cov(𝐄~𝐞~0,𝐜~0𝐄~𝐜~0)}\displaystyle\quad=\frac{1}{n}\Big{\{}\operatorname*{\rm Cov}(\widetilde{\mathbf{E}}\widetilde{\mathbf{e}}_{0})+\operatorname{Var}\big{(}\widetilde{{\mathbf{c}}}_{0}^{\top}\widetilde{\mathbf{E}}\widetilde{{\mathbf{c}}}_{0}\big{)}\widetilde{\mathbf{e}}_{0}\widetilde{\mathbf{e}}_{0}^{\top}-\operatorname*{\rm Cov}(\widetilde{\mathbf{E}}\widetilde{\mathbf{e}}_{0},\widetilde{{\mathbf{c}}}_{0}^{\top}\widetilde{\mathbf{E}}\widetilde{{\mathbf{c}}}_{0})\widetilde{\mathbf{e}}_{0}^{\top}-\widetilde{\mathbf{e}}_{0}\operatorname*{\rm Cov}(\widetilde{\mathbf{E}}\widetilde{\mathbf{e}}_{0},\widetilde{{\mathbf{c}}}_{0}^{\top}\widetilde{\mathbf{E}}\widetilde{{\mathbf{c}}}_{0})^{\top}\Big{\}}
=1n{𝐞~022σ2𝚲d+3σ4𝐞~0𝐞~02σ4𝐜~,𝐞~(𝐜~0𝐞~0+𝐞~0𝐜~0)}.\displaystyle\quad=\frac{1}{n}\{\|\widetilde{\mathbf{e}}_{0}\|_{2}^{2}\sigma^{2}\mathbf{\Lambda}_{d}+3\sigma^{4}\widetilde{\mathbf{e}}_{0}\widetilde{\mathbf{e}}_{0}^{\top}-2\sigma^{4}\langle\widetilde{\mathbf{c}},\widetilde{\mathbf{e}}\rangle(\widetilde{\mathbf{c}}_{0}\widetilde{\mathbf{e}}_{0}^{\top}+\widetilde{\mathbf{e}}_{0}\widetilde{\mathbf{c}}_{0}^{\top})\}.

Thus it can be seen that the covariance matrix is block-diagonal:

Cov(𝐕d𝐄0𝐏𝐞j)=1n(𝐞~022σ2(𝚲+σ2𝐈K)𝟎𝟎𝐞~022σ4(𝐈dK+3τ1τ12ρ𝐜~τ12ρτ1𝐜~)),\operatorname{Cov}(\mathbf{V}_{d}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})=\frac{1}{n}\begin{pmatrix}\|\widetilde{\mathbf{e}}_{0}\|_{2}^{2}\sigma^{2}(\mathbf{\Lambda}+\sigma^{2}\mathbf{I}_{K})&\mathbf{0}\\ \mathbf{0}&\|\widetilde{\mathbf{e}}_{0}\|_{2}^{2}\sigma^{4}(\mathbf{I}_{d-K}+3\mathbf{\tau}_{1}\mathbf{\tau}_{1}^{\top}-2\rho\widetilde{\mathbf{c}}\mathbf{\tau}_{1}^{\top}-2\rho\mathbf{\tau}_{1}\widetilde{\mathbf{c}}^{\top})\end{pmatrix},

where τ1=𝐞~/𝐞~2\mathbf{\tau}_{1}=\widetilde{\mathbf{e}}/\|\widetilde{\mathbf{e}}\|_{2} and ρ=𝐜~,τ1\rho=\langle\widetilde{{\mathbf{c}}},\mathbf{\tau}_{1}\rangle. Then following basic algebra, we can write Cov(𝐄0𝐏𝐞j)\operatorname{Cov}(\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}) as:

1n{σ2𝐞~022𝚺+3σ4𝐏𝐞j𝐞j𝐏2σ4ρ𝐞~02[(𝐏)[:,S]𝐮K+1𝐞j𝐏+𝐏𝐞j(𝐮K+1)(𝐏)[S,:]]}.\frac{1}{n}\!\!\left\{\!\sigma^{2}\|\widetilde{\mathbf{e}}_{0}\|_{2}^{2}\bm{\Sigma}\!+\!3\sigma^{4}\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\!\!-\!2\sigma^{4}\rho\|\widetilde{\mathbf{e}}_{0}\|_{2}\big{[}\!(\mathbf{P}_{\perp})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\!\mathbf{P}_{\perp}\!\!+\!\mathbf{P}_{\perp}\mathbf{e}_{j}(\mathbf{u}_{K+1})^{\!\top}\!(\mathbf{P}_{\perp})_{[S,:]}\big{]}\!\!\right\}.

To study η2(d)\eta_{2}(d), we will first define 𝚺j\bm{\Sigma}_{j}^{\prime} as following

𝚺j=1nL2𝐁𝛀𝛀{σ2𝚺+3σ4𝐞j𝐞j2σ4ρ𝐞~02((𝐈d)[:,S]𝐮K+1𝐞j+𝐞j𝐮K+1(𝐈d)[S,:])}𝛀𝐁𝛀.\bm{\Sigma}_{j}^{\prime}=\frac{1}{nL^{2}}\mathbf{B}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\Big{\{}\sigma^{2}\bm{\Sigma}+3\sigma^{4}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}-2\sigma^{4}\rho\|\widetilde{\mathbf{e}}_{0}\|_{2}\big{(}(\mathbf{I}_{d})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}+\mathbf{e}_{j}\mathbf{u}_{K+1}^{\top}(\mathbf{I}_{d})_{[S,:]}\big{)}\Big{\}}\mathbf{\Omega}\mathbf{B}_{\mathbf{\Omega}}.

We know that 𝐞~022=𝐏𝐞j22=1O(μK/d)\|\widetilde{\mathbf{e}}_{0}\|^{2}_{2}=\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}^{2}=1-O(\mu K/d), thus we have

σ2𝐏𝐞j22𝚺σ2𝚺2O(μKσ2d(σ2+λ1)),\displaystyle\Big{\|}\sigma^{2}\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}^{2}\bm{\Sigma}-\sigma^{2}\bm{\Sigma}\Big{\|}_{2}\leq O\big{(}\frac{\mu K\sigma^{2}}{d}(\sigma^{2}+\lambda_{1})\big{)},
3σ4𝐏𝐞j𝐞j𝐏3σ4𝐞j𝐞j23σ4(𝐏𝐞j𝐞j)𝐞j𝐏2+3σ4𝐞j(𝐏𝐞j𝐞j)2σ4μKd,\displaystyle\|3\sigma^{4}\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}-3\sigma^{4}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\|_{2}\leq 3\sigma^{4}\|(\mathbf{P}_{\perp}\mathbf{e}_{j}-\mathbf{e}_{j})\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\|_{2}+3\sigma^{4}\|\mathbf{e}_{j}(\mathbf{P}_{\perp}\mathbf{e}_{j}-\mathbf{e}_{j})^{\top}\|_{2}\lesssim\sigma^{4}\sqrt{\frac{\mu K}{d}},
(𝐏)[:,S]𝐮K+1𝐞j𝐏(𝐈d)[:,S]𝐮K+1𝐞j2[(𝐏)[:,S](𝐈d)[:,S]]𝐮K+1𝐞j𝐏2\displaystyle\|(\mathbf{P}_{\perp})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}-(\mathbf{I}_{d})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\|_{2}\leq\|[(\mathbf{P}_{\perp})_{[:,S]}-(\mathbf{I}_{d})_{[:,S]}]\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\|_{2}
+(𝐈d)[:,S]]𝐮K+1𝐞j(𝐏𝐈d)2Kμd+μKdKμd,\displaystyle\quad+\|(\mathbf{I}_{d})_{[:,S]}]\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}(\mathbf{P}_{\perp}-\mathbf{I}_{d})\|_{2}\lesssim K\sqrt{\frac{\mu}{d}}+\sqrt{\frac{\mu K}{d}}\lesssim K\sqrt{\frac{\mu}{d}},
2σ4ρ𝐞~02[(𝐏)[:,S]𝐮K+1𝐞j𝐏+𝐏𝐞j(𝐮K+1)(𝐏)[S,:]][(𝐈d)[:,S]𝐮K+1𝐞j+𝐞j(𝐮K+1)(𝐈d)[S,:]]2\displaystyle 2\sigma^{4}\rho\|\widetilde{\mathbf{e}}_{0}\|_{2}\Big{\|}\!\big{[}(\mathbf{P}_{\perp})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\!\!+\!\mathbf{P}_{\perp}\mathbf{e}_{j}(\mathbf{u}_{K+1})^{\top}\!(\mathbf{P}_{\perp})_{[S,:]}\big{]}\!\!-\!\big{[}\!(\mathbf{I}_{d})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\!\!\!+\!\mathbf{e}_{j}(\mathbf{u}_{K+1})^{\top}\!(\mathbf{I}_{d})_{[S,:]}\big{]}\!\Big{\|}_{2}
Kσ4μd,\displaystyle\quad\lesssim K\sigma^{4}\sqrt{\frac{\mu}{d}},

and in summary we have 𝚺j𝚺j2=OP(Kdσ4nΔ2Lpμd)=OP(Kλ12Δ2μd)dσ4nLpλ12=oP(dσ4nLpλ12)\|\bm{\Sigma}_{j}-{\bm{\Sigma}}_{j}^{\prime}\|_{2}=O_{P}\Big{(}\frac{Kd\sigma^{4}}{n\Delta^{2}Lp}\sqrt{\frac{\mu}{d}}\Big{)}=O_{P}\Big{(}\frac{K\lambda_{1}^{2}}{\Delta^{2}}\sqrt{\frac{\mu}{d}}\Big{)}\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}=o_{P}\big{(}\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}\big{)}. Now we study 𝚺j𝚺~j2\|\bm{\Sigma}_{j}^{\prime}-\widetilde{\bm{\Sigma}}_{j}\|_{2}. Since the entries of p𝛀\sqrt{p}\mathbf{\Omega} are i.i.d. standard Gaussian, by Lemma 3 in Fan et al., [18], we know that with probability 1O(d9)1-O(d^{-9}), we have

𝛀2,L,and𝛀[S,:]2L.\|\mathbf{\Omega}\|_{2,\infty}\lesssim\sqrt{L},\quad\text{and}\quad\|\mathbf{\Omega}_{[S,:]}\|_{2}\lesssim\sqrt{L}.

Therefore, under the condition that λ12LpΔ2d=o(1)\frac{\lambda_{1}^{2}Lp}{\Delta^{2}d}=o(1) we have

𝚺j𝚺~j2\displaystyle\|\bm{\Sigma}_{j}^{\prime}-\widetilde{\bm{\Sigma}}_{j}\|_{2} =σ41nL2𝐁𝛀𝛀(3𝐞j𝐞j2ρ((𝐈d)[:,S]𝐮K+1𝐞j+𝐞j𝐮K+1(𝐈d)[S,:]))𝛀𝐁𝛀2\displaystyle=\sigma^{4}\Big{\|}\frac{1}{nL^{2}}\mathbf{B}_{\mathbf{\Omega}}^{\top}\mathbf{\Omega}^{\top}\Big{(}3\mathbf{e}_{j}\mathbf{e}_{j}^{\top}-2\rho\big{(}(\mathbf{I}_{d})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}+\mathbf{e}_{j}\mathbf{u}_{K+1}^{\top}(\mathbf{I}_{d})_{[S,:]}\big{)}\Big{)}\mathbf{\Omega}\mathbf{B}_{\mathbf{\Omega}}\Big{\|}_{2}
σ4nL2𝐁𝛀22𝛀2,(𝛀2,+𝛀[S,:]2)=OP(σ4nΔ2)=oP(dσ4nLpλ12).\displaystyle\lesssim\frac{\sigma^{4}}{nL^{2}}\|\mathbf{B}_{\mathbf{\Omega}}\|_{2}^{2}\|\mathbf{\Omega}\|_{2,\infty}\big{(}\|\mathbf{\Omega}\|_{2,\infty}+\|\mathbf{\Omega}_{[S,:]}\|_{2}\big{)}=O_{P}\big{(}\frac{\sigma^{4}}{n\Delta^{2}}\big{)}=o_{P}(\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}).

As for 𝚺~j\widetilde{\bm{\Sigma}}_{j}, by Lemma 3 in Fan et al., [18] with high probability we have that σK(𝛀𝐕)L\sigma_{K}(\bm{\Omega}^{\top}\mathbf{V})\gtrsim\sqrt{L} and in turn

σK(𝚺~j)\displaystyle\sigma_{K}(\widetilde{\bm{\Sigma}}_{j}) σ2nL2(σK(𝐁𝛀))2((σK(𝛀𝐕)2Δ+(σK(𝛀)2σ2)dσ4nLpλ12+σ2Δnλ12.\displaystyle\gtrsim\frac{\sigma^{2}}{nL^{2}}\big{(}\sigma_{K}(\mathbf{B}_{\bm{\Omega}})\big{)}^{2}\left(\big{(}\sigma_{K}(\bm{\Omega}^{\top}\mathbf{V}\big{)}^{2}\Delta+\big{(}\sigma_{K}(\bm{\Omega}\big{)}^{2}\sigma^{2}\right)\gtrsim\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}+\frac{\sigma^{2}\Delta}{n\lambda_{1}^{2}}.

Therefore, combining the previous results, we have that by Weyl’s inequality [19], with high probability

λK(𝚺j)\displaystyle\lambda_{K}\big{(}\bm{\Sigma}_{j}\big{)} λK(𝚺~j)𝚺j𝚺j2𝚺j𝚺~j2\displaystyle\geq\lambda_{K}\big{(}\widetilde{\bm{\Sigma}}_{j}\big{)}-\|\bm{\Sigma}_{j}-\bm{\Sigma}_{j}^{\prime}\|_{2}-\|\bm{\Sigma}_{j}^{\prime}-\widetilde{\bm{\Sigma}}_{j}\|_{2}
dσ4nLpλ12+σ2Δnλ12o(dσ4nLpλ12)dσ4nLpλ12+σ2Δnλ12.\displaystyle\gtrsim\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}+\frac{\sigma^{2}\Delta}{n\lambda_{1}^{2}}-o(\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}})\gtrsim\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}+\frac{\sigma^{2}\Delta}{n\lambda_{1}^{2}}.

Thus we know η2(d)dσ4/(nLpλ12)+σ2Δ/(nλ12)\eta_{2}(d)\asymp d\sigma^{4}/(nLp\lambda_{1}^{2})+\sigma^{2}\Delta/(n\lambda_{1}^{2}).

Recall from the proof of Corollary 4.2 with probability 1O(d10)1-O(d^{-10}) we have 𝐄02(λ1+σ2)logdrn\|\mathbf{E}_{0}\|_{2}\lesssim(\lambda_{1}+\sigma^{2})\log d\sqrt{\frac{r}{n}}. Also recall that r2(d)σ~13Kδ2n(logd)2r_{2}(d)\asymp\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}n}(\log d)^{2}. Therefore, under the condition that

nκ14λ1dr2(logd)4pσ2(κ1dpλ1σ2L)andσ~16K2δ4σ4n(logd)4(Δλ1)2,n\gg\frac{\kappa_{1}^{4}\lambda_{1}dr^{2}(\log d)^{4}}{p\sigma^{2}}\left(\kappa_{1}\frac{d}{p}\wedge\frac{\lambda_{1}}{\sigma^{2}}L\right)\quad\text{and}\quad\frac{\widetilde{\sigma}_{1}^{6}K^{2}}{\delta^{4}\sigma^{4}n}(\log d)^{4}\ll(\frac{\Delta}{\lambda_{1}})^{2},

we have d2r1(d)4(logd)4(p2Δ4η2(d))1=o(1){d^{2}r_{1}(d)^{4}(\log d)^{4}}\big{(}{p^{2}\Delta^{4}\eta_{2}(d)}\big{)}^{-1}=o(1) and dr2(d)2(LpΔ2η2(d))1=o(1){dr_{2}(d)^{2}}\big{(}{Lp\Delta^{2}\eta_{2}(d)}\big{)}^{-1}=o(1).

Now we need to verify Assumption 5. It can be seen that the randomness of the leading term comes from 𝛀\bm{\Omega} and 𝐄0\mathbf{E}_{0} both. We will first establish the results conditional on 𝛀\bm{\Omega}. In fact, we will first show a more general CLT that will also cover the case of the leading term under the regime LpdLp\gg d. More specifically, we will show that for any matrix 𝐀d×K\mathbf{A}\in\mathbb{R}^{d\times K} that satisfies the following two conditions: (1) σmax(𝐀)/σmin(𝐀)C|λ1|/Δ\sigma_{\max}(\mathbf{A})/\sigma_{\min}(\mathbf{A})\leq C|\lambda_{1}|/\Delta; (2) λK(Cov(𝐀𝐄0𝐏𝐞j))cn1σ4(σmin(𝐀))2\lambda_{K}\big{(}\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}\geq cn^{-1}\sigma^{4}\big{(}\sigma_{\min}(\mathbf{A})\big{)}^{2}, where C,c>0C,c>0 are fixed constants irrelevant to 𝐀\mathbf{A} and we abuse the notation by denoting 𝚺j:=Cov(𝐀𝐄0𝐏𝐞j)\bm{\Sigma}_{j}:=\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}), it holds that

𝚺j1/2𝐀𝐄0𝐏𝐞j𝑑𝒩(𝟎,𝐈K).\mathbf{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}). (B.31)

Now for any matrix 𝐀d×K\mathbf{A}\in\mathbb{R}^{d\times K} satisfying the aforementioned conditions, to show that 𝐀𝐄0𝐏𝐞j\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j} is asymptotically normal, we only need to show that 𝐚𝚺j1/2𝐀𝐄0𝐏𝐞j𝑑𝒩(0,1){\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(0,1) for any 𝐚K{\mathbf{a}}\in\mathbb{R}^{K} with 𝐚2=1\|{\mathbf{a}}\|_{2}=1. We can write

𝐚𝚺j1/2𝐀𝐄0𝐏𝐞j=1ni=1n𝐚𝚺j1/2𝐀{𝑿i𝑿i𝚺𝐜~0(𝒁i𝒁i𝚲d)𝐜~0𝐈d}𝐏𝐞j\displaystyle{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}=\frac{1}{n}\sum_{i=1}^{n}{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\{\bm{X}_{i}\bm{X}_{i}^{\top}-\bm{\Sigma}-\widetilde{\mathbf{c}}_{0}^{\top}(\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{\mathbf{c}}_{0}\mathbf{I}_{d}\}\mathbf{P}_{\perp}\mathbf{e}_{j}
=1ni=1n{𝐚𝚺j1/2𝐀(𝑿i𝑿i𝚺)𝐏𝐞j𝐜~0(𝒁i𝒁i𝚲d)𝐜~0(𝐚𝚺j1/2𝐀𝐏𝐞j)}.\displaystyle=\frac{1}{n}\sum_{i=1}^{n}\Big{\{}{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}(\bm{X}_{i}\bm{X}_{i}^{\top}-\bm{\Sigma})\mathbf{P}_{\perp}\mathbf{e}_{j}-\widetilde{\mathbf{c}}_{0}^{\top}(\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{\mathbf{c}}_{0}({\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{P}_{\perp}\mathbf{e}_{j})\Big{\}}.

We let xi=𝐚𝚺j1/2𝐀(𝑿i𝑿i𝚺)𝐏𝐞jx_{i}={\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}(\bm{X}_{i}\bm{X}_{i}^{\top}-\bm{\Sigma})\mathbf{P}_{\perp}\mathbf{e}_{j} and yi=𝐜~0(𝒁i𝒁i𝚲d)𝐜~0(𝐚𝚺j1/2𝐀𝐏𝐞j)y_{i}=\widetilde{\mathbf{c}}_{0}^{\top}(\bm{Z}_{i}\bm{Z}_{i}^{\top}-\mathbf{\Lambda}_{d})\widetilde{\mathbf{c}}_{0}({\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{P}_{\perp}\mathbf{e}_{j}). For 𝚺j\bm{\Sigma}_{j}, we have that 𝚺j1/22σmin(𝚺j)1/2n/(σ2σmin(𝐀))\|\bm{\Sigma}_{j}^{-1/2}\|_{2}\leq\sigma_{\min}(\bm{\Sigma}_{j})^{-1/2}\leq\sqrt{n}/\big{(}\sigma^{2}\sigma_{\min}(\mathbf{A})\big{)}. Then we have

𝔼|xi|3𝔼|𝐚𝚺j1/2𝐀𝑿i𝑿i𝐏𝐞j|3𝔼|𝐚𝚺j1/2𝐀𝑿i|6𝔼|𝐞j𝐏𝑿i|6\displaystyle{\mathbb{E}}|x_{i}|^{3}\lesssim{\mathbb{E}}|{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\bm{X}_{i}\bm{X}_{i}^{\top}\mathbf{P}_{\perp}\mathbf{e}_{j}|^{3}\leq\sqrt{{\mathbb{E}}|{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\bm{X}_{i}|^{6}{\mathbb{E}}|\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\bm{X}_{i}|^{6}}
𝚺j1/223(λ1+σ2)3σ6𝐀26,\displaystyle\quad\lesssim\|\bm{\Sigma}_{j}^{-1/2}\|_{2}^{3}\sqrt{(\lambda_{1}+\sigma^{2})^{3}\sigma^{6}\|\mathbf{A}\|_{2}^{6}},
𝔼|yi|3(𝐚𝚺j1/2𝐀𝐏𝐞j)3𝔼|𝐜~0𝒁i𝒁i𝐜~0|3𝚺j1/2𝐀23𝔼|𝐜~0𝒁i|6\displaystyle{\mathbb{E}}|y_{i}|^{3}\lesssim({\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{P}_{\perp}\mathbf{e}_{j})^{3}{\mathbb{E}}|\widetilde{\mathbf{c}}_{0}^{\top}\bm{Z}_{i}\bm{Z}_{i}^{\top}\widetilde{\mathbf{c}}_{0}|^{3}\leq\|\bm{\Sigma}_{j}^{-1/2}\mathbf{A}\|_{2}^{3}{\mathbb{E}}|\widetilde{{\mathbf{c}}}_{0}^{\top}\bm{Z}_{i}|^{6}
𝚺j1/223(λ1+σ2)3𝐀23,\displaystyle\quad\lesssim\|\bm{\Sigma}_{j}^{-1/2}\|_{2}^{3}(\lambda_{1}+\sigma^{2})^{3}\|\mathbf{A}\|_{2}^{3},
𝔼|xiyi|3𝔼|xi|3+𝔼|yi|3𝚺j1/223((λ1+σ2)3σ6𝐀26+(λ1+σ2)3𝐀23)\displaystyle{\mathbb{E}}|x_{i}-y_{i}|^{3}\lesssim{\mathbb{E}}|x_{i}|^{3}+{\mathbb{E}}|y_{i}|^{3}\lesssim\|\bm{\Sigma}_{j}^{-1/2}\|_{2}^{3}\Big{(}\sqrt{(\lambda_{1}+\sigma^{2})^{3}\sigma^{6}\|\mathbf{A}\|_{2}^{6}}+(\lambda_{1}+\sigma^{2})^{3}\|\mathbf{A}\|_{2}^{3}\Big{)}
n3/2(λ1+σ2)3𝐀23/(σ2σmin(𝐀))3.\displaystyle\lesssim n^{3/2}(\lambda_{1}+\sigma^{2})^{3}\|\mathbf{A}\|_{2}^{3}/\big{(}\sigma^{2}\sigma_{\min}(\mathbf{A})\big{)}^{3}.

Thus

i=1n𝔼|xiyi|3Var{i=1n(xiyi)}3/2n(λ1+σ2)3𝐀23n3/2σ6σmin(𝐀)3(λ1+σ2)3λ13nσ6Δ3=o(1).\frac{\sum_{i=1}^{n}{\mathbb{E}}|x_{i}-y_{i}|^{3}}{\operatorname{Var}\Big{\{}\sum_{i=1}^{n}(x_{i}-y_{i})\Big{\}}^{3/2}}\lesssim\frac{n(\lambda_{1}+\sigma^{2})^{3}\|\mathbf{A}\|_{2}^{3}}{n^{3/2}\sigma^{6}\sigma_{\min}(\mathbf{A})^{3}}\lesssim\frac{(\lambda_{1}+\sigma^{2})^{3}\lambda_{1}^{3}}{\sqrt{n}\sigma^{6}\Delta^{3}}=o(1).

Thus the Lyapunov’s condition is met and (B.31) holds. Then we take 𝐀=𝛀𝐁𝛀\mathbf{A}=\bm{\Omega}\mathbf{B}_{\bm{\Omega}}, and define the following event

𝒜𝛀\displaystyle\mathcal{A}_{\bm{\Omega}} ={1/2σmin(𝛀~()/p)σmax(𝛀~()/p)3/2,[L]}\displaystyle=\bigg{\{}1/2\leq\sigma_{\min}(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p})\leq\sigma_{\max}(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p})\leq 3/2,\quad\forall\ell\in[L]\bigg{\}}
{12dpσmin(𝛀)σmax(𝛀)32dp,[L]}.\displaystyle\quad\cap\bigg{\{}\frac{1}{2}\sqrt{\frac{d}{p}}\leq\sigma_{\min}(\mathbf{\Omega})\leq\sigma_{\max}(\mathbf{\Omega})\leq\frac{3}{2}\sqrt{\frac{d}{p}},\quad\forall\ell\in[L]\bigg{\}}.

Then from previous results we know that ((𝒜𝛀)c)=o(1)\mathbb{P}((\mathcal{A}_{\bm{\Omega}})^{c})=o(1), and under the event 𝒜𝛀\mathcal{A}_{\bm{\Omega}} we have

σmax(𝛀𝐁𝛀)σmin(𝛀𝐁𝛀)9λ1/Δ,λK(𝚺j)σ42n(σmin(𝛀𝐁𝛀))2.\frac{\sigma_{\max}(\bm{\Omega}\mathbf{B}_{\bm{\Omega}})}{\sigma_{\min}(\bm{\Omega}\mathbf{B}_{\bm{\Omega}})}\leq 9\lambda_{1}/\Delta,\quad\lambda_{K}(\bm{\Sigma}_{j})\geq\frac{\sigma^{4}}{2n}\big{(}\sigma_{\min}(\bm{\Omega}\mathbf{B}_{\bm{\Omega}})\big{)}^{2}.

Thus from the above proof, for any vector 𝐭K\mathbf{t}\in\mathbb{R}^{K}, we have (𝚺j1/2𝒱(𝐄0)𝐞j𝐭|𝒜𝛀)Φ(𝐭)=o(1)\mathbb{P}\big{(}\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\leq\mathbf{t}|\mathcal{A}_{\bm{\Omega}}\big{)}-\Phi(\mathbf{t})=o(1), where Φ()\Phi(\cdot) is the CDF for 𝒩(0,𝐈K){\mathcal{N}}(0,\mathbf{I}_{K}). Then we have

(𝚺j1/2𝒱(𝐄0)𝐞j𝐭)=𝔼((𝚺j1/2𝒱(𝐄0)𝐞j𝐭|𝛀))\displaystyle\mathbb{P}\big{(}\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\leq\mathbf{t}\big{)}={\mathbb{E}}\Big{(}\mathbb{P}\big{(}\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\leq\mathbf{t}|\mathbf{\Omega}\big{)}\Big{)}
=(𝚺j1/2𝒱(𝐄0)𝐞j𝐭|𝛀𝒜𝛀)(𝒜𝛀)+(𝚺j1/2𝒱(𝐄0)𝐞j𝐭|𝛀𝒜𝛀c)(𝒜𝛀c)\displaystyle\quad=\mathbb{P}\big{(}\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\leq\mathbf{t}|\mathbf{\Omega}\in\mathcal{A}_{\bm{\Omega}}\big{)}\mathbb{P}(\mathcal{A}_{\bm{\Omega}})+\mathbb{P}\big{(}\mathbf{\Sigma}_{j}^{-1/2}\mathcal{V}(\mathbf{E}_{0})^{\top}\mathbf{e}_{j}\leq\mathbf{t}|\mathbf{\Omega}\in\mathcal{A}_{\bm{\Omega}}^{c}\big{)}\mathbb{P}(\mathcal{A}_{\bm{\Omega}}^{c})
=(Φ(𝐭)+o(1))(1o(1))+o(1)=Φ(𝐭)+o(1).\displaystyle\quad=\big{(}\Phi(\mathbf{t})+o(1)\big{)}\big{(}1-o(1)\big{)}+o(1)=\Phi(\mathbf{t})+o(1).

Hence we have that Assumption 5 holds and (18) follows. Next we need to show that the result also holds for 𝚺~j\widetilde{\bm{\Sigma}}_{j}. From previous discussion we already know that 𝚺j𝚺~j2=oP(λK(𝚺~j))\|\bm{\Sigma}_{j}-\widetilde{\bm{\Sigma}}_{j}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}, then by Lemma 13 in Chen et al., [13] we have that 𝚺~j1/2𝚺j1/2𝐈d2=OP(𝚺~j1/22𝚺j1/2𝚺~j1/22)=OP(λK(𝚺~j)1𝚺j𝚺~j2)=oP(1)\|\widetilde{\bm{\Sigma}}_{j}^{-1/2}\bm{\Sigma}_{j}^{1/2}-\mathbf{I}_{d}\|_{2}=O_{P}\big{(}\|\widetilde{\bm{\Sigma}}_{j}^{-1/2}\|_{2}\|\bm{\Sigma}_{j}^{1/2}-\widetilde{\bm{\Sigma}}_{j}^{1/2}\|_{2}\big{)}=O_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})^{-1}\|\bm{\Sigma}_{j}-\widetilde{\bm{\Sigma}}_{j}\|_{2}\big{)}=o_{P}(1). Then by Slutsky’s Theorem, we have

𝚺~j1/2(𝐕~F𝐇𝐕)𝐞j=(𝚺~j1/2𝚺j1/2)𝚺j1/2(𝐕~F𝐇𝐕)𝐞j𝑑𝒩(0,𝐈K).\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}=(\widetilde{\bm{\Sigma}}_{j}^{-1/2}\bm{\Sigma}_{j}^{1/2})\bm{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(0,\mathbf{I}_{K}).

Finally, we move on to verify the validity of the estimator 𝚺^j\widehat{\bm{\Sigma}}_{j} for the asymptotic covariance matrix. From Lemma 7 in Fan et al., [18], it can be seen that with probability 1o(1)1-o(1), 𝐇\mathbf{H} is orthonormal. When 𝐇\mathbf{H} is orthonormal, by Slutsky’s Theorem we have that

𝐇𝚺~j1/2(𝐕~F𝐇𝐕)𝐞j=𝐇𝚺~j1/2𝐇(𝐕~F𝐕𝐇)𝐞j𝑑𝒩(𝟎,𝐈K),\mathbf{H}\widetilde{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}=\mathbf{H}\widetilde{\bm{\Sigma}}_{j}^{-1/2}\mathbf{H}^{\top}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}),

where it can be seen that 𝐇𝚺~j1/2𝐇=(𝐇𝚺~j𝐇)1/2\mathbf{H}\widetilde{\bm{\Sigma}}_{j}^{-1/2}\mathbf{H}^{\top}=(\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top})^{-1/2}. Therefore, it suffices to show that 𝚺^j𝐇𝚺~j𝐇2=oP(λK(𝚺~j))\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}, and the results will hold by Slutsky’s Theorem. Recall from the proof of Corollary 4.2, we have the following bounds

𝚺𝚺^2=OP((λ1+σ2)rn),|σ^2σ2|=OP(σ~1Kn),\|\bm{\Sigma}-\widehat{\bm{\Sigma}}\|_{2}=O_{P}\Big{(}(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\Big{)},\quad|\widehat{\sigma}^{2}-\sigma^{2}|=O_{P}(\widetilde{\sigma}_{1}\sqrt{\frac{K}{n}}),

We will bound the components of 𝚺^j𝚺~j2\|\widehat{\bm{\Sigma}}_{j}-\widetilde{\bm{\Sigma}}_{j}\|_{2} respectively. We have

σ2𝚺σ^2𝚺^2|σ^2σ2|𝚺2+σ2𝚺𝚺^2=OP(σ~1(λ1+σ2)Kn)\displaystyle\|\sigma^{2}\bm{\Sigma}-\widehat{\sigma}^{2}\widehat{\bm{\Sigma}}\|_{2}\lesssim|\widehat{\sigma}^{2}-\sigma^{2}|\|\bm{\Sigma}\|_{2}+\sigma^{2}\|\bm{\Sigma}-\widehat{\bm{\Sigma}}\|_{2}=O_{P}\Big{(}\widetilde{\sigma}_{1}(\lambda_{1}+\sigma^{2})\sqrt{\frac{K}{n}}\Big{)}
+OP(σ2(λ1+σ2)rn)=OP(σ2(λ1+σ2)rn),\displaystyle\quad+O_{P}\Big{(}\sigma^{2}(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\Big{)}=O_{P}\Big{(}\sigma^{2}(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\Big{)},

Also, from proof of Theorem 4.10, we have that with high probability

𝐕~F𝐇𝐕2=𝐕~F𝐕𝐇2𝐄02𝛀2𝐁𝛀2/L=OP(κ1drnpL),\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2}=\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}\lesssim\|\mathbf{E}_{0}\|_{2}\|\mathbf{\Omega}\|_{2}\|\mathbf{B}_{\mathbf{\Omega}}\|_{2}/L=O_{P}(\kappa_{1}\sqrt{\frac{dr}{npL}}),

and 𝚺^tr𝐕𝚲𝐕2=OP((λ1+σ2)rn)\|\widehat{\bm{\Sigma}}^{\text{tr}}-\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\|_{2}=O_{P}\Big{(}(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\Big{)}, where 𝚺^tr=𝚺^σ^2𝐈d\widehat{\bm{\Sigma}}^{\text{tr}}=\widehat{\bm{\Sigma}}-\widehat{\sigma}^{2}\mathbf{I}_{d}.Then with high probability, for all [L]\ell\in[L] we have that

(𝐕~F𝚺^tr𝐇𝚲𝐕)𝛀()/p2dp(𝚺^tr𝐕𝚲𝐕2+λ1𝐕~F𝐕𝐇2)\displaystyle\big{\|}(\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\bm{\Sigma}}^{\text{tr}}-\mathbf{H}\mathbf{\Lambda}\mathbf{V}^{\top})\mathbf{\Omega}^{(\ell)}/\sqrt{p}\big{\|}_{2}\lesssim\sqrt{\frac{d}{p}}\big{(}\|\widehat{\bm{\Sigma}}^{\text{tr}}-\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\|_{2}+\lambda_{1}\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}\big{)}
=OP(κ1λ1d2rnp2L)=oP(Δ),\displaystyle=O_{P}\left(\kappa_{1}\lambda_{1}\sqrt{\frac{d^{2}r}{np^{2}L}}\right)=o_{P}(\Delta),

and thus by Theorem 3.3 in Stewart, [38], with high probability for all [L]\ell\in[L] we have that

𝐁^()𝐁()𝐇2=(𝐕~F𝚺^tr𝛀()/p)(𝐇𝚲𝐕𝛀()/p)2\displaystyle\|\widehat{\mathbf{B}}^{(\ell)}-\mathbf{B}^{(\ell)}\mathbf{H}^{\top}\|_{2}=\big{\|}(\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\bm{\Sigma}}^{\text{tr}}\mathbf{\Omega}^{(\ell)}/\sqrt{p})^{\dagger}-(\mathbf{H}\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}/\sqrt{p})^{\dagger}\big{\|}_{2}
=OP(Δ2κ1λ1d2rnp2L),\displaystyle=O_{P}\bigg{(}\Delta^{-2}\kappa_{1}\lambda_{1}\sqrt{\frac{d^{2}r}{np^{2}L}}\bigg{)},

and in turn we have 𝐁^𝛀𝐁𝛀𝐇2=OP(Δ2κ1λ1d2rnp2L)L=OP(Δ2κ1λ1d2rnp2)\|\widehat{\mathbf{B}}_{\mathbf{\Omega}}-\mathbf{B}_{\mathbf{\Omega}}\mathbf{H}^{\top}\|_{2}=O_{P}\bigg{(}\Delta^{-2}\kappa_{1}\lambda_{1}\sqrt{\frac{d^{2}r}{np^{2}L}}\bigg{)}\sqrt{L}=O_{P}\bigg{(}\Delta^{-2}\kappa_{1}\lambda_{1}\sqrt{\frac{d^{2}r}{np^{2}}}\bigg{)}.

Thus combining the above results, under the condition that λ1κ14σ2d2rnp2L=o(1)\frac{\lambda_{1}\kappa_{1}^{4}}{\sigma^{2}}\sqrt{\frac{d^{2}r}{np^{2}L}}=o(1), following basic algebra we have

𝚺^j𝐇𝚺~j𝐇2OP(σ2(λ1+σ2)rn)dnLpΔ2+OP(dLnL2pΔ3σ2(σ2+λ1)κ1λ1d2rnp2)\displaystyle\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}\!\lesssim O_{P}\Big{(}\sigma^{2}(\lambda_{1}\!+\!\sigma^{2})\sqrt{\frac{r}{n}}\Big{)}\frac{d}{nLp\Delta^{2}}\!+\!O_{P}\bigg{(}\frac{d\sqrt{L}}{nL^{2}p\Delta^{3}}\sigma^{2}(\sigma^{2}\!+\!\lambda_{1})\kappa_{1}\lambda_{1}\sqrt{\frac{d^{2}r}{np^{2}}}\bigg{)}
=OP(λ12Δ2σ2(λ1+σ2)rn)dσ4nLpλ12+OP(λ1κ14σ2d2rnp2L)dσ4nLpλ12=oP(λK(𝚺~j)).\displaystyle=O_{P}\Big{(}\frac{\lambda_{1}^{2}}{\Delta^{2}\sigma^{2}}(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\Big{)}\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}+O_{P}\Big{(}\frac{\lambda_{1}\kappa_{1}^{4}}{\sigma^{2}}\sqrt{\frac{d^{2}r}{np^{2}L}}\Big{)}\frac{d\sigma^{4}}{nLp\lambda_{1}^{2}}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.

Therefore, by Slutsky’s Theorem, under the event :={𝐇 is orthonormal}\mathcal{B}:=\{\mathbf{H}\text{ is orthonormal}\}, for any vector 𝐭K\mathbf{t}\in\mathbb{R}^{K}, we have that (𝚺^j1/2(𝐕~F𝐕𝐇)𝐞j𝐭|)Φ(𝐭)=o(1)\mathbb{P}(\widehat{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\leq\mathbf{t}|\mathcal{B})-\Phi(\mathbf{t})=o(1), and thus

(𝚺^j1/2(𝐕~F𝐕𝐇)𝐞j𝐭)=(𝚺^j1/2(𝐕~F𝐕𝐇)𝐞j𝐭|)()\displaystyle\mathbb{P}(\widehat{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\leq\mathbf{t})=\mathbb{P}(\widehat{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\leq\mathbf{t}|\mathcal{B})\mathbb{P}(\mathcal{B})
+(𝚺^j1/2(𝐕~F𝐕𝐇)𝐞j𝐭|c)(c)\displaystyle\quad+\mathbb{P}(\widehat{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\leq\mathbf{t}|\mathcal{B}^{c})\mathbb{P}(\mathcal{B}^{c})
=(𝚺^j1/2(𝐕~F𝐕𝐇)𝐞j𝐭|)(1o(1))+o(1)=Φ(𝐭)+o(1).\displaystyle=\mathbb{P}\big{(}\widehat{\bm{\Sigma}}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{e}_{j}\leq\mathbf{t}|\mathcal{B}\big{)}\big{(}1-o(1)\big{)}+o(1)=\Phi(\mathbf{t})+o(1).

Hence the claim follows.

B.8 Proof of Corollary 4.12

We will verify that Assumptions 1, 2, 3 and 5 hold. First, it is not hard to see that there exists some orthonormal matrix 𝐎K×K\mathbf{O}\in\mathbb{R}^{K\times K} such that 𝐕=𝐅𝐂1𝐎\mathbf{V}=\mathbf{F}\mathbf{C}^{-1}\mathbf{O}, where 𝐂=diag(d1,,dK)\mathbf{C}=\operatorname{diag}(\sqrt{d_{1}},\ldots,\sqrt{d_{K}}). From the problem setting of Example 3 we also know that there exists a constant C>0C>0 such that

C1KmaxkdkKminkdkdKmaxkdk,d1dKd/K,C^{-1}K\max_{k}d_{k}\leq K\min_{k}d_{k}\leq d\leq K\max_{k}d_{k},\quad d_{1}\asymp\ldots\asymp d_{K}\asymp d/K,

and thus that d/KσK(𝐂)𝐂2d/K\sqrt{d/K}\lesssim\sigma_{K}(\mathbf{C})\leq\|\mathbf{C}\|_{2}\lesssim\sqrt{d/K}. Then 𝐕2,Kd𝐅2,=Kd\|\mathbf{V}\|_{2,\infty}\lesssim\sqrt{\frac{K}{d}}\|\mathbf{F}\|_{2,\infty}=\sqrt{\frac{K}{d}}. Thus Assumption 3 holds with μ=O(1)\mu=O(1).

From the proof of Corollary 4.2 we know that Assumption 1 is satisfied. Besides, recall from Remark 16, under the condition that K/dlogd=O(1)\sqrt{K/d}\log d=O(1), with probability at least 1d101-d^{-10} we have that 𝐄2dΔ0/K+dnlogd:=r1(d)\|\mathbf{E}\|_{2}\lesssim d\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d:=r_{1}^{\prime}(d), which is sharper than r1(d)logdr_{1}(d)\log d. Since 𝐄b=0\mathbf{E}_{b}=0, we have r2(d)=0r_{2}(d)=0 and Assumption 2 holds trivially. Now we move on to study the minimum covariance eigenvalue rate η2(d)\eta_{2}(d). From the proof of Corollary 4.2, we know that

𝐄=𝐄0=𝐅𝚯𝐙+𝐙𝚯𝐅+𝐙𝐙n𝐈d=i=1n{𝐐i𝐙i.+𝐙i.𝐐i+𝐙i.𝐙i.𝐈d},\mathbf{E}=\mathbf{E}_{0}=\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{Z}+\mathbf{Z}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+\mathbf{Z}^{\top}\mathbf{Z}-n\mathbf{I}_{d}=\sum_{i=1}^{n}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}+\mathbf{Z}_{i.}\mathbf{Q}_{i}^{\top}+\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d}\big{\}},

where 𝐐i=𝐅𝚯i.d\mathbf{Q}_{i}=\mathbf{F}\mathbf{\Theta}_{i.}\in\mathbb{R}^{d} with 𝚯i.\mathbf{\Theta}_{i.} being the ii-th row of 𝚯\mathbf{\Theta}, 𝐙i.\mathbf{Z}_{i.} is the ii-th row of 𝐙\mathbf{Z} and 𝐙i.i.i.d𝒩(𝟎,𝐈d)\mathbf{Z}_{i.}\overset{\text{i.i.d}}{\sim}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{d}). Then for j[d]j\in[d], we have

Cov(𝐄0𝐏𝐞j)\displaystyle\operatorname*{\rm Cov}(\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}) =Cov(i=1n{𝐐i𝐙i.+𝐙i.𝐐i+𝐙i.𝐙i.𝐈d}𝐏𝐞j)\displaystyle=\operatorname*{\rm Cov}\Big{(}\sum_{i=1}^{n}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}+\mathbf{Z}_{i.}\mathbf{Q}_{i}^{\top}+\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d}\big{\}}\mathbf{P}_{\perp}\mathbf{e}_{j}\Big{)}
=i=1nCov({𝐐i𝐙i.+𝐙i.𝐐i+𝐙i.𝐙i.𝐈d}𝐏𝐞j)\displaystyle=\sum_{i=1}^{n}\operatorname*{\rm Cov}\Big{(}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}+\mathbf{Z}_{i.}\mathbf{Q}_{i}^{\top}+\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d}\big{\}}\mathbf{P}_{\perp}\mathbf{e}_{j}\Big{)}
=i=1nCov({𝐐i𝐙i.+𝐙i.𝐙i.𝐈d}𝐏𝐞j),\displaystyle=\sum_{i=1}^{n}\operatorname*{\rm Cov}\Big{(}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}+\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d}\big{\}}\mathbf{P}_{\perp}\mathbf{e}_{j}\Big{)},

where the last equality is due to the fact that 𝐏𝐐i=𝐏𝐅𝚯i.=𝟎\mathbf{P}_{\perp}\mathbf{Q}_{i}=\mathbf{P}_{\perp}\mathbf{F}\mathbf{\Theta}_{i.}=\mathbf{0}. Now for i[n]i\in[n], we calculate Cov({𝐐i𝐙i.+𝐙i.𝐙i.𝐈d}𝐏𝐞j)\operatorname*{\rm Cov}\Big{(}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}+\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d}\big{\}}\mathbf{P}_{\perp}\mathbf{e}_{j}\Big{)}. Following basic algebra, we have that

Cov({𝐐i𝐙i.+𝐙i.𝐙i.𝐈d}𝐏𝐞j)\displaystyle\operatorname*{\rm Cov}\Big{(}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}\!+\!\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}\!-\!\mathbf{I}_{d}\big{\}}\mathbf{P}_{\perp}\mathbf{e}_{j}\Big{)}
=𝔼({𝐐i𝐙i.+𝐙i.𝐙i.}𝐏𝐞j𝐞j𝐏{𝐙i.𝐐i+𝐙i.𝐙i.})𝐏𝐞j𝐞j𝐏\displaystyle=\!{\mathbb{E}}\Big{(}\big{\{}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}\!+\!\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}\big{\}}\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\big{\{}\mathbf{Z}_{i.}\mathbf{Q}_{i}^{\top}\!+\!\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}\big{\}}\Big{)}\!-\!\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}
=𝐏𝐞j22(𝐐i𝐐i+𝐈d)+2𝐏𝐞j𝐞j𝐏𝐏𝐞j𝐞j𝐏\displaystyle=\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}^{2}(\mathbf{Q}_{i}\mathbf{Q}_{i}^{\top}+\mathbf{I}_{d})+2\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}-\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}
=𝐏𝐞j22(𝐐i𝐐i+𝐈d)+𝐏𝐞j𝐞j𝐏,\displaystyle=\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}^{2}(\mathbf{Q}_{i}\mathbf{Q}_{i}^{\top}+\mathbf{I}_{d})+\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp},

and thus

Cov(𝐄0𝐏𝐞j)\displaystyle\operatorname*{\rm Cov}(\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}) =i=1n(𝐏𝐞j22(𝐐i𝐐i+𝐈d)+𝐏𝐞j𝐞j𝐏)\displaystyle=\sum_{i=1}^{n}\left(\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}^{2}(\mathbf{Q}_{i}\mathbf{Q}_{i}^{\top}\!+\!\mathbf{I}_{d})\!+\!\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\right)
=𝐏𝐞j22(i=1n𝐐i𝐐i+n𝐈d)+n𝐏𝐞j𝐞j𝐏\displaystyle=\!\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}^{2}(\sum_{i=1}^{n}\mathbf{Q}_{i}\mathbf{Q}_{i}^{\top}\!+\!n\mathbf{I}_{d})\!+\!n\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}
=𝐏𝐞j22(𝐅𝚯𝚯𝐅+n𝐈d)+n𝐏𝐞j𝐞j𝐏.\displaystyle=\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}^{2}(\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+n\mathbf{I}_{d})+\!n\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}.

Then since 𝐏𝐞j2=1K/d=1o(1)\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}=1-K/d=1-o(1), we have that λd(Cov(𝐄0𝐏𝐞j))n\lambda_{d}\big{(}\operatorname*{\rm Cov}(\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}\gtrsim n, and hence we have η2(d)dn/(λ12Lp)\eta_{2}(d)\asymp dn/(\lambda_{1}^{2}Lp). Then under the condition that nd3L/pn\gg d^{3}L/p and Δ02K(logd)2dnL/p\Delta_{0}^{2}\gg K(\log d)^{2}\sqrt{dnL/p}, we have that

r1(d)4η2(d)λ12Lpd(d4Δ04K2n+d2n(logd)4)λ12p2d2Δ04K2d2p2Δ4d2,d2r1(d)4p2Δ4η2(d)=o(1).\frac{r_{1}^{\prime}(d)^{4}}{\eta_{2}(d)}\lesssim\frac{\lambda_{1}^{2}Lp}{d}\left(\frac{d^{4}\Delta_{0}^{4}}{K^{2}n}+d^{2}n(\log d)^{4}\right)\ll\frac{\lambda_{1}^{2}p^{2}d^{2}\Delta_{0}^{4}}{K^{2}d^{2}}\asymp\frac{p^{2}\Delta^{4}}{d^{2}},\quad\frac{d^{2}r_{1}^{\prime}(d)^{4}}{p^{2}\Delta^{4}\eta_{2}(d)}=o(1).

Now we move on to check Assumption 5. Similar as in the proof of Corollary 4.11, we will first show the results conditional on 𝛀\bm{\Omega} by establishing a more general CLT . More specifically, we will show that for any 𝐚K{\mathbf{a}}\in\mathbb{R}^{K} with 𝐚2=1\|{\mathbf{a}}\|_{2}=1, and 𝐀d×K\mathbf{A}\in\mathbb{R}^{d\times K} such that λK(Cov(𝐀𝐄0𝐏𝐞j))cnσmin(𝐀)2\lambda_{K}\big{(}\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}\geq cn\sigma_{\min}(\mathbf{A})^{2} and σmax(𝐀)/σmin(𝐀)C\sigma_{\max}(\mathbf{A})/\sigma_{\min}(\mathbf{A})\leq C, where C,c>0C,c>0 are constants irrelevant to 𝐀\mathbf{A} and we abuse the notation by denoting 𝚺j:=Cov(𝐀𝐄0𝐏𝐞j)\bm{\Sigma}_{j}:=\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}), we have 𝐚𝚺j1/2𝐀𝐄0𝐏𝐞j𝑑𝒩(0,1){\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(0,1). Define 𝐐=𝚯𝐅\mathbf{Q}=\mathbf{\Theta}\mathbf{F}^{\top}. We know that

𝐐2,=maxi[n]𝐐i2𝐅2𝚯2,μθΔ0dn,and\|\mathbf{Q}\|_{2,\infty}=\max_{i\in[n]}\|\mathbf{Q}_{i}\|_{2}\leq\|\mathbf{F}\|_{2}\|\mathbf{\Theta}\|_{2,\infty}\lesssim\mu_{\theta}\Delta_{0}\sqrt{\frac{d}{n}},\,\text{and}
𝐚𝚺j1/2𝐀𝐄0𝐏𝐞j=i=1n{𝐚𝚺j1/2𝐀(𝐐i𝐙i.+𝐙i.𝐙i.𝐈d)𝐏𝐞j},{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}=\sum_{i=1}^{n}\big{\{}{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}(\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}+\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d})\mathbf{P}_{\perp}\mathbf{e}_{j}\big{\}},

and we denote

xi=𝐚𝚺j1/2𝐀𝐐i𝐙i.𝐏𝐞j,yi=𝐚𝚺j1/2𝐀(𝐙i.𝐙i.𝐈d)𝐏𝐞j.x_{i}={\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{Q}_{i}\mathbf{Z}_{i.}^{\top}\mathbf{P}_{\perp}\mathbf{e}_{j},\quad y_{i}={\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}(\mathbf{Z}_{i.}\mathbf{Z}_{i.}^{\top}-\mathbf{I}_{d})\mathbf{P}_{\perp}\mathbf{e}_{j}.

Then we have

𝔼|xi+yi|3\displaystyle{\mathbb{E}}|x_{i}+y_{i}|^{3} 𝔼|xi|3+𝔼|yi|3𝚺j1/2𝐀𝐐i23+𝚺j1/2𝐀23\displaystyle\lesssim{\mathbb{E}}|x_{i}|^{3}+{\mathbb{E}}|y_{i}|^{3}\lesssim\|\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{Q}_{i}\|_{2}^{3}+\|\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\|_{2}^{3}
𝚺j1/223𝐀23(𝐐2,3+1)n3/2{𝐀2σmin(𝐀)}3{μθ3Δ03(dn)3/2+1}\displaystyle\leq\|\bm{\Sigma}_{j}^{-1/2}\|_{2}^{3}\|\mathbf{A}\|_{2}^{3}(\|\mathbf{Q}\|_{2,\infty}^{3}+1)\lesssim n^{-3/2}\big{\{}\frac{\|\mathbf{A}\|_{2}}{\sigma_{\min}(\mathbf{A})}\big{\}}^{3}\big{\{}\mu_{\theta}^{3}\Delta_{0}^{3}\big{(}\frac{d}{n}\big{)}^{3/2}+1\big{\}}
n3/2μθ3Δ03(dn)3/2+n3/2.\displaystyle\lesssim n^{-3/2}\mu_{\theta}^{3}\Delta_{0}^{3}\big{(}\frac{d}{n}\big{)}^{3/2}+n^{-3/2}.

Then

i=1n𝔼|xi+yi|3Var{i=1n(xi+yi)}3/2\displaystyle\frac{\sum_{i=1}^{n}{\mathbb{E}}|x_{i}+y_{i}|^{3}}{\operatorname{Var}\big{\{}\sum_{i=1}^{n}(x_{i}+y_{i})\big{\}}^{3/2}} =i=1n𝔼|xi+yi|3n2μθ3Δ03d3/2+n1/2.\displaystyle=\sum_{i=1}^{n}{\mathbb{E}}|x_{i}+y_{i}|^{3}\lesssim n^{-2}\mu_{\theta}^{3}\Delta_{0}^{3}d^{3/2}+n^{-1/2}.

Then under the condition that Δ02n4/3/(μθ2d)\Delta_{0}^{2}\ll n^{4/3}/(\mu_{\theta}^{2}d), we have that

n2μθ3Δ03d3/2=o(1)and(i=1n𝔼|xi+yi|3)Var(i=1n(xi+yi))3/2=o(1).n^{-2}\mu_{\theta}^{3}\Delta_{0}^{3}d^{3/2}=o(1)\quad\text{and}\quad\left({\sum_{i=1}^{n}{\mathbb{E}}|x_{i}+y_{i}|^{3}}\right){\operatorname{Var}\big{(}\sum_{i=1}^{n}(x_{i}+y_{i})\big{)}^{-3/2}}=o(1).

Thus the Lyapunov’s condition is met and the CLT holds. Also recall from previous arguments, there exists a fixed constant C>0C>0 such that with high probability we have

σmax(𝛀𝐁𝛀)σmin(𝛀𝐁𝛀)9(σ1(𝚯)σK(𝚯))2C,λK(𝚺j)n2(σmin(𝛀𝐁𝛀))2.\frac{\sigma_{\max}(\bm{\Omega}\mathbf{B}_{\bm{\Omega}})}{\sigma_{\min}(\bm{\Omega}\mathbf{B}_{\bm{\Omega}})}\leq 9\left(\frac{\sigma_{1}(\mathbf{\Theta})}{\sigma_{K}(\mathbf{\Theta})}\right)^{2}\leq C,\quad\lambda_{K}(\bm{\Sigma}_{j})\geq\frac{n}{2}\big{(}\sigma_{\min}(\bm{\Omega}\mathbf{B}_{\bm{\Omega}})\big{)}^{2}.

Then by taking 𝐀=𝛀𝐁𝛀\mathbf{A}=\bm{\Omega}\mathbf{B}_{\bm{\Omega}} and following similar steps as in the proof of Corollary 4.11, we know that Assumption 5 is satisfied. Then by Theorem 4.10, (17) holds.

We move on to prove (20). It suffices to show that 𝚺j𝚺~j2=oP(λK(𝚺~j))\|\bm{\Sigma}_{j}-\widetilde{\bm{\Sigma}}_{j}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}. When Δ02n\Delta_{0}^{2}\ll n and KdK\ll d, with high probability we have

𝚺j𝚺~j2dLΔ2p{(n+Δ)(1𝐏𝐞j22)+n𝐏𝐞j𝐞j𝐏𝐞j𝐞j2}\displaystyle\|\bm{\Sigma}_{j}-\widetilde{\bm{\Sigma}}_{j}\|_{2}\lesssim\frac{d}{L\Delta^{2}p}\big{\{}(n+\Delta)(1\!-\!\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}^{2})+n\|\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}-\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\|_{2}\big{\}}
+nLΔ2𝛀2,2dLΔ2p(Knd+Δ02+nKd)+nΔ2=o(dnLλ12p)=oP(λK(𝚺~j)).\displaystyle\quad+\frac{n}{L\Delta^{2}}\|\mathbf{\Omega}\|_{2,\infty}^{2}\lesssim\frac{d}{L\Delta^{2}p}\Big{(}\frac{Kn}{d}+\Delta_{0}^{2}+n\sqrt{\frac{K}{d}}\Big{)}+\frac{n}{\Delta^{2}}=o(\frac{dn}{L\lambda_{1}^{2}p})=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.

Thus (20) holds.

Last we verify the validity of 𝚺^j\widehat{\bm{\Sigma}}_{j}. Similar as in the proof of Corollary 4.11, it suffices to show that 𝚺^j𝐇𝚺~j𝐇2=oP(λK(𝚺~j))\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}. Recall with high probability 𝐌^𝐌2r1(d)=dΔ0/K+dnlogd\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}\lesssim r_{1}^{\prime}(d)=d\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d.

Also, from the proof of Theorem 4.10, we have that

𝐕~F𝐇𝐕2=𝐕~F𝐕𝐇2=1LOP(𝐌^𝐌2𝛀2𝐁𝛀2)=OP(dpLr1(d)Δ),\displaystyle\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2}=\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}=\frac{1}{L}O_{P}(\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}\|\mathbf{\Omega}\|_{2}\|\mathbf{B}_{\mathbf{\Omega}}\|_{2})=O_{P}\Big{(}\sqrt{\frac{d}{pL}}\frac{r_{1}^{\prime}(d)}{\Delta}\Big{)},

Then with high probability, for all [L]\ell\in[L] we have that

(𝐕~F𝐌^𝐇𝚲𝐕)𝛀()/p2dp(𝐌^𝐌2+λ1𝐕~F𝐕𝐇2)\displaystyle\big{\|}(\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}-\mathbf{H}\mathbf{\Lambda}\mathbf{V}^{\top})\mathbf{\Omega}^{(\ell)}/\sqrt{p}\big{\|}_{2}\lesssim\sqrt{\frac{d}{p}}\big{(}\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}+\lambda_{1}\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}\big{)}
=OP(d2p2Lr1(d))=oP(Δ),\displaystyle=O_{P}\left(\sqrt{\frac{d^{2}}{p^{2}L}}r_{1}^{\prime}(d)\right)=o_{P}(\Delta),

and thus by Theorem 3.3 in Stewart, [38], we have that

𝐁^()𝐁()𝐇2=(𝐕~F𝐌^𝛀()/p)(𝐇𝚲𝐕𝛀()/p)2=OP(d2p2Lr1(d)Δ2),\displaystyle\|\widehat{\mathbf{B}}^{(\ell)}-\mathbf{B}^{(\ell)}\mathbf{H}^{\top}\|_{2}=\big{\|}(\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\mathbf{\Omega}^{(\ell)}/\sqrt{p})^{\dagger}-(\mathbf{H}\mathbf{\Lambda}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}/\sqrt{p})^{\dagger}\big{\|}_{2}=O_{P}\left(\sqrt{\frac{d^{2}}{p^{2}L}}\frac{r_{1}^{\prime}(d)}{\Delta^{2}}\right),

and in turn we have 𝐁^𝛀𝐁𝛀𝐇2=OP(d2p2Lr1(d)Δ2)L=OP(dr1(d)pΔ2)\|\widehat{\mathbf{B}}_{\mathbf{\Omega}}-\mathbf{B}_{\mathbf{\Omega}}\mathbf{H}^{\top}\|_{2}=O_{P}\left(\sqrt{\frac{d^{2}}{p^{2}L}}\frac{r_{1}^{\prime}(d)}{\Delta^{2}}\right)\sqrt{L}=O_{P}\left(\frac{dr_{1}^{\prime}(d)}{p\Delta^{2}}\right).

Therefore, under the condition that Δ02KLp2n2/d4\Delta_{0}^{2}\ll{KLp^{2}n^{2}}/{d^{4}}, we have

𝚺^j𝐇𝚺~j𝐇2dLΔ2p𝐌^𝐌2+(n+λ1)dpLΔOP(dpLr1(d)Δ2)\displaystyle\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}\lesssim\frac{d}{L\Delta^{2}p}\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}+(n+\lambda_{1})\frac{d}{pL\Delta}O_{P}\left(\frac{d}{p\sqrt{L}}\frac{r_{1}^{\prime}(d)}{\Delta^{2}}\right)
=oP(dnLΔ2p)=oP(λK(𝚺~j)).\displaystyle=o_{P}\big{(}\frac{dn}{L\Delta^{2}p}\big{)}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.

Thus the claim follows.

B.9 Proof of Theorem 4.5

We will first decompose 𝐕~F𝐇𝐕=(𝐕~F𝐇𝐕~𝐇1𝐇0)+(𝐕~𝐇1𝐇0𝐕^𝐇0)+(𝐕^𝐇0𝐕)\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}=(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{1}\mathbf{H}_{0})+(\widetilde{\mathbf{V}}\mathbf{H}_{1}\mathbf{H}_{0}-\widehat{\mathbf{V}}\mathbf{H}_{0})+(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}). We will show that when LL is sufficiently large the first two terms are negligible, and we will consider the third term 𝐕^𝐇0𝐕\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V} first. We will first study 𝐕^𝐇0𝐕𝐏𝐄0𝐕𝚲12,\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}-\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}\|_{2,\infty} by conducting decomposition of the error term. For the convenience of notations, we let 𝐏=𝐕𝐕\mathbf{P}=\mathbf{V}^{\top}\mathbf{V} for short. If we define 𝐇^0=𝐕^𝐕\widehat{\mathbf{H}}_{0}=\widehat{\mathbf{V}}^{\top}\mathbf{V}, we can decompose

𝐕^𝐇0𝐕𝐏𝐄0𝐕𝚲1\displaystyle\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}-\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}
=𝐏𝐕^𝐇^0𝐏𝐄0𝐕𝚲1+𝐏𝐕^(𝐇0𝐇^0)+(𝐏𝐕^𝐇0𝐕).\displaystyle=\mathbf{P}_{\perp}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}\!-\!\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}\!+\!\mathbf{P}_{\perp}\widehat{\mathbf{V}}({\mathbf{H}}_{0}\!-\!\widehat{\mathbf{H}}_{0})\!+\!(\mathbf{P}\widehat{\mathbf{V}}\mathbf{H}_{0}\!-\!\mathbf{V}).

Under the condition that 𝐄2/Δ=OP(r1(d)/Δ)=oP(1)\|\mathbf{E}\|_{2}/\Delta=O_{P}\big{(}r_{1}(d)/\Delta\big{)}=o_{P}(1), we have that 𝐇0\mathbf{H}_{0} is a full-rank orthonormal matrix with probability 1o(1)1-o(1). Then we have with probability 1o(1)1-o(1) that

𝐏𝐕^(𝐇0𝐇^0)2,=(𝐈𝐕𝐕)(𝐕^𝐇0𝐕)𝐇0(𝐇0𝐇^0)2,\displaystyle\|\mathbf{P}_{\perp}\widehat{\mathbf{V}}({\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0})\|_{2,\infty}=\|(\mathbf{I}-\mathbf{V}\mathbf{V}^{\top})(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}({\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0})\|_{2,\infty}
(𝐕^𝐇0𝐕)𝐇0(𝐇0𝐇^0)2,+𝐕𝐕(𝐕^𝐇0𝐕)𝐇0(𝐇0𝐇^0)2,\displaystyle\leq\|(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}({\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0})\|_{2,\infty}+\|\mathbf{V}\mathbf{V}^{\top}(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}({\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0})\|_{2,\infty}
𝐕^𝐇0𝐕2,𝐇0𝐇^02+𝐕2,𝐕^𝐇0𝐕2𝐇0𝐇^02\displaystyle\leq\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\|_{2,\infty}\|{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\|_{2}+\|\mathbf{V}\|_{2,\infty}\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\|_{2}\|{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\|_{2}
(r3(d)+μKd𝐄2Δ)𝐇0𝐇^02.\displaystyle\lesssim\Big{(}r_{3}(d)+\sqrt{\frac{\mu K}{d}}\frac{\|\mathbf{E}\|_{2}}{\Delta}\Big{)}\|{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\|_{2}.

From Lemma 7 in Fan et al., [18], we know that 𝐇0𝐇^02𝐕^𝐕^𝐕𝐕22(𝐄2/Δ)2=OP(r1(d)2/Δ2)\|{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\|_{2}\lesssim\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{2}\lesssim(\|\mathbf{E}\|_{2}/\Delta)^{2}=O_{P}(r_{1}(d)^{2}/\Delta^{2}), and thus we have

𝐏𝐕^(𝐇0𝐇^0)2,=OP((r3(d)+μKdr1(d)Δ)r1(d)2/Δ2).\|\mathbf{P}_{\perp}\widehat{\mathbf{V}}({\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0})\|_{2,\infty}=O_{P}\left(\Big{(}r_{3}(d)+\sqrt{\frac{\mu K}{d}}\frac{r_{1}(d)}{\Delta}\Big{)}r_{1}(d)^{2}/\Delta^{2}\right).

We move on to bound 𝐏𝐕^𝐇0𝐕2,\|\mathbf{P}\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\|_{2,\infty},

𝐏𝐕^𝐇0𝐕2,\displaystyle\|\mathbf{P}\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\|_{2,\infty} =𝐕(𝐇^0𝐇0𝐈K)2,𝐕2,𝐇0𝐇^02\displaystyle=\|\mathbf{V}(\widehat{\mathbf{H}}_{0}^{\top}\mathbf{H}_{0}-\mathbf{I}_{K})\|_{2,\infty}\leq\|\mathbf{V}\|_{2,\infty}\|{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\|_{2}
=OP(μKdr1(d)2/Δ2).\displaystyle=O_{P}\left(\sqrt{\frac{\mu K}{d}}r_{1}(d)^{2}/\Delta^{2}\right).

Finally, we consider the term 𝐏𝐕^𝐇^0𝐏𝐄0𝐕𝚲1\mathbf{P}_{\perp}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}. We can decompose

𝐏𝐕^𝐇^0𝐏𝐄0𝐕𝚲1=𝐏𝐕^𝐇^0𝚲𝚲1𝐏𝐄0𝐕𝚲1\displaystyle\mathbf{P}_{\perp}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}=\mathbf{P}_{\perp}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}\mathbf{\Lambda}^{-1}-\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}
=𝐏(𝐄𝐕^𝐇^0𝐄0𝐕+𝐕^(𝚲𝚲^)𝐇^0+𝐕^(𝐇^0𝚲𝚲𝐇^0))𝚲1.\displaystyle=\mathbf{P}_{\perp}\big{(}\mathbf{E}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{E}_{0}\mathbf{V}+\widehat{\mathbf{V}}(\mathbf{\Lambda}-\widehat{\mathbf{\Lambda}})\widehat{\mathbf{H}}_{0}+\widehat{\mathbf{V}}(\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}-\mathbf{\Lambda}\widehat{\mathbf{H}}_{0})\big{)}\mathbf{\Lambda}^{-1}.

We bound the three terms separately, with high probability

𝐏(𝐄𝐕^𝐇^0𝐄0𝐕)𝚲12,𝐏𝐄0(𝐕^𝐇^0𝐕)𝚲12,+𝐏𝐄b𝐕^𝐇^0𝚲12,\displaystyle\|\!\mathbf{P}_{\perp}\!\big{(}\mathbf{E}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}\!-\!\mathbf{E}_{0}\mathbf{V}\big{)}\!\mathbf{\Lambda}^{-1}\!\|_{2,\infty}\!\!\leq\!\|\!\mathbf{P}_{\perp}\mathbf{E}_{0}(\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}\!-\!\mathbf{V}\!)\mathbf{\Lambda}^{-1}\!\|_{2,\infty}\!\!+\!\|\!\mathbf{P}_{\perp}\mathbf{E}_{b}\!\widehat{\mathbf{V}}\!\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\!\|_{2,\infty}
𝐄0(𝐕^𝐇^0𝐕)𝚲12,+𝐕𝐕𝐄0(𝐕^𝐇^0𝐕)𝚲12,+𝐄b2/Δ\displaystyle\quad\leq\|\mathbf{E}_{0}(\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{V})\mathbf{\Lambda}^{-1}\|_{2,\infty}+\|\mathbf{V}\mathbf{V}^{\top}\mathbf{E}_{0}(\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{V})\mathbf{\Lambda}^{-1}\|_{2,\infty}+\|\mathbf{E}_{b}\|_{2}/\Delta
𝐄0(𝐕^𝐇^0𝐕)2,/Δ+𝐕2,𝐄02𝐕^𝐇^0𝐕2/Δ+r2(d)/Δ\displaystyle\quad\leq\|\mathbf{E}_{0}(\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{V})\|_{2,\infty}/\Delta+\|\mathbf{V}\|_{2,\infty}\|\mathbf{E}_{0}\|_{2}\|\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}-\mathbf{V}\|_{2}/\Delta+r_{2}(d)/\Delta
=OP(r4(d,𝚲)/Δ+μKdr1(d)2/Δ2+r2(d)/Δ).\displaystyle\quad=O_{P}\left(r_{4}(d,\mathbf{\Lambda})/\Delta+\sqrt{\frac{\mu K}{d}}r_{1}(d)^{2}/\Delta^{2}+r_{2}(d)/\Delta\right).

As for 𝐏𝐕^(𝚲𝚲^)𝐇^0𝚲1\mathbf{P}_{\perp}\widehat{\mathbf{V}}({\mathbf{\Lambda}}-\widehat{\mathbf{\Lambda}})\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}, we have

𝐏𝐕^(𝚲^𝚲)𝐇^0𝚲12,(𝐕^𝐇0𝐕)𝐇0(𝚲^𝚲)𝐇^0𝚲12,\displaystyle\|\mathbf{P}_{\perp}\widehat{\mathbf{V}}(\widehat{\mathbf{\Lambda}}-\mathbf{\Lambda})\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\|_{2,\infty}\leq\|(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}(\widehat{\mathbf{\Lambda}}-\mathbf{\Lambda})\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\|_{2,\infty}
+𝐕𝐕(𝐕^𝐇0𝐕)𝐇0(𝚲^𝚲)𝐇^0𝚲12,\displaystyle\quad\quad+\|\mathbf{V}\mathbf{V}^{\top}(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}(\widehat{\mathbf{\Lambda}}-\mathbf{\Lambda})\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\|_{2,\infty}
𝐕^𝐇0𝐕2,𝐄02/Δ+𝐕2,𝐕^𝐇0𝐕2𝐄02/Δ\displaystyle\quad\leq\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\|_{2,\infty}\|\mathbf{E}_{0}\|_{2}/\Delta+\|\mathbf{V}\|_{2,\infty}\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\|_{2}\|\mathbf{E}_{0}\|_{2}/\Delta
=OP{r3(d)r1(d)/Δ+μKdr1(d)2/Δ2},\displaystyle\quad=O_{P}\bigg{\{}r_{3}(d)r_{1}(d)/\Delta+\sqrt{\frac{\mu K}{d}}r_{1}(d)^{2}/\Delta^{2}\bigg{\}},

and finally

𝐏𝐕^(𝚲𝐇^0𝐇^0𝚲)𝚲12,(𝐕^𝐇0𝐕)𝐇0(𝚲𝐇^0𝐇^0𝚲)𝚲12,\displaystyle\|\mathbf{P}_{\perp}\widehat{\mathbf{V}}(\mathbf{\Lambda}\widehat{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\mathbf{\Lambda})\mathbf{\Lambda}^{-1}\|_{2,\infty}\leq\|(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}(\mathbf{\Lambda}\widehat{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\mathbf{\Lambda})\mathbf{\Lambda}^{-1}\|_{2,\infty}
+𝐕𝐕(𝐕^𝐇0𝐕)𝐇0(𝚲𝐇^0𝐇^0𝚲)𝚲12,\displaystyle\quad\quad+\|\mathbf{V}\mathbf{V}^{\top}(\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V})\mathbf{H}_{0}^{\top}(\mathbf{\Lambda}\widehat{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\mathbf{\Lambda})\mathbf{\Lambda}^{-1}\|_{2,\infty}
=OP((r3(d)+μKdr1(d)/Δ)𝚲𝐇^0𝐇^0𝚲2/Δ)\displaystyle\quad=O_{P}\left(\big{(}r_{3}(d)+\sqrt{\frac{\mu K}{d}}r_{1}(d)/\Delta\big{)}\|\mathbf{\Lambda}\widehat{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}\|_{2}/\Delta\right)
=OP((r3(d)+μKdr1(d)/Δ)r1(d)/Δ),\displaystyle\quad=O_{P}\left(\big{(}r_{3}(d)+\sqrt{\frac{\mu K}{d}}r_{1}(d)/\Delta\big{)}r_{1}(d)/\Delta\right),

where the last inequality is due to the fact that

𝚲𝐇^0𝐇^0𝚲2\displaystyle\|\mathbf{\Lambda}\widehat{\mathbf{H}}_{0}-\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}\|_{2} =𝚲𝐕^𝐕𝐕𝐕^𝐕𝚲𝐕2\displaystyle=\|\mathbf{\Lambda}\widehat{\mathbf{V}}^{\top}\mathbf{V}\mathbf{V}^{\top}-\widehat{\mathbf{V}}^{\top}\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\|_{2}
=𝚲𝐕^𝐕𝐕𝐕^𝐌𝐕𝐕2\displaystyle=\|\mathbf{\Lambda}\widehat{\mathbf{V}}^{\top}\mathbf{V}\mathbf{V}^{\top}-\widehat{\mathbf{V}}^{\top}\mathbf{M}\mathbf{V}\mathbf{V}^{\top}\|_{2}
𝚲𝐕^𝐕𝐕𝐕^𝐌^𝐕𝐕2+𝐕^𝐄𝐕𝐕2\displaystyle\leq\|\mathbf{\Lambda}\widehat{\mathbf{V}}^{\top}\mathbf{V}\mathbf{V}^{\top}-\widehat{\mathbf{V}}^{\top}\widehat{\mathbf{M}}\mathbf{V}\mathbf{V}^{\top}\|_{2}+\|\widehat{\mathbf{V}}^{\top}\mathbf{E}\mathbf{V}\mathbf{V}^{\top}\|_{2}
=(𝚲𝚲^)𝐕^𝐕𝐕2+𝐕^𝐄𝐕𝐕22𝐄2.\displaystyle=\|(\mathbf{\Lambda}-\widehat{\mathbf{\Lambda}})\widehat{\mathbf{V}}^{\top}\mathbf{V}\mathbf{V}^{\top}\|_{2}+\|\widehat{\mathbf{V}}^{\top}\mathbf{E}\mathbf{V}\mathbf{V}^{\top}\|_{2}\leq 2\|\mathbf{E}\|_{2}.

Thus in summary, we have

𝐕^𝐇0𝐕𝐏𝐄0𝐕𝚲12,=OP{r3(d)r1(d)Δ+μKdr1(d)2Δ2+r2(d)+r4(d)Δ},\displaystyle\|\widehat{\mathbf{V}}\mathbf{H}_{0}\!\!-\!\!\mathbf{V}\!\!-\!\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}\|_{2,\infty}\!=\!O_{P}\bigg{\{}\!\!\frac{r_{3}(d)r_{1}(d)}{\Delta}\!+\!\!\sqrt{\frac{\mu K}{d}}\!\frac{r_{1}(d)^{2}}{\Delta^{2}}\!+\!\frac{r_{2}(d)\!+\!r_{4}(d)}{\Delta}\!\!\bigg{\}},

Now we move on to bound 𝐕~𝐇1𝐇0𝐕^𝐇02\|\widetilde{\mathbf{V}}\mathbf{H}_{1}\mathbf{H}_{0}-\widehat{\mathbf{V}}\mathbf{H}_{0}\|_{2}. By Theorem 4.1, we know that

𝐕~𝐇1𝐇0𝐕^𝐇02\displaystyle\|\widetilde{\mathbf{V}}\mathbf{H}_{1}\mathbf{H}_{0}-\widehat{\mathbf{V}}\mathbf{H}_{0}\|_{2} 𝐕~𝐇1𝐕^2𝐕~𝐕~𝐕^𝐕^2\displaystyle\leq\|\widetilde{\mathbf{V}}\mathbf{H}_{1}-\widehat{\mathbf{V}}\|_{2}\lesssim\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}
𝐕~𝐕~𝐕𝐕2+𝐕𝐕𝐕^𝐕^2\displaystyle\leq\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-{\mathbf{V}}^{\prime}{\mathbf{V}}^{\prime\top}\|_{2}+\|{\mathbf{V}}^{\prime}{\mathbf{V}}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}
𝐕~𝐕~𝐕𝐕F+𝐕𝐕𝐕^𝐕^2\displaystyle\leq\|\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}-{\mathbf{V}}^{\prime}{\mathbf{V}}^{\prime\top}\|_{\text{F}}+\|{\mathbf{V}}^{\prime}{\mathbf{V}}^{\prime\top}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}
=OP(1dr1(d)2Δ2+KdΔ2pLr1(d)).\displaystyle=O_{P}\Big{(}\frac{1}{\sqrt{d}}\frac{r_{1}(d)^{2}}{\Delta^{2}}+\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d)\Big{)}.

Finally, we consider 𝐕~F𝐇𝐕~𝐇1𝐇02\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{1}\mathbf{H}_{0}\|_{2}. From the proof of Theorem 4.1, we know that

𝐕~F𝐇𝐕~𝐇1𝐇02𝐕~F𝐇2𝐕~2𝐕~F𝐕~F𝐕~𝐕~2\displaystyle\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widetilde{\mathbf{V}}\mathbf{H}_{1}\mathbf{H}_{0}\|_{2}\leq\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}_{2}-\widetilde{\mathbf{V}}\|_{2}\lesssim\|\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\|_{2}
=OP(𝔼(𝐕~F𝐕~F𝐕~𝐕~22|𝚺~)1/2)dp𝚺~𝐕𝐕2q(1𝚺~𝐕𝐕2)q.\displaystyle\quad=O_{P}\big{(}{\mathbb{E}}(\|\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{V}}^{\text{F}\top}-\widetilde{\mathbf{V}}\widetilde{\mathbf{V}}^{\top}\|_{2}^{2}|\widetilde{\bm{\Sigma}})^{1/2}\big{)}\lesssim\sqrt{\frac{d}{p^{\prime}}}\frac{\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{q}}{\left(1-\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\right)^{q}}.

From the proof of Theorem 4.1, we know that with probability converging to 1, there exists some constant η>0\eta>0 such that 𝚺~𝐕𝐕2ηr1(d)logdd/p/Δ=o(1)\|\widetilde{\mathbf{\Sigma}}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\leq\eta r_{1}(d)\log d\sqrt{d/p}/\Delta=o(1), and thus that

𝐕~F𝐇2𝐕~2=OP(dp(2ηr1(d)logddΔ2p)q).\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}_{2}-\widetilde{\mathbf{V}}\|_{2}=O_{P}\left(\sqrt{\frac{d}{p^{\prime}}}\left(2\eta r_{1}(d)\log d\sqrt{\frac{d}{\Delta^{2}p}}\right)^{q}\right).

When we choose qq to be large enough, i.e.,

q2+log(Ld)loglogd1+log(logdLd/(Kp))log((2ηlogd)1Δ/r1(d)p/d),q\geq 2+\frac{\log(Ld)}{\log\log d}\gg 1+\frac{\log\left(\log d\sqrt{Ld/(Kp^{\prime})}\right)}{\log\left((2\eta\log d)^{-1}\Delta/r_{1}(d)\sqrt{p/d}\right)},

we have 𝐕~F𝐇2𝐕~2=OP(KdΔ2pLr1(d))\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}_{2}-\widetilde{\mathbf{V}}\|_{2}=O_{P}(\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d)). Therefore, if we denote

r(d):=Δ1(KdpLr1(d)+r3(d)r1(d)+μKdΔ2r1(d)2+r2(d)+r4(d)),r(d):=\Delta^{-1}\Big{(}\sqrt{\frac{Kd}{pL}}r_{1}(d)+r_{3}(d)r_{1}(d)\!+\!\sqrt{\frac{\mu K}{d\Delta^{2}}}r_{1}(d)^{2}\!+\!r_{2}(d)\!+\!r_{4}(d)\Big{)},

we can write

𝐕~F𝐇𝐕=𝐏𝐄0𝐕𝚲1+𝐑(d),\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}=\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}+\mathbf{R}(d),

where 𝐑(d)2,=OP(r(d))\|\mathbf{R}(d)\|_{2,\infty}=O_{P}\big{(}r(d)\big{)}. Then under the condition that η1(d)1/2r(d)=o(1)\eta_{1}(d)^{-1/2}r(d)=o(1), we have that 𝐑(d)2,=oP(σmin(𝚺j))1/2)\|\mathbf{R}(d)\|_{2,\infty}=o_{P}\Big{(}\sigma_{\min}\big{(}\bm{\Sigma}_{j})\big{)}^{1/2}\Big{)}. Thus by Assumption 5,

𝚺j1/2(𝐕~F𝐇𝐕)𝐞j=𝚺j1/2(𝚲1𝐕𝐄0𝐏𝐞j)+oP(1)𝑑𝒩(𝟎,𝐈K).\bm{\Sigma}_{j}^{-1/2}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{e}_{j}=\bm{\Sigma}_{j}^{-1/2}(\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})+o_{P}(1)\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}).

B.10 Proof of Corollary 4.6

We define 𝐄0\mathbf{E}_{0} and 𝐄b\mathbf{E}_{b} the same as in the proof of Corollary 4.11. Then Assumptions 1 and 2 are satisfied as been proven for Corollary 4.11. As for Assumption 5, we have shown that under the condition that κ13(λ1/σ2)3=o(n){\kappa_{1}^{3}(\lambda_{1}/\sigma^{2})^{3}}=o(\sqrt{n}), the results (B.31) holds for any matrix 𝐀d×K\mathbf{A}\in\mathbb{R}^{d\times K} such that σmax(𝐀)/σmin(𝐀)C|λ1|/Δ\sigma_{\max}(\mathbf{A})/\sigma_{\min}(\mathbf{A})\leq C|\lambda_{1}|/\Delta and λK(Cov(𝐀𝐄0𝐏𝐞j))cn1σ4(σmin(𝐀))2\lambda_{K}\big{(}\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}\geq cn^{-1}\sigma^{4}\big{(}\sigma_{\min}(\mathbf{A})\big{)}^{2} in the proof of Corollary 4.11. Under the regime LpdLp\gg d, the leading term 𝒱(𝐄0)=𝐏𝐄0𝐕𝚲1\mathcal{V}(\mathbf{E}_{0})=\mathbf{P}_{\perp}\mathbf{E}_{0}\mathbf{V}\mathbf{\Lambda}^{-1}, and by taking 𝐀=𝐕𝚲1\mathbf{A}=\mathbf{V}\mathbf{\Lambda}^{-1}, it can be seen that

σmax(𝐕𝚲1)/σmin(𝐕𝚲1)=σmax(𝚲)/σmin(𝚲)|λ1|/Δ,\sigma_{\max}(\mathbf{V}\mathbf{\Lambda}^{-1})/\sigma_{\min}(\mathbf{V}\mathbf{\Lambda}^{-1})=\sigma_{\max}(\mathbf{\Lambda})/\sigma_{\min}(\mathbf{\Lambda})\leq|\lambda_{1}|/\Delta,

and if we can show that η1(d)(2n)1λ12σ4\eta_{1}(d)\geq(2n)^{-1}\lambda_{1}^{-2}\sigma^{4}, we have λK(𝚺j)η1(d)=(2n)1σ4(σmin(𝐕𝚲1))2\lambda_{K}(\bm{\Sigma}_{j})\geq\eta_{1}(d)=(2n)^{-1}\sigma^{4}\big{(}\sigma_{\min}(\mathbf{V}\mathbf{\Lambda}^{-1})\big{)}^{2} and Assumption 5 is satisfied. Thus we only need to verify Assumption 4 and the conditions for η1(d)\eta_{1}(d). Recall from the proof of Corollary 4.11 we have the following rates

r1(d)=(λ1+σ2)rn,r2(d)σ~13Kδ2n(logd)2,r_{1}(d)=(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}},\quad r_{2}(d)\asymp\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}n}(\log d)^{2},

and we can further derive that the following bounds hold with high probability

𝐕^sgn(𝐕^𝐕)𝐕2,𝐕^sgn(𝐕^𝐕)𝐕2𝐄02/Δr1(d)logd/Δ;\displaystyle\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}\leq\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2}\lesssim\|\mathbf{E}_{0}\|_{2}/\Delta\lesssim r_{1}(d)\log d/\Delta;
𝐄0(𝐕^(𝐕^𝐕)𝐕)2,𝐄02𝐕^sgn(𝐕^𝐕)𝐕2r1(d)2(logd)2/Δ.\displaystyle\|\mathbf{E}_{0}(\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V})\|_{2,\infty}\lesssim\|\mathbf{E}_{0}\|_{2}\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2}\lesssim r_{1}(d)^{2}(\log d)^{2}/\Delta.

Thus we know r3(d)κ1logdr/nr_{3}(d)\asymp\kappa_{1}\log d\sqrt{{r}/{n}} and r4(d)r1(d)2(logd)2/Δ=κ1(λ1+σ2)(logd)2r/nr_{4}(d)\asymp r_{1}(d)^{2}(\log d)^{2}/\Delta=\kappa_{1}(\lambda_{1}+\sigma^{2})(\log d)^{2}r/n.

From the proof of Corollary 4.11, we know that 𝚺j=n1𝚲1𝐕𝚺j0𝐕𝚲1\bm{\Sigma}_{j}=n^{-1}\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\bm{\Sigma}_{j}^{0}\mathbf{V}\mathbf{\Lambda}^{-1}, where

𝚺j0={σ2𝐏𝐞j22𝚺+3σ4𝐏𝐞j𝐞j𝐏2σ4ρ𝐏𝐞j2[(𝐏)[:,S]𝐮K+1𝐞j𝐏+𝐏𝐞j(𝐮K+1)(𝐏)[S,:]]}.\mathbf{\Sigma}_{j}^{0}\!=\!\Big{\{}\!\sigma^{2}\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}^{2}\bm{\Sigma}\!+3\sigma^{4}\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}\!\!-\!2\sigma^{4}\rho\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}\!\big{[}(\mathbf{P}_{\perp})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}\!\mathbf{P}_{\perp}\!\!+\!\mathbf{P}_{\perp}\mathbf{e}_{j}(\mathbf{u}_{K+1})^{\top}\!(\mathbf{P}_{\perp})_{[S,:]}\big{]}\!\!\Big{\}}.

Similar as in the proof of Corollary 4.11, we will first define 𝚺j\bm{\Sigma}_{j}^{\prime} as following

𝚺j=1n𝚲1𝐕{σ2𝚺+3σ4𝐞j𝐞j2σ4ρ𝐏𝐞j2((𝐈d)[:,S]𝐮K+1𝐞j+𝐞j𝐮K+1(𝐈d)[S,:])}𝐕𝚲1.\bm{\Sigma}_{j}^{\prime}=\frac{1}{n}\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\Big{\{}\sigma^{2}\bm{\Sigma}+3\sigma^{4}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}-2\sigma^{4}\rho\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}\big{(}(\mathbf{I}_{d})_{[:,S]}\mathbf{u}_{K+1}\mathbf{e}_{j}^{\top}+\mathbf{e}_{j}\mathbf{u}_{K+1}^{\top}(\mathbf{I}_{d})_{[S,:]}\big{)}\Big{\}}\mathbf{V}\mathbf{\Lambda}^{-1}.

Then following similar arguments as in the proof of Corollary 4.11, we have that

𝚺j𝚺j2=O(Kσ4nΔ2μd)=O(Kλ12Δ2μd)σ4nλ12=o(σ4nλ12).\|\bm{\Sigma}_{j}-{\bm{\Sigma}}_{j}^{\prime}\|_{2}=O\Big{(}\frac{K\sigma^{4}}{n\Delta^{2}}\sqrt{\frac{\mu}{d}}\Big{)}=O\Big{(}\frac{K\lambda_{1}^{2}}{\Delta^{2}}\sqrt{\frac{\mu}{d}}\Big{)}\frac{\sigma^{4}}{n\lambda_{1}^{2}}=o\big{(}\frac{\sigma^{4}}{n\lambda_{1}^{2}}\big{)}.

Besides, under the condition that μ2κ14K3d2\mu^{2}\kappa_{1}^{4}K^{3}\ll d^{2} we have

𝚺j𝚺~j2Kσ4nΔ2𝐕2,2μKKσ4dnΔ2=O(μκ12KKd)σ4nλ12=o(σ4nλ12).\displaystyle\|\bm{\Sigma}_{j}^{\prime}-\widetilde{\bm{\Sigma}}_{j}\|_{2}\lesssim\frac{\sqrt{K}\sigma^{4}}{n\Delta^{2}}\|\mathbf{V}\|_{2,\infty}^{2}\lesssim\frac{\mu K\sqrt{K}\sigma^{4}}{dn\Delta^{2}}=O\left(\frac{\mu\kappa_{1}^{2}K\sqrt{K}}{d}\right)\frac{\sigma^{4}}{n\lambda_{1}^{2}}=o\Big{(}\frac{\sigma^{4}}{n\lambda_{1}^{2}}\Big{)}.

Then we know that λK(𝚺j)σ42nλ12+σ22nλ1\lambda_{K}\big{(}\bm{\Sigma}_{j}\big{)}\geq\frac{\sigma^{4}}{2n\lambda_{1}^{2}}+\frac{\sigma^{2}}{2n\lambda_{1}} and we can take η1(d)=σ42nλ12+σ22nλ1\eta_{1}(d)=\frac{\sigma^{4}}{2n\lambda_{1}^{2}}+\frac{\sigma^{2}}{2n\lambda_{1}}. Thus Assumption 5 holds. Then by plugging in the above rates, we can derive the rate r(d)r(d) as

r(d)\displaystyle r(d) =KdpLr1(d)Δ+r3(d)r1(d)/Δ+μKdr1(d)2/Δ2+(r2(d)+r4(d))/Δ\displaystyle=\sqrt{\frac{Kd}{pL}}\frac{r_{1}(d)}{\Delta}+r_{3}(d)r_{1}(d)/\Delta+\sqrt{\frac{\mu K}{d}}r_{1}(d)^{2}/\Delta^{2}+\big{(}r_{2}(d)+r_{4}(d)\big{)}/\Delta
κ1KdrnpL+κ12(logd)2rn+σ~13Kδ2nΔ(logd)2.\displaystyle\lesssim\kappa_{1}\sqrt{\frac{Kdr}{npL}}+\frac{\kappa_{1}^{2}(\log d)^{2}r}{n}+\frac{\widetilde{\sigma}_{1}^{3}K}{\delta^{2}n\Delta}(\log d)^{2}.

Then under the condition that LKdrpκ12(λ1σ2)L\gg\frac{Kdr}{p}\kappa_{1}^{2}(\frac{\lambda_{1}}{\sigma^{2}}), nκ14(logd)4r2(λ1σ2)n\gg\kappa_{1}^{4}(\log d)^{4}r^{2}(\frac{\lambda_{1}}{\sigma^{2}}) and K(σ~1δ)2κ1rK(\frac{\widetilde{\sigma}_{1}}{\delta})^{2}\ll\kappa_{1}r, we have η1(d)1/2r(d)=o(1)\eta_{1}(d)^{-1/2}r(d)=o(1), and hence the condition for η1(d)\eta_{1}(d) is satisfied and (8) holds. Also recall from the above proof that 𝚺~j𝚺j2=o(λK(𝚺j))\|\widetilde{\bm{\Sigma}}_{j}-\bm{\Sigma}_{j}\|_{2}=o\big{(}\lambda_{K}(\bm{\Sigma}_{j})\big{)}, and (9) holds.

Now we verify the validity of 𝚺^j\widehat{\bm{\Sigma}}_{j}. Similar as in the proof of Corollary 4.11, it suffices to show that 𝚺^j𝐇𝚺~j𝐇2=oP(λK(𝚺~j))\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}, and the results will hold by Slutsky’s Theorem. From proof of Corollary 4.11, we have

𝚺^tr𝐕𝚲𝐕2=OP((λ1+σ2)rn).\|\widehat{\bm{\Sigma}}^{\text{tr}}-\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top}\|_{2}=O_{P}\Big{(}(\lambda_{1}+\sigma^{2})\sqrt{\frac{r}{n}}\Big{)}.

Also, we know that with high probability

𝐕~F𝐕𝐇2\displaystyle\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2} =𝐕~F𝐇𝐕2KdΔ2pLr1(d)logd+r1(d)logd/Δ\displaystyle=\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2}\lesssim\sqrt{\frac{Kd}{\Delta^{2}pL}}r_{1}(d)\log d+r_{1}(d)\log d/\Delta
r1(d)logd/Δκ1logdrn.\displaystyle\lesssim r_{1}(d)\log d/\Delta\lesssim\kappa_{1}\log d\sqrt{\frac{r}{n}}.

Then we have

𝚲~𝐇𝚲𝐇2𝐕~F(𝚺^tr𝐕𝚲𝐕)𝐕~F2+(𝐕~F𝐕𝐇)(𝐕𝚲𝐕)𝐕~F2\displaystyle\|\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top}\|_{2}\leq\|\widetilde{\mathbf{V}}^{\text{F}\top}(\widehat{\bm{\Sigma}}^{\text{tr}}-\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top})\widetilde{\mathbf{V}}^{\text{F}}\|_{2}+\|(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}(\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top})\widetilde{\mathbf{V}}^{\text{F}}\|_{2}
+𝐇𝐕(𝐕𝚲𝐕)(𝐕~F𝐕𝐇)2=OP(λ1κ1logdrn).\displaystyle\quad+\|\mathbf{H}\mathbf{V}^{\top}(\mathbf{V}\mathbf{\Lambda}\mathbf{V}^{\top})(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})\|_{2}=O_{P}\Big{(}\lambda_{1}\kappa_{1}\log d\sqrt{\frac{r}{n}}\Big{)}.

Then if we denote 𝐃𝚲=(𝚲~𝐇𝚲𝐇)𝐇𝚲1𝐇\mathbf{D}_{\mathbf{\Lambda}}=(\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top})\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}, we have that 𝐃𝚲2=OP(κ12logdrn)=oP(1)\|\mathbf{D}_{\mathbf{\Lambda}}\|_{2}=O_{P}(\kappa_{1}^{2}\log d\sqrt{\frac{r}{n}})=o_{P}(1), and thus we have

𝚲~1𝐇𝚲1𝐇2\displaystyle\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\|_{2} =(𝐇𝚲𝐇+𝚲~𝐇𝚲𝐇)1(𝐇𝚲𝐇)12\displaystyle=\|(\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top}+\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top})^{-1}-(\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top})^{-1}\|_{2}
=𝐇𝚲1𝐇[(𝐈K+𝐃𝚲)1𝐈K]2𝚲12i=1(𝐃𝚲)i2\displaystyle=\Big{\|}\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\big{[}(\mathbf{I}_{K}+\mathbf{D}_{\mathbf{\Lambda}})^{-1}-\mathbf{I}_{K}\big{]}\Big{\|}_{2}\leq\|\mathbf{\Lambda}^{-1}\|_{2}\big{\|}\sum_{i=1}^{\infty}(-\mathbf{D}_{\mathbf{\Lambda}})^{i}\big{\|}_{2}
=OP(κ12logdrn)Δ1,\displaystyle=O_{P}\left(\kappa_{1}^{2}\log d\sqrt{\frac{r}{n}}\right)\Delta^{-1},

and furthermore, we have

𝚲~2𝐇𝚲2𝐇2\displaystyle\|\widetilde{\mathbf{\Lambda}}^{-2}-\mathbf{H}\mathbf{\Lambda}^{-2}\mathbf{H}^{\top}\|_{2} 𝚲12𝚲~1𝐇𝚲1𝐇2=OP(κ12logdrn)Δ2.\displaystyle\lesssim\|\mathbf{\Lambda}^{-1}\|_{2}\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\|_{2}=O_{P}\left(\kappa_{1}^{2}\log d\sqrt{\frac{r}{n}}\right)\Delta^{-2}.

Then following basic algebra, under the condition that nκ14(logd)4r2(λ1/σ2)2n\gg\kappa_{1}^{4}(\log d)^{4}r^{2}(\lambda_{1}/\sigma^{2})^{2} we have

𝐇𝚺~j𝐇𝚺^j2=1n𝐇(σ2𝚲1+σ4𝚲2)𝐇(σ^2𝚲~1+σ^4𝚲~2)2\displaystyle\|\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}-\widehat{\bm{\Sigma}}_{j}\|_{2}=\frac{1}{n}\|\mathbf{H}(\sigma^{2}\mathbf{\Lambda}^{-1}+\sigma^{4}\mathbf{\Lambda}^{-2})\mathbf{H}^{\top}-(\widehat{\sigma}^{2}\widetilde{\mathbf{\Lambda}}^{-1}+\widehat{\sigma}^{4}\widetilde{\mathbf{\Lambda}}^{-2})\|_{2}
1n(σ2𝐇𝚲1𝐇σ^2𝚲~12+σ4𝐇𝚲2𝐇σ^4𝚲~22)\displaystyle\leq\frac{1}{n}\left(\|\sigma^{2}\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}-\widehat{\sigma}^{2}\widetilde{\mathbf{\Lambda}}^{-1}\|_{2}+\|\sigma^{4}\mathbf{H}\mathbf{\Lambda}^{-2}\mathbf{H}^{\top}-\widehat{\sigma}^{4}\widetilde{\mathbf{\Lambda}}^{-2}\|_{2}\right)
=OP(κ12logdσ2nΔrn)+OP(σ~1nΔKn)+OP(κ12logdσ4nΔ2rn)+OP(σ~1σ2nΔ2Kn)\displaystyle=O_{P}\left(\kappa_{1}^{2}\log d\frac{\sigma^{2}}{n\Delta}\sqrt{\frac{r}{n}}\right)+O_{P}\left(\frac{\widetilde{\sigma}_{1}}{n\Delta}\sqrt{\frac{K}{n}}\right)+O_{P}\left(\kappa_{1}^{2}\log d\frac{\sigma^{4}}{n\Delta^{2}}\sqrt{\frac{r}{n}}\right)+O_{P}\left(\frac{\widetilde{\sigma}_{1}\sigma^{2}}{n\Delta^{2}}\sqrt{\frac{K}{n}}\right)
=OP(κ12logd(Δσ2)rn)σ4nΔ2=OP(κ12logd(λ1σ2)(λ1Δ)rn)σ4nλ12\displaystyle=O_{P}\Big{(}\kappa_{1}^{2}\log d\big{(}\frac{\Delta}{\sigma^{2}}\big{)}\sqrt{\frac{r}{n}}\Big{)}\frac{\sigma^{4}}{n\Delta^{2}}=O_{P}\Big{(}\kappa_{1}^{2}\log d\big{(}\frac{\lambda_{1}}{\sigma^{2}}\big{)}\big{(}\frac{\lambda_{1}}{\Delta}\big{)}\sqrt{\frac{r}{n}}\Big{)}\frac{\sigma^{4}}{n\lambda_{1}^{2}}
=OP(κ13logd(λ1σ2)rn)σ4nλ12=oP(λK(𝚺~j)).\displaystyle=O_{P}\Big{(}\kappa_{1}^{3}\log d\big{(}\frac{\lambda_{1}}{\sigma^{2}}\big{)}\sqrt{\frac{r}{n}}\Big{)}\frac{\sigma^{4}}{n\lambda_{1}^{2}}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.

Therefore, by Slutsky’s Theorem, the claim follows.

B.11 Proof of Corollary 4.7

The proof for the case where no self-loops are present is almost identical to the case where there are self-loops except for some modifications. We will first prove the results for the case when self-loops are present, then in the end we will discuss how to modify the proof for the case where self-loops are absent.

We only need to verify that Assumptions 1 to 5 hold. Recall from the proof of Corollary 4.2 that we have 𝐄2ψ1r1(d)=dθ\|\|\mathbf{E}\|_{2}\|_{\psi_{1}}\lesssim r_{1}(d)=\sqrt{d\theta}, and thus we know that Assumption 1 is satisfied. Also Assumption 2 holds trivially due to the unbiasedness of 𝐄\mathbf{E}. We will then verify Assumption 3 holds under the model. We know that 𝚯𝚷\mathbf{\Theta}\mathbf{\Pi} and 𝐕\mathbf{V} share the same column space, and thus there exists a non-singular matrix 𝐂K×K\mathbf{C}\in\mathbb{R}^{K\times K} such that 𝚯𝚷=𝐕𝐂\mathbf{\Theta}\mathbf{\Pi}=\mathbf{V}\mathbf{C} and 𝐕=𝚯𝚷𝐂1\mathbf{V}=\mathbf{\Theta}\mathbf{\Pi}\mathbf{C}^{-1}. Then we can see that σmin(𝐂)=σmin(𝚯𝚷)dθ/K\sigma_{\min}(\mathbf{C})=\sigma_{\min}(\mathbf{\Theta}\mathbf{\Pi})\gtrsim\sqrt{d\theta/K}, and 𝐂12K/dθ\|\mathbf{C}^{-1}\|_{2}\lesssim\sqrt{K/d\theta}. Hence we have 𝐕2,𝚯𝚷2,𝐂12θK/dθ=K/d\|\mathbf{V}\|_{2,\infty}\leq\|\mathbf{\Theta}\mathbf{\Pi}\|_{2,\infty}\|\mathbf{C}^{-1}\|_{2}\lesssim\sqrt{\theta}\sqrt{K/d\theta}=\sqrt{K/d}. Thus we can see that Assumption 3 is satisfied with μ=O(1)\mu=O(1).

Now we move on to verify Assumption 4. Recall from the proof of Corollary 4.2 that Δdθ/K\Delta\gtrsim d\theta/K, 𝐌2Kdθ\|\mathbf{M}\|_{2}\lesssim Kd\theta, 𝐌ijθ\mathbf{M}_{ij}\asymp\theta and maxij𝔼(𝐄ij2)θ\max_{ij}{\mathbb{E}}(\mathbf{E}_{ij}^{2})\lesssim\theta. By Theorem 4.2.1 in Chen et al., [12], we have that with probability 1O(d5)1-O(d^{-5}),

𝐕^sgn(𝐕^𝐕)𝐕2,K3K+KKlogddθ,r3(d)K3K+KKlogddθ,\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}},\quad r_{3}(d)\asymp\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}},

and by the proof of Theorem 4.2.1 in [12], we further have that with probability 1O(d7)1-O(d^{-7}),

𝐄(𝐕^(𝐕^𝐕)𝐕)2,KKθlogddθ𝐄2+r3(d)(logd+dθ)\displaystyle\|\mathbf{E}(\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})\!-\!\mathbf{V})\|_{2,\infty}\lesssim\!\frac{K\!\sqrt{K\theta\log d}}{d\theta}\|\mathbf{E}\|_{2}\!+\!r_{3}(d)(\log d\!\!+\!\!\sqrt{d\theta})
r3(d)(logd+dθ)+KKlogd/d\displaystyle\lesssim r_{3}(d)(\log d+\sqrt{d\theta})+K\sqrt{K\log d/d}
K3K+KKlogdd,r4(d)K3K+KKlogdd.\displaystyle\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{\sqrt{d}},r_{4}(d)\asymp\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{\sqrt{d}}.

Thus Assumption 4 is met and now we move on to study the order of η1(d)\eta_{1}(d). Before we continue with the proof, we state the following elementary lemma that helps study the operator norm of a covariance matrix.

Lemma B.5.

𝐱1,𝐱2d\mathbf{x}_{1},\mathbf{x}_{2}\in\mathbb{R}^{d} are two random vectors, then we have

Cov(𝐱1,𝐱2)2=Cov(𝐱2,𝐱1)2Cov(𝐱1)2Cov(𝐱2)2,\|\operatorname*{\rm Cov}(\mathbf{x}_{1},\mathbf{x}_{2})\|_{2}=\|\operatorname*{\rm Cov}(\mathbf{x}_{2},\mathbf{x}_{1})\|_{2}\leq\sqrt{\|\operatorname*{\rm Cov}(\mathbf{x}_{1})\|_{2}\|\operatorname*{\rm Cov}(\mathbf{x}_{2})\|_{2}},

and

Cov(𝐱1+𝐱2)22Cov(𝐱1)2+2Cov(𝐱2)2.\|\operatorname*{\rm Cov}(\mathbf{x}_{1}+\mathbf{x}_{2})\|_{2}\leq 2\|\operatorname*{\rm Cov}(\mathbf{x}_{1})\|_{2}+2\|\operatorname*{\rm Cov}(\mathbf{x}_{2})\|_{2}.

The proof of Lemma B.5 can be found in Appendix C.4. With the help of Lemma B.5, we first decompose 𝐄=𝐄1+𝐄2\mathbf{E}=\mathbf{E}_{1}+\mathbf{E}_{2}, where 𝐄1=[𝐄ij𝕀{ij}]\mathbf{E}_{1}=[\mathbf{E}_{ij}\mathbb{I}\{i\leq j\}] is composed of the diagonal and upper triangular entries of 𝐄\mathbf{E} and 𝐄2=[𝐄ij𝕀{i>j}]\mathbf{E}_{2}=[\mathbf{E}_{ij}\mathbb{I}\{i>j\}] is composed of the off-diagonal lower triangular entries of 𝐄\mathbf{E}. Then it can be seen that both 𝐄1\mathbf{E}_{1} and 𝐄2\mathbf{E}_{2} have independent entries. Now for j[d]j\in[d], we can write

𝐄𝐏𝐞j=𝐄𝐞j𝐄𝐕𝐕𝐞j=𝐄𝐞j(𝐄1𝐕𝐕𝐞j+𝐄2𝐕𝐕𝐞j).\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j}=\mathbf{E}\mathbf{e}_{j}-\mathbf{E}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}=\mathbf{E}\mathbf{e}_{j}-(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}+\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}).

Then we study the covariance of the three terms separately. We have

Cov(𝐄𝐞j)=Cov(𝐄.j)=diag(𝐌1j(1𝐌1j),,𝐌dj(1𝐌dj));\displaystyle\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})=\operatorname*{\rm Cov}(\mathbf{E}_{.j})=\operatorname{diag}\big{(}\mathbf{M}_{1j}(1-\mathbf{M}_{1j}),\ldots,\mathbf{M}_{dj}(1-\mathbf{M}_{dj})\big{)};
Cov(𝐄1𝐕𝐕𝐞j)=diag([k=1d𝐌ik(1𝐌ik)(𝐏𝐕𝐞j)k2𝕀{ik}]i=1d);\displaystyle\operatorname*{\rm Cov}(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j})=\operatorname{diag}\Big{(}\big{[}\sum_{k=1}^{d}\mathbf{M}_{ik}(1-\mathbf{M}_{ik})(\mathbf{P}_{\mathbf{V}}\mathbf{e}_{j})_{k}^{2}\mathbb{I}\{i\leq k\}\big{]}_{i=1}^{d}\Big{)};
Cov(𝐄2𝐕𝐕𝐞j)=diag([k=1d𝐌ik(1𝐌ik)(𝐏𝐕𝐞j)k2𝕀{i>k}]i=1d).\displaystyle\operatorname*{\rm Cov}(\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j})=\operatorname{diag}\Big{(}\big{[}\sum_{k=1}^{d}\mathbf{M}_{ik}(1-\mathbf{M}_{ik})(\mathbf{P}_{\mathbf{V}}\mathbf{e}_{j})_{k}^{2}\mathbb{I}\{i>k\}\big{]}_{i=1}^{d}\Big{)}.

Then we have θλd(Cov(𝐄𝐞j))Cov(𝐄𝐞j)2maxij𝔼(𝐄ij2)θ\theta\lesssim\lambda_{d}\big{(}\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\big{)}\leq\|\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\|_{2}\leq\max_{ij}{\mathbb{E}}(\mathbf{E}_{ij}^{2})\lesssim\theta and

Cov(𝐄1𝐕𝐕𝐞j)2maxi[d]k=1d𝐌ik(1𝐌ik)(𝐏𝐕𝐞j)k2𝕀{ik}\displaystyle\|\operatorname*{\rm Cov}(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j})\|_{2}\leq\max_{i\in[d]}\sum_{k=1}^{d}\mathbf{M}_{ik}(1-\mathbf{M}_{ik})(\mathbf{P}_{\mathbf{V}}\mathbf{e}_{j})_{k}^{2}\mathbb{I}\{i\leq k\}
maxik𝔼(𝐄ik)2k=1d(𝐏𝐕𝐞j)k2θ𝐏𝐕𝐞j22θ𝐕2,2θKd,\displaystyle\leq\max_{ik}{\mathbb{E}}(\mathbf{E}_{ik})^{2}\sum_{k=1}^{d}(\mathbf{P}_{\mathbf{V}}\mathbf{e}_{j})_{k}^{2}\lesssim\theta\|\mathbf{P}_{\mathbf{V}}\mathbf{e}_{j}\|_{2}^{2}\leq\theta\|\mathbf{V}\|_{2,\infty}^{2}\leq\frac{\theta K}{d},

and very similarly we also have Cov(𝐄2𝐕𝐕𝐞j)2θK/d\|\operatorname*{\rm Cov}(\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j})\|_{2}\lesssim\theta K/d. Thus by Lemma B.5, we know that Cov(𝐄1𝐕𝐕𝐞j+𝐄2𝐕𝐕𝐞j)2θK/d\|\operatorname*{\rm Cov}(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}+\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j})\|_{2}\lesssim\theta K/d and

Cov(𝐄1𝐕𝐕𝐞j+𝐄2𝐕𝐕𝐞j,𝐄𝐞j)2θ2K/d=θK/d.\|\operatorname*{\rm Cov}(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}+\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j},\mathbf{E}\mathbf{e}_{j})\|_{2}\lesssim\sqrt{\theta^{2}K/d}=\theta\sqrt{K/d}.

Therefore, we can write

Cov(𝐄𝐏𝐞j)Cov(𝐄𝐞j)22Cov(𝐄1𝐕𝐕𝐞j+𝐄2𝐕𝐕𝐞j,𝐄𝐞j)2\displaystyle\|\operatorname*{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})-\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\|_{2}\leq 2\|\operatorname*{\rm Cov}(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}+\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j},\mathbf{E}\mathbf{e}_{j})\|_{2}
+Cov(𝐄1𝐕𝐕𝐞j+𝐄2𝐕𝐕𝐞j)2θK/d.\displaystyle\quad+\|\operatorname*{\rm Cov}(\mathbf{E}_{1}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}+\mathbf{E}_{2}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j})\|_{2}\lesssim\theta\sqrt{K/d}.

Thus we have λd(Cov(𝐄𝐏𝐞j))λd(Cov(𝐄𝐞j))Cov(𝐄𝐏𝐞j)Cov(𝐄𝐞j)2θ\lambda_{d}\big{(}\operatorname*{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}\geq\lambda_{d}\big{(}\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\big{)}-\|\operatorname*{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})-\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\|_{2}\gtrsim\theta, and we have η1(d)λ12θ\eta_{1}(d)\asymp\lambda_{1}^{-2}\theta. Therefore, when θ=K2d1/2+ϵ\theta=K^{2}d^{-1/2+\epsilon} for some constant ϵ>0\epsilon>0, p=Ω(d)p=\Omega(\sqrt{d}) and LK5d2/pL\gg K^{5}d^{2}/p, K=o(d1/18)K=o(d^{1/18}), we have that

r(d)\displaystyle r(d) =Δ1(KdpLr1(d)+r3(d)r1(d)+μKdΔ2r1(d)2+r2(d)+r4(d))\displaystyle=\Delta^{-1}\Big{(}\sqrt{\frac{Kd}{pL}}r_{1}(d)+r_{3}(d)r_{1}(d)+\sqrt{\frac{\mu K}{d\Delta^{2}}}r_{1}(d)^{2}+r_{2}(d)+r_{4}(d)\Big{)}
K4K+K2Klogdd3/2θ+KKθpL1Kdθη1(d)1/2.\displaystyle\lesssim\frac{K^{4}\sqrt{K}+K^{2}\sqrt{K\log d}}{d^{3/2}\theta}+K\sqrt{\frac{K}{\theta pL}}\ll\frac{1}{Kd\sqrt{\theta}}\lesssim\eta_{1}(d)^{1/2}.

Thus η1(d)1/2r(d)=o(1)\eta_{1}(d)^{-1/2}r(d)=o(1) and the condition for the asymptotic covariance matrix is satisfied. Now we need to verify Assumption 5, and similar as in the proof of Corollary 4.11, we can verify the following more general result.

Given j[d]j\in[d], for any matrix 𝐀d×K\mathbf{A}\in\mathbb{R}^{d\times K} that satisfies the following two conditions: (1)𝐀2,/σmin(𝐀)Cλ12μK/(dΔ2)\|\mathbf{A}\|_{2,\infty}/\sigma_{\min}(\mathbf{A})\leq C\sqrt{\lambda_{1}^{2}\mu K/(d\Delta^{2})}; (2) λK(𝚺j)cθ(σmin(𝐀))2\lambda_{K}\big{(}\bm{\Sigma}_{j}\big{)}\geq c\theta\big{(}\sigma_{\min}(\mathbf{A})\big{)}^{2}, where 𝚺j:=Cov(𝐀𝐄0𝐏𝐞j)\bm{\Sigma}_{j}:=\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}) and C,c>0C,c>0 are fixed constants independent of 𝐀\mathbf{A}, it holds that

𝚺j1/2𝐀𝐄0𝐏𝐞j𝑑𝒩(𝟎,𝐈K).\mathbf{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}). (B.32)

It can be checked from the previous proof that 𝐀=𝐕𝚲1\mathbf{A}=\mathbf{V}\mathbf{\Lambda}^{-1} satisfies the two conditions. To show (B.32), we need to show that 𝐚𝚺j1/2𝐀𝐄𝐏𝐞j𝑑𝒩(0,1){\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(0,1) for any 𝐚K,𝐚2=1{\mathbf{a}}\in\mathbb{R}^{K},\|{\mathbf{a}}\|_{2}=1. We will first study the entries of 𝐏𝐞j\mathbf{P}_{\perp}\mathbf{e}_{j} and 𝐀𝚺j1/2𝐚\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}. It holds that

|(𝐏𝐞j)j|=|((𝐈d𝐕𝐕)𝐞j)j|1+𝐕2,2=1+o(1);\displaystyle|(\mathbf{P}_{\perp}\mathbf{e}_{j})_{j}|=|\big{(}(\mathbf{I}_{d}-\mathbf{V}\mathbf{V}^{\top})\mathbf{e}_{j}\big{)}_{j}|\leq 1+\|\mathbf{V}\|_{2,\infty}^{2}=1+o(1);
maxij|(𝐏𝐞j)i|=maxij|𝐞i𝐞j𝐞i𝐕𝐕𝐞j|0+𝐕2,2=Kd;\displaystyle\max_{i\neq j}|(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}|=\max_{i\neq j}|\mathbf{e}_{i}^{\top}\mathbf{e}_{j}-\mathbf{e}_{i}^{\top}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}|\leq 0+\|\mathbf{V}\|_{2,\infty}^{2}=\frac{K}{d};
𝐀𝚺j1/2𝐚𝐀2,𝚺j1/22θ1/2𝐀2,/σmin(𝐀)K2Kdθ.\displaystyle\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\|_{\infty}\leq\|\mathbf{A}\|_{2,\infty}\|\bm{\Sigma}_{j}^{-1/2}\|_{2}\lesssim\theta^{-1/2}\|\mathbf{A}\|_{2,\infty}/\sigma_{\min}(\mathbf{A})\lesssim K^{2}\sqrt{\frac{K}{d\theta}}.

Then we know that

𝐚𝚺j1/2𝐀𝐄𝐏𝐞j=ik𝐄ik(𝐀𝚺j1/2𝐚)i(𝐏𝐞j)k=i=1d𝐄ii(𝐀𝚺j1/2𝐚)i(𝐏𝐞j)i\displaystyle{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j}=\sum_{ik}\mathbf{E}_{ik}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{k}=\sum_{i=1}^{d}\mathbf{E}_{ii}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}
+i<k𝐄ik[(𝐀𝚺j1/2𝐚)i(𝐏𝐞j)k+(𝐀𝚺j1/2𝐚)k(𝐏𝐞j)i].\displaystyle\quad+\sum_{i<k}\mathbf{E}_{ik}\big{[}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{k}+(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{k}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\big{]}.

Then for the diagonal entries we have

i=1d𝔼|𝐄ii(𝐀𝚺j1/2𝐚)i(𝐏𝐞j)i|3\displaystyle\sum_{i=1}^{d}{\mathbb{E}}|\mathbf{E}_{ii}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}|^{3}
=𝔼|𝐄jj(𝐀𝚺j1/2𝐚)j(𝐏𝐞j)j|3+ij𝔼|𝐄ii(𝐀𝚺j1/2𝐚)i(𝐏𝐞j)i|3\displaystyle={\mathbb{E}}|\mathbf{E}_{jj}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{j}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{j}|^{3}+\sum_{i\neq j}{\mathbb{E}}|\mathbf{E}_{ii}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}|^{3}
θ𝐀𝚺j1/2𝐚3+dθ𝐀𝚺j1/2𝐚3maxij|(𝐏𝐞j)i|3K6dK3dθ,\displaystyle\quad\lesssim\theta\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\|_{\infty}^{3}+d\theta\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\|_{\infty}^{3}\max_{i\neq j}|(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}|^{3}\lesssim\frac{K^{6}}{d}\sqrt{\frac{K^{3}}{d\theta}},

and for the off-diagonal entries, when K=o(d1/26)K=o(d^{1/26}) it holds that

i<k𝔼|𝐄ik[(𝐀𝚺j1/2𝐚)i(𝐏𝐞j)k+(𝐀𝚺j1/2𝐚)k(𝐏𝐞j)i]|3dθ𝐀𝚺j1/2𝐚3\displaystyle\sum_{i<k}{\mathbb{E}}\Big{|}\mathbf{E}_{ik}\big{[}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{k}+(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{k}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\big{]}\Big{|}^{3}\lesssim d\theta\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\|_{\infty}^{3}
+d2θ𝐀𝚺j1/2𝐚3(Kd)3K6K3dθ=o(1).\displaystyle\quad+d^{2}\theta\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\|_{\infty}^{3}\big{(}\frac{K}{d}\big{)}^{3}\lesssim K^{6}\sqrt{\frac{K^{3}}{d\theta}}=o(1).

Moreover, since Var(𝐚𝚺j1/2𝐀𝐄𝐏𝐞j)=1\operatorname{Var}({\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})=1, by the Lyapunov’s condition and plugging in 𝐀=𝐕𝚲1\mathbf{A}=\mathbf{V}\mathbf{\Lambda}^{-1}, Assumption 5 is met and (8) follows.

Now we only need to verify that the result also holds when replacing 𝚺j\bm{\Sigma}_{j} by 𝚺~j\widetilde{\bm{\Sigma}}_{j}. From previous discussion we learnt that

𝚺~j𝚺j2𝐕𝚲122Cov(𝐄𝐏𝐞j)Cov(𝐄𝐞j)2\displaystyle\|\widetilde{\bm{\Sigma}}_{j}-\bm{\Sigma}_{j}\|_{2}\leq\|\mathbf{V}\mathbf{\Lambda}^{-1}\|_{2}^{2}\|\operatorname*{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})-\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\|_{2}
K2d2θKdK4KdλK(𝚺~j)=o(λK(𝚺~j)).\displaystyle\leq\frac{K^{2}}{d^{2}\theta}\sqrt{\frac{K}{d}}\lesssim K^{4}\sqrt{\frac{K}{d}}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})=o(\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})).

Then by Slutsky’s Theorem, (11) holds.

Now we verify the validity of 𝚺^j\widehat{\bm{\Sigma}}_{j}. Similar as in the proof of Corollary 4.6, 𝐇\mathbf{H} is orthonormal with probability 1o(1)1-o(1), and we will start by showing that 𝚺^j𝐇𝚺~j𝐇2=oP(λK(𝚺~j))\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}. From previous discussion we have the following bounds

𝐌^𝐌2=OP(dθ),𝐕~F𝐕𝐇2=𝐕~F𝐇𝐕2=OP(Kdθ),\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}=O_{P}(\sqrt{d\theta}),\quad\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}=\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2}=O_{P}(\frac{K}{\sqrt{d\theta}}),

and

𝐕~F𝐇𝐕2,\displaystyle\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2,\infty} 𝐕~F𝐇𝐕^𝐇02+𝐕^𝐇0𝐕2,=oP(1Kdθ)\displaystyle\leq\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widehat{\mathbf{V}}\mathbf{H}_{0}\|_{2}+\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\|_{2,\infty}=o_{P}(\frac{1}{Kd\sqrt{\theta}})
+OP(K3K+KKlogddθ)=OP(K3K+KKlogddθ).\displaystyle\quad+O_{P}(\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}})=O_{P}(\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}}).

With the help of the above results, we will study the components of 𝚺^j𝐇𝚺~j𝐇\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top} separately. In the following proof, we will base the discussion on the event that 𝐇\mathbf{H} is orthonormal. We first study 𝐌~=(𝐕~F𝐕~F)𝐌^(𝐕~F𝐕~F)=𝐕~F𝐇(𝐇𝐕~F𝐌^𝐕~F𝐇)𝐇𝐕~F\widetilde{\mathbf{M}}=(\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{V}}^{\text{F}\top})\widehat{\mathbf{M}}(\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{V}}^{\text{F}\top})=\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}(\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}. We have that

𝐇𝐕~F𝐌^𝐕~F𝐇𝚲2𝐇𝐕~F𝐌^𝐕~F𝐇𝐇𝐕~F𝐌𝐕~F𝐇2\displaystyle\|\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda}\|_{2}\leq\|\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{M}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}\|_{2}
+𝐇𝐕~F𝐌(𝐕~F𝐇𝐕)2+(𝐕~F𝐇𝐕)𝐌𝐕2\displaystyle\quad+\|\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{M}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})\|_{2}+\|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{M}\mathbf{V}\|_{2}
𝐌^𝐌2+2𝐌2𝐕~F𝐇𝐕2=OP(K2dθ).\displaystyle\leq\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}+2\|\mathbf{M}\|_{2}\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2}=O_{P}(K^{2}\sqrt{d\theta}).

Then for i,k[d]i,k\in[d], we have

|𝐌~ik𝐌ik|=|(𝐕~F𝐇)i(𝐇𝐕~F𝐌^𝐕~F𝐇)(𝐕~F𝐇)k𝐌ik|\displaystyle|\widetilde{\mathbf{M}}_{ik}-\mathbf{M}_{ik}|=|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{i}^{\top}(\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}-\mathbf{M}_{ik}|
|(𝐕~F𝐇)i(𝐇𝐕~F𝐌^𝐕~F𝐇𝚲)(𝐕~F𝐇)k|+|(𝐕~F𝐇𝐕)i𝚲(𝐕~F𝐇)k|\displaystyle\leq|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{i}^{\top}(\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda})(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}|+|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})_{i}\mathbf{\Lambda}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}|
+|(𝐕)i𝚲(𝐕~F𝐇𝐕)k|.\displaystyle\quad+|(\mathbf{V})_{i}\mathbf{\Lambda}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})_{k}|.

It is not hard to see that

|(𝐕~F𝐇)i(𝐇𝐕~F𝐌^𝐕~F𝐇𝚲)(𝐕~F𝐇)k|𝐇𝐕~F𝐌^𝐕~F𝐇𝚲2𝐕~F𝐇2,2\displaystyle|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{i}^{\top}(\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda})(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}|\lesssim\|\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda}\|_{2}\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}\|_{2,\infty}^{2}
=OP(K2dθ𝐕~F𝐇2,2)=OP(K3θd),\displaystyle=O_{P}(K^{2}\sqrt{d\theta}\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}\|_{2,\infty}^{2})=O_{P}\left(K^{3}\sqrt{\frac{\theta}{d}}\right),
|(𝐕~F𝐇𝐕)i𝚲(𝐕~F𝐇)k|+|(𝐕)i𝚲(𝐕~F𝐇𝐕)k|\displaystyle|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})_{i}\mathbf{\Lambda}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}|+|(\mathbf{V})_{i}\mathbf{\Lambda}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})_{k}|
=OP(Kdθ𝐕2,𝐕^𝐇0𝐕2,)=OP(K3(K2+logd)θd),\displaystyle=O_{P}(Kd\theta\|\mathbf{V}\|_{2,\infty}\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\|_{2,\infty})=O_{P}\left(K^{3}(K^{2}+\sqrt{\log d})\sqrt{\frac{\theta}{d}}\right),

and in turn we have the upper bound

|𝐌~ik𝐌ik|=OP(K3θd)+OP(K3(K2+logd)θd)\displaystyle|\widetilde{\mathbf{M}}_{ik}-\mathbf{M}_{ik}|=O_{P}\left(K^{3}\sqrt{\frac{\theta}{d}}\right)+O_{P}\left(K^{3}(K^{2}+\sqrt{\log d})\sqrt{\frac{\theta}{d}}\right)
=OP(K3(K2+logd)dθ)θ=oP(θ)=oP(𝐌ik).\displaystyle=O_{P}\Big{(}\frac{K^{3}(K^{2}+\sqrt{\log d})}{\sqrt{d\theta}}\Big{)}\theta=o_{P}(\theta)=o_{P}(\mathbf{M}_{ik}).

Thus we have

diag([𝐌~ij(1𝐌~ij)]i=1d)diag([𝐌ij(1𝐌ij)]i=1d)2=OP(K3(K2+logd)dθθ).\|\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{ij}(1-\widetilde{\mathbf{M}}_{ij})]_{i=1}^{d}\big{)}-\operatorname{diag}\big{(}[\mathbf{M}_{ij}(1-\mathbf{M}_{ij})]_{i=1}^{d}\big{)}\|_{2}=O_{P}\Big{(}\frac{K^{3}(K^{2}+\sqrt{\log d})}{\sqrt{d\theta}}\theta\Big{)}.

Then we move on to study 𝚲~\widetilde{\mathbf{\Lambda}}. We have

𝚲~𝐇𝚲𝐇2𝐕~F(𝐌^𝐌)𝐕~F2+(𝐕~F𝐕𝐇)𝐌𝐕~F2\displaystyle\|\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top}\|_{2}\leq\|\widetilde{\mathbf{V}}^{\text{F}\top}(\widehat{\mathbf{M}}-\mathbf{M})\widetilde{\mathbf{V}}^{\text{F}}\|_{2}+\|(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{M}\widetilde{\mathbf{V}}^{\text{F}}\|_{2}
+𝐇𝐕𝐌(𝐕~F𝐕𝐇)2=OP(dθ)+OP(K2dθ)=OP(K2dθ).\displaystyle\quad+\|\mathbf{H}\mathbf{V}^{\top}\mathbf{M}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})\|_{2}=O_{P}(\sqrt{d\theta})+O_{P}(K^{2}\sqrt{d\theta})=O_{P}(K^{2}\sqrt{d\theta}).

Then if we denote 𝐃𝚲=(𝚲~𝐇𝚲𝐇)𝐇𝚲1𝐇\mathbf{D}_{\mathbf{\Lambda}}=(\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top})\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}, we have that 𝐃𝚲2=OP(K3/dθ)=oP(1)\|\mathbf{D}_{\mathbf{\Lambda}}\|_{2}=O_{P}(K^{3}/\sqrt{d\theta})=o_{P}(1), and thus we have

𝚲~1𝐇𝚲1𝐇2\displaystyle\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\|_{2} =(𝐇𝚲𝐇+𝚲~𝐇𝚲𝐇)1(𝐇𝚲𝐇)12\displaystyle=\|(\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top}+\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top})^{-1}-(\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top})^{-1}\|_{2}
=𝐇𝚲1𝐇[(𝐈K+𝐃𝚲)1𝐈K]𝚲12i=1(𝐃𝚲)i2\displaystyle=\Big{\|}\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\big{[}(\mathbf{I}_{K}+\mathbf{D}_{\mathbf{\Lambda}})^{-1}-\mathbf{I}_{K}\big{]}\Big{\|}\leq\|\mathbf{\Lambda}^{-1}\|_{2}\big{\|}\sum_{i=1}^{\infty}(-\mathbf{D}_{\mathbf{\Lambda}})^{i}\big{\|}_{2}
=OP(K4/(dθ)3/2).\displaystyle=O_{P}\big{(}{K^{4}}/{(d\theta)^{3/2}}\big{)}.

Thus, following basic algebra we have the following bounds

𝐕~Fdiag([𝐌~ij(1𝐌~ij)]i=1d)𝐕~F𝐇𝐕diag([𝐌ij(1𝐌ij)]i=1d)𝐕𝐇2\displaystyle\|\widetilde{\mathbf{V}}^{\text{F}\top}\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{ij}(1-\widetilde{\mathbf{M}}_{ij})]_{i=1}^{d}\big{)}\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{H}\mathbf{V}^{\top}\operatorname{diag}\big{(}[\mathbf{M}_{ij}(1-\mathbf{M}_{ij})]_{i=1}^{d}\big{)}\mathbf{V}\mathbf{H}^{\top}\|_{2}
𝐕~F(diag([𝐌~ij(1𝐌~ij)]i=1d)diag([𝐌ij(1𝐌ij)]i=1d))𝐕~F2\displaystyle\leq\|\widetilde{\mathbf{V}}^{\text{F}\top}\Big{(}\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{ij}(1-\widetilde{\mathbf{M}}_{ij})]_{i=1}^{d}\big{)}-\operatorname{diag}\big{(}[\mathbf{M}_{ij}(1-\mathbf{M}_{ij})]_{i=1}^{d}\big{)}\Big{)}\widetilde{\mathbf{V}}^{\text{F}}\|_{2}
+2𝐕~F𝐕𝐇2diag([𝐌ij(1𝐌ij)]i=1d)2=OP(K3(K2+logd)dθ)θ,\displaystyle\quad+2\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}\|\operatorname{diag}\big{(}[\mathbf{M}_{ij}(1-\mathbf{M}_{ij})]_{i=1}^{d}\big{)}\|_{2}=O_{P}\Big{(}\frac{K^{3}(K^{2}+\sqrt{\log d})}{\sqrt{d\theta}}\Big{)}\theta,

and further, under the condition that K=o(d1/32)K=o(d^{1/32}), we have

𝚺^j𝐇𝚺~j𝐇2\displaystyle\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2} OP(K3(K2+logd)dθθ)𝚲~122+θ𝚲12𝚲~1𝐇𝚲1𝐇2\displaystyle\lesssim O_{P}\Big{(}\frac{K^{3}(K^{2}+\sqrt{\log d})}{\sqrt{d\theta}}\theta\Big{)}\|\widetilde{\mathbf{\Lambda}}^{-1}\|_{2}^{2}+\theta\|\mathbf{\Lambda}^{-1}\|_{2}\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\|_{2}
=OP(K7(K2+logd)dθ)1K2d2θ+OP(K7dθ)1K2d2θ\displaystyle=O_{P}\Big{(}\frac{K^{7}(K^{2}+\sqrt{\log d})}{\sqrt{d\theta}}\Big{)}\frac{1}{K^{2}d^{2}\theta}+O_{P}(\frac{K^{7}}{\sqrt{d\theta}})\frac{1}{K^{2}d^{2}\theta}
=OP(K7(K2+logd)dθ)1K2d2θ=oP(λK(𝚺~j)).\displaystyle=O_{P}\Big{(}\frac{K^{7}(K^{2}+\sqrt{\log d})}{\sqrt{d\theta}}\Big{)}\frac{1}{K^{2}d^{2}\theta}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.

Thus with similar arguments as in the proof of Corollary 4.6, the claim follows.

Remark 18.

The inferential results also hold for the case where self-loops are absent. Recall that under the no-self-loop case, the observed matrix is

𝐌^=𝐗diag(𝐗)=𝐌+𝐄diag(𝐌+𝐄)=𝐌+𝐄diag(𝐄)diag(𝐌),\widehat{\mathbf{M}}=\mathbf{X}-\operatorname{diag}(\mathbf{X})=\mathbf{M}+\mathbf{E}-\operatorname{diag}(\mathbf{M}+\mathbf{E})=\mathbf{M}+\mathbf{E}-\operatorname{diag}(\mathbf{E})-\operatorname{diag}(\mathbf{M}),

where 𝐄=𝐗𝐌\mathbf{E}=\mathbf{X}-\mathbf{M} is the error matrix between the adjacency matrix with self-loops and its expectation. We define 𝐌^=𝐌+𝐄diag(𝐄)\widehat{\mathbf{M}}^{\prime}=\mathbf{M}+\mathbf{E}-\operatorname{diag}(\mathbf{E}) and denote by 𝐕^\widehat{\mathbf{V}}^{\prime} its KK leading eigenvectors. By Weyl’s inequality [19] we know that with probability at least 1d101-d^{-10} we have that σK(𝐌^)σK+1(𝐌^)ΔO(dθ)dθ/K\sigma_{K}(\widehat{\mathbf{M}}^{\prime})-\sigma_{K+1}(\widehat{\mathbf{M}}^{\prime})\geq\Delta-O(\sqrt{d\theta})\gtrsim d\theta/K, and hence by Davis-Kahan’s Theorem [45] we have

𝐕^𝐕^𝐕^𝐕^2diag(𝐌)2/(σK(𝐌^)σK+1(𝐌^))K/d,\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}\|_{2}\leq\|\operatorname{diag}(\mathbf{M})\|_{2}/\big{(}\sigma_{K}(\widehat{\mathbf{M}}^{\prime})-\sigma_{K+1}(\widehat{\mathbf{M}}^{\prime})\big{)}\lesssim K/d,

with probability at least 1d101-d^{-10}. The verification of Assumptions 1, 3 and 5 when self-loops are present can also be applied to the no-self-loop case. For Assumption 2, we can take 𝐄0=𝐄diag(𝐄)\mathbf{E}_{0}=\mathbf{E}-\operatorname{diag}(\mathbf{E}) and 𝐄b=diag(𝐌)\mathbf{E}_{b}=-\operatorname{diag}(\mathbf{M}). Then r2(d)=diag(𝐌)2θ=o(r1(d))r_{2}(d)=\|\operatorname{diag}(\mathbf{M})\|_{2}\lesssim\theta=o(r_{1}(d)) and Assumption 2 is satisfied. As for Assumption 4, by Lemma 7 in Fan et al., [18], we have

sgn(𝐕^𝐕)𝐕^𝐕2𝐕^𝐕^𝐕𝐕22K2dθ,\|\operatorname{sgn}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})-\widehat{\mathbf{V}}^{\prime\top}\mathbf{V}\|_{2}\lesssim\|\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}-\mathbf{V}\mathbf{V}^{\top}\|_{2}^{2}\lesssim\frac{K^{2}}{d\theta},
sgn(𝐕^𝐕^)𝐕^𝐕^2𝐕^𝐕^𝐕^𝐕^22K2d2.\|\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\widehat{\mathbf{V}}^{\prime})-\widehat{\mathbf{V}}^{\top}\widehat{\mathbf{V}}^{\prime}\|_{2}\lesssim\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}\|_{2}^{2}\lesssim\frac{K^{2}}{d^{2}}.

With similar arguments as in the self-loop case, for 𝐕^\widehat{\mathbf{V}}^{\prime} with high probability we have

𝐕^sgn(𝐕^𝐕)𝐕2,K3K+KKlogddθ,\|\widehat{\mathbf{V}}^{\prime}\operatorname{sgn}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}},
𝐄0(𝐕^(𝐕^𝐕)𝐕)2,K3K+KKlogdd.\|\mathbf{E}_{0}(\widehat{\mathbf{V}}^{\prime}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})\!-\!\mathbf{V})\|_{2,\infty}\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{\sqrt{d}}.

Then for 𝐕^\widehat{\mathbf{V}}, with high probability we have that

𝐕^sgn(𝐕^𝐕)𝐕2,𝐕^(𝐕^𝐕)𝐕2,+𝐕^(sgn(𝐕^𝐕)𝐕^𝐕)𝐕2,\displaystyle\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}\leq\|\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}+\|\widehat{\mathbf{V}}\big{(}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\widehat{\mathbf{V}}^{\top}\mathbf{V}\big{)}-\mathbf{V}\|_{2,\infty}
𝐕^(𝐕^(𝐈d𝐕^𝐕^)𝐕)𝐕2,+𝐕^(𝐕^𝐕^𝐕^𝐕)𝐕2,+O(K2dθ)𝐕^2,\displaystyle\leq\|\widehat{\mathbf{V}}\big{(}\widehat{\mathbf{V}}^{\top}(\mathbf{I}_{d}-\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top})\mathbf{V}\big{)}-\mathbf{V}\|_{2,\infty}+\|\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}+O(\frac{K^{2}}{d\theta})\|\widehat{\mathbf{V}}\|_{2,\infty}
𝐕^(𝐕^𝐕^)𝐕^2,+𝐕^(𝐕^𝐕)𝐕2,+O(K2dθ)𝐕^2,+𝐕^𝐕^𝐕^𝐕^2𝐕^2,\displaystyle\leq\|\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\widehat{\mathbf{V}}^{\prime})\!-\!\widehat{\mathbf{V}}^{\prime}\|_{2,\infty}\!+\!\|\widehat{\mathbf{V}}^{\prime}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})\!-\!\mathbf{V}\|_{2,\infty}\!+\!O(\frac{K^{2}}{d\theta})\|\widehat{\mathbf{V}}\|_{2,\infty}\!+\!\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\!-\!\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}\|_{2}\|\widehat{\mathbf{V}}\|_{2,\infty}
O(K2dθ)(𝐕2,+𝐕^sgn(𝐕^𝐕)𝐕2,)+𝐕^𝐕^𝐕^𝐕^2+𝐕^(𝐕^𝐕)𝐕2,,\displaystyle\leq O(\frac{K^{2}}{d\theta})\!\!\left(\|\mathbf{V}\|_{2,\infty}\!+\!\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}\!\right)\!+\!\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\!-\!\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}\|_{2}\!+\!\|\widehat{\mathbf{V}}^{\prime}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})\!-\!\mathbf{V}\|_{2,\infty},

where in the last two inequalities we use the fact that

(𝐈d𝐕^𝐕^)𝐕^2=(𝐈d𝐕^𝐕^)𝐕^2=𝐕^𝐕^2=𝐕^𝐕^2=𝐕^𝐕^𝐕^𝐕^2,\|(\mathbf{I}_{d}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top})\widehat{\mathbf{V}}^{\prime}\|_{2}=\|(\mathbf{I}_{d}-\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top})\widehat{\mathbf{V}}\|_{2}=\|\widehat{\mathbf{V}}_{\perp}^{\top}\widehat{\mathbf{V}}^{\prime}\|_{2}=\|\widehat{\mathbf{V}}_{\perp}^{\prime\top}\widehat{\mathbf{V}}\|_{2}=\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\widehat{\mathbf{V}}^{\prime}\widehat{\mathbf{V}}^{\prime\top}\|_{2},

with 𝐕^\widehat{\mathbf{V}}_{\perp} and 𝐕^\widehat{\mathbf{V}}^{\prime}_{\perp} being the orthogonal complement of 𝐕^\widehat{\mathbf{V}} and 𝐕^\widehat{\mathbf{V}}^{\prime} respectively. Since K2/(dθ)=o(1)K^{2}/(d\theta)=o(1), for large enough dd we further get

12𝐕^sgn(𝐕^𝐕)𝐕2,(1O(K2/(dθ)))𝐕^sgn(𝐕^𝐕)𝐕2,\displaystyle\frac{1}{2}\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}\leq\left(1-O\left({K^{2}}/{(d\theta)}\right)\right)\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}
O(K2dθ)𝐕2,+O(Kd)+𝐕^sgn(𝐕^𝐕)𝐕2,+O(K2dθ)𝐕^2,\displaystyle\quad\leq O(\frac{K^{2}}{d\theta})\|\mathbf{V}\|_{2,\infty}+O(\frac{K}{d})+\|\widehat{\mathbf{V}}^{\prime}\operatorname{sgn}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}+O(\frac{K^{2}}{d\theta})\|\widehat{\mathbf{V}}^{\prime}\|_{2,\infty}
K2dθKd+Kd+K3K+KKlogddθK3K+KKlogddθ.\displaystyle\quad\lesssim\frac{K^{2}}{d\theta}\sqrt{\frac{K}{d}}+\frac{K}{d}+\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}}\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{d\sqrt{\theta}}.

Hence r3(d)=KK(K2+logd)/(dθ)r_{3}(d)=K\sqrt{K}(K^{2}+\sqrt{\log d})/(d\sqrt{\theta}). We also have

𝐄(𝐕^(𝐕^𝐕)𝐕)2,\displaystyle\|\mathbf{E}(\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V})\|_{2,\infty} 𝐄0(𝐕^(𝐕^𝐕)𝐕)2,+r2(d)r1(d)Δ\displaystyle\lesssim\|\mathbf{E}_{0}(\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V})\|_{2,\infty}+\frac{r_{2}(d)r_{1}(d)}{\Delta}
𝐄0(𝐕^(𝐕^𝐕)𝐕)2,+r2(d)r1(d)Δ\displaystyle\lesssim\|\mathbf{E}_{0}(\widehat{\mathbf{V}}^{\prime}(\widehat{\mathbf{V}}^{\prime\top}\mathbf{V})-\mathbf{V})\!\|_{2,\infty}+\frac{r_{2}(d)r_{1}(d)}{\Delta}
K3K+KKlogdd+KθdK3K+KKlogdd,\displaystyle\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{\sqrt{d}}+K\sqrt{\frac{\theta}{d}}\lesssim\frac{K^{3}\sqrt{K}+K\sqrt{K\log d}}{\sqrt{d}},

and hence we can take r4(d)=KK(K2+logd)/dr_{4}(d)=K\sqrt{K}(K^{2}+\sqrt{\log d})/\sqrt{d}. Now to get a sharper rate for r(d)r(d), we take into consideration the diagonal structure of 𝐄b\mathbf{E}_{b} and derive the following bound

𝐏𝐄b𝐕^𝐇^0𝚲12,\displaystyle\|\!\mathbf{P}_{\perp}\mathbf{E}_{b}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\!\|_{2,\infty} 𝐕𝐕𝐄b𝐕^𝐇^0𝚲12,+𝐄b𝐕^𝐇^0𝚲12,\displaystyle\leq\|\mathbf{V}\mathbf{V}^{\top}\mathbf{E}_{b}\widehat{\mathbf{V}}\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\!\|_{2,\infty}+\|\mathbf{E}_{b}\widehat{\mathbf{V}}\!\widehat{\mathbf{H}}_{0}\mathbf{\Lambda}^{-1}\|_{2,\infty}
r2(d)𝐕2,Δ+diag(𝐌)𝐕^2,Δ\displaystyle\leq\frac{r_{2}(d)\|\mathbf{V}\|_{2,\infty}}{\Delta}+\frac{\|\operatorname{diag}(\mathbf{M})\widehat{\mathbf{V}}\|_{2,\infty}}{\Delta}
KdKd+diag(𝐌)2𝐕^2,ΔKdKd.\displaystyle\lesssim\frac{K}{d}\sqrt{\frac{K}{d}}+\frac{\|\operatorname{diag}(\mathbf{M})\|_{2}\|\widehat{\mathbf{V}}\|_{2,\infty}}{\Delta}\lesssim\frac{K}{d}\sqrt{\frac{K}{d}}.

Then from the proof of Theorem 4.5 we have that

r(d)K4K+K2Klogdd3/2θ+KKθpL+KdKd1Kdθ,r(d)\lesssim\frac{K^{4}\sqrt{K}+K^{2}\sqrt{K\log d}}{d^{3/2}\theta}+K\sqrt{\frac{K}{\theta pL}}+\frac{K}{d}\sqrt{\frac{K}{d}}\ll\frac{1}{Kd\sqrt{\theta}},

and we are only left to verify the minimum eigenvalue condition of 𝚺j\bm{\Sigma}_{j} by showing that the order of η1(d)\eta_{1}(d) is the same as when there are self-loops. With the same arguments, we know that

Cov(𝚲1𝐕𝐄𝐏𝐞j)Cov(𝚲1𝐕𝐄𝐞j)O(K4Kd)1K2d2θ.\|\operatorname*{\rm Cov}(\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\mathbf{E}^{\prime}\mathbf{P}_{\perp}\mathbf{e}_{j})-\operatorname*{\rm Cov}(\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\mathbf{E}^{\prime}\mathbf{e}_{j})\|\leq O\left(K^{4}\sqrt{\frac{K}{d}}\right)\frac{1}{K^{2}d^{2}\theta}.

Besides, we also have

Cov(𝚲1𝐕𝐄𝐞j)𝚺~j2=𝚲1𝐕(𝐌jj(1𝐌jj)𝐞j𝐞j)𝐕𝚲12\displaystyle\|\operatorname*{\rm Cov}(\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\mathbf{E}^{\prime}\mathbf{e}_{j})-\widetilde{\bm{\Sigma}}_{j}\|_{2}=\big{\|}\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\big{(}\mathbf{M}_{jj}(1-\mathbf{M}_{jj})\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\big{)}\mathbf{V}\mathbf{\Lambda}^{-1}\big{\|}_{2}
𝐌jj𝚲122𝐕2,2K2d2θKd=O(K5d)1K2d2θ=o(λK(𝚺~j)).\displaystyle\lesssim\mathbf{M}_{jj}\|\mathbf{\Lambda}^{-1}\|_{2}^{2}\|\mathbf{V}\|_{2,\infty}^{2}\lesssim\frac{K^{2}}{d^{2}\theta}\frac{K}{d}=O(\frac{K^{5}}{d})\frac{1}{K^{2}d^{2}\theta}=o\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.

Thus we also have Cov(𝚲1𝐕𝐄𝐏𝐞j)𝚺~j2=o(λK(𝚺~j))\|\operatorname*{\rm Cov}(\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\mathbf{E}^{\prime}\mathbf{P}_{\perp}\mathbf{e}_{j})-\widetilde{\bm{\Sigma}}_{j}\|_{2}=o\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}, and thereby

λK(Cov(𝚲1𝐕𝐄𝐏𝐞j))=λK(𝚺~j)(1+o(1))θK2d2θ2.\lambda_{K}\big{(}\operatorname*{\rm Cov}(\mathbf{\Lambda}^{-1}\mathbf{V}^{\top}\mathbf{E}^{\prime}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}=\lambda_{K}\big{(}\widetilde{\bm{\Sigma}}_{j}\big{)}\big{(}1+o(1)\big{)}\gtrsim\frac{\theta}{K^{2}d^{2}\theta^{2}}.

Thus we still have η1(d)=λ12θ\eta_{1}(d)=\lambda_{1}^{-2}\theta for the case where self-loops are absent. The condition for η1(d)\eta_{1}(d) also holds for the no-self-loop case and both (8) and (11) hold. The verification of (12) is almost identical to the self-loop case and is hence omitted.

B.12 Proof of Corollary 4.8

From the proof of Corollary 4.12, we have verified Assumptions 1-3. It can be checked that 𝐕𝚲1\mathbf{V}\mathbf{\Lambda}^{-1} satisfies the two conditions for the general CLT results in the proof of Corollary 4.12, then under the condition that Δ02n4/3/(μθ2d)\Delta_{0}^{2}\ll n^{4/3}/(\mu_{\theta}^{2}d), Assumption 5 is also satisfied.

Now we move on to check the conditions for η1(d)\eta_{1}(d). Recall from the proof of Corollary 4.12, we have

Cov(𝐄0𝐏𝐞j)=𝐏𝐞j22(𝐅𝚯𝚯𝐅+n𝐈d)+n𝐏𝐞j𝐞j𝐏.\operatorname*{\rm Cov}(\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j})=\|\mathbf{P}_{\perp}\mathbf{e}_{j}\|_{2}^{2}(\mathbf{F}\mathbf{\Theta}^{\top}\mathbf{\Theta}\mathbf{F}^{\top}+n\mathbf{I}_{d})+\!n\mathbf{P}_{\perp}\mathbf{e}_{j}\mathbf{e}_{j}^{\top}\mathbf{P}_{\perp}.

Then we have

𝚺~j𝚺j2KdΔ2(n+Δ)O(Kdn(n+Δ))nλ12=o(nλ12).\displaystyle\|\widetilde{\bm{\Sigma}}_{j}-\bm{\Sigma}_{j}\|_{2}\lesssim\frac{K}{d\Delta^{2}}(n+\Delta)\lesssim O\left(\frac{K}{dn}(n+\Delta)\right)\frac{n}{\lambda_{1}^{2}}=o\left(\frac{n}{\lambda_{1}^{2}}\right).

Besides, it can be seen that λK(𝚺~j)n/λ12+1/λ1\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\geq n/\lambda_{1}^{2}+1/\lambda_{1}, and hence we can take η1(d)=λ12n/2+λ11/2\eta_{1}(d)=\lambda_{1}^{-2}n/2+\lambda_{1}^{-1}/2. Next we move on to verify the statistical rates r3(d)r_{3}(d) and r4(d)r_{4}(d). By Davis-Kahan’s Theorem [45], we have that with high probability

𝐕^sgn(𝐕^𝐕)𝐕2,𝐕^sgn(𝐕^𝐕)𝐕2𝐄2/Δr1(d)/Δ,\displaystyle\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}\leq\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2}\lesssim\|\mathbf{E}\|_{2}/\Delta\lesssim r_{1}^{\prime}(d)/\Delta,

where r1(d)=dΔ0/K+dnlogdr_{1}^{\prime}(d)=d\Delta_{0}/\sqrt{K}+\sqrt{dn}\log d as defined in the proof of Corollary 4.12, and thus we know that r3(d)r1(d)/Δr_{3}(d)\asymp r_{1}^{\prime}(d)/\Delta. Besides, with high probability we have

𝐄0(𝐕^(𝐕^𝐕)𝐕)2,𝐄0(𝐕^(𝐕^𝐕)𝐕)2r1(d)2/Δ,\displaystyle\|\mathbf{E}_{0}(\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V})\|_{2,\infty}\leq\|\mathbf{E}_{0}(\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V})\|_{2}\lesssim r_{1}^{\prime}(d)^{2}/\Delta,

and we have r4(d)r1(d)2/Δr_{4}(d)\asymp r_{1}^{\prime}(d)^{2}/\Delta. Thus Assumption 4 is satisfied. Then we have

r(d)\displaystyle r(d) =KdpLr1(d)Δ+r3(d)r1(d)/Δ+Kdr1(d)2/Δ2+(r2(d)+r4(d))/Δ\displaystyle=\sqrt{\frac{Kd}{pL}}\frac{r_{1}(d)}{\Delta}+r_{3}(d)r_{1}(d)/\Delta+\sqrt{\frac{K}{d}}r_{1}(d)^{2}/\Delta^{2}+\big{(}r_{2}(d)+r_{4}(d)\big{)}/\Delta
KdpLr1(d)Δ+r1(d)2/Δ2KΔ02+K2n(logd)2dΔ04+KdpL(KΔ0+KndΔ02).\displaystyle\lesssim\sqrt{\frac{Kd}{pL}}\frac{r_{1}(d)}{\Delta}+r_{1}^{\prime}(d)^{2}/\Delta^{2}\lesssim\frac{K}{\Delta_{0}^{2}}+\frac{K^{2}n(\log d)^{2}}{d\Delta_{0}^{4}}+\sqrt{\frac{Kd}{pL}}\Big{(}\frac{\sqrt{K}}{\Delta_{0}}+\frac{K\sqrt{n}}{\sqrt{d}\Delta_{0}^{2}}\Big{)}.

Therefore, under the conditions that Δ02Kn(logd)2\Delta_{0}^{2}\gg K\sqrt{n}(\log d)^{2}, nd2n\gg d^{2} and LKd2/pL\gg Kd^{2}/p, we have η1(d)1/2r(d)=o(1)\eta_{1}(d)^{-1/2}r(d)=o(1). Thus by Theorem 4.5, (8) holds. As for (13), from the above arguments we have 𝚺~j𝚺j2=o(λK(𝚺j))\|\widetilde{\bm{\Sigma}}_{j}-\bm{\Sigma}_{j}\|_{2}=o\big{(}\lambda_{K}(\bm{\Sigma}_{j})\big{)}, and hence (13) holds.

Now we need to check the validity of 𝚺^j\widehat{\bm{\Sigma}}_{j}. Similar as before, it suffices for us to prove that 𝚺^j𝐇𝚺~j𝐇2=oP(λK(𝚺~j))\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}. From Corollary 4.8, we have that 𝐌^𝐌2=OP(dΔ0/K+dn)\|\widehat{\mathbf{M}}-\mathbf{M}\|_{2}=O_{P}(d\Delta_{0}/\sqrt{K}+\sqrt{dn}) and 𝐕~F𝐇𝐕2=𝐕~F𝐕𝐇2=OP(Knd/Δ02)\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2}=\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}=O_{P}\big{(}K\sqrt{\frac{n}{d}}/\Delta_{0}^{2}\big{)}. Then we have

𝚲~𝐇𝚲𝐇2𝐕~F(𝐌^𝐌)𝐕~F2+(𝐕~F𝐕𝐇)𝐌𝐕~F2\displaystyle\|\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top}\|_{2}\leq\|\widetilde{\mathbf{V}}^{\text{F}\top}(\widehat{\mathbf{M}}-\mathbf{M})\widetilde{\mathbf{V}}^{\text{F}}\|_{2}+\|(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})^{\top}\mathbf{M}\widetilde{\mathbf{V}}^{\text{F}}\|_{2}
+𝐇𝐕𝐌(𝐕~F𝐕𝐇)2=OP(dΔ0/K+dn).\displaystyle\quad+\|\mathbf{H}\mathbf{V}^{\top}\mathbf{M}(\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top})\|_{2}=O_{P}\Big{(}d\Delta_{0}/\sqrt{K}+\sqrt{dn}\Big{)}.

Then if we denote 𝐃𝚲=(𝚲~𝐇𝚲𝐇)𝐇𝚲1𝐇\mathbf{D}_{\mathbf{\Lambda}}=(\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top})\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}, we have that

𝐃𝚲2=OP(KndΔ02)=oP(1),\|\mathbf{D}_{\mathbf{\Lambda}}\|_{2}=O_{P}\left(K\sqrt{\frac{n}{d}}\Delta_{0}^{-2}\right)=o_{P}(1),

and thus we have

𝚲~1𝐇𝚲1𝐇2𝚲12𝐃𝚲2=OP(KndΔ02)Δ1=oP(n/λ12)=oP(λK(𝚺~j)),\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\|_{2}\lesssim\|\mathbf{\Lambda}^{-1}\|_{2}\|\mathbf{D}_{\mathbf{\Lambda}}\|_{2}=O_{P}\left(K\sqrt{\frac{n}{d}}\Delta_{0}^{-2}\right)\Delta^{-1}=o_{P}(n/\lambda_{1}^{2})=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)},

and furthermore, we have

n𝚲~2𝐇𝚲2𝐇2\displaystyle n\|\widetilde{\mathbf{\Lambda}}^{-2}\!-\!\mathbf{H}\mathbf{\Lambda}^{-2}\mathbf{H}^{\top}\|_{2} n𝚲12𝚲~1𝐇𝚲1𝐇2=OP(Knd/Δ02)nΔ2=oP(λK(𝚺~j)).\displaystyle\!\lesssim\!n\|\mathbf{\Lambda}^{-1}\|_{2}\!\|\widetilde{\mathbf{\Lambda}}^{-1}\!-\!\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\!\|_{2}\!=\!O_{P}\big{(}K\sqrt{\frac{n}{d}}/\Delta_{0}^{2}\big{)}n\Delta^{-2}\!=\!o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.

Combining the above results, we have 𝚺^j𝐇𝚺~j𝐇2=oP(λK(𝚺~j))\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}, and hence (13) holds with 𝚺~j\widetilde{\bm{\Sigma}}_{j} replaced by 𝚺^j\widehat{\bm{\Sigma}}_{j}.

B.13 Proof of Corollary 4.9

Recall that 𝐌^=(1/θ^)𝒫𝒮(𝐌+¯)\widehat{\mathbf{M}}=({1}/{\widehat{\theta}})\mathcal{P}_{{\mathcal{S}}}(\mathbf{M}+\bar{\mathcal{E}}) and 𝐌^=(1/θ)𝒫𝒮(𝐌+¯)\widehat{\mathbf{M}}^{\prime}=({1}/{\theta})\mathcal{P}_{{\mathcal{S}}}(\mathbf{M}+\bar{\mathcal{E}}) share exactly the same sequence of eigenvectors, and we can treat 𝐕~F\widetilde{\mathbf{V}}^{\text{F}} as the FADI estimator applied to 𝐌^\widehat{\mathbf{M}}^{\prime}. We will abuse the notation and denote 𝐄:=𝐌^𝐌\mathbf{E}:=\widehat{\mathbf{M}}^{\prime}-\mathbf{M}.

To show that (8) holds, we need to verify that Assumptions 1 to 5 hold and the minimum eigenvalue conditions hold for the asymptotic covariance matrix. We know from Corollary 4.2 that Assumption 1 and Assumption 2 are satisfied, and that r1(d)=|λ1|μK/dθ+dσ2/θr_{1}(d)=|\lambda_{1}|\mu K/\sqrt{d\theta}+\sqrt{d\sigma^{2}/\theta} and r2(d)=0r_{2}(d)=0. Define σ~=(|λ1|μK/d)σ\widetilde{\sigma}=(|\lambda_{1}|\mu K/d)\vee\sigma, we have from the proof of Corollary 4.2 that Var(𝐄ij)σ~2/θ\operatorname{Var}(\mathbf{E}_{ij})\asymp\widetilde{\sigma}^{2}/\theta and |𝐄ij|=O(σ~logd/θ)|\mathbf{E}_{ij}|=O(\widetilde{\sigma}\log d/\theta) for i,j[d]i,j\in[d]. From Theorem 4.2.1 in Chen et al., [12], we have that with probability 1O(d5)1-O(d^{-5})

𝐕^sgn(𝐕^𝐕)𝐕2,κ2σ~μK/θ+σ~Klogd/θΔ,\displaystyle\|\widehat{\mathbf{V}}\operatorname{sgn}(\widehat{\mathbf{V}}^{\top}\mathbf{V})-\mathbf{V}\|_{2,\infty}\lesssim\frac{\kappa_{2}\widetilde{\sigma}\sqrt{\mu K/\theta}+\widetilde{\sigma}\sqrt{K\log d/\theta}}{\Delta},

and thus we know r3(d)(κ2σ~μK/θ+σ~Klogd/θ)/Δr_{3}(d)\asymp\big{(}\kappa_{2}\widetilde{\sigma}\sqrt{\mu K/\theta}+\widetilde{\sigma}\sqrt{K\log d/\theta}\big{)}/{\Delta}. Besides, by the proof of Theorem 4.2.1 in Chen et al., [12], with probability 1O(d7)1-O(d^{-7}), we have

𝐄(𝐕^(𝐕^𝐕)𝐕)2,\displaystyle\big{\|}\mathbf{E}\big{(}\widehat{\mathbf{V}}(\widehat{\mathbf{V}}^{\top}\mathbf{V})\!-\!\!\mathbf{V}\big{)}\!\big{\|}_{2,\infty}\! dKσ~2Δθ(logd+μ)+σ~dθr3(d)+σ~ΔKlogdθ𝐄2\displaystyle\lesssim\!\frac{\sqrt{dK}\widetilde{\sigma}^{2}}{\Delta{\theta}}\big{(}\!\sqrt{\log d}\!+\!\!\sqrt{\mu}\big{)}\!\!+\!\widetilde{\sigma}\sqrt{\frac{d}{\theta}}r_{3}(d)\!+\!\frac{\widetilde{\sigma}}{\Delta}\!\sqrt{K\frac{\log d}{\theta}}\|\mathbf{E}\|_{2}
dσ~2Δθ(Klogd+κ2μK),\displaystyle\quad\lesssim\frac{\sqrt{d}\widetilde{\sigma}^{2}}{\Delta{\theta}}\big{(}\sqrt{K\log d}+\kappa_{2}\sqrt{\mu K}\big{)},

and thus r4(d)dσ~2Δθ(Klogd+κ2μK)r_{4}(d)\asymp\frac{\sqrt{d}\widetilde{\sigma}^{2}}{\Delta{\theta}}\big{(}\sqrt{K\log d}+\kappa_{2}\sqrt{\mu K}\big{)}. Therefore, Assumption 4 is met and we have

r(d)\displaystyle r(d) =KdpLr1(d)Δ+r3(d)r1(d)/Δ+μKdr1(d)2/Δ2+(r2(d)+r4(d))/Δ\displaystyle=\sqrt{\frac{Kd}{pL}}\frac{r_{1}(d)}{\Delta}+r_{3}(d)r_{1}(d)/\Delta+\sqrt{\frac{\mu K}{d}}r_{1}(d)^{2}/\Delta^{2}+\big{(}r_{2}(d)+r_{4}(d)\big{)}/\Delta
(dσ~Δθ)((κ2μK+KlogdΔ)σ~θ+KdpL).\displaystyle\lesssim\left(\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\right)\left(\left(\frac{\kappa_{2}\sqrt{\mu K}+\sqrt{K\log d}}{\Delta}\right)\frac{\widetilde{\sigma}}{\sqrt{\theta}}+\sqrt{\frac{Kd}{pL}}\right).

Now we will study the statistical rate η1(d)\eta_{1}(d). We know that 𝐄ij=𝐄ji\mathbf{E}_{ij}=\mathbf{E}_{ji} are i.i.d. across iji\leq j and Var(𝐄ij)σ~2/θ\operatorname{Var}(\mathbf{E}_{ij})\asymp\widetilde{\sigma}^{2}/\theta, then by Lemma B.5, with almost identical arguments as in the proof of Corollary 4.7, for j[d]j\in[d] we have that Cov(𝐄𝐏𝐞j)Cov(𝐄𝐞j)2σ~2/θμK/d\|\operatorname*{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})-\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\|_{2}\lesssim\widetilde{\sigma}^{2}/\theta\sqrt{\mu K/d}, and thus λd(Cov(𝐄𝐏𝐞j))λd(Cov(𝐄𝐞j))σ~2/θ\lambda_{d}\big{(}\operatorname*{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})\big{)}\gtrsim\lambda_{d}\big{(}\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\big{)}\gtrsim\widetilde{\sigma}^{2}/\theta and we have η1(d)λ12θ1σ~2\eta_{1}(d)\asymp\lambda_{1}^{-2}\theta^{-1}\widetilde{\sigma}^{2}. Therefore, under the condition that Lκ22Kd2/pL\gg\kappa_{2}^{2}Kd^{2}/p and σ~/Δd/θmin((κ22μK+κ2Klogd)1,p/d)\widetilde{\sigma}/\Delta\sqrt{d/\theta}\ll\min\left(\big{(}\kappa_{2}^{2}\sqrt{\mu K}+\kappa_{2}\sqrt{K\log d}\big{)}^{-1},\sqrt{p/d}\right), we have that η1(d)1/2r(d)=o(1)\eta_{1}(d)^{-1/2}r(d)=o(1).

Now we move on to verify Assumption 5. More specifically, we will show that the following results hold:

Given j[d]j\in[d], for any matrix 𝐀d×K\mathbf{A}\in\mathbb{R}^{d\times K} that satisfies the following two conditions: (1)𝐀2,/σmin(𝐀)Cλ12μK/(dΔ2)\|\mathbf{A}\|_{2,\infty}/\sigma_{\min}(\mathbf{A})\leq C\sqrt{\lambda_{1}^{2}\mu K/(d\Delta^{2})}; (2) λK(𝚺j)cσ~2θ1(σmin(𝐀))2\lambda_{K}\big{(}\bm{\Sigma}_{j}\big{)}\geq c\widetilde{\sigma}^{2}\theta^{-1}\big{(}\sigma_{\min}(\mathbf{A})\big{)}^{2}, where 𝚺j:=Cov(𝐀𝐄0𝐏𝐞j)\bm{\Sigma}_{j}:=\operatorname*{\rm Cov}(\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}) and C,c>0C,c>0 are fixed constants independent of 𝐀\mathbf{A}, it holds that

𝚺j1/2𝐀𝐄0𝐏𝐞j𝑑𝒩(𝟎,𝐈K).\mathbf{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}_{0}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(\mathbf{0},\mathbf{I}_{K}). (B.33)

To prove (B.33), it suffices to show that 𝐚𝚺j1/2𝐀𝐄𝐏𝐞j𝑑𝒩(0,1){\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j}\overset{d}{\rightarrow}{\mathcal{N}}(0,1) for any 𝐚K,𝐚2=1{\mathbf{a}}\in\mathbb{R}^{K},\|{\mathbf{a}}\|_{2}=1. We will first study 𝐏𝐞j\mathbf{P}_{\perp}\mathbf{e}_{j}, 𝐀𝚺j1/2𝐚\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}} and maxik𝔼|𝐄ik|3\max_{ik}{\mathbb{E}}|\mathbf{E}_{ik}|^{3}. It holds that

|(𝐏𝐞j)j|=|((𝐈d𝐕𝐕)𝐞j)j|1+𝐕2,2=1+o(1);\displaystyle|(\mathbf{P}_{\perp}\mathbf{e}_{j})_{j}|=|\big{(}(\mathbf{I}_{d}-\mathbf{V}\mathbf{V}^{\top})\mathbf{e}_{j}\big{)}_{j}|\leq 1+\|\mathbf{V}\|_{2,\infty}^{2}=1+o(1);
maxij|(𝐏𝐞j)i|=maxij|𝐞i𝐞j𝐞i𝐕𝐕𝐞j|0+𝐕2,2=μKd;\displaystyle\max_{i\neq j}|(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}|=\max_{i\neq j}|\mathbf{e}_{i}^{\top}\mathbf{e}_{j}-\mathbf{e}_{i}^{\top}\mathbf{V}\mathbf{V}^{\top}\mathbf{e}_{j}|\leq 0+\|\mathbf{V}\|_{2,\infty}^{2}=\frac{\mu K}{d};
𝐀𝚺j1/2𝐚𝐀2,𝚺j1/22(σ~2/θ)1/2𝐀2,/σmin(𝐀)κ2μKdθσ~;\displaystyle\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\|_{\infty}\leq\|\mathbf{A}\|_{2,\infty}\|\bm{\Sigma}_{j}^{-1/2}\|_{2}\lesssim(\widetilde{\sigma}^{2}/{\theta})^{-1/2}\|\mathbf{A}\|_{2,\infty}/{\sigma}_{\min}(\mathbf{A})\lesssim\kappa_{2}\sqrt{\frac{\mu K}{d}}\frac{\sqrt{\theta}}{\widetilde{\sigma}};
maxik𝔼|𝐄ik|3𝐌max3θ3θ+σ3(logd)3θ3θσ~3θ2(logd)3.\displaystyle\max_{ik}{\mathbb{E}}|\mathbf{E}_{ik}|^{3}\lesssim\frac{\|\mathbf{M}\|_{\max}^{3}}{\theta^{3}}\theta+\frac{{\sigma}^{3}(\log d)^{3}}{\theta^{3}}\theta\lesssim\frac{\widetilde{\sigma}^{3}}{\theta^{2}}(\log d)^{3}.

Then we know that

𝐚𝚺j1/2𝐀𝐄𝐏𝐞j=ik𝐄ik(𝐀𝚺j1/2𝐚)i(𝐏𝐞j)k=i=1d𝐄ii(𝐀𝚺j1/2𝐚)i(𝐏𝐞j)i\displaystyle{\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j}=\sum_{ik}\mathbf{E}_{ik}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{k}=\sum_{i=1}^{d}\mathbf{E}_{ii}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}
+i<k𝐄ik[(𝐀𝚺j1/2𝐚)i(𝐏𝐞j)k+(𝐀𝚺j1/2𝐚)k(𝐏𝐞j)i].\displaystyle\quad+\sum_{i<k}\mathbf{E}_{ik}\big{[}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{k}+(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{k}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\big{]}.

Then for the diagonal entries we have

i=1d𝔼|𝐄ii(𝐀𝚺j1/2𝐚)i(𝐏𝐞j)i|3\displaystyle\sum_{i=1}^{d}{\mathbb{E}}|\mathbf{E}_{ii}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}|^{3}
=𝔼|𝐄jj(𝐀𝚺j1/2𝐚)j(𝐏𝐞j)j|3+ij𝔼|𝐄ii(𝐀𝚺j1/2𝐚)i(𝐏𝐞j)i|3\displaystyle={\mathbb{E}}|\mathbf{E}_{jj}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{j}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{j}|^{3}+\sum_{i\neq j}{\mathbb{E}}|\mathbf{E}_{ii}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}|^{3}
𝔼|𝐄jj|3𝐀𝚺j1/2𝐚3+dmaxi𝔼|𝐄ii|3𝐀𝚺j1/2𝐚3maxij|(𝐏𝐞j)i|3\displaystyle\lesssim{\mathbb{E}}|\mathbf{E}_{jj}|^{3}\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\|_{\infty}^{3}+d\max_{i}{\mathbb{E}}|\mathbf{E}_{ii}|^{3}\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\|_{\infty}^{3}\max_{i\neq j}|(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}|^{3}
κ23KμdμKdθ(logd)3,\displaystyle\lesssim\frac{\kappa_{2}^{3}K\mu}{d}\sqrt{\frac{\mu K}{d\theta}}(\log d)^{3},

and for the off-diagonal entries, under the condition κ26K3μ3=o(d1/2)\kappa_{2}^{6}K^{3}\mu^{3}=o(d^{1/2}) it holds that

i<k𝔼|𝐄ik[(𝐀𝚺j1/2𝐚)i(𝐏𝐞j)k+(𝐀𝚺j1/2𝐚)k(𝐏𝐞j)i]|3dσ~3θ2𝐀𝚺j1/2𝐚3(logd)3\displaystyle\sum_{i<k}{\mathbb{E}}\Big{|}\mathbf{E}_{ik}\big{[}(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{i}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{k}+(\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}})_{k}(\mathbf{P}_{\perp}\mathbf{e}_{j})_{i}\big{]}\Big{|}^{3}\lesssim d\frac{\widetilde{\sigma}^{3}}{\theta^{2}}\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\|_{\infty}^{3}(\log d)^{3}
+d2σ~3θ2(logd)3𝐀𝚺j1/2𝐚3(μKd)3κ23KμμKdθ(logd)3=o(1).\displaystyle\quad+d^{2}\frac{\widetilde{\sigma}^{3}}{\theta^{2}}(\log d)^{3}\|\mathbf{A}\bm{\Sigma}_{j}^{-1/2}{\mathbf{a}}\|_{\infty}^{3}\big{(}\frac{\mu K}{d}\big{)}^{3}\lesssim{\kappa_{2}^{3}K\mu}\sqrt{\frac{\mu K}{d\theta}}(\log d)^{3}=o(1).

Moreover, since Var(𝐚𝚺j1/2𝐀𝐄𝐏𝐞j)=1\operatorname{Var}({\mathbf{a}}^{\top}\bm{\Sigma}_{j}^{-1/2}\mathbf{A}^{\top}\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})=1, by the Lyapunov’s condition, (B.33) holds and Assumption 5 is satisfied by plugging in 𝐀=𝐕𝚲1\mathbf{A}=\mathbf{V}\mathbf{\Lambda}^{-1}. By Theorem 4.5, we have that (8) follows.

To show that (15) holds we need to show that 𝚺~j𝚺j2=o(λK(𝚺~j))\|\widetilde{\bm{\Sigma}}_{j}-\bm{\Sigma}_{j}\|_{2}=o(\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})). From previous discussion we learnt that

𝚺~i𝚺j2𝐕𝚲122Cov(𝐄𝐏𝐞j)Cov(𝐄𝐞j)2\displaystyle\|\widetilde{\bm{\Sigma}}_{i}-\bm{\Sigma}_{j}\|_{2}\leq\|\mathbf{V}\mathbf{\Lambda}^{-1}\|_{2}^{2}\|\operatorname*{\rm Cov}(\mathbf{E}\mathbf{P}_{\perp}\mathbf{e}_{j})-\operatorname*{\rm Cov}(\mathbf{E}\mathbf{e}_{j})\|_{2}
1Δ2μKdσ~2θκ22μKdλK(𝚺~j)=o(λK(𝚺~j)).\displaystyle\leq\frac{1}{\Delta^{2}}\sqrt{\frac{\mu K}{d}}\frac{\widetilde{\sigma}^{2}}{\theta}\lesssim\kappa_{2}^{2}\sqrt{\frac{\mu K}{d}}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})=o(\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})).

Then by Slutsky’s Theorem, (15) holds.

Last we verify that the distributional convergence still holds when we plug in the estimator 𝚺^j\widehat{\bm{\Sigma}}_{j}. Similar as in the previous proof, it suffices for us to prove that 𝚺^j𝐇𝚺~j𝐇2=oP(λK(𝚺~j))\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}. In the following proof, we will base the discussion on the event that 𝐇\mathbf{H} is orthonormal. We will first bound 𝐌~𝐌max\|\widetilde{\mathbf{M}}-\mathbf{M}\|_{\max}. From previous discussion we have the following bounds

𝐌^𝐌2=OP(dσ~2/θ),𝐕~F𝐕𝐇2=𝐕~F𝐇𝐕2=OP(1Δdσ~2/θ),\|\widehat{\mathbf{M}}^{\prime}-\mathbf{M}\|_{2}=O_{P}(\sqrt{d\widetilde{\sigma}^{2}/\theta}),\quad\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}=\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2}=O_{P}(\frac{1}{\Delta}\sqrt{d\widetilde{\sigma}^{2}/\theta}),

and

𝐕~F𝐇𝐕2,𝐕~F𝐇𝐕^𝐇02+𝐕^𝐇0𝐕2,=oP(σ~|λ1|θ)\displaystyle\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2,\infty}\leq\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\widehat{\mathbf{V}}\mathbf{H}_{0}\|_{2}+\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\|_{2,\infty}=o_{P}(\frac{\widetilde{\sigma}}{|\lambda_{1}|\sqrt{\theta}})
+OP(κ2σ~μK/θ+σ~Klogd/θΔ)=OP(κ2σ~μK/θ+σ~Klogd/θΔ).\displaystyle\quad+O_{P}(\frac{\kappa_{2}\widetilde{\sigma}\sqrt{\mu K/\theta}+\widetilde{\sigma}\sqrt{K\log d/\theta}}{\Delta})=O_{P}(\frac{\kappa_{2}\widetilde{\sigma}\sqrt{\mu K/\theta}+\widetilde{\sigma}\sqrt{K\log d/\theta}}{\Delta}).

Now we can study 𝐌~=(𝐕~F𝐕~F)𝐌^(𝐕~F𝐕~F)=𝐕~F𝐇(θθ^𝐇𝐕~F𝐌^𝐕~F𝐇)𝐇𝐕~F\widetilde{\mathbf{M}}=(\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{V}}^{\text{F}\top})\widehat{\mathbf{M}}(\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{V}}^{\text{F}\top})=\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}(\frac{\theta}{\widehat{\theta}}\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}^{\prime}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}. Recall by Hoeffding’s inequality [21], with probability 1O(d10)1-O(d^{-10}) we have that |θ^θ|logdd|\widehat{\theta}-\theta|\lesssim\frac{\sqrt{\log d}}{d} and |𝒮|=Ω(d2θ)|{\mathcal{S}}|=\Omega(d^{2}\theta), and we have that

θθ^𝐇𝐕~F𝐌^𝐕~F𝐇𝚲2𝐇𝐕~F𝐌^𝐕~F𝐇𝐇𝐕~F𝐌𝐕~F𝐇2\displaystyle\|\frac{\theta}{\widehat{\theta}}\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}^{\prime}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda}\|_{2}\leq\|\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}^{\prime}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{M}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}\|_{2}
+𝐇𝐕~F𝐌(𝐕~F𝐇𝐕)2+(𝐕~F𝐇𝐕)𝐌𝐕2+OP(logddθ|λ1|)\displaystyle\quad+\|\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\mathbf{M}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})\|_{2}+\|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})^{\top}\mathbf{M}\mathbf{V}\|_{2}+O_{P}\Big{(}\frac{\sqrt{\log d}}{d{\theta}}|\lambda_{1}|\Big{)}
𝐌^𝐌2+2𝐌2𝐕~F𝐇𝐕2+OP(logddθ|λ1|)=OP(κ2dσ~2/θ).\displaystyle\lesssim\|\widehat{\mathbf{M}}^{\prime}-\mathbf{M}\|_{2}+2\|\mathbf{M}\|_{2}\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V}\|_{2}+O_{P}\Big{(}\frac{\sqrt{\log d}}{d{\theta}}|\lambda_{1}|\Big{)}=O_{P}(\kappa_{2}\sqrt{d\widetilde{\sigma}^{2}/\theta}).

Then for any i,k[d]i,k\in[d], we have

|𝐌~ik𝐌ik|=|(𝐕~F𝐇)i(θθ^𝐇𝐕~F𝐌^𝐕~F𝐇)(𝐕~F𝐇)k𝐌ik|\displaystyle|\widetilde{\mathbf{M}}_{ik}\!-\!\mathbf{M}_{ik}|=|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{i}^{\top}(\frac{\theta}{\widehat{\theta}}\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}^{\prime}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}-\mathbf{M}_{ik}|
|(𝐕~F𝐇)i(θθ^𝐇𝐕~F𝐌^𝐕~F𝐇𝚲)(𝐕~F𝐇)k|+|(𝐕~F𝐇𝐕)i𝚲(𝐕~F𝐇)k|\displaystyle\leq|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{i}^{\top}(\frac{\theta}{\widehat{\theta}}\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}^{\prime}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda})(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}|+|(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})_{i}\mathbf{\Lambda}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H})_{k}|
+|(𝐕)i𝚲(𝐕~F𝐇𝐕)k|=OP(κ2dσ~2/θ𝐕~F𝐇2,2)\displaystyle\quad+|(\mathbf{V})_{i}\mathbf{\Lambda}(\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{V})_{k}|=O_{P}(\kappa_{2}\sqrt{d\widetilde{\sigma}^{2}/\theta}\|\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}\|_{2,\infty}^{2})
+OP(|λ1|𝐕2,𝐕^𝐇0𝐕2,)=OP(dσ~Δθ)|λ1|μKd=OP(κ2μKdθ)σ~,\displaystyle\quad+O_{P}(|\lambda_{1}|\|\mathbf{V}\|_{2,\infty}\|\widehat{\mathbf{V}}\mathbf{H}_{0}-\mathbf{V}\|_{2,\infty})=O_{P}(\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}})\frac{|\lambda_{1}|\mu K}{d}=O_{P}\left(\frac{\kappa_{2}\mu K}{\sqrt{d\theta}}\right)\widetilde{\sigma},

and in turn we have

|𝐌~ik2𝐌ik2||λ1|μKd|𝐌~ik𝐌ik|=OP(dσ~Δθ)(|λ1|μKd)2,i,k[d].\displaystyle|\widetilde{\mathbf{M}}_{ik}^{2}\!-\!\mathbf{M}_{ik}^{2}|\lesssim\frac{|\lambda_{1}|\mu K}{d}|\widetilde{\mathbf{M}}_{ik}\!-\!\mathbf{M}_{ik}|=O_{P}\left(\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\right)\Big{(}\frac{|\lambda_{1}|\mu K}{d}\Big{)}^{2},\quad\forall i,k\in[d].

Now we move on to bound the error of σ^2\widehat{\sigma}^{2}. We know from the setting of Example 4 that εik\varepsilon_{ik}’s are sub-Gaussian with variance proxy of order O(σ2(logd)2)O(\sigma^{2}(\log d)^{2}), and thus

|σ^2σ2|\displaystyle|\widehat{\sigma}^{2}\!-\!\sigma^{2}| =|(i,k)𝒮(𝐌ik+εik𝐌~ik)2/|𝒮|σ2|=|(i,k)𝒮(𝐌ik+εik𝐌ik+𝐌ik𝐌~ik)2/|𝒮|σ2|\displaystyle\!=\!\Big{|}\!\!\sum_{(i,k)\in{\mathcal{S}}}\!\!\!(\mathbf{M}_{ik}\!+\!\varepsilon_{ik}\!-\!\widetilde{\mathbf{M}}_{ik})^{2}/|{\mathcal{S}}|\!-\!\sigma^{2}\Big{|}\!=\!\Big{|}\!\!\sum_{(i,k)\in{\mathcal{S}}}\!\!\!(\mathbf{M}_{ik}\!+\!\varepsilon_{ik}\!-\!\mathbf{M}_{ik}\!+\!\mathbf{M}_{ik}\!-\!\widetilde{\mathbf{M}}_{ik})^{2}/|{\mathcal{S}}|\!-\!\sigma^{2}\Big{|}
|1|𝒮|(i,k)𝒮εik2σ2|+𝐌~𝐌max2=OP(σ2(logd)2|𝒮|)+OP(κ22μ2K2dθ)σ~2\displaystyle\lesssim\Big{|}\frac{1}{|{\mathcal{S}}|}\sum_{(i,k)\in{\mathcal{S}}}\varepsilon_{ik}^{2}-\sigma^{2}\Big{|}+\|\widetilde{\mathbf{M}}-\mathbf{M}\|_{\max}^{2}=O_{P}\big{(}\frac{\sigma^{2}(\log d)^{2}}{\sqrt{|{\mathcal{S}}|}}\big{)}+O_{P}\Big{(}\frac{\kappa_{2}^{2}\mu^{2}K^{2}}{d\theta}\Big{)}\widetilde{\sigma}^{2}
=OP((logd)2dθ)σ2+OP(κ22μ2K2dθ)σ~2.\displaystyle=O_{P}\Big{(}\frac{(\log d)^{2}}{d\sqrt{\theta}}\Big{)}\sigma^{2}+O_{P}\Big{(}\frac{\kappa_{2}^{2}\mu^{2}K^{2}}{d\theta}\Big{)}\widetilde{\sigma}^{2}.

Then for any i[d]i\in[d], we have that

|𝐌~ij2(1θ^)θ^+σ^2θ^𝐌ij2(1θ)θσ2θ||𝐌~ij|2|1θ^1θ|+|𝐌~ij2𝐌ij2|θ+σ^2|1θ^1θ|+|σ^2σ2|θ\displaystyle\Big{|}\frac{\widetilde{\mathbf{M}}_{ij}^{2}(1-\widehat{\theta})}{\widehat{\theta}}\!+\!\frac{\widehat{\sigma}^{2}}{\widehat{\theta}}\!-\!\frac{\mathbf{M}_{ij}^{2}(1\!-\!\theta)}{\theta}\!-\!\frac{\sigma^{2}}{\theta}\Big{|}\!\lesssim\!|\widetilde{\mathbf{M}}_{ij}|^{2}\Big{|}\frac{1}{\widehat{\theta}}\!-\!\frac{1}{\theta}\Big{|}\!+\!\frac{|\widetilde{\mathbf{M}}_{ij}^{2}\!-\!\mathbf{M}_{ij}^{2}|}{\theta}\!+\!\widehat{\sigma}^{2}\Big{|}\frac{1}{\widehat{\theta}}\!-\!\frac{1}{\theta}\Big{|}\!+\!\frac{|\widehat{\sigma}^{2}\!-\!\sigma^{2}|}{\theta}
=OP(dσ~Δθ)θ1(|λ1|μKd)2+OP((logd)2dθ+κ22μ2K2dθ)σ~2θ=OP(dσ~Δθ)σ~2θ,\displaystyle=O_{P}\left(\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\right)\theta^{-1}\Big{(}\frac{|\lambda_{1}|\mu K}{d}\Big{)}^{2}+O_{P}\Big{(}\frac{(\log d)^{2}}{d\sqrt{\theta}}+\frac{\kappa_{2}^{2}\mu^{2}K^{2}}{d\theta}\Big{)}\frac{\widetilde{\sigma}^{2}}{\theta}=O_{P}\left(\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\right)\frac{\widetilde{\sigma}^{2}}{\theta},

and thus we have that

diag([𝐌~ij2(1θ^)/θ^+σ^2/θ^]i=1d)diag([𝐌ij2(1θ)/θ+σ2/θ]i=1d)2=OP(dσ~Δθ)σ~2θ.\|\operatorname{diag}\big{(}[\widetilde{\mathbf{M}}_{ij}^{2}(1-\widehat{\theta})/\widehat{\theta}+\widehat{\sigma}^{2}/\widehat{\theta}]_{i=1}^{d}\big{)}-\operatorname{diag}\big{(}[\mathbf{M}_{ij}^{2}(1-\theta)/\theta+\sigma^{2}/\theta]_{i=1}^{d}\big{)}\|_{2}=O_{P}\Big{(}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{\widetilde{\sigma}^{2}}{\theta}.

Also, we have shown that

𝚲~𝐇𝚲𝐇2=θθ^𝐇𝐕~F𝐌^𝐕~F𝐇𝚲2=OP(κ2dσ~Δθ)Δ,\|\widetilde{\mathbf{\Lambda}}-\mathbf{H}\mathbf{\Lambda}\mathbf{H}^{\top}\|_{2}=\|\frac{\theta}{\widehat{\theta}}\mathbf{H}^{\top}\widetilde{\mathbf{V}}^{\text{F}\top}\widehat{\mathbf{M}}^{\prime}\widetilde{\mathbf{V}}^{\text{F}}\mathbf{H}-\mathbf{\Lambda}\|_{2}=O_{P}\Big{(}\kappa_{2}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\Delta,

then we have 𝚲~1𝐇𝚲1𝐇2=OP(κ2dσ~Δθ)1Δ\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\|_{2}=O_{P}\Big{(}\kappa_{2}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{1}{\Delta}, and hence

𝐕~F𝚲~1𝐕𝚲1𝐇2𝚲~1𝐇𝚲1𝐇2+𝚲12𝐕~F𝐕𝐇2\displaystyle\|\widetilde{\mathbf{V}}^{\text{F}}\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{V}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\|_{2}\leq\|\widetilde{\mathbf{\Lambda}}^{-1}-\mathbf{H}\mathbf{\Lambda}^{-1}\mathbf{H}^{\top}\|_{2}+\|\mathbf{\Lambda}^{-1}\|_{2}\|\widetilde{\mathbf{V}}^{\text{F}}-\mathbf{V}\mathbf{H}^{\top}\|_{2}
=OP(κ2dσ~Δθ)1Δ+OP(dσ~Δθ)1Δ=OP(κ2dσ~Δθ)1Δ.\displaystyle=O_{P}\Big{(}\kappa_{2}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{1}{\Delta}+O_{P}\Big{(}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{1}{\Delta}=O_{P}\Big{(}\kappa_{2}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{1}{\Delta}.

Then following basic algebra we have that with high probability

𝚺^j𝐇𝚺~j𝐇2OP(dσ~Δθ)σ~2Δ2θ+OP(κ2dσ~Δθ)σ~2Δ2θ=OP(κ2dσ~Δθ)σ~2Δ2θ.\displaystyle\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}\lesssim O_{P}\Big{(}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{\widetilde{\sigma}^{2}}{\Delta^{2}\theta}+O_{P}\Big{(}\kappa_{2}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{\widetilde{\sigma}^{2}}{\Delta^{2}\theta}=O_{P}\Big{(}\kappa_{2}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}\Big{)}\frac{\widetilde{\sigma}^{2}}{\Delta^{2}\theta}.

Then under the condition that κ23dσ~Δθ=o(1)\kappa_{2}^{3}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}}=o(1), we have that

𝚺^j𝐇𝚺~j𝐇2=OP(κ23dσ~Δθ)σ~2λ12θ=oP(λK(𝚺~j)).\|\widehat{\bm{\Sigma}}_{j}-\mathbf{H}\widetilde{\bm{\Sigma}}_{j}\mathbf{H}^{\top}\|_{2}=O_{P}(\kappa_{2}^{3}\frac{\sqrt{d}\widetilde{\sigma}}{\Delta\sqrt{\theta}})\frac{\widetilde{\sigma}^{2}}{\lambda_{1}^{2}\theta}=o_{P}\big{(}\lambda_{K}(\widetilde{\bm{\Sigma}}_{j})\big{)}.

C Proof of Technical Lemmas

In this section, we provide proofs of the technical lemmas used in the proofs of the main theorems.

C.1 Proof of Lemma B.2

It can be easily seen that

𝛀/p2=(𝛀𝛀/p2)1/2=((d/p)𝛀𝛀/d2)1/2.\|\mathbf{\Omega}/\sqrt{p}\|_{2}=(\|\mathbf{\Omega}\mathbf{\Omega}^{\top}/p\|_{2})^{1/2}=\left((d/p)\|\mathbf{\Omega}^{\top}\mathbf{\Omega}/d\|_{2}\right)^{1/2}.

By Lemma 3 in [18], we know that 𝛀𝛀/d𝐈p2ψ1p/d\|\|\mathbf{\Omega}^{\top}\mathbf{\Omega}/d-\mathbf{I}_{p}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{p/d}, and thus 𝛀𝛀/d2ψ11+p/d=O(1)\|\|\mathbf{\Omega}^{\top}\mathbf{\Omega}/d\|_{2}\|_{\psi_{1}}\lesssim 1+\sqrt{p/d}=O(1). Therefore, we have 𝛀𝛀/p2ψ1d/p\|\|\mathbf{\Omega}\mathbf{\Omega}^{\top}/p\|_{2}\|_{\psi_{1}}\lesssim d/p. By Jensen’s inequality, we in turn get 𝛀/p2ψ1d/p\|\|\mathbf{\Omega}/\sqrt{p}\|_{2}\|_{\psi_{1}}\lesssim\sqrt{d/p}.

C.2 Proof of Lemma B.3

By Proposition 10.4 in [20], we know that for any t1t\geq 1, we have

(𝛀2eppK+1t)t(pK+1).\mathbb{P}\left(\left\|\bm{\Omega}^{\dagger}\right\|_{2}\geq\frac{{\rm e}\sqrt{p}}{p-K+1}\cdot t\right)\leq t^{-(p-K+1)}. (C.34)

Since p2Kp\geq 2K, there exists a constant cc such that eppK+1c\frac{{\rm e}p}{p-K+1}\leq c, and thus

(p𝛀2ct)t(pK+1).\mathbb{P}\left(\sqrt{p}\left\|\bm{\Omega}^{\dagger}\right\|_{2}\geq ct\right)\leq t^{-(p-K+1)}. (C.35)

Therefore, we have

𝔼((σmin(𝛀/p))a)=𝔼(p𝛀2a)=u0(p𝛀2au)du\displaystyle{\mathbb{E}}\left(\left(\sigma_{\min}(\bm{\Omega}/\sqrt{p})\right)^{-a}\right)={\mathbb{E}}\left(\left\|\sqrt{p}\bm{\Omega}^{\dagger}\right\|_{2}^{a}\right)=\int_{u\geq 0}\mathbb{P}\left(\left\|\sqrt{p}\bm{\Omega}^{\dagger}\right\|_{2}^{a}\geq u\right)du
=0uca(p𝛀2au)du+uca(p𝛀2au)du\displaystyle\quad=\int_{0\leq u\leq c^{a}}\mathbb{P}\left(\left\|\sqrt{p}\bm{\Omega}^{\dagger}\right\|_{2}^{a}\geq u\right)du+\int_{u\geq c^{a}}\mathbb{P}\left(\left\|\sqrt{p}\bm{\Omega}^{\dagger}\right\|_{2}^{a}\geq u\right)du
ca+uca(p𝛀2u1/a)duca+uca(u1/a/c)(pK+1)du\displaystyle\quad\leq c^{a}+\int_{u\geq c^{a}}\!\!\!\mathbb{P}\left(\left\|\sqrt{p}\bm{\Omega}^{\dagger}\right\|_{2}\geq u^{1/a}\right)du\leq c^{a}+\int_{u\geq c^{a}}\!\!\!\left(u^{1/a}/c\right)^{-(p-K+1)}du
=ca(1+1(pK+1)/a1).\displaystyle\quad=c^{a}\left(1+\frac{1}{(p-K+1)/a-1}\right).

Since 1+1(pK+1)/a121+\frac{1}{(p-K+1)/a-1}\leq 2, the claim follows.

C.3 Proof of Lemma B.4

We first consider the probability (𝚺𝐕𝐕2ε)\mathbb{P}\left(\|{\mathbf{\Sigma}}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq\varepsilon\right). Recall the matrix 𝐘():=𝐕𝐏0𝚲0𝐕𝛀(){\mathbf{Y}}^{(\ell)}:=\mathbf{V}\mathbf{P}_{0}\mathbf{\Lambda}^{0}\mathbf{V}^{\top}\mathbf{\Omega}^{(\ell)}. Now by Jensen’s inequality and Wedin’s Theorem [42], we have

𝚺𝐕𝐕2=𝔼(𝐕^()𝐕^()|𝐌^)𝐕𝐕2𝔼(𝐕^()𝐕^()𝐕𝐕2|𝐌^)\displaystyle\|{\mathbf{\Sigma}}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\|_{2}=\|\mathbb{E}\left(\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}|\widehat{\mathbf{M}}\right)-\mathbf{V}\mathbf{V}^{\top}\|_{2}\leq\mathbb{E}\left(\left\|\widehat{\mathbf{V}}^{(\ell)}\widehat{\mathbf{V}}^{(\ell)\top}-\mathbf{V}\mathbf{V}^{\top}\right\|_{2}\Big{|}\widehat{\mathbf{M}}\right)
𝔼(𝐘^()/p𝐘()/p2/σK(𝐘()/p)|𝐌^)𝐄2Δ𝔼(𝛀()/p2σmin(𝛀~()/p)|𝐌^)\displaystyle\quad\lesssim{\mathbb{E}}\left(\|\widehat{\mathbf{Y}}^{(\ell)}/\sqrt{p}-{\mathbf{Y}}^{(\ell)}/\sqrt{p}\|_{2}/\sigma_{K}\left({\mathbf{Y}}^{(\ell)}/\sqrt{p}\right)|\widehat{\mathbf{M}}\right)\leq\frac{\|\mathbf{E}\|_{2}}{\Delta}{\mathbb{E}}\left(\frac{\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}}{\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)}\quad\bigg{|}\widehat{\mathbf{M}}\right)
=𝐄2Δ𝔼(𝛀()/p2σmin(𝛀~()/p))𝐄2Δ𝔼(𝛀()/p22)1/2𝔼((σmin(𝛀()/p))2)1/2\displaystyle\quad=\frac{\|\mathbf{E}\|_{2}}{\Delta}{\mathbb{E}}\left(\frac{\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}}{\sigma_{\min}\left(\widetilde{\mathbf{\Omega}}^{(\ell)}/\sqrt{p}\right)}\right)\leq\frac{\|\mathbf{E}\|_{2}}{\Delta}{\mathbb{E}}\left(\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}^{2}\right)^{1/2}{\mathbb{E}}\left(\left(\sigma_{\min}(\bm{\Omega}^{(\ell)}/\sqrt{p})\right)^{-2}\right)^{1/2}
𝐄2Δ𝛀()/p2ψ1𝐄2Δd/p,\displaystyle\quad\lesssim\frac{\|\mathbf{E}\|_{2}}{\Delta}\|\|\mathbf{\Omega}^{(\ell)}/\sqrt{p}\|_{2}\|_{\psi_{1}}\lesssim\frac{\|\mathbf{E}\|_{2}}{\Delta}\sqrt{d/p},

where the last but one inequality is due to Lemma B.3 under the condition that pmax(2K,K+3)p\geq\max(2K,K+3), and the last inequality is due to Lemma B.2. Therefore, by Assumption 1, there exist constants c0,c0>0c_{0},c_{0}^{\prime}>0 such that

(𝚺𝐕𝐕2ε)(𝐄2Δd/pc0ε)exp(c0pdΔεr1(d)).\mathbb{P}\left(\|{\mathbf{\Sigma}}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq\varepsilon\right)\leq\mathbb{P}\left(\frac{\|\mathbf{E}\|_{2}}{\Delta}\sqrt{d/p}\geq c_{0}^{\prime}\varepsilon\right)\leq\exp\left(-c_{0}\sqrt{\frac{p}{d}}\frac{\Delta\varepsilon}{r_{1}(d)}\right).

Similarly, we consider the probability (𝚺𝐕^𝐕^2ε)\mathbb{P}\left(\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}\geq\varepsilon\right). By Assumption 1, there exist constants c0,c0>0c_{0}^{\prime\prime},c_{0}^{\prime\prime\prime}>0 such that

(𝚺𝐕^𝐕^2ε)(𝚺𝐕𝐕2ε/2)+(𝐕^𝐕^𝐕𝐕2ε/2)\displaystyle\mathbb{P}\left(\|{\mathbf{\Sigma}}^{\prime}-\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}\|_{2}\geq\varepsilon\right)\leq\mathbb{P}\left(\|{\mathbf{\Sigma}}^{\prime}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq\varepsilon/2\right)+\mathbb{P}\left(\|\widehat{\mathbf{V}}\widehat{\mathbf{V}}^{\top}-\mathbf{V}\mathbf{V}^{\top}\|_{2}\geq\varepsilon/2\right)
exp(c0pdΔε2r1(d))+(𝐄2Δc0ε)exp(c0pdΔε2r1(d))+exp(c0Δεr1(d))\displaystyle\quad\leq\exp\left(-c_{0}\sqrt{\frac{p}{d}}\frac{\Delta\varepsilon}{2r_{1}(d)}\right)+\mathbb{P}\left(\frac{\|\mathbf{E}\|_{2}}{\Delta}\geq c_{0}^{\prime\prime\prime}\varepsilon\right)\leq\exp\left(-c_{0}\sqrt{\frac{p}{d}}\frac{\Delta\varepsilon}{2r_{1}(d)}\right)+\exp\left(-\frac{c_{0}^{\prime\prime}\Delta\varepsilon}{r_{1}(d)}\right)
exp(c0pdΔε2r1(d)).\displaystyle\quad\lesssim\exp\left(-c_{0}\sqrt{\frac{p}{d}}\frac{\Delta\varepsilon}{2r_{1}(d)}\right).

Therefore, the claim follows.

C.4 Proof of Lemma B.5

We know that Cov(𝐱1+𝐱2)=Cov(𝐱1)+Cov(𝐱2)+Cov(𝐱1,𝐱2)+Cov(𝐱2,𝐱1)\operatorname*{\rm Cov}(\mathbf{x}_{1}+\mathbf{x}_{2})=\operatorname*{\rm Cov}(\mathbf{x}_{1})+\operatorname*{\rm Cov}(\mathbf{x}_{2})+\operatorname*{\rm Cov}(\mathbf{x}_{1},\mathbf{x}_{2})+\operatorname*{\rm Cov}(\mathbf{x}_{2},\mathbf{x}_{1}), where Cov(𝐱1,𝐱2)=𝔼(𝐱1𝔼𝐱1)(𝐱2𝔼𝐱2)\operatorname*{\rm Cov}(\mathbf{x}_{1},\mathbf{x}_{2})={\mathbb{E}}(\mathbf{x}_{1}-{\mathbb{E}}\mathbf{x}_{1})(\mathbf{x}_{2}-{\mathbb{E}}\mathbf{x}_{2})^{\top}, and

Cov(𝐱i)2=max𝐯2=1𝐯Cov(𝐱i)𝐯=max𝐯2=1Var(𝐯𝐱i),\|\operatorname*{\rm Cov}(\mathbf{x}_{i})\|_{2}=\max_{\|\mathbf{v}\|_{2}=1}\mathbf{v}^{\top}\operatorname*{\rm Cov}(\mathbf{x}_{i})\mathbf{v}=\max_{\|\mathbf{v}\|_{2}=1}\operatorname{Var}\big{(}\mathbf{v}^{\top}\mathbf{x}_{i}\big{)},

for i=1,2i=1,2. Therefore, we have

Cov(𝐱1,𝐱2)2\displaystyle\|\operatorname*{\rm Cov}(\mathbf{x}_{1},\mathbf{x}_{2})\|_{2} =max𝐯2=1,𝐮2=1𝐯Cov(𝐱1,𝐱2)𝐮=max𝐯2=1,𝐮2=1Cov(𝐯𝐱1,𝐮𝐱2)\displaystyle=\max_{\|\mathbf{v}\|_{2}=1,\|\mathbf{u}\|_{2}=1}\mathbf{v}^{\top}\operatorname*{\rm Cov}(\mathbf{x}_{1},\mathbf{x}_{2})\mathbf{u}=\max_{\|\mathbf{v}\|_{2}=1,\|\mathbf{u}\|_{2}=1}\operatorname*{\rm Cov}(\mathbf{v}^{\top}\mathbf{x}_{1},\mathbf{u}^{\top}\mathbf{x}_{2})
max𝐯2=1,𝐮2=1Var(𝐯𝐱1)Var(𝐯𝐱2)=Cov(𝐱1)2Cov(𝐱2)2\displaystyle\leq\max_{\|\mathbf{v}\|_{2}=1,\|\mathbf{u}\|_{2}=1}\sqrt{\operatorname{Var}(\mathbf{v}^{\top}\mathbf{x}_{1})}\sqrt{\operatorname{Var}(\mathbf{v}^{\top}\mathbf{x}_{2})}=\sqrt{\|\operatorname*{\rm Cov}(\mathbf{x}_{1})\|_{2}\|\operatorname*{\rm Cov}(\mathbf{x}_{2})\|_{2}}
12Cov(𝐱1)2+12Cov(𝐱2)2.\displaystyle\leq\frac{1}{2}\|\operatorname*{\rm Cov}(\mathbf{x}_{1})\|_{2}+\frac{1}{2}\|\operatorname*{\rm Cov}(\mathbf{x}_{2})\|_{2}.

Thus we have

Cov(𝐱1+𝐱2)2\displaystyle\|\operatorname*{\rm Cov}(\mathbf{x}_{1}+\mathbf{x}_{2})\|_{2} Cov(𝐱1)2+Cov(𝐱2)2+Cov(𝐱1,𝐱2)2+Cov(𝐱2,𝐱1)2\displaystyle\leq\|\operatorname*{\rm Cov}(\mathbf{x}_{1})\|_{2}+\|\operatorname*{\rm Cov}(\mathbf{x}_{2})\|_{2}+\|\operatorname*{\rm Cov}(\mathbf{x}_{1},\mathbf{x}_{2})\|_{2}+\|\operatorname*{\rm Cov}(\mathbf{x}_{2},\mathbf{x}_{1})\|_{2}
2Cov(𝐱1)2+2Cov(𝐱2)2.\displaystyle\leq 2\|\operatorname*{\rm Cov}(\mathbf{x}_{1})\|_{2}+2\|\operatorname*{\rm Cov}(\mathbf{x}_{2})\|_{2}.

D Wedin’s Theorem

Lemma D.1 (Modified Wedin’s Theorem).

Let 𝐌\mathbf{M}^{\star} and 𝐌=𝐌+𝐄\mathbf{M}=\mathbf{M}^{\star}+\mathbf{E} be two matrices in n1×n2\mathbb{R}^{n_{1}\times n_{2}} (without loss of generality, we assume n1n2n_{1}\leq n_{2} ), whose SVDs are given respectively by

𝐌=i=1n1σi𝐮i𝐯i=[𝐔𝐔][𝚺𝟎𝟎𝟎𝚺𝟎][𝐕𝐕],\mathbf{M}^{\star}=\sum_{i=1}^{n_{1}}\sigma_{i}^{\star}\mathbf{u}_{i}^{\star}\mathbf{v}_{i}^{\star\top}=\left[\begin{array}[]{ll}\mathbf{U}^{\star}&\mathbf{U}_{\perp}^{\star}\end{array}\right]\left[\begin{array}[]{ccc}\bm{\Sigma}^{\star}&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\bm{\Sigma}_{\perp}^{\star}&\mathbf{0}\end{array}\right]\left[\begin{array}[]{c}\mathbf{V}^{\star\top}\\ \mathbf{V}_{\perp}^{\star\top}\end{array}\right],
𝐌=i=1n1σi𝐮i𝐯i=[𝐔𝐔][𝚺𝟎𝟎𝟎𝚺𝟎][𝐕𝐕].\mathbf{M}=\sum_{i=1}^{n_{1}}\sigma_{i}\mathbf{u}_{i}\mathbf{v}_{i}^{\top}=\left[\begin{array}[]{ll}\mathbf{U}&\mathbf{U}_{\perp}\end{array}\right]\left[\begin{array}[]{ccc}\bm{\Sigma}&\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\bm{\Sigma}_{\perp}&\mathbf{0}\end{array}\right]\left[\begin{array}[]{l}\mathbf{V}^{\top}\\ \mathbf{V}_{\perp}^{\top}\end{array}\right].

Here, σ1σn1\sigma_{1}\geq\cdots\geq\sigma_{n_{1}} (resp. σ1σn1\sigma_{1}^{\star}\geq\cdots\geq\sigma_{n_{1}}^{\star}) stand for the singular values of 𝐌\mathbf{M} (resp. 𝐌\mathbf{M}^{\star}) arranged in descending order, 𝐮i\mathbf{u}_{i} (resp. 𝐮i)\left.\mathbf{u}_{i}^{\star}\right) denotes the left singular vector associated with the singular value σi\sigma_{i} (resp. σi\sigma_{i}^{\star}), and 𝐯i\mathbf{v}_{i} (resp. 𝐯i\mathbf{v}_{i}^{\star}) represents the right singular vector associated with σi\sigma_{i} (resp. σi\sigma_{i}^{\star}). 𝐔\mathbf{U} and 𝐔\mathbf{U}^{\star} stand for the top rr eigenvectors of 𝐌\mathbf{M} and 𝐌\mathbf{M}^{\star} respectively. Then,

max{𝐔𝐔𝐔𝐔2,𝐕𝐕𝐕𝐕2}2𝐄σrσr+1,\max\left\{\|\mathbf{U}\mathbf{U}^{\top}-\mathbf{U}^{\star}\mathbf{U}^{\star\top}\|_{2},\|\mathbf{V}\mathbf{V}^{\top}-\mathbf{V}^{\star}\mathbf{V}^{\star\top}\|_{2}\right\}\lesssim\frac{2\|\mathbf{E}\|}{\sigma_{r}^{\star}-\sigma_{r+1}^{\star}}, (D.36)

and

max{𝐔𝐔𝐔𝐔F,𝐕𝐕𝐕𝐕F}2r𝐄σrσr+1.\max\left\{\|\mathbf{U}\mathbf{U}^{\top}-\mathbf{U}^{\star}\mathbf{U}^{\star\top}\|_{\rm F},\|\mathbf{V}\mathbf{V}^{\top}-\mathbf{V}^{\star}\mathbf{V}^{\star\top}\|_{\rm F}\right\}\lesssim\frac{2\sqrt{r}\|\mathbf{E}\|}{\sigma_{r}^{\star}-\sigma_{r+1}^{\star}}. (D.37)
Proof.

By Wedin’s Theorem [42], if 𝐄2<(11/2)(σrσr+1)\|\mathbf{E}\|_{2}<(1-1/\sqrt{2})\left(\sigma_{r}^{\star}-\sigma_{r+1}^{\star}\right), (D.36) and (D.37) are true. When 𝐄2(11/2)(σrσr+1)\|\mathbf{E}\|_{2}\geq(1-1/\sqrt{2})\left(\sigma_{r}^{\star}-\sigma_{r+1}^{\star}\right), the RHS of (D.36) are larger than or equal to 222-\sqrt{2}, whereas the LHS are bounded by 1. Thus (D.36) follows trivially, and so is (D.37). ∎

E Supplementary Figures

We provide in this section additional figures deferred from the main paper.

Refer to caption
Figure 9: Illustration of Step 0 for Example 1. 𝚺^S(s)=𝐗[:,S](s)𝐗[:,S](s)\widehat{\bm{\Sigma}}_{S}^{(s)}=\mathbf{X}_{[:,S]}^{(s)\top}\mathbf{X}_{[:,S]}^{(s)} is calculated by the data columns in the set SS for the ss-th split (s[m]s\in[m]), and 𝚺^S=n1s[m]𝚺^S(s)\widehat{\bm{\Sigma}}_{S}=n^{-1}\sum_{s\in[m]}\widehat{\bm{\Sigma}}_{S}^{(s)}.
Refer to caption
Figure 10: (a) Correlations between the 25 leading PCs calculated by FADI and by full sample PCA on the 1000 Genomes Data; (b) Top 25 eigenvalues for the sample covariance matrix of the 1000 Genomes Data. We can see that for the 15 leading PCs, the results calculated by FADI are highly correlated to the results calculated by the traditional full sample PCA, whereas the correlations drop afterward. This can be attributed to the fact that the top 15 eigenvalues are well-separated for the sample covariance matrix of the 1000 Genomes Data, and the eigengaps get smaller after the 15th eigenvalue.
Refer to caption
Figure 11: Comparison of the top 12 PCs of the 1000 Genomes Data calculated by full sample traditional PCA and by FADI.