FADI: Fast Distributed Principal Component Analysis With High Accuracy for Large-Scale Federated Data
Abstract
Principal component analysis (PCA) is one of the most popular methods for dimension reduction. In light of the rapidly growing large-scale data in federated ecosystems, the traditional PCA method is often not applicable due to privacy protection considerations and large computational burden. Algorithms were proposed to lower the computational cost, but few can handle both high dimensionality and massive sample size under the distributed setting. In this paper, we propose the FAst DIstributed (FADI) PCA method for federated data when both the dimension and the sample size are ultra-large, by simultaneously performing parallel computing along and distributed computing along . Specifically, we utilize parallel copies of -dimensional fast sketches to divide the computing burden along and aggregate the results distributively along the split samples. We present FADI under a general framework applicable to multiple statistical problems, and establish comprehensive theoretical results under the general framework. We show that FADI enjoys the same non-asymptotic error rate as the traditional PCA when . We also derive inferential results that characterize the asymptotic distribution of FADI, and show a phase-transition phenomenon as increases. We perform extensive simulations to show that FADI substantially outperforms the existing methods in computational efficiency while preserving accuracy, and validate the distributional phase-transition phenomenon through numerical experiments. We apply FADI to the 1000 Genomes data to study the population structure.
Keyword: Computational efficiency; Distributed computing; Efficient communication; Fast PCA; Large-scale inference; Federated learning; Random matrices; Random sketches.
1 Introduction
As one of the most popular methods for dimension reduction, principal component analysis (PCA) finds applications in a broad spectrum of scientific fields including network studies [3], statistical genetics [35] and finance [31]. Methodologically, parameter estimation in many statistical models is based on PCA, such as spectral clustering in graphical models [2], missing data imputation through low-rank matrix completion [23], and clustering with subsequent k-means refinement in Gaussian mixture models [12]. When it comes to real data analysis, however, several shortcomings of the traditional PCA method hinder its application to large-scale datasets. First, the high dimensionality and large sample size of modern big data can render the PCA computation infeasible in practice. For instance, PCA is commonly used for controlling for ancestry confounding in Genome-Wide Association Studies (GWAS) [33], yet biomedical databases, such as the UK Biobank [39], often contain hundreds of thousands to millions of Single Nucleotide Polymorphisms (SNPs) and subjects, which entails more scalable algorithms to handle the intensive computation of PCA. Second, large-scale datasets in many applications are stored in federated ecosystems, where data cannot leave individual warehouses due to privacy protection considerations [8, 14, 15, 29, 34]. This calls for federated learning methods [26, 30] that provide efficient and privacy-protected strategies for joint analysis across data warehouses without the need to exchange individual-level data.
The burgeoning popularity of large-scale data necessitates the development of fast algorithms that can cope with both high dimensionality and massiveness efficiently and distributively. Indeed, efforts have been made in recent years on developing fast PCA and distributed PCA algorithms. The existing fast PCA algorithms use the full-sample data and apply random projection to speed up PCA calculations [11, 20], while the existing distributed PCA algorithms apply the traditional PCA method to the split data and aggregate the results [18, 28].
Specifically, fast PCA algorithms utilize the fact that the column space of a low-rank matrix can be represented by a small set of columns and use random projection to approximate the original high-dimensional matrix [4]. For instance, Halko et al., [20] proposed to estimate the leading eigenvectors of a matrix () using Gaussian random sketches, which decreases the computation time by a factor of at the cost of increasing the statistical error by a factorial power of . Chen et al., [11] modified Halko et al., [20]’s method by repeating the fast sketching multiple times and showed the consistency of the algorithm using the average of i.i.d. random sketches when the number of sketches goes to infinity. However, they did not study the trade-off between computation complexities and error rates in finite samples, and hence did not recommend the number of fast sketches that optimizes both the computational efficiency and the statistical accuracy. As the fast PCA methods use the full data, they have two major limitations. First, they are often not scalable to large sample sizes . Second, they are not applicable to federated data when data in different sites cannot be shared.
The existing distributed PCA algorithms reduce the PCA computational burden by partitioning the full data “horizontally” or “vertically” [18, 27, 28]. The horizontal partition splits the data over the sample size , whereas the vertical partition splits the data over the dimension . Horizontal partition is useful when the sample size is large or when the data are federated in multiple sites. For example, Fan et al., [18] considered the horizontally distributed PCA where they estimated the leading eigenvectors of the population covariance matrix by applying traditional PCA to each data split and aggregating the PCA results across different datasets. They showed when the number of data splits is not too large, the error rate of their algorithm is of the same order as the traditional PCA. Since they used the traditional PCA algorithm for each data partition, the computational complexity is at least of order , which will be computationally difficult when is large, e.g., in GWAS, is hundreds of thousands to millions. Kargupta et al., [28] considered vertical partition and developed a method that collects local principal components (PCs) and then reconstructs global PCs by linear transformations. However, there is no theoretical guarantee on the error rate compared with the traditional full sample PCA, and the method may fail when variables are correlated.
Apart from the aforementioned PCA applications in parameter estimation, inference also constitutes an important part of PCA methods. For example, when studying the ancestry groups of whole genome data under the mixed membership models, while the estimation error rate guarantees the overall misclustering rate for all subjects, one may be interested in testing whether two individuals of interest share the same ancestry membership profile and assessing the associated statistical uncertainty [16]. Furthermore, despite the rich literature depicting the asymptotic distribution of traditional PCA estimators under different statistical models [16, 32, 41], distributional characterization of fast PCA methods and distributed PCA methods are not well-studied. For instance, Yang et al., [44] characterized the convergence of fast sketching estimators in probability but gave no inferential results. Halko et al., [20] provided error bound for the fast PCA algorithm, but there is no characterization of the asymptotic distribution and hence no evaluation of the testing efficiency. Fan et al., [18] derived the non-asymptotic error rate of the distributed PC estimator but did not provide distributional guarantees, and inference based upon their estimator is computationally intensive when the dimension is large.
In summary, the existing fast PCA algorithms accelerate computation along by fast sketching, but cannot handle distributed computing along . The existing distributed PCA methods mainly focus on dividing the computing burden along , while distributed computing along is complicated by variable correlation and lacks theoretical guarantees. It remains an open question how to develop fast and distributed PCA algorithms that can handle both large and simultaneously, while achieving the same asymptotic efficiency as the traditional PCA.
In view of the gaps in existing literature, we propose in this paper a scalable and computationally efficient FAst DIstributed (FADI) PCA method applicable to federated data that could be large in both and . More specifically, to obtain the -leading PCs of a matrix from its estimator , we take the divide-and-conquer strategy to break down the computation complexities along the dimension : we generate the -dimensional fast sketch and perform SVD on instead of to expedite the PCA computation, where is a Gaussian test matrix with ; meanwhile, to adjust for the additional variability induced by random approximation, we repeat the fast sketching for times in parallel, and then aggregate the SVD results across data splits to restore statistical accuracy. When the data are distributively stored, the federated structure of also enables its easy implementation without the need of sharing individual-level data, which in turn facilitates distributing the computing burden along among the split samples, as opposed to the existing fast PCA methods that are not scalable to large . We will show that FADI has computational complexities of smaller magnitudes than existing methods (see Table 3), while achieving the same asymptotic efficiency as the traditional PCA. Moreover, we establish FADI under general frameworks that cover multiple statistical models. We list below four statistical problems as illustrative applications of FADI, where we will define and in each setting:
-
(1)
Spiked covariance model: let be i.i.d. random vectors with spiked covariance , where is the rank- spiked component of interest. Define to be the estimator for , where is a consistent estimator for . We assume that the data are split along the sample size and stored on servers.
-
(2)
Degree-corrected mixed membership (DCMM) model: let be the adjacency matrix for an undirected graph of nodes, where the connection probabilities between nodes are determined by their membership assignments to communities and node-associated degrees. Consider the data to be split along on servers, and we aim to infer the membership profiles of nodes by recovering the -leading eigenspace of the marginal connection probability matrix using the data .
-
(3)
Gaussian mixture models (GMM): let be independent random vectors drawn from Gaussian distributions with different means and identity covariance matrix. We are interested in clustering the samples by estimating the eigenspace of , whose estimator is given by . Assume the data are distributively stored on servers along the dimension .
-
(4)
Incomplete matrix inference: we have a low-rank matrix of interest, and we observe as a perturbed version of with missing entries. Assume to be vertically split along on servers, and we aim to infer the eigenspace of through .
We will elaborate on the above examples in Section 2. We consider distributed settings for all the four problems, where the data are split along for the spiked covariance model and the GMM, and along for the DCMM model and the incomplete matrix inference model given that coincides with for those two. We will establish in Section 4.1 a general non-asymptotic error bound applicable to multiple statistical models as well as case-specific error rates for each example, and show that the non-asymptotic error rate of FADI is of the same order as the traditional PCA as long as the sketching dimension and the number of fast sketches are sufficiently large. Inferentially, we provide distributional characterizations of FADI under different regimes of the fast sketching parameters. We observe a phase-transition phenomenon where the asymptotic covariance matrix takes on two different forms as increases. When , the FADI estimator converges in distribution to a multivariate Gaussian, and the asymptotic relative efficiency (ARE) between FADI and the traditional PCA is 1 (see Figure 1). On the other hand, when , FADI has higher computational efficiency and still enjoys asymptotic normality under certain models, but will have a larger asymptotic variance.
(a) Example 1: Spiked Covariance Model | (b) Example 3: Gaussian Mixture Models |
![]() |
![]() |
Related Papers on Inferential Analysis of PCA
There has been a great amount of literature depicting the asymptotic distribution of traditional PCA estimators. Anderson, [5] characterized the asymptotic normality of eigenvectors and eigenvalues for traditional PCA on the sample covariance matrix with fixed dimension. Paul, [32] and Wang and Fan, [41] extended the analysis to the high-dimensional regime and established distributional results under the spiked covariance model. Similar efforts were made by Johnstone, [25] and Baik et al., [6], where they studied the limiting distribution of the largest empirical eigenvalue when both the dimension and the sample size go to infinity. Apart from inference on the sample covariance matrix of i.i.d. data, previous works also made progress in inferential analyses for a variety of statistical models including the DCMM model [16], the matrix completion problem [13], and high-dimensional data with heteroskedastic noise and missingness under the spiked covariance model [43]. Specifically, Fan et al., [16] employed statistics based on principal eigenspace estimators of the adjacency matrix to perform inference on whether two given nodes share the same membership profile under the DCMM model. Chen et al., [13] constructed entry-wise confidence intervals (CIs) for a low-rank matrix with missing data and Gaussian noise based on debiased convex/nonconvex PC estimators. A similar missing data inference problem was conducted in Yan et al., [43], where they adopted a refined spectral method with imputed diagonal for CI construction of the underlying spiked covariance matrix of corrupted samples with missing data.
The aforementioned works were all based upon the traditional PCA approach and considered no distributed data setting, and hence will suffer from low computational efficiency when the data are high-dimensional or distributively stored across different sites. Our paper fills the gap in the literature and provides general inferential results on the fast sketching method with high computational efficiency adapted to high-dimensional federated data.
Our Contributions
We summarize the major contributions of our paper as follows.
First, the existing PCA methods either handle high dimensions or large sample sizes , but not both. Specifically, fast PCA [20] handles large but has elevated error rates and is difficult to apply when is large. Distributed PCA [18] handles large but is not scalable to large , as it applies traditional PCA to each data split. FADI overcomes the limitations of these methods by providing scalable PCA when both and are large or data are federated. Due to the fact that variables are usually dependent, it is challenging to achieve parallel computing along and distributed computing along simultaneously. To address this challenge, FADI splits the data along and untangles the variable dependency along by dividing the high-dimensional data into copies of -dimensional fast sketches. Namely, for each split dataset, FADI performs multiple parallel fast sketchings instead of the traditional PCA, and then aggregates the PC results distributively over the split samples. We establish theoretical error bounds to show that FADI is as accurate as the traditional PCA so long as .
Second, we provide distributional characterizations for inferential analyses and show a phase-transition phenomenon. We provide distributional guarantees on the FADI estimator to facilitate inference, which is absent in previous literature on fast PCA methods and distributed PCA methods. More specifically, we depict the trade-off between computational complexity and testing efficiency by studying FADI’s asymptotic distribution under the regimes and respectively. We show that the same asymptotic efficiency as the traditional PCA can be achieved at with a compromise on computational efficiency, while faster inferential procedures can be performed at with suboptimal testing efficiency. We further validate the distributional phase transition via numerical experiments.
Third, we propose FADI under a general framework applicable to multiple statistical models under mild assumptions, including the four examples discussed earlier in this section. We provide a comprehensive investigation of FADI’s performance both methodologically and theoretically under the general framework, and illustrate the results with the aforementioned statistical models. In comparison, the existing distributed methods mainly focus on estimating the covariance structure of independent samples [18].
Paper Organization
The rest of the paper is organized as follows. Section 2 introduces the problem setting and provides an overview of FADI and its intuition. Section 3 discusses FADI’s implementation details, as well as the computational complexity of FADI and its modifications when is unknown. Section 4 presents the theoretical results of the statistical error and asymptotic normality of the FADI estimator. Section 5 shows the numerical evaluation of FADI and comparison with several existing methods. The application of FADI to the 1000 Genomes Data is given in Section 6.
Notation
We use to denote the vector of length with all entries equal to 1, and denote by the canonical basis of . For a matrix , we use (respectively ) to represent the -th largest singular value (respectively eigenvalue) of , and or (respectively or ) stands for the largest or smallest singular value (respectively eigenvalue) of . If has the singular value decomposition (SVD) , then we denote by the pseudo-inverse of , the projection matrix onto the column space of , and the matrix signum. If is positive definite with eigen-decomposition , we define and . We denote by the Kronecker product. For two orthonormal matrices with , we measure the distance between their column spaces by . For a vector , we use to denote the vector -norm, and to denote the vector -norm. For a matrix , we denote by the matrix spectral norm, the Frobenius norm, the 2-to- norm and the matrix max norm. For an integer , define . For two positive sequences and , we say or if for that does not depend on . We say if and . If , we say or . Let be an indicator function, which takes 1 if the statement inside is true and 0 otherwise. Throughout the paper, we use and to represent generic constants and their values might change from place to place.
2 Preliminaries and Problem Setup
We aim to estimate the eigenspace of the rank- symmetric matrix , whose eigen-decomposition is given by ,111When is asymmetric, we can deploy the “symmetric dilation” trick and take to fit it into the setting. where , and is the stacking of the leading eigenvectors. We denote by the eigengap of , and assume without loss of generality that . is a corrupted version of obtained from observed data, with representing the error matrix. Our goal is to estimate the column space of from distributively and scalably. The following four examples provide concrete statistical setups for the above problem.
Example 1 (Spiked Covariance Model [25]).
Let be i.i.d. sub-Gaussian random vectors with and .222We assume are i.i.d. for the simplicity of presentation. We will generalize the theoretical results to non-i.i.d. and heterogeneous data in Section 4.1. We assume the following decomposition for the covariance matrix: , where is the stacked leading eigenvectors and with . Assume that the data are split along the sample size and stored on different sites. Denote by the sample split of size on the -th site, and by the corresponding data matrix split ( and ). Denote by the full data matrix. Then , and , where is the sample covariance matrix and is a consistent estimator for .
Example 2 (Degree-Corrected Mixed Membership (DCMM) Model [16]).
Let be a symmetric adjacency matrix for an undirected graph of nodes, where if nodes are connected and otherwise. Assume ’s are independent for and , where stands for the degree heterogeneity matrix, is the stacked community assignment probability vectors and is a symmetric rank- matrix with constant entries for . Then and .333In the case where self-loops are absent, will be replaced by and will be replaced by . Our theoretical results hold for both cases. The goal is to infer the community membership profiles . Recall . Since and share the same column space, we can make inference on through . 444To address the degree heterogeneity, one can perform the SCORE normalization to cancel out [24]. In this paper, we assume that there exist constants such that , and , where we define as the rate of signal strength. We assume that the adjacency matrix is distributed across sites, where on the -th site we observe the connectivity matrix and .
Example 3 (Gaussian Mixture Models (GMM) [12]).
Let be independent samples with generated from one of Gaussian distributions with means (). More specifically, for , is associated with a membership label , and . Our goal is to recover the unknown membership labels ’s. Denote , where is the -th row of . Without loss of generality, we order ’s such that , where
with denoting the number of samples drawn from the Gaussian distribution with mean . Then we define and . Recall . Since and share the same column space, we can recover the memberships from . We consider the regime where . Besides, we assume that there exists a constant such that and . We consider the distributed setting where the data are split along the dimension and distributively stored on sites. Denote by the data split on the -th site of size ().
Example 4 (Incomplete Matrix Inference [13]).
Assume that is a symmetric rank- matrix, and is a subset of indices. We only observe the perturbed entries of in the subset . Specifically, for , we denote , and is an indicator for whether the th entry is missing. Then for , the observation for is , where are i.i.d. random variables satisfying , and .555We can generalize the results to sub-Gaussian error ’s with variance proxy by taking the truncated error , and by the maximal inequality for sub-Gaussian random variables we know that with probability at least , , and the theorems can be generalized with minor modifications. Then to adjust for scaling, we define the observed data as , where .666In practice, we can estimate by rather than by , since the two matrices share exactly the same eigenvectors. However, we need the factor to preserve correct scaling for the estimation of eigenvalues as well as the follow-up matrix completion. Please see Theorem 4.3 and Corollary 4.9 for more details. Consider the distributed setting where the data are split along on servers, where stands for the observations on the -th server and . The goal is to infer from in the presence of missing data.
Table 1 provides the complexities of FADI for the four problems and suggested choice of parameters for optimal error rates. We will further discuss the computational complexities in detail in Section 3.4.
Complexity | |||
Spiked covariance model | |||
DCMM model | |||
Gaussian mixture models | |||
Incomplete matrix inference |
3 Method
In this section, we present the FADI algorithm and its application to different examples. We then provide the computational complexities of FADI and compare it with the existing methods. We also discuss how to estimate the rank when it is unknown.
3.1 Fast Distributed PCA (FADI): Overview and Intuition
For a given matrix , the computational cost of the traditional PCA on is . In the case where is computed from observed data, e.g., the sample covariance matrix , extra computational burden comes from calculating , e.g., flops for computing the sample covariance matrix. Hence performing traditional PCA for large-scale data with high dimensions and huge sample sizes can be considerably expensive.
To reduce the computational cost when is large, the most straightforward idea is to reduce the data dimension. One popular method for dimension reduction is random sketching [20]. For instance, for a low-rank matrix of rank , its column space can be represented by a low-dimensional fast sketch , where is a random Gaussian matrix with . In practice, is usually replaced by an almost low-rank corrupted matrix calculated from observed data. Traditional fast PCA methods then consider performing random sketching on instead, and use the full sample to obtain the fast sketch that almost maintains the same left singular space as . It is hence reasonable to estimate by performing SVD on the matrix that has a much smaller computational cost than directly performing PCA on . However, one major drawback of this approach is that information might be lost due to fast sketching. Furthermore, the method is not scalable when is large or the data are federated. This motivates us to propose FADI, where we repeat the fast sketching multiple times and aggregate the results to reduce the statistical error. Besides, instead of performing the fast sketching on the full sample, we apply multiple sketches to each split sample, and then aggregate the PC results across the data splits.
Specifically, assume the data are stored across sites, and we have the decomposition , where is the component that can be computed locally on the -th machine (). Then instead of applying random sketching directly to , FADI computes in parallel the local fast sketching for each component and aggregates the results across sites, which will reduce the cost of computing by a factor of . Note that this representation of is legitimate in many models. Taking Example 1 for instance, define , and we have . We will verify the decomposition for Examples 2 - 4 in Section 3.3.
We will see in Section 4.1 that when the number of repeated fast sketches is sufficiently large, FADI enjoys the same error rate as the traditional PCA. From this perspective, FADI can be viewed as a “vertically” distributed PCA method as it allocates the computational burden along the dimension to several machines using low-dimensional sketches while maintaining high statistical accuracy through the aggregation of local PCs. FADI overcomes the difficulties of vertical splitting caused by the correlation between variables.
3.2 General Algorithmic Framework
Recall we aim to estimate the leading eigenvectors of a rank- matrix from its estimator . Figure 2 illustrates the fast distributed PCA (FADI) algorithm:

In Step 0, we perform preliminary processing on the raw data to produce . We will elaborate on the case-specific preprocessing in Section 3.3.
In Step 1, we calculate the distributed fast sketch , where is a standard Gaussian test matrix and . To reduce the statistical error, we repeat the fast sketching times and aggregate the results from the copies of . Specifically, we generate i.i.d. Gaussian test matrices , and for each , we apply distributively to for each and obtain the -th fast sketch of as . We send () to the -th parallel server for aggregation.
In Step 2, on the -th server, the random sketches () from the split datasets corresponding to the -th Gaussian test matrix will be collected and added up to get the -th fast sketch: . We next compute in parallel the top left singular vectors of and send the ’s to the central processor for aggregation.
In Step 3, on the central processor, calculate , where is the projection matrix of . We next calculate the leading eigenvectors of , which will serve as the final estimator of .
To further improve the computational efficiency, we might conduct another fast sketching in Step 3 to compute . More specifically, we apply the power method [20] to by calculating for , where is a Gaussian test matrix with dimension that can be set different from for optimal efficiency. Here, can be calculated iteratively: for , where and . We denote by the leading left singular vectors of . We will show in Section 4 that when is properly large, the distance between and will be negligible.
3.3 Case-Specific Processing of Raw Data
In this section, we discuss the calculation of in Step 0 of FADI specifically for each example.
Example 1: Recall that in Step 0 of FADI, to obtain , we need a consistent estimator of the residual variance . Denote by an arbitrary index set of size . Then we estimate by , where is a principal submatrix of computed using only data columns in the set . Due to the additive structure of the sample covariance matrix, can be easily computed distributively (see Figure 9 in Appendix E for reference). Then for , we have . Note that since computing is much faster than first computing then computing , we will calculate by calculating first rather than directly computing .
Example 2: Recall that the adjacency matrix is stored distributively on sites, and for the -th site we observe the connectivity matrix . Then for , define , where is the canonical basis for . Namely, is the -th observation augmented by zeros, and . No preliminary computation is needed.
Example 3: Recall that the data are vertically distributed across sites, and are the corresponding data splits. For the -th site, we have , and for , we compute by .
Example 4: Recall that we observe the split data with missing entries on servers. Define for the -th server, where , then we have .
3.4 Computational Complexity
In this section, we provide the computational complexity of FADI for each example given in Section 2. The complexity of each step is listed in Table 2.
Example 1 | Example 2 | Example 3 | Example 4 | |
Step 0 | N/A | O(1) | ||
Step 1 | ||||
Step 2 | ||||
Step 3 | N/A | N/A | ||
Total |
When can be customized, we recommend taking for Examples 1 and 3, and for Examples 2 and 4 for optimal efficiency. For Examples 1 and 3, when , , and , the total computational cost will be . For Examples 2 and 4, direct SVD on will induce computational cost of order and we only suggest as the eigenspace estimator. If we take , , and , the total computational cost will be . Inference on eigenspace will require the calculation of the asymptotic covariance, whose formula and computational costs will be discussed in Sections 4.3 and 4.4.
Method | Error Rate | Computational Complexity |
FADI | ||
Traditional PCA | ||
Fast PCA | ||
Distributed PCA |
For a comparison of FADI with the existing works, we provide in Table 3 the theoretical error rates and the computational complexities of FADI against different PCA methods under Example 1 (please refer to Therem 4.1 for the error rates of FADI). We choose Example 1 for illustration, as the existing distributed PCA methods mainly consider this setting [18]. The results show that under the distributed setting, FADI has a much lower computational complexity than the other three methods, while enjoying the same error rate as the traditional full-sample PCA. In comparison, the distributed PCA method in [18] is slowed down significantly by applying traditional PCA to each data split. The fast PCA algorithm in [20] has suboptimal computational complexity and theoretical error rate due to their downstream projection that hinders aggregation.
3.5 Estimation of the Rank
FADI requires inputting the rank of the matrix . In practice, if we are only interested in estimating the leading PCs, the exact value of is not needed as long as the fast sketching dimensions, and , are sufficiently larger than . Yet knowing the exact value of will improve the computational efficiency as well as facilitate inference on PCs. In fact, the estimation of can be incorporated into Step 2 and Step 3 of FADI. Specifically, for the -th parallel server ( ), after performing the SVD , we estimate by
where is a user-specified parameter (we refer to Theorem 4.3 for the choice of ). Then send all the left singular vectors and to the central processor. Finally, on the central processor, take as the estimator for , and obtain (respectively ) by performing PCA (respectively powered fast sketching) on the aggregated average of and taking the leading PCs, where is the leading PCs of . We will show in Theorem 4.3 that is a consistent estimator of .
4 Theory
In this section, we will establish a theoretical upper bound for the error rate of FADI in Section 4.1, and characterize the asymptotic distribution of the FADI estimator in Section 4.3 and Section 4.4 to facilitate inference.
4.1 Theoretical Bound on Error Rates
We need the following condition to guarantee that the error term converges at a proper rate.
Assumption 1 (Convergence of ).
Recall that is the error matrix. Assume that is sub-exponential, and there exists a rate such that
Remark 2.
By standard probability theory [40], we know that there exists a constant such that for any we have and .
We will conduct a variance-bias decomposition on the error rate . To facilitate the discussion, we introduce the intermediate matrix , where the expectation is taken with respect to . Let be the top eigenvectors of . Note that both and are random depending on . For the FADI PC estimator , we have the following “variance-bias” decomposition of the error rate:
Conditional on all the available data, the first term characterizes the statistical randomness of due to fast sketching, whereas the second bias term is deterministic and depends on all the information provided by the data. Intuitively, since converges to the conditional expectation , will also converge to . Hence the first variance term goes to 0 asymptotically. As for the second bias term, let be the leading eigenvectors of , then we further break the bias term into two components: . We can see that the first term is the error rate for the traditional PCA, whereas the second term is the bias caused by fast sketching. We can show that the second term is 0 with high probability and is hence negligible compared to the first term (see Lemma B.1 in Appendix B.1 for details), and the bias of the FADI estimator is of the same order as the error rate of the traditional PCA. In other words, the bias of the FADI estimator mainly comes from , which is due to the information we can get from the available data. The following theorem gives the overall error rate of the FADI PC estimator. Its proof is given in Appendix B.2.
Theorem 4.1.
Under Assumption 1, if and for some large enough constant , we have
(1) |
Furthermore, recall is the leading left singular vectors of for some power , where is a random Gaussian matrix and , then under Assumption 1 and the conditions that and , there exists some constant such that
(2) |
Remark 3.
On the RHS of (1), the first term is the bias term, while the second term is the variance term. We can see that when the number of sketches reaches the order , the variance term will be of the same order as the bias term, which is the same as the error rate of the traditional PCA method. As for (2), the first term and the second term on the RHS are the same as the bias and the variance terms in (1), while the third term comes from the additional fast sketching. In fact, if we properly choose , the third term in (2) will be negligible. Theorem 4.1 also indicates that only needs to be of order , which significantly reduces the communication costs from to for each server.
Based upon Theorem 4.1, we provide the case-specific error rate for each example given in Section 2 in the following corollary. Please refer to Appendix B.3 for the proof.
Corollary 4.2.
For Examples 1 – 4, we have the following error bounds for each case under corresponding regularity conditions.
-
•
Example 1: Define , then under the conditions that , , and for some large enough constant , it holds that
(3) where is the effective rank.
-
•
Example 2: Suppose for some constant . If we take , and , it holds that
(4) -
•
Example 3: Under the conditions that for some large enough constant , where , if we take , and , it holds that
(5) -
•
Example 4: Define . Suppose for some constant , , for some and , if we take , and , it holds that
(6)
Remark 4.
We can generalize the results of Example 1 to the heterogeneous residual variance model for non-i.i.d. data, under which are centered random vectors with covariance matrices satisfying , where and . Then we have , where , and . Then by plugging in , we have the error bound under the heterogeneous scenario. While the first term is deterministic, the second term depends on the dependence structure of the sample. Many studies depicted the convergence of the sample covariance matrix for non-i.i.d. data [7, 17].
For Example 1, when , our error rate in (3) is optimal [18]. Under the distributed data setting, we require the total sample size to be larger than , while Fan et al., [18]’s distributed PCA requires , where is the sample size for each data split. Compared with [18], our method has theoretical guarantees regardless of the number of data splits, but our scaling condition has an extra factor of in exchange for reduced computation cost. As for Example 2, our estimation rate in (4) matches the inferential results in [16]. Please also refer to Section 4.3 for a detailed comparison with the method in [16] in terms of the limiting distributions. For Example 3, our estimation rate in (5) is the same as in [12]. For Example 4, our error rate in (6) matches the results in [12]. When the rank is unknown and estimated by FADI, the following theorem shows that under appropriate conditions, our estimator presented in Section 3.5 recovers the true with high probability.
Theorem 4.3.
We defer the proof to Appendix B.4. We provide case-specific choices of the thresholding parameter in the following corollary, whose proof can be found in Appendix B.5.
Corollary 4.4.
For Examples 1 to 4, we specify the choice of under certain regularity conditions.
-
•
Example 1: Under the conditions that , , and , if we take , with probability at least , we have .
-
•
Example 2: Define , then under the condition that for some constant and , if we take , with probability at least , we have .
-
•
Example 3: Under the conditions that and , if we take , with probability at least , we have .
-
•
Example 4: When for some constant , for some , , and , if we take , where , then with probability at least , we have .
Remark 5.
For Example 3, we impose the upper bound on because in practice the eigengap is unknown, and estimation of requires knowledge of . Imposing the upper bound on makes the term in involving knowledge of vanish and enables the estimation of from observed data.
4.2 Inferential Results on the Asymptotic Distribution: Intuition and Assumptions
In Section 4.1, we discuss the theoretical upper bound for the error rate and present the bias-variance decomposition for the FADI estimator . From (2), we can see that when , the bias term will be the leading term, and the dominating error comes from , whereas when , the variance term will be the leading term and the main error derives from . This offers insight into conducting inferential analysis on the estimator and implies a possible phase transition in the asymptotic distribution. Before moving on to further discussions, we state the following assumption to ensure that the bias of is negligible.
Assumption 2 (Statistical Rate for the Biased Error Term).
For the error matrix we have the decomposition , where and is the biased error term satisfying with .
In fact, we will later show in Section 4.3 and Section 4.4 that the leading term for the distance between and takes on two different forms under the two regimes:
where is some orthogonal matrix aligning with , is the projection matrix onto the linear space perpendicular to , and with for . To get an intuitive understanding on the form of the leading error term, let’s start with the regime where and consider the case where are well-separated such that . Following basic algebra, we have
where is the -leading eigenvalues of corresponding to , and the second approximation is due to the fact that and are fairly close and will be negligible.
Now we turn to the scenario , where the error mainly comes from . For a given , denote , where is also a Gaussian test matrix. Intuitively, when is much larger than . Hence acts like an orthonormal matrix scaled by , and the rank- truncated SVD for and will approximately be and respectively. Then following similar arguments as when , we have
where the last approximation is because when is almost orthonormal we have . Then aggregating the results over we have
It is worth noting that
(7) |
when , which demonstrates the consistency of the leading term across different regimes of . To unify the notations, we denote the leading term for by
Before we formally present the theorems, we introduce the following extra regularity conditions necessary for studying the asymptotic features of the eigenspace estimator.
Assumption 3 (Incoherence Condition).
For the eigenspace of the true matrix , we assume
where may change with .
Assumption 4 (Statistical Rates for Eigenspace Convergence).
For the unbiased error term and the traditional PCA estimator , we have the following statistical rates
Assumption 5 (Central Limit Theorem).
For the leading term and any , it holds that
where when and when .
Assumption 3 is the incoherence condition [10] to guarantee that the information of the eigenspace is uniformly spread. In Assumption 4 , bounds the row-wise estimation error for the eigenspace, while characterizes the row-wise convergence rate of the residual error term projected onto the spaces spanned by and consecutively, i.e., . Assumption 5 states that the leading term satisfies the central limit theorem (CLT). These assumptions are for the general framework and will be translated into case-specific conditions for concrete examples. With the above assumptions in place, we are ready to present the formal inferential results.
4.3 Inference When
Recall that is the leading eigenvectors of the matrix , and is the leading left singular vectors of the matrix . We define to be the alignment matrix between and , where , and . The follow theorem provides the distributional guarantee of FADI when .
Theorem 4.5.
Remark 6.
Please refer to Appendix B.9 for the proof of Theorem 4.5. Here guarantees that the asymptotic covariance of the leading term is positive definite, and the rate bounds the remainder term stemming from fast sketching approximation and eigenspace misalignment. When the rate is not too small relative to , Theorem 4.5 guarantees the distributional convergence of the FADI estimator. We will see in the concrete examples that the asymptotic covariance of the FADI estimator under the regime is the same as that of the traditional PCA estimator. In other words, we can increase the number of repeated sketches in exchange for the same testing efficiency as the traditional PCA.
4.3.1 Spiked Covariance Model
Recall the set of size defined in Section 3.3 for estimating . We denote by the population covariance matrix corresponding to and define . Denote by the eigengap of . We have the following corollary of Theorem 4.5 for Example 1.
Corollary 4.6.
Remark 7.
Please refer to Appendix B.10 for the proof. We compute distributively across the data splits, and the cost for computing is . We recommend taking , and for optimal computational efficiency, where the total computation cost will be . Our asymptotic covariance matrix is the same as that of the traditional PCA estimator under the incoherence condition [5, 32, 41]. Specifically, Wang and Fan, [41] studied the asymptotic distribution of the traditional PCA estimator by assuming that the spiked eigenvalues are well-separated and diverging to infinity, which is not required by our paper. Our scaling conditions are stronger than the estimation results in Corollary 4.2 to cancel out the additional randomness induced by fast sketching and allow for efficient inference.
4.3.2 Degree-Corrected Mixed Membership Models
Corollary 4.7.
When for some constant and , if we take , , and , then (8) holds. Furthermore, if we denote , we have
(11) |
Besides, define and , then if we estimate by , we have
(12) |
Remark 8.
The proof is deferred to Appendix B.11. We can obtain by computing in parallel for , and the computational cost for is . To achieve the optimal computational efficiency, we would take and . Hence taking is sufficient, and the total computational cost will be . Inferential analyses on the membership profiles has received attention in previous works [16, 37]. Fan et al., [16] studied the asymptotic normality of the spectral estimator under the DCMM model with complicated assumptions on the eigen-structure (see Conditions 1, 3, 6, 7 in their paper). In comparison, we only impose non-singularity conditions on the membership profiles, but have a stronger scaling condition on the signal strength to facilitate the divide-and-conquer process. Our asymptotic covariance is almost the same as Fan et al., [16]’s, suggesting the same level of asymptotic efficiency.
4.3.3 Gaussian Mixure Models
Denote by the incoherence parameter for the Gaussian means. Then we have the following corollary for Example 3.
Corollary 4.8.
When , If we take , and , under the conditions that
we have that (8) holds. Furthermore, if we denote , we have
(13) |
If we define and estimate by , we have
(14) |
Remark 9.
Please refer to Appendix B.12 for the proof. We impose the upper bound on to guarantee that the leading term satisfies the CLT. The distributive computation cost of is . We recommend taking , and , with total complexity of . In Corollary 4.8, the scaling condition for is compared to in Corollary 4.2, where the extra factor is to guarantee fast enough convergence rate of the remainder term for inference. It can be verified that the Cramér-Rao lower bound for unbiased estimators of is , and thus we can also see from (13) that when is large enough, the asymptotic efficiency of is 1 under the regime .
4.3.4 Incomplete Matrix Inference
Corollary 4.9.
Remark 10.
Please see Appendix B.13 for the proof of Corollary 4.9. We compute by calculating in parallel, and then can be communicated across servers at low cost for computing . The total computational cost for calculating is . We recommend taking , and , and the total computational cost will be . Chen et al., [13] studied the incomplete matrix inference problem through penalized optimization, and their testing efficiency is the same as ours.
4.4 Inference When
Similar as when , we first redefine the alignment matrix between and as , where and . Then we have the following theorem characterizing the limiting distribution for .
Theorem 4.10.
Remark 11.
Theorem 4.10 states that under proper scaling conditions, the FADI estimator still enjoys asymptotic normality even when the aggregated sketching dimension is much smaller than . The rate is usually at least of order . In comparison, the rate in Theorem 4.5 is usually of order , suggesting a larger variance and lower testing efficiency of FADI at than at . The proof is deferred to Appendix B.6.
The following corollaries of Theorem 4.10 provide case-specific distributional guarantee for Examples 1 and 3 under the regime .
4.4.1 Spiked Covariance Model
Corollary 4.11.
Remark 12.
Please refer to Appendix B.7 for the proof. For the computation of , apart from , the -th machine on layer 2 (see Figure 2) will send and to the central processor, and the total communication cost for each server is . On the central processor, the total computational cost of will be . Then we will compute with total computational cost of . Compared to Corollary 4.6, Corollary 4.11 has stronger scaling conditions on the sample size to compensate for the extra variability due to less fast sketches. As indicated by (7), the asymptotic covariance matrix of Corollary 4.12 is consistent with Corollary 4.8.
4.4.2 Gaussian Mixture Models
Corollary 4.12.
When , if we take and , we have that (17) holds under the conditions that
Furthermore, if we define then we have
(20) |
If we further assume and estimate by , where with for , we have
(21) |
Remark 13.
We do not have distributional results for Examples 2 and 4 under the regime . An intuitive explanation would be that the information contained in each entry is independent for Example 2 and Example 4, and when , too much information will be lost from the graph or matrix. In comparison, we can still recover information from Examples 1 and 3 under the regime due to the correlation structure of the matrix.
5 Numerical Results
We conduct extensive simulation studies to assess the performance of FADI under each example given in Section 2 and compare it with several existing methods. We provide in this section the representative results for Examples 1 and 2. The results for Examples 3 and 4 are given in Appendix A.
5.1 Example 1: Spiked Covariance Model
We generate i.i.d. from , where . We consider , and set respectively to study the asymptotic properties of the FADI estimator under different settings. To ensure the incoherence condition is satisfied, we set to be the left singular vectors of a i.i.d. Gaussian matrix. We take and . For the estimation of in Step 0, we set . We split the data into subsamples, and set and in Step 3 to compute . We set at a range of values by taking the ratio for each setting and compute the asymptotic covariance via Corollary 4.6 and Corollary 4.11 correspondingly. We define , where , and calculate the coverage probability by empirically evaluating with being the 0.95 quantile of the Chi-squared distribution with degrees of freedom equal to 3. Results under different settings are shown in Figure 3. Figure 3(a) shows that as increases, the error rate of FADI converges to that of the traditional PCA. From Figure 3(b) we can see that when is approaching 1 from the left, the computational efficiency drops due to the cost of computing . For Figure 3(c), convergence towards the nominal 95% level can be observed when is much smaller or much larger than 1, while the valley at around 1 is consistent with the theoretical conditions on in Section 4 and implies a possible phase-transition phenomenon on the distributional convergence of FADI. Note that the empirical coverage is closer to the nominal level 0.95 at than at , which might be caused by the vanishing of some error terms for approximation of the asymptotic covariance matrix as grows larger. The good Gaussian approximation of is further validated by Figure 3(d), where is the first entry of . Based upon the low computational efficiency and poor empirical coverage at around 1, we recommend conducting inference based on FADI at regimes and only. In particular, we suggest the regime if priority is given to higher testing efficiency, and the regime if one needs valid inference with faster computation. We also compare FADI with the distributed PCA in [18]. Results over 100 Monte Carlos are given in Table 4. We can see that FADI outperforms both distributed PCA and the traditional PCA under the distributed setting.
(a) Error Rate | (b) Running Time |
![]() |
![]() |
(c) Coverage Probability | (d) Q-Q Plot |
![]() |
![]() |
Parameters | Error rate | Running time (seconds) | |||||||
FADI | Traditional | Distributed | FADI | Traditional | Distributed | ||||
400 | 30000 | 15 | 40 | 0.068 | 0.065 | 0.065 | 0.07 | 4.53 | 0.59 |
400 | 60000 | 30 | 40 | 0.048 | 0.046 | 0.046 | 0.05 | 8.84 | 0.60 |
400 | 100000 | 50 | 40 | 0.037 | 0.036 | 0.036 | 0.05 | 14.84 | 0.62 |
800 | 100000 | 50 | 80 | 0.052 | 0.050 | 0.050 | 0.10 | 55.76 | 3.66 |
800 | 5000 | 50 | 80 | 0.230 | 0.220 | 0.230 | 0.05 | 3.76 | 2.56 |
800 | 25000 | 50 | 80 | 0.106 | 0.103 | 0.103 | 0.07 | 15.07 | 2.82 |
800 | 50000 | 50 | 80 | 0.073 | 0.070 | 0.070 | 0.07 | 28.68 | 3.23 |
1600 | 30000 | 15 | 160 | 0.134 | 0.130 | 0.130 | 0.31 | 80.72 | 27.02 |
1600 | 60000 | 30 | 160 | 0.095 | 0.092 | 0.092 | 0.35 | 150.75 | 27.29 |
1600 | 100000 | 50 | 160 | 0.074 | 0.071 | 0.071 | 0.34 | 243.83 | 27.38 |
5.2 Example 2: Degree-Corrected Mixed Membership Models
We consider the mixed membership model without degree heterogeneity for the simulation, i.e., , and . For two preselected nodes , we test vs. by testing whether . To simulate the data, we set , , and set the membership profiles and the connection probability matrix to be
We test the performance of FADI under respectively, and under each setting of , we take , , and set by the ratio . For each setting, we conduct 300 independent Monte Carlo simulations. To perform the test, with minor modifications of Corollary 4.7, we can show that
(22) |
where the asymptotic covariance is defined as and can be consistently estimated by . We first preselect two nodes, which we denote by and , with membership profiles both equal to and calculate the empirical coverage probability of , where . We also evaluate the power of the test by choosing two nodes with different membership profiles equal to and respectively, which we denote by and . We empirically calculate the power , where . Under the regime , we calculate the asymptotic covariance referring to Theorem 4.10 by
where with for . We also apply k-means to to differentiate different membership profiles and compare the misclustering rate with the traditional PCA. The results of different settings are shown in Figure 4. We can see from Figure 4(d) that under the regime , the empirical coverage probability is zero under all settings, which validates the necessity of for performance guarantee. Figure 4(f) demonstrates the asymptotic normality of at and poor Gaussian approximation of FADI at , where is the first entry of .
We also compare FADI with the SIMPLE method [16] on the membership profile inference under the DCMM model. The SIMPLE method conducted inference directly on the traditional PCA estimator and adopted a one-step correction to the empirical eigenvalues for calculating the asymptotic covariance matrix. We compare the inferential performance of FADI at with the SIMPLE method (under 100 independent Monte Carlos), and summarize the results in Table 5, where the running time includes both the PCA procedure and the computation time of . Compared to the SIMPLE method, our method has a similar coverage probability and power but is computationally more efficient.
Parameters | Coverage probability | Power | Running time (seconds) | |||||
FADI | SIMPLE | FADI | SIMPLE | FADI | SIMPLE | |||
500 | 12 | 417 | 0.91 | 0.92 | 0.87 | 0.88 | 0.21 | 0.73 |
1000 | 12 | 833 | 0.94 | 0.94 | 1.00 | 1.00 | 0.69 | 6.77 |
2000 | 12 | 1667 | 0.95 | 0.98 | 1.00 | 1.00 | 2.61 | 59.42 |
(a) Error Rate | (b) Misclustering Rate | (c) Running Time |
![]() |
![]() |
![]() |
(d) Coverage Probability | (e) Power | (f) Q-Q Plot |
![]() |
![]() |
![]() |
6 Application to the 1000 Genomes Data
In this section, we apply FADI and the existing methods to the 1000 Genomes Data [1]. We use phase 3 of the 1000 Genomes Data and focus on common variants with minor allele frequencies larger than or equal to 0.05. There are 2504 subjects in total, and 168,047 independent variants after the linkage disequilibrium (LD) pruning. As we are interested in the ancestry principal components to capture population structure, the sample size is the number of independent variants after LD pruning (), and the dimension is the number of subjects () [33]. The data were collected from 7 super populations: (1) AFR: African; (2) AMR: Ad Mixed American; (3) EAS: East Asian; (4) EUR: European; (5) SAS: South Asian; (6) PUR: Puerto Rican and (7) FIN: Finnish; and 26 sub-populations.
6.1 Estimation of Principal Eigenspace
For the estimation of the principal components, we assume that the data follow the spiked covariance model specified in Example 1. We perform FADI with , , , , and , where we choose and according to Table 1. For the estimation of the number of spikes, we take the thresholding parameter . The estimated number of spikes from FADI is , which is close to 25, the number of self-reported ethnicity groups minus 1, i.e., . The results of the 4 leading PCs are shown in Figure 5, where a clear separation can be observed among different super-populations. Figure 10 and Figure 11 in the appendix show a good alignment between the PC results calculated by the traditional PCA and FADI. We compare the computational times of different methods for analyzing the 1000 Genomes Data. FADI takes 5.6 seconds at , whereas the traditional PCA method takes 595.4 seconds and the distributed PCA method [18] takes 120.2 seconds. These results show that FADI greatly outperforms the existing PCA methods in terms of computational time.
![]() |
![]() |
![]() |
(a) PC 1 versus PC 2 | (b) PC 1 versus PC 3 | (c) PC 1 versus PC 4 |
![]() |
![]() |
![]() |
(d) PC 2 versus PC 3 | (e) PC 2 versus PC 4 | (f) PC 3 versus PC 4 |
6.2 Inference on Ancestry Membership Profiles
We also generate an undirected graph from the 1000 Genomes Data. To increase the randomness for better fitting of the model setting in Example 2, we sample 1000 out of the total 168047 variants for generating the graph. More specifically, we treat each subject as a node, and for each given pair of subjects , we define a genetic similarity score , where refers to the genotype of the -th variant for subject . We denote by the 0.95 quantile of . Subjects and are connected if and only if . Denote by the adjacency matrix (allowing no self-loops). We include only four super populations: AFR, EAS, EUR and SAS, with 2058 subjects in total. We are interested in testing whether two given subjects and belong to the same super population, i.e., vs. . We divide the adjacency matrix equally into splits, and perform FADI with , , and . The rank estimator from FADI is by setting , where is the average degree estimator defined in Section 3.3. We can see the estimated rank is consistent with the number of super populations. We apply K-means clustering to the FADI estimator , and calculate the misclustering rate by treating the self-reported ancestry group as the ground truth. The misclustering rate of FADI is 0.135, with computation time of 3.7 seconds.
In comparison,
the misclustering rate for the traditional PCA method is 0.134 with computation time of 26.5 seconds, and the correlation between the top four PCs for the traditional PCA and FADI are 0.997, 0.994, 0.994 and 0.996 respectively.
To conduct pairwise inference on the ancestry membership profiles, we preselect 16 subjects, with 4 subjects from each super population. We apply Bonferroni correction to correct for the multiple comparison issue and set the level at . We estimate the asymptotic covariance matrix by Corollary 4.7 and correct by setting entries larger than 1 to 1 and entries smaller than 0 to 0. The pairwise p-values are summarized in Figure 6. The computational time for computing the covariance matrix is 0.31 seconds. We can see that most of the comparison results are consistent with the true ancestry groups, while the inconsistency could be due to the mixed memberships of certain subjects and the unaccounted sub-population structures.

7 Discussion
In this paper, we develop a FAst DIstributed PCA algorithm FADI that can deal with high-dimensional PC calculations with low computational cost and high accuracy. The algorithm is applicable to multiple statistical models and is friendly for distributed computing. The main idea is to apply distributed-friendly random sketches so as to reduce the data dimension, and aggregate the results from multiple sketches to improve the statistical accuracy and accommodate federated data. We conduct theoretical analysis as well as simulation studies to demonstrate that FADI enjoys the same non-asymptotic error rate as the traditional full sample PCA while significantly reducing the computational time compared to existing methods. We also establish distributional guarantee for the FADI estimator and perform numerical experiments to validate the potential phase-transition phenomenon in distributional convergence.
Fast PCA algorithms using random sketches usually require the data to have certain “almost low-rank” structures, without which the approximation might not be accurate [20]. It is of future research interest to investigate whether the proposed FADI approach can be extended to non-low-rank settings. In Step 3 of FADI, we aggregate local estimators by taking a simple average over the projection matrices. It would be of future research interest to explore the performance of other weighted averages and investigate the best convex combination to reduce the statistical error.
References
- 1000 Genomes Project Consortium, [2015] 1000 Genomes Project Consortium (2015). A global reference for human genetic variation. Nature, 526(7571):68.
- Abbe, [2018] Abbe, E. (2018). Community detection and stochastic block models: Recent developments. Journal of Machine Learning Research, 18(177):1–86.
- Abbe et al., [2020] Abbe, E., Fan, J., Wang, K., and Zhong, Y. (2020). Entrywise eigenvector analysis of random matrices with low expected rank. Annals of Statistics, 48(3):1452–1474.
- Achlioptas, [2003] Achlioptas, D. (2003). Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences, 66(4):671–687. Special issue on PODS 2001 (Santa Barbara, CA).
- Anderson, [1963] Anderson, T. W. (1963). Asymptotic theory for principal component analysis. Annals of Mathematical Statistics, 34:122–148.
- Baik et al., [2005] Baik, J., Arous, G. B., and Péché, S. (2005). Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices. Annals of Probability, 33(5):1643–1697.
- Banna et al., [2016] Banna, M., Merlevède, F., and Youssef, P. (2016). Bernstein-type inequality for a class of dependent random matrices. Random Matrices: Theory and Applications, 05(02):1650006.
- Belbin et al., [2021] Belbin, G. M., Cullina, S., Wenric, S., Soper, E. R., Glicksberg, B. S., Torre, D., Moscati, A., Wojcik, G. L., Shemirani, R., Beckmann, N. D., Cohain, A., Sorokin, E. P., Park, D. S., Ambite, J.-L., Ellis, S., Auton, A., Bottinger, E. P., Cho, J. H., Loos, R. J. F., Abul-Husn, N. S., Zaitlen, N. A., Gignoux, C. R., and Kenny, E. E. (2021). Toward a fine-scale population health monitoring system. Cell, 184(8):2068–2083.e11. PMID: 33861964.
- Bernstein, [1924] Bernstein, S. (1924). On a modification of chebyshev’s inequality and of the error formula of laplace. Annals of Science Institute SAV. Ukraine, Sect. Math, I.
- Candès and Tao, [2010] Candès, E. J. and Tao, T. (2010). The power of convex relaxation: Near-optimal matrix completion. IEEE Transactions on Information Theory, 56(5):2053–2080.
- Chen et al., [2016] Chen, T., Chang, D. D., Huang, S., Chen, H., Lin, C., and Wang, W. (2016). Integrating multiple random sketches for singular value decomposition. arXiv preprint arXiv:1608.08285.
- Chen et al., [2021] Chen, Y., Chi, Y., Fan, J., and Ma, C. (2021). Spectral methods for data science: A statistical perspective. Foundations and Trends® in Machine Learning, 14(5):566–806.
- Chen et al., [2019] Chen, Y., Fan, J., Ma, C., and Yan, Y. (2019). Inference and uncertainty quantification for noisy matrix completion. Proceedings of the National Academy of Sciences, 116(46):22931–22937.
- Dey et al., [2022] Dey, R., Zhou, W., Kiiskinen, T., Havulinna, A., Elliott, A., Karjalainen, J., Kurki, M., Qin, A., Lee, S., Palotie, A., et al. (2022). Efficient and accurate frailty model approach for genome-wide survival association analysis in large-scale biobanks. Nature Communications, 13(1):1–13.
- Dhruva et al., [2020] Dhruva, S. S., Ross, J. S., Akar, J. G., et al. (2020). Aggregating multiple real-world data sources using a patient-centered health-data-sharing platform. npj Digital Medicine, 3(1):60.
- Fan et al., [2022] Fan, J., Fan, Y., Han, X., and Lv, J. (2022). Simple: Statistical inference on membership profiles in large networks. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 84(2):630–653.
- Fan et al., [2013] Fan, J., Liao, Y., and Mincheva, M. (2013). Large covariance estimation by thresholding principal orthogonal complements. Journal of the Royal Statistical Society. Series B., 75(4):603–680.
- Fan et al., [2019] Fan, J., Wang, D., Wang, K., and Zhu, Z. (2019). Distributed estimation of principal eigenspaces. Annals of Statistics, 47(6):3009–3031.
- Franklin, [2012] Franklin, J. N. (2012). Matrix theory. Courier Corporation.
- Halko et al., [2011] Halko, N., Martinsson, P. G., and Tropp, J. A. (2011). Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 53(2):217–288.
- Hoeffding, [1994] Hoeffding, W. (1994). Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding, pages 409–426. Springer.
- Jensen, [1906] Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica, 30(1):175–193.
- Jiang et al., [2016] Jiang, B., Ma, S., Causey, J., Qiao, L., Hardin, M. P., Bitts, I., Johnson, D., Zhang, S., and Huang, X. (2016). Sparrec: An effective matrix completion framework of missing data imputation for gwas. Scientific Reports, 6(1):35534.
- Jin, [2015] Jin, J. (2015). Fast community detection by score. Annals of Statistics, 43(1):57–89.
- Johnstone, [2001] Johnstone, I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Annals of Statistics, 29(2):295–327.
- Jordan et al., [2019] Jordan, M. I., Lee, J. D., and Yang, Y. (2019). Communication-efficient distributed statistical inference. Journal of the American Statistical Association, 114(526):668–681.
- Kannan et al., [2014] Kannan, R., Vempala, S., and Woodruff, D. (2014). Principal component analysis and higher correlations for distributed data. In Proceedings of the 27th Conference on Learning Theory, volume 35 of Proceedings of Machine Learning Research, pages 1040–1057, Barcelona, Spain. PMLR.
- Kargupta et al., [2001] Kargupta, H., Huang, W., Sivakumar, K., and Johnson, E. (2001). Distributed clustering using collective principal component analysis. Knowledge and Information Systems, 3(4):422–448.
- Klarin et al., [2018] Klarin, D., Damrauer, S. M., Cho, K., Sun, Y. V., Teslovich, T. M., Honerlaw, J., Gagnon, D. R., DuVall, S. L., Li, J., Peloso, G. M., et al. (2018). Genetics of blood lipids among ~300,000 multi-ethnic participants of the million veteran program. Nature genetics, 50(11):1514–1523. PMCID: PMC6521726.
- Li et al., [2020] Li, T., Sahu, A. K., Talwalkar, A., and Smith, V. (2020). Federated learning: challenges, methods, and future directions. IEEE Signal Processing Magazine, 37(3):50–60.
- Pasini, [2017] Pasini, G. (2017). Principal component analysis for stock portfolio management. International Journal of Pure and Applied Mathematics, 115:153–167.
- Paul, [2007] Paul, D. (2007). Asymptotics of sample eigenstructure for a large dimensional spiked covariance model. Statistica Sinica, pages 1617–1642.
- Price et al., [2006] Price, A. L., Patterson, N. J., Plenge, R. M., Weinblatt, M. E., Shadick, N. A., and Reich, D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genetics, 38(8):904–909.
- Pulley et al., [2010] Pulley, J., Clayton, E., Bernard, G. R., Roden, D. M., and Masys, D. R. (2010). Principles of human subjects protections applied in an opt-out, de-identified biobank. Clinical and Translational Science, 3(1):42–48. PMCID: PMC3075971.
- Reich et al., [2008] Reich, D., Price, A. L., and Patterson, N. (2008). Principal component analysis of genetic data. Nature Genetics, 40(5):491–492.
- Serfling, [2009] Serfling, R. J. (2009). Approximation theorems of mathematical statistics. John Wiley & Sons.
- Shen and Lu, [2020] Shen, S. and Lu, J. (2020). Combinatorial-probabilistic trade-off: Community properties test in the stochastic block models. arXiv preprint arXiv:2010.15063.
- Stewart, [1977] Stewart, G. W. (1977). On the perturbation of pseudo-inverses, projections and linear least squares problems. SIAM Review, 19(4):634–662.
- Sudlow et al., [2015] Sudlow, C., Gallacher, J., Allen, N., et al. (2015). Uk biobank: an open access resource for identifying the causes of a wide range of complex diseases of middle and old age. PLoS Medicine, 12(3):e1001779–e1001779.
- Vershynin, [2012] Vershynin, R. (2012). Introduction to the non-asymptotic analysis of random matrices, page 210–268. Cambridge University Press.
- Wang and Fan, [2017] Wang, W. and Fan, J. (2017). Asymptotics of empirical eigen-structure for high dimensional spiked covariance. Annals of Statistics, 45(3):1342–1374.
- Wedin, [1972] Wedin, P.-A. (1972). Perturbation bounds in connection with singular value decomposition. Nordisk Tidskr. Informationsbehandling (BIT), 12:99–111.
- Yan et al., [2021] Yan, Y., Chen, Y., and Fan, J. (2021). Inference for heteroskedastic pca with missing data. arXiv preprint arXiv:2107.12365.
- Yang et al., [2021] Yang, F., Liu, S., Dobriban, E., and Woodruff, D. P. (2021). How to reduce dimension with pca and random projections? IEEE Transactions on Information Theory, 67:8154–8189.
- Yu et al., [2015] Yu, Y., Wang, T., and Samworth, R. J. (2015). A useful variant of the Davis-Kahan theorem for statisticians. Biometrika, 102(2):315–323.
Supplementary Materials to
“FADI: Fast Distributed Principal Component Analysis With High Accuracy for Large-Scale Federated Data”
This file contains the supplementary materials to the paper “FADI: Fast Distributed Principal Component Analysis With High Accuracy for Large-Scale Federated Data”. In Appendix A we provide numerical results for Example 3 and Example 4 along with some additional simulation results for Example 1 under the genetic setting. In Appendix B, we present the proofs for the main theorems, propositions and corollaries given in Section 4 of the main paper. In Appendix C we give the proofs of some technical lemmas useful for the proofs of the main theorems. In Appendix D, we present the modified version of Wedin’s theorem, which is used in several proofs. Appendix E provides the supplementary figures deferred from the main paper.
A Additional Simulation Results
In this section we present the simulation results for Example 3 and Example 4, and we provide some additional simulation results for Example 1 to evaluate the performance of FADI under the genetic settings.
A.1 Example 3: Gaussian Mixture Models
Under this setting, we take , fix the Gaussian vector dimension at and set . Then we generate the Gaussian means by , . We set the sample size at respectively and generate independent Gaussian samples from a mixture of Gaussian with means to study the performance of FADI under different settings. We assign each cluster with Gaussian samples. We divide the data vertically along into splits, set and for the final powered fast sketching. We take the ratio for each setting and compute the asymptotic covariance via Corollary 4.8 and Corollary 4.12 under different regimes of . We define where is the asymptotic covariance for the first row of and is the alignment matrix, and calculate the empirical coverage probability by empirically evaluating , where is the 0.95 quantile of the Chi-squared distribution with degrees of freedom equal to 3. We perform 300 Monte Carlo simulations and the results under different settings are shown in Figure 7. We can see that the error rate of FADI gets closer to that of the traditional PCA estimator as increases while FADI greatly outperforms the traditional PCA in terms of running time under different settings. Note that here is the sample size, and the decreasing of error rates with increasing and fixed (at the same ratio) is consistent with Corollary 4.2. Similar to Example 1 in Section 5.1, we can see from Figure 7(b) the running time is large due to the calculation of at approaching 1 from the left, and we do not recommend inference at this regime. Validation of the inferential properties are shown in Figure 7(c) and Figure 7(d).
(a) Error Rate | (b) Running Time |
![]() |
![]() |
(c) Coverage Probability | (d) Q-Q Plot |
![]() |
![]() |
A.2 Example 4: Incomplete Matrix Inference
For the true matrix , we consider , take to be the left singular vectors of a pregenerated i.i.d. Gaussian matrix, and take . We consider the distributed setting , and set the dimension at respectively, and set and for each setting. Then we generate the entry-wise noise by for , and subsample non-zero entries of with probability . Under each setting, we perform FADI at , and for the computation of . Define with being the asymptotic covariance for defined in Corollary 4.9 and , and empirically calculate the coverage probability, i.e., . Similar as in Section 5.2, for the regime , we refer to Theorem 4.10 and calculate by
Results over 300 Monte Carlo simulations are provided in Figure 8. Figure 8(a) illustrates that the error rate of FADI is almost the same as the traditional PCA as gets larger, and Figure 8(b) shows that the computational efficiency of FADI greatly outperforms the traditional PCA for large dimension . We can observe from Figure 8(c) that the confidence interval performs poorly at with the coverage probability equal to 1, which is consistent with the theoretical conditions in Corollary 4.9 for distributional convergence. Figure 8(d) shows the good Gaussian approximation of FADI at , and the results at is consistent with Figure 8(c).
(a) Error Rate | (b) Running Time |
![]() |
![]() |
(c) Coverage Probability | (d) Q-Q Plot |
![]() |
![]() |
A.3 Additional Results for Example 1 in the Genetic Setting
Section 5.1 compares FADI with several existing methods under a relatively large eigengap. In practice, the eigengap of the population covariance matrix may not be large. To assess different methods in a more realistic scenario, we imitate the setting of the 1000 Genomes Data, where we take the number of spikes , and the eigengap to be . We generate the data by , where
The dimension is and the sample size is . Error rates and running times using different algorithms are compared under different number of splits for the sample size . For FADI, we take , and .
Table 6 shows that the number of sample splits has little impact on the error rate of FADI as expected, while the error rate of Fan et al., [18]’s distributed PCA increases as increases. FADI is much faster than the other two methods in all the practical settings when the eigengap is small. This suggests that in practical problems where the sample size is large and the eigengap is small, FADI not only enjoys much higher computational efficiency compared to the existing methods, but also gives stable estimation for different sample splits along the sample size . Although the settings of small eigengap are of major interest in this section, we still conduct simulations where the eigengap increases gradually to see how it affects the performance of FADI. Table 7 shows that as the eigengap gets larger, the error rate of FADI gets closer to that of the traditional full sample PCA, whereas the error rate ratio of distributed PCA to FADI gets below 1, but are still above 0.9 when the eigengap is larger than 1. As to the running time, FADI outperforms the other two methods in all the settings. In summary, when the eigengap grows larger, the performance of the three algorithms becomes similar to what we see in Section 5.1.
FADI | Traditional PCA | Distributed PCA | ||
Error Rate | 2.296 | 1.811 (0.79) | 2.629 (1.15) | 10 |
2.294 | 1.811 (0.79) | 3.412 (1.49) | 20 | |
2.294 | 1.811 (0.79) | 3.955 (1.72) | 40 | |
2.294 | 1.811 (0.79) | 4.215 (1.84) | 80 | |
Running Time | 5.76 | 983.86 (170.8) | 189.76 (32.9) | 10 |
3.82 | 992.09 (259.8) | 144.18 (37.8) | 20 | |
2.86 | 972.47 (339.5) | 119.29 (41.6) | 40 | |
2.37 | 968.43 (408.5) | 99.39 (41.9) | 80 |
FADI | Traditional PCA | Distributed PCA | Eigengap | |
Error Rate | 1.28 | 1.06 (0.82) | 1.57 (1.22) | 0.4 |
0.77 | 0.65 (0.85) | 0.71 (0.92) | 0.8 | |
0.48 | 0.42 (0.88) | 0.43 (0.90) | 1.6 | |
0.31 | 0.29 (0.92) | 0.29 (0.93) | 3.2 | |
Running Time | 2.76 | 925.15 (334.7) | 115.29 (41.7) | 0.4 |
2.77 | 916.52 (331.4) | 114.76 (41.5) | 0.8 | |
2.69 | 922.85 (342.7) | 114.75 (42.6) | 1.6 | |
2.77 | 919.20 (332.2) | 115.26 (41.7) | 3.2 |
B Proof of Main Theoretical Results
In this section we provide proofs of the theoretical results in Section 4. For the inferential results, we will present proofs of the theorems under the regime first, which takes into consideration the extra variability caused by the fast sketching, and then give proofs of the theorems under the regime where the fast sketching randomness is negligible.
B.1 Unbiasedness of Fast Sketching With Respect to
We show by the following Lemma B.1 that the fast sketching is unbiased with respect to under proper conditions.
Lemma B.1.
Let be the eigen-decomposition of , and let be the stacked leading eigenvectors of corresponding to the eigenvalues with largest magnitudes. When , we have that , where denotes the column space of the matrix.
Proof.
We will first show that is diagonal. For any , we let , and recall we denote the eigen-decomposition of by . Then conditional on we have
where the second equality is due to the fact that diagonal matrices are commutative, and the last but one equivalence in distribution is due to the fact that . Also we know the top eigenvectors of are , and thus . Hence we have
The above equation holds for any , which suggests that is diagonal and that and share the same set of eigenvectors.
Now under the condition that , for any we denote by the -th column of , and we have
In other words, the corresponding eigenvalue of in is larger than . On the other hand, by Weyl’s inequality [19], the rest of the eigenvalues of should be less than 1/2. Therefore, are still the leading eigenvectors for , and thus . ∎
Recall in Section 4 we discuss that the bias term has the following decomposition . Lemma B.1 shows that as long as and are not too far apart, and will share the same column space. In fact, Lemma B.4 in Section B.2 will show that the probability that and are not sufficiently close converges to 0, and with high probability. With the help of Lemma B.1, we present the proof of the main error bound results in the following section.
B.2 Proof of Theorem 4.1
Recall the problem setting in Section 2. It is not hard to see that we can write , where and . Then is the SVD of .
We begin with bounding . Before delving into the detailed proof, the following two lemmas provide some important properties of the random Gaussian matrix.
Lemma B.2.
Let be a random matrix with i.i.d. standard Gaussian entries, where . For a random variable, recall that we define the norm to be . Then we have the following bound on the norm of the matrix :
(B.23) |
Lemma B.3.
Let denote a random matrix with i.i.d. Gaussian entries, where . For any integer such that , there exists a constant such that
(B.24) |
The following lemma shows that and are bounded by a small constant with high probability.
Lemma B.4.
If Assumption 1 holds and , there exists a constant such that for any , we have
The proof of Lemma B.2, Lemma B.3 and Lemma B.4 are deferred to Appendix C. Now we can start with the proof. We first decompose the bias term into two parts,
(B.25) |
Term I can be regarded as the variance term, whereas term II is the bias term. We will consider the bias term first.
B.2.1 Control of the Bias Term
We can see that term II can be further decomposed into two terms
(B.26) |
We can bound both terms separately. First note that . Thus we have,
where the last but one inequality follows from Lemma B.1, and the last inequality is a result of Lemma B.4. As for the second term on the RHS of (B.26), by Davis-Kahan’s Theorem [45], we have
Therefore, the bound for the bias term is
B.2.2 Control of the Variance Term
Now we move on to control the variance term. Suppose that . Then by Weyl’s inequality [19] we have that and . Thus by Davis-Kahan theorem [45]
We will bound term III later. Also similar as previously, note that . Thus by Lemma B.4,
Therefore, we have
Now we move on to bound term III.
where the last but one equality is due to the independence of estimators from different sketches conditional on . By Jensen’s inequality [22], we have
Thus we have
(B.27) |
Before bounding the RHS, let’s consider the matrix . If does not have full row rank, then the entries will be restricted to a linear space with dimension less than . Since is a standard Gaussian matrix, the probability that has full row rank is 1. And thus with probability 1, the matrix is of rank , and and the top left singular vectors of span the same column space. In other words, if we let be the left singular vectors of , then .
Now consider the -th singular value of , we let be the SVD of , and we have
where , and . Inequality (i) follows because
and inequality (ii) is because .
Now by Wedin’s Theorem [42] we have the following bound on the RHS of (B.27),
where the last but one inequality is due to Lemma B.3. Therefore, we have the final error rate for the estimator :
Now consider the function , where is a fixed constant. We have
Thus is increasing on , and if we take for some large enough constant , we have that . Then by plugging in and taking , under the condition that for some large enough constant , we have that
and the error rate simplifies to
Now we move on to bound . Since is convex, by Jensen’s inequality [22], under the condition that we have that there exists some constant such that
Thus by Markov’s inequality, we also have
Since is the summation of positive semi-definite matrices by construction, is also positive semi-definite. By Weyl’s inequality [19], we know that and .
Now if we denote the SVD of by , then with probability 1, and share the same column space. By the relationship for and Davis-Kahan’s Theorem [45], we have
Therefore we have,
where the last but one inequality is by Markov’s inequality, i.e.,
Thus by previous results and triangle inequality we have
B.3 Proof of Corollary 4.2
The case-specific error rates can be calculated by computing and studying the proper value of for each example.
Example 1: we know that . Now consider the submatrix of corresponding to the the index set , which we denote by . We have , where is the submatrix of composed of the rows in . Then since and , we know that . By Weyl’s inequality [19], we know . Thus we have . Then by Lemma 3 in Fan et al., [18], we have that there exists some constant such that for any , we have
where is the effective rank of . Thus we can see that is sub-exponential with
and hence we can take . When , by Theorem 4.1 we have
where the third term will be dominated by the first bias term when taking , and hence (3) holds.
Example 2: Under the problem settings we know that . For the eigenvalues of , under the given conditions we know that
where the last inequality is because for , we have that
Thus we know that .
We then bound the entries of . We know , and thus we have that
Thus we can see that , and . By Theorem 3.1.4 in [12], we know that there exists some constant such that for any ,
Also, since for , there exists a constant such that , we have that , and hence we can take . Besides, , and hence by Theorem 4.1 we have
When
the third term is negligible and (4) holds.
Remark 14.
It’s worth noting that here in Example 2 converges faster than sub-Exponential random variables and with probability at least , which we will take into account in later proofs.
Remark 15.
Under the case where no self-loops are present, is replaced by . With similar arguments we can show that
with probability at least , and hence (4) also holds for the no-self-loops case.
Example 3: From the problem setting we know that we can represent as , where , . Denote , then it can be seen that , and we can write
then we know that . We consider first. We know that , where . Under the given conditions we know that . Since is a i.i.d. Gaussian matrix, by Lemma B.2, we have that
As for , when , by Lemma 3 in Fan et al., [18] we know that , and hence in summary we have
and we can take . We know that , and thus under the condition that for some large enough constant , by Theorem 4.1 we have that
Remark 16.
In fact we can derive a slightly sharper tail bound for the convergence rate of . More specifically, for any , by Lemma 3 in Fan et al., [18] there exists some constant such that
which indicates that with probability at least . Hence under the condition that , with probability at least we have that , which will be used as the statistical rate of in later proofs.
Example 4: We define , then , where is the projection onto the subspace of matrices with non-zero entries only in . Since and differ only by a positive factor, and share exactly the same sequence of eigenvectors and can be viewed as the output by applying FADI to . Thus we will establish the results for instead, and abuse the notation by denoting . We first study the order of . When for some rate (that may change with ), for any , we have that
Thus we have . Also, we can write , where , and for
It is not hard to see that . Also, by the setting of Example 4 we have that , and there exists a constant independent of such that for all . Then we will study and separately. We denote and . Under the condition that for some constant , by Theorem 3.1.4 in Chen et al., [12], there exists constant such that for any we have
Very similarly for , there exists such that for any , we have
Thus we can see that
By Theorem 4.1, under the condition that , and , it holds that
Furthermore, the third term vanishes when and (6) holds.
Remark 17.
Here we can also obtain a statistical rate sharper than subexponential rate for that would be used in later proofs. Combining the above results for any we have
where are constants. Thus with probability at least .
B.4 Proof of Theorem 4.3
We first bound the recovery probability of for each . Recall that , where .
For the residual term , by Lemma 3 in [18], under the condition that , with probability at least we have . Denote by the event , where is the constant defined in Remark 2. Then conditional on , we have that with probability at least for each . Recall . From Proposition 10.4 in [20], we know that when ,
Therefore, with probability at least ,
By Weyl’s inequality [19], we know that conditional on , with probability at least , for large enough , which indicates that for any . For , under the same event we have
Then we have
We know that conditional on , are i.i.d. Bernoulli variables with expectation and variance . Since the estimators are all integers, we know that if , at least half of are not equal to . Then by Hoeffding’s inequality, we have
We know that for , and under the condition that we have .
B.5 Proof of Corollary 4.4
Example 1: From the proof of Corollary 4.2 we know that we can take . Then by plugging in each term we know that under the condition that and , we have . Besides, under the condition that , we also have . Thus the conditions for Theorem 4.3 are satisfied and we have with probaility at least .
Example 2: We know from the proof of Corollary 4.2 that . Also from Remark 14 we know that with probability at least , and thus we have , and . Also recall from the proof of Corollary 4.2 that for any , and hence . By Hoeffding’s inequality [21], we have that
Thus we can see with probability at least , and , and in turn . Thus by Theorem 4.3 the claim follows.
Example 3: We know from the proof of Corollary 4.2 and Remark 16 that and with probability at least . Thus we have . Under the condition that , we know that , and , and thus . By Theorem 4.3 the claim follows.
Example 4: By Hoeffding’s inequality [21], with probability at least we have that . As for , we have
We consider the latter two terms first. We know that for some constant and , for any . Denote by , then we have
and
Thus by Bernstein inequality [9], conditional on , with probability at least we have that there exists a constant independent of such that
(B.28) |
and
(B.29) |
Now we consider the first term. Since ’s are i.i.d. Bernoulli random variables with expectation , we have
Also, we know that and , and hence . Then by Bernstein inequality [9] with probability at least , it holds that
(B.30) |
Thus combining (B.28), (B.29) and (B.30) with the fact that with probability at least , under the condition that , with probability at least we have
From the proof of Corollary 4.2 and Remark 17, we know that with probability at least ,
and hence and .
Under the condition that , with probability at least we have . Thus by Theorem 4.3 the claim follows.
B.6 Proof of Theorem 4.10
We first decompose , and we consider the term first.
By Lemma 8 in Fan et al., [18], we have that . Note that in Lemma 8 of Fan et al., [18], the norm is Frobenius norm rather than operator norm, and the modification from Frobenius norm to operator norm is trivial and hence omitted. We first study the leading term .
For a given , we know that is the top left singular vectors of , where
By the “symmetric dilation” trick, we denote
We let be the SVD of , and we know that with probability 1 we have , where is an orthonormal matrix depending on . It is not hard to verify that the eigen-decomposition of is:
where . First we study the eigengap . Recall , and it can be seen that the entries of are i.i.d. standard Gaussian. By Lemma 3 in Fan et al., [18], we know that with probability at least , we have that , and thus with probability at least . Thus under the condition that , under the same high probability event we have that . Now we let be the top right singular vectors of . For we define
Then we have with probability at least . Correspondingly we define the linear mapping
and denote . By Lemma 8 in Fan et al., [18], under the condition that we have
By taking the upper left block of the matrix, we have
Now for , we study . Since , we have . Therefore we have,
and as a result we have
Thus in turn,
For a given , under the condition that , by Lemma 3 in Fan et al., [18] we have that with probability at least , . Combined with previous results on the eigengap , we have that with probability , for a fixed constant
Besides, under Assumption 1, we have that with probability at least , and in turn by Wedin’s Theorem [42], with high probability for all we have that
and thus . Besides, we have
where is the residual matrix with . Now we study the matrix . From previous results we know that with probability at least , for any , and in turn . Now for any vector such that , with probability we have that
Now since we know that the entries of are i.i.d. standard Gaussian, similar as before, under the condition that , by Lemma 3 in Fan et al., [18] we have with high probability that . Therefore, we have the following upper bound on the norm of the leading term
Thus we have the following decomposition
where is a residual matrix with
Thus
Next we consider the term . We denote the SVD of by , and by Weyl’s inequality [19], we know that and . Thus under the condition that , for large enough with high probability we have
Similar as before, we know that with probability 1 the left singular vector space of and the column space of are the same, where is still a Gaussian test matrix with i.i.d. entries. By Lemma 3 in Fan et al., [18], we have with probability at least , . When , by Wedin’s Theorem [42], there exists a constant such that with high probability we have
Denote . Then it can be seen that when
we have that and .
Now for a given , recall that with high probability . Therefore, under the condition that and , we have with probability , , and . Then under Assumption 5, we have
B.7 Proof of Corollary 4.11
To prove Corollary 4.11, it suffices for us to show that Assumptions 1, 2 and 5 are met. From the proof of Corollary 4.2, we know that Assumption 1 is satisfied. We move on to show that Assumption 2 is met. Define as the stacking of eigenvectors for the covariance matrix . Note that is not identifiable under the spiked covariance model and is unique up to orthogonal transformation. Let , and , where . We let be the stacking of eigenvectors for the matrix , and let be the eigenvalues of . Correspondingly, let be the eigenvalues of the sample covariance matrix . Since , we know that and . We define , and denote . Then by the proof of Lemma 6.2 in Wang and Fan, [41], we know that
where and is the -th element of the -th eigenvector of multiplied by for . We let . By Wedin’s Theorem [42] and Lemma 3 in Fan et al., [18], we have that with probability at least , for . If we denote by the stacked top eigenvectors of , and by the stacked top eigenvectors of , then we know that is the -th row of . By Davis-Kahan’s Theorem [45], we also know that there exists an orthonormal matrix such that , and thus
Thus we can write .
Now we take , and from previous results we know that with high probability , such that we have and Assumption 2 is satisfied.
Now we move on to study the statistical rate . For any , we first study the covariance of . We denote , then it’s not hard to verify that . Since and share the same eigenvalues, we can study the covariance of instead. Then can be calculated as following
where and . Then we have
Thus it can be seen that the covariance matrix is block-diagonal:
where and . Then following basic algebra, we can write as:
To study , we will first define as following
We know that , thus we have
and in summary we have . Now we study . Since the entries of are i.i.d. standard Gaussian, by Lemma 3 in Fan et al., [18], we know that with probability , we have
Therefore, under the condition that we have
As for , by Lemma 3 in Fan et al., [18] with high probability we have that and in turn
Therefore, combining the previous results, we have that by Weyl’s inequality [19], with high probability
Thus we know .
Recall from the proof of Corollary 4.2 with probability we have . Also recall that . Therefore, under the condition that
we have and .
Now we need to verify Assumption 5. It can be seen that the randomness of the leading term comes from and both. We will first establish the results conditional on . In fact, we will first show a more general CLT that will also cover the case of the leading term under the regime . More specifically, we will show that for any matrix that satisfies the following two conditions: (1) ; (2) , where are fixed constants irrelevant to and we abuse the notation by denoting , it holds that
(B.31) |
Now for any matrix satisfying the aforementioned conditions, to show that is asymptotically normal, we only need to show that for any with . We can write
We let and . For , we have that . Then we have
Thus
Thus the Lyapunov’s condition is met and (B.31) holds. Then we take , and define the following event
Then from previous results we know that , and under the event we have
Thus from the above proof, for any vector , we have , where is the CDF for . Then we have
Hence we have that Assumption 5 holds and (18) follows. Next we need to show that the result also holds for . From previous discussion we already know that , then by Lemma 13 in Chen et al., [13] we have that . Then by Slutsky’s Theorem, we have
Finally, we move on to verify the validity of the estimator for the asymptotic covariance matrix. From Lemma 7 in Fan et al., [18], it can be seen that with probability , is orthonormal. When is orthonormal, by Slutsky’s Theorem we have that
where it can be seen that . Therefore, it suffices to show that , and the results will hold by Slutsky’s Theorem. Recall from the proof of Corollary 4.2, we have the following bounds
We will bound the components of respectively. We have
Also, from proof of Theorem 4.10, we have that with high probability
and , where .Then with high probability, for all we have that
and thus by Theorem 3.3 in Stewart, [38], with high probability for all we have that
and in turn we have .
Thus combining the above results, under the condition that , following basic algebra we have
Therefore, by Slutsky’s Theorem, under the event , for any vector , we have that , and thus
Hence the claim follows.
B.8 Proof of Corollary 4.12
We will verify that Assumptions 1, 2, 3 and 5 hold. First, it is not hard to see that there exists some orthonormal matrix such that , where . From the problem setting of Example 3 we also know that there exists a constant such that
and thus that . Then . Thus Assumption 3 holds with .
From the proof of Corollary 4.2 we know that Assumption 1 is satisfied. Besides, recall from Remark 16, under the condition that , with probability at least we have that , which is sharper than . Since , we have and Assumption 2 holds trivially. Now we move on to study the minimum covariance eigenvalue rate . From the proof of Corollary 4.2, we know that
where with being the -th row of , is the -th row of and . Then for , we have
where the last equality is due to the fact that . Now for , we calculate . Following basic algebra, we have that
and thus
Then since , we have that , and hence we have . Then under the condition that and , we have that
Now we move on to check Assumption 5. Similar as in the proof of Corollary 4.11, we will first show the results conditional on by establishing a more general CLT . More specifically, we will show that for any with , and such that and , where are constants irrelevant to and we abuse the notation by denoting , we have . Define . We know that
and we denote
Then we have
Then
Then under the condition that , we have that
Thus the Lyapunov’s condition is met and the CLT holds. Also recall from previous arguments, there exists a fixed constant such that with high probability we have
Then by taking and following similar steps as in the proof of Corollary 4.11, we know that Assumption 5 is satisfied. Then by Theorem 4.10, (17) holds.
We move on to prove (20). It suffices to show that . When and , with high probability we have
Thus (20) holds.
Last we verify the validity of . Similar as in the proof of Corollary 4.11, it suffices to show that . Recall with high probability .
Also, from the proof of Theorem 4.10, we have that
Then with high probability, for all we have that
and thus by Theorem 3.3 in Stewart, [38], we have that
and in turn we have .
Therefore, under the condition that , we have
Thus the claim follows.
B.9 Proof of Theorem 4.5
We will first decompose . We will show that when is sufficiently large the first two terms are negligible, and we will consider the third term first. We will first study by conducting decomposition of the error term. For the convenience of notations, we let for short. If we define , we can decompose
Under the condition that , we have that is a full-rank orthonormal matrix with probability . Then we have with probability that
From Lemma 7 in Fan et al., [18], we know that , and thus we have
We move on to bound ,
Finally, we consider the term . We can decompose
We bound the three terms separately, with high probability
As for , we have
and finally
where the last inequality is due to the fact that
Thus in summary, we have
Now we move on to bound . By Theorem 4.1, we know that
Finally, we consider . From the proof of Theorem 4.1, we know that
From the proof of Theorem 4.1, we know that with probability converging to 1, there exists some constant such that , and thus that
When we choose to be large enough, i.e.,
we have . Therefore, if we denote
we can write
where . Then under the condition that , we have that . Thus by Assumption 5,
B.10 Proof of Corollary 4.6
We define and the same as in the proof of Corollary 4.11. Then Assumptions 1 and 2 are satisfied as been proven for Corollary 4.11. As for Assumption 5, we have shown that under the condition that , the results (B.31) holds for any matrix such that and in the proof of Corollary 4.11. Under the regime , the leading term , and by taking , it can be seen that
and if we can show that , we have and Assumption 5 is satisfied. Thus we only need to verify Assumption 4 and the conditions for . Recall from the proof of Corollary 4.11 we have the following rates
and we can further derive that the following bounds hold with high probability
Thus we know and .
From the proof of Corollary 4.11, we know that , where
Similar as in the proof of Corollary 4.11, we will first define as following
Then following similar arguments as in the proof of Corollary 4.11, we have that
Besides, under the condition that we have
Then we know that and we can take . Thus Assumption 5 holds. Then by plugging in the above rates, we can derive the rate as
Then under the condition that , and , we have , and hence the condition for is satisfied and (8) holds. Also recall from the above proof that , and (9) holds.
Now we verify the validity of . Similar as in the proof of Corollary 4.11, it suffices to show that , and the results will hold by Slutsky’s Theorem. From proof of Corollary 4.11, we have
Also, we know that with high probability
Then we have
Then if we denote , we have that , and thus we have
and furthermore, we have
Then following basic algebra, under the condition that we have
Therefore, by Slutsky’s Theorem, the claim follows.
B.11 Proof of Corollary 4.7
The proof for the case where no self-loops are present is almost identical to the case where there are self-loops except for some modifications. We will first prove the results for the case when self-loops are present, then in the end we will discuss how to modify the proof for the case where self-loops are absent.
We only need to verify that Assumptions 1 to 5 hold. Recall from the proof of Corollary 4.2 that we have , and thus we know that Assumption 1 is satisfied. Also Assumption 2 holds trivially due to the unbiasedness of . We will then verify Assumption 3 holds under the model. We know that and share the same column space, and thus there exists a non-singular matrix such that and . Then we can see that , and . Hence we have . Thus we can see that Assumption 3 is satisfied with .
Now we move on to verify Assumption 4. Recall from the proof of Corollary 4.2 that , , and . By Theorem 4.2.1 in Chen et al., [12], we have that with probability ,
and by the proof of Theorem 4.2.1 in [12], we further have that with probability ,
Thus Assumption 4 is met and now we move on to study the order of . Before we continue with the proof, we state the following elementary lemma that helps study the operator norm of a covariance matrix.
Lemma B.5.
are two random vectors, then we have
and
The proof of Lemma B.5 can be found in Appendix C.4. With the help of Lemma B.5, we first decompose , where is composed of the diagonal and upper triangular entries of and is composed of the off-diagonal lower triangular entries of . Then it can be seen that both and have independent entries. Now for , we can write
Then we study the covariance of the three terms separately. We have
Then we have and
and very similarly we also have . Thus by Lemma B.5, we know that and
Therefore, we can write
Thus we have , and we have . Therefore, when for some constant , and , , we have that
Thus and the condition for the asymptotic covariance matrix is satisfied. Now we need to verify Assumption 5, and similar as in the proof of Corollary 4.11, we can verify the following more general result.
Given , for any matrix that satisfies the following two conditions: (1); (2) , where and are fixed constants independent of , it holds that
(B.32) |
It can be checked from the previous proof that satisfies the two conditions. To show (B.32), we need to show that for any . We will first study the entries of and . It holds that
Then we know that
Then for the diagonal entries we have
and for the off-diagonal entries, when it holds that
Moreover, since , by the Lyapunov’s condition and plugging in , Assumption 5 is met and (8) follows.
Now we only need to verify that the result also holds when replacing by . From previous discussion we learnt that
Then by Slutsky’s Theorem, (11) holds.
Now we verify the validity of . Similar as in the proof of Corollary 4.6, is orthonormal with probability , and we will start by showing that . From previous discussion we have the following bounds
and
With the help of the above results, we will study the components of separately. In the following proof, we will base the discussion on the event that is orthonormal. We first study . We have that
Then for , we have
It is not hard to see that
and in turn we have the upper bound
Thus we have
Then we move on to study . We have
Then if we denote , we have that , and thus we have
Thus, following basic algebra we have the following bounds
and further, under the condition that , we have
Thus with similar arguments as in the proof of Corollary 4.6, the claim follows.
Remark 18.
The inferential results also hold for the case where self-loops are absent. Recall that under the no-self-loop case, the observed matrix is
where is the error matrix between the adjacency matrix with self-loops and its expectation. We define and denote by its leading eigenvectors. By Weyl’s inequality [19] we know that with probability at least we have that , and hence by Davis-Kahan’s Theorem [45] we have
with probability at least . The verification of Assumptions 1, 3 and 5 when self-loops are present can also be applied to the no-self-loop case. For Assumption 2, we can take and . Then and Assumption 2 is satisfied. As for Assumption 4, by Lemma 7 in Fan et al., [18], we have
With similar arguments as in the self-loop case, for with high probability we have
Then for , with high probability we have that
where in the last two inequalities we use the fact that
with and being the orthogonal complement of and respectively. Since , for large enough we further get
Hence . We also have
and hence we can take . Now to get a sharper rate for , we take into consideration the diagonal structure of and derive the following bound
Then from the proof of Theorem 4.5 we have that
and we are only left to verify the minimum eigenvalue condition of by showing that the order of is the same as when there are self-loops. With the same arguments, we know that
Besides, we also have
Thus we also have , and thereby
Thus we still have for the case where self-loops are absent. The condition for also holds for the no-self-loop case and both (8) and (11) hold. The verification of (12) is almost identical to the self-loop case and is hence omitted.
B.12 Proof of Corollary 4.8
From the proof of Corollary 4.12, we have verified Assumptions 1-3. It can be checked that satisfies the two conditions for the general CLT results in the proof of Corollary 4.12, then under the condition that , Assumption 5 is also satisfied.
Now we move on to check the conditions for . Recall from the proof of Corollary 4.12, we have
Then we have
Besides, it can be seen that , and hence we can take . Next we move on to verify the statistical rates and . By Davis-Kahan’s Theorem [45], we have that with high probability
where as defined in the proof of Corollary 4.12, and thus we know that . Besides, with high probability we have
and we have . Thus Assumption 4 is satisfied. Then we have
Therefore, under the conditions that , and , we have . Thus by Theorem 4.5, (8) holds. As for (13), from the above arguments we have , and hence (13) holds.
Now we need to check the validity of . Similar as before, it suffices for us to prove that . From Corollary 4.8, we have that and . Then we have
Then if we denote , we have that
and thus we have
and furthermore, we have
Combining the above results, we have , and hence (13) holds with replaced by .
B.13 Proof of Corollary 4.9
Recall that and share exactly the same sequence of eigenvectors, and we can treat as the FADI estimator applied to . We will abuse the notation and denote .
To show that (8) holds, we need to verify that Assumptions 1 to 5 hold and the minimum eigenvalue conditions hold for the asymptotic covariance matrix. We know from Corollary 4.2 that Assumption 1 and Assumption 2 are satisfied, and that and . Define , we have from the proof of Corollary 4.2 that and for . From Theorem 4.2.1 in Chen et al., [12], we have that with probability
and thus we know . Besides, by the proof of Theorem 4.2.1 in Chen et al., [12], with probability , we have
and thus . Therefore, Assumption 4 is met and we have
Now we will study the statistical rate . We know that are i.i.d. across and , then by Lemma B.5, with almost identical arguments as in the proof of Corollary 4.7, for we have that , and thus and we have . Therefore, under the condition that and , we have that .
Now we move on to verify Assumption 5. More specifically, we will show that the following results hold:
Given , for any matrix that satisfies the following two conditions: (1); (2) , where and are fixed constants independent of , it holds that
(B.33) |
To prove (B.33), it suffices to show that for any . We will first study , and . It holds that
Then we know that
Then for the diagonal entries we have
and for the off-diagonal entries, under the condition it holds that
Moreover, since , by the Lyapunov’s condition, (B.33) holds and Assumption 5 is satisfied by plugging in . By Theorem 4.5, we have that (8) follows.
To show that (15) holds we need to show that . From previous discussion we learnt that
Then by Slutsky’s Theorem, (15) holds.
Last we verify that the distributional convergence still holds when we plug in the estimator . Similar as in the previous proof, it suffices for us to prove that . In the following proof, we will base the discussion on the event that is orthonormal. We will first bound . From previous discussion we have the following bounds
and
Now we can study . Recall by Hoeffding’s inequality [21], with probability we have that and , and we have that
Then for any , we have
and in turn we have
Now we move on to bound the error of . We know from the setting of Example 4 that ’s are sub-Gaussian with variance proxy of order , and thus
Then for any , we have that
and thus we have that
Also, we have shown that
then we have , and hence
Then following basic algebra we have that with high probability
Then under the condition that , we have that
C Proof of Technical Lemmas
In this section, we provide proofs of the technical lemmas used in the proofs of the main theorems.
C.1 Proof of Lemma B.2
It can be easily seen that
By Lemma 3 in [18], we know that , and thus . Therefore, we have . By Jensen’s inequality, we in turn get .
C.2 Proof of Lemma B.3
By Proposition 10.4 in [20], we know that for any , we have
(C.34) |
Since , there exists a constant such that , and thus
(C.35) |
Therefore, we have
Since , the claim follows.
C.3 Proof of Lemma B.4
We first consider the probability . Recall the matrix . Now by Jensen’s inequality and Wedin’s Theorem [42], we have
where the last but one inequality is due to Lemma B.3 under the condition that , and the last inequality is due to Lemma B.2. Therefore, by Assumption 1, there exist constants such that
Similarly, we consider the probability . By Assumption 1, there exist constants such that
Therefore, the claim follows.
C.4 Proof of Lemma B.5
We know that , where , and
for . Therefore, we have
Thus we have
D Wedin’s Theorem
Lemma D.1 (Modified Wedin’s Theorem).
Let and be two matrices in (without loss of generality, we assume ), whose SVDs are given respectively by
Here, (resp. ) stand for the singular values of (resp. ) arranged in descending order, (resp. denotes the left singular vector associated with the singular value (resp. ), and (resp. ) represents the right singular vector associated with (resp. ). and stand for the top eigenvectors of and respectively. Then,
(D.36) |
and
(D.37) |
E Supplementary Figures
We provide in this section additional figures deferred from the main paper.


