A Scalable Approach to Estimating the Rank of High-Dimensional Data
Abstract
A key challenge to performing effective analyses of high-dimensional data is finding a signal-rich, low-dimensional representation. For linear subspaces, this is generally performed by decomposing a design matrix (via eigenvalue or singular value decomposition) into orthogonal components, and then retaining those components with sufficient variations. This is equivalent to estimating the rank of the matrix and deciding which components to retain is generally carried out using heuristic or ad-hoc approaches such as plotting the decreasing sequence of the eigenvalues and looking for the “elbow” in the plot. While these approaches have been shown to be effective, a poorly calibrated or misjudged elbow location can result in an overabundance of noise or an under-abundance of signal in the low-dimensional representation, making subsequent modeling difficult. In this article, we propose a latent-space-construction procedure to estimate the rank of the detectable signal space of a matrix by retaining components whose variations are significantly greater than random matrices, of which eigenvalues follow a universal Marchĕnko-Pastur (MP) distribution.
I Introduction and Background
High-dimensional data analysis, in which the number of variables or features in a data set is larger than the number of samples, is common in signal processing, computer vision, and genomic studies (Yang, 1996; White and Smyth, 2005). For these types of data sets, it is generally assumed that there is a low-dimensional and information-rich subspace within the data set (Hoff, 2007). Subsequent methods prescribe a variety of dimension-reduction techniques to find a linear vector subspace that provides a low-dimensional representation of high-dimensional data and retaining those components of the data carrying predictive (or otherwise significant) information. A standard construction for this class of problems considers decomposing an matrix into a rank signal matrix plus a noise matrix . The determination of is paramount for dimension-reduction problems to avoid over- or under-fitting subsequent models (Carreira-Perpinán, 1997). However, due to the challenge of dimension estimation and relatively few effective approaches proposed, many models take rank as an analyst-specified parameter based on ill-defined or ad-hoc methods. For example, a standard method of estimating rank is to retain components that explain 70% of total variance or to plot the decreasing sequence of the matrix’s eigenvalues (i.e., scree plot) and having an individual pick the point at which the rate of decrease itself drops significantly (i.e., “elbow”) (Yong et al., 2013). The eigenvalue index at which the elbow occurs is denoted as . An alternative approach (jackstraw) implements a statistical test to identify features with a statistically significant association with any linear combination of principal components (PCs) of interest based on a constructed null distribution by randomly permuting the data set (Chung and Storey, 2015). Although jackstraw is widely applied in dimension estimation, it requires the number of PCs that capture “systematic variation from latent variables” as an input for the algorithm, and the test results from jackstraw only provide a subset of suggested cutoffs from input PCs instead of a clear dimension-estimation cutoff (Chung and Storey, 2015). Furthermore, jackstraw is computationally intensive and time-consuming when applied to large data sets.
Several attempts have been made previously to solve the rank estimation problem. Hoff (2007) proposed using the Bayesian rank-estimation procedure based on the signal-plus-Gaussian-noise model, providing prior and posterior distribution of rank and posterior estimation of right- and left- singular vectors of matrix . Posterior inferences can be drawn from expectation and rank can be determined by the posterior mode of . The major weakness that prohibits application of Hoff’s method to large data sets (e.g., 1,000 nodes) is the computational complexity of the markov chain monte carlo (MCMC) algorithm (Hoff, 2007). In addition, in the markov chain, Hoff (2007) sampled eigenvalues and the corresponding eigenvectors from the full condition distribution instead of the marginal distribution. As a result, the probabilities of sampling nonzero eigenvalues do not mix well across different ranks of the mean matrix, as eigenvalues and the corresponding eigenvectors are dependent (Hoff, 2007). Recently, Luo and Li (2016) introduced a ladle estimator that combined the decreasing sequence of eigenvalues with the increasing pattern of variability of eigenvector directions. They observed that bootstrap variability of the eigenvectors decrease as the eigenvalues get further apart and increase as the eigenvalues get closer. Luo and Li (2016) proposed the ladle estimator based on this observation. They demonstrated the consistency of the ladle estimator and its superior performance in solving various dimension-reduction problems in a low-dimensional setting. While Hoff’s method and ladle have been shown to be effective in several settings, if the focus is on large or high-dimensional data, then these two methods may not be the best choice for rank estimation.
We propose a procedure to estimate the rank of the detectable signal space of a high-dimensional matrix using the signal-plus-noise model under the assumption that the eigenvalues of the noise matrix follow the universal behavior described in Aparicio et al. (2018). Under this construction, the expected spectrum of the random matrix follows a universal MP distribution, which can be compared to the eigenvalues of (Marčenko and Pastur, 1967). In this context, components carrying detectable signal can be characterized as those whose eigenvalues are significantly larger than expected under the MP distribution. This construction provides a method for estimating the dimension of a data set and extracting those components of the data that may not be attributable to randomness. The major contribution of this chapter is to present this novel and practical method, which results in improved accuracy, robustness, and computational efficiency, especially for high-dimensional data.
The paper is organized as follows: section II describes the framework setup and provides a distributional signal-to-ratio bound. Section III illustrates a procedure of estimating detectable signal dimension and demonstrates its performance in relation to other standard approaches, including the ladle and Hoff’s estimators. Section IV presents an application of dimension-estimation method in scRNA-seq data analysis. Section V concludes with a discussion.
II A bound on the detectable signal dimension
II.1 Signal-Plus-Noise Model
Let be a matrix with mean-centered columns consisting of a low-rank matrix plus a random matrix . Without loss of generality, we assume for high-dimensional data. Following the construction of Livan et al. (2018), assume matrix is formed by samples from random variables drawn from an unknown joint distribution . Let be independent and identically distributed (i.i.d.) entries with and (Götze et al., 2004). The scaled inner product of can be written as
(1) | ||||
We can specify finding the rank of in terms of eigenvalue decomposition as follows:
(2) |
where denotes the th column of the eigenvector and is a diagonal matrix with diagonal entries as the non-negative eigenvalues sorted in a non-increasing order. When , the , where eigenvalues equals zero. Thus, we can rewrite the scaled eigenvalues of matrix in terms of the following approximation:
(3) | ||||
There are two challenges in the simplification provided in Equation 3. First, adding to can result in the underestimation of the elements in using . Second, the eigen vectors of and are not generally well-aligned (). As a result, we provide rigorous justification below and in Appendix A. Given and are independent, we have and converge to 0 at a rate of when (see Appendix A). is a rank diagonal matrix with non-zero entries corresponding to the rank of the signal space. When the signal’s magnitude is large compared to the noise, the corresponding vectors in are aligned with in the first dimensions. can therefore be regarded as as a perturbation on , and will be approximately identical in the leading eigenvectors. The remaining dimensions have eigenvalues close to zero. At the same time, is a random matrix converging to (see Appendix B) and can be seen as a rotation on the noise matrix, implying that .
When , we calculate the scaled inner product of . Similarly, the scaled eigenvalues of can be rewritten as
(4) |
The top eigenvalues of and are the same up to a constant or and the rest eigenvalues are equal to zero for . We only focus on the consistent eigenvalues up to the minimum of or . The eigenvalues in the null space are not considered in this paper. Similarly, we can demonstrate that and that converges to .
II.2 Empirical Spectral Distribution
By providing an estimate of , we estimate the amount of detectable signal that can be distinguished from noise in the region where . We denote and its non-negative eigenvalues as . The joint density of eigenvalues for the Wishart matrices is as follows:
(5) | ||||
where for real entries and for standard normal distribution (Götze et al., 2004; Livan et al., 2018). When , we define it as Anti-Wishart ensemble , where we have eigenvalues equal zero. The joint density function of the Anti-Wishart ensemble is similar to Equation 5, with part of the matrix elements being random and the remaining elements non-random. The determination of these non-random elements are related to the first rows of (Livan et al., 2018). The normalization factor can be calculated by Selberg’s integral as
(6) | ||||
where and (Livan et al., 2018; Forrester and Warnaar, 2008; Kumar, 2019; Dumitriu and Edelman, 2005). The empirical spectral distribution of is defined as
(7) |
As proven in Marčenko and Pastur (1967); Götze et al. (2004); Livan et al. (2018), the expected spectral distribution and converge to the MP distribution in probability with density form:
(8) |
where rectangularity ratio . Note that ratio of approximates and is not fixed. Denote as the Dirac delta function and (Livan et al., 2018; Götze et al., 2004; Vivo, 2008). When , we have the rectangularity ratio and the second term in the density form equals zero according to the index function. When , we have the rectangularity ratio and the second term of the density function does not equal zero only when the eigenvalues equal zero. Note that eigenvalues in the null space are not discussed, as we assume that the signal space is smaller than the minimum of and . As a result, the Dirac delta function equals zero for both high- and low-dimensional settings and share the same density form, which is as
(9) |
II.3 Signal-to-Noise Ratio Bound
The empirical spectral distribution provides an estimate of the eigenvalues for the matrix in high- and low-dimensional cases. We focus on the high-dimensional case where in the in-text derivation and provide a distributional upper bound on the maximum eigenvalue of to determine the magnitude of the signal that can be distinguished from noise. The derivation of the signal-to-noise ratio (SNR) bound for low-dimensional cases can be derived similarly.
The proof of the SNR bound can be summarized as follows: (i) provide the distribution of the diagonal elements of the matrix. (ii) Provide the distribution of the off-diagonal elements of the matrix and show that the off-diagonal elements converge to zero. (iii) Set the bound for the eigenvalues of the matrix by the Gershgorin circle theorem (Gerschgorin, 1931). Intuitively, the eigenvalue of the th column of the square matrix lies within a circle of the th diagonal elements at a radius of the sum of the absolute values of off-diagonal elements of the th column (Gerschgorin, 1931). (iv) Further set the bound for the maximum eigenvalue of the matrix by the order statistics (Fréchet, 1927).
Assuming that the entries of the noise matrix are i.i.d. random variables drawn from an unknown joint distribution (without normality assumption) and is sufficiently large, we first note that by the central limit theorem, the diagonal and off-diagonal entries of matrix converge in distribution to normal distributions respectively, as and goes to infinity, where is the number of elements in the upper triangular matrix.
Theorem II.1.
Let with i.i.d. elements having mean 0, variance , and the fourth moment . The diagonal entries of the matrix converge in distribution to when is sufficiently large to have the central limit effect.
Proof II.2.
see Appendix C.
Theorem II.3.
Let with i.i.d. elements having mean 0, variance , and the fourth moment . The off-diagonal entries of the matrix ( and ) converge in distribution to when is sufficiently large to have the central limit effect.
Proof II.4.
see Appendix D.
Consider the special case where follows a standard normal distribution, the variance of the main diagonal elements is , or similarly by the main diagonal of variance for as . Thus, the distribution of the main diagonal elements is , the justification for which is rigorously provided in Appendices B and C. Similarly, the distribution of the off-diagonal elements is , which is rigorously justified in Appendix D. As proved in Appendix B, the first moment of the matrix for the normalized noise matrix is the identity matrix, with the off-diagonal entries converging to zero at a rate of when is sufficiently large.
Theorems II.1 and II.3 provide the distribution of the diagonal and off-diagonal entries of the matrix. We use the results of order statistics to estimate an upper bound of the maximum diagonal element and the maximum radius of the Gershgorin circle of matrix. A similar upper bound can be estimated for the diagonal and the maximum radius of the Gershgorin circle for matrix when .
Theorem II.5.
Let be the main diagonal elements of , where . The Fisher–Tippett–Gnedenko theorem shows that for the maximum diagonal element, . We have
(10) |
We can set the bound of with the inequality for such that
(11) |
Proof II.6.
see Appendix E.
Theorem II.7.
Let be the off-diagonal elements of , which are normally distributed with mean 0 and variance , where and . Let be the sum of the absolute values of the off-diagonal elements of the th column of matrix with expectation and variance . The Fisher–Tippett–Gnedenko theorem shows that for the maximum entry, . We have
(12) |
where . We can set the bound of with the inequality for such that
(13) |
The eigenvalue of the th column of the matrix can be bound by the Gershgorin circle theorem. With the distributional upper bound of the and distributional upper bound of , we have the upper bound of the largest eigenvalue of matrix, which lies within the Gershgorin discs that are centered at with a radius of rate of convergence for .
Proof II.8.
see Appendix F.
Corollary II.9.
Let with i.i.d. elements following a standard normal distribution with mean 0 and variance 1. The diagonal entries of the matrix converge in distribution to . The Fisher–Tippett–Gnedenko theorem shows that for the maximum diagonal element, . We have
(14) |
we can set the bound of by
(15) |
The off-diagonal entries of the matrix ( and ) converge in distribution to . Let be the sum of the absolute values of the off-diagonal elements of the th column of the matrix with expectation and variance . The Fisher–Tippett–Gnedenko theorem shows that for the maximum entry, . We have
(16) |
where . We can set the bound of with the inequality for such that
(17) |
Thus, we can set the bound of the largest eigenvalue of by .
III Estimating the detectable signal dimension in practice
III.1 Algorithm
Equation 3 shows that the rank of the signal can be estimated by first estimating and then finding the zero crossing of the difference corresponding to where any signal in the signal space cannot be distinguished from noise (see Figure 1).
As a more concrete example, consider the case of plotting scaled eigenvalues of matrix with 100 samples, 500 features, and a true rank equals to 10 as well as random samples from the MP distribution in Figure 1. The red line plots the scaled eigenvalues of the matrix while the blue line plots the scaled eigenvalues of a similarly distributed noise-only matrix. The goal is to extract all the signal components up to whose eigenvalues are significantly larger than the expected eigenvalues from the MP distribution. The dimension of the detectable signal is the first eigenvalue index where the corresponding eigenvalue is close to the largest eigenvalue sampled from the MP distribution, meaning it cannot be distinguished from noise. The algorithm is implemented as follows:
III.2 Implementation
An implementation of the dimension, ladle, and Hoff’s method as an open-source package for the R Programming environment can be found at https://github.com/WenlanzZ/dimension. We estimate the corrected variance of the MP distribution by the SNR bound to align the tails of the scaled eigenvalues of and noise. The test for deviation of scaled eigenvalues can be performed in several different ways. In practice, the univariate Bayesian change point (BCP) analysis, proposed by Barry and Hartigan (1993) has been found to be effective and is implemented in the R package bcp (Erdman et al., 2007). Frequently, there are several change points with their posterior probabilities approximating the maximum posterior probability. In order to guarantee the inclusion of all important signals, we propose a two-step procedure to estimate the dimension in practice. In the first step, we select candidates for by calculating the BCP on the posterior probability for a second time (double posterior). In the second step, among those candidates with the highest double posterior, we pick equal to the largest dimension where its posterior probability is larger than times the maximum posterior probability ( determine the range of approximation to the maximum posterior probability; by default set ).
One limitation observed in simulation studies when implementing the dimension algorithm is related to R package bcp for calculating the posterior probabilities of the change points. The performance of R package bcp is not always stable at the edges of the sequence. Thus, in some extreme cases, a spike occurs at the edge after a long and flat pattern in the sequence of posterior probabilities. To address this problem, we adopted an alarm system that detects flat and spike patterns and trims off sequence before spikes or after a long and flat pattern.
III.3 Simulation
We compared the performance of the proposed rank estimation procedure to the ladle and Hoff’s method. Consider two signal-plus-noise matrices whose true signal rank . For a fair comparison, the first matrix setting is the same as in Luo and Li (2016) and the second matrix setting is similar to Hoff (2007):
(18a) | |||
(18b) |
We simulated 1,000 independent samples from the matrix and performed rank estimation with different methods. We allowed each method to run for up to 24 hours for 1,000 simulations. The percentage of correct estimation and elapsed time in seconds are reported in Table 1. For example, in the first case of the matrix (sigma separated by commas) with signal rank equal to three and 100 samples and 10 features, dimension had 94.8% of correct estimation with a mean absolute error of 0.056 and an average elapsed time of 0.040 seconds; ladle had 93% of correct estimation with a mean absolute error of 0.396 and an average elapsed time of 0.034 seconds; Hoff’s method had 78.8% of correct estimation with a mean absolute error of 0.367 and an average elapsed time of 0.538 seconds. In the sixth case of the matrix (sigma as a single integer) with signal rank equal to three and 10 samples and 100 features, dimension had 28.3% of correct estimation with a mean absolute error of 0.980 and an average elapsed time of 0.041 seconds; ladle had 0% of correct estimation with a mean absolute error of 5.930 and an average elapsed time of 0.044 seconds; Hoff’s method had 0% of correct estimation with a mean absolute error of 7 and an average elapsed time of 1.094 seconds. In Table 1, when we fixed and let grow for the setting, the dimension method achieved the highest accuracy and performed stably for both the and matrices, while the ladle method seemed to work only with its own matrix setting. When we fixed and let grow for the setting, dimension exceled in both accuracy and computational efficiency. In terms of computational efficiency, the dimension method is more scalable to large data sets compared to the other methods. In extreme cases where noise overwhelms signal and the signal space is too small to be detectable, dimension underestimated the true rank by a mean absolute error of around 6. When signal is moderate in an extremely high-dimensional setting, dimension achieved an accuracy of 91.1% with a mean absolute error of 0.207 and an average elapsed time of 3.767 seconds while the other two methods are computationally inapplicable to providing an estimation.
rank | unknown matrix | dimension | ladle | Hoff | ||||||||
k | n | p | sigma | acc | mae | time | acc | mae | time | acc | mae | time |
3 | 100 | 10 | 2, 1, 1 | 94.8 | 0.056 | 0.040 | 93 | 0.396 | 0.034 | 78.8 | 0.367 | 0.538 |
3 | 1000 | 10 | 2, 1, 1 | 100 | 0 | 0.058 | 100 | 0 | 0.696 | 96.3 | 0.044 | 49.747 |
3 | 10000 | 10 | 2, 1, 1 | 100 | 0 | 0.070 | 100 | 0 | 15.442 | - | - | >24h |
3 | 10 | 100 | 2, 1, 1 | 25.7 | 1.175 | 0.041 | 0 | 5.828 | 0.043 | 11.4 | 2.727 | 0.591 |
3 | 10 | 1000 | 2, 1, 1 | 21.3 | 1.420 | 0.059 | 0 | 3.057 | 0.524 | - | - | >24h |
3 | 10 | 100 | 6 | 28.3 | 0.980 | 0.041 | 0 | 5.930 | 0.044 | 0 | 7 | 1.094 |
3 | 10 | 1000 | 6 | 18.2 | 1.696 | 0.062 | 0 | 3.123 | 0.493 | 0 | 7 | 116.187 |
3 | 10 | 10000 | 6 | 15.8 | 1.914 | 0.071 | 0 | 3 | 15.211 | - | - | >24h |
10 | 500 | 100 | 2, 2, 2 | 99.6 | 0.006 | 0.268 | 100 | 0 | 2.006 | 99.9 | 0.001 | 50.730 |
10 | 500 | 100 | 2 | 99.9 | 0.000 | 0.284 | 0 | 70 | 15.933 | 92.2 | 0.127 | 23.333 |
10 | 5000 | 100 | 2 | 100 | 0 | 0.614 | - | - | >24h | - | - | >24h |
10 | 50000 | 100 | 2 | 100 | 0 | 3.902 | - | - | >24h | - | - | >24h |
10 | 100 | 500 | 6 | 0.064 | 4.434 | 0.290 | 0 | 10.008 | 3.871 | 23.4 | 1.693 | 24.214 |
10 | 100 | 5000 | 6 | 0 | 6.684 | 0.646 | - | - | >24h | - | - | >24h |
10 | 100 | 50000 | 6 | 0 | 8.553 | 3.728 | - | - | >24h | - | - | >24h |
10 | 100 | 500 | 10 | 85.9 | 0.250 | 0.288 | 0 | 10.971 | 3.914 | 69.2 | 0.689 | 15.457 |
10 | 100 | 5000 | 10 | 0 | 6.314 | 0.699 | - | - | >24h | - | - | >24h |
10 | 100 | 50000 | 100 | 91.1 | 0.207 | 3.767 | - | - | >24h | - | - | >24h |
IV Case Study: Dimension Estimation in scRNA-seq Analysis
The lung scRNA-seq data can be found on Gene Expression Omnibus (GEO) under accession code GSE136831. The data was examined by Adams et al. (2019) to build a single-cell atlas of Idiopathic Pulmonary Fibrosis (IPF). The data contains the gene expression of 312,928 cells profiled from 32 IPF lung samples, 18 chronic obstructive pulmonary disease (COPD) lung samples, and 29 control donor lung samples (Adams et al., 2019). We applied the dimension method to a subgroup of control lung sample (Patient ID 001C) with 2,000 highly variable genes and 127 cells. Figure 2 shows that among all the candidates with maximum double posterior probabilities (see Figure 2b), the eighth dimension is the largest with the maximum posterior probability of change point (see Figure 2a).

We then applied the Louvain clustering method implemented in the R package Seurat for PCs from Dimensions 1-50, 9-50, and 1-8 (see Figures 3a, 3b, 3c). The Uniform Manifold Approximation and Projection (UMAP) for each plot confirms that major features of cell-type clustering are associated with Dimensions 1-8 and there is no new cluster formed from Dimensions 9-50. Clustering from the identified signal subspace identified two additional clusters that are not identified by Dimensions 1-50. To validate the clustering results, we identified marker genes in each cluster via differential expression analysis and further matched the clustering results to known cell types with identified marker genes (see Figure 3c for cell-type identification). Comparing to the original identification published by Adams et al. (2019) (see Figure 3d), we found that most of the cells types were correctly annotated with the small subset using Dimensions 1-8, with the exception of the T cells that were clustered with the NK cells. However, NK cells and T cells are phenotypically similar and they both derive from lymphoid lineage cells. By contrast, when we included noises from Dimensions 1-50, T cells, fibroblast, and secretory cells were misclassified as one cell type. Therefore, an accurate estimation of the signal dimension is paramount for cell-type identification to avoid clustering with an overabundance of noise or signal omission, both of which lead to misleading conclusions.
V Conclusion
In this chapter, we posed the problem of dimension estimation in a signal-plus-noise regime as the eigenvalue index where the difference between a data matrix and its inherent noise, defined by a random matrix, were close to zero. Based on this regime, we proposed a procedure to estimate the dimension of detectable signal, along with a derivation of distributional bounds for the magnitude of detectable signal required to be distinguished from noise. We implemented the proposed procedure and related methods as per the open-source R package at https://github.com/WenlanzZ/dimension. In simulation experiments, we demonstrated that the accuracy of the dimension method and its speed and computationally efficiency advantages over ladle and Hoff’s method, both of which are essential for large and high-dimensional data sets. In high-dimensional cases with extremely low signal, the dimension method provided a robust low-rank estimation while the other two methods provided either zero rank or full rank. Our results showed that the dimension method is more accurate in estimating the dimension while remaining computationally scalable compared to current methods in a wide range of scenarios under both high- and low-dimensional settings. Besides making performance comparisons with the simulated data, we also demonstrated that community detection in the identified signal subspace was beneficial in cell-type identification when analyzing scRNA-seq data sets. In addition to the applications to genomic data sets, the dimension-estimation method has more general applications, such as digital imaging, speech modelling, and neural networks.
Acknowledgements.
Appendix A Appendix: Proof of Equation 3
Here we provide detailed proofs of the Equation 3. Base on the central limit theorem, we show that and . Given that , we will focus on the proof for . The main technical challenge here comes from the case when the eigenvectors of and are not well-aligned (). Assume an extreme worst case when is a diagonal matrix with finite singular values for ( rows are padded with zeros), we show that in general for finite . Let be i.i.d. entries of matrix with and . Without loss of generality, we normalize so that it has mean zero and variance one.
(19) |
Thus we have when with a convergence rate .
Appendix B Appendix: Derive moments for
Without loss of generality, we standardize the columns of random matrix () by subtracting off the column-wise means and standardizing such that . We can first derive the first moment of as follows:
(20) |
where the main diagonals
(21) |
for . and the off-diagonal we have
(22) |
for and .
Thus we have the first moment of the matrix as
(23) |
We then derive the second moment of the matrix as follows:
(24) |
where the expectation of main diagonal elements is
(25) |
Note that is the second moment of the chi-square distribution and the off-diagonal elements are with expectation
(26) |
where .
Thus we have the second moment of as
(27) |
Appendix C Appendix: Proof of Theorem 1
For a matrix, where . Let be i.i.d. entries of with , and .
Consider the th column of , for . Let and , then we can write the main diagonal of the elements of the matrix as:
(28) |
where and . When sufficiently large to have the Lyapunov central limit effect, we have
(29) |
Appendix D Appendix: Proof of Theorem 2
For a matrix, where . Let be i.i.d. entries of with , and .
Consider the th row and th column of , for and . We can write the off-diagonal of the elements of the matrix at the th row and th column as
(30) |
The product of two independent random variables can be written as
(31) |
We rewrite as
(32) |
where and . The moments of and is derived as follows:
(33) |
(34) |
(35) |
(36) |
In the next step, we show that and are independent. Consider a vector , and . According to a theorem in multivariate probability that two linear functions and of a Gaussian vector are independent if an only if .
(37) |
Thus, and are independent. So that we can calculate the expectation of as
(38) |
and the variance as
(39) |
Thus, for the off-diagonal entry , the expectation is
(40) |
and the variance as
(41) |
When sufficiently large to have the Lyapunov central limit effect, we have
(42) |
Appendix E Appendix: Proof of Theorem 3
Theorem II.1 indicates that the main diagonal entries of are iid. normally distributed random variables with mean and variance . The Fisher–Tippett–
Gnedenko theorem shows that for the maximum diagonal element
(43) |
The cumulative distribution function of the Generalized Extreme Value (GEV) distribution is
(44) |
where .
We can bound the the maximum diagonal element with the inequality for such that
(45) |
Thus, we have
(46) |
Appendix F Appendix: Proof of Theorem 4
Theorem II.3 indicates that the off-diagonal entries and are iid. normally distributed random variables with mean and variance . Given the distribution of the th row and th column of the off-diagonal element of the matrix. We can calculate the sum of the absolute values of the off-diagonal elements of the th column of the matrix as . The absolute value of follows a half-normal distribution and the off-diagonal elements are independent, thus we can derive the expectation of the sum of the absolute off-diagonal values as follows:
(47) |
and the variance as
(48) |
The Fisher–Tippett–Gnedenko theorem shows that for the maximum entry of
(49) |
The cumulative distribution function of the Generalized Extreme Value (GEV) distribution is
(50) |
where . We can bound the with the inequality for such that
(51) |
Furthermore, we can bound the eigenvalue of the th column of the matrix by the Gershgorin circle theorem. Known that the eigenvalue of the matrix lies within at least one of the Gershgorin discs , where is a closed disc centered at the th diagonal element of (denote as ) with radius . With the distributional upper bound of the and distributional upper found of , we have the upper bound of the largest eigenvalue of , which lies with the Gershgorin discs that are centered at with a radius of rate of convergence for .
References
- Adams et al. (2019) Adams, T. S., Schupp, J. C., Poli, S., Ayaub, E. A., Neumark, N., Ahangari, F., Chu, S. G., Raby, B., DeIuliis, G., Januszyk, M., Duan, Q., Arnett, H. A., Siddiqui, A., Washko, G. R., Homer, R., Yan, X., Rosas, I. O., and Kaminski, N. (2019). “Single cell rna-seq reveals ectopic and aberrant lung resident cell populations in idiopathic pulmonary fibrosis,” bioRxiv https://www.biorxiv.org/content/early/2019/09/06/759902, \dodoi10.1101/759902.
- Aparicio et al. (2018) Aparicio, L., Bordyuh, M., Blumberg, A. J., and Rabadan, R. (2018). “Quasi-universality in single-cell sequencing data,” arXiv preprint arXiv:1810.03602 .
- Barry and Hartigan (1993) Barry, D., and Hartigan, J. A. (1993). “A bayesian analysis for change point problems,” Journal of the American Statistical Association 88(421), 309–319.
- Carreira-Perpinán (1997) Carreira-Perpinán, M. A. (1997). “A review of dimension reduction techniques,” Department of Computer Science. University of Sheffield. Tech. Rep. CS-96-09 9, 1–69.
- Chung and Storey (2015) Chung, N. C., and Storey, J. D. (2015). “Statistical significance of variables driving systematic variation in high-dimensional data,” Bioinformatics 31(4), 545–554.
- Dumitriu and Edelman (2005) Dumitriu, I., and Edelman, A. (2005). “Eigenvalues of hermite and laguerre ensembles: large beta asymptotics,” in Annales de l’IHP Probabilités et statistiques, Vol. 41, pp. 1083–1099.
- Erdman et al. (2007) Erdman, C., Emerson, J. W. et al. (2007). “bcp: an r package for performing a bayesian analysis of change point problems,” Journal of Statistical Software 23(3), 1–13.
- Forrester and Warnaar (2008) Forrester, P., and Warnaar, S. (2008). “The importance of the selberg integral,” Bulletin of the American Mathematical Society 45(4), 489–534.
- Fréchet (1927) Fréchet, M. (1927). “Sur la loi de probabilité de l’écart maximum,” Ann. Soc. Math. Polon. 6, 93–116.
- Gerschgorin (1931) Gerschgorin, S. (1931). “On bounding the eigenvalues of a matrix,” Izv. Akad. Nauk. SSSR Otd Mat. Estest 1, 749–754.
- Götze et al. (2004) Götze, F., Tikhomirov, A. et al. (2004). “Rate of convergence in probability to the marchenko-pastur law,” Bernoulli 10(3), 503–548.
- Hoff (2007) Hoff, P. D. (2007). “Model averaging and dimension selection for the singular value decomposition,” Journal of the American Statistical Association 102(478), 674–685.
- Kumar (2019) Kumar, S. (2019). “Recursion for the smallest eigenvalue density of -wishart–laguerre ensemble,” Journal of Statistical Physics 175(1), 126–149.
- Livan et al. (2018) Livan, G., Novaes, M., and Vivo, P. (2018). Introduction to random matrices: theory and practice (Springer).
- Luo and Li (2016) Luo, W., and Li, B. (2016). “Combining eigenvalues and variation of eigenvectors for order determination,” Biometrika 103(4), 875–887, https://doi.org/10.1093/biomet/asw051, \dodoi10.1093/biomet/asw051.
- Marčenko and Pastur (1967) Marčenko, V. A., and Pastur, L. A. (1967). “DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES,” Mathematics of the USSR-Sbornik 1(4), 457–483, https://doi.org/10.10702Fsm1967v001n04abeh001994.
- Vivo (2008) Vivo, P. (2008). “From wishart to jacobi ensembles: statistical properties and applications,” Ph.D. thesis, Brunel University, School of Information Systems, Computing and Mathematics.
- White and Smyth (2005) White, S., and Smyth, P. (2005). “A spectral clustering approach to finding communities in graphs,” in Proceedings of the 2005 SIAM international conference on data mining, SIAM, pp. 274–285.
- Yang (1996) Yang, B. (1996). “Asymptotic convergence analysis of the projection approximation subspace tracking algorithms,” Signal processing 50(1-2), 123–136.
- Yong et al. (2013) Yong, A. G., Pearce, S. et al. (2013). “A beginner’s guide to factor analysis: Focusing on exploratory factor analysis,” Tutorials in quantitative methods for psychology 9(2), 79–94.