This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Scalable Approach to Estimating the Rank of High-Dimensional Data

Wenlan Zang [email protected]    Jen-hwa Chu Also at: Section of Pulmonary, Critical Care and Sleep Medicine (PCCSM), Yale University, New Haven, CT 06510, USA    Michael J. Kane Department of Biostatistics, Yale University, New Haven, CT 06510, USA
Abstract

A key challenge to performing effective analyses of high-dimensional data is finding a signal-rich, low-dimensional representation. For linear subspaces, this is generally performed by decomposing a design matrix (via eigenvalue or singular value decomposition) into orthogonal components, and then retaining those components with sufficient variations. This is equivalent to estimating the rank of the matrix and deciding which components to retain is generally carried out using heuristic or ad-hoc approaches such as plotting the decreasing sequence of the eigenvalues and looking for the “elbow” in the plot. While these approaches have been shown to be effective, a poorly calibrated or misjudged elbow location can result in an overabundance of noise or an under-abundance of signal in the low-dimensional representation, making subsequent modeling difficult. In this article, we propose a latent-space-construction procedure to estimate the rank of the detectable signal space of a matrix by retaining components whose variations are significantly greater than random matrices, of which eigenvalues follow a universal Marchĕnko-Pastur (MP) distribution.

I Introduction and Background

High-dimensional data analysis, in which the number of variables or features in a data set is larger than the number of samples, is common in signal processing, computer vision, and genomic studies (Yang, 1996; White and Smyth, 2005). For these types of data sets, it is generally assumed that there is a low-dimensional and information-rich subspace within the data set (Hoff, 2007). Subsequent methods prescribe a variety of dimension-reduction techniques to find a linear vector subspace that provides a low-dimensional representation of high-dimensional data and retaining those components of the data carrying predictive (or otherwise significant) information. A standard construction for this class of problems considers decomposing an n×pn\times p matrix XX into a rank KK (Kp)(K\ll p) signal matrix SS plus a noise matrix NN. The determination of KK is paramount for dimension-reduction problems to avoid over- or under-fitting subsequent models (Carreira-Perpinán, 1997). However, due to the challenge of dimension estimation and relatively few effective approaches proposed, many models take rank KK as an analyst-specified parameter based on ill-defined or ad-hoc methods. For example, a standard method of estimating rank KK is to retain components that explain 70% of total variance or to plot the decreasing sequence of the matrix’s eigenvalues (i.e., scree plot) and having an individual pick the point at which the rate of decrease itself drops significantly (i.e., “elbow”) (Yong et al., 2013). The eigenvalue index at which the elbow occurs is denoted as KK. An alternative approach (jackstraw) implements a statistical test to identify features with a statistically significant association with any linear combination of principal components (PCs) of interest based on a constructed null distribution by randomly permuting the data set (Chung and Storey, 2015). Although jackstraw is widely applied in dimension estimation, it requires the number of PCs that capture “systematic variation from latent variables” as an input for the algorithm, and the test results from jackstraw only provide a subset of suggested cutoffs from input PCs instead of a clear dimension-estimation cutoff (Chung and Storey, 2015). Furthermore, jackstraw is computationally intensive and time-consuming when applied to large data sets.

Several attempts have been made previously to solve the rank estimation problem. Hoff (2007) proposed using the Bayesian rank-estimation procedure based on the signal-plus-Gaussian-noise model, providing prior and posterior distribution of rank KK and posterior estimation of right- and left- singular vectors of matrix XX. Posterior inferences can be drawn from expectation EX[p(K|X)]E_{X}[p(K|X)] and rank can be determined by the posterior mode of p(K|X)p(K|X). The major weakness that prohibits application of Hoff’s method to large data sets (e.g., 1,000 nodes) is the computational complexity of the markov chain monte carlo (MCMC) algorithm (Hoff, 2007). In addition, in the markov chain, Hoff (2007) sampled eigenvalues and the corresponding eigenvectors from the full condition distribution instead of the marginal distribution. As a result, the probabilities of sampling nonzero eigenvalues do not mix well across different ranks of the mean matrix, as eigenvalues and the corresponding eigenvectors are dependent (Hoff, 2007). Recently, Luo and Li (2016) introduced a ladle estimator that combined the decreasing sequence of eigenvalues with the increasing pattern of variability of eigenvector directions. They observed that bootstrap variability of the eigenvectors decrease as the eigenvalues get further apart and increase as the eigenvalues get closer. Luo and Li (2016) proposed the ladle estimator based on this observation. They demonstrated the consistency of the ladle estimator and its superior performance in solving various dimension-reduction problems in a low-dimensional setting. While Hoff’s method and ladle have been shown to be effective in several settings, if the focus is on large or high-dimensional data, then these two methods may not be the best choice for rank estimation.

We propose a procedure to estimate the rank of the detectable signal space of a high-dimensional matrix using the signal-plus-noise model under the assumption that the eigenvalues of the noise matrix follow the universal behavior described in Aparicio et al. (2018). Under this construction, the expected spectrum of the random matrix NN follows a universal MP distribution, which can be compared to the eigenvalues of XX (Marčenko and Pastur, 1967). In this context, components carrying detectable signal can be characterized as those whose eigenvalues are significantly larger than expected under the MP distribution. This construction provides a method for estimating the dimension of a data set and extracting those components of the data that may not be attributable to randomness. The major contribution of this chapter is to present this novel and practical method, which results in improved accuracy, robustness, and computational efficiency, especially for high-dimensional data.

The paper is organized as follows: section II describes the framework setup and provides a distributional signal-to-ratio bound. Section III illustrates a procedure of estimating detectable signal dimension and demonstrates its performance in relation to other standard approaches, including the ladle and Hoff’s estimators. Section IV presents an application of dimension-estimation method in scRNA-seq data analysis. Section V concludes with a discussion.

II A bound on the detectable signal dimension

II.1 Signal-Plus-Noise Model

Let Xn×pX_{n\times p} be a matrix with mean-centered columns consisting of a low-rank matrix SS plus a random matrix NN. Without loss of generality, we assume p>np>n for high-dimensional data. Following the construction of Livan et al. (2018), assume matrix NN is formed by n×pn\times p samples from pp random variables [x1,,xp][x_{1},...,x_{p}] drawn from an unknown joint distribution p(x)p(x). Let NijN_{ij} be independent and identically distributed (i.i.d.) entries with \E(Nij)=0\E(N_{ij})=0 and \E(N2ij)=σ2\E({N^{2}}_{ij})=\sigma^{2} (Götze et al., 2004). The scaled inner product of XX can be written as

1pXXT\displaystyle\frac{1}{p}XX^{T} =1p(S+N)(S+N)T\displaystyle=\frac{1}{p}(S+N)(S+N)^{T} (1)
=1pSST+1pNNT+1pSNT+1pNST.\displaystyle=\frac{1}{p}SS^{T}+\frac{1}{p}NN^{T}+\frac{1}{p}SN^{T}+\frac{1}{p}NS^{T}\,.

We can specify finding the rank of XX in terms of eigenvalue decomposition as follows:

XXT=UXΛXUXT,XX^{T}=U_{X}\Lambda_{X}U_{X}^{T}\,, (2)

where uXiu_{X_{i}} denotes the iith column of the eigenvector UXU_{X} and ΛX\Lambda_{X} is a diagonal matrix with diagonal entries {λX1,,λXn}\{\lambda_{X_{1}},...,\lambda_{X_{n}}\} as the non-negative eigenvalues sorted in a non-increasing order. When npn\geq p, the ΛX=diag{λX1,,λXp,0,,0}\Lambda_{X}=diag\{\lambda_{X_{1}},...,\lambda_{X_{p}},0,...,0\}, where npn-p eigenvalues equals zero. Thus, we can rewrite the scaled eigenvalues of matrix XXTXX^{T} in terms of the following approximation:

1pΛX(X)\displaystyle\frac{1}{p}\Lambda_{X}(X) =1pUXT(XXT)UX\displaystyle=\frac{1}{p}U_{X}^{T}\left(XX^{T}\right)U_{X} (3)
=1pUXT(SST+NNT+SNT+NST)UX\displaystyle=\frac{1}{p}U_{X}^{T}\left(SS^{T}+NN^{T}+SN^{T}+NS^{T}\right)U_{X}
1pUXT(USΛS(S)UST+UNΛN(N)UNT)UX\displaystyle\approx\frac{1}{p}U_{X}^{T}\left(U_{S}\Lambda_{S}(S)U_{S}^{T}+U_{N}\Lambda_{N}(N)U_{N}^{T}\right)U_{X}
1pΛS(S)+1pΛN(N).\displaystyle\approx\frac{1}{p}\Lambda_{S}(S)+\frac{1}{p}\Lambda_{N}(N)\,.

There are two challenges in the simplification provided in Equation 3. First, adding NN to SS can result in the underestimation of the elements in ΛS(S)\Lambda_{S}(S) using ΛX(S)\Lambda_{X}(S). Second, the eigen vectors of SS and NN are not generally well-aligned (UNTUSIU_{N}^{T}U_{S}\not\approx I). As a result, we provide rigorous justification below and in Appendix A. Given SS and NN are independent, we have 1pSNT\frac{1}{p}SN^{T} and 1pNST\frac{1}{p}NS^{T} converge to 0 at a rate of \Tr(S)p\frac{\Tr(S)}{p} when pp\to\infty (see Appendix A). ΛS(S)\Lambda_{S}(S) is a rank KK diagonal matrix with non-zero entries corresponding to the rank of the signal space. When the signal’s magnitude is large compared to the noise, the corresponding vectors in UXU_{X} are aligned with USU_{S} in the first KK dimensions. UXU_{X} can therefore be regarded as as a perturbation on SSTSS^{T}, and UXTUSU_{X}^{T}U_{S} will be approximately identical in the KK leading eigenvectors. The remaining (nK)(n-K) dimensions have eigenvalues close to zero. At the same time, Wp=1pNNTW_{p}=\frac{1}{p}NN^{T} is a random matrix converging to II (see Appendix B) and UXTNTNUXU_{X}^{T}N^{T}NU_{X} can be seen as a rotation on the noise matrix, implying that UXTUNIU_{X}^{T}U_{N}\approx I.

When npn\geq p, we calculate the scaled inner product of Wn=1nXTXW_{n}=\frac{1}{n}X^{T}X. Similarly, the scaled eigenvalues of XTXX^{T}X can be rewritten as

1nΛX(X)\displaystyle\frac{1}{n}\Lambda_{X}(X) 1nΛS(S)+1nΛN(N).\displaystyle\approx\frac{1}{n}\Lambda_{S}(S)+\frac{1}{n}\Lambda_{N}(N)\,. (4)

The top pp eigenvalues of XTXX^{T}X and XXTXX^{T} are the same up to a constant nn or pp and the rest npn-p eigenvalues are equal to zero for npn\geq p. We only focus on the consistent eigenvalues up to the minimum of nn or pp. The eigenvalues in the null space are not considered in this paper. Similarly, we can demonstrate that 1nSTN=1nNTS0\frac{1}{n}S^{T}N=\frac{1}{n}N^{T}S\approx 0 and that 1nNTN\frac{1}{n}N^{T}N converges to II.

II.2 Empirical Spectral Distribution

By providing an estimate of ΛN(N)\Lambda_{N}(N), we estimate the amount of detectable signal that can be distinguished from noise in the region where ΛX(X)ΛN(N)0\Lambda_{X}(X)-\Lambda_{N}(N)\geq 0. We denote W=NNTW=NN^{T} and its nn non-negative eigenvalues as [λ1,λ2,,λn][\lambda_{1},\lambda_{2},...,\lambda_{n}]. The joint density of eigenvalues for the Wishart matrices is as follows:

p(λ1,λ2,,λn)\displaystyle p(\lambda_{1},\lambda_{2},...,\lambda_{n}) =Zn,β(L)eβσi=1nλii=1nλiβ2[(pn+1)2β]j<k|λjλk|β\displaystyle={Z^{(L)}_{n,\beta}}e^{-\beta\sigma\sum_{i=1}^{n}{\lambda_{i}}}\prod_{i=1}^{n}{\lambda_{i}}^{\frac{\beta}{2}[(p-n+1)-\frac{2}{\beta}]}\prod_{j<k}{|\lambda_{j}-\lambda_{k}|}^{\beta} (5)
=Zn,β=1(L)e12i=1nλii=1nλi12(pn1)j<k|λjλk|,\displaystyle={Z^{(L)}_{n,\beta=1}}e^{-\frac{1}{2}\sum_{i=1}^{n}{\lambda_{i}}}\prod_{i=1}^{n}{\lambda_{i}}^{\frac{1}{2}(p-n-1)}\prod_{j<k}{|\lambda_{j}-\lambda_{k}|}\,,

where β=1\beta=1 for real entries and σ=12\sigma=\frac{1}{2} for standard normal distribution (Götze et al., 2004; Livan et al., 2018). When npn\geq p, we define it as Anti-Wishart ensemble W~\tilde{W}, where we have npn-p eigenvalues equal zero. The joint density function of the Anti-Wishart ensemble is similar to Equation 5, with part of the matrix elements being random and the remaining elements non-random. The determination of these non-random elements are related to the first pp rows of W~\tilde{W} (Livan et al., 2018). The normalization factor can be calculated by Selberg’s integral as

Zn,β=1(L)\displaystyle{Z^{(L)}_{n,\beta=1}} =(β2)γj=1nΓ(β2+1)Γ(β2j+1)Γ(β2(j1)+α+1)\displaystyle=(\frac{\beta}{2})^{\gamma}\prod_{j=1}^{n}{\frac{\Gamma{\big{(}\frac{\beta}{2}+1\big{)}}}{\Gamma{\big{(}\frac{\beta}{2}j+1\big{)}}\Gamma{\big{(}\frac{\beta}{2}(j-1)+\alpha+1\big{)}}}} (6)
=(12)np2j=1nΓ(32)Γ(j2+1)Γ(12(pn+j)),\displaystyle=(\frac{1}{2})^{\frac{np}{2}}\prod_{j=1}^{n}{\frac{\Gamma{\big{(}\frac{3}{2}\big{)}}}{\Gamma{\big{(}\frac{j}{2}+1\big{)}}\Gamma{\big{(}\frac{1}{2}(p-n+j)\big{)}}}}\,,

where γ=n[α+β2(n1)+1]\gamma=n[\alpha+\frac{\beta}{2}(n-1)+1] and α=β2(pn+1)1\alpha=\frac{\beta}{2}(p-n+1)-1 (Livan et al., 2018; Forrester and Warnaar, 2008; Kumar, 2019; Dumitriu and Edelman, 2005). The empirical spectral distribution of WpW_{p} is defined as

Fn(y)=1nk=1nI{λky}.F_{n}(y)=\frac{1}{n}\sum_{k=1}^{n}I_{\{\lambda_{k}\leq y\}}\,. (7)

As proven in Marčenko and Pastur (1967); Götze et al. (2004); Livan et al. (2018), the expected spectral distribution EFn(y)EF_{n}(y) and Fn(y)F_{n}(y) converge to the MP distribution in probability with density form:

fc(y)=12πσ2cy(yc)(c+y))I[c,c+](y)+I[1,)(c)(1c1)δ(y),f_{c}(y)=\frac{1}{2\pi\sigma^{2}cy}\sqrt{(y-c_{-})(c_{+}-y))}I_{[c_{-},c_{+}]}(y)+I_{[1,\infty)}(c)(1-c^{-1})\delta(y)\,, (8)

where rectangularity ratio limn,p(n/p)c(0,)lim_{n,p\to\infty}(n/p)\to c\in(0,\infty). Note that ratio of n/pn/p approximates cc and is not fixed. Denote δ(y)\delta(y) as the Dirac delta function and c±=σ2(1±c)2c_{\pm}=\sigma^{2}(1\pm\sqrt{c})^{2} (Livan et al., 2018; Götze et al., 2004; Vivo, 2008). When p>np>n, we have the rectangularity ratio c<1c<1 and the second term in the density form equals zero according to the index function. When npn\geq p, we have the rectangularity ratio c1c\geq 1 and the second term of the density function does not equal zero only when the eigenvalues equal zero. Note that eigenvalues in the null space are not discussed, as we assume that the signal space is smaller than the minimum of nn and pp. As a result, the Dirac delta function equals zero for both high- and low-dimensional settings and share the same density form, which is as

fc(y)=12πσ2cy(yc)(c+y))I[c,c+](y).f_{c}(y)=\frac{1}{2\pi\sigma^{2}cy}\sqrt{(y-c_{-})(c_{+}-y))}I_{[c_{-},c_{+}]}(y)\,. (9)

II.3 Signal-to-Noise Ratio Bound

The empirical spectral distribution provides an estimate of the eigenvalues ΛN(N)\Lambda_{N}(N) for the WpW_{p} matrix in high- and low-dimensional cases. We focus on the high-dimensional case where p>np>n in the in-text derivation and provide a distributional upper bound on the maximum eigenvalue of ΛN(N)\Lambda_{N}(N) to determine the magnitude of the signal that can be distinguished from noise. The derivation of the signal-to-noise ratio (SNR) bound for low-dimensional cases can be derived similarly.

The proof of the SNR bound can be summarized as follows: (i) provide the distribution of the diagonal elements of the WpW_{p} matrix. (ii) Provide the distribution of the off-diagonal elements of the WpW_{p} matrix and show that the off-diagonal elements converge to zero. (iii) Set the bound for the eigenvalues of the WpW_{p} matrix by the Gershgorin circle theorem (Gerschgorin, 1931). Intuitively, the eigenvalue of the jjth column of the WpW_{p} square matrix lies within a circle of the jjth diagonal elements at a radius of the sum of the absolute values of off-diagonal elements of the jjth column (Gerschgorin, 1931). (iv) Further set the bound for the maximum eigenvalue of the WpW_{p} matrix by the order statistics (Fréchet, 1927).

Assuming that the entries of the noise matrix NN are i.i.d. random variables drawn from an unknown joint distribution (without normality assumption) and pp is sufficiently large, we first note that by the central limit theorem, the diagonal and off-diagonal entries of WpW_{p} matrix converge in distribution to normal distributions respectively, as nn and mm goes to infinity, where m=n(n1)/2m=n(n-1)/2 is the number of elements in the upper triangular matrix.

Theorem II.1.

Let Nn×pN\in\mathcal{R}^{n\times p} with i.i.d. elements having mean 0, variance σ2\sigma^{2}, and the fourth moment γ4<\gamma^{4}<\infty. The diagonal entries of the WpW_{p} matrix ξj\xi_{j} (j=1,,n)(j=1,...,n) converge in distribution to 𝒩(σ2,γ4σ4p)\mathcal{N}(\sigma^{2},\frac{\gamma^{4}-\sigma^{4}}{p}) when pp is sufficiently large to have the central limit effect.

Proof II.2.

see Appendix C.

Theorem II.3.

Let Nn×pN\in\mathcal{R}^{n\times p} with i.i.d. elements having mean 0, variance σ2\sigma^{2}, and the fourth moment γ4<\gamma^{4}<\infty. The off-diagonal entries of the WpW_{p} matrix εik\varepsilon_{ik} (i,k=1,,ni,k=1,...,n and i<ki<k) converge in distribution to 𝒩(0,γ4+σ44p)\mathcal{N}(0,\frac{\gamma^{4}+\sigma^{4}}{4p}) when pp is sufficiently large to have the central limit effect.

Proof II.4.

see Appendix D.

Consider the special case where NN follows a standard normal distribution, the variance of the main diagonal elements is γ4σ4p=3σ4σ4p=2p\frac{\gamma^{4}-\sigma^{4}}{p}=\frac{3\sigma^{4}-\sigma^{4}}{p}=\frac{2}{p}, or similarly by the main diagonal of variance for WpW_{p} as 1p(n+1)=2p\frac{1}{p}(n+1)=\frac{2}{p}. Thus, the distribution of the main diagonal elements is 𝒩(1,2p)\mathcal{N}(1,\frac{2}{p}), the justification for which is rigorously provided in Appendices B and C. Similarly, the distribution of the off-diagonal elements is 𝒩(0,1p)\mathcal{N}(0,\frac{1}{p}), which is rigorously justified in Appendix D. As proved in Appendix B, the first moment of the WpW_{p} matrix for the normalized noise matrix is the identity matrix, with the off-diagonal entries converging to zero at a rate of 1p\frac{1}{\sqrt{p}} when pp is sufficiently large.

Theorems II.1 and II.3 provide the distribution of the diagonal and off-diagonal entries of the WpW_{p} matrix. We use the results of order statistics to estimate an upper bound of the maximum diagonal element and the maximum radius of the Gershgorin circle of WpW_{p} matrix. A similar upper bound can be estimated for the diagonal and the maximum radius of the Gershgorin circle for WnW_{n} matrix when npn\geq p.

Theorem II.5.

Let ξj\xi_{j} be the main diagonal elements of WpW_{p}, where j=1,,nj=1,...,n. The Fisher–Tippett–Gnedenko theorem shows that for the maximum diagonal element, ξ(p)=max(ξj)\xi_{(p)}=max(\xi_{j}). We have

\E{ξ(n)>α1}=eep(α1σ2)/γ4σ4.\E\{\xi_{(n)}>\alpha_{1}\}=e^{-e^{-{\sqrt{p}(\alpha_{1}-\sigma^{2})}/{\sqrt{\gamma^{4}-\sigma^{4}}}}}\,. (10)

We can set the bound of ξ(n)\xi_{(n)} with the inequality exex2+xe^{x}\leq e^{x^{2}}+x for xx\in\mathbb{R} such that

\E{ξ(n)>α1}ee2p(α1σ2)γ4σ4ep(α1σ2)γ4σ4.\E\{\xi_{(n)}>\alpha_{1}\}\leq e^{e^{-\frac{2\sqrt{p}(\alpha_{1}-\sigma^{2})}{\sqrt{\gamma^{4}-\sigma^{4}}}}}-e^{-\frac{\sqrt{p}(\alpha_{1}-\sigma^{2})}{\sqrt{\gamma^{4}-\sigma^{4}}}}\,. (11)
Proof II.6.

see Appendix E.

Theorem II.7.

Let εik\varepsilon_{ik} be the off-diagonal elements of WpW_{p}, which are normally distributed with mean 0 and variance γ4+σ44p\frac{\gamma^{4}+\sigma^{4}}{4p}, where i,k=1,,ni,k=1,...,n and i<ki<k. Let Rk=ik|εik|R_{k}=\sum_{i\neq k}|\varepsilon_{ik}| be the sum of the absolute values of the off-diagonal elements of the kkth column of WpW_{p} matrix with expectation μGEV=(n1)2(γ4+σ4)2pπ\mu_{GEV}=\sqrt{\frac{(n-1)^{2}(\gamma^{4}+\sigma^{4})}{2p\pi}} and variance σGEV2=(n1)(π2)(γ4+σ4)4pπ\sigma^{2}_{GEV}=\frac{(n-1)(\pi-2)(\gamma^{4}+\sigma^{4})}{4p\pi}. The Fisher–Tippett–Gnedenko theorem shows that for the maximum entry, R(n)=max(Rk)R_{(n)}=max(R_{k}). We have

\E{R(n)>α2}=eeb,\E\{R_{(n)}>\alpha_{2}\}=e^{-e^{-b}}\,, (12)

where b=α2μGEVσGEVb=\frac{\alpha_{2}-\mu_{GEV}}{\sigma_{GEV}}. We can set the bound of R(n)R_{(n)} with the inequality exex2+xe^{x}\leq e^{x^{2}}+x for xx\in\mathbb{R} such that

eebee2beb.e^{-e^{-b}}\leq e^{e^{-2b}}-e^{-b}\,. (13)

The eigenvalue of the kkth column of the WpW_{p} matrix can be bound by the Gershgorin circle theorem. With the distributional upper bound of the ξk\xi_{k} and distributional upper bound of RkR_{k}, we have the upper bound of the largest eigenvalue of WpW_{p} matrix, which lies within the Gershgorin discs D(ξ(n),R(n))D(\xi_{(n)},R_{(n)}) that are centered at ξ(n)\xi_{(n)} with a radius R(n)R_{(n)} of rate of convergence n1p\sqrt{\frac{n-1}{p}} for p>np>n.

Proof II.8.

see Appendix F.

Corollary II.9.

Let Nn×pN\in\mathcal{R}^{n\times p} with i.i.d. elements following a standard normal distribution with mean 0 and variance 1. The diagonal entries of the WpW_{p} matrix ξj\xi_{j} (j=1,,n)(j=1,...,n) converge in distribution to 𝒩(1,2p)\mathcal{N}(1,\frac{2}{p}). The Fisher–Tippett–Gnedenko theorem shows that for the maximum diagonal element, ξ(p)=max(ξj)\xi_{(p)}=max(\xi_{j}). We have

\E{ξ(n)>α1}=ee2p(α11)2.\E\{\xi_{(n)}>\alpha_{1}\}=e^{-e^{-\frac{\sqrt{2p}(\alpha_{1}-1)}{2}}}\,. (14)

we can set the bound of ξ(n)\xi_{(n)} by

\E{ξ(n)>α1}ee2p(α11)e2p(α11)2.\E\{\xi_{(n)}>\alpha_{1}\}\leq e^{e^{-\sqrt{2p}(\alpha_{1}-1)}}-e^{-\frac{\sqrt{2p}(\alpha_{1}-1)}{2}}\,. (15)

The off-diagonal entries of the WpW_{p} matrix εik\varepsilon_{ik} (i,k=1,,ni,k=1,...,n and i<ki<k) converge in distribution to 𝒩(0,1p)\mathcal{N}(0,\frac{1}{p}). Let Rk=ik|εik|R_{k}=\sum_{i\neq k}|\varepsilon_{ik}| be the sum of the absolute values of the off-diagonal elements of the kkth column of the WpW_{p} matrix with expectation μGEV=2(n1)2pπ\mu_{GEV}=\sqrt{\frac{2(n-1)^{2}}{p\pi}} and variance σGEV2=(n1)(π2)pπ\sigma^{2}_{GEV}=\frac{(n-1)(\pi-2)}{p\pi}. The Fisher–Tippett–Gnedenko theorem shows that for the maximum entry, R(n)=max(Rk)R_{(n)}=max(R_{k}). We have

\E{R(n)>α2}=eeb,\E\{R_{(n)}>\alpha_{2}\}=e^{-e^{-b}}\,, (16)

where b=α2μGEVσGEVb=\frac{\alpha_{2}-\mu_{GEV}}{\sigma_{GEV}}. We can set the bound of R(n)R_{(n)} with the inequality exex2+xe^{x}\leq e^{x^{2}}+x for xx\in\mathbb{R} such that

eebee2beb.e^{-e^{-b}}\leq e^{e^{-2b}}-e^{-b}\,. (17)

Thus, we can set the bound of the largest eigenvalue of WpW_{p} by D(ξ(n),R(n))D(\xi_{(n)},R_{(n)}).

III Estimating the detectable signal dimension in practice

III.1 Algorithm

Equation 3 shows that the rank of the signal can be estimated by first estimating ΛN(N)\Lambda_{N}(N) and then finding the zero crossing of the difference ΛX(X)ΛN(N)\Lambda_{X}(X)-\Lambda_{N}(N) corresponding to where any signal in the signal space cannot be distinguished from noise (see Figure 1).

Refer to caption
Figure 1: Scaled eigenvalues of X100×500X_{100\times 500} with true rank 10 (red dotted line) compare to random samples from the MP distribution (blue dotted line).

As a more concrete example, consider the case of plotting scaled eigenvalues of XX matrix with 100 samples, 500 features, and a true rank equals to 10 as well as random samples from the MP distribution in Figure 1. The red line plots the scaled eigenvalues of the matrix while the blue line plots the scaled eigenvalues of a similarly distributed noise-only matrix. The goal is to extract all the signal components up to KK whose eigenvalues are significantly larger than the expected eigenvalues from the MP distribution. The dimension of the detectable signal is the first eigenvalue index where the corresponding eigenvalue is close to the largest eigenvalue sampled from the MP distribution, meaning it cannot be distinguished from noise. The algorithm is implemented as follows:

Algorithm 1 Algorithm for rank estimation
1:procedure Algorithm(Xn×pX_{n\times p})
2:     Calculate eigenvalue decomposition for scaled matrix XX to dimension nn^{\prime} (nnn^{\prime}\ll n). Denote eigenvalues as Σx\Sigma_{x}.
3:     Draw random samples from the MP distribution with a corrected variance. Denote eigenvalues as Σn\Sigma_{n}.
4:     If Σn>Σx\exists\>\Sigma_{n}>\Sigma_{x}, K=0K=0.
5:     Calculate posterior probability of change points for deviation (Δ=ΣxΣn\Delta=\Sigma_{x}-\Sigma_{n}) with a prior equals to a decreasing linear sequence from 0.9 to 0.
6:     Set KK to equal the largest dimension with the maximum posterior probability.

III.2 Implementation

An implementation of the dimension, ladle, and Hoff’s method as an open-source package for the R Programming environment can be found at https://github.com/WenlanzZ/dimension. We estimate the corrected variance of the MP distribution by the SNR bound to align the tails of the scaled eigenvalues of XX and noise. The test for deviation of scaled eigenvalues can be performed in several different ways. In practice, the univariate Bayesian change point (BCP) analysis, proposed by Barry and Hartigan (1993) has been found to be effective and is implemented in the R package bcp (Erdman et al., 2007). Frequently, there are several change points with their posterior probabilities approximating the maximum posterior probability. In order to guarantee the inclusion of all important signals, we propose a two-step procedure to estimate the dimension KK in practice. In the first step, we select candidates for KK by calculating the BCP on the posterior probability for a second time (double posterior). In the second step, among those candidates with the highest double posterior, we pick KK equal to the largest dimension where its posterior probability is larger than δ\delta times the maximum posterior probability (δ:δ0 and δ1\delta\in\mathbb{R}:\delta\geq 0\text{ and }\delta\leq 1 determine the range of approximation to the maximum posterior probability; by default set δ=0.90\delta=0.90).

One limitation observed in simulation studies when implementing the dimension algorithm is related to R package bcp for calculating the posterior probabilities of the change points. The performance of R package bcp is not always stable at the edges of the sequence. Thus, in some extreme cases, a spike occurs at the edge after a long and flat pattern in the sequence of posterior probabilities. To address this problem, we adopted an alarm system that detects flat and spike patterns and trims off sequence before spikes or after a long and flat pattern.

III.3 Simulation

We compared the performance of the proposed rank estimation procedure to the ladle and Hoff’s method. Consider two signal-plus-noise matrices whose true signal rank K=3K=3. For a fair comparison, the first matrix setting is the same as in Luo and Li (2016) and the second matrix setting is similar to Hoff (2007):

X1N(0,ΣX1),ΣX1=M+0.542Ip,M=diag(2,1,1)+0p,p,\displaystyle X_{1}\sim N(0,\Sigma_{X_{1}}),\Sigma_{X_{1}}=M+0.54^{2}I_{p},M=diag(2,1,1)+0_{p,p}, (18a)
X2=S+ϵ,S[,1:K]N(0,Σ),S[,(K+1):p]=0,Σ=3I3.\displaystyle X_{2}=S+\epsilon,S[\,\cdot\,,1:K]\sim N(0,\Sigma),S[\,\cdot\,,(K+1):p]=0,\Sigma=3I_{3}. (18b)

We simulated 1,000 independent samples from the XX matrix and performed rank estimation with different methods. We allowed each method to run for up to 24 hours for 1,000 simulations. The percentage of correct estimation and elapsed time in seconds are reported in Table 1. For example, in the first case of the X1X_{1} matrix (sigma separated by commas) with signal rank equal to three and 100 samples and 10 features, dimension had 94.8% of correct estimation with a mean absolute error of 0.056 and an average elapsed time of 0.040 seconds; ladle had 93% of correct estimation with a mean absolute error of 0.396 and an average elapsed time of 0.034 seconds; Hoff’s method had 78.8% of correct estimation with a mean absolute error of 0.367 and an average elapsed time of 0.538 seconds. In the sixth case of the X2X_{2} matrix (sigma as a single integer) with signal rank equal to three and 10 samples and 100 features, dimension had 28.3% of correct estimation with a mean absolute error of 0.980 and an average elapsed time of 0.041 seconds; ladle had 0% of correct estimation with a mean absolute error of 5.930 and an average elapsed time of 0.044 seconds; Hoff’s method had 0% of correct estimation with a mean absolute error of 7 and an average elapsed time of 1.094 seconds. In Table 1, when we fixed pp and let nn grow for the n>pn>p setting, the dimension method achieved the highest accuracy and performed stably for both the X1X_{1} and X2X_{2} matrices, while the ladle method seemed to work only with its own matrix setting. When we fixed nn and let pp grow for the p>np>n setting, dimension exceled in both accuracy and computational efficiency. In terms of computational efficiency, the dimension method is more scalable to large data sets compared to the other methods. In extreme cases where noise overwhelms signal and the signal space is too small to be detectable, dimension underestimated the true rank by a mean absolute error of around 6. When signal is moderate in an extremely high-dimensional setting, dimension achieved an accuracy of 91.1% with a mean absolute error of 0.207 and an average elapsed time of 3.767 seconds while the other two methods are computationally inapplicable to providing an estimation.

Table 1: Simulation results comparing dimension estimation methods under different settings.
rank unknown matrix dimension ladle Hoff
k n p sigma acc mae time acc mae time acc mae time
3 100 10 2, 1, 1 94.8 0.056 0.040 93 0.396 0.034 78.8 0.367 0.538
3 1000 10 2, 1, 1 100 0 0.058 100 0 0.696 96.3 0.044 49.747
3 10000 10 2, 1, 1 100 0 0.070 100 0 15.442 - - >24h
3 10 100 2, 1, 1 25.7 1.175 0.041 0 5.828 0.043 11.4 2.727 0.591
3 10 1000 2, 1, 1 21.3 1.420 0.059 0 3.057 0.524 - - >24h
3 10 100 6 28.3 0.980 0.041 0 5.930 0.044 0 7 1.094
3 10 1000 6 18.2 1.696 0.062 0 3.123 0.493 0 7 116.187
3 10 10000 6 15.8 1.914 0.071 0 3 15.211 - - >24h
10 500 100 2, 2, 2 99.6 0.006 0.268 100 0 2.006 99.9 0.001 50.730
10 500 100 2 99.9 0.000 0.284 0 70 15.933 92.2 0.127 23.333
10 5000 100 2 100 0 0.614 - - >24h - - >24h
10 50000 100 2 100 0 3.902 - - >24h - - >24h
10 100 500 6 0.064 4.434 0.290 0 10.008 3.871 23.4 1.693 24.214
10 100 5000 6 0 6.684 0.646 - - >24h - - >24h
10 100 50000 6 0 8.553 3.728 - - >24h - - >24h
10 100 500 10 85.9 0.250 0.288 0 10.971 3.914 69.2 0.689 15.457
10 100 5000 10 0 6.314 0.699 - - >24h - - >24h
10 100 50000 100 91.1 0.207 3.767 - - >24h - - >24h

IV Case Study: Dimension Estimation in scRNA-seq Analysis

The lung scRNA-seq data can be found on Gene Expression Omnibus (GEO) under accession code GSE136831. The data was examined by Adams et al. (2019) to build a single-cell atlas of Idiopathic Pulmonary Fibrosis (IPF). The data contains the gene expression of 312,928 cells profiled from 32 IPF lung samples, 18 chronic obstructive pulmonary disease (COPD) lung samples, and 29 control donor lung samples (Adams et al., 2019). We applied the dimension method to a subgroup of control lung sample (Patient ID 001C) with 2,000 highly variable genes and 127 cells. Figure 2 shows that among all the candidates with maximum double posterior probabilities (see Figure 2b), the eighth dimension is the largest with the maximum posterior probability of change point (see Figure 2a).

Refer to caption
Figure 2: (a) The posterior means of the deviation and posterior probability plots. (b) The posterior probability and double posterior probability plots.

We then applied the Louvain clustering method implemented in the R package Seurat for PCs from Dimensions 1-50, 9-50, and 1-8 (see Figures 3a, 3b, 3c). The Uniform Manifold Approximation and Projection (UMAP) for each plot confirms that major features of cell-type clustering are associated with Dimensions 1-8 and there is no new cluster formed from Dimensions 9-50. Clustering from the identified signal subspace identified two additional clusters that are not identified by Dimensions 1-50. To validate the clustering results, we identified marker genes in each cluster via differential expression analysis and further matched the clustering results to known cell types with identified marker genes (see Figure 3c for cell-type identification). Comparing to the original identification published by Adams et al. (2019) (see Figure 3d), we found that most of the cells types were correctly annotated with the small subset using Dimensions 1-8, with the exception of the T cells that were clustered with the NK cells. However, NK cells and T cells are phenotypically similar and they both derive from lymphoid lineage cells. By contrast, when we included noises from Dimensions 1-50, T cells, fibroblast, and secretory cells were misclassified as one cell type. Therefore, an accurate estimation of the signal dimension is paramount for cell-type identification to avoid clustering with an overabundance of noise or signal omission, both of which lead to misleading conclusions.

Refer to caption
Figure 3: (a) UMAP visualization and cells labeled by Louvain clustering results with dimensions 1-50, (b) 9-50, and (c) 1-8. (d) Cell-type identification marked by Adams et al. (2019).

V Conclusion

In this chapter, we posed the problem of dimension estimation in a signal-plus-noise regime as the eigenvalue index where the difference between a data matrix and its inherent noise, defined by a random matrix, were close to zero. Based on this regime, we proposed a procedure to estimate the dimension of detectable signal, along with a derivation of distributional bounds for the magnitude of detectable signal required to be distinguished from noise. We implemented the proposed procedure and related methods as per the open-source R package at https://github.com/WenlanzZ/dimension. In simulation experiments, we demonstrated that the accuracy of the dimension method and its speed and computationally efficiency advantages over ladle and Hoff’s method, both of which are essential for large and high-dimensional data sets. In high-dimensional cases with extremely low signal, the dimension method provided a robust low-rank estimation while the other two methods provided either zero rank or full rank. Our results showed that the dimension method is more accurate in estimating the dimension while remaining computationally scalable compared to current methods in a wide range of scenarios under both high- and low-dimensional settings. Besides making performance comparisons with the simulated data, we also demonstrated that community detection in the identified signal subspace was beneficial in cell-type identification when analyzing scRNA-seq data sets. In addition to the applications to genomic data sets, the dimension-estimation method has more general applications, such as digital imaging, speech modelling, and neural networks.

Acknowledgements.

Appendix A Appendix: Proof of Equation 3

Here we provide detailed proofs of the Equation 3. Base on the central limit theorem, we show that 1pSNT\frac{1}{p}SN^{T} and 1pNST0\frac{1}{p}NS^{T}\approx 0. Given that SNT=NSTSN^{T}=NS^{T}, we will focus on the proof for 1pSNT0\frac{1}{p}SN^{T}\approx 0. The main technical challenge here comes from the case when the eigenvectors of SS and NN are not well-aligned (UNTUSIU_{N}^{T}U_{S}\not\approx I). Assume an extreme worst case when SS is a diagonal matrix with finite singular values (Σs)i(\Sigma_{s})_{i} for i=1,.,pi=1,....,p (pnp-n rows are padded with zeros), we show that in general 1pSNT0\frac{1}{p}SN^{T}\not\approx 0 for finite pp. Let NijN_{ij} be i.i.d. entries of matrix NN with \E(Nij)=0\E(N_{ij})=0 and \E(N2ij)=σ2\E({N^{2}}_{ij})=\sigma^{2}. Without loss of generality, we normalize NijN_{ij} so that it has mean zero and variance one.

1pSNT=1p[S11000S22000Spp][N11N12N1pN21S22N2pNp10Npp]=1pi=1pSiiNii=1pi=1p(Σs)iNii\TrSp.\begin{array}[]{l}\frac{1}{p}SN^{T}=\frac{1}{p}\begin{bmatrix}S_{11}&0&\cdots&0\\ 0&S_{22}&\cdots&0\\ \vdots&\vdots&\ddots&\vdots\\ 0&0&\cdots&S_{pp}\end{bmatrix}\begin{bmatrix}N_{11}&N_{12}&\cdots&N_{1p}\\ N_{21}&S_{22}&\cdots&N_{2p}\\ \vdots&\vdots&\ddots&\vdots\\ N_{p1}&0&\cdots&N_{pp}\end{bmatrix}\\ =\frac{1}{p}\sum_{i=1}^{p}S_{ii}N_{ii}\\ =\frac{1}{p}\sum_{i=1}^{p}\left(\Sigma_{s}\right)_{i}N_{ii}\\ \leq\frac{\Tr S}{p}\,.\end{array} (19)

Thus we have 1pSNT0\frac{1}{p}SN^{T}\approx 0 when pp\to\infty with a convergence rate \TrSp\frac{\Tr S}{p}.

Appendix B Appendix: Derive moments for WW

Without loss of generality, we standardize the columns of random matrix Nn×pN_{n\times p} (p>np>n) by subtracting off the column-wise means and standardizing such that σ2=1\sigma^{2}=1. We can first derive the first moment of WW as follows:

NNT=j=1pN.jN.jT=[j=1pN1j2j=1pN1jN2jj=1pN1jNnjj=1pN2jN1jj=1pN2j2j=1pN2jNnjj=1pNnjN1jj=1pNnjN2jj=1pNnj2]\begin{array}[]{l}NN^{T}=\sum_{j=1}^{p}N_{.j}N_{.j}^{T}=\begin{bmatrix}\sum_{j=1}^{p}N_{1j}^{2}&\sum_{j=1}^{p}N_{1j}N_{2j}&\cdots&\sum_{j=1}^{p}N_{1j}N_{nj}\\ \sum_{j=1}^{p}N_{2j}N_{1j}&\sum_{j=1}^{p}N_{2j}^{2}&\cdots&\sum_{j=1}^{p}N_{2j}N_{nj}\\ \vdots&\vdots&\ddots&\vdots\\ \sum_{j=1}^{p}N_{nj}N_{1j}&\sum_{j=1}^{p}N_{nj}N_{2j}&\cdots&\sum_{j=1}^{p}N_{nj}^{2}\end{bmatrix}\end{array} (20)

where the main diagonals

\Ej=1pN.j2=p,\E\sum_{j=1}^{p}N_{.j}^{2}=p\,, (21)

for j=1,2,,pj=1,2,...,p. and the off-diagonal we have

j=1p\ENijNkj=j=1p\ENij\ENkj=0,\sum_{j=1}^{p}\E N_{ij}N_{kj}=\sum_{j=1}^{p}\E N_{ij}\E N_{kj}=0\,, (22)

for j=1,2,,pj=1,2,...,p and iki\neq k.

Thus we have the first moment of the WpW_{p} matrix as

\EWp=\E1pNNT=I.\E W_{p}=\E\frac{1}{p}NN^{T}=I\,. (23)

We then derive the second moment of the WpW_{p} matrix as follows:

NNTNNT=[j=1pi=1nk=1pN1kNikNijN1jj=1pi=1nk=1pN1kNikNijN2jj=1pi=1nk=1pN1kNikNijNnjj=1pi=1nk=1pN2kNikNijN1jj=1pi=1nk=1pN2kNikNijN2jj=1pi=1nk=1pN2kNikNijNnjj=1pi=1nk=1pNnkNikNijN1jj=1pi=1nk=1pNnkNikNijN2jj=1pi=1nk=1pNnkNikNijNnj]\begin{array}[]{l}NN^{T}NN^{T}=\\ \begin{bmatrix}\sum\limits_{j=1}^{p}\sum\limits_{i=1}^{n}\sum\limits_{k=1}^{p}N_{1k}N_{ik}N_{ij}N_{1j}&\sum\limits_{j=1}^{p}\sum\limits_{i=1}^{n}\sum\limits_{k=1}^{p}N_{1k}N_{ik}N_{ij}N_{2j}&\cdots&\sum\limits_{j=1}^{p}\sum\limits_{i=1}^{n}\sum\limits_{k=1}^{p}N_{1k}N_{ik}N_{ij}N_{nj}\\ \sum\limits_{j=1}^{p}\sum\limits_{i=1}^{n}\sum\limits_{k=1}^{p}N_{2k}N_{ik}N_{ij}N_{1j}&\sum\limits_{j=1}^{p}\sum\limits_{i=1}^{n}\sum\limits_{k=1}^{p}N_{2k}N_{ik}N_{ij}N_{2j}&\cdots&\sum\limits_{j=1}^{p}\sum\limits_{i=1}^{n}\sum\limits_{k=1}^{p}N_{2k}N_{ik}N_{ij}N_{nj}\\ \vdots&\vdots&\ddots&\vdots\\ \sum\limits_{j=1}^{p}\sum\limits_{i=1}^{n}\sum\limits_{k=1}^{p}N_{nk}N_{ik}N_{ij}N_{1j}&\sum\limits_{j=1}^{p}\sum\limits_{i=1}^{n}\sum\limits_{k=1}^{p}N_{nk}N_{ik}N_{ij}N_{2j}&\cdots&\sum\limits_{j=1}^{p}\sum\limits_{i=1}^{n}\sum\limits_{k=1}^{p}N_{nk}N_{ik}N_{ij}N_{nj}\end{bmatrix}\end{array} (24)

where the expectation of dithd_{i}^{th} main diagonal elements is

\Ej=1pi=1nk=1pNdikNikNijNdij=p\ENdij4+i=di,jkNdikNdikNdijNdij+idi,j=kNdikNikNikNdik=p\ENdij4+i=di,jk1+idi,j=k1=p(p+n+1).\begin{array}[]{l}\E\sum\limits_{j=1}^{p}\sum\limits_{i=1}^{n}\sum\limits_{k=1}^{p}N_{d_{i}k}N_{ik}N_{ij}N_{d_{i}j}\\[10.0pt] =p\E N_{d_{i}j}^{4}+\sum\limits_{i=d_{i},j\neq k}N_{d_{i}k}N_{d_{i}k}N_{d_{i}j}N_{d_{i}j}+\sum\limits_{i\neq d_{i},j=k}N_{d_{i}k}N_{ik}N_{ik}N_{d_{i}k}\\[10.0pt] =p\E N_{d_{i}j}^{4}+\sum\limits_{i=d_{i},j\neq k}1+\sum\limits_{i\neq d_{i},j=k}1\\[10.0pt] =p(p+n+1)\,.\end{array} (25)

Note that \ENidi4\E N_{id_{i}}^{4} is the second moment of the chi-square distribution and the off-diagonal elements are with expectation

\Ej=1pi=1nk=1pNdikNikNijNdjj=0,\E\sum\limits_{j=1}^{p}\sum\limits_{i=1}^{n}\sum\limits_{k=1}^{p}N_{d_{i}k}N_{ik}N_{ij}N_{d_{j}j}=0\,, (26)

where didjd_{i}\neq d_{j}.

Thus we have the second moment of WpW_{p} as

\EWp2=\E1p2NNTNNT=(1+1p(n+1))I.\begin{array}[]{l}\E{W_{p}}^{2}=\E\frac{1}{p^{2}}NN^{T}NN^{T}=(1+\frac{1}{p}(n+1))I\,.\end{array} (27)

Appendix C Appendix: Proof of Theorem 1

For a Nn×pN_{n\times p} matrix, where p>np>n. Let NijN_{ij} be i.i.d. entries of NN with \E(Nij)=0\E(N_{ij})=0, \E(N2ij)=σ2\E({N^{2}}_{ij})=\sigma^{2} and \E(N4ij)=γ4<\E({N^{4}}_{ij})=\gamma^{4}<\infty.

Consider the jjth column of NN, for j=1,2,,pj=1,2,...,p. Let \ENij2=σ2\E N_{ij}^{2}=\sigma^{2} and Zi=Nij2Z_{i}=N_{ij}^{2}, then we can write the main diagonal of the elements of the WpW_{p} matrix as:

ξj=1pNj,Nj=1pi=1pZi,\xi_{j}=\frac{1}{p}\left<N_{\cdot j},N_{\cdot j}\right>=\frac{1}{p}\sum_{i=1}^{p}Z_{i}\,, (28)

where \EZi=σ2\E Z_{i}=\sigma^{2} and \Var(Zi)=γ4σ4\Var\left(Z_{i}\right)=\gamma^{4}-\sigma^{4}. When pp sufficiently large to have the Lyapunov central limit effect, we have

ξj𝒩(σ2,γ4σ4p).\xi_{j}\rightsquigarrow\mathcal{N}\left(\sigma^{2},\frac{\gamma^{4}-\sigma^{4}}{p}\right)\,. (29)

Appendix D Appendix: Proof of Theorem 2

For a Nn×pN_{n\times p} matrix, where p>np>n. Let NijN_{ij} be i.i.d. entries of NN with \E(Nij)=0\E(N_{ij})=0, \E(N2ij)=σ2\E({N^{2}}_{ij})=\sigma^{2} and \E(N4ij)=γ4<\E({N^{4}}_{ij})=\gamma^{4}<\infty.

Consider the iith row and kkth column of NN, for 1i,kp1\leq{i,k}\leq p and i<ki<k. We can write the off-diagonal of the elements of the WpW_{p} matrix at the iith row and kkth column as

εik=1pj=1pNijNkj.\varepsilon_{ik}=\frac{1}{p}\sum_{j=1}^{p}N_{ij}N_{kj}\,. (30)

The product of two independent random variables can be written as

NijNkj=14(Nij+Nkj)214(NijNkj)2.N_{ij}N_{kj}=\frac{1}{4}(N_{ij}+N_{kj})^{2}-\frac{1}{4}(N_{ij}-N_{kj})^{2}\,. (31)

We rewrite NijNkjN_{ij}N_{kj} as

NijNkj=14(a2b2),N_{ij}N_{kj}=\frac{1}{4}(a^{2}-b^{2})\,, (32)

where a=Nij+Nkja=N_{ij}+N_{kj} and b=NijNkjb=N_{ij}-N_{kj}. The moments of aa and bb is derived as follows:

\Ea=\Eb=j=1nμj=0,\E a=\E b=\sum_{j=1}^{n}\mu_{j}=0\,, (33)
\Ea2=\Eb2=j=1nμ2j=2σ2,\E a^{2}=\E b^{2}=\sum_{j=1}^{n}\mu_{2j}=2\sigma^{2}\,, (34)
\Ea4=\Eb4=j=1nμ4j+6j=1n1k>jnσj2σk2=2γ4+6σ4,\E a^{4}=\E b^{4}=\sum_{j=1}^{n}\mu_{4j}+6\sum_{j=1}^{n-1}\sum_{k>j}^{n}\sigma_{j}^{2}\sigma_{k}^{2}=2\gamma^{4}+6\sigma^{4}\,, (35)
\Vara4=\Varb4=2γ4+2σ4.\Var a^{4}=\Var b^{4}=2\gamma^{4}+2\sigma^{4}\,. (36)

In the next step, we show that aa and bb are independent. Consider a vector U=[NijNkj]U=\begin{bmatrix}N_{ij}\\ N_{kj}\end{bmatrix}, A=[11]A=\begin{bmatrix}1&1\end{bmatrix} and B=[11]B=\begin{bmatrix}1&-1\end{bmatrix}. According to a theorem in multivariate probability that two linear functions AUAU and BUBU of a Gaussian vector UU are independent if an only if A\Var(U)BT=0A\Var(U)B^{T}=0.

A\Var(U)BT=[11][\Var(Nij)\Cov(Nij,Nkj)\Cov(Nij,Nkj)\Var(Nkj)][11]=\Var(Nij)\Var(Nkj)=0.A\Var(U)B^{T}=\begin{bmatrix}1&1\end{bmatrix}\begin{bmatrix}\Var(N_{ij})&\Cov(N_{ij},N_{kj})\\ \Cov(N_{ij},N_{kj})&\Var(N_{kj})\end{bmatrix}\begin{bmatrix}1\\ -1\end{bmatrix}=\Var(N_{ij})-\Var(N_{kj})=0\,. (37)

Thus, aa and bb are independent. So that we can calculate the expectation of NijNkjN_{ij}N_{kj} as

\ENijNkj=14\E(a2b2)=0,\E N_{ij}N_{kj}=\frac{1}{4}\E\left(a^{2}-b^{2}\right)=0\,, (38)

and the variance as

\VarNijNkj=116\Var(a2b2)=γ4+σ44.\Var N_{ij}N_{kj}=\frac{1}{16}\Var\left(a^{2}-b^{2}\right)=\frac{\gamma^{4}+\sigma^{4}}{4}\,. (39)

Thus, for the off-diagonal entry εik\varepsilon_{ik}, the expectation is

\Eεik=1p\Ej=1pNijNkj=0,\E\varepsilon_{ik}=\frac{1}{p}\E\sum_{j=1}^{p}N_{ij}N_{kj}=0\,, (40)

and the variance as

\Varεik=1p2\Varj=1pNijNkj=γ4+σ44p.\Var\varepsilon_{ik}=\frac{1}{p^{2}}\Var\sum_{j=1}^{p}N_{ij}N_{kj}=\frac{\gamma^{4}+\sigma^{4}}{4p}\,. (41)

When pp sufficiently large to have the Lyapunov central limit effect, we have

εik𝒩(0,γ4+σ44p).\varepsilon_{ik}\rightsquigarrow\mathcal{N}\left(0,\frac{\gamma^{4}+\sigma^{4}}{4p}\right)\,. (42)

Appendix E Appendix: Proof of Theorem 3

Theorem II.1 indicates that the main diagonal entries ξj\xi_{j} (j=1,,n)(j=1,...,n) of WpW_{p} are iid. normally distributed random variables with mean σ2\sigma^{2} and variance γ4σ4p\frac{\gamma^{4}-\sigma^{4}}{p}. The Fisher–Tippett–
Gnedenko theorem shows that for the maximum diagonal element

ξ(n)=max(ξj)𝔾𝔼𝕍(σ2,p(γ4σ4)p,0).\xi_{(n)}=max(\xi_{j})\sim\mathbb{GEV}(\sigma^{2},\frac{\sqrt{p(\gamma^{4}-\sigma^{4})}}{p},0)\,. (43)

The cumulative distribution function of the Generalized Extreme Value (GEV) distribution is

Φ(α1)=\E{ξ(n)>α1}=eea,\Phi(\alpha_{1})=\E\{\xi_{(n)}>\alpha_{1}\}=e^{-e^{-a}}\,, (44)

where a=p(α1σ2)/γ4σ4a={\sqrt{p}(\alpha_{1}-\sigma^{2})}/{\sqrt{\gamma^{4}-\sigma^{4}}}.

We can bound the the maximum diagonal element ξ(n)\xi_{(n)} with the inequality exex2+xe^{x}\leq e^{x^{2}}+x for xx\in\mathbb{R} such that

eeaee2aea.e^{-e^{-a}}\leq e^{e^{-2a}}-e^{-a}\,. (45)

Thus, we have

\E{ξ(n)>α1}ee2p(α1σ2)γ4σ4ep(α1σ2)γ4σ4.\E\{\xi_{(n)}>\alpha_{1}\}\leq e^{e^{-\frac{2\sqrt{p}(\alpha_{1}-\sigma^{2})}{\sqrt{\gamma^{4}-\sigma^{4}}}}}-e^{-\frac{\sqrt{p}(\alpha_{1}-\sigma^{2})}{\sqrt{\gamma^{4}-\sigma^{4}}}}\,. (46)

Appendix F Appendix: Proof of Theorem 4

Theorem II.3 indicates that the off-diagonal entries εik\varepsilon_{ik} (i,k=1,,n)(i,k=1,...,n) and i<ki<k are iid. normally distributed random variables with mean 0 and variance γ4+σ44p\frac{\gamma^{4}+\sigma^{4}}{4p}. Given the distribution of the iith row and kkth column of the off-diagonal element of the WpW_{p} matrix. We can calculate the sum of the absolute values of the off-diagonal elements of the kkth column of the WpW_{p} matrix as Rk=ik|εik|R_{k}=\sum_{i\neq k}|\varepsilon_{ik}|. The absolute value of εik\varepsilon_{ik} follows a half-normal distribution and the off-diagonal elements are independent, thus we can derive the expectation of the sum of the absolute off-diagonal values as follows:

μGEV=\ERk=ik\E|εik|=(n1)2(γ4+σ4)4pπ=(n1)2(γ4+σ4)2pπ,\mu_{GEV}=\E R_{k}=\sum_{i\neq k}\E|\varepsilon_{ik}|=(n-1)\sqrt{\frac{2(\gamma^{4}+\sigma^{4})}{4p\pi}}=\sqrt{\frac{(n-1)^{2}(\gamma^{4}+\sigma^{4})}{2p\pi}}\,, (47)

and the variance as

σGEV2=\VarRk=ik\Var|εik|=(n1)(π2)(γ4+σ4)4pπ.\sigma^{2}_{GEV}=\Var R_{k}=\sum_{i\neq k}\Var|\varepsilon_{ik}|=\frac{(n-1)(\pi-2)(\gamma^{4}+\sigma^{4})}{4p\pi}\,. (48)

The Fisher–Tippett–Gnedenko theorem shows that for the maximum entry of RkR_{k}

R(n)=max(Rk)𝔾𝔼𝕍(μGEV,σGEV,0),R_{(n)}=max(R_{k})\sim\mathbb{GEV}(\mu_{GEV},\sigma_{GEV},0)\,, (49)

The cumulative distribution function of the Generalized Extreme Value (GEV) distribution is

Φ(α2)=\E{R(n)>α2}=eeb,\Phi(\alpha_{2})=\E\{R_{(n)}>\alpha_{2}\}=e^{-e^{-b}}\,, (50)

where b=α2μGEVσGEVb=\frac{\alpha_{2}-\mu_{GEV}}{\sigma_{GEV}}. We can bound the R(n)R_{(n)} with the inequality exex2+xe^{x}\leq e^{x^{2}}+x for xx\in\mathbb{R} such that

eebee2beb.e^{-e^{-b}}\leq e^{e^{-2b}}-e^{-b}\,. (51)

Furthermore, we can bound the eigenvalue of the kkth column of the WpW_{p} matrix by the Gershgorin circle theorem. Known that the eigenvalue of the WpW_{p} matrix lies within at least one of the Gershgorin discs D(ξk,Rk)D(\xi_{k},R_{k}), where D(ξk,Rk)D(\xi_{k},R_{k})\subseteq\mathbb{C} is a closed disc centered at the kkth diagonal element of WpW_{p} (denote as ξk\xi_{k}) with radius RkR_{k}. With the distributional upper bound of the ξk\xi_{k} and distributional upper found of RkR_{k}, we have the upper bound of the largest eigenvalue of WpW_{p}, which lies with the Gershgorin discs D(ξ(n),R(n))D(\xi_{(n)},R_{(n)}) that are centered at ξ(n)\xi_{(n)} with a radius R(n)R_{(n)} of rate of convergence n1p\sqrt{\frac{n-1}{p}} for p>np>n.

References

  • Adams et al. (2019) Adams, T. S., Schupp, J. C., Poli, S., Ayaub, E. A., Neumark, N., Ahangari, F., Chu, S. G., Raby, B., DeIuliis, G., Januszyk, M., Duan, Q., Arnett, H. A., Siddiqui, A., Washko, G. R., Homer, R., Yan, X., Rosas, I. O., and Kaminski, N. (2019). “Single cell rna-seq reveals ectopic and aberrant lung resident cell populations in idiopathic pulmonary fibrosis,” bioRxiv https://www.biorxiv.org/content/early/2019/09/06/759902, \dodoi10.1101/759902.
  • Aparicio et al. (2018) Aparicio, L., Bordyuh, M., Blumberg, A. J., and Rabadan, R. (2018). “Quasi-universality in single-cell sequencing data,” arXiv preprint arXiv:1810.03602 .
  • Barry and Hartigan (1993) Barry, D., and Hartigan, J. A. (1993). “A bayesian analysis for change point problems,” Journal of the American Statistical Association 88(421), 309–319.
  • Carreira-Perpinán (1997) Carreira-Perpinán, M. A. (1997). “A review of dimension reduction techniques,” Department of Computer Science. University of Sheffield. Tech. Rep. CS-96-09 9, 1–69.
  • Chung and Storey (2015) Chung, N. C., and Storey, J. D. (2015). “Statistical significance of variables driving systematic variation in high-dimensional data,” Bioinformatics 31(4), 545–554.
  • Dumitriu and Edelman (2005) Dumitriu, I., and Edelman, A. (2005). “Eigenvalues of hermite and laguerre ensembles: large beta asymptotics,” in Annales de l’IHP Probabilités et statistiques, Vol. 41, pp. 1083–1099.
  • Erdman et al. (2007) Erdman, C., Emerson, J. W. et al. (2007). “bcp: an r package for performing a bayesian analysis of change point problems,” Journal of Statistical Software 23(3), 1–13.
  • Forrester and Warnaar (2008) Forrester, P., and Warnaar, S. (2008). “The importance of the selberg integral,” Bulletin of the American Mathematical Society 45(4), 489–534.
  • Fréchet (1927) Fréchet, M. (1927). “Sur la loi de probabilité de l’écart maximum,” Ann. Soc. Math. Polon. 6, 93–116.
  • Gerschgorin (1931) Gerschgorin, S. (1931). “On bounding the eigenvalues of a matrix,” Izv. Akad. Nauk. SSSR Otd Mat. Estest 1, 749–754.
  • Götze et al. (2004) Götze, F., Tikhomirov, A. et al. (2004). “Rate of convergence in probability to the marchenko-pastur law,” Bernoulli 10(3), 503–548.
  • Hoff (2007) Hoff, P. D. (2007). “Model averaging and dimension selection for the singular value decomposition,” Journal of the American Statistical Association 102(478), 674–685.
  • Kumar (2019) Kumar, S. (2019). “Recursion for the smallest eigenvalue density of β\beta-wishart–laguerre ensemble,” Journal of Statistical Physics 175(1), 126–149.
  • Livan et al. (2018) Livan, G., Novaes, M., and Vivo, P. (2018). Introduction to random matrices: theory and practice (Springer).
  • Luo and Li (2016) Luo, W., and Li, B. (2016). “Combining eigenvalues and variation of eigenvectors for order determination,” Biometrika 103(4), 875–887, https://doi.org/10.1093/biomet/asw051, \dodoi10.1093/biomet/asw051.
  • Marčenko and Pastur (1967) Marčenko, V. A., and Pastur, L. A. (1967). “DISTRIBUTION OF EIGENVALUES FOR SOME SETS OF RANDOM MATRICES,” Mathematics of the USSR-Sbornik 1(4), 457–483, https://doi.org/10.10702Fsm1967v001n04abeh001994.
  • Vivo (2008) Vivo, P. (2008). “From wishart to jacobi ensembles: statistical properties and applications,” Ph.D. thesis, Brunel University, School of Information Systems, Computing and Mathematics.
  • White and Smyth (2005) White, S., and Smyth, P. (2005). “A spectral clustering approach to finding communities in graphs,” in Proceedings of the 2005 SIAM international conference on data mining, SIAM, pp. 274–285.
  • Yang (1996) Yang, B. (1996). “Asymptotic convergence analysis of the projection approximation subspace tracking algorithms,” Signal processing 50(1-2), 123–136.
  • Yong et al. (2013) Yong, A. G., Pearce, S. et al. (2013). “A beginner’s guide to factor analysis: Focusing on exploratory factor analysis,” Tutorials in quantitative methods for psychology 9(2), 79–94.