Asymptotic locations of bounded and unbounded eigenvalues of sample correlation matrices of certain factor models
– application to a components retention rule
Abstract
Let the dimension of data and the sample size tend to with . The spectral properties of a sample correlation matrix and a sample covariance matrix are asymptotically equal whenever the population correlation matrix is bounded (El Karoui, 2009). We demonstrate this also for general linear models for unbounded , by examining the behavior of the singular values of multiplicatively perturbed matrices. By this, we establish: Given a factor model of an idiosyncratic noise variance and a rank- factor loading matrix which rows all have common Euclidean norm . Then, the th largest eigenvalues of satisfy almost surely: (1) diverges, (2) for the th largest singular value of , and (3) for . Whenever is much larger than , then broken-stick rule (Frontier, 1976; Jackson, 1993), which estimates by a random partition (Holst, 1980) of , tends to (a.s.). We also provide a natural factor model where the rule tends to “essential rank” of (a.s.) which is smaller than .
Keywords: unbounded eigenvalues, the largest bounded eigenvalue, high-dimensional modeling, multiplicative matrix perturbation, correlation
1 Introduction
In multivariate analysis (Anderson, 2003; Fujikoshi et al., 2011; Muirhead, 2009), covariance matrices are important objects, for example, in principal component analysis (PCA). Motivated by this, research on sample covariance matrices has a long history and an abundance of results have been established, e.g. various Wishart distributions.
Several well-known methods in multivariate analysis, however, become inefficient or even misleading when the data dimension is large. To deal with such large-dimensional data, a novel approach in asymptotic statistics has been developed where the data dimension is no longer fixed but tends to infinity together with the sample size . The feature of this limiting regime are discussed in (Yao et al., 2015, Section 1.1) based on real datasets such as portfolio, climate survey, speech analysis, face recognition, microarrays, signal detection, etc. In this paper, we assume a high-dimensional regime in which the ratio , of dimension to sample size , converges to a positive constant .
One of milestone work about the spectral properties of sample covariance matrices in this limiting regime was the discovery of Marčenko-Pastur distribution, and the determination (Yin et al., 1988; Bai & Yin, 1993) of the largest and the smallest eigenvalues of sample covariance matrices for i.i.d. data under the finite fourth moment condition. Baik & Silverstein (2006) studied asymptotic locations of the eigenvalues of of a spiked eigenvalue model (Johnstone, 2001), but the largest eigenvalues of the population covariance matrix are bounded. We refer the reader to (Bai & Silverstein, 2010; Paul & Aue, 2014; Yao et al., 2015).
In some cases, for example, when data are measured in different units, it is more appropriate to utilize sample correlation matrices (Jolliffe, 2002; Jolliffe & Cadima, 2016; Johnson & Wichern, 2007) because sample correlation matrices are invariant under scaling and shifting. Moreover, the assumption of i.i.d. data is not very acceptable; Practitioners often hope to be in the presence of a curious covariance structure. However, the asymptotic spectral properties of a sample correlation matrix in the high-dimensional regime have not been sufficiently investigated, compared to sample covariance matrices .
We review typical asymptotic results of the spectral properties of sample correlation matrices here. For i.i.d. case, under the finite fourth moment condition, Jiang (2004) showed that the extreme eigenvalues of the sample correlation matrix converge almost surely to , and Bai & Zhou (2008) showed that the limiting spectral distribution is the standard Marčenko-Pastur distribution under a weaker assumption.
In a class of spiked models, Morales-Jimenez et al. (2021) derived asymptotic first-order and distributional results for spiked eigenvalues and eigenvectors of sample correlation matrices . More specifically, they found that the first-order spectral properties of sample correlation matrices match those of sample covariance matrices , whilst their asymptotic distributions can differ significantly.
El Karoui (2009) revealed that the first-order asymptotic behavior of the spectra of is similar to that of , for unit variance data of a general linear model, except that this similarity requires the boundedness of the population correlation matrix , a condition not met by some factor models.
The problem is that the boundedness of the population covariance matrices is not always satisfied in econometrics, finance (Chamberlain & Rothschild, 1983; Bai & Ng, 2002), genomics, and stationary long-memory processes. One of the features of the unbounded population covariance matrices is the consistency of the eigenvectors (Yata & Aoshima, 2013; Koltchinskii & Lounici, 2017; Wang & Fan, 2017), although the asymptotic location and the fluctuation of the largest unbounded eigenvalues of are not available yet even for the equi-correlated normal population, which has the simplest unbounded population covariance matrix .
Unbounded covariance/correlation matrices in high-dimensional problems are studied with a spiked model (Yata & Aoshima, 2013), a time-series model (Merlevède et al., 2019) or a factor model (Cai et al., 2020; Wang & Fan, 2017). Following the latter articles for unbounded sample covariance matrices, we study factor models for unbounded sample correlation matrices.
Our target model is a -factor model
with , , being deterministic, a fixed positive integer, and two random matrices. Here we assume that the entries of (factors) are i.i.d. centered random variables with unit variance, and so are the entries of (noises). The entries of are independent from the entries of , but and are not necessarily identically distributed. All rows of are nonzero.
Our fundamental result is Theorem 2.4: for every -factor model, if the entries of the noise matrix have finite fourth moments and is diagonal, or if there exists such that , then the eigenvalues of the sample correlation matrices and those of the sample covariance matrices with unit-variance data are asymptotically equal, which extends the result of El Karoui (2009).
In PCA, methods for estimating the number of factors in a sample have been studied by (Bai & Ng, 2002; Lam & Yao, 2012; Aït-Sahalia & Xiu, 2017) in econometrics, etc.
Meanwhile, rules for the retention of principal components have been proposed in many lectures (see, e.g., (Jackson, 1993)). Broken-stick (BS) rule (Frontier, 1976; Jackson, 1993) is a peculiar rule among such rules. BS rule compares the spectral distribution of , with a distribution of the mean lengths of the subintervals obtained by a random partition of [0, 1] (Holst, 1980). Specifically, for the number of factors or significant principal components, the BS rule provides where is the lowest among for which the th largest eigenvalue of does not exceed the sum . The BS rule depends on the number and growth rates of unbounded eigenvalues of . The idea of the BS rule, initially coming from the species occupation model in an ecological system, is not evident in relation to the distribution of eigenvalues of .
First, we study the asymptotic spectral properties of , through the fundamental theorem of -factor model (Theorem 2.4) and techniques from the random matrix theory, for two illustrative -factor models generalizing the equi-correlated normal population. One of the two is a -factor model such that and the rows of the factor loading matrix have a common length. We call this a constant length factor loading model (CLFM). The other is a -factor model such that the rows of and the diagonal entries of are convergent. This model was introduced in Akama (2023), and is called an asymptotic convergent factor model (ACFM) here. We establish that under certain mild conditions, the BS rule precisely matches the principal rank of the factor loading matrix for a CLFM (see Theorem 2.7), but the essential rank of for an ACFM (see Theorem 2.8).
Then we calculate the BS rule and some modern factor number estimators such as the adjusted correlation thresholding (Fan et al., 2022) and Bai-Ng’s rule (Bai & Ng, 2002) based on an information criterion, for financial datasets and a biological dataset (Quadeer et al., 2014). The financial datasets obtained by Fan et al. (2022) from Fama-French 100 portfolio (Fama & French, 1993) by cleaning-up outliers. The biological dataset is a binary multiple sequence alignment (MSA) of HCV genotype 1a (prevalent in North America) publicly available from the Los Alamos National Laboratory database (Kuiken et al., 2004).
The structure of this paper is as follows. Section 2 formally introduces the model and presents the main theorems, with proofs provided in Section A of the supplementary material. Numerical analysis is performed on actual financial and biological datasets in Section 3. Section 4 is the conclusion. Section B of the supplementary material includes several important lemmas derived from general random matrix theories.
2 Models and theoretical results
In this paper we consider the following model of data matrix:
Definition 2.1 (-Factor model).
Let be nonnegative integers. A -factor model is a random matrix in the form
where
-
1.
the deterministic matrix is the theoretical mean, with ;
-
2.
the deterministic matrices and are called the factor loading matrix and noise coefficient matrix, and the rows of are called factor loading vectors;
-
3.
there are two independent sets of i.i.d. random variables and with mean zero and variance one, such that
The two random matrices and are called factor matrix and idiosyncratic noise matrix, respectively.
We often write in a compact form
with
Let be the th column of . Then are i.i.d. random vectors. We recall that the population covariance and correlation matrices are and , respectively, where is the th component of the vector . Note that under Definition 2.1, the population covariance matrix is
and the population correlation matrix is
where
is a diagonal matrix containing the variance of components of . Note that is just the square of Euclidean norm of the th row vector of , and is also the th diagonal element of . So we can write , where for a square matrix , denotes the diagonal matrix with the same diagonal as .
The population covariance matrix and correlation matrix play important roles in multivariate statistics. However they are not always available. In order to estimate them, we define the theoretically-centered sample covariance matrix as
(1) |
Let . Then the theoretically-centered sample correlation matrix is defined as (see e.g. El Karoui (2009))
As the mean vector is not always known either, we sometimes need to replace by the sample mean
and use it to define the data-centered sample covariance matrix
where , and the data-centered sample correlation matrix
where .
In this paper we focus on the sample correlation matrices. Note that if one row of is identically , then the corresponding row of is deterministically equal to the mean, and the correlation between this row and any other row is not defined. But it is easy to recognize such a row in the data and to eliminate it before further treatment. So we can assume
-
A1
The rows of are nonzero:
We study the limiting locations of eigenvalues of sample correlation matrices and in the proportional limiting regime
which will be denoted as
for simplicity. In this regime, Theorem 1 in El Karoui (2009) related the spectral properties (limiting spectral distributions and limit locations of individual eigenvalues) of and to those of sample covariance matrices and , in condition that , the spectral norm of , are uniformly bounded. However, due to the presence of , this condition is not satisfied by some factor models. For example, let us consider an ENP studied in (Fan & Jiang, 2019; Akama & Husnaqilati, 2022; Akama, 2023) and defined in the following Definition 2.2. Then which diverges to as .
Definition 2.2.
An equi-correlated normal population (ENP for short) is a -factor model such that (1) for some ; and (2) for some , and (3) and are independent standard normal random variables. For an ENP, we define .
In this paper, for the th largest singular value of a matrix , we prove the following Theorem in Section A:
Theorem 2.3.
Let and be complex matrices of order and .
Thanks to this theorem, we managed to generalize Theorem 1 in El Karoui (2009) to general -factor models with possibly unbounded .
Before stating our first main result, we add some additional assumptions.
-
A2
Each random variable has finite forth moment:
-
A3
One of the following assumptions holds: (a) There exists such that
or (b) is diagonal.
For two nonnegative sequences and , by we mean that there is a positive sequence such that , and for large enough , . For a family of such sequences and , we say that uniformly if there is independent of , with , such that holds for all and for large enough .
Theorem 2.4.
It should be noted that no bounding condition for is required. Therefore, when , the theorem is applicable to a general linear model , where can be unbounded. Furthermore, assuming , we accommodate different distributions for factor and noise components, for example, the factors are allowed to have a heavy-tailed distribution, along with a light-tailed noise.
The general result being stated, it is still complex to determine the asymptotic locations of a sample covariance matrix in general. We content ourselves with some particular cases that extend an ENP.
Model example 1: Constant length factor loading model (CLFM).
Definition 2.5 (Constant length factor loading model).
By a constant length factor loading model CLFM for short, we mean a -factor model with for some , whose factor loading vectors have the same length, i.e., there is a constant independent of and such that the th row vector of has length .
Note that a CLFM satisfies automatically A1, and A3(b). Moreover, the largest eigenvalue of has asymptotic tight order :
(2) |
It is because, by letting be the th column of for , we have
where the first inequality is due to Courant-Fisher min-max theorem (Horn & Johnson, 2013, Theorem 4.2.6), and the equality between the trace and is due to that the length of each row of is .
Furthermore, we will consider the following assumption:
-
A4
The rank of is , and the nonzero eigenvalues of (if any) tend to infinity: for ,
If a given CLFM is an ENP, then Definition 2.2 implies , which gives by (2). Thus, by Theorem 2.6 (3), both and are almost surely , which is proved in Akama (2023). Theorem 2.6 (4) implies that almost surely.
As a direct application of Theorem 2.6, we establish that the limit of the broken-stick rule and equal to .
Theorem 2.7.
If a CLFM satisfies A2, and if the smallest nonzero eigenvalue (if there exists) of is eventually larger than , i.e.,
(5) |
then
Here we give a sufficient condition to ensure A4 or (5) above. If the rank of is , then by rearranging the columns, can be written as
where is an matrix and an matrix. Then by eigenvalue interlacing theorem (Horn & Johnson, 2013, Theorem 4.3.28),
Let and
be the cosine of the angle generated by the two vectors . By virtue of Theorem 2.3 implies
By (Varah, 1975, Theorem 1),
Therefore, for any positive sequence ,
(6) |
is a sufficient condition of
This condition is noteworthy because the magnitude of can be seen as the overall influence of the th factor on the dataset; whereas the normalized vector indicates the effects of the th factor across the covariates. By Condition (6), if factors exert equally substantial influences on the dataset and their impact distributions are sufficiently independent, then the value of tends to be large.
Model example 2: Asymptotic constant factor loading model (ACFM).
We consider a -factor model satisfying the following
-
A5
There is a sequence of -dimensional row vectors and a sequence of strict positive numbers such that
and there are and such that
This model was considered in Akama (2023) where the limiting spectral distribution of its sample correlation matrices was derived. The limit of the broken-stick rule and for this model is determined here.
Theorem 2.8.
An equi-correlated normal population (ENP) is a CLFM and an ACLM, but is not a model of Fan et al. (2022); the population covariance matrix of an ENP is for some constant , so an ENP does not satisfy the condition C3 of Section 2 “High-Dimensional Factor Model” of (Fan et al., 2022). Some model of Fan et al. (2022) is neither a CLFM nor an ACLM, because of the definitions of CLFM and ACLM.
3 Broken-stick rule for real datasets
We will check whether the following real datasets are generated by a CLFM or an ACFM, based on Theorems 2.6 and 2.8:
-
1.
the datasets Fan et al. (2022) obtained by cleaning outliers from the datasets of the daily excess returns (Jensen, 1968) of Fama-French 100 portfolios (Fama & French (1993, 2015). See Prof. French’s data library http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/).
-
2.
binary multiple sequence alignment (Quadeer et al., 2014) by the courtesy of Prof. Quadeer.
For the correlation matrices of these datasets, we will compute the eight quantities:
In a stock return dataset, (, resp.) intends the number of companies (trading days, resp.). For , the th row of a data matrix is for the th company and th column is for the th trading day. The factors are those of Fama-French, for instance.
3.1 Stock return time-series
Engle & Kelly (2012) proposed their Dynamic Equicorrelation to forecast time-series of economics. In asset pricing and portfolio management, Fama and French designed statistical models, namely, 3-factor model (Fama & French, 1993) and 5-factor model (Fama & French, 2015), to describe stock returns.
Fan et al. (2022) computed their estimator ACT and confirmed the three Fama-French factors (Fama & French, 1993), for the following datasets of the daily excess returns of 100 companies French chose.
-
1.
Dot-com bubble & recession (1998-01-02/2007-12-31).
In this case and . (“Before 2007-2008 financial crisis”)
-
2.
After Lehman shock (2010-01-04/2019-04-30).
In this case and . (“After 2007-2008 financial crisis”)
Fama-French Portfolios | Bai-Ng | ||||||
---|---|---|---|---|---|---|---|
1998/2007 (Dot-com bubble & recession) | 2514 | .0398 | .658 | 100 | 2 | 4 | 4 |
2010/2019 (After Lehman shock) | 2346 | .0426 | .806 | 1 | 3 | 5 |


As we see in Figure 1, equicorrelation coefficient is low during dot-com bubble by speculation to uncorrelated emergent stocks.
We have two large groups of less speculative companies:
-
•
the group of non-dot-com companies, in the dot-com bubble.
-
•
the totality, in the dot-com recession.
Figure 1 is produced by the combination of a few techniques.
-
1.
GJR GARCH with the correlation model being the Dynamic Equicorrelation (Engle & Kelly, 2012).
This computes the time-series of equicorrelation coefficient . Our program utilizes dccmidas package (Candila, 2021) of R.
-
2.
Fisher’s z-transform .
This bijection between and makes the movement of clearer. By z(ECC) time-series, we mean the time-series of the z-transform of . Fisher’s z-transform often obeys asymptotically a normal distribution. In Figure 1, the z(ECC) time-series are on the upper paletts.
-
3.
Hodrick-Prescott filter (Hodrick & Prescott, 1997) to compute the trend and the cyclical component of the z(ECC) time-series.
From a given time-series , Hodrick-Prescott filter of parameter extracts
Here is the second difference of . Then, ’s are weighted averages of and satisfy . is the arithmetical progression obtained from by the least square approach, but is for . is called a trend of the time-series and a cyclical component of .
In Figure 1, the cyclical components are on the lower paletts, and the residues (trends) are overlaid on the upper paletts.
For the time-series of equicorrelation coefficient based on Engle & Kelly (2012), Wang et al. (2020) applied a linear regression analysis and regarded the residue as the equi-correlation of industry return (IEC). They claimed that “We can see that the index displays prominent countercyclical behavior in that it always rises during the recession periods and declines during the expansion periods. This is expected because economic recessions lead to greater comovement among stock returns”. Fisher’s z-transformation emphasizes the movement of the time-series of equicorrelation coefficient (ECC), and Hodrick-Prescott filter improves the linear regression analysis of the ECC time-series.
As we see in Figure 1, the trend of z(ECC) time-series during period Dot-com bubble & recession (1998-01-02/2007-12-31), is more changing than that of z(ECC) time-series after Lehman shock. One may suppose that investors speculated much to seemingly uncorrelated stocks during the Dot-com bubble.
3.2 Binary multiple sequence alignment
We consider the binary multiple sequence alignment (MSA for short) of an -residue (site) protein with sequences where and . The dataset is by the courtesy of Quadeer. Quadeer et al. did “identify groups of coevolving residues within HCV nonstructural protein 3 (NS3) by analyzing diverse sequences of this protein using ideas from random matrix theory and associated methods” (Quadeer et al., 2014, p. 7628), and also found “Sequence analysis reveals three sectors of collectively evolving sites in NS3. …, there remained eigenvalues greater than , presumably representing intrinsic correlations”(Quadeer et al., 2014, p. 7631). They detected signals by a randomization from the data.
On the statistical model of Quadeer et al. (2014), Morales-Jimenez, one of the authors of Quadeer et al. (2018), commented “the majority of variables (protein positions in the genome) are essentially independent, and there are just some small groups of variables which are correlated, giving rise to the different spikes. These group of variables can be modeled with equi-correlation, but the size of these groups is modeled as fixed, i.e., not growing with the dimension of the protein. That leads to a non-divergent spiked model, like the one considered in our Stat Sinica paper” Morales-Jimenez et al. (2021).
Nonetheless, for the dataset of the binary MSA (Quadeer et al., 2014), the broken-stick rule and the adjusted correlation thresholding work well.
Bai-Ng | |||||||
---|---|---|---|---|---|---|---|
MSA | 2815 | 0.1687 | 0.0216 | 475 | 3 | 10 | 4 |
-
•
The number 3 of the sectors is detected by the broken-stick rule .
-
•
The number of eigenvalues greater than is close to . Here is studied with the proportional limiting regime .
To end this section about the broken-stick rule for the stock return datasets and the binary MSA dataset, we observe: Fama-French portfolio (Dot-com bubble & recession (1998-01-02/2007-12-31)), and the binary MSA dataset fit a CLFM with but not an ACFM. The Fama-French portfolio (after Lehman shock (2010-01-04/2019-04-30)) fits both ACFM and a CLFM with .
In conclusion, the pair of a CLFM and BS rule may be more useful than the pair of a model and ACT (Fan et al., 2022), to estimate the number of factors of the binary MSA dataset; the number of factors estimated by our approach is the same as that estimated by (Quadeer et al., 2014). On the other hand, ACT may be useful for Fama-French portfolios, as they can predict the Fama-French factors.
In theory, to distinguish a model CLFM of from an ACFM, one may think of . The diagonal entries are asymptotically equal to the corresponding entries of (their ratios converge almost surely to 1 by Lemma A.1). And the diagonal entries of , are just the Euclidean norm of rows of .
Once we find a theoretically appropriate scaling and shifting of ’s to uncover the concentration of ’s, we could decide whether a given dataset fit a CLFM, an ACFM, or the generalization.
4 Conclusion
We suppose the proportional limiting regime for factor models. We established a general theorem for the asymptotic equispectrality of and the naturally normalized sample covariance matrix for factor models. Then, we derived the following assertions: for the introduced two models (CLFM and ACFM), the limiting largest bounded eigenvalue and the limiting spectral distribution of the sample correlation matrix , are scaling of those of of the i.i.d. case, with the scaling constant . Here is “a reduced correlation coefficient”. The largest eigenvalue of divided by the order of converges almost surely to .
Notably, for an ACFM, the BS rule computes not the rank of the factor loading matrix , but the essential rank (Theorem 2.8). In other words, the deterministic decreasing vector of the BS rule examines an intrinsic structure of , for an ACFM. Moreover, is the descending list of the limits of the eigenvalues of in , for a random factor loading matrix where for each ,
-
•
is a random vector of independent Rademacher variables,
-
•
is such that is the length of the th longest subintervals obtained by independent uniformly distributed separators of ;
and are independent. In this case, does not satisfy the condition of an ACFM. A limit theory for the order statistics of the eigenvalues of a sample covariance/correlation matrix is awaited to analyze the BS rule.
References
- (1)
- Aït-Sahalia & Xiu (2017) Aït-Sahalia, Y. & Xiu, D. (2017), ‘Using principal component analysis to estimate a high dimensional factor model with high-frequency data’, J. Econom. 201(2), 384–399.
- Akama (2023) Akama, Y. (2023), ‘Correlation matrix of equi-correlated normal population: fluctuation of the largest eigenvalue, scaling of the bulk eigenvalues, and stock market’, Int. J. Theor. Appl. Finance 26, 2350006.
- Akama & Husnaqilati (2022) Akama, Y. & Husnaqilati, A. (2022), ‘A dichotomous behavior of Guttman-Kaiser criterion from equi-correlated normal population’, J. Indones. Math. Soc. 28(3), 272–303.
- Anderson (2003) Anderson, T. W. (2003), An introduction to multivariate statistical analysis, 3rd edn, Wiley.
- Bai & Ng (2002) Bai, J. & Ng, S. (2002), ‘Determining the number of factors in approximate factor models’, Econometrica 70(1), 191–221.
- Bai & Silverstein (2010) Bai, Z. D. & Silverstein, J. W. (2010), Spectral analysis of large dimensional random matrices, 2nd edn, Springer.
- Bai & Yin (1993) Bai, Z. D. & Yin, Y. Q. (1993), ‘Limit of the smallest eigenvalue of a large dimensional sample covariance matrix’, Ann. Probab. 21(3), 1275–1294.
-
Bai & Zhou (2008)
Bai, Z. & Zhou, W. (2008), ‘Large
sample covariance matrices without independence structures in columns’, Stat. Sin. 18(2), 425–442.
https://www.jstor.org/stable/24308489 - Baik & Silverstein (2006) Baik, J. & Silverstein, J. W. (2006), ‘Eigenvalues of large sample covariance matrices of spiked population models’, J. Multivar. Anal. 97(6), 1382–1408.
- Cai et al. (2020) Cai, T., Han, X. & Pan, G. (2020), ‘Limiting laws for divergent spiked eigenvalues and largest nonspiked eigenvalue of sample covariance matrices’, Ann. Stat. 48(3), 1255–1280.
-
Candila (2021)
Candila, V. (2021), Package ‘dccmidas’
(DCC Models with GARCH-MIDAS Specifications in the Univariate Step).
R package version 0.1.0.
https://CRAN.R-project.org/package=dccmidas -
Chamberlain & Rothschild (1983)
Chamberlain, G. & Rothschild, M. (1983), ‘Arbitrage, factor structure, and mean-variance
analysis on large asset markets’, Econometrica 51(5), 1281–1304.
http://www.jstor.org/stable/1912275 -
El Karoui (2009)
El Karoui, N. (2009), ‘Concentration of
measure and spectra of random matrices: Applications to correlation matrices,
elliptical distributions and beyond’, Ann. Appl. Probab. 19(6), 2362–2405.
http://www.jstor.org/stable/25662544 - Engle & Kelly (2012) Engle, R. & Kelly, B. (2012), ‘Dynamic equicorrelation’, J. Bus. Econ. Stat. 30(2), 212–228.
- Fama & French (1993) Fama, E. F. & French, K. R. (1993), ‘Common risk factors in the returns on stocks and bonds’, J. Financ. Econ. 33(1), 3–56.
-
Fama & French (2015)
Fama, E. F. & French, K. R. (2015), ‘A five-factor asset pricing model’, J. Financ. Econ. 116(1), 1–22.
https://www.sciencedirect.com/science/article/pii/S0304405X14002323 -
Fan et al. (2022)
Fan, J., Guo, J. & Zheng, S. (2022), ‘Estimating number of factors by adjusted eigenvalues thresholding’, J. Am. Stat. Assoc. 117(538), 852–861.
https://doi.org/10.1080/01621459.2020.1825448 - Fan & Jiang (2019) Fan, J. & Jiang, T. (2019), ‘Largest entries of sample correlation matrices from equi-correlated normal populations’, Ann. Probab. 47(5), 3321–3374.
- Frontier (1976) Frontier, S. (1976), ‘Étude de la décroissance des valeurs propres dans une analyse en composantes principales: Comparaison avec le modèle du bâton brisé’, J. Exp. Mar. Biol. Ecol. 25, 67–75.
- Fujikoshi et al. (2011) Fujikoshi, Y., Ulyanov, V. V. & Shimizu, R. (2011), Multivariate statistics: High-dimensional and large-sample approximations, John Wiley & Sons.
-
Hodrick & Prescott (1997)
Hodrick, R. J. & Prescott, E. C. (1997), ‘Postwar U.S. business cycles: An empirical
investigation’, J. Money Credit Bank. 29(1), 1–16.
http://www.jstor.org/stable/2953682 - Holst (1980) Holst, L. (1980), ‘On the lengths of the pieces of a stick broken at random’, J. Appl. Prob. 17, 623–634.
- Horn & Johnson (2013) Horn, R. A. & Johnson, C. R. (2013), Matrix analysis, 2nd edn, Cambridge University Press.
-
Jackson (1993)
Jackson, D. A. (1993), ‘Stopping rules in
principal components analysis: A comparison of heuristical and statistical
approaches’, Ecology 74(8), 2204–2214.
http://www.jstor.org/stable/1939574 -
Jensen (1968)
Jensen, M. C. (1968), ‘The performance of
mutual funds in the period 1945–1964’, J. Finance 23(2), 389–416.
https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1540-6261.1968.tb00815.x -
Jiang (2004)
Jiang, T. (2004), ‘The limiting distributions
of eigenvalues of sample correlation matrices’, Sankhyā: The Indian
Journal of Statistics (2003-2007) 66(1), 35–48.
http://www.jstor.org/stable/25053330 - Johnson & Wichern (2007) Johnson, R. A. & Wichern, D. W. (2007), Applied Multivariate Statistical Analysis, 6th edn, Pearson Prentice Hall.
- Johnstone (2001) Johnstone, I. M. (2001), ‘On the distribution of the largest eigenvalue in principal components analysis’, Ann. Stat. 29(2), 295–327.
-
Jolliffe (2002)
Jolliffe, I. T. (2002), Principal
Component Analysis, 2nd edn, Springer.
https://link.springer.com/book/10.1007/b98835 -
Jolliffe & Cadima (2016)
Jolliffe, I. T. & Cadima, J. (2016), ‘Principal component analysis: a review and recent developments’, Philos. Trans. Royal Soc. A 374, 20150202.
http://doi.org/10.1098/rsta.2015.0202 -
Koltchinskii & Lounici (2017)
Koltchinskii, V. & Lounici, K. (2017), ‘Concentration inequalities and moment bounds for
sample covariance operators’, Bernoulli 23(1), 110–133.
https://doi.org/10.3150/15-BEJ730 -
Kuiken et al. (2004)
Kuiken, C., Yusim, K., Boykin, L. & Richardson, R. (2004), ‘The Los Alamos hepatitis C sequence database’,
Bioinformatics 21(3), 379–384.
https://doi.org/10.1093/bioinformatics/bth485 -
Lam & Yao (2012)
Lam, C. & Yao, Q. (2012), ‘Factor
modeling for high-dimensional time series: Inference for the number of
factors’, Ann. Stat. 40(2), 694–726.
https://doi.org/10.1214/12-AOS970 -
Ledoux (2001)
Ledoux, M. (2001), The concentration of
measure phenomenon, Vol. 89 of Mathematical Surveys and Monographs,
American Mathematical Society.
https://doi.org/10.1090/surv/089 - Mallows (1991) Mallows, C. (1991), ‘Another comment on O’Cinneide’, Am. Stat. 45(3), 257.
- Merlevède et al. (2019) Merlevède, F., Najim, J. & Tian, P. (2019), ‘Unbounded largest eigenvalue of large sample covariance matrices: Asymptotics, fluctuations and applications’, Linear Algebra Its Appl. 577, 317–359.
- Morales-Jimenez et al. (2021) Morales-Jimenez, D., Johnstone, I. M., McKay, M. R. & Yang, J. (2021), ‘Asymptotics of eigenstructure of sample correlation matrices for high-dimensional spiked models’, Stat. Sin. 31(2), 571.
- Muirhead (2009) Muirhead, R. J. (2009), Aspects of multivariate statistical theory, John Wiley & Sons.
-
Paul & Aue (2014)
Paul, D. & Aue, A. (2014), ‘Random
matrix theory in statistics: A review’, J. Stat. Plan. Inference 150, 1–29.
https://www.sciencedirect.com/science/article/pii/S0378375813002280 - Quadeer et al. (2014) Quadeer, A. A., Louie, R. H., Shekhar, K., Chakraborty, A. K., Hsing, I. & McKay, M. R. (2014), ‘Statistical linkage analysis of substitutions in patient-derived sequences of genotype 1a hepatitis C virus nonstructural protein 3 exposes targets for immunogen design’, J. Virol. 88(13), 7628–7644.
- Quadeer et al. (2018) Quadeer, A. A., Morales-Jimenez, D. & McKay, M. R. (2018), ‘Co-evolution networks of HIV/HCV are modular with direct association to structure and function’, PLoS Computational Biology 14, 1–29.
-
Tomkins (1975)
Tomkins, R. J. (1975), ‘On Conditional
Medians’, Ann. Probab. 3(2), 375–379.
https://doi.org/10.1214/aop/1176996411 - Varah (1975) Varah, J. M. (1975), ‘A lower bound for the smallest singular value of a matrix’, Linear Algebra Its Appl. 11(1), 3–5.
-
Wang & Fan (2017)
Wang, W. & Fan, J. (2017),
‘Asymptotics of empirical eigenstructure for high dimensional spiked
covariance’, Ann. Stat. 45(3), 1342–1374.
https://doi.org/10.1214/16-AOS1487 -
Wang et al. (2020)
Wang, Y., Pan, Z., Wu, C. & Wu, W. (2020), ‘Industry equi-correlation: A powerful predictor of
stock returns’, J. Empir. Finance 59, 1–24.
https://www.sciencedirect.com/science/article/pii/S092753982030044X -
Yao et al. (2015)
Yao, J., Zheng, S. & Bai, Z. D. (2015), Sample covariance matrices and high-dimensional
data analysis, Cambridge University Press.
https://doi.org/10.1017/CBO9781107588080 - Yata & Aoshima (2013) Yata, K. & Aoshima, M. (2013), ‘PCA consistency for the power spiked model in high-dimensional settings’, J. Multivar. Anal. 122, 334–354.
- Yin et al. (1988) Yin, Y. Q., Bai, Z. D. & Krishnaiah, P. R. (1988), ‘On the limit of the largest eigenvalue of the large dimensional sample covariance matrix’, Probab. Theory Relat. Fields 78, 509–521.
Supplementary material for the manuscript “Asymptotic locations of bounded and unbounded eigenvalues of sample correlation matrices of certain factor models - application to a components retention rule”
This supplementary article proves Theorems 2.3, 2.4, 2.6, 2.7, and 2.8 of the main manuscript and collects some useful lemmas. The literature cited in this supplementary material are listed in References of the main text.
Appendix A Proofs of theorems
A.1 Proof of Theorem 2.3
Proposition B.4 clearly implies the latter inequality
(7) |
Now we prove the other inequality . If , then there is nothing to prove. We now assume that . Then it is necessary that and the Hermitian matrix is positive definite. Let be the Moore-Penrose inverse of defined as
By (7) and ,
(8) |
Noticing that
we obtained the wanted inequality by multiplying on both sides of (8).
A.2 Proof of Theorem 2.4
Note that and the corresponding formula for . By Theorem 2.3, we have only to prove that all the singular values of and are concentrated near , as follows:
Lemma A.1 (Key).
We now give the proof of this lemma. Recall that is invariant under scaling and shifting of data. In the proof, we assume that and the length of each row vector of is : for without loss of generality. Then . The almost sure convergence in the limiting regime is denoted by . This proof is separated by two parts: the first assuming A3(b), and the second assuming A3(a).
Proof under A3(b).
Let . We prove that . It suffices to confirm
(9) |
Let . Because of , we have
(10) |
Thus
Here for , . As a result, is at most
(11) |
Then, for each possible and , we can prove that each of the four term of (11) tends to 0 almost surely, as follows:
-
(i)
is an array of i.i.d. random variables with mean . Use Lemma B.5 with .
-
(ii)
for is an array of centered i.i.d. random variables, since and are independent and centered. Use Lemma B.5 with .
- (iii)
- (iv)
Hence, (9) is confirmed.
Now we prove that as . It suffices to guarantee
(12) |
Let be the sample average of the th row of . Then by ,
(13) |
The first term of the right side converges almost surely to 0, by (9).
As for the second term , from the definition of , , and , we get
The first term of the right side converges almost surely to 0, as for each , is an array of centered i.i.d. random variables. The second term of the right side converges almost surely to 0 by Lemma B.5, since is a double array of i.i.d. random variables. Therefore the right side of (13) converges to almost surely. Consequently, (12) is guaranteed. Q.E.D.
Proof under A3(a).
This part is inspired by the proof of Lemma 4 in El Karoui (2009). First, we proceed the truncation on the random variables . Let
be the matrix with truncated entries, where ‘’ is the indicator function and . Then we re-center the entries of by defining
where
and is the matrix of unity. Let be the entries of .
We will prove that after replacing with , the diagonal entries of and do not change a lot. By A3 and (El Karoui 2009, Lemma 2), almost surely for large enough , we have , so almost surely, for large enough , the truncation does not modify and . Also is invariant under the translation, so the centering does not impact . For , because is centered and has finite forth moment by A2, we get
Here the right side is . Thus . In the limiting regime ,
Let be the th row of data after truncation and re-centering. As we have assumed without loss of generality, we can write
Let be the th row vector of . Since we assumed the length of each row vector of is 1 at the beginning of this proof, we get and . Hence
Now define the function by
From the proof of (El Karoui 2009, Lemma 4), is convex and -Lipschitz (as ). Let and be the distribution of and , respectively. Conditioning on (i.e., for every fixed ), we apply (Ledoux 2001, Corollary 4.10, pp.77–78, Eq.(4.10)) on the variables . By , it holds that for all ,
(14) |
Here (, , resp.) represents a conditional median (the conditional expectation, the conditional variance, resp.) knowing . By Tomkins (1975), is measurable, so we can integrate (14) with by . Then, we get, for all ,
Readily,
Borel-Cantelli’s lemma yields
(15) |
On the other hand, by Mallows (1991), and then by applying (Ledoux 2001, Proposition 1.9) to with inequality (14), we have
(16) |
where is an absolute constant independent of . By (10) and the first two arguments (i, ii) just below (11), we can check
Moreover, by Lebesgue’s dominated convergence theorem, the premise deduces . Hence
As a result, by the second inequality in (16),
Again by(16),
from which (15) concludes
For , we use the similar arguments as El Karoui (2009) with the above adaptions. We omit the details. The proof is complete. Q.E.D.
A.3 Proof of Theorems 2.6, 2.7 and 2.8
To establish Theorem 2.6, by Theorem 2.3 and , it is enough to determine the asymptotic locations of largest eigenvalues of sample covariance matrix of a CLFM.
Theorem A.2.
If a CLFM satisfies A4, then for ,
(17) | ||||
(18) |
Proof of (17).
Let . For , by the definition (1) and the model setting, we observe
By Corollary B.3,
By dividing the above inequality with ,
(19) |
Here by A4. For , (a.s.) by Proposition B.6 (1). Thus (a.s.).
Proof of (18).
Proof of Theorem 2.7.
It is well-known that for any fixed . By Theorem 2.6 along with condition (5), it holds almost surely that and . As a result, for sufficiently large , it follows almost surely that , and . Thus, (a.s.).
We can demonstrate (a.s.) mutatis mutandis. Q.E.D.
Now we prove Theorem 2.8. For an asymptotic CLF, we establish the following result for the largest and the second largest eigenvalues of sample correlation matrices and , and then we proceed in the same manner as proving Theorem 2.7.
Theorem A.3.
Under the same conditions as Theorem 2.8, for , the following hold almost surely:
Appendix B Some useful lemmas
The following Weyl’s inequality is well-known (Horn & Johnson 2013, Theorem 4.3.1).
Proposition B.1 (Weyl’s inequality).
Let and be Hermitian matrices, and let . Then
An analogous theorem on the singular values of matrices can be established.
Proposition B.2 ((Bai & Silverstein 2010, Theorem A.8)).
Let and be two complex matrices. Then for any nonnegative integers ,
The conjugated transpose of a complex matrix is denoted by .
Proof.
Apply the second inequality of Proposition B.1 to Hermitian matrices and . Q.E.D.
Using this proposition, we establish the small-size perturbation inequality for singular values.
Corollary B.3.
Let and be the same as in Proposition B.2. Then for ,
Proof.
In Proposition B.2 taking and replacing by , we get The other part is checked from the above inequality by and note that . Q.E.D.
Proposition B.4 ((Bai & Silverstein 2010, Theorem A.10)).
Let and be complex matrices of order and . For any , we have
To argue forms of the law of large numbers in the proportional limiting regime, we make use of the following:
Lemma B.5 ((Bai & Yin 1993, Lemma 2)).
Suppose that is a double array of i.i.d. random variables. Let , and be constants. Then,
For all square matrices and of order , the characteristic polynomial of is that of . For two matrices and of size and , respectively, the matrix has the same nonzero eigenvalues as .
In the following, the first assertion is due to (Yin et al. 1988, Theorem 3.1) and the second to (Jiang 2004, (2.7)).
Proposition B.6.
Let satisfy the following subarray condition: There is an infinite matrix such that
-
•
all the entries of are centered i.i.d. real random variables having unit variances and the finite fourth moments; and
-
•
for each , the matrix is the top-left submatrix of where .
Moreover, let . Then
(1) | |||||
(2) |
By the empirical spectral measure of an semi-definite matrix , we mean a probability measure such that
If the empirical spectral measure of converges weakly to a probability measure in a given limiting regime, we call the limiting spectral measure of .
Marčenko-Pastur probability measure of index and scale parameter has the probability density function
with an additional point mass of value at the origin , where . is called the right-edge.
Lemma B.7.
In every CLFM, if , then all the limiting spectral measures of and are Marčenko-Pastur probability measure of index and scale parameter .
Proof.
Proved likewise as in (Akama 2023, Section 7). Q.E.D.