Asymptotic properties of spiked eigenvalues and eigenvectors of signal-plus-noise matrices with their applications
Abstract
This paper is to consider a general low-rank signal plus noise model in high dimensional settings.
Specifically, we consider the noise with a general covariance structure and the signal to be at the same magnitude as the noise.
Our study focuses on exploring various asymptotic properties related to the spiked eigenvalues and eigenvectors. As applications, we propose a new criterion to estimate the number of clusters, and investigate the properties of spectral clustering.
Xiaoyu Liu1, Yiming Liu1, Guangming Pan2, Lingyue Zhang3,
Zhixiang Zhang4 1School of Economics, Jinan University
2 School of Physical and Mathematical Sciences, Nanyang Technological University
3School of Mathematical Science, Capital Normal University
4Department of Mathematics, Faculty of Science and Technology, University of Macau
1 Introduction
Consider a signal-plus-noise model with the form of
(1)
where is the signal matrix with finite rank, consists of i.i.d. random variables, and accounts for the covariance structure in the noise.
Such a model is popular in many fields including machine learning (Yang etย al.,, 2016), matrix denoising (Nadakuditi,, 2014) or signal processing (Vallet etย al.,, 2012). When is an identity matrix, there has been a huge amount of work on eigenvalues and eigenvectors for such signal-plus-noise type matrices. To name a few, Loubaton and Vallet, (2011) derived the almost sure limits of eigenvalues, Ding, (2020) obtained the limits and convergent rates of the leading eigenvalues and eigenvectors, Bao etย al., (2021) showed the distributions of the principal singular vectors and singular subspaces.
When is set to be a diagonal matrix, Hachem etย al., (2013) investigated
the limiting behavior of the
random bilinear form of the sample covariance matrix under a separable model, which includes the case of being diagonal in (1).
When the signal-to-noise ratio tends to infinity, i.e., the ratio of the spectral norm of the signal part to the noise part tends to infinity, Cape etย al., (2019) also considered the asymptotic properties of spiked eigenvectors under Model (1). By imposing Gaussianity on , Han etย al., (2021) provided a
eigen-selected spectral clustering method with theoretical justifications.
However, the assumptions that is an identity or diagonal matrix, and the signal-to-noise ratio tends to infinity, seem to be restricted and hard to verify in practice. In this paper,
we aim to investigate the asymptotic properties of the eigenvalues of , as well as both the left and right spiked singular vectors of under the regime where ,
with mild regularity conditions towards and , and mild moment assumptions on . To the best of our knowledge, we first systematically study the properties of eigenvalues and eigenvectors of Model (1) under such mild conditions. Specifically, we consider
(2)
and
(3)
In order to obtain the asymptotic properties of spiked eigenvectors of and , we analyze the quadratic forms involving the resolvents and of matrices and defined as
(4)
and
(5)
respectively, where and refer to an identity matrix with comparable sizes. The study on the spiked eigenvalues leverages the main results developed in Liu etย al., (2022).
To demonstrate the use of the theoretical results, we consider applications in spectral clustering.
When each column of can be only chosen from a finite number of distinct unknown deterministic vectors, (1) can be regarded as a collection of samples generated from a mixture model.
Thus, in a vector form, the -th column of Model (1) can be written as
(6)
where for some if . The normalized constant in is to unify the Assumption 1 below. Here and for any , and actually refers to the number of the different distributions (i.e., clusters) in a mixture model. One should also note that the labels are unknown in clustering problems. Numerous literatures investigate mixture models. In statistics, Redner and Walker, (1984) considered the clustering problem for the Gaussian mixture model in low dimensional cases, while Cai etย al., (2019) considered the high dimensional cases. Some classical techniques about clustering were also proposed in past decades; see e.g., MacQueen etย al., (1967), Bradley etย al., (1999), Kaufman and Rousseeuw, (1987), Maimon and Rokach, (2005) and Duda and Hart, (1973).
In empirical economics, mixture models are used to introduce unobserved heterogeneity. An important example of this setup from the econometrics literature is Keane and Wolpin, (1997), which investigated the clustering problem in labor markets. Such models also arise in analyzing some classes of games with multiple Nash equilibria. See for example, Berry and Tamer, (2006), Chen etย al., (2014) and others.
Our main theoretical contribution is to
precisely characterize the
first-order limits of the eigenvalues and eigenvectors of and .
There are two observations that can be obtained based on our main theoretical results that are somewhat surprising, as they exhibit some overlaps with findings in the literature, albeit in different scopes of problems.
The first is that the limits of the spiked eigenvalues of coincide with these of the sample covariance matrices without the signal part, by letting the population covariance matrix be , see the discussion below Theorem 2.
The second is that,
the spiked right singular vectors of have an intrinsic block structure if contains a finite number of distinct deterministic factors, even for a moderate signal-to-noise ratio. Our Corollary 1 precisely quantifies the deviation of the right singular vector from a vector with entries having a group structure. This finding is highly relevant to the field of spectral clustering, which has been extensively discussed in the literature. It is worth noting that many existing studies assume strong moment conditions on the noise and consider scenarios where the signal-to-noise ratio tends to infinity.
As applications, we propose a method to estimate the number of clusters by leveraging the asymptotic limits of sample eigenvalues. We also discuss how the theoretical results intuitively explain why the spiked eigenvectors have clustering power in the context of spectral clustering.
The remaining sections are organized as follows. In Section 2 we state the main results of the quantities of (4) and (5) under Model 1. Section 3 includes the applications in terms of clustering and classification. In Section 4, we demonstrate some numerical results regarding the applications mentioned in Section 3. All the proofs are relegated to the Appendix part and the supplementary materials.
Conventions: We use to denote generic large constant independent of , and its value may change from line to line. , and refer to a vector with all entries being one and an identity matrix with a comparable size, respectively. We let denote the Euclidean norm of a vector or the spectral norm of a matrix.
2 The main results
In this section, we mainly investigate the limits of the eigenvalues and eigenvectors of and defined in (2) and (3), respectively.
We first impose some mild conditions on for establishing the asymptotic limits of the eigenvalues and eigenvectors:
Assumption 1.
We assume that is an matrix, whose entries are independent real random variables satisfying
(7)
We consider the high dimensional setting specified by the following assumption.
Assumption 2.
.
Note that when has a bounded rank, the limiting spectral distribution of is the same as that of the model by setting as a zero matrix. This can be directly concluded by the rank inequality, see Theorem A.43 of Bai and Silverstein, (2010). However, to investigate the limiting behaviors of the spiked eigenvalues and eigenvectors under Model (1), more assumptions on and are required.
Assumption 3.
Let be a matrix with bounded spectral norm and finite rank , and be a symmetric matrix with bounded spectral norm. Let , and denote the singular value decomposition of by with .
Remark 1.
In this paper, we consider the case where the leading eigenvalue of is bounded, and a similar strategy can be adapted to investigate the case of divergent spikes.
Moreover, one can also allow to tend to infinity at a slow rate, but we do not pursue it here.
The key technical tool is the deterministic equivalents of
and in (4) and (5).
We introduce it first as it requires weaker assumptions than the main results on spiked eigenvectors and eigenvalues, and may be of independent interest.
For any , let be the unique solution to the equation
(8)
where is the empirical spectral distribution of . Proposition 1 below provides the deterministic equivalence of and .
Proposition 1.
Suppose that Assumptions 1 to 3 are satisfied. Let be sequences of deterministic vectors of unit norm. Then for any
with being bounded from below by a positive constant.
we have
(9)
where
(10)
and
(11)
where
(12)
Remark 2.
The model we studied is similar to that in Hachem etย al., (2013)
and the proof of Proposition 1 leverages the main result therein. The main difference between our Proposition 1 and their results
is that we study the case with a general , while they consider a model with a separable variance profile where the noise part can be written as but and are both diagonal matrices.
There are two features of Proposition 1 that are worth mentioning here. First, the deterministic equivalents of both and involves a quantity
, which is actually the Stieltjes transform of the generalized Marchenko-Pastur law, see Bai and Silverstein, (2010) for instance.
This is hidden in Hachem etย al., (2013) as their results hold for general instead of being of finite rank.
Second, when has columns with the structure specified in (6), which is of statistical interest, especially in the context of special clustering, has a block structure as it is in a form of for some constants and some matrix . This can also be inferred from the observation that can also be written in such a form.
Based on Proposition 1, we now first focus on the eigenvectors corresponding to the spiked eigenvalues of . The following assumption is needed.
Assumption 4.
Under the spectral decomposition of specified in Assumption 3, we assume that for some constant independent of and . For , satisfies
(13)
where is the limiting spectral distribution of .
Remark 3.
Assumption 4 is a variant of the condition given in definition 4.1 of Bai and Yao, (2012), ensures that the first largest eigenvalues of are simple spiked eigenvalues, and the gaps of adjacent spiked eigenvalues have a constant lower bound with probability tending to one.
Let and be the eigenvector associated with the -th largest (spiked) eigenvalue of and , respectively. The following theorem characterizes the asymptotic behaviours of and .
Theorem 1.
Under Assumptions 1 to 4 , for any , and
any sequences of deterministic unit vectors and , we have
1.
(14)
where , and are defined by
and are the real solutions to the equation in :
(15)
2.
(16)
where
(17)
Remark 4.
It is worth mentioning that the first-order behaviour of the left spiked singular vectors of is the same as that of a sample covariance matrix of , see the main results in Mestre, 2008b , and Table 5 below demonstrated by a simulation. However, the behaviour of the right singular vectors is significantly distinct. Specifically, when the entries of are Gaussian variables, the matrix composed of the right eigenvectors of
is asymptotically Haar distributed. This observation contrasts with the second fact in Theorem 1.
In addition, it is noteworthy that when , the model reduces to the one studied in (Ding,, 2020; Bao etย al.,, 2021).
In these studies, the results on the left and right singular vectors of are observed to be symmetric due to the symmetry of the model structure. However, for a general , we cannot deduce the properties of the right singular vectors of solely based on the properties of the left singular vectors, and vice versa. We further discuss the relationship between our results and those in Ding, (2020) below Theorem 2.
The asymptotic behaviour of the spiked eigenvalues is also considered, and thus some more notations are also required.
Similar to Bai and Yao, (2012), for the spiked eigenvalue outside the support of and , we define
(18)
where is regarded as the function defined in (8) with its domain extended to the real line. As defined in Bai and Yao, (2012), a spiked eigenvalue is called a distant spike if which is coincident to Assumption 4, and a close spike if . Note that and share the same nonzero eigenvalues, and we denote by
In particular, the above result still holds for the distant spiked eigenvalues with multiplicity larger than one. Moreover, for a nonspiked eigenvalue with , we have
(20)
uniformly holds in , where is the -quantile of , that is, .
Remark 5.
By the main results in Bai and Yao, (2012), one could obtain the limits of the spiked eigenvalues of . Theorem 2 indicates that
the asymptotic limits of the spiked eigenvalues of are the same as those of . See Table 5 below for an illustration.
Related work.
Ding, (2020) investigated the limits of the spiked eigenvalues and eigenvectors of a signal-plus-noise model where . Bao etย al., (2021) further obtained the fluctuation of quadratic forms of left and right spiked eigenvectors of a signal-plus-noise model where . The model we considered includes both of these two models as special cases, and our results show that the source of sample spiked eigenvalues can be either from the spikes in the signal matrix , or spikes from , which is the covariance matrix of the noise part.
We verify that our main results match with the corresponding parts in Ding, (2020). As the first eigenvectors of are the same as the left singular vectors of , the singular value decomposition of can be written as where is the eigenvector associated with . Then for , and for . Theorem 2 implies that
These limits coincide with and defined in (2.6) and (2.9) of Ding, (2020), respectively.
One may wonder whether the asymptotic distributions of the spiked eigenvalues and eigenvectors of as those of , given that their first order limits coincide.
Several recent studies investigated the latter model, including (Jiang and Bai,, 2021; Zhang etย al.,, 2022; Bao etย al.,, 2022), etc. Through simulations, we observe different asymptotic variances between the two models, as indicated by Table 5.
The aforementioned theoretical results are all built on or that refer to the noncentral covariance matrices. In some situations, the centered versions are also of interest. Specifically, we consider the corresponding covariance matrices
and
where and . Let and
denote the spectral decomposition of by , where . Here may be equal to or in some different cases.
Moreover, define the corresponding resolvent and of matrix and , respectively:
Based on the given notations, we established the corresponding results for the centralized sample covariance matrices:
Proposition 2.
Suppose that Assumptions 1 and 2 are satisfied, replace in Assumption 3 by . Then, we have
where
Relying on Proposition 2, we also have the following conclusion for the spiked eigenvalues and the corresponding eigenvectors of and .
Theorem 3.
Assume that the conditions of Proposition 2 are satisfied with in Assumption 4 replaced by . By replacing , , and their latent symbols (e.g., ) with the counterparts of , and ,
the conclusions in Theorems 1 and 2 still hold.
3 Applications
In this section, based on the results in Section 2, we aim to develop some potential applications.
Spectral clustering has been used in practice frequently in data science and the theoretical underpinning of such a method has received extensive interest in recent years; see e.g., (Couillet etย al.,, 2016; Zhou and Amini,, 2019; Lรถffler etย al.,, 2021), etc. This section is to have a deep insight into the spectral clustering based on the Model (6). Moreover, we also propose a new criterion to estimate the number of clusters.
Recalling (6), for any , there is , where .
Let , , , where if and otherwise.
In a matrix form, write
Notice that
The block structure of (except the diagonal positions) is similar to that of stochastic block models (SBM). This motivates one to use spectral clustering for high dimensional data with different means across groups.
To do the clustering, it is of interest to estimate the number of clusters, i.e., estimation of . There exist plenty of approaches to estimate the number of the clusters. To name a few, Thorndike, (1953) proposed the Elbow method that aims to minimize the within-group sum of squares (WSS); Silhouette index (Rousseeuw,, 1987) is a measure of how similar an object is to its own cluster compared to other clusters, which takes values in ; Tibshirani etย al., (2001) proposed a gap statistic to estimate the number of clusters, etc. These methods either lack theoretical guarantees or have some restrictions in computation or settings. Hence, here we propose a theoretical guarantee and easily implemented approach to estimate the number of clusters. Notice that under Model (6), the number of the spiked eigenvalues of or is the same as the number of clusters if the means in terms of the different clusters are not linearly correlated. The estimation of the number of spikes in different models has been discussed in multiple literatures, and mostly are based on the setting of ; see e.g., Bai etย al., (2018).
Motivated by the work of Bai etย al., (2018) and Theorem 2, we propose two criteria to estimate the number of clusters. Without loss of generality, we assume . Let
(21)
where , and .
Remark 6.
The first two main terms aim to capture the difference between eigenvalues, and the third term is the penalty term for the number of unknown parameters in the model. The values of EDA and EDB are expected to reach a minimum when . From (21), it can be seen that, as increases, the first and second terms decrease while the third term increases. For more discussion about (21) and the case of , one may refer to the supplementary material.
We estimate the number of clusters by
(22)
(23)
where is the prespecified number of clusters satisfying . Note that under conditions of Theorem 2, it follows that for ,
(24)
where function and are defined in (18) and (20), respectively. For simplicity, denote the limit of by for . Define two sequences and as follows
(25)
We propose two gap conditions for EDA and EDB, respectively, i.e.,
(26)
(27)
Remark 7.
The gap condition in Bai etย al., (2018) was proposed for the population covariance matrix with distant spikes larger than one and other eigenvalues equal to one. While the model studied in this paper imposes no restriction to the non-spiked eigenvalues, the gap conditions in (26) and (27) are more easily satisfied and have a wider range of applications.
Note that Theorem 2 and (24) are obtained when the leading eigenvalues are bounded. Here we also investigate the cases of the leading eigenvalues tending to infinity.
Lemma 1.
In the same setup of Theorem 2, instead of assuming being bounded, suppose that , as . Then, for any , we have
Based on Theorem 2 and Lemma 1, we derive the consistency of as follows.
Theorem 4.
Under conditions of Theorem 2, if the gap condition (26) does not hold, then is not consistent; if the gap condition holds, then is strongly consistent.
In particular, if tends to infinity, then is strongly consistent.
In Bai etย al., (2018), BIC is consistent when at a rate faster than , which makes BIC less capable of detecting signals. This is because BIC has a more strict penalty coefficient compared to the penalty coefficient โ2โ in AIC. For the EDB construction of selecting the number of clusters, we add the coefficient โโ to the first term so that the spikes do not need to be very large and only the corresponding gap condition for EDB is required. By the analogous proof strategy of Theorem 4, we obtain the consistency of EDB as follows.
Theorem 5.
Under the same setting of Theorem 4,
if the gap condition (27) does not hold, then is not consistent; if (27) holds, then is strongly consistent.
Moreover, if tends to infinity, then is strongly consistent.
Once the estimator of the number of clusters is available, we can conduct spectral clustering. Specifically,
let the eigenvectors corresponding to the first eigenvalues of be . We then apply the following K-means optimization to the , i.e.,
(28)
where . Then, we return as the indices for each cluster. From (28), we see that the spectral clustering is conducted from the obtained , and hence we look into the properties of .
Corollary 1.
Under the conditions of Theorem 1, in the set of all deterministic unit vectors , maximizes the non-random term in (16), and
(29)
Moreover, let be the eigenvectors corresponding to the largest eigenvalues of , where .
For any deterministic that contains column vectors of unit length, we have
(30)
where
Remark 8.
From Corollary 1, we see that if tends to infinity, and for with being a positive constant independent of , we have thus the right side of (29) converges to zero in probability. Consequently, is an asymptotic consistent estimator of . Note that , which has distinct columns and represents different means. Hence, under mild conditions, there are different rows in , and one can use it to find the corresponding clusters. When is bounded, is not a consistent estimator for the block-wise constant vector . However, in this case, following the proof of Theorem 2.2 in Jin, (2015), an elementary misclustering error rate by spectral clustering can be also obtained, which is a new observation based on the proposed results.
4 Simulation
In this section, we first evaluate the performance of the proposed criteria in the estimation of the number of clusters discussed in Section 3. Denote the sets of under-estimated, exactly estimated and over-estimated models by and , respectively, i.e.,
The selection percentages corresponding to and are computed by repetitions. Suppose that the entries of are i.i.d. with the following distributions:
โข
Standard normal distribution: .
โข
Standardized distribution with 8 degrees of freedom: .
โข
Standardized Bernoulli distribution with probability : .
โข
Standardized chi-square distribution with 3 degrees of freedom:
For comparison, three different methods are also considered: Average Silhouette Index (Rousseeuw, (1987)), Gap Statistic (Tibshirani etย al., (2001)) and BIC with degrees of freedom (David, (2020)), denoted by ASI, GS and BICdf, respectively. This section considers the situations with , and the cases with are demonstrated in the supplementary material.
Here we set and the largest number of possible clusters . Different means in terms of different clusters and the covariance matrices are set as follows
:
Case 1. Let , , ,
, and , where . Define
where . Therefore, the true number of clusters is .
Case 2. Let , , ,
, where is the identity matrix of size . Then,
where . Therefore, the true number of clusters is .
Case 3. The same setting as in the above Case 2 with instead of , where .
The spikes in the above cases are bounded. We also consider a case of spikes with at a rate faster than and .
Case 4. Let , , ,
, and the sample size of cluster corresponding to each center be , such that the true number of clusters . Suppose , where , .
Tables 1 to 4 report the percentages of under-estimated, exactly estimated and over-estimated under 1000 replications. From the reported results, we see the criteria based on EDA and EDB work better and better as become larger. When , the probabilities of the under-estimated number of clusters are equal to and increase when is getting closer to . From (25), it is shown that the larger is, the harder the gap conditions are to be satisfied.
EDB generally outperforms EDA except the case of , when are large. It can be seen that when , as increases, the probability of estimated by EDB becomes larger, and is uniformly greater than that by EDA. This is due to the fact that the coefficient in the penalty term of EDB criterion is โโ which is different from the coefficient โโ in EDA, so that the gap condition of EDB is more stronger than of EDA, that is, (27) is more difficult to be satisfied than (26). The criteria based on EDA and EDB show the highest accuracy under Bernoulli distribution, followed by normal, and with relatively heavy right tail which may be destructive to the results.
Table 1: Selection percentages of EDA, EDB, ASI, GS and BICdf in Case 1
EDA
EDB
ASI
GS
BICdf
EDA
EDB
ASI
GS
BICdf
Bernoulli
Table 2: Selection percentages of EDA, EDB, ASI, GS and BICdf in Case 2
EDA
EDB
ASI
GS
BICdf
EDA
EDB
ASI
GS
BICdf
Bernoulli
Table 3: Selection percentages of EDA, EDB, ASI, GS and BICdf in Case 3
EDA
EDB
ASI
GS
BICdf
EDA
EDB
ASI
GS
BICdf
82.4
Bernoulli
3.8
Table 4: Selection percentages of EDA, EDB, ASI, GS and BICdf in Case 4
EDA
EDB
ASI
GS
BICdf
EDA
EDB
ASI
GS
BICdf
Bernoulli
At the end of this section, we use a simple simulation to demonstrate the matching properties of the left spiked eigenvectors and spiked eigenvalues between a signal-plus-noise matrix and a sample covariance matrix, which have been discussed in Remarks 4 and 5. Let , where has two column vectors and , and consists the first two right eigenvectors of a Gaussian matrix, and . Let Model 1 be where consists of independent , and Model 2 be where , and the same as Model 1. There are three spiked eigenvalues satisfying Assumption 4. Table 5 reports the three largest eigenvalues and eigenvectors of averaged from 500 replications generated by Model 1 and 2, respectively.
Table 5: The first three eigenvalues and eigenvectors of where are generated by Model 1 and 2, averaging from 500 replications each, with
(values in parentheses indicate the standard deviations).
11.122
7.574
5.238
0.447
0.040
0.377
(0.550)
(0.633)
(0.261)
(0.052)
(0.050)
(0.053)
11.114
7.583
5.212
0.444
0.047
0.371
(1.026)
(0.640)
(0.432)
(0.089)
(0.065)
(0.080)
We observe that the first-order limits are almost the same for the two types of models. Moreover, the fluctuation behaviour is possibly different which can be inferred from the different standard deviations in Table 5.
5 Proofs of main results
In this section, we prove the main results in Section 2 and 3.
Proposition 1 plays an important role in the proof of Theorem 1. To prove Proposition 1, the following Proposition 3 is required, whose proof is provided in the supplementary material. In what follows, we sometimes omit the subscripts โnโ, and use the conjugate transpose to replace the common transpose , which are same in real cases. In addition, the proof of Theorem 2, Theorem 4 and Corollary 1 are also provided.
Proposition 3.
Under the conditions of Propostion 1, for any deterministic unit vectors and , we have
(31)
where
and .
To give the theoretical justifications, we first introduce a necessary lemma.
Lemma 2.
(Woodbury matrix identity) Suppose that and are invertible, and , , there is
Proposition 2.2 in Hachem etย al., (2007) yields , and one can see in Hachem etย al., (2013) as well. Also, by Lemma 2.3 of Silverstein and Bai, (1995), there is . Combining the fact of , we have
Thus, a direct calculation shows that
(36)
Next, let
where in solves the equation
and is the empirical spectral distribution of .
If we denote the right hand side of (35) by , then (35) can be rewritten as
where .
We also let
By the definition of , this equation can be further written as
(37)
where
We have that , ,and . Then it follows that .
With equations (8) and (37) at hand, there is
Similar to (6.2.26) in Bai and Silverstein, (2010), we also have
Therefore, there is
Using the same arguments as in (5), it follows that
Then the conclusion follows. โ
To prove Theorem 1, we also need the separation of the spiked eigenvalues of . Recall that .
Lemma 3.
Under assumptions of Theorem 1, for satisfying for , where are given in (8), we have
where is the -th largest eigenvalue of .
Proof.
Theorem 1 in Liu etย al., (2022) has shown that the conclusion holds with an additional assumption that contains finite number of different columns. We use two steps to extend to general low-rank . The first step is to show that for general low rank the conclusion holds when is Gaussian. And the second step is to extend to general satisfying Assumption 1.
We begin with the first step. Assume that is Gaussian, and has a singular value decomposition , where is a matrix with singular values on its first main diagonal positions. Assume for simplicity and the case for general is similar. Let be an orthogonal matrix where the first row has non-zero entries that all equal on the first coordinates, and the second row has non-zero entries that equal on the last coordinates.
Further define . Then becomes Model 1 in Liu etย al., (2022), i.e., the columns of signal part contains two different vectors. Therefore the conclusion of this lemma holds for . Thus the conclusion also holds for since and are both orthogonal matrices.
To conclude the second step, we introduce a continuous interpolation matrix defined as for , where is Gaussian, is general, and both satisfy the moment conditions in Assumption 1. Note that satisfies Assumption 1 for any . Define and by replacing with , respectively. Denote the -th largest singular value of a matrix by . For any , we have
(38)
where are some positive constants independent of . In the first and third step we use the fact that and are bounded, and the second step uses Welyโs inequality.
Now we can conclude the exact separation by the continuity of eigenvalues together with Proposition 1 in Liu etย al., (2022). More specifically, let , we know that , and Proposition 1 in Liu etย al., (2022) implies that there are no eigenvalues of , in . Therefore (38) implies with probability tending to one.
โ
where , encloses the sample eigenvalues of and excludes all other sample eigenvalues. The existence of is guaranteed by the Assumption 4.
By Cauchy integral formula, we have
(39)
where is any deterministic unit vector, and represents negatively oriented boundary of .
The proof is in the same spirit as that of Proposition 1 in Mestre, 2008b . Since our result provides a convergence rate of error, we use a slightly different argument by considering the second moment of the left term.
Define an event , which holds with probability tending to one for some small positive independent of . We have
(40)
where the first step uses Hรถlderโs inequality and the second step follows from (S.72) and (S.73).
The conclusion follows from Chebyshevโs inequality.
โ
The above lemma reduces the proof to calculating the deterministic integral
Let , where is introduced in Proposition 1. We find that satisfies the following equation
which is parallel to equation (24) in Mestre, 2008a . Thus, satisfies all the properties listed in Proposition 2 in Mestre, 2008a . Write , where
(41)
(42)
where is a simple closed curve that includes and excludes all the other population eigenvalues of with negative orientation. By a calculation,
The first assertion can be obtained by an argument similar to the one that leads to (40) and the calculation of the deterministic term is exactly the same as (22) in Mestre, 2008b , where their lines up with in our case.
โ
We first consider (19).Denote the support of by . Under Assumption 4, it is easy to obtain that for . By the continuity of , there exists such that
(43)
and (by default, ). Then, we can find
and
such that and are outside . For , define
where is the ESD of nonspikes. Then,
(44)
Observe that
so that the third and the fourth term on the right hand of (44) converge uniformly to zero, as . It is shown that the first term on the right hand of (44) converges pointwise to the second one, in which they are all continuous function w.r.t. . Since can be regarded as a monotone sequence of functions,
by Diniโs theorem, the convergence is uniform. Thus, uniformly converges to on . The proof for the uniform convergence of equal to
is analogous and left out here. Hence, from Theorem 4.2 of Silverstein and Choi, (1995), combining (43) and the uniform convergence of on , it follows that both and are out of the support of . Then, using Lemma 3,
Next we turn to the second assertion (20).
Theorems 1.1 and 2.1 in Silverstein and Choi, (1995) has shown that has a continuous derivative on given by the imaginary part of its Stieltjes transform , so that
Note that for positive , follows from Lemma 3, which indicates that no eigenvalues lie outside the support of the LSD of and , and there exists satisfying
Thus, with probability 1, for
From the definition of quantile, we have
(46)
When , we can find such that
(47)
Therefore, with probability 1, there exists such that
(48)
which yields . Hence, (20) follows from (46)-(48). The conclusion follows.
โ
We first verify (29). From (16) we find that for any fixed unit vector ,
(56)
where the first step holds by taking .
Note that is a rank one matrix and its eigenvector associated with the non-zero eigenvalue is .
Then (29) follows by substituting into (56) and using the fact that .
The second statement (30) can be concluded by finding that minimizes and its minimum value is obtained also by (16).
โ
References
Bai etย al., (2007)
Bai, Z., Miao, B., and Pan, G. (2007).
On asymptotics of eigenvectors of large sample covariance matrix.
The Annals of Probability, 35(4):1532โ1572.
Bai and Silverstein, (2010)
Bai, Z. and Silverstein, J.ย W. (2010).
Spectral analysis of large dimensional random matrices,
volumeย 20.
Springer.
Bai and Yao, (2012)
Bai, Z. and Yao, J. (2012).
On sample eigenvalues in a generalized spiked population model.
Journal of Multivariate Analysis, 106:167โ177.
Bai etย al., (2018)
Bai, Z.ย D., Choi, K.ย P., and Fujikoshi, Y. (2018).
Consistency of aic and bic in estimating the number of significant
components in high-dimensional pricipal component analysis.
The Annals of Statistics, 46(3):1050โ1076.
Bao etย al., (2022)
Bao, Z., Ding, X., Wang, J., and Wang, K. (2022).
Statistical inference for principal components of spiked covariance
matrices.
The Annals of Statistics, 50(2):1144โ1169.
Bao etย al., (2021)
Bao, Z., Ding, X., and Wang, K. (2021).
Singular vector and singular subspace distribution for the matrix
denoising model.
The Annals of Statistics, 49(1):370โ392.
Berry and Tamer, (2006)
Berry, S. and Tamer, E. (2006).
Identification in models of oligopoly entry.
Econometric Society Monographs, 42:46.
Bradley etย al., (1999)
Bradley, P.ย S., Fayyad, U.ย M., and Mangasarian, O.ย L. (1999).
Mathematical programming for data mining: Formulations and
challenges.
INFORMS Journal on Computing, 11(3):217โ238.
Cai etย al., (2019)
Cai, T.ย T., Ma, J., and Zhang, L. (2019).
Chime: Clustering of high-dimensional gaussian mixtures with em
algorithm and its optimality.
The Annals of Statistics, 47(3):1234โ1267.
Cape etย al., (2019)
Cape, J., Tang, M., and Priebe, C.ย E. (2019).
Signal-plus-noise matrix models: eigenvector deviations and
fluctuations.
Biometrika, 106(1):243โ250.
Chen etย al., (2014)
Chen, X., Ponomareva, M., and Tamer, E. (2014).
Likelihood inference in some finite mixture models.
Journal of Econometrics, 182(1):87โ99.
Couillet etย al., (2016)
Couillet, R., Benaych-Georges, F., etย al. (2016).
Kernel spectral clustering of large dimensional data.
Electronic Journal of Statistics, 10(1):1393โ1454.
David, (2020)
David, P.ย H. (2020).
Degrees of freedom and model selection for -means.
Computational Statistics and Data Analysis, 149:1โ14.
Ding, (2020)
Ding, X. (2020).
High dimensional deformed rectangular matrices with applications in
matrix denoising.
Bernoulli, 26(1):387โ417.
Duda and Hart, (1973)
Duda, R.ย O. and Hart, P.ย E. (1973).
Pattern classification and scene analysis, volumeย 3.
Wiley New York.
Hachem etย al., (2007)
Hachem, W., Loubaton, P., and Najim, J. (2007).
Deterministic equivalents for certain functionals of large random
matrices.
The Annals of Applied Probability, 17(3):875โ930.
Hachem etย al., (2013)
Hachem, W., Loubaton, P., Najim, J., and Vallet, P. (2013).
On bilinear forms based on the resolvent of large random matrices.
In Annales de lโIHP Probabilitรฉs et statistiques,
volumeย 49, pages 36โ63.
Han etย al., (2021)
Han, X., Tong, X., and Fan, Y. (2021).
Eigen selection in spectral clustering: A theory-guided practice.
Journal of the American Statistical Association, pages 1โ13.
Jiang and Bai, (2021)
Jiang, D. and Bai, Z. (2021).
Generalized four moment theorem and an application to clt for spiked
eigenvalues of high-dimensional covariance matrices.
Bernoulli, 27(1):274โ294.
Jin, (2015)
Jin, J. (2015).
Fast community detection by score.
The Annals of Statistics, 43(1):57โ89.
Kaufman and Rousseeuw, (1987)
Kaufman, L. and Rousseeuw, P. (1987).
Clustering by means of medoids.
Keane and Wolpin, (1997)
Keane, M.ย P. and Wolpin, K.ย I. (1997).
The career decisions of young men.
Journal of political Economy, 105(3):473โ522.
Liu etย al., (2022)
Liu, Y., Liang, Y.-C., Pan, G., and Zhang, Z. (2022).
Random or nonrandom signal in high-dimensional regimes.
IEEE Transactions on Information Theory, 69(1):298โ315.
Lรถffler etย al., (2021)
Lรถffler, M., Zhang, A.ย Y., and Zhou, H.ย H. (2021).
Optimality of spectral clustering in the gaussian mixture model.
The Annals of Statistics, 49(5):2506โ2530.
Loubaton and Vallet, (2011)
Loubaton, P. and Vallet, P. (2011).
Almost sure localization of the eigenvalues in a gaussian information
plus noise model. application to the spiked models.
Electronic Journal of Probability, 16:1934โ1959.
MacQueen etย al., (1967)
MacQueen, J. etย al. (1967).
Some methods for classification and analysis of multivariate
observations.
In Proceedings of the fifth Berkeley symposium on mathematical
statistics and probability, volumeย 1, pages 281โ297. Oakland, CA, USA.
Maimon and Rokach, (2005)
Maimon, O. and Rokach, L. (2005).
Data mining and knowledge discovery handbook.
(28)
Mestre, X. (2008a).
Improved estimation of eigenvalues and eigenvectors of covariance
matrices using their sample estimates.
IEEE Transactions on Information Theory, 54(11):5113โ5129.
(29)
Mestre, X. (2008b).
On the asymptotic behavior of the sample estimates of eigenvalues and
eigenvectors of covariance matrices.
IEEE Transactions on Signal Processing, 56(11):5353โ5368.
Nadakuditi, (2014)
Nadakuditi, R.ย R. (2014).
Optshrink: An algorithm for improved low-rank signal matrix denoising
by optimal, data-driven singular value shrinkage.
IEEE Transactions on Information Theory, 60(5):3002โ3018.
Pan, (2014)
Pan, G. (2014).
Comparison between two types of large sample covariance matrices.
In Annales de lโIHP Probabilitรฉs et statistiques,
volumeย 50, pages 655โ677.
Redner and Walker, (1984)
Redner, R.ย A. and Walker, H.ย F. (1984).
Mixture densities, maximum likelihood and the em algorithm.
SIAM review, 26(2):195โ239.
Rousseeuw, (1987)
Rousseeuw, P.ย J. (1987).
Silhouettes: a graphical aid to the interpretation and validation of
cluster analysis.
Computational and Applied Mathematics, 20:53โ65.
Silverstein and Bai, (1995)
Silverstein, J.ย W. and Bai, Z. (1995).
On the empirical distribution of eigenvalues of a class of large
dimensional random matrices.
Journal of Multivariate analysis, 54(2):175โ192.
Silverstein and Choi, (1995)
Silverstein, J.ย W. and Choi, S.-I. (1995).
Analysis of the limiting spectral distribution of large dimensional
random matrices.
Journal of Multivariate Analysis, 54(2):295โ309.
Thorndike, (1953)
Thorndike, R.ย L. (1953).
Who belongs in the family?
Psychometrika, 18(4):267โ276.
Tibshirani etย al., (2001)
Tibshirani, R., Walther, G., and Hastie, T. (2001).
Estimating the number of clusters in a data set via the gap
statistic.
Journal of the Royal Statistical Society. Series B. Statistical
Methodology, 63(2):411โ423.
Vallet etย al., (2012)
Vallet, P., Loubaton, P., and Mestre, X. (2012).
Improved subspace estimation for multivariate observations of high
dimension: the deterministic signals case.
IEEE Transactions on Information Theory, 58(2):1043โ1068.
Yang etย al., (2016)
Yang, D., Ma, Z., and Buja, A. (2016).
Rate optimal denoising of simultaneously sparse and low rank
matrices.
The Journal of Machine Learning Research, 17(1):3163โ3189.
Yin etย al., (1988)
Yin, Y., Bai, Z.ย D., and Krishnaiah, P. (1988).
On the limit of the largest eigenvalue of the large dimensional
sample covariance matrix.
Probability Theory and Related Fields, 78:509โ521.
Zhang etย al., (2022)
Zhang, Z., Zheng, S., Pan, G., and Zhong, P.-S. (2022).
Asymptotic independence of spiked eigenvalues and linear spectral
statistics for large sample covariance matrices.
The Annals of Statistics, 50(4):2205โ2230.
Zhou and Amini, (2019)
Zhou, Z. and Amini, A.ย A. (2019).
Analysis of spectral clustering algorithms for community detection:
the general bipartite setting.
The Journal of Machine Learning Research, 20(1):1774โ1820.
Supplementary material on โOn the asymptotic properties of spiked eigenvalues and eigenvectors of signal-plus-noise matrices with their applicationsโ
The supplementary material provides an additional application, the criteria of the estimation of the number of clusters when , the remaining proof of the theoretical results and some additional simulation studies.
6 S.1 The estimation of the number of clusters when .
In this section, we consider the case when such that .
Then the smallest eigenvalues of are zero, that is,
The modified criteria and for selecting the true number of clusters under are obtained by replacing the second term in and with :
where , called pseudo-EDA and pseudo-EDB, respectively. Analogous to the case where , the modified pseudo-EDA and pseudo-EDB select the number of clusters by
The corresponding gap conditions for pseudo-EDA and pseudo-EDB stay the same as in (26) and (27), respectively. The following theorems show that and possess a similar property as and .
Theorem 6.
Under conditions of Theorems 4 and 5, we have the following consistency results of the estimation criteria and .
(i) Suppose that is bounded. If the gap conditions (26), (27) do not hold, then and are not consistent. If the gap conditions (26) and (27) hold, then and are strongly consistent.
(ii) Suppose that tends to infinity. Then, and are strongly consistent.
Remark 9.
To illustrate EDA and EDB, one can refer to the example below, in which the true number of clusters is two.
Example.
Let , , be identity matrix and and . Suppose the means of two clusters are , with equal number of observations in each cluster, that is, . From Theorem 2, the limits of first four eigenvalues of can be obtained as follows
which means EDA and EDB can not lead to underestimation of the number of clusters, and the following expressions imply that they do not also lead to overestimation
(S.60)
From (S.59) and (S.60), it follows that both EDA and EDB are able to estimate the number of clusters accurately.
S.2.1 Additional simulations
We also consider the consistency properties of pseudo-EDA and pseudo-EDB when and under the following situations:
Case 5. Let , , , , where . Then,
where . Therefore, the true number of clusters is .
Case 6. Let , , ,
. Then, has the same form as above with . Therefore, the true number of clusters is .
Case 7. The same setting as in Case 6 with replaced by , where .
Case 8. The same setting as in Case 4 with instead of .
Generality, ss can be seen from Table 6, when , with increasing, and perform better, especially . As increases (fixed and reducing ), from (25), the gap conditions of EDA and EDB are not easy to satisfy. In particular, the gap condition of EDB is more strict than that of EDA when and are large. Therefore, the performance of pseudo-EDA is better than that of pseudo-EDB at . Other tables are similarly.
Table 6: Selection percentages of EDA, EDB, ASI, GS and BICdf in Case 5
ASI
GS
BICdf
ASI
GS
BICdf
Bernoulli
Table 7: Selection percentages of EDA, EDB, ASI, GS and BICdf in Case 6
ASI
GS
BICdf
ASI
GS
BICdf
Bernoulli
Table 8: Selection percentages of EDA, EDB, ASI, GS and BICdf in Case 7
ASI
GS
BICdf
ASI
GS
BICdf
Bernoulli
Table 9: Selection percentages of EDA, EDB, ASI, GS and BICdf in Case 8
For any matrix , denote by , the -th largest eigenvalue and singular value of , respectively. From conditions in Theorem 2 and the main result of Yin etย al., (1988), it is shown that, with probability , as , for , there is a constant such that
(S.61)
Since , it follows
(S.62)
Dividing by on the both sides of (S.61), due to (S.62), we complete the proof.
โ
We sketch the proofs here, which is quite similar to that of Theorem 4. For , we have
According to the second assertion in Theorem 2, due to bulks in , we also have (51). Thus,
When , for , we have defined in (24).
Hence, if the gap conditions (26) and (27) are satisfied, then
When , by the similar discussion to that of Theorem 4 with instead of , without any gap conditions, we have
Next, consider . Analogously, it yields
which completes the proof.
โ
S.3 Remaining proof
Below are some lemmas required.
Lemma 5.
For invertible matrix and vectors where and
are invertible, we have
Lemma 6.
Let with and , where are i.i.d. satisfying , . Then, there is
Lemma 7.
(Burkholder inequality) Let be a complex martingale difference sequence with respect to the filtration . For every , there exists such that:
For simplicity, we remove the subscripts of โnโ.
Let , , , and hence define
Moreover, we also introduce some basic notations and formulas. For invertible matrix and -dimensional vector , there are
(S.65)
(S.66)
Moreover, define
(S.67)
(S.68)
The following lemma is useful in calculating some moments bounds below:
Lemma 8.
For , there are
, and
.
Proof.
We have
where the second step uses the fact that . Therefore . The bound for is checked similarly.
For the last one, there is
โ
Proof of Proposition 3.
We first truncate, recentralize and renormalize the entries of following the steps in Bai etย al., (2007).
Select and satisfies
.
Let , and , where . Let and . Write . Then following the arguments used in the proof Lemma 4 therein, we can show that
With this truncation and centralization, we have the following simple bound that can be checked using Lemma 6 and will be used frequently later:
(S.69)
With this bound, and
(S.70)
for any deterministic unit norm vector ,
we can obtain the following lemma without difficulty
Lemma 9.
Let and be the conditional expectation with respect to the -field generated by . Under Assumption 1 and Assumption 3, for , there is
where is a fixed unit vector and represents the case of the being gaussian, denoted by . Suppose, by singular value decomposition, , we then have
Letting as , it satisfies the model in Hachem etย al., (2013). Hence we have
(S.74)
where .
Moreover, combing Proposition 3.8 and Proposition 3.9 in Hachem etย al., (2013), the conclusion follows.
Proof of (S.72):
We will write the term in (S.72) as the sum of the martingale difference sequence first.
Using two basic matrix equality (S.65) and (S.66) and, there is
(S.75)
Denote by the conditional expectation with respect to the -field generated by .
With the above expansion,
it is equivalent to obtaining a bound of
We split as:
To obtain the bound for the term involving , we consider the bounds of to , respectively.
For , we can decompose it as the sum of two components:
and . Since , we have
where is the -th coordinate of , and the third lines uses Lemmas 8, 9 and .
Similarly,
(S.76)
(S.77)
(S.78)
Thus, applying the Burkholder inequality in Lemma 7, there is
The term invovling can be bounded by an argument similar to that for .
Combining the above discussions we obtain
(S.82)
Now, we consider in (S.75). We split into several components:
We obtain bounds for , and by arguments similar to those leading to the bounds for and . For the terms and , we utilize (S.70).
For we use to further decompose it into four components. For the component without , it can be bounded following an argument similar to the one that leads to the bound for . For the component with one , it can be handled similarly to . For the component involving the quadratic form , we use arguments leading to the bound for and . For , it is similar to , and owing to the presence of , which has a -th moment of , its analysis becomes even simpler. Therefore, we find
(S.83)
Recalling the definition of in (S.75), according to the analysis of ,
we readily obtain that
(S.84)
Combining the bounds in (S.82), (S.83),and (S.84), we can obtain (S.72).
where , and follows normal distribution with mean variance .
Define
(S.85)
Write
where
Similar to in (S.75), here, can be further decomposed as before, and we use the superscripts โ1โ and โ0โ to distinguish the general case and the gaussian case. Since the procedure is similar as before, for simplicity, we list two typical examples to illustrate the proof idea.
For example, consider ,
For , there is
where is defined in (S.68), and in the second step we use . Thus, we have
(S.86)
Similarly, we also have .
For , note that . Then, by CauchyโSchwarz inequality, write
(S.87)
Similarly, it is easy to prove that
for by using CauchyโSchwarz inequality.
Proof of Proposition 2.
We first consider the bound for .
The strategy is to consider the difference between the quadratic form involving and the one involving , as studied in Proposition 1, and to derive the limits of such difference terms. The approach also applies to deriving the second bound for .
where the first step uses (S.66), the second step uses (S.65), and the third step uses (S.71) and .
Following similar steps, we obtain
(S.90)
Next we verify that is .
By polarization,
(9) still holds with different sequences of deterministic vectors on both sides of . This together with yields
For the term in the denominator, we have
With these two bounds, and by calculating the difference of those two terms in the last step of (S.89) and (S.90), respectively,
we find that
(S.91)
Since by (9), we conclude from this and (S.91) that
Therefore,
This concludes the bound for .
Then we prove the second bound on .
We have
(S.92)
where in the last step we use , which can be checked by following arguments similar to (3.7)-(3.12) of Pan, (2014).
We also find
By (11), we have Since
according to (9), we find . Therefore, we obtain that . This combining with the fact that
concludes the proof.
โ