This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Homogeneity and Sub-homogeneity Pursuit: Iterative Complement Clustering PCA


Daning Bi, Le Chang, Yanrong Yang
Research School of Finance, Actuarial Studies & Statistics
The Australian National University
Corresponding author, email: [email protected]
Abstract

Principal component analysis (PCA), the most popular dimension-reduction technique, has been used to analyze high-dimensional data in many areas. It discovers the homogeneity within the data and creates a reduced feature space to capture as much information as possible from the original data. However, in the presence of a group structure of the data, PCA often fails to identify the group-specific pattern, which is known as sub-homogeneity in this study. Group-specific information that is missed can result in an unsatisfactory representation of the data from a particular group. It is important to capture both homogeneity and sub-homogeneity in high-dimensional data analysis, but this poses a great challenge. In this study, we propose a novel iterative complement-clustering principal component analysis (CPCA) to iteratively estimate the homogeneity and sub-homogeneity. A principal component regression based clustering method is also introduced to provide reliable information about clusters. Theoretically, this study shows that our proposed clustering approach can correctly identify the cluster membership under certain conditions. The simulation study and real analysis of the stock return data confirm the superior performance of our proposed methods. Supplementary materials, including R codes, for the simulation and real data analysis are available online.


Keywords: Principal component analysis, principal component regression, high-dimensional data, homogeneity, sub-homogeneity, clustering

1 Introduction

Since its introduction, principal component analysis (PCA) (Jolliffe, 2002; Anderson, 2003) has become one of the most popular statistical tools for data analysis in a wide range of areas. Literature on PCA dates back to the early twentieth century (e.g., see Pearson (1901) and Hotelling (1933)). However, with an increasing dimension of data, PCA has now been reconsidered and widely discussed for the purpose of analyzing high-dimensional data. Jolliffe and Cadima (2016) and Fan et al. (2018) reviewed recent developments in PCA for statistical analysis on high-dimensional data, including its sparsity and robustness. However, are more data together really benefiting statistical analysis? Boivin and Ng (2006) provided a negative answer at the aspect of forecasting. This is due to an increased complexity when more data from different populations are grouped together, as a proportion of data can exhibit a different pattern compared with the rest. More specifically, increased complexity refers to heterogeneity when more data are collected from different populations.

In this study, we consider that information from a particular group of data collected from one population can be divided into two categories. One is shared with other groups and forms the homogeneity within the entire data, while the other is group-specific and is the main source of heterogeneity. This type of heterogeneity can be treated as sub-homogeneity, which refers to the homogeneity for a particular group of data.

However, sub-homogeneity may not be identified using traditional estimation methods such as PCA. One reason for this is that the sub-homogeneity for a particular group of data can be relatively small compared with the homogeneity within the entire data because it usually contributes less than the homogeneity to the total variance of all of the data. In such a situation, traditional PCA may regard the sub-homogeneity as negligible compared with homogeneity and ignore this group-specific pattern. Moreover, from the interpretation perspective, principal components that are produced using traditional PCA on all of the data do not target a specific group (e.g., information on which component corresponds to which group of data is not known). In previous studies, the discussion of PCA has mainly focused on the large eigenvalues, which correspond to the homogeneity in this study. For example, Johnstone (2001) studied the asymptotic distribution of the largest eigenvalue of PCA. Recently, Cai et al. (2017) extended the discussion to the asymptotic distribution of the spiked eigenvalues, while Morales-Jimenez et al. (2018) studied the asymptotics for the leading eigenvalues and eigenvectors of the correlation matrix in the class of spiked models. None of them have considered the existence of sub-homogeneity within the data.

To the best of our knowledge, the sub-homogeneity has not been well discussed, but it can be very important in high-dimensional data analysis. The following example of analyzing stock returns from different industries is used to explain the importance of finding sub-homogeneity. As stated in Fama and French (1997), stock returns from various industries can have varying performance over time, although the homogeneity (e.g., market return) can be deemed to be driven by some common economic variables. In a situation in which a vast number of individual stock returns are collected together for statistical analysis, traditional dimension-reduction techniques such as PCA may be able to capture the market effect but can fail to identify the sub-homogeneity within each industry (industry-specific pattern). This is because some industry-specific components may have much smaller variance than the market component and are highly likely to be omitted by PCA. However, these industry-specific components may be very important in capturing the movement of the stock returns within the industry. This loss of information may result in a very poor forecast of stock returns for some companies in which sub-homogeneity exists. In addition, although it would be interesting to study which industry has a larger industry-specific effect on stock returns, traditional PCA performed on the entire data does not allow us to draw such conclusions. Therefore, this study aims at identifying both homogeneity and sub-homogeneity in high-dimensional data analysis.

The sub-homogeneity in several parts of the data can be identified and estimated by dividing the whole data set into several groups. However, in most situations, the group structure is not known in advance. Therefore, a clustering method can be used to group the data at the first stage, and PCA can then be applied to identify the sub-homogeneity within each group. Similarly, Liu et al. (2002) performed PCA in different blocks of variables, while Tan et al. (2015) penalized the variables in each group differently when using the graphical lasso. Both studies used hierarchical clustering to group the variables to take into account the heterogeneity. However, when homogeneity exists across different groups, the sub-homogeneity in each group cannot be successfully identified. This is mainly because the homogeneity shared by most groups can dominate the sub-homogeneity so that the data from different groups all seem to be highly correlated. In this situation, the sub-homogeneity is masked by the homogeneity, and these clustering methods tend to group all individuals (variables) into one cluster. Therefore, the homogeneity must be correctly identified and removed before we can successfully group the individuals and discover the sub-homogeneity of each group.

On the other hand, the estimation of homogeneity can often be problematic in a high-dimensional setting. This is because of the inconsistency of PCA estimates when the number of variables pp is comparable with or greater than the sample size nn, as discussed in Johnstone and Lu (2009). Motivated by the group structure of the data, we suggest first clustering the individuals into groups and then performing a traditional approach such as PCA on the level of groups, followed by a second layer of PCA that extracts common information shared by each group. This approach can improve the estimation accuracy of the homogeneity because the components are now extracted from groups in which the dimension has already been reduced (to the number of variables in a group), while the traditional dimension-reduction method (e.g., PCA) is performed on the full dimension pp. This can be viewed as an effective way to alleviate the potential problem caused by the curse of dimensionality. Other studies have modified the traditional approaches to deal with the curse of dimensionality. For example, Johnstone and Lu (2009) suggest that some initial dimension-reduction work is necessary before using PCA, as long as a sparse basis exists. Further, Zou et al. (2006) introduced sparse PCA, which uses the lasso (elastic net) to modify the principal components so that sparse loadings can be achieved. In addition, a review of some sparse versions of PCA can be found in Tibshirani et al. (2015), while some general discussions about the blessing and curse of dimensionality can be found in Donoho et al. (2000) and Fan et al. (2014).

To conclude, homogeneity must be removed to correctly identify the cluster structure and sub-homogeneity, but the cluster structure of the data is used to accurately find the homogeneity. Therefore, we introduce a novel “iterative complement-clustering principal component analysis” (CPCA) to iteratively estimate the homogeneity and sub-homogeneity. Details of CPCA are provided in Section 4.

The contributions of this study can be summarized as follows. First, we propose CPCA to identify homogeneity and sub-homogeneity and handle the interaction between them when the whole data set exhibits a group structure. Second, our proposed estimation method not only correctly captures the sub-homogeneity, but also provides very reliable cluster information (e.g., which part of the data is from the same group), which can be useful in understanding and explaining the data structure. Third, inspired by Chiou and Li (2007), we develop a leave-one-out principal component regression (PCR) clustering method that can outperform the hierarchical clustering method used in previous studies. In addition, we theoretically illustrate that if the sub-homogeneity from different clusters is distinct, our proposed clustering procedure can effectively separate the variables from different clusters. This is also numerically confirmed by the simulation study and real data analysis. Details of the proposed clustering method are provided in Section 4.

The rest of this study is organized as follows. A low-rank representation of the data that captures both homogeneity and sub-homogeneity is introduced in Section 2, and some related PCA methods are discussed in Section 3. In Section 4, a novel estimation method called CPCA is proposed, followed by a discussion of more details of the algorithm. Section 5 demonstrates and explains the effectiveness of the proposed clustering method. Extensive simulations, along with two applications (PCR and covariance estimation) of our proposed method are provided in Sections 6 and 7. Section 8 analyzes a stock return dataset using our method. Lastly, Section 9 concludes the study.

2 Homogeneity and Sub-homogeneity

Considering the data 𝒙i=(x1i,,xpi)p\bm{x}_{i}=(x_{1i},\ldots,x_{pi})^{\top}\in\mathbb{R}^{p} from ithi^{th} observation with pp variables, the singular value decomposition (SVD) of the data 𝒙i\bm{x}_{i} can be written as:

𝒙i=\displaystyle\bm{x}_{i}= k=1Kgikϕk+𝒖i,\displaystyle\sum_{k=1}^{K}{g}_{ik}\bm{\phi}_{k}+\bm{u}_{i},
where 𝒖i=\displaystyle\text{where }\bm{u}_{i}= k=K+1min(n,p)gikϕki=1,,n,\displaystyle\sum_{k=K+1}^{\min{(n,p)}}{g}_{ik}\bm{\phi}_{k}\quad i=1,\ldots,n, (1)

where gik{g}_{ik} is defined as kthk^{th} principal component score and ϕk\bm{\phi}_{k} denotes a p×1p\times{1} eigenvector for gik{g}_{ik}. Traditional PCA summarizes the data using the first kk principal components, and it treats 𝒖i\bm{u}_{i} as noise because gik{g}_{ik} for k=m+1,,min(n,p)k=m+1,\ldots,\min{(n,p)} has smaller variance. However, under certain conditions, some of the information contained in 𝒖i\bm{u}_{i} can be useful in prediction or forecasting problems, particularly for one or more specific groups of data. Therefore, when the data exhibit a group structure, we propose the following low-rank representation for the data to capture both homogeneity and sub-homogeneity:

𝒙i=\displaystyle\bm{x}_{i}= k=1rcgikϕk+𝒖i,\displaystyle\sum_{k=1}^{r_{c}}{g}_{ik}\bm{\phi}_{k}+\bm{u}_{i},
where 𝒖i=j=1Jh=1rjfih(j)𝜸h(j)\displaystyle\text{where }\bm{u}_{i}=\sum_{j=1}^{J}\sum_{h=1}^{r_{j}}{f}^{(j)}_{ih}\bm{\gamma}^{(j)}_{h} +j=1J𝑰(j)ϵi(j),i=1,,n,j=1,,J.\displaystyle+\sum_{j=1}^{J}\bm{I}^{(j)}\bm{\epsilon}^{(j)}_{i},\quad i=1,\ldots,n,\quad j=1,\ldots,J. (2)

The first line in (2) measures the homogeneity among all variables from all groups, where gik,k=1,,rc{g}_{ik},k=1,\ldots,r_{c} is kthk^{th} principal component, which we call the common component, and ϕk\bm{\phi}_{k} is its corresponding eigenvector. The first part of 𝒖i\bm{u}_{i} in (2) consists of JJ cluster-specific components from which the sub-homogeneity of the data is derived. Assuming pp variables can be split into JJ clusters, fih(j),h=1,,rj{f}^{(j)}_{ih},h=1,\ldots,r_{j} measures the within-cluster principal components for cluster jj. The within-cluster eigenvector with dimension p×1p\times 1 has the form of 𝜸h(j)=(𝟎(1),,𝜼h(j),,𝟎(J))\bm{\gamma}^{(j)}_{h}=(\bm{0}^{(1)\top},\ldots,\bm{\eta}^{(j)\top}_{h},\ldots,\bm{0}^{(J)\top})^{\top}, where 𝜼h(j)\bm{\eta}^{(j)}_{h} defines a pj×1p_{j}\times 1 vector and pjp_{j} is the number of variables in cluster jj so that j=1Jpj=p\sum_{j=1}^{J}p_{j}=p. That is, the values of 𝜸h(j)\bm{\gamma}^{(j)}_{h} for variables that do not belong to cluster jj are zero. This implies that after removing the effect of the common principal components, variables from different clusters are uncorrelated (e.g., 𝒖i\bm{u}_{i} has a block-diagonal covariance structure). In the second part of 𝒖i\bm{u}_{i}, ϵi(j)\bm{\epsilon}^{(j)}_{i} is simply a pjp_{j}-dimensional error that has variance σ(j)2{\sigma^{(j)}}^{2}, and 𝑰(j)\bm{I}^{(j)} is a diagonal matrix, but the diagonals for variables that do not belong to cluster jj are zero.

We further define 𝒈i=(gi1,,girc)\bm{g}_{i}=({g}_{i1},\ldots,{g}_{ir_{c}})^{\top} and 𝒇i(j)=(fi1(j),,firj(j)),j=1,,J\bm{f}^{(j)}_{i}=({f}^{(j)}_{i1},\ldots,{f}^{(j)}_{ir_{j}})^{\top},j=1,\ldots,J as the vector forms of the common components and cluster-specific components, and the eigenvector matrices as 𝚽(p×rc)=(ϕ1,,ϕrc)\bm{\Phi}_{(p\times r_{c})}=(\bm{\phi}_{1},\ldots,\bm{\phi}_{r_{c}}) and 𝚪(p×rj)(j)=(𝜸1(j),,𝜸rj(j)),j=1,,J\bm{\Gamma}^{(j)}_{(p\times r_{j})}=(\bm{\gamma}^{(j)}_{1},\ldots,\bm{\gamma}^{(j)}_{r_{j}}),j=1,\ldots,J, respectively. Without loss of generality, we assume E(𝒈i)=E(𝒇i(j))=E(ϵi(j))=𝟎E(\bm{g}_{i})=E(\bm{f}^{(j)}_{i})=E(\bm{\epsilon}^{(j)}_{i})=\bm{0}, and 𝒈i,𝒇i(j),ϵi(j),j=1,,J\bm{g}_{i},\bm{f}^{(j)}_{i},\bm{\epsilon}^{(j)}_{i},j=1,\ldots,J are mutually uncorrelated. In addition, we also impose the following canonical condition for the identification of both the homogeneity and sub-homogeneity,

𝚽𝚽=Irc,\displaystyle\bm{\Phi}^{\top}\bm{\Phi}=I_{r_{c}}, 𝚺gcov(𝒈i) is diagonal,\displaystyle\quad\bm{\Sigma}_{g}\coloneqq\text{cov}(\bm{g}_{i})\text{ is diagonal},
𝚪(j)𝚪(j)=Irj,\displaystyle{\bm{\Gamma}^{(j)\top}}\bm{\Gamma}^{(j)}=I_{r_{j}}, 𝚺f(j)cov(𝒇i(j)) is diagonal,j=1,,J.\displaystyle\quad\bm{\Sigma}f^{(j)}\coloneqq\text{cov}(\bm{f}^{(j)}_{i})\text{ is diagonal},\quad j=1,\ldots,J. (3)

Under the data structure given in (2), the population covariance matrix is given by a low-rank plus block-diagonal representation

𝚺=𝚽𝚺g𝚽+j=1J𝚪(j)𝚺f(j)𝚪(j)+j=1J𝑰(j)σ(j)2,j=1,,J.\displaystyle\bm{\Sigma}=\bm{\Phi}\bm{\Sigma}_{g}\bm{\Phi}^{\top}+\sum_{j=1}^{J}\bm{\Gamma}^{(j)}\bm{\Sigma}_{f}^{(j)}{\bm{\Gamma}^{(j)\top}}+\sum_{j=1}^{J}\bm{I}^{(j)}{\sigma^{(j)2}},\quad j=1,\ldots,J. (4)

If we denote the data 𝑿(n×p)=(𝒙1,,𝒙n)\bm{X}_{(n\times p)}=(\bm{x}_{1},\ldots,\bm{x}_{n})^{\top}, the common components and cluster-specific components in a matrix form 𝑮(n×rc)=(𝒈1,,𝒈n)\bm{G}_{(n\times{r_{c}})}=(\bm{g}_{1},\ldots,\bm{g}_{n})^{\top} and 𝑭(n×rj)(j)=(𝒇1(j),,𝒇n(j))\bm{F}^{(j)}_{(n\times{r_{j}})}=(\bm{f}^{(j)}_{1},\ldots,\bm{f}^{(j)}_{n})^{\top} , respectively, data representation (2) can also be presented in the following matrix form

𝑿=\displaystyle\bm{X}= 𝑮𝚽+𝑼,\displaystyle\bm{G}{\bm{\Phi}}^{\top}+\bm{U},
where 𝑼=\displaystyle\text{where }\bm{U}= j=1J𝑭(j)𝚪(j)+𝑬,j=1,,J.\displaystyle\sum_{j=1}^{J}\bm{F}^{(j)}{\bm{\Gamma}^{(j)\top}}+\bm{E},\quad j=1,\ldots,J. (5)

In general, there are two goals in using representation (2). In the presence of an unknown cluster structure, the first goal is to cluster the variables (i.e., determine which variables are in the same group). From the interpretation perspective, it is interesting to know how variables are clustered and which variables belong to the same cluster. The second goal is to correctly estimate both the common components 𝒈i\bm{g}_{i} and the cluster-specific components 𝒇i(j)\bm{f}^{(j)}_{i}. These components serve as a low-rank representation of the data and can therefore be used for further applications. One obvious application is to estimate the covariance matrix 𝚺\bm{\Sigma} and the precision matrix 𝚺1\bm{\Sigma}^{-1} of 𝑿\bm{X}. The use of both 𝒈i\bm{g}_{i} and 𝒇i(j)\bm{f}^{(j)}_{i} captures both the low rank and the block-diagonal representation of 𝚺\bm{\Sigma}, which results in a more efficient estimation compared with using 𝒈i\bm{g}_{i} only. Another important application is PCR. In some situations, it is a cluster-specific component 𝒇i(j)\bm{f}^{(j)}_{i} that contributes most to determining the response in PCR rather than the common component 𝒈i\bm{g}_{i}. It is important to identify each cluster-specific component 𝒇i(j)\bm{f}^{(j)}_{i} to explain which cluster of variables has a greater effect in predicting the response. We will discuss more about these two applications using simulated data in Section 7.

3 Relationship with Existing PCA Methods

Our proposed data representation (2) should be used to perform dimension reduction of the data with a cluster structure because it captures both the common effect (homogeneity) and the cluster-specific effect (sub-homogeneity). Consequently, this method can be viewed as an extension of many other widely used dimension-reduction methods. Three of these methods are discussed below.

  • Case 1: When there is no cluster-specific effect, for example, 𝒇i(j)=𝟎,j=1,,J\bm{f}^{(j)}_{i}=\bm{0},j=1,\ldots,J and σ(j)2σ2{\sigma^{(j)2}}\equiv{\sigma}^{2}, representation (2) simply reduces to the well-known PCA

    𝒙i=k=1rcgikϕk+ϵi,i=1,,n,\displaystyle\bm{x}_{i}=\sum_{k=1}^{r_{c}}{g}_{ik}\bm{\phi}_{k}+\bm{\epsilon}_{i},\quad i=1,\ldots,n, (6)

    where 𝒈ik\bm{g}_{ik} can be found by the principal component with kthk^{th} largest eigenvalue and ϕk\bm{\phi}_{k} is the corresponding eigenvector. However, in the presence of a small cluster-specific effect, traditional PCA generally only identifies the large common effect and treats the rest as noise, disregarding the sub-homogeneity within the data.

  • Case 2: When the common components do not exist (e.g., 𝒈i=𝟎\bm{g}_{i}=\bm{0}), representation (2) demonstrates a well-studied block-diagonal covariance structure of the data, as discussed in Liu et al. (2002) and Tan et al. (2015). Thus, representation (2) can be reduced to

    𝒙i=j=1Jh=1rjfih(j)𝜸h(j)+j=1J𝑰(j)ϵi(j),i=1,,n,j=1,,J.\displaystyle\bm{x}_{i}=\sum_{j=1}^{J}\sum_{h=1}^{r_{j}}{f}^{(j)}_{ih}\bm{\gamma}^{(j)}_{h}+\sum_{j=1}^{J}\bm{I}^{(j)}\bm{\epsilon}^{(j)}_{i},\quad i=1,\ldots,n,\quad j=1,\ldots,J. (7)

    In this case, hierarchical clustering is usually used to group the variables with high correlations, and PCA is then performed on each group of variables to estimate the cluster-specific components 𝒇i(j)\bm{f}^{(j)}_{i}. In many situations, this cluster structure is masked by a dominant common effect. Ignoring this common effect will result in a non-identifiable cluster structure.

  • Case 3: Fan et al. (2013) and Li et al. (2018) consider the following low-rank representation, in which the error covariance matrix is assumed to be cross-sectional dependent after the common components have been taken out and

    𝒙i=k=1rcgikϕk+𝒖i,i=1,,n,\displaystyle\bm{x}_{i}=\sum_{k=1}^{r_{c}}{g}_{ik}\bm{\phi}_{k}+\bm{u}_{i},\quad i=1,\ldots,n, (8)

    where they assume that the covariance of 𝒖i\bm{u}_{i}, Σu\Sigma_{u}, is sparse and propose a method called Principal Orthogonal complEment Thresholding (POET) to explore such a high-dimensional structure with sparsity. Li et al. (2018) used weighted PCA to find a more efficient estimator of common components 𝒈i\bm{g}_{i}. Hong et al. (2018) and Deville and Malinvaud (1983) also applied weighted PCA to estimate the covariance matrix when heteroscedastic noise of samples or variables exists. However, in this study, we consider 𝒖i\bm{u}_{i} following a cluster structure with a block-diagonal covariance and aim to find its low-rank presentation, which we call sub-homogeneity. In addition, we propose iteratively estimating the common components and cluster-specific components.

4 Estimation Methods

Correctly estimating 𝒈i\bm{g}_{i} and 𝒇i(j)\bm{f}^{(j)}_{i} poses many challenges. First, PCA, which is widely used to estimate the common component 𝒈i\bm{g}_{i}, often performs very poorly when pp is much larger than nn. The group structure of the variables motivates us to separate the data according to the clusters and then extract the common information from each cluster to determine the common components. This is less influenced by the curse of dimensionality because each cluster of the data results in a lower dimension of variable pjp_{j}. However, to accurately identify the clusters, we must remove the common effect and then perform the clustering method based on the complement (𝒙ik=1rcgikϕk)\left(\bm{x}_{i}-\sum_{k=1}^{r_{c}}{g}_{ik}\bm{\phi}_{k}\right), but the common components and its eigenvectors are not known in advance. This inspires us to propose a novel iterative method, CPCA, to cluster the variables and estimate the components simultaneously. The details of the method are described in Algorithm 1 and a flowchart that summarizes our estimation method is presented in Figure 1.

Algorithm 1 (CPCA)
  1. 1.

    Initial Step:

    1. (a)

      Perform PCA directly on the entire data 𝑿\bm{X} and select the number of components using a ratio based estimator defined by Lam and Yao (2012). A brief discussion on selecting the number of components is in Remark 1. The resulting eigenvectors 𝚽0\bm{\Phi}_{0} and principal components 𝑮0\bm{G}_{0} are served as the initial estimates of 𝚽\bm{\Phi} and 𝑮\bm{G}. Then, find the initial complement 𝑿0c=𝑿𝑮0𝚽0\bm{X}^{c}_{0}=\bm{X}-\bm{G}_{0}\bm{\Phi}^{\top}_{0}

    2. (b)

      Perform hierarchical clustering for 𝑿0c\bm{X}^{c}_{0} based on a similarity matrix given by the absolute value of the empirical correlation matrix 𝑿0c\bm{X}^{c}_{0}. The obtained clusters C0(j),j=1,,J0C^{(j)}_{0},j=1,\ldots,J_{0} are served as the initial clusters.

  2. 2.

    Iterative Step: for s=1,2,s=1,2,\ldots

    1. (a)

      Cluster the variables into Js1J_{s-1} groups according to Cs1(j)C^{(j)}_{s-1} and define variables from jthj^{th} cluster as 𝑿s(j)\bm{X}^{(j)}_{s}. Perform PCA on each cluster of variables 𝑿s(j)\bm{X}^{(j)}_{s} and denote the obtained principal components as 𝚿s(j),j=1,,Js1\bm{\Psi}^{(j)}_{s},j=1,\ldots,J_{s-1}. Combine these principal components as 𝚿s=(𝚿s(1),,𝚿s(J0))\bm{\Psi}_{s}=(\bm{\Psi}^{(1)}_{s},\ldots,\bm{\Psi}^{(J_{0})}_{s}) and perform a further step of PCA on 𝚿s\bm{\Psi}_{s}. Define the principal components as 𝑮s\bm{G}_{s} and their corresponding eigenvectors as 𝚽s\bm{\Phi}_{s}. Then, compute the updated complement 𝑿sc=𝑿𝑮s𝚽s\bm{X}^{c}_{s}=\bm{X}-\bm{G}_{s}\bm{\Phi}^{\top}_{s}. Details for finding the eigenvectors 𝚽1\bm{\Phi}_{1} can be found in Appendix A.

    2. (b)

      Perform leave-one-out clustering for variables in 𝑿sc\bm{X}^{c}_{s} using PCR (more details of this clustering method are discussed in Remark 3):

      1. (i)

        For k,k=1,,pk,k=1,\ldots,p, leave kthk^{th} variable out of 𝑿sc\bm{X}^{c}_{s}.

      2. (ii)

        Group the rest of the variables of 𝑿sc\bm{X}^{c}_{s} based on Cs1(j)C^{(j)}_{s-1}, perform PCA on each cluster again and denote the obtained components as 𝑭s1(j),j=1,,Js1\bm{F}_{s-1}^{(j)},j=1,\ldots,J_{s-1}.

      3. (iii)

        Fit Js1J_{s-1} PCRs by using kthk^{th} variable of 𝑿sc\bm{X}^{c}_{s} as the response and 𝑭s1(j)\bm{F}_{s-1}^{(j)} as the predictor for each j=1,,Js1j=1,\ldots,J_{s-1}, respectively.

      4. (iv)

        Compute the sum of squared residuals (SSR) for each Js1J_{s-1} PCR model. Assign kthk^{th} variable to the cluster with the minimum SSR. Update the cluster index for kthk^{th} variable in Cs1(j)C^{(j)}_{s-1}.

      5. (v)

        Repeat (i)\sim(iv) for each kk and denote the updated clusters as Cs(j)C^{(j)}_{s}.

    3. (c)

      Repeat (a) and (b) within this step until the clusters converge, and define this final converged cluster as Cf(j)C^{(j)}_{f}.

  3. 3.

    Final Step:

    1. (a)

      Repeat (a) in the Iterative Step, but using cluster Cf(j)C^{(j)}_{f}. The final complement is denoted by 𝑿fc=𝑿𝑮f𝚽f\bm{X}^{c}_{f}=\bm{X}-\bm{G}_{f}\bm{\Phi}^{\top}_{f}.

    2. (b)

      Cluster variables of 𝑿fc\bm{X}^{c}_{f} based on Cf(j)C^{(j)}_{f}, perform PCA on each cluster 𝑿fc(j)\bm{X}^{c{(j)}}_{f} and denote the obtained cluster-specific components as 𝑭f(j),j=1,,Jf\bm{F}_{f}^{(j)},j=1,\ldots,J_{f}.

Therefore, CPCA produces these required outputs: the final clusters of pp variables Cf(j),j=1,,JfC^{(j)}_{f},j=1,\ldots,J_{f}, the final estimate of the common components 𝑮f\bm{G}_{f} and the cluster-specific components 𝑭f(j),j=1,,Jf\bm{F}_{f}^{(j)},j=1,\ldots,J_{f}, along with their eigenvectors. Some details and discussions of CPCA are provided in the following remarks.

Refer to caption
Figure 1: Flowchart of Algorithm 1.
Remark 1

[Selecting the Number of Components]:
In this work, we use a ratio based estimator (Lam and Yao, 2012) to select the number of common components and estimate the homogeneity. The ratio based estimator of the number of principal components rcr_{c} is defined as

r^c=argmin1iRλ^iλ^i+1,\displaystyle\widehat{r}_{c}=\operatorname*{argmin}_{1\leq i\leq R}\frac{\widehat{\lambda}_{i}}{\widehat{\lambda}_{i+1}}, (9)

where λ^1λ^p\widehat{\lambda}_{1}\geq\cdots\geq\widehat{\lambda}_{p} are eigenvalues of the sample covariance matrix of data 𝑿\bm{X} and rRpr\leq R\leq p is a constant. In practice, many researchers may prefer to select the number of principal components by either setting a threshold on the proportion of total variance explained, which may accidentally include some errors into the principal components, or choosing the numbers where the largest drop of eigenvalues exists (i.e. based on the scree plot). Nonetheless, when the number of common components is greater than 1 (e.g., rc>1r_{c}>1), the discrepancy of the eigenvalues (from the sample covariance) among the common components can be accidentally greater than the one between the common components and the cluster-specific component, especially when some groups have a relatively larger cluster-specific effect (e.g., variables in one group are on a large scale or highly correlated). In this situation, selecting the number of components using the largest drop in eigenvalues in Initial Step (a) and Iterative Step (a) may lead to underestimating the number of common components, while the ratio-based estimator r^c\widehat{r}_{c} is less affected by the scales of variables. In addition, to consider a broader class of sub-homogeneities, we also allow for sub-homogeneities within a specific cluster to have different order of variances. To correctly estimate the number of principal components that may have different orders of variances in the same group, in Iterative Step 2 (a) and (b), we use the iterative estimation procedure purposed by Peng et al. (2020), which is a generalization of the original ratio-based estimator. The exact procedure is lengthy and is not the focus of the present work, hence is omitted here. More details of the iterative estimation procedure of the ratio-based estimator can be found in Section 2 of Peng et al. (2020).

Remark 2

[Iterative Step]:
One of the key contributions of our algorithm is iteratively estimating the common components and cluster-specific components. Directly applying PCA on the entire data 𝑿\bm{X} generally leads to a very poor estimate of the common components because a large pp may blur the spike structure of the sample covariance matrix. Motivated by the group structure of the data, we utilize Iterative Step (a) to first cluster variables and then perform PCA on the level of clusters, followed by a second layer of PCA that extracts common information shared by each group. Iterative Step (b) is then implemented to update and improve the clusters information. This is more effective in estimating the common components because PCA is performed in a smaller pp case, which is also numerically shown in Figure 2. The data 𝑿\bm{X} used here are generated by simulation Example 2. Figure 2 (a) presents the correlation plot for the original data 𝑿\bm{X}. We observe that the cluster structure is masked by the common effect. Figure 2 (b) demonstrates that after the common effects estimated in the Initial Step are removed, it is still difficult to perceive the cluster structure. However, if we remove the common effects estimated in the Final Step, the block-diagonal structure is clear, implying a prominent heterogeneity within the data. Therefore, iteratively estimating the common components is advantageous.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 2: Correlation plot using data generated by simulation example 2 for (a) the original data 𝑿\bm{X}, (b) the initial complement 𝑿0c=𝑿𝑮0𝚽0\bm{X}^{c}_{0}=\bm{X}-\bm{G}_{0}\bm{\Phi}^{\top}_{0} and (c) the final complement 𝑿fc=𝑿𝑮f𝚽f\bm{X}^{c}_{f}=\bm{X}-\bm{G}_{f}\bm{\Phi}^{\top}_{f}
Remark 3

[Initial Step (b), Iterative Step (b) and (c)]:
Initial Step (b), an ordinary hierarchical clustering method using average-linkage as dissimilarity between clusters is performed. As discussed in Bühlmann et al. (2013), we choose the number of clusters using the largest increment in height between iterations proceeded in a agglomerative way, such that J0=argmaxj(hj+1hj)J_{0}=\operatorname*{argmax}_{j}(h_{j+1}-h_{j}).

In Iterative Step (b), we propose a new clustering method called “leave-one-out PCR clustering” (LOO-PCR clustering), which is another key contribution of CPCA. The underlying idea of this new clustering method is that if one variable belongs to a cluster, it should be well-predicted by the principal components extracted from that cluster. This is more aligned with our model set-up than hierarchical clustering based on correlations. A similar idea of leaving one out is used for functional clustering, as discussed in (Chiou and Li, 2007). In addition, hierarchical clustering can perform poorly when the dimension of variables pp is larger than nn. In step (iv), we assign kthk^{th} variable to the cluster that achieves the minimum SSR. However, when the minimum SSR is larger than a threshold τ\tau, we treat the kthk^{th} variable itself as a cluster, because in such a situation, this variable cannot be well-predicted by any clusters of variables. Hence, the number of clusters is driven by the data and can vary in each iteration. This suggests that our proposed method is more flexible than hierarchical clustering. We set τ=0.95\tau=0.95, and Example 4 in the simulation studies demonstrates that our proposed LOO-PCR clustering, served as a clustering method itself, outperforms hierarchical clustering in a large pp small nn case. Identifiability of LOO-PCR clustering is also demonstrated theoretically in Section 5.

In Iterative Step (c), we stop the iteration when the clusters Cs(j)C^{(j)}_{s} converge. In our algorithm, we adopt the adjusted rand index (ARI) (Rand, 1971; Hubert and Arabie, 1985) between the clusters Cs(j)C^{(j)}_{s} and the one in the previous iteration Cs1(j)C^{(j)}_{s-1} as the stopping criterion. The ARI is a corrected version of the rand index that measures the similarity between two clusters of the same data using the proportions of agreements between these two partitions. The correction is achieved by subtracting the rand index by its expected value. The ARI can yield a maximum value of 1, and a high value implies high similarity. When the ARI is above a certain threshold η\eta, we stop the iteration. In this study, we use η=0.95\eta=0.95.

5 Identifiability of LOO-PCR Clustering Approach

Achieving an accurate sub-homogeneity pursuit relies on the effectiveness of the clustering procedure. In this section, we investigate the proposed LOO-PCR clustering method theoretically. More specifically, we examine the identifiability of cluster membership in our proposed method.

Consider one random variable XmiX_{mi} that belongs to cluster ll. Based on the variable’s structure in cluster ll, there exists m{1,2,,p}m\in\{1,2,\ldots,p\}, such that:

Xmi=k=1rcgikϕmk+umi(l),umi(l)=h=1rlfih(l)γmh(l)+ϵmi(l).\displaystyle X_{mi}=\sum^{r_{c}}_{k=1}g_{ik}\phi_{mk}+u_{mi}^{(l)},\ \ u_{mi}^{(l)}=\sum^{r_{l}}_{h=1}f_{ih}^{(l)}\gamma_{mh}^{(l)}+\epsilon_{mi}^{(l)}. (10)

We now consider another cluster dd with components gik,k=1,,rcg_{ik},k=1,\ldots,r_{c} and fih(d),h=1,,rdf_{ih}^{(d)},h=1,\ldots,r_{d}. When applying our proposed LOO-PCR clustering method, we regress umi(l)u_{mi}^{(l)} on fih(d),h=1,2,,rdf_{ih}^{(d)},h=1,2,\ldots,r_{d} and measure its goodness of fit to determine whether the random variable XmiX_{mi} belongs to cluster dd, such that:

umi(l)=h=1rdβhfih(d)+ζmi(d),\displaystyle u_{mi}^{(l)}=\sum^{r_{d}}_{h=1}\beta_{h}f_{ih}^{(d)}+\zeta_{mi}^{(d)}, (11)

where β1,,βrd\beta_{1},\ldots,\beta_{r_{d}} are coefficients and ζmi(d)\zeta_{mi}^{(d)} is the error. Naturally, we expect that features (e.g., principal components) from cluster dd cannot explain umi(l)u_{mi}^{(l)} sufficiently. Based on (10) and (11), intuitively, a large discrepancy between principal components fih(d)f_{ih}^{(d)} from cluster dd and principal components fih(l)f_{ih}^{(l)} from cluster ll will result in a large residual ζmi(d)\zeta_{mi}^{(d)} in (11).

The following theorem will show the property of ζ^mi(d)\widehat{\zeta}_{mi}^{(d)}, which is an estimator of the error ζmi(d)\zeta_{mi}^{(d)} from our proposed clustering approach.

Theorem 1

For any cluster dd and any random variable XmiX_{mi} from another cluster ll, we have the following evaluation:

𝜻^m(d)2=𝑴𝑭(d)𝑴𝑮𝒙m2+Op(max(αnpd,γnJ))𝒙m2,\displaystyle\|\widehat{\bm{\zeta}}_{m}^{(d)}\|_{2}=\|\bm{M}_{\bm{F}^{(d)}}\bm{M}_{\bm{G}}\bm{x}_{m}\|_{2}+O_{p}\left(\max(\alpha_{np_{d}},\gamma_{nJ})\right)\|\bm{x}_{m}\|_{2}, (12)

where 𝐌𝐅^(d)𝐌𝐅(d)2:=Op(αnpd)\|\bm{M}_{\widehat{\bm{F}}^{(d)}}-\bm{M}_{\bm{F}^{(d)}}\|_{2}:=O_{p}(\alpha_{np_{d}}), 𝐌𝐆^𝐌𝐆2:=Op(γnJ)\|\bm{M}_{\widehat{\bm{G}}}-\bm{M}_{\bm{G}}\|_{2}:=O_{p}(\gamma_{nJ}), 𝛇^m(d)=(ζ^m1(d),ζ^m2(d),,ζ^mn(d))\widehat{\bm{\zeta}}_{m}^{(d)}=\left(\widehat{\zeta}_{m1}^{(d)},\widehat{\zeta}_{m2}^{(d)},\ldots,\widehat{\zeta}_{mn}^{(d)}\right)^{\top}, 𝐱m=(Xm1,Xm2,,Xmn)\bm{x}_{m}=(X_{m1},X_{m2},\ldots,X_{mn})^{\top}, 𝐌𝐅(d)=𝐈n𝐅(d)(𝐅(d)𝐅(d))1𝐅(d)\bm{M}_{\bm{F}^{(d)}}=\bm{I}_{n}-\bm{F}^{(d)}\left(\bm{F}^{(d)\top}\bm{F}^{(d)}\right)^{-1}\bm{F}^{(d)\top} and 𝐌𝐆=𝐈n𝐆(𝐆𝐆)1𝐆\bm{M}_{\bm{G}}=\bm{I}_{n}-\bm{G}(\bm{G}^{\top}\bm{G})^{-1}\bm{G}^{\top}; here, 𝐅(d)\bm{F}^{(d)} is an n×rdn\times r_{d} matrix with (i,h)(i,h)-th element being fih(d)f_{ih}^{(d)}, and 𝐆\bm{G} is an n×rcn\times r_{c} matrix with (i,k)(i,k)-th element being gikg_{ik}. Here, 𝐌𝐅^(d)\bm{M}_{\widehat{\bm{F}}^{(d)}} and 𝐌G^\bm{M}_{\widehat{G}} are 𝐌𝐅(d)\bm{M}_{\bm{F}^{(d)}} and 𝐌𝐆\bm{M}_{\bm{G}}, but with 𝐅(d)\bm{F}^{(d)} and 𝐆\bm{G} replaced by 𝐅^(d)\widehat{\bm{F}}^{(d)} and 𝐆^\widehat{\bm{G}}, respectively.

A brief proof of Theorem 1 is provided in Appendix B.

Remark 4

It is expected that 𝜻^m(d)\widehat{\bm{\zeta}}^{(d)}_{m} will be large when cluster dd and cluster ll are different. When l=dl=d, the first term on the right-hand side of (12) is equal to 𝑴𝑭(d)𝑴𝑮𝜻m(d)2\|\bm{M}_{\bm{F}^{(d)}}\bm{M}_{\bm{G}}{\bm{\zeta}}_{m}^{(d)}\|_{2}. In contrast, when ldl\neq d (e.g., 𝑭(l)2\|{\bm{F}^{(l)}}\|_{2} and 𝑭(d)2\|{\bm{F}^{(d)}}\|_{2} are distinct), the first term on the right-hand side of (12) is equal to 𝑴𝑭(d)𝑴𝑮𝑭(l)𝜸(l)+𝑴𝑭(d)𝑴𝑮𝜻m(l)2\|\bm{M}_{\bm{F}^{(d)}}\bm{M}_{\bm{G}}{\bm{F}^{(l)}}{\bm{\gamma}^{(l)}}+\bm{M}_{\bm{F}^{(d)}}\bm{M}_{\bm{G}}{\bm{\zeta}}_{m}^{(l)}\|_{2}, which is dominated by 𝑭(l)2\|{\bm{F}^{(l)}}\|_{2}. Meanwhile, the second term on the right-hand side of (12) is determined by the estimation accuracy of 𝑭(d)\bm{F}^{(d)} and 𝑮\bm{G}. It is related to the dimension of cluster dd (e.g., pdp_{d}), the sample size nn and the total number of clusters JJ. Bai and Ng (2002) provided the rate of convergence for the projection matrices of principal components. Fan et al. (2013) and Fan et al. (2018) also studied the properties of principal components for high-dimensional data. In this study, we do not pursue the exact expression of αnpd\alpha_{np_{d}} and γnJ\gamma_{nJ}. However, given that the homogeneity 𝑴𝑮\bm{M}_{\bm{G}} and sub-homogeneity 𝑭(d)\bm{F}^{(d)} can be estimated accurately, the first term on the right-hand side of (12) is expected to dominate 𝜻^m(d)2\|\widehat{\bm{\zeta}}_{m}^{(d)}\|_{2}. This implies that two different clusters dd and ll are identifiable, given that 𝑭(l)2\|{\bm{F}^{(l)}}\|_{2} has a higher order than 𝜻m(d)2\|{\bm{\zeta}}_{m}^{(d)}\|_{2}. As a general discussion, Theorem 1 shows that when the sub-homogeneity from the different clusters is distinct, our clustering procedure can effectively separate the variables from the different clusters.

6 Simulation Studies

In this section, we conduct various simulation studies to investigate the performance of our proposed CPCA compared with traditional PCA under different simulation settings.

6.1 Simulation Settings

We generate the data from representation (2):

𝒙i=k=1rcgikϕk+j=1Jh=1rjfih(j)𝜸h(j)+j=1J𝑰(j)ϵi(j),i=1,,n,j=1,,J.\displaystyle\bm{x}_{i}=\sum_{k=1}^{r_{c}}{g}_{ik}\bm{\phi}_{k}+\sum_{j=1}^{J}\sum_{h=1}^{r_{j}}{f}^{(j)}_{ih}\bm{\gamma}^{(j)}_{h}+\sum_{j=1}^{J}\bm{I}^{(j)}\bm{\epsilon}^{(j)}_{i},\quad i=1,\ldots,n,\quad j=1,\ldots,J.

First, we generate 𝚽=(ϕ1,,ϕrc)\bm{\Phi}=(\bm{\phi}_{1},\ldots,\bm{\phi}_{r_{c}}) as a p×rcp\times r_{c} orthonormal matrix. Second, a pj×rjp_{j}\times r_{j} orthonormal matrix 𝚿(j)=(𝝍1(j),,𝝍rj(j))\bm{\Psi}^{(j)}=(\bm{\psi}^{(j)}_{1},\ldots,\bm{\psi}^{(j)}_{r_{j}}) is generated randomly and independently across different clusters for j=1,,Jj=1,\ldots,J and let 𝚪h(j)=(𝟎(1),,𝚿h(j),,𝟎(J))\bm{\Gamma}^{(j)}_{h}=(\bm{0}^{(1)},\ldots,\bm{\Psi}^{(j)}_{h},\ldots,\bm{0}^{(J)}). We then generate gik,i=1,,n,k=1,,rcg_{ik},i=1,\ldots,n,k=1,\ldots,r_{c} from δkWik\sqrt{\delta_{k}}W_{ik}, where WikW_{ik} are i.i.d. standard normal random variables and the eigenvalues δh\delta_{h} are defined as δ1>>δrc>0\delta_{1}>\ldots>\delta_{r_{c}}>0. Similarly, we generate fih(j),i=1,,n,h=1,,rj,j=1,,Jf^{(j)}_{ih},i=1,\ldots,n,h=1,\ldots,r_{j},j=1,\ldots,J from λh(j)Zih(j)\sqrt{\lambda^{(j)}_{h}}Z^{(j)}_{ih}, where Zhi(j)Z^{(j)}_{hi} are again i.i.d. standard normal random variables and the eigenvalues λh(j)\lambda^{(j)}_{h} are defined as λ1(j)>>λrj(j)>0,j=1,,J\lambda^{(j)}_{1}>\ldots>\lambda^{(j)}_{r_{j}}>0,j=1,\ldots,J. Lastly, ϵi(j)\bm{\epsilon}^{(j)}_{i} is a pj×1p_{j}\times 1 vector consisting of pjp_{j} i.i.d. normal random variables with mean 0 and variance σ(j)2{\sigma^{(j)}}^{2}. Data from each cluster are generated according to the above setting and then combined to obtain 𝑿\bm{X}.

In our simulation study, we consider four different settings, which are outlined below.

  • Example 1 (n=50,p=100n=50,\ p=100). In this example, we consider the number of common components rc=3r_{c}=3 and generate δk\delta_{k}, k=1,,rck=1,\ldots,r_{c}, from 𝒩(125,5)\mathcal{N}(125,5). We further simulate WikW_{ik}, k=1,2,,rck=1,2,\ldots,r_{c}, i=1,2,,2ni=1,2,\ldots,2n, from a standard normal distribution independently so that 𝒈i\bm{g}_{i} can be constructed.

    In terms of the cluster-specific components, we set J=5J=5, and for each of the five clusters, we consider rj2r_{j}\equiv 2 and generate λh(j)𝒩(θj,h,1)\lambda_{h}^{(j)}\sim\mathcal{N}(\theta_{j,h},1), where θj,h\theta_{j,h} represents the mean level of each component’s variance in each cluster. In specific, we set θ1,1=θ1,2=25\theta_{1,1}=\theta_{1,2}=25 for the first cluster, θ5,1=θ5,2=5\theta_{5,1}=\theta_{5,2}=5 for the last cluster, while θj,1=25\theta_{j,1}=25 and θj,2=5\theta_{j,2}=5 for j=2,3,4j=2,3,4, i.e., the rest three clusters. That is, each of the five clusters has the same number of cluster-specific components, while the components from different clusters are allowed to have different levels of variances. Among those five clusters, the components in the first cluster have the largest variances, those in the last cluster have the smallest variances, while the two components in the rest clusters are designed to have different levels of variances. Then, by setting pj20p_{j}\equiv 20 for each cluster, we end up with a total number of variables p=100p=100. According to (2), Zih(j)Z^{(j)}_{ih} for h=1,2h=1,2 and ϵi(j)\bm{\epsilon}^{(j)}_{i} are simulated independently for each cluster jj, with σ(j)=0.5\sigma^{(j)}=0.5 for j=1,2,,5j=1,2,...,5 so that 2n2n observations are constructed based on (2). Then, the first nn observations are served as the training sample 𝑿train=(𝒙1,,𝒙n)\bm{X}_{train}=(\bm{x}_{1},\ldots,\bm{x}_{n}) and the last nn observations are designated as the testing sample 𝑿test=(𝒙n+1,,𝒙2n)\bm{X}_{test}=(\bm{x}_{n+1},\ldots,\bm{x}_{2n}). In this example, n=50n=50.

  • Example 2 (n=30,p=100n=30,\ p=100). This example is identical to Example 1, except that the sample size is decreased from 50 to 30. We will show that our CPCA is more reliable than other competing methods when the sample size is relatively small.

  • Example 3 (n=50,p=200n=50,\ p=200). In this example, we consider data consisting more clusters, that is, we consider J=10J=10 clusters compared with five in Example 1, leading to a number of variables p=200p=200 given pj20p_{j}\equiv 20. We maintain rj2r_{j}\equiv 2 but double the number of clusters with exactly the same setting of λh(j)𝒩(θj,h,1)\lambda_{h}^{(j)}\sim\mathcal{N}(\theta_{j,h},1). In other words, components in the first two clusters have the largest variances, while those in the last two clusters have the smallest variances. And similarly, the rest six clusters have two levels of variances on their components as described in Example 1. Accordingly, ϵi(j)\bm{\epsilon}^{(j)}_{i} is simulated independently for each of the 10 clusters, with σ(j)=0.5\sigma^{(j)}=0.5 for j=1,,10j=1,\ldots,10. The settings for the common components remain the same as those in Example 1.

  • Example 4 (n=30,p=100n=30,\ p=100). This example is identical to Example 2, but without the common effect, which is a special case of our proposed data representation as in (7). Correspondingly, the response is generated as:

    yi=j=1J𝒇i(j)𝜷j+ei,i=1,,n,j=1,,J.\displaystyle{y}_{i}=\sum_{j=1}^{J}{\bm{f}_{i}^{(j)\top}}\bm{\beta}_{j}+{e}_{i},\quad i=1,\ldots,n,\quad j=1,\ldots,J. (13)

    In this example, we do not estimate the common effect, and the clustering method is directly applied to the data. Therefore, the purpose of considering this example is to demonstrate that our proposed LOO-PCR clustering (Iterative Step (b) and (c) in CPCA), served as a clustering method itself, can outperform the hierarchical clustering.

For all four examples, we repeat the simulation procedure 100 times and investigate the clustering and recovering accuracy of the following methods:

  • PCA: The classical PCA method is performed directly on the whole data set without clustering variables. In addition, clustering accuracy is not considered for this method. The number of principal components is determined according to the ratio-based estimator (9) in Remark 1.

  • CPCA_I: This procedure is simply CPCA, but without the Iterative Step. First, the complement 𝑿0c=𝑿𝑮0𝚽0\bm{X}^{c}_{0}=\bm{X}-\bm{G}_{0}\bm{\Phi}^{\top}_{0} is found, and then clusters C0(j)C^{(j)}_{0} are obtained using hierarchical clustering for 𝑿0c\bm{X}^{c}_{0}. Variables of 𝑿0c\bm{X}^{c}_{0} are clustered, and the cluster-specific components 𝑭0(j)\bm{F}_{0}^{(j)} are finally obtained using PCA on each cluster.

  • CPCA_F: This is exactly the estimation procedure described in CPCA. Different to CPCA_I, the clusters and components are estimated iteratively.

The main purposes of this comparison are to demonstrate that: 1) traditional PCA fails to capture the cluster-specific components while our proposed CPCA can successfully identify and capture the cluster-specific components; 2) our proposed iterative estimation of the clusters and components outperforms the initial estimation.

In this study, the recovering accuracy is measured by the mean squared recovering error (MSRE). In this study, the MSRE is defined as:

MSRE=1np𝑿^test𝑿testF2,\displaystyle MSRE=\frac{1}{np}\left\|\widehat{\bm{X}}_{test}-\bm{X}_{test}\right\|^{2}_{F}, (14)

where 𝑿^test\widehat{\bm{X}}_{test} is the recovered testing sample computed using the testing sample and the eigenvectors (𝚽,𝚲(j)\bm{\Phi},\bm{\Lambda}^{(j)}) estimated from the training sample 𝑿train\bm{X}_{train}. We prefer a lower MSRE because it indicates a better recovering of the data.

6.2 Simulation Results

The boxplots of the total number of principal components selected (No.PCs) and MSRE for Example 1,2 and 3 are presented in Figure 3 and Table 1. As shown, CPCA_I (blue boxes) achieves a consistently higher number of principal components than its iterative counterpart (red box) and the true number of principal components (dotted line), which implies the advantages of using the iterative estimation. In the initial clustering, the common effects are estimated without partitioning the variables, while the iterative clustering partitions the variables in the previous step to estimate the common components, resulting in a more accurate estimation. This is also stated in Remark 2.

From the recovering accuracy perspective, it is worth noting from Figure 3 and Table 1 that PCA always performs poorly under this cluster structure of the data, while both CPCA-based methods demonstrate superior performance in recovering because they achieve lower MSRE than PCA. Although both CPCA_I and CPCA_F can achieve similar MSRE, the numbers of PCs considered by CPCA_I are generally greater than those identified by CPCA_F, which also indicates the importance of the iterative partitions. In general, CPCA_F results in more accurate partitions and produces more reliable estimations of the cluster-specific components, especially for those with small variations, which contribute the most to predicting the response according to our simulation settings. This further confirms the advantages of using our proposed method.

The results of our proposed CPCA_I and CPCA_F in Example 2 and 3 are similar, while when the sample size is smaller (n=30n=30) or the number of variables and clusters are larger (p=200p=200), the performance of traditional PCA is adversely affected. This is not surprising because traditional PCA estimates principal components from the sample covariance matrices which are less accurate due to the so-called ‘Curse of dimensionality’ and only the common components can be identified. However, our proposed CPCA is superior in estimating PCs and recovering the original data as the sub-homogeneity are correctly estimated.

Lastly, we investigate Example 4, which does not take into account the common effect. In this example, CPCA_I and CPCA_F simply reduce to performing PCA on each cluster of variables based on hierarchical clustering and LOO-PCR clustering, respectively. From the last row of Figure 3 and the last column of Table 1, we observe that applying PCA after clustering the variables demonstrates better performance than applying PCA directly to the data. Compared with CPCA_I, CPCA_F achieves a better estimate of the number of PCs, implying that our proposed LOO-PCR clustering outperforms hierarchical clustering.

Refer to caption
Figure 3: Boxplots of the following three measurements based on 100 simulations from Example 1 to 4: (a) No.PCs and (b) MSRE.
Table 1: Averages (standard errors) of total number of principal components selected (No.PCs) and MSPE for Example 1, 2, 3, and 4.
Method Example 1 Example 2 Example 3 Example 4
No.PCs CPCA_I 24.36 16.90 30.96 12.22
(6.16) (5.14) (6.53) (1.88)
CPCA_F 21.22 14.64 27.78 10.12
(7.70) (4.87) (6.89) (1.15)
PCA 5.65 3.00 2.95 6.54
(4.04) (1.32) (0.31) (2.56)
MSRE CPCA_I 0.37 0.45 0.34 0.26
(0.20) (0.22) (0.16) (0.05)
CPCA_F 0.38 0.47 0.33 0.26
(0.17) (0.22) (0.07) (0.04)
PCA 1.32 2.00 1.82 0.54
(0.67) (0.60) (0.17) (0.24)

7 Applications of CPCA

7.1 Principal component regression

As aforementioned, one of the most important uses of the principal components is PCR. We consider a PCR model with an univariate response yiy_{i} for ithi^{th} observation as:

yi=𝒈i𝜶+j=1J𝒇(j)i𝜷j+ei,i=1,,n,j=1,,J.\displaystyle{y}_{i}=\bm{g}^{\top}_{i}\bm{\alpha}+\sum_{j=1}^{J}{\bm{f}^{(j)}}^{\top}_{i}\bm{\beta}_{j}+{e}_{i},\quad i=1,\ldots,n,\quad j=1,\ldots,J. (15)

where 𝜶\bm{\alpha} is a rc×1r_{c}\times 1 vector representing the regression coefficients of the common components 𝒈i\bm{g}_{i}, 𝜷j\bm{\beta}_{j} is a rj×1r_{j}\times 1 vector denoting the regression coefficients of jthj^{th} cluster-specific components 𝒇i(j)\bm{f}^{(j)}_{i} and ei{e}_{i} is simply the error term with mean 0 and variance θ2\theta^{2}.

Therefore, after the the common components 𝒈i\bm{g}_{i} and cluster-specific components 𝒇i(j)\bm{f}^{(j)}_{i} are estimated via CPCA, we can fit a PCR to predict the response. However, in many situations, only a few groups are useful in predicting the response. To investigate which clusters of variable have an impact on predicting the response, we utilize the group lasso (Yuan and Lin, 2006) using the components produced by CPCA as the coviarates to estimate the regression coefficients as in (15),

(𝜶^,𝜷^)=argmin𝜶,𝜷12ni=1n(yi𝒈i𝜶j=1J𝒇i(j)𝜷j)2+λ(𝜶2+j=1J𝜷(j)2),\displaystyle\left(\widehat{\bm{\alpha}},\widehat{\bm{\beta}}\right)=\operatorname*{argmin}_{\bm{\alpha},\bm{\beta}}\frac{1}{2n}\sum_{i=1}^{n}\left({y}_{i}-\bm{g}^{\top}_{i}\bm{\alpha}-\sum_{j=1}^{J}{\bm{f}_{i}^{(j)\top}}\bm{\beta}_{j}\right)^{2}+\lambda\left(\left\|\bm{\alpha}\right\|_{2}+\sum_{j=1}^{J}\left\|{\bm{\beta}^{(j)}}\right\|_{2}\right), (16)

where λ\lambda is the tuning parameter that controls the sparsity of the regression coefficients. Using this group lasso penalty, 𝜷j\bm{\beta}_{j} for some jj will be shrunk to zero exactly so that we can identify which clusters of variables are important in predicting the response.

To better demonstrate the performance of our method, we conduct similar simulation studies as in Section 6. We consider the following three examples:

  • Example 1 (n=50,p=100n=50,p=100). We generate 𝑿\bm{X} using the same settings as Example 1 in Section 6.1. Then, we simulate the response yiy_{i} according to (15) by setting regression coefficients 𝜶=(1,1,1)\bm{\alpha}=(1,1,1), 𝜷=(𝜷1,,𝜷10)=(0,0,,0,25,25)\bm{\beta}=(\bm{\beta}_{1},\ldots,\bm{\beta}_{10})=(0,0,\ldots,0,25,25) and standard deviation θ=1\theta=1. That is, only the common effect and the cluster-specific effects for the last cluster are important in predicting the response. This regression setting is interesting because the cluster-specific components with smaller variance (e.g. the last cluster) are highly likely to be omitted or estimated poorly in traditional PCA, but they can sometimes be very important in predicting the response.

  • Example 2 (n=30,p=100n=30,p=100). This example is identical to Example 1 above, except that the sample size is decreased from 50 to 30.

  • Example 3 (n=50,p=200n=50,p=200). We generate 𝑿\bm{X} using the same settings as Example 3 in Section 6.1. That is, we consider J=10J=10 clusters in this example. In regard to the PCR, we set regression coefficients 𝜶=(1,1,1)\bm{\alpha}=(1,1,1) again and consider 𝜷=(𝜷1,,𝜷20)=(0,0,,0,25,25)\bm{\beta}=(\bm{\beta}_{1},\ldots,\bm{\beta}_{20})=(0,0,\ldots,0,25,25) such that only the common effect and the cluster-specific effects for the last cluster are useful and the rest of clusters are noise.

For all three examples, we repeat the simulation procedure 100 times and investigate prediction accuracy of the methods mentioned in Section 6.1. The MSPE is used to determine whether the selected components can accurately predict the response:

MSPE=1ni=1n(y^i,testyi,test)2,\displaystyle\text{MSPE}=\frac{1}{n}\sum_{i=1}^{n}\left(\widehat{y}_{i,test}-{y}_{i,test}\right)^{2}, (17)

where y^i,test\widehat{y}_{i,test} is computed using (15) with the regression coefficients (𝜸,𝜷\bm{\gamma},\bm{\beta}) and the eigenvectors (𝚽,𝚲(j))\left(\bm{\Phi},\bm{\Lambda}^{(j)}\right) estimated from the training sample.

Refer to caption
Figure 4: Boxplots of the following two measurements based on 100 simulations from Example 1 to 3: (a) MSPE and (b) 𝚺^𝚺F2\left\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\right\|^{2}_{F} .
Table 2: Averages (standard errors) of MSPE and Σ^ΣF2\left\|\widehat{\Sigma}-\Sigma\right\|_{\text{F}}^{2} for Example 1, 2, and 3.
Method Example 1 Example 2 Example 3
MSPE CPCA_I 3.27 5.30 4.65
(2.14) (4.21) (4.92)
CPCA_F 2.95 4.90 4.22
(1.92) (4.12) (4.94)
PCA 5.55 6.67 6.32
(2.37) (2.03) (1.43)
CPCA_I_g 2.38 3.66 2.43
(1.27) (2.13) (1.18)
CPCA_F_g 2.30 3.61 2.40
(1.23) (2.10) (1.19)
Σ^ΣF2\|\widehat{\Sigma}-\Sigma\|_{\text{F}}^{2} CPCA_I 85.92 108.55 111.71
(14.83) (25.05) (13.57)
CPCA_F 86.41 108.80 111.52
(14.01) (24.71) (13.33)
PCA 93.40 123.95 128.53
(12.83) (31.88) (16.77)
POET 91.28 119.71 125.01
(11.87) (30.49) (16.46)

The first panel of Table 2 and the first column of Figure 4 display mean and standard errors and the boxplot of MSPE for all three examples. Note that in this section, CPCA_F and CPCA_F_g represent the ordinary least squares (OLS) and the group lasso regressed on the principal components produced by CPCA_F, respectively. Similar notations apply to the rest of CPCA-based methods. When the traditional PCR (OLS) is utilized, we clearly see that CPCA_F achieves much lower average MSPE and standard deviation than other methods, especially when nn is small or pp is large. Not surprisingly, the traditional PCA performs the worst because it fails to capture any of the sub-homogeneity, which is important in predicting the response under our setting. Moreover, we observe that PCR using the group lasso also outperforms those OLS counterparts, especially for CPCA_I. The group lasso can provide substantial reductions in MSPE for CPCA_I because it tends to select a large number of principal components, only some of which are useful. Overall, CPCA_F_g achieves the lowest mean and standard error of MSPE among all the methods in all three examples, because it not only accurately extracts both common components and group-specific components, but also shrinks the coefficients of those unimportant groups down to zero.

7.2 Covariance and Precision Matrix Estimation

Another useful application of CPCA is to estimate 𝚺\bm{\Sigma} as in (4). Using 𝒈i\bm{g}_{i} and 𝒇i(j)\bm{f}^{(j)}_{i} leads to a better estimation of 𝚺\bm{\Sigma} compared with using 𝒈i\bm{g}_{i} only, because 𝒈i\bm{g}_{i} and 𝒇i(j)\bm{f}^{(j)}_{i} can capture both the low rank and the block-diagonal representation of 𝚺\bm{\Sigma}. Therefore, we estimate 𝚺\bm{\Sigma} using clusters, the common components, and the cluster specific components, along with their associated eigenvectors estimated via CPCA. In this section, we numerically illustrate the advantage of using CPCA to estimate the covariance matrix in comparison with other traditional PCA methods. Again, we conduct three simulation examples mentioned in Section 7.1 and utilize the Euclidean distance (ED) between the estimated covariance and population covariance, 𝚺^𝚺F2\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|^{2}_{F}, to measure the performances of different methods. Recall 𝚺\bm{\Sigma} is generated from (4):

𝚺=𝚽cov(𝒈i)𝚽+j=1J𝚪(j)cov(𝒇i(j))𝚪(j)+j=1J𝑰(j)σ(j)2.\displaystyle\bm{\Sigma}=\bm{\Phi}\text{cov}\left(\bm{g}_{i}\right)\bm{\Phi}^{\top}+\sum_{j=1}^{J}\bm{\Gamma}^{(j)}\text{cov}(\bm{f}^{(j)}_{i}){\bm{\Gamma}^{(j)}}^{\top}+\sum_{j=1}^{J}\bm{I}^{(j)}{\sigma^{(j)2}}.

In this section, we add another prevalent covariance estimation method POET (Fan et al., 2013) as aforementioned into our comparison.

The second panel of Table 2 and the second column of Figure 4 demonstrate the mean and standard errors and boxplot of 𝚺^𝚺F2\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|^{2}_{F} for all three examples, computed using methods discussed before. From these results, we see that POET performs better than PCA but worse than CPCA_I and CPCA_F, indicating that the sparsity structure implemented in POET can partly capture the sub-homogeneity, while not in a very efficient way. Among these methods, both CPCA_I and CPCA_F can achieve the lowest mean and standard error of 𝚺^𝚺F2\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|^{2}_{F} because it can best identify both homogeneity and sub-homogeneity to accurately estimate 𝚺\bm{\Sigma}.

8 Real Data Analysis

For further illustration, we analyze a stock return data set using the proposed CPCA and compare it with traditional PCA. In particular, we are interested in the performance of our proposed CPCA on clustering stock returns of companies from various industries. Moreover, we also use CPCA to obtain an estimate of the covariance matrix and use it to construct a minimum variance portfolio (MVP).

8.1 Clustering Stock Returns

The data are collected from the Center for Research in Security Prices (CRSP) and include the daily stock returns of 160 companies from 1st January, 2014 to 31st December, 2014, with 252 trading days. The 160 stocks are selected from eight different industries according to Fama and French’s 48-industry classification (Fama and French, 1997), namely, Candy and Soda, Tobacco Products, Apparel, Aircraft, Shipbuilding and Railroad Equipment, Petroleum and Natural Gas, Measuring and Control Equipment, and Shipping Containers, with 20 stocks from each industry. The data for the first 126 trading days are treated as the training sample, and the rest are the testing sample. Thus, the training data have the dimensions n=126n=126 and p=160p=160. In this example, after removing the common effect from the data, we aim to identify the clusters that consist of companies from the same industry. This cluster structure of stock returns is also discussed in Fan et al. (2013).

Refer to caption
(a)
Refer to caption
(b)
Figure 5: Correlation plot for (a) the original stock data 𝑿\bm{X}, (b) the complement of the stock data 𝑿fc=𝑿𝑮f𝚽f\bm{X}^{c}_{f}=\bm{X}-\bm{G}_{f}\bm{\Phi}^{\top}_{f}
Refer to caption
Figure 6: Cluster membership for 160 stocks based on (a) their industry, (b) hierarchical clustering of the original data, (c) initial clusters obtained from CPCA, (d) final clusters obtained from CPCA. Stocks with the same color are from the same industry.

Figure 5 (a) shows the correlation plot for the original stock data. We can observe a vague cluster structure, but the off-diagonals are clearly non-zero, implying that the stocks from the different industries selected in our study tend to be positively correlated. We apply hierarchical clustering directly to the original data and find that most stocks from different industries are clustered in the same group, as displayed in Figure 6 (b). Only stocks from the 5th5^{\text{th}} industry stand out as another cluster, because they have a relatively lower correlation with other stocks. If we consider the industries as true clusters, the hierarchical clustering of the original data results in an ARI of only 0.090.09. This indicates that common components may exist and conceal the cluster structure within the data.

Next, we apply our proposed CPCA to the stock data, and one common component is determined using the ratio-based estimator (9). Figure 5 (b) presents the correlation plot of the data after the common effect is removed. Comparatively, the cluster structure is more apparent. One can interpret the common effect as the market effect and the rest of the components as industry-specific effects. As demonstrated in Figure 6 (c), after the initial step of CPCA, a few clusters of stocks from the 1st, 2nd,and 7th1^{\text{st}},\ 2^{\text{nd}},\ \text{and}\ 7^{\text{th}} industries are correctly identified, while stocks from other industries (the 3rd, 4th, 5th, 6th,and 8th3^{\text{rd}},\ 4^{\text{th}},\ 5^{\text{th}},\ 6^{\text{th}},\ \text{and}\ 8^{\text{th}}) cannot be correctly separated from two large clusters. The ARI after the initial step of CPCA is about 0.440.44 which is a significant improvement to 0.090.09. Figure 6 (d) depicts the final clusters after the iterative steps of CPCA where most stocks from the aforementioned two large clusters can be correctly separated according to industries with a minor cost of a few miss-classifications in the 1stand 7th1^{\text{st}}\ \text{and}\ 7^{\text{th}} cluster. The ARI is also further improved from 0.440.44 to 0.720.72. The final clusters mainly capture the industry information, but a few stocks are not clustered into their own industries. This is to be expected because some stocks from the same industry are not highly correlated, as shown in Figure 5 (e.g., some stocks from the first, third, sixth and seventh industries), which can be common in reality due to diversification of enterprises.

From the prediction perspective, we compute the MSRE for PCA, CPCA_I, and CPCA_F as 3.563.56, 2.232.23, and 1.671.67 (in unit of 10410^{-4}), respectively. In other words, our CPCA can not only capture the market effect of stock returns as the homogeneity but also include some sub-homogeneities which can be interpreted as industrial effects. All of these findings firmly support that our proposed CPCA is appropriate for analyzing these stock return data.

8.2 Constructing Minimum Variance Portfolio

Next, we consider the problem of constructing a global minimum variance portfolio (MVP) from stock returns. The minimum variance portfolio is built by solving the following optimization problem

min𝒘\displaystyle\min_{\bm{w}}\ 𝒘𝚺~𝒘\displaystyle\bm{w}^{\top}\bm{\widetilde{\Sigma}}\bm{w}
s.t. 𝒘𝟏p=1,\displaystyle\bm{w}^{\top}\bm{1}_{p}=1,

where 𝒘\bm{w} is a pp-dimensional vector of weights allocated on each stock, 𝟏p\bm{1}_{p} is a pp-dimensional vector of ones and 𝚺~\bm{\widetilde{\Sigma}} is defined as an estimate of the pp by pp covariance matrix of stock returns. The analytic solution of 𝒘\bm{w} is

𝒘=𝚺~𝟏𝟏p𝟏p𝚺~𝟏𝟏p,\displaystyle\bm{{w}}^{\star}=\frac{\bm{\widetilde{\Sigma}^{-1}}\bm{1}_{p}}{\bm{1}_{p}^{\top}\bm{\widetilde{\Sigma}^{-1}}\bm{1}_{p}}, (18)

where we assume short sale is allowed and there are no transaction costs.

Constructing the global MVP is another important application of our proposed CPCA due to its ability of correctly identifying both the homogeneity and the sub-homogeneity when estimating the covariance matrices of stock returns. The application of constructing MVP have been well studied and discussed by Engle et al. (2019); Chen et al. (2019) and Wang et al. (2021) where several different estimation approaches for the covariance matrices of stock returns are developed. Our interest in this section is the numerical results of the out-of-sample performance of the global MVP constructed using the covariance matrices estimated by CPCA. Recall that stock returns of 160160 companies from 88 industries are collected for 252252 trading days, we use rolling windows with the size of 110110 days (5 calendar months) to obtain the out-of-sample performance of the portfolio. In specific, for each trading day ii, we firstly estimate 𝚺~\bm{\widetilde{\Sigma}} by CPCA where stock returns in the most recent 110110 tradings days before day ii (from day (i110)(i-110) to day (i1)(i-1)) are considered, and then compute the optimal weight 𝒘i\bm{{w}}_{i}^{\star} by (18). Consequently, the return of the global MVP on day ii can be computed using 𝒘i\bm{{w}}_{i}^{\star} and the return of each stock on day ii. Similar to Wang et al. (2021), we compute the standard deviation (STD), the information ratio (IR), which is the ratio of mean return to STD, and the Sharpe ratio (SR), which is the ratio of mean excess return (the portfolio-wide return minus the risk-free return) to STD. As discussed by Engle et al. (2019), Chen et al. (2019) and Wang et al. (2021), the STD should be the primary measure when comparing the MVPs constructed using different estimates of covariance matrices, while the IR and SR are also desirable but less crucial. The numerical results in Table 3 compare the portfolios constructed with different estimates of the covariance matrices. As depicted, CPCA with both the initial and final clusters can achieve a slightly better estimate of the covariance matrix, since the portfolio corresponding to CPCA with has the smallest STD. In addition, the global MVP constructed using CPCA with the final clustering index can also achieve highest IR and SR across all methods, which is also desirable. As a result, CPCA works well when constructing the global MVP since the sub-homogeneity of industrial effects can be well estimated.

Table 3: Out-of-sample performance of the minimum variance portfolio (MVP) constructed using the covariance matrices estimated by traditional PCA, CPCA with initial clustering index, CPCA with final clustering index and POET method.
Method STD IR SR
CPCA_I 0.008 0.120 0.112
CPCA_F 0.008 0.151 0.143
PCA 0.010 0.107 0.099
POET 0.009 0.148 0.140

9 Discussion

To conclude, we introduced a novel CPCA to study the homogeneity and sub-homogeneity of high-dimensional data collected from different populations, where the sub-homogeneity refers to a group-specific feature from a particular population. Our numerical simulations confirmed that traditional PCA can only extract the homogeneity from the data, whereas CPCA not only provides a more accurate estimate of the common components, but also identifies the group-specific features, even for the group with small variations. Under a PCR setting, the features extracted using CPCA can significantly outperform those selected by classical PCA in terms of prediction, especially when nn is small but pp is large. Our real analysis of the stock return data also demonstrated that, when using CPCA, we can capture the industry information after the common component (i.e., market effect) is removed. All of these findings support the use of our proposed CPCA in dimension-reduction problems.

The applications of CPCA are not limited to producing principal components in PCR and revealing the industry structure from the stock return data. CPCA can also be applied to any other data sets in which a group structure exists but hides in the homogeneity. Further, it can be used to estimate a covariance matrix and its inverse of a large data set that exhibits a group structure. More applications of CPCA will be explored in our future work.


SUPPLEMENTARY MATERIAL

R-codes for CPCA:

R-codes including R functions to perform the iterative complement-clustering principal component analysis (CPCA) described in the article. Also included are the R-codes for simulation studies and real analysis of the stock return data shown in the article. (.R files)

Data of stock returns:

Data set of daily returns of stocks from different industries in the US (2014). (.rds file)

References

  • (1)
  • Anderson (2003) Anderson, T. (2003), An Introduction to Multivariate Statistical Analysis, 3 edn, Wiley.
  • Bai and Ng (2002) Bai, J. and Ng, S. (2002), ‘Determining the number of factors in approximate factor models’, Econometrica 70(1), 191–221.
  • Boivin and Ng (2006) Boivin, J. and Ng, S. (2006), ‘Are more data always better for factor analysis?’, Journal of Econometrics 132(1), 169–194.
  • Bühlmann et al. (2013) Bühlmann, P., Rütimann, P., van de Geer, S. and Zhang, C.-H. (2013), ‘Correlated variables in regression: clustering and sparse estimation’, Journal of Statistical Planning and Inference 143(11), 1835–1858.
  • Cai et al. (2017) Cai, T., Han, X. and Pan, G. (2017), ‘Limiting laws for divergent spiked eigenvalues and largest non-spiked eigenvalue of sample covariance matrices’, arXiv preprint arXiv:1711.00217 .
  • Chen et al. (2019) Chen, J., Li, D. and Linton, O. (2019), ‘A new semiparametric estimation approach for large dynamic covariance matrices with multiple conditioning variables’, Journal of Econometrics 212(1), 155–176.
    https://www.sciencedirect.com/science/article/pii/S0304407619300806
  • Chiou and Li (2007) Chiou, J.-M. and Li, P.-L. (2007), ‘Functional clustering and identifying substructures of longitudinal data’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(4), 679–699.
  • Deville and Malinvaud (1983) Deville, J.-C. and Malinvaud, E. (1983), ‘Data analysis in official socio-economic statistics’, Journal of the Royal Statistical Society: Series A (General) 146(4), 335–352.
  • Donoho et al. (2000) Donoho, D. L. et al. (2000), ‘High-dimensional data analysis: The curses and blessings of dimensionality’, AMS math challenges lecture 1(32), 375.
  • Engle et al. (2019) Engle, R. F., Ledoit, O. and Wolf, M. (2019), ‘Large Dynamic Covariance Matrices’, Journal of Business & Economic Statistics 37(2), 363–375. Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/07350015.2017.1345683.
    https://doi.org/10.1080/07350015.2017.1345683
  • Fama and French (1997) Fama, E. F. and French, K. R. (1997), ‘Industry costs of equity’, Journal of Financial Economics 43(2), 153–193.
  • Fan et al. (2014) Fan, J., Han, F. and Liu, H. (2014), ‘Challenges of big data analysis’, National Science Review 1(2), 293–314.
  • Fan et al. (2013) Fan, J., Liao, Y. and Mincheva, M. (2013), ‘Large covariance estimation by thresholding principal orthogonal complements’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75(4), 603–680.
  • Fan et al. (2018) Fan, J., Sun, Q., Zhou, W.-X. and Zhu, Z. (2018), ‘Principal component analysis for big data’, Wiley StatsRef: Statistics Reference Online pp. 1–13.
  • Hong et al. (2018) Hong, D., Fessler, J. A. and Balzano, L. (2018), ‘Optimally weighted pca for high-dimensional heteroscedastic data’, arXiv preprint arXiv:1810.12862 .
  • Hotelling (1933) Hotelling, H. (1933), ‘Analysis of a complex of statistical variables into principal components.’, Journal of Educational Psychology 24(6), 417.
  • Hubert and Arabie (1985) Hubert, L. and Arabie, P. (1985), ‘Comparing partitions’, Journal of Classification 2(1), 193–218.
  • Johnstone (2001) Johnstone, I. M. (2001), ‘On the distribution of the largest eigenvalue in principal components analysis’, The Annals of Statistics 29(2), 295–327.
  • Johnstone and Lu (2009) Johnstone, I. M. and Lu, A. Y. (2009), ‘On consistency and sparsity for principal components analysis in high dimensions’, Journal of the American Statistical Association 104(486), 682–693.
  • Jolliffe (2002) Jolliffe, I. (2002), Principal Component Analysis, Springer Verlag, New York.
  • Jolliffe and Cadima (2016) Jolliffe, I. T. and Cadima, J. (2016), ‘Principal component analysis: a review and recent developments’, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374(2065), 20150202.
  • Lam and Yao (2012) Lam, C. and Yao, Q. (2012), ‘Factor modeling for high-dimensional time series: Inference for the number of factors’, The Annals of Statistics 40(2), 694–726.
  • Li et al. (2018) Li, Q., Cheng, G., Fan, J. and Wang, Y. (2018), ‘Embracing the blessing of dimensionality in factor models’, Journal of the American Statistical Association 113(521), 380–389.
  • Liu et al. (2002) Liu, A., Zhang, Y., Gehan, E. and Clarke, R. (2002), ‘Block principal component analysis with application to gene microarray data classification’, Statistics in Medicine 21(22), 3465–3474.
  • Morales-Jimenez et al. (2018) Morales-Jimenez, D., Johnstone, I. M., McKay, M. R. and Yang, J. (2018), ‘Asymptotics of eigenstructure of sample correlation matrices for high-dimensional spiked models’, arXiv preprint arXiv:1810.10214 .
  • Pearson (1901) Pearson, K. (1901), ‘Liii. on lines and planes of closest fit to systems of points in space’, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2(11), 559–572.
  • Peng et al. (2020) Peng, B., Su, L., Westerlund, J. and Yang, Y. (2020), ‘Interactive effects panel data models with general factors and regressors’, Available at SSRN 3617477 .
  • Rand (1971) Rand, W. M. (1971), ‘Objective criteria for the evaluation of clustering methods’, Journal of the American Statistical Association 66(336), 846–850.
  • Tan et al. (2015) Tan, K. M., Witten, D. and Shojaie, A. (2015), ‘The cluster graphical lasso for improved estimation of gaussian graphical models’, Computational Statistics & Data Analysis 85, 23–36.
  • Tibshirani et al. (2015) Tibshirani, R., Wainwright, M. and Hastie, T. (2015), Statistical learning with sparsity: the lasso and generalizations, Chapman and Hall/CRC.
  • Wang et al. (2021) Wang, H., Peng, B., Li, D. and Leng, C. (2021), ‘Nonparametric estimation of large covariance matrices with conditional sparsity’, Journal of Econometrics 223(1), 53–72.
  • Yuan and Lin (2006) Yuan, M. and Lin, Y. (2006), ‘Model selection and estimation in regression with grouped variables’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49–67.
  • Zou et al. (2006) Zou, H., Hastie, T. and Tibshirani, R. (2006), ‘Sparse principal component analysis’, Journal of Computational and Graphical Statistics 15(2), 265–286.

Appendix A Estimation of 𝑮1\bm{G}_{1} and 𝚽𝟏\bm{\Phi_{1}} in CPCA Iterative Step (a)

In this part, we describe the estimation of the principal components, along with their eigenvectors, combined from each cluster in Iterative Step (a) of Algorithm 1.

First, we perform PCA in cluster jj by:

𝑿(j)=𝚿(j)𝚷(j)+𝑼(j),\displaystyle\bm{X}^{(j)}=\bm{\Psi}^{(j)}\bm{\Pi}^{(j)\top}+\bm{U}^{(j)},

where 𝚿(j)\bm{\Psi}^{(j)} is an n×rjn\times r_{j} matrix of principal components for variables in cluster jj and 𝚷(j)\bm{\Pi}^{(j)} is a pj×rjp_{j}\times r_{j} matrix in which each column represents an eigenvector of 𝑿(j)𝑿(j)\bm{X}^{(j)\top}\bm{X}^{(j)}. Here, recall that pjp_{j} denotes the number of variables in cluster jj, and rjr_{j} is the number of principal components in cluster jj. Then, we can combine the principal components 𝚿(j)\bm{\Psi}^{(j)} from each cluster as 𝚿=(𝚿(1),,𝚿(J0))\bm{\Psi}=(\bm{\Psi}^{(1)},...,\bm{\Psi}^{(J_{0})}) and perform a further step of PCA on 𝚿\bm{\Psi} to obtain:

𝚿=𝑮1𝑯+𝑽,\displaystyle\bm{\Psi}=\bm{G}_{1}\bm{H^{\top}}+\bm{V},

where we assume the first rcr_{c} principal components can be used to summarize the common effects among all clusters. Then, 𝑮1\bm{G}_{1} is a n×rn\times r matrix of principal components of 𝚿\bm{\Psi}, and 𝑯\bm{H} is a (j=1J0rj)×rc(\sum_{j=1}^{J_{0}}r_{j})\times r_{c} matrix in which each column represents an eigenvector of 𝚿𝚿\bm{\Psi}^{\top}\bm{\Psi}. Lastly, we can find the complement for cluster jj as:

𝑿1(j)c=𝑿(j)𝑮1𝑯𝚷(j),\displaystyle\bm{X}_{1}^{(j)c}=\bm{X}^{(j)}-\bm{G}_{1}\bm{H^{\top}}\bm{\Pi}^{(j)\star\top},

where 𝚷(j)=(𝟎(1),𝟎(2),,𝚷(j),,𝟎(J0))\bm{\Pi}^{(j)\star}=(\bm{0}^{(1)},\bm{0}^{(2)},...,\bm{\Pi}^{(j)},...,\bm{0}^{(J_{0})}) is a pj×(j=1J0rj)p_{j}\times(\sum_{j=1}^{J_{0}}r_{j}) matrix in which 𝚷(j)\bm{\Pi}^{(j)} is a pj×rjp_{j}\times r_{j} matrix, as we described earlier, and 𝟎(l)\bm{0}^{(l)} is a pj×rlp_{j}\times r_{l} zero matrix for l=1,2,,J0;ljl=1,2,...,J_{0};\ l\neq j. Once we obtain the complement from each cluster, we can combine them and present the total complement using:

𝑿1c=𝑿𝑮1𝑯𝚷,\displaystyle\bm{X}_{1}^{c}=\bm{X}-\bm{G}_{1}\bm{H^{\top}}\bm{\Pi^{\top}},

where 𝚷=(𝚷(1),𝚷(2),,𝚷(j),,𝚷(J0))\bm{\Pi}=(\bm{\Pi}^{(1)\star\top},\bm{\Pi}^{(2)\star\top},...,\bm{\Pi}^{(j)\star\top},...,\bm{\Pi}^{(J_{0})\star\top})^{\top} is a p×(j=1J0rj)p\times(\sum_{j=1}^{J_{0}}r_{j}) block-diagonal matrix. Hence, we can finally define 𝚽𝟏=𝚷𝑯\bm{\Phi_{1}}=\bm{\Pi}\bm{H} as the corresponding eigenvectors for 𝑮1\bm{G}_{1}. It is also easy to show that 𝚽𝟏𝚽𝟏=Ip\bm{\Phi_{1}}^{\top}\bm{\Phi_{1}}=I_{p}.

Appendix B Proof of Theorem 1

The matrix form of (11) is:

𝒖m=𝑭(d)𝜷+𝜻m(d),\displaystyle\bm{u}_{m}=\bm{F}^{(d)}\bm{\beta}+\bm{\zeta}^{(d)}_{m}, (19)

where 𝒖m=(um1(l),um2(l),,umn(l))\bm{u}_{m}=\left(u_{m1}^{(l)},u_{m2}^{(l)},\ldots,u_{mn}^{(l)}\right)^{\top} and 𝜷=(β1,β2,,βrd)\bm{\beta}=(\beta_{1},\beta_{2},\ldots,\beta_{r_{d}}).
Hence, the least-squares estimation of the residual 𝜻m(d)\bm{\zeta}^{(d)}_{m} is:

𝜻~m(d)=𝒖m𝑭(d)(𝑭(d)𝑭(d))1𝑭(d)𝒖m=:𝑴𝑭(d)𝒖m.\displaystyle\widetilde{\bm{\zeta}}^{(d)}_{m}=\bm{u}_{m}-\bm{F}^{(d)}\left(\bm{F}^{(d)\top}\bm{F}^{(d)}\right)^{-1}\bm{F}^{(d)\top}\bm{u}_{m}=:\bm{M}_{\bm{F}^{(d)}}\bm{u}_{m}. (20)

It should be noted that 𝒖m\bm{u}_{m} and 𝑭(d)\bm{F}^{(d)} are not observed. In terms of our proposed clustering method, we estimate 𝒖m\bm{u}_{m} via PCA, such that:

𝒖^m=𝑴𝑮^𝒙m,\displaystyle\widehat{\bm{u}}_{m}=\bm{M}_{\widehat{\bm{G}}}\bm{x}_{m}, (21)

where 𝑴𝑮^=𝑰n𝑮^(𝑮^𝑮^)1𝑮^\bm{M}_{\widehat{\bm{G}}}=\bm{I}_{n}-\widehat{\bm{G}}(\widehat{\bm{G}}^{\top}\widehat{\bm{G}})^{-1}\widehat{\bm{G}}^{\top}, with 𝑮^\widehat{\bm{G}} being an estimator for 𝑮\bm{G}, as described in CPCA.
For 𝑭(d)\bm{F}^{(d)}, we also perform PCA on cluster dd and denote the estimator by 𝑭^(d)\widehat{\bm{F}}^{(d)}. Based on (20) and (21), our estimator of 𝜻m(d)\bm{\zeta}^{(d)}_{m} can be written as:

𝜻^m(d)\displaystyle\widehat{\bm{\zeta}}^{(d)}_{m} =\displaystyle= 𝑴𝑭^(d)𝒖^m=𝑴𝑭^(d)𝑴𝑮^𝒙m\displaystyle\bm{M}_{\widehat{\bm{F}}^{(d)}}\widehat{\bm{u}}_{m}=\bm{M}_{\widehat{\bm{F}}^{(d)}}\bm{M}_{\widehat{\bm{G}}}\bm{x}_{m} (22)
=\displaystyle= 𝑴𝑭^(d)(𝑴𝑮^𝑴𝑮)𝒙m+𝑴𝑭^(d)𝑴𝑮𝒙m\displaystyle\bm{M}_{\widehat{\bm{F}}^{(d)}}\left(\bm{M}_{\widehat{\bm{G}}}-\bm{M}_{\bm{G}}\right)\bm{x}_{m}+\bm{M}_{\widehat{\bm{F}}^{(d)}}\bm{M}_{\bm{G}}\bm{x}_{m}
=\displaystyle= (𝑴𝑭^(d)𝑴𝑭(d))(𝑴𝑮^𝑴𝑮)𝒙m+𝑴𝑭(d)(𝑴𝑮^𝑴𝑮)𝒙m\displaystyle\left(\bm{M}_{\widehat{\bm{F}}^{(d)}}-\bm{M}_{\bm{F}^{(d)}}\right)\left(\bm{M}_{\widehat{\bm{G}}}-\bm{M}_{\bm{G}}\right)\bm{x}_{m}+\bm{M}_{\bm{F}^{(d)}}\left(\bm{M}_{\widehat{\bm{G}}}-\bm{M}_{\bm{G}}\right)\bm{x}_{m}
+(𝑴𝑭^(d)𝑴𝑭(d))𝑴𝑮𝒙m+𝑴𝑭(d)𝑴𝑮𝒙m.\displaystyle+\left(\bm{M}_{\widehat{\bm{F}}^{(d)}}-\bm{M}_{\bm{F}^{(d)}}\right)\bm{M}_{\bm{G}}\bm{x}_{m}+\bm{M}_{\bm{F}^{(d)}}\bm{M}_{\bm{G}}\bm{x}_{m}.

To define 𝑴𝑭^(d)𝑴𝑭(d)2=Op(αnpd)\|\bm{M}_{\widehat{\bm{F}}^{(d)}}-\bm{M}_{\bm{F}^{(d)}}\|_{2}=O_{p}(\alpha_{np_{d}}), 𝑴𝑮^𝑴𝑮2=Op(γnJ)\|\bm{M}_{\widehat{\bm{G}}}-\bm{M}_{\bm{G}}\|_{2}=O_{p}(\gamma_{nJ}), we obtain the result in this theorem.