Homogeneity and Sub-homogeneity Pursuit: Iterative Complement Clustering PCA

Daning Bi, Le Chang, Yanrong Yang
Research School of Finance, Actuarial Studies & Statistics
The Australian National University
Corresponding author, email: [email protected]

Abstract

Principal component analysis (PCA), the most popular dimension-reduction technique, has been used to analyze high-dimensional data in many areas. It discovers the homogeneity within the data and creates a reduced feature space to capture as much information as possible from the original data. However, in the presence of a group structure of the data, PCA often fails to identify the group-specific pattern, which is known as sub-homogeneity in this study. Group-specific information that is missed can result in an unsatisfactory representation of the data from a particular group. It is important to capture both homogeneity and sub-homogeneity in high-dimensional data analysis, but this poses a great challenge. In this study, we propose a novel iterative complement-clustering principal component analysis (CPCA) to iteratively estimate the homogeneity and sub-homogeneity. A principal component regression based clustering method is also introduced to provide reliable information about clusters. Theoretically, this study shows that our proposed clustering approach can correctly identify the cluster membership under certain conditions. The simulation study and real analysis of the stock return data confirm the superior performance of our proposed methods. Supplementary materials, including R codes, for the simulation and real data analysis are available online.

Keywords: Principal component analysis, principal component regression, high-dimensional data, homogeneity, sub-homogeneity, clustering

1 Introduction

Since its introduction, principal component analysis (PCA) (Jolliffe, 2002; Anderson, 2003) has become one of the most popular statistical tools for data analysis in a wide range of areas. Literature on PCA dates back to the early twentieth century (e.g., see Pearson (1901) and Hotelling (1933)). However, with an increasing dimension of data, PCA has now been reconsidered and widely discussed for the purpose of analyzing high-dimensional data. Jolliffe and Cadima (2016) and Fan et al. (2018) reviewed recent developments in PCA for statistical analysis on high-dimensional data, including its sparsity and robustness. However, are more data together really benefiting statistical analysis? Boivin and Ng (2006) provided a negative answer at the aspect of forecasting. This is due to an increased complexity when more data from different populations are grouped together, as a proportion of data can exhibit a different pattern compared with the rest. More specifically, increased complexity refers to heterogeneity when more data are collected from different populations.

In this study, we consider that information from a particular group of data collected from one population can be divided into two categories. One is shared with other groups and forms the homogeneity within the entire data, while the other is group-specific and is the main source of heterogeneity. This type of heterogeneity can be treated as sub-homogeneity, which refers to the homogeneity for a particular group of data.

However, sub-homogeneity may not be identified using traditional estimation methods such as PCA. One reason for this is that the sub-homogeneity for a particular group of data can be relatively small compared with the homogeneity within the entire data because it usually contributes less than the homogeneity to the total variance of all of the data. In such a situation, traditional PCA may regard the sub-homogeneity as negligible compared with homogeneity and ignore this group-specific pattern. Moreover, from the interpretation perspective, principal components that are produced using traditional PCA on all of the data do not target a specific group (e.g., information on which component corresponds to which group of data is not known). In previous studies, the discussion of PCA has mainly focused on the large eigenvalues, which correspond to the homogeneity in this study. For example, Johnstone (2001) studied the asymptotic distribution of the largest eigenvalue of PCA. Recently, Cai et al. (2017) extended the discussion to the asymptotic distribution of the spiked eigenvalues, while Morales-Jimenez et al. (2018) studied the asymptotics for the leading eigenvalues and eigenvectors of the correlation matrix in the class of spiked models. None of them have considered the existence of sub-homogeneity within the data.

To the best of our knowledge, the sub-homogeneity has not been well discussed, but it can be very important in high-dimensional data analysis. The following example of analyzing stock returns from different industries is used to explain the importance of finding sub-homogeneity. As stated in Fama and French (1997), stock returns from various industries can have varying performance over time, although the homogeneity (e.g., market return) can be deemed to be driven by some common economic variables. In a situation in which a vast number of individual stock returns are collected together for statistical analysis, traditional dimension-reduction techniques such as PCA may be able to capture the market effect but can fail to identify the sub-homogeneity within each industry (industry-specific pattern). This is because some industry-specific components may have much smaller variance than the market component and are highly likely to be omitted by PCA. However, these industry-specific components may be very important in capturing the movement of the stock returns within the industry. This loss of information may result in a very poor forecast of stock returns for some companies in which sub-homogeneity exists. In addition, although it would be interesting to study which industry has a larger industry-specific effect on stock returns, traditional PCA performed on the entire data does not allow us to draw such conclusions. Therefore, this study aims at identifying both homogeneity and sub-homogeneity in high-dimensional data analysis.

The sub-homogeneity in several parts of the data can be identified and estimated by dividing the whole data set into several groups. However, in most situations, the group structure is not known in advance. Therefore, a clustering method can be used to group the data at the first stage, and PCA can then be applied to identify the sub-homogeneity within each group. Similarly, Liu et al. (2002) performed PCA in different blocks of variables, while Tan et al. (2015) penalized the variables in each group differently when using the graphical lasso. Both studies used hierarchical clustering to group the variables to take into account the heterogeneity. However, when homogeneity exists across different groups, the sub-homogeneity in each group cannot be successfully identified. This is mainly because the homogeneity shared by most groups can dominate the sub-homogeneity so that the data from different groups all seem to be highly correlated. In this situation, the sub-homogeneity is masked by the homogeneity, and these clustering methods tend to group all individuals (variables) into one cluster. Therefore, the homogeneity must be correctly identified and removed before we can successfully group the individuals and discover the sub-homogeneity of each group.

On the other hand, the estimation of homogeneity can often be problematic in a high-dimensional setting. This is because of the inconsistency of PCA estimates when the number of variables $p$ is comparable with or greater than the sample size $n$ , as discussed in Johnstone and Lu (2009). Motivated by the group structure of the data, we suggest first clustering the individuals into groups and then performing a traditional approach such as PCA on the level of groups, followed by a second layer of PCA that extracts common information shared by each group. This approach can improve the estimation accuracy of the homogeneity because the components are now extracted from groups in which the dimension has already been reduced (to the number of variables in a group), while the traditional dimension-reduction method (e.g., PCA) is performed on the full dimension $p$ . This can be viewed as an effective way to alleviate the potential problem caused by the curse of dimensionality. Other studies have modified the traditional approaches to deal with the curse of dimensionality. For example, Johnstone and Lu (2009) suggest that some initial dimension-reduction work is necessary before using PCA, as long as a sparse basis exists. Further, Zou et al. (2006) introduced sparse PCA, which uses the lasso (elastic net) to modify the principal components so that sparse loadings can be achieved. In addition, a review of some sparse versions of PCA can be found in Tibshirani et al. (2015), while some general discussions about the blessing and curse of dimensionality can be found in Donoho et al. (2000) and Fan et al. (2014).

To conclude, homogeneity must be removed to correctly identify the cluster structure and sub-homogeneity, but the cluster structure of the data is used to accurately find the homogeneity. Therefore, we introduce a novel “iterative complement-clustering principal component analysis” (CPCA) to iteratively estimate the homogeneity and sub-homogeneity. Details of CPCA are provided in Section 4.

The contributions of this study can be summarized as follows. First, we propose CPCA to identify homogeneity and sub-homogeneity and handle the interaction between them when the whole data set exhibits a group structure. Second, our proposed estimation method not only correctly captures the sub-homogeneity, but also provides very reliable cluster information (e.g., which part of the data is from the same group), which can be useful in understanding and explaining the data structure. Third, inspired by Chiou and Li (2007), we develop a leave-one-out principal component regression (PCR) clustering method that can outperform the hierarchical clustering method used in previous studies. In addition, we theoretically illustrate that if the sub-homogeneity from different clusters is distinct, our proposed clustering procedure can effectively separate the variables from different clusters. This is also numerically confirmed by the simulation study and real data analysis. Details of the proposed clustering method are provided in Section 4.

The rest of this study is organized as follows. A low-rank representation of the data that captures both homogeneity and sub-homogeneity is introduced in Section 2, and some related PCA methods are discussed in Section 3. In Section 4, a novel estimation method called CPCA is proposed, followed by a discussion of more details of the algorithm. Section 5 demonstrates and explains the effectiveness of the proposed clustering method. Extensive simulations, along with two applications (PCR and covariance estimation) of our proposed method are provided in Sections 6 and 7. Section 8 analyzes a stock return dataset using our method. Lastly, Section 9 concludes the study.

2 Homogeneity and Sub-homogeneity

Considering the data $\bm{x}_{i}=(x_{1i},\ldots,x_{pi})^{\top}\in\mathbb{R}^{p}$ from $i^{th}$ observation with $p$ variables, the singular value decomposition (SVD) of the data $\bm{x}_{i}$ can be written as:

	$\displaystyle\bm{x}_{i}=$	$\displaystyle\sum_{k=1}^{K}{g}_{ik}\bm{\phi}_{k}+\bm{u}_{i},$
	$\displaystyle\text{where }\bm{u}_{i}=$	$\displaystyle\sum_{k=K+1}^{\min{(n,p)}}{g}_{ik}\bm{\phi}_{k}\quad i=1,\ldots,n,$		(1)

where ${g}_{ik}$ is defined as $k^{th}$ principal component score and $\bm{\phi}_{k}$ denotes a $p\times{1}$ eigenvector for ${g}_{ik}$ . Traditional PCA summarizes the data using the first $k$ principal components, and it treats $\bm{u}_{i}$ as noise because ${g}_{ik}$ for $k=m+1,\ldots,\min{(n,p)}$ has smaller variance. However, under certain conditions, some of the information contained in $\bm{u}_{i}$ can be useful in prediction or forecasting problems, particularly for one or more specific groups of data. Therefore, when the data exhibit a group structure, we propose the following low-rank representation for the data to capture both homogeneity and sub-homogeneity:

	$\displaystyle\bm{x}_{i}=$	$\displaystyle\sum_{k=1}^{r_{c}}{g}_{ik}\bm{\phi}_{k}+\bm{u}_{i},$
	$\displaystyle\text{where }\bm{u}_{i}=\sum_{j=1}^{J}\sum_{h=1}^{r_{j}}{f}^{(j)}_{ih}\bm{\gamma}^{(j)}_{h}$	$\displaystyle+\sum_{j=1}^{J}\bm{I}^{(j)}\bm{\epsilon}^{(j)}_{i},\quad i=1,\ldots,n,\quad j=1,\ldots,J.$		(2)

The first line in (2) measures the homogeneity among all variables from all groups, where ${g}_{ik},k=1,\ldots,r_{c}$ is $k^{th}$ principal component, which we call the common component, and $\bm{\phi}_{k}$ is its corresponding eigenvector. The first part of $\bm{u}_{i}$ in (2) consists of $J$ cluster-specific components from which the sub-homogeneity of the data is derived. Assuming $p$ variables can be split into $J$ clusters, ${f}^{(j)}_{ih},h=1,\ldots,r_{j}$ measures the within-cluster principal components for cluster $j$ . The within-cluster eigenvector with dimension $p\times 1$ has the form of $\bm{\gamma}^{(j)}_{h}=(\bm{0}^{(1)\top},\ldots,\bm{\eta}^{(j)\top}_{h},\ldots,\bm{0}^{(J)\top})^{\top}$ , where $\bm{\eta}^{(j)}_{h}$ defines a $p_{j}\times 1$ vector and $p_{j}$ is the number of variables in cluster $j$ so that $\sum_{j=1}^{J}p_{j}=p$ . That is, the values of $\bm{\gamma}^{(j)}_{h}$ for variables that do not belong to cluster $j$ are zero. This implies that after removing the effect of the common principal components, variables from different clusters are uncorrelated (e.g., $\bm{u}_{i}$ has a block-diagonal covariance structure). In the second part of $\bm{u}_{i}$ , $\bm{\epsilon}^{(j)}_{i}$ is simply a $p_{j}$ -dimensional error that has variance ${\sigma^{(j)}}^{2}$ , and $\bm{I}^{(j)}$ is a diagonal matrix, but the diagonals for variables that do not belong to cluster $j$ are zero.

We further define $\bm{g}_{i}=({g}_{i1},\ldots,{g}_{ir_{c}})^{\top}$ and $\bm{f}^{(j)}_{i}=({f}^{(j)}_{i1},\ldots,{f}^{(j)}_{ir_{j}})^{\top},j=1,\ldots,J$ as the vector forms of the common components and cluster-specific components, and the eigenvector matrices as $\bm{\Phi}_{(p\times r_{c})}=(\bm{\phi}_{1},\ldots,\bm{\phi}_{r_{c}})$ and $\bm{\Gamma}^{(j)}_{(p\times r_{j})}=(\bm{\gamma}^{(j)}_{1},\ldots,\bm{\gamma}^{(j)}_{r_{j}}),j=1,\ldots,J$ , respectively. Without loss of generality, we assume $E(\bm{g}_{i})=E(\bm{f}^{(j)}_{i})=E(\bm{\epsilon}^{(j)}_{i})=\bm{0}$ , and $\bm{g}_{i},\bm{f}^{(j)}_{i},\bm{\epsilon}^{(j)}_{i},j=1,\ldots,J$ are mutually uncorrelated. In addition, we also impose the following canonical condition for the identification of both the homogeneity and sub-homogeneity,

	$\displaystyle\bm{\Phi}^{\top}\bm{\Phi}=I_{r_{c}},$	$\displaystyle\quad\bm{\Sigma}_{g}\coloneqq\text{cov}(\bm{g}_{i})\text{ is diagonal},$
	$\displaystyle{\bm{\Gamma}^{(j)\top}}\bm{\Gamma}^{(j)}=I_{r_{j}},$	$\displaystyle\quad\bm{\Sigma}f^{(j)}\coloneqq\text{cov}(\bm{f}^{(j)}_{i})\text{ is diagonal},\quad j=1,\ldots,J.$		(3)

Under the data structure given in (2), the population covariance matrix is given by a low-rank plus block-diagonal representation

\displaystyle\bm{\Sigma}=\bm{\Phi}\bm{\Sigma}_{g}\bm{\Phi}^{\top}+\sum_{j=1}^{J}\bm{\Gamma}^{(j)}\bm{\Sigma}_{f}^{(j)}{\bm{\Gamma}^{(j)\top}}+\sum_{j=1}^{J}\bm{I}^{(j)}{\sigma^{(j)2}},\quad j=1,\ldots,J.

(4)

If we denote the data $\bm{X}_{(n\times p)}=(\bm{x}_{1},\ldots,\bm{x}_{n})^{\top}$ , the common components and cluster-specific components in a matrix form $\bm{G}_{(n\times{r_{c}})}=(\bm{g}_{1},\ldots,\bm{g}_{n})^{\top}$ and $\bm{F}^{(j)}_{(n\times{r_{j}})}=(\bm{f}^{(j)}_{1},\ldots,\bm{f}^{(j)}_{n})^{\top}$ , respectively, data representation (2) can also be presented in the following matrix form

	$\displaystyle\bm{X}=$	$\displaystyle\bm{G}{\bm{\Phi}}^{\top}+\bm{U},$
	$\displaystyle\text{where }\bm{U}=$	$\displaystyle\sum_{j=1}^{J}\bm{F}^{(j)}{\bm{\Gamma}^{(j)\top}}+\bm{E},\quad j=1,\ldots,J.$		(5)

In general, there are two goals in using representation (2). In the presence of an unknown cluster structure, the first goal is to cluster the variables (i.e., determine which variables are in the same group). From the interpretation perspective, it is interesting to know how variables are clustered and which variables belong to the same cluster. The second goal is to correctly estimate both the common components $\bm{g}_{i}$ and the cluster-specific components $\bm{f}^{(j)}_{i}$ . These components serve as a low-rank representation of the data and can therefore be used for further applications. One obvious application is to estimate the covariance matrix $\bm{\Sigma}$ and the precision matrix $\bm{\Sigma}^{-1}$ of $\bm{X}$ . The use of both $\bm{g}_{i}$ and $\bm{f}^{(j)}_{i}$ captures both the low rank and the block-diagonal representation of $\bm{\Sigma}$ , which results in a more efficient estimation compared with using $\bm{g}_{i}$ only. Another important application is PCR. In some situations, it is a cluster-specific component $\bm{f}^{(j)}_{i}$ that contributes most to determining the response in PCR rather than the common component $\bm{g}_{i}$ . It is important to identify each cluster-specific component $\bm{f}^{(j)}_{i}$ to explain which cluster of variables has a greater effect in predicting the response. We will discuss more about these two applications using simulated data in Section 7.

3 Relationship with Existing PCA Methods

Our proposed data representation (2) should be used to perform dimension reduction of the data with a cluster structure because it captures both the common effect (homogeneity) and the cluster-specific effect (sub-homogeneity). Consequently, this method can be viewed as an extension of many other widely used dimension-reduction methods. Three of these methods are discussed below.

•

Case 1: When there is no cluster-specific effect, for example, $\bm{f}^{(j)}_{i}=\bm{0},j=1,\ldots,J$ and ${\sigma^{(j)2}}\equiv{\sigma}^{2}$ , representation (2) simply reduces to the well-known PCA

$\displaystyle\bm{x}_{i}=\sum_{k=1}^{r_{c}}{g}_{ik}\bm{\phi}_{k}+\bm{\epsilon}_{i},\quad i=1,\ldots,n,$ (6)

where $\bm{g}_{ik}$ can be found by the principal component with $k^{th}$ largest eigenvalue and $\bm{\phi}_{k}$ is the corresponding eigenvector. However, in the presence of a small cluster-specific effect, traditional PCA generally only identifies the large common effect and treats the rest as noise, disregarding the sub-homogeneity within the data.

•

Case 2: When the common components do not exist (e.g., $\bm{g}_{i}=\bm{0}$ ), representation (2) demonstrates a well-studied block-diagonal covariance structure of the data, as discussed in Liu et al. (2002) and Tan et al. (2015). Thus, representation (2) can be reduced to

\displaystyle\bm{x}_{i}=\sum_{j=1}^{J}\sum_{h=1}^{r_{j}}{f}^{(j)}_{ih}\bm{\gamma}^{(j)}_{h}+\sum_{j=1}^{J}\bm{I}^{(j)}\bm{\epsilon}^{(j)}_{i},\quad i=1,\ldots,n,\quad j=1,\ldots,J.

(7)

In this case, hierarchical clustering is usually used to group the variables with high correlations, and PCA is then performed on each group of variables to estimate the cluster-specific components $\bm{f}^{(j)}_{i}$ . In many situations, this cluster structure is masked by a dominant common effect. Ignoring this common effect will result in a non-identifiable cluster structure.

•

Case 3: Fan et al. (2013) and Li et al. (2018) consider the following low-rank representation, in which the error covariance matrix is assumed to be cross-sectional dependent after the common components have been taken out and

$\displaystyle\bm{x}_{i}=\sum_{k=1}^{r_{c}}{g}_{ik}\bm{\phi}_{k}+\bm{u}_{i},\quad i=1,\ldots,n,$ (8)

where they assume that the covariance of $\bm{u}_{i}$ , $\Sigma_{u}$ , is sparse and propose a method called Principal Orthogonal complEment Thresholding (POET) to explore such a high-dimensional structure with sparsity. Li et al. (2018) used weighted PCA to find a more efficient estimator of common components $\bm{g}_{i}$ . Hong et al. (2018) and Deville and Malinvaud (1983) also applied weighted PCA to estimate the covariance matrix when heteroscedastic noise of samples or variables exists. However, in this study, we consider $\bm{u}_{i}$ following a cluster structure with a block-diagonal covariance and aim to find its low-rank presentation, which we call sub-homogeneity. In addition, we propose iteratively estimating the common components and cluster-specific components.

4 Estimation Methods

Correctly estimating $\bm{g}_{i}$ and $\bm{f}^{(j)}_{i}$ poses many challenges. First, PCA, which is widely used to estimate the common component $\bm{g}_{i}$ , often performs very poorly when $p$ is much larger than $n$ . The group structure of the variables motivates us to separate the data according to the clusters and then extract the common information from each cluster to determine the common components. This is less influenced by the curse of dimensionality because each cluster of the data results in a lower dimension of variable $p_{j}$ . However, to accurately identify the clusters, we must remove the common effect and then perform the clustering method based on the complement $\left(\bm{x}_{i}-\sum_{k=1}^{r_{c}}{g}_{ik}\bm{\phi}_{k}\right)$ , but the common components and its eigenvectors are not known in advance. This inspires us to propose a novel iterative method, CPCA, to cluster the variables and estimate the components simultaneously. The details of the method are described in Algorithm 1 and a flowchart that summarizes our estimation method is presented in Figure 1.

Algorithm 1 (CPCA)

1.
Initial Step:
1. (a)
  
  Perform PCA directly on the entire data $\bm{X}$ and select the number of components using a ratio based estimator defined by Lam and Yao (2012). A brief discussion on selecting the number of components is in Remark 1. The resulting eigenvectors $\bm{\Phi}_{0}$ and principal components $\bm{G}_{0}$ are served as the initial estimates of $\bm{\Phi}$ and $\bm{G}$ . Then, find the initial complement $\bm{X}^{c}_{0}=\bm{X}-\bm{G}_{0}\bm{\Phi}^{\top}_{0}$
2. (b)
  
  Perform hierarchical clustering for $\bm{X}^{c}_{0}$ based on a similarity matrix given by the absolute value of the empirical correlation matrix $\bm{X}^{c}_{0}$ . The obtained clusters $C^{(j)}_{0},j=1,\ldots,J_{0}$ are served as the initial clusters.
2.
Iterative Step: for $s=1,2,\ldots$
1. (a)
  
  Cluster the variables into $J_{s-1}$ groups according to $C^{(j)}_{s-1}$ and define variables from $j^{th}$ cluster as $\bm{X}^{(j)}_{s}$ . Perform PCA on each cluster of variables $\bm{X}^{(j)}_{s}$ and denote the obtained principal components as $\bm{\Psi}^{(j)}_{s},j=1,\ldots,J_{s-1}$ . Combine these principal components as $\bm{\Psi}_{s}=(\bm{\Psi}^{(1)}_{s},\ldots,\bm{\Psi}^{(J_{0})}_{s})$ and perform a further step of PCA on $\bm{\Psi}_{s}$ . Define the principal components as $\bm{G}_{s}$ and their corresponding eigenvectors as $\bm{\Phi}_{s}$ . Then, compute the updated complement $\bm{X}^{c}_{s}=\bm{X}-\bm{G}_{s}\bm{\Phi}^{\top}_{s}$ . Details for finding the eigenvectors $\bm{\Phi}_{1}$ can be found in Appendix A.
2. (b)
  Perform leave-one-out clustering for variables in $\bm{X}^{c}_{s}$ using PCR (more details of this clustering method are discussed in Remark 3):
  1. (i)
    
    For $k,k=1,\ldots,p$ , leave $k^{th}$ variable out of $\bm{X}^{c}_{s}$ .
  2. (ii)
    
    Group the rest of the variables of $\bm{X}^{c}_{s}$ based on $C^{(j)}_{s-1}$ , perform PCA on each cluster again and denote the obtained components as $\bm{F}_{s-1}^{(j)},j=1,\ldots,J_{s-1}$ .
  3. (iii)
    
    Fit $J_{s-1}$ PCRs by using $k^{th}$ variable of $\bm{X}^{c}_{s}$ as the response and $\bm{F}_{s-1}^{(j)}$ as the predictor for each $j=1,\ldots,J_{s-1}$ , respectively.
  4. (iv)
    
    Compute the sum of squared residuals (SSR) for each $J_{s-1}$ PCR model. Assign $k^{th}$ variable to the cluster with the minimum SSR. Update the cluster index for $k^{th}$ variable in $C^{(j)}_{s-1}$ .
  5. (v)
    
    Repeat (i) $\sim$ (iv) for each $k$ and denote the updated clusters as $C^{(j)}_{s}$ .
3. (c)
  
  Repeat (a) and (b) within this step until the clusters converge, and define this final converged cluster as $C^{(j)}_{f}$ .
3.
Final Step:
1. (a)
  
  Repeat (a) in the Iterative Step, but using cluster $C^{(j)}_{f}$ . The final complement is denoted by $\bm{X}^{c}_{f}=\bm{X}-\bm{G}_{f}\bm{\Phi}^{\top}_{f}$ .
2. (b)
  
  Cluster variables of $\bm{X}^{c}_{f}$ based on $C^{(j)}_{f}$ , perform PCA on each cluster $\bm{X}^{c{(j)}}_{f}$ and denote the obtained cluster-specific components as $\bm{F}_{f}^{(j)},j=1,\ldots,J_{f}$ .

Therefore, CPCA produces these required outputs: the final clusters of $p$ variables $C^{(j)}_{f},j=1,\ldots,J_{f}$ , the final estimate of the common components $\bm{G}_{f}$ and the cluster-specific components $\bm{F}_{f}^{(j)},j=1,\ldots,J_{f}$ , along with their eigenvectors. Some details and discussions of CPCA are provided in the following remarks.

Refer to caption — Figure 1: Flowchart of Algorithm 1.

Remark 1

[Selecting the Number of Components]:
In this work, we use a ratio based estimator (Lam and Yao, 2012) to select the number of common components and estimate the homogeneity. The ratio based estimator of the number of principal components $r_{c}$ is defined as

\displaystyle\widehat{r}_{c}=\operatorname*{argmin}_{1\leq i\leq R}\frac{\widehat{\lambda}_{i}}{\widehat{\lambda}_{i+1}},

(9)

where $\widehat{\lambda}_{1}\geq\cdots\geq\widehat{\lambda}_{p}$ are eigenvalues of the sample covariance matrix of data $\bm{X}$ and $r\leq R\leq p$ is a constant. In practice, many researchers may prefer to select the number of principal components by either setting a threshold on the proportion of total variance explained, which may accidentally include some errors into the principal components, or choosing the numbers where the largest drop of eigenvalues exists (i.e. based on the scree plot). Nonetheless, when the number of common components is greater than 1 (e.g., $r_{c}>1$ ), the discrepancy of the eigenvalues (from the sample covariance) among the common components can be accidentally greater than the one between the common components and the cluster-specific component, especially when some groups have a relatively larger cluster-specific effect (e.g., variables in one group are on a large scale or highly correlated). In this situation, selecting the number of components using the largest drop in eigenvalues in Initial Step (a) and Iterative Step (a) may lead to underestimating the number of common components, while the ratio-based estimator $\widehat{r}_{c}$ is less affected by the scales of variables. In addition, to consider a broader class of sub-homogeneities, we also allow for sub-homogeneities within a specific cluster to have different order of variances. To correctly estimate the number of principal components that may have different orders of variances in the same group, in Iterative Step 2 (a) and (b), we use the iterative estimation procedure purposed by Peng et al. (2020), which is a generalization of the original ratio-based estimator. The exact procedure is lengthy and is not the focus of the present work, hence is omitted here. More details of the iterative estimation procedure of the ratio-based estimator can be found in Section 2 of Peng et al. (2020).

Remark 2

[Iterative Step]:
One of the key contributions of our algorithm is iteratively estimating the common components and cluster-specific components. Directly applying PCA on the entire data $\bm{X}$ generally leads to a very poor estimate of the common components because a large $p$ may blur the spike structure of the sample covariance matrix. Motivated by the group structure of the data, we utilize Iterative Step (a) to first cluster variables and then perform PCA on the level of clusters, followed by a second layer of PCA that extracts common information shared by each group. Iterative Step (b) is then implemented to update and improve the clusters information. This is more effective in estimating the common components because PCA is performed in a smaller $p$ case, which is also numerically shown in Figure 2. The data $\bm{X}$ used here are generated by simulation Example 2. Figure 2 (a) presents the correlation plot for the original data $\bm{X}$ . We observe that the cluster structure is masked by the common effect. Figure 2 (b) demonstrates that after the common effects estimated in the Initial Step are removed, it is still difficult to perceive the cluster structure. However, if we remove the common effects estimated in the Final Step, the block-diagonal structure is clear, implying a prominent heterogeneity within the data. Therefore, iteratively estimating the common components is advantageous.

Remark 3

[Initial Step (b), Iterative Step (b) and (c)]:
Initial Step (b), an ordinary hierarchical clustering method using average-linkage as dissimilarity between clusters is performed. As discussed in Bühlmann et al. (2013), we choose the number of clusters using the largest increment in height between iterations proceeded in a agglomerative way, such that $J_{0}=\operatorname*{argmax}_{j}(h_{j+1}-h_{j})$ .

In Iterative Step (b), we propose a new clustering method called “leave-one-out PCR clustering” (LOO-PCR clustering), which is another key contribution of CPCA. The underlying idea of this new clustering method is that if one variable belongs to a cluster, it should be well-predicted by the principal components extracted from that cluster. This is more aligned with our model set-up than hierarchical clustering based on correlations. A similar idea of leaving one out is used for functional clustering, as discussed in (Chiou and Li, 2007). In addition, hierarchical clustering can perform poorly when the dimension of variables $p$ is larger than $n$ . In step (iv), we assign $k^{th}$ variable to the cluster that achieves the minimum SSR. However, when the minimum SSR is larger than a threshold $\tau$ , we treat the $k^{th}$ variable itself as a cluster, because in such a situation, this variable cannot be well-predicted by any clusters of variables. Hence, the number of clusters is driven by the data and can vary in each iteration. This suggests that our proposed method is more flexible than hierarchical clustering. We set $\tau=0.95$ , and Example 4 in the simulation studies demonstrates that our proposed LOO-PCR clustering, served as a clustering method itself, outperforms hierarchical clustering in a large $p$ small $n$ case. Identifiability of LOO-PCR clustering is also demonstrated theoretically in Section 5.

In Iterative Step (c), we stop the iteration when the clusters $C^{(j)}_{s}$ converge. In our algorithm, we adopt the adjusted rand index (ARI) (Rand, 1971; Hubert and Arabie, 1985) between the clusters $C^{(j)}_{s}$ and the one in the previous iteration $C^{(j)}_{s-1}$ as the stopping criterion. The ARI is a corrected version of the rand index that measures the similarity between two clusters of the same data using the proportions of agreements between these two partitions. The correction is achieved by subtracting the rand index by its expected value. The ARI can yield a maximum value of 1, and a high value implies high similarity. When the ARI is above a certain threshold $\eta$ , we stop the iteration. In this study, we use $\eta=0.95$ .

5 Identifiability of LOO-PCR Clustering Approach

Achieving an accurate sub-homogeneity pursuit relies on the effectiveness of the clustering procedure. In this section, we investigate the proposed LOO-PCR clustering method theoretically. More specifically, we examine the identifiability of cluster membership in our proposed method.

Consider one random variable $X_{mi}$ that belongs to cluster $l$ . Based on the variable’s structure in cluster $l$ , there exists $m\in\{1,2,\ldots,p\}$ , such that:

\displaystyle X_{mi}=\sum^{r_{c}}_{k=1}g_{ik}\phi_{mk}+u_{mi}^{(l)},\ \ u_{mi}^{(l)}=\sum^{r_{l}}_{h=1}f_{ih}^{(l)}\gamma_{mh}^{(l)}+\epsilon_{mi}^{(l)}.

(10)

We now consider another cluster $d$ with components $g_{ik},k=1,\ldots,r_{c}$ and $f_{ih}^{(d)},h=1,\ldots,r_{d}$ . When applying our proposed LOO-PCR clustering method, we regress $u_{mi}^{(l)}$ on $f_{ih}^{(d)},h=1,2,\ldots,r_{d}$ and measure its goodness of fit to determine whether the random variable $X_{mi}$ belongs to cluster $d$ , such that:

\displaystyle u_{mi}^{(l)}=\sum^{r_{d}}_{h=1}\beta_{h}f_{ih}^{(d)}+\zeta_{mi}^{(d)},

(11)

where $\beta_{1},\ldots,\beta_{r_{d}}$ are coefficients and $\zeta_{mi}^{(d)}$ is the error. Naturally, we expect that features (e.g., principal components) from cluster $d$ cannot explain $u_{mi}^{(l)}$ sufficiently. Based on (10) and (11), intuitively, a large discrepancy between principal components $f_{ih}^{(d)}$ from cluster $d$ and principal components $f_{ih}^{(l)}$ from cluster $l$ will result in a large residual $\zeta_{mi}^{(d)}$ in (11).

The following theorem will show the property of $\widehat{\zeta}_{mi}^{(d)}$ , which is an estimator of the error $\zeta_{mi}^{(d)}$ from our proposed clustering approach.

Theorem 1

For any cluster $d$ and any random variable $X_{mi}$ from another cluster $l$ , we have the following evaluation:

\displaystyle\|\widehat{\bm{\zeta}}_{m}^{(d)}\|_{2}=\|\bm{M}_{\bm{F}^{(d)}}\bm{M}_{\bm{G}}\bm{x}_{m}\|_{2}+O_{p}\left(\max(\alpha_{np_{d}},\gamma_{nJ})\right)\|\bm{x}_{m}\|_{2},

(12)

where $\|\bm{M}_{\widehat{\bm{F}}^{(d)}}-\bm{M}_{\bm{F}^{(d)}}\|_{2}:=O_{p}(\alpha_{np_{d}})$ , $\|\bm{M}_{\widehat{\bm{G}}}-\bm{M}_{\bm{G}}\|_{2}:=O_{p}(\gamma_{nJ})$ , $\widehat{\bm{\zeta}}_{m}^{(d)}=\left(\widehat{\zeta}_{m1}^{(d)},\widehat{\zeta}_{m2}^{(d)},\ldots,\widehat{\zeta}_{mn}^{(d)}\right)^{\top}$ , $\bm{x}_{m}=(X_{m1},X_{m2},\ldots,X_{mn})^{\top}$ , $\bm{M}_{\bm{F}^{(d)}}=\bm{I}_{n}-\bm{F}^{(d)}\left(\bm{F}^{(d)\top}\bm{F}^{(d)}\right)^{-1}\bm{F}^{(d)\top}$ and $\bm{M}_{\bm{G}}=\bm{I}_{n}-\bm{G}(\bm{G}^{\top}\bm{G})^{-1}\bm{G}^{\top}$ ; here, $\bm{F}^{(d)}$ is an $n\times r_{d}$ matrix with $(i,h)$ -th element being $f_{ih}^{(d)}$ , and $\bm{G}$ is an $n\times r_{c}$ matrix with $(i,k)$ -th element being $g_{ik}$ . Here, $\bm{M}_{\widehat{\bm{F}}^{(d)}}$ and $\bm{M}_{\widehat{G}}$ are $\bm{M}_{\bm{F}^{(d)}}$ and $\bm{M}_{\bm{G}}$ , but with $\bm{F}^{(d)}$ and $\bm{G}$ replaced by $\widehat{\bm{F}}^{(d)}$ and $\widehat{\bm{G}}$ , respectively.

A brief proof of Theorem 1 is provided in Appendix B.

Remark 4

It is expected that $\widehat{\bm{\zeta}}^{(d)}_{m}$ will be large when cluster $d$ and cluster $l$ are different. When $l=d$ , the first term on the right-hand side of (12) is equal to $\|\bm{M}_{\bm{F}^{(d)}}\bm{M}_{\bm{G}}{\bm{\zeta}}_{m}^{(d)}\|_{2}$ . In contrast, when $l\neq d$ (e.g., $\|{\bm{F}^{(l)}}\|_{2}$ and $\|{\bm{F}^{(d)}}\|_{2}$ are distinct), the first term on the right-hand side of (12) is equal to $\|\bm{M}_{\bm{F}^{(d)}}\bm{M}_{\bm{G}}{\bm{F}^{(l)}}{\bm{\gamma}^{(l)}}+\bm{M}_{\bm{F}^{(d)}}\bm{M}_{\bm{G}}{\bm{\zeta}}_{m}^{(l)}\|_{2}$ , which is dominated by $\|{\bm{F}^{(l)}}\|_{2}$ . Meanwhile, the second term on the right-hand side of (12) is determined by the estimation accuracy of $\bm{F}^{(d)}$ and $\bm{G}$ . It is related to the dimension of cluster $d$ (e.g., $p_{d}$ ), the sample size $n$ and the total number of clusters $J$ . Bai and Ng (2002) provided the rate of convergence for the projection matrices of principal components. Fan et al. (2013) and Fan et al. (2018) also studied the properties of principal components for high-dimensional data. In this study, we do not pursue the exact expression of $\alpha_{np_{d}}$ and $\gamma_{nJ}$ . However, given that the homogeneity $\bm{M}_{\bm{G}}$ and sub-homogeneity $\bm{F}^{(d)}$ can be estimated accurately, the first term on the right-hand side of (12) is expected to dominate $\|\widehat{\bm{\zeta}}_{m}^{(d)}\|_{2}$ . This implies that two different clusters $d$ and $l$ are identifiable, given that $\|{\bm{F}^{(l)}}\|_{2}$ has a higher order than $\|{\bm{\zeta}}_{m}^{(d)}\|_{2}$ . As a general discussion, Theorem 1 shows that when the sub-homogeneity from the different clusters is distinct, our clustering procedure can effectively separate the variables from the different clusters.

6 Simulation Studies

In this section, we conduct various simulation studies to investigate the performance of our proposed CPCA compared with traditional PCA under different simulation settings.

6.1 Simulation Settings

We generate the data from representation (2):

\displaystyle\bm{x}_{i}=\sum_{k=1}^{r_{c}}{g}_{ik}\bm{\phi}_{k}+\sum_{j=1}^{J}\sum_{h=1}^{r_{j}}{f}^{(j)}_{ih}\bm{\gamma}^{(j)}_{h}+\sum_{j=1}^{J}\bm{I}^{(j)}\bm{\epsilon}^{(j)}_{i},\quad i=1,\ldots,n,\quad j=1,\ldots,J.

First, we generate $\bm{\Phi}=(\bm{\phi}_{1},\ldots,\bm{\phi}_{r_{c}})$ as a $p\times r_{c}$ orthonormal matrix. Second, a $p_{j}\times r_{j}$ orthonormal matrix $\bm{\Psi}^{(j)}=(\bm{\psi}^{(j)}_{1},\ldots,\bm{\psi}^{(j)}_{r_{j}})$ is generated randomly and independently across different clusters for $j=1,\ldots,J$ and let $\bm{\Gamma}^{(j)}_{h}=(\bm{0}^{(1)},\ldots,\bm{\Psi}^{(j)}_{h},\ldots,\bm{0}^{(J)})$ . We then generate $g_{ik},i=1,\ldots,n,k=1,\ldots,r_{c}$ from $\sqrt{\delta_{k}}W_{ik}$ , where $W_{ik}$ are i.i.d. standard normal random variables and the eigenvalues $\delta_{h}$ are defined as $\delta_{1}>\ldots>\delta_{r_{c}}>0$ . Similarly, we generate $f^{(j)}_{ih},i=1,\ldots,n,h=1,\ldots,r_{j},j=1,\ldots,J$ from $\sqrt{\lambda^{(j)}_{h}}Z^{(j)}_{ih}$ , where $Z^{(j)}_{hi}$ are again i.i.d. standard normal random variables and the eigenvalues $\lambda^{(j)}_{h}$ are defined as $\lambda^{(j)}_{1}>\ldots>\lambda^{(j)}_{r_{j}}>0,j=1,\ldots,J$ . Lastly, $\bm{\epsilon}^{(j)}_{i}$ is a $p_{j}\times 1$ vector consisting of $p_{j}$ i.i.d. normal random variables with mean $0$ and variance ${\sigma^{(j)}}^{2}$ . Data from each cluster are generated according to the above setting and then combined to obtain $\bm{X}$ .

In our simulation study, we consider four different settings, which are outlined below.

•

Example 1 ( $n=50,\ p=100$ ). In this example, we consider the number of common components $r_{c}=3$ and generate $\delta_{k}$ , $k=1,\ldots,r_{c}$ , from $\mathcal{N}(125,5)$ . We further simulate $W_{ik}$ , $k=1,2,\ldots,r_{c}$ , $i=1,2,\ldots,2n$ , from a standard normal distribution independently so that $\bm{g}_{i}$ can be constructed.

In terms of the cluster-specific components, we set $J=5$ , and for each of the five clusters, we consider $r_{j}\equiv 2$ and generate $\lambda_{h}^{(j)}\sim\mathcal{N}(\theta_{j,h},1)$ , where $\theta_{j,h}$ represents the mean level of each component’s variance in each cluster. In specific, we set $\theta_{1,1}=\theta_{1,2}=25$ for the first cluster, $\theta_{5,1}=\theta_{5,2}=5$ for the last cluster, while $\theta_{j,1}=25$ and $\theta_{j,2}=5$ for $j=2,3,4$ , i.e., the rest three clusters. That is, each of the five clusters has the same number of cluster-specific components, while the components from different clusters are allowed to have different levels of variances. Among those five clusters, the components in the first cluster have the largest variances, those in the last cluster have the smallest variances, while the two components in the rest clusters are designed to have different levels of variances. Then, by setting $p_{j}\equiv 20$ for each cluster, we end up with a total number of variables $p=100$ . According to (2), $Z^{(j)}_{ih}$ for $h=1,2$ and $\bm{\epsilon}^{(j)}_{i}$ are simulated independently for each cluster $j$ , with $\sigma^{(j)}=0.5$ for $j=1,2,...,5$ so that $2n$ observations are constructed based on (2). Then, the first $n$ observations are served as the training sample $\bm{X}_{train}=(\bm{x}_{1},\ldots,\bm{x}_{n})$ and the last $n$ observations are designated as the testing sample $\bm{X}_{test}=(\bm{x}_{n+1},\ldots,\bm{x}_{2n})$ . In this example, $n=50$ .
•

Example 2 ( $n=30,\ p=100$ ). This example is identical to Example 1, except that the sample size is decreased from 50 to 30. We will show that our CPCA is more reliable than other competing methods when the sample size is relatively small.
•

Example 3 ( $n=50,\ p=200$ ). In this example, we consider data consisting more clusters, that is, we consider $J=10$ clusters compared with five in Example 1, leading to a number of variables $p=200$ given $p_{j}\equiv 20$ . We maintain $r_{j}\equiv 2$ but double the number of clusters with exactly the same setting of $\lambda_{h}^{(j)}\sim\mathcal{N}(\theta_{j,h},1)$ . In other words, components in the first two clusters have the largest variances, while those in the last two clusters have the smallest variances. And similarly, the rest six clusters have two levels of variances on their components as described in Example 1. Accordingly, $\bm{\epsilon}^{(j)}_{i}$ is simulated independently for each of the 10 clusters, with $\sigma^{(j)}=0.5$ for $j=1,\ldots,10$ . The settings for the common components remain the same as those in Example 1.
•

Example 4 ( $n=30,\ p=100$ ). This example is identical to Example 2, but without the common effect, which is a special case of our proposed data representation as in (7). Correspondingly, the response is generated as:

$\displaystyle{y}_{i}=\sum_{j=1}^{J}{\bm{f}_{i}^{(j)\top}}\bm{\beta}_{j}+{e}_{i},\quad i=1,\ldots,n,\quad j=1,\ldots,J.$ (13)

In this example, we do not estimate the common effect, and the clustering method is directly applied to the data. Therefore, the purpose of considering this example is to demonstrate that our proposed LOO-PCR clustering (Iterative Step (b) and (c) in CPCA), served as a clustering method itself, can outperform the hierarchical clustering.

For all four examples, we repeat the simulation procedure 100 times and investigate the clustering and recovering accuracy of the following methods:

•

PCA: The classical PCA method is performed directly on the whole data set without clustering variables. In addition, clustering accuracy is not considered for this method. The number of principal components is determined according to the ratio-based estimator (9) in Remark 1.
•

CPCA_I: This procedure is simply CPCA, but without the Iterative Step. First, the complement $\bm{X}^{c}_{0}=\bm{X}-\bm{G}_{0}\bm{\Phi}^{\top}_{0}$ is found, and then clusters $C^{(j)}_{0}$ are obtained using hierarchical clustering for $\bm{X}^{c}_{0}$ . Variables of $\bm{X}^{c}_{0}$ are clustered, and the cluster-specific components $\bm{F}_{0}^{(j)}$ are finally obtained using PCA on each cluster.
•

CPCA_F: This is exactly the estimation procedure described in CPCA. Different to CPCA_I, the clusters and components are estimated iteratively.

The main purposes of this comparison are to demonstrate that: 1) traditional PCA fails to capture the cluster-specific components while our proposed CPCA can successfully identify and capture the cluster-specific components; 2) our proposed iterative estimation of the clusters and components outperforms the initial estimation.

In this study, the recovering accuracy is measured by the mean squared recovering error (MSRE). In this study, the MSRE is defined as:

\displaystyle MSRE=\frac{1}{np}\left\|\widehat{\bm{X}}_{test}-\bm{X}_{test}\right\|^{2}_{F},

(14)

where $\widehat{\bm{X}}_{test}$ is the recovered testing sample computed using the testing sample and the eigenvectors ( $\bm{\Phi},\bm{\Lambda}^{(j)}$ ) estimated from the training sample $\bm{X}_{train}$ . We prefer a lower MSRE because it indicates a better recovering of the data.

6.2 Simulation Results

The boxplots of the total number of principal components selected (No.PCs) and MSRE for Example 1,2 and 3 are presented in Figure 3 and Table 1. As shown, CPCA_I (blue boxes) achieves a consistently higher number of principal components than its iterative counterpart (red box) and the true number of principal components (dotted line), which implies the advantages of using the iterative estimation. In the initial clustering, the common effects are estimated without partitioning the variables, while the iterative clustering partitions the variables in the previous step to estimate the common components, resulting in a more accurate estimation. This is also stated in Remark 2.

From the recovering accuracy perspective, it is worth noting from Figure 3 and Table 1 that PCA always performs poorly under this cluster structure of the data, while both CPCA-based methods demonstrate superior performance in recovering because they achieve lower MSRE than PCA. Although both CPCA_I and CPCA_F can achieve similar MSRE, the numbers of PCs considered by CPCA_I are generally greater than those identified by CPCA_F, which also indicates the importance of the iterative partitions. In general, CPCA_F results in more accurate partitions and produces more reliable estimations of the cluster-specific components, especially for those with small variations, which contribute the most to predicting the response according to our simulation settings. This further confirms the advantages of using our proposed method.

The results of our proposed CPCA_I and CPCA_F in Example 2 and 3 are similar, while when the sample size is smaller ( $n=30$ ) or the number of variables and clusters are larger ( $p=200$ ), the performance of traditional PCA is adversely affected. This is not surprising because traditional PCA estimates principal components from the sample covariance matrices which are less accurate due to the so-called ‘Curse of dimensionality’ and only the common components can be identified. However, our proposed CPCA is superior in estimating PCs and recovering the original data as the sub-homogeneity are correctly estimated.

Lastly, we investigate Example 4, which does not take into account the common effect. In this example, CPCA_I and CPCA_F simply reduce to performing PCA on each cluster of variables based on hierarchical clustering and LOO-PCR clustering, respectively. From the last row of Figure 3 and the last column of Table 1, we observe that applying PCA after clustering the variables demonstrates better performance than applying PCA directly to the data. Compared with CPCA_I, CPCA_F achieves a better estimate of the number of PCs, implying that our proposed LOO-PCR clustering outperforms hierarchical clustering.

Table 1: Averages (standard errors) of total number of principal components selected (No.PCs) and MSPE for Example 1, 2, 3, and 4.

	Method	Example 1	Example 2	Example 3	Example 4
No.PCs	CPCA_I	24.36	16.90	30.96	12.22
		(6.16)	(5.14)	(6.53)	(1.88)
	CPCA_F	21.22	14.64	27.78	10.12
		(7.70)	(4.87)	(6.89)	(1.15)
	PCA	5.65	3.00	2.95	6.54
		(4.04)	(1.32)	(0.31)	(2.56)
MSRE	CPCA_I	0.37	0.45	0.34	0.26
		(0.20)	(0.22)	(0.16)	(0.05)
	CPCA_F	0.38	0.47	0.33	0.26
		(0.17)	(0.22)	(0.07)	(0.04)
	PCA	1.32	2.00	1.82	0.54
		(0.67)	(0.60)	(0.17)	(0.24)

7 Applications of CPCA

7.1 Principal component regression

As aforementioned, one of the most important uses of the principal components is PCR. We consider a PCR model with an univariate response $y_{i}$ for $i^{th}$ observation as:

\displaystyle{y}_{i}=\bm{g}^{\top}_{i}\bm{\alpha}+\sum_{j=1}^{J}{\bm{f}^{(j)}}^{\top}_{i}\bm{\beta}_{j}+{e}_{i},\quad i=1,\ldots,n,\quad j=1,\ldots,J.

(15)

where $\bm{\alpha}$ is a $r_{c}\times 1$ vector representing the regression coefficients of the common components $\bm{g}_{i}$ , $\bm{\beta}_{j}$ is a $r_{j}\times 1$ vector denoting the regression coefficients of $j^{th}$ cluster-specific components $\bm{f}^{(j)}_{i}$ and ${e}_{i}$ is simply the error term with mean $0$ and variance $\theta^{2}$ .

Therefore, after the the common components $\bm{g}_{i}$ and cluster-specific components $\bm{f}^{(j)}_{i}$ are estimated via CPCA, we can fit a PCR to predict the response. However, in many situations, only a few groups are useful in predicting the response. To investigate which clusters of variable have an impact on predicting the response, we utilize the group lasso (Yuan and Lin, 2006) using the components produced by CPCA as the coviarates to estimate the regression coefficients as in (15),

\displaystyle\left(\widehat{\bm{\alpha}},\widehat{\bm{\beta}}\right)=\operatorname*{argmin}_{\bm{\alpha},\bm{\beta}}\frac{1}{2n}\sum_{i=1}^{n}\left({y}_{i}-\bm{g}^{\top}_{i}\bm{\alpha}-\sum_{j=1}^{J}{\bm{f}_{i}^{(j)\top}}\bm{\beta}_{j}\right)^{2}+\lambda\left(\left\|\bm{\alpha}\right\|_{2}+\sum_{j=1}^{J}\left\|{\bm{\beta}^{(j)}}\right\|_{2}\right),

(16)

where $\lambda$ is the tuning parameter that controls the sparsity of the regression coefficients. Using this group lasso penalty, $\bm{\beta}_{j}$ for some $j$ will be shrunk to zero exactly so that we can identify which clusters of variables are important in predicting the response.

To better demonstrate the performance of our method, we conduct similar simulation studies as in Section 6. We consider the following three examples:

•

Example 1 ( $n=50,p=100$ ). We generate $\bm{X}$ using the same settings as Example 1 in Section 6.1. Then, we simulate the response $y_{i}$ according to (15) by setting regression coefficients $\bm{\alpha}=(1,1,1)$ , $\bm{\beta}=(\bm{\beta}_{1},\ldots,\bm{\beta}_{10})=(0,0,\ldots,0,25,25)$ and standard deviation $\theta=1$ . That is, only the common effect and the cluster-specific effects for the last cluster are important in predicting the response. This regression setting is interesting because the cluster-specific components with smaller variance (e.g. the last cluster) are highly likely to be omitted or estimated poorly in traditional PCA, but they can sometimes be very important in predicting the response.
•

Example 2 ( $n=30,p=100$ ). This example is identical to Example 1 above, except that the sample size is decreased from 50 to 30.
•

Example 3 ( $n=50,p=200$ ). We generate $\bm{X}$ using the same settings as Example 3 in Section 6.1. That is, we consider $J=10$ clusters in this example. In regard to the PCR, we set regression coefficients $\bm{\alpha}=(1,1,1)$ again and consider $\bm{\beta}=(\bm{\beta}_{1},\ldots,\bm{\beta}_{20})=(0,0,\ldots,0,25,25)$ such that only the common effect and the cluster-specific effects for the last cluster are useful and the rest of clusters are noise.

For all three examples, we repeat the simulation procedure 100 times and investigate prediction accuracy of the methods mentioned in Section 6.1. The MSPE is used to determine whether the selected components can accurately predict the response:

\displaystyle\text{MSPE}=\frac{1}{n}\sum_{i=1}^{n}\left(\widehat{y}_{i,test}-{y}_{i,test}\right)^{2},

(17)

where $\widehat{y}_{i,test}$ is computed using (15) with the regression coefficients ( $\bm{\gamma},\bm{\beta}$ ) and the eigenvectors $\left(\bm{\Phi},\bm{\Lambda}^{(j)}\right)$ estimated from the training sample.

Table 2: Averages (standard errors) of MSPE and

\left\|\widehat{\Sigma}-\Sigma\right\|_{\text{F}}^{2}

for Example 1, 2, and 3.

	Method	Example 1	Example 2	Example 3
MSPE	CPCA_I	3.27	5.30	4.65
		(2.14)	(4.21)	(4.92)
	CPCA_F	2.95	4.90	4.22
		(1.92)	(4.12)	(4.94)
	PCA	5.55	6.67	6.32
		(2.37)	(2.03)	(1.43)
	CPCA_I_g	2.38	3.66	2.43
		(1.27)	(2.13)	(1.18)
	CPCA_F_g	2.30	3.61	2.40
		(1.23)	(2.10)	(1.19)
$\\|\widehat{\Sigma}-\Sigma\\|_{\text{F}}^{2}$	CPCA_I	85.92	108.55	111.71
		(14.83)	(25.05)	(13.57)
	CPCA_F	86.41	108.80	111.52
		(14.01)	(24.71)	(13.33)
	PCA	93.40	123.95	128.53
		(12.83)	(31.88)	(16.77)
	POET	91.28	119.71	125.01
		(11.87)	(30.49)	(16.46)

The first panel of Table 2 and the first column of Figure 4 display mean and standard errors and the boxplot of MSPE for all three examples. Note that in this section, CPCA_F and CPCA_F_g represent the ordinary least squares (OLS) and the group lasso regressed on the principal components produced by CPCA_F, respectively. Similar notations apply to the rest of CPCA-based methods. When the traditional PCR (OLS) is utilized, we clearly see that CPCA_F achieves much lower average MSPE and standard deviation than other methods, especially when $n$ is small or $p$ is large. Not surprisingly, the traditional PCA performs the worst because it fails to capture any of the sub-homogeneity, which is important in predicting the response under our setting. Moreover, we observe that PCR using the group lasso also outperforms those OLS counterparts, especially for CPCA_I. The group lasso can provide substantial reductions in MSPE for CPCA_I because it tends to select a large number of principal components, only some of which are useful. Overall, CPCA_F_g achieves the lowest mean and standard error of MSPE among all the methods in all three examples, because it not only accurately extracts both common components and group-specific components, but also shrinks the coefficients of those unimportant groups down to zero.

7.2 Covariance and Precision Matrix Estimation

Another useful application of CPCA is to estimate $\bm{\Sigma}$ as in (4). Using $\bm{g}_{i}$ and $\bm{f}^{(j)}_{i}$ leads to a better estimation of $\bm{\Sigma}$ compared with using $\bm{g}_{i}$ only, because $\bm{g}_{i}$ and $\bm{f}^{(j)}_{i}$ can capture both the low rank and the block-diagonal representation of $\bm{\Sigma}$ . Therefore, we estimate $\bm{\Sigma}$ using clusters, the common components, and the cluster specific components, along with their associated eigenvectors estimated via CPCA. In this section, we numerically illustrate the advantage of using CPCA to estimate the covariance matrix in comparison with other traditional PCA methods. Again, we conduct three simulation examples mentioned in Section 7.1 and utilize the Euclidean distance (ED) between the estimated covariance and population covariance, $\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|^{2}_{F}$ , to measure the performances of different methods. Recall $\bm{\Sigma}$ is generated from (4):

\displaystyle\bm{\Sigma}=\bm{\Phi}\text{cov}\left(\bm{g}_{i}\right)\bm{\Phi}^{\top}+\sum_{j=1}^{J}\bm{\Gamma}^{(j)}\text{cov}(\bm{f}^{(j)}_{i}){\bm{\Gamma}^{(j)}}^{\top}+\sum_{j=1}^{J}\bm{I}^{(j)}{\sigma^{(j)2}}.

In this section, we add another prevalent covariance estimation method POET (Fan et al., 2013) as aforementioned into our comparison.

The second panel of Table 2 and the second column of Figure 4 demonstrate the mean and standard errors and boxplot of $\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|^{2}_{F}$ for all three examples, computed using methods discussed before. From these results, we see that POET performs better than PCA but worse than CPCA_I and CPCA_F, indicating that the sparsity structure implemented in POET can partly capture the sub-homogeneity, while not in a very efficient way. Among these methods, both CPCA_I and CPCA_F can achieve the lowest mean and standard error of $\|\widehat{\bm{\Sigma}}-\bm{\Sigma}\|^{2}_{F}$ because it can best identify both homogeneity and sub-homogeneity to accurately estimate $\bm{\Sigma}$ .

8 Real Data Analysis

For further illustration, we analyze a stock return data set using the proposed CPCA and compare it with traditional PCA. In particular, we are interested in the performance of our proposed CPCA on clustering stock returns of companies from various industries. Moreover, we also use CPCA to obtain an estimate of the covariance matrix and use it to construct a minimum variance portfolio (MVP).

8.1 Clustering Stock Returns

The data are collected from the Center for Research in Security Prices (CRSP) and include the daily stock returns of 160 companies from 1st January, 2014 to 31st December, 2014, with 252 trading days. The 160 stocks are selected from eight different industries according to Fama and French’s 48-industry classification (Fama and French, 1997), namely, Candy and Soda, Tobacco Products, Apparel, Aircraft, Shipbuilding and Railroad Equipment, Petroleum and Natural Gas, Measuring and Control Equipment, and Shipping Containers, with 20 stocks from each industry. The data for the first 126 trading days are treated as the training sample, and the rest are the testing sample. Thus, the training data have the dimensions $n=126$ and $p=160$ . In this example, after removing the common effect from the data, we aim to identify the clusters that consist of companies from the same industry. This cluster structure of stock returns is also discussed in Fan et al. (2013).

Figure 5 (a) shows the correlation plot for the original stock data. We can observe a vague cluster structure, but the off-diagonals are clearly non-zero, implying that the stocks from the different industries selected in our study tend to be positively correlated. We apply hierarchical clustering directly to the original data and find that most stocks from different industries are clustered in the same group, as displayed in Figure 6 (b). Only stocks from the $5^{\text{th}}$ industry stand out as another cluster, because they have a relatively lower correlation with other stocks. If we consider the industries as true clusters, the hierarchical clustering of the original data results in an ARI of only $0.09$ . This indicates that common components may exist and conceal the cluster structure within the data.

Next, we apply our proposed CPCA to the stock data, and one common component is determined using the ratio-based estimator (9). Figure 5 (b) presents the correlation plot of the data after the common effect is removed. Comparatively, the cluster structure is more apparent. One can interpret the common effect as the market effect and the rest of the components as industry-specific effects. As demonstrated in Figure 6 (c), after the initial step of CPCA, a few clusters of stocks from the $1^{\text{st}},\ 2^{\text{nd}},\ \text{and}\ 7^{\text{th}}$ industries are correctly identified, while stocks from other industries (the $3^{\text{rd}},\ 4^{\text{th}},\ 5^{\text{th}},\ 6^{\text{th}},\ \text{and}\ 8^{\text{th}}$ ) cannot be correctly separated from two large clusters. The ARI after the initial step of CPCA is about $0.44$ which is a significant improvement to $0.09$ . Figure 6 (d) depicts the final clusters after the iterative steps of CPCA where most stocks from the aforementioned two large clusters can be correctly separated according to industries with a minor cost of a few miss-classifications in the $1^{\text{st}}\ \text{and}\ 7^{\text{th}}$ cluster. The ARI is also further improved from $0.44$ to $0.72$ . The final clusters mainly capture the industry information, but a few stocks are not clustered into their own industries. This is to be expected because some stocks from the same industry are not highly correlated, as shown in Figure 5 (e.g., some stocks from the first, third, sixth and seventh industries), which can be common in reality due to diversification of enterprises.

From the prediction perspective, we compute the MSRE for PCA, CPCA_I, and CPCA_F as $3.56$ , $2.23$ , and $1.67$ (in unit of $10^{-4}$ ), respectively. In other words, our CPCA can not only capture the market effect of stock returns as the homogeneity but also include some sub-homogeneities which can be interpreted as industrial effects. All of these findings firmly support that our proposed CPCA is appropriate for analyzing these stock return data.

8.2 Constructing Minimum Variance Portfolio

Next, we consider the problem of constructing a global minimum variance portfolio (MVP) from stock returns. The minimum variance portfolio is built by solving the following optimization problem

	$\displaystyle\min_{\bm{w}}\$	$\displaystyle\bm{w}^{\top}\bm{\widetilde{\Sigma}}\bm{w}$
	s.t.	$\displaystyle\bm{w}^{\top}\bm{1}_{p}=1,$

where $\bm{w}$ is a $p$ -dimensional vector of weights allocated on each stock, $\bm{1}_{p}$ is a $p$ -dimensional vector of ones and $\bm{\widetilde{\Sigma}}$ is defined as an estimate of the $p$ by $p$ covariance matrix of stock returns. The analytic solution of $\bm{w}$ is

\displaystyle\bm{{w}}^{\star}=\frac{\bm{\widetilde{\Sigma}^{-1}}\bm{1}_{p}}{\bm{1}_{p}^{\top}\bm{\widetilde{\Sigma}^{-1}}\bm{1}_{p}},

(18)

where we assume short sale is allowed and there are no transaction costs.

Constructing the global MVP is another important application of our proposed CPCA due to its ability of correctly identifying both the homogeneity and the sub-homogeneity when estimating the covariance matrices of stock returns. The application of constructing MVP have been well studied and discussed by Engle et al. (2019); Chen et al. (2019) and Wang et al. (2021) where several different estimation approaches for the covariance matrices of stock returns are developed. Our interest in this section is the numerical results of the out-of-sample performance of the global MVP constructed using the covariance matrices estimated by CPCA. Recall that stock returns of $160$ companies from $8$ industries are collected for $252$ trading days, we use rolling windows with the size of $110$ days (5 calendar months) to obtain the out-of-sample performance of the portfolio. In specific, for each trading day $i$ , we firstly estimate $\bm{\widetilde{\Sigma}}$ by CPCA where stock returns in the most recent $110$ tradings days before day $i$ (from day $(i-110)$ to day $(i-1)$ ) are considered, and then compute the optimal weight $\bm{{w}}_{i}^{\star}$ by (18). Consequently, the return of the global MVP on day $i$ can be computed using $\bm{{w}}_{i}^{\star}$ and the return of each stock on day $i$ . Similar to Wang et al. (2021), we compute the standard deviation (STD), the information ratio (IR), which is the ratio of mean return to STD, and the Sharpe ratio (SR), which is the ratio of mean excess return (the portfolio-wide return minus the risk-free return) to STD. As discussed by Engle et al. (2019), Chen et al. (2019) and Wang et al. (2021), the STD should be the primary measure when comparing the MVPs constructed using different estimates of covariance matrices, while the IR and SR are also desirable but less crucial. The numerical results in Table 3 compare the portfolios constructed with different estimates of the covariance matrices. As depicted, CPCA with both the initial and final clusters can achieve a slightly better estimate of the covariance matrix, since the portfolio corresponding to CPCA with has the smallest STD. In addition, the global MVP constructed using CPCA with the final clustering index can also achieve highest IR and SR across all methods, which is also desirable. As a result, CPCA works well when constructing the global MVP since the sub-homogeneity of industrial effects can be well estimated.

Table 3: Out-of-sample performance of the minimum variance portfolio (MVP) constructed using the covariance matrices estimated by traditional PCA, CPCA with initial clustering index, CPCA with final clustering index and POET method.

Method	STD	IR	SR
CPCA_I	0.008	0.120	0.112
CPCA_F	0.008	0.151	0.143
PCA	0.010	0.107	0.099
POET	0.009	0.148	0.140

9 Discussion

To conclude, we introduced a novel CPCA to study the homogeneity and sub-homogeneity of high-dimensional data collected from different populations, where the sub-homogeneity refers to a group-specific feature from a particular population. Our numerical simulations confirmed that traditional PCA can only extract the homogeneity from the data, whereas CPCA not only provides a more accurate estimate of the common components, but also identifies the group-specific features, even for the group with small variations. Under a PCR setting, the features extracted using CPCA can significantly outperform those selected by classical PCA in terms of prediction, especially when $n$ is small but $p$ is large. Our real analysis of the stock return data also demonstrated that, when using CPCA, we can capture the industry information after the common component (i.e., market effect) is removed. All of these findings support the use of our proposed CPCA in dimension-reduction problems.

The applications of CPCA are not limited to producing principal components in PCR and revealing the industry structure from the stock return data. CPCA can also be applied to any other data sets in which a group structure exists but hides in the homogeneity. Further, it can be used to estimate a covariance matrix and its inverse of a large data set that exhibits a group structure. More applications of CPCA will be explored in our future work.

SUPPLEMENTARY MATERIAL

R-codes for CPCA:: R-codes including R functions to perform the iterative complement-clustering principal component analysis (CPCA) described in the article. Also included are the R-codes for simulation studies and real analysis of the stock return data shown in the article. (.R files)
Data of stock returns:: Data set of daily returns of stocks from different industries in the US (2014). (.rds file)

References

(1)
Anderson (2003) Anderson, T. (2003), An Introduction to Multivariate Statistical Analysis, 3 edn, Wiley.
Bai and Ng (2002) Bai, J. and Ng, S. (2002), ‘Determining the number of factors in approximate factor models’, Econometrica 70(1), 191–221.
Boivin and Ng (2006) Boivin, J. and Ng, S. (2006), ‘Are more data always better for factor analysis?’, Journal of Econometrics 132(1), 169–194.
Bühlmann et al. (2013) Bühlmann, P., Rütimann, P., van de Geer, S. and Zhang, C.-H. (2013), ‘Correlated variables in regression: clustering and sparse estimation’, Journal of Statistical Planning and Inference 143(11), 1835–1858.
Cai et al. (2017) Cai, T., Han, X. and Pan, G. (2017), ‘Limiting laws for divergent spiked eigenvalues and largest non-spiked eigenvalue of sample covariance matrices’, arXiv preprint arXiv:1711.00217 .
Chen et al. (2019) Chen, J., Li, D. and Linton, O. (2019), ‘A new semiparametric estimation approach for large dynamic covariance matrices with multiple conditioning variables’, Journal of Econometrics 212(1), 155–176.
https://www.sciencedirect.com/science/article/pii/S0304407619300806
Chiou and Li (2007) Chiou, J.-M. and Li, P.-L. (2007), ‘Functional clustering and identifying substructures of longitudinal data’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 69(4), 679–699.
Deville and Malinvaud (1983) Deville, J.-C. and Malinvaud, E. (1983), ‘Data analysis in official socio-economic statistics’, Journal of the Royal Statistical Society: Series A (General) 146(4), 335–352.
Donoho et al. (2000) Donoho, D. L. et al. (2000), ‘High-dimensional data analysis: The curses and blessings of dimensionality’, AMS math challenges lecture 1(32), 375.
Engle et al. (2019) Engle, R. F., Ledoit, O. and Wolf, M. (2019), ‘Large Dynamic Covariance Matrices’, Journal of Business & Economic Statistics 37(2), 363–375. Publisher: Taylor & Francis _eprint: https://doi.org/10.1080/07350015.2017.1345683.
https://doi.org/10.1080/07350015.2017.1345683
Fama and French (1997) Fama, E. F. and French, K. R. (1997), ‘Industry costs of equity’, Journal of Financial Economics 43(2), 153–193.
Fan et al. (2014) Fan, J., Han, F. and Liu, H. (2014), ‘Challenges of big data analysis’, National Science Review 1(2), 293–314.
Fan et al. (2013) Fan, J., Liao, Y. and Mincheva, M. (2013), ‘Large covariance estimation by thresholding principal orthogonal complements’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 75(4), 603–680.
Fan et al. (2018) Fan, J., Sun, Q., Zhou, W.-X. and Zhu, Z. (2018), ‘Principal component analysis for big data’, Wiley StatsRef: Statistics Reference Online pp. 1–13.
Hong et al. (2018) Hong, D., Fessler, J. A. and Balzano, L. (2018), ‘Optimally weighted pca for high-dimensional heteroscedastic data’, arXiv preprint arXiv:1810.12862 .
Hotelling (1933) Hotelling, H. (1933), ‘Analysis of a complex of statistical variables into principal components.’, Journal of Educational Psychology 24(6), 417.
Hubert and Arabie (1985) Hubert, L. and Arabie, P. (1985), ‘Comparing partitions’, Journal of Classification 2(1), 193–218.
Johnstone (2001) Johnstone, I. M. (2001), ‘On the distribution of the largest eigenvalue in principal components analysis’, The Annals of Statistics 29(2), 295–327.
Johnstone and Lu (2009) Johnstone, I. M. and Lu, A. Y. (2009), ‘On consistency and sparsity for principal components analysis in high dimensions’, Journal of the American Statistical Association 104(486), 682–693.
Jolliffe (2002) Jolliffe, I. (2002), Principal Component Analysis, Springer Verlag, New York.
Jolliffe and Cadima (2016) Jolliffe, I. T. and Cadima, J. (2016), ‘Principal component analysis: a review and recent developments’, Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences 374(2065), 20150202.
Lam and Yao (2012) Lam, C. and Yao, Q. (2012), ‘Factor modeling for high-dimensional time series: Inference for the number of factors’, The Annals of Statistics 40(2), 694–726.
Li et al. (2018) Li, Q., Cheng, G., Fan, J. and Wang, Y. (2018), ‘Embracing the blessing of dimensionality in factor models’, Journal of the American Statistical Association 113(521), 380–389.
Liu et al. (2002) Liu, A., Zhang, Y., Gehan, E. and Clarke, R. (2002), ‘Block principal component analysis with application to gene microarray data classification’, Statistics in Medicine 21(22), 3465–3474.
Morales-Jimenez et al. (2018) Morales-Jimenez, D., Johnstone, I. M., McKay, M. R. and Yang, J. (2018), ‘Asymptotics of eigenstructure of sample correlation matrices for high-dimensional spiked models’, arXiv preprint arXiv:1810.10214 .
Pearson (1901) Pearson, K. (1901), ‘Liii. on lines and planes of closest fit to systems of points in space’, The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science 2(11), 559–572.
Peng et al. (2020) Peng, B., Su, L., Westerlund, J. and Yang, Y. (2020), ‘Interactive effects panel data models with general factors and regressors’, Available at SSRN 3617477 .
Rand (1971) Rand, W. M. (1971), ‘Objective criteria for the evaluation of clustering methods’, Journal of the American Statistical Association 66(336), 846–850.
Tan et al. (2015) Tan, K. M., Witten, D. and Shojaie, A. (2015), ‘The cluster graphical lasso for improved estimation of gaussian graphical models’, Computational Statistics & Data Analysis 85, 23–36.
Tibshirani et al. (2015) Tibshirani, R., Wainwright, M. and Hastie, T. (2015), Statistical learning with sparsity: the lasso and generalizations, Chapman and Hall/CRC.
Wang et al. (2021) Wang, H., Peng, B., Li, D. and Leng, C. (2021), ‘Nonparametric estimation of large covariance matrices with conditional sparsity’, Journal of Econometrics 223(1), 53–72.
Yuan and Lin (2006) Yuan, M. and Lin, Y. (2006), ‘Model selection and estimation in regression with grouped variables’, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 68(1), 49–67.
Zou et al. (2006) Zou, H., Hastie, T. and Tibshirani, R. (2006), ‘Sparse principal component analysis’, Journal of Computational and Graphical Statistics 15(2), 265–286.

Appendix A Estimation of $\bm{G}_{1}$ and $\bm{\Phi_{1}}$ in CPCA Iterative Step (a)

In this part, we describe the estimation of the principal components, along with their eigenvectors, combined from each cluster in Iterative Step (a) of Algorithm 1.

First, we perform PCA in cluster $j$ by:

\displaystyle\bm{X}^{(j)}=\bm{\Psi}^{(j)}\bm{\Pi}^{(j)\top}+\bm{U}^{(j)},

where $\bm{\Psi}^{(j)}$ is an $n\times r_{j}$ matrix of principal components for variables in cluster $j$ and $\bm{\Pi}^{(j)}$ is a $p_{j}\times r_{j}$ matrix in which each column represents an eigenvector of $\bm{X}^{(j)\top}\bm{X}^{(j)}$ . Here, recall that $p_{j}$ denotes the number of variables in cluster $j$ , and $r_{j}$ is the number of principal components in cluster $j$ . Then, we can combine the principal components $\bm{\Psi}^{(j)}$ from each cluster as $\bm{\Psi}=(\bm{\Psi}^{(1)},...,\bm{\Psi}^{(J_{0})})$ and perform a further step of PCA on $\bm{\Psi}$ to obtain:

\displaystyle\bm{\Psi}=\bm{G}_{1}\bm{H^{\top}}+\bm{V},

where we assume the first $r_{c}$ principal components can be used to summarize the common effects among all clusters. Then, $\bm{G}_{1}$ is a $n\times r$ matrix of principal components of $\bm{\Psi}$ , and $\bm{H}$ is a $(\sum_{j=1}^{J_{0}}r_{j})\times r_{c}$ matrix in which each column represents an eigenvector of $\bm{\Psi}^{\top}\bm{\Psi}$ . Lastly, we can find the complement for cluster $j$ as:

\displaystyle\bm{X}_{1}^{(j)c}=\bm{X}^{(j)}-\bm{G}_{1}\bm{H^{\top}}\bm{\Pi}^{(j)\star\top},

where $\bm{\Pi}^{(j)\star}=(\bm{0}^{(1)},\bm{0}^{(2)},...,\bm{\Pi}^{(j)},...,\bm{0}^{(J_{0})})$ is a $p_{j}\times(\sum_{j=1}^{J_{0}}r_{j})$ matrix in which $\bm{\Pi}^{(j)}$ is a $p_{j}\times r_{j}$ matrix, as we described earlier, and $\bm{0}^{(l)}$ is a $p_{j}\times r_{l}$ zero matrix for $l=1,2,...,J_{0};\ l\neq j$ . Once we obtain the complement from each cluster, we can combine them and present the total complement using:

\displaystyle\bm{X}_{1}^{c}=\bm{X}-\bm{G}_{1}\bm{H^{\top}}\bm{\Pi^{\top}},

where $\bm{\Pi}=(\bm{\Pi}^{(1)\star\top},\bm{\Pi}^{(2)\star\top},...,\bm{\Pi}^{(j)\star\top},...,\bm{\Pi}^{(J_{0})\star\top})^{\top}$ is a $p\times(\sum_{j=1}^{J_{0}}r_{j})$ block-diagonal matrix. Hence, we can finally define $\bm{\Phi_{1}}=\bm{\Pi}\bm{H}$ as the corresponding eigenvectors for $\bm{G}_{1}$ . It is also easy to show that $\bm{\Phi_{1}}^{\top}\bm{\Phi_{1}}=I_{p}$ .

Appendix B Proof of Theorem 1

The matrix form of (11) is:

\displaystyle\bm{u}_{m}=\bm{F}^{(d)}\bm{\beta}+\bm{\zeta}^{(d)}_{m},

(19)

where $\bm{u}_{m}=\left(u_{m1}^{(l)},u_{m2}^{(l)},\ldots,u_{mn}^{(l)}\right)^{\top}$ and $\bm{\beta}=(\beta_{1},\beta_{2},\ldots,\beta_{r_{d}})$ .
Hence, the least-squares estimation of the residual $\bm{\zeta}^{(d)}_{m}$ is:

\displaystyle\widetilde{\bm{\zeta}}^{(d)}_{m}=\bm{u}_{m}-\bm{F}^{(d)}\left(\bm{F}^{(d)\top}\bm{F}^{(d)}\right)^{-1}\bm{F}^{(d)\top}\bm{u}_{m}=:\bm{M}_{\bm{F}^{(d)}}\bm{u}_{m}.

(20)

It should be noted that $\bm{u}_{m}$ and $\bm{F}^{(d)}$ are not observed. In terms of our proposed clustering method, we estimate $\bm{u}_{m}$ via PCA, such that:

\displaystyle\widehat{\bm{u}}_{m}=\bm{M}_{\widehat{\bm{G}}}\bm{x}_{m},

(21)

where $\bm{M}_{\widehat{\bm{G}}}=\bm{I}_{n}-\widehat{\bm{G}}(\widehat{\bm{G}}^{\top}\widehat{\bm{G}})^{-1}\widehat{\bm{G}}^{\top}$ , with $\widehat{\bm{G}}$ being an estimator for $\bm{G}$ , as described in CPCA.
For $\bm{F}^{(d)}$ , we also perform PCA on cluster $d$ and denote the estimator by $\widehat{\bm{F}}^{(d)}$ . Based on (20) and (21), our estimator of $\bm{\zeta}^{(d)}_{m}$ can be written as:

$\displaystyle\widehat{\bm{\zeta}}^{(d)}_{m}$	$\displaystyle=$	$\displaystyle\bm{M}_{\widehat{\bm{F}}^{(d)}}\widehat{\bm{u}}_{m}=\bm{M}_{\widehat{\bm{F}}^{(d)}}\bm{M}_{\widehat{\bm{G}}}\bm{x}_{m}$	(22)
	$\displaystyle=$	$\displaystyle\bm{M}_{\widehat{\bm{F}}^{(d)}}\left(\bm{M}_{\widehat{\bm{G}}}-\bm{M}_{\bm{G}}\right)\bm{x}_{m}+\bm{M}_{\widehat{\bm{F}}^{(d)}}\bm{M}_{\bm{G}}\bm{x}_{m}$
	$\displaystyle=$	$\displaystyle\left(\bm{M}_{\widehat{\bm{F}}^{(d)}}-\bm{M}_{\bm{F}^{(d)}}\right)\left(\bm{M}_{\widehat{\bm{G}}}-\bm{M}_{\bm{G}}\right)\bm{x}_{m}+\bm{M}_{\bm{F}^{(d)}}\left(\bm{M}_{\widehat{\bm{G}}}-\bm{M}_{\bm{G}}\right)\bm{x}_{m}$
		$\displaystyle+\left(\bm{M}_{\widehat{\bm{F}}^{(d)}}-\bm{M}_{\bm{F}^{(d)}}\right)\bm{M}_{\bm{G}}\bm{x}_{m}+\bm{M}_{\bm{F}^{(d)}}\bm{M}_{\bm{G}}\bm{x}_{m}.$

To define $\|\bm{M}_{\widehat{\bm{F}}^{(d)}}-\bm{M}_{\bm{F}^{(d)}}\|_{2}=O_{p}(\alpha_{np_{d}})$ , $\|\bm{M}_{\widehat{\bm{G}}}-\bm{M}_{\bm{G}}\|_{2}=O_{p}(\gamma_{nJ})$ , we obtain the result in this theorem.