Integrative Sparse Partial Least Squares
Abstract
Partial least squares, as a dimension reduction method, has become increasingly important for its ability to deal with problems with a large number of variables. Since noisy variables may weaken the performance of the model, the sparse partial least squares (SPLS) technique has been proposed to identify important variables and generate more interpretable results. However, the small sample size of a single dataset limits the performance of conventional methods. An effective solution comes from gathering information from multiple comparable studies. The integrative analysis holds an important status among multi-datasets analyses. The main idea is to improve estimation results by assembling raw datasets and analyzing them jointly. In this paper, we develop an integrative SPLS (iSPLS) method using penalization based on the SPLS technique. The proposed approach consists of two penalties. The first penalty conducts variable selection under the context of integrative analysis; The second penalty, a contrasted one, is imposed to encourage the similarity of estimates across datasets and generate more reasonable and accurate results. Computational algorithms are provided. Simulation experiments are conducted to compare iSPLS with alternative approaches. The practical utility of iSPLS is shown in the analysis of two TCGA gene expression data.
1 Introduction
With the rapid development of technology, comes the need to analyze data with high dimensions. Partial least squares, introduced by Wold et al. (1984), has been successfully used as a dimension reduction method in many research areas, such as chemometrics (Sjöström et al., 1983) and more recently genetics (Chun and Keleş, 2009). PLS reduces the variable dimension by constructing new components, which are linear combinations of the original variables. Its stability under collinearity and high-dimensionality gives PLS a clear superiority over many other methods. However, in high dimensional problems, noise accumulation from irrelevant variables has long been recognized (Fan and Lv, 2010). For example, in omics studies, it is wildly accepted that only a small fraction of genes are associated with outcomes. To yield more accurate estimates and facilitate interpretation, variable selection needs to be considered. Recently, Chun and Keleş (2010) propose a sparse PLS technique to conduct variable selection and dimension reduction simultaneously by imposing Elastic Net penalization in the PLS optimization.
Another challenge that real data analyses often face is the unsatisfactory performance generated from a single dataset (Guerra and Goldstein, 2009), especially for data with a limited sample size. Due to the recent progress in data collection, a possibility exists for integration across multiple datasets generated under similar protocols. Methods for analyzing multiple datasets include meta-analysis, integrative analysis, and others. Among them, integrative analysis has been proved to be effective both in theory and practice and have better performance in prediction and variable selection than other multi-datasets methods (Liu et al., 2015; Ma et al., 2011), especially including meta-analysis (Grützmann et al., 2005).
Considering the wide applications of PLS/SPLS to high dimensional data, we propose an integrative SPLS (iSPLS) method to remedy the aforementioned problems of the conventional SPLS technique caused by a limited sample size. Based on the SPLS technique, our method conducts the integrative analysis of multiple independent datasets using the penalization method to promote certain similarities and sparse structures among them, and further improve the accuracy and reliability of variable selection and loading estimation. Our penalization involves two parts. The first penalty conducts variable selection under the paradigm of integrative analysis (Zhao et al., 2015), where a composite penalty is adopted to identify important variables under both the homogeneity structure and heterogeneity structure. The intuition of adding the second penalty comes from empirical data analyses, that is, datasets with comparable designs may have a certain degree of similarity, which may help further improve analysis results. Our work advances from the existing sparse PLS and integrative studies by merging the dimension reduction technique and integrative analysis paradigm. Furthermore, we consider both similarity and difference across multiple datasets, which is achieved by our introduction of a two-part penalization.
The rest of the paper is organized as follows. In Section 2, for the completeness of this article, we first briefly review the general principles of PLS and SPLS, and then formulate the iSPLS method and establish its algorithms. Simulation studies and applications to TCGA data are provided in Section 3 and 4. Discussion is organized in Section 5. Additional technical details and numerical results are provided in the Appendix.
2 Methods
2.1 Sparse partial least squares
Let and represent the response matrix and predictor matrix, respectively. PLS assumes that there exists latent components , , which are linear combinations of predictors, such that and , where , and are matrices of coefficients (loadings), and and are matrices of random errors.
PLS solves the optimization problem for direction vectors successively. Specifically, is the solution to the following problem:
(1) |
which can be solved via the NIPALS (Wold et al., 1984) or SIMPLS (De Jong, 1993) algorithms with different constraints and where . After estimating the number of direction vectors , the latent components can be calculated by , where . And the final estimator is , where is the solution of . Details are available in Ter Braak and de Jong (1998).
In the analysis of high-dimensional data, a variable selection procedure needs to be considered to remove the noise. Note that noisy variables enter the PLS regression via direction vectors, one possible way is to adopt the penalization approach into the optimization procedure, that is, imposing an constrain to the direction vector in problem (1). Then the first SPLS direction vector can be obtained by solving the following problem:
(2) |
where the tuning parameter controls the degree of sparsity.
However, Jolliffe et al. (2003) point out the concavity issue of this problem as well as the lack of sparsity of its solution. Chun and Keleş (2010) then develop a generalized form of the SPLS problem (2) given below, which can generate a sufficiently sparse solution.
(3) | ||||
In this problem, penalties are imposed on , a surrogate of the direction vector which is very close to , rather than on the original direction vector. Here the additional penalty deals with the singularity of when solving for , and the small reduces the effect of the concave part. The solution of (3) is given by optimizing and iteratively.
2.2 Integrative sparse partial least squares
2.2.1 Data and Model Settings
In this section, we consider the case where datasets are from independent studies with comparable designs. Below, we develop an integrative sparse partial least squares (iSPLS) method to conduct an integrative analysis of these datasets based on the SPLS technique. Note that in the context of integrative analysis, datasets do not need to be fully comparable. With matched predictors, we further assume that data preprocessing, including imputation, centralization, and normalization, has been done for each dataset separately.
Following the notations in the existing integrative analysis literature (Huang et al., 2012b; Zhao et al., 2015), we use the superscript to denote the th dataset with observations, for . As in the SPLS for a single dataset, where the main interest is on the first direction vector, denote as the weight of the th variable in the first direction vector of the th dataset, and as the “group” of weights of variable in the first direction vectors, for
2.2.2 iSPLS with contrasted penalization
Following the generalized SPLS given in (3), we formulate the objective function for estimating the first direction vectors in datasets. For , consider the minimization of the penalized objective function:
(4) | ||||
where , , and .
In (4), is the goodness-of-fit of th dataset, and serves the same role as in the SPLS method, dealing with the potential singularity when solving for . To eliminate the influence of lager datasets, here we take the form of weighted sum with weights given by the reciprocal of the square of sample sizes. As for the penalty function, conducts variable selection in the context of integrative analysis, whereas accounts for the secondary model similarity structure. Below we provide detailed discussions on these two penalties.
2.2.3 Penalization for variable selection
We first consider the form of . With datasets, sparsity structures of the direction vectors need to be considered. Integrative analysis considers two generic sparsity structures (Zhao et al., 2015), the homogeneity structure and the heterogeneity structure. Under the homogeneity structure, , for any , which means that the datasets share the same set of important variables. Under the heterogeneity structure, for some , and , it is possible that , that is, one variable can be important in some datasets but irrelevant in others.
To achieve variable selection under the two sparsity structures, the composite penalty is used for , with the MCP as the outer penalty, which determines whether a variable is relevant at all. The minimax concave penalty (MCP) is defined by (Zhang, 2010) and its derivative , where is a penalty parameter, is a regularization parameter that controls the concavity of , , and for , respectively. The inner penalties have different forms for the two sparsity structures.
iSPLS under the homogeneity model
Consider the penalty function
with regularization parameter and tuning parameter . Here the inner penalty is the norm of . Under this form of penalty, all the datasets select the same set of variables. The overall penalty is referred to as the 2-norm group MCP (Huang et al., 2012a; Ma et al., 2011).
iSPLS under the heterogeneity model
Consider the penalty function
with regularization parameters and , and tuning parameter . Here the inner penalty, which also takes the form of MCP, determines the individual importance for a selected variable. We refer to this penalty as the composite MCP.
2.2.4 Contrasted penalization
In the above section, the 2-norm MCP and composite MCP mainly conduct variable selection, but deeper relationships among datasets are ignored. It has been observed in empirical studies that, the estimation results of independent studies may exhibit a certain degree of similarity in their magnitudes or signs (Grützmann et al., 2005; Guerra and Goldstein, 2009). It is quite possible that the direction vectors of the datasets have similarities in the magnitudes or signs if the datasets are generated by studies with similar designs (Guerra and Goldstein, 2009; Shi et al., 2014).
To utilize the similarity information and further improve estimation performance, we propose iSPLS with contrasted penalty , which penalizes the difference between estimators within each group. Specifically, we propose the following two kinds of contrasted penalties, depending on the degree of similarity across the datasets.
Magnitude-based contrasted penalization
When datasets are quite comparable to each other, for example, those from the same study design but independently conducted, it is reasonable to expect that the first direction vectors have similar magnitudes. We propose a penalty which can shrink the differences of weights thus encourage the similarity within groups. Consider the magnitude-based contrasted penalty
where is a tuning parameter. Overall, we refer to this approach as iSPLS-Homo(Hetero)M, with the subscript ‘M’ standing for magnitude. Here, we choose the penalty for a simpler computation and note that it can be replaced by other penalties.
Sign-based contrasted penalization
Under certain scenarios, similarities in quantitative results is overly demanding, and it is more reasonable to expect/encourage the first direction vectors of the datasets to have similar signs (Fang et al., 2018), which is weaker than that in magnitudes. Here we propose the following sign-based contrasted penalty:
where is a tuning parameter, and , or if , or . Note that the sign-based penalty is not continuous, which brings challenges to optimization. We further propose the following approximation to tackle this non-smooth optimization problem:
where is a small positive constant.
Under the ‘regression analysis + variable selection’ framework, contrasted penalization methods similar to the proposed have been developed (Fang et al., 2018). For the th variable, the contrasted penalty encourages the direction vectors in different datasets to have similar magnitudes/signs, rather than forcing them to be the same. Even under the heterogeneity model, our two contrasted penalties are still reasonable. For example, they can encourage similarity within a group by pulling the nonzero loading which has relatively small value towards zero. The degree of similarity is adjusted by the tuning parameter . Shrinkage of the differences between parameter estimates based on magnitude or sign has been considered in the literatures (Chiquet et al., 2011; Wang et al., 2016), but is still novel under the context where we primarily focus on.
2.3 Computation
For the methods proposed in section 2.2, the computation algorithms share the same strategy with the SPLS procedure (Chun and Keleş, 2010), where and are optimized iteratively for . With fixed tuning and regularization parameters, the algorithm is repeated until convergence.
In Algorithm 1, the key is Step 2. For Step 2(a), with fixed , the objective function (4) becomes
(5) |
which does not involve the group part. Thus, we can optimize in each dataset separately. Problem (5) can be written as
where . Then, by the method of Lagrangian multipliers, we have
where the multiplier is the solution of .
For Step 2(b), when solving for fixed , problem (4) becomes
The iSPLS algorithms under the homogeneity and heterogeneity models are different. We adopt the coordinate descent (CD) approach, which minimizes the objective function with respect to one group of coefficients at a time and cycles through all groups. This method transforms a complicated minimization problem into a series of simple ones. The remainder of this section describes the CD algorithm for the heterogeneity model with a sign-based contrasted penalty. The computational algorithms for the homogeneity model and heterogeneity model with a magnitude-based contrasted penalty are described in the Appendix.
2.3.1 iSPLS with the composite MCP
Consider the heterogeneity model with the sign-based contrasted penalty,
(6) | ||||
For , given the group parameter vectors fixed at their current estimates , we minimize the objective function (6) with respect to . here is required to be very large because is a matrix with a relatively small (Chun and Keleş, 2010). With , we take the first order Taylor expansion about for the first penalty, then the problem is approximately equivalent to minimizing
where and .
Thus, can be updated as follows: for ,
-
1.
Initialize and .
-
2.
Update . Compute:
where
and
-
3.
Repeat Step 2 until convergence. The estimate at convergence is .
Tuning parameter selection
iSPLS-HeteroS involves regularization parameters . Breheny and Huang (2009) suggested setting them connected in a manner to ensure that the group level penalty attains its maximum if and only if all of its components are at the maximum. Following published studies, we set . With the link between the inner and outer penalties, we set . iSPLS-HomoS only involves regularization parameters , which is also set to be 6. We use cross-validation to choose tuning parameters and . Furthermore, iSPLS-HeteroS involves . In our study, we fix the value of , following the suggestion of setting it as a small positive number (Dicker et al., 2013). Literature suggested that the proposed approach is valid if is not too big, and the approximation can differentiate parameters with different signs.
3 Simulation
We simulate four independent studies each with sample size 40 and 120, and 5 response variables. For each sample, we simulate 100 predictor variables, which are jointly normally distributed, with marginal means zero and variances one. We assume that the predictor variables have an auto-regressive correlation structure, where variables and have correlation coefficient , and 0.2 and 0.7, corresponding to weak and strong correlations, respectively. All the scenarios follow the model , where is normally distributed with mean zero. Following the data-generating mechanism in Chun and Keleş (2010), the columns of , for , are generated by . The sparsity structures of direction vectors are controlled by . Within each dataset, the number of variables associated with the responses is set to be 10. The nonzero coefficients range from 0.5 to 4. We simulate under both the homogeneity and heterogeneity models.
Under the homogeneity model, direction vectors have the same sparsity structure, with similar or different nonzero values, corresponding to Scenario 1 and Scenario 2, respectively. Under the heterogeneity model, two scenarios are considered. In Scenario 3, four datasets share 5 important variables in common, and the rest important variables are dataset-specific. That is, direction vectors have partially overlapping sparsity structures. In Scenario 4, direction vectors have random sparsity structures with random overlappings. These four scenarios comprehensively cover different degrees of overlapping in sparsity structures.
To better gauge performance of the proposed approach, we also consider the following alternative approaches: (a) meta-analysis. We analyze each data set separately using the PLS or SPLS approaches and then combine results across datasets via meta-analysis; (b) a pooled approach. Four datasets are pooled together and analyzed by SPLS as a whole. For all approaches, the tuning parameters are selected via 5-fold cross-validation. To evaluate the accuracy of variable selection, the averages of sensitivities and specificities are computed across replicates. We also evaluate prediction performance by calculating mean-squared prediction errors (MSPE).
Summary statistics based on 50 replicates are presented in Tables 1-4. The simulation indicates that the proposed integrative analysis method outperforms its competitors. More specifically, under the fully overlapping (homogeneity) case, when the magnitudes of nonzero values are similar across datasets (Scenario 1), iSPLS-HomoM has the most competitive performance. For example, in Table 1, with and , MSPEs are 49.062 (meta-PLS), 5.686 (meta-SPLS), 1.350 (pooled-SPLS), 2.002 (iSPLS-HomoM), 2.414 (iSPLS-HomoS), 3.368 (iSPLS-HeteroM) and 3.559 (iSPLS-HeteroS), respectively. Note that under Scenario 1, the performance of iSPLS-HomoM and iSPLS-HomoS may be slightly inferior to that of pooled-SPLS. Since with fully comparable datasets, it is sensible to pool all data together, thus, pooled-SPLS may generate more accurate results. However, when the nonzero values are quite different across datasets (Scenario 2), as can be seen from Table 2, iSPLS-HomoS outperforms others, including pooled-SPLS. Under the partially overlapping Scenario 3 (heterogeneity model), iSPLS-HeteroM and iSPLS-HeteroS seem to have better performance, for example when and , they have higher Sensitivities (0.821 and 0.821, compared to 0.675, 0.575, 0.800 and 0.800 of the alternatives), smaller MSPEs (24.637 and 23.734, compared to 268.880, 30.928, 84.875, 40.867, and 39.492 of the alternatives), and with similar Specificities. Even under the non-overlapping Scenario 4, which is not favourable to multi-datasets analysis, the proposed integrative analysis still has reasonable performance. Thus, our integrative analysis methods have the potential to generate more satisfactory results comparable to meta-analysis, when the overlapping structure of multiple datasets is unknown.
Method | MSPE | Sensitivity | Specificity | ||||
---|---|---|---|---|---|---|---|
0.2 | 40 | meta-PLS | 48.972 (4.676) | 1 (0) | 0 (0) | ||
meta-SPLS | 24.739 (3.879) | 0.632 (0.135) | 0.873 (0.111) | ||||
pooled-SPLS | 4.377 (2.486) | 0.810 (0.127) | 0.999 (0.003) | ||||
iSPLS-HomoM | 9.452 (4.369) | 0.840 (0.110) | 0.982 (0.018) | ||||
iSPLS-HomoS | 10.151 (4.027) | 0.837 (0.119) | 0.980 (0.022) | ||||
iSPLS-HeteroM | 18.287 (6.022) | 0.845 (0.152) | 0.757 (0.063) | ||||
iSPLS-HeteroS | 15.462 (6.251) | 0.875 (0.143) | 0.743 (0.060) | ||||
0.2 | 120 | meta-PLS | 49.062 (4.151) | 1 (0) | 0 (0) | ||
meta-SPLS | 5.686 (2.056) | 0.799 (0.053) | 0.994 (0.007) | ||||
pooled-SPLS | 1.350 (1.229) | 0.937 (0.025) | 0.999 (0.000) | ||||
iSPLS-HomoM | 2.002 (0.920) | 0.993 (0.008) | 0.956 (0.016) | ||||
iSPLS-HomoS | 2.414 (0.951) | 0.997 (0.008) | 0.929 (0.014) | ||||
iSPLS-HeteroM | 3.368 (1.211) | 0.955 (0.039) | 0.945 (0.019) | ||||
iSPLS-HeteroS | 3.559 (1.297) | 0.982 (0.051) | 0.872 (0.007) | ||||
0.7 | 40 | meta-PLS | 106.532 (7.066) | 1 (0) | 0 (0) | ||
meta-SPLS | 16.212 (4.033) | 0.828 (0.063) | 0.962 (0.011) | ||||
pooled-SPLS | 5.984 (1.939) | 0.893 (0.065) | 0.984 (0.037) | ||||
iSPLS-HomoM | 6.956 (1.885) | 0.967 (0.018) | 0.947 (0.021) | ||||
iSPLS-HomoS | 7.000 (2.067) | 0.967 (0.018) | 0.946 (0.020) | ||||
iSPLS-HeteroM | 13.630 (3.817) | 0.896 (0.109) | 0.946 (0.019) | ||||
iSPLS-HeteroS | 13.855 (3.778) | 0.909 (0.112) | 0.942 (0.020) | ||||
0.7 | 120 | meta-PLS | 102.629 (9.225) | 1 (0) | 0 (0) | ||
meta-SPLS | 4.824 (1.913) | 0.912 (0.049) | 0.985 (0.012) | ||||
pooled-SPLS | 2.454 (1.481) | 0.883 (0.056) | 0.994 (0.023) | ||||
iSPLS-HomoM | 2.292 (0.829) | 0.987 (0.018) | 0.977 (0.014) | ||||
iSPLS-HomoS | 2.356 (0.785) | 0.987 (0.018) | 0.976 (0.014) | ||||
iSPLS-HeteroM | 3.718 (0.995) | 0.988 (0.051) | 0.948 (0.013) | ||||
iSPLS-HeteroS | 3.609 (1.077) | 0.997 (0.035) | 0.942 (0.012) |
Method | MSPE | Sensitivity | Specificity | ||||
---|---|---|---|---|---|---|---|
0.2 | 40 | meta-PLS | 87.769 (14.532) | 1 (0) | 0 (0) | ||
meta-SPLS | 31.173 (8.422) | 0.532 (0.074) | 0.919 (0.073) | ||||
pooled-SPLS | 33.533 (4.519) | 0.883 (0.095) | 0.976 (0.042) | ||||
iSPLS-HomoM | 17.567 (5.086) | 0.993 (0.025) | 0.681 (0.084) | ||||
iSPLS-HomoS | 16.881 (4.548) | 0.993 (0.025) | 0.681 (0.084) | ||||
iSPLS-HeteroM | 28.803 (6.574) | 0.756 (0.122) | 0.774 (0.057) | ||||
iSPLS-HeteroS | 25.990 (5.446) | 0.819 (0.102) | 0.739 (0.063) | ||||
0.2 | 120 | meta-PLS | 85.138 (4.172) | 1 (0) | 0 (0) | ||
meta-SPLS | 9.015 (1.283) | 0.672 (0.054) | 0.994 (0.005) | ||||
pooled-SPLS | 27.068 (0.867) | 0.993 (0.076) | 1.000 (0.003) | ||||
iSPLS-HomoM | 3.673 (0.552) | 1.000 (0.025) | 0.983 (0.024) | ||||
iSPLS-HomoS | 3.589 (0.649) | 1.000 (0.018) | 0.982 (0.043) | ||||
iSPLS-HeteroM | 6.050 (0.555) | 0.898 (0.040) | 0.956 (0.024) | ||||
iSPLS-HeteroS | 6.674 (0.776) | 0.949 (0.030) | 0.939 (0.032) | ||||
0.7 | 40 | meta-PLS | 192.366 (10.990) | 1 (0) | 0 (0) | ||
meta-SPLS | 28.179 (5.592) | 0.652 (0.078) | 0.981 (0.015) | ||||
pooled-SPLS | 65.284 (5.221) | 0.970 (0.101) | 0.963 (0.018) | ||||
iSPLS-HomoM | 10.186 (4.096) | 0.997 (0.055) | 0.948 (0.023) | ||||
iSPLS-HomoS | 9.909 (4.031) | 0.997 (0.055) | 0.947 (0.022) | ||||
iSPLS-HeteroM | 23.300 (9.108) | 0.741 (0.096) | 0.953 (0.017) | ||||
iSPLS-HeteroS | 22.974 (9.806) | 0.765 (0.095) | 0.950 (0.019) | ||||
0.7 | 120 | meta-PLS | 175.348 (8.390) | 1 (0) | 0 (0) | ||
meta-SPLS | 14.871 (1.943) | 0.745 (0.041) | 0.975 (0.010) | ||||
pooled-SPLS | 61.626 (0.758) | 0.963 (0.059) | 0.986 (0.008) | ||||
iSPLS-HomoM | 5.923 (0.913) | 0.997 (0.035) | 0.972 (0.012) | ||||
iSPLS-HomoS | 5.764 (0.913) | 0.997 (0.035) | 0.971 (0.013) | ||||
iSPLS-HeteroM | 10.742 (1.252) | 0.911 (0.016) | 0.917 (0.012) | ||||
iSPLS-HeteroS | 9.354 (1.267) | 0.946 (0.009) | 0.912 (0.011) |
Method | MSPE | Sensitivity | Specificity | ||||
---|---|---|---|---|---|---|---|
0.2 | 40 | meta-PLS | 76.919 (11.918) | 1 (0) | 0 (0) | ||
meta-SPLS | 35.372 (6.920) | 0.551 (0.118) | 0.900 (0.087) | ||||
pooled-SPLS | 54.006 (6.920) | 0.675 (0.289) | 0.730 (0.289) | ||||
iSPLS-HomoM | 28.495 (4.416) | 0.900 (0.057) | 0.589 (0.062) | ||||
iSPLS-HomoS | 27.897 (4.231) | 0.900 (0.069) | 0.589 (0.070) | ||||
iSPLS-HeteroM | 23.201 (5.788) | 0.800(0.137) | 0.847 (0.042) | ||||
iSPLS-HeteroS | 21.616 (5.632) | 0.800(0.134) | 0.856 (0.039) | ||||
0.2 | 120 | meta-PLS | 84.613 (16.931) | 1 (0) | 0 (0) | ||
meta-SPLS | 10.995 (2.382) | 0.696 (0.082) | 0.990 (0.010) | ||||
pooled-SPLS | 44.243 (3.532) | 0.600 (0.190) | 0.847 (0.125) | ||||
iSPLS-HomoM | 12.445 (1.995) | 0.902 (0.050) | 0.683 (0.049) | ||||
iSPLS-HomoS | 12.471 (1.993) | 0.908 (0.049) | 0.674 (0.058) | ||||
iSPLS-HeteroM | 8.699 (1.768) | 0.882 (0.050) | 0.926 (0.016) | ||||
iSPLS-HeteroS | 8.467 (1.826) | 0.882 (0.049) | 0.931 (0.015) | ||||
0.7 | 40 | meta-PLS | 268.880 (12.323) | 1 (0) | 0 (0) | ||
meta-SPLS | 30.928 (7.532) | 0.675 (0.084) | 0.939 (0.022) | ||||
pooled-SPLS | 84.875 (7.594) | 0.575 (0.152) | 0.909 (0.073) | ||||
iSPLS-HomoM | 40.867 (6.147) | 0.800 (0.084) | 0.700 (0.152) | ||||
iSPLS-HomoS | 39.492 (5.919) | 0.800 (0.080) | 0.700 (0.171) | ||||
iSPLS-HeteroM | 24.637 (6.887) | 0.821 (0.102) | 0.900 (0.051) | ||||
iSPLS-HeteroS | 23.734 (6.373) | 0.825 (0.111) | 0.911 (0.068) | ||||
0.7 | 120 | meta-PLS | 258.583 (8.390) | 1 (0) | 0 (0) | ||
meta-SPLS | 12.631 (2.791) | 0.900 (0.062) | 0.971 (0.011) | ||||
pooled-SPLS | 73.999 (6.112) | 0.800 (0.138) | 0.772 (0.135) | ||||
iSPLS-HomoM | 20.475 (3.493) | 0.998 (0.010) | 0.364 (0.066) | ||||
iSPLS-HomoS | 20.463 (3.445) | 0.998 (0.010) | 0.364 (0.066) | ||||
iSPLS-HeteroM | 10.228 (2.837) | 0.988 (0.019) | 0.895 (0.022) | ||||
iSPLS-HeteroS | 10.113 (2.818) | 0.988 (0.062) | 0.895 (0.011) |
Method | MSPE | Sensitivity | Specificity | ||||
---|---|---|---|---|---|---|---|
0.2 | 40 | meta-PLS | 203.530 (25.691) | 1 (0) | 0 (0) | ||
meta-SPLS | 92.530 (21.452) | 0.580 (0.105) | 0.880 (0.096) | ||||
pooled-SPLS | 174.432 (14.915) | 0.491 (0.300) | 0.608 (0.288) | ||||
iSPLS-HomoM | 100.245 (14.246) | 0.852 (0.067) | 0.404 (0.084) | ||||
iSPLS-HomoS | 98.013 (14.990) | 0.851 (0.064) | 0.403 (0.073) | ||||
iSPLS-HeteroM | 79.508 (19.775) | 0.633 (0.111) | 0.918 (0.026) | ||||
iSPLS-HeteroS | 81.041 (19.177) | 0.626 (0.106) | 0.920 (0.025) | ||||
0.2 | 120 | meta-PLS | 233.403 (41.459) | 1 (0) | 0 (0) | ||
meta-SPLS | 26.850 (5.595) | 0.689 (0.060) | 0.994 (0.067) | ||||
pooled-SPLS | 155.342 (12.988) | 0.500 (0.098) | 0.717 (0.041) | ||||
iSPLS-HomoM | 42.005 (5.538) | 0.914 (0.047) | 0.496 (0.055) | ||||
iSPLS-HomoS | 41.962 (5.566) | 0.913 (0.047) | 0.498 (0.063) | ||||
iSPLS-HeteroM | 24.120 (4.955) | 0.878 (0.076) | 0.925 (0.025) | ||||
iSPLS-HeteroS | 24.177 (4.865) | 0.88 (0.077) | 0.926 (0.018) | ||||
0.7 | 40 | meta-PLS | 542.745 (91.125) | 1 (0) | 0 (0) | ||
meta-SPLS | 69.596 (22.876) | 0.654 (0.089) | 0.974 (0.024) | ||||
pooled-SPLS | 357.967 (34.464) | 0.401 (0.236 ) | 0.753 (0.219) | ||||
iSPLS-HomoM | 100.322 (17.976) | 0.937 (0.055) | 0.437 (0.067) | ||||
iSPLS-HomoS | 97.774 (20.713) | 0.937 (0.054) | 0.436 (0.069) | ||||
iSPLS-HeteroM | 66.089 (16.784) | 0.904 (0.083) | 0.776 (0.041) | ||||
iSPLS-HeteroS | 64.131 (16.378) | 0.904 (0.089) | 0.771 (0.024) | ||||
0.7 | 120 | meta-PLS | 636.34 (73.501) | 1 (0) | 0 (0) | ||
meta-SPLS | 35.067 (8.960) | 0.872 (0.015) | 0.954 (0.057) | ||||
pooled-SPLS | 331.250 (9.337) | 0.469 (0.110) | 0.773 (0.075) | ||||
iSPLS-HomoM | 56.381 (11.017) | 0.992 (0.047) | 0.465 (0.018) | ||||
iSPLS-HomoS | 56.234 (11.021) | 0.993 (0.047) | 0.461 (0.019) | ||||
iSPLS-HeteroM | 31.622 (8.855) | 0.943 (0.066) | 0.913 (0.016) | ||||
iSPLS-HeteroS | 30.625 (8.501) | 0.943 (0.063) | 0.911 (0.017) |
To sum up, under the homogeneity cases, iSPLS-HomoM and iSPLS-HomoS have the most favourable performance, and under the heterogeneity cases, iSPLS-HeteroS and iSPLS-HeteroM outperform the others. It is also interesting to observe that the performance of the constructed penalties depends on the degree of similarity across datasets. For example, in Table 2, iSPLS-HomoS (iSPLS-HeteroS), with a less stringent penalty, has relatively lower MSPEs than iSPLS-HomoM (iSPLS-HeteroM), while in Table 1, it is the other way around. This comparison suggests the sensibility of the proposed contrasted penalization.
4 Data analysis
4.1 Analysis of cutaneous melanoma data
We analyze three datasets from the TCGA cutaneous melanoma (SKCM) study, corresponding to different tumor stages, with 70 samples in stage 1, 60 in stage 2, and 110 in stage 3 and 4. Studies have been conducted on Breslow thickness, an important prognostic marker, which is regulated by gene expressions. However, most of these studies use all samples from different stages together. Exploratory analysis suggests that beyond similarity, there also exists considerable variation across the three stages. The number of gene expression measurements contained in these three datasets is 18947 in all. To generate more accurate results with quite limited samples, we conduct our analysis based on the result of Sun et al. (2018), in which they develop a Community Fusion (CoFu) approach to conduct variable selection while taking account into the network community structure of omics measurements. After undergoing procedures including the unique identification of genes, matching of gene names with those in the SKCM dataset, a supervised screening, network construction and community identification, a total of 21 communities, with 126 genes, are identified as associated with the response using their proposed CoFu method, and are used here for downstream analysis.
We apply the proposed integrative analysis methods and their competitors, meta-analysis, and pooled analysis. It is found that the identified variables vary across methods and stages. For example, Figure 1 shows the estimation results for genes in community 3, 5 and 42, in which each row corresponds to one dataset. We can see that, for one specific set, although results generated by different methods vary to each other, they share some nonzero loadings in common, and the number of overlapping genes identified by different methods are summarized in Table 5. Genes identified by iSPLS-Hetero shown in Figure 1 demonstrate the stage-specific feature in a specific community, thereby indicating the difference across tumor stages.

To evaluate prediction performance and stability of identification, we first randomly split each dataset into 75% for training and 25% for testing. Then, estimation results are generated by the training set and used to make a prediction for the testing set. The root mean squared error (RMSE) is used to measure prediction performance. Furthermore, for each gene, we compute its observed occurrence index (OOI) (Huang and Ma, 2010), that is, its probability of being identified in 100 resamplings. The results of RMSEs and OOIs for each method are shown in Table 6, which suggests the stability of our proposed methods as well as their competitive performance compared to the alternatives.
4.2 Analysis of lung cancer data
We collect two lung cancer datasets, on Lung Adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LUSC), with sample sizes equal to 142 and 89, respectively. Studies have been conducted to analyze FEV1, which is a measure of lung function, and its relationship with gene expressions, using two datasets, however, separately. Since both Adenocarcinoma and Squamous Cell Carcinoma are non-small cell lung carcinomas, we may expect a certain degree of similarity between them. With the consideration on both difference and similarity, we apply our proposed integrative methods on these two datasets. Our analysis focuses on 474 genes in 26 communities, which are identified as associated with the response, based on the results of Sun et al. (2018).
We perform the same procedure as described above. The identified variables vary across methods and datasets. To better illustrate the estimation results, Figure 2 shows the behaviors of three communities identified by the above methods, from which we can easily see both the similarities and differences between these two datasets. Stability and prediction performance evaluation are conducted by computing the RMSEs and OOIs from 100 resamplings, following the same procedure as described above. Overall results are summarized in Table 5-6, and the iSPLS methods have relatively lower RMSEs and higher OOIs than the other methods.

iSPLS | ||||||
---|---|---|---|---|---|---|
Method | pooled-SPLS | meta-SPLS | HomoM | HomoS | HeteroM | HeteroS |
SKCM data | ||||||
pooled-SPLS | 100 | 34 | 20 | 21 | 51 | 53 |
meta-SPLS | 107 | 28 | 29 | 71 | 72 | |
iSPLS-HomoM | 45 | 45 | 45 | 45 | ||
iSPLS-HomoS | 46 | 46 | 46 | |||
iSPLS-HeteroM | 83 | 75 | ||||
iSPLS-HeteroS | 89 | |||||
Lung cancer data | ||||||
pooled-SPLS | 145 | 78 | 37 | 40 | 66 | 51 |
meta-SPLS | 92 | 39 | 42 | 76 | 58 | |
iSPLS-HomoM | 39 | 39 | 38 | 35 | ||
iSPLS-HomoS | 42 | 40 | 36 | |||
iSPLS-HeteroM | 66 | 58 | ||||
iSPLS-HeteroS | 72 |
iSPLS | ||||||
---|---|---|---|---|---|---|
pooled-SPLS | meta-SPLS | HomoM | HomoS | HeteroM | HeteroS | |
SKCM data | ||||||
RMSE | 6.210 | 4.046 | 4.202 | 4.163 | 3.202 | 3.135 |
OOI(Median) | 0.76 | 0.75 | 0.80 | 0.80 | 0.77 | 0.78 |
Lung cancer data | ||||||
RMSE | 32.367 | 27.837 | 22.269 | 20.412 | 21.019 | 20.318 |
OOI(Median) | 0.71 | 0.73 | 0.78 | 0.78 | 0.76 | 0.75 |
5 Discussion
PLS regression has been promoted in ill-conditioned linear regression problems that arise in several disciplines such as chemistry, economics, medicine, and psychology. In this study, we propose an integrative SPLS (iSPLS) method, which conducts the integrative analysis of multiple independent datasets based on the SPLS technique. This study significantly extends the novel integrative analysis paradigm by conducting a dimension reduction analysis. An important contribution is that, to promote similarity across datasets more effectively, two contrasted penalties have been developed. Under both the homogeneity and heterogeneity models, we develop the magnitude-based contrasted penalization and sign-based contrasted penalization. We develop effective computational algorithms for the proposed integrative analysis. For a variety of model settings, simulations demonstrate satisfactory performance of the proposed iSPLS method. The application to TCGA data suggests that magnitude-based iSPLS and sign-based iSPLS do not dominate each other, and are both needed in practice. The stability and prediction evaluation provides some support to the validity of the proposed method.
This study can be potentially extended in multiple directions. Apart from PLS, integrative analysis can be developed based on other dimension reduction techniques, such as CCA, ICA, and so on. For selection, the MCP penalty is adopted and can be potentially replaced with other two-level selection penalties. Integrative analysis can be developed based on SPLS-SVD. Moreover, iSPLS is applicable to non-linear frameworks such as generalized linear models and survival models. In data analysis, both the magnitude-based iSPLS and sign-based iSPLS have applications far beyond this study.
References
- Breheny and Huang (2009) Breheny, P. and Huang, J. (2009). Penalized methods for bi-level variable selection. Statistics and Its Interface, 2:369–380.
- Chiquet et al. (2011) Chiquet, J., Grandvalet, Y., and Ambroise, C. (2011). Inferring multiple graphical structures. Statistics and Computing, 21:537–553.
- Chun and Keleş (2009) Chun, H. and Keleş, S. (2009). Expression quantitative trait loci mapping with multivariate sparse partial least squares regression. Genetics, 182:79–90.
- Chun and Keleş (2010) Chun, H. and Keleş, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72:3–25.
- De Jong (1993) De Jong, S. (1993). SIMPLS: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18:251–263.
- Dicker et al. (2013) Dicker, L., Huang, B., and Lin, X. (2013). Variable selection and estimation with the seamless- penalty. Statistica Sinica, 23:929–962.
- Fan and Lv (2010) Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20:101–148.
- Fang et al. (2018) Fang, K., Fan, X., Zhang, Q., and Ma, S. (2018). Integrative sparse principal component analysis. Journal of Multivariate Analysis, 166:1–16.
- Grützmann et al. (2005) Grützmann, R., Boriss, H., Ammerpohl, O., Lüttges, J., Kalthoff, H., Schackert, H. K., Klöppel, G., Saeger, H. D., and Pilarsky, C. (2005). Meta-analysis of microarray data on pancreatic cancer defines a set of commonly dysregulated genes. Oncogene, 24:5079–5088.
- Guerra and Goldstein (2009) Guerra, R. and Goldstein, D. R. (2009). Meta-analysis and combining information in genetics and genomics. CRC Press.
- Huang et al. (2012a) Huang, J., Breheny, P., and Ma, S. (2012a). A selective review of group selection in high-dimensional models. Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 27:481–499.
- Huang and Ma (2010) Huang, J. and Ma, S. (2010). Variable selection in the accelerated failure time model via the bridge method. Lifetime Data Analysis, 16:176–195.
- Huang et al. (2012b) Huang, Y., Huang, J., Shia, B.-C., and Ma, S. (2012b). Identification of cancer genomic markers via integrative sparse boosting. Biostatistics, 13:509–522.
- Jolliffe et al. (2003) Jolliffe, I. T., Trendafilov, N. T., and Uddin, M. (2003). A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics, 12:531–547.
- Liu et al. (2015) Liu, J., Huang, J., Zhang, Y., Lan, Q., Rothman, N., Zheng, T., and Ma, S. (2015). Integrative analysis of prognosis data on multiple cancer subtypes. Biometrics, 70:480–488.
- Ma et al. (2011) Ma, S., Huang, J., and Song, X. (2011). Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics, 12:763–775.
- Shi et al. (2014) Shi, X., Liu, J., Huang, J., Zhou, Y., Shia, B., and Ma, S. (2014). Integrative analysis of high-throughput cancer studies with contrasted penalization. Genetic Epidemiology, 38:144–151.
- Sjöström et al. (1983) Sjöström, M., Wold, S., Lindberg, W., Persson, J.-Å., and Martens, H. (1983). A multivariate calibration problem in analytical chemistry solved by partial least squares models in latent variables. Analytica Chimica Acta, 150:61–70.
- Sun et al. (2018) Sun, Y., Jiang, Y., Li, Y., and Ma, S. (2018). Identification of cancer omics commonality and difference via community fusion. Statistics in Medicine, 38:1200–1212.
- Ter Braak and de Jong (1998) Ter Braak, C. J. and de Jong, S. (1998). The objective function of partial least squares regression. Journal of Chemometrics: A Journal of the Chemometrics Society, 12:41–54.
- Wang et al. (2016) Wang, F., Wang, L., and Song, P. X.-K. (2016). Fused lasso with the adaptation of parameter ordering in combining multiple studies with repeated measurements. Biometrics, 72:1184–1193.
- Wold et al. (1984) Wold, S., Ruhe, A., Wold, H., and Dunn, III, W. (1984). The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM Journal on Scientific and Statistical Computing, 5:735–743.
- Zhang (2010) Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38:894–942.
- Zhao et al. (2015) Zhao, Q., Shi, X., Huang, J., Liu, J., Li, Y., and Ma, S. (2015). Integrative analysis of ‘-omics’ data using penalty functions. Wiley Interdisciplinary Reviews Computational Statistics, 7:99–108.
Algorithms
iSPLS with 2-norm group MCP with Magnitude-based contrasted penalty
We adopt a similar computational algorithm as Algorithm 1. The key difference lies in Step 2(b), solving with fixed . Consider the homogeneity model with magnitude-based contrasted penalty (iSPLS-HomoM), we have the following problem
(7) | ||||
iSPLS with composite MCP with Magnitude-based contrasted penalty
Under the heterogeneity model with Magnitude-based contrasted penalty (iSPLS-HeteroM),
(9) | ||||
Take the first order Taylor expansion approximation about for the first penalty, with fixed at their current estimates , and conduct the same procedure to the second penalty as in Section 2.3.1. Then the objective function (9) is approximately equivalent to minimizing
(10) |
where .
Thus, can be updated as follows: For ,
-
1.
Initialize and .
-
2.
Update . Compute:
where
and
-
3.
Repeat Step 2 until convergence. The estimate at convergence is .
iSPLS with 2-norm group MCP with the sign-based contrasted penalty
Consider the homogeneity model with sign-based contrasted penalty (iSPLS-HomoS).
(11) | ||||
For , following the same procedure in Section 2.3.1, we have the following minimization problem
(12) | ||||
Thus, can be updated as follows: For
-
1.
Initialize and
-
2.
Update . Compute:
-
3.
Repeat Step 2 until convergence. The estimate at convergence is .