This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Integrative Sparse Partial Least Squares

Weijuan Liang Shuangge Ma Qingzhao Zhang Tingyu Zhu Email: [email protected]
(03 July 2020)
Abstract

Partial least squares, as a dimension reduction method, has become increasingly important for its ability to deal with problems with a large number of variables. Since noisy variables may weaken the performance of the model, the sparse partial least squares (SPLS) technique has been proposed to identify important variables and generate more interpretable results. However, the small sample size of a single dataset limits the performance of conventional methods. An effective solution comes from gathering information from multiple comparable studies. The integrative analysis holds an important status among multi-datasets analyses. The main idea is to improve estimation results by assembling raw datasets and analyzing them jointly. In this paper, we develop an integrative SPLS (iSPLS) method using penalization based on the SPLS technique. The proposed approach consists of two penalties. The first penalty conducts variable selection under the context of integrative analysis; The second penalty, a contrasted one, is imposed to encourage the similarity of estimates across datasets and generate more reasonable and accurate results. Computational algorithms are provided. Simulation experiments are conducted to compare iSPLS with alternative approaches. The practical utility of iSPLS is shown in the analysis of two TCGA gene expression data.

1 Introduction

With the rapid development of technology, comes the need to analyze data with high dimensions. Partial least squares, introduced by Wold et al. (1984), has been successfully used as a dimension reduction method in many research areas, such as chemometrics (Sjöström et al., 1983) and more recently genetics (Chun and Keleş, 2009). PLS reduces the variable dimension by constructing new components, which are linear combinations of the original variables. Its stability under collinearity and high-dimensionality gives PLS a clear superiority over many other methods. However, in high dimensional problems, noise accumulation from irrelevant variables has long been recognized (Fan and Lv, 2010). For example, in omics studies, it is wildly accepted that only a small fraction of genes are associated with outcomes. To yield more accurate estimates and facilitate interpretation, variable selection needs to be considered. Recently, Chun and Keleş (2010) propose a sparse PLS technique to conduct variable selection and dimension reduction simultaneously by imposing Elastic Net penalization in the PLS optimization.

Another challenge that real data analyses often face is the unsatisfactory performance generated from a single dataset (Guerra and Goldstein, 2009), especially for data with a limited sample size. Due to the recent progress in data collection, a possibility exists for integration across multiple datasets generated under similar protocols. Methods for analyzing multiple datasets include meta-analysis, integrative analysis, and others. Among them, integrative analysis has been proved to be effective both in theory and practice and have better performance in prediction and variable selection than other multi-datasets methods (Liu et al., 2015; Ma et al., 2011), especially including meta-analysis (Grützmann et al., 2005).

Considering the wide applications of PLS/SPLS to high dimensional data, we propose an integrative SPLS (iSPLS) method to remedy the aforementioned problems of the conventional SPLS technique caused by a limited sample size. Based on the SPLS technique, our method conducts the integrative analysis of multiple independent datasets using the penalization method to promote certain similarities and sparse structures among them, and further improve the accuracy and reliability of variable selection and loading estimation. Our penalization involves two parts. The first penalty conducts variable selection under the paradigm of integrative analysis (Zhao et al., 2015), where a composite penalty is adopted to identify important variables under both the homogeneity structure and heterogeneity structure. The intuition of adding the second penalty comes from empirical data analyses, that is, datasets with comparable designs may have a certain degree of similarity, which may help further improve analysis results. Our work advances from the existing sparse PLS and integrative studies by merging the dimension reduction technique and integrative analysis paradigm. Furthermore, we consider both similarity and difference across multiple datasets, which is achieved by our introduction of a two-part penalization.

The rest of the paper is organized as follows. In Section 2, for the completeness of this article, we first briefly review the general principles of PLS and SPLS, and then formulate the iSPLS method and establish its algorithms. Simulation studies and applications to TCGA data are provided in Section 3 and 4. Discussion is organized in Section 5. Additional technical details and numerical results are provided in the Appendix.

2 Methods

2.1 Sparse partial least squares

Let YRn×qY\in R^{n\times q} and XRn×pX\in R^{n\times p} represent the response matrix and predictor matrix, respectively. PLS assumes that there exists latent components tkt_{k}, 1kK1\leq k\leq K, which are linear combinations of predictors, such that Y=TQ+FY=TQ^{\top}+F and X=TP+EX=TP^{\top}+E, where T=(t1,,tK)n×KT=(t_{1},\dots,t_{K})_{n\times K}, PRp×KP\in R^{p\times K} and QRq×KQ\in R^{q\times K} are matrices of coefficients (loadings), and ERn×pE\in R^{n\times p} and FRn×qF\in R^{n\times q} are matrices of random errors.

PLS solves the optimization problem for direction vectors wkw_{k} successively. Specifically, wkw_{k} is the solution to the following problem:

maxwk{wkZZwk},\max\limits_{w_{k}}\{w_{k}^{\top}ZZ^{\top}w_{k}\}, (1)

which can be solved via the NIPALS (Wold et al., 1984) or SIMPLS (De Jong, 1993) algorithms with different constraints and where Z=XYZ=X^{\top}Y. After estimating the number of direction vectors KK, the latent components can be calculated by T=XWT=XW, where W=(w1,,wK)W=(w_{1},\dots,w_{K}). And the final estimator is β^PLS=WKQ^\hat{\beta}^{PLS}={W}_{K}\hat{Q}^{\top}, where Q^\hat{Q} is the solution of minQ{YTKQ22}\min\limits_{Q}\{\left\|Y-T_{K}Q^{\top}\right\|^{2}_{2}\}. Details are available in Ter Braak and de Jong (1998).

In the analysis of high-dimensional data, a variable selection procedure needs to be considered to remove the noise. Note that noisy variables enter the PLS regression via direction vectors, one possible way is to adopt the penalization approach into the optimization procedure, that is, imposing an 1\mathcal{L}_{1} constrain to the direction vector in problem (1). Then the first SPLS direction vector can be obtained by solving the following problem:

maxw{wZZw},s.t.ww=1,|w|λ,\max\limits_{w}\left\{w^{\top}ZZ^{\top}w\right\},\quad\mbox{s.t.}~{}w^{\top}w=1,\left|w\right|\leq\lambda, (2)

where the tuning parameter λ\lambda controls the degree of sparsity.

However, Jolliffe et al. (2003) point out the concavity issue of this problem as well as the lack of sparsity of its solution. Chun and Keleş (2010) then develop a generalized form of the SPLS problem (2) given below, which can generate a sufficiently sparse solution.

minw,c\displaystyle\min\limits_{w,c} {κwZZw+(1κ)(cw)ZZ(cw)+λ1|c|1+λ2c22},\displaystyle\left\{-\kappa w^{\top}ZZ^{\top}w+(1-\kappa)(c-w)^{\top}ZZ^{\top}(c-w)\right.\left.+\lambda_{1}\left|c\right|_{1}+\lambda_{2}\left\|c\right\|_{2}^{2}\right\}, (3)
s.t.ww=1.\displaystyle\mbox{s.t.}~{}w^{\top}w=1.

In this problem, penalties are imposed on cc, a surrogate of the direction vector which is very close to ww, rather than on the original direction vector. Here the additional 2\mathcal{L}_{2} penalty deals with the singularity of ZZZZ^{\top} when solving for cc, and the small κ\kappa reduces the effect of the concave part. The solution of (3) is given by optimizing ww and cc iteratively.

2.2 Integrative sparse partial least squares

2.2.1 Data and Model Settings

In this section, we consider the case where LL datasets are from independent studies with comparable designs. Below, we develop an integrative sparse partial least squares (iSPLS) method to conduct an integrative analysis of these LL datasets based on the SPLS technique. Note that in the context of integrative analysis, datasets do not need to be fully comparable. With matched predictors, we further assume that data preprocessing, including imputation, centralization, and normalization, has been done for each dataset separately.

Following the notations in the existing integrative analysis literature (Huang et al., 2012b; Zhao et al., 2015), we use the superscript (l)(l) to denote the llth dataset (Ynl×q(l),Xnl×p(l))(Y^{(l)}_{n_{l}\times q},X^{(l)}_{n_{l}\times p}) with nln_{l} i.i.d.i.i.d. observations, for l=1,,Ll=1,\dots,L. As in the SPLS for a single dataset, where the main interest is on the first direction vector, denote wj(l)w_{j}^{(l)} as the weight of the jjth variable in the first direction vector of the llth dataset, and wj=(wj(1),,wj(L))w_{j}=(w_{j}^{(1)},\dots,w_{j}^{(L)})^{\top} as the “group” of weights of variable jj in the first LL direction vectors, for j=1,,pj=1,\dots,p

2.2.2 iSPLS with contrasted penalization

Following the generalized SPLS given in (3), we formulate the objective function for estimating the first direction vectors in LL datasets. For l=1,,Ll=1,\dots,L, consider the minimization of the penalized objective function:

l=1L\displaystyle\sum_{l=1}^{L} 12nl2(f(w(l),c(l))+λc(l)22)+pen1(c(1),,c(L))+pen2(c(1),,c(L))\displaystyle\frac{1}{2n_{l}^{2}}\left(f(w^{(l)},c^{(l)})+\lambda\|c^{(l)}\|^{2}_{2}\right)+\text{pen}_{1}\left(c^{(1)},\dots,c^{(L)}\right)+\text{pen}_{2}\left(c^{(1)},\dots,c^{(L)}\right) (4)
s.t.w(l)w(l)=1,\displaystyle\mbox{s.t.}~{}w^{(l)\top}w^{(l)}=1,

where f(w(l),c(l))=κw(l)Z(l)Z(l)w(l)+(1κ)(c(l)w(l))Z(l)Z(l)(c(l)w(l))f(w^{(l)},c^{(l)})=-\kappa w^{(l)\top}Z^{(l)}Z^{(l)\top}w^{(l)}+(1-\kappa)(c^{(l)}-w^{(l)})^{\top}Z^{(l)}Z^{(l)\top}(c^{(l)}-w^{(l)}), c(l)=(c1(l),,cp(l))c^{(l)}=(c_{1}^{(l)},\cdots,c_{p}^{(l)})^{\top}, w(l)=(w1(l),,wp(l))w^{(l)}=(w_{1}^{(l)},\cdots,w_{p}^{(l)})^{\top} and Z(l)=X(l)Y(l)Z^{(l)}=X^{(l)\top}Y^{(l)} .

In (4), f(w(l),c(l))f(w^{(l)},c^{(l)}) is the goodness-of-fit of llth dataset, and c(l)22\|c^{(l)}\|^{2}_{2} serves the same role as in the SPLS method, dealing with the potential singularity when solving for c(l)c^{(l)}. To eliminate the influence of lager datasets, here we take the form of weighted sum with weights given by the reciprocal of the square of sample sizes. As for the penalty function, pen1()\text{pen}_{1}(\cdot) conducts variable selection in the context of integrative analysis, whereas pen2()\text{pen}_{2}(\cdot) accounts for the secondary model similarity structure. Below we provide detailed discussions on these two penalties.

2.2.3 Penalization for variable selection

We first consider the form of pen1()\text{pen}_{1}(\cdot). With LL datasets, LL sparsity structures of the direction vectors need to be considered. Integrative analysis considers two generic sparsity structures (Zhao et al., 2015), the homogeneity structure and the heterogeneity structure. Under the homogeneity structure, I(wj(1)=0)==I(wj(L)=0)I(w_{j}^{(1)}=0)=\dots=I(w_{j}^{(L)}=0), for any j{1,,p}j\in\{1,\dots,p\}, which means that the LL datasets share the same set of important variables. Under the heterogeneity structure, for some j{1,,p}j\in\{1,\dots,p\}, and l,l{1,,L}l,l^{\prime}\in\{1,\dots,L\}, it is possible that I(wj(l)=0)I(wj(l)=0)I(w_{j}^{(l)}=0)\not=I(w_{j}^{(l^{\prime})}=0), that is, one variable can be important in some datasets but irrelevant in others.

To achieve variable selection under the two sparsity structures, the composite penalty is used for pen1()\text{pen}_{1}(\cdot), with the MCP as the outer penalty, which determines whether a variable is relevant at all. The minimax concave penalty (MCP) is defined by ρ(t;λ,γ)=λ0|t|(1x/(λγ))+𝑑x\rho(t;\lambda,\gamma)=\lambda\int_{0}^{\left|t\right|}(1-x/(\lambda\gamma))_{+}\,dx (Zhang, 2010) and its derivative ρ˙(t;λ,γ)=λ(1|t|/(λγ))+sgn(t)\dot{\rho}(t;\lambda,\gamma)=\lambda(1-\left|t\right|/(\lambda\gamma))_{+}\text{sgn}(t), where λ\lambda is a penalty parameter, γ\gamma is a regularization parameter that controls the concavity of ρ\rho, x+=xI(x>0)x_{+}=xI(x>0), and sgn(t)=1,0,or1\text{sgn}(t)=-1,~{}0,~{}\mbox{or}~{}1 for t<0,=0,or>0t<0,~{}=0,~{}\mbox{or}>0, respectively. The inner penalties have different forms for the two sparsity structures.

iSPLS under the homogeneity model

Consider the penalty function

pen1(c(1),c(L))=j=1pρ(cj2;μ1,a),\text{pen}_{1}\left(c^{(1)}\dots,c^{(L)}\right)=\sum_{j=1}^{p}\rho\left(\left\|c_{j}\right\|_{2};\mu_{1},a\right),

with regularization parameter aa and tuning parameter μ1\mu_{1}. Here the inner penalty cj2=l=1Lcj(l)2\left\|c_{j}\right\|_{2}=\sqrt{\sum_{l=1}^{L}c_{j}^{(l)2}} is the 2\mathcal{L}_{2} norm of cjc_{j}. Under this form of penalty, all the LL datasets select the same set of variables. The overall penalty is referred to as the 2-norm group MCP (Huang et al., 2012a; Ma et al., 2011).

iSPLS under the heterogeneity model

Consider the penalty function

pen1(c(1),c(L))=j=1pρ(l=1Lρ(|cj(l)|;μ1,a);1,b),\text{pen}_{1}\left(c^{(1)}\dots,c^{(L)}\right)=\sum_{j=1}^{p}\rho\left(\sum_{l=1}^{L}\rho(|c_{j}^{(l)}|;\mu_{1},a);1,b\right),

with regularization parameters aa and bb, and tuning parameter μ1\mu_{1}. Here the inner penalty, which also takes the form of MCP, determines the individual importance for a selected variable. We refer to this penalty as the composite MCP.

2.2.4 Contrasted penalization

In the above section, the 2-norm MCP and composite MCP mainly conduct variable selection, but deeper relationships among datasets are ignored. It has been observed in empirical studies that, the estimation results of independent studies may exhibit a certain degree of similarity in their magnitudes or signs (Grützmann et al., 2005; Guerra and Goldstein, 2009). It is quite possible that the direction vectors of the LL datasets have similarities in the magnitudes or signs if the datasets are generated by studies with similar designs (Guerra and Goldstein, 2009; Shi et al., 2014).

To utilize the similarity information and further improve estimation performance, we propose iSPLS with contrasted penalty pen2()\text{pen}_{2}(\cdot), which penalizes the difference between estimators within each group. Specifically, we propose the following two kinds of contrasted penalties, depending on the degree of similarity across the datasets.

Magnitude-based contrasted penalization

When datasets are quite comparable to each other, for example, those from the same study design but independently conducted, it is reasonable to expect that the first direction vectors have similar magnitudes. We propose a penalty which can shrink the differences of weights thus encourage the similarity within groups. Consider the magnitude-based contrasted penalty

pen2(c(1),c(L))=μ22j=1pll(cj(l)cj(l))2,\text{pen}_{2}\left(c^{(1)}\dots,c^{(L)}\right)=\frac{\mu_{2}}{2}\sum_{j=1}^{p}\sum_{l^{\prime}\neq l}\left(c_{j}^{(l)}-c_{j}^{(l^{\prime})}\right)^{2},

where μ2>0\mu_{2}>0 is a tuning parameter. Overall, we refer to this approach as iSPLS-Homo(Hetero)M, with the subscript ‘M’ standing for magnitude. Here, we choose the 2\mathcal{L}_{2} penalty for a simpler computation and note that it can be replaced by other penalties.

Sign-based contrasted penalization

Under certain scenarios, similarities in quantitative results is overly demanding, and it is more reasonable to expect/encourage the first direction vectors of the LL datasets to have similar signs (Fang et al., 2018), which is weaker than that in magnitudes. Here we propose the following sign-based contrasted penalty:

pen2(c(1),c(L))=μ22j=1pll{sgn(cj(l))sgn(cj(l))}2,\text{pen}_{2}\left(c^{(1)}\dots,c^{(L)}\right)=\frac{\mu_{2}}{2}\sum_{j=1}^{p}\sum_{l^{\prime}\neq l}\left\{\text{sgn}(c_{j}^{(l)})-\text{sgn}(c_{j}^{(l^{\prime})})\right\}^{2},

where μ2>0\mu_{2}>0 is a tuning parameter, and sgn(t)=1,0\text{sgn}(t)=-1,~{}0, or 11 if t<0,=0t<0,~{}=0, or t>0t>0. Note that the sign-based penalty is not continuous, which brings challenges to optimization. We further propose the following approximation to tackle this non-smooth optimization problem:

μ22j=1pll(cj(l)cj(l)2+τ2cj(l)cj(l)2+τ2)2,\frac{\mu_{2}}{2}\sum_{j=1}^{p}\sum_{l^{\prime}\neq l}\Bigg{(}\frac{c_{j}^{(l)}}{\sqrt{c_{j}^{(l)2}+\tau^{2}}}-\frac{c_{j}^{(l^{\prime})}}{\sqrt{c_{j}^{(l^{\prime})2}+\tau^{2}}}\Bigg{)}^{2},

where τ>0\tau>0 is a small positive constant.

Under the ‘regression analysis + variable selection’ framework, contrasted penalization methods similar to the proposed have been developed (Fang et al., 2018). For the jjth variable, the contrasted penalty encourages the direction vectors in different datasets to have similar magnitudes/signs, rather than forcing them to be the same. Even under the heterogeneity model, our two contrasted penalties are still reasonable. For example, they can encourage similarity within a group by pulling the nonzero loading which has relatively small value towards zero. The degree of similarity is adjusted by the tuning parameter μ2\mu_{2}. Shrinkage of the differences between parameter estimates based on magnitude or sign has been considered in the literatures (Chiquet et al., 2011; Wang et al., 2016), but is still novel under the context where we primarily focus on.

2.3 Computation

For the methods proposed in section 2.2, the computation algorithms share the same strategy with the SPLS procedure (Chun and Keleş, 2010), where w(l)w^{(l)} and c(l)c^{(l)} are optimized iteratively for l=1,,Ll=1,\dots,L. With fixed tuning and regularization parameters, the algorithm is repeated until convergence.

1Initialize. For l=1,,Ll=1,\dots,L:a. Apply partial least squares regression of Y(l)Y^{(l)} on X(l)X^{(l)}, and obtain the first direction vector w(l)w^{(l)}.b. Set t=0t=0, c[t](l)=w[t](l)=w(l)c_{[t]}^{(l)}=w_{[t]}^{(l)}=w^{(l)} and Z(l)=X(l)Y(l)Z^{(l)}=X^{(l)\top}Y^{(l)}.
2 Update:a. Optimize (4) over w[t](l)w^{(l)}_{[t]} with fixed c[t1](l)c^{(l)}_{[t-1]}.b. Optimize (4) over c[t](l)c^{(l)}_{[t]} with fixed w[t](l).w^{(l)}_{[t]}.
3 Repeat Step 2 until convergence. In our simulation, we use the 2\mathcal{L}_{2} norm of difference between two consecutive estimates smaller than a predetermined threshold as the criterion for convergence.
4 Normalize the final c[t](l)c^{(l)}_{[t]} as w(l)=c[t](l)/c[t](l)2w^{(l)}=c^{(l)}_{[t]}/\|c^{(l)}_{[t]}\|_{2} for each l=1,,Ll=1,\dots,L.
Algorithm 1 Computational Algorithm for iSPLS

In Algorithm 1, the key is Step 2. For Step 2(a), with fixed c[t1](l)c^{(l)}_{[t-1]}, the objective function (4) becomes

minw(l)l=1L{κw(l)Z(l)Z(l)w(l)+(1κ)(c[t1](l)w(l))Z(l)Z(l)(c[t1](l)w(l))},\displaystyle\min\limits_{w^{(l)}}\sum^{L}_{l=1}\left\{-\kappa w^{(l)\top}Z^{(l)}Z^{(l)\top}w^{(l)}\right.\left.+(1-\kappa)(c^{(l)}_{[t-1]}-w^{(l)})^{\top}Z^{(l)}Z^{(l)\top}(c^{(l)}_{[t-1]}-w^{(l)})\right\}, (5)

which does not involve the group part. Thus, we can optimize w(l)w^{(l)} in each dataset separately. Problem (5) can be written as

minw(l)\displaystyle\min\limits_{w^{(l)}} Z(l)w(l)κZ(l)c[t1](l)22,s.t.w(l)w(l)=1,forl=1,,L,\displaystyle\left\|Z^{(l)\top}w^{(l)}-\kappa^{\prime}Z^{(l)\top}c^{(l)}_{[t-1]}\right\|_{2}^{2},\quad\mbox{s.t.}~{}w^{(l)\top}w^{(l)}=1,~{}\text{for}~{}l=1,\dots,L,

where κ=(1κ)/(12κ)\kappa^{\prime}=(1-\kappa)/(1-2\kappa). Then, by the method of Lagrangian multipliers, we have

w[t](l)=κ(Z(l)Z(l)+λ(l)I)1Z(l)Z(l)c[t1](l),w_{[t]}^{(l)}=\kappa^{\prime}(Z^{(l)}Z^{(l)\top}+\lambda^{*(l)}I)^{-1}Z^{(l)}Z^{(l)\top}c^{(l)}_{[t-1]},

where the multiplier λ(l)\lambda^{*(l)} is the solution of 1/κ2=c[t1](l)Z(l)Z(l)(Z(l)Z(l)+λI)2Z(l)Z(l)c[t1](l)1/\kappa^{\prime 2}=c^{(l)\top}_{[t-1]}Z^{(l)}Z^{(l)\top}(Z^{(l)}Z^{(l)\top}+\lambda I)^{-2}Z^{(l)}Z^{(l)\top}c^{(l)}_{[t-1]}.

For Step 2(b), when solving c(l)c^{(l)} for fixed w[t](l)w_{[t]}^{(l)}, problem (4) becomes

minc(l)l=1L12nl2(Z(l)c(l)Z(l)w[t](l)22+λc(l)22)\displaystyle\min\limits_{c^{(l)}}\sum_{l=1}^{L}\frac{1}{2n_{l}^{2}}\left(\left\|Z^{(l)\top}c^{(l)}-Z^{(l)\top}w_{[t]}^{(l)}\right\|_{2}^{2}+\lambda\left\|c^{(l)}\right\|_{2}^{2}\right) +pen1(c(1),,c(L))\displaystyle+\text{pen}_{1}\left(c^{(1)},\dots,c^{(L)}\right)
+pen2(c(1),,c(L)).\displaystyle+\text{pen}_{2}\left(c^{(1)},\dots,c^{(L)}\right).

The iSPLS algorithms under the homogeneity and heterogeneity models are different. We adopt the coordinate descent (CD) approach, which minimizes the objective function with respect to one group of coefficients at a time and cycles through all groups. This method transforms a complicated minimization problem into a series of simple ones. The remainder of this section describes the CD algorithm for the heterogeneity model with a sign-based contrasted penalty. The computational algorithms for the homogeneity model and heterogeneity model with a magnitude-based contrasted penalty are described in the Appendix.

2.3.1 iSPLS with the composite MCP

Consider the heterogeneity model with the sign-based contrasted penalty,

minc(l)\displaystyle\min\limits_{c^{(l)}} l=1L12nl2(Z(l)c(l)Z(l)w[t](l)22+λc(l)22)+j=1pρ(l=1Lρ(|cj(l)|;μ1,a);1,b)\displaystyle\sum_{l=1}^{L}\frac{1}{2n_{l}^{2}}\left(\left\|Z^{(l)\top}c^{(l)}-Z^{(l)\top}w_{[t]}^{(l)}\right\|_{2}^{2}+\lambda\left\|c^{(l)}\right\|_{2}^{2}\right)+\sum_{j=1}^{p}\rho\left(\sum_{l=1}^{L}\rho(|c_{j}^{(l)}|;\mu_{1},a);1,b\right) (6)
+μ22j=1pll{sgn(cj(l))sgn(cj(l))}2.\displaystyle+\frac{\mu_{2}}{2}\sum_{j=1}^{p}\sum_{l^{\prime}\neq l}\left\{\text{sgn}(c_{j}^{(l)})-\text{sgn}(c_{j}^{(l^{\prime})})\right\}^{2}.

For j=1,,,pj=1,\dots,,p, given the group parameter vectors ck(l)(kj)c_{k}^{(l)}(k\not=j) fixed at their current estimates ck,[t1](l)c^{(l)}_{k,[t-1]}, we minimize the objective function (6) with respect to cj(l)c^{(l)}_{j}. λ\lambda here is required to be very large because Z(l)Z^{(l)} is a p×qp\times q matrix with a relatively small qq (Chun and Keleş, 2010). With λ=\lambda=\infty, we take the first order Taylor expansion about cj(l)c_{j}^{(l)} for the first penalty, then the problem is approximately equivalent to minimizing

12\displaystyle\frac{1}{2} cj(l)2w[t](l)Z(l)Zj(l)cj(l)+αjl|cj(l)|+μ22ll(cj(l)cj(l)2+τ2cj,[t1](l)cj,[t1](l)2+τ2)2,\displaystyle c^{(l)2}_{j}-w_{[t]}^{(l)\top}Z^{(l)}Z^{(l)\top}_{j}c^{(l)}_{j}+\alpha_{jl}|c^{(l)}_{j}|+\frac{\mu_{2}^{*}}{2}\sum_{l^{\prime}\neq l}\Bigg{(}\frac{c_{j}^{(l)}}{\sqrt{c_{j}^{(l)2}+\tau^{2}}}-\frac{c_{j,[t-1]}^{(l^{\prime})}}{\sqrt{c_{j,[t-1]}^{(l^{\prime})2}+\tau^{2}}}\Bigg{)}^{2},

where αjl=ρ˙(l=1Lρ(|cj,[t1](l)|;μ1,a);1,b)ρ˙(|cj,[t1](l)|;μ1,a)\alpha_{jl}=\dot{\rho}(\sum_{l=1}^{L}\rho(|c_{j,[t-1]}^{(l)}|;\mu_{1},a);1,b)\dot{\rho}(|c^{(l)}_{j,[t-1]}|;\mu_{1},a) and μ2=μ2nl2\mu_{2}^{*}=\mu_{2}n_{l}^{2}.

Thus, cj,[t](l)c^{(l)}_{j,[t]} can be updated as follows: for l=1,,Ll=1,\dots,L,

  • 1.

    Initialize r=0r=0 and cj,[r](l)=cj,[t1](l)c^{(l)}_{j,[r]}=c^{(l)}_{j,[t-1]}.

  • 2.

    Update r=r+1r=r+1. Compute:

    cj,[r](l)=sgn(Sj,[r1](l))(|Sj,[r1](l)|αjl)+(1+μ2(L1))/(cj,[r1](l)2+τ2),c^{(l)}_{j,[r]}=\frac{\text{sgn}(S_{j,[r-1]}^{(l)})(|S_{j,[r-1]}^{(l)}|-\alpha_{jl})_{+}}{(1+\mu_{2}^{*}(L-1))/(c^{(l)2}_{j,[r-1]}+\tau^{2})},

    where

    Sj,[r1](l)=\displaystyle S_{j,[r-1]}^{(l)}= m=1pi=1qwm(l)Zmi(l)Zji(l)+μ2cj,[r1](l)2+τ2llcj,[r1](l)cj,[r1](l)2+τ2,\displaystyle\sum_{m=1}^{p}\sum_{i=1}^{q}w^{(l)}_{m}Z^{(l)}_{mi}Z^{(l)}_{ji}+\frac{\mu_{2}^{*}}{\sqrt{c^{(l)2}_{j,[r-1]}+\tau^{2}}}\sum_{l^{\prime}\not=l}\frac{c_{j,[r-1]}^{(l^{\prime})}}{\sqrt{c_{j,[r-1]}^{(l^{\prime})2}+\tau^{2}}},

    and αjl=ρ˙(l=1Lρ(|cj,[r1](l)|;μ1,a);1,b)ρ˙(|cj,[r1](l)|;μ1,a).\alpha_{jl}=\dot{\rho}(\sum_{l=1}^{L}\rho(|c_{j,[r-1]}^{(l)}|;\mu_{1},a);1,b)\dot{\rho}(|c^{(l)}_{j,[r-1]}|;\mu_{1},a).

  • 3.

    Repeat Step 2 until convergence. The estimate at convergence is cj,[t](l)c^{(l)}_{j,[t]}.

Tuning parameter selection

iSPLS-HeteroS involves regularization parameters a,ba,b. Breheny and Huang (2009) suggested setting them connected in a manner to ensure that the group level penalty attains its maximum if and only if all of its components are at the maximum. Following published studies, we set a=6a=6. With the link between the inner and outer penalties, we set b=12Laμ12b=\frac{1}{2}La\mu_{1}^{2}. iSPLS-HomoS only involves regularization parameters aa, which is also set to be 6. We use cross-validation to choose tuning parameters μ1\mu_{1} and μ2\mu_{2}. Furthermore, iSPLS-HeteroS involves τ\tau. In our study, we fix the value of τ2=0.5\tau^{2}=0.5, following the suggestion of setting it as a small positive number (Dicker et al., 2013). Literature suggested that the proposed approach is valid if τ\tau is not too big, and the approximation can differentiate parameters with different signs.

3 Simulation

We simulate four independent studies each with sample size 40 and 120, and 5 response variables. For each sample, we simulate 100 predictor variables, which are jointly normally distributed, with marginal means zero and variances one. We assume that the predictor variables have an auto-regressive correlation structure, where variables jj and kk have correlation coefficient ρ|jk|\rho^{\left|j-k\right|}, and ρ=\rho= 0.2 and 0.7, corresponding to weak and strong correlations, respectively. All the scenarios follow the model Y(l)=X(l)β(l)+ϵ(l)Y^{(l)}=X^{(l)}\beta^{(l)}+\epsilon^{(l)}, where ϵ(l)\epsilon^{(l)} is normally distributed with mean zero. Following the data-generating mechanism in Chun and Keleş (2010), the columns of βi(l)\beta_{i}^{(l)}, for i=2,,5i=2,...,5, are generated by βi(l)=1.2i1β1(l)\beta_{i}^{(l)}=1.2^{i-1}{\beta}_{1}^{(l)}. The sparsity structures of direction vectors w(l)w^{(l)} are controlled by β1(l)\beta_{1}^{(l)}. Within each dataset, the number of variables associated with the responses is set to be 10. The nonzero coefficients β1(l)\beta_{1}^{(l)} range from 0.5 to 4. We simulate under both the homogeneity and heterogeneity models.

Under the homogeneity model, direction vectors have the same sparsity structure, with similar or different nonzero values, corresponding to Scenario 1 and Scenario 2, respectively. Under the heterogeneity model, two scenarios are considered. In Scenario 3, four datasets share 5 important variables in common, and the rest important variables are dataset-specific. That is, direction vectors have partially overlapping sparsity structures. In Scenario 4, direction vectors have random sparsity structures with random overlappings. These four scenarios comprehensively cover different degrees of overlapping in sparsity structures.

To better gauge performance of the proposed approach, we also consider the following alternative approaches: (a) meta-analysis. We analyze each data set separately using the PLS or SPLS approaches and then combine results across datasets via meta-analysis; (b) a pooled approach. Four datasets are pooled together and analyzed by SPLS as a whole. For all approaches, the tuning parameters are selected via 5-fold cross-validation. To evaluate the accuracy of variable selection, the averages of sensitivities and specificities are computed across replicates. We also evaluate prediction performance by calculating mean-squared prediction errors (MSPE).

Summary statistics based on 50 replicates are presented in Tables 1-4. The simulation indicates that the proposed integrative analysis method outperforms its competitors. More specifically, under the fully overlapping (homogeneity) case, when the magnitudes of nonzero values are similar across datasets (Scenario 1), iSPLS-HomoM has the most competitive performance. For example, in Table 1, with ρ=0.2\rho=0.2 and n=120n=120, MSPEs are 49.062 (meta-PLS), 5.686 (meta-SPLS), 1.350 (pooled-SPLS), 2.002 (iSPLS-HomoM), 2.414 (iSPLS-HomoS), 3.368 (iSPLS-HeteroM) and 3.559 (iSPLS-HeteroS), respectively. Note that under Scenario 1, the performance of iSPLS-HomoM and iSPLS-HomoS may be slightly inferior to that of pooled-SPLS. Since with fully comparable datasets, it is sensible to pool all data together, thus, pooled-SPLS may generate more accurate results. However, when the nonzero values are quite different across datasets (Scenario 2), as can be seen from Table 2, iSPLS-HomoS outperforms others, including pooled-SPLS. Under the partially overlapping Scenario 3 (heterogeneity model), iSPLS-HeteroM and iSPLS-HeteroS seem to have better performance, for example when ρ=0.7\rho=0.7 and n=40n=40, they have higher Sensitivities (0.821 and 0.821, compared to 0.675, 0.575, 0.800 and 0.800 of the alternatives), smaller MSPEs (24.637 and 23.734, compared to 268.880, 30.928, 84.875, 40.867, and 39.492 of the alternatives), and with similar Specificities. Even under the non-overlapping Scenario 4, which is not favourable to multi-datasets analysis, the proposed integrative analysis still has reasonable performance. Thus, our integrative analysis methods have the potential to generate more satisfactory results comparable to meta-analysis, when the overlapping structure of multiple datasets is unknown.

Table 1: Simulation results for Scenario 1(M=4,p=100M=4,p=100)
ρ\rho nln_{l} Method MSPE Sensitivity Specificity
0.2 40 meta-PLS 48.972 (4.676) 1 (0) 0 (0)
meta-SPLS 24.739 (3.879) 0.632 (0.135) 0.873 (0.111)
pooled-SPLS 4.377 (2.486) 0.810 (0.127) 0.999 (0.003)
iSPLS-HomoM 9.452 (4.369) 0.840 (0.110) 0.982 (0.018)
iSPLS-HomoS 10.151 (4.027) 0.837 (0.119) 0.980 (0.022)
iSPLS-HeteroM 18.287 (6.022) 0.845 (0.152) 0.757 (0.063)
iSPLS-HeteroS 15.462 (6.251) 0.875 (0.143) 0.743 (0.060)
0.2 120 meta-PLS 49.062 (4.151) 1 (0) 0 (0)
meta-SPLS 5.686 (2.056) 0.799 (0.053) 0.994 (0.007)
pooled-SPLS 1.350 (1.229) 0.937 (0.025) 0.999 (0.000)
iSPLS-HomoM 2.002 (0.920) 0.993 (0.008) 0.956 (0.016)
iSPLS-HomoS 2.414 (0.951) 0.997 (0.008) 0.929 (0.014)
iSPLS-HeteroM 3.368 (1.211) 0.955 (0.039) 0.945 (0.019)
iSPLS-HeteroS 3.559 (1.297) 0.982 (0.051) 0.872 (0.007)
0.7 40 meta-PLS 106.532 (7.066) 1 (0) 0 (0)
meta-SPLS 16.212 (4.033) 0.828 (0.063) 0.962 (0.011)
pooled-SPLS 5.984 (1.939) 0.893 (0.065) 0.984 (0.037)
iSPLS-HomoM 6.956 (1.885) 0.967 (0.018) 0.947 (0.021)
iSPLS-HomoS 7.000 (2.067) 0.967 (0.018) 0.946 (0.020)
iSPLS-HeteroM 13.630 (3.817) 0.896 (0.109) 0.946 (0.019)
iSPLS-HeteroS 13.855 (3.778) 0.909 (0.112) 0.942 (0.020)
0.7 120 meta-PLS 102.629 (9.225) 1 (0) 0 (0)
meta-SPLS 4.824 (1.913) 0.912 (0.049) 0.985 (0.012)
pooled-SPLS 2.454 (1.481) 0.883 (0.056) 0.994 (0.023)
iSPLS-HomoM 2.292 (0.829) 0.987 (0.018) 0.977 (0.014)
iSPLS-HomoS 2.356 (0.785) 0.987 (0.018) 0.976 (0.014)
iSPLS-HeteroM 3.718 (0.995) 0.988 (0.051) 0.948 (0.013)
iSPLS-HeteroS 3.609 (1.077) 0.997 (0.035) 0.942 (0.012)
Table 2: Simulation results for Scenario 2(M=4,p=100M=4,p=100)
ρ\rho nln_{l} Method MSPE Sensitivity Specificity
0.2 40 meta-PLS 87.769 (14.532) 1 (0) 0 (0)
meta-SPLS 31.173 (8.422) 0.532 (0.074) 0.919 (0.073)
pooled-SPLS 33.533 (4.519) 0.883 (0.095) 0.976 (0.042)
iSPLS-HomoM 17.567 (5.086) 0.993 (0.025) 0.681 (0.084)
iSPLS-HomoS 16.881 (4.548) 0.993 (0.025) 0.681 (0.084)
iSPLS-HeteroM 28.803 (6.574) 0.756 (0.122) 0.774 (0.057)
iSPLS-HeteroS 25.990 (5.446) 0.819 (0.102) 0.739 (0.063)
0.2 120 meta-PLS 85.138 (4.172) 1 (0) 0 (0)
meta-SPLS 9.015 (1.283) 0.672 (0.054) 0.994 (0.005)
pooled-SPLS 27.068 (0.867) 0.993 (0.076) 1.000 (0.003)
iSPLS-HomoM 3.673 (0.552) 1.000 (0.025) 0.983 (0.024)
iSPLS-HomoS 3.589 (0.649) 1.000 (0.018) 0.982 (0.043)
iSPLS-HeteroM 6.050 (0.555) 0.898 (0.040) 0.956 (0.024)
iSPLS-HeteroS 6.674 (0.776) 0.949 (0.030) 0.939 (0.032)
0.7 40 meta-PLS 192.366 (10.990) 1 (0) 0 (0)
meta-SPLS 28.179 (5.592) 0.652 (0.078) 0.981 (0.015)
pooled-SPLS 65.284 (5.221) 0.970 (0.101) 0.963 (0.018)
iSPLS-HomoM 10.186 (4.096) 0.997 (0.055) 0.948 (0.023)
iSPLS-HomoS 9.909 (4.031) 0.997 (0.055) 0.947 (0.022)
iSPLS-HeteroM 23.300 (9.108) 0.741 (0.096) 0.953 (0.017)
iSPLS-HeteroS 22.974 (9.806) 0.765 (0.095) 0.950 (0.019)
0.7 120 meta-PLS 175.348 (8.390) 1 (0) 0 (0)
meta-SPLS 14.871 (1.943) 0.745 (0.041) 0.975 (0.010)
pooled-SPLS 61.626 (0.758) 0.963 (0.059) 0.986 (0.008)
iSPLS-HomoM 5.923 (0.913) 0.997 (0.035) 0.972 (0.012)
iSPLS-HomoS 5.764 (0.913) 0.997 (0.035) 0.971 (0.013)
iSPLS-HeteroM 10.742 (1.252) 0.911 (0.016) 0.917 (0.012)
iSPLS-HeteroS 9.354 (1.267) 0.946 (0.009) 0.912 (0.011)
Table 3: Simulation results for Scenario 3(M=4,p=100M=4,p=100)
ρ\rho nln_{l} Method MSPE Sensitivity Specificity
0.2 40 meta-PLS 76.919 (11.918) 1 (0) 0 (0)
meta-SPLS 35.372 (6.920) 0.551 (0.118) 0.900 (0.087)
pooled-SPLS 54.006 (6.920) 0.675 (0.289) 0.730 (0.289)
iSPLS-HomoM 28.495 (4.416) 0.900 (0.057) 0.589 (0.062)
iSPLS-HomoS 27.897 (4.231) 0.900 (0.069) 0.589 (0.070)
iSPLS-HeteroM 23.201 (5.788) 0.800(0.137) 0.847 (0.042)
iSPLS-HeteroS 21.616 (5.632) 0.800(0.134) 0.856 (0.039)
0.2 120 meta-PLS 84.613 (16.931) 1 (0) 0 (0)
meta-SPLS 10.995 (2.382) 0.696 (0.082) 0.990 (0.010)
pooled-SPLS 44.243 (3.532) 0.600 (0.190) 0.847 (0.125)
iSPLS-HomoM 12.445 (1.995) 0.902 (0.050) 0.683 (0.049)
iSPLS-HomoS 12.471 (1.993) 0.908 (0.049) 0.674 (0.058)
iSPLS-HeteroM 8.699 (1.768) 0.882 (0.050) 0.926 (0.016)
iSPLS-HeteroS 8.467 (1.826) 0.882 (0.049) 0.931 (0.015)
0.7 40 meta-PLS 268.880 (12.323) 1 (0) 0 (0)
meta-SPLS 30.928 (7.532) 0.675 (0.084) 0.939 (0.022)
pooled-SPLS 84.875 (7.594) 0.575 (0.152) 0.909 (0.073)
iSPLS-HomoM 40.867 (6.147) 0.800 (0.084) 0.700 (0.152)
iSPLS-HomoS 39.492 (5.919) 0.800 (0.080) 0.700 (0.171)
iSPLS-HeteroM 24.637 (6.887) 0.821 (0.102) 0.900 (0.051)
iSPLS-HeteroS 23.734 (6.373) 0.825 (0.111) 0.911 (0.068)
0.7 120 meta-PLS 258.583 (8.390) 1 (0) 0 (0)
meta-SPLS 12.631 (2.791) 0.900 (0.062) 0.971 (0.011)
pooled-SPLS 73.999 (6.112) 0.800 (0.138) 0.772 (0.135)
iSPLS-HomoM 20.475 (3.493) 0.998 (0.010) 0.364 (0.066)
iSPLS-HomoS 20.463 (3.445) 0.998 (0.010) 0.364 (0.066)
iSPLS-HeteroM 10.228 (2.837) 0.988 (0.019) 0.895 (0.022)
iSPLS-HeteroS 10.113 (2.818) 0.988 (0.062) 0.895 (0.011)
Table 4: Simulation results for Scenario 4(M=4,p=100M=4,p=100)
ρ\rho nln_{l} Method MSPE Sensitivity Specificity
0.2 40 meta-PLS 203.530 (25.691) 1 (0) 0 (0)
meta-SPLS 92.530 (21.452) 0.580 (0.105) 0.880 (0.096)
pooled-SPLS 174.432 (14.915) 0.491 (0.300) 0.608 (0.288)
iSPLS-HomoM 100.245 (14.246) 0.852 (0.067) 0.404 (0.084)
iSPLS-HomoS 98.013 (14.990) 0.851 (0.064) 0.403 (0.073)
iSPLS-HeteroM 79.508 (19.775) 0.633 (0.111) 0.918 (0.026)
iSPLS-HeteroS 81.041 (19.177) 0.626 (0.106) 0.920 (0.025)
0.2 120 meta-PLS 233.403 (41.459) 1 (0) 0 (0)
meta-SPLS 26.850 (5.595) 0.689 (0.060) 0.994 (0.067)
pooled-SPLS 155.342 (12.988) 0.500 (0.098) 0.717 (0.041)
iSPLS-HomoM 42.005 (5.538) 0.914 (0.047) 0.496 (0.055)
iSPLS-HomoS 41.962 (5.566) 0.913 (0.047) 0.498 (0.063)
iSPLS-HeteroM 24.120 (4.955) 0.878 (0.076) 0.925 (0.025)
iSPLS-HeteroS 24.177 (4.865) 0.88 (0.077) 0.926 (0.018)
0.7 40 meta-PLS 542.745 (91.125) 1 (0) 0 (0)
meta-SPLS 69.596 (22.876) 0.654 (0.089) 0.974 (0.024)
pooled-SPLS 357.967 (34.464) 0.401 (0.236 ) 0.753 (0.219)
iSPLS-HomoM 100.322 (17.976) 0.937 (0.055) 0.437 (0.067)
iSPLS-HomoS 97.774 (20.713) 0.937 (0.054) 0.436 (0.069)
iSPLS-HeteroM 66.089 (16.784) 0.904 (0.083) 0.776 (0.041)
iSPLS-HeteroS 64.131 (16.378) 0.904 (0.089) 0.771 (0.024)
0.7 120 meta-PLS 636.34 (73.501) 1 (0) 0 (0)
meta-SPLS 35.067 (8.960) 0.872 (0.015) 0.954 (0.057)
pooled-SPLS 331.250 (9.337) 0.469 (0.110) 0.773 (0.075)
iSPLS-HomoM 56.381 (11.017) 0.992 (0.047) 0.465 (0.018)
iSPLS-HomoS 56.234 (11.021) 0.993 (0.047) 0.461 (0.019)
iSPLS-HeteroM 31.622 (8.855) 0.943 (0.066) 0.913 (0.016)
iSPLS-HeteroS 30.625 (8.501) 0.943 (0.063) 0.911 (0.017)

To sum up, under the homogeneity cases, iSPLS-HomoM and iSPLS-HomoS have the most favourable performance, and under the heterogeneity cases, iSPLS-HeteroS and iSPLS-HeteroM outperform the others. It is also interesting to observe that the performance of the constructed penalties depends on the degree of similarity across datasets. For example, in Table 2, iSPLS-HomoS (iSPLS-HeteroS), with a less stringent penalty, has relatively lower MSPEs than iSPLS-HomoM (iSPLS-HeteroM), while in Table 1, it is the other way around. This comparison suggests the sensibility of the proposed contrasted penalization.

4 Data analysis

4.1 Analysis of cutaneous melanoma data

We analyze three datasets from the TCGA cutaneous melanoma (SKCM) study, corresponding to different tumor stages, with 70 samples in stage 1, 60 in stage 2, and 110 in stage 3 and 4. Studies have been conducted on Breslow thickness, an important prognostic marker, which is regulated by gene expressions. However, most of these studies use all samples from different stages together. Exploratory analysis suggests that beyond similarity, there also exists considerable variation across the three stages. The number of gene expression measurements contained in these three datasets is 18947 in all. To generate more accurate results with quite limited samples, we conduct our analysis based on the result of Sun et al. (2018), in which they develop a Community Fusion (CoFu) approach to conduct variable selection while taking account into the network community structure of omics measurements. After undergoing procedures including the unique identification of genes, matching of gene names with those in the SKCM dataset, a supervised screening, network construction and community identification, a total of 21 communities, with 126 genes, are identified as associated with the response using their proposed CoFu method, and are used here for downstream analysis.

We apply the proposed integrative analysis methods and their competitors, meta-analysis, and pooled analysis. It is found that the identified variables vary across methods and stages. For example, Figure 1 shows the estimation results for genes in community 3, 5 and 42, in which each row corresponds to one dataset. We can see that, for one specific set, although results generated by different methods vary to each other, they share some nonzero loadings in common, and the number of overlapping genes identified by different methods are summarized in Table 5. Genes identified by iSPLS-Hetero shown in Figure 1 demonstrate the stage-specific feature in a specific community, thereby indicating the difference across tumor stages.

Refer to caption
Figure 1: Analysis of the TCGA SKCM data. Rhombus and cross in blue and orange correspond to iSPLS-Homo and Hetero with magnitude and sign penalties, respectively. Pink cross and red circle correspond to meta-SPLS and pooled-SPLS.

To evaluate prediction performance and stability of identification, we first randomly split each dataset into 75% for training and 25% for testing. Then, estimation results are generated by the training set and used to make a prediction for the testing set. The root mean squared error (RMSE) is used to measure prediction performance. Furthermore, for each gene, we compute its observed occurrence index (OOI) (Huang and Ma, 2010), that is, its probability of being identified in 100 resamplings. The results of RMSEs and OOIs for each method are shown in Table 6, which suggests the stability of our proposed methods as well as their competitive performance compared to the alternatives.

4.2 Analysis of lung cancer data

We collect two lung cancer datasets, on Lung Adenocarcinoma (LUAD) and Lung Squamous Cell Carcinoma (LUSC), with sample sizes equal to 142 and 89, respectively. Studies have been conducted to analyze FEV1, which is a measure of lung function, and its relationship with gene expressions, using two datasets, however, separately. Since both Adenocarcinoma and Squamous Cell Carcinoma are non-small cell lung carcinomas, we may expect a certain degree of similarity between them. With the consideration on both difference and similarity, we apply our proposed integrative methods on these two datasets. Our analysis focuses on 474 genes in 26 communities, which are identified as associated with the response, based on the results of Sun et al. (2018).

We perform the same procedure as described above. The identified variables vary across methods and datasets. To better illustrate the estimation results, Figure 2 shows the behaviors of three communities identified by the above methods, from which we can easily see both the similarities and differences between these two datasets. Stability and prediction performance evaluation are conducted by computing the RMSEs and OOIs from 100 resamplings, following the same procedure as described above. Overall results are summarized in Table 5-6, and the iSPLS methods have relatively lower RMSEs and higher OOIs than the other methods.

Refer to caption
Figure 2: Analysis of the TCGA lung cancer data. Rhombus and cross in blue and orange correspond to iSPLS-Homo and Hetero with magnitude and sign penalties, respectively. Pink cross and red circle correspond to meta-SPLS and pooled-SPLS.
Table 5: Data analysis: numbers of overlapping genes identified by different methods.
iSPLS
Method pooled-SPLS meta-SPLS HomoM HomoS HeteroM HeteroS
SKCM data
pooled-SPLS 100 34 20 21 51 53
meta-SPLS 107 28 29 71 72
iSPLS-HomoM 45 45 45 45
iSPLS-HomoS 46 46 46
iSPLS-HeteroM 83 75
iSPLS-HeteroS 89
Lung cancer data
pooled-SPLS 145 78 37 40 66 51
meta-SPLS 92 39 42 76 58
iSPLS-HomoM 39 39 38 35
iSPLS-HomoS 42 40 36
iSPLS-HeteroM 66 58
iSPLS-HeteroS 72
Table 6: Data analysis: RMSEs and the median of OOIs of different methods
iSPLS
pooled-SPLS meta-SPLS HomoM HomoS HeteroM HeteroS
SKCM data
RMSE 6.210 4.046 4.202 4.163 3.202 3.135
OOI(Median) 0.76 0.75 0.80 0.80 0.77 0.78
Lung cancer data
RMSE 32.367 27.837 22.269 20.412 21.019 20.318
OOI(Median) 0.71 0.73 0.78 0.78 0.76 0.75

5 Discussion

PLS regression has been promoted in ill-conditioned linear regression problems that arise in several disciplines such as chemistry, economics, medicine, and psychology. In this study, we propose an integrative SPLS (iSPLS) method, which conducts the integrative analysis of multiple independent datasets based on the SPLS technique. This study significantly extends the novel integrative analysis paradigm by conducting a dimension reduction analysis. An important contribution is that, to promote similarity across datasets more effectively, two contrasted penalties have been developed. Under both the homogeneity and heterogeneity models, we develop the magnitude-based contrasted penalization and sign-based contrasted penalization. We develop effective computational algorithms for the proposed integrative analysis. For a variety of model settings, simulations demonstrate satisfactory performance of the proposed iSPLS method. The application to TCGA data suggests that magnitude-based iSPLS and sign-based iSPLS do not dominate each other, and are both needed in practice. The stability and prediction evaluation provides some support to the validity of the proposed method.

This study can be potentially extended in multiple directions. Apart from PLS, integrative analysis can be developed based on other dimension reduction techniques, such as CCA, ICA, and so on. For selection, the MCP penalty is adopted and can be potentially replaced with other two-level selection penalties. Integrative analysis can be developed based on SPLS-SVD. Moreover, iSPLS is applicable to non-linear frameworks such as generalized linear models and survival models. In data analysis, both the magnitude-based iSPLS and sign-based iSPLS have applications far beyond this study.

References

  • Breheny and Huang (2009) Breheny, P. and Huang, J. (2009). Penalized methods for bi-level variable selection. Statistics and Its Interface, 2:369–380.
  • Chiquet et al. (2011) Chiquet, J., Grandvalet, Y., and Ambroise, C. (2011). Inferring multiple graphical structures. Statistics and Computing, 21:537–553.
  • Chun and Keleş (2009) Chun, H. and Keleş, S. (2009). Expression quantitative trait loci mapping with multivariate sparse partial least squares regression. Genetics, 182:79–90.
  • Chun and Keleş (2010) Chun, H. and Keleş, S. (2010). Sparse partial least squares regression for simultaneous dimension reduction and variable selection. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72:3–25.
  • De Jong (1993) De Jong, S. (1993). SIMPLS: an alternative approach to partial least squares regression. Chemometrics and Intelligent Laboratory Systems, 18:251–263.
  • Dicker et al. (2013) Dicker, L., Huang, B., and Lin, X. (2013). Variable selection and estimation with the seamless-l0l_{0} penalty. Statistica Sinica, 23:929–962.
  • Fan and Lv (2010) Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistica Sinica, 20:101–148.
  • Fang et al. (2018) Fang, K., Fan, X., Zhang, Q., and Ma, S. (2018). Integrative sparse principal component analysis. Journal of Multivariate Analysis, 166:1–16.
  • Grützmann et al. (2005) Grützmann, R., Boriss, H., Ammerpohl, O., Lüttges, J., Kalthoff, H., Schackert, H. K., Klöppel, G., Saeger, H. D., and Pilarsky, C. (2005). Meta-analysis of microarray data on pancreatic cancer defines a set of commonly dysregulated genes. Oncogene, 24:5079–5088.
  • Guerra and Goldstein (2009) Guerra, R. and Goldstein, D. R. (2009). Meta-analysis and combining information in genetics and genomics. CRC Press.
  • Huang et al. (2012a) Huang, J., Breheny, P., and Ma, S. (2012a). A selective review of group selection in high-dimensional models. Statistical Science: A Review Journal of the Institute of Mathematical Statistics, 27:481–499.
  • Huang and Ma (2010) Huang, J. and Ma, S. (2010). Variable selection in the accelerated failure time model via the bridge method. Lifetime Data Analysis, 16:176–195.
  • Huang et al. (2012b) Huang, Y., Huang, J., Shia, B.-C., and Ma, S. (2012b). Identification of cancer genomic markers via integrative sparse boosting. Biostatistics, 13:509–522.
  • Jolliffe et al. (2003) Jolliffe, I. T., Trendafilov, N. T., and Uddin, M. (2003). A modified principal component technique based on the LASSO. Journal of Computational and Graphical Statistics, 12:531–547.
  • Liu et al. (2015) Liu, J., Huang, J., Zhang, Y., Lan, Q., Rothman, N., Zheng, T., and Ma, S. (2015). Integrative analysis of prognosis data on multiple cancer subtypes. Biometrics, 70:480–488.
  • Ma et al. (2011) Ma, S., Huang, J., and Song, X. (2011). Integrative analysis and variable selection with multiple high-dimensional data sets. Biostatistics, 12:763–775.
  • Shi et al. (2014) Shi, X., Liu, J., Huang, J., Zhou, Y., Shia, B., and Ma, S. (2014). Integrative analysis of high-throughput cancer studies with contrasted penalization. Genetic Epidemiology, 38:144–151.
  • Sjöström et al. (1983) Sjöström, M., Wold, S., Lindberg, W., Persson, J.-Å., and Martens, H. (1983). A multivariate calibration problem in analytical chemistry solved by partial least squares models in latent variables. Analytica Chimica Acta, 150:61–70.
  • Sun et al. (2018) Sun, Y., Jiang, Y., Li, Y., and Ma, S. (2018). Identification of cancer omics commonality and difference via community fusion. Statistics in Medicine, 38:1200–1212.
  • Ter Braak and de Jong (1998) Ter Braak, C. J. and de Jong, S. (1998). The objective function of partial least squares regression. Journal of Chemometrics: A Journal of the Chemometrics Society, 12:41–54.
  • Wang et al. (2016) Wang, F., Wang, L., and Song, P. X.-K. (2016). Fused lasso with the adaptation of parameter ordering in combining multiple studies with repeated measurements. Biometrics, 72:1184–1193.
  • Wold et al. (1984) Wold, S., Ruhe, A., Wold, H., and Dunn, III, W. (1984). The collinearity problem in linear regression. The partial least squares (PLS) approach to generalized inverses. SIAM Journal on Scientific and Statistical Computing, 5:735–743.
  • Zhang (2010) Zhang, C. H. (2010). Nearly unbiased variable selection under minimax concave penalty. The Annals of Statistics, 38:894–942.
  • Zhao et al. (2015) Zhao, Q., Shi, X., Huang, J., Liu, J., Li, Y., and Ma, S. (2015). Integrative analysis of ‘-omics’ data using penalty functions. Wiley Interdisciplinary Reviews Computational Statistics, 7:99–108.

Algorithms

iSPLS with 2-norm group MCP with Magnitude-based contrasted penalty

We adopt a similar computational algorithm as Algorithm 1. The key difference lies in Step 2(b), solving c(l)c^{(l)} with fixed w[t](l)w_{[t]}^{(l)}. Consider the homogeneity model with magnitude-based contrasted penalty (iSPLS-HomoM), we have the following problem

minc(l)\displaystyle\min\limits_{c^{(l)}} l=1L12nl2(Z(l)c(l)Z(l)w[t](l)22+λc1(l)22)+j=1pρ(cj2;μ1,a)\displaystyle\sum_{l=1}^{L}\frac{1}{2n_{l}^{2}}\left(\left\|Z^{(l)\top}c^{(l)}-Z^{(l)\top}w_{[t]}^{(l)}\right\|_{2}^{2}+\lambda\left\|c_{1}^{(l)}\right\|_{2}^{2}\right)+\sum_{j=1}^{p}\rho\left(\left\|{c_{j}}\right\|_{2};\mu_{1},a\right) (7)
+μ22j=1pll(cj(l)cj(l))2.\displaystyle+\frac{\mu_{2}}{2}\sum_{j=1}^{p}\sum_{l^{\prime}\neq l}\left(c_{j}^{(l)}-c_{j}^{(l^{\prime})}\right)^{2}.

For j=1,,pj=1,\dots,p, given the group parameter vectors ck(l)c^{(l)}_{k}(kj)(k\not=j) fixed at their current estimates ck,[t1](l)c^{(l)}_{k,[t-1]}, minimize the objective function (7) with respect to cj(l)c^{(l)}_{j}. After conducting the same procedures as those in Section 2.3.1, this problem is equivalent to minimizing

12cj(l)2w[t](l)Z(l)Zj(l)cj(l)+ρ˙(cj,[t1]2;μ1,a)cj2+μ22ll(cj(l)cj(l))2.\displaystyle\frac{1}{2}c^{(l)2}_{j}-w_{[t]}^{(l)\top}Z^{(l)}Z^{(l)\top}_{j}c^{(l)}_{j}+\dot{\rho}\left(\left\|c_{j,[t-1]}\right\|_{2};\mu_{1},a\right)\left\|c_{j}\right\|_{2}+\frac{\mu_{2}^{*}}{2}\sum_{l^{\prime}\neq l}\left(c_{j}^{(l)}-c_{j}^{(l^{\prime})}\right)^{2}. (8)

It can be shown that the minimizer of (8) is

cj,[t](l)=(Sj2ρ˙(cj,[t1]2;μ1,a))+Sj(l)(1+μ2(L1))Sj2,c^{(l)}_{j,[t]}=\frac{\left(\left\|S_{j}\right\|_{2}-\dot{\rho}(\left\|c_{j,[t-1]}\right\|_{2};\mu_{1},a)\right)_{+}S^{(l)}_{j}}{(1+\mu_{2}^{*}(L-1))\left\|S_{j}\right\|_{2}},

where Sj(l)=m=1pi=1qwm(l)Zmi(l)Zji(l)+μ2llcj,[t1](l),S_{j}^{(l)}=\sum_{m=1}^{p}\sum_{i=1}^{q}w^{(l)}_{m}Z^{(l)}_{mi}Z^{(l)}_{ji}+\mu_{2}^{*}\sum_{l^{\prime}\neq l}c_{j,[t-1]}^{(l^{\prime})}, and Sj2=l=1LSj(l)2\left\|S_{j}\right\|_{2}=\sqrt{\sum^{L}_{l=1}S^{(l)2}_{j}}.

iSPLS with composite MCP with Magnitude-based contrasted penalty

Under the heterogeneity model with Magnitude-based contrasted penalty (iSPLS-HeteroM),

minc(l)\displaystyle\min\limits_{c^{(l)}} l=1L12nl2(Z(l)c(l)Z(l)w[t](l)22+λc(l)22)+j=1pρ(l=1Lρ(|cj(l)|;μ1,a);1,b)\displaystyle\sum_{l=1}^{L}\frac{1}{2n_{l}^{2}}\left(\left\|Z^{(l)\top}c^{(l)}-Z^{(l)\top}w_{[t]}^{(l)}\right\|_{2}^{2}+\lambda\left\|c^{(l)}\right\|_{2}^{2}\right)+\sum_{j=1}^{p}\rho\left(\sum_{l=1}^{L}\rho\left(|c_{j}^{(l)}|;\mu_{1},a\right);1,b\right) (9)
+μ22j=1pll(cj(l)cj(l))2.\displaystyle+\frac{\mu_{2}}{2}\sum_{j=1}^{p}\sum_{l^{\prime}\neq l}\left(c_{j}^{(l)}-c_{j}^{(l^{\prime})}\right)^{2}.

Take the first order Taylor expansion approximation about cj(l)c_{j}^{(l)} for the first penalty, with ck(l)(kj)c_{k}^{(l)}(k\not=j) fixed at their current estimates ck,[t1](l)c^{(l)}_{k,[t-1]}, and conduct the same procedure to the second penalty as in Section 2.3.1. Then the objective function (9) is approximately equivalent to minimizing

12cj(l)2w[t](l)Z(l)Zj(l)cj(l)+αjl|cj(l)|+μ22ll(cj(l)cj(l))2,\frac{1}{2}c_{j}^{(l)2}-w_{[t]}^{(l)\top}Z^{(l)}Z^{(l)\top}_{j}c^{(l)}_{j}+\alpha_{jl}|c^{(l)}_{j}|+\frac{\mu_{2}^{*}}{2}\sum_{l^{\prime}\neq l}\left(c_{j}^{(l)}-c_{j}^{(l^{\prime})}\right)^{2},~{} (10)

where αjl=ρ˙(l=1Lρ(|cj,[t1](l)|;μ1,a);1,b)ρ˙(|cj,[t1](l)|;μ1,a)\alpha_{jl}=\dot{\rho}\left(\sum_{l=1}^{L}\rho(|c_{j,[t-1]}^{(l)}|;\mu_{1},a);1,b\right)\dot{\rho}\left(|c^{(l)}_{j,[t-1]}|;\mu_{1},a\right).

Thus, cj,[t](l)c^{(l)}_{j,[t]} can be updated as follows: For l=1,,Ll=1,\dots,L,

  • 1.

    Initialize r=0r=0 and cj,[r](l)=cj,[t1](l)c^{(l)}_{j,[r]}=c^{(l)}_{j,[t-1]}.

  • 2.

    Update r=r+1r=r+1. Compute:

    cj,[r](l)=sgn(Sj,[r1](l))(|Sj,[r1](l)|αjl)+(1+μ2(L1)),c^{(l)}_{j,[r]}=\frac{\text{sgn}\left(S_{j,[r-1]}^{(l)}\right)\left(|S_{j,[r-1]}^{(l)}|-\alpha_{jl}\right)_{+}}{(1+\mu_{2}^{*}(L-1))},

    where

    Sj,[r1](l)=m=1pi=1qwm(l)Zmi(l)Zji(l)+μ2llcj,[t1](l),S_{j,[r-1]}^{(l)}=\sum_{m=1}^{p}\sum_{i=1}^{q}w^{(l)}_{m}Z^{(l)}_{mi}Z^{(l)}_{ji}+\mu_{2}^{*}\sum_{l^{\prime}\not=l}c_{j,[t-1]}^{(l^{\prime})},

    and αjl=ρ˙(l=1Lρ(|cj,[r1](l)|;μ1,a);1,b)ρ˙(|cj,[r1](l)|;μ1,a).\alpha_{jl}=\dot{\rho}\left(\sum_{l=1}^{L}\rho(|c_{j,[r-1]}^{(l)}|;\mu_{1},a);1,b\right)\dot{\rho}\left(|c^{(l)}_{j,[r-1]}|;\mu_{1},a\right).

  • 3.

    Repeat Step 2 until convergence. The estimate at convergence is cj,[t](l)c^{(l)}_{j,[t]}.

iSPLS with 2-norm group MCP with the sign-based contrasted penalty

Consider the homogeneity model with sign-based contrasted penalty (iSPLS-HomoS).

minc(l)\displaystyle\min\limits_{c^{(l)}} l=1L12nl2(Z(l)c(l)Z(l)w[t](l)22+λc(l)22)+j=1pρ(cj2;μ1,a)\displaystyle\sum_{l=1}^{L}\frac{1}{2n_{l}^{2}}\left(\left\|Z^{(l)\top}c^{(l)}-Z^{(l)\top}w_{[t]}^{(l)}\right\|_{2}^{2}+\lambda\left\|c^{(l)}\right\|^{2}_{2}\right)+\sum_{j=1}^{p}\rho\left(\left\|{c_{j}}\right\|_{2};\mu_{1},a\right) (11)
+μ22j=1pll{sgn(cj(l))sgn(cj(l))}2.\displaystyle+\frac{\mu_{2}}{2}\sum_{j=1}^{p}\sum_{l^{\prime}\neq l}\left\{\text{sgn}(c_{j}^{(l)})-\text{sgn}(c_{j}^{(l^{\prime})})\right\}^{2}.

For j=1,,pj=1,\dots,p, following the same procedure in Section 2.3.1, we have the following minimization problem

12\displaystyle\frac{1}{2} cj(l)2w[t](l)ZZj(l)cj(l)+ρ˙(cj,[t1]2;μ1,a)cj2\displaystyle c^{(l)2}_{j}-w_{[t]}^{(l)\top}ZZ^{(l)\top}_{j}c^{(l)}_{j}+\dot{\rho}\left(\left\|c_{j,[t-1]}\right\|_{2};\mu_{1},a\right)\left\|c_{j}\right\|_{2} (12)
+μ22ll(cj(l)cj(l)2+τ2cj(l)cj(l)2+τ2)2.\displaystyle+\frac{\mu_{2}^{*}}{2}\sum_{l^{\prime}\neq l}\Bigg{(}\frac{c_{j}^{(l)}}{\sqrt{c_{j}^{(l)2}+\tau^{2}}}-\frac{c_{j}^{(l^{\prime})}}{\sqrt{c_{j}^{(l^{\prime})2}+\tau^{2}}}\Bigg{)}^{2}.

It can be shown that the minimizer of (12) is

cj,[t](l)=(Sj2ρ˙(cj,[t1]2;μ1,a))+Sj(l)(1+μ2(L1))/(cj,[t1](l)2+τ2)Sj2,c^{(l)}_{j,[t]}=\frac{\left(\left\|S_{j}\right\|_{2}-\dot{\rho}(\left\|c_{j,[t-1]}\right\|_{2};\mu_{1},a)\right)_{+}S^{(l)}_{j}}{(1+\mu_{2}^{*}(L-1))/(c^{(l)2}_{j,[t-1]}+\tau^{2})\left\|S_{j}\right\|_{2}},

where

Sj(l)\displaystyle S_{j}^{(l)} =m=1pi=1qwm(l)Zmi(l)Zji(l)+μ2cj,[t1](l)2+τ2llcj,[t1](l)cj,[t1](l)2+τ2,\displaystyle=\sum_{m=1}^{p}\sum_{i=1}^{q}w^{(l)}_{m}Z^{(l)}_{mi}Z^{(l)}_{ji}+\frac{\mu_{2}^{*}}{\sqrt{c^{(l)2}_{j,[t-1]}+\tau^{2}}}\sum_{l^{\prime}\not=l}\frac{c_{j,[t-1]}^{(l^{\prime})}}{\sqrt{c_{j,[t-1]}^{(l^{\prime})2}+\tau^{2}}}, (13)

and Sj2=l=1L(Sj(l))2\left\|S_{j}\right\|_{2}=\sqrt{\sum^{L}_{l=1}(S^{(l)}_{j})^{2}}.

Thus, cj,[t](l)c^{(l)}_{j,[t]} can be updated as follows: For l=1,,Ll=1,\dots,L

  • 1.

    Initialize r=0r=0 and cj,[r](l)=cj,[t1](l)c^{(l)}_{j,[r]}=c^{(l)}_{j,[t-1]}

  • 2.

    Update r=r+1r=r+1. Compute:

    cj,[r](l)=(Sj,[r1]2ρ˙(cj,[r1]2;μ1,a))+Sj,[r1](l)(1+μ2(L1))/(cj,[r1](l)2+τ2)Sj,[r1]2.c^{(l)}_{j,[r]}=\frac{\left(\left\|S_{j,[r-1]}\right\|_{2}-\dot{\rho}\left(\left\|c_{j,[r-1]}\right\|_{2};\mu_{1},a\right)\right)_{+}S^{(l)}_{j,[r-1]}}{(1+\mu_{2}^{*}(L-1))/(c^{(l)2}_{j,[r-1]}+\tau^{2})\left\|S_{j,[r-1]}\right\|_{2}}.
  • 3.

    Repeat Step 2 until convergence. The estimate at convergence is cj,[t](l)c^{(l)}_{j,[t]}.