Weighted Sparse Partial Least Squares for Joint Sample and Feature Selection
Abstract
Sparse Partial Least Squares (sPLS) is a common dimensionality reduction technique for data fusion, which projects data samples from two views by seeking linear combinations with a small number of variables with the maximum variance. However, sPLS extracts the combinations between two data sets with all data samples so that it cannot detect latent subsets of samples. To extend the application of sPLS by identifying a specific subset of samples and remove outliers, we propose an -norm constrained weighted sparse PLS (-wsPLS) method for joint sample and feature selection, where the -norm constrains are used to select a subset of samples. We prove that the -norm constrains have the Kurdyka-Łojasiewicz property so that a globally convergent algorithm is developed to solve it. Moreover, multi-view data with a same set of samples can be available in various real problems. To this end, we extend the -wsPLS model and propose two multi-view wsPLS models for multi-view data fusion. We develop an efficient iterative algorithm for each multi-view wsPLS model and show its convergence property. As well as numerical and biomedical data experiments demonstrate the efficiency of the proposed methods.
Index Terms:
-norm, sparse PLS, sparse CCA, multi-view learning, nonconvex optimization.1 Introduction
Anumber of paired biomedical omics data from the same batch of patients have been accumulated [1, 2]. These invaluable datasets provide us new opportunities to explore cooperative mechanisms between different types of biological molecules. Partial Least Squares (PLS) is a popular technique for finding correlations across two data matrices with the maximum variance on a new low-dimensional space by learning two latent vectors for both representations [3, 4, 5]. Recently, PLS has been widely used for data fusion of the paired omics data [6], such as DNA copy number variation and gene expression data, DNA methylation and gene expression data, etc.
However, the classical PLS has the overfitting problem when dealing with high-dimensional but small sample biomedical data. In addition, the outputs of PLS (i.e., latent vectors) are a combination of all variables, which is difficult to interpret. To address these problems, sparse PLS (sPLS) methods have been proposed to avoid overfitting problem and improve interpretation ability of model by learning two sparse latent vectors [7, 8, 9, 10, 11]. Moreover, some extensions of the sPLS methods have been developed by using the structured sparse penalties, such as structured sparse PLS and matrix factorization methods [12, 13, 14, 15, 16]. These extension methods can improve the performance by integrating the structural information of the variables when applying to the biomedical paired omics datasets. For example, Chen et al. [13] proposed a sparse network-regularized PLS method to identify common modules (co-modules) using the matched gene-expression and drug response data. However, these sPLS and their variants often perform poorly when applying to high dimensional biomedical datasets with irrelevant features and noisy samples. The main reason is that they are susceptible to outliers and noisy samples, causing the acquired projection vectors to deviate from the desired direction. On one hand, most medical datasets are high-dimensional and polluted data due to the influence of experimental conditions and environmental factors. On the other hand, even the collected patients in the medical datasets from the same disease show heterogeneity in the field of biomedicine, such as cancer disease [17]. So it is very necessary to mine sample-specific co-modules when applying to two view or multi-view data.
To this end, we develop an -norm constrained weighted sparse PLS (-wsPLS) model for joint sample and feature selection, which can simultaneously select features and samples on a paired datasets (Figure 1). The use of -norm constraint is used to select a sample subset, while the -norm constraint is used to select features in our model. However, it is very difficult to find an effective convergence algorithm in traditional way, to solve the -wsPLS model because the -norm and the -norm are non-convex and non-smooth. Fortunately, we prove that the -norm and the -norm constraints satisfy the Kurdyka-Łojasiewicz (KŁ) property. Inspired by the Proximal Alternating Linearized Minimization (PALM) method [18], we develop a block proximal gradient algorithm to solve the -wsPLS model and prove that it converges to a critical for any initial point.
Furthermore, multi-view biomedical data have accumulated at an unprecedented rate in recent years [19, 20, 21, 22]. These collected multi-view biomedical data presents possibility to study the molecular mechanism of cancer pathogenesis. Many multi-view learning methods have been proposed to consider the cancer multi-omics data to improve their generalization ability [23, 24, 25, 26, 27, 28], but it is still a challenge to mine co-modules between multi-view biomedical data. To this end, we extend the wsPLS model for multi-view omics data analysis and propose two multi-view weighted sparse PLS (mwsPLS) models with two different schemes. For each mwsPLS model, we develop an efficient iterative algorithm based on the PALM framework and prove its convergence.
In this paper, we firstly show that the proposed methods significantly improve the performance of joint sample and feature selection when applying to simulated data. Secondly, we demonstrate the ability of -wsPLS in identifying sample-specific co-modules when applying to the paired DNA copy number variation and gene expression data, and the paired DNA methylation and gene expression data, respectively. The results demonstrate that -wsPLS outperforms others methods, which obtains higher correlation coefficients by removing noisy samples. Thirdly, we also demonstrate the ability of mwsPLS in discovering cancer subtype-specific co-modules when applying the paired miRNA-lncRNA-mRNA expression data of breast cancer. The results show that mwsPLS can identify some important miRNA-lncRNA-mRNA co-modules which are involved in some breast cancer related biological processes. The main contributions of this paper are highlighted as follows:
-
1.
We propose an -norm constrained wsPLS model for joint sample and feature selection, where the -norm constraints are used for sample selection. We prove that the -norm constraints satisfy the KŁ property so that a globally convergent algorithm is developed to solve the -wsPLS model.
-
2.
To integrate more than two datasets, we extend the -wsPLS model and proposed two multi-view wsPLS (mwsPLS) models with two different schemes. We develop an efficient iterative algorithm based on the PALM framework and prove its convergence for each mwsPLS model.
-
3.
The application of -wsPLS and mwsPLS methods and the comparison with the competing methods using the simulated and cancer multi-omic data.
Notation Meanings Normal font, e.g., x A scalar Bold lowercase, e.g., A vector Bold capital, e.g., A matrix A -by- matrix A -by- matrix A -by- matrix or -norm for a vector -norm for a vector -norm for a vector Element-wise product for two vectors

2 Weighted Sparse PLS (wsPLS)
2.1 wsPLS Model
Assume that there are two data matrices and with a same set of samples and their columns are standardized to have mean zero and variance unit. PLS is equivalent to solving the following optimization problem [29]:
(1) | ||||||
subject to |
The basic idea of PLS is to obtain two projected vectors and by maximizing the covariance between the latent vectors and . Specifically, a sparse PLS (sPLS) with -norm constraint is formulated as:
(2) | ||||||
subject to |
where and are two parameters to control the sparsity of and . The above model is also called sparse diagonal CCA problem [30, 31].
We observe that the objective function in Eq. (2) can be written as . To consider the difference of samples, we set a weighted coefficient for each sample and modify the above objective function as and propose a weighted sparse PLS model with the -norm constraints, which is denoted as (-wsPLS or wsPLS for short):
(3) | ||||||
subject to | ||||||
where and is a diagonal matrix with the diagonal entries indicating the weighted coefficients of samples and the -norm constraints are used to select a subset of samples. The mathematical notations are summarized in Table I.
2.2 Optimization Algorithm
Eq. (3) is equivalent to the following optimization problem because :
(4) | ||||||
subject to | ||||||
Obviously, the above -wsPLS model reduces to sPLS model when all elements of are one. We observe that is also equal to or . Therefore, we have
(5) | ||||||
Furthermore, we know that the Hessian matrices of with respect to , and are zero matrix, respectively. Thus, , and are Lipschitz continuous, and their Lipschitz constants can be any constant greater than zero (, , and ).
Recently, the PALM algorithm [18] has been proposed to solve a class of non-convex and non-smooth problems. Based on the PALM framework, we develop a block proximal gradient algorithm to solve (4) by iteratively updating each variable when fixing others.
1) Computation of using , and . We perform the proximal gradient step for when and are fixed:
(6) | ||||||
subject to |
where . To solve the above problem (6), we define an -ball projection problem as Eq. (7) and give its closed-form solution in Proposition 1.
(7) |
Proposition 1.
Suppose that is a no-zero vector, then the solution of problem (7) is . keeps the top entries with largest absolute values of and the other entries are zeros.
Based on Proposition 1, we obtain the closed-form solution of Eq. (6) as follows:
(8) |
where , the partial derivatives and the Lipschitz constant can be set to any constant greater than zero.
2) Computation of using , and . We perform the proximal gradient step for when and are fixed:
(9) | ||||||
subject to |
where . Based on the Proposition 1, we obtain the closed-form solution of Eq. (9) as follows:
(10) |
where , the partial derivatives and the Lipschitz constant can be set to any constant greater than zero.
3) Computation of using , and . We perform the proximal gradient step for when and are fixed:
(11) | ||||||
where . To solve the above model (11), we define a projection function for an arbitrary vector , which satisfies as follows for all :
(12) |
We obtain the closed-form solution of Eq. (11) using the following Proposition 2.
Proposition 2.
The optimal solution of problem (11) is .
Based on Proposition 2, we obtain the update rule of :
(13) |
where , is defined in Eq. (12), and the Lipschitz constant can be set to any constant greater than zero.
Combing (8), (10) and (13), we propose the block proximal gradient algorithm to solve Eq. (4) (See Algorithm 1).
Initialization. We assume that all the samples have equal contribution and initialize as (with all entries are ones) in Algorithm 1. Furthermore, Algorithm 1 is run with multiple random initial points and . Finally, the model with the optimal objective function value is selected.
Stopping Condition. We can use the following ways to stop the algorithm: (i) the stopping criterion can be set as follows:
(14) |
where , , and represent the -th iteration of , , and respectively. (ii) the stopping criteria can be set as follows:
(15) |
where is the objective function value after the -th iteration.
2.3 Complexity Analysis
The computational complexity of Algorithm 1 mainly depends on the computation of Eq. (8), Eq. (10) and Eq. (13). Computing Eq. (8) is mainly composed of the computation of which consumes . Computing Eq. (10) is mainly composed of the computation of which consumes . Similarly, Computing Eq. (13) is mainly composed of the computation of which consumes . Overall, the complexity of Algorithm 1 is , where is the number of iterations and it is empirically set to 20 in our experiments.
2.4 Convergence Analysis
Lemma 1.
has the Kurdyka-Łojasiewicz (KŁ) property.
Proof.
Theorem 1.
Proof.
References [18] has shown that the PALM algorithm monotonically increasing converges to a critical point if the objective function satisfied the KŁ property. Because Lemma 1 has shown that the objective function of the -wsPLS model satisfies the KŁ property and Algorithm 1 is an algorithm based on the PALM framework, we prove that Algorithm 1, i.e., Theorem 1 holds. ∎
2.5 Determination of co-modules
A co-module is decide with two steps: (1) extract a submatrix in the data marix , where rows and columns are corresponding to non-zero entries in and , respectively; (2) extract another submatrix in the data matrix , where rows and columns are corresponding to non-zero entries in and , respectively. The combination of two extracted sub-matrices is called a co-module (see Figure 1).
We explain how wsPLS identify multiple co-modules. After identifying the first co-module using Algorithm 1, we update matrices and by removing their rows (samples) in the first co-module, and we repeatedly apply Algorithm 1 to identify the next co-module on the updated data matrices (see the three co-modules in the rightmost subgraph of Figure 1).
3 Multi-view Weighted Sparse PLS
To identify co-modules on the multi-view data with more than three data matrices, we extend wsPLS and propose two multi-view wsPLS (mwsPLS) models with different schemes.
3.1 Scheme 1 for mwsPLS using a sum way
Assume that multiple matrices () across a common sample set are given and their columns are normalized with zero-mean and unit-variance. Similar to [32], we propose the multi-view wsPLS (mwsPLS) model using a sum way, which is called mwsPLS-scheme1 for short as follows:
(16) | ||||||
subject to | ||||||
Eq. (16) is equivalent to
(17) | ||||||
subject to | ||||||
Optimization. We observe that the objective function of Eq. (17) satisfies , where is independent for any . So, we can calculate the partial derivatives of () and as follows:
(18) | ||||||
Similar to Algorithm 1, we develop the block proximal gradient algorithm to solve (17) by using iteratively updating way. The detail algorithm procedure is shown in Algorithm 2 where Eq. (19) and Eq. (20) are solved using Proposition 1 and Proposition 2, respectively.
(19) | ||||||
subject to |
(20) | ||||||
Initialization. We set (all entries are ones) in Algorithm 2 and randomly initialize {} from a normal distribution. Algorithm 2 is run with multiple random initial points and the model with the optimal value is selected.
Complexity Analysis. The computational burden of Algorithm 2 mainly depends on the computation of Eq. (19) and Eq. (20). The computation of Eq. (19) is mainly composed of the computation of which consumes . The computation of Eq. (20) is mainly composed of the computation of which consumes because . In total, the complexity of Algorithm 2 is , where is the number of iterations and it is empirically set to 20 in our experiments.
3.2 Scheme 2 for mwsPLS using a product way
We observe , so that we intuitively extend wsPLS for multi-view data analysis using a product way. Herein, we propose the second mwsPLS model using the product way, which is called mwsPLS-scheme2 for short as follows:
(21) | ||||||
subject to | ||||||
where .
Optimization. Formally, the objective function in Eq. (21) is equivalent to
(22) |
where . Therefore, we calculate the partial derivatives for () and as follows:
(23) | ||||||
Finally, we solve the optimization problem (21) using a similar manner as Algorithm 2 and the detail procedure is shown in Algorithm 3.
(24) | ||||||
subject to |
(25) | ||||||
Initialization. The initial point setting of Algorithm 3 is the same as that of Algorithm 2, and the description is omitted here.
Complexity Analysis. The complexity of and is . The complexity of is also . So, the complexity of Algorithm 3 is , where is the number of iterations and it is empirically set to 20 in our experiments.
4 Experiments
In this section, we evaluate the effectiveness of these proposed methods for co-module identification and compare them with other competing methods on the synthetic and biomedical data.
4.1 Competing Methods
We compare the performance of the proposed methods with PLS and its variants:
-
•
PLS: It is a classical model without any sparse constrain [29]:
subject to which is solved using the single value decomposition (SVD) method for .
- •
- •
- •

4.2 Application to Synthetic Data
To generate the synthetic data of and ( is the number of samples, and are the number of two different features, respectively), we firstly generated , and , and then generate them by
(26) |
where the elements of and are randomly sampled from a standard normal distribution and , are two nonnegative parameters to control the signal-to-noise ratio (SNR) which is defined by
(27) |
where and denote the expected sum of squares of noise.
(1) Simulation data I ( = 50, = 80, = 100). We firstly generate three vectors , and using
where denotes a row vector of size with all elements are equal to . We then generate the simulation data I using the formula (26) with .
(2) Simulation data II ( = 100, = 800, = 1000). We firstly generate three vectors , and using
We then generate the simulation data II using the formula (26) with .
(3) Simulation data III ( = 1000, = 8000, = 10000). We firstly generate three vectors , and using
We then generate the simulation data III using the formula (26) with .
We compare the performance of -wsPLS with -wsPLS [31], two sparse PLS methods including -sPLS [30] and PMD-sPLS with -penalty [34], and the classical PLS [29]. All the experiments are performed on a laptop (2.8-GHz Intel 8-Core CPU and 16-GB RAM). Since both PLS and sPLS cannot estimate , we assume that for the convenience of comparison.
For the simulation data I, we set parameters , and for -wsPLS and -wsPLS. For comparison, we set , for -sPLS and ensure that and of PMD-sPLS output have the same sparsity level. For the simulation data II, we set parameters , and for -wsPLS and -wsPLS. Meanwhile, we set , for -sPLS and ensure that and of PMD-sPLS output have the same sparsity level. For the simulation data III, we set parameters , and for -wsPLS and -wsPLS. For comparison, we set , for -sPLS and ensure that and of PMD-sPLS output have the same sparsity level.
A optimal method should identify the true non-zero patterns of , and as Figure 2 (A). We evaluate the performance of all methods by three evaluation metrics including true positive rate (TPR), true negative rate (TNR) and accuracy (ACC):
(28) |
where P denotes the number of positive samples (i.e., non-zero values), N denotes the number of negative samples (i.e., zero values), TP denotes the number of true positive, TN denotes the number of true negative.
Data I | Simulation data I: = 50, = 80, = 100 | ||||||||||||
Metrics | ACC | TPR | TNR | Time | |||||||||
ACC_all | ACC_u | ACC_v | ACC_w | TPR_all | TPR_u | TPR_v | TPR_w | TNR_all | TNR_u | TNR_v | TNR_w | Time | |
PLS | 0.326 | 0.250 | 0.300 | 0.500 | 1.000 | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.004 |
(0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.010) | |
PMD-sPLS | 0.800 | 0.916 | 0.856 | 0.500 | 0.880 | 0.815 | 0.823 | 1.000 | 0.761 | 0.950 | 0.870 | 0.000 | 0.016 |
(0.016) | (0.023) | (0.034) | (0.000) | (0.025) | (0.067) | (0.040) | (0.000) | (0.016) | (0.015) | (0.039) | (0.000) | (0.007) | |
-sPLS | 0.870 | 0.980 | 0.966 | 0.500 | 0.967 | 0.960 | 0.943 | 1.000 | 0.823 | 0.987 | 0.976 | 0.000 | 0.029 |
(0.010) | (0.022) | (0.016) | (0.000) | (0.015) | (0.044) | (0.026) | (0.000) | (0.007) | (0.015) | (0.011) | (0.000) | (0.008) | |
-wsPLS | 0.710 | 0.728 | 0.692 | 0.716 | 0.555 | 0.455 | 0.487 | 0.716 | 0.785 | 0.818 | 0.780 | 0.716 | 0.023 |
(0.135) | (0.125) | (0.118) | (0.214) | (0.207) | (0.249) | (0.196) | (0.214) | (0.100) | (0.083) | (0.084) | (0.214) | (0.007) | |
-wsPLS | 0.979 | 0.980 | 0.972 | 0.992 | 0.968 | 0.960 | 0.953 | 0.992 | 0.985 | 0.987 | 0.980 | 0.992 | 0.056 |
(0.016) | (0.019) | (0.016) | (0.024) | (0.024) | (0.037) | (0.027) | (0.024) | (0.012) | (0.012) | (0.011) | (0.024) | (0.007) | |
Data II | Simulation data II: = 100, = 800, = 1000 | ||||||||||||
ACC_all | ACC_u | ACC_v | ACC_w | TPR_all | TPR_u | TPR_v | TPR_w | TNR_all | TNR_u | TNR_v | TNR_w | Time | |
PLS | 0.289 | 0.250 | 0.300 | 0.500 | 1.000 | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 0.065 |
(0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.008) | |
PMD-sPLS | 0.868 | 0.795 | 0.963 | 0.500 | 0.816 | 0.625 | 0.913 | 1.000 | 0.889 | 0.853 | 0.984 | 0.000 | 0.079 |
(0.006) | (0.010) | (0.006) | (0.000) | (0.008) | (0.014) | (0.013) | (0.000) | (0.006) | (0.010) | (0.004) | (0.000) | (0.012) | |
-sPLS | 0.926 | 0.890 | 0.997 | 0.500 | 0.917 | 0.780 | 0.995 | 1.000 | 0.929 | 0.927 | 0.998 | 0.000 | 0.114 |
(0.005) | (0.013) | (0.001) | (0.000) | (0.009) | (0.026) | (0.002) | (0.000) | (0.004) | (0.009) | (0.001) | (0.000) | (0.009) | |
-wsPLS | 0.617 | 0.627 | 0.605 | 0.654 | 0.338 | 0.254 | 0.342 | 0.654 | 0.730 | 0.751 | 0.718 | 0.654 | 0.026 |
(0.027) | (0.013) | (0.035) | (0.179) | (0.047) | (0.025) | (0.059) | (0.179) | (0.019) | (0.008) | (0.025) | (0.179) | (0.008) | |
-wsPLS | 0.953 | 0.891 | 0.997 | 1.000 | 0.918 | 0.783 | 0.995 | 1.000 | 0.967 | 0.927 | 0.998 | 1.000 | 0.117 |
(0.005) | (0.013) | (0.001) | (0.000) | (0.009) | (0.026) | (0.002) | (0.000) | (0.004) | (0.009) | (0.001) | (0.000) | (0.010) | |
Data III | Simulation data III: = 500, = 8000, = 10000 | ||||||||||||
ACC_all | ACC_u | ACC_v | ACC_w | TPR_all | TPR_u | TPR_v | TPR_w | TNR_all | TNR_u | TNR_v | TNR_w | Time | |
PLS | 0.284 | 0.250 | 0.300 | 0.500 | 1.000 | 1.000 | 1.000 | 1.000 | 0.000 | 0.000 | 0.000 | 0.000 | 11.965 |
(0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.097) | |
PMD-sPLS | 0.940 | 0.919 | 0.979 | 0.500 | 0.892 | 0.821 | 0.931 | 1.000 | 0.960 | 0.952 | 1.000 | 0.000 | 11.999 |
(0.002) | (0.004) | (0.001) | (0.000) | (0.003) | (0.007) | (0.003) | (0.000) | (0.001) | (0.003) | (0.000) | (0.000) | (0.205) | |
-sPLS | 0.976 | 0.977 | 1.000 | 0.500 | 0.982 | 0.954 | 1.000 | 1.000 | 0.974 | 0.985 | 1.000 | 0.000 | 26.078 |
(0.001) | (0.002) | (0.000) | (0.000) | (0.001) | (0.004) | (0.000) | (0.000) | (0.001) | (0.001) | (0.000) | (0.000) | (0.238) | |
-wsPLS | 0.921 | 0.907 | 0.928 | 0.991 | 0.860 | 0.813 | 0.880 | 0.991 | 0.945 | 0.938 | 0.949 | 0.991 | 0.360 |
(0.138) | (0.138) | (0.143) | (0.019) | (0.243) | (0.277) | (0.239) | (0.019) | (0.096) | (0.092) | (0.102) | (0.019) | (0.029) | |
-wsPLS | 0.990 | 0.977 | 1.000 | 1.000 | 0.982 | 0.954 | 1.000 | 1.000 | 0.993 | 0.985 | 1.000 | 1.000 | 5.948 |
(0.001) | (0.002) | (0.000) | (0.000) | (0.001) | (0.004) | (0.000) | (0.000) | (0.001) | (0.001) | (0.000) | (0.000) | (0.057) |
Figure 2 shows the estimated patterns of , and on the simulation data I. Table II shows the detailed performance results of different methods in terms of ACC, TPR, TNR and running time (seconds). We find (1) the sparse PLS methods including PMD-sPLS and -sPLS cannot detect the pattern of (samples) and they identify and with some wrong points; (2) compared to -wsPLS, -wsPLS can identify more correct pattern of . In short, -wsPLS not only discovers the true non-zero patterns for , , but also identifies the true non-zero patterns for (samples).
4.3 Application to Paired DNA Copy Number Variation and Gene Expression Data
We demonstrate the -wsPLS method on the comparative genomic hybridization (CGH) data with matched samples from “PMA” R package [34], which contains the matched DNA copy number variation (CNV) data and gene expression data . Here, our aim is to identify a gene set whose expression is strongly correlated with CNV across a subset of samples.
We apply -wsPLS on the CGH dataset with , and and identify a co-expression module of CNV and gene which contains 50 CNV regions, 150 genes and 71 patients. For comparison, we also apply -sPLS with and , PMD-sPLS with , for learning and with the same sparse level. The setting of -wsPLS is same as -wsPLS. We can first see that the co-module identified by -wsPLS has a strong co-correlation pattern (Figure 3 (A), (B) and (C)). Second, we find that the proposed -wsPLS obtains the largest Pearson correlation coefficient (PCC) , while -wsPLS obtains , -sPLS obtains and PMD-sPLS obtains (Figure 3 (D)). These results show that wsPLS is more effective to capture the latent patterns by selecting a specific subset of samples than -sPLS and PMD-sPLS methods.



4.4 Application to Paired DNA Methylation and Gene Expression Data
Aberrant DNA methylation plays an important role in the development of cancer, which has been regarded as one of the biomarkers of cancer [35]. In recent years, it has become very important to detect the cross-talk mechanism between DNA methylation and gene expression [36, 37]. To this end, we obtain the paired DNA methylation and gene expression dataset of Lung Adenocarcinoma (LUAD) in The Cancer Genome Atlas (TCGA) [38, 37]. To remove some noise methylation probes and genes, we extract the top 5000 methylation probes and the top 5000 genes with largest variance for further analysis. Finally, the LUAD dataset contains the DNA methylation and gene expression data . We apply -wsPLS methods on the LUAD dataset with , and to identify methylation-gene co-module. For comparison, we also apply -sPLS with and , PMD-sPLS with , on the LUAD data for learning and with the same sparse level. The setting of -wsPLS is same as -wsPLS. The co-module identified by -wsPLS has a strong co-correlation pattern (Figure 4 (A), (B) and (C)). In addition, we find that -wsPLS obtains the largest PCC , while -wsPLS obtains , -sPLS obtains and PMD-sPLS obtains (Figure 4 (D)). These results show that wsPLS is more effective to capture the latent patterns of canonical vectors by selecting a specific subset of samples than -sPLS and PMD-sPLS methods.
4.5 Application to Paired miRNA-lncRNA-mRNA Expression Data
To evaluate the mwsPLS method on triple datasets with matched samples, we obtain the miRNA-lncRNA-mRNA transcript expression datasets across 751 Breast invasive carcinoma (BRCA) patients from TCGA database [39]. We remove some noise features (miRNAs, lncRNAs and mRNAs) by using standard deviation threshold. Finally, we obtain three data matrices including the miRNA expression matrix , the lncRNA expression matrix and the mRNA expression matrix .
We apply mwsPLS-scheme1 (Algorithm 2) and mwsPLS-scheme2 (Algorithm 3) on the BRCA miRNA-lncRNA-mRNA dataset with (miRNAs), (lncRNAs), (mRNAs) and (about of all samples) to identify four subtype-specific miRNA-lncRNA-mRNA co-modules. We define an average correlation score to quantify the co-correlation level for a given co-module where , and represent three different feature sets (i.e., miRNA subset, lncRNA subset and mRNA subset) and represents sample subset:
(29) |
where is the number of all the possible pairs for the three types in the co-module and is used to calculate the pair-wise Pearson correlation coefficient (PCC) based on their corresponding expression data on the sample subset . Simply, the is defined as the average value of the absolute PCCs for all miRNA-lncRNA, miRNA-mRNA, and lncRNA-mRNA pairs. We apply a permutation test to evaluate the significance level of score for a given co-module. We firstly construct 1000 random co-modules by randomly sampling the miRNA set, lncRNA set and sample set, and then compute the scores for these random co-modules. We find that all the scores of the identified co-modules by Algorithms 2 and 3 are significantly higher than the scores of the random co-modules (Figure 5 (A)). Moreover, the objective function curves show the convergence of Algorithm 2 and 3 (Figure 5). As an example, we show the expressed pattern of the first co-module identified by mwsPLS-scheme1 (Algorithm 2) in Figure 5 (C). Meanwhile, the correlation matrices (heatmaps) of any two types of molecules in the identified co-module are shown in Figure 5 (D). In terms of the scores in Figure 5 (A), mwsPLS-scheme1 (Algorithm 2) performs as well or better than mwsPLS-scheme2 (Algorithm 3).
To demonstrate the biological function of the gene sets from these identified co-modules by mwsPLS-scheme1. The co-modules identified by mwsPLS-scheme1 are used for further biological analysis. We retrieve the GO biological process data information from Molecular Signatures Database (MSigDB, http://www.gsea-msigdb.org/gsea/msigdb/index.jsp). Each GO biological process is consisted by a set of functionally gene set. The hypergeometric test is used for the statistical analysis. The gene sets of the identified co-modules are significantly enriched a series of important biological processes (Table III).
Module | Biological process term | #gene | P-value |
---|---|---|---|
M1 | microtubule based process | 13 | 9.43e-06 |
microtubule cytoskeleton organization | 10 | 8.03e-06 | |
axonemal dynein complex assembly | 5 | 5.25e-05 | |
axoneme assembly | 6 | 1.11e-04 | |
regulation of microtubule based process | 6 | 2.17e-04 | |
M2 | immune system process | 104 | 4.52e-54 |
regulation of immune system process | 89 | 9.71e-53 | |
immune response | 79 | 2.53e-42 | |
regulation of immune response | 59 | 1.88e-38 | |
positive regulation of immune system process | 59 | 6.98e-37 | |
M3 | regulation of cellular component movement | 24 | 7.68e-08 |
response to ionizing radiation | 6 | 4.10e-06 | |
positive regulation of cell proliferation | 25 | 3.54e-06 | |
regulation of epithelial cell proliferation | 13 | 5.12e-06 | |
positive regulation of locomotion | 15 | 8.19e-06 | |
M4 | extracellular structure organization | 19 | 1.04e-09 |
organ morphogenesis | 33 | 8.30e-09 | |
skeletal system development | 21 | 7.56e-08 | |
regulation of multicellular organismal development | 43 | 5.22e-08 | |
tissue development | 43 | 7.11e-08 |
For example, the gene set in the co-module 2 (M2) is significantly enriched in the immune related biological processes including immune system process (), regulation of immune system process () and immune response (), which have been reported to be related to breast cancer [40, 41]. The gene set in the co-module 3 (M3) is significantly enriched in the regulation of epithelial cell proliferation (), which has been reported to be associated with breast cancer [42]. These results show that the mwsPLS methods could effectively identify the biologically related miRNA-lncRNA-mRNA comodules.
5 Conclusion
In this paper, we propose an -wsPLS model for joint sample and feature selection, which uses the -norm constraints for sample selection. It is difficult to solve the -wsPLS model because the -norm constraints are non-convex and non-smooth. Fortunately, we prove that the objective function of the -wsPLS model satisfies the KŁ property. Based on this, we propose a block proximal gradient algorithm based on the PALM framework to solve it and show its convergence property. Furthermore, we extend the -wsPLS model for data fusion on three or more data matrices and propose two mwsPLS models by a sum way and a product way. Similarly, we propose two block proximal gradient algorithms to solve them and prove that they all have theoretical convergence guarantees. The numerical and biomedical data experiments demonstrate the efficiency of our methods. The source code of the proposed wsPLS and mwsPLS methods is available from https://github.com/wenwenmin/wsPLS.
Acknowledgment
This work was supported by the National Science Foundation of China (62262069, 61802157, 61902372), the Open Project Program of Yunnan Key Laboratory of Intelligent Systems and Computing (No. ISC22Z03).
References
- [1] C. Hutter and J. C. Zenklusen, “The cancer genome atlas: creating lasting value beyond its data,” Cell, vol. 173, no. 2, pp. 283–285, 2018.
- [2] J. D. Welch, V. Kozareva, A. Ferreira, C. Vanderburg, C. Martin, and E. Z. Macosko, “Single-cell multi-omic integration compares and contrasts features of brain cell identity,” Cell, vol. 177, no. 7, pp. 1873–1887, 2019.
- [3] A. Krishnan, L. J. Williams, A. R. McIntosh, and H. Abdi, “Partial least squares (PLS) methods for neuroimaging: a tutorial and review,” Neuroimage, vol. 56, no. 2, pp. 455–475, 2011.
- [4] A.-L. Boulesteix and K. Strimmer, “Partial least squares: a versatile tool for the analysis of high-dimensional genomic data,” Briefings in bioinformatics, vol. 8, no. 1, pp. 32–44, 2007.
- [5] H. Chen, Y. Sun, J. Gao, Y. Hu, and B. Yin, “Solving partial least squares regression via manifold optimization approaches,” IEEE Trans. Neural. Netw. Learn. Syst., vol. 30, no. 2, pp. 588–600, 2019.
- [6] F. Rohart, B. Gautier, A. Singh, and K.-A. Lê Cao, “mixomics: An r package for ‘omics feature selection and multiple data integration,” PLoS Comput. Biol., vol. 13, no. 11, p. e1005752, 2017.
- [7] K.-A. Lê Cao, D. Rossouw, C. Robert-Granié, and P. Besse, “A sparse PLS for variable selection when integrating omics data,” Stat. Appl. Genet. Mol. Biol., vol. 7, no. 1, 2008.
- [8] H. Chun and S. Keleş, “Sparse partial least squares regression for simultaneous dimension reduction and variable selection,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 1, pp. 3–25, 2010.
- [9] J. M. Monteiro, A. Rao, J. Shawe-Taylor, J. Mourão-Miranda, A. D. Initiative et al., “A multiple hold-out framework for sparse partial least squares,” Journal of neuroscience methods, vol. 271, pp. 182–194, 2016.
- [10] Q. Li, Y. Yan, and H. Wang, “Discriminative weighted sparse partial least squares for human detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 4, pp. 1062–1071, 2016.
- [11] G. Zhu, Z. Su et al., “Envelope-based sparse partial least squares,” The Annals of Statistics, vol. 48, no. 1, pp. 161–182, 2020.
- [12] B. Liquet, P. L. de Micheaux, B. P. Hejblum, and R. Thiebaut, “Group and sparse group partial least square approaches applied in genomics context,” Bioinformatics, vol. 32, no. 1, pp. 35–42, 2016.
- [13] J. Chen and S. Zhang, “Integrative analysis for identifying joint modular patterns of gene-expression and drug-response data,” Bioinformatics, vol. 32, no. 11, pp. 1724–1732, 2016.
- [14] W. Min, J. Liu, and S. Zhang, “Group-sparse SVD models via -and -norm penalties and their applications in biological data,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 2, pp. 536–550, 2021.
- [15] W. Min, X. Wan, T.-H. Chang, and S. Zhang, “A novel sparse graph-regularized singular value decomposition model and its application to genomic data analysis,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 8, pp. 3842 –3856, 2022.
- [16] W. Min, T. Xu, X. Wan, and T.-H. Chang, “Structured sparse non-negative matrix factorization with -norm,” IEEE Trans. Knowl. Data Eng., pp. 1–13, 2022.
- [17] I. Dagogo-Jack and A. T. Shaw, “Tumour heterogeneity and resistance to cancer therapies,” Nature reviews Clinical oncology, vol. 15, no. 2, pp. 81–94, 2018.
- [18] J. Bolte, S. Sabach, and M. Teboulle, “Proximal alternating linearized minimization for nonconvex and nonsmooth problems,” Math. Program., vol. 146, no. 1-2, pp. 459–494, 2014.
- [19] J. Chen and S. Zhang, “Matrix integrative analysis (MIA) of multiple genomic data for modular patterns,” Frontiers in genetics, vol. 9, p. 194, 2018.
- [20] C. Meng, O. A. Zeleznik, G. G. Thallinger, B. Kuster, A. M. Gholami, and A. C. Culhane, “Dimension reduction techniques for the integrative analysis of multi-omics data,” Briefings in bioinformatics, vol. 17, no. 4, pp. 628–641, 2016.
- [21] N. Rappoport and R. Shamir, “Multi-omic and multi-view clustering algorithms: review and cancer benchmark,” Nucleic acids research, vol. 46, no. 20, pp. 10 546–10 562, 2018.
- [22] L. Wang and R.-C. Li, “A scalable algorithm for large-scale unsupervised multi-view partial least squares,” IEEE Transactions on Big Data, pp. 1–1, 2020.
- [23] Z. Zhang, L. Liu, F. Shen, H. T. Shen, and L. Shao, “Binary multi-view clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 7, pp. 1774–1782, 2019.
- [24] J. Han, J. Xu, F. Nie, and X. Li, “Multi-view k-means clustering with adaptive sparse memberships and weight allocation,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 2, pp. 816–827, 2022.
- [25] C. Hou, F. Nie, H. Tao, and D. Yi, “Multi-view unsupervised feature selection with adaptive similarity and view weight,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 9, pp. 1998–2011, 2017.
- [26] W. Hu, D. Lin, S. Cao, J. Liu, J. Chen, V. D. Calhoun, and Y.-P. Wang, “Adaptive sparse multiple canonical correlation analysis with application to imaging (epi) genomics study of schizophrenia,” IEEE Transactions on Biomedical Engineering, vol. 65, no. 2, pp. 390–399, 2018.
- [27] Z. Deng, R. Liu, P. Xu et al., “Multi-view clustering with the cooperation of visible and hidden views,” IEEE Trans. Knowl. Data Eng., pp. 1–1, 2020.
- [28] J. Zhao, X. Xie, X. Xu, and S. Sun, “Multi-view learning overview: Recent progress and new challenges,” Information Fusion, vol. 38, pp. 43–54, 2017.
- [29] R. Rosipal and N. Krämer, “Overview and recent advances in partial least squares,” in International Statistical and Optimization Perspectives Workshop, 2005, pp. 34–51.
- [30] M. Asteris, A. Kyrillidis, O. Koyejo, and R. Poldrack, “A simple and provable algorithm for sparse diagonal cca,” in ICML, 2016, pp. 1148–1157.
- [31] W. Min, J. Liu, and S. Zhang, “Sparse weighted canonical correlation analysis,” Chinese Journal of Electronics, vol. 27, no. 3, pp. 459–466, 2018.
- [32] D. M. Witten and R. J. Tibshirani, “Extensions of sparse canonical correlation analysis with applications to genomic data,” Stat. Appl. Genet. Mol. Biol., vol. 8, no. 1, 2009.
- [33] W. Min, J. Liu, F. Luo, and S. Zhang, “A two-stage method to identify joint modules from matched microrna and mrna expression data,” IEEE Trans. Nanobioscience, vol. 15, pp. 362–370, 2016.
- [34] D. M. Witten, R. Tibshirani, and T. Hastie, “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis,” Biostatistics, vol. 10, no. 3, pp. 515–534, 2009.
- [35] S. Saghafinia, M. Mina, N. Riggi, D. Hanahan, and G. Ciriello, “Pan-cancer landscape of aberrant dna methylation across human tumors,” Cell reports, vol. 25, no. 4, pp. 1066–1080, 2018.
- [36] Z. Wang, E. Curry, and G. Montana, “Network-guided regression for detecting associations between dna methylation and gene expression,” Bioinformatics, vol. 30, no. 19, pp. 2693–2701, 2014.
- [37] Y. Li and Z. Zhang, “Potential microRNA-mediated oncogenic intercellular communication revealed by pan-cancer analysis,” Scientific reports, vol. 4, no. 1, pp. 1–15, 2014.
- [38] C. G. A. R. Network et al., “Comprehensive molecular profiling of lung adenocarcinoma,” Nature, vol. 511, no. 7511, pp. 543–550, 2014.
- [39] C. G. A. Network et al., “Comprehensive molecular portraits of human breast tumours,” Nature, vol. 490, no. 7418, p. 61, 2012.
- [40] C. Desmedt, B. Haibe-Kains et al., “Biological processes associated with breast cancer clinical outcome depend on the molecular subtypes,” Clinical cancer research, vol. 14, no. 16, pp. 5158–5165, 2008.
- [41] J. P. Bates, R. Derakhshandeh, L. Jones, and T. J. Webb, “Mechanisms of immune evasion in breast cancer,” BMC cancer, vol. 18, no. 1, pp. 1–14, 2018.
- [42] M. A. Cichon, C. M. Nelson, and D. C. Radisky, “Regulation of epithelial-mesenchymal transition in breast cancer cells by cell contact and adhesion,” Cancer informatics, vol. 14, pp. CIN–S18 965, 2015.
![]() |
Wenwen Min is currently an associate professor in the School of Information Science and Engineering, Yunnan University. He received the Ph.D. degree in Computer Science from the School of Computer Science, Wuhan University, in December, 2017. He was a visiting Ph.D student at the Academy of Mathematics and Systems Science, Chinese Academy of Sciences from 2015 to 2017. He was a Postdoctoral Researcher with the School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China, from 2019 to 2021. His current research interests include data mining, machine learning, sparse optimization and bioinformatics. He has authored about 20 papers in journals and conferences, such as the IEEE TKDE, IEEE TNNLS, PLoS Comput Biol, Bioinformatics, and IEEE/ACM TCBB, etc. |
![]() |
Taosheng Xu is currently an associate professor in the Hefei Institutes of Physical Science, Chinese Academy of Sciences. He received his BSc (2009) from Hefei University of Technology. He received his PhD degree in Control Science and Engineering in 2016 at University of Science and Technology of China. Dr. Xu is a Postdoctoral Research Fellow at the Arieh WARSHEL Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen from 2020 to 2022. His current research interests include data mining, machine learning, bioinformatics and cancer genomics. |
![]() |
Chris Ding received his Ph.D. degree in Columbia University in 1987. He is currently a chair professor at Chinese University of Hong Kong, Shenzhen. Before that, he worked at California Institute of Technology, Jet Propulsion Lab, Lawrence Berkeley Lab, and University of Texas. His areas are machine learning, data mining, computational biology and high performance computing. He proposed L21 norm which is widely used in machine learning. Made significant contributions on principal component analysis, non-negative matrix factorization and feature selection. Designed the minimum redundancy maximum relevance feature selection algorithm that is widely adopted, e.g., by Uber; Two related papers were cited 11,700 times. Published over 200 research papers with over 50,000 citations. |