This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Weighted Sparse Partial Least Squares for Joint Sample and Feature Selection

Wenwen Min, Taosheng Xu and Chris Ding Wenwen Min is with the Yunnan Key Laboratory of Intelligent Systems and Computing, School of Information Science and Engineering, Yunnan University, Kunming 650091, Yunnan, China. E-mail: [email protected]. Taosheng Xu is with the Institute of Intelligent Machines, Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei, 230031, Anhui, China. E-mail: [email protected]. Chris Ding is with the School of Data Science, The Chinese University of Hong Kong, Shenzhen, China. E-mail: [email protected]. Manuscript received XX, 2022; revised XX, 2023.
Abstract

Sparse Partial Least Squares (sPLS) is a common dimensionality reduction technique for data fusion, which projects data samples from two views by seeking linear combinations with a small number of variables with the maximum variance. However, sPLS extracts the combinations between two data sets with all data samples so that it cannot detect latent subsets of samples. To extend the application of sPLS by identifying a specific subset of samples and remove outliers, we propose an /0\ell_{\infty}/\ell_{0}-norm constrained weighted sparse PLS (/0\ell_{\infty}/\ell_{0}-wsPLS) method for joint sample and feature selection, where the /0\ell_{\infty}/\ell_{0}-norm constrains are used to select a subset of samples. We prove that the /0\ell_{\infty}/\ell_{0}-norm constrains have the Kurdyka-Łojasiewicz property so that a globally convergent algorithm is developed to solve it. Moreover, multi-view data with a same set of samples can be available in various real problems. To this end, we extend the /0\ell_{\infty}/\ell_{0}-wsPLS model and propose two multi-view wsPLS models for multi-view data fusion. We develop an efficient iterative algorithm for each multi-view wsPLS model and show its convergence property. As well as numerical and biomedical data experiments demonstrate the efficiency of the proposed methods.

Index Terms:
/0\ell_{\infty}/\ell_{0}-norm, sparse PLS, sparse CCA, multi-view learning, nonconvex optimization.

1 Introduction

Anumber of paired biomedical omics data from the same batch of patients have been accumulated [1, 2]. These invaluable datasets provide us new opportunities to explore cooperative mechanisms between different types of biological molecules. Partial Least Squares (PLS) is a popular technique for finding correlations across two data matrices with the maximum variance on a new low-dimensional space by learning two latent vectors for both representations [3, 4, 5]. Recently, PLS has been widely used for data fusion of the paired omics data [6], such as DNA copy number variation and gene expression data, DNA methylation and gene expression data, etc.

However, the classical PLS has the overfitting problem when dealing with high-dimensional but small sample biomedical data. In addition, the outputs of PLS (i.e., latent vectors) are a combination of all variables, which is difficult to interpret. To address these problems, sparse PLS (sPLS) methods have been proposed to avoid overfitting problem and improve interpretation ability of model by learning two sparse latent vectors [7, 8, 9, 10, 11]. Moreover, some extensions of the sPLS methods have been developed by using the structured sparse penalties, such as structured sparse PLS and matrix factorization methods [12, 13, 14, 15, 16]. These extension methods can improve the performance by integrating the structural information of the variables when applying to the biomedical paired omics datasets. For example, Chen et al. [13] proposed a sparse network-regularized PLS method to identify common modules (co-modules) using the matched gene-expression and drug response data. However, these sPLS and their variants often perform poorly when applying to high dimensional biomedical datasets with irrelevant features and noisy samples. The main reason is that they are susceptible to outliers and noisy samples, causing the acquired projection vectors to deviate from the desired direction. On one hand, most medical datasets are high-dimensional and polluted data due to the influence of experimental conditions and environmental factors. On the other hand, even the collected patients in the medical datasets from the same disease show heterogeneity in the field of biomedicine, such as cancer disease [17]. So it is very necessary to mine sample-specific co-modules when applying to two view or multi-view data.

To this end, we develop an /0\ell_{\infty}/\ell_{0}-norm constrained weighted sparse PLS (/0\ell_{\infty}/\ell_{0}-wsPLS) model for joint sample and feature selection, which can simultaneously select features and samples on a paired datasets (Figure 1). The use of /0\ell_{\infty}/\ell_{0}-norm constraint is used to select a sample subset, while the 0\ell_{0}-norm constraint is used to select features in our model. However, it is very difficult to find an effective convergence algorithm in traditional way, to solve the /0\ell_{\infty}/\ell_{0}-wsPLS model because the /0\ell_{\infty}/\ell_{0}-norm and the 0\ell_{0}-norm are non-convex and non-smooth. Fortunately, we prove that the /0\ell_{\infty}/\ell_{0}-norm and the 0\ell_{0}-norm constraints satisfy the Kurdyka-Łojasiewicz (KŁ) property. Inspired by the Proximal Alternating Linearized Minimization (PALM) method [18], we develop a block proximal gradient algorithm to solve the /0\ell_{\infty}/\ell_{0}-wsPLS model and prove that it converges to a critical for any initial point.

Furthermore, multi-view biomedical data have accumulated at an unprecedented rate in recent years [19, 20, 21, 22]. These collected multi-view biomedical data presents possibility to study the molecular mechanism of cancer pathogenesis. Many multi-view learning methods have been proposed to consider the cancer multi-omics data to improve their generalization ability [23, 24, 25, 26, 27, 28], but it is still a challenge to mine co-modules between multi-view biomedical data. To this end, we extend the wsPLS model for multi-view omics data analysis and propose two multi-view weighted sparse PLS (mwsPLS) models with two different schemes. For each mwsPLS model, we develop an efficient iterative algorithm based on the PALM framework and prove its convergence.

In this paper, we firstly show that the proposed methods significantly improve the performance of joint sample and feature selection when applying to simulated data. Secondly, we demonstrate the ability of /0\ell_{\infty}/\ell_{0}-wsPLS in identifying sample-specific co-modules when applying to the paired DNA copy number variation and gene expression data, and the paired DNA methylation and gene expression data, respectively. The results demonstrate that /0\ell_{\infty}/\ell_{0}-wsPLS outperforms others methods, which obtains higher correlation coefficients by removing noisy samples. Thirdly, we also demonstrate the ability of mwsPLS in discovering cancer subtype-specific co-modules when applying the paired miRNA-lncRNA-mRNA expression data of breast cancer. The results show that mwsPLS can identify some important miRNA-lncRNA-mRNA co-modules which are involved in some breast cancer related biological processes. The main contributions of this paper are highlighted as follows:

  1. 1.

    We propose an /0\ell_{\infty}/\ell_{0}-norm constrained wsPLS model for joint sample and feature selection, where the /0\ell_{\infty}/\ell_{0}-norm constraints are used for sample selection. We prove that the /0\ell_{\infty}/\ell_{0}-norm constraints satisfy the KŁ property so that a globally convergent algorithm is developed to solve the /0\ell_{\infty}/\ell_{0}-wsPLS model.

  2. 2.

    To integrate more than two datasets, we extend the /0\ell_{\infty}/\ell_{0}-wsPLS model and proposed two multi-view wsPLS (mwsPLS) models with two different schemes. We develop an efficient iterative algorithm based on the PALM framework and prove its convergence for each mwsPLS model.

  3. 3.

    The application of /0\ell_{\infty}/\ell_{0}-wsPLS and mwsPLS methods and the comparison with the competing methods using the simulated and cancer multi-omic data.

Table I: Summary of notations.

Notation Meanings Normal font, e.g., x A scalar Bold lowercase, e.g., 𝒖\bm{u} A vector Bold capital, e.g., 𝑿\bm{X} A matrix 𝑿i\bm{X}_{i} A nn-by-pip_{i} matrix 𝑿\bm{X} A nn-by-pp matrix 𝒀\bm{Y} A nn-by-qq matrix \|\cdot\| or 2\|\cdot\|_{2} 2\ell_{2}-norm for a vector 1\|\cdot\|_{1} 1\ell_{1}-norm for a vector 0\|\cdot\|_{0} 0\ell_{0}-norm for a vector 𝒘\|\bm{w}\|_{\infty} 𝒘=maxj|wj|\|\bm{w}\|_{\infty}=\max_{j}|w_{j}| \odot Element-wise product for two vectors

Refer to caption
Figure 1: Overview of the proposed /0\ell_{\infty}/\ell_{0}-norm constrained wsPLS model for joint sample and feature selection, where the /0\ell_{\infty}/\ell_{0}-norm constrains are used to select a subset of samples. The /0\ell_{\infty}/\ell_{0}-wsPLS model can not only obtain two sparse canonical vectors 𝒖\bm{u} and 𝒗\bm{v}, but also identify a set of samples based on those non-zero elements of 𝒘\bm{w}. In this toy example, /0\ell_{\infty}/\ell_{0}-wsPLS is used to extract three co-modules.

2 Weighted Sparse PLS (wsPLS)

2.1 wsPLS Model

Assume that there are two data matrices 𝑿n×p\bm{X}\in\mathbb{R}^{n\times p} and 𝒀n×q\bm{Y}\in\mathbb{R}^{n\times q} with a same set of samples and their columns are standardized to have mean zero and variance unit. PLS is equivalent to solving the following optimization problem [29]:

maximize𝒖,𝒗\displaystyle\underset{\bm{u},\bm{v}}{\text{maximize}} 𝒖T𝑿T𝒀𝒗\displaystyle\bm{u}^{T}\bm{X}^{T}\bm{Y}\bm{v} (1)
subject to 𝒖=𝒗=1.\displaystyle\|\bm{u}\|=\|\bm{v}\|=1.

The basic idea of PLS is to obtain two projected vectors 𝒖\bm{u} and 𝒗\bm{v} by maximizing the covariance between the latent vectors 𝑿𝒖\bm{Xu} and 𝒀𝒗\bm{Yv}. Specifically, a sparse PLS (sPLS) with 0\ell_{0}-norm constraint is formulated as:

maximize𝒖,𝒗\displaystyle\underset{\bm{u},\bm{v}}{\text{maximize}} 𝒖T𝑿T𝒀𝒗\displaystyle\bm{u}^{T}\bm{X}^{T}\bm{Y}\bm{v} (2)
subject to 𝒖0ku,𝒗0kv,𝒖=𝒗=1,\displaystyle\|\bm{u}\|_{0}\leq k_{u},\|\bm{v}\|_{0}\leq k_{v},\|\bm{u}\|=\|\bm{v}\|=1,

where kuk_{u} and kvk_{v} are two parameters to control the sparsity of 𝒖\bm{u} and 𝒗\bm{v}. The above model is also called sparse diagonal CCA problem [30, 31].

We observe that the objective function in Eq. (2) can be written as i=1n(𝑿𝒖)i(𝒀𝒗)i\sum_{i=1}^{n}(\bm{Xu})_{i}(\bm{Yv})_{i}. To consider the difference of samples, we set a weighted coefficient for each sample and modify the above objective function as i=1nwi(𝑿𝒖)i(𝒀𝒗)i=𝒖T𝑿Tdiag(𝒘)𝒀𝒗\sum_{i=1}^{n}w_{i}(\bm{Xu})_{i}(\bm{Yv})_{i}=\bm{u}^{T}\bm{X}^{T}\mbox{diag}(\bm{w})\bm{Y}\bm{v} and propose a weighted sparse PLS model with the /0\ell_{\infty}/\ell_{0}-norm constraints, which is denoted as (/0\ell_{\infty}/\ell_{0}-wsPLS or wsPLS for short):

maximize𝒖,𝒗,𝒘\displaystyle\underset{\bm{u},\bm{v},\bm{w}}{\text{maximize}} 𝒖T𝑿Tdiag(𝒘)𝒀𝒗\displaystyle\bm{u}^{T}\bm{X}^{T}\mbox{diag}(\bm{w})\bm{Y}\bm{v} (3)
subject to 𝒖0ku,𝒗0kv,𝒖=𝒗=1,\displaystyle\|\bm{u}\|_{0}\leq k_{u},\|\bm{v}\|_{0}\leq k_{v},\|\bm{u}\|=\|\bm{v}\|=1,
𝒘1,𝒘0kw,wj0j\displaystyle\|\bm{w}\|_{\infty}\leq 1,\|\bm{w}\|_{0}\leq k_{w},w_{j}\geq 0\leavevmode\nobreak\ \forall j

where 𝒘=maxj|wj|\|\bm{w}\|_{\infty}=\max_{j}|w_{j}| and diag(𝒘)+n×n\mbox{diag}(\bm{w})\in\mathbb{R}_{+}^{n\times n} is a diagonal matrix with the diagonal entries w1,w2,,wnw_{1},w_{2},\cdots,w_{n} indicating the weighted coefficients of samples and the /0\ell_{\infty}/\ell_{0}-norm constraints are used to select a subset of samples. The mathematical notations are summarized in Table I.

2.2 Optimization Algorithm

Eq. (3) is equivalent to the following optimization problem because 𝒖T𝑿Tdiag(𝒘)𝒀𝒗=𝒘T[(𝑿𝒖)(𝒀𝒗)]\bm{u}^{T}\bm{X}^{T}\mbox{diag}(\bm{w})\bm{Y}\bm{v}=\bm{w}^{T}[(\bm{X}\bm{u})\odot(\bm{Y}\bm{v})]:

minimize𝒖,𝒗,𝒘\displaystyle\underset{\bm{u},\bm{v},\bm{w}}{\text{minimize}} f(𝒖,𝒗,𝒘)=𝒘T[(𝑿𝒖)(𝒀𝒗)]\displaystyle f(\bm{u},\bm{v},\bm{w})=-\bm{w}^{T}[(\bm{X}\bm{u})\odot(\bm{Y}\bm{v})] (4)
subject to 𝒖0ku,𝒗0kv,𝒖=𝒗=1,\displaystyle\|\bm{u}\|_{0}\leq k_{u},\|\bm{v}\|_{0}\leq k_{v},\|\bm{u}\|=\|\bm{v}\|=1,
𝒘1,𝒘0kw,wj0j\displaystyle\|\bm{w}\|_{\infty}\leq 1,\|\bm{w}\|_{0}\leq k_{w},w_{j}\geq 0\leavevmode\nobreak\ \forall j

Obviously, the above /0\ell_{\infty}/\ell_{0}-wsPLS model reduces to sPLS model when all elements of 𝒘\bm{w} are one. We observe that f(𝒖,𝒗,𝒘)f(\bm{u},\bm{v},\bm{w}) is also equal to 𝒖T𝑿T[𝒘(𝒀𝒗)]-\bm{u}^{T}\bm{X}^{T}[\bm{w}\odot(\bm{Y}\bm{v})] or 𝒗T𝒀T[𝒘(𝑿𝒖)]-\bm{v}^{T}\bm{Y}^{T}[\bm{w}\odot(\bm{X}\bm{u})]. Therefore, we have

𝒖f(𝒖,𝒗,𝒘)\displaystyle\nabla_{\bm{u}}f(\bm{u},\bm{v},\bm{w}) =\displaystyle= 𝑿T[𝒘(𝒀𝒗)],\displaystyle-\bm{X}^{T}[\bm{w}\odot(\bm{Y}\bm{v})], (5)
𝒗f(𝒖,𝒗,𝒘)\displaystyle\nabla_{\bm{v}}f(\bm{u},\bm{v},\bm{w}) =\displaystyle= 𝒀T[𝒘(𝑿𝒖)],\displaystyle-\bm{Y}^{T}[\bm{w}\odot(\bm{X}\bm{u})],
𝒘f(𝒖,𝒗,𝒘)\displaystyle\nabla_{\bm{w}}f(\bm{u},\bm{v},\bm{w}) =\displaystyle= (𝑿𝒖)(𝒀𝒗).\displaystyle-(\bm{X}\bm{u})\odot(\bm{Y}\bm{v}).

Furthermore, we know that the Hessian matrices of f(𝒖,𝒗,𝒘)f(\bm{u},\bm{v},\bm{w}) with respect to 𝒖\bm{u}, 𝒗\bm{v} and 𝒘\bm{w} are zero matrix, respectively. Thus, 𝒖f(𝒖,𝒗,𝒘)\nabla_{\bm{u}}f(\bm{u},\bm{v},\bm{w}), 𝒗f(𝒖,𝒗,𝒘)\nabla_{\bm{v}}f(\bm{u},\bm{v},\bm{w}) and 𝒘f(𝒖,𝒗,𝒘)\nabla_{\bm{w}}f(\bm{u},\bm{v},\bm{w}) are Lipschitz continuous, and their Lipschitz constants can be any constant greater than zero (Lu>0L_{u}>0, Lv>0L_{v}>0, and Lw>0L_{w}>0).

Recently, the PALM algorithm [18] has been proposed to solve a class of non-convex and non-smooth problems. Based on the PALM framework, we develop a block proximal gradient algorithm to solve (4) by iteratively updating each variable when fixing others.

1) Computation of ut+1\bm{u}^{t+1} using ut\bm{u}^{t}, vt\bm{v}^{t} and wt\bm{w}^{t}. We perform the proximal gradient step for 𝒖t+1\bm{u}^{t+1} when 𝒗\bm{v} and 𝒘\bm{w} are fixed:

minimize𝒖\displaystyle\underset{\bm{u}}{\text{minimize}} 𝒖𝒖¯22\displaystyle\|\bm{u}-\overline{\bm{u}}\|_{2}^{2} (6)
subject to 𝒖0ku,𝒖=1,\displaystyle\|\bm{u}\|_{0}\leq k_{u},\|\bm{u}\|=1,

where 𝒖¯=𝒖t1Lu𝒖f(𝒖t,𝒗t,𝒘t)\overline{\bm{u}}=\bm{u}^{t}-\frac{1}{L_{u}}\nabla_{\bm{u}}f(\bm{u}^{t},\bm{v}^{t},\bm{w}^{t}). To solve the above problem (6), we define an 0\ell_{0}-ball projection problem as Eq. (7) and give its closed-form solution in Proposition 1.

𝒙=argmin𝒙{𝒙𝒛22:𝒙0<k,𝒙=1}\bm{x}^{*}=\underset{\bm{x}}{\text{argmin}}\{\|\bm{x}-\bm{z}\|_{2}^{2}:\|\bm{x}\|_{0}<k,\|\bm{x}\|=1\} (7)
Proposition 1.

Suppose that 𝐳\bm{z} is a no-zero vector, then the solution of problem (7) is 𝐱=Tk(𝐳)Tk(𝐳)\bm{x}^{*}=\frac{T_{k}(\bm{z})}{\|T_{k}(\bm{z})\|}. Tk(𝐳)T_{k}(\bm{z}) keeps the top kk entries with largest absolute values of 𝐳\bm{z} and the other entries are zeros.

Based on Proposition 1, we obtain the closed-form solution of Eq. (6) as follows:

𝒖t+1:=Tku(𝒖¯)Tku(𝒖¯),\bm{u}^{t+1}:=\frac{T_{k_{u}}(\overline{\bm{u}})}{\|T_{k_{u}}(\overline{\bm{u}})\|}, (8)

where 𝒖¯=𝒖t1Lu𝒖f(𝒖t,𝒗t,𝒘t)\overline{\bm{u}}=\bm{u}^{t}-\frac{1}{L_{u}}\nabla_{\bm{u}}f(\bm{u}^{t},\bm{v}^{t},\bm{w}^{t}), the partial derivatives 𝒖f(𝒖,𝒗,𝒘)=𝑿T[𝒘(𝒀𝒗)]\nabla_{\bm{u}}f(\bm{u},\bm{v},\bm{w})=-\bm{X}^{T}[\bm{w}\odot(\bm{Y}\bm{v})] and the Lipschitz constant LuL_{u} can be set to any constant greater than zero.

2) Computation of vt+1\bm{v}^{t+1} using ut+1\bm{u}^{t+1}, vt\bm{v}^{t} and wt\bm{w}^{t}. We perform the proximal gradient step for 𝒗t+1\bm{v}^{t+1} when 𝒖\bm{u} and 𝒘\bm{w} are fixed:

minimize𝒗\displaystyle\underset{\bm{v}}{\text{minimize}} 𝒗𝒗¯22\displaystyle\|\bm{v}-\overline{\bm{v}}\|_{2}^{2} (9)
subject to 𝒗0kv,𝒗=1,\displaystyle\|\bm{v}\|_{0}\leq k_{v},\|\bm{v}\|=1,

where 𝒗¯=𝒗t1Lv𝒖f(𝒖t+1,𝒗t,𝒘t)\overline{\bm{v}}=\bm{v}^{t}-\frac{1}{L_{v}}\nabla_{\bm{u}}f(\bm{u}^{t+1},\bm{v}^{t},\bm{w}^{t}). Based on the Proposition 1, we obtain the closed-form solution of Eq. (9) as follows:

𝒗t+1:=Tkv(𝒗¯)Tkv(𝒗¯),\bm{v}^{t+1}:=\frac{T_{k_{v}}(\overline{\bm{v}})}{\|T_{k_{v}}(\overline{\bm{v}})\|}, (10)

where 𝒗¯=𝒗t1Lv𝒖f(𝒖t+1,𝒗t,𝒘t)\overline{\bm{v}}=\bm{v}^{t}-\frac{1}{L_{v}}\nabla_{\bm{u}}f(\bm{u}^{t+1},\bm{v}^{t},\bm{w}^{t}), the partial derivatives 𝒗f(𝒖,𝒗,𝒘)=𝒀T[𝒘(𝑿𝒖)]\nabla_{\bm{v}}f(\bm{u},\bm{v},\bm{w})=-\bm{Y}^{T}[\bm{w}\odot(\bm{X}\bm{u})] and the Lipschitz constant LvL_{v} can be set to any constant greater than zero.

3) Computation of wt+1\bm{w}^{t+1} using ut+1\bm{u}^{t+1}, vt+1\bm{v}^{t+1} and wt\bm{w}^{t}. We perform the proximal gradient step for 𝒘t+1\bm{w}^{t+1} when 𝒖\bm{u} and 𝒗\bm{v} are fixed:

minimize𝒘\displaystyle\underset{\bm{w}}{\text{minimize}} 𝒘𝒘¯22\displaystyle\|\bm{w}-\overline{\bm{w}}\|_{2}^{2} (11)
𝒘0kw,𝒘1,wj0j\displaystyle\|\bm{w}\|_{0}\leq k_{w},\|\bm{w}\|_{\infty}\leq 1,w_{j}\geq 0\leavevmode\nobreak\ \forall j

where 𝒘¯=𝒘t1Lw𝒘f(𝒖t+1,𝒗t+1,𝒘t)\overline{\bm{w}}=\bm{w}^{t}-\frac{1}{L_{w}}\nabla_{\bm{w}}f(\bm{u}^{t+1},\bm{v}^{t+1},\bm{w}^{t}). To solve the above model (11), we define a projection function 𝒘^:=P,+(𝒘)\widehat{\bm{w}}:=P_{\infty,+}(\bm{w}) for an arbitrary vector 𝒘\bm{w}, which satisfies as follows for all jj:

w^j={0,ifwj0,1,ifwj1,wi,otherwise.\widehat{w}_{j}=\left\{\begin{array}[]{cl}0,&\text{if}\leavevmode\nobreak\ w_{j}\leq 0,\\ 1,&\text{if}\leavevmode\nobreak\ w_{j}\geq 1,\\ w_{i},&\text{otherwise}.\end{array}\right. (12)

We obtain the closed-form solution of Eq. (11) using the following Proposition 2.

Proposition 2.

The optimal solution of problem (11) is 𝐰=Tkw(P,+(𝐰¯))\bm{w}^{*}=T_{k_{w}}(P_{\infty,+}(\overline{\bm{w}})).

Based on Proposition 2, we obtain the update rule of 𝒘t+1\bm{w}^{t+1}:

𝒘t+1:=Tkw(P,+(𝒘¯)),\bm{w}^{t+1}:=T_{k_{w}}(P_{\infty,+}(\overline{\bm{w}})), (13)

where 𝒘¯=𝒘t1Lw𝒘f(𝒖t+1,𝒗t+1,𝒘t)\overline{\bm{w}}=\bm{w}^{t}-\frac{1}{L_{w}}\nabla_{\bm{w}}f(\bm{u}^{t+1},\bm{v}^{t+1},\bm{w}^{t}), P,+()P_{\infty,+}(\cdot) is defined in Eq. (12), 𝒘f(𝒖,𝒗,𝒘)=(𝑿𝒖)(𝒀𝒗)\nabla_{\bm{w}}f(\bm{u},\bm{v},\bm{w})=-(\bm{X}\bm{u})\odot(\bm{Y}\bm{v}) and the Lipschitz constant LwL_{w} can be set to any constant greater than zero.

Algorithm 1 /0\ell_{\infty}/\ell_{0}-norm constrained wsPLS (/0\ell_{\infty}/\ell_{0}-wsPLS)
0:  Data matrices {𝑿n×p\bm{X}\in\mathbb{R}^{n\times p}, 𝒀n×q\bm{Y}\in\mathbb{R}^{n\times q}}, parameters {kuk_{u}, kvk_{v}, kwk_{w}}, and Lipschitz constants {LuL_{u}, LvL_{v}, LwL_{w}}.
0:  𝒖\bm{u}, 𝒗\bm{v}, and 𝒘\bm{w}.
1:  Each column of 𝑿\bm{X} and 𝒀\bm{Y} is normalized to have mean zero and variance one.
2:  Initialize (𝒖0,𝒗0,𝒘0)(\bm{u}^{0},\bm{v}^{0},\bm{w}^{0}) and set t=0t=0.
3:  repeat
4:     Compute 𝒖t+1\bm{u}^{t+1} according to Eq. (8).
5:     Compute 𝒗t+1\bm{v}^{t+1} according to Eq. (10).
6:     Compute 𝒘t+1\bm{w}^{t+1} according to Eq. (13).
7:     t=t+1t=t+1
8:  until convergence.
9:  return  𝒖:=𝒖t\bm{u}:=\bm{u}^{t}, 𝒗:=𝒗t\bm{v}:=\bm{v}^{t} and 𝒘:=𝒘t\bm{w}:=\bm{w}^{t}.

Combing (8), (10) and (13), we propose the block proximal gradient algorithm to solve Eq. (4) (See Algorithm 1).

Initialization. We assume that all the samples have equal contribution and initialize 𝒘\bm{w} as 𝒘0=𝟏\bm{w}^{0}=\bm{1} (with all entries are ones) in Algorithm 1. Furthermore, Algorithm 1 is run with multiple random initial points 𝒖0\bm{u}^{0} and 𝒗0\bm{v}^{0}. Finally, the model with the optimal objective function value is selected.

Stopping Condition. We can use the following ways to stop the algorithm: (i) the stopping criterion can be set as follows:

𝒖t𝒖t1+𝒗t𝒗t1+𝒘t𝒘t1<105,\|\bm{u}^{t}-\bm{u}^{t-1}\|+\|\bm{v}^{t}-\bm{v}^{t-1}\|+\|\bm{w}^{t}-\bm{w}^{t-1}\|<10^{-5}, (14)

where 𝒖t\bm{u}^{t}, 𝒗t\bm{v}^{t}, and 𝒘t\bm{w}^{t} represent the tt-th iteration of 𝒖\bm{u}, 𝒗\bm{v}, and 𝒘\bm{w} respectively. (ii) the stopping criteria can be set as follows:

|obj(t)obj(t1)|obj(t1)<105,\frac{|\mbox{obj}(t)-\mbox{obj}(t-1)|}{\mbox{obj}(t-1)}<10^{-5}, (15)

where obj(t)\mbox{obj}(t) is the objective function value after the tt-th iteration.

2.3 Complexity Analysis

The computational complexity of Algorithm 1 mainly depends on the computation of Eq. (8), Eq. (10) and Eq. (13). Computing Eq. (8) is mainly composed of the computation of 𝒖f(𝒖,𝒗,𝒘)=𝑿T[𝒘(𝒀𝒗)]\nabla_{\bm{u}}f(\bm{u},\bm{v},\bm{w})=-\bm{X}^{T}[\bm{w}\odot(\bm{Y}\bm{v})] which consumes 𝒪(np+nq+n)\mathcal{O}(np+nq+n). Computing Eq. (10) is mainly composed of the computation of 𝒗f(𝒖,𝒗,𝒘)=𝒀T[𝒘(𝑿𝒖)]\nabla_{\bm{v}}f(\bm{u},\bm{v},\bm{w})=-\bm{Y}^{T}[\bm{w}\odot(\bm{X}\bm{u})] which consumes 𝒪(nq+np+n)\mathcal{O}(nq+np+n). Similarly, Computing Eq. (13) is mainly composed of the computation of 𝒘f(𝒖,𝒗,𝒘)=(𝑿𝒖)(𝒀𝒗)\nabla_{\bm{w}}f(\bm{u},\bm{v},\bm{w})=-(\bm{X}\bm{u})\odot(\bm{Y}\bm{v}) which consumes 𝒪(nq+np+n)\mathcal{O}(nq+np+n). Overall, the complexity of Algorithm 1 is 𝒪(tnp+tnq+tn)\mathcal{O}(tnp+tnq+tn), where tt is the number of iterations and it is empirically set to 20 in our experiments.

2.4 Convergence Analysis

In this subsection, we show that Algorithm 1 has theoretical convergence guarantees (Theorem 1).

Lemma 1.

Φ=𝒖T𝑿Tdiag(𝒘)𝒀𝒗+δ𝒖=1+δ𝒖0ku+δ𝒗=1+δ𝒗0kv+δ𝒘1+δ𝒘0kw+δ𝒘0\Phi=-\bm{u}^{T}\bm{X}^{T}\mbox{diag}(\bm{w})\bm{Y}\bm{v}+\delta_{\|\bm{u}\|=1}+\delta_{\|\bm{u}\|_{0}\leq k_{u}}+\delta_{\|\bm{v}\|=1}+\delta_{\|\bm{v}\|_{0}\leq k_{v}}+\delta_{\|\bm{w}\|_{\infty}\leq 1}+\delta_{\|\bm{w}\|_{0}\leq k_{w}}+\delta_{\bm{w}\geq 0} has the Kurdyka-Łojasiewicz (KŁ) property.

Proof.

Inspired by [18], we have (1) δ𝒖0ku\delta_{\|\bm{u}\|_{0}\leq k_{u}} and δ𝒖2=1\delta_{\|\bm{u}\|^{2}=1} are semi-algebraic functions; (2) 𝒘\|\bm{w}\|_{\infty} ball is convex, so δ𝒘1\delta_{\|\bm{w}\|_{\infty}\leq 1} is a semi-algebraic function; (3) 𝒘0\bm{w}\geq 0 is a convex constraint, so δ𝒘0\delta_{\bm{w}\geq 0} is a semi-algebraic function. Then, based on Theorem 3 in [18], Φ\Phi is a semi-algebraic function and has the KŁ property. ∎

Theorem 1.

(Globally convergence of Algorithm 1) Let {(𝐮t,𝐯t,𝐰t)}t\{(\bm{u}^{t},\bm{v}^{t},\bm{w}^{t})\}_{t\in\mathbb{N}} be a sequence which are generated for any initial point (𝐮0,𝐯0,𝐰0)(\bm{u}^{0},\bm{v}^{0},\bm{w}^{0}) by using Algorithm 1. Then the objective function value f(𝐮t,𝐯t,𝐰t)f(\bm{u}^{t},\bm{v}^{t},\bm{w}^{t}) is non-increasing and {(𝐮t,𝐯t,𝐰t)}t\{(\bm{u}^{t},\bm{v}^{t},\bm{w}^{t})\}_{t\in\mathbb{N}} converges to a critical point.

Proof.

References [18] has shown that the PALM algorithm monotonically increasing converges to a critical point if the objective function satisfied the KŁ property. Because Lemma 1 has shown that the objective function of the /0\ell_{\infty}/\ell_{0}-wsPLS model satisfies the KŁ property and Algorithm 1 is an algorithm based on the PALM framework, we prove that Algorithm 1, i.e., Theorem 1 holds. ∎

2.5 Determination of co-modules

A co-module is decide with two steps: (1) extract a submatrix in the data marix 𝑿n×p\bm{X}\in\mathbb{R}^{n\times p}, where rows and columns are corresponding to non-zero entries in 𝒘\bm{w} and 𝒖\bm{u}, respectively; (2) extract another submatrix in the data matrix 𝒀𝕐n×q\bm{Y}\in\mathbb{Y}^{n\times q}, where rows and columns are corresponding to non-zero entries in 𝒘\bm{w} and 𝒗\bm{v}, respectively. The combination of two extracted sub-matrices is called a co-module (see Figure 1).

We explain how wsPLS identify multiple co-modules. After identifying the first co-module using Algorithm 1, we update matrices 𝑿\bm{X} and 𝒀\bm{Y} by removing their rows (samples) in the first co-module, and we repeatedly apply Algorithm 1 to identify the next co-module on the updated data matrices (see the three co-modules in the rightmost subgraph of Figure 1).

3 Multi-view Weighted Sparse PLS

To identify co-modules on the multi-view data with more than three data matrices, we extend wsPLS and propose two multi-view wsPLS (mwsPLS) models with different schemes.

3.1 Scheme 1 for mwsPLS using a sum way

Assume that multiple matrices 𝑿in×pi\bm{X}_{i}\in\mathbb{R}^{n\times p_{i}} (i=1,,mi=1,\cdots,m) across a common sample set are given and their columns are normalized with zero-mean and unit-variance. Similar to [32], we propose the multi-view wsPLS (mwsPLS) model using a sum way, which is called mwsPLS-scheme1 for short as follows:

maximize𝒖1,,𝒖m,𝒘\displaystyle\underset{\bm{u}_{1},\cdots,\bm{u}_{m},\bm{w}}{\text{maximize}} i<j𝒖iT𝑿iTdiag(𝒘)𝑿j𝒖j\displaystyle\sum_{i<j}\bm{u}_{i}^{T}\bm{X}_{i}^{T}\mbox{diag}(\bm{w})\bm{X}_{j}\bm{u}_{j} (16)
subject to 𝒖i=1,𝒖i0kii\displaystyle\|\bm{u}_{i}\|=1,\|\bm{u}_{i}\|_{0}\leq k_{i}\leavevmode\nobreak\ \forall i
𝒘1,𝒘0kw,𝒘0.\displaystyle\|\bm{w}\|_{\infty}\leq 1,\|\bm{w}\|_{0}\leq k_{w},\bm{w}\geq 0.

Eq. (16) is equivalent to

minimize𝒖1,,𝒖m,𝒘\displaystyle\underset{\bm{u}_{1},\cdots,\bm{u}_{m},\bm{w}}{\text{minimize}} f(𝒖,𝒘)=i<j𝒘T[(𝑿i𝒖i)(𝑿j𝒖j)]\displaystyle f(\bm{u},\bm{w})=-\sum_{i<j}\bm{w}^{T}[(\bm{X}_{i}\bm{u}_{i})\odot(\bm{X}_{j}\bm{u}_{j})] (17)
subject to 𝒖i=1,𝒖i0kii\displaystyle\|\bm{u}_{i}\|=1,\|\bm{u}_{i}\|_{0}\leq k_{i}\leavevmode\nobreak\ \forall i
𝒘1,𝒘0kw,𝒘0.\displaystyle\|\bm{w}\|_{\infty}\leq 1,\|\bm{w}\|_{0}\leq k_{w},\bm{w}\geq 0.

Optimization. We observe that the objective function of Eq. (17) satisfies f(𝒖,𝒘)=i<j𝒘T[(𝑿i𝒖i)(𝑿j𝒖j)]=𝒖iT𝑿iT[ij𝒘(𝑿j𝒖j)]+Cf(\bm{u},\bm{w})=-\sum\limits_{i<j}\bm{w}^{T}[(\bm{X}_{i}\bm{u}_{i})\odot(\bm{X}_{j}\bm{u}_{j})]=-\bm{u}_{i}^{T}\bm{X}_{i}^{T}[\sum\limits_{i\neq j}\bm{w}\odot(\bm{X}_{j}\bm{u}_{j})]+C, where CC is independent for any 𝒖i\bm{u}_{i}. So, we can calculate the partial derivatives of 𝒖i\bm{u}_{i} (i=1,,mi=1,\cdots,m) and 𝒘\bm{w} as follows:

𝒖if(𝒖,𝒘)\displaystyle\nabla_{\bm{u}_{i}}f(\bm{u},\bm{w}) =\displaystyle= 𝑿iTji[𝒘(𝑿j𝒖j)],\displaystyle-\bm{X}_{i}^{T}\sum\limits_{j\neq i}[\bm{w}\odot(\bm{X}_{j}\bm{u}_{j})], (18)
𝒘f(𝒖,𝒘)\displaystyle\nabla_{\bm{w}}f(\bm{u},\bm{w}) =\displaystyle= i<j[(𝑿i𝒖i)(𝑿j𝒖j)].\displaystyle-\sum_{i<j}[(\bm{X}_{i}\bm{u}_{i})\odot(\bm{X}_{j}\bm{u}_{j})].

Similar to Algorithm 1, we develop the block proximal gradient algorithm to solve (17) by using iteratively updating way. The detail algorithm procedure is shown in Algorithm 2 where Eq. (19) and Eq. (20) are solved using Proposition 1 and Proposition 2, respectively.

Algorithm 2 Multi-view wsPLS (mwsPLS-scheme1) in Eq. (16)
0:  Data matrices 𝑿in×pi\bm{X}_{i}\in\mathbb{R}^{n\times p_{i}} (i=1,2,,mi=1,2,\cdots,m), parameters {k1,,kmk_{1},\cdots,k_{m} and kwk_{w}}, and Lipschitz constants {Lu1,,LumL_{u_{1}},\cdots,L_{u_{m}} and LwL_{w}}.
0:  𝒖1,,𝒖m\bm{u}_{1},\cdots,\bm{u}_{m} and 𝒘\bm{w}.
1:  Standardize the columns of 𝑿i\bm{X}_{i} (i=1,,mi=1,\cdots,m).
2:  Initialize (𝒖i0,,𝒖m0,𝒘0)(\bm{u}_{i}^{0},\cdots,\bm{u}_{m}^{0},\bm{w}^{0}) and set t=0t=0.
3:  repeat
4:     Compute 𝒖it+1\bm{u}_{i}^{t+1} (for i=1,2,,mi=1,2,\cdots,m) by solving the following optimization problem:
minimize𝒖\displaystyle\underset{\bm{u}}{\text{minimize}} 𝒖𝒖¯22\displaystyle\|\bm{u}-\overline{\bm{u}}\|_{2}^{2} (19)
subject to 𝒖0ki,𝒖=1,\displaystyle\|\bm{u}\|_{0}\leq k_{i},\|\bm{u}\|=1,
where 𝒖¯=𝒖it1Lui𝒖if(𝒖t,𝒘t)\overline{\bm{u}}=\bm{u}_{i}^{t}-\frac{1}{L_{u_{i}}}\nabla_{\bm{u}_{i}}f(\bm{u}^{t},\bm{w}^{t}) and the partial derivatives 𝒖if(𝒖,𝒘)=𝑿iTji[𝒘(𝑿j𝒖j)]\nabla_{\bm{u}_{i}}f(\bm{u},\bm{w})=-\bm{X}_{i}^{T}\sum_{j\neq i}[\bm{w}\odot(\bm{X}_{j}\bm{u}_{j})].
5:     Compute 𝒘t+1\bm{w}^{t+1} by solving the following problem:
minimize𝒘\displaystyle\underset{\bm{w}}{\text{minimize}} 𝒘𝒘¯22\displaystyle\|\bm{w}-\overline{\bm{w}}\|_{2}^{2} (20)
𝒘0kw,𝒘1,wj0j\displaystyle\|\bm{w}\|_{0}\leq k_{w},\|\bm{w}\|_{\infty}\leq 1,w_{j}\geq 0\leavevmode\nobreak\ \forall j
where 𝒘¯=𝒘t1Lw𝒘f(𝒖t+1,𝒘t)\overline{\bm{w}}=\bm{w}^{t}-\frac{1}{L_{w}}\nabla_{\bm{w}}f(\bm{u}^{t+1},\bm{w}^{t}) and the partial derivatives 𝒘f(𝒖,𝒘)=i<j[(𝑿i𝒖i)(𝑿j𝒖j)]\nabla_{\bm{w}}f(\bm{u},\bm{w})=-\sum_{i<j}[(\bm{X}_{i}\bm{u}_{i})\odot(\bm{X}_{j}\bm{u}_{j})].
6:     t=t+1t=t+1
7:  until convergence (e.g., |obj(t)obj(t1)|obj(t1)<105\frac{|\mbox{obj}(t)-\mbox{obj}(t-1)|}{\mbox{obj}(t-1)}<10^{-5}).
8:  return  𝒖i:=𝒖it\bm{u}_{i}:=\bm{u}_{i}^{t} for all ii and 𝒘:=𝒘t\bm{w}:=\bm{w}^{t}.

Initialization. We set 𝒘0=𝟏\bm{w}^{0}=\bm{1} (all entries are ones) in Algorithm 2 and randomly initialize {𝒖i0,i=1,,m\bm{u}_{i}^{0},\leavevmode\nobreak\ i=1,\cdots,m} from a normal distribution. Algorithm 2 is run with multiple random initial points and the model with the optimal value is selected.

Complexity Analysis. The computational burden of Algorithm 2 mainly depends on the computation of Eq. (19) and Eq. (20). The computation of Eq. (19) is mainly composed of the computation of 𝒖if(𝒖,𝒘)=𝑿iTji[𝒘(𝑿j𝒖j)]\nabla_{\bm{u}_{i}}f(\bm{u},\bm{w})=-\bm{X}_{i}^{T}\sum_{j\neq i}[\bm{w}\odot(\bm{X}_{j}\bm{u}_{j})] which consumes 𝒪(inpi)\mathcal{O}(\sum_{i}np_{i}). The computation of Eq. (20) is mainly composed of the computation of 𝒘f(𝒖,𝒘)=i<j[(𝑿i𝒖i)(𝑿j𝒖j)]\nabla_{\bm{w}}f(\bm{u},\bm{w})=-\sum_{i<j}[(\bm{X}_{i}\bm{u}_{i})\odot(\bm{X}_{j}\bm{u}_{j})] which consumes 𝒪((m1)inpi)\mathcal{O}((m-1)\sum_{i}np_{i}) because i<j(npi+npj)=(m1)inpi\sum_{i<j}(np_{i}+np_{j})=(m-1)\sum_{i}np_{i}. In total, the complexity of Algorithm 2 is 𝒪(tmnp1+tmnp2++tmnpm)\mathcal{O}(tmnp_{1}+tmnp_{2}+\cdots+tmnp_{m}), where tt is the number of iterations and it is empirically set to 20 in our experiments.

Convergence Analysis. Similar to Lemma 1, we can prove that the objective function of the mwsPLS-scheme1 model satisfies the KŁ property. Therefore, Algorithm 2 also has theoretical convergence.

3.2 Scheme 2 for mwsPLS using a product way

We observe 𝒖1T𝑿1Tdiag(𝒘)𝑿2𝒖2=𝒘T[(𝑿1𝒖1)(𝑿2𝒖2)]\bm{u}_{1}^{T}\bm{X}_{1}^{T}\mbox{diag}(\bm{w})\bm{X}_{2}\bm{u}_{2}=\bm{w}^{T}[(\bm{X}_{1}\bm{u}_{1})\odot(\bm{X}_{2}\bm{u}_{2})], so that we intuitively extend wsPLS for multi-view data analysis using a product way. Herein, we propose the second mwsPLS model using the product way, which is called mwsPLS-scheme2 for short as follows:

minimize𝒖1,,𝒖m,𝒘\displaystyle\underset{\bm{u}_{1},\cdots,\bm{u}_{m},\bm{w}}{\text{minimize}} f(𝒖,𝒘)=𝒘T[i=1m(𝑿i𝒖i)]\displaystyle f(\bm{u},\bm{w})=-\bm{w}^{T}\big{[}\bigodot\limits_{i=1}^{m}(\bm{X}_{i}\bm{u}_{i})\big{]} (21)
subject to 𝒖i=1,𝒖i0kii\displaystyle\|\bm{u}_{i}\|=1,\|\bm{u}_{i}\|_{0}\leq k_{i}\leavevmode\nobreak\ \forall i
𝒘1,𝒘0kw,𝒘0,\displaystyle\|\bm{w}\|_{\infty}\leq 1,\|\bm{w}\|_{0}\leq k_{w},\bm{w}\geq 0,

where i=1m(𝑿i𝒖i)=(𝑿1𝒖1)(𝑿2𝒖2)(𝑿m𝒖m)\bigodot\limits_{i=1}^{m}(\bm{X}_{i}\bm{u}_{i})=(\bm{X}_{1}\bm{u}_{1})\odot(\bm{X}_{2}\bm{u}_{2})\odot\cdots\odot(\bm{X}_{m}\bm{u}_{m}).

Optimization. Formally, the objective function in Eq. (21) is equivalent to

f(𝒖,𝒘)=𝒖iT[𝑿iT(𝒘𝒛i)],f(\bm{u},\bm{w})=-\bm{u}_{i}^{T}[\bm{X}_{i}^{T}(\bm{w}\odot\bm{z}_{i})], (22)

where 𝒛i=jim(𝑿j𝒖j)\bm{z}_{i}=\bigodot\limits_{j\neq i}^{m}(\bm{X}_{j}\bm{u}_{j}). Therefore, we calculate the partial derivatives for 𝒖i\bm{u}_{i} (i=1,,mi=1,\cdots,m) and 𝒘\bm{w} as follows:

𝒖if(𝒖,𝒘)\displaystyle\nabla_{\bm{u}_{i}}f(\bm{u},\bm{w}) =\displaystyle= 𝑿iT(𝒘𝒛i),i\displaystyle-\bm{X}_{i}^{T}(\bm{w}\odot\bm{z}_{i}),\forall i (23)
𝒘f(𝒖,𝒘)\displaystyle\nabla_{\bm{w}}f(\bm{u},\bm{w}) =\displaystyle= i=1m(𝑿i𝒖i).\displaystyle-\bigodot\limits_{i=1}^{m}(\bm{X}_{i}\bm{u}_{i}).

Finally, we solve the optimization problem (21) using a similar manner as Algorithm 2 and the detail procedure is shown in Algorithm 3.

Algorithm 3 Multi-view wsPLS (mwsPLS-scheme2) in Eq. (21)
0:  Data matrices 𝑿in×pi\bm{X}_{i}\in\mathbb{R}^{n\times p_{i}} (i=1,2,,mi=1,2,\cdots,m), parameters {k1,,kmk_{1},\cdots,k_{m} and kwk_{w}} and Lipschitz constants {Lu1,,LumL_{u_{1}},\cdots,L_{u_{m}} and LwL_{w}}.
0:  𝒖1,,𝒖m\bm{u}_{1},\cdots,\bm{u}_{m} and 𝒘\bm{w}.
1:  Standardize the columns of 𝑿i\bm{X}_{i} (i=1,,mi=1,\cdots,m).
2:  Initialize (𝒖i0,,𝒖m0,𝒘0)(\bm{u}_{i}^{0},\cdots,\bm{u}_{m}^{0},\bm{w}^{0}) and set t=0t=0.
3:  repeat
4:     Compute 𝒖it+1\bm{u}_{i}^{t+1} (for i=1,2,,mi=1,2,\cdots,m) by solving the following optimization problem:
minimize𝒖\displaystyle\underset{\bm{u}}{\text{minimize}} 𝒖𝒖¯22\displaystyle\|\bm{u}-\overline{\bm{u}}\|_{2}^{2} (24)
subject to 𝒖0ki,𝒖=1,\displaystyle\|\bm{u}\|_{0}\leq k_{i},\|\bm{u}\|=1,
where 𝒖¯=𝒖it1Lui𝒖if(𝒖t,𝒘t)\overline{\bm{u}}=\bm{u}_{i}^{t}-\frac{1}{L_{u_{i}}}\nabla_{\bm{u}_{i}}f(\bm{u}^{t},\bm{w}^{t}), the partial derivatives 𝒖if(𝒖,𝒘)=𝑿iT(𝒘𝒛i)\nabla_{\bm{u}_{i}}f(\bm{u},\bm{w})=-\bm{X}_{i}^{T}(\bm{w}\odot\bm{z}_{i}) and 𝒛i=jim(𝑿j𝒖j)\bm{z}_{i}=\bigodot\limits_{j\neq i}^{m}(\bm{X}_{j}\bm{u}_{j}).
5:     Compute 𝒘t+1\bm{w}^{t+1} by solving the following problem:
minimize𝒘\displaystyle\underset{\bm{w}}{\text{minimize}} 𝒘𝒘¯22\displaystyle\|\bm{w}-\overline{\bm{w}}\|_{2}^{2} (25)
𝒘0kw,𝒘1,wj0j\displaystyle\|\bm{w}\|_{0}\leq k_{w},\|\bm{w}\|_{\infty}\leq 1,w_{j}\geq 0\leavevmode\nobreak\ \forall j
where 𝒘¯=𝒘t1Lw𝒘f(𝒖t+1,𝒘t)\overline{\bm{w}}=\bm{w}^{t}-\frac{1}{L_{w}}\nabla_{\bm{w}}f(\bm{u}^{t+1},\bm{w}^{t}) and the partial derivatives 𝒘f(𝒖,𝒘)=i=1m(𝑿i𝒖i)\nabla_{\bm{w}}f(\bm{u},\bm{w})=-\bigodot\limits_{i=1}^{m}(\bm{X}_{i}\bm{u}_{i}).
6:     t=t+1t=t+1
7:  until convergence (e.g., |obj(t)obj(t1)|obj(t1)<105\frac{|\mbox{obj}(t)-\mbox{obj}(t-1)|}{\mbox{obj}(t-1)}<10^{-5}).
8:  return  𝒖i:=𝒖it\bm{u}_{i}:=\bm{u}_{i}^{t} for all ii and 𝒘:=𝒘t\bm{w}:=\bm{w}^{t}.

Initialization. The initial point setting of Algorithm 3 is the same as that of Algorithm 2, and the description is omitted here.

Complexity Analysis. The complexity of 𝒖if(𝒖,𝒘)=𝑿iT(𝒘𝒛i)\nabla_{\bm{u}_{i}}f(\bm{u},\bm{w})=-\bm{X}_{i}^{T}(\bm{w}\odot\bm{z}_{i}) and 𝒛i=jim(𝑿j𝒖j)\bm{z}_{i}=\bigodot\limits_{j\neq i}^{m}(\bm{X}_{j}\bm{u}_{j}) is 𝒪(inpi)\mathcal{O}(\sum_{i}np_{i}). The complexity of 𝒘f(𝒖,𝒘)=i=1m(𝑿i𝒖i)\nabla_{\bm{w}}f(\bm{u},\bm{w})=-\bigodot\limits_{i=1}^{m}(\bm{X}_{i}\bm{u}_{i}) is also 𝒪(inpi)\mathcal{O}(\sum_{i}np_{i}). So, the complexity of Algorithm 3 is 𝒪(tmnp1+tmnp2++tmnpm)\mathcal{O}(tmnp_{1}+tmnp_{2}+\cdots+tmnp_{m}), where tt is the number of iterations and it is empirically set to 20 in our experiments.

Convergence Analysis. Similar to Lemma 1, we can also prove that the objective function of the mwsPLS-scheme2 model satisfies the KŁ property. So, Algorithm 3 also has theoretical convergence guarantees and globally converges to a critical point.

4 Experiments

In this section, we evaluate the effectiveness of these proposed methods for co-module identification and compare them with other competing methods on the synthetic and biomedical data.

4.1 Competing Methods

We compare the performance of the proposed methods with PLS and its variants:

  • PLS: It is a classical model without any sparse constrain [29]:

    maximize𝒖,𝒗\displaystyle\underset{\bm{u},\bm{v}}{\text{maximize}} 𝒖T𝑿T𝒀𝒗\displaystyle\bm{u}^{T}\bm{X}^{T}\bm{Y}\bm{v}
    subject to 𝒖=𝒗=1,\displaystyle\|\bm{u}\|=\|\bm{v}\|=1,

    which is solved using the single value decomposition (SVD) method for 𝑿T𝒀\bm{X}^{T}\bm{Y}.

  • PMD-sPLS: It is a sparse model which uses the 1\ell_{1}-norm constraint, which is inspired by [32]:

    maximize𝒖,𝒗\displaystyle\underset{\bm{u},\bm{v}}{\text{maximize}} 𝒖T𝑿T𝒀𝒗\displaystyle\bm{u}^{T}\bm{X}^{T}\bm{Y}\bm{v}
    subject to 𝒖=𝒗=1,𝒖<c1,𝒗<c2,\displaystyle\|\bm{u}\|=\|\bm{v}\|=1,\|\bm{u}\|<c_{1},\|\bm{v}\|<c_{2},

    which is solved by the penalized matrix decomposition (PMD) method [32].

  • 0\ell_{0}-sPLS: It is a sparse PLS model which uses the 0\ell_{0}-norm constraint, which is inspired by [30, 33]:

    maximize𝒖,𝒗\displaystyle\underset{\bm{u},\bm{v}}{\text{maximize}} 𝒖T𝑿T𝒀𝒗\displaystyle\bm{u}^{T}\bm{X}^{T}\bm{Y}\bm{v}
    subject to 𝒖=𝒗=1,𝒖0ku,𝒗0kv,\displaystyle\|\bm{u}\|=\|\bm{v}\|=1,\|\bm{u}\|_{0}\leq k_{u},\|\bm{v}\|_{0}\leq k_{v},

    which is solved using an alternating iterative algorithm.

  • 2/0\ell_{2}/\ell_{0}-wsPLS: It is a weighted sparse PLS model which uses the 2/0\ell_{2}/\ell_{0}-norm constrains for the weighed vector, which is inspired by [31].

    maximize𝒖,𝒗,𝒘\displaystyle\underset{\bm{u},\bm{v},\bm{w}}{\text{maximize}} 𝒖T𝑿Tdiag(𝒘)𝒀𝒗\displaystyle\bm{u}^{T}\bm{X}^{T}\mbox{diag}(\bm{w})\bm{Y}\bm{v}
    subject to 𝒖0ku,𝒗0kv,𝒖=𝒗=1,\displaystyle\|\bm{u}\|_{0}\leq k_{u},\|\bm{v}\|_{0}\leq k_{v},\|\bm{u}\|=\|\bm{v}\|=1,
    𝒘2=1,𝒘0kw,\displaystyle\|\bm{w}\|_{2}=1,\|\bm{w}\|_{0}\leq k_{w},

    which is also solved using the proposed Algorithm 1 with a small change.

Refer to caption
Figure 2: We compare the performance of wsPLS with PLS and its variants on the simulation data I. The points (in red) are truly nonzero and the points (in black) are truly zero. (A) Showing the true 𝒖\bm{u}, 𝒗\bm{v} and 𝒘\bm{w}; (B) Showing the estimated 𝒖\bm{u}, 𝒗\bm{v} and 𝒘\bm{w} by PLS; (C) Showing the estimated 𝒖\bm{u} and 𝒗\bm{v} by PMD-sPLS; (D) Showing the estimated 𝒖\bm{u} and 𝒗\bm{v} by 0\ell_{0}-sPLS; (E) Showing the estimated 𝒖\bm{u}, 𝒗\bm{v} and 𝒘\bm{w} by 2/0\ell_{2}/\ell_{0}-wsPLS; (F) Showing the estimated 𝒖\bm{u}, 𝒗\bm{v} and 𝒘\bm{w} by /0\ell_{\infty}/\ell_{0}-wsPLS.

4.2 Application to Synthetic Data

To generate the synthetic data of 𝑿n×p\bm{X}\in\mathbb{R}^{n\times p} and 𝒀n×q\bm{Y}\in\mathbb{R}^{n\times q} (nn is the number of samples, pp and qq are the number of two different features, respectively), we firstly generated 𝒖\bm{u}, 𝒗\bm{v} and 𝒘\bm{w}, and then generate them by

𝑿=𝒘𝒖T+γ1ϵ1,𝒀=𝒘𝒗T+γ2ϵ2,\begin{array}[]{rl}\bm{X}\leavevmode\nobreak\ \leavevmode\nobreak\ =&\bm{w}\bm{u}^{T}+\gamma_{1}\bm{\epsilon}_{1},\\ \bm{Y}\leavevmode\nobreak\ \leavevmode\nobreak\ =&\bm{w}\bm{v}^{T}+\gamma_{2}\bm{\epsilon}_{2},\end{array} (26)

where the elements of ϵ1\bm{\epsilon}_{1} and ϵ2\bm{\epsilon}_{2} are randomly sampled from a standard normal distribution and γ1\gamma_{1}, γ2\gamma_{2} are two nonnegative parameters to control the signal-to-noise ratio (SNR) which is defined by

SNR1=𝒘𝒖F2𝐄(γ1ϵ1F2)=𝒘𝒖F2γ12np,SNR2=𝒘𝒗F2𝐄(γ2ϵ2F2)=𝒘𝒗F2γ22nq,\begin{array}[]{rl}\mbox{SNR}_{1}\leavevmode\nobreak\ =&\frac{\|\bm{w}\bm{u}\|_{F}^{2}}{\mathbf{E}(\|\gamma_{1}\bm{\epsilon}_{1}\|_{F}^{2})}=\frac{\|\bm{w}\bm{u}\|_{F}^{2}}{\gamma_{1}^{2}np},\\ \mbox{SNR}_{2}\leavevmode\nobreak\ =&\frac{\|\bm{w}\bm{v}\|_{F}^{2}}{\mathbf{E}(\|\gamma_{2}\bm{\epsilon}_{2}\|_{F}^{2})}=\frac{\|\bm{w}\bm{v}\|_{F}^{2}}{\gamma_{2}^{2}nq},\end{array} (27)

where 𝐄(γ1ϵ1F2)\mathbf{E}(\|\gamma_{1}\bm{\epsilon}_{1}\|_{F}^{2}) and 𝐄(γ2ϵ2F2)\mathbf{E}(\|\gamma_{2}\bm{\epsilon}_{2}\|_{F}^{2}) denote the expected sum of squares of noise.

(1) Simulation data I (nn = 50, pp = 80, pp = 100). We firstly generate three vectors 𝒖\bm{u}, 𝒗\bm{v} and 𝒘\bm{w} using

𝒖=[rep(1,10),rep(1,10),rep(0,p20)]T,𝒗=[rep(1,15),rep(1,15),rep(0,q30)]T,𝒘=[rep(1,25),rep(0,n25)]T,\begin{array}[]{rl}\bm{u}\leavevmode\nobreak\ \leavevmode\nobreak\ =&[rep(1,10),rep(-1,10),rep(0,p-20)]^{\rm{T}},\\ \bm{v}\leavevmode\nobreak\ \leavevmode\nobreak\ =&[rep(-1,15),rep(1,15),rep(0,q-30)]^{\rm{T}},\\ \bm{w}\leavevmode\nobreak\ \leavevmode\nobreak\ =&[rep(1,25),rep(0,n-25)]^{\rm{T}},\\ \end{array}

where rep(x,n)rep(x,n) denotes a row vector of size nn with all elements are equal to xx. We then generate the simulation data I using the formula (26) with SNR1=SNR2=0.1\mbox{SNR}_{1}=\mbox{SNR}_{2}=0.1.

(2) Simulation data II (nn = 100, pp = 800, pp = 1000). We firstly generate three vectors 𝒖\bm{u}, 𝒗\bm{v} and 𝒘\bm{w} using

𝒖=[rep(1,100),rep(1,100),rep(0,p200)]T,𝒗=[rep(1,150),rep(1,150),rep(0,q300)]T,𝒘=[rep(1,50),rep(0,n50)]T.\begin{array}[]{rl}\bm{u}\leavevmode\nobreak\ \leavevmode\nobreak\ =&[rep(1,100),rep(-1,100),rep(0,p-200)]^{\rm{T}},\\ \bm{v}\leavevmode\nobreak\ \leavevmode\nobreak\ =&[rep(-1,150),rep(1,150),rep(0,q-300)]^{\rm{T}},\\ \bm{w}\leavevmode\nobreak\ \leavevmode\nobreak\ =&[rep(1,50),rep(0,n-50)]^{\rm{T}}.\\ \end{array}

We then generate the simulation data II using the formula (26) with SNR1=SNR2=0.1\mbox{SNR}_{1}=\mbox{SNR}_{2}=0.1.

(3) Simulation data III (nn = 1000, pp = 8000, pp = 10000). We firstly generate three vectors 𝒖\bm{u}, 𝒗\bm{v} and 𝒘\bm{w} using

𝒖=[rep(1,1000),rep(1,1000),rep(0,p2000)]T,𝒗=[rep(1,1500),rep(1,1500),rep(0,q3000)]T,𝒘=[rep(1,250),rep(0,n250)]T.\begin{array}[]{rl}\bm{u}\leavevmode\nobreak\ \leavevmode\nobreak\ =&[rep(1,1000),rep(-1,1000),rep(0,p-2000)]^{\rm{T}},\\ \bm{v}\leavevmode\nobreak\ \leavevmode\nobreak\ =&[rep(-1,1500),rep(1,1500),rep(0,q-3000)]^{\rm{T}},\\ \bm{w}\leavevmode\nobreak\ \leavevmode\nobreak\ =&[rep(1,250),rep(0,n-250)]^{\rm{T}}.\\ \end{array}

We then generate the simulation data III using the formula (26) with SNR1=SNR2=0.1\mbox{SNR}_{1}=\mbox{SNR}_{2}=0.1.

We compare the performance of /0\ell_{\infty}/\ell_{0}-wsPLS with 2/0\ell_{2}/\ell_{0}-wsPLS [31], two sparse PLS methods including 0\ell_{0}-sPLS [30] and PMD-sPLS with 1\ell_{1}-penalty [34], and the classical PLS [29]. All the experiments are performed on a laptop (2.8-GHz Intel 8-Core CPU and 16-GB RAM). Since both PLS and sPLS cannot estimate 𝒘\bm{w}, we assume that 𝒘=𝟏\bm{w}=\bm{1} for the convenience of comparison.

For the simulation data I, we set parameters ku=20k_{u}=20, kv=30k_{v}=30 and kw=25k_{w}=25 for \ell_{\infty}-wsPLS and 2/0\ell_{2}/\ell_{0}-wsPLS. For comparison, we set ku=20k_{u}=20, kv=30k_{v}=30 for 0\ell_{0}-sPLS and ensure that 𝒖\bm{u} and 𝒗\bm{v} of PMD-sPLS output have the same sparsity level. For the simulation data II, we set parameters ku=200k_{u}=200, kv=300k_{v}=300 and kw=50k_{w}=50 for \ell_{\infty}-wsPLS and 2/0\ell_{2}/\ell_{0}-wsPLS. Meanwhile, we set ku=200k_{u}=200, kv=300k_{v}=300 for 0\ell_{0}-sPLS and ensure that 𝒖\bm{u} and 𝒗\bm{v} of PMD-sPLS output have the same sparsity level. For the simulation data III, we set parameters ku=2000k_{u}=2000, kv=3000k_{v}=3000 and kw=250k_{w}=250 for \ell_{\infty}-wsPLS and 2/0\ell_{2}/\ell_{0}-wsPLS. For comparison, we set ku=2000k_{u}=2000, kv=3000k_{v}=3000 for 0\ell_{0}-sPLS and ensure that 𝒖\bm{u} and 𝒗\bm{v} of PMD-sPLS output have the same sparsity level.

A optimal method should identify the true non-zero patterns of 𝒖\bm{u}, 𝒗\bm{v} and 𝒘\bm{w} as Figure 2 (A). We evaluate the performance of all methods by three evaluation metrics including true positive rate (TPR), true negative rate (TNR) and accuracy (ACC):

TPR=TPP,TNR=TNN,ACC=TP+TNP+N,\mbox{TPR}=\frac{\mbox{TP}}{\mbox{P}},\leavevmode\nobreak\ \mbox{TNR}=\frac{\mbox{TN}}{\mbox{N}},\leavevmode\nobreak\ \mbox{ACC}=\frac{\mbox{TP+TN}}{\mbox{P+N}}, (28)

where P denotes the number of positive samples (i.e., non-zero values), N denotes the number of negative samples (i.e., zero values), TP denotes the number of true positive, TN denotes the number of true negative.

Table II: Performance comparison of different methods in terms of ACC, TPR, TNR and running time (in seconds) on three simulated datasets. Mean and standard derivation of each performance measure is calculated based on 20 simulation runs. The evaluation metrics (ACC, TPR, and TNR) are defined in Eq.(28). For example ACC_uACC\_u, ACC_vACC\_v and ACC_wACC\_w represent the evaluation of 𝒖\bm{u}, 𝒗\bm{v} and 𝒘\bm{w} variables respectively, while ACC_allACC\_all represents the evaluation of all variables (i.e. the combination of 𝒖\bm{u}, 𝒗\bm{v}, 𝒘\bm{w}).
Data I Simulation data I: nn = 50, pp = 80, qq = 100
Metrics ACC TPR TNR Time
ACC_all ACC_u ACC_v ACC_w TPR_all TPR_u TPR_v TPR_w TNR_all TNR_u TNR_v TNR_w Time
PLS 0.326 0.250 0.300 0.500 1.000 1.000 1.000 1.000 0.000 0.000 0.000 0.000 0.004
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.010)
PMD-sPLS 0.800 0.916 0.856 0.500 0.880 0.815 0.823 1.000 0.761 0.950 0.870 0.000 0.016
(0.016) (0.023) (0.034) (0.000) (0.025) (0.067) (0.040) (0.000) (0.016) (0.015) (0.039) (0.000) (0.007)
0\ell_{0}-sPLS 0.870 0.980 0.966 0.500 0.967 0.960 0.943 1.000 0.823 0.987 0.976 0.000 0.029
(0.010) (0.022) (0.016) (0.000) (0.015) (0.044) (0.026) (0.000) (0.007) (0.015) (0.011) (0.000) (0.008)
2/0\ell_{2}/\ell_{0}-wsPLS 0.710 0.728 0.692 0.716 0.555 0.455 0.487 0.716 0.785 0.818 0.780 0.716 0.023
(0.135) (0.125) (0.118) (0.214) (0.207) (0.249) (0.196) (0.214) (0.100) (0.083) (0.084) (0.214) (0.007)
/0\ell_{\infty}/\ell_{0}-wsPLS 0.979 0.980 0.972 0.992 0.968 0.960 0.953 0.992 0.985 0.987 0.980 0.992 0.056
(0.016) (0.019) (0.016) (0.024) (0.024) (0.037) (0.027) (0.024) (0.012) (0.012) (0.011) (0.024) (0.007)
Data II Simulation data II: nn = 100, pp = 800, qq = 1000
ACC_all ACC_u ACC_v ACC_w TPR_all TPR_u TPR_v TPR_w TNR_all TNR_u TNR_v TNR_w Time
PLS 0.289 0.250 0.300 0.500 1.000 1.000 1.000 1.000 0.000 0.000 0.000 0.000 0.065
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.008)
PMD-sPLS 0.868 0.795 0.963 0.500 0.816 0.625 0.913 1.000 0.889 0.853 0.984 0.000 0.079
(0.006) (0.010) (0.006) (0.000) (0.008) (0.014) (0.013) (0.000) (0.006) (0.010) (0.004) (0.000) (0.012)
0\ell_{0}-sPLS 0.926 0.890 0.997 0.500 0.917 0.780 0.995 1.000 0.929 0.927 0.998 0.000 0.114
(0.005) (0.013) (0.001) (0.000) (0.009) (0.026) (0.002) (0.000) (0.004) (0.009) (0.001) (0.000) (0.009)
2/0\ell_{2}/\ell_{0}-wsPLS 0.617 0.627 0.605 0.654 0.338 0.254 0.342 0.654 0.730 0.751 0.718 0.654 0.026
(0.027) (0.013) (0.035) (0.179) (0.047) (0.025) (0.059) (0.179) (0.019) (0.008) (0.025) (0.179) (0.008)
/0\ell_{\infty}/\ell_{0}-wsPLS 0.953 0.891 0.997 1.000 0.918 0.783 0.995 1.000 0.967 0.927 0.998 1.000 0.117
(0.005) (0.013) (0.001) (0.000) (0.009) (0.026) (0.002) (0.000) (0.004) (0.009) (0.001) (0.000) (0.010)
Data III Simulation data III: nn = 500, pp = 8000, qq = 10000
ACC_all ACC_u ACC_v ACC_w TPR_all TPR_u TPR_v TPR_w TNR_all TNR_u TNR_v TNR_w Time
PLS 0.284 0.250 0.300 0.500 1.000 1.000 1.000 1.000 0.000 0.000 0.000 0.000 11.965
(0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.000) (0.097)
PMD-sPLS 0.940 0.919 0.979 0.500 0.892 0.821 0.931 1.000 0.960 0.952 1.000 0.000 11.999
(0.002) (0.004) (0.001) (0.000) (0.003) (0.007) (0.003) (0.000) (0.001) (0.003) (0.000) (0.000) (0.205)
0\ell_{0}-sPLS 0.976 0.977 1.000 0.500 0.982 0.954 1.000 1.000 0.974 0.985 1.000 0.000 26.078
(0.001) (0.002) (0.000) (0.000) (0.001) (0.004) (0.000) (0.000) (0.001) (0.001) (0.000) (0.000) (0.238)
2/0\ell_{2}/\ell_{0}-wsPLS 0.921 0.907 0.928 0.991 0.860 0.813 0.880 0.991 0.945 0.938 0.949 0.991 0.360
(0.138) (0.138) (0.143) (0.019) (0.243) (0.277) (0.239) (0.019) (0.096) (0.092) (0.102) (0.019) (0.029)
/0\ell_{\infty}/\ell_{0}-wsPLS 0.990 0.977 1.000 1.000 0.982 0.954 1.000 1.000 0.993 0.985 1.000 1.000 5.948
(0.001) (0.002) (0.000) (0.000) (0.001) (0.004) (0.000) (0.000) (0.001) (0.001) (0.000) (0.000) (0.057)

Figure 2 shows the estimated patterns of 𝒖\bm{u}, 𝒗\bm{v} and 𝒘\bm{w} on the simulation data I. Table II shows the detailed performance results of different methods in terms of ACC, TPR, TNR and running time (seconds). We find (1) the sparse PLS methods including PMD-sPLS and 0\ell_{0}-sPLS cannot detect the pattern of 𝒘\bm{w} (samples) and they identify 𝒖\bm{u} and 𝒗\bm{v} with some wrong points; (2) compared to 2/0\ell_{2}/\ell_{0}-wsPLS, \ell_{\infty}-wsPLS can identify more correct pattern of 𝒘\bm{w}. In short, \ell_{\infty}-wsPLS not only discovers the true non-zero patterns for 𝒖\bm{u}, 𝒗\bm{v}, but also identifies the true non-zero patterns for 𝒘\bm{w} (samples).

4.3 Application to Paired DNA Copy Number Variation and Gene Expression Data

We demonstrate the /0\ell_{\infty}/\ell_{0}-wsPLS method on the comparative genomic hybridization (CGH) data with matched samples from “PMA” R package [34], which contains the matched DNA copy number variation (CNV) data 𝑿89×2149\bm{X}\in\mathbb{R}^{89\times 2149} and gene expression data 𝒀89×19672\bm{Y}\in\mathbb{R}^{89\times 19672}. Here, our aim is to identify a gene set whose expression is strongly correlated with CNV across a subset of samples.

We apply /0\ell_{\infty}/\ell_{0}-wsPLS on the CGH dataset with ku=50k_{u}=50, kv=150k_{v}=150 and kw=890.8k_{w}=89*0.8 and identify a co-expression module of CNV and gene which contains 50 CNV regions, 150 genes and 71 patients. For comparison, we also apply 0\ell_{0}-sPLS with ku=50k_{u}=50 and kv=150k_{v}=150, PMD-sPLS with c1=0.128c_{1}=0.128, c2=0.0665c_{2}=0.0665 for learning 𝒖\bm{u} and 𝒗\bm{v} with the same sparse level. The setting of 2/0\ell_{2}/\ell_{0}-wsPLS is same as /0\ell_{\infty}/\ell_{0}-wsPLS. We can first see that the co-module identified by /0\ell_{\infty}/\ell_{0}-wsPLS has a strong co-correlation pattern (Figure 3 (A), (B) and (C)). Second, we find that the proposed /0\ell_{\infty}/\ell_{0}-wsPLS obtains the largest Pearson correlation coefficient (PCC) r=0.901r=0.901, while 2/0\ell_{2}/\ell_{0}-wsPLS obtains r=0.816r=0.816, 0\ell_{0}-sPLS obtains r=0.802r=0.802 and PMD-sPLS obtains r=0.779r=0.779 (Figure 3 (D)). These results show that wsPLS is more effective to capture the latent patterns by selecting a specific subset of samples than 0\ell_{0}-sPLS and PMD-sPLS methods.

Refer to caption
Figure 3: Results on the paired DNA copy number variation and gene expression data of CGH. (A) Showing heatmaps of input data. (B) Showing heatmaps of the identified co-module (circled in red lines) by wsPLS and randomly selected features for comparison. (C) Showing heatmaps of correlation submatrix for the identified co-module. (D) Showing scatterplots of dim1 (𝑿𝒖\bm{Xu}) vs. dim2 (𝑿𝒗\bm{Xv}) for four different methods, including PMD-sPLS and 0\ell_{0}-sPLS, 2/0\ell_{2}/\ell_{0}-wsPLS and /0\ell_{\infty}/\ell_{0}-wsPLS (see section 4.1 for more details) where the blue points in the third and fourth pictures correspond to wj=0w_{j}=0 and the black line is fitted using all data points and the red line is fitted using only black data points, and Pearson correlation coefficient (PCC) rr is displayed.
Refer to caption
Figure 4: Results on the paired DNA methylation and gene expression data of LUAD. (A) Showing heatmaps of input data. (B) Showing heatmaps of the identified co-module (circled in red lines) by wsPLS and randomly selected features for comparison. (C) Showing heatmaps of correlation submatrix for the identified co-module. (D) Showing scatterplots of dim1 (𝑿𝒖\bm{Xu}) vs. dim2 (𝑿𝒗\bm{Xv}) for four different methods, where the blue points in the third and fourth pictures correspond to wj=0w_{j}=0 and the black line is fitted using all data points and the red line is fitted using only black data points, and PCC rr is displayed.
Refer to caption
Figure 5: Application of the proposed Algorithm 2 and 3 on the BRCA miRNA-lncRNA-mRNA dataset. (A) Distribution of the SS scores (See Eq. (29) for definition). The SS scores of identified modules are significantly greater than those of random ones (Permutation test PP ¡ 0.01). The table in the upper right corner corresponds to the SS scores of the co-modules identified by mwsPLS-scheme1 (Algorithm 2) and mwsPLS-scheme2 (Algorithm 3). (B) Showing the convergence curves of the proposed Algorithm 2 and 3. (C) Showing heatmaps of the first co-module identified by Algorithm 2 (see red rectangular areas) and randomly selected features for comparison. (D) Showing heatmaps of the absolute values of PCCs for these correlation matrices on (miRNA, lncRNA), (miRNA, mRNA) and (lncRNA, mRNA), where the three red rectangular areas correspond to the first identified co-module by mwsPLS-scheme1.

4.4 Application to Paired DNA Methylation and Gene Expression Data

Aberrant DNA methylation plays an important role in the development of cancer, which has been regarded as one of the biomarkers of cancer [35]. In recent years, it has become very important to detect the cross-talk mechanism between DNA methylation and gene expression [36, 37]. To this end, we obtain the paired DNA methylation and gene expression dataset of Lung Adenocarcinoma (LUAD) in The Cancer Genome Atlas (TCGA) [38, 37]. To remove some noise methylation probes and genes, we extract the top 5000 methylation probes and the top 5000 genes with largest variance for further analysis. Finally, the LUAD dataset contains the DNA methylation 𝑿350×5000\bm{X}\in\mathbb{R}^{350\times 5000} and gene expression data 𝒀350×5000\bm{Y}\in\mathbb{R}^{350\times 5000}. We apply /0\ell_{\infty}/\ell_{0}-wsPLS methods on the LUAD dataset with ku=150k_{u}=150, kv=150k_{v}=150 and kw=3500.8k_{w}=350*0.8 to identify methylation-gene co-module. For comparison, we also apply 0\ell_{0}-sPLS with ku=150k_{u}=150 and kv=150k_{v}=150, PMD-sPLS with c1=0.1327c_{1}=0.1327, c2=0.1375c_{2}=0.1375 on the LUAD data for learning 𝒖\bm{u} and 𝒗\bm{v} with the same sparse level. The setting of 2/0\ell_{2}/\ell_{0}-wsPLS is same as /0\ell_{\infty}/\ell_{0}-wsPLS. The co-module identified by /0\ell_{\infty}/\ell_{0}-wsPLS has a strong co-correlation pattern (Figure 4 (A), (B) and (C)). In addition, we find that /0\ell_{\infty}/\ell_{0}-wsPLS obtains the largest PCC r=0.878r=0.878, while 2/0\ell_{2}/\ell_{0}-wsPLS obtains r=0.578r=0.578, 0\ell_{0}-sPLS obtains r=0.792r=0.792 and PMD-sPLS obtains r=0.798r=0.798 (Figure 4 (D)). These results show that wsPLS is more effective to capture the latent patterns of canonical vectors by selecting a specific subset of samples than 0\ell_{0}-sPLS and PMD-sPLS methods.

4.5 Application to Paired miRNA-lncRNA-mRNA Expression Data

To evaluate the mwsPLS method on triple datasets with matched samples, we obtain the miRNA-lncRNA-mRNA transcript expression datasets across 751 Breast invasive carcinoma (BRCA) patients from TCGA database [39]. We remove some noise features (miRNAs, lncRNAs and mRNAs) by using standard deviation threshold. Finally, we obtain three data matrices including the miRNA expression matrix 𝑿1751×581\bm{X}_{1}\in\mathbb{R}^{751\times 581}, the lncRNA expression matrix 𝑿2751×3782\bm{X}_{2}\in\mathbb{R}^{751\times 3782} and the mRNA expression matrix 𝑿2751×6200\bm{X}_{2}\in\mathbb{R}^{751\times 6200}.

We apply mwsPLS-scheme1 (Algorithm 2) and mwsPLS-scheme2 (Algorithm 3) on the BRCA miRNA-lncRNA-mRNA dataset with k1=30k_{1}=30 (miRNAs), k2=100k_{2}=100 (lncRNAs), k3=200k_{3}=200 (mRNAs) and kw=190k_{w}=190 (about 1/41/4 of all samples) to identify four subtype-specific miRNA-lncRNA-mRNA co-modules. We define an average correlation score SS to quantify the co-correlation level for a given co-module M(I,J,K,T)M(I,J,K,T) where II, JJ and KK represent three different feature sets (i.e., miRNA subset, lncRNA subset and mRNA subset) and TT represents sample subset:

S=iI,jJ|cor(i,j)|+iI,kK|cor(i,k)|+jJ,kK|cor(j,k)|N,S=\frac{\sum\limits_{i\in I,j\in J}|\text{cor}_{(i,j)}|+\sum\limits_{i\in I,k\in K}|\text{cor}_{(i,k)}|+\sum\limits_{j\in J,k\in K}|\text{cor}_{(j,k)}|}{N}, (29)

where N=|I||J|+|I||K|+|J||K|N=|I|\cdot|J|+|I|\cdot|K|+|J|\cdot|K| is the number of all the possible pairs for the three types in the co-module M(I,J,K,T)M(I,J,K,T) and cor(i,j)cor_{(i,j)} is used to calculate the pair-wise Pearson correlation coefficient (PCC) based on their corresponding expression data on the sample subset TT. Simply, the SS is defined as the average value of the absolute PCCs for all miRNA-lncRNA, miRNA-mRNA, and lncRNA-mRNA pairs. We apply a permutation test to evaluate the significance level of SS score for a given co-module. We firstly construct 1000 random co-modules by randomly sampling the miRNA set, lncRNA set and sample set, and then compute the SS scores for these random co-modules. We find that all the SS scores of the identified co-modules by Algorithms 2 and 3 are significantly higher than the scores of the random co-modules (Figure 5 (A)). Moreover, the objective function curves show the convergence of Algorithm 2 and 3 (Figure 5). As an example, we show the expressed pattern of the first co-module identified by mwsPLS-scheme1 (Algorithm 2) in Figure 5 (C). Meanwhile, the correlation matrices (heatmaps) of any two types of molecules in the identified co-module are shown in Figure 5 (D). In terms of the SS scores in Figure 5 (A), mwsPLS-scheme1 (Algorithm 2) performs as well or better than mwsPLS-scheme2 (Algorithm 3).

To demonstrate the biological function of the gene sets from these identified co-modules by mwsPLS-scheme1. The co-modules identified by mwsPLS-scheme1 are used for further biological analysis. We retrieve the GO biological process data information from Molecular Signatures Database (MSigDB, http://www.gsea-msigdb.org/gsea/msigdb/index.jsp). Each GO biological process is consisted by a set of functionally gene set. The hypergeometric test is used for the statistical analysis. The gene sets of the identified co-modules are significantly enriched a series of important biological processes (Table III).

Table III: Enrichment analysis of GO biological process terms on the gene sets of co-modules which are identified by Algorithm 2 on BRCA dataset. We present top five terms of each co-module. “#gene” denotes the number of genes.
Module Biological process term #gene P-value
M1 microtubule based process 13 9.43e-06
microtubule cytoskeleton organization 10 8.03e-06
axonemal dynein complex assembly 5 5.25e-05
axoneme assembly 6 1.11e-04
regulation of microtubule based process 6 2.17e-04
M2 immune system process 104 4.52e-54
regulation of immune system process 89 9.71e-53
immune response 79 2.53e-42
regulation of immune response 59 1.88e-38
positive regulation of immune system process 59 6.98e-37
M3 regulation of cellular component movement 24 7.68e-08
response to ionizing radiation 6 4.10e-06
positive regulation of cell proliferation 25 3.54e-06
regulation of epithelial cell proliferation 13 5.12e-06
positive regulation of locomotion 15 8.19e-06
M4 extracellular structure organization 19 1.04e-09
organ morphogenesis 33 8.30e-09
skeletal system development 21 7.56e-08
regulation of multicellular organismal development 43 5.22e-08
tissue development 43 7.11e-08

For example, the gene set in the co-module 2 (M2) is significantly enriched in the immune related biological processes including immune system process (p=4.52e54p=4.52e-54), regulation of immune system process (p=9.71e53p=9.71e-53) and immune response (p=2.53e42p=2.53e-42), which have been reported to be related to breast cancer [40, 41]. The gene set in the co-module 3 (M3) is significantly enriched in the regulation of epithelial cell proliferation (p=5.12e06p=5.12e-06), which has been reported to be associated with breast cancer [42]. These results show that the mwsPLS methods could effectively identify the biologically related miRNA-lncRNA-mRNA comodules.

5 Conclusion

In this paper, we propose an /0\ell_{\infty}/\ell_{0}-wsPLS model for joint sample and feature selection, which uses the /0\ell_{\infty}/\ell_{0}-norm constraints for sample selection. It is difficult to solve the /0\ell_{\infty}/\ell_{0}-wsPLS model because the /0\ell_{\infty}/\ell_{0}-norm constraints are non-convex and non-smooth. Fortunately, we prove that the objective function of the /0\ell_{\infty}/\ell_{0}-wsPLS model satisfies the KŁ property. Based on this, we propose a block proximal gradient algorithm based on the PALM framework to solve it and show its convergence property. Furthermore, we extend the /0\ell_{\infty}/\ell_{0}-wsPLS model for data fusion on three or more data matrices and propose two mwsPLS models by a sum way and a product way. Similarly, we propose two block proximal gradient algorithms to solve them and prove that they all have theoretical convergence guarantees. The numerical and biomedical data experiments demonstrate the efficiency of our methods. The source code of the proposed wsPLS and mwsPLS methods is available from https://github.com/wenwenmin/wsPLS.

Acknowledgment

This work was supported by the National Science Foundation of China (62262069, 61802157, 61902372), the Open Project Program of Yunnan Key Laboratory of Intelligent Systems and Computing (No. ISC22Z03).

References

  • [1] C. Hutter and J. C. Zenklusen, “The cancer genome atlas: creating lasting value beyond its data,” Cell, vol. 173, no. 2, pp. 283–285, 2018.
  • [2] J. D. Welch, V. Kozareva, A. Ferreira, C. Vanderburg, C. Martin, and E. Z. Macosko, “Single-cell multi-omic integration compares and contrasts features of brain cell identity,” Cell, vol. 177, no. 7, pp. 1873–1887, 2019.
  • [3] A. Krishnan, L. J. Williams, A. R. McIntosh, and H. Abdi, “Partial least squares (PLS) methods for neuroimaging: a tutorial and review,” Neuroimage, vol. 56, no. 2, pp. 455–475, 2011.
  • [4] A.-L. Boulesteix and K. Strimmer, “Partial least squares: a versatile tool for the analysis of high-dimensional genomic data,” Briefings in bioinformatics, vol. 8, no. 1, pp. 32–44, 2007.
  • [5] H. Chen, Y. Sun, J. Gao, Y. Hu, and B. Yin, “Solving partial least squares regression via manifold optimization approaches,” IEEE Trans. Neural. Netw. Learn. Syst., vol. 30, no. 2, pp. 588–600, 2019.
  • [6] F. Rohart, B. Gautier, A. Singh, and K.-A. Lê Cao, “mixomics: An r package for ‘omics feature selection and multiple data integration,” PLoS Comput. Biol., vol. 13, no. 11, p. e1005752, 2017.
  • [7] K.-A. Lê Cao, D. Rossouw, C. Robert-Granié, and P. Besse, “A sparse PLS for variable selection when integrating omics data,” Stat. Appl. Genet. Mol. Biol., vol. 7, no. 1, 2008.
  • [8] H. Chun and S. Keleş, “Sparse partial least squares regression for simultaneous dimension reduction and variable selection,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), vol. 72, no. 1, pp. 3–25, 2010.
  • [9] J. M. Monteiro, A. Rao, J. Shawe-Taylor, J. Mourão-Miranda, A. D. Initiative et al., “A multiple hold-out framework for sparse partial least squares,” Journal of neuroscience methods, vol. 271, pp. 182–194, 2016.
  • [10] Q. Li, Y. Yan, and H. Wang, “Discriminative weighted sparse partial least squares for human detection,” IEEE Transactions on Intelligent Transportation Systems, vol. 17, no. 4, pp. 1062–1071, 2016.
  • [11] G. Zhu, Z. Su et al., “Envelope-based sparse partial least squares,” The Annals of Statistics, vol. 48, no. 1, pp. 161–182, 2020.
  • [12] B. Liquet, P. L. de Micheaux, B. P. Hejblum, and R. Thiebaut, “Group and sparse group partial least square approaches applied in genomics context,” Bioinformatics, vol. 32, no. 1, pp. 35–42, 2016.
  • [13] J. Chen and S. Zhang, “Integrative analysis for identifying joint modular patterns of gene-expression and drug-response data,” Bioinformatics, vol. 32, no. 11, pp. 1724–1732, 2016.
  • [14] W. Min, J. Liu, and S. Zhang, “Group-sparse SVD models via L1L_{1}-and L0L_{0}-norm penalties and their applications in biological data,” IEEE Trans. Knowl. Data Eng., vol. 33, no. 2, pp. 536–550, 2021.
  • [15] W. Min, X. Wan, T.-H. Chang, and S. Zhang, “A novel sparse graph-regularized singular value decomposition model and its application to genomic data analysis,” IEEE Trans. Neural Netw. Learn. Syst., vol. 33, no. 8, pp. 3842 –3856, 2022.
  • [16] W. Min, T. Xu, X. Wan, and T.-H. Chang, “Structured sparse non-negative matrix factorization with 2,0\ell_{2,0}-norm,” IEEE Trans. Knowl. Data Eng., pp. 1–13, 2022.
  • [17] I. Dagogo-Jack and A. T. Shaw, “Tumour heterogeneity and resistance to cancer therapies,” Nature reviews Clinical oncology, vol. 15, no. 2, pp. 81–94, 2018.
  • [18] J. Bolte, S. Sabach, and M. Teboulle, “Proximal alternating linearized minimization for nonconvex and nonsmooth problems,” Math. Program., vol. 146, no. 1-2, pp. 459–494, 2014.
  • [19] J. Chen and S. Zhang, “Matrix integrative analysis (MIA) of multiple genomic data for modular patterns,” Frontiers in genetics, vol. 9, p. 194, 2018.
  • [20] C. Meng, O. A. Zeleznik, G. G. Thallinger, B. Kuster, A. M. Gholami, and A. C. Culhane, “Dimension reduction techniques for the integrative analysis of multi-omics data,” Briefings in bioinformatics, vol. 17, no. 4, pp. 628–641, 2016.
  • [21] N. Rappoport and R. Shamir, “Multi-omic and multi-view clustering algorithms: review and cancer benchmark,” Nucleic acids research, vol. 46, no. 20, pp. 10 546–10 562, 2018.
  • [22] L. Wang and R.-C. Li, “A scalable algorithm for large-scale unsupervised multi-view partial least squares,” IEEE Transactions on Big Data, pp. 1–1, 2020.
  • [23] Z. Zhang, L. Liu, F. Shen, H. T. Shen, and L. Shao, “Binary multi-view clustering,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 7, pp. 1774–1782, 2019.
  • [24] J. Han, J. Xu, F. Nie, and X. Li, “Multi-view k-means clustering with adaptive sparse memberships and weight allocation,” IEEE Trans. Knowl. Data Eng., vol. 34, no. 2, pp. 816–827, 2022.
  • [25] C. Hou, F. Nie, H. Tao, and D. Yi, “Multi-view unsupervised feature selection with adaptive similarity and view weight,” IEEE Trans. Knowl. Data Eng., vol. 29, no. 9, pp. 1998–2011, 2017.
  • [26] W. Hu, D. Lin, S. Cao, J. Liu, J. Chen, V. D. Calhoun, and Y.-P. Wang, “Adaptive sparse multiple canonical correlation analysis with application to imaging (epi) genomics study of schizophrenia,” IEEE Transactions on Biomedical Engineering, vol. 65, no. 2, pp. 390–399, 2018.
  • [27] Z. Deng, R. Liu, P. Xu et al., “Multi-view clustering with the cooperation of visible and hidden views,” IEEE Trans. Knowl. Data Eng., pp. 1–1, 2020.
  • [28] J. Zhao, X. Xie, X. Xu, and S. Sun, “Multi-view learning overview: Recent progress and new challenges,” Information Fusion, vol. 38, pp. 43–54, 2017.
  • [29] R. Rosipal and N. Krämer, “Overview and recent advances in partial least squares,” in International Statistical and Optimization Perspectives Workshop, 2005, pp. 34–51.
  • [30] M. Asteris, A. Kyrillidis, O. Koyejo, and R. Poldrack, “A simple and provable algorithm for sparse diagonal cca,” in ICML, 2016, pp. 1148–1157.
  • [31] W. Min, J. Liu, and S. Zhang, “Sparse weighted canonical correlation analysis,” Chinese Journal of Electronics, vol. 27, no. 3, pp. 459–466, 2018.
  • [32] D. M. Witten and R. J. Tibshirani, “Extensions of sparse canonical correlation analysis with applications to genomic data,” Stat. Appl. Genet. Mol. Biol., vol. 8, no. 1, 2009.
  • [33] W. Min, J. Liu, F. Luo, and S. Zhang, “A two-stage method to identify joint modules from matched microrna and mrna expression data,” IEEE Trans. Nanobioscience, vol. 15, pp. 362–370, 2016.
  • [34] D. M. Witten, R. Tibshirani, and T. Hastie, “A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis,” Biostatistics, vol. 10, no. 3, pp. 515–534, 2009.
  • [35] S. Saghafinia, M. Mina, N. Riggi, D. Hanahan, and G. Ciriello, “Pan-cancer landscape of aberrant dna methylation across human tumors,” Cell reports, vol. 25, no. 4, pp. 1066–1080, 2018.
  • [36] Z. Wang, E. Curry, and G. Montana, “Network-guided regression for detecting associations between dna methylation and gene expression,” Bioinformatics, vol. 30, no. 19, pp. 2693–2701, 2014.
  • [37] Y. Li and Z. Zhang, “Potential microRNA-mediated oncogenic intercellular communication revealed by pan-cancer analysis,” Scientific reports, vol. 4, no. 1, pp. 1–15, 2014.
  • [38] C. G. A. R. Network et al., “Comprehensive molecular profiling of lung adenocarcinoma,” Nature, vol. 511, no. 7511, pp. 543–550, 2014.
  • [39] C. G. A. Network et al., “Comprehensive molecular portraits of human breast tumours,” Nature, vol. 490, no. 7418, p. 61, 2012.
  • [40] C. Desmedt, B. Haibe-Kains et al., “Biological processes associated with breast cancer clinical outcome depend on the molecular subtypes,” Clinical cancer research, vol. 14, no. 16, pp. 5158–5165, 2008.
  • [41] J. P. Bates, R. Derakhshandeh, L. Jones, and T. J. Webb, “Mechanisms of immune evasion in breast cancer,” BMC cancer, vol. 18, no. 1, pp. 1–14, 2018.
  • [42] M. A. Cichon, C. M. Nelson, and D. C. Radisky, “Regulation of epithelial-mesenchymal transition in breast cancer cells by cell contact and adhesion,” Cancer informatics, vol. 14, pp. CIN–S18 965, 2015.
[Uncaptioned image] Wenwen Min is currently an associate professor in the School of Information Science and Engineering, Yunnan University. He received the Ph.D. degree in Computer Science from the School of Computer Science, Wuhan University, in December, 2017. He was a visiting Ph.D student at the Academy of Mathematics and Systems Science, Chinese Academy of Sciences from 2015 to 2017. He was a Postdoctoral Researcher with the School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China, from 2019 to 2021. His current research interests include data mining, machine learning, sparse optimization and bioinformatics. He has authored about 20 papers in journals and conferences, such as the IEEE TKDE, IEEE TNNLS, PLoS Comput Biol, Bioinformatics, and IEEE/ACM TCBB, etc.
[Uncaptioned image] Taosheng Xu is currently an associate professor in the Hefei Institutes of Physical Science, Chinese Academy of Sciences. He received his BSc (2009) from Hefei University of Technology. He received his PhD degree in Control Science and Engineering in 2016 at University of Science and Technology of China. Dr. Xu is a Postdoctoral Research Fellow at the Arieh WARSHEL Institute for Computational Biology, The Chinese University of Hong Kong, Shenzhen from 2020 to 2022. His current research interests include data mining, machine learning, bioinformatics and cancer genomics.
[Uncaptioned image] Chris Ding received his Ph.D. degree in Columbia University in 1987. He is currently a chair professor at Chinese University of Hong Kong, Shenzhen. Before that, he worked at California Institute of Technology, Jet Propulsion Lab, Lawrence Berkeley Lab, and University of Texas. His areas are machine learning, data mining, computational biology and high performance computing. He proposed L21 norm which is widely used in machine learning. Made significant contributions on principal component analysis, non-negative matrix factorization and feature selection. Designed the minimum redundancy maximum relevance feature selection algorithm that is widely adopted, e.g., by Uber; Two related papers were cited 11,700 times. Published over 200 research papers with over 50,000 citations.