Latent regularization for feature selection using kernel methods in tumor classification

Martin Palazzo [email protected] Patricio Yankilevich Pierre Beauseroy Institut Charles Delaunay, Universite de Technologie de Troyes Biomedicine Research Institute of Buenos Aires - Max Planck Partner Institute Universidad Tecnologica Nacional Buenos Aires

Abstract

The transcriptomics of cancer tumors are characterized with tens of thousands of gene expression features. Patient prognosis or tumor stage can be assessed by machine learning techniques like supervised classification tasks given a gene expression profile. Feature selection is a useful approach to select the key genes which helps to classify tumors. In this work we propose a feature selection method based on Multiple Kernel Learning that results in a reduced subset of genes and a custom kernel that improves the classification performance when used in support vector classification. During the feature selection process this method performs a novel latent regularisation by relaxing the supervised target problem by introducing unsupervised structure obtained from the latent space learned by a non linear dimensionality reduction model. An improvement of the generalization capacity is obtained and assessed by the tumor classification performance on new unseen test samples when the classifier is trained with the features selected by the proposed method in comparison with other supervised feature selection approaches.

keywords:

Feature Selection, Gene Expression, Multiple Kernel Learning, Latent variable models

^†^†journal: a Journal

1 Introduction

Computational biology community is producing large quantities of raw data from different organisms. More precisely, cancer systems are being recorded at a high pace at big repositories like The Cancer Genome Atlas (TCGA) [1] and the International Cancer Genome Consortium (ICGC) [2] data portals. These databases contains hundred of tumor records from humans, each one with data from different omics like Genomics, Transcriptomics and Proteomics among others. Furthermore these records are related with the clinical data of the patient, like survival time and tumor stage. Transcriptomics or gene expression is considered as an important layer of information to be used as predictor for clinical outcomes of the patient [3]. Gene expression data can be used to predict patient prognosis or tumor stage by modeling the problem as a binary classification one. Nevertheless, there are over 20.000 protein coding genes and this high dimensional context increases the complexity of the prediction problem. For this reason it is necessary to reduce the initial high dimensional problem while keeping the interpretability of the features involved. This approach should determine which genes are more important to predict the clinical outcome and involves not only a classification problem but also a feature selection one.
Machine learning is a powerful approach to tackle these two problems by using algorithms that learns rules from data. Supervised feature selection is an important field dedicated mainly to select a subset of features that improves a classification task with the assumption that the initial feature set involves not only useful predictors but also noisy features that should not be considered. More specificaly, tumor profiles are described by the expression of thousands of genes but the majority of these do not provide useful signals to achieve a classification task for tumor stage or patient prognosis. Moreover, in general a reduced subset of $p$ genes from an initial feature set of $n$ genes is just necessary to classify between tumor profiles with an acceptable performance where $p<<n$ . Therefore Feature Selection methods are used in biomarker discovery [4]. Most of the time, the important features are selected using a supervised objective function [5]. In some cases this supervised objective may be too strict and difficult to fulfill in order to obtain a model that could generalize on new unseen data [6]. Then a major question arises: is it possible to improve the feature selection process by taking advantage of the structure in feature space of training data in the search for better classification and generalization performances?
Our work proposes a feature selection method based on Multiple Kernel Learning (MKL) [7], a family of models which defines a new kernel by combining multiple kernel functions via a weighting system to optimize an objective function. Additionally the proposed method combines MKL and a nonlinear latent feature extraction model to improve the feature selection problem by a combination of supervised and unsupervised approaches respectively. This combination of approaches aims to improve the generalization capacity in classification of the selected features. This strategy aims to maximize the separability between tumor classes while considering simultaneously the latent structure of the training data. In this work the proposed selection method performs what we name a latent regularization using simultaneously the labels of the data and unsupervised latent variables to improve the classification task and the generalization. To extract the latent variables of the training data an unsupervised dimensionality reduction has been performed using kernel-PCA (kPCA) [8]. With our method the selected features are the ones which consider the tumor labels and the data structure at the same time.
The proposed method is designed to deal with tumor classification problems where dimensionality $n>18.000$ and the sample size is lower than 200 tumor profiles. Tumor profiles are classified by stage or prognosis. In this scenario most of feature selection algorithms may under perform due to the lack of tumor samples and the high dimensional feature set. The combination of both MKL and kPCA models perform feature selection while improving the generalization capacity of the classification task due to the consideration of the latent variables extracted by the kPCA. This latent regularisation step works as a label relaxation that is shown to improve the tumor classification performance on new unseen test samples. The proposed method is named Kernel Latent Regularization Feature Selection (KLR-FS).

2 Related Work

Feature selection is a key task in biology due to the high number of genes describing a tumor profile. Feature selection algorithms are useful in this type of problems and are categorized in filter, wrapper and embedded [9]. Methods like LASSO [10][11], Minimal Redundancy Maximal Relevance (MRMR) [12], Recursive Feature Elimination (RFE) [13] and Hilbert Schmidt Independence Criterion Kernelized coupled with Lasso (HSIC-Lasso) [14] have been used to select key genes on expression data for tumor classification.
Supervised MKL [7] relies on a linear combination of kernels to improve the classification performance of a support vector machine classifier [15] or to maximize the Kernel Target Alignment (KTA) score [16][17] of the resulting kernel. Different implementations of supervised MKL have been used in cancer problems for patient prognosis prediction. MKL has been applied for multi-omics data fusion considering gene expression, copy number variation, gene methylation, mRNA expression and histopathological features [18] for Glioblastoma Multiforme survival prediction while optimizing the MKL problem to improve a Support Vector Machine (SVM) classification task. The same approach of multi-omics fusion using MKL has been used in 14 types of cancer for prognosis assessment by combining the kernels in order to improve the Kernel Target Alignment (KTA) score [19] [20]. Also the use of MKL has been used to combine multiple feature subsets by optimizing the MKL model with the classification performance of a SVM model in Lung Squamus cell carcinoma samples [21]. Moreover, MKL has been used to select subsets of gene features using gene expression data from 9 cancer types where the MKL solution is optimized to improve a SVM classifier and where the result defines the feature subset that optimizes the classification accuracy [22].
Unsupervised dimensionality reduction approach to perform non linear extraction of latent features have been widely used in cancer problems involving molecular data [23] [24]. kPCA is the nonlinear dimensionality reduction method used in this work. It has been used on gene expression data for missing data estimation [25], to integrate biological data from different sources of a nutrigenomic study in mouse [26], to find meaningful representations from Leukemia and Lymphoma microarray data [27] and applied on Breast cancer tumor profiles for multi-omic data fusion [28].
There is not doubt about the potential of kernel-PCA, Multiple kernel learning and feature selection approaches on cancer genomics. The high dimensional context of the gene expression data makes these methods useful to reduce the dimensionality and to perform biomarker discovery via feature selection. Moreover, we observe that most of the supervised feature selection approaches do not take advantage of the structure of the training data [9]. From our best knowledge, there is not any prior work that mix via kernel methods the latent data structure and the class labels to improve the classification performance of a supervised feature selection method.

3 The KLR-FS method

As introduced in the previous section, this work proposes a feature selection method that aims to retain information about labels and about the global data structure. To achieve that goal a new feature selection method is proposed which uses a regularization step that mixes target sample labels with unsupervised latent variables learned from the training samples. This section presents the three main parts of the KLR-FS method. First the feature selection strategy using MKL maximizing the kernel alignment to select a subset of genes by a classic supervised criteria where labels are used as the only target. Then the latent regularisation is introduced by explaining how to build a target kernel matrix composed by a mix of both supervised labels and unsupervised latent variables in order to select features that considers both labels and structure. Finally support vector classification is performed on tumor profiles using the resulting selected features and the kernel obtained from the MKL step.

3.1 Feature selection with multiple kernel learning

By a Multiple Kernel Learning process built to optimize the KTA score a subset of features is selected and a kernel is obtained. This subsection defines how to select features with MKL by a supervised approach.
Given $\mathcal{X}\subseteq\mathbb{R}^{n}$ a n-dimensional space , a function $k:\mathcal{X}\times\mathcal{X}\rightarrow\mathbb{R}$ which is symmetric and semi-positive definite meets the condition

k\left(x_{i},x_{j}\right)=\left\langle\phi(x_{i}),\phi(x_{j})\right\rangle_{H}

(1)

where $\phi$ is a mapping function from $\mathcal{X}$ to a high dimensional feature Hilbert space $\mathcal{H}$ such that

\phi:\mathcal{X}\mapsto\phi\left(\mathcal{X}\right)\in\mathcal{H}

(2)

and $\mathcal{H}$ is defined as the Reproducing Kernel Hilbert Space (RKHS) of $k$ [29]. The function $k$ computes the inner product of a pair of vectors in $\mathcal{H}$ the Hilbert Space. Considering a set of $m$ labeled samples such that

\left\{\left(x_{1},y_{1}\right),\left(x_{2},y_{2}\right),...,\left(x_{m},y_{m}\right)\right\}

where $x_{i}\in\mathcal{X}$ and $y_{i}\in\left\{-1,+1\right\}$ the Kernel Matrix or Gram Matrix $G$ is defined as a $M\times M$ matrix with entries $K_{ij}$ . Every entry of the Kernel or Gram Matrix is defined as

K_{ij}=\left\langle\phi\left(x_{i}\right),\phi\left(x_{j}\right)\right\rangle=k\left(x_{i},x_{j}\right)

(3)

The kernel function can be thought as an information bottleneck that concentrates the information required to perform a learning task [30]. Kernels can be thought as similarity functions where an output close to $0$ means orthogonal samples or high dissimilar and an output close to $1$ means similar samples in the Hilbert space. In this work we will use the Radial Basis Function (RBF) Kernel which is defined as

k(x_{i},x_{j})=\textup{exp}\left(-\gamma\left\|x_{i}-x_{j}\right\|^{2}\right)

(4)

where $\gamma$ is a scale parameter.
Given two valid kernels $K_{1}$ and $K_{2}$ over a set of samples $M$ , the alignment $A$ between both kernels is defined as

\mathit{A}\left(K_{1},K_{2}\right)=\frac{\left\langle K_{1},K_{2}\right\rangle_{F}}{\sqrt{\left\langle K_{1},K_{1}\right\rangle_{F}\left\langle K_{2},K_{2}\right\rangle_{F}}}

(5)

and measures the similarity between the two kernels using the same sample set $M$ [20][31]. Here the operation $\left\langle.\right\rangle_{F}$ is the Frobenious Inner product. The alignment can be thought as how similar both kernels maps the samples from $M$ . If tumor labels are used, then $K_{2}$ represents an ideal Kernel matrix or target $K_{yy}$ where $K_{yy}=+1$ if $y_{i}=y_{j}$ and $K_{yy}=0$ if $y_{i}\neq y_{j}$ . The alignment of a kernel $K$ built on $x$ and the target $K_{yy}$ can be expressed as

\mathit{A}\left(K,K_{yy}\right)=\frac{\left\langle K,K_{yy}\right\rangle_{F}}{\sqrt{\left\langle K,K\right\rangle_{F}\left\langle K_{yy},K_{yy}\right\rangle_{F}}}

(6)

known as Kernel Target Alignment (KTA) score. The higher the KTA between a given kernel matrix $K$ and its target $K_{yy}$ the higher the inter-cluster separation between the two classes.
Multiple Kernel Learning (MKL) allows the combination of simple kernels to build a more complex one with higher KTA. MKL is defined as the linear combination of multiple kernels to build a final one [7] and can be expressed as

\boldsymbol{K_{\mu}}\left(\boldsymbol{x},\boldsymbol{x{}^{\prime}}\right)=\sum_{i=1}^{n}\mu_{i}K_{i}\left(\boldsymbol{x},\boldsymbol{x{}^{\prime}}\right),\mu_{i}\geq 0

(7)

where the vector parameter $\mu$ corresponds to the weight $\mu_{i}>0$ of each kernel $k_{i}$ and it is directly related to the importance of each kernel in the final solution. Despite there are different objective functions to calculate the weights of the MKL model, like the classification accuracy of a support vector machine, in this work the MKL model is built with the objective to maximize the KTA. This means that the resulting kernel from the combination of the initial ones will present a higher inter-class separability than each individual kernel on its own. The resulting KTA for the kernel $\boldsymbol{K_{\mu}}$ is $\mathit{A}\left(\boldsymbol{K_{\mu}},K_{yy}\right)$ and computed as detailed in equation 6.
This work uses a greedy strategy [16] to compute the weights $\mu$ of the resulting $\boldsymbol{K_{\mu}}$ kernel by combining just two kernels at each iteration while maximizing the KTA of the resulting kernel. This strategy allows a simpler computation of the $\mu$ vector by obtaining the weights values where the derivative of the alignment becomes $=0$ as optimal condition for each partial derivative. Solving the MKL process with two kernels at each iteration $t$ means that the weight vector is $\boldsymbol{\mu}=\left[\mu_{\alpha},\mu_{\beta}\right]$ . Since the solution with two kernels is convex [32] the values of $\mu_{\alpha}$ and $\mu_{\beta}$ can be obtained from

\begin{matrix}\frac{\partial\boldsymbol{KTA}\left(\mu_{\alpha},\mu_{\beta}\right)}{\partial\mu_{\alpha}}=0&\frac{\partial\boldsymbol{KTA}\left(\mu_{\alpha},\mu_{\beta}\right)}{\partial\mu_{\beta}}=0\end{matrix}

(8)

subject to $\mu_{\alpha},\mu_{\beta}\geq 0$ . From a set of $N$ kernels, the greedy approach starts at iteration $t=0$ by choosing the kernel $K_{i}$ from $N$ with highest $KTA_{i}$ . Then assign $K_{i}$ to a empty list $P$ containing the kernels which are part of the solution and makes it the current solution $K_{\mu}^{(t)}=K_{i}$ . Then for the following iterations $t=1...p$ the method combines iterativelly the current $K_{\mu}^{(t)}$ and a new kernel $K_{j}$ from $N$ with $i\neq j$ to obtain a new solution $K_{\mu}^{(t+1)}=\mu_{\alpha}K_{\mu}^{(t)}+\mu_{\beta}K_{j}$ that maximizes the overall KTA as $A(K_{yy},K_{\mu}^{(t+1)})$ . This process prioritizes in adding first to the solution $P$ the kernels that increase the most the overall KTA, thus the kernel selected first will have the highest weight $\mu_{i}$ . The following kernels to be selected will have a lower weight $\mu_{i}$ since their contribution to maximize the overall KTA as $\Delta KTA$ is going to be lower than the previous added kernel. The KTA increment at each iteration will be $\Delta KTA^{(t)}>\Delta KTA^{(t+1)}...>\Delta KTA^{(p)}$ . This means that when a number $p$ of kernels is requested to be in the solution weights tend to decrease with $p$ as $\mu_{0}>\mu_{1}>....>\mu_{p}$ .
Given a dataset of $m$ samples characterized by $n$ features, this work proposes to build $n$ feature-wise kernels $K_{i}$ . Feature-wise kernel $\gamma_{i}$ parameters are selected based on the criterion of maximizing the alignment of each kernel. Then using the proposed MKL method a subset of feature-wise kernels $P$ is iterativelly selected and combined to increase the overall KTA of the resulting kernel $\boldsymbol{K}_{\mu}$ . Only the feature-wise kernels that increases the KTA are included in the final kernel. This approach leads to a sparse solution where the number of selected features $p$ associated to the selected feature-wise kernels is $p<<n$ . The desired output of the greedy MKL approach is a reduced set of $p$ features and a kernel $\boldsymbol{K}_{\mu}$ that improve the inter-cluster distance between samples of different tumor classes and thus improve the support vector classification. The positive values of the weighting vector $\mu_{i}>0$ indicates the feature importance on the result and the selected features.
Once $\boldsymbol{K}_{\mu}$ is built it is used as a custom kernel function in a support vector classifier [29] for binary classification in tumor stage and patient survival problems.

3.2 Latent regularization with nonlinear feature extraction

In the previous section a feature selection method based on multiple kernel learning and kernel target alignment is described. This method selects features using a supervised kernel matrix $K_{yy}$ where the separation between classes is ideal as target. But this objective is very specific and since the number of samples is small it can lead to significant different solutions when changing samples in the training set. Thus it may be interesting to be less specific and more conservative about the structure of the data. For this reason a relaxation of this supervised constraint is proposed in this work by mixing the target kernel $K_{yy}$ with a $K_{z}$ kernel built on the latent space $Z$ learned by a dimensionality reduction algorithm based on a non-linear transformation $\phi_{z}(x)$ . Note that the $K_{yy}$ is built from the tumor labels and the $K_{z}$ from the extracted latent variables, thus the first one is built by a supervised approach and the latter by an unsupervised one. The mixture of both kernels forms a new target kernel $K_{\delta}$ that has supervised and unsupervised information from the training samples. For the unsupervised and nonlinear dimensionality reduction transformation $\phi_{z}(x)$ any algorithm can be used such as Kernel-PCA, Autoencoders or t-SNE. The key aspect is to learn using an unsupervised criteria, a latent space from the training samples to capture the structure of the data and to mix it with the supervised labels as a mean to relax the supervised constraint.
Most supervised feature selection models $f$ are expressed as

y\sim f(x)

where a subset $p$ of selected features minimizes a loss function between the model output $f(x)$ and the true labels $y$ . This work proposes a feature selection model that learns not only from the true labels $y$ but also from the latent structure of the training data as

(z,y)\sim f(x)

where $z=\phi(x)_{z}$ is the latent space obtained from a nonlinear and unsupervised mapping function. The mixture between the labels $y$ and $z$ creates a hybrid target composed by a supervised and unsupervised approach. In this work to extract a latent space $z$ the function $\phi(x)_{z}$ is learned by a Kernel-PCA method [8].
This method is based on a nonlinear mapping $\phi_{z}(x)$ from $\mathcal{X}$ to $\mathcal{H}$ where the standard Principal Component Analysis (PCA) algorithm is performed.
The PCA is a linear dimensionality reduction method and is defined as the orthogonal projection of the training samples into a low dimensional space such that the variance of the projected samples is maximized [33]. From a set of samples ${x_{m}}$ where $m=1,...,M$ characterized in $n$ dimensions, the goal of PCA is to map $x_{m}$ into a low dimensional space of dimension $p$ such that $p<n$ . The PCA projects the data into the p low dimensional space that maximizes the variance. The PCA transformation is obtained by applying the spectral decomposition to the covariance matrix $C$ of the training data $x_{m}$ such that $U\Lambda=CU$ . The $U$ matrix contains stacked vectors $u_{1},...,u_{p}$ where $u_{i}$ is the ith eigenvector corresponding to the ith largest eigenvalue $\lambda_{i}$ in the $\Lambda$ matrix. Then by selecting the first $p$ eigen-vectors a new $n\times p$ matrix ${U}^{\prime}$ is obtained and works as the desired subspace. Then by computing $X{U}^{\prime}=Z$ the original data is projected by ${U}^{\prime}$ into $Z$ of dimension $p$ .
To make PCA nonlinear the Kernel Methods introduced in the previous section are used to define the Kernel-PCA in order to perform an implicit PCA in the projected Hilbert space $\phi(x)$ via the kernel trick [34]. In this case the spectral decomposition is applied to a Kernel Matrix $K$ such that

m\Lambda U=KU

where $U$ and $\Lambda$ are the matrices containing the eigenvectors and eigenvalues respectively and $m$ the number of samples. Then the resulting coordinates $(l_{1},...,l_{p})$ known as kernel principal components are calculated by

\begin{matrix}l_{j}=\sum_{i=1}^{m}u_{ij}k(x_{i},x)&,j=1...p\end{matrix}

(9)

and this is equivalent to perform a nonlinear PCA in the original input space.
Once the kernel-PCA is computed with the training samples $\boldsymbol{x}_{tr}$ these are mapped to $\boldsymbol{z}$ and a kernel $k_{z}$ is built on the latent space. $k_{z}$ captures the unsupervised structure of the training samples in the latent space. Then by a linear combination of the supervised $k_{yy}$ and unsupervised $k_{z}$ kernels a new hybrid target kernel $K_{\delta}$ can be created

K_{\delta}=\delta K_{yy}+(1-\delta)K_{z}

(10)

This new kernel contains a linear combination of both the supervised labels and unsupervised latent variables ruled by a new parameter named as Mixture Coefficient $\delta\in[0,1]$ .

Figure 1 shows the KLR-FS pipeline. From equation 10 three scenarios are possible. When $\delta=1$ then $K_{\delta}=K_{yy}$ and corresponds to the supervised kernel described in the previous section and the feature selection process is completely supervised. On the other hand when $\delta=0$ then $K_{\delta}=K_{z}$ and it represents the unsupervised structure provided by the kernel-PCA of the training samples. When $\delta=0$ the target $K_{\delta}$ does not contain any supervised information and the MKL process is unsupervised. Finally, every value of $0<\delta<1$ corresponds to a kernel $K_{\delta}$ that has a mixture of supervised and unsupervised components of the problem. Then by this approach a latent regularization of the supervised feature selection problem is possible. The hypothesis stated in this work is based on the assumption that the features selected by MKL in a mixture of supervised labels and unsupervised latent variables results in a higher generalization ability and thus in higher classification performance.
Finally, once the MKL step is done a $\boldsymbol{K}_{\mu}$ kernel and a subset $P$ of selected features are obtained as result. Then the $\boldsymbol{K}_{\mu}$ kernel is used as the kernel in a support vector classification [35] to classify tumor profiles.

\hat{y}(\mathbf{x})=\sum_{i=1}^{M}\alpha_{i}y_{i}\left\langle\phi_{\mu}\left(x\right),\phi_{\mu}\left(x_{i}\right)\right\rangle_{\mathcal{H}}+b=\sum_{i=1}^{M}\alpha_{i}y_{i}k_{\mu}(\mathbf{x},\mathbf{x}_{i})+b

Refer to caption — Figure 1: The KLR-FS pipeline. 1) A $K_{yy}$ matrix is built using the tumor labels. A kernel-PCA model is trained using the training data and a kernel $K_{z}$ is built on the latent space learned from the kernel-PCA. The $K_{\delta}$ kernel is obtained from the mixture between $K_{yy}$ and $K_{z}$ . 2) From the training data a set $N$ of feature-wise kernels is built. By MKL a subset of feature-wise kernels are selected and a $K_{\mu}$ kernel is obtained by improving the alingment with $K_{\delta}$ kernel. The $\mu$ vector indicates the selected features. 3) The $K_{\mu}$ kernel is used in Support Vector Classification.

3.3 Model evaluation

This work is focused on binary classification of tumor samples by using reduced feature subsets selected by KLR-FS. The classification task gives an idea about the predictive power of the selected features. Each binary classification problem has two classes: a Positive and a Negative class. To evaluate a classification model on new independent test samples four outcomes are possible: True Positive (TP), False Negative (FN), True Negative (TN) and False Positive (FP). Given the Positive class the TP counts the number of samples well classified while the FN counts the wrong classified by the model. Given the Negative class the TN counts the number of samples well classified while the FP counts the wrong classified ones.
Another way to measure the performance of a classifier is to estimate the Area Under the Receiving Operation Characteristic Curve (ROC) [36] or Area Under the Curve (AUC). The ROC represents the TP rate as a function of the FP rate. The AUC is a score used to measure the overall performance of a binary classifier. It ranges from $0.5$ to $1$ . An AUC $=0.5$ represents the performance of a random classifier while an AUC $=1$ corresponds to a perfect classifier. Unlike Accuracy or other classification metrics based on a specific decision threshold defined a priori by the classifier the AUC is independent of the classification threshold and is a measure of a continuous output. This makes the AUC a robust score. The AUC is used to evaluate the classification performance of a binary classifier trained on the selected features by KLR-FS. For all the experiments a Support Vector Classifier is trained using the selected features of each method.
To evaluate the robustness and quality of the feature selection process the Redundancy Rate (RED) [37] [14] metric is used. The RED score measures the mean value of absolute correlation between features is computed as

\text{RED}=\frac{1}{p(p-1)}\sum_{f_{i},f_{j}\in\textbf{P}}|\rho_{ij}|

(11)

where $\rho_{ij}$ is the correlation score between the ith and the jth selected variables. The RED metric takes values between $0$ and $1$ . A low value of RED indicates that the majority of the selected features within the subset $P$ have low linear correlation between each other and thus a low redundancy is expected within the selected feature set which can be interpreted as a high quality of selection. On the other side values of RED close to $1$ correponds to a subset of features with high redundancy which is a non desired output.

3.4 Benchmark methods

The proposed feature selection method for classification is compared with four other supervised feature selection methods. These methods are Minimum Redundancy Maximum Relevance (mRMR)[38], the High-Dimensional Feature Selection by Feature-Wise Kernelized Lasso (HSIC-Lasso) [14], the SVM Recursive Feature Elimination (SVM-RFE)[13] and the univariate ANOVA filter [39].
The mRMR method attempts to discard redundant features while keeping the features with highest relevance to the target labels $y$ . The objective is to select a feature subset that best characterizes the statistical property of a target label [38]. The selection has the constraint that selected features are mutually as dissimilar to each other as possible, but marginally as similar to the target label.
The HSIC-Lasso method [14] is a feature-wise kernelized Lasso that captures non-linear dependency between input features and target labels.
The SVM-RFE is a greedy feature selection method [40][41] that generates a ranking list of features and selects a subset of the top-ranked features. The ranking is built by a feature weight vector $w$ obtained from the parameters of the hyperplane decision function of a SVM classifier and the top $p$ features are selected.
Finally, ANOVA t-test is an univariate feature selection method that gives a score to each feature based on the p-value obtained from a statistical test between the feature $i$ and the class label $y$ .
During support vector classification for benchmark methods a RBF Kernel has been used with a grid search on gamma hyperparameter $\gamma=[0.01,0.1,1,10]$ . For the KLR-FS the resulting $k_{\mu}$ kernel is used. A grid search for the C hyperparameter $C=[0.1,1,10,100]$ has been used for all kernels. Hyperparameters have been selected by 5-fold cross validation on train set to select the best model.

4 Datasets

In this work three cancer datasets containing tumor profiles characterized by gene expression (RNA-Seq) features are used to evaluate the proposed method. These datasets are the Breast Cancer BRCA-US, the Pancreas Cancer PACA-CA and the Lung Cancer SMK-CAN-187. The BRCA-US and PACA-CA datasets are available from the International Cancer Genome Consortium [2]. The SMK-CAN-187 has been presented by [42] and available from http://featureselection.asu.edu/ and the Gene Expression Omnibus [43] under GEO accession number GSE4115.
The Breast cancer BRCA-US dataset is composed by 194 Breast cancer samples labeled by the survival days since diagnosis. To define two classes the threshold between low and high survival is defined as five years of survival since diagnosis [44]. The BRCA-US tumor samples are characterized by the expression of 20502 protein coding genes.
The Pancreas cancer PACA-CA dataset is composed by 135 tumor profiles labeled by tumor stage. The stage labels are IA, IB, IIA and IIB. The stages IA and IB are considered early stage while the stages IIA and IIB are considered late stage. The PACA-CA tumor samples are characterized by the expression of 18020 protein coding genes.
The SMK-CAN-187 is a benchmark microarray based gene expression database and it has 187 samples and 19993 gene expression features.

Dataset	Gene features	Class	Samples
BRCA-US	20502	Low Survival	63
BRCA-US	20502	High Survival	131
PACA-CA	18020	Early Stage	84
PACA-CA	18020	Late Stage	51
SMK-CAN-187	19993	Control	90
SMK-CAN-187	19993	Tumor	97

Table 1: Size of each dataset.

Table 1 summarizes the size of each data set. It is clearly visible the high dimensional context of the three datasets and the low number of tumor samples.

4.1 Preprocessing

The initial sample set has been randomly split between $80\%$ as train and $20\%$ as test. By using just the training samples all the gene expression features have been auto-scaled with 0 mean and unit variance. Test samples have been auto-scaled using the transformation learned from train samples. Then by using training samples the feature selection methods are applied and a subset of features are selected. With the selected features a support vector classifier is trained and tuned by 5-fold cross validation within train set for hyperparameter selection and evaluated on the test set.

5 Experimental Results

For performance estimation the split between train and test set has been done randomly five times and at each time a feature selection and classification tasks are implemented. Then the classification results and the redundancy rate of the feature selection are averaged across all random splits and the mean and standard deviation of these metrics are reported.
The experimental results section is divided in two subsections. The first one is devoted to the Latent Regularization for feature selection and analyzes how the performance of KLR-FS behaves for different values of the mixture coefficient described in equation 10. The second subsection is the performance estimation and statistical comparison of the proposed method and the benchmark ones.

5.1 Latent regularization for feature selection

This subsection details how the feature selection process of KLR-FS behaves with different settings of the mixture coefficient $\delta$ . As explained in the materials and methods section by varying the values of $\delta$ different target kernels $K_{\delta}$ are obtained, each one with a different mix between the unsupervised $K_{z}$ and the supervised $K_{yy}$ kernels. A set of six values of the Mixture Coefficient $\delta=[0,0.2,0.4,0.6,0.8,1.0]$ are evaluated by increasing the dependency to the labels. Each value of $\delta$ determines a feature selection model. Then the resulting kernel $K_{\delta}$ is used for support vector classification and evaluated on the test set.
Figure 2 shows the AUC results on classification by using features selected for different values of $\delta$ on the three datasets. The AUC score peaks between $\delta=0.4$ and $\delta=0.6$ in all the cases. This results evidence the importance of the latent regularization in classification and means that the maximum AUC score using the KLR-FS features is obtained with a mixture between supervised labels and unsupervised latent structure.
Figure 3 shows how the RED score varies for different values of the Mixture Coefficient $\delta$ on the corresponding selected features and how the RED score is reduced by the latent regularization. For the BRCA-US and PACA-CA datasets the lowest values of RED are obtained for $\delta=0.4$ and $\delta=0.6$ and match with the same range of $\delta$ values with maximum AUC. For the SMK-CAN dataset the lowest RED scores are obtained between $\delta=0.6$ and $\delta=1$ and only when the number of selected features $p=20$ and $p=30$ the lowest RED score is at $\delta=1$ . Figure 3 shows that when the number of selected features $p$ is small the mixture parameter $\delta$ has a stronger impact.

5.2 Performance estimation and statistical comparison

In this subsection the KLR-FS is compared with the benchmark methods for different number of selected features $p$ . For performance estimation the results of the classification and selection tasks are averaged among all the random iterations. The feature selection is applied on train samples and the classifier is trained on the selected features to finaly classify the test samples. The mixture coefficient used for KLR-FS for benchmark is $\delta=0.6$ since it appear to be the best combination according to AUC and RED scores in the previous section.

Figure 4 shows the classification performance on test set for different numbers of $p$ selected features by each method. For the BRCA-US and PACA-CA datasets the KLR-FS shows the highest mean AUC score and the lowest classification variance for every number of selected features. For the SMK-CAN dataset the KLR-FS has the highest classification performance with $p=10$ features and for larger number of features it is outperformed by the RFE method.
In the three datasets the classification results using the KLR-FS features are the most constant as the value of $p$ increases. According to section 3.1 the resulting $K_{\mu}$ assigns a larger weight $\mu_{i}$ to the variables that are selected first since these are the ones that increments more the overall alignment $\Delta A(K_{\delta},K_{\mu})$ , while to those variables selected later the corresponding weight is much smaller. The KLR-FS results on Figure 4 suggest that these classification problems can be solved by selecting the first 10 variables since selecting more variables does not introduce a significant improvement in the classification results.

Figure 5 shows the RED score of each subset of selected features for different values of $p$ . For the BRCA-US and PACA-CA datasets the lowest score is obtained by the mRMR method with a $RED<0.1$ followed by RFE, KLR-FS and HSIC-Lasso. For the SMK-CAN dataset the lowest RED score is obtained by the RFE method followed by HSIC-Lasso and KLR-FS while the highest corresponds to mRMR and ANOVA. The ANOVA shows the highest RED score in all datasets. KRL-FS and HSIC-Lasso shows a similar behaviour with a $RED<0.2$ in the BRCA-US and PACA-CA datasets and a $RED<0.4$ for the SMK-CAN datasets.
This section compares the KLR-FS method with $\delta=0.6$ and four feature selection methods. The main objective of this work is to select features to improve the tumor classification. It is observed that the classification performance of the KLR-FS is the highest for the majority of the cases.

6 Discussion

This work proposes a feature selection model based on Multiple Kernel Learning coupled with kernel-PCA with an application to tumor classification using high dimensional gene expression data from tumor profiles. The proposed method KLR-FS aims to select features considering not only the sample labels $y$ but also the latent structure $z$ of the training data presented in this work as a latent regularization since it is used to improve the generalization capacity in feature selection for classification. The idea is based on considering not only a pure supervised target kernel $k_{yy}$ but also a kernel $k_{z}$ built from the latent variables $z$ obtained from a nonlinear dimensionality reduction $\phi_{z}$ . The latent regularization comes from a kernel built on the latent space $\mathcal{Z}$ generated by the kernel-PCA. This allows to explore the relaxation of the sample labels by a linear combination of a supervised and an unsupervised kernel. The experiments detailed on section 5.1 show that the highest classification performance and the lowest redundancy rate for the KLR-FS is obtained when latent information is mixed with label targetted information. These results show that selecting variables contemplating both the latent structure $z$ of the data and the labels of the supervised problem can improve the classification task. Moreover, learning partially a latent space works as a regularization term since it limit the solution space by introducing the need to capture also the general structure of the data. By doing so it can improve the generalization capacity on new unseen test samples. In addition, a relation between the latent structure of the data and the sample labels $y$ exists and only learning from the labels with a classic supervised criteria can result in overfitting when training in a context where the number of samples is very limited. Figure 2 reveals how the latent regularization works as a relaxation of the supervised problem by mixing $K_{yy}$ with $K_{z}$ . This evidences the important role of the unsupervised latent variables in the supervised feature selection task. As expected, a value of $\delta=0$ almost does not select any useful feature since the classification performance declines significantly in almost all cases. Finally, there is consistency between the peak in the AUC and the lowest RED score in the majority of the cases with a mixture coefficient ranging between $0.4<\delta<0.6$ .
Despite the kernel-PCA is used as a nonlinear and unsupervised mapping function $\phi_{z}(x)$ to learn a low dimensional latent space from training data any other nonlinear transformation could be used like Autoencoders or t-SNE [23] [45]. In this work since the tumor data has a considerably small sample set in comparison to the number of features $m<<n$ then Autoencoders and t-SNE were not considered although they could be used in cases where more samples are available.
KLR-FS provides feature importance via the non-zero elements of the resulting sparse $\mu$ vector, which is a useful characteristic to gain interpretation of the results. The model shows that it can deal with high dimensional initial problems, in this case the initial dimension in all datasets is $n>18.000$ and low sample sizes where the classification using the KLR-FS features outperforms the benchmark methods. Despite the KLR-FS is only outperformed in RED score by RFE in the three datasets and by mRMR in two datasets, KLR-FS has the highest classification performance and the lowest variance when compared with all the benchmark methods.
KLR-FS outputs a custom kernel $K_{\mu}$ that is used in support vector classification and contributes to improve the classification performance. The resulting $K_{\mu}$ kernel can be used not only for classification tasks but also for representation of tumor profiles or visualization by kernel-PCA since it is built from both supervised and unsupervised latent sources.
The proposed method has relevance in problems where the structure of the training data correlates partially with the tumor labels thus learning also from the latent structure of the data improves the supervised learning task.

7 Conclusion

In a context of low sample size and high dimensional space we propose a novel feature selection method that want to enforce the selected features to be discriminant and to keep information about the general data structure. By doing so we expect to avoid overfitting, to improve generalization and finally classification performance. To reach that goal we design a new approach called Kernel Latent Regularization Feature Selection (KLR-FS). It is based on a MKL approach that targets a label kernel relaxed with another kernel built on a latent space. In the proposing application the latent space is obtained using kPCA. The method is applied on high dimensional gene expression tumor profiles from Breast, Pancreas and Lung Cancer. KLR-FS selects genes which are used to classify tumor subtypes or survival rate with the highest classification performance and a considerably low redundancy rate. The mixture between supervised labels and latent variables reveals an improvement in the generalization capacity when compared with other feature selection methods.
Future work should generalize to multi-modal data as multi-omic layers of biological information to not only select features but also improve the fusion of heterogeneous data.

Acknowledgements

We thank to Diego Tomassi and Ivan Lengyel for helpful comments on an early version of this work.

Funding

This work was supported by the Universite de Technologie de Troyes in France and the Universidad Tecnologica Nacional in Argentina. Also it was funded by grants from CONICET, ANPCyT and FOCEM-Mercosur.

References

[1] J. N. Weinstein, E. A. Collisson, G. B. Mills, K. R. M. Shaw, B. A. Ozenberger, K. Ellrott, I. Shmulevich, C. Sander, J. M. Stuart, C. G. A. R. Network, et al., The cancer genome atlas pan-cancer analysis project, Nature genetics 45 (10) (2013) 1113.
[2] I. C. G. Consortium, et al., International network of cancer genome projects, Nature 464 (7291) (2010) 993.
[3] D. G. Beer, S. L. Kardia, C.-C. Huang, T. J. Giordano, A. M. Levin, D. E. Misek, L. Lin, G. Chen, T. G. Gharib, D. G. Thomas, et al., Gene-expression profiles predict survival of patients with lung adenocarcinoma, Nature medicine 8 (8) (2002) 816–824.
[4] Z. He, W. Yu, Stable feature selection for biomarker discovery, Computational biology and chemistry 34 (4) (2010) 215–225.
[5] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, H. Liu, Feature selection: A data perspective, ACM Computing Surveys (CSUR) 50 (6) (2017) 1–45.
[6] J. Reunanen, Overfitting in making comparisons between variable selection methods, Journal of Machine Learning Research 3 (Mar) (2003) 1371–1382.
[7] M. Gönen, E. Alpaydın, Multiple kernel learning algorithms, Journal of machine learning research 12 (Jul) (2011) 2211–2268.
[8] S. Mika, B. Schölkopf, A. J. Smola, K.-R. Müller, M. Scholz, G. Rätsch, Kernel pca and de-noising in feature spaces, in: Advances in neural information processing systems, 1999, pp. 536–542.
[9] J. C. Ang, A. Mirzal, H. Haron, H. N. A. Hamed, Supervised, unsupervised, and semi-supervised feature selection: a review on gene selection, IEEE/ACM transactions on computational biology and bioinformatics 13 (5) (2015) 971–989.
[10] Z. Y. Algamal, M. H. Lee, Penalized logistic regression with the adaptive lasso for gene selection in high-dimensional cancer classification, Expert Systems with Applications 42 (23) (2015) 9326–9332.
[11] M. Moon, K. Nakai, Stable feature selection based on the ensemble l 1-norm support vector machine for biomarker discovery, BMC genomics 17 (13) (2016) 1026.
[12] Y. Zhang, A. Li, C. Peng, M. Wang, Improve glioblastoma multiforme prognosis prediction by using feature selection and multiple kernel learning, IEEE/ACM transactions on computational biology and bioinformatics 13 (5) (2016) 825–835.
[13] K.-B. Duan, J. C. Rajapakse, H. Wang, F. Azuaje, Multiple svm-rfe for gene selection in cancer classification with expression data, IEEE transactions on nanobioscience 4 (3) (2005) 228–234.
[14] M. Yamada, W. Jitkrittum, L. Sigal, E. P. Xing, M. Sugiyama, High-dimensional feature selection by feature-wise kernelized lasso, Neural computation 26 (1) (2014) 185–207.
[15] A. Rakotomamonjy, F. R. Bach, S. Canu, Y. Grandvalet, Simplemkl, Journal of Machine Learning Research 9 (Nov) (2008) 2491–2521.
[16] J.-B. Pothin, C. Richard, A greedy algorithm for optimizing the kernel alignment and the performance of kernel machines, in: 2006 14th European Signal Processing Conference, IEEE, 2006, pp. 1–4.
[17] F. Aiolli, M. Donini, Easymkl: a scalable multiple kernel learning algorithm, Neurocomputing 169 (2015) 215–224.
[18] Y. Zhang, A. Li, J. He, M. Wang, A novel mkl method for gbm prognosis prediction by integrating histopathological image and multi-omics data, IEEE journal of biomedical and health informatics.
[19] B. Zhu, N. Song, R. Shen, A. Arora, M. J. Machiela, L. Song, M. T. Landi, D. Ghosh, N. Chatterjee, V. Baladandayuthapani, et al., Integrating clinical and multiple omics data for prognostic assessment across human cancers, Scientific reports 7 (1) (2017) 16954.
[20] N. Cristianini, J. Shawe-Taylor, A. Elisseeff, J. S. Kandola, On kernel-target alignment, in: Advances in neural information processing systems, 2002, pp. 367–373.
[21] A. Zhang, A. Li, J. He, M. Wang, Lscdfs-mkl: A multiple kernel based method for lung squamous cell carcinomas disease-free survival prediction with pathological and genomic data, Journal of biomedical informatics 94 (2019) 103194.
[22] W. Du, Z. Cao, T. Song, Y. Li, Y. Liang, A feature selection method based on multiple kernel learning with expression profiles of different types, BioData mining 10 (1) (2017) 4.
[23] M. Palazzo, P. Beauseroy, P. Yankilevich, A pan-cancer somatic mutation embedding using autoencoders, BMC bioinformatics 20 (1) (2019) 655.
[24] Z. Liu, D. Chen, H. Bensmail, Gene expression data classification with kernel principal component analysis, BioMed Research International 2005 (2) (2005) 155–159.
[25] Y. Shan, G. Deng, Kernel pca regression for missing data estimation in dna microarray analysis, in: 2009 IEEE International Symposium on Circuits and Systems, IEEE, 2009, pp. 1477–1480.
[26] F. Reverter, E. Vegas, J. M. Oller, Kernel-pca data integration with enhanced interpretability, BMC systems biology 8 (S2) (2014) S6.
[27] F. Reverter, E. Vegas, J. M. Oller, Kernel methods for dimensionality reduction applied to the “omics” data, Principal Component Analysis-Multidisciplinary Applications (2012) 1–20.
[28] J. Mariette, N. Villa-Vialaneix, Unsupervised multiple kernel learning for heterogeneous data integration, Bioinformatics 34 (6) (2018) 1009–1015.
[29] J.-P. Vert, K. Tsuda, B. Schölkopf, A primer on kernel methods, Kernel methods in computational biology 47 (2004) 35–70.
[30] J. Shawe-Taylor, N. Cristianini, et al., Kernel methods for pattern analysis, Cambridge university press, 2004.
[31] J. Kandola, J. Shawe-Taylor, N. Cristianini, On the extensions of kernel alignment.
[32] C. Cortes, M. Mohri, A. Rostamizadeh, Algorithms for learning kernels based on centered alignment, Journal of Machine Learning Research 13 (Mar) (2012) 795–828.
[33] C. M. Bishop, Pattern recognition and machine learning, springer, 2006.
[34] B. Schölkopf, The kernel trick for distances, in: Advances in neural information processing systems, 2001, pp. 301–307.
[35] B. Schölkopf, A. J. Smola, F. Bach, et al., Learning with kernels: support vector machines, regularization, optimization, and beyond, MIT press, 2002.
[36] F. Melo, Area under the roc curve, Encyclopedia of Systems Biology (2013) 38–39.
[37] Z. Zhao, L. Wang, H. Liu, Efficient spectral feature selection with minimum redundancy, in: Twenty-fourth AAAI conference on artificial intelligence, 2010.
[38] H. Peng, F. Long, C. Ding, Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis & Machine Intelligence (8) (2005) 1226–1238.
[39] C. Lazar, J. Taminau, S. Meganck, D. Steenhoff, A. Coletta, C. Molter, V. de Schaetzen, R. Duque, H. Bersini, A. Nowe, A survey on filter techniques for feature selection in gene expression microarray analysis, IEEE/ACM Transactions on Computational Biology and Bioinformatics 9 (4) (2012) 1106–1119.
[40] A. Adorada, R. Permatasari, P. W. Wirawan, A. Wibowo, A. Sujiwo, Support vector machine-recursive feature elimination (svm-rfe) for selection of microrna expression features of breast cancer, in: 2018 2nd International Conference on Informatics and Computational Sciences (ICICoS), IEEE, 2018, pp. 1–4.
[41] I. Guyon, J. Weston, S. Barnhill, V. Vapnik, Gene selection for cancer classification using support vector machines, Machine learning 46 (1-3) (2002) 389–422.
[42] A. Spira, J. E. Beane, V. Shah, K. Steiling, G. Liu, F. Schembri, S. Gilman, Y.-M. Dumas, P. Calner, P. Sebastiani, et al., Airway epithelial gene expression in the diagnostic evaluation of smokers with suspect lung cancer, Nature medicine 13 (3) (2007) 361–366.
[43] R. Edgar, M. Domrachev, A. E. Lash, Gene expression omnibus: Ncbi gene expression and hybridization array data repository, Nucleic acids research 30 (1) (2002) 207–210.
[44] L. Chen, H. M. Linden, B. O. Anderson, C. I. Li, Trends in 5-year survival rates among breast cancer patients by hormone receptor status and stage, Breast cancer research and treatment 147 (3) (2014) 609–616.
[45] D. Kobak, P. Berens, The art of using t-sne for single-cell transcriptomics, Nature communications 10 (1) (2019) 1–14.