This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Feature selection simultaneously preserving both class and cluster structures

Suchismita Das and Nikhil R. Pal Electronics and Communication Sciences Unit, Indian Statistical Institute, 203 B T Road, Kolkata-700108 [email protected],[email protected]
Abstract

When a data set has significant differences in its class and cluster structure, selecting features aiming only at the discrimination of classes would lead to poor clustering performance, and similarly, feature selection aiming only at preserving cluster structures would lead to poor classification performance. To the best of our knowledge, a feature selection method that simultaneously considers class discrimination and cluster structure preservation is not available in the literature. In this paper, we have tried to bridge this gap by proposing a neural network-based feature selection method that focuses both on class discrimination and structure preservation in an integrated manner. In addition to assessing typical classification problems, we have investigated its effectiveness on band selection in hyperspectral images. Based on the results of the experiments, we may claim that the proposed feature/band selection can select a subset of features that is good for both classification and clustering.

keywords:
Feature selection, Structure preserving, Classification, Neural network, Sammon’s Stress, Band selection, Hyperspectral Image.

1 Introduction

Feature selection methods can be broadly classified on the basis of the utilization of the class label information. There are three categories: supervised, semi-supervised and unsupervised [1, 2]. The supervised feature selection method exploits the label information to find out the relevant features which distinguish samples of different classes [3, 4]. Semi-supervised feature selection is used when some labeled samples along with plenty of unlabelled samples are present [5, 6]. Both labeled and unlabelled data are used to modify a hypothesis obtained from the labeled data [7]. Unsupervised feature selection is much more difficult as it needs to find out the useful features in the absence of the label information [2, 8]. Different criteria have been chosen to select a subset of original features in different unsupervised feature selection studies. Some of them are: preserving the data distribution such as manifold structure [9, 10], preserving cluster structure [11, 12], and preserving data similarity [13, 14]. It is noteworthy that in the case of unsupervised feature selection, some methods try to preserve the “structure” or “geometry” of the data in some sense. Contrarily supervised feature selection methods in most cases do not set any explicit criteria to preserve the structure of the data. They only pay heed to separating the classes as much as possible with different measures exploiting class information such as Fisher score [15, 16], Laplacian score [17], mutual information [18, 19], normalized mutual information [18, 20, 21, 22], ReliefF [23], class correlation [24], classifier score [25]. We should note here that the feature selection criterion are not always lead by a single objective. Feature selection methods often follow a criterion that consisits of two or more objectives. The study in [8] proposes a criterion named ‘maximum projection and minimum redundancy’ which is governed by two goals: projecting data into a feature subspace with minimum reconstruction error and minimum redundancy. The studies in [26, 27, 28, 29] claim that both global structure and local structure should be preserved in the projected space as both them may carry important discriminating information and hence, they have proposed feature selection schemes that focus both on global and local structure preservation. The investigation in [30] claims to preserve dual global structures. Going through various feature selection schemes having multiple objective we found that whenever class label is available, no work in feature selection explicitly focused preserving structural information along with class information although both of these are important discriminative information and may have positive impact on the generalization ability of the classifier. Suppose, for a data set, the class and cluster structures are substantially different. Exploiting only the class labels, it may not be possible to keep the cluster structures in the projected space. For a practical system, even when the primary task is classification, we may need to cluster the samples in the space defined by the selected features. For example, fuzzy rule based classifiers are often designed by clustering the training data for each class and translating each cluster into a rule[31, 32, 33]. We could not find any feature selection method that focuses both on class and cluster separability. To bridge this gap, in this study we propose a feature selection method that selects features preserving class and cluster-structure information simultaneously. We employ a multi-layer perceptron (MLP) based neural network to develop an embedded feature selection scheme. The training of the proposed MLP based feature selection method is governed by both class discriminating and cluster (structure) preserving objectives. The philosophy is quite general and can be easily extended to other networks such as radial basis function network.

2 Proposed Method

Let us denote the input data by an n×Pn\times P matrix, 𝐗={𝐱iP}i=1n\mathbf{X}=\{\mathbf{x}_{i}\in\mathbb{R}^{P}\}_{i=1}^{n}. Here, 𝐱i\mathbf{x}_{i} is a PP dimensional row vector of the form, 𝐱i=(xi1,xi2,,xiP)\mathbf{x}_{i}=(x_{i1},x_{i2},\cdots,x_{iP}). The collection of class labels of 𝐗\mathbf{X} be 𝐙={zi{1,2,,C}}i=1n\mathbf{Z}=\{z_{i}\in\{1,2,\cdots,C\}\}_{i=1}^{n}, where, ziz_{i} is the class label corresponding to 𝐱i\mathbf{x}_{i}. We aim to select a subset of size QQ from the original set of features such that the selected subset performs reasonably well in terms of the classification task as well as in clustering. In other words, if we design a classifier using the selected features, the performance of the classifier would be comparable to a classifier designed using all features. Similarly, if we cluster the data in the reduced dimension as well as in the original dimension we expect to get similar partition matrix. Here, we propose a neural network-based framework to select features. Neural networks have been explored for the feature selection [34, 35, 36] as well as for classification [37, 38, 39, 40]. However, in our proposed model the neural network simultaneously selects features and learns a classifier as we follow an embedded method for feature selection. Moreover, our proposed network preserves structural information and the class label information simultaneously, whereas, the feature selection networks in [34, 36] solve classification problems, consider class label information in their loss function but not any structural information. Note here, the work in [35] considers a system identification problem. To build the neural network-based embedded feature selector, we employ the multi-layer perceptron (MLP) based framework used in [41, 42, 43]. The basic framework is shown in Fig. 1.

Refer to caption
Figure 1: Proposed network for feature selection

As seen in Figure 1, preceding the input layer of the MLP, there is a layer consisting of PP nodes. Before entering to the input layer of the MLP, the jjth feature passes through the node fj()f_{j}(). These nodes act as attenuating gates that allow or block features from contributing to the output of the neural network effectively. For the iith instance, it’s jjth feature xijx_{ij} on passing the gate node fj()f_{j}() becomes ajxija_{j}x_{ij}; i.e., fj(xij)=ajxijf_{j}(x_{ij})=a_{j}x_{ij}. In MLP, a weighted sum of the values available at the input nodes is applied to the hidden nodes of the first hidden layer. Zero value at an input node implies that the corresponding feature is not considered. When training of the MLP-based framework is complete, aja_{j}s for the selected features become close to 11, effectively allowing them to contribute to the classifier. Whereas, for poor or rejected features aja_{j}s become close to 0, effectively making them not contribute to the classifier. In [41, 42, 43], this framework was explored for classification-oriented feature selection, group feature selection, and redundancy-controlled feature selection. Here, we explore this framework for simultaneous structure-preserving and class-discriminating feature selection. Next, we elaborate on the MLP-based framework and the proposed objective functions to train the network.

We denote the PP nodes before the input layer of the MLP as fj()f_{j}()s for j=1,2,Pj=1,2,\dots P where fj()f_{j}() is a gate or modulator function applied on the jjth feature, xjx_{j}. Now, we have to design fj()f_{j}() in such a way,

fj(xj)=ajxj={xj if xj is a useful feature.0 otherwise. f_{j}(x_{j})=a_{j}x_{j}=\begin{cases}{x}_{j}&\text{ if }{x}_{j}\text{ is a useful feature.}\\ 0&\text{ otherwise. }\end{cases} (1)

In our framework, the factor, aja_{j} is learnable. We implement aja_{j} as a smooth continuous function, aj=exp(λj2)a_{j}=\exp(-\lambda_{j}^{2}). Clearly, when λj=0\lambda_{j}=0, the value of exp(λj2)=1\exp(-\lambda_{j}^{2})=1 and when λj±\lambda_{j}\rightarrow\pm\infty, the value of exp(λj2)=0\exp(-\lambda_{j}^{2})=0. By adding suitable regularizer terms to the objective function we design our learning system in such a way that, over the learning process, the gate parameters, λj\lambda_{j}s for useful features drop close to zero and that for derogatory or indifferent features rise to high values. So, in our learning system, the learnable parameters, λj\lambda_{j}s and the neural network weights are learned together, i.e., the loss function is minimized with respect to both λj\lambda_{j}s and the neural network weights.

Now, we have to define a suitable loss function for selecting features along with learning the embedded classifier. Our aim is to select features that are reasonably good for classification as well as clustering. To satisfy this requirement, we take the loss function as a combination of two losses EclassE_{class} and EstructE_{struct}. EclassE_{class} is considered for preserving class information and EstructE_{struct} is considered for preserving structural information. At this moment let us consider the network for selecting features for efficient classification only. A suitable loss function to impose class discrimination is the cross-entropy loss [44]. We define, EclassE_{class} as the cross-entropy loss involving actual and predicted class labels.

Eclass=1ni=1nk=1Ctkilog(pk(𝐱i))E_{class}=-\dfrac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{C}t^{i}_{k}\log\left(p_{k}(\mathbf{x}_{i})\right) (2)

Here, tkit^{i}_{k} is kkth element of the one-hot encoded label of the sample 𝐱i\mathbf{x}_{i} or in other words tkit^{i}_{k} is the kkth element of the vector 𝐭i{0,1}C\mathbf{t}^{i}\in\left\{0,1\right\}^{C} such that

tki={1 if k=zi0 otherwiset_{k}^{i}=\begin{cases}1&\text{ if }k=z_{i}\\ 0&\text{ otherwise}\end{cases} (3)

In (2), pk(𝐱i)p_{k}(\mathbf{x}_{i}) is the predicted probability (by the MLP classifier) of 𝐱i\mathbf{x}_{i} being in kkth class. As already discussed above, for effective feature selection, the magnitude of λj\lambda_{j}s for the selected features should drop to almost zero and for rejected features should rise to high values. To ensure this condition we add the following regularizer.

Eselect=\displaystyle E_{select}= 1Pj=1Paj(1aj)\displaystyle\dfrac{1}{P}\sum_{j=1}^{P}a_{j}(1-a_{j})
=\displaystyle= 1Pj=1Pexp(λj2)(1exp(λj2))\displaystyle\dfrac{1}{P}\sum_{j=1}^{P}\exp(-\lambda_{j}^{2})(1-\exp(-\lambda_{j}^{2})) (4)

In a feature selection framework, a constraint for selecting a fixed number of features is necessary. The following regularizer tries to keep the number of the selected features close or equal to QQ.

EQ=1Q2{(j=1Paj)Q}2=1Q2{(j=1Pexp(λj2))Q}2E_{Q}=\dfrac{1}{Q^{2}}\{(\sum_{j=1}^{P}a_{j})-Q\}^{2}=\dfrac{1}{Q^{2}}\{(\sum_{j=1}^{P}\exp(-\lambda_{j}^{2}))-Q\}^{2} (5)

So, the overall loss function for the selection of features with our framework for classification purposes is the following.

E=Eclass+α1Eselect+α2EQE=E_{class}+\alpha_{1}E_{select}+\alpha_{2}E_{Q} (6)

Here, α10,α20\alpha_{1}\geq 0,\alpha_{2}\geq 0 are scalar multipliers for adjusting the importance of EselectE_{select} and EQE_{Q} in the overall error function EE.

Now let us focus on our original agenda of selecting features that perform satisfactorily both for classification and clustering. To preserve structural information of the data in the lower dimensional space formed by the selected QQ features, we consider the Sammon’s stress [45] as a loss function. The Sammon’s stress is the loss function for a non-linear mapping named Sammon’s mapping that is able to capture complex non-linear structures in data, as a result, also preserves cluster structure. The lower the value of Sammon’s stress, the better the lower dimensional representations in capturing the original inter-point distances or structures of the original data. We can define Sammon’s stress involving the original input space and selected feature space as the following.

Esammons=1(i,l=1ndil)i=1n1l=i+1n(dil𝐗dil𝐗^)2dil𝐗E_{sammons}=\dfrac{1}{(\sum_{i,l=1}^{n}d_{il})}\sum_{i=1}^{n-1}\sum_{l=i+1}^{n}\dfrac{\left(d_{il}^{\mathbf{X}}-d_{il}^{\mathbf{\hat{X}}}\right)^{2}}{d_{il}^{\mathbf{X}}} (7)

dil𝐗d_{il}^{\mathbf{X}} is the distance between 𝐱i\mathbf{x}_{i} and 𝐱l\mathbf{x}_{l}. 𝐗^={𝐱^i=(a1xi1,a2xi2,,aPxiP)TP}i=1n\mathbf{\hat{X}}=\{\mathbf{\hat{x}}_{i}=(a_{1}x_{i1},a_{2}x_{i2},\cdots,a_{P}x_{iP})^{T}\in\mathbb{R}^{P}\}_{i=1}^{n}. So, dil𝐗^d_{il}^{\mathbf{\hat{X}}} is the distance between 𝐱^i\mathbf{\hat{x}}_{i} and 𝐱^l\mathbf{\hat{x}}_{l}. As discussed earlier, at the end of the training of our embedded system, aja_{j}s will be close to 0 or 11 depending on whether the corresponding features are rejected or selected. Therefore, for a trained system dil𝐗^d_{il}^{\mathbf{\hat{X}}} would signify the distance between iith and llth instances in the latent space formed by the implicitly selected QQ features. So considering EsammonsE_{sammons} in Equation (7) as an regularizer, the resultant overall loss function is given by.

Etot=Eclass+βEsammons+α1Eselect+α2EQE_{tot}=E_{class}+\beta E_{sammons}+\alpha_{1}E_{select}+\alpha_{2}E_{Q} (8)

β0\beta\geq 0 is a scalar multiplier that controls the trade-off between the class information and the structural information in the feature selection process. Note that, the computational complexity for the loss function in Equation (7) is O(n2)O(n^{2}). For large nn, computing Equation (7) and hence Equation (8) is intensive. As the weight update at each iteration will involve computing Equation (7), the overall computation cost would be high. For small and moderate nn, we use Equation (8) as the loss function to be minimized. However, for large nn to avoid the high computational cost we modify Equation (7) as follows.

Estruct=1(𝐱i,𝐱lStdil)𝐱iSt𝐱lSt;𝐱l𝐱i(dil𝐗dil𝐗^)2dil𝐗E_{struct}=\dfrac{1}{(\sum_{\mathbf{x}_{i},\mathbf{x}_{l}\in S_{t}}d_{il})}\sum_{\mathbf{x}_{i}\in S_{t}}\sum_{\mathbf{x}_{l}\in S_{t};\mathbf{x}_{l}\neq\mathbf{x}_{i}}\dfrac{\left(d_{il}^{\mathbf{X}}-d_{il}^{\mathbf{\hat{X}}}\right)^{2}}{d_{il}^{\mathbf{X}}} (9)

Here StS_{t} is a randomly selected subset of 𝐗\mathbf{X} at the ttth iteration. Different StS_{t}s are chosen at different iterations and hence different sets of inter-point distances are preserved. Since the considered MLP is trained over a large number of iterations, the use of Equation (9) is expected to result in almost the same effect as that by Equation (7). We have to choose |St||S_{t}| such that Equation (9) is computationally manageable and at the same time it should be large enough to make EstructE_{struct} an effective substitute of EsammonsE_{sammons}. Adding Equation (9) to Equation (6) we propose the following loss function for our system.

Etot=Eclass+βEstruct+α1Eselect+α2EQE_{tot}=E_{class}+\beta E_{struct}+\alpha_{1}E_{select}+\alpha_{2}E_{Q} (10)

EtotE_{tot} is minimized with respect to the gate parameters λj\lambda_{j}s and the weights of the network to find their optimal values.

3 Experimentation and Results

The feature selection framework proposed in this chapter is generic but it can be adapted to solve specialized problems. We have studied the proposed framework for general datasets as well as for solving a special problem: band selection of hyperspectral images. We present the results of band selection for HSIs in a different subsection, Subsec. 3.2. We present the results of feature selection for the conventional classification problem in the following subsection (Subsec. 3.1).

3.1 Feature selection for conventional classification problems

We have used five publicly available datasets that are very commonly used for classification and clustering. The first four datasets are downloaded from UCI machine learning repository [46]. AR10P is downloaded from the open-source feature selection repository of Arizona State University[47]. We have also performed the experiments with three benchmark HSI datasets for land cover classification problems. We discuss them in a separate subsection (Subsec. 3.2).

The details of the number of features, number of classes, and number of instances for the five datasets are summarized in Table 1.

Table 1: Summary of datasets.
Name Number of features Number of classes Number of instances
PP CC nn
E. coli 7 8 336
Glass 9 6 214
Ionosphere 34 2 351
Sonar 60 2 208
AR10P 2400 10 130

The datasets are used directly without any further processing. The datasets are partitioned into training and test sets as approximately 90%90\% and 10%10\% of the total number of instances. To implement our proposed feature selection scheme we use the neural network shown in Fig. 1 with the number of hidden layers, nH=1n_{H}=1. The input and output layers have PP and CC nodes respectively, where PP is the number of features and CC is the number of classes corresponding to the considered dataset. The number of hidden nodes in the hidden layer is 88 (2020 for AR10P data set). To get stable feature selection results, the network weights are initialized in a certain way. To set the initial weights of the proposed network, we undergo the following steps. First, we consider the usual MLP part of our network (i.e. without feature selection), depicted by the portion within the dotted rectangle in Fig. 1, and initialize its weights randomly. Next, we train the usual MLP with the cross-entropy loss defined in Equation (2) with the training set until convergence. The weights of the converged network are used as the initial weights of the proposed network. The gate parameter λj\lambda_{j}s are initialized with values drawn randomly from a normal distribution with mean =2=2 and spread =1P=\frac{1}{\sqrt{P}}. The initial values of λj\lambda_{j}s are chosen around 22 to effectively make the gates almost closed initially. As the learning progresses the λj\lambda_{j}s are updated in a way to allow the useful features to the network. For the proposed system, to select a subset of QQ features, the gate parameters λj\lambda_{j}s are sorted in ascending order, and the QQ features corresponding to the top QQ, λj\lambda_{j}s are selected. The network weights as well as the gate parameters λj\lambda_{j}s are learned using the adaptive gradient algorithm, ‘train.AdagradOptimizer’ routine of the ‘TensorFlow’ framework [48]. For all experiments with the data sets in Table 1, both α1\alpha_{1} and α2\alpha_{2} of the error functions in Equations (6) and (8) are set as 11. The total number of iterations for training the network is set to 2000020000. The five datasets we consider here, have the number of instances n<400n<400, which is not so large. Therefore, we use (8) as the overall loss function to train the MLP based architecture for selecting features that are reasonably good for clustering and classification. When β=0\beta=0 in (8), effectively, the error function that governs the learning of our MLP based embedded feature selection scheme is (6). The corresponding feature selection scheme now only considers classification. Let us name this method as feature selection with MLP (FSMLP). When β0\beta\neq 0 in (8), our method takes structure preservation into account along with classification. Let us name the corresponding method as FSMLPstruct. To understand the importance of adding the structure preserving regularizer (7), we perform feature selection with FSMLP and compare with FSMLPstruct having different β\beta values. We explore three values of β\betas 0.1,1,0.1,1, and 1010. Although the exact value of the β\beta that is optimum for a particular dataset for a particular number of selected features QQ cannot be decided from these three values, we investigate the effect of three widely different β\betas to see the role of the weight to the structure preserving regularizer, i.e. β\beta on the performance of the selected features. We compare with three other methods namely, Independent Component Analysis (ICA)-based feature selection [49], F-score based filter method [16], and mutual information based filter method [18, 19]. The performance of both FSMLP and FSMLPstruct is dependent on the initial weights of the network. So, we repeat the initialization of the network weights and gate parameters λj\lambda_{j}s five times and run the schemes- FSMLP or FSMLPstruct five times with the five initializations. For the performance measure of FSMLP and FSMLPstruct, we consider the average performance over the five subsets obtained from the five runs. To check the effectiveness of the methods in selecting features that perform well in classification and clustering simultaneously, we compute the classification scores of the support vector machine (SVM) classifier as well as several structure-preserving indices: Sammon’s stress (SS) [45], normalized mutual information (NMI) [18, 20, 21, 22], adjusted rand index (ARI) [50], and Jaccard Index (JI) [51]. As the measure of classification performance, we use the overall classification accuracy (OCA) of the SVM classifier. The optimal hyper-parameters of SVM are determined through five-fold cross-validation using grid search. Note that here the test set is not only unseen to the SVM classifier but unseen to the feature selection methods also. SS, defined in Equation (7) use the original inter-point distances dil𝐗d_{il}^{\mathbf{X}}s and latent space inter-point distances dil𝐗^d_{il}^{\mathbf{\hat{X}}}s. Here to compute dil𝐗^d_{il}^{\mathbf{\hat{X}}}, we use the lower dimensional data formed by the selected QQ features. We use NMI, ARI, and JI as the structure-preserving performance metrics by supplying the cluster labels obtained from clustering the data in the original space (using all features) as the true label and the cluster labels obtained from clustering the data in the reduced space formed by the selected QQ features as the predicted cluster label. So, NMI, ARI, and JI measure how the cluster assignments in the original space and in the selected space agree, effectively giving a measure for the preservation of the original cluster structure in the selected space. We know that the maximum value for NMI or ARI or JI is 11. Here, the value of each of these three measures being close to 11 indicates that the cluster structure in the original space is preserved in the selected space. As the clustering algorithm we use, the fuzzy CC means (FCM) algorithm [52] with the fuzzy exponent m=2m=2. We set the number of clusters for FCM algorithm as the number of classes. We use two values for the number of the selected features, QQ. Q=0.35×PQ=0.35\times P and Q=0.5×PQ=0.5\times P, where these values are rounded up to the nearest integers using the ceiling function.

Tables 2 and 3 summarize the performances of the proposed method and other comparing methods for training and test sets, respectively for the E. coli dataset. We tabulate the three previously mentioned structure preserving measures and one classifier score for two choices of the number of selected features (approximately 35%35\% and 50%50\% of the original dimension) i.e., Q=3Q=3, and Q=4Q=4 in Tables 2 and 3.

Table 2: Performance comparison for E. coli, training set for different choices of β\beta and QQ.
Method SS NMI ARI JI OCA
Q=3 Q=4 Q=3 Q=4 Q=3 Q=4 Q=3 Q=4 Q=3 Q=4
ICA 0.2323 0.2146 0.4250 0.4220 0.2643 0.2617 0.4250 0.4220 0.6788 0.6656
F Score 0.1547 0.0648 0.5417 0.6124 0.3520 0.5073 0.5417 0.6124 0.6325 0.7815
Mutual Info 0.0726 0.0204 0.6061 0.7511 0.4993 0.6884 0.6061 0.7511 0.6823 0.6719
FSMLP 0.1391 0.1098 0.4356 0.4751 0.2705 0.3167 0.4356 0.4751 0.8391 0.8828
FSMLPstruct, 0.1391 0.1098 0.4356 0.4751 0.2705 0.3167 0.4356 0.4751 0.8391 0.8828
β=0.1\beta=0.1
FSMLPstruct, 0.0902 0.0391 0.5338 0.6138 0.3868 0.4873 0.5338 0.6138 0.9000 0.9007
β=1\beta=1
FSMLPstruct, 0.0766 0.0280 0.5812 0.7137 0.4497 0.6362 0.5812 0.7137 0.8675 0.8768
β=10\beta=10
Table 3: Performance comparison for E. coli, test set for different choices of β\beta and QQ.
Method SS NMI ARI JI OCA
Q=3 Q=4 Q=3 Q=4 Q=3 Q=4 Q=3 Q=4 Q=3 Q=4
ICA 0.2562 0.2332 0.6191 0.6068 0.2858 0.2775 0.6191 0.6068 0.5588 0.6471
F Score 0.1389 0.0682 0.7124 0.7150 0.4579 0.4552 0.7124 0.7150 0.5882 0.6471
Mutual Info 0.0763 0.0426 0.7047 0.8047 0.4541 0.6114 0.7047 0.8047 0.4545 0.5909
FSMLP 0.1217 0.0976 0.7341 0.7416 0.5284 0.5449 0.7341 0.7416 0.6824 0.7000
FSMLPstruct, 0.1217 0.0976 0.7341 0.7416 0.5284 0.5449 0.7341 0.7416 0.6824 0.7000
β=0.1\beta=0.1
FSMLPstruct, 0.0949 0.0382 0.6910 0.8179 0.4215 0.6669 0.6910 0.8179 0.7235 0.8118
β=1\beta=1
FSMLPstruct, 0.0932 0.0346 0.6921 0.8571 0.4171 0.7029 0.6921 0.8571 0.7412 0.8588
β=10\beta=10

As we have already discussed in Sec. 2 the lesser the value of SS, the better the projected space (formed by selected features) preserves the original pairwise distances and hence the structure of the original data. We observe in Table 2, the mutual information based method shows the lowest value of SS, and the second lowest is FSMLPstruct with β=10\beta=10 for both Q=3Q=3 and Q=4Q=4. Actually, the SS values for the mutual information based method and FSMLPstruct with β=10\beta=10 are almost the same, equal up to two places after decimal points in both choices of QQ. The SS values achieved by ICA, the F score based method and FSMLP are comparatively higher. So, the mutual information based method and FSMLPstruct with β=10\beta=10 preserve the original pairwise distances most in the projected space. They are also expected to preserve the structures most. The values of the other three structure preserving measures i.e., NMI, ARI, and JI confirm that. We know that the higher the values of NMI, ARI, and JI are, the closer the clustering structures of the projected space are to the original clustering structures. The highest values of the NMI, ARI, and JI are obtained by the mutual information based method, followed by the FSMLPstruct with β=10\beta=10. So, in cluster structure preservation, mutual information based method and FSMLPstruct with β=10\beta=10 perform better than the other three methods and even than the other two models trained by FSMLPstruct with β=0.1\beta=0.1 and 11. β\beta is the weight of the regualizer EsammonsE_{sammons} in (7). Although SS and EsammonsE_{sammons} are not exactly same, under the influence of EselectE_{select}, it is expected that the higher the value of β\beta lesser the value of SS would be. Table 2 reconfirms that. The SS values become lesser as the β\beta increases from 0.10.1 to 1010. SS values of FSMLP (which is basically FSMLPstruct with β=0\beta=0) and FSMLPstruct with β=0.1\beta=0.1 are the same for both the choices of QQ. Actually, FSMLP and FSMLPstruct with β=0.1\beta=0.1 have all the ten measures the same. It proves that for the E.coli dataset β=0.1\beta=0.1 does not give any effective weightage to the structure preservation term and chooses the same subsets as FSMLP. For the classification performance measure OCA, FSMLPstruct with β=1\beta=1, achieves the highest value, followed by FSMLPstruct with β=10\beta=10. The mutual information based method and FSMLPstruct with β=10\beta=10 have all the structure preserving measures either almost equal or of comparable values, however for OCA, FSMLPstruct with β=10\beta=10 is better than mutual information based method with a margin more than 18%18\%. For E. coli data, the test set follows the observed trends in the training set with the following exceptions. First, for Q=3Q=3, the values of NMI, ARI, and JI have not increased as β\beta increases from 0.10.1 to 1010. Second, for Q=4Q=4, FSMLPstruct with β=10\beta=10 beats all the methods including mutual information based method. Analyzing the performances over train and test sets, for E. coli data FSMLPstruct with β=10\beta=10 is the winner among the other six models.

Tables 4 and 5 compare the performance of the proposed method with other methods in terms of different criteria for the Glass dataset on its training and test sets, respectively.

Table 4: Performance comparison for Glass, training set for different choices of β\beta and QQ.
Method SS NMI ARI JI OCA
Q=4 Q=5 Q=4 Q=5 Q=4 Q=5 Q=4 Q=5 Q=4 Q=5
ICA 0.2699 0.2216 0.5066 0.5416 0.3343 0.3688 0.5066 0.5416 0.6771 0.6823
F Score 0.1284 0.1011 0.5961 0.6343 0.5137 0.5750 0.5961 0.6343 0.6979 0.7396
Mutual Info 0.2091 0.1159 0.5073 0.6113 0.3321 0.5688 0.5073 0.6113 0.6823 0.6719
FSMLP 0.1968 0.1593 0.5512 0.6028 0.4473 0.5235 0.5512 0.6028 0.6302 0.7094
FSMLPstruct, 0.1968 0.0747 0.5512 0.7501 0.4473 0.6873 0.5512 0.7501 0.6302 0.7510
β=0.1\beta=0.1
FSMLPstruct, 0.0494 0.0623 0.7732 0.7965 0.7124 0.7665 0.7732 0.7965 0.7740 0.7708
β=1\beta=1
FSMLPstruct, 0.0309 0.0406 0.8551 0.7394 0.8507 0.6792 0.8551 0.7394 0.8125 0.7667
β=10\beta=10
Table 5: Performance comparison for Glass, test set for different choices of β\beta and QQ.
Method SS NMI ARI JI OCA
Q=4 Q=5 Q=4 Q=5 Q=4 Q=5 Q=4 Q=5 Q=4 Q=5
ICA 0.3039 0.2333 0.6536 0.6063 0.3409 0.2979 0.6536 0.6063 0.4545 0.4545
F Score 0.1393 0.1258 0.7028 0.7744 0.5746 0.6861 0.7028 0.7744 0.5455 0.6364
Mutual Info 0.2034 0.1372 0.7805 0.7728 0.6358 0.6655 0.7805 0.7728 0.4545 0.5909
FSMLP 0.2027 0.1669 0.7366 0.7425 0.5053 0.5380 0.7366 0.7425 0.5727 0.6727
FSMLPstruct, 0.2027 0.0806 0.7366 0.7885 0.5053 0.6190 0.7366 0.7885 0.5727 0.6636
β=0.1\beta=0.1
FSMLPstruct, 0.0406 0.0636 0.8265 0.8300 0.6951 0.7355 0.8265 0.8300 0.6818 0.6727
β=1\beta=1
FSMLPstruct, 0.0215 0.0372 0.9022 0.7950 0.8670 0.6210 0.9022 0.7950 0.6727 0.6818
β=10\beta=10

The chosen numbers of features for the Glass data are 44 and 55. The expected nature of decreasing SS with increasing β\beta is clearly observed for Q=5Q=5 for both the training and test set. For Q=4Q=4, the Glass data also follows the characteristics of the E. coli data of having the same values for FSMLP and FSMLPstruct with β=0.1\beta=0.1 in all the ten measures for both training and test set. For Q=4Q=4, from β=0.1\beta=0.1 onwards, increasing β\betas produce decreasing SS values and increasing NMI, ARI, and JI values for both training and test datasets. We observe from the Tables 4 and 5, for Q=5Q=5, as the β\beta increases from 0 (FSMLP) to 0.10.1, and then to 11, NMI, ARI, and JI values are increased for both training and test datasets, however at β=10\beta=10, NMI, ARI, and JI values are decreased compared to β=0.1\beta=0.1 and 11. We can conclude that, for Q=4Q=4, FSMLPstruct with β=10\beta=10 gives the best structure preserving performance among the considered models and for Q=5Q=5, FSMLPstruct with β=1\beta=1 is best in structure preservation. In terms of the classification performance measure OCA, FSMLPstruct with β=10\beta=10 and FSMLPstruct with β=1\beta=1 show the highest OCA values for the training set and test set respectively, with Q=4Q=4. On the other hand, for Q=5Q=5, FSMLPstruct with β=10\beta=10 show the highest OCA values for the training set and FSMLPstruct with β=1\beta=1 show the highest OCA values for the test set. Inspecting all the performance measure values, we conclude that for the Glass dataset, both FSMLPstruct with β=10\beta=10 and FSMLPstruct with β=1\beta=1 are comparatively better in simultaneously preserving both class and cluster structures than the other methods.

The performances of the Ionosphere dataset are recorded in Tables 6 and 7 for training and test sets respectively.

Table 6: Performance comparison for Ionosphere, training set for different choices of β\beta and QQ.
Method SS NMI ARI JI OCA
Q=12 Q=17 Q=12 Q=17 Q=12 Q=17 Q=12 Q=17 Q=12 Q=17
ICA 0.1791 0.0926 0.8261 0.8817 0.8766 0.9250 0.8261 0.8817 0.9492 0.9714
F Score 0.1753 0.0914 0.9035 0.9313 0.9497 0.9621 0.9035 0.9313 0.9397 0.9746
Mutual Info 0.1776 0.0803 0.6829 0.8645 0.7838 0.9250 0.6829 0.8645 0.9746 0.9651
FSMLP 0.1721 0.0895 0.6292 0.7330 0.7246 0.8192 0.6292 0.7330 0.9606 0.9733
FSMLPstruct, 0.1675 0.0878 0.6689 0.7415 0.7641 0.8243 0.6689 0.7415 0.9638 0.9733
β=0.1\beta=0.1
FSMLPstruct, 0.1505 0.0776 0.7307 0.7867 0.8051 0.8582 0.7307 0.7867 0.9632 0.9568
β=1\beta=1
FSMLPstruct, 0.1437 0.0766 0.7894 0.8111 0.8533 0.8723 0.7894 0.8111 0.9683 0.9644
β=10\beta=10
Table 7: Performance comparison for Ionosphere, test set for different choices of β\beta and QQ.
Method SS NMI ARI JI OCA
Q=12 Q=17 Q=12 Q=17 Q=12 Q=17 Q=12 Q=17 Q=12 Q=17
ICA 0.1652 0.0964 0.7351 0.7351 0.7782 0.7782 0.7351 0.7351 0.9143 0.9429
F Score 0.1952 0.1003 0.4325 0.4778 0.3411 0.4152 0.4325 0.4778 0.9429 0.9429
Mutual Info 0.2186 0.0972 0.5287 0.6541 0.4959 0.6774 0.5287 0.6541 0.9429 0.9429
FSMLP 0.1801 0.0913 0.6518 0.8284 0.6841 0.8682 0.6518 0.8284 0.9371 0.9143
FSMLPstruct, 0.1722 0.0885 0.7434 0.8284 0.7714 0.8682 0.7434 0.8284 0.9371 0.9314
β=0.1\beta=0.1
FSMLPstruct, 0.1437 0.0726 0.7764 0.8597 0.8265 0.8884 0.7764 0.8597 0.8971 0.9257
β=1\beta=1
FSMLPstruct, 0.1310 0.0720 0.7960 0.9146 0.8454 0.9328 0.7960 0.9146 0.9029 0.9257
β=10\beta=10

For the Ionosphere data set, the number of selected features, QQ is set as 1212 and 1717. Here, in all the cases, whenever the β\beta is increasing, SS is decreasing and the other structure preserving indices NMI, ARI, and JI are increasing consistently. Unlike, E. coli and Glass data set, here when β\beta increases from 0 (in FSMLP) to 0.10.1, the structure preserving metrics including SS shifted in the desired direction in most of the cases and remained the same in some cases. Except for SS, in the other three structure preserving measures, ICA and F score based method have performed better than FSMLP and FSMLPstruct for all the cases. Classification performance is good for almost all methods for the Ionosphere data set. In the training set, for both Q=12Q=12 and Q=17Q=17, an accuracy of 97.46%97.46\% is reached by mutual info and F score based methods, however, FSMLP and FSMLPstruct models have reached more than 96%96\% accuracy in every case. For the test set, all the structure preserving indices are better for FSMLP and FSMLPstruct than ICA, F score, and mutual information based methods, although in terms of classification score OCA, the F score and mutual information based methods have performed marginally better than FSMLP and FSMLPstruct models. This may have happened because the selected features from the neural network based classifier which are expected to be discriminatory features, may not be the best for SVM. Moreover, FSMLPstruct makes a compromise between preserving cluster structure and classifier loss. For the Ionosphere dataset, our proposed models are not the winner. May be with higher β\beta, FSMLPstruct would deliver better scores.

For the Sonar data, the summary of the performances of the training and test data sets in terms of the five measures for two choices of the number of selected features are available in Tables 8 and 9.

Table 8: Performance comparison for Sonar, training set for different choices of β\beta and QQ.
Method SS NMI ARI JI OCA
Q=21 Q=30 Q=21 Q=30 Q=21 Q=30 Q=21 Q=30 Q=21 Q=30
ICA 0.7830 0.4963 0.0004 0.0276 -0.0040 0.0382 0.0004 0.0276 0.7112 0.8075
F Score 0.3614 0.2029 0.1167 0.1778 0.1159 0.1926 0.1167 0.1778 0.8289 0.9733
Mutual Info 0.2247 0.0950 0.4601 0.6601 0.5662 0.7585 0.4601 0.6601 0.9733 0.9465
FSMLP 0.1880 0.0910 0.3735 0.5293 0.4432 0.6104 0.3735 0.5293 0.9273 0.9765
FSMLPstruct, 0.1314 0.0779 0.4917 0.5587 0.5698 0.6358 0.4917 0.5587 0.9775 0.9722
β=0.1\beta=0.1
FSMLPstruct, 0.0620 0.0285 0.5743 0.6838 0.6620 0.7696 0.5743 0.6838 0.9872 0.9840
β=1\beta=1
FSMLPstruct, 0.0481 0.0201 0.6983 0.7473 0.7751 0.8318 0.6983 0.7473 0.9936 0.9968
β=10\beta=10
Table 9: Performance comparison for Sonar, test set for different choices of β\beta and QQ.
Method SS NMI ARI JI OCA
Q=21 Q=30 Q=21 Q=30 Q=21 Q=30 Q=21 Q=30 Q=21 Q=30
ICA 0.7556 0.4968 0.2513 0.1993 0.1546 0.2383 0.2513 0.1993 0.7143 0.5714
F Score 0.3629 0.1941 0.2906 0.3029 0.3519 0.3536 0.2906 0.3029 0.5714 0.8571
Mutual Info 0.2144 0.0955 0.2906 0.5411 0.3519 0.6378 0.2906 0.5411 0.8095 0.8095
FSMLP 0.1967 0.0977 0.3918 0.5288 0.4633 0.5850 0.3918 0.5288 0.8190 0.8857
FSMLPstruct, 0.1436 0.0840 0.4026 0.5540 0.4368 0.6155 0.4026 0.5540 0.7714 0.8762
β=0.1\beta=0.1
FSMLPstruct, 0.0679 0.0302 0.6905 0.6824 0.7223 0.7146 0.6905 0.6824 0.7619 0.8381
β=1\beta=1
FSMLPstruct, 0.0473 0.0198 0.5001 0.7156 0.5547 0.7490 0.5001 0.7156 0.8000 0.7810
β=10\beta=10

We set, Q=21Q=21 and 3030 for the Sonar data set. In the case of the Sonar data set, not only with increasing β\beta, all the structure preserving indices improve, in case of the training set, FSMLPstruct with β=10\beta=10 are significantly better than ICA, F score, and mutual information based methods, and FSMLP in all five scores for both the choices of QQ. In test set for some cases, FSMLPstruct with β=1\beta=1 is better than FSMLPstruct with β=10\beta=10. For the Sonar data set, clearly, the proposed method performed extremely well in terms of classification and clustering performance.

Tables 10 and 11 summarize the performances of the proposed method and other comparing methods for training and test sets, respectively for the AR10P data set. The original number of features, PP for the AR10P data set is 24002400, which is comparatively higher than that of the other two data sets used in this sub-section. The two choices of the number of selected features here are 4040 and 6060 and these are not approximately 35%35\% and 50%50\% of the original dimension like in previous cases.

Table 10: Performance comparison for AR10P, training set for different choices of β\beta and QQ.
Method SS NMI ARI JI OCA
Q=40 Q=60 Q=40 Q=60 Q=40 Q=60 Q=40 Q=60 Q=40 Q=60
ICA 0.7619 0.7107 0.2807 0.2449 0.2222 0.2066 0.2807 0.2449 1 1
F Score 0.7933 0.7484 0.1693 0.1709 0.0550 0.0438 0.1693 0.1709 1 1
Mutual Info 0.7805 0.7309 0.1613 0.1863 0.0429 0.0558 0.1613 0.1863 0.9915 0.9915
FSMLP 0.7640 0.7109 0.2296 0.2108 0.1038 0.0943 0.2296 0.2108 0.9983 1
FSMLPstruct, 0.7566 0.7061 0.3828 0.5251 0.3907 0.5654 0.3828 0.5251 1 1
β=25\beta=25
FSMLPstruct, 0.7567 0.7059 0.5307 0.5089 0.5575 0.5268 0.5307 0.5089 1 1
β=50\beta=50
FSMLPstruct, 0.7564 0.7061 0.4849 0.5830 0.4852 0.6353 0.4849 0.5830 1 1
β=100\beta=100
Table 11: Performance comparison for AR10P, test set for different choices of β\beta and QQ.
Method SS NMI ARI JI OCA
Q=40 Q=60 Q=40 Q=60 Q=40 Q=60 Q=40 Q=60 Q=40 Q=60
ICA 0.7739 0.7256 0.8255 0.8222 0.0031 -0.0523 0.8255 0.8222 0.6154 0.6154
F Score 0.7364 0.6856 0.9072 0.9152 0.3645 0.4155 0.9072 0.9152 0.8462 0.8462
Mutual Info 0.7614 0.7112 0.8451 0.8451 0.0122 0.0122 0.8451 0.8451 0.9231 0.8462
FSMLP 0.7838 0.7213 0.9051 0.9152 0.3686 0.4155 0.9051 0.9152 0.8462 0.7538
FSMLPstruct, 0.7795 0.7346 0.9223 0.9146 0.4820 0.4380 0.9223 0.9146 0.4615 0.6154
β=25\beta=25
FSMLPstruct, 0.7815 0.7352 0.9050 0.9050 0.3782 0.3782 0.9050 0.9050 0.4615 0.6154
β=50\beta=50
FSMLPstruct, 0.7804 0.7356 0.9056 0.9136 0.3557 0.4727 0.9056 0.9136 0.4615 0.6154
β=100\beta=100

The study in [42], proposed a feature selection scheme for redundancy control in features. They reported an average number of selected features of 58.958.9 without practicing redundancy control and an average number of selected features in the range of 22.822.8 to 44.244.2 when practicing redundancy control for AR10P data set. Hence, we choose the number of selected features QQ as 4040 and 6060. From the classification scores shown in Table 10, we note that for all the methods for both the choices of QQ, classification scores in training set are more than 99%99\%. In the training set, we observe that for FSMLPstruct as β\beta increases SS is decreased in almost all the cases. But for the test set, this is not true. For the other structure-preserving measures for the training set, FSMLPstruct with β=50\beta=50 is best among all the methods for Q=40Q=40 and FSMLPstruct with β=100\beta=100 is best among all the methods for Q=60Q=60. In the test set, all the methods have performed almost the same in terms of the structure-preserving measures. The classification performances of FSMLPstruct are very poor in the test set for AR10P data. The significant differences in training and test OCA values for FSMLPstruct indicate poor generalization of the system. This problem may be addressed by choosing the number of nodes for our MLP based model through cross-validation.

Results from the five data sets clearly establish the benefit of introducing the proposed structure preserving regularizer term, EsammonsE_{sammons} in the overall loss function (8) of the MLP based embedded feature selection scheme. Next we shall consider the band (channel) selection problem for hyperspectral satelite images.

3.2 Band selection in hyperspectral images

Let our considered hyperspectral image II be of dimension, H×W×PH\times W\times P where, HH, WW, and PP are the height, width, and number of spectral bands of the image respectively. We can represent the pixels of II as 𝐱iP:i=1,2,,H×W\mathbf{x}_{i}\in\mathbb{R}^{P}:i=1,2,\dots,H\times W. Let, there be total nn pixels annotated with CC land cover classes. Without any loss of generality, we take the first nn pixels, i.e., i=1,2,ni=1,2,\dots n as the pixels having class labels. Our input data for land cover classification problem be 𝐗={𝐱i=(xi1,xi2,,xiP)P}i=1n\mathbf{X}=\{\mathbf{x}_{i}=(x_{i1},x_{i2},\cdots,x_{iP})\in\mathbb{R}^{P}\}_{i=1}^{n}. The collection of class labels of 𝐗\mathbf{X} be 𝐙={zi{1,2,,C}}i=1n\mathbf{Z}=\{z_{i}\in\{1,2,\cdots,C\}\}_{i=1}^{n}, where, ziz_{i} is the class label corresponding to 𝐱i\mathbf{x}_{i}. We aim to select a subset of size QQ from the original set of bands such that the selected subset performs reasonably well for land cover classification as well as in clustering. We have performed the experiments with three benchmark HSI datasets for land cover classification problems- Indian pines, Pavia University, and Salinas[53]. We have used the corrected version of the Indian pines and Salinas dataset having the number of bands 200 and 204 respectively. The Pavia University dataset uses 103 bands. The pre-processing of the datasets is the same as done in [54], following the code available in[55]. For any dataset, its pixel values are scaled to [0,1][0,1] using the expression (xmin(x))/(max(x)min(x))(x-\min(x))/(\max(x)-\min(x)), where, xx is a pixel value. The max\max and min\min are computed over the entire HSI. The data are then mean normalized across each channel by subtracting channel-wise means. The datasets are partitioned into training and test datasets. For band selection, only the training datasets are fed to the model. For measuring performances both training and test datasets are used. For splitting the datasets into training and test subsets, we drop the pixels of the unknown land-cover type. Let 𝐗\mathbf{X} be the set of pixels with known land-cover type. To obtain the training and test sets, let us divide 𝐗\mathbf{X} into two subsets 𝐀\mathbf{A} and 𝐁\mathbf{B} such that 𝐀𝐁=𝐗\mathbf{A}\bigcup\mathbf{B}=\mathbf{X}, 𝐀𝐁=ϕ\mathbf{A}\bigcap\mathbf{B}=\phi, and 𝐀\mathbf{A} and 𝐁\mathbf{B} contain, respectively, 25%25\% and 75%75\% pixels of 𝐗\mathbf{X}. We use 𝐀\mathbf{A} as the test set. Note that, both the datasets suffer from the class imbalance problem. To avoid the learning difficulty raised by class imbalance, in the training set, we consider the same number of instances from each class. For this, from the subset 𝐁\mathbf{B}, we randomly select (without replacement) 200200 pixels per class. If a class has less than 200200 instances in 𝐁\mathbf{B}, we oversample the class by synthetic minority oversampling technique (SMOTE) [56] to gather 200200 points. For band selection also, we use the same neural network (Fig. 1) with the number of hidden layers, nH=3n_{H}=3. The numbers of hidden nodes in the three hidden layers are 500,350,500,350, and 150150 respectively. Here the number of input nodes of the MLP is equal to the number of bands (PP). The network weights and the gate parameters λj\lambda_{j}s are initialized in the same way as done. For all experiments of the current sub-section, α1\alpha_{1} and α2\alpha_{2} of the error functions in Equations (6) and (10) are set as 55 and 11 respectively. The total number of iterations for training the network is set to 5000050000. The rest of experimental settings are kept same as the previously mentioned experiment with the two data sets. The number of training instances of Indian pines and Salinas data set is 32003200 and that of Pavia university is 18001800. Both of the number of training instances, nn are high. Computation of the EsammonsE_{sammons} in (7) would involve computing (3200)2(3200)^{2} or (1800)2(1800)^{2} distances. Adding EsammonsE_{sammons} to the overall loss function would cause very intensive computation at each iteration. So, instead of EsammonsE_{sammons}, its proposed approximation EstructE_{struct} defined in (9) is used. In (9), |St||S_{t}| is taken as 100100. Varying the value of β\beta in (10), we analyse its effect on the OCA, SS, NMI, ARI, and JI. We compute SS, NMI, ARI, and JI as described in Subsec. 3.1. We also use the same clustering algorithm with the same settings as used in Subsec. 3.1.

Tables 12 and 13 summarize the comparative results of FSMLPstruct with FSMLP and other band selection methods, ICA, F score, and mutual information based filter methods on the training and test datasets of Indian pines respectively.

Table 12: Performance comparison for Indian Pines training set, with the number of bands= 7070.
Method SS NMI ARI JI OCA
ICA 0.6982 0.5437 0.3071 0.3335 0.7497
F Score 0.5325 0.5567 0.3015 0.0704 0.7978
Mutual Info 0.1966 0.5633 0.3232 0.2493 0.8600
FSMLP 0.0162 0.7994 0.6530 0.6827 0.9157
FSMLPstruct, β\beta=2 0.0153 0.8027 0.6588 0.6894 0.9165
FSMLPstruct, β\beta=5 0.0140 0.8051 0.6627 0.6945 0.9155
FSMLPstruct, β\beta=20 0.0107 0.8188 0.6953 0.7216 0.9153
FSMLPstruct, β\beta=50 0.0071 0.8400 0.7404 0.7549 0.9179
Table 13: Performance comparison for Indian Pines test set, with the number of bands= 7070.
Method SS NMI ARI JI OCA
ICA 0.6748 0.5360 0.3052 0.2983 0.6175
F Score 0.5457 0.5518 0.2951 0.1643 0.6738
Mutual Info 0.2041 0.5787 0.3283 0.2395 0.7298
FSMLP 0.0158 0.8729 0.7949 0.8109 0.7871
FSMLPstruct, β\beta=2 0.0148 0.8831 0.8210 0.8384 0.7871
FSMLPstruct, β\beta=5 0.0137 0.8717 0.7881 0.8017 0.7881
FSMLPstruct, β\beta=20 0.0102 0.8896 0.8346 0.8507 0.7882
FSMLPstruct, β\beta=50 0.0066 0.8926 0.8360 0.8505 0.7850

Similarly, Table 14 and 15 summarize the comparative results on the training and test datasets of Pavia university.

Table 14: Performance comparison for Pavia University training set, with the number of bands= 3535.
Method SS NMI ARI JI OCA
ICA 0.3231 0.6941 0.5858 0.6340 0.8433
F Score 0.2386 0.6451 0.4654 0.4748 0.8233
Mutual Info 0.1387 0.8905 0.8769 0.8871 0.8717
FSMLP 0.1686 0.9072 0.8925 0.8863 0.9213
FSMLPstruct, β\beta=1 0.1625 0.9169 0.9014 0.8952 0.9213
FSMLPstruct, β\beta=1.5 0.1616 0.9233 0.9112 0.9033 0.9212
FSMLPstruct, β\beta=2 0.1592 0.9243 0.9121 0.9044 0.9213
FSMLPstruct, β\beta=2.5 0.1590 0.9198 0.9056 0.8993 0.9218
Table 15: Performance comparison for Pavia University test set, with the number of bands= 3535.
Method SS NMI ARI JI OCA
ICA 0.3215 0.5533 0.3777 0.2082 0.7573
F Score 0.3350 0.5091 0.3126 0.0886 0.7222
Mutual Info 0.1433 0.8911 0.8814 0.8990 0.7785
FSMLP 0.1584 0.9001 0.8930 0.9084 0.8628
FSMLPstruct, β\beta=1 0.1568 0.9156 0.9131 0.9254 0.8650
FSMLPstruct, β\beta=1.5 0.1573 0.9175 0.9149 0.9270 0.8672
FSMLPstruct, β\beta=2 0.1556 0.9189 0.9166 0.9279 0.8666
FSMLPstruct, β\beta=2.5 0.1562 0.9168 0.9143 0.9261 0.8648

In this experiment, we have fixed the number of selected bands QQ, approximately to 35%35\% of the original number of bands PP. So, The number of selected bands is 7070 for Indian pines and it is 3535 for Pavia University. Tables 12 and 13 record the values of the structure-preserving indices and classification scores on Indian pines for different β\beta values in FSMLPstruct (β\beta values in Equation (10)). The considered β\betas for Indian pines, are 2,5,20,2,5,20, and 5050. Note here that, FSMLP is basically FSMLPstruct with β=0\beta=0. We observe in Tables 12 and 13 that both for training and test datasets as the value of β\beta increases for FSMLPstruct (in the last five rows of the corresponding Tables) the value of SS becomes smaller. A similar trend is also observed for the Pavia University data set (here, β\beta varies as 1,1.5,2,1,1.5,2, and 2.52.5) for training (Table 14) and test (Table 15) sets.

For the Pavia university dataset, we have set the values of β\beta in FSMLPstruct as 1,1.5,2,1,1.5,2, and 2.52.5. Unlike Indian pines for Pavia university, we restrict the β\betas to lower values. This is due to the fact that the number of selected bands for Pavia university is 3535 and that for Indian pines is 7070. Lesser the number of bands, the lesser the importance (β\beta) to be given to our structure preserving regularizer in Equation (9) to obtain a desired balance between classification and clustering performance. Table 12 which contains the results for the Indian pines training data, clearly shows that both FSMLP and FSMLPstruct are better than ICA, F-score based, mutual information based methods in all four structure preserving metrics as well as in terms of the OCA. In Table 12 we observe that with increasing values of β\beta there is a consistent improvement in the values of the four structure preserving metrics while the values of OCAs retain approximately at 91%91\%. The results shown in Table 13 for the Indian pines test set also show that FSMLP and FSMLPstruct perform better in terms of all the five metrics than the other three methods. Also with an increase in β\beta all the structure-preserving metrics improve for FSMLPstruct, except the value of JI slightly decreases when β\beta goes to 5050 from 2020. The classification metric OCA is around 78%78\% with bands selected by FSMLPstruct for different choices of β\betas.

It is notable here that the test set is completely unseen in the process of band selection, yet the selected bands for the proposed method is providing fairly good results for structure preservation as well as for classification. As observed from Table 14 and Table 15 for Pavia university training and test sets respectively, the lowest (best) SS value among all the comparing methods is achieved by mutual information based filter method. However, for the other four metrics i.e. NMI, ARI, JI and OCA; FSMLP and FSMLPstruct show better values. In the case of the Pavia university dataset with increasing β\beta; NMI, ARI, and JI are not consistently increasing but the results indicate that, it is possible to find a β\beta, (here β=2\beta=2) where the structures are preserved better maintaining a good classification score. Table 16 and 17 summarize the comparative results of training and test datasets of Salinas.

Table 16: Performance comparison for Salinas training set, with the number of bands= 7070.
Method SS NMI ARI JI OCA
ICA 0.7159 0.7066 0.5385 0.5111 0.9325
F Score 0.0167 0.9420 0.9307 0.9291 0.9650
Mutual Info 0.0902 0.9207 0.8775 0.8860 0.9628
FSMLP 0.0072 0.9672 0.9589 0.9628 0.9654
FSMLPstruct, β=2\beta=2 0.0078 0.9678 0.9603 0.9640 0.9658
FSMLPstruct, β=5\beta=5 0.0072 0.9691 0.9626 0.9659 0.9648
FSMLPstruct, β=20\beta=20 0.0055 0.9710 0.9657 0.9683 0.9646
FSMLPstruct, β=50\beta=50 0.0039 0.9688 0.9630 0.9659 0.9649
Table 17: Performance comparison for Salinas test set, with the number of bands= 7070.
Method SS NMI ARI JI OCA
ICA 0.7027 0.6297 0.4144 0.3490 0.8670
F Score 0.0230 0.8584 0.7483 0.7615 0.9081
Mutual Info 0.0820 0.8791 0.7950 0.8230 0.8950
FSMLP 0.0078 0.9368 0.9183 0.9265 0.9077
FSMLPstruct, β=2\beta=2 0.0084 0.9392 0.9231 0.9305 0.9076
FSMLPstruct, β=5\beta=5 0.0079 0.9413 0.9275 0.9340 0.9069
FSMLPstruct, β=20\beta=20 0.0061 0.9366 0.9175 0.9258 0.9065
FSMLPstruct, β=50\beta=50 0.0044 0.9342 0.9136 0.9229 0.9073

We note from Tables 16 and 17 that, for the Salinas dataset, FSMLP and FSMLPstruct are better than the other three methods in all the five metrics used. All four structure preserving metrics scores of FSMLPstruct are better than or comparable to FSMLP keeping the classification score OCA at approximately 96%96\% for the training dataset and 90%90\% for the test dataset. Tables 16 and 17 reveal that when β\beta is increased from 0 to 22, the value of SS is increased however, from β=2\beta=2 to β=50\beta=50 onward, the values of SS are decreased. The exceptions for the Salinas dataset, while increasing β\beta from 0 to 22 is possibly due to the fact that we do not use the entire training data in Equation (9) and use of |St|=100|S_{t}|=100 in Equation (9) is not adequate to capture the structure of the data faithfully for the Salinas dataset. As discussed earlier, setting the value of |St||S_{t}| is crucial for approximating Equation (7) with Equation (9). We have set |St|=100|S_{t}|=100 for all three datasets empirically. However, choosing an optimum value of |St||S_{t}| for each dataset is expected to avoid the occurred exceptions.

As we increase the value of β\beta, there is more stress to reduce the loss function Equation (9). In most cases, increasing β\beta results in a drop in SS. This clearly suggests that the loss function Equation (9) that we use, is a computationally efficient substitute for the original SS defined in Equation (7).

We have included results of the thematic maps (Fig. 2) and it reveals that our proposed method is capable of selecting useful bands that can broadly capture the land cover types. Figure 2 illustrates thematic maps of the entire region captured in the Indian pines dataset. Figure 2(a) shows ground truth labels. Figures 2(b), 2(c), 2(d) are thematic maps of the Indian pines data set using the class labels obtained from the SVM classifier trained on the considered training set represented with 7070 bands selected by FSMLPstruct with β=0\beta=0, i.e., by the method FSMLP, and FSMLPstruct considering β=20\beta=20, and β=50\beta=50, respectively.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 2: Thematic maps resulting from classifications (SVM) of the Indian pines dataset with 7070 bands selected by FSMLPstruct considering (a) β=0\beta=0, (b) β=20\beta=20, (c) β=50\beta=50.

Figure 2 ensures that even with the increasing stress on the structure-preserving regularizer EstructE_{struct}, our proposed band selection method FSMLPstruct is able to select bands that maintain a good land cover classification performance.

4 Conclusion and Discussions

To the best of our knowledge, a feature selection method that simultaneously cares about class discrimination and structure preservation is not available in the literature. In this study, we have tried to bridge this gap by proposing a neural network-based feature selection method that focuses both on class discrimination and structure preservation. To learn the proposed system, we use Sammon’s stress as a regularizer to the classification loss. For datasets having a large number of instances, the computational overhead associated with Sammon’s stress is very high. Consequently, as the structure-preserving regularizer, we use Sammon’s stress computed based on a sample of the original data (using dynamic sampling on each iteration during the adaptive gradient descent based learning). Using this regularizer in the experiments with datasets having a large number of instances, we have demonstrated that this regularizer is an effective and computationally efficient implementation of Sammon’s stress based structure-preserving regularizer. Our proposed feature selection scheme is generic. So we have investigated its effectiveness on datasets commonly used for assessing classifiers as well as for a specialized case: band selection in hyperspectral images (HSI). We have applied the feature selection scheme to five real-world datasets which are commonly used typically for assessing classification. In the context of band selection, we have applied our method to three well-known HSI datasets and compared performances with three other band selection methods. Based on our experiments, we conclude that the proposed feature selection method is able to produce reasonably good classification and clustering scores in the majority of the data sets, proving that the proposed method is capable of selecting a subset of features that is good both for classification and clustering. Our scheme provides a mechanism to control the number of selected features. The proposed method is easily extendable to other networks like Radial Basis Function (RBF) network.

References

  • [1] R. Zhang, F. Nie, X. Li, X. Wei, Feature selection with multi-view data: A survey, Information Fusion 50 (2019) 158–167.
  • [2] S. Solorio-Fernández, J. A. Carrasco-Ochoa, J. F. Martínez-Trinidad, A review of unsupervised feature selection methods, Artificial Intelligence Review 53 (2) (2020) 907–948.
  • [3] Z. Atashgahi, X. Zhang, N. Kichler, S. Liu, L. Yin, M. Pechenizkiy, R. Veldhuis, D. C. Mocanu, Supervised feature selection with neuron evolution in sparse neural networks, arXiv preprint arXiv:2303.07200.
  • [4] R. Jiao, B. Xue, M. Zhang, Solving multi-objective feature selection problems in classification via problem reformulation and duplication handling, IEEE Transactions on Evolutionary Computation.
  • [5] S. An, M. Zhang, C. Wang, W. Ding, Robust fuzzy rough approximations with knn granules for semi-supervised feature selection, Fuzzy Sets and Systems (2023) 108476.
  • [6] Z. Zeng, X. Wang, F. Yan, Y. Chen, Local adaptive learning for semi-supervised feature selection with group sparsity, Knowledge-Based Systems 181 (2019) 104787.
  • [7] G. Chandrashekar, F. Sahin, A survey on feature selection methods, Computers & Electrical Engineering 40 (1) (2014) 16–28.
  • [8] S. Wang, W. Pedrycz, Q. Zhu, W. Zhu, Unsupervised feature selection via maximum projection and minimum redundancy, Knowledge-Based Systems 75 (2015) 19–29.
  • [9] L. Du, Y.-D. Shen, Unsupervised feature selection with adaptive structure learning, in: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 209–218.
  • [10] J.-S. Wu, M.-X. Song, W. Min, J.-H. Lai, W.-S. Zheng, Joint adaptive manifold and embedding learning for unsupervised feature selection, Pattern Recognition 112 (2021) 107742.
  • [11] Z. Li, J. Liu, Y. Yang, X. Zhou, H. Lu, Clustering-guided sparse structural learning for unsupervised feature selection, IEEE Transactions on Knowledge and Data Engineering 26 (9) (2013) 2138–2150.
  • [12] C. Hou, F. Nie, X. Li, D. Yi, Y. Wu, Joint embedding learning and sparse regression: A framework for unsupervised feature selection, IEEE transactions on cybernetics 44 (6) (2013) 793–804.
  • [13] Z. Zhao, L. Wang, H. Liu, J. Ye, On similarity preserving feature selection, IEEE Transactions on Knowledge and Data Engineering 25 (3) (2011) 619–632.
  • [14] W. Gao, L. Hu, Y. Li, P. Zhang, Preserving similarity and staring decisis for feature selection, IEEE Transactions on Artificial Intelligence 2 (6) (2021) 584–593.
  • [15] Q. Gu, Z. Li, J. Han, Generalized fisher score for feature selection, arXiv preprint arXiv:1202.3725.
  • [16] Q. Song, H. Jiang, J. Liu, Feature selection based on fda and f-score for multi-class classification, Expert Systems with Applications 81 (2017) 22–27.
  • [17] X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, Advances in neural information processing systems 18.
  • [18] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on pattern analysis and machine intelligence 27 (8) (2005) 1226–1238.
  • [19] J. R. Vergara, P. A. Estévez, A review of feature selection methods based on mutual information, Neural computing and applications 24 (2014) 175–186.
  • [20] P. A. Estévez, M. Tesmer, C. A. Perez, J. M. Zurada, Normalized mutual information feature selection, IEEE Transactions on neural networks 20 (2) (2009) 189–201.
  • [21] M. Tesmer, P. A. Estévez, Amifs: Adaptive feature selection by using mutual information, in: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), Vol. 1, IEEE, 2004, pp. 303–308.
  • [22] J. R. Vergara, P. A. Estévez, Cmim-2: an enhanced conditional mutual information maximization criterion for feature selection, Journal of Applied Computer Science Methods 2 (1) (2010) 5–20.
  • [23] R. J. Urbanowicz, M. Meeker, W. La Cava, R. S. Olson, J. H. Moore, Relief-based feature selection: Introduction and review, Journal of biomedical informatics 85 (2018) 189–203.
  • [24] J. Wang, J. Wei, Z. Yang, Supervised feature selection by preserving class correlation, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016, pp. 1613–1622.
  • [25] T. M. Le, T. M. Vo, T. N. Pham, S. V. T. Dao, A novel wrapper–based feature selection for early diabetes prediction enhanced with a metaheuristic, IEEE Access 9 (2020) 7869–7884.
  • [26] N. Zhou, Y. Xu, H. Cheng, J. Fang, W. Pedrycz, Global and local structure preserving sparse subspace learning: An iterative approach to unsupervised feature selection, Pattern Recognition 53 (2016) 87–101.
  • [27] Y. Zhu, X. Zhang, R. Hu, G. Wen, Adaptive structure learning for low-rank supervised feature selection, Pattern Recognition Letters 109 (2018) 89–96.
  • [28] X. Liu, L. Wang, J. Zhang, J. Yin, H. Liu, Global and local structure preservation for feature selection, IEEE transactions on neural networks and learning systems 25 (6) (2013) 1083–1095.
  • [29] X. Zhu, S. Zhang, R. Hu, Y. Zhu, et al., Local and global structure preservation for robust unsupervised spectral feature selection, IEEE Transactions on Knowledge and Data Engineering 30 (3) (2017) 517–529.
  • [30] Q. Ye, X. Zhang, Y. Sun, Dual global structure preservation based supervised feature selection, Neural Processing Letters 51 (2020) 2765–2787.
  • [31] D. Chakraborty, N. R. Pal, A neuro-fuzzy scheme for simultaneous feature selection and fuzzy rule-based classification, IEEE Transactions on Neural Networks 15 (1) (2004) 110–123.
  • [32] Y.-C. Chen, N. R. Pal, I.-F. Chung, An integrated mechanism for feature selection and fuzzy rule extraction for classification, IEEE Transactions on Fuzzy Systems 20 (4) (2012) 683–698.
  • [33] G. Xue, Q. Chang, J. Wang, K. Zhang, N. R. Pal, An adaptive neuro-fuzzy system with integrated feature selection and rule extraction for high-dimensional classification problems, IEEE Transactions on Fuzzy Systems.
  • [34] R. Setiono, H. Liu, Neural-network feature selector, IEEE transactions on neural networks 8 (3) (1997) 654–662.
  • [35] G. Castellano, A. M. Fanelli, Variable selection using neural-network models, Neurocomputing 31 (1-4) (2000) 1–13.
  • [36] A. Verikas, M. Bacauskiene, Feature selection with neural networks, Pattern recognition letters 23 (11) (2002) 1323–1335.
  • [37] G. P. Zhang, Neural networks for classification: a survey, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 30 (4) (2000) 451–462.
  • [38] S. Dreiseitl, L. Ohno-Machado, Logistic regression and artificial neural network classification models: a methodology review, Journal of biomedical informatics 35 (5-6) (2002) 352–359.
  • [39] S. Agrawal, J. Agrawal, Neural network techniques for cancer prediction: A survey, Procedia Computer Science 60 (2015) 769–774.
  • [40] A. Mabrouk, R. P. D. Redondo, M. Kayed, Deep learning-based sentiment classification: a comparative survey, IEEE Access 8 (2020) 85616–85638.
  • [41] D. Chakraborty, N. R. Pal, Selecting useful groups of features in a connectionist framework, IEEE transactions on neural networks 19 (3) (2008) 381–396.
  • [42] R. Chakraborty, N. R. Pal, Feature selection using a neural framework with controlled redundancy, IEEE transactions on neural networks and learning systems 26 (1) (2014) 35–50.
  • [43] R. Chakraborty, C.-T. Lin, N. R. Pal, Sensor (group feature) selection with controlled redundancy in a connectionist framework, International journal of neural systems 24 (06) (2014) 1450021.
  • [44] Z. Zhang, M. R. Sabuncu, Generalized cross entropy loss for training deep neural networks with noisy labels, in: 32nd Conference on Neural Information Processing Systems (NeurIPS), 2018.
  • [45] J. W. Sammon, A nonlinear mapping for data structure analysis, IEEE Transactions on computers 100 (5) (1969) 401–409.
  • [46] D. Dua, C. Graff, UCI machine learning repository (2017).
    URL http://archive.ics.uci.edu/ml
  • [47] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, H. Liu, Feature selection: A data perspective, ACM Computing Surveys (CSUR) 50 (6) (2018) 94.
  • [48] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, et al., Tensorflow: A system for large-scale machine learning, in: 12th {\{USENIX}\} symposium on operating systems design and implementation ({\{OSDI}\} 16), 2016, pp. 265–283.
  • [49] M. Prasad, A. Sowmya, I. Koch, Efficient feature selection based on independent component analysis, in: Proceedings of the 2004 Intelligent Sensors, Sensor Networks and Information Processing Conference, 2004., IEEE, 2004, pp. 427–432.
  • [50] D. Steinley, Properties of the hubert-arable adjusted rand index., Psychological methods 9 (3) (2004) 386.
  • [51] S. Fletcher, M. Z. Islam, et al., Comparing sets of patterns with the jaccard index, Australasian Journal of Information Systems 22.
  • [52] J. C. Bezdek, R. Ehrlich, W. Full, Fcm: The fuzzy c-means clustering algorithm, Computers & Geosciences 10 (2-3) (1984) 191–203.
  • [53] M. Graña, M. Veganzons, B. Ayerdi, Hyperspectral remote sensing scenes (06 2021).
    URL http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes
  • [54] A. Santara, K. Mani, P. Hatwar, A. Singh, A. Garg, K. Padia, P. Mitra, Bass net: Band-adaptive spectral-spatial feature learning neural network for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 55 (9) (2017) 5293–5301.
  • [55] A. Santara, K. Mani, A. Shukla, Deep learning for land-cover classification in hyperspectral images (07 2016).
    URL https://kgpml.github.io/Hyperspectral/
  • [56] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357.