Feature selection simultaneously preserving both class and cluster structures
Abstract
When a data set has significant differences in its class and cluster structure, selecting features aiming only at the discrimination of classes would lead to poor clustering performance, and similarly, feature selection aiming only at preserving cluster structures would lead to poor classification performance. To the best of our knowledge, a feature selection method that simultaneously considers class discrimination and cluster structure preservation is not available in the literature. In this paper, we have tried to bridge this gap by proposing a neural network-based feature selection method that focuses both on class discrimination and structure preservation in an integrated manner. In addition to assessing typical classification problems, we have investigated its effectiveness on band selection in hyperspectral images. Based on the results of the experiments, we may claim that the proposed feature/band selection can select a subset of features that is good for both classification and clustering.
keywords:
Feature selection, Structure preserving, Classification, Neural network, Sammon’s Stress, Band selection, Hyperspectral Image.1 Introduction
Feature selection methods can be broadly classified on the basis of the utilization of the class label information. There are three categories: supervised, semi-supervised and unsupervised [1, 2]. The supervised feature selection method exploits the label information to find out the relevant features which distinguish samples of different classes [3, 4]. Semi-supervised feature selection is used when some labeled samples along with plenty of unlabelled samples are present [5, 6]. Both labeled and unlabelled data are used to modify a hypothesis obtained from the labeled data [7]. Unsupervised feature selection is much more difficult as it needs to find out the useful features in the absence of the label information [2, 8]. Different criteria have been chosen to select a subset of original features in different unsupervised feature selection studies. Some of them are: preserving the data distribution such as manifold structure [9, 10], preserving cluster structure [11, 12], and preserving data similarity [13, 14]. It is noteworthy that in the case of unsupervised feature selection, some methods try to preserve the “structure” or “geometry” of the data in some sense. Contrarily supervised feature selection methods in most cases do not set any explicit criteria to preserve the structure of the data. They only pay heed to separating the classes as much as possible with different measures exploiting class information such as Fisher score [15, 16], Laplacian score [17], mutual information [18, 19], normalized mutual information [18, 20, 21, 22], ReliefF [23], class correlation [24], classifier score [25]. We should note here that the feature selection criterion are not always lead by a single objective. Feature selection methods often follow a criterion that consisits of two or more objectives. The study in [8] proposes a criterion named ‘maximum projection and minimum redundancy’ which is governed by two goals: projecting data into a feature subspace with minimum reconstruction error and minimum redundancy. The studies in [26, 27, 28, 29] claim that both global structure and local structure should be preserved in the projected space as both them may carry important discriminating information and hence, they have proposed feature selection schemes that focus both on global and local structure preservation. The investigation in [30] claims to preserve dual global structures. Going through various feature selection schemes having multiple objective we found that whenever class label is available, no work in feature selection explicitly focused preserving structural information along with class information although both of these are important discriminative information and may have positive impact on the generalization ability of the classifier. Suppose, for a data set, the class and cluster structures are substantially different. Exploiting only the class labels, it may not be possible to keep the cluster structures in the projected space. For a practical system, even when the primary task is classification, we may need to cluster the samples in the space defined by the selected features. For example, fuzzy rule based classifiers are often designed by clustering the training data for each class and translating each cluster into a rule[31, 32, 33]. We could not find any feature selection method that focuses both on class and cluster separability. To bridge this gap, in this study we propose a feature selection method that selects features preserving class and cluster-structure information simultaneously. We employ a multi-layer perceptron (MLP) based neural network to develop an embedded feature selection scheme. The training of the proposed MLP based feature selection method is governed by both class discriminating and cluster (structure) preserving objectives. The philosophy is quite general and can be easily extended to other networks such as radial basis function network.
2 Proposed Method
Let us denote the input data by an matrix, . Here, is a dimensional row vector of the form, . The collection of class labels of be , where, is the class label corresponding to . We aim to select a subset of size from the original set of features such that the selected subset performs reasonably well in terms of the classification task as well as in clustering. In other words, if we design a classifier using the selected features, the performance of the classifier would be comparable to a classifier designed using all features. Similarly, if we cluster the data in the reduced dimension as well as in the original dimension we expect to get similar partition matrix. Here, we propose a neural network-based framework to select features. Neural networks have been explored for the feature selection [34, 35, 36] as well as for classification [37, 38, 39, 40]. However, in our proposed model the neural network simultaneously selects features and learns a classifier as we follow an embedded method for feature selection. Moreover, our proposed network preserves structural information and the class label information simultaneously, whereas, the feature selection networks in [34, 36] solve classification problems, consider class label information in their loss function but not any structural information. Note here, the work in [35] considers a system identification problem. To build the neural network-based embedded feature selector, we employ the multi-layer perceptron (MLP) based framework used in [41, 42, 43]. The basic framework is shown in Fig. 1.

As seen in Figure 1, preceding the input layer of the MLP, there is a layer consisting of nodes. Before entering to the input layer of the MLP, the th feature passes through the node . These nodes act as attenuating gates that allow or block features from contributing to the output of the neural network effectively. For the th instance, it’s th feature on passing the gate node becomes ; i.e., . In MLP, a weighted sum of the values available at the input nodes is applied to the hidden nodes of the first hidden layer. Zero value at an input node implies that the corresponding feature is not considered. When training of the MLP-based framework is complete, s for the selected features become close to , effectively allowing them to contribute to the classifier. Whereas, for poor or rejected features s become close to , effectively making them not contribute to the classifier. In [41, 42, 43], this framework was explored for classification-oriented feature selection, group feature selection, and redundancy-controlled feature selection. Here, we explore this framework for simultaneous structure-preserving and class-discriminating feature selection. Next, we elaborate on the MLP-based framework and the proposed objective functions to train the network.
We denote the nodes before the input layer of the MLP as s for where is a gate or modulator function applied on the th feature, . Now, we have to design in such a way,
(1) |
In our framework, the factor, is learnable. We implement as a smooth continuous function, . Clearly, when , the value of and when , the value of . By adding suitable regularizer terms to the objective function we design our learning system in such a way that, over the learning process, the gate parameters, s for useful features drop close to zero and that for derogatory or indifferent features rise to high values. So, in our learning system, the learnable parameters, s and the neural network weights are learned together, i.e., the loss function is minimized with respect to both s and the neural network weights.
Now, we have to define a suitable loss function for selecting features along with learning the embedded classifier. Our aim is to select features that are reasonably good for classification as well as clustering. To satisfy this requirement, we take the loss function as a combination of two losses and . is considered for preserving class information and is considered for preserving structural information. At this moment let us consider the network for selecting features for efficient classification only. A suitable loss function to impose class discrimination is the cross-entropy loss [44]. We define, as the cross-entropy loss involving actual and predicted class labels.
(2) |
Here, is th element of the one-hot encoded label of the sample or in other words is the th element of the vector such that
(3) |
In (2), is the predicted probability (by the MLP classifier) of being in th class. As already discussed above, for effective feature selection, the magnitude of s for the selected features should drop to almost zero and for rejected features should rise to high values. To ensure this condition we add the following regularizer.
(4) |
In a feature selection framework, a constraint for selecting a fixed number of features is necessary. The following regularizer tries to keep the number of the selected features close or equal to .
(5) |
So, the overall loss function for the selection of features with our framework for classification purposes is the following.
(6) |
Here, are scalar multipliers for adjusting the importance of and in the overall error function .
Now let us focus on our original agenda of selecting features that perform satisfactorily both for classification and clustering. To preserve structural information of the data in the lower dimensional space formed by the selected features, we consider the Sammon’s stress [45] as a loss function. The Sammon’s stress is the loss function for a non-linear mapping named Sammon’s mapping that is able to capture complex non-linear structures in data, as a result, also preserves cluster structure. The lower the value of Sammon’s stress, the better the lower dimensional representations in capturing the original inter-point distances or structures of the original data. We can define Sammon’s stress involving the original input space and selected feature space as the following.
(7) |
is the distance between and . . So, is the distance between and . As discussed earlier, at the end of the training of our embedded system, s will be close to or depending on whether the corresponding features are rejected or selected. Therefore, for a trained system would signify the distance between th and th instances in the latent space formed by the implicitly selected features. So considering in Equation (7) as an regularizer, the resultant overall loss function is given by.
(8) |
is a scalar multiplier that controls the trade-off between the class information and the structural information in the feature selection process. Note that, the computational complexity for the loss function in Equation (7) is . For large , computing Equation (7) and hence Equation (8) is intensive. As the weight update at each iteration will involve computing Equation (7), the overall computation cost would be high. For small and moderate , we use Equation (8) as the loss function to be minimized. However, for large to avoid the high computational cost we modify Equation (7) as follows.
(9) |
Here is a randomly selected subset of at the th iteration. Different s are chosen at different iterations and hence different sets of inter-point distances are preserved. Since the considered MLP is trained over a large number of iterations, the use of Equation (9) is expected to result in almost the same effect as that by Equation (7). We have to choose such that Equation (9) is computationally manageable and at the same time it should be large enough to make an effective substitute of . Adding Equation (9) to Equation (6) we propose the following loss function for our system.
(10) |
is minimized with respect to the gate parameters s and the weights of the network to find their optimal values.
3 Experimentation and Results
The feature selection framework proposed in this chapter is generic but it can be adapted to solve specialized problems. We have studied the proposed framework for general datasets as well as for solving a special problem: band selection of hyperspectral images. We present the results of band selection for HSIs in a different subsection, Subsec. 3.2. We present the results of feature selection for the conventional classification problem in the following subsection (Subsec. 3.1).
3.1 Feature selection for conventional classification problems
We have used five publicly available datasets that are very commonly used for classification and clustering. The first four datasets are downloaded from UCI machine learning repository [46]. AR10P is downloaded from the open-source feature selection repository of Arizona State University[47]. We have also performed the experiments with three benchmark HSI datasets for land cover classification problems. We discuss them in a separate subsection (Subsec. 3.2).
The details of the number of features, number of classes, and number of instances for the five datasets are summarized in Table 1.
Name | Number of features | Number of classes | Number of instances |
---|---|---|---|
E. coli | 7 | 8 | 336 |
Glass | 9 | 6 | 214 |
Ionosphere | 34 | 2 | 351 |
Sonar | 60 | 2 | 208 |
AR10P | 2400 | 10 | 130 |
The datasets are used directly without any further processing. The datasets are partitioned into training and test sets as approximately and of the total number of instances. To implement our proposed feature selection scheme we use the neural network shown in Fig. 1 with the number of hidden layers, . The input and output layers have and nodes respectively, where is the number of features and is the number of classes corresponding to the considered dataset. The number of hidden nodes in the hidden layer is ( for AR10P data set). To get stable feature selection results, the network weights are initialized in a certain way. To set the initial weights of the proposed network, we undergo the following steps. First, we consider the usual MLP part of our network (i.e. without feature selection), depicted by the portion within the dotted rectangle in Fig. 1, and initialize its weights randomly. Next, we train the usual MLP with the cross-entropy loss defined in Equation (2) with the training set until convergence. The weights of the converged network are used as the initial weights of the proposed network. The gate parameter s are initialized with values drawn randomly from a normal distribution with mean and spread . The initial values of s are chosen around to effectively make the gates almost closed initially. As the learning progresses the s are updated in a way to allow the useful features to the network. For the proposed system, to select a subset of features, the gate parameters s are sorted in ascending order, and the features corresponding to the top , s are selected. The network weights as well as the gate parameters s are learned using the adaptive gradient algorithm, ‘train.AdagradOptimizer’ routine of the ‘TensorFlow’ framework [48]. For all experiments with the data sets in Table 1, both and of the error functions in Equations (6) and (8) are set as . The total number of iterations for training the network is set to . The five datasets we consider here, have the number of instances , which is not so large. Therefore, we use (8) as the overall loss function to train the MLP based architecture for selecting features that are reasonably good for clustering and classification. When in (8), effectively, the error function that governs the learning of our MLP based embedded feature selection scheme is (6). The corresponding feature selection scheme now only considers classification. Let us name this method as feature selection with MLP (FSMLP). When in (8), our method takes structure preservation into account along with classification. Let us name the corresponding method as FSMLPstruct. To understand the importance of adding the structure preserving regularizer (7), we perform feature selection with FSMLP and compare with FSMLPstruct having different values. We explore three values of s and . Although the exact value of the that is optimum for a particular dataset for a particular number of selected features cannot be decided from these three values, we investigate the effect of three widely different s to see the role of the weight to the structure preserving regularizer, i.e. on the performance of the selected features. We compare with three other methods namely, Independent Component Analysis (ICA)-based feature selection [49], F-score based filter method [16], and mutual information based filter method [18, 19]. The performance of both FSMLP and FSMLPstruct is dependent on the initial weights of the network. So, we repeat the initialization of the network weights and gate parameters s five times and run the schemes- FSMLP or FSMLPstruct five times with the five initializations. For the performance measure of FSMLP and FSMLPstruct, we consider the average performance over the five subsets obtained from the five runs. To check the effectiveness of the methods in selecting features that perform well in classification and clustering simultaneously, we compute the classification scores of the support vector machine (SVM) classifier as well as several structure-preserving indices: Sammon’s stress (SS) [45], normalized mutual information (NMI) [18, 20, 21, 22], adjusted rand index (ARI) [50], and Jaccard Index (JI) [51]. As the measure of classification performance, we use the overall classification accuracy (OCA) of the SVM classifier. The optimal hyper-parameters of SVM are determined through five-fold cross-validation using grid search. Note that here the test set is not only unseen to the SVM classifier but unseen to the feature selection methods also. SS, defined in Equation (7) use the original inter-point distances s and latent space inter-point distances s. Here to compute , we use the lower dimensional data formed by the selected features. We use NMI, ARI, and JI as the structure-preserving performance metrics by supplying the cluster labels obtained from clustering the data in the original space (using all features) as the true label and the cluster labels obtained from clustering the data in the reduced space formed by the selected features as the predicted cluster label. So, NMI, ARI, and JI measure how the cluster assignments in the original space and in the selected space agree, effectively giving a measure for the preservation of the original cluster structure in the selected space. We know that the maximum value for NMI or ARI or JI is . Here, the value of each of these three measures being close to indicates that the cluster structure in the original space is preserved in the selected space. As the clustering algorithm we use, the fuzzy means (FCM) algorithm [52] with the fuzzy exponent . We set the number of clusters for FCM algorithm as the number of classes. We use two values for the number of the selected features, . and , where these values are rounded up to the nearest integers using the ceiling function.
Tables 2 and 3 summarize the performances of the proposed method and other comparing methods for training and test sets, respectively for the E. coli dataset. We tabulate the three previously mentioned structure preserving measures and one classifier score for two choices of the number of selected features (approximately and of the original dimension) i.e., , and in Tables 2 and 3.
Method | SS | NMI | ARI | JI | OCA | |||||
---|---|---|---|---|---|---|---|---|---|---|
Q=3 | Q=4 | Q=3 | Q=4 | Q=3 | Q=4 | Q=3 | Q=4 | Q=3 | Q=4 | |
ICA | 0.2323 | 0.2146 | 0.4250 | 0.4220 | 0.2643 | 0.2617 | 0.4250 | 0.4220 | 0.6788 | 0.6656 |
F Score | 0.1547 | 0.0648 | 0.5417 | 0.6124 | 0.3520 | 0.5073 | 0.5417 | 0.6124 | 0.6325 | 0.7815 |
Mutual Info | 0.0726 | 0.0204 | 0.6061 | 0.7511 | 0.4993 | 0.6884 | 0.6061 | 0.7511 | 0.6823 | 0.6719 |
FSMLP | 0.1391 | 0.1098 | 0.4356 | 0.4751 | 0.2705 | 0.3167 | 0.4356 | 0.4751 | 0.8391 | 0.8828 |
FSMLPstruct, | 0.1391 | 0.1098 | 0.4356 | 0.4751 | 0.2705 | 0.3167 | 0.4356 | 0.4751 | 0.8391 | 0.8828 |
FSMLPstruct, | 0.0902 | 0.0391 | 0.5338 | 0.6138 | 0.3868 | 0.4873 | 0.5338 | 0.6138 | 0.9000 | 0.9007 |
FSMLPstruct, | 0.0766 | 0.0280 | 0.5812 | 0.7137 | 0.4497 | 0.6362 | 0.5812 | 0.7137 | 0.8675 | 0.8768 |
Method | SS | NMI | ARI | JI | OCA | |||||
---|---|---|---|---|---|---|---|---|---|---|
Q=3 | Q=4 | Q=3 | Q=4 | Q=3 | Q=4 | Q=3 | Q=4 | Q=3 | Q=4 | |
ICA | 0.2562 | 0.2332 | 0.6191 | 0.6068 | 0.2858 | 0.2775 | 0.6191 | 0.6068 | 0.5588 | 0.6471 |
F Score | 0.1389 | 0.0682 | 0.7124 | 0.7150 | 0.4579 | 0.4552 | 0.7124 | 0.7150 | 0.5882 | 0.6471 |
Mutual Info | 0.0763 | 0.0426 | 0.7047 | 0.8047 | 0.4541 | 0.6114 | 0.7047 | 0.8047 | 0.4545 | 0.5909 |
FSMLP | 0.1217 | 0.0976 | 0.7341 | 0.7416 | 0.5284 | 0.5449 | 0.7341 | 0.7416 | 0.6824 | 0.7000 |
FSMLPstruct, | 0.1217 | 0.0976 | 0.7341 | 0.7416 | 0.5284 | 0.5449 | 0.7341 | 0.7416 | 0.6824 | 0.7000 |
FSMLPstruct, | 0.0949 | 0.0382 | 0.6910 | 0.8179 | 0.4215 | 0.6669 | 0.6910 | 0.8179 | 0.7235 | 0.8118 |
FSMLPstruct, | 0.0932 | 0.0346 | 0.6921 | 0.8571 | 0.4171 | 0.7029 | 0.6921 | 0.8571 | 0.7412 | 0.8588 |
As we have already discussed in Sec. 2 the lesser the value of SS, the better the projected space (formed by selected features) preserves the original pairwise distances and hence the structure of the original data. We observe in Table 2, the mutual information based method shows the lowest value of SS, and the second lowest is FSMLPstruct with for both and . Actually, the SS values for the mutual information based method and FSMLPstruct with are almost the same, equal up to two places after decimal points in both choices of . The SS values achieved by ICA, the F score based method and FSMLP are comparatively higher. So, the mutual information based method and FSMLPstruct with preserve the original pairwise distances most in the projected space. They are also expected to preserve the structures most. The values of the other three structure preserving measures i.e., NMI, ARI, and JI confirm that. We know that the higher the values of NMI, ARI, and JI are, the closer the clustering structures of the projected space are to the original clustering structures. The highest values of the NMI, ARI, and JI are obtained by the mutual information based method, followed by the FSMLPstruct with . So, in cluster structure preservation, mutual information based method and FSMLPstruct with perform better than the other three methods and even than the other two models trained by FSMLPstruct with and . is the weight of the regualizer in (7). Although SS and are not exactly same, under the influence of , it is expected that the higher the value of lesser the value of SS would be. Table 2 reconfirms that. The SS values become lesser as the increases from to . SS values of FSMLP (which is basically FSMLPstruct with ) and FSMLPstruct with are the same for both the choices of . Actually, FSMLP and FSMLPstruct with have all the ten measures the same. It proves that for the E.coli dataset does not give any effective weightage to the structure preservation term and chooses the same subsets as FSMLP. For the classification performance measure OCA, FSMLPstruct with , achieves the highest value, followed by FSMLPstruct with . The mutual information based method and FSMLPstruct with have all the structure preserving measures either almost equal or of comparable values, however for OCA, FSMLPstruct with is better than mutual information based method with a margin more than . For E. coli data, the test set follows the observed trends in the training set with the following exceptions. First, for , the values of NMI, ARI, and JI have not increased as increases from to . Second, for , FSMLPstruct with beats all the methods including mutual information based method. Analyzing the performances over train and test sets, for E. coli data FSMLPstruct with is the winner among the other six models.
Tables 4 and 5 compare the performance of the proposed method with other methods in terms of different criteria for the Glass dataset on its training and test sets, respectively.
Method | SS | NMI | ARI | JI | OCA | |||||
---|---|---|---|---|---|---|---|---|---|---|
Q=4 | Q=5 | Q=4 | Q=5 | Q=4 | Q=5 | Q=4 | Q=5 | Q=4 | Q=5 | |
ICA | 0.2699 | 0.2216 | 0.5066 | 0.5416 | 0.3343 | 0.3688 | 0.5066 | 0.5416 | 0.6771 | 0.6823 |
F Score | 0.1284 | 0.1011 | 0.5961 | 0.6343 | 0.5137 | 0.5750 | 0.5961 | 0.6343 | 0.6979 | 0.7396 |
Mutual Info | 0.2091 | 0.1159 | 0.5073 | 0.6113 | 0.3321 | 0.5688 | 0.5073 | 0.6113 | 0.6823 | 0.6719 |
FSMLP | 0.1968 | 0.1593 | 0.5512 | 0.6028 | 0.4473 | 0.5235 | 0.5512 | 0.6028 | 0.6302 | 0.7094 |
FSMLPstruct, | 0.1968 | 0.0747 | 0.5512 | 0.7501 | 0.4473 | 0.6873 | 0.5512 | 0.7501 | 0.6302 | 0.7510 |
FSMLPstruct, | 0.0494 | 0.0623 | 0.7732 | 0.7965 | 0.7124 | 0.7665 | 0.7732 | 0.7965 | 0.7740 | 0.7708 |
FSMLPstruct, | 0.0309 | 0.0406 | 0.8551 | 0.7394 | 0.8507 | 0.6792 | 0.8551 | 0.7394 | 0.8125 | 0.7667 |
Method | SS | NMI | ARI | JI | OCA | |||||
---|---|---|---|---|---|---|---|---|---|---|
Q=4 | Q=5 | Q=4 | Q=5 | Q=4 | Q=5 | Q=4 | Q=5 | Q=4 | Q=5 | |
ICA | 0.3039 | 0.2333 | 0.6536 | 0.6063 | 0.3409 | 0.2979 | 0.6536 | 0.6063 | 0.4545 | 0.4545 |
F Score | 0.1393 | 0.1258 | 0.7028 | 0.7744 | 0.5746 | 0.6861 | 0.7028 | 0.7744 | 0.5455 | 0.6364 |
Mutual Info | 0.2034 | 0.1372 | 0.7805 | 0.7728 | 0.6358 | 0.6655 | 0.7805 | 0.7728 | 0.4545 | 0.5909 |
FSMLP | 0.2027 | 0.1669 | 0.7366 | 0.7425 | 0.5053 | 0.5380 | 0.7366 | 0.7425 | 0.5727 | 0.6727 |
FSMLPstruct, | 0.2027 | 0.0806 | 0.7366 | 0.7885 | 0.5053 | 0.6190 | 0.7366 | 0.7885 | 0.5727 | 0.6636 |
FSMLPstruct, | 0.0406 | 0.0636 | 0.8265 | 0.8300 | 0.6951 | 0.7355 | 0.8265 | 0.8300 | 0.6818 | 0.6727 |
FSMLPstruct, | 0.0215 | 0.0372 | 0.9022 | 0.7950 | 0.8670 | 0.6210 | 0.9022 | 0.7950 | 0.6727 | 0.6818 |
The chosen numbers of features for the Glass data are and . The expected nature of decreasing SS with increasing is clearly observed for for both the training and test set. For , the Glass data also follows the characteristics of the E. coli data of having the same values for FSMLP and FSMLPstruct with in all the ten measures for both training and test set. For , from onwards, increasing s produce decreasing SS values and increasing NMI, ARI, and JI values for both training and test datasets. We observe from the Tables 4 and 5, for , as the increases from (FSMLP) to , and then to , NMI, ARI, and JI values are increased for both training and test datasets, however at , NMI, ARI, and JI values are decreased compared to and . We can conclude that, for , FSMLPstruct with gives the best structure preserving performance among the considered models and for , FSMLPstruct with is best in structure preservation. In terms of the classification performance measure OCA, FSMLPstruct with and FSMLPstruct with show the highest OCA values for the training set and test set respectively, with . On the other hand, for , FSMLPstruct with show the highest OCA values for the training set and FSMLPstruct with show the highest OCA values for the test set. Inspecting all the performance measure values, we conclude that for the Glass dataset, both FSMLPstruct with and FSMLPstruct with are comparatively better in simultaneously preserving both class and cluster structures than the other methods.
The performances of the Ionosphere dataset are recorded in Tables 6 and 7 for training and test sets respectively.
Method | SS | NMI | ARI | JI | OCA | |||||
---|---|---|---|---|---|---|---|---|---|---|
Q=12 | Q=17 | Q=12 | Q=17 | Q=12 | Q=17 | Q=12 | Q=17 | Q=12 | Q=17 | |
ICA | 0.1791 | 0.0926 | 0.8261 | 0.8817 | 0.8766 | 0.9250 | 0.8261 | 0.8817 | 0.9492 | 0.9714 |
F Score | 0.1753 | 0.0914 | 0.9035 | 0.9313 | 0.9497 | 0.9621 | 0.9035 | 0.9313 | 0.9397 | 0.9746 |
Mutual Info | 0.1776 | 0.0803 | 0.6829 | 0.8645 | 0.7838 | 0.9250 | 0.6829 | 0.8645 | 0.9746 | 0.9651 |
FSMLP | 0.1721 | 0.0895 | 0.6292 | 0.7330 | 0.7246 | 0.8192 | 0.6292 | 0.7330 | 0.9606 | 0.9733 |
FSMLPstruct, | 0.1675 | 0.0878 | 0.6689 | 0.7415 | 0.7641 | 0.8243 | 0.6689 | 0.7415 | 0.9638 | 0.9733 |
FSMLPstruct, | 0.1505 | 0.0776 | 0.7307 | 0.7867 | 0.8051 | 0.8582 | 0.7307 | 0.7867 | 0.9632 | 0.9568 |
FSMLPstruct, | 0.1437 | 0.0766 | 0.7894 | 0.8111 | 0.8533 | 0.8723 | 0.7894 | 0.8111 | 0.9683 | 0.9644 |
Method | SS | NMI | ARI | JI | OCA | |||||
---|---|---|---|---|---|---|---|---|---|---|
Q=12 | Q=17 | Q=12 | Q=17 | Q=12 | Q=17 | Q=12 | Q=17 | Q=12 | Q=17 | |
ICA | 0.1652 | 0.0964 | 0.7351 | 0.7351 | 0.7782 | 0.7782 | 0.7351 | 0.7351 | 0.9143 | 0.9429 |
F Score | 0.1952 | 0.1003 | 0.4325 | 0.4778 | 0.3411 | 0.4152 | 0.4325 | 0.4778 | 0.9429 | 0.9429 |
Mutual Info | 0.2186 | 0.0972 | 0.5287 | 0.6541 | 0.4959 | 0.6774 | 0.5287 | 0.6541 | 0.9429 | 0.9429 |
FSMLP | 0.1801 | 0.0913 | 0.6518 | 0.8284 | 0.6841 | 0.8682 | 0.6518 | 0.8284 | 0.9371 | 0.9143 |
FSMLPstruct, | 0.1722 | 0.0885 | 0.7434 | 0.8284 | 0.7714 | 0.8682 | 0.7434 | 0.8284 | 0.9371 | 0.9314 |
FSMLPstruct, | 0.1437 | 0.0726 | 0.7764 | 0.8597 | 0.8265 | 0.8884 | 0.7764 | 0.8597 | 0.8971 | 0.9257 |
FSMLPstruct, | 0.1310 | 0.0720 | 0.7960 | 0.9146 | 0.8454 | 0.9328 | 0.7960 | 0.9146 | 0.9029 | 0.9257 |
For the Ionosphere data set, the number of selected features, is set as and . Here, in all the cases, whenever the is increasing, SS is decreasing and the other structure preserving indices NMI, ARI, and JI are increasing consistently. Unlike, E. coli and Glass data set, here when increases from (in FSMLP) to , the structure preserving metrics including SS shifted in the desired direction in most of the cases and remained the same in some cases. Except for SS, in the other three structure preserving measures, ICA and F score based method have performed better than FSMLP and FSMLPstruct for all the cases. Classification performance is good for almost all methods for the Ionosphere data set. In the training set, for both and , an accuracy of is reached by mutual info and F score based methods, however, FSMLP and FSMLPstruct models have reached more than accuracy in every case. For the test set, all the structure preserving indices are better for FSMLP and FSMLPstruct than ICA, F score, and mutual information based methods, although in terms of classification score OCA, the F score and mutual information based methods have performed marginally better than FSMLP and FSMLPstruct models. This may have happened because the selected features from the neural network based classifier which are expected to be discriminatory features, may not be the best for SVM. Moreover, FSMLPstruct makes a compromise between preserving cluster structure and classifier loss. For the Ionosphere dataset, our proposed models are not the winner. May be with higher , FSMLPstruct would deliver better scores.
For the Sonar data, the summary of the performances of the training and test data sets in terms of the five measures for two choices of the number of selected features are available in Tables 8 and 9.
Method | SS | NMI | ARI | JI | OCA | |||||
---|---|---|---|---|---|---|---|---|---|---|
Q=21 | Q=30 | Q=21 | Q=30 | Q=21 | Q=30 | Q=21 | Q=30 | Q=21 | Q=30 | |
ICA | 0.7830 | 0.4963 | 0.0004 | 0.0276 | -0.0040 | 0.0382 | 0.0004 | 0.0276 | 0.7112 | 0.8075 |
F Score | 0.3614 | 0.2029 | 0.1167 | 0.1778 | 0.1159 | 0.1926 | 0.1167 | 0.1778 | 0.8289 | 0.9733 |
Mutual Info | 0.2247 | 0.0950 | 0.4601 | 0.6601 | 0.5662 | 0.7585 | 0.4601 | 0.6601 | 0.9733 | 0.9465 |
FSMLP | 0.1880 | 0.0910 | 0.3735 | 0.5293 | 0.4432 | 0.6104 | 0.3735 | 0.5293 | 0.9273 | 0.9765 |
FSMLPstruct, | 0.1314 | 0.0779 | 0.4917 | 0.5587 | 0.5698 | 0.6358 | 0.4917 | 0.5587 | 0.9775 | 0.9722 |
FSMLPstruct, | 0.0620 | 0.0285 | 0.5743 | 0.6838 | 0.6620 | 0.7696 | 0.5743 | 0.6838 | 0.9872 | 0.9840 |
FSMLPstruct, | 0.0481 | 0.0201 | 0.6983 | 0.7473 | 0.7751 | 0.8318 | 0.6983 | 0.7473 | 0.9936 | 0.9968 |
Method | SS | NMI | ARI | JI | OCA | |||||
---|---|---|---|---|---|---|---|---|---|---|
Q=21 | Q=30 | Q=21 | Q=30 | Q=21 | Q=30 | Q=21 | Q=30 | Q=21 | Q=30 | |
ICA | 0.7556 | 0.4968 | 0.2513 | 0.1993 | 0.1546 | 0.2383 | 0.2513 | 0.1993 | 0.7143 | 0.5714 |
F Score | 0.3629 | 0.1941 | 0.2906 | 0.3029 | 0.3519 | 0.3536 | 0.2906 | 0.3029 | 0.5714 | 0.8571 |
Mutual Info | 0.2144 | 0.0955 | 0.2906 | 0.5411 | 0.3519 | 0.6378 | 0.2906 | 0.5411 | 0.8095 | 0.8095 |
FSMLP | 0.1967 | 0.0977 | 0.3918 | 0.5288 | 0.4633 | 0.5850 | 0.3918 | 0.5288 | 0.8190 | 0.8857 |
FSMLPstruct, | 0.1436 | 0.0840 | 0.4026 | 0.5540 | 0.4368 | 0.6155 | 0.4026 | 0.5540 | 0.7714 | 0.8762 |
FSMLPstruct, | 0.0679 | 0.0302 | 0.6905 | 0.6824 | 0.7223 | 0.7146 | 0.6905 | 0.6824 | 0.7619 | 0.8381 |
FSMLPstruct, | 0.0473 | 0.0198 | 0.5001 | 0.7156 | 0.5547 | 0.7490 | 0.5001 | 0.7156 | 0.8000 | 0.7810 |
We set, and for the Sonar data set. In the case of the Sonar data set, not only with increasing , all the structure preserving indices improve, in case of the training set, FSMLPstruct with are significantly better than ICA, F score, and mutual information based methods, and FSMLP in all five scores for both the choices of . In test set for some cases, FSMLPstruct with is better than FSMLPstruct with . For the Sonar data set, clearly, the proposed method performed extremely well in terms of classification and clustering performance.
Tables 10 and 11 summarize the performances of the proposed method and other comparing methods for training and test sets, respectively for the AR10P data set. The original number of features, for the AR10P data set is , which is comparatively higher than that of the other two data sets used in this sub-section. The two choices of the number of selected features here are and and these are not approximately and of the original dimension like in previous cases.
Method | SS | NMI | ARI | JI | OCA | |||||
---|---|---|---|---|---|---|---|---|---|---|
Q=40 | Q=60 | Q=40 | Q=60 | Q=40 | Q=60 | Q=40 | Q=60 | Q=40 | Q=60 | |
ICA | 0.7619 | 0.7107 | 0.2807 | 0.2449 | 0.2222 | 0.2066 | 0.2807 | 0.2449 | 1 | 1 |
F Score | 0.7933 | 0.7484 | 0.1693 | 0.1709 | 0.0550 | 0.0438 | 0.1693 | 0.1709 | 1 | 1 |
Mutual Info | 0.7805 | 0.7309 | 0.1613 | 0.1863 | 0.0429 | 0.0558 | 0.1613 | 0.1863 | 0.9915 | 0.9915 |
FSMLP | 0.7640 | 0.7109 | 0.2296 | 0.2108 | 0.1038 | 0.0943 | 0.2296 | 0.2108 | 0.9983 | 1 |
FSMLPstruct, | 0.7566 | 0.7061 | 0.3828 | 0.5251 | 0.3907 | 0.5654 | 0.3828 | 0.5251 | 1 | 1 |
FSMLPstruct, | 0.7567 | 0.7059 | 0.5307 | 0.5089 | 0.5575 | 0.5268 | 0.5307 | 0.5089 | 1 | 1 |
FSMLPstruct, | 0.7564 | 0.7061 | 0.4849 | 0.5830 | 0.4852 | 0.6353 | 0.4849 | 0.5830 | 1 | 1 |
Method | SS | NMI | ARI | JI | OCA | |||||
---|---|---|---|---|---|---|---|---|---|---|
Q=40 | Q=60 | Q=40 | Q=60 | Q=40 | Q=60 | Q=40 | Q=60 | Q=40 | Q=60 | |
ICA | 0.7739 | 0.7256 | 0.8255 | 0.8222 | 0.0031 | -0.0523 | 0.8255 | 0.8222 | 0.6154 | 0.6154 |
F Score | 0.7364 | 0.6856 | 0.9072 | 0.9152 | 0.3645 | 0.4155 | 0.9072 | 0.9152 | 0.8462 | 0.8462 |
Mutual Info | 0.7614 | 0.7112 | 0.8451 | 0.8451 | 0.0122 | 0.0122 | 0.8451 | 0.8451 | 0.9231 | 0.8462 |
FSMLP | 0.7838 | 0.7213 | 0.9051 | 0.9152 | 0.3686 | 0.4155 | 0.9051 | 0.9152 | 0.8462 | 0.7538 |
FSMLPstruct, | 0.7795 | 0.7346 | 0.9223 | 0.9146 | 0.4820 | 0.4380 | 0.9223 | 0.9146 | 0.4615 | 0.6154 |
FSMLPstruct, | 0.7815 | 0.7352 | 0.9050 | 0.9050 | 0.3782 | 0.3782 | 0.9050 | 0.9050 | 0.4615 | 0.6154 |
FSMLPstruct, | 0.7804 | 0.7356 | 0.9056 | 0.9136 | 0.3557 | 0.4727 | 0.9056 | 0.9136 | 0.4615 | 0.6154 |
The study in [42], proposed a feature selection scheme for redundancy control in features. They reported an average number of selected features of without practicing redundancy control and an average number of selected features in the range of to when practicing redundancy control for AR10P data set. Hence, we choose the number of selected features as and . From the classification scores shown in Table 10, we note that for all the methods for both the choices of , classification scores in training set are more than . In the training set, we observe that for FSMLPstruct as increases SS is decreased in almost all the cases. But for the test set, this is not true. For the other structure-preserving measures for the training set, FSMLPstruct with is best among all the methods for and FSMLPstruct with is best among all the methods for . In the test set, all the methods have performed almost the same in terms of the structure-preserving measures. The classification performances of FSMLPstruct are very poor in the test set for AR10P data. The significant differences in training and test OCA values for FSMLPstruct indicate poor generalization of the system. This problem may be addressed by choosing the number of nodes for our MLP based model through cross-validation.
Results from the five data sets clearly establish the benefit of introducing the proposed structure preserving regularizer term, in the overall loss function (8) of the MLP based embedded feature selection scheme. Next we shall consider the band (channel) selection problem for hyperspectral satelite images.
3.2 Band selection in hyperspectral images
Let our considered hyperspectral image be of dimension, where, , , and are the height, width, and number of spectral bands of the image respectively. We can represent the pixels of as . Let, there be total pixels annotated with land cover classes. Without any loss of generality, we take the first pixels, i.e., as the pixels having class labels. Our input data for land cover classification problem be . The collection of class labels of be , where, is the class label corresponding to . We aim to select a subset of size from the original set of bands such that the selected subset performs reasonably well for land cover classification as well as in clustering. We have performed the experiments with three benchmark HSI datasets for land cover classification problems- Indian pines, Pavia University, and Salinas[53]. We have used the corrected version of the Indian pines and Salinas dataset having the number of bands 200 and 204 respectively. The Pavia University dataset uses 103 bands. The pre-processing of the datasets is the same as done in [54], following the code available in[55]. For any dataset, its pixel values are scaled to using the expression , where, is a pixel value. The and are computed over the entire HSI. The data are then mean normalized across each channel by subtracting channel-wise means. The datasets are partitioned into training and test datasets. For band selection, only the training datasets are fed to the model. For measuring performances both training and test datasets are used. For splitting the datasets into training and test subsets, we drop the pixels of the unknown land-cover type. Let be the set of pixels with known land-cover type. To obtain the training and test sets, let us divide into two subsets and such that , , and and contain, respectively, and pixels of . We use as the test set. Note that, both the datasets suffer from the class imbalance problem. To avoid the learning difficulty raised by class imbalance, in the training set, we consider the same number of instances from each class. For this, from the subset , we randomly select (without replacement) pixels per class. If a class has less than instances in , we oversample the class by synthetic minority oversampling technique (SMOTE) [56] to gather points. For band selection also, we use the same neural network (Fig. 1) with the number of hidden layers, . The numbers of hidden nodes in the three hidden layers are and respectively. Here the number of input nodes of the MLP is equal to the number of bands (). The network weights and the gate parameters s are initialized in the same way as done. For all experiments of the current sub-section, and of the error functions in Equations (6) and (10) are set as and respectively. The total number of iterations for training the network is set to . The rest of experimental settings are kept same as the previously mentioned experiment with the two data sets. The number of training instances of Indian pines and Salinas data set is and that of Pavia university is . Both of the number of training instances, are high. Computation of the in (7) would involve computing or distances. Adding to the overall loss function would cause very intensive computation at each iteration. So, instead of , its proposed approximation defined in (9) is used. In (9), is taken as . Varying the value of in (10), we analyse its effect on the OCA, SS, NMI, ARI, and JI. We compute SS, NMI, ARI, and JI as described in Subsec. 3.1. We also use the same clustering algorithm with the same settings as used in Subsec. 3.1.
Tables 12 and 13 summarize the comparative results of FSMLPstruct with FSMLP and other band selection methods, ICA, F score, and mutual information based filter methods on the training and test datasets of Indian pines respectively.
Method | SS | NMI | ARI | JI | OCA |
---|---|---|---|---|---|
ICA | 0.6982 | 0.5437 | 0.3071 | 0.3335 | 0.7497 |
F Score | 0.5325 | 0.5567 | 0.3015 | 0.0704 | 0.7978 |
Mutual Info | 0.1966 | 0.5633 | 0.3232 | 0.2493 | 0.8600 |
FSMLP | 0.0162 | 0.7994 | 0.6530 | 0.6827 | 0.9157 |
FSMLPstruct, =2 | 0.0153 | 0.8027 | 0.6588 | 0.6894 | 0.9165 |
FSMLPstruct, =5 | 0.0140 | 0.8051 | 0.6627 | 0.6945 | 0.9155 |
FSMLPstruct, =20 | 0.0107 | 0.8188 | 0.6953 | 0.7216 | 0.9153 |
FSMLPstruct, =50 | 0.0071 | 0.8400 | 0.7404 | 0.7549 | 0.9179 |
Method | SS | NMI | ARI | JI | OCA |
---|---|---|---|---|---|
ICA | 0.6748 | 0.5360 | 0.3052 | 0.2983 | 0.6175 |
F Score | 0.5457 | 0.5518 | 0.2951 | 0.1643 | 0.6738 |
Mutual Info | 0.2041 | 0.5787 | 0.3283 | 0.2395 | 0.7298 |
FSMLP | 0.0158 | 0.8729 | 0.7949 | 0.8109 | 0.7871 |
FSMLPstruct, =2 | 0.0148 | 0.8831 | 0.8210 | 0.8384 | 0.7871 |
FSMLPstruct, =5 | 0.0137 | 0.8717 | 0.7881 | 0.8017 | 0.7881 |
FSMLPstruct, =20 | 0.0102 | 0.8896 | 0.8346 | 0.8507 | 0.7882 |
FSMLPstruct, =50 | 0.0066 | 0.8926 | 0.8360 | 0.8505 | 0.7850 |
Similarly, Table 14 and 15 summarize the comparative results on the training and test datasets of Pavia university.
Method | SS | NMI | ARI | JI | OCA |
---|---|---|---|---|---|
ICA | 0.3231 | 0.6941 | 0.5858 | 0.6340 | 0.8433 |
F Score | 0.2386 | 0.6451 | 0.4654 | 0.4748 | 0.8233 |
Mutual Info | 0.1387 | 0.8905 | 0.8769 | 0.8871 | 0.8717 |
FSMLP | 0.1686 | 0.9072 | 0.8925 | 0.8863 | 0.9213 |
FSMLPstruct, =1 | 0.1625 | 0.9169 | 0.9014 | 0.8952 | 0.9213 |
FSMLPstruct, =1.5 | 0.1616 | 0.9233 | 0.9112 | 0.9033 | 0.9212 |
FSMLPstruct, =2 | 0.1592 | 0.9243 | 0.9121 | 0.9044 | 0.9213 |
FSMLPstruct, =2.5 | 0.1590 | 0.9198 | 0.9056 | 0.8993 | 0.9218 |
Method | SS | NMI | ARI | JI | OCA |
---|---|---|---|---|---|
ICA | 0.3215 | 0.5533 | 0.3777 | 0.2082 | 0.7573 |
F Score | 0.3350 | 0.5091 | 0.3126 | 0.0886 | 0.7222 |
Mutual Info | 0.1433 | 0.8911 | 0.8814 | 0.8990 | 0.7785 |
FSMLP | 0.1584 | 0.9001 | 0.8930 | 0.9084 | 0.8628 |
FSMLPstruct, =1 | 0.1568 | 0.9156 | 0.9131 | 0.9254 | 0.8650 |
FSMLPstruct, =1.5 | 0.1573 | 0.9175 | 0.9149 | 0.9270 | 0.8672 |
FSMLPstruct, =2 | 0.1556 | 0.9189 | 0.9166 | 0.9279 | 0.8666 |
FSMLPstruct, =2.5 | 0.1562 | 0.9168 | 0.9143 | 0.9261 | 0.8648 |
In this experiment, we have fixed the number of selected bands , approximately to of the original number of bands . So, The number of selected bands is for Indian pines and it is for Pavia University. Tables 12 and 13 record the values of the structure-preserving indices and classification scores on Indian pines for different values in FSMLPstruct ( values in Equation (10)). The considered s for Indian pines, are and . Note here that, FSMLP is basically FSMLPstruct with . We observe in Tables 12 and 13 that both for training and test datasets as the value of increases for FSMLPstruct (in the last five rows of the corresponding Tables) the value of SS becomes smaller. A similar trend is also observed for the Pavia University data set (here, varies as and ) for training (Table 14) and test (Table 15) sets.
For the Pavia university dataset, we have set the values of in FSMLPstruct as and . Unlike Indian pines for Pavia university, we restrict the s to lower values. This is due to the fact that the number of selected bands for Pavia university is and that for Indian pines is . Lesser the number of bands, the lesser the importance () to be given to our structure preserving regularizer in Equation (9) to obtain a desired balance between classification and clustering performance. Table 12 which contains the results for the Indian pines training data, clearly shows that both FSMLP and FSMLPstruct are better than ICA, F-score based, mutual information based methods in all four structure preserving metrics as well as in terms of the OCA. In Table 12 we observe that with increasing values of there is a consistent improvement in the values of the four structure preserving metrics while the values of OCAs retain approximately at . The results shown in Table 13 for the Indian pines test set also show that FSMLP and FSMLPstruct perform better in terms of all the five metrics than the other three methods. Also with an increase in all the structure-preserving metrics improve for FSMLPstruct, except the value of JI slightly decreases when goes to from . The classification metric OCA is around with bands selected by FSMLPstruct for different choices of s.
It is notable here that the test set is completely unseen in the process of band selection, yet the selected bands for the proposed method is providing fairly good results for structure preservation as well as for classification. As observed from Table 14 and Table 15 for Pavia university training and test sets respectively, the lowest (best) SS value among all the comparing methods is achieved by mutual information based filter method. However, for the other four metrics i.e. NMI, ARI, JI and OCA; FSMLP and FSMLPstruct show better values. In the case of the Pavia university dataset with increasing ; NMI, ARI, and JI are not consistently increasing but the results indicate that, it is possible to find a , (here ) where the structures are preserved better maintaining a good classification score. Table 16 and 17 summarize the comparative results of training and test datasets of Salinas.
Method | SS | NMI | ARI | JI | OCA |
---|---|---|---|---|---|
ICA | 0.7159 | 0.7066 | 0.5385 | 0.5111 | 0.9325 |
F Score | 0.0167 | 0.9420 | 0.9307 | 0.9291 | 0.9650 |
Mutual Info | 0.0902 | 0.9207 | 0.8775 | 0.8860 | 0.9628 |
FSMLP | 0.0072 | 0.9672 | 0.9589 | 0.9628 | 0.9654 |
FSMLPstruct, | 0.0078 | 0.9678 | 0.9603 | 0.9640 | 0.9658 |
FSMLPstruct, | 0.0072 | 0.9691 | 0.9626 | 0.9659 | 0.9648 |
FSMLPstruct, | 0.0055 | 0.9710 | 0.9657 | 0.9683 | 0.9646 |
FSMLPstruct, | 0.0039 | 0.9688 | 0.9630 | 0.9659 | 0.9649 |
Method | SS | NMI | ARI | JI | OCA |
---|---|---|---|---|---|
ICA | 0.7027 | 0.6297 | 0.4144 | 0.3490 | 0.8670 |
F Score | 0.0230 | 0.8584 | 0.7483 | 0.7615 | 0.9081 |
Mutual Info | 0.0820 | 0.8791 | 0.7950 | 0.8230 | 0.8950 |
FSMLP | 0.0078 | 0.9368 | 0.9183 | 0.9265 | 0.9077 |
FSMLPstruct, | 0.0084 | 0.9392 | 0.9231 | 0.9305 | 0.9076 |
FSMLPstruct, | 0.0079 | 0.9413 | 0.9275 | 0.9340 | 0.9069 |
FSMLPstruct, | 0.0061 | 0.9366 | 0.9175 | 0.9258 | 0.9065 |
FSMLPstruct, | 0.0044 | 0.9342 | 0.9136 | 0.9229 | 0.9073 |
We note from Tables 16 and 17 that, for the Salinas dataset, FSMLP and FSMLPstruct are better than the other three methods in all the five metrics used. All four structure preserving metrics scores of FSMLPstruct are better than or comparable to FSMLP keeping the classification score OCA at approximately for the training dataset and for the test dataset. Tables 16 and 17 reveal that when is increased from to , the value of SS is increased however, from to onward, the values of SS are decreased. The exceptions for the Salinas dataset, while increasing from to is possibly due to the fact that we do not use the entire training data in Equation (9) and use of in Equation (9) is not adequate to capture the structure of the data faithfully for the Salinas dataset. As discussed earlier, setting the value of is crucial for approximating Equation (7) with Equation (9). We have set for all three datasets empirically. However, choosing an optimum value of for each dataset is expected to avoid the occurred exceptions.
As we increase the value of , there is more stress to reduce the loss function Equation (9). In most cases, increasing results in a drop in SS. This clearly suggests that the loss function Equation (9) that we use, is a computationally efficient substitute for the original SS defined in Equation (7).
We have included results of the thematic maps (Fig. 2) and it reveals that our proposed method is capable of selecting useful bands that can broadly capture the land cover types. Figure 2 illustrates thematic maps of the entire region captured in the Indian pines dataset. Figure 2(a) shows ground truth labels. Figures 2(b), 2(c), 2(d) are thematic maps of the Indian pines data set using the class labels obtained from the SVM classifier trained on the considered training set represented with bands selected by FSMLPstruct with , i.e., by the method FSMLP, and FSMLPstruct considering , and , respectively.




Figure 2 ensures that even with the increasing stress on the structure-preserving regularizer , our proposed band selection method FSMLPstruct is able to select bands that maintain a good land cover classification performance.
4 Conclusion and Discussions
To the best of our knowledge, a feature selection method that simultaneously cares about class discrimination and structure preservation is not available in the literature. In this study, we have tried to bridge this gap by proposing a neural network-based feature selection method that focuses both on class discrimination and structure preservation. To learn the proposed system, we use Sammon’s stress as a regularizer to the classification loss. For datasets having a large number of instances, the computational overhead associated with Sammon’s stress is very high. Consequently, as the structure-preserving regularizer, we use Sammon’s stress computed based on a sample of the original data (using dynamic sampling on each iteration during the adaptive gradient descent based learning). Using this regularizer in the experiments with datasets having a large number of instances, we have demonstrated that this regularizer is an effective and computationally efficient implementation of Sammon’s stress based structure-preserving regularizer. Our proposed feature selection scheme is generic. So we have investigated its effectiveness on datasets commonly used for assessing classifiers as well as for a specialized case: band selection in hyperspectral images (HSI). We have applied the feature selection scheme to five real-world datasets which are commonly used typically for assessing classification. In the context of band selection, we have applied our method to three well-known HSI datasets and compared performances with three other band selection methods. Based on our experiments, we conclude that the proposed feature selection method is able to produce reasonably good classification and clustering scores in the majority of the data sets, proving that the proposed method is capable of selecting a subset of features that is good both for classification and clustering. Our scheme provides a mechanism to control the number of selected features. The proposed method is easily extendable to other networks like Radial Basis Function (RBF) network.
References
- [1] R. Zhang, F. Nie, X. Li, X. Wei, Feature selection with multi-view data: A survey, Information Fusion 50 (2019) 158–167.
- [2] S. Solorio-Fernández, J. A. Carrasco-Ochoa, J. F. Martínez-Trinidad, A review of unsupervised feature selection methods, Artificial Intelligence Review 53 (2) (2020) 907–948.
- [3] Z. Atashgahi, X. Zhang, N. Kichler, S. Liu, L. Yin, M. Pechenizkiy, R. Veldhuis, D. C. Mocanu, Supervised feature selection with neuron evolution in sparse neural networks, arXiv preprint arXiv:2303.07200.
- [4] R. Jiao, B. Xue, M. Zhang, Solving multi-objective feature selection problems in classification via problem reformulation and duplication handling, IEEE Transactions on Evolutionary Computation.
- [5] S. An, M. Zhang, C. Wang, W. Ding, Robust fuzzy rough approximations with knn granules for semi-supervised feature selection, Fuzzy Sets and Systems (2023) 108476.
- [6] Z. Zeng, X. Wang, F. Yan, Y. Chen, Local adaptive learning for semi-supervised feature selection with group sparsity, Knowledge-Based Systems 181 (2019) 104787.
- [7] G. Chandrashekar, F. Sahin, A survey on feature selection methods, Computers & Electrical Engineering 40 (1) (2014) 16–28.
- [8] S. Wang, W. Pedrycz, Q. Zhu, W. Zhu, Unsupervised feature selection via maximum projection and minimum redundancy, Knowledge-Based Systems 75 (2015) 19–29.
- [9] L. Du, Y.-D. Shen, Unsupervised feature selection with adaptive structure learning, in: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 209–218.
- [10] J.-S. Wu, M.-X. Song, W. Min, J.-H. Lai, W.-S. Zheng, Joint adaptive manifold and embedding learning for unsupervised feature selection, Pattern Recognition 112 (2021) 107742.
- [11] Z. Li, J. Liu, Y. Yang, X. Zhou, H. Lu, Clustering-guided sparse structural learning for unsupervised feature selection, IEEE Transactions on Knowledge and Data Engineering 26 (9) (2013) 2138–2150.
- [12] C. Hou, F. Nie, X. Li, D. Yi, Y. Wu, Joint embedding learning and sparse regression: A framework for unsupervised feature selection, IEEE transactions on cybernetics 44 (6) (2013) 793–804.
- [13] Z. Zhao, L. Wang, H. Liu, J. Ye, On similarity preserving feature selection, IEEE Transactions on Knowledge and Data Engineering 25 (3) (2011) 619–632.
- [14] W. Gao, L. Hu, Y. Li, P. Zhang, Preserving similarity and staring decisis for feature selection, IEEE Transactions on Artificial Intelligence 2 (6) (2021) 584–593.
- [15] Q. Gu, Z. Li, J. Han, Generalized fisher score for feature selection, arXiv preprint arXiv:1202.3725.
- [16] Q. Song, H. Jiang, J. Liu, Feature selection based on fda and f-score for multi-class classification, Expert Systems with Applications 81 (2017) 22–27.
- [17] X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, Advances in neural information processing systems 18.
- [18] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on pattern analysis and machine intelligence 27 (8) (2005) 1226–1238.
- [19] J. R. Vergara, P. A. Estévez, A review of feature selection methods based on mutual information, Neural computing and applications 24 (2014) 175–186.
- [20] P. A. Estévez, M. Tesmer, C. A. Perez, J. M. Zurada, Normalized mutual information feature selection, IEEE Transactions on neural networks 20 (2) (2009) 189–201.
- [21] M. Tesmer, P. A. Estévez, Amifs: Adaptive feature selection by using mutual information, in: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), Vol. 1, IEEE, 2004, pp. 303–308.
- [22] J. R. Vergara, P. A. Estévez, Cmim-2: an enhanced conditional mutual information maximization criterion for feature selection, Journal of Applied Computer Science Methods 2 (1) (2010) 5–20.
- [23] R. J. Urbanowicz, M. Meeker, W. La Cava, R. S. Olson, J. H. Moore, Relief-based feature selection: Introduction and review, Journal of biomedical informatics 85 (2018) 189–203.
- [24] J. Wang, J. Wei, Z. Yang, Supervised feature selection by preserving class correlation, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016, pp. 1613–1622.
- [25] T. M. Le, T. M. Vo, T. N. Pham, S. V. T. Dao, A novel wrapper–based feature selection for early diabetes prediction enhanced with a metaheuristic, IEEE Access 9 (2020) 7869–7884.
- [26] N. Zhou, Y. Xu, H. Cheng, J. Fang, W. Pedrycz, Global and local structure preserving sparse subspace learning: An iterative approach to unsupervised feature selection, Pattern Recognition 53 (2016) 87–101.
- [27] Y. Zhu, X. Zhang, R. Hu, G. Wen, Adaptive structure learning for low-rank supervised feature selection, Pattern Recognition Letters 109 (2018) 89–96.
- [28] X. Liu, L. Wang, J. Zhang, J. Yin, H. Liu, Global and local structure preservation for feature selection, IEEE transactions on neural networks and learning systems 25 (6) (2013) 1083–1095.
- [29] X. Zhu, S. Zhang, R. Hu, Y. Zhu, et al., Local and global structure preservation for robust unsupervised spectral feature selection, IEEE Transactions on Knowledge and Data Engineering 30 (3) (2017) 517–529.
- [30] Q. Ye, X. Zhang, Y. Sun, Dual global structure preservation based supervised feature selection, Neural Processing Letters 51 (2020) 2765–2787.
- [31] D. Chakraborty, N. R. Pal, A neuro-fuzzy scheme for simultaneous feature selection and fuzzy rule-based classification, IEEE Transactions on Neural Networks 15 (1) (2004) 110–123.
- [32] Y.-C. Chen, N. R. Pal, I.-F. Chung, An integrated mechanism for feature selection and fuzzy rule extraction for classification, IEEE Transactions on Fuzzy Systems 20 (4) (2012) 683–698.
- [33] G. Xue, Q. Chang, J. Wang, K. Zhang, N. R. Pal, An adaptive neuro-fuzzy system with integrated feature selection and rule extraction for high-dimensional classification problems, IEEE Transactions on Fuzzy Systems.
- [34] R. Setiono, H. Liu, Neural-network feature selector, IEEE transactions on neural networks 8 (3) (1997) 654–662.
- [35] G. Castellano, A. M. Fanelli, Variable selection using neural-network models, Neurocomputing 31 (1-4) (2000) 1–13.
- [36] A. Verikas, M. Bacauskiene, Feature selection with neural networks, Pattern recognition letters 23 (11) (2002) 1323–1335.
- [37] G. P. Zhang, Neural networks for classification: a survey, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 30 (4) (2000) 451–462.
- [38] S. Dreiseitl, L. Ohno-Machado, Logistic regression and artificial neural network classification models: a methodology review, Journal of biomedical informatics 35 (5-6) (2002) 352–359.
- [39] S. Agrawal, J. Agrawal, Neural network techniques for cancer prediction: A survey, Procedia Computer Science 60 (2015) 769–774.
- [40] A. Mabrouk, R. P. D. Redondo, M. Kayed, Deep learning-based sentiment classification: a comparative survey, IEEE Access 8 (2020) 85616–85638.
- [41] D. Chakraborty, N. R. Pal, Selecting useful groups of features in a connectionist framework, IEEE transactions on neural networks 19 (3) (2008) 381–396.
- [42] R. Chakraborty, N. R. Pal, Feature selection using a neural framework with controlled redundancy, IEEE transactions on neural networks and learning systems 26 (1) (2014) 35–50.
- [43] R. Chakraborty, C.-T. Lin, N. R. Pal, Sensor (group feature) selection with controlled redundancy in a connectionist framework, International journal of neural systems 24 (06) (2014) 1450021.
- [44] Z. Zhang, M. R. Sabuncu, Generalized cross entropy loss for training deep neural networks with noisy labels, in: 32nd Conference on Neural Information Processing Systems (NeurIPS), 2018.
- [45] J. W. Sammon, A nonlinear mapping for data structure analysis, IEEE Transactions on computers 100 (5) (1969) 401–409.
-
[46]
D. Dua, C. Graff, UCI machine learning
repository (2017).
URL http://archive.ics.uci.edu/ml - [47] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, H. Liu, Feature selection: A data perspective, ACM Computing Surveys (CSUR) 50 (6) (2018) 94.
- [48] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, et al., Tensorflow: A system for large-scale machine learning, in: 12th USENIX symposium on operating systems design and implementation (OSDI 16), 2016, pp. 265–283.
- [49] M. Prasad, A. Sowmya, I. Koch, Efficient feature selection based on independent component analysis, in: Proceedings of the 2004 Intelligent Sensors, Sensor Networks and Information Processing Conference, 2004., IEEE, 2004, pp. 427–432.
- [50] D. Steinley, Properties of the hubert-arable adjusted rand index., Psychological methods 9 (3) (2004) 386.
- [51] S. Fletcher, M. Z. Islam, et al., Comparing sets of patterns with the jaccard index, Australasian Journal of Information Systems 22.
- [52] J. C. Bezdek, R. Ehrlich, W. Full, Fcm: The fuzzy c-means clustering algorithm, Computers & Geosciences 10 (2-3) (1984) 191–203.
-
[53]
M. Graña, M. Veganzons, B. Ayerdi,
Hyperspectral
remote sensing scenes (06 2021).
URL http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes - [54] A. Santara, K. Mani, P. Hatwar, A. Singh, A. Garg, K. Padia, P. Mitra, Bass net: Band-adaptive spectral-spatial feature learning neural network for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 55 (9) (2017) 5293–5301.
-
[55]
A. Santara, K. Mani, A. Shukla,
Deep learning for land-cover
classification in hyperspectral images (07 2016).
URL https://kgpml.github.io/Hyperspectral/ - [56] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357.