Feature selection simultaneously preserving both class and cluster structures

Suchismita Das and Nikhil R. Pal Electronics and Communication Sciences Unit, Indian Statistical Institute, 203 B T Road, Kolkata-700108 [email protected],[email protected]

Abstract

When a data set has significant differences in its class and cluster structure, selecting features aiming only at the discrimination of classes would lead to poor clustering performance, and similarly, feature selection aiming only at preserving cluster structures would lead to poor classification performance. To the best of our knowledge, a feature selection method that simultaneously considers class discrimination and cluster structure preservation is not available in the literature. In this paper, we have tried to bridge this gap by proposing a neural network-based feature selection method that focuses both on class discrimination and structure preservation in an integrated manner. In addition to assessing typical classification problems, we have investigated its effectiveness on band selection in hyperspectral images. Based on the results of the experiments, we may claim that the proposed feature/band selection can select a subset of features that is good for both classification and clustering.

keywords:

Feature selection, Structure preserving, Classification, Neural network, Sammon’s Stress, Band selection, Hyperspectral Image.

1 Introduction

Feature selection methods can be broadly classified on the basis of the utilization of the class label information. There are three categories: supervised, semi-supervised and unsupervised [1, 2]. The supervised feature selection method exploits the label information to find out the relevant features which distinguish samples of different classes [3, 4]. Semi-supervised feature selection is used when some labeled samples along with plenty of unlabelled samples are present [5, 6]. Both labeled and unlabelled data are used to modify a hypothesis obtained from the labeled data [7]. Unsupervised feature selection is much more difficult as it needs to find out the useful features in the absence of the label information [2, 8]. Different criteria have been chosen to select a subset of original features in different unsupervised feature selection studies. Some of them are: preserving the data distribution such as manifold structure [9, 10], preserving cluster structure [11, 12], and preserving data similarity [13, 14]. It is noteworthy that in the case of unsupervised feature selection, some methods try to preserve the “structure” or “geometry” of the data in some sense. Contrarily supervised feature selection methods in most cases do not set any explicit criteria to preserve the structure of the data. They only pay heed to separating the classes as much as possible with different measures exploiting class information such as Fisher score [15, 16], Laplacian score [17], mutual information [18, 19], normalized mutual information [18, 20, 21, 22], ReliefF [23], class correlation [24], classifier score [25]. We should note here that the feature selection criterion are not always lead by a single objective. Feature selection methods often follow a criterion that consisits of two or more objectives. The study in [8] proposes a criterion named ‘maximum projection and minimum redundancy’ which is governed by two goals: projecting data into a feature subspace with minimum reconstruction error and minimum redundancy. The studies in [26, 27, 28, 29] claim that both global structure and local structure should be preserved in the projected space as both them may carry important discriminating information and hence, they have proposed feature selection schemes that focus both on global and local structure preservation. The investigation in [30] claims to preserve dual global structures. Going through various feature selection schemes having multiple objective we found that whenever class label is available, no work in feature selection explicitly focused preserving structural information along with class information although both of these are important discriminative information and may have positive impact on the generalization ability of the classifier. Suppose, for a data set, the class and cluster structures are substantially different. Exploiting only the class labels, it may not be possible to keep the cluster structures in the projected space. For a practical system, even when the primary task is classification, we may need to cluster the samples in the space defined by the selected features. For example, fuzzy rule based classifiers are often designed by clustering the training data for each class and translating each cluster into a rule[31, 32, 33]. We could not find any feature selection method that focuses both on class and cluster separability. To bridge this gap, in this study we propose a feature selection method that selects features preserving class and cluster-structure information simultaneously. We employ a multi-layer perceptron (MLP) based neural network to develop an embedded feature selection scheme. The training of the proposed MLP based feature selection method is governed by both class discriminating and cluster (structure) preserving objectives. The philosophy is quite general and can be easily extended to other networks such as radial basis function network.

2 Proposed Method

Let us denote the input data by an $n\times P$ matrix, $\mathbf{X}=\{\mathbf{x}_{i}\in\mathbb{R}^{P}\}_{i=1}^{n}$ . Here, $\mathbf{x}_{i}$ is a $P$ dimensional row vector of the form, $\mathbf{x}_{i}=(x_{i1},x_{i2},\cdots,x_{iP})$ . The collection of class labels of $\mathbf{X}$ be $\mathbf{Z}=\{z_{i}\in\{1,2,\cdots,C\}\}_{i=1}^{n}$ , where, $z_{i}$ is the class label corresponding to $\mathbf{x}_{i}$ . We aim to select a subset of size $Q$ from the original set of features such that the selected subset performs reasonably well in terms of the classification task as well as in clustering. In other words, if we design a classifier using the selected features, the performance of the classifier would be comparable to a classifier designed using all features. Similarly, if we cluster the data in the reduced dimension as well as in the original dimension we expect to get similar partition matrix. Here, we propose a neural network-based framework to select features. Neural networks have been explored for the feature selection [34, 35, 36] as well as for classification [37, 38, 39, 40]. However, in our proposed model the neural network simultaneously selects features and learns a classifier as we follow an embedded method for feature selection. Moreover, our proposed network preserves structural information and the class label information simultaneously, whereas, the feature selection networks in [34, 36] solve classification problems, consider class label information in their loss function but not any structural information. Note here, the work in [35] considers a system identification problem. To build the neural network-based embedded feature selector, we employ the multi-layer perceptron (MLP) based framework used in [41, 42, 43]. The basic framework is shown in Fig. 1.

Refer to caption — Figure 1: Proposed network for feature selection

As seen in Figure 1, preceding the input layer of the MLP, there is a layer consisting of $P$ nodes. Before entering to the input layer of the MLP, the $j$ ^th feature passes through the node $f_{j}()$ . These nodes act as attenuating gates that allow or block features from contributing to the output of the neural network effectively. For the $i$ ^th instance, it’s $j$ ^th feature $x_{ij}$ on passing the gate node $f_{j}()$ becomes $a_{j}x_{ij}$ ; i.e., $f_{j}(x_{ij})=a_{j}x_{ij}$ . In MLP, a weighted sum of the values available at the input nodes is applied to the hidden nodes of the first hidden layer. Zero value at an input node implies that the corresponding feature is not considered. When training of the MLP-based framework is complete, $a_{j}$ s for the selected features become close to $1$ , effectively allowing them to contribute to the classifier. Whereas, for poor or rejected features $a_{j}$ s become close to $0$ , effectively making them not contribute to the classifier. In [41, 42, 43], this framework was explored for classification-oriented feature selection, group feature selection, and redundancy-controlled feature selection. Here, we explore this framework for simultaneous structure-preserving and class-discriminating feature selection. Next, we elaborate on the MLP-based framework and the proposed objective functions to train the network.

We denote the $P$ nodes before the input layer of the MLP as $f_{j}()$ s for $j=1,2,\dots P$ where $f_{j}()$ is a gate or modulator function applied on the $j$ ^th feature, $x_{j}$ . Now, we have to design $f_{j}()$ in such a way,

f_{j}(x_{j})=a_{j}x_{j}=\begin{cases}{x}_{j}&\text{ if }{x}_{j}\text{ is a useful feature.}\\ 0&\text{ otherwise. }\end{cases}

(1)

In our framework, the factor, $a_{j}$ is learnable. We implement $a_{j}$ as a smooth continuous function, $a_{j}=\exp(-\lambda_{j}^{2})$ . Clearly, when $\lambda_{j}=0$ , the value of $\exp(-\lambda_{j}^{2})=1$ and when $\lambda_{j}\rightarrow\pm\infty$ , the value of $\exp(-\lambda_{j}^{2})=0$ . By adding suitable regularizer terms to the objective function we design our learning system in such a way that, over the learning process, the gate parameters, $\lambda_{j}$ s for useful features drop close to zero and that for derogatory or indifferent features rise to high values. So, in our learning system, the learnable parameters, $\lambda_{j}$ s and the neural network weights are learned together, i.e., the loss function is minimized with respect to both $\lambda_{j}$ s and the neural network weights.

Now, we have to define a suitable loss function for selecting features along with learning the embedded classifier. Our aim is to select features that are reasonably good for classification as well as clustering. To satisfy this requirement, we take the loss function as a combination of two losses $E_{class}$ and $E_{struct}$ . $E_{class}$ is considered for preserving class information and $E_{struct}$ is considered for preserving structural information. At this moment let us consider the network for selecting features for efficient classification only. A suitable loss function to impose class discrimination is the cross-entropy loss [44]. We define, $E_{class}$ as the cross-entropy loss involving actual and predicted class labels.

E_{class}=-\dfrac{1}{n}\sum_{i=1}^{n}\sum_{k=1}^{C}t^{i}_{k}\log\left(p_{k}(\mathbf{x}_{i})\right)

(2)

Here, $t^{i}_{k}$ is $k$ ^th element of the one-hot encoded label of the sample $\mathbf{x}_{i}$ or in other words $t^{i}_{k}$ is the $k$ ^th element of the vector $\mathbf{t}^{i}\in\left\{0,1\right\}^{C}$ such that

t_{k}^{i}=\begin{cases}1&\text{ if }k=z_{i}\\ 0&\text{ otherwise}\end{cases}

(3)

In (2), $p_{k}(\mathbf{x}_{i})$ is the predicted probability (by the MLP classifier) of $\mathbf{x}_{i}$ being in $k$ ^th class. As already discussed above, for effective feature selection, the magnitude of $\lambda_{j}$ s for the selected features should drop to almost zero and for rejected features should rise to high values. To ensure this condition we add the following regularizer.

	$\displaystyle E_{select}=$	$\displaystyle\dfrac{1}{P}\sum_{j=1}^{P}a_{j}(1-a_{j})$
	$\displaystyle=$	$\displaystyle\dfrac{1}{P}\sum_{j=1}^{P}\exp(-\lambda_{j}^{2})(1-\exp(-\lambda_{j}^{2}))$		(4)

In a feature selection framework, a constraint for selecting a fixed number of features is necessary. The following regularizer tries to keep the number of the selected features close or equal to $Q$ .

E_{Q}=\dfrac{1}{Q^{2}}\{(\sum_{j=1}^{P}a_{j})-Q\}^{2}=\dfrac{1}{Q^{2}}\{(\sum_{j=1}^{P}\exp(-\lambda_{j}^{2}))-Q\}^{2}

(5)

So, the overall loss function for the selection of features with our framework for classification purposes is the following.

E=E_{class}+\alpha_{1}E_{select}+\alpha_{2}E_{Q}

(6)

Here, $\alpha_{1}\geq 0,\alpha_{2}\geq 0$ are scalar multipliers for adjusting the importance of $E_{select}$ and $E_{Q}$ in the overall error function $E$ .

Now let us focus on our original agenda of selecting features that perform satisfactorily both for classification and clustering. To preserve structural information of the data in the lower dimensional space formed by the selected $Q$ features, we consider the Sammon’s stress [45] as a loss function. The Sammon’s stress is the loss function for a non-linear mapping named Sammon’s mapping that is able to capture complex non-linear structures in data, as a result, also preserves cluster structure. The lower the value of Sammon’s stress, the better the lower dimensional representations in capturing the original inter-point distances or structures of the original data. We can define Sammon’s stress involving the original input space and selected feature space as the following.

E_{sammons}=\dfrac{1}{(\sum_{i,l=1}^{n}d_{il})}\sum_{i=1}^{n-1}\sum_{l=i+1}^{n}\dfrac{\left(d_{il}^{\mathbf{X}}-d_{il}^{\mathbf{\hat{X}}}\right)^{2}}{d_{il}^{\mathbf{X}}}

(7)

$d_{il}^{\mathbf{X}}$ is the distance between $\mathbf{x}_{i}$ and $\mathbf{x}_{l}$ . $\mathbf{\hat{X}}=\{\mathbf{\hat{x}}_{i}=(a_{1}x_{i1},a_{2}x_{i2},\cdots,a_{P}x_{iP})^{T}\in\mathbb{R}^{P}\}_{i=1}^{n}$ . So, $d_{il}^{\mathbf{\hat{X}}}$ is the distance between $\mathbf{\hat{x}}_{i}$ and $\mathbf{\hat{x}}_{l}$ . As discussed earlier, at the end of the training of our embedded system, $a_{j}$ s will be close to $0$ or $1$ depending on whether the corresponding features are rejected or selected. Therefore, for a trained system $d_{il}^{\mathbf{\hat{X}}}$ would signify the distance between $i$ ^th and $l$ ^th instances in the latent space formed by the implicitly selected $Q$ features. So considering $E_{sammons}$ in Equation (7) as an regularizer, the resultant overall loss function is given by.

E_{tot}=E_{class}+\beta E_{sammons}+\alpha_{1}E_{select}+\alpha_{2}E_{Q}

(8)

$\beta\geq 0$ is a scalar multiplier that controls the trade-off between the class information and the structural information in the feature selection process. Note that, the computational complexity for the loss function in Equation (7) is $O(n^{2})$ . For large $n$ , computing Equation (7) and hence Equation (8) is intensive. As the weight update at each iteration will involve computing Equation (7), the overall computation cost would be high. For small and moderate $n$ , we use Equation (8) as the loss function to be minimized. However, for large $n$ to avoid the high computational cost we modify Equation (7) as follows.

E_{struct}=\dfrac{1}{(\sum_{\mathbf{x}_{i},\mathbf{x}_{l}\in S_{t}}d_{il})}\sum_{\mathbf{x}_{i}\in S_{t}}\sum_{\mathbf{x}_{l}\in S_{t};\mathbf{x}_{l}\neq\mathbf{x}_{i}}\dfrac{\left(d_{il}^{\mathbf{X}}-d_{il}^{\mathbf{\hat{X}}}\right)^{2}}{d_{il}^{\mathbf{X}}}

(9)

Here $S_{t}$ is a randomly selected subset of $\mathbf{X}$ at the $t$ ^th iteration. Different $S_{t}$ s are chosen at different iterations and hence different sets of inter-point distances are preserved. Since the considered MLP is trained over a large number of iterations, the use of Equation (9) is expected to result in almost the same effect as that by Equation (7). We have to choose $|S_{t}|$ such that Equation (9) is computationally manageable and at the same time it should be large enough to make $E_{struct}$ an effective substitute of $E_{sammons}$ . Adding Equation (9) to Equation (6) we propose the following loss function for our system.

E_{tot}=E_{class}+\beta E_{struct}+\alpha_{1}E_{select}+\alpha_{2}E_{Q}

(10)

$E_{tot}$ is minimized with respect to the gate parameters $\lambda_{j}$ s and the weights of the network to find their optimal values.

3 Experimentation and Results

The feature selection framework proposed in this chapter is generic but it can be adapted to solve specialized problems. We have studied the proposed framework for general datasets as well as for solving a special problem: band selection of hyperspectral images. We present the results of band selection for HSIs in a different subsection, Subsec. 3.2. We present the results of feature selection for the conventional classification problem in the following subsection (Subsec. 3.1).

3.1 Feature selection for conventional classification problems

We have used five publicly available datasets that are very commonly used for classification and clustering. The first four datasets are downloaded from UCI machine learning repository [46]. AR10P is downloaded from the open-source feature selection repository of Arizona State University[47]. We have also performed the experiments with three benchmark HSI datasets for land cover classification problems. We discuss them in a separate subsection (Subsec. 3.2).

The details of the number of features, number of classes, and number of instances for the five datasets are summarized in Table 1.

Table 1: Summary of datasets.

Name	Number of features	Number of classes	Number of instances
	$P$	$C$	$n$
E. coli	7	8	336
Glass	9	6	214
Ionosphere	34	2	351
Sonar	60	2	208
AR10P	2400	10	130

The datasets are used directly without any further processing. The datasets are partitioned into training and test sets as approximately $90\%$ and $10\%$ of the total number of instances. To implement our proposed feature selection scheme we use the neural network shown in Fig. 1 with the number of hidden layers, $n_{H}=1$ . The input and output layers have $P$ and $C$ nodes respectively, where $P$ is the number of features and $C$ is the number of classes corresponding to the considered dataset. The number of hidden nodes in the hidden layer is $8$ ( $20$ for AR10P data set). To get stable feature selection results, the network weights are initialized in a certain way. To set the initial weights of the proposed network, we undergo the following steps. First, we consider the usual MLP part of our network (i.e. without feature selection), depicted by the portion within the dotted rectangle in Fig. 1, and initialize its weights randomly. Next, we train the usual MLP with the cross-entropy loss defined in Equation (2) with the training set until convergence. The weights of the converged network are used as the initial weights of the proposed network. The gate parameter $\lambda_{j}$ s are initialized with values drawn randomly from a normal distribution with mean $=2$ and spread $=\frac{1}{\sqrt{P}}$ . The initial values of $\lambda_{j}$ s are chosen around $2$ to effectively make the gates almost closed initially. As the learning progresses the $\lambda_{j}$ s are updated in a way to allow the useful features to the network. For the proposed system, to select a subset of $Q$ features, the gate parameters $\lambda_{j}$ s are sorted in ascending order, and the $Q$ features corresponding to the top $Q$ , $\lambda_{j}$ s are selected. The network weights as well as the gate parameters $\lambda_{j}$ s are learned using the adaptive gradient algorithm, ‘train.AdagradOptimizer’ routine of the ‘TensorFlow’ framework [48]. For all experiments with the data sets in Table 1, both $\alpha_{1}$ and $\alpha_{2}$ of the error functions in Equations (6) and (8) are set as $1$ . The total number of iterations for training the network is set to $20000$ . The five datasets we consider here, have the number of instances $n<400$ , which is not so large. Therefore, we use (8) as the overall loss function to train the MLP based architecture for selecting features that are reasonably good for clustering and classification. When $\beta=0$ in (8), effectively, the error function that governs the learning of our MLP based embedded feature selection scheme is (6). The corresponding feature selection scheme now only considers classification. Let us name this method as feature selection with MLP (FSMLP). When $\beta\neq 0$ in (8), our method takes structure preservation into account along with classification. Let us name the corresponding method as FSMLP_struct. To understand the importance of adding the structure preserving regularizer (7), we perform feature selection with FSMLP and compare with FSMLP_struct having different $\beta$ values. We explore three values of $\beta$ s $0.1,1,$ and $10$ . Although the exact value of the $\beta$ that is optimum for a particular dataset for a particular number of selected features $Q$ cannot be decided from these three values, we investigate the effect of three widely different $\beta$ s to see the role of the weight to the structure preserving regularizer, i.e. $\beta$ on the performance of the selected features. We compare with three other methods namely, Independent Component Analysis (ICA)-based feature selection [49], F-score based filter method [16], and mutual information based filter method [18, 19]. The performance of both FSMLP and FSMLP_struct is dependent on the initial weights of the network. So, we repeat the initialization of the network weights and gate parameters $\lambda_{j}$ s five times and run the schemes- FSMLP or FSMLP_struct five times with the five initializations. For the performance measure of FSMLP and FSMLP_struct, we consider the average performance over the five subsets obtained from the five runs. To check the effectiveness of the methods in selecting features that perform well in classification and clustering simultaneously, we compute the classification scores of the support vector machine (SVM) classifier as well as several structure-preserving indices: Sammon’s stress (SS) [45], normalized mutual information (NMI) [18, 20, 21, 22], adjusted rand index (ARI) [50], and Jaccard Index (JI) [51]. As the measure of classification performance, we use the overall classification accuracy (OCA) of the SVM classifier. The optimal hyper-parameters of SVM are determined through five-fold cross-validation using grid search. Note that here the test set is not only unseen to the SVM classifier but unseen to the feature selection methods also. SS, defined in Equation (7) use the original inter-point distances $d_{il}^{\mathbf{X}}$ s and latent space inter-point distances $d_{il}^{\mathbf{\hat{X}}}$ s. Here to compute $d_{il}^{\mathbf{\hat{X}}}$ , we use the lower dimensional data formed by the selected $Q$ features. We use NMI, ARI, and JI as the structure-preserving performance metrics by supplying the cluster labels obtained from clustering the data in the original space (using all features) as the true label and the cluster labels obtained from clustering the data in the reduced space formed by the selected $Q$ features as the predicted cluster label. So, NMI, ARI, and JI measure how the cluster assignments in the original space and in the selected space agree, effectively giving a measure for the preservation of the original cluster structure in the selected space. We know that the maximum value for NMI or ARI or JI is $1$ . Here, the value of each of these three measures being close to $1$ indicates that the cluster structure in the original space is preserved in the selected space. As the clustering algorithm we use, the fuzzy $C$ means (FCM) algorithm [52] with the fuzzy exponent $m=2$ . We set the number of clusters for FCM algorithm as the number of classes. We use two values for the number of the selected features, $Q$ . $Q=0.35\times P$ and $Q=0.5\times P$ , where these values are rounded up to the nearest integers using the ceiling function.

Tables 2 and 3 summarize the performances of the proposed method and other comparing methods for training and test sets, respectively for the E. coli dataset. We tabulate the three previously mentioned structure preserving measures and one classifier score for two choices of the number of selected features (approximately $35\%$ and $50\%$ of the original dimension) i.e., $Q=3$ , and $Q=4$ in Tables 2 and 3.

Table 2: Performance comparison for E. coli, training set for different choices of

\beta

and

Q

Method	SS		NMI		ARI		JI		OCA
	Q=3	Q=4	Q=3	Q=4	Q=3	Q=4	Q=3	Q=4	Q=3	Q=4
ICA	0.2323	0.2146	0.4250	0.4220	0.2643	0.2617	0.4250	0.4220	0.6788	0.6656
F Score	0.1547	0.0648	0.5417	0.6124	0.3520	0.5073	0.5417	0.6124	0.6325	0.7815
Mutual Info	0.0726	0.0204	0.6061	0.7511	0.4993	0.6884	0.6061	0.7511	0.6823	0.6719
FSMLP	0.1391	0.1098	0.4356	0.4751	0.2705	0.3167	0.4356	0.4751	0.8391	0.8828
FSMLP_struct,	0.1391	0.1098	0.4356	0.4751	0.2705	0.3167	0.4356	0.4751	0.8391	0.8828
$\beta=0.1$	0.1391	0.1098	0.4356	0.4751	0.2705	0.3167	0.4356	0.4751	0.8391	0.8828
FSMLP_struct,	0.0902	0.0391	0.5338	0.6138	0.3868	0.4873	0.5338	0.6138	0.9000	0.9007
$\beta=1$	0.0902	0.0391	0.5338	0.6138	0.3868	0.4873	0.5338	0.6138	0.9000	0.9007
FSMLP_struct,	0.0766	0.0280	0.5812	0.7137	0.4497	0.6362	0.5812	0.7137	0.8675	0.8768
$\beta=10$	0.0766	0.0280	0.5812	0.7137	0.4497	0.6362	0.5812	0.7137	0.8675	0.8768

Table 3: Performance comparison for E. coli, test set for different choices of

\beta

and

Q

Method	SS		NMI		ARI		JI		OCA
	Q=3	Q=4	Q=3	Q=4	Q=3	Q=4	Q=3	Q=4	Q=3	Q=4
ICA	0.2562	0.2332	0.6191	0.6068	0.2858	0.2775	0.6191	0.6068	0.5588	0.6471
F Score	0.1389	0.0682	0.7124	0.7150	0.4579	0.4552	0.7124	0.7150	0.5882	0.6471
Mutual Info	0.0763	0.0426	0.7047	0.8047	0.4541	0.6114	0.7047	0.8047	0.4545	0.5909
FSMLP	0.1217	0.0976	0.7341	0.7416	0.5284	0.5449	0.7341	0.7416	0.6824	0.7000
FSMLP_struct,	0.1217	0.0976	0.7341	0.7416	0.5284	0.5449	0.7341	0.7416	0.6824	0.7000
$\beta=0.1$	0.1217	0.0976	0.7341	0.7416	0.5284	0.5449	0.7341	0.7416	0.6824	0.7000
FSMLP_struct,	0.0949	0.0382	0.6910	0.8179	0.4215	0.6669	0.6910	0.8179	0.7235	0.8118
$\beta=1$	0.0949	0.0382	0.6910	0.8179	0.4215	0.6669	0.6910	0.8179	0.7235	0.8118
FSMLP_struct,	0.0932	0.0346	0.6921	0.8571	0.4171	0.7029	0.6921	0.8571	0.7412	0.8588
$\beta=10$	0.0932	0.0346	0.6921	0.8571	0.4171	0.7029	0.6921	0.8571	0.7412	0.8588

As we have already discussed in Sec. 2 the lesser the value of SS, the better the projected space (formed by selected features) preserves the original pairwise distances and hence the structure of the original data. We observe in Table 2, the mutual information based method shows the lowest value of SS, and the second lowest is FSMLP_struct with $\beta=10$ for both $Q=3$ and $Q=4$ . Actually, the SS values for the mutual information based method and FSMLP_struct with $\beta=10$ are almost the same, equal up to two places after decimal points in both choices of $Q$ . The SS values achieved by ICA, the F score based method and FSMLP are comparatively higher. So, the mutual information based method and FSMLP_struct with $\beta=10$ preserve the original pairwise distances most in the projected space. They are also expected to preserve the structures most. The values of the other three structure preserving measures i.e., NMI, ARI, and JI confirm that. We know that the higher the values of NMI, ARI, and JI are, the closer the clustering structures of the projected space are to the original clustering structures. The highest values of the NMI, ARI, and JI are obtained by the mutual information based method, followed by the FSMLP_struct with $\beta=10$ . So, in cluster structure preservation, mutual information based method and FSMLP_struct with $\beta=10$ perform better than the other three methods and even than the other two models trained by FSMLP_struct with $\beta=0.1$ and $1$ . $\beta$ is the weight of the regualizer $E_{sammons}$ in (7). Although SS and $E_{sammons}$ are not exactly same, under the influence of $E_{select}$ , it is expected that the higher the value of $\beta$ lesser the value of SS would be. Table 2 reconfirms that. The SS values become lesser as the $\beta$ increases from $0.1$ to $10$ . SS values of FSMLP (which is basically FSMLP_struct with $\beta=0$ ) and FSMLP_struct with $\beta=0.1$ are the same for both the choices of $Q$ . Actually, FSMLP and FSMLP_struct with $\beta=0.1$ have all the ten measures the same. It proves that for the E.coli dataset $\beta=0.1$ does not give any effective weightage to the structure preservation term and chooses the same subsets as FSMLP. For the classification performance measure OCA, FSMLP_struct with $\beta=1$ , achieves the highest value, followed by FSMLP_struct with $\beta=10$ . The mutual information based method and FSMLP_struct with $\beta=10$ have all the structure preserving measures either almost equal or of comparable values, however for OCA, FSMLP_struct with $\beta=10$ is better than mutual information based method with a margin more than $18\%$ . For E. coli data, the test set follows the observed trends in the training set with the following exceptions. First, for $Q=3$ , the values of NMI, ARI, and JI have not increased as $\beta$ increases from $0.1$ to $10$ . Second, for $Q=4$ , FSMLP_struct with $\beta=10$ beats all the methods including mutual information based method. Analyzing the performances over train and test sets, for E. coli data FSMLP_struct with $\beta=10$ is the winner among the other six models.

Tables 4 and 5 compare the performance of the proposed method with other methods in terms of different criteria for the Glass dataset on its training and test sets, respectively.

Table 4: Performance comparison for Glass, training set for different choices of

\beta

and

Q

Method	SS		NMI		ARI		JI		OCA
	Q=4	Q=5	Q=4	Q=5	Q=4	Q=5	Q=4	Q=5	Q=4	Q=5
ICA	0.2699	0.2216	0.5066	0.5416	0.3343	0.3688	0.5066	0.5416	0.6771	0.6823
F Score	0.1284	0.1011	0.5961	0.6343	0.5137	0.5750	0.5961	0.6343	0.6979	0.7396
Mutual Info	0.2091	0.1159	0.5073	0.6113	0.3321	0.5688	0.5073	0.6113	0.6823	0.6719
FSMLP	0.1968	0.1593	0.5512	0.6028	0.4473	0.5235	0.5512	0.6028	0.6302	0.7094
FSMLP_struct,	0.1968	0.0747	0.5512	0.7501	0.4473	0.6873	0.5512	0.7501	0.6302	0.7510
$\beta=0.1$	0.1968	0.0747	0.5512	0.7501	0.4473	0.6873	0.5512	0.7501	0.6302	0.7510
FSMLP_struct,	0.0494	0.0623	0.7732	0.7965	0.7124	0.7665	0.7732	0.7965	0.7740	0.7708
$\beta=1$	0.0494	0.0623	0.7732	0.7965	0.7124	0.7665	0.7732	0.7965	0.7740	0.7708
FSMLP_struct,	0.0309	0.0406	0.8551	0.7394	0.8507	0.6792	0.8551	0.7394	0.8125	0.7667
$\beta=10$	0.0309	0.0406	0.8551	0.7394	0.8507	0.6792	0.8551	0.7394	0.8125	0.7667

Table 5: Performance comparison for Glass, test set for different choices of

\beta

and

Q

Method	SS		NMI		ARI		JI		OCA
	Q=4	Q=5	Q=4	Q=5	Q=4	Q=5	Q=4	Q=5	Q=4	Q=5
ICA	0.3039	0.2333	0.6536	0.6063	0.3409	0.2979	0.6536	0.6063	0.4545	0.4545
F Score	0.1393	0.1258	0.7028	0.7744	0.5746	0.6861	0.7028	0.7744	0.5455	0.6364
Mutual Info	0.2034	0.1372	0.7805	0.7728	0.6358	0.6655	0.7805	0.7728	0.4545	0.5909
FSMLP	0.2027	0.1669	0.7366	0.7425	0.5053	0.5380	0.7366	0.7425	0.5727	0.6727
FSMLP_struct,	0.2027	0.0806	0.7366	0.7885	0.5053	0.6190	0.7366	0.7885	0.5727	0.6636
$\beta=0.1$	0.2027	0.0806	0.7366	0.7885	0.5053	0.6190	0.7366	0.7885	0.5727	0.6636
FSMLP_struct,	0.0406	0.0636	0.8265	0.8300	0.6951	0.7355	0.8265	0.8300	0.6818	0.6727
$\beta=1$	0.0406	0.0636	0.8265	0.8300	0.6951	0.7355	0.8265	0.8300	0.6818	0.6727
FSMLP_struct,	0.0215	0.0372	0.9022	0.7950	0.8670	0.6210	0.9022	0.7950	0.6727	0.6818
$\beta=10$	0.0215	0.0372	0.9022	0.7950	0.8670	0.6210	0.9022	0.7950	0.6727	0.6818

The chosen numbers of features for the Glass data are $4$ and $5$ . The expected nature of decreasing SS with increasing $\beta$ is clearly observed for $Q=5$ for both the training and test set. For $Q=4$ , the Glass data also follows the characteristics of the E. coli data of having the same values for FSMLP and FSMLP_struct with $\beta=0.1$ in all the ten measures for both training and test set. For $Q=4$ , from $\beta=0.1$ onwards, increasing $\beta$ s produce decreasing SS values and increasing NMI, ARI, and JI values for both training and test datasets. We observe from the Tables 4 and 5, for $Q=5$ , as the $\beta$ increases from $0$ (FSMLP) to $0.1$ , and then to $1$ , NMI, ARI, and JI values are increased for both training and test datasets, however at $\beta=10$ , NMI, ARI, and JI values are decreased compared to $\beta=0.1$ and $1$ . We can conclude that, for $Q=4$ , FSMLP_struct with $\beta=10$ gives the best structure preserving performance among the considered models and for $Q=5$ , FSMLP_struct with $\beta=1$ is best in structure preservation. In terms of the classification performance measure OCA, FSMLP_struct with $\beta=10$ and FSMLP_struct with $\beta=1$ show the highest OCA values for the training set and test set respectively, with $Q=4$ . On the other hand, for $Q=5$ , FSMLP_struct with $\beta=10$ show the highest OCA values for the training set and FSMLP_struct with $\beta=1$ show the highest OCA values for the test set. Inspecting all the performance measure values, we conclude that for the Glass dataset, both FSMLP_struct with $\beta=10$ and FSMLP_struct with $\beta=1$ are comparatively better in simultaneously preserving both class and cluster structures than the other methods.

The performances of the Ionosphere dataset are recorded in Tables 6 and 7 for training and test sets respectively.

Table 6: Performance comparison for Ionosphere, training set for different choices of

\beta

and

Q

Method	SS		NMI		ARI		JI		OCA
	Q=12	Q=17	Q=12	Q=17	Q=12	Q=17	Q=12	Q=17	Q=12	Q=17
ICA	0.1791	0.0926	0.8261	0.8817	0.8766	0.9250	0.8261	0.8817	0.9492	0.9714
F Score	0.1753	0.0914	0.9035	0.9313	0.9497	0.9621	0.9035	0.9313	0.9397	0.9746
Mutual Info	0.1776	0.0803	0.6829	0.8645	0.7838	0.9250	0.6829	0.8645	0.9746	0.9651
FSMLP	0.1721	0.0895	0.6292	0.7330	0.7246	0.8192	0.6292	0.7330	0.9606	0.9733
FSMLP_struct,	0.1675	0.0878	0.6689	0.7415	0.7641	0.8243	0.6689	0.7415	0.9638	0.9733
$\beta=0.1$	0.1675	0.0878	0.6689	0.7415	0.7641	0.8243	0.6689	0.7415	0.9638	0.9733
FSMLP_struct,	0.1505	0.0776	0.7307	0.7867	0.8051	0.8582	0.7307	0.7867	0.9632	0.9568
$\beta=1$	0.1505	0.0776	0.7307	0.7867	0.8051	0.8582	0.7307	0.7867	0.9632	0.9568
FSMLP_struct,	0.1437	0.0766	0.7894	0.8111	0.8533	0.8723	0.7894	0.8111	0.9683	0.9644
$\beta=10$	0.1437	0.0766	0.7894	0.8111	0.8533	0.8723	0.7894	0.8111	0.9683	0.9644

Table 7: Performance comparison for Ionosphere, test set for different choices of

\beta

and

Q

Method	SS		NMI		ARI		JI		OCA
	Q=12	Q=17	Q=12	Q=17	Q=12	Q=17	Q=12	Q=17	Q=12	Q=17
ICA	0.1652	0.0964	0.7351	0.7351	0.7782	0.7782	0.7351	0.7351	0.9143	0.9429
F Score	0.1952	0.1003	0.4325	0.4778	0.3411	0.4152	0.4325	0.4778	0.9429	0.9429
Mutual Info	0.2186	0.0972	0.5287	0.6541	0.4959	0.6774	0.5287	0.6541	0.9429	0.9429
FSMLP	0.1801	0.0913	0.6518	0.8284	0.6841	0.8682	0.6518	0.8284	0.9371	0.9143
FSMLP_struct,	0.1722	0.0885	0.7434	0.8284	0.7714	0.8682	0.7434	0.8284	0.9371	0.9314
$\beta=0.1$	0.1722	0.0885	0.7434	0.8284	0.7714	0.8682	0.7434	0.8284	0.9371	0.9314
FSMLP_struct,	0.1437	0.0726	0.7764	0.8597	0.8265	0.8884	0.7764	0.8597	0.8971	0.9257
$\beta=1$	0.1437	0.0726	0.7764	0.8597	0.8265	0.8884	0.7764	0.8597	0.8971	0.9257
FSMLP_struct,	0.1310	0.0720	0.7960	0.9146	0.8454	0.9328	0.7960	0.9146	0.9029	0.9257
$\beta=10$	0.1310	0.0720	0.7960	0.9146	0.8454	0.9328	0.7960	0.9146	0.9029	0.9257

For the Ionosphere data set, the number of selected features, $Q$ is set as $12$ and $17$ . Here, in all the cases, whenever the $\beta$ is increasing, SS is decreasing and the other structure preserving indices NMI, ARI, and JI are increasing consistently. Unlike, E. coli and Glass data set, here when $\beta$ increases from $0$ (in FSMLP) to $0.1$ , the structure preserving metrics including SS shifted in the desired direction in most of the cases and remained the same in some cases. Except for SS, in the other three structure preserving measures, ICA and F score based method have performed better than FSMLP and FSMLP_struct for all the cases. Classification performance is good for almost all methods for the Ionosphere data set. In the training set, for both $Q=12$ and $Q=17$ , an accuracy of $97.46\%$ is reached by mutual info and F score based methods, however, FSMLP and FSMLP_struct models have reached more than $96\%$ accuracy in every case. For the test set, all the structure preserving indices are better for FSMLP and FSMLP_struct than ICA, F score, and mutual information based methods, although in terms of classification score OCA, the F score and mutual information based methods have performed marginally better than FSMLP and FSMLP_struct models. This may have happened because the selected features from the neural network based classifier which are expected to be discriminatory features, may not be the best for SVM. Moreover, FSMLP_struct makes a compromise between preserving cluster structure and classifier loss. For the Ionosphere dataset, our proposed models are not the winner. May be with higher $\beta$ , FSMLP_struct would deliver better scores.

For the Sonar data, the summary of the performances of the training and test data sets in terms of the five measures for two choices of the number of selected features are available in Tables 8 and 9.

Table 8: Performance comparison for Sonar, training set for different choices of

\beta

and

Q

Method	SS		NMI		ARI		JI		OCA
	Q=21	Q=30	Q=21	Q=30	Q=21	Q=30	Q=21	Q=30	Q=21	Q=30
ICA	0.7830	0.4963	0.0004	0.0276	-0.0040	0.0382	0.0004	0.0276	0.7112	0.8075
F Score	0.3614	0.2029	0.1167	0.1778	0.1159	0.1926	0.1167	0.1778	0.8289	0.9733
Mutual Info	0.2247	0.0950	0.4601	0.6601	0.5662	0.7585	0.4601	0.6601	0.9733	0.9465
FSMLP	0.1880	0.0910	0.3735	0.5293	0.4432	0.6104	0.3735	0.5293	0.9273	0.9765
FSMLP_struct,	0.1314	0.0779	0.4917	0.5587	0.5698	0.6358	0.4917	0.5587	0.9775	0.9722
$\beta=0.1$	0.1314	0.0779	0.4917	0.5587	0.5698	0.6358	0.4917	0.5587	0.9775	0.9722
FSMLP_struct,	0.0620	0.0285	0.5743	0.6838	0.6620	0.7696	0.5743	0.6838	0.9872	0.9840
$\beta=1$	0.0620	0.0285	0.5743	0.6838	0.6620	0.7696	0.5743	0.6838	0.9872	0.9840
FSMLP_struct,	0.0481	0.0201	0.6983	0.7473	0.7751	0.8318	0.6983	0.7473	0.9936	0.9968
$\beta=10$	0.0481	0.0201	0.6983	0.7473	0.7751	0.8318	0.6983	0.7473	0.9936	0.9968

Table 9: Performance comparison for Sonar, test set for different choices of

\beta

and

Q

Method	SS		NMI		ARI		JI		OCA
	Q=21	Q=30	Q=21	Q=30	Q=21	Q=30	Q=21	Q=30	Q=21	Q=30
ICA	0.7556	0.4968	0.2513	0.1993	0.1546	0.2383	0.2513	0.1993	0.7143	0.5714
F Score	0.3629	0.1941	0.2906	0.3029	0.3519	0.3536	0.2906	0.3029	0.5714	0.8571
Mutual Info	0.2144	0.0955	0.2906	0.5411	0.3519	0.6378	0.2906	0.5411	0.8095	0.8095
FSMLP	0.1967	0.0977	0.3918	0.5288	0.4633	0.5850	0.3918	0.5288	0.8190	0.8857
FSMLP_struct,	0.1436	0.0840	0.4026	0.5540	0.4368	0.6155	0.4026	0.5540	0.7714	0.8762
$\beta=0.1$	0.1436	0.0840	0.4026	0.5540	0.4368	0.6155	0.4026	0.5540	0.7714	0.8762
FSMLP_struct,	0.0679	0.0302	0.6905	0.6824	0.7223	0.7146	0.6905	0.6824	0.7619	0.8381
$\beta=1$	0.0679	0.0302	0.6905	0.6824	0.7223	0.7146	0.6905	0.6824	0.7619	0.8381
FSMLP_struct,	0.0473	0.0198	0.5001	0.7156	0.5547	0.7490	0.5001	0.7156	0.8000	0.7810
$\beta=10$	0.0473	0.0198	0.5001	0.7156	0.5547	0.7490	0.5001	0.7156	0.8000	0.7810

We set, $Q=21$ and $30$ for the Sonar data set. In the case of the Sonar data set, not only with increasing $\beta$ , all the structure preserving indices improve, in case of the training set, FSMLP_struct with $\beta=10$ are significantly better than ICA, F score, and mutual information based methods, and FSMLP in all five scores for both the choices of $Q$ . In test set for some cases, FSMLP_struct with $\beta=1$ is better than FSMLP_struct with $\beta=10$ . For the Sonar data set, clearly, the proposed method performed extremely well in terms of classification and clustering performance.

Tables 10 and 11 summarize the performances of the proposed method and other comparing methods for training and test sets, respectively for the AR10P data set. The original number of features, $P$ for the AR10P data set is $2400$ , which is comparatively higher than that of the other two data sets used in this sub-section. The two choices of the number of selected features here are $40$ and $60$ and these are not approximately $35\%$ and $50\%$ of the original dimension like in previous cases.

Table 10: Performance comparison for AR10P, training set for different choices of

\beta

and

Q

Method	SS		NMI		ARI		JI		OCA
	Q=40	Q=60	Q=40	Q=60	Q=40	Q=60	Q=40	Q=60	Q=40	Q=60
ICA	0.7619	0.7107	0.2807	0.2449	0.2222	0.2066	0.2807	0.2449	1	1
F Score	0.7933	0.7484	0.1693	0.1709	0.0550	0.0438	0.1693	0.1709	1	1
Mutual Info	0.7805	0.7309	0.1613	0.1863	0.0429	0.0558	0.1613	0.1863	0.9915	0.9915
FSMLP	0.7640	0.7109	0.2296	0.2108	0.1038	0.0943	0.2296	0.2108	0.9983	1
FSMLP_struct,	0.7566	0.7061	0.3828	0.5251	0.3907	0.5654	0.3828	0.5251	1	1
$\beta=25$	0.7566	0.7061	0.3828	0.5251	0.3907	0.5654	0.3828	0.5251	1	1
FSMLP_struct,	0.7567	0.7059	0.5307	0.5089	0.5575	0.5268	0.5307	0.5089	1	1
$\beta=50$	0.7567	0.7059	0.5307	0.5089	0.5575	0.5268	0.5307	0.5089	1	1
FSMLP_struct,	0.7564	0.7061	0.4849	0.5830	0.4852	0.6353	0.4849	0.5830	1	1
$\beta=100$	0.7564	0.7061	0.4849	0.5830	0.4852	0.6353	0.4849	0.5830	1	1

Table 11: Performance comparison for AR10P, test set for different choices of

\beta

and

Q

Method	SS		NMI		ARI		JI		OCA
	Q=40	Q=60	Q=40	Q=60	Q=40	Q=60	Q=40	Q=60	Q=40	Q=60
ICA	0.7739	0.7256	0.8255	0.8222	0.0031	-0.0523	0.8255	0.8222	0.6154	0.6154
F Score	0.7364	0.6856	0.9072	0.9152	0.3645	0.4155	0.9072	0.9152	0.8462	0.8462
Mutual Info	0.7614	0.7112	0.8451	0.8451	0.0122	0.0122	0.8451	0.8451	0.9231	0.8462
FSMLP	0.7838	0.7213	0.9051	0.9152	0.3686	0.4155	0.9051	0.9152	0.8462	0.7538
FSMLP_struct,	0.7795	0.7346	0.9223	0.9146	0.4820	0.4380	0.9223	0.9146	0.4615	0.6154
$\beta=25$	0.7795	0.7346	0.9223	0.9146	0.4820	0.4380	0.9223	0.9146	0.4615	0.6154
FSMLP_struct,	0.7815	0.7352	0.9050	0.9050	0.3782	0.3782	0.9050	0.9050	0.4615	0.6154
$\beta=50$	0.7815	0.7352	0.9050	0.9050	0.3782	0.3782	0.9050	0.9050	0.4615	0.6154
FSMLP_struct,	0.7804	0.7356	0.9056	0.9136	0.3557	0.4727	0.9056	0.9136	0.4615	0.6154
$\beta=100$	0.7804	0.7356	0.9056	0.9136	0.3557	0.4727	0.9056	0.9136	0.4615	0.6154

The study in [42], proposed a feature selection scheme for redundancy control in features. They reported an average number of selected features of $58.9$ without practicing redundancy control and an average number of selected features in the range of $22.8$ to $44.2$ when practicing redundancy control for AR10P data set. Hence, we choose the number of selected features $Q$ as $40$ and $60$ . From the classification scores shown in Table 10, we note that for all the methods for both the choices of $Q$ , classification scores in training set are more than $99\%$ . In the training set, we observe that for FSMLP_struct as $\beta$ increases SS is decreased in almost all the cases. But for the test set, this is not true. For the other structure-preserving measures for the training set, FSMLP_struct with $\beta=50$ is best among all the methods for $Q=40$ and FSMLP_struct with $\beta=100$ is best among all the methods for $Q=60$ . In the test set, all the methods have performed almost the same in terms of the structure-preserving measures. The classification performances of FSMLP_struct are very poor in the test set for AR10P data. The significant differences in training and test OCA values for FSMLP_struct indicate poor generalization of the system. This problem may be addressed by choosing the number of nodes for our MLP based model through cross-validation.

Results from the five data sets clearly establish the benefit of introducing the proposed structure preserving regularizer term, $E_{sammons}$ in the overall loss function (8) of the MLP based embedded feature selection scheme. Next we shall consider the band (channel) selection problem for hyperspectral satelite images.

3.2 Band selection in hyperspectral images

Let our considered hyperspectral image $I$ be of dimension, $H\times W\times P$ where, $H$ , $W$ , and $P$ are the height, width, and number of spectral bands of the image respectively. We can represent the pixels of $I$ as $\mathbf{x}_{i}\in\mathbb{R}^{P}:i=1,2,\dots,H\times W$ . Let, there be total $n$ pixels annotated with $C$ land cover classes. Without any loss of generality, we take the first $n$ pixels, i.e., $i=1,2,\dots n$ as the pixels having class labels. Our input data for land cover classification problem be $\mathbf{X}=\{\mathbf{x}_{i}=(x_{i1},x_{i2},\cdots,x_{iP})\in\mathbb{R}^{P}\}_{i=1}^{n}$ . The collection of class labels of $\mathbf{X}$ be $\mathbf{Z}=\{z_{i}\in\{1,2,\cdots,C\}\}_{i=1}^{n}$ , where, $z_{i}$ is the class label corresponding to $\mathbf{x}_{i}$ . We aim to select a subset of size $Q$ from the original set of bands such that the selected subset performs reasonably well for land cover classification as well as in clustering. We have performed the experiments with three benchmark HSI datasets for land cover classification problems- Indian pines, Pavia University, and Salinas[53]. We have used the corrected version of the Indian pines and Salinas dataset having the number of bands 200 and 204 respectively. The Pavia University dataset uses 103 bands. The pre-processing of the datasets is the same as done in [54], following the code available in[55]. For any dataset, its pixel values are scaled to $[0,1]$ using the expression $(x-\min(x))/(\max(x)-\min(x))$ , where, $x$ is a pixel value. The $\max$ and $\min$ are computed over the entire HSI. The data are then mean normalized across each channel by subtracting channel-wise means. The datasets are partitioned into training and test datasets. For band selection, only the training datasets are fed to the model. For measuring performances both training and test datasets are used. For splitting the datasets into training and test subsets, we drop the pixels of the unknown land-cover type. Let $\mathbf{X}$ be the set of pixels with known land-cover type. To obtain the training and test sets, let us divide $\mathbf{X}$ into two subsets $\mathbf{A}$ and $\mathbf{B}$ such that $\mathbf{A}\bigcup\mathbf{B}=\mathbf{X}$ , $\mathbf{A}\bigcap\mathbf{B}=\phi$ , and $\mathbf{A}$ and $\mathbf{B}$ contain, respectively, $25\%$ and $75\%$ pixels of $\mathbf{X}$ . We use $\mathbf{A}$ as the test set. Note that, both the datasets suffer from the class imbalance problem. To avoid the learning difficulty raised by class imbalance, in the training set, we consider the same number of instances from each class. For this, from the subset $\mathbf{B}$ , we randomly select (without replacement) $200$ pixels per class. If a class has less than $200$ instances in $\mathbf{B}$ , we oversample the class by synthetic minority oversampling technique (SMOTE) [56] to gather $200$ points. For band selection also, we use the same neural network (Fig. 1) with the number of hidden layers, $n_{H}=3$ . The numbers of hidden nodes in the three hidden layers are $500,350,$ and $150$ respectively. Here the number of input nodes of the MLP is equal to the number of bands ( $P$ ). The network weights and the gate parameters $\lambda_{j}$ s are initialized in the same way as done. For all experiments of the current sub-section, $\alpha_{1}$ and $\alpha_{2}$ of the error functions in Equations (6) and (10) are set as $5$ and $1$ respectively. The total number of iterations for training the network is set to $50000$ . The rest of experimental settings are kept same as the previously mentioned experiment with the two data sets. The number of training instances of Indian pines and Salinas data set is $3200$ and that of Pavia university is $1800$ . Both of the number of training instances, $n$ are high. Computation of the $E_{sammons}$ in (7) would involve computing $(3200)^{2}$ or $(1800)^{2}$ distances. Adding $E_{sammons}$ to the overall loss function would cause very intensive computation at each iteration. So, instead of $E_{sammons}$ , its proposed approximation $E_{struct}$ defined in (9) is used. In (9), $|S_{t}|$ is taken as $100$ . Varying the value of $\beta$ in (10), we analyse its effect on the OCA, SS, NMI, ARI, and JI. We compute SS, NMI, ARI, and JI as described in Subsec. 3.1. We also use the same clustering algorithm with the same settings as used in Subsec. 3.1.

Tables 12 and 13 summarize the comparative results of FSMLP_struct with FSMLP and other band selection methods, ICA, F score, and mutual information based filter methods on the training and test datasets of Indian pines respectively.

Table 12: Performance comparison for Indian Pines training set, with the number of bands=

70

Method	SS	NMI	ARI	JI	OCA
ICA	0.6982	0.5437	0.3071	0.3335	0.7497
F Score	0.5325	0.5567	0.3015	0.0704	0.7978
Mutual Info	0.1966	0.5633	0.3232	0.2493	0.8600
FSMLP	0.0162	0.7994	0.6530	0.6827	0.9157
FSMLP_struct, $\beta$ =2	0.0153	0.8027	0.6588	0.6894	0.9165
FSMLP_struct, $\beta$ =5	0.0140	0.8051	0.6627	0.6945	0.9155
FSMLP_struct, $\beta$ =20	0.0107	0.8188	0.6953	0.7216	0.9153
FSMLP_struct, $\beta$ =50	0.0071	0.8400	0.7404	0.7549	0.9179

Table 13: Performance comparison for Indian Pines test set, with the number of bands=

70

Method	SS	NMI	ARI	JI	OCA
ICA	0.6748	0.5360	0.3052	0.2983	0.6175
F Score	0.5457	0.5518	0.2951	0.1643	0.6738
Mutual Info	0.2041	0.5787	0.3283	0.2395	0.7298
FSMLP	0.0158	0.8729	0.7949	0.8109	0.7871
FSMLP_struct, $\beta$ =2	0.0148	0.8831	0.8210	0.8384	0.7871
FSMLP_struct, $\beta$ =5	0.0137	0.8717	0.7881	0.8017	0.7881
FSMLP_struct, $\beta$ =20	0.0102	0.8896	0.8346	0.8507	0.7882
FSMLP_struct, $\beta$ =50	0.0066	0.8926	0.8360	0.8505	0.7850

Similarly, Table 14 and 15 summarize the comparative results on the training and test datasets of Pavia university.

Table 14: Performance comparison for Pavia University training set, with the number of bands=

35

Method	SS	NMI	ARI	JI	OCA
ICA	0.3231	0.6941	0.5858	0.6340	0.8433
F Score	0.2386	0.6451	0.4654	0.4748	0.8233
Mutual Info	0.1387	0.8905	0.8769	0.8871	0.8717
FSMLP	0.1686	0.9072	0.8925	0.8863	0.9213
FSMLP_struct, $\beta$ =1	0.1625	0.9169	0.9014	0.8952	0.9213
FSMLP_struct, $\beta$ =1.5	0.1616	0.9233	0.9112	0.9033	0.9212
FSMLP_struct, $\beta$ =2	0.1592	0.9243	0.9121	0.9044	0.9213
FSMLP_struct, $\beta$ =2.5	0.1590	0.9198	0.9056	0.8993	0.9218

Table 15: Performance comparison for Pavia University test set, with the number of bands=

35

Method	SS	NMI	ARI	JI	OCA
ICA	0.3215	0.5533	0.3777	0.2082	0.7573
F Score	0.3350	0.5091	0.3126	0.0886	0.7222
Mutual Info	0.1433	0.8911	0.8814	0.8990	0.7785
FSMLP	0.1584	0.9001	0.8930	0.9084	0.8628
FSMLP_struct, $\beta$ =1	0.1568	0.9156	0.9131	0.9254	0.8650
FSMLP_struct, $\beta$ =1.5	0.1573	0.9175	0.9149	0.9270	0.8672
FSMLP_struct, $\beta$ =2	0.1556	0.9189	0.9166	0.9279	0.8666
FSMLP_struct, $\beta$ =2.5	0.1562	0.9168	0.9143	0.9261	0.8648

In this experiment, we have fixed the number of selected bands $Q$ , approximately to $35\%$ of the original number of bands $P$ . So, The number of selected bands is $70$ for Indian pines and it is $35$ for Pavia University. Tables 12 and 13 record the values of the structure-preserving indices and classification scores on Indian pines for different $\beta$ values in FSMLP_struct ( $\beta$ values in Equation (10)). The considered $\beta$ s for Indian pines, are $2,5,20,$ and $50$ . Note here that, FSMLP is basically FSMLP_struct with $\beta=0$ . We observe in Tables 12 and 13 that both for training and test datasets as the value of $\beta$ increases for FSMLP_struct (in the last five rows of the corresponding Tables) the value of SS becomes smaller. A similar trend is also observed for the Pavia University data set (here, $\beta$ varies as $1,1.5,2,$ and $2.5$ ) for training (Table 14) and test (Table 15) sets.

For the Pavia university dataset, we have set the values of $\beta$ in FSMLP_struct as $1,1.5,2,$ and $2.5$ . Unlike Indian pines for Pavia university, we restrict the $\beta$ s to lower values. This is due to the fact that the number of selected bands for Pavia university is $35$ and that for Indian pines is $70$ . Lesser the number of bands, the lesser the importance ( $\beta$ ) to be given to our structure preserving regularizer in Equation (9) to obtain a desired balance between classification and clustering performance. Table 12 which contains the results for the Indian pines training data, clearly shows that both FSMLP and FSMLP_struct are better than ICA, F-score based, mutual information based methods in all four structure preserving metrics as well as in terms of the OCA. In Table 12 we observe that with increasing values of $\beta$ there is a consistent improvement in the values of the four structure preserving metrics while the values of OCAs retain approximately at $91\%$ . The results shown in Table 13 for the Indian pines test set also show that FSMLP and FSMLP_struct perform better in terms of all the five metrics than the other three methods. Also with an increase in $\beta$ all the structure-preserving metrics improve for FSMLP_struct, except the value of JI slightly decreases when $\beta$ goes to $50$ from $20$ . The classification metric OCA is around $78\%$ with bands selected by FSMLP_struct for different choices of $\beta$ s.

It is notable here that the test set is completely unseen in the process of band selection, yet the selected bands for the proposed method is providing fairly good results for structure preservation as well as for classification. As observed from Table 14 and Table 15 for Pavia university training and test sets respectively, the lowest (best) SS value among all the comparing methods is achieved by mutual information based filter method. However, for the other four metrics i.e. NMI, ARI, JI and OCA; FSMLP and FSMLP_struct show better values. In the case of the Pavia university dataset with increasing $\beta$ ; NMI, ARI, and JI are not consistently increasing but the results indicate that, it is possible to find a $\beta$ , (here $\beta=2$ ) where the structures are preserved better maintaining a good classification score. Table 16 and 17 summarize the comparative results of training and test datasets of Salinas.

Table 16: Performance comparison for Salinas training set, with the number of bands=

70

Method	SS	NMI	ARI	JI	OCA
ICA	0.7159	0.7066	0.5385	0.5111	0.9325
F Score	0.0167	0.9420	0.9307	0.9291	0.9650
Mutual Info	0.0902	0.9207	0.8775	0.8860	0.9628
FSMLP	0.0072	0.9672	0.9589	0.9628	0.9654
FSMLP_struct, $\beta=2$	0.0078	0.9678	0.9603	0.9640	0.9658
FSMLP_struct, $\beta=5$	0.0072	0.9691	0.9626	0.9659	0.9648
FSMLP_struct, $\beta=20$	0.0055	0.9710	0.9657	0.9683	0.9646
FSMLP_struct, $\beta=50$	0.0039	0.9688	0.9630	0.9659	0.9649

Table 17: Performance comparison for Salinas test set, with the number of bands=

70

Method	SS	NMI	ARI	JI	OCA
ICA	0.7027	0.6297	0.4144	0.3490	0.8670
F Score	0.0230	0.8584	0.7483	0.7615	0.9081
Mutual Info	0.0820	0.8791	0.7950	0.8230	0.8950
FSMLP	0.0078	0.9368	0.9183	0.9265	0.9077
FSMLP_struct, $\beta=2$	0.0084	0.9392	0.9231	0.9305	0.9076
FSMLP_struct, $\beta=5$	0.0079	0.9413	0.9275	0.9340	0.9069
FSMLP_struct, $\beta=20$	0.0061	0.9366	0.9175	0.9258	0.9065
FSMLP_struct, $\beta=50$	0.0044	0.9342	0.9136	0.9229	0.9073

We note from Tables 16 and 17 that, for the Salinas dataset, FSMLP and FSMLP_struct are better than the other three methods in all the five metrics used. All four structure preserving metrics scores of FSMLP_struct are better than or comparable to FSMLP keeping the classification score OCA at approximately $96\%$ for the training dataset and $90\%$ for the test dataset. Tables 16 and 17 reveal that when $\beta$ is increased from $0$ to $2$ , the value of SS is increased however, from $\beta=2$ to $\beta=50$ onward, the values of SS are decreased. The exceptions for the Salinas dataset, while increasing $\beta$ from $0$ to $2$ is possibly due to the fact that we do not use the entire training data in Equation (9) and use of $|S_{t}|=100$ in Equation (9) is not adequate to capture the structure of the data faithfully for the Salinas dataset. As discussed earlier, setting the value of $|S_{t}|$ is crucial for approximating Equation (7) with Equation (9). We have set $|S_{t}|=100$ for all three datasets empirically. However, choosing an optimum value of $|S_{t}|$ for each dataset is expected to avoid the occurred exceptions.

As we increase the value of $\beta$ , there is more stress to reduce the loss function Equation (9). In most cases, increasing $\beta$ results in a drop in SS. This clearly suggests that the loss function Equation (9) that we use, is a computationally efficient substitute for the original SS defined in Equation (7).

We have included results of the thematic maps (Fig. 2) and it reveals that our proposed method is capable of selecting useful bands that can broadly capture the land cover types. Figure 2 illustrates thematic maps of the entire region captured in the Indian pines dataset. Figure 2(a) shows ground truth labels. Figures 2(b), 2(c), 2(d) are thematic maps of the Indian pines data set using the class labels obtained from the SVM classifier trained on the considered training set represented with $70$ bands selected by FSMLP_struct with $\beta=0$ , i.e., by the method FSMLP, and FSMLP_struct considering $\beta=20$ , and $\beta=50$ , respectively.

Figure 2 ensures that even with the increasing stress on the structure-preserving regularizer $E_{struct}$ , our proposed band selection method FSMLP_struct is able to select bands that maintain a good land cover classification performance.

4 Conclusion and Discussions

To the best of our knowledge, a feature selection method that simultaneously cares about class discrimination and structure preservation is not available in the literature. In this study, we have tried to bridge this gap by proposing a neural network-based feature selection method that focuses both on class discrimination and structure preservation. To learn the proposed system, we use Sammon’s stress as a regularizer to the classification loss. For datasets having a large number of instances, the computational overhead associated with Sammon’s stress is very high. Consequently, as the structure-preserving regularizer, we use Sammon’s stress computed based on a sample of the original data (using dynamic sampling on each iteration during the adaptive gradient descent based learning). Using this regularizer in the experiments with datasets having a large number of instances, we have demonstrated that this regularizer is an effective and computationally efficient implementation of Sammon’s stress based structure-preserving regularizer. Our proposed feature selection scheme is generic. So we have investigated its effectiveness on datasets commonly used for assessing classifiers as well as for a specialized case: band selection in hyperspectral images (HSI). We have applied the feature selection scheme to five real-world datasets which are commonly used typically for assessing classification. In the context of band selection, we have applied our method to three well-known HSI datasets and compared performances with three other band selection methods. Based on our experiments, we conclude that the proposed feature selection method is able to produce reasonably good classification and clustering scores in the majority of the data sets, proving that the proposed method is capable of selecting a subset of features that is good both for classification and clustering. Our scheme provides a mechanism to control the number of selected features. The proposed method is easily extendable to other networks like Radial Basis Function (RBF) network.

References

[1] R. Zhang, F. Nie, X. Li, X. Wei, Feature selection with multi-view data: A survey, Information Fusion 50 (2019) 158–167.
[2] S. Solorio-Fernández, J. A. Carrasco-Ochoa, J. F. Martínez-Trinidad, A review of unsupervised feature selection methods, Artificial Intelligence Review 53 (2) (2020) 907–948.
[3] Z. Atashgahi, X. Zhang, N. Kichler, S. Liu, L. Yin, M. Pechenizkiy, R. Veldhuis, D. C. Mocanu, Supervised feature selection with neuron evolution in sparse neural networks, arXiv preprint arXiv:2303.07200.
[4] R. Jiao, B. Xue, M. Zhang, Solving multi-objective feature selection problems in classification via problem reformulation and duplication handling, IEEE Transactions on Evolutionary Computation.
[5] S. An, M. Zhang, C. Wang, W. Ding, Robust fuzzy rough approximations with knn granules for semi-supervised feature selection, Fuzzy Sets and Systems (2023) 108476.
[6] Z. Zeng, X. Wang, F. Yan, Y. Chen, Local adaptive learning for semi-supervised feature selection with group sparsity, Knowledge-Based Systems 181 (2019) 104787.
[7] G. Chandrashekar, F. Sahin, A survey on feature selection methods, Computers & Electrical Engineering 40 (1) (2014) 16–28.
[8] S. Wang, W. Pedrycz, Q. Zhu, W. Zhu, Unsupervised feature selection via maximum projection and minimum redundancy, Knowledge-Based Systems 75 (2015) 19–29.
[9] L. Du, Y.-D. Shen, Unsupervised feature selection with adaptive structure learning, in: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 209–218.
[10] J.-S. Wu, M.-X. Song, W. Min, J.-H. Lai, W.-S. Zheng, Joint adaptive manifold and embedding learning for unsupervised feature selection, Pattern Recognition 112 (2021) 107742.
[11] Z. Li, J. Liu, Y. Yang, X. Zhou, H. Lu, Clustering-guided sparse structural learning for unsupervised feature selection, IEEE Transactions on Knowledge and Data Engineering 26 (9) (2013) 2138–2150.
[12] C. Hou, F. Nie, X. Li, D. Yi, Y. Wu, Joint embedding learning and sparse regression: A framework for unsupervised feature selection, IEEE transactions on cybernetics 44 (6) (2013) 793–804.
[13] Z. Zhao, L. Wang, H. Liu, J. Ye, On similarity preserving feature selection, IEEE Transactions on Knowledge and Data Engineering 25 (3) (2011) 619–632.
[14] W. Gao, L. Hu, Y. Li, P. Zhang, Preserving similarity and staring decisis for feature selection, IEEE Transactions on Artificial Intelligence 2 (6) (2021) 584–593.
[15] Q. Gu, Z. Li, J. Han, Generalized fisher score for feature selection, arXiv preprint arXiv:1202.3725.
[16] Q. Song, H. Jiang, J. Liu, Feature selection based on fda and f-score for multi-class classification, Expert Systems with Applications 81 (2017) 22–27.
[17] X. He, D. Cai, P. Niyogi, Laplacian score for feature selection, Advances in neural information processing systems 18.
[18] H. Peng, F. Long, C. Ding, Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy, IEEE Transactions on pattern analysis and machine intelligence 27 (8) (2005) 1226–1238.
[19] J. R. Vergara, P. A. Estévez, A review of feature selection methods based on mutual information, Neural computing and applications 24 (2014) 175–186.
[20] P. A. Estévez, M. Tesmer, C. A. Perez, J. M. Zurada, Normalized mutual information feature selection, IEEE Transactions on neural networks 20 (2) (2009) 189–201.
[21] M. Tesmer, P. A. Estévez, Amifs: Adaptive feature selection by using mutual information, in: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No. 04CH37541), Vol. 1, IEEE, 2004, pp. 303–308.
[22] J. R. Vergara, P. A. Estévez, Cmim-2: an enhanced conditional mutual information maximization criterion for feature selection, Journal of Applied Computer Science Methods 2 (1) (2010) 5–20.
[23] R. J. Urbanowicz, M. Meeker, W. La Cava, R. S. Olson, J. H. Moore, Relief-based feature selection: Introduction and review, Journal of biomedical informatics 85 (2018) 189–203.
[24] J. Wang, J. Wei, Z. Yang, Supervised feature selection by preserving class correlation, in: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, 2016, pp. 1613–1622.
[25] T. M. Le, T. M. Vo, T. N. Pham, S. V. T. Dao, A novel wrapper–based feature selection for early diabetes prediction enhanced with a metaheuristic, IEEE Access 9 (2020) 7869–7884.
[26] N. Zhou, Y. Xu, H. Cheng, J. Fang, W. Pedrycz, Global and local structure preserving sparse subspace learning: An iterative approach to unsupervised feature selection, Pattern Recognition 53 (2016) 87–101.
[27] Y. Zhu, X. Zhang, R. Hu, G. Wen, Adaptive structure learning for low-rank supervised feature selection, Pattern Recognition Letters 109 (2018) 89–96.
[28] X. Liu, L. Wang, J. Zhang, J. Yin, H. Liu, Global and local structure preservation for feature selection, IEEE transactions on neural networks and learning systems 25 (6) (2013) 1083–1095.
[29] X. Zhu, S. Zhang, R. Hu, Y. Zhu, et al., Local and global structure preservation for robust unsupervised spectral feature selection, IEEE Transactions on Knowledge and Data Engineering 30 (3) (2017) 517–529.
[30] Q. Ye, X. Zhang, Y. Sun, Dual global structure preservation based supervised feature selection, Neural Processing Letters 51 (2020) 2765–2787.
[31] D. Chakraborty, N. R. Pal, A neuro-fuzzy scheme for simultaneous feature selection and fuzzy rule-based classification, IEEE Transactions on Neural Networks 15 (1) (2004) 110–123.
[32] Y.-C. Chen, N. R. Pal, I.-F. Chung, An integrated mechanism for feature selection and fuzzy rule extraction for classification, IEEE Transactions on Fuzzy Systems 20 (4) (2012) 683–698.
[33] G. Xue, Q. Chang, J. Wang, K. Zhang, N. R. Pal, An adaptive neuro-fuzzy system with integrated feature selection and rule extraction for high-dimensional classification problems, IEEE Transactions on Fuzzy Systems.
[34] R. Setiono, H. Liu, Neural-network feature selector, IEEE transactions on neural networks 8 (3) (1997) 654–662.
[35] G. Castellano, A. M. Fanelli, Variable selection using neural-network models, Neurocomputing 31 (1-4) (2000) 1–13.
[36] A. Verikas, M. Bacauskiene, Feature selection with neural networks, Pattern recognition letters 23 (11) (2002) 1323–1335.
[37] G. P. Zhang, Neural networks for classification: a survey, IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews) 30 (4) (2000) 451–462.
[38] S. Dreiseitl, L. Ohno-Machado, Logistic regression and artificial neural network classification models: a methodology review, Journal of biomedical informatics 35 (5-6) (2002) 352–359.
[39] S. Agrawal, J. Agrawal, Neural network techniques for cancer prediction: A survey, Procedia Computer Science 60 (2015) 769–774.
[40] A. Mabrouk, R. P. D. Redondo, M. Kayed, Deep learning-based sentiment classification: a comparative survey, IEEE Access 8 (2020) 85616–85638.
[41] D. Chakraborty, N. R. Pal, Selecting useful groups of features in a connectionist framework, IEEE transactions on neural networks 19 (3) (2008) 381–396.
[42] R. Chakraborty, N. R. Pal, Feature selection using a neural framework with controlled redundancy, IEEE transactions on neural networks and learning systems 26 (1) (2014) 35–50.
[43] R. Chakraborty, C.-T. Lin, N. R. Pal, Sensor (group feature) selection with controlled redundancy in a connectionist framework, International journal of neural systems 24 (06) (2014) 1450021.
[44] Z. Zhang, M. R. Sabuncu, Generalized cross entropy loss for training deep neural networks with noisy labels, in: 32nd Conference on Neural Information Processing Systems (NeurIPS), 2018.
[45] J. W. Sammon, A nonlinear mapping for data structure analysis, IEEE Transactions on computers 100 (5) (1969) 401–409.
[46] D. Dua, C. Graff, UCI machine learning repository (2017).
URL http://archive.ics.uci.edu/ml
[47] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang, H. Liu, Feature selection: A data perspective, ACM Computing Surveys (CSUR) 50 (6) (2018) 94.
[48] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, et al., Tensorflow: A system for large-scale machine learning, in: 12th $\{$ USENIX $\}$ symposium on operating systems design and implementation ( $\{$ OSDI $\}$ 16), 2016, pp. 265–283.
[49] M. Prasad, A. Sowmya, I. Koch, Efficient feature selection based on independent component analysis, in: Proceedings of the 2004 Intelligent Sensors, Sensor Networks and Information Processing Conference, 2004., IEEE, 2004, pp. 427–432.
[50] D. Steinley, Properties of the hubert-arable adjusted rand index., Psychological methods 9 (3) (2004) 386.
[51] S. Fletcher, M. Z. Islam, et al., Comparing sets of patterns with the jaccard index, Australasian Journal of Information Systems 22.
[52] J. C. Bezdek, R. Ehrlich, W. Full, Fcm: The fuzzy c-means clustering algorithm, Computers & Geosciences 10 (2-3) (1984) 191–203.
[53] M. Graña, M. Veganzons, B. Ayerdi, Hyperspectral remote sensing scenes (06 2021).
URL http://www.ehu.eus/ccwintco/index.php/Hyperspectral_Remote_Sensing_Scenes
[54] A. Santara, K. Mani, P. Hatwar, A. Singh, A. Garg, K. Padia, P. Mitra, Bass net: Band-adaptive spectral-spatial feature learning neural network for hyperspectral image classification, IEEE Transactions on Geoscience and Remote Sensing 55 (9) (2017) 5293–5301.
[55] A. Santara, K. Mani, A. Shukla, Deep learning for land-cover classification in hyperspectral images (07 2016).
URL https://kgpml.github.io/Hyperspectral/
[56] N. V. Chawla, K. W. Bowyer, L. O. Hall, W. P. Kegelmeyer, Smote: synthetic minority over-sampling technique, Journal of artificial intelligence research 16 (2002) 321–357.