This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adaptive Collaborative Correlation Learning-based Semi-Supervised Multi-Label Feature Selection

Yanyong Huang, Li Yang, Dongjie Wang, Ke Li, Xiuwen Yi, Fengmao Lv and Tianrui Li Yanyong Huang, Li Yang and Ke Li are with the Joint Laboratory of Data Science and Business Intelligence, School of Statistics, Southwestern University of Finance and Economics, Chengdu 611130, China (e-mail: [email protected]; [email protected]; [email protected]);Dongjie Wang is with the Department of Electrical Engineering and Computer Science, University of Kansas, Lawrence, KS 66045, USA (e-mail: [email protected]);Xiuwen Yi is with the JD Intelligent Cities Research and JD Intelligent Cities Business Unit, Beijing 100176, China (e-mail: [email protected]);Fengmao Lv and Tianrui Li are with the School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu 611756, China (e-mail: [email protected]; [email protected]).
Abstract

Semi-supervised multi-label feature selection has recently been developed to solve the curse of dimensionality problem in high-dimensional multi-label data with certain samples missing labels. Although many efforts have been made, most existing methods use a predefined graph approach to capture the sample similarity or the label correlation. In this manner, the presence of noise and outliers within the original feature space can undermine the reliability of the resulting sample similarity graph. It also fails to precisely depict the label correlation due to the existence of unknown labels. Besides, these methods only consider the discriminative power of selected features, while neglecting their redundancy. In this paper, we propose an Adaptive Collaborative Correlation lEarning-based Semi-Supervised Multi-label Feature Selection (Access-MFS) method to address these issues. Specifically, a generalized regression model equipped with an extended uncorrelated constraint is introduced to select discriminative yet irrelevant features and maintain consistency between predicted and ground-truth labels in labeled data, simultaneously. Then, the instance correlation and label correlation are integrated into the proposed regression model to adaptively learn both the sample similarity graph and the label similarity graph, which mutually enhance feature selection performance. Extensive experimental results demonstrate the superiority of the proposed Access-MFS over other state-of-the-art methods.

Index Terms:
feature selection, semi-supervised multi-label learning, generalized regression model, adaptive similarity graph learning.
publicationid: pubid: 0000–0000/00$00.00 © 2021 IEEE

1 Introduction

Multi-label data, where each instance is associated with multiple labels, widely exists in real-world applications [1, 2]. For example, a short video can be classified into several categorizes in the task of video classification [3], a news article may be related to several subjects in text categorization [4], and a painting image is possible to be tagged into a number of genres in the image annotation task [5]. In theses applications, multi-label data is often represented in a high-dimensional feature space with redundant and noisy features, resulting in the ”curse of dimensionality” problem and impacting the performance of subsequent tasks. Consequently, developing an effective method to reduce the dimensionality of multi-label data and thereby enhance downstream task performance has emerged as a pressing issue in practical applications.

Multi-label feature selection (MFS) deals with this issue by choosing a compact subset of discriminative features from the original high-dimensional feature space, thus eliminating redundant and noisy information [6, 7]. In recent years, numerous MFS methods have been developed, which can be broadly categorized into two groups. The first group converts multi-label data into multiple independent single-label data and then applies the traditional feature selection method to the decomposed data [8, 9, 10]. Typical feature selection methods for single-label data include the Laplacian Score [9], spectral feature selection [11] and group lasso-based feature selection [12]. Those methods usually assess the importance of features based on specific criteria while maintaining the local manifold structure of data, and subsequently select the top ranked features. However, those methods fail to consider the correlations among different labels, which could enhance the performance of feature selection. Instead of decomposing multi-label data into several single-label data, the second group of MFS methods directly construct a model from multi-label data to facilitate feature selection [13, 14, 15, 16]. MIFS (multi-label informed feature selection) is a typical method in the second group, which incorporates multi-label information into a low-dimensional subspace to identify discriminative features [13]. Besides, Zhang et al. proposed using both the local label correlation and global label correlation to guide the feature selection process [17]. Moreover, Fan et al. have developed a collaborative learning framework for multi-label feature selection, which learns the label correlation and the feature correlation simultaneously [16]. Despite the demonstrated efficacy of these proposed methods in feature selection, they share a common underlying assumption, i.e., all data has been comprehensively labeled. Unfortunately, obtaining large amounts of labeled data is expensive and time-consuming in real-world applications [18]. A more common scenario is that only a portion of the samples are labeled [19, 20]. The aforementioned method based on a supervised way cannot be directly applied in the semi-supervised scenario. To address this issue, Chang et al. proposed a unified semi-supervised multi-label feature selection framework through the integration of sparse feature learning with label prediction [21]. Guo et al. developed a multi-label propagation mechanism to learn unknown labels, which was then integrated into an extended linear discriminant analysis for feature selection [22]. Lv et al. combined adaptive graph structure learning and manifold learning to investigate the global structure of data and label correlations, and subsequently embedded them into the feature selection process [23]. Constructing or learning the similarity graph of samples is a promising technology in MFS [24]. However, current semi-supervised multi-label feature selection methods either fail to investigate the sample similarity graph or rely exclusively on predefined graph approaches. Noise and outliers in the original feature space can undermine the reliability of predefined sample similarity graphs. Additionally, these methods describe label correlations in a pre-computed manner, which can lead to inaccuracies and deficiencies, especially with large volumes of unlabeled data. Existing methods that consider only the correlations between labels or samples overlook the potential for mutual enhancement, which can improve feature selection performance. Furthermore, these methods focus solely on the discriminative capability of selected features while ignoring their redundancy, ultimately degrading the performance of feature selection.

To address these problems, we present a novel MFS method for semi-supervised multi-label data, named as Adaptive Collaborative Correlation lEarning-based Semi-Supervised Multi-label Feature Selection (Access-MFS). Specifically, the proposed method Access-MFS first embeds the feature selection process into a generalized regression model equipped with an extended uncorrelated constraint, which enables the selection of discriminative yet irrelevant features while preserving consistency between predicted and ground-truth labels in labeled data. We then integrate the instance correlation and label correlation into the proposed regression model to adaptively learn the sample similarity graph and label similarity graph. The adaptive collaborative correlation learning module is capable of preserving the sample similarity from high-dimensional to low-dimensional space, guaranteeing that similar samples are assigned similar labels and that strong label correlations lead to consistent predicted labels, simultaneously. It yields more accurate and reliable sample and label similarity graphs than those obtained through the predefined graph method. Finally, an iterative optimization algorithm is developed to solve our model. Extensive experimental results demonstrate that the proposed Access-MFS is superior to the state-of-the-art methods.

To sum up, the main contributions of this paper are as follows: (i) To the best of our knowledge, we are the first to introduce adaptive collaborative correlation learning for feature selection on semi-supervised multi-label data. Our method can adaptively learn sample and label similarity graphs directly from the data, rather than relying on predefined graphs. The simultaneous learning of these two similarity graphs is integrated into the feature selection process, where they mutually enhance each other, leading to an improvement in feature selection performance. (ii) We propose a novel model for semi-supervised multi-label feature selection, incorporating a generalized regression module with an extended uncorrelated constraint and an adaptive collaborative correlation learning module. This model effectively selects features that are both discriminative and uncorrelated, while ensuring consistency between predicted labels and existing label data. (iii) We develop an efficient alternative optimization algorithm with guaranteed convergence to solve the proposed model. Comprehensive experiments are conducted on the real-world datasets to verify its superiority over state-of-the-art methods.

The remainder of this paper is organized as follows. Section 2 briefly reviews the related work about MFS. We formulate the proposed method Access-MFS in Section 3 and provide an effective solution to this method in the following Section 4. Section 5 presents an analysis of the proposed algorithm’s convergence behavior and time complexity. Section 6 conducts a series of experiments on eight real-world datasets. Finally, we conclude this article in Section 7.

Refer to caption
Figure 1: The framework of the proposed adaptive collaborative correlation learning-based semi-supervised multi-label feature selection method (Access-MFS).

2 Related Work

In this section, we briefly introduce some representative works about MFS. MFS can be classified into two categories based on whether it decomposes multi-labels.

In the first category, multi-label data is transformed into multiple single-label datasets, upon which single-label feature selection methods are subsequently applied. Single-label feature selection methods are commonly classified into three groups: filter-based, wrapper-based, and embedded-based methods. The filter-based methods select informative features based on their importance, as defined by a specific evaluation criterion. Laplacian Score [9], as a typical filter-based approach, identifies representative features by utilizing the Laplacian Score metric, which quantifies the local preservative ability of the features. This type of method does not take into account the efficacy of the selected features on a downstream learning task, resulting in unsatisfactory performance. Wrapper-based methods typically employ a learning algorithm to assess the selected features until identifying the most suitable ones. Laporte et al. introduced a wrapper-based feature selection method that entails learning the ranking of features using support vector machines with a sparse constraint [25]. The primary limitation of wrapper-based methods lies in their extensive time requirements and limited generalization capabilities. Embedded-based methods integrate feature selection into the model learning process, offering a middle ground between filter-based and wrapper-based methods. Liu et al. embedded the procedure of feature selection into the locally linear embedding algorithm based on the 1{\ell_{1}}-norm reconstruction loss [26]. These single-label feature selection methods ignore the label correlations, which will result in the inferior performance.

Unlike the methods in the first category, the second category explores label correlations and constructs the feature selection model directly on multi-label data. MIFS selects discriminative features by factorizing the label matrix to reveal label correlations and diminish the impact of imperfect label information [13]. In GLFS [17], the LASSO-based sparse learning is used to select label-specific features that preserve groups, enabling the simultaneous exploration of label-group and instance-group correlations. However, these methods assume that all samples are labeled, which may not hold true in real-world scenarios. More often, only a subset of samples are labeled. Multi-label feature selection within the semi-supervised learning paradigm is proposed to address this issue. In LEDA [22], Guo et al. devised a semi-supervised multi-label propagation method to capture information from both labeled and unlabeled data simultaneously. Subsequently, the acquired label information is incorporated into an extended linear discriminant analysis module to select informative features. To explore the local manifold structures of both feature and label spaces in semi-supervised multi-label data, SMLFS [27] uses a local logistic regression model that incorporates feature graph regularization and label graph regularization. In addition, 2,p{\ell_{2,p}}-norm is applied to the regression coefficient matrix to select important feature dimensions. In [28], Li et al. proposed a semi-supervised multi-label feature selection method that simultaneously utilizes label and sample correlations. By imposing binary hash constraints on the spectral embedding model, SFS-BLL [29] identifies pseudo-labels for unlabeled data and utilizes a self-weighted sparse regression model to select discriminative features, leveraging both existing labels and learned pseudo-labels.

However, the aforementioned semi-supervised multi-label feature selection methods are confronted with three issues. First, they use a predefined way to describe the sample similarity graph, which is easily influenced by noise and outliers. Second, these methods, which pre-compute label correlations based on partially known labels, fail to accurately capture the relationships between labels due to the large number of labels that remain unknown in a semi-supervised scenario. Existing methods fail to simultaneously consider the correlations between samples and between labels, overlooking how these two correlations can mutually enhance feature selection performance. Last, they only take into account the discriminative capability of the chosen features and overlook their redundancy, which will limit the performance of feature selection.

3 Proposed method

3.1 Notations

In this paper, we denote matrices with boldface uppercase letters and vectors with boldface lowercase letters. For an arbitrary matrix 𝐌n×p{\mathbf{M}}\in\mathbb{R}^{n\times p}, mijm_{ij} indicates the iith row and jjth column entry of 𝐌\mathbf{M}, 𝐦i.\mathbf{m}_{i.} and 𝐦.j\mathbf{m}_{.j} denote its iith row and jjth column, respectively. We use 𝐌F=i=1nj=1p|mij|2\|\mathbf{M}\|_{F}=\sqrt{\sum_{i=1}^{n}\sum_{j=1}^{p}\left|m_{ij}\right|^{2}} to denote the Frobenius norm of 𝐌{\mathbf{M}}. The 2,1{\ell_{2,1}}-norm of 𝐌\mathbf{M} is defined as 𝐌2,1=i=1nj=1pmij2=i=1n𝐦i.2\|\mathbf{M}\|_{2,1}=\sum_{i=1}^{n}{\sqrt{\sum_{j=1}^{p}{m_{ij}^{2}}}}=\sum_{i=1}^{n}{\|\mathbf{m}_{i.}\|_{2}}, where 𝐦i.2\|\mathbf{m}_{i.}\|_{2} indicates the 2\ell_{2}-norm of the iith row vector 𝐦i.\mathbf{m}_{i.}. Let Tr(𝐌)\operatorname{Tr}{(\mathbf{M})} and 𝐌T\mathbf{M}^{T} respectively represent the trace and the transpose of 𝐌{\mathbf{M}}. Besides, 𝐇=𝐈1n𝟏n𝟏nT{\mathbf{H}}={\mathbf{I}}-\frac{1}{n}\mathbf{1}_{n}\mathbf{1}_{n}^{T} is a centering matrix, where 𝐈\mathbf{I} is an identity matrix and 𝟏nn×1{\mathbf{1}_{n}}\in\mathbb{R}^{n\times 1} is a column vector of ones.

Let 𝐗=[𝐗l,𝐗u]d×n\mathbf{X}=\left[\mathbf{X}_{l},\mathbf{X}_{u}\right]\in\mathbb{R}^{d\times n} be a given data matrix with dd dimensional features and nn instances, where 𝐗ld×n1\mathbf{X}_{l}\in\mathbb{R}^{d\times n_{1}} is the labeled data, 𝐗ud×n2\mathbf{X}_{u}\in\mathbb{R}^{d\times n_{2}} is the unlabeled data and n1+n2=n{n_{1}+n_{2}=n}. The corresponding label matrix of 𝐗\mathbf{X} is denoted by 𝐘=[𝐘l𝐘u]n×c\mathbf{Y}=\left[\begin{array}[]{c}\mathbf{Y}_{l}\\ \mathbf{Y}_{u}\end{array}\right]\in\mathbb{R}^{n\times c} with cc categories, where 𝐘ln1×c\mathbf{Y}_{l}\in\mathbb{R}^{n_{1}\times c} and 𝐘un2×c\mathbf{Y}_{u}\in\mathbb{R}^{n_{2}\times c} denote the labels of 𝐗l\mathbf{X}_{l} and 𝐗u\mathbf{X}_{u}, respectively. In 𝐘l\mathbf{Y}_{l}, yij=1{y_{ij}=1} indicates the iith instance is labeled as the jjth class, otherwise it is 0. The label matrix 𝐘u\mathbf{Y}_{u} of the unlabeled data is set to 0. In semi-supervised multi-label feature selection, our aim is to select kk informative features from the given dataset, which is much less than the total number of features in it. The framework of Access-MFS is shown in Fig. 1.

3.2 Formulation of Access-MFS

In order to learn a mapping function from samples with high dimensional features to the desired information (such as classification label or latent features), the least squares regression model has been extensively applied in various machine learn tasks including classification, regression and dimension reduction [30, 31]. Given the input dataset 𝒟={(𝐗,𝐘)}\mathcal{D}=\{(\mathbf{X},\mathbf{Y})\}, where 𝐗d×n\mathbf{X}\in\mathbb{R}^{d\times n} and 𝐘n×c\mathbf{Y}\in\mathbb{R}^{n\times c} are the data matrix and the corresponding label matrix, respectively, the traditional least squares regression model is defined as follows:

min𝐖,𝐛𝐗T𝐖+𝟏n𝐛T𝐘F2+λΩ(𝐖),\displaystyle\min_{\mathbf{W},\mathbf{b}}\|\mathbf{X}^{T}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{T}-\mathbf{Y}\|_{F}^{2}+\lambda\Omega(\|\mathbf{W}\|), (1)

where 𝐖d×c\mathbf{W}\in\mathbb{R}^{d\times c} is a projection matrix, Ω(𝐖)\Omega(\|\mathbf{W}\|) indicates certain regularization on 𝐖\mathbf{W} and λ\lambda is a regularization parameter. However, the abovementioned model cannot be directly applied to semi-supervised multi-label feature selection. Eq. (1) cannot be solved when only a subset of the labels in the label matrix 𝐘\mathbf{Y} is known. Furthermore, to select the dicriminative features and avoid the trivial solution, a column full rank constraint is typically imposed on 𝐖\mathbf{W} in Eq. (1). This approach disregards the redundancy of selected features. In order to deal with these two issues, we propose a generalized regression model with an extended uncorrelated constraint for semi-supervised multi-label feature selection, which is formulated as follows:

min𝐖,𝐛,𝐅𝐗T𝐖+𝟏n𝐛T𝐅F2+𝐅l𝐘lF2+λ𝐖2,1\displaystyle\min_{\mathbf{W},\mathbf{b},\mathbf{F}}\|\mathbf{X}^{T}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{T}-\mathbf{F}\|_{F}^{2}+{\|{\mathbf{F}_{l}-\mathbf{Y}_{l}}\|_{F}^{2}}+\lambda\|\mathbf{W}\|_{2,1} (2)
s.t.𝐖T𝐑𝐖=𝐈,\displaystyle~{}{s.t.}~{}\mathbf{W}^{T}\mathbf{R}\mathbf{W}=\mathbf{I},

where 𝐅=[𝐅l𝐅u]n×c\mathbf{F}=\left[\begin{array}[]{c}\mathbf{F}_{l}\\ \mathbf{F}_{u}\end{array}\right]\in\mathbb{R}^{n\times c} is a predicted label matrix and 𝐖T𝐑𝐖=𝐈\mathbf{W}^{T}\mathbf{R}\mathbf{W}=\mathbf{I} is an extended uncorrelated constraint. In Eq. (2), 𝐅\mathbf{F} is comprised of the predicted label 𝐅l\mathbf{F}_{l} for the labeled data 𝐗l\mathbf{X}_{l} and 𝐅u\mathbf{F}_{u} for the unlabeled data 𝐗u\mathbf{X}_{u}. We employ the predicted label matrix 𝐅\mathbf{F} as an optimization variable in Eq. (2), simultaneously ensuring that the predicted label 𝐅l\mathbf{F}_{l} is consistent with the ground-truth label 𝐘l\mathbf{Y}_{l} of 𝐗l\mathbf{X}_{l}. Besides, 2,1\ell_{2,1}-norm is applied to 𝐖\mathbf{W} to induce row sparsity, facilitating feature selection. Then, we can select the top kk features by calculating the 2\ell_{2} norm of each row in 𝐖\mathbf{W} and then sorting them in descending order.

To select the discriminative and uncorrelated features, we propose an extended uncorrelated constraint on 𝐖\mathbf{W}, denoted as 𝐖T𝐑𝐖=𝐈\mathbf{W}^{T}\mathbf{R}\mathbf{W}=\mathbf{I}. Here, 𝐑\mathbf{R} is defined as 𝐗𝐇𝐗T+λ𝐃+θ𝐗𝐋s𝐗T\mathbf{X}\mathbf{H}\mathbf{X}^{T}+\lambda\mathbf{D}+\theta\mathbf{X}\mathbf{L}_{s}\mathbf{X}^{T}, where the matrix 𝐃d×d\mathbf{D}\in\mathbb{R}^{d\times d} is diagonal, with the iith diagonal entry given by 𝐃(i,i)=12𝐰i.22+ϵ\mathbf{D}(i,i)=\frac{1}{2\sqrt{\|\mathbf{w}_{i.}\|_{2}^{2}+\epsilon}} (ϵ\epsilon is a small constant to avoid division by 0). Additional, 𝐋s=𝐃s𝐒\mathbf{L}_{s}=\mathbf{D}_{s}-\mathbf{S} represents the Laplacian matrix of instance similarity matrix 𝐒\mathbf{S}, 𝐃s\mathbf{D}_{s} is the degree matrix of 𝐒\mathbf{S}, and θ\theta is a trade-off parameter. The uncorrelated constraint we proposed consists of three terms obtained by expanding 𝐑\mathbf{R}. The first term aims to promote the orthonormality of the projected dimensions, which makes the low-dimensional instances to be uncorrelated. The second term is added to the first term to avoid the singularity of 𝐗𝐇𝐗T\mathbf{X}\mathbf{H}\mathbf{X}^{T}. The third term is advantageous in reducing the variance among samples within the same neighborhood under the graph structure. When λ=θ=0\lambda=\theta=0, the proposed constraint reduces to the traditional uncorrelated constraint. The proposed constraint is beneficial to select discriminative features that are also uncorrelated.

For semi-supervised multi-label feature selection [22, 23, 16, 28], the sample similarity structure and the label correlations serve as two crucial priors. Most existing methods use a predefined graph approach to capture the sample similarity or the label correlations. However, the presence of noise and outliers within the original feature space can undermine the reliability of the resulting graph. Besides, in semi-supervised settings where some labels are unknown, this approach cannot fully ascertain the label correlations. In this paper, we maintain the sample similarity structure and investigate label correlations by adaptively learning both the instance similarity graph and the label similarity graph. Guided by two fundamental assumptions that similar instances tend to have similar labels, and that samples close to each other in high-dimensional space should also be close in low-dimensional space, we formulate the instance similarity graph learning as follows:

min𝐒12i,j=1n𝐟i.𝐟j.22sij+12i,j=1n𝐖T𝐱.i𝐖T𝐱.j22sij\displaystyle\min_{\mathbf{S}}\frac{1}{2}\sum_{i,j=1}^{n}\|\mathbf{f}_{i.}-\mathbf{f}_{j.}\|_{2}^{2}{s_{ij}}+\frac{1}{2}\sum_{i,j=1}^{n}\|\mathbf{W}^{T}\mathbf{x}_{.i}-\mathbf{W}^{T}\mathbf{x}_{.j}\|_{2}^{2}s_{ij} (3)
+α𝐒F2\displaystyle~{}~{}~{}~{}~{}+\alpha\|\mathbf{S}\|_{F}^{2}
s.t.sii=0,sij0,𝐒𝟏n=𝟏n,\displaystyle{s.t.}~{}s_{ii}=0,s_{ij}\geq 0,\mathbf{S}\mathbf{1}_{n}=\mathbf{1}_{n},

where 𝐒\mathbf{S} is the instance similarity matrix, 𝐟i.\mathbf{f}_{i.} in the predicted label matrix 𝐅\mathbf{F} denotes the label of the iith instance 𝐱.i\mathbf{x}_{.i}, and α\alpha is a regularization parameter to control the sparsity of 𝐒\mathbf{S}. The first two terms of Eq. (LABEL:New3) are designed to learn the instance similarity matrix from the perspectives of the label space and the feature space respectively. The third term serves as a regularization to prevent the trivial solution where each row in 𝐒\mathbf{S} contains only one element with a value of 1, while all other elements are 0. By defining the Laplacian matrix 𝐋s=𝐃s𝐒\mathbf{L}_{s}=\mathbf{D}_{s}-\mathbf{S}, where 𝐃s\mathbf{D}_{s} is a diagonal matrix with its iith diagonal element equals to the sum of the iith row of 𝐒\mathbf{S}, we can simplify Eq. (LABEL:New3) as follows:

min𝐒Tr(𝐅T𝐋s𝐅)+Tr(𝐖T𝐗𝐋s𝐗T𝐖)+α𝐒F2\displaystyle\min_{\mathbf{S}}\operatorname{Tr}(\mathbf{F}^{T}\mathbf{L}_{s}\mathbf{F})+\operatorname{Tr}(\mathbf{W}^{T}\mathbf{X}\mathbf{L}_{s}\mathbf{X}^{T}\mathbf{W})+\alpha\|\mathbf{S}\|_{F}^{2} (4)
s.t.sii=0,sij0,𝐒𝟏n=𝟏n.\displaystyle{s.t.}~{}s_{ii}=0,s_{ij}\geq 0,\mathbf{S}\mathbf{1}_{n}=\mathbf{1}_{n}.

Furthermore, in order to explore label correlations, we operate under the assumption that a stronger positive correlation between two labels leads to greater similarity in the predicted labels. This assumption guides the learning of the label similarity graph, which can be expressed as follows:

min𝐏12i,j=1c𝐟.i𝐟.j22pij+β𝐏F2\displaystyle\min_{\mathbf{P}}\frac{1}{2}\sum_{i,j=1}^{c}\|\mathbf{f}_{.i}-\mathbf{f}_{.j}\|_{2}^{2}p_{ij}+\beta\|\mathbf{P}\|_{F}^{2} (5)
s.t.pii=0,pij0,𝐏𝟏c=𝟏c,\displaystyle{s.t.}~{}p_{ii}=0,p_{ij}\geq 0,\mathbf{P}\mathbf{1}_{c}=\mathbf{1}_{c},

where 𝐏c×c\mathbf{P}\in\mathbb{R}^{c\times c} denotes the label similarity matrix, 𝐟.i\mathbf{f}_{.i} is the iith label vector corresponding to the iith column of 𝐅\mathbf{F}, and β\beta is a regularization parameter. From Eq. (5), it is evident that a larger pijp_{ij} value indicates greater similarity between 𝐟.i\mathbf{f}_{.i} and 𝐟.j\mathbf{f}_{.j}, and the converse is also true. We define the Laplacian matrix 𝐋p=𝐃p𝐏\mathbf{L}_{p}=\mathbf{D}_{p}-\mathbf{P}, where 𝐃p\mathbf{D}_{p} is a diagonal matrix with the iith diagonal entry is j=1cpij\sum_{j=1}^{c}p_{ij}. Then, Eq. (5) can be rewritten as

min𝐏Tr(𝐅𝐋p𝐅T)+β𝐏F2\displaystyle\min_{\mathbf{P}}\operatorname{Tr}(\mathbf{F}\mathbf{L}_{p}\mathbf{F}^{T})+\beta\|\mathbf{P}\|_{F}^{2} (6)
s.t.pii=0,pij0,𝐏𝟏c=𝟏c.\displaystyle{s.t.}~{}p_{ii}=0,p_{ij}\geq 0,\mathbf{P}\mathbf{1}_{c}=\mathbf{1}_{c}.

By jointly utilizing Eqs. (LABEL:New4) and (6), we are able to collaboratively learn both the sample similarity matrix and the label similarity matrix in an adaptive manner, rather than relying on the predefined method. This is helpful to enhance the performance of feature selection.

By combining Eqs. (2), (LABEL:New4) and (6) together, the proposed semi-supervised multi-label feature selection method Access-MFS is summarized as follows:

minΞ𝐗T𝐖+𝟏n𝐛T𝐅F2+𝐅l𝐘lF2+λ𝐖2,1\displaystyle\min_{\Xi}\|\mathbf{X}^{T}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{T}-\mathbf{F}\|_{F}^{2}+{\|{\mathbf{F}_{l}-\mathbf{Y}_{l}}\|_{F}^{2}}+\lambda\|\mathbf{W}\|_{2,1} (7)
+θ(Tr(𝐅T𝐋s𝐅)+Tr(𝐖T𝐗𝐋s𝐗T𝐖)+α𝐒F2)\displaystyle\quad\quad+\theta(\operatorname{Tr}(\mathbf{F}^{T}\mathbf{L}_{s}\mathbf{F})+\operatorname{Tr}(\mathbf{W}^{T}\mathbf{X}\mathbf{L}_{s}\mathbf{X}^{T}\mathbf{W})+\alpha\|\mathbf{S}\|_{F}^{2})
+μ(Tr(𝐅𝐋p𝐅T)+β𝐏F2)\displaystyle\quad\quad+\mu(\operatorname{Tr}(\mathbf{F}\mathbf{L}_{p}\mathbf{F}^{T})+\beta\|\mathbf{P}\|_{F}^{2})
s.t.𝐖T𝐑𝐖=𝐈,sii=0,sij0,𝐒𝟏n=𝟏n,pii=0,\displaystyle{s.t.}~{}\mathbf{W}^{T}\mathbf{R}\mathbf{W}=\mathbf{I},s_{ii}=0,s_{ij}\geq 0,\mathbf{S}\mathbf{1}_{n}=\mathbf{1}_{n},p_{ii}=0,
pij0,𝐏𝟏c=𝟏c,\displaystyle~{}~{}~{}~{}~{}~{}~{}p_{ij}\geq 0,\mathbf{P}\mathbf{1}_{c}=\mathbf{1}_{c},

where Ξ={𝐖,𝐛,𝐅,𝐒,𝐏}\Xi=\{\mathbf{W},\mathbf{b},\mathbf{F},\mathbf{S},\mathbf{P}\}, and θ\theta and μ\mu are two trade-off hyper-parameters. These two regularization parameters α\alpha and β\beta can be automatically determined during the optimization process.

In our Access-MFS method, we integrate feature selection, the collaborative learning of the instance similarity graph and the label similarity graph, and the label learning into a unified framework, enabling these different learning tasks to promote each other. Access-MFS can offer two key benefits: First, the proposed extended regression model not only maintains consistency between the predicted labels and the ground-truth of the labeled data but also selects discriminative and uncorrelated features. Second, Access-MFS is capable of adaptively learning the instance and label similarity graphs that are more reliable than those generated by predefined methods. This collaborative learning of the two similarity-induced graphs can preserve the local structures of the feature space and label space concurrently.

4 Optimization and Algorithm

There are five variables that need to be optimized in Eq. (7). Since these variables are interrelated, making it difficult to solve them simultaneously. To address this, an alternative iterative algorithm is proposed to solve the optimization problem by optimizing one variable at a time while keeping the others fixed.

Since there is no constraint on the variable 𝐛\mathbf{b}, we can solve for it by setting the first-order derivate of the objective function in Eq. (7) w.r.t. 𝐛\mathbf{b} to 0, in accordance with the Karush-Kuhn-Tucker (KKT) conditions [32]. The optimal solution for 𝐛\mathbf{b} can be easily obtained as 𝐛=1n(𝐅T𝟏n𝐖T𝐗𝟏n)\mathbf{b}=\frac{1}{n}(\mathbf{F}^{T}\mathbf{1}_{n}-\mathbf{W}^{T}\mathbf{X}\mathbf{1}_{n}). By substituting 𝐛\mathbf{b} with the derived solution, Eq. (7) can be reformulated as follows:

min𝐖,𝐅,𝐒,𝐏𝐇𝐗T𝐖𝐇𝐅F2+𝐅l𝐘lF2+λ𝐖2,1\displaystyle\min_{\mathbf{W},\mathbf{F},\mathbf{S},\mathbf{P}}\|\mathbf{H}\mathbf{X}^{T}\mathbf{W}-\mathbf{H}\mathbf{F}\|_{F}^{2}+{\|{\mathbf{F}_{l}-\mathbf{Y}_{l}}\|_{F}^{2}}+\lambda\|\mathbf{W}\|_{2,1} (8)
+θ(Tr(𝐅T𝐋s𝐅)+Tr(𝐖T𝐗𝐋s𝐗T𝐖)+α𝐒F2)\displaystyle~{}~{}\quad\quad\quad+\theta(\operatorname{Tr}(\mathbf{F}^{T}\mathbf{L}_{s}\mathbf{F})+\operatorname{Tr}(\mathbf{W}^{T}\mathbf{X}\mathbf{L}_{s}\mathbf{X}^{T}\mathbf{W})+\alpha\|\mathbf{S}\|_{F}^{2})
+μ(Tr(𝐅𝐋p𝐅T)+β𝐏F2)\displaystyle~{}~{}\quad\quad\quad+\!\mu(\operatorname{Tr}(\mathbf{F}\mathbf{L}_{p}\mathbf{F}^{T})+\beta\|\mathbf{P}\|_{F}^{2})
s.t.𝐖T𝐑𝐖=𝐈,sii=0,sij0,𝐒𝟏n=𝟏n,pii=0,\displaystyle{s.t.}~{}\mathbf{W}^{T}\mathbf{R}\mathbf{W}=\mathbf{I},s_{ii}=0,s_{ij}\geq 0,\mathbf{S}\mathbf{1}_{n}=\mathbf{1}_{n},p_{ii}=0,
pij0,𝐏𝟏c=𝟏c.\displaystyle~{}~{}~{}~{}~{}~{}~{}p_{ij}\geq 0,\mathbf{P}\mathbf{1}_{c}=\mathbf{1}_{c}.

In the following, we will describe the process for updating the variables 𝐖\mathbf{W}, 𝐅\mathbf{F}, 𝐒\mathbf{S} and 𝐏\mathbf{P} in Eq. (8).

4.1 Update 𝐖\mathbf{W} by Fixing Other Variables

When other variables are fixed, we can disregard the terms that do not involve 𝐖\mathbf{W} in Eq. (8). Then, 𝐖\mathbf{W} can be obtained by solving the following problem:

min𝐖𝐇𝐗T𝐖𝐇𝐅F2+λ𝐖2,1+θTr(𝐖T𝐗𝐋s𝐗T𝐖)\displaystyle\min_{\mathbf{W}}\|\mathbf{H}\mathbf{X}^{T}\mathbf{W}-\mathbf{H}\mathbf{F}\|_{F}^{2}\!+\!\lambda\|\mathbf{W}\|_{2,1}\!+\!\theta\operatorname{Tr}(\mathbf{W}^{T}\mathbf{X}\mathbf{L}_{s}\mathbf{X}^{T}\mathbf{W}) (9)
s.t.𝐖T𝐑𝐖=𝐈.\displaystyle{s.t.}~{}\mathbf{W}^{T}\mathbf{R}\mathbf{W}=\mathbf{I}.

According to [33], we have 𝐖2,1𝐖=2𝐃𝐖\frac{\partial\|\mathbf{W}\|_{2,1}}{\partial\mathbf{W}}=2\mathbf{D}\mathbf{W}. Then, Eq. (9) can be transformed into the following equivalent form:

min𝐖𝐇𝐗T𝐖𝐇𝐅F2+λTr(𝐖T𝐃𝐖)\displaystyle\min_{\mathbf{W}}\|\mathbf{H}\mathbf{X}^{T}\mathbf{W}-\mathbf{H}\mathbf{F}\|_{F}^{2}+\lambda\operatorname{Tr}(\mathbf{W}^{T}\mathbf{D}\mathbf{W}) (10)
+θTr(𝐖T𝐗𝐋s𝐗T𝐖)\displaystyle\quad\quad+\theta\operatorname{Tr}(\mathbf{W}^{T}\mathbf{X}\mathbf{L}_{s}\mathbf{X}^{T}\mathbf{W})
s.t.𝐖T𝐑𝐖=𝐈.\displaystyle{s.t.}~{}\mathbf{W}^{T}\mathbf{R}\mathbf{W}=\mathbf{I}.

Based on the matrix calculus theory, Eq (10) can be further derived as follows:

min𝐖𝐇𝐗T𝐖𝐇𝐅F2+λTr(𝐖T𝐃𝐖)\displaystyle\min_{\mathbf{W}}\|\mathbf{H}\mathbf{X}^{T}\mathbf{W}-\mathbf{H}\mathbf{F}\|_{F}^{2}+\lambda\operatorname{Tr}(\mathbf{W}^{T}\mathbf{D}\mathbf{W}) (11)
+θTr(𝐖T𝐗𝐋s𝐗T𝐖)\displaystyle\quad\quad+\theta\operatorname{Tr}(\mathbf{W}^{T}\mathbf{X}\mathbf{L}_{s}\mathbf{X}^{T}\mathbf{W})
\displaystyle\Leftrightarrow min𝐖Tr(𝐖T𝐗𝐇𝐗T𝐖2𝐖T𝐗𝐇𝐅+𝐅T𝐇𝐅)\displaystyle\min_{\mathbf{W}}\operatorname{Tr}(\mathbf{W}^{T}\mathbf{X}\mathbf{H}\mathbf{X}^{T}\mathbf{W}-2\mathbf{W}^{T}\mathbf{X}\mathbf{H}\mathbf{F}+\mathbf{F}^{T}\mathbf{H}\mathbf{F})
+λTr(𝐖T𝐃𝐖)+θTr(𝐖T𝐗𝐋s𝐗T𝐖)\displaystyle+\lambda\operatorname{Tr}(\mathbf{W}^{T}\mathbf{D}\mathbf{W})+\theta\operatorname{Tr}(\mathbf{W}^{T}\mathbf{X}\mathbf{L}_{s}\mathbf{X}^{T}\mathbf{W})
\displaystyle\Leftrightarrow min𝐖Tr(𝐖T𝐑𝐖)2Tr(𝐖T𝐗𝐇𝐅)+Tr(𝐅T𝐇𝐅)\displaystyle\min_{\mathbf{W}}\operatorname{Tr}(\mathbf{W}^{T}\mathbf{R}\mathbf{W})-2\operatorname{Tr}(\mathbf{W}^{T}\mathbf{X}\mathbf{H}\mathbf{F})+\operatorname{Tr}(\mathbf{F}^{T}\mathbf{H}\mathbf{F})

Since the constraint 𝐖T𝐑𝐖=𝐈\mathbf{W}^{T}\mathbf{R}\mathbf{W}=\mathbf{I} and 𝐅\mathbf{F} is fixed, Eq. (11) simplifies to the following problem.

max𝐖Tr(𝐖T𝐗𝐇𝐅)s.t.𝐖T𝐑𝐖=𝐈.\max_{\mathbf{W}}\operatorname{Tr}(\mathbf{W}^{T}\mathbf{X}\mathbf{H}\mathbf{F})~{}{s.t.}~{}\mathbf{W}^{T}\mathbf{R}\mathbf{W}=\mathbf{I}. (12)

Let 𝐀=𝐑12𝐖\mathbf{A}=\mathbf{R}^{\frac{1}{2}}\mathbf{W} and 𝐁=𝐑12𝐗𝐇𝐅\mathbf{B}=\mathbf{R}^{-\frac{1}{2}}\mathbf{X}\mathbf{H}\mathbf{F}. Then, Eq. (12) is transformed into

max𝐀Tr(𝐀T𝐁)s.t.𝐀T𝐀=𝐈.\max_{\mathbf{A}}\operatorname{Tr}(\mathbf{A}^{T}\mathbf{B})~{}{s.t.}~{}\mathbf{A}^{T}\mathbf{A}=\mathbf{I}. (13)

According to [34], the optimal solution for Eq. (13) can be obtained as 𝐀=𝐕l𝐕rT\mathbf{A}=\mathbf{V}_{l}\mathbf{V}_{r}^{T}, where 𝐕l\mathbf{V}_{l} and 𝐕r\mathbf{V}_{r} respectively denote the left and right singular matrices derived from the compact singular value decomposition (C-SVD) of 𝐁\mathbf{B}. Hence, the optimal solution for Eq. (12) is derived as

𝐖=𝐑12𝐀.\mathbf{W}=\mathbf{R}^{-\frac{1}{2}}\mathbf{A}. (14)

In Eq. (14), as 𝐑\mathbf{R} contains 𝐃\mathbf{D}, which is related to 𝐖\mathbf{W}, we can update 𝐖\mathbf{W} and 𝐃\mathbf{D} alternatively to obtain the final optimal solution. The detailed procedure of solving problem (12) is listed in Algorithm 1.

4.2 Update 𝐅\mathbf{F} by Fixing Other Variables

By ignoring other fixed variables, we can solve the following problem to obtain 𝐅\mathbf{F}.

min𝐅𝐇𝐗T𝐖𝐇𝐅F2+𝐅l𝐘lF2+θTr(𝐅T𝐋s𝐅)\displaystyle\min_{\mathbf{F}}\|\mathbf{H}\mathbf{X}^{T}\mathbf{W}-\mathbf{H}\mathbf{F}\|_{F}^{2}+\|{\mathbf{F}_{l}-\mathbf{Y}_{l}}\|_{F}^{2}+\theta\operatorname{Tr}(\mathbf{F}^{T}\mathbf{L}_{s}\mathbf{F}) (15)
+μTr(𝐅𝐋p𝐅T).\displaystyle\quad\quad+\mu\operatorname{Tr}(\mathbf{F}\mathbf{L}_{p}\mathbf{F}^{T}).

Let 𝐔\mathbf{U} be a diagonal matrix, where the iith diagonal entry uiiu_{ii} is defined as 1 if the ii-th sample 𝐱.i\mathbf{x}_{.i} is labeled, and 0 otherwise. Then, we have 𝐅l𝐘lF2=Tr((𝐅𝐘)T𝐔(𝐅𝐘))\|{\mathbf{F}_{l}-\mathbf{Y}_{l}}\|_{F}^{2}=\operatorname{Tr}((\mathbf{F}-\mathbf{Y})^{T}\mathbf{U}(\mathbf{F}-\mathbf{Y})). By following a similar deductive process as outlined in (11), problem (15) can be transformed into the following equivalent trace form:

min𝐅Tr(𝐅T𝐐𝐅2𝐅T𝐂)+μTr(𝐅𝐋p𝐅T),\displaystyle\min_{\mathbf{F}}\operatorname{Tr}(\mathbf{F}^{T}\mathbf{Q}\mathbf{F}-2\mathbf{F}^{T}\mathbf{C})+\mu\operatorname{Tr}(\mathbf{F}\mathbf{L}_{p}\mathbf{F}^{T}), (16)

where 𝐐=θ𝐋s+𝐇+𝐔\mathbf{Q}=\theta\mathbf{L}_{s}+\mathbf{H}+\mathbf{U}, 𝐂=𝐇𝐗T𝐖+𝐔𝐘\mathbf{C}=\mathbf{H}\mathbf{X}^{T}\mathbf{W}+\mathbf{U}\mathbf{Y}.

Take the derivation of Eq.(16) w.r.t. 𝐅{\mathbf{F}} and set it to 0, we can obtain

𝐐𝐅+μ𝐅𝐋p=𝐂\displaystyle\mathbf{Q}\mathbf{F}+\mu\mathbf{F}\mathbf{L}_{p}=\mathbf{C} (17)

The optimal solution 𝐅{\mathbf{F}} for Eq. (17) can be obtained by using the proposed algorithm in [35].

Input: Data matrix 𝐗\mathbf{X}, centering matrix 𝐇\mathbf{H}, predicted label matrix 𝐅\mathbf{F}, Laplacian matrix 𝐋s\mathbf{L}_{s}, and the parameters λ\lambda and θ\theta.
1 𝐈𝐧𝐢𝐭𝐢𝐚𝐥𝐢𝐳𝐞:\mathbf{Initialize:} 𝐃=𝐈d\mathbf{D}=\mathbf{I}_{d}.
2begin
3      
4      while not converged do
5            
6            Compute 𝐑=𝐗𝐇𝐗T+λ𝐃+θ𝐗𝐋s𝐗T\mathbf{R}=\mathbf{X}\mathbf{H}\mathbf{X}^{T}+\lambda\mathbf{D}+\theta\mathbf{X}\mathbf{L}_{s}\mathbf{X}^{T};
7            
8            Compute 𝐁=𝐑12𝐗𝐇𝐅\mathbf{B}=\mathbf{R}^{-\frac{1}{2}}\mathbf{X}\mathbf{H}\mathbf{F};
9            
10            Compute 𝐕l𝐕rT=𝐁\mathbf{V}_{l}\sum\mathbf{V}_{r}^{T}=\mathbf{B} via C-SVD of 𝐁\mathbf{B};
11            
12            Compute 𝐀=𝐕l𝐕rT\mathbf{A}=\mathbf{V}_{l}\mathbf{V}_{r}^{T};
13            
14            Update 𝐖=𝐑12𝐀\mathbf{W}=\mathbf{R}^{-\frac{1}{2}}\mathbf{A};
15            
16            Update 𝐃=diag(12𝐰1.22+ϵ,12𝐰2.22+ϵ,,12𝐰d.22+ϵ)\mathbf{D}=\operatorname{diag}(\frac{1}{2\sqrt{\|\mathbf{w}_{1.}\|_{2}^{2}+\epsilon}},\frac{1}{2\sqrt{\|\mathbf{w}_{2.}\|_{2}^{2}+\epsilon}},\cdots,\frac{1}{2\sqrt{\|\mathbf{w}_{d.}\|_{2}^{2}+\epsilon}}).
17            
18       end while
19      
20 end
21
Output: The feature selection matrix 𝐖\mathbf{W}.
Algorithm 1 Algorithm to solve problem (12)

4.3 Update 𝐒\mathbf{S} by Fixing Other Variables

While 𝐖\mathbf{W}, 𝐅\mathbf{F} and 𝐏\mathbf{P} are fixed, the optimization problem in Eq. (8) is equivalent to solve Eq. (LABEL:New3). By setting a matrix 𝐌=(mij)n×n\mathbf{M}=(m_{ij})_{n\times n}, where mij=12𝐟i.𝐟j.22+12𝐖T𝐱.i𝐖T𝐱.j22m_{ij}=\frac{1}{2}\|\mathbf{f}_{i.}-\mathbf{f}_{j.}\|_{2}^{2}+\frac{1}{2}\|\mathbf{W}^{T}\mathbf{x}_{.i}-\mathbf{W}^{T}\mathbf{x}_{.j}\|_{2}^{2}, Eq. (LABEL:New3) is transformed into

min𝐒i,j=1nmijsij+α𝐒F2\displaystyle\min_{\mathbf{S}}\sum_{i,j=1}^{n}m_{ij}s_{ij}+\alpha\|\mathbf{S}\|_{F}^{2} (18)
s.t.sii=0,sij0,𝐒𝟏n=𝟏n.\displaystyle{s.t.}~{}s_{ii}=0,s_{ij}\geq 0,\mathbf{S}\mathbf{1}_{n}=\mathbf{1}_{n}.

Note that Eq. (18) is independent for each row. Hence, for each ii, we can rewritten Eq. (18) by using the method of completing the square as follows:

min𝐬i.12𝐬i.+𝐦i.2α22\displaystyle\min_{\mathbf{s}_{i.}}\frac{1}{2}\|\mathbf{s}_{i.}+\frac{\mathbf{m}_{i.}}{2\alpha}\|_{2}^{2} (19)
s.t.sii=0,sij0,𝐬i.𝟏n=1.\displaystyle{s.t.}~{}s_{ii}=0,s_{ij}\geq 0,\mathbf{s}_{i.}\mathbf{1}_{n}=1.

Then, the Lagrangian function of Eq. (19) is written as:

(𝐬i.,ψ,φ)\displaystyle\mathcal{L}(\mathbf{s}_{i.},\psi,\varphi) =12𝐬i.+𝐦i.2β22ψ(𝐬i.𝟏n1)𝝋T𝐬i.\displaystyle=\frac{1}{2}\|\mathbf{s}_{i.}+\frac{\mathbf{m}_{i.}}{2\beta}\|_{2}^{2}-\psi(\mathbf{s}_{i.}\mathbf{1}_{n}-1)-\bm{\varphi}^{T}\mathbf{s}_{i.} (20)
=12j=1n(sij+mij2α)2ψ(j=1nsij1)j=1nφjsij,\displaystyle=\frac{1}{2}\sum_{j=1}^{n}(s_{ij}\!+\!\frac{m_{ij}}{2\alpha})^{2}\!-\!\psi(\sum_{j=1}^{n}s_{ij}-1)\!-\!\sum_{j=1}^{n}\varphi_{j}s_{ij},

where ψ\psi and 𝝋=[φ1,φ2,,φn]T\bm{\varphi}=[\varphi_{1},\varphi_{2},\cdots,\varphi_{n}]^{T} are the Lagrangian multipliers. Based on the KKT conditions, we have sij+mij2αψφj=0{s}_{ij}+\frac{m_{ij}}{2\alpha}-\psi-\varphi_{j}=0 and sijφj=0s_{ij}\varphi_{j}=0. Then, we can get sij=(ψmij2α)+{s}_{ij}=(\psi-\frac{m_{ij}}{2\alpha})_{+}, where (z)+=max(z,0)(z)_{+}=max(z,0).

In practice, each sample tends to be similar to its neighbours. Therefore, we aim for 𝐬i.\mathbf{s}_{i.} to have ksk_{s} nonzero values, indicating that the iith sample is similar to its ksk_{s} closest neighbours. Assuming that mi1,mi2,,minm_{i1},m_{i2},\cdots,m_{in} are arranged in ascending order, we prefer siks>0{s}_{ik_{s}}>0 and siks+1=0{s}_{ik_{s}+1}=0. As a result, we have α=miks+12ψ\alpha=\frac{m_{ik_{s}+1}}{2\psi}. Besides, according to the constraint 𝐬i.𝟏n=1\mathbf{s}_{i.}\mathbf{1}_{n}=1, we get ψ=1ks+j=1ksmij2ksα\psi=\frac{1}{k_{s}}+\frac{\sum_{j=1}^{k_{s}}m_{ij}}{2k_{s}\alpha}. By substituting ψ\psi into the obtained α\alpha, we can calculate the regularization parameter as α=ksmiks+1j=1ksmij2\alpha=\frac{k_{s}m_{ik_{s}+1}-\sum_{j=1}^{k_{s}}m_{ij}}{2}. Then, we achieve the optimal solution sij{s}_{ij} as follows:

sij={miks+1+mijksmiks+1h=1ksmihjks;0j>ks.{s}_{ij}=\left\{\begin{array}[]{cl}\frac{m_{ik_{s}+1}+m_{ij}}{k_{s}m_{ik_{s}+1}-\sum_{h=1}^{k_{s}}m_{ih}}&j\leq k_{s};\\ 0&j>k_{s}.\end{array}\right. (21)

4.4 Update 𝐏\mathbf{P} by Fixing Other Variables

With other variables fixed, 𝐏\mathbf{P} can be optimized by solving the following problem:

min𝐏12i,j=1c𝐟.i𝐟.j22pij+βi=1c𝐩i.22\displaystyle\min_{\mathbf{P}}\frac{1}{2}\sum_{i,j=1}^{c}\|\mathbf{f}_{.i}-\mathbf{f}_{.j}\|_{2}^{2}p_{ij}+\beta\sum_{i=1}^{c}\|\mathbf{p}_{i.}\|_{2}^{2} (22)
s.t.pii=0,pij0,𝐏𝟏c=𝟏c.\displaystyle{s.t.}~{}p_{ii}=0,p_{ij}\geq 0,\mathbf{P}\mathbf{1}_{c}=\mathbf{1}_{c}.

Let 𝐆{\mathbf{G}} be a c×c{c\times c} matrix with its (i,j)(i,j)th entry gij=12𝐟.i𝐟.j22g_{ij}=\frac{1}{2}\|\mathbf{f}_{.i}-\mathbf{f}_{.j}\|_{2}^{2}. Then, Eq. (22) is transformed into

min𝐏i,j=1cgijpij+βi=1n𝐩i.22\displaystyle\min_{\mathbf{P}}\sum_{i,j=1}^{c}g_{ij}p_{ij}+\beta\sum_{i=1}^{n}\|\mathbf{p}_{i.}\|_{2}^{2} (23)
s.t.pii=0,pij0,𝐏𝟏c=𝟏c.\displaystyle{s.t.}~{}p_{ii}=0,p_{ij}\geq 0,\mathbf{P}\mathbf{1}_{c}=\mathbf{1}_{c}.

Given that the optimization of each row in 𝐏\mathbf{P} is independent, Eq. (23) can be addressed by optimizing each individual subproblem as follows:

min𝐩i.12𝐩i.+𝐠i.2β22\displaystyle\min_{\mathbf{p}_{i.}}\frac{1}{2}\|\mathbf{p}_{i.}+\frac{\mathbf{g}_{i.}}{2\beta}\|_{2}^{2} (24)
s.t.pii=0,pij0,𝐩i.𝟏c=1.\displaystyle{s.t.}~{}p_{ii}=0,p_{ij}\geq 0,\mathbf{p}_{i.}\mathbf{1}_{c}=1.

Following a similar procedure to that used in solving Eq. (19), we can determine the regularization parameter β=kpgikp+1j=1kpgij2\beta=\frac{k_{p}g_{ik_{p}+1}-\sum_{j=1}^{k_{p}}g_{ij}}{2}, where kpk_{p} denotes the number of nonzero elements in 𝐩i.\mathbf{p}_{i.}. Furthermore, the optimal solution pij{p}_{ij} is formulated as follows:

pij={gikp+1+gijkpgik+1h=1kpgihjkp;0j>kp.p_{ij}=\left\{\begin{array}[]{cl}\frac{g_{ik_{p}+1}+g_{ij}}{k_{p}g_{ik+1}-\sum_{h=1}^{k_{p}}g_{ih}}&j\leq k_{p};\\ 0&j>k_{p}.\end{array}\right. (25)

Up to this point, we have obtained the updated expression of 𝐖\mathbf{W}, 𝐅\mathbf{F}, 𝐒\mathbf{S} and 𝐏\mathbf{P}. The detailed optimization procedure of Access-MFS is summarized in Algorithm 2.

Input:
  1. 1.

    The data matrix 𝐗=[𝐗l,𝐗u]d×n\mathbf{X}=\left[\mathbf{X}_{l},\mathbf{X}_{u}\right]\in\mathbb{R}^{d\times n}, and the centering matrix 𝐇\mathbf{H};

  2. 2.

    The label matrix 𝐘=[𝐘l𝟎]n×c;\mathbf{Y}=\left[\begin{array}[]{c}\mathbf{Y}_{l}\\ \mathbf{0}\end{array}\right]\in\mathbb{R}^{n\times c};

  3. 3.

    The trade-off parameters λ,θ\lambda,\theta and μ\mu.

1 𝐈𝐧𝐢𝐭𝐢𝐚𝐥𝐢𝐳𝐚𝐭𝐢𝐨𝐧:\bf{Initialization:} 𝐅=𝐘\mathbf{F}=\mathbf{Y}, randomly initialize 𝐖\mathbf{W}, 𝐒\mathbf{S} and 𝐏\mathbf{P} are initialized by Eq. (21) and Eq. (25), respectively.
2begin
3       while not converged do
4             Update 𝐖\mathbf{W} by using Algorithm 1;
5             Update 𝐅\mathbf{F} by solving Eq. (17);
6             Update 𝐒\mathbf{S} via Eq. (21);
7             Update 𝐏\mathbf{P} via Eq. (25);
8            
9       end while
10      
11 end
12
Output: Sort the scores of all features computed by the 2\ell_{2}-norm of the rows of 𝐖\mathbf{W} in descending order and select the top kk ranked features.
Algorithm 2 Access-MFS

5 Convergence and Complexity Analysis

In this section, we provide a theoretical analysis of the proposed algorithm Access-MFS, which includes both the convergence analysis and the complexity analysis.

5.1 Convergence Analysis

Due to the objective function in Eq. (8) not being convex w.r.t. four variables 𝐖\mathbf{W}, 𝐅\mathbf{F}, 𝐒\mathbf{S} and 𝐏\mathbf{P} simultaneously, we address this by dividing it into four sub-objective functions, namely, Eqs. (9), (15), (18), and (22), each of which is separately solved with other variables fixed. Therefore, the convergence of Algorithm 2 can be demonstrated by proving that each sub-objective function decreases monotonically. We first present the following theorem to prove the monotonic decrease of the objective function in Eq. (9) when updating 𝐖\mathbf{W} with 𝐅\mathbf{F}, 𝐒\mathbf{S} and 𝐏\mathbf{P} are fixed.

Theorem 1.

The objective function in Eq. (9) monotonically decreases until convergence when 𝐖\mathbf{W} is updated using the rules outlined in Algorithm 2, with the other variables fixed.

We can prove this theorem based on the property in [33] and the result in [34]. Due to space constraints, the detailed proof of Theorem 1 is provided in the supplementary material.

Besides, as Eq. (15) is equivalent to Eq. (16), we only need to demonstrate that the objective function in Eq.(16) with respect to 𝐅{\mathbf{F}} is a monotonically decreasing function. It is easy to verify that the matrix 𝐐\mathbf{Q} and the Laplacian matrix 𝐋p\mathbf{L}_{p} are both positive semidefinite. Consequently, the quadratic forms 𝐅T𝐐𝐅2𝐅T𝐂\mathbf{F}^{T}\mathbf{Q}\mathbf{F}-2\mathbf{F}^{T}\mathbf{C} and 𝐅𝐋p𝐅T\mathbf{F}\mathbf{L}_{p}\mathbf{F}^{T} are convex. Hence, the objective function in Eq.(16) is convex. According to [35], we can easily obtain that the objective function in Eq.(16) monotonically decrease by using the updating rules in Algorithm 2. Additionally, the objective functions in Eqs. (18) and (22) are convex w.r.t. 𝐒\mathbf{S} and 𝐏\mathbf{P}, and possess closed-form solutions, ensuring the convergence of the updates for 𝐒\mathbf{S} and 𝐏\mathbf{P}, respectively. Therefore, we can conclude that the objective function in Eq.(8) monotonically decreases with each iteration of Algorithm 2. Moreover, the convergence behavior of Algorithm 2 is further demonstrated through experiments.

5.2 Complexity Analysis

In Algorithm 2, four variables are updated alternately by solving Eqs. (9), (15), (18), and (22). The optimization of Eq. (9) is further transformed into solve Eq. (13). According to [34], the time complexity for updating 𝐖\mathbf{W} in Eq. (13) is 𝒪(d3)\mathcal{O}(d^{3}). For updating 𝐅\mathbf{F}, a proposed algorithm in [35] is applied into solve Eq. (15). Based on [35], we can calculate the computational complexity for updating 𝐅\mathbf{F}, which is 𝒪(n2c+c2n)\mathcal{O}(n^{2}c+c^{2}n). Eq. (18) is solved by updating 𝐒\mathbf{S} row by row. According to the closed form solution of 𝐒\mathbf{S} in Eq. (21), the time complexity of updating 𝐒\mathbf{S} is 𝒪(nksd)\mathcal{O}(nk_{s}d). The optimization 𝐏\mathbf{P} in Eq. (22) is similar to solving 𝐒\mathbf{S}. In the same way, we can determine the time complexity of 𝐏\mathbf{P} as 𝒪(ckpn)\mathcal{O}(ck_{p}n). Therefore, the total computational complexity of Algorithm 2 is 𝒪((d3+n2c+c2n+nksd+ckpn)T)\mathcal{O}((d^{3}+n^{2}c+c^{2}n+nk_{s}d+ck_{p}n)T), where TT is the number of iteration.

6 Experiments

In this section, we compare Access-MFS with several state-of-the-art semi-supervised multi-label feature selection methods. We conduct extensive experiments on eight real-world benchmark datasets to demonstrate the effectiveness of our approach.

6.1 Experimental Schemes

6.1.1 Datasets

In our experiments, we collected eight real-world benchmark multi-label datasets, each with distinct statistical characteristics, to present the performance of various feature selection methods. These datasets consist of the biology data set VirusGO 111http://www.uco.es/kdis/mllresources, two medicine datasets Parkinson222https://github.com/XSilverBullet/parkinson- and Medical333http://www.uco.es/kdis/mllresources, as well as five text datasets: Langlog3, Enron3, Slashdot3, Recreation3 and Computer3. The detailed statistics for these multi-label datasets are summarized in Table I. Within this table, Label card. refers to the average number of labels per instance, while Label dens. reflects the label density, calculated as Label card. divided by the total number of labels.

TABLE I: A detail description of datasets.
Datasets Instances Features Labels Label card. Label dens.
VirusGO 207 749 6 1.2174 0.2029
Parkinson 401 91 5 1.2594 0.2519
Medical 978 1449 45 1.2454 0.0277
Langlog 1460 1004 75 1.1801 0.0157
Enron 1702 1001 53 3.3784 0.0637
Slashdot 3782 1079 22 1.1809 0.0537
Recreation 5000 606 22 1.4232 0.0647
Computer 5000 681 33 1.5082 0.0457

6.1.2 Comparison Methods

To evaluate the performance of the proposed method Access-MFS, we compare it with several state-of-the-art multi-label feature selection methods. This includes nine semi-supervised methods (namely SMILE, FSSRDM, SCFS, LEDA, LSMR, SMLFS, SFAM, SMDR-IC, and SFS-BLL), one supervised method (LFFS), and one unsupervised method (UAFS-BH). The following is a brief introduction to the methods being compared.

\bullet All-Fea utilizes all original features for comparison.

\bullet SMILE [36] uses label correlations to infer labels for unlabeled samples and embeds them into an adapted neighborhood graph model for feature selection.

\bullet FSSRDM [37] selects features through an 2,1\ell_{2,1}-norm-based sparse regression model that incorporates maximizing the dependence between labels and features.

\bullet SCFS [38] ensures consistency between feature and label spaces by leveraging spectral analysis to learn sample similarity during feature selection.

\bullet LEDA [22] integrates a multi-label propagation mechanism into an extended linear discriminant analysis for selecting informative features.

\bullet LSMR [39] constructs a multi-label semi-supervised regression model for feature selection.

\bullet SMLFS [27] incorporates the feature graph and label graph into a logistic regression model with 2,p\ell_{2,p} regularization to learn the regression coefficient matrix.

\bullet SFAM [23] combines adaptive graph structure learning and local manifold learning in a unified feature selection framework.

\bullet SMDR-IC [28] selects important features by incorporating label and sample correlations into a sparse regression model equipped with 1\ell_{1}-norm and 2,1\ell_{2,1}-norm regularization.

\bullet SFS-BLL [29] utilizes binary hash learning to generate pseudo labels and integrates them into a self-weighted sparse regression model for feature selection.

\bullet LFFS [16] incorporates both global and local label correlation structural information into a sparse regression model to learn the feature selection matrix.

\bullet UAFS-BH [40] integrates binary hash constraints into the spectral embedding model to facilitate the learning of weakly-supervised labels and the selection of features.

6.1.3 Implementation Details

Several parameters need to be set in Access-MFS and the other compared methods. For Access-MFS, the objective function in Eq. (8) includes five parameters: λ\lambda, θ\theta, μ\mu, α\alpha and β\beta. The regularization parameters α\alpha and β\beta are automatically determined while solving 𝐒\mathbf{S} and 𝐏\mathbf{P} in Eqs. (21) and (25), respectively. The remaining three parameters λ\lambda, θ\theta and μ\mu are tuned using a grid search within the range {103,102,101,1,10}\{10^{-3},10^{-2},10^{-1},1,10\}. For all compared methods, the parameters are tuned in accordance with the original papers to achieve optimal results. For each dataset, following the approach of existing semi-supervised feature selection methods [29, 41], we randomly select {10%,20%,30%,40%,50%}\{10\%,20\%,30\%,40\%,50\%\} of the instances as the training set from the entire dataset, using the remainder as the testing set. Additionally, the test data also serve as unlabeled data for the semi-supervised multi-label feature selection methods. As determining the optimal number of selected features is still an open problem [42], we set the range for the number of selected features between 100 and 200 with an increment of 10 across all datasets (except for Parkinson dataset, which contains only 91 features). For Parkinson dataset, we vary the number of selected features from 5 to 55, with an incremental size of 5. Then, Multi-Label K-nearest Neighbor (MLKNN), as proposed in [43], is executed on the selected features. Four commonly used evaluation metrics Average Precision (AP) [1], Macro-F1 (MaF) [1], Ranking Loss (RL) [1] and One Error (OE) [1] are employed to assess the performance of different methods. For the Average Precision and Macro-F1 metrics, a higher value indicates better performance, while for Ranking Loss and One Error, a lower value indicates better performance. All methods are implemented in MATLAB R2022b and executed on a desktop with an Intel Corei9-10900, CPU 2.80 GHz and 64 GB RAM. The experiments are independently conducted five times and the average results are reported for comparison.

TABLE II: Performance comparison between Access-MFS and other methods on eight datasets in terms of AP and MaF.
Datasets VirusGO Parkinson Medical Langlog Enron Slashdot Recreation Computer
AP MaF AP MaF AP MaF AP MaF AP MaF AP MaF AP MaF AP MaF
Access-MFS .8918\mathbf{.8918} .5051\mathbf{.5051} .8079\mathbf{.8079} .4223\mathbf{.4223} .8365\mathbf{.8365} .1951\mathbf{.1951} .2914\mathbf{.2914} .0092\mathbf{.0092} .6198\mathbf{.6198} .0737\mathbf{.0737} .5859\mathbf{.5859} .1756\mathbf{.1756} .5377\mathbf{.5377} .1455\mathbf{.1455} .6536\mathbf{.6536} .1152\mathbf{.1152}
All-Fea .8621.8621 .4577.4577 .7169.7169 .2586.2586 .7666.7666 .1366.1366 .1846.1846 .0011.0011 .6020.6020 .0645.0645 .5055.5055 .0455.0455 .4364.4364 .0457.0457 .6278.6278 .0587.0587
SMILE .6798.6798 .1685.1685 7135.7135. .2327.2327 .3985.3985 .0031.0031 .1753.1753 .0016.0016 .5692.5692 .0468.0468 .4080.4080 .0296.0296 .4293.4293 .0344.0344 .6342.6342 .0684.0684
FSSRDM .6955.6955 .2555.2555 .6929.6929 .1814.1814 .4107.4107 .0067.0067 .2544.2544 .0010.0010 .5947.5947 .0633.0633 .4136.4136 .0346.0346 .4320.4320 .0451.0451 .6285.6285 .0493.0493
SCFS .8639.8639 .4163.4163 .6863.6863 .2340.2340 .4249.4249 .0070.0070 .2543.2543 .0007.0007 .5674.5674 .0486.0486 .4320.4320 .0341.0341 .4518.4518 .0502.0502 .6274.6274 .0635.0635
LEDA .6798.6798 .1685.1685 .7450.7450 .3248.3248 .4074.4074 .0032.0032 .1796.1796 .0010.0010 .5989.5989 .0616.0616 .4176.4176 .0384.0384 .4331.4331 .0437.0437 .6261.6261 .0525.0525
LSMR .6901.6901 .2595.2595 .6829.6829 .1991.1991 .4937.4937 .0300.0300 .2603.2603 .0008.0008 .5987.5987 .0659.0659 .4345.4345 .0394.0394 .4396.4396 .0401.0401 .6266.6266 .0553.0553
SMLFS .8412.8412 .3854.3854 .7572.7572 .2984.2984 .5318.5318 .0489.0489 .2733.2733 .0036.0036 .5946.5946 .0606.0606 .4117.4117 .0320.0320 .4250.4250 .0297.0297 .6240.6240 .0487.0487
SFAM .8543.8543 .4341.4341 .6951.6951 .2352.2352 .5976.5976 .0747.0747 .2279.2279 .0012.0012 .5877.5877 .0542.0542 .5037.5037 .0802.0802 .4327.4327 .0379.0379 .6312.6312 .0571.0571
SMDR-IC .7768.7768 .3375.3375 .6220.6220 .1093.1093 .4114.4114 .0032.0032 .2486.2486 .0023.0023 .5410.5410 .0423.0423 .4076.4076 .0243.0243 .4235.4235 .0198.0198 .6201.6201 .0522.0522
SFS-BLL .8442.8442 .4088.4088 .7530.7530 .3197.3197 .7384.7384 .1230.1230 .1693.1693 .0012.0012 .6006.6006 .0626.0626 .4154.4154 .0333.0333 .4369.4369 .0422.0422 .6270.6270 .0585.0585
LFFS .7884.7884 .3594.3594 .7168.7168 .1824.1824 .5139.5139 .0354.0354 .2538.2538 .0010.0010 .5947.5947 .0633.0633 .4390.4390 .0439.0439 .4320.4320 .0451.0451 .6285.6285 .0493.0493
UAFS-BH .8636.8636 .4273.4273 .6594.6594 .1684.1684 .4664.4664 .0233.0233 .2662.2662 .0040.0040 .5521.5521 .0452.0452 .4010.4010 .0301.0301 .4782.4782 .0805.0805 .6319.6319 .0604.0604
Refer to caption
Figure 2: AP of different methods with varying numbers of selected features.
Refer to caption
Figure 3: MaF of different methods with varying numbers of selected features.
Refer to caption
Figure 4: AP of different methods on eight datasets with varying percentages of labeled training data.
Refer to caption
Figure 5: MaF of different methods on eight datasets with varying percentages of labeled training data.

6.2 Experimental Results and Performance Analysis

In this section, we demonstrate the superior performance of the proposed method Access-MFS compared with other competing methods in terms of AP, MaF, RL and OE. Table II summarizes the experimental results of various methods on eight multi-label datasets based on AP and MaF metrics, with a fixed ratio of labeled instances at 40%, selecting 30 features for the Parkinson dataset and 150 features for the remaining datasets. Due to space constraints, the experimental results regarding RL and OE can be found in Table 1 of the supplementary material. From Table II, we can see that the proposed method Access-MFS consistently outperforms other methods across all datasets. As to VirusGO, Parkinson and Recreation datasets, Access-MFS achieves an average improvement of over 10% in both AP and MaF metrics compared to all other methods. On Medical dataset, Access-MFS gains over 32% and 15% average improvement in terms of AP and MaF, respectively. As to Slashdot datasets, Access-MFS outperforms other methods with more than 13% improvement in average AP and MaF. On Computer dataset, Access-MFS obtains over 2.5% average improvement in both AP and MaF. On Langlog and Enron datasets, Access-MFS achieves an average improvement of over 3.6% in AP and still outperforms all other methods in MaF. Besides, Table 1 in the supplementary material also shows that the proposed Access-MFS is still better than other methods on all datasets in terms of RL and OE. Furthermore, compared to the baseline method All-Fea, Access-MFS is superior across all datasets in terms of AP, MaF, RL, and OE, demonstrating the effectiveness of Access-MFS. Access-MFS also shows better performance than the supervised method LEFS on all datasets, indicating the effectiveness of the proposed semi-supervised learning strategy in enhancing feature selection performance.

Due to the difficulty in determining the optimal number of selected features, we also present the performance (AP, MaF, RL and OE) of different methods as the number of selected features varies. Since the page limitation, we only report the experimental results for AP and MaF. The results for RL and OE can be found in Figs. 1 and 2 of the supplementary material. Figs. 2 and 3 respectively show the AP and MaF values for different numbers of selected features, with the labeled instance ratio fixed at 40%. As illustrated in these two figures, the proposed Access-MFS method outperforms other methods in most cases, with the number of selected features varying from 5 to 55 on Parkinson dataset and from 100 to 200 on the other datasets. Furthermore, we also investigate the performance of different methods with varying ratios of labeled instances. Figs. 4 and 5 respectively show the AP and MaF of different methods when the ratio of labeled instances ranges from 10% to 50%. The experimental results for RL and OE with varying ratios of labeled instances can be found in Figs. 3 and 4 of the supplementary material, respectively. From these figures, we can see that our method achieves better performance in most cases. In addition, our method performs better as the ratio of labeled instances increases, indicating that our approach effectively leverages labeled information to guide the collaborative learning of sample and label similarity graphs, ultimately enhancing feature selection performance. The superior performance of the proposed method is attributed to the adaptive collaborative learning of similarity graphs from both the sample and label perspectives, which are embedded into a generalized regression model with an extended uncorrelated constraint and a consistency constraint between predicted and existing labels.

6.3 Parameter Sensitivity Experiment

In this section, we conduct experiments to investigate the sensitivity of the proposed Access-MFS to three parameters, i.e., λ\lambda, θ\theta, and μ{\mu}. Fig. 6 shows the performance variations w.r.t. AP on Enron and Recreation datasets when two of the parameters λ\lambda, θ\theta, and μ{\mu} are varied while the third is kept fixed. We can observe that the parameters λ\lambda and θ\theta are relatively more sensitive compared to the parameter μ{\mu}. Besides, when μ{\mu} is fixed, the proposed method achieves prominent performances with the parameter combinations of λ=10\lambda=10 or 11 and θ=10\theta=10 in most cases. Hence, We can empirically fine-tune λ\lambda and θ\theta within a narrower range discussed to achieve a better performance.

Refer to caption
Figure 6: Variations of AP with different parameters on Enron and Recreation datasets
Refer to caption
Figure 7: Convergence curves of Access-MFS on VirusGo, Parkinson, Enron, and Recreation datasets.

6.4 Convergence Experiment

The theoretical convergence of the proposed Access-MFS algorithm has been established in Section 5.1. This subsection provides experimental validation of its convergence and analyzes the speed of convergence. Fig. 7 displays the convergence curves of Access-MFS on VirusGo, Parkinson, Enron and Recreation datasets, demonstrating the variation of the objective function value with the number of iterations. As illustrated in Fig. 7, the convergence curves steeply fall within a few iterations and approach stability almost within 10 iterations. The experimental results demonstrate that the proposed method can converge effectively.

TABLE III: Means of AP and MaF for different variants of Access-MFS on eight datasets.
Metircs Methods VirusGO Parkinson Medical Langlog Enron Slashdot Recreation Computer
AP Access-MFS .8918\mathbf{.8918} .8079\mathbf{.8079} .8365\mathbf{.8365} .2914\mathbf{.2914} .6198\mathbf{.6198} .5859\mathbf{.5859} .5377\mathbf{.5377} .6536\mathbf{.6536}
Accsess-MFS-I .8624.8624 .7843.7843 .8009.8009 .2797.2797 .5805.5805 .5453.5453 .5109.5109 .6477.6477
Accsess-MFS-II .8659.8659 .7871.7871 .7967.7967 .2871.2871 .5854.5854 .5521.5521 .5100.5100 .6453.6453
Access-MFS-III .8570.8570 .7568.7568 .7753.7753 .2652.2652 .5399.5399 .4502.4502 .4841.4841 .6349.6349
MaF Access-MFS .5051\mathbf{.5051} .4223\mathbf{.4223} .1951\mathbf{.1951} .0092\mathbf{.0092} .0737\mathbf{.0737} .1756\mathbf{.1756} .1455\mathbf{.1455} .1152\mathbf{.1152}
Accsess-MFS-I .4328.4328 .3878.3878 .1770.1770 .0048.0048 .0688.0688 .1290.1290 .1199.1199 .0890.0890
Accsess-MFS-II .4515.4515 .4054.4054 .1644.1644 .0027.0027 .0682.0682 .1346.1346 .1187.1187 .0956.0956
Access-MFS-III .4276.4276 .3499.3499 .1545.1545 .0009.0009 .0610.0610 .0878.0878 .0970.0970 .0803.0803

6.5 Ablation Study

In this section, we conduct an ablation study to evaluate the effects of the adaptive collaborative correlation learning module for the instance similarity graph and label similarity graph in the proposed method. To this end, three variant methods of Access-MFS are designed for comparison as follows:

(1) Access-MFS-I: It is only composed of the instance similarity graph learning module in Access-MFS.

minΞ𝐗T𝐖+𝟏n𝐛T𝐅F2+𝐅l𝐘lF2+λ𝐖2,1\displaystyle\min_{\Xi}\|\mathbf{X}^{T}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{T}-\mathbf{F}\|_{F}^{2}+{\|{\mathbf{F}_{l}-\mathbf{Y}_{l}}\|_{F}^{2}}+\lambda\|\mathbf{W}\|_{2,1} (26)
+θ(Tr(𝐅T𝐋s𝐅)+Tr(𝐖T𝐗𝐋s𝐗T𝐖)+α𝐒F2)\displaystyle\quad\quad+\theta(\operatorname{Tr}(\mathbf{F}^{T}\mathbf{L}_{s}\mathbf{F})+\operatorname{Tr}(\mathbf{W}^{T}\mathbf{X}\mathbf{L}_{s}\mathbf{X}^{T}\mathbf{W})+\alpha\|\mathbf{S}\|_{F}^{2})
s.t.𝐖T𝐑𝐖=𝐈,sii=0,sij0,𝐒𝟏n=𝟏n.\displaystyle{s.t.}~{}\mathbf{W}^{T}\mathbf{R}\mathbf{W}=\mathbf{I},s_{ii}=0,s_{ij}\geq 0,\mathbf{S}\mathbf{1}_{n}=\mathbf{1}_{n}.

(2) Access-MFS-II: It is only composed of the label similarity graph learning module in Access-MFS.

minΞ𝐗T𝐖+𝟏n𝐛T𝐅F2+𝐅l𝐘lF2+λ𝐖2,1\displaystyle\min_{\Xi}\|\mathbf{X}^{T}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{T}-\mathbf{F}\|_{F}^{2}+{\|{\mathbf{F}_{l}-\mathbf{Y}_{l}}\|_{F}^{2}}+\lambda\|\mathbf{W}\|_{2,1} (27)
+μ(Tr(𝐅𝐋p𝐅T)+β𝐏F2)\displaystyle\quad\quad+\mu(\operatorname{Tr}(\mathbf{F}\mathbf{L}_{p}\mathbf{F}^{T})+\beta\|\mathbf{P}\|_{F}^{2})
s.t.𝐖T𝐑𝐖=𝐈,pii=0,pij0,𝐏𝟏c=𝟏c.\displaystyle{s.t.}~{}\mathbf{W}^{T}\mathbf{R}\mathbf{W}=\mathbf{I},p_{ii}=0,p_{ij}\geq 0,\mathbf{P}\mathbf{1}_{c}=\mathbf{1}_{c}.

(3) Access-MFS-III: It performs feature selection without the adaptive collaborative learning of the instance similarity graph and the label similarity graph modules in Access-MFS.

minΞ𝐗T𝐖+𝟏n𝐛T𝐅F2+𝐅l𝐘lF2+λ𝐖2,1\displaystyle\min_{\Xi}\|\mathbf{X}^{T}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{T}-\mathbf{F}\|_{F}^{2}+{\|{\mathbf{F}_{l}-\mathbf{Y}_{l}}\|_{F}^{2}}+\lambda\|\mathbf{W}\|_{2,1} (28)
s.t.𝐖T𝐑𝐖=𝐈.\displaystyle{s.t.}~{}\mathbf{W}^{T}\mathbf{R}\mathbf{W}=\mathbf{I}.

By comparing the proposed Access-MFS with its three variant methods, Table III presents the ablation experimental results in terms of AP and MaF. The experimental results regarding RL and OE can be found in Table 2 of the supplementary material. From those tables, we can see that our method outperforms methods that use only instance similarity graph learning or only label similarity graph learning (namely, Access-MFS-I and Access-MFS-II). Additionally, methods with only one type of similarity graph learning (Access-MFS-I and Access-MFS-II) perform better than those without any similarity graph learning, i.e., Access-MFS-III. This demonstrates the effectiveness of the proposed adaptive collaborative correlation learning module, which can automatically capture the local geometric structure of samples and accurately learn label correlations using existing labels. These two processes mutually enhance each other, improving feature selection performance.

7 Conclusion

To address the high dimensionality of multi-label data and the difficulty of obtaining labels for all instances, we proposed a novel semi-supervised multi-label feature selection method named Access-MFS, which uses only a subset of labeled samples. Unlike existing methods that use predefined approaches to describe either sample correlations or label correlations separately, our method enhanced the performance of semi-supervised multi-label feature selection by adaptively and collaboratively learning both sample correlations and label correlations. An generalized regression model equipped with an extended uncorrelated constraint was presented to select informative features from semi-supervised multi-label data. Meanwhile, the adaptive collaborative learning module of instance similarity graphs and label similarity graphs was integrated into the generalized regression model to preserve sample similarity in both high-dimensional and low-dimensional spaces, assign similar labels to similar samples, and establish strong label correlations for consistent predictions. An alternative iterative optimization algorithm was also developed to solve the proposed objective function. The experimental results on real-world semi-supervised multi-label data showed that Access-MFS outperformed the state-of-the-art methods.

Acknowledgments

This work was supported by the Natural Science Foundation Project of Sichuan Province (No. 24NSFSC0508), the Youth Fund Project of Humanities and Social Science Research of Ministry of Education (No. 21YJCZH045), and the National Natural Science Foundation of China (No. 62176221).

References

  • [1] M.L. Zhang and Z.H. Zhou, ”A review on multi-Label learning algorithms”, IEEE Trans. Knowl. Data Eng., 26 (8) (2014) 1819-1837.
  • [2] M.L. Zhang and L. Wu, ”Lift: Multi-label learning with label-specific features,” IEEE Trans. Pattern Anal. Mach. Intell., 37 (1) (2015) 107-120.
  • [3] W. Lu, D. Li, L. Nie, P. Jing and Y. Su, ”Learning dual low-rank representation for multi-label micro-video classification,” IEEE Trans. Multimedia, 25 (2023) 77-89.
  • [4] T. Jiang, D. Wang, L. Sun, H. Yang, Z. Zhao, and F. Zhuang, ”LightXML: Transformer with dynamic negative sampling for high-performance extreme multi-label text classification,” AAAI Conf. Artif. Intell., 35 (9) (2021) 7987-7994.
  • [5] C. Wang, S.C Yan, L. Zhang and H. J. Zhang, ”Multi-label sparse coding for automatic image annotation,” IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit., 2009, pp. 1643-1650.
  • [6] W. Qian, J. Huang, F. Xu, W. Shu and W. Ding, ”A survey on multi-label feature selection from perspectives of label fusion,” Inf. Fusion, 100 (2023) 101948.
  • [7] L. Hu, L. Gao, Y. Li, P. Zhang and W. Gao, ”Feature-specific mutual information variation for multi-label feature selection,” Inf. Sci., 593 (2022) 449-471.
  • [8] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang and H. Liu, ”Feature selection: A data perspective,” ACM Comput. Surv., 50 (6) (2017) 1-45.
  • [9] X. He, D. Cai and P. Niyogi, ”Laplacian score for feature selection,” Proc. Adv. Neural Inf. Process. Syst., 2005, pp. 507-514.
  • [10] Y. Fu, X. Liu, S. Sarkar and T. Wu, ”Gaussian mixture model with feature selection: An embedded approach,” Comput. Ind. Eng., 152 (2021) 107000.
  • [11] Z. Zhao and L. Huan, ”Spectral feature selection for supervised and unsupervised learning,” Proc. 24th Int. Conf. Mach. Learn., 2007, pp. 1151-1157.
  • [12] F. Wu, Y. Yuan and Y. Zhuang, ”Heterogeneous feature selection by group lasso with logistic regression,” Proc. 18th ACM Int. Conf. Multimedia, 2010, pp. 983-986.
  • [13] L. Jian, J. Li, K. Shu and H. Liu, ”Multi-label informed feature selection,” Proc. Int. Joint Conf. Artif. Intell., 2016, pp. 1627-1633.
  • [14] A. Braytee, W. Liu, D. R. Catchpoole and P. J. Kennedy, ”Multi-label feature selection using correlation information,” Proc. ACM Conf. Inf. Knowl. Management, 2017, pp. 1649-1656.
  • [15] J. Zhang, Z. Luo, C. Li, C. Zhou and S. Li, ”Manifold regularized discriminative feature selection for multi-label learning,” Pattern Recognit., 95 (2019) 136–150.
  • [16] Y.L. Fan, B.H. Chen, W.Q. Huang, J.H. Liu, W. Weng and W.Y. Lan, ”Multi-label feature selection based on label correlations and feature redundancy,” Knowl. Based Syst., 241 (2022) 108256.
  • [17] J. Zhang, H. Wu, M. Jiang, J. Liu, S. Li, Y. Tang and J. Long, ”Group-preserving label-specific feature selection for multi-label learning,” Expert Syst. Appl., 213 (2023) 118861.
  • [18] Z.Ma, F. Nie, Y. Yang, J. R. Uijlings and N. Sebe, ”Web image annotation via subspace-sparsity collaborated feature selection,” IEEE Trans. Multimedia, 14 (4) (2012) 1021-1030.
  • [19] Z. Xu, I. King, M.R.T. Lyu and R. Jin, ”Discriminative semi-supervised feature selection via manifold regularization,” IEEE Trans. Neural Netw., 21 (7) (2010) 1033-1047.
  • [20] N. Sun, T. Luo, W. Zhuge, H. Tao, C. Hou and D. Hu, ”Semi-supervised learning with label proportion,” IEEE Trans. Knowl. Data Eng., 35 (1) (2021) 877-890.
  • [21] X. Chang, F.Nie, Y. Yang and H. Huang, ”A convex formulation for semi-supervised multi-label feature selection,” Proc. AAAI Conf. Artif. Intell., 2014, pp. 1171-1177.
  • [22] B. Guo, H. Tao, C. Hou and D. Yi, ”Semi-supervised multi-label feature learning via label enlarged discriminant analysis,” Knowl. Inf. Syst., 62 (2020) 2383-2417.
  • [23] S. Lv, S.F. Shi, H.Z. Wang and F. Li, ”Semi-supervised multi-label feature selection with adaptive structure learning and manifold learning,” Knowl. Based Syst., 214 (2021) 106757.
  • [24] W. Liu, H. Wang, X. Shen and I.W. Tsang, ”The emerging trends of multi-label learning,” IEEE Trans. Pattern Anal. Mach. Intell., 44 (11) (2021) 7955-7974.
  • [25] L. Laporte, R. Flamary, S. Canu, S. Déjean and J. Mothe, ”Nonconvex regularizations for feature selection in ranking with sparse SVM,” IEEE Trans. Neural Netw. Learn. Syst., 25 (6) (2013) 1118-1130.
  • [26] Y. Liu, D. Ye, W. Li, H. Wang and Y. Gao, ”Robust neighborhood embedding for unsupervised feature selection,” Knowl. Based Syst. 193 (2020) 105462.
  • [27] Y. Zhang, Y. Ma, X. Yang, H. Zhu and T. Yang, ”Semi-supervised multi-label feature selection with local logic information preserved,” Adv. Comput. Intell., 1 (7) (2021) 1-15.
  • [28] R. Li, J. Du, J. Ding, L. Jia, Y. Chen and Z. Shang, ”Semi-supervised multi-label dimensionality reduction learning by instance and label correlations”, Mathematics, 11(3) (2023) 782.
  • [29] D. Shi, L. Zhu, J. Li, Z. Cheng and Z. Liu, ”Binary label learning for semi-supervised feature selection,” IEEE Trans. Knowl. Data Eng., 35 (3) (2023) 2299-2312.
  • [30] S. Xiang, F. Nie, G. Meng, C. Pan and C. Zhang, “Discriminative least squares regression for multiclass classification and feature selection”, IEEE Trans. Neural Netw. Learn. Syst., 23 (11) (2012) 1738-1754.
  • [31] X. Chen, F. Nie, G. Yuan and J.Z. Huang, “Semi-supervised feature selection via rescaled linear regression”, Proc. AAAI Int. Joint Conf. Artif. Intell., 2017, pp. 1525-1531.
  • [32] D.P. Bertsekas, ”Constrained optimization and Lagrange multiplier methods,” Academic Press, 2014.
  • [33] F. Nie, H. Huang, X. Cai, and C. H. Ding, ”Efficient and robust feature selection via joint l2,1-norms minimization”, Proc. Adv. Neural Inf. Proc. Syst., 23 (2010) 1813-1821.
  • [34] J. Huang, F.P. Nie, H. Huang and C. Ding, ”Robust manifold nonnegative matrix factorization,” ACM Trans. Knowl. Disc. Data, 8 (3) (2014) 1-21.
  • [35] R. Bartels and G. Stewart, ”Solution of the matrix equation ax+xb=c,” Commun. ACM, 15 (9) (1972) 820-826.
  • [36] Q. Tan, Y. Yu, G. Yu and J. Wang, ”Semi-supervised multi-label classification using incomplete label information,” Neurocomputing, 260 (2017) 192-202.
  • [37] L. Jiang, J. Wang and G. Yu, ”Semi-supervised multi-label feature selection based on sparsity regularization and dependence maximization,” Int. Conf. Intell. Control Inf. Proc., 2018, pp. 325-332.
  • [38] Y. Xu, J. Wang, S. An, J. Wei and J. Ruan, ”Semi-supervised multi-label feature selection by preserving feature-label space consistency,” Proc. 27th ACM Int. Conf. Inf. Knowl. Manag., 2018, pp. 783-792.
  • [39] V. Kraus, K. Benabdeslem and B. Canitia, ”Laplacian-based semi-supervised multi-label regression,” Int. Joint Conf. Neural Netw., 2020, pp. 1-8.
  • [40] D. Shi, L. Zhu, J. Li, Z. Zhang and X. Chang, ”Unsupervised adaptive feature selection with binary hashing,” IEEE Trans. Image Process., 32 (2023) 838-853.
  • [41] X. Chen, G. Yuan, F. Nie and Z. Ming, ”Semi-supervised feature selection via sparse rescaled linear square regression,” IEEE Trans. Knowl. Data Eng., 32 (1) (2020) 165-176.
  • [42] J. Li, K. Cheng, S. Wang, F. Morstatter, R. P. Trevino, J. Tang and H. Liu, ”Feature selection: A data perspective,” ACM Comput. Surv., 50 (6) (2017) 1-45.
  • [43] M.L. Zhang and Z.H. Zhou, ”ML-KNN: A lazy learning approach to multi-label learning,” Pattern Recognit., 40 (7) (2007) 2038-2048.