Complementary Classifier Induced Partial Label Learning

Yuheng Jia Key Laboratory of Computer Network and Information Integration, School of Computer Science and EngineeringSoutheast UniversityNanjing 210096China [email protected] , Chongjie Si MoE Key Lab of Artificial Intelligence, AI InstituteShanghai Jiao Tong UniversityShanghai 200240China [email protected] and Min-Ling Zhang Key Laboratory of Computer Network and Information Integration, School of Computer Science and EngineeringSoutheast UniversityNanjing 210096China [email protected]

(2023)

Abstract.

In partial label learning (PLL), each training sample is associated with a set of candidate labels, among which only one is valid. The core of PLL is to disambiguate the candidate labels to get the ground-truth one. In disambiguation, the existing works usually do not fully investigate the effectiveness of the non-candidate label set (a.k.a. complementary labels), which accurately indicates a set of labels that do not belong to a sample. In this paper, we use the non-candidate labels to induce a complementary classifier, which naturally forms an adversarial relationship against the traditional PLL classifier, to eliminate the false-positive labels in the candidate label set. Besides, we assume the feature space and the label space share the same local topological structure captured by a dynamic graph, and use it to assist disambiguation. Extensive experimental results validate the superiority of the proposed approach against state-of-the-art PLL methods on 4 controlled UCI data sets and 6 real-world data sets, and reveal the usefulness of complementary learning in PLL. The code has been released in the link https://github.com/Chongjie-Si/PL-CL.

partial label learning, complementary classifier, label disambiguation

^†^†copyright: acmcopyright^†^†journalyear: 2023^†^†doi: XXXXXXX.XXXXXXX^†^†conference: 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 06–10, 2023; Long Beach, CA^†^†price: 15.00^†^†isbn: 978-1-4503-XXXX-X/23/08^†^†ccs: Computing methodologies Learning paradigms

1. Introduction

In traditional multi-class learning, each training sample is associated with a single label indicating the class that sample belongs to. However, collecting such highly accurate label is expensive and time-consuming, which is also a bottleneck of many real-world classification tasks. Therefore, partial label learning (PLL) (Hüllermeier and Beringer, 2005), (Zhang et al., 2016), (Lyu et al., 2021) was proposed, where in PLL, an instance is associated with a set of candidate labels, among which only one is valid. PLL has been applied to many real-world applications, such as natural language processing (Yang and Vozila, 2014), web mining (Huiskes and Lew, 2008) and image classification (Zeng et al., 2013), (Liu and Dietterich, 2012a), etc.

Formally speaking, suppose $\mathcal{X}=\mathbb{R}^{q}$ denotes the $q$ -dimensional feature space, $\mathcal{Y}=\{0,1\}^{l}$ is the label space with $l$ classes. Given a partial label data set $\mathcal{D}=\{\mathbf{x}_{i},S_{i}|1\leq i\leq n\}$ where $\mathbf{x}_{i}\in\mathcal{X}$ is the $i$ -th sample, and $S_{i}\subseteq\mathcal{Y}$ is the corresponding candidate label set and $n$ is the number of samples, the task of PLL is to learn a function $f:\mathcal{X}\rightarrow\mathcal{Y}$ based on $\mathcal{D}$ . As the ground-truth labels $\{y_{1},y_{2},...,y_{n}\}$ are concealed in the corresponding candidate label sets, i.e., $y_{i}\in S_{i}$ , which are not directly accessible, PLL is a quite challenging problem.

Accordingly, disambiguating the candidate labels becomes the core task of PLL. A naive strategy is to treat all the candidate labels of a sample equally, and then averages the modeling outputs of the candidate labels, which is known as the averaging-based framework (Cour et al., 2011). For example, PL-KNN (Hüllermeier and Beringer, 2005) averages the candidate labels of neighboring instances to make prediction. However, this naive strategy may fail when false-positive labels dominate the candidate label set. The second strategy is to identify the ground-truth label from the candidate label set, which is known as the identification-based framework (a.k.a. the disambiguation-based approach).

Refer to caption — Figure 1. Framework of PL-CL. PL-CL utilizes candidate and non-candidate labels to generate an ordinary classifier and a complementary classifier, and restrains the outputs of those two classifiers with an adversarial manner. A dynamic graph is also adopted to exploit the local topology consistency of the feature manifold and label manifold to further assist disambiguation.

To disambiguate the candidate labels, we believe the following three kinds of prior information are important, and the previous methods usually use one or two kinds of priors to achieve disambiguation. The first type is the correlations among instances, i.e., if the two samples are similar to each other in the feature space, those two samples are possible to have the same label. For instance, PL-AGGD (Wang et al., 2022) and IPAL (Zhang and Yu, 2015) use the feature matrix to construct a similarity graph and further use that graph to realize disambiguation. The second kind of prior is the mapping from instances to the candidate labels, i.e., we could expect that the mapping error of an instance to the correct label is small, while the mapping error of that instance with an incorrect label in the candidate label set (i.e., false-positive label) is relatively large. For example, SURE (Feng and An, 2019) adopts such a mapping to automatically remove the incorrect labels by maximizing an infinity norm. The last kind of information is non-candidate label set, which precisely describes “a set of labels should not be assigned to a sample”. Although the information in non-candidate label set is directly available and accurate, the majority of existing works overlook the valuable information in disambiguation.

To fully take advantage of above mentioned priors, as shown in Fig. 1, we propose a novel PLL method named PL-CL, with Partial Label learning based on Complementary Learning. Specifically, PL-CL uses the candidate labels to design an ordinary classifier, and meanwhile adopts the non-candidate labels to develop a complementary classifier. As the ordinary classifier indicates which label should be assigned to a sample while the complementary classifier specifies the labels that should not be assigned to that sample, the outputs of those two classifiers form an adversarial relationship. Moreover, as the non-candidate labels are accurate, the trained complementary classifier is highly reliable and can be used to promote the ordinary classifier through the adversarial relationship, which finally enhances PLL. Additionally, inspired by (Hou et al., 2016), the feature space and label space of a data set usually lie in two different manifolds but own the similar local structure. Therefore, PL-CL constructs an adaptive local topological graph shared by both the feature space and the label space. Then similar samples in the feature space will possess the similar labels, which further benefits the disambiguation of PLL. PL-CL is finally formulated as an adaptive graph regularized dual classifiers learning problem, and solved by an alternating iterative algorithm. Comprehensive experiments on both the controlled UCI data sets and the real-world data sets validate that PL-CL performs significantly better than state-of-the-art PLL methods, and also substantiate the effectiveness of the complementary classifier and the shared adaptive local graph structure in PLL.

The rest of the paper is organized as follows. We first review some related works in Section 2 and introduce the proposed approach in Section 3. The numerical solution is provided in Section 4. Experimental results and analyses are shown in Section 5, and conclusions and future research directions are finally given at Section 6.

2. Related Work

2.1. Partial Label Learning

PLL is a representative weakly supervised learning framework, which learns from ambiguous supervision information. In PLL, an instance is associated with a set of labels, among which only one is valid. Since the ground-truth label of each sample is concealed in its candidate labels, which can not be directly accessible during the training phase, PLL is a quite challenging task.

To address the mentioned challenge, the key approach is candidate label disambiguation, i.e., (Feng and An, 2018), (Nguyen and Caruana, 2008), (Feng and An, 2019), (Zhang and Yu, 2015). The existing methods can be summarized into two categories: averaging-based approaches and identification-based approaches. For the averaging-based approaches (Hüllermeier and Beringer, 2005), (Cour et al., 2011), each candidate label of a training sample is treated in an equal manner and the model prediction is yielded by averaging the modeling outputs of the candidate labels. For example, PL-KNN (Hüllermeier and Beringer, 2005) averages the candidate labels of neighboring instances to make prediction. This strategy is intuitive, however, it may fail when false-positive labels dominate the candidate label set. For the identification-based approaches, (Wang et al., 2022), (Feng and An, 2019), (Feng and An, 2018), the valid label is considered as a latent variable and can be identified through an iterative optimization procedure. For instance, PL-AGGD (Wang et al., 2022) uses the feature matrix to construct a similarity graph, which is further utilized for generating labeling confidence of each candidate label.

In this paper, we summarize the following three kinds of priors that are important for disambiguation. The first is type is the correlations among instances, i.e., similar samples in the feature space may share a same label. The second kind of prior is the mapping from instances to the candidate labels, i.e., the mapping error of an instance to the valid label is small, while that to an incorrect label in the candidate label set is large. The last kind of information is non-candidate label set, which precisely describes a set of labels should not be assigned to a sample. Existing works mainly focus on the first two types of priors, while the valuable information in non-candidate labels (complementary labels) which is directly available and accurate is overlooked. Therefore, we will fully take advantage of above mentioned priors (especially the complementary labels) to achieve PLL.

2.2. Learning from Complementary Labels

Complementary-label learning (CLL) is another representative framework of weakly supervised learning (Feng et al., 2020), (Yu et al., 2018). In the traditional CLL (Ishida et al., 2017), an instance is associated with only one label, which indicates that this sample does not belong to. Although in this paper, complementary label information are adopted, it is quite different from that in the traditional complementary-label learning. First, in partial label learning the number of complementary labels of each sample is usually more than one, while an instance only has one complementary label in complementary-label learning. Additionally, the aim of complementary classifier in PLL is to assist the disambiguation of candidate labels, which is different from CLL. To the best of our knowledge, the information of complementary labels has been overlooked to some extent in previous PLL works, and our work is the first time to induce a complementary classifier to assist the construction of the ordinary classifier by an adversarial prior in PLL.

3. The Proposed Approach

PLL aims to induce a multi-class classifier $f:\mathcal{X}\rightarrow\mathcal{Y}$ based on the data set $\mathcal{D}=\{\mathbf{x}_{i},S_{i}|1\leq i\leq n\}$ . Denote $\mathbf{X}=[\mathbf{x}_{1},\mathbf{x}_{2},...,\mathbf{x}_{n}]^{\mathsf{T}}\in\mathbb{R}^{n\times q}$ the instance matrix and $\mathbf{Y}=[\mathbf{y}_{1},\mathbf{y}_{2},...,\mathbf{y}_{n}]^{\mathsf{T}}\in\{0,1\}^{n\times l}$ the partial label matrix with $l$ labels. Note that the value of $y_{ij}$ is binary, and $y_{ij}=1$ (resp. $y_{ij}=0$ ) if the $j$ -th label resides in (resp. does not reside in) the candidate label set of $\mathbf{x}_{i}$ .

As analyzed in the introduction, three kinds of priors are important to label disambiguation. We use all the three kinds of priors to solve the PLL problem, and formulate the loss function as

(1)

\min\mathcal{L}(\mathbf{X},\mathbf{Y})+\hat{\mathcal{L}}(\mathbf{X},\hat{\mathbf{Y}})+\Psi(\mathbf{X},\mathbf{Y}),

where $\hat{\mathbf{Y}}=[\hat{y}_{ij}]_{n\times l}$ is the complementary label matrix, i.e., $\hat{y}_{ij}=1$ (resp. $\hat{y}_{ij}=0$ ) if the $j$ -th label is in the non-candidate (resp. candidate) label set of the $i$ -th instance. $\mathcal{L}(\mathbf{X},\mathbf{Y})$ denotes the mapping from instance matrix to the candidate labels, i.e., the ordinary classifier. $\hat{\mathcal{L}}(\mathbf{X},\mathbf{Y})$ denotes the loss of the complementary classifier, i.e., the mapping from the instance to the non-candidate labels, and $\Psi(\mathbf{X},\mathbf{Y})$ uses correlations of instances to disambiguate the candidate labels. These three terms are three different ways of disambiguation in PLL, and details of them will be introduced in the following subsections.

3.1. Ordinary Classifier

First, PL-CL uses a weight matrix $\mathbf{W}=[\mathbf{w}_{1},\mathbf{w}_{2},...,\mathbf{w}_{l}]\in\mathbb{R}^{q\times l}$ to map the instance matrix to the labeling confidence matrix $\mathbf{P}=[\mathbf{p}_{1},\mathbf{p}_{2},...,\mathbf{p}_{n}]^{\mathsf{T}}\in\mathbb{R}^{n\times l}$ , where $\mathbf{p}_{i}\in\mathbb{R}^{l}$ is the labeling confidence vector corresponding to $\mathbf{x}_{i}$ , and $p_{ij}$ represents the probability of the $j$ -th label being the ground-truth label of $\mathbf{x}_{i}$ . To be specific, PL-CL uses the following least squares loss to form the ordinary classifier:

(2)

\begin{split}\mathcal{L}(\mathbf{X},\mathbf{Y})&=\left\|\mathbf{XW}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}-\mathbf{P}\right\|_{F}^{2}+\lambda\|\mathbf{W}\|_{F}^{2}\\ {\rm s.t.}&\quad\sum_{j}p_{ij}=1,0\leq p_{ij}\leq y_{ij},\forall i,j,\end{split}

where $\mathbf{b}=[b_{1},b_{2},...,b_{l}]\in\mathbb{R}^{l}$ is the bias term and $\mathbf{1}_{n}\in\mathbb{R}^{n}$ is an all ones vector. $\|\mathbf{W}\|_{F}$ denotes the Frobenius norm of the weight matrix, and minimizing $\|\mathbf{W}\|^{2}_{F}$ will control the complexity of the model. $\lambda$ is a hyper-parameter trading off these terms. As $\mathbf{P}$ represents the labeling confidence, it should be strictly non-negative, i.e., $p_{ij}\geq 0$ and $p_{ij}$ has a chance to be positive only when the $j$ -th label lies in the candidate label set of $i$ -th sample, i.e., we have $p_{ij}\leq y_{ij}$ . Moreover, as each row of $\mathbf{P}$ indicates the labeling confidence of a certain sample, the sum of its elements should be equal to 1, i.e., $\sum_{j}p_{ij}=1$ . The ideal state of $\mathbf{p}_{i}$ is one-hot, for there will only one ground-truth label. By minimizing Eq. (2), the ground-truth label is excepted to produce a smaller mapping error than the incorrect ones residing the candidate label set, and the correct label is likely to be selected.

3.2. Complementary Classifier

Suppose matrix $\mathbf{Q}=[\mathbf{q}_{1},\mathbf{q}_{2},...,\mathbf{q}_{n}]\in\mathbb{R}^{n\times l}$ is the complementary-labeling confidence matrix where $q_{ij}$ denotes the confidence of the $j$ -th label NOT being the ground-truth label of $\mathbf{x}_{i}$ . Similar as the ordinary classifier, we construct a complementary classifier:

(3)

\begin{split}\min_{\hat{\mathbf{W}},\hat{\mathbf{b}}}&\left\|\mathbf{X}\hat{\mathbf{W}}+\mathbf{1}_{n}\hat{\mathbf{b}}^{\mathsf{T}}-\mathbf{Q}\right\|_{F}^{2}+\lambda\left\|\hat{\mathbf{W}}\right\|_{F}^{2}\\ {\rm s.t.}&\quad\hat{y}_{ij}\leq q_{ij}\leq 1,\forall i,j,\end{split}

where $\hat{\mathbf{W}}\in\mathbb{R}^{q\times l}$ and $\hat{\mathbf{b}}\in\mathbb{R}^{l}$ are the weight matrix and bias for the complementary classifier. Different from $\mathbf{p}_{i}$ , the ideal state of $\mathbf{q}_{i}$ is “zero-hot”, i.e., only one value in $\mathbf{q}_{i}$ is 0 with others 1, for there is only one ground-truth label in PLL. Therefore, we require $\hat{y}_{ij}\leq q_{ij}\leq 1,\forall i,j$ , which means $0\leq q_{ij}\leq 1$ if the $j$ -th label resides in the candidate label set, and $q_{ij}=1$ , if $\hat{y}_{ij}=1$ .

The ordinary classifier predicts which label should be assigned to an instance while the complementary classifier specifies a group of labels that should not be allocated to that instance. Therefore, an adversarial relationship exists between the outputs of the ordinary and complementary classifiers, i.e., a larger (resp. smaller) element in $\mathbf{P}$ implies a smaller (resp. larger) element in $\mathbf{Q}$ , and vice versa. Besides, as the elements in both $\mathbf{Q}$ and $\mathbf{P}$ are non-negative and no more than 1, the above adversarial relationship can be formulated as $\mathbf{p}_{i}+\mathbf{q}_{i}=\mathbf{1}_{l}$ . Taking the adversarial relationship into account, the complementary classifier becomes:

(4)

\begin{split}\hat{\mathcal{L}}(\mathbf{X},\hat{\mathbf{Y}})=&\alpha\left\|\mathbf{X}\hat{\mathbf{W}}+\mathbf{1}_{n}\hat{\mathbf{b}}^{\mathsf{T}}-\mathbf{Q}\right\|_{F}^{2}\\ &+\beta\left\|\mathbf{1}_{n\times l}-\mathbf{P}-\mathbf{Q}\right\|_{F}^{2}+\lambda\left\|\hat{\mathbf{W}}\right\|_{F}^{2}\\ {\rm s.t.}&\quad\hat{y}_{ij}\leq q_{ij}\leq 1,\end{split}

where $\mathbf{1}_{n\times l}\in\mathbb{R}^{n\times l}$ is an all ones matrix, and $\alpha$ and $\beta$ are two hyper-parameters. Different from the candidate label set, the complementary information in the non-candidate label set is directly accessible and accurate, by minimizing Eq. (4), the valuable information is passed to the labeling confidence matrix $\mathbf{P}$ , which will further assist label disambiguation.

3.3. Local Consistency in Label Manifold and Feature Manifold

As suggested by (Hou et al., 2016), the label space and feature space of a data set lie in two different manifolds but with the similar local structure, i.e., if $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ are similar, their corresponding labeling confidence vector $\mathbf{p}_{i}$ and $\mathbf{p}_{j}$ should also be similar. To this end, we first construct a local non-negative similarity matrix $\mathbf{G}=[g_{ij}]_{n\times n}$ to capture the local manifold structure of the feature space, i.e.,

(5)

\begin{split}\min_{\mathbf{G}}&\quad\left\|\mathbf{X}^{\mathsf{T}}-\mathbf{X}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ {\rm s.t.}&\quad\mathbf{G}^{\mathsf{T}}\mathbf{1}_{n}=\mathbf{1}_{n},\mathbf{0}_{n\times n}\leq\mathbf{G}\leq\mathbf{U},\end{split}

where $\mathbf{0}_{n\times n}$ is an all zeros matrix with the size of $n\times n$ . $\mathbf{G}^{\mathsf{T}}\mathbf{1}_{n}=\mathbf{1}_{n}$ will ensure that the sum of each column of $\mathbf{G}$ is 1, i.e., $\mathbf{G}$ is normalized. $\mathbf{U}=[u_{ij}]_{n\times n}$ denotes the $k$ -nearest neighbors (KNN) location matrix, i.e., $u_{ij}=1$ if $\mathbf{x}_{i}$ belongs to the KNN of $\mathbf{x}_{j}$ , otherwise $u_{ij}=0$ . $\mathbf{0}_{n\times n}\leq\mathbf{G}\leq\mathbf{U}$ means that the elements of $\mathbf{G}$ will be no less than 0 and no more than the elements of $\mathbf{U}$ at the same location, which will ensure that one sample can only be reconstructed by its neighbors. We further assume the label space shares the same local structure as the feature space captured by $\mathbf{G}$ , and use it to assist label disambiguation, i.e.,

(6)

\begin{split}\min_{\mathbf{P}}&\quad\left\|\mathbf{P}^{\mathsf{T}}-\mathbf{P}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ {\rm s.t.}&\quad\mathbf{P}\mathbf{1}_{q}=\mathbf{1}_{n},\mathbf{0}_{n\times l}\leq\mathbf{P}\leq\mathbf{Y}.\end{split}

By minimizing Eq. (6), the local manifold structure $\mathbf{G}$ will help produce a better $\mathbf{P}$ . Finally, we combine Eqs. (5) and (6) together to jointly optimize the local structure $\mathbf{G}$ and the labeling confidence matrix $\mathbf{P}$ , i.e.,

(7)

\begin{split}\Psi(\mathbf{X},\mathbf{Y})=&\quad\gamma\left\|\mathbf{X}^{\mathsf{T}}-\mathbf{X}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}+\mu\left\|\mathbf{P}^{\mathsf{T}}-\mathbf{P}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ {\rm s.t.}&\quad\mathbf{G}^{\mathsf{T}}\mathbf{1}_{n}=\mathbf{1}_{n},\mathbf{0}_{n\times n}\leq\mathbf{G}\leq\mathbf{U},\\ &\quad\mathbf{P}\mathbf{1}_{q}=\mathbf{1}_{n},\mathbf{0}_{n\times l}\leq\mathbf{P}\leq\mathbf{Y},\end{split}

where $\gamma$ and $\mu$ are two hyper-parameters. By minimizing Eq. (7), the feature space and label space will be well communicated to achieve a better local consistency.

3.4. Model Formulation

Based on the discussions above, the objective function of PL-CL is finally formulated as

(8)

\begin{split}\min_{\begin{subarray}{c}\mathbf{W},\mathbf{b},\hat{\mathbf{W}},\hat{\mathbf{b}},\\ \mathbf{P},\mathbf{Q},\mathbf{G}\end{subarray}}&\left\|\mathbf{XW}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}-\mathbf{P}\right\|_{F}^{2}+\beta\|\mathbf{1}_{n\times l}-\mathbf{P}-\mathbf{Q}||_{F}^{2}\\ +&\alpha\left\|\mathbf{X}\hat{\mathbf{W}}+\mathbf{1}_{n}\hat{\mathbf{b}}^{\mathsf{T}}-\mathbf{Q}\right\|_{F}^{2}+\mu\left\|\mathbf{P}^{\mathsf{T}}-\mathbf{P}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ +&\gamma\left\|\mathbf{X}^{\mathsf{T}}-\mathbf{X}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}+\lambda(\|{{\mathbf{W}}}\|_{F}^{2}+\left\|{\hat{\mathbf{W}}}\right\|_{F}^{2})\vspace{2pt}\\ {\rm s.t.}\quad&\mathbf{P}\mathbf{1}_{q}=\mathbf{1}_{n},\mathbf{0}_{n\times l}\leq\mathbf{P}\leq\mathbf{Y},\hat{\mathbf{Y}}\leq\mathbf{Q}\leq\mathbf{1}_{n\times l}\\ &\mathbf{G}^{\mathsf{T}}\mathbf{1}_{n}=\mathbf{1}_{n},\mathbf{0}_{n\times n}\leq\mathbf{G}\leq\mathbf{U},\end{split}

where $\beta,\alpha,\gamma,\mu$ and $\lambda$ are the hyper-parameters balancing different terms. PL-CL comprehensively takes the information in the candidate-label set, the non-candidate label set, and the sample relationships into consideration to perform label disambiguation. As will be shown later, it is quite robust to all these hyper-parameters.

The most challenging issue in PLL is the inaccurate supervision. As the non-candidate labels precisely indicate whether an instance does not belong to a certain class, which can be regarded as a kind of accurate supervision, leveraging the non-candidate labels will certainly help PLL.

To the best of our knowledge, it is the first time to use the non-candidate labels to construct a complementary classifier in PLL to help label disambiguation. Moreover, the adversarial prior to introduce the complementary classifier is also novel in PLL.

4. Numerical Solution

The optimization problem in Eq. (8) has seven variables with different constraints, we therefore adopt an alternative and iterative manner to solve it.

4.1. Update W, b and $\hat{\mathbf{W}}$ , $\hat{\mathbf{b}}$

With other variables fixed, problem (8) with respect to $\mathbf{W}$ and $\mathbf{b}$ can be reformulated as

(9)

\min_{\mathbf{W},\mathbf{b}}\quad\left\|\mathbf{XW}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}-\mathbf{P}\right\|_{F}^{2}+\lambda\left\|\mathbf{W}\right\|_{F}^{2},

which is a regularized least squares problem and the corresponding closed-form solution is

(10)

\begin{split}\mathbf{W}&=\left(\mathbf{X}^{\mathsf{T}}\mathbf{X}+\lambda\mathbf{I}_{n\times n}\right)^{-1}\mathbf{X}^{\mathsf{T}}\mathbf{P}\\ \mathbf{b}&=\frac{1}{n}\left(\mathbf{P}^{\mathsf{T}}\mathbf{1}_{n}-\mathbf{W}^{\mathsf{T}}\mathbf{X}^{\mathsf{T}}\mathbf{1}_{n}\right),\end{split}

where $\mathbf{I}_{n\times n}$ is an identity matrix with the size of $n\times n$ . Similarly, the problem (8) with respect to $\hat{\mathbf{W}}$ and $\hat{\mathbf{b}}$ is

(11)

\min_{\hat{\mathbf{W}},\hat{\mathbf{b}}}\quad\|\mathbf{X}\hat{\mathbf{W}}+\mathbf{1}_{n}\hat{\mathbf{b}}^{\mathsf{T}}-\mathbf{Q}\|_{F}^{2}+\frac{\lambda}{\alpha}\|\hat{\mathbf{W}}\|_{F}^{2},

which has a similar analytical solution as Eq. (9), i.e.,

(12)

\begin{split}\hat{\mathbf{W}}&=\left(\mathbf{X}^{\mathsf{T}}\mathbf{X}+\frac{\lambda}{\alpha}\mathbf{I}_{n\times n}\right)^{-1}\mathbf{X}^{\mathsf{T}}\mathbf{Q}\\ \hat{\mathbf{b}}&=\frac{1}{n}\left(\mathbf{Q}^{\mathsf{T}}\mathbf{1}_{n}-\hat{\mathbf{W}}^{\mathsf{T}}\mathbf{X}^{\mathsf{T}}\mathbf{1}_{n}\right).\end{split}

Kernel Extension The above linear model may be too simple to tackle the complex relationships between the instances to labels, therefore we extend the linear model to a kernel version. Suppose $\phi(\cdot):\mathbb{R}^{q}\rightarrow\mathbb{R}^{h}$ represents a feature transformation that maps the feature space to a higher dimensional space, and $\mathbf{\Phi}=[\phi(\mathbf{x}_{1}),\phi(\mathbf{x}_{2}),...,\phi(\mathbf{x}_{n})]$ denotes the feature matrix in the higher dimensional space. With the feature mapping, we can rewrite problem (9) as follows:

(13)

\begin{split}\min_{\mathbf{W},\mathbf{b}}\quad&\|\mathbf{M}\|_{F}^{2}+\lambda\|\mathbf{W}\|_{F}^{2}\\ {\rm s.t.}\quad&\mathbf{M}=\mathbf{\Phi}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}-\mathbf{P},\end{split}

where $\mathbf{M}=[\mathbf{m}_{1},\mathbf{m}_{2},...,\mathbf{m}_{n}]\in\mathbb{R}^{n\times l}$ . The Lagrangian function of Eq. (13) is formulated as

(14)

\begin{split}\mathcal{L}(\mathbf{W},\mathbf{b},\mathbf{M},\mathbf{A})=\|\mathbf{M}\|_{F}^{2}+\lambda\|\mathbf{W}\|_{F}^{2}\\ -{\rm tr}\left(\mathbf{A}^{\mathsf{T}}(\mathbf{\Phi}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}-\mathbf{P}-\mathbf{M})\right),\end{split}

where $\mathbf{A}\in\mathbb{R}^{n\times l}$ is the Lagrange multiplier. ${\rm tr}(\cdot)$ is the trace of a matrix. Based on the KKT conditions, we have

(15)

\begin{split}&\frac{\partial\mathcal{L}}{\partial\mathbf{W}}=2\lambda\mathbf{W}-\mathbf{\Phi}^{\mathsf{T}}\mathbf{A}=0,\frac{\partial\mathcal{L}}{\partial\mathbf{b}}=\mathbf{A}^{\mathsf{T}}\mathbf{1}_{n}=0\\ &\frac{\partial\mathcal{L}}{\partial\mathbf{M}}=2\mathbf{M}+\mathbf{A}=0,\frac{\partial\mathcal{L}}{\partial\mathbf{A}}=\mathbf{\Phi}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}-\mathbf{P}-\mathbf{M}=0.\end{split}

Define a kernel matrix $\mathbf{K}=\mathbf{\Phi}\mathbf{\Phi}^{\mathsf{T}}$ with its element $k_{ij}=\phi(\mathbf{x}_{i})\phi(\mathbf{x}_{j})=\mathcal{K}(\mathbf{x}_{i},\mathbf{x}_{j})$ , where $\mathcal{K}(\cdot,\cdot)$ is the kernel function. For PL-CL, we use Gaussian function $\mathcal{K}(\mathbf{x}_{i},\mathbf{x}_{j})={\rm exp}(-\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}/(2\sigma^{2}))$ as the kernel function and set $\sigma$ to the average distance of all pairs of training instances. Then, we have

(16)

\begin{split}\mathbf{A}&=\left(\frac{1}{2\lambda}\mathbf{K}+\frac{1}{2}\mathbf{I}_{n\times n}\right)^{-1}\left(\mathbf{P}-\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}\right),\\ \mathbf{b}&=\left(\frac{\mathbf{sP}}{\mathbf{s1}_{n}}\right)^{\mathsf{T}},\end{split}

where $\mathbf{s}=\mathbf{1}_{n}^{\mathsf{T}}\left(\frac{1}{2\lambda}\mathbf{K}+\frac{1}{2}\mathbf{I}_{n\times n}\right)^{-1}$ . We can also obtain $\mathbf{W}={\mathbf{\Phi}^{\mathsf{T}}\mathbf{A}}/{2\lambda}$ from Eq. (15). The output of the model can be denoted by $\mathbf{H}=\mathbf{\Phi}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}=\frac{1}{2\lambda}\mathbf{KA}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}$ .

As the optimization problem regarding $\hat{\mathbf{W}}$ and $\hat{\mathbf{b}}$ has the same form as problem (13), similarly, we have

(17)

\begin{split}\hat{\mathbf{A}}&=\left(\frac{\alpha}{2\lambda}\mathbf{K}+\frac{1}{2}\mathbf{I}_{n\times n}\right)^{-1}\left(\mathbf{Q}-\mathbf{1}_{n}\hat{\mathbf{b}}^{\mathsf{T}}\right),\\ \hat{\mathbf{b}}&=\left(\frac{\hat{\mathbf{s}}\mathbf{Q}}{\hat{\mathbf{s}}\mathbf{1}_{n}}\right)^{\mathsf{T}},\end{split}

where $\hat{\mathbf{s}}=\mathbf{1}_{n}^{\mathsf{T}}\left(\frac{\alpha}{2\lambda}\mathbf{K}+\frac{1}{2}\mathbf{I}_{n\times n}\right)^{-1}$ , and the output of the complementary classifier is defined as $\hat{\mathbf{H}}=\frac{\alpha}{2\lambda}\mathbf{K}\hat{\mathbf{A}}+\mathbf{1}_{n}\hat{\mathbf{b}}^{\mathsf{T}}$ .

4.2. Update Q

Fixing other variables, the $\mathbf{Q}$ -subproblem can be equivalently rewritten as

(18)

\begin{split}\min_{\mathbf{Q}}\quad&\left\|\mathbf{Q}-\frac{\alpha\hat{\mathbf{H}}+\beta(\mathbf{1}_{n\times l}-\mathbf{P})}{\alpha+\beta}\right\|_{F}^{2}\\ {\rm s.t.}\quad&\hat{\mathbf{Y}}\leq\mathbf{Q}\leq\mathbf{1}_{n\times l},\end{split}

where the solution is

(19)

\mathbf{Q}=\mathcal{T}_{1}\left(\mathcal{T}_{\hat{\mathbf{Y}}}\left(\frac{\alpha\hat{\mathbf{H}}+\beta(\mathbf{1}_{n\times l}-\mathbf{P})}{\alpha+\beta}\right)\right).

$\mathcal{T}_{1}$ , $\mathcal{T}_{\hat{\mathbf{Y}}}$ are two thresholding operators in element-wise, i.e., $\mathcal{T}_{1}(m):=\min\{1,m\}$ with $m$ being a scalar and $\mathcal{T}_{\hat{\mathbf{Y}}}(m):=\max\{\hat{{y}}_{ij},m\}$ .

4.3. Update G

Fixing other variables, the $\mathbf{G}$ -subproblem is rewritten as

(20)

\begin{split}\min_{\mathbf{G}}\quad&\gamma\left\|\mathbf{X}^{\mathsf{T}}-\mathbf{X}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}+\mu\left\|\mathbf{P}^{\mathsf{T}}-\mathbf{P}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ {\rm s.t.}\quad&\mathbf{G}^{\mathsf{T}}\mathbf{1}_{n}=\mathbf{1}_{n},\mathbf{0}_{n\times n}\leq\mathbf{G}\leq\mathbf{U}.\end{split}

Notice that each column of $\mathbf{G}$ is independent to other columns, therefore we can solve the $\mathbf{G}$ -subproblem column by column. To solve the $i$ -th column of $\mathbf{G}$ , we have

(21)

\begin{split}\min_{\mathbf{G_{\cdot i}}}\quad&\gamma\left\|\mathbf{x}_{i}-\sum_{k_{ji}=1}g_{ji}\mathbf{x}_{j}\right\|_{F}^{2}+\mu\left\|\mathbf{p}_{i}-\sum_{k_{ji}=1}g_{ji}\mathbf{p}_{j}\right\|_{F}^{2}\\ {\rm s.t.}\quad&\mathbf{G}_{\cdot i}^{\mathsf{T}}\mathbf{1}_{n}=1,\mathbf{0}_{n}\leq\mathbf{G}_{\cdot i}\leq\mathbf{U}_{\cdot i}.\end{split}

As there are only $k$ non-zero elements in $\mathbf{G}_{\cdot i}$ that are supposed to be updated, which corresponds to the reconstruction coefficients of $\mathbf{x}_{i}$ by its top- $k$ neighbors, let $\hat{\mathbf{g}}_{i}\in\mathbb{R}^{k}$ denote the vector of these elements. Let $\mathcal{N}_{i}$ be the set of the indexes of these neighbors corresponding to $\hat{\mathbf{g}}_{i}$ . Define matrix $\mathbf{D}^{x_{i}}=[\mathbf{x}_{i}-\mathbf{x}_{\mathcal{N}_{i(1)}},\mathbf{x}_{i}-\mathbf{x}_{\mathcal{N}_{i(2)}},...,\mathbf{x}_{i}-\mathbf{x}_{\mathcal{N}_{i(k)}}]^{\mathsf{T}}\in\mathbb{R}^{k\times l}$ and $\mathbf{D}^{p_{i}}=[\mathbf{p}_{i}-\mathbf{p}_{\mathcal{N}_{i(1)}},\mathbf{p}_{i}-\mathbf{p}_{\mathcal{N}_{i(2)}},...,\mathbf{p}_{i}-\mathbf{p}_{\mathcal{N}_{i(k)}}]^{\mathsf{T}}\in\mathbb{R}^{k\times l}$ , let Gram matrices $\mathbf{B}^{x_{i}}=\mathbf{D}^{x_{i}}(\mathbf{D}^{x_{i}})^{\mathsf{T}}\in\mathbb{R}^{k\times k}$ and $\mathbf{B}^{p_{i}}=\mathbf{D}^{p_{i}}(\mathbf{D}^{p_{i}})^{\mathsf{T}}\in\mathbb{R}^{k\times k}$ , we can transform the problem in Eq. (21) into the following form:

(22)

\begin{split}\min_{\hat{\mathbf{g}}_{i}}\quad&\hat{\mathbf{g}}_{i}^{\mathsf{T}}\left(\gamma\mathbf{B}^{x_{i}}+\mu\mathbf{B}^{p_{i}}\right)\hat{\mathbf{g}}_{i}\\ {\rm s.t.}\quad&\hat{\mathbf{g}}_{i}^{\mathsf{T}}\mathbf{1}_{k}=1,\mathbf{0}_{k}\leq\hat{\mathbf{g}}_{i}\leq\mathbf{1}_{k}.\end{split}

The optimization problem in Eq. (22) is a standard Quadratic Programming (QP) problem, and can be solved by off-the-shelf QP tools. $\mathbf{G}$ is updated by concatenating all the solved $\hat{\mathbf{g}}_{i}$ together.

4.4. Update P

With other variables fixed, the $\mathbf{P}$ -subproblem is

(23)

\begin{split}\min_{{\mathbf{P}}}\quad&\left\|\mathbf{H}-\mathbf{P}\right\|_{F}^{2}+\beta\left\|\mathbf{E}-\mathbf{Q}-\mathbf{P}\right\|_{F}^{2}+\mu\left\|\mathbf{P}^{\mathsf{T}}-\mathbf{P}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ {\rm s.t.}\quad&\mathbf{P}\mathbf{1}_{q}=\mathbf{1}_{n},\mathbf{0}_{n\times l}\leq\mathbf{P}\leq\mathbf{Y},\end{split}

and we rewrite the problem in Eq. (23) into the following form:

(24)

\begin{split}\min_{{\mathbf{P}}}\quad&\left\|\mathbf{P}-\frac{\mathbf{H}+\beta(\mathbf{E}-\mathbf{Q})}{1+\beta}\right\|_{F}^{2}+\frac{\mu}{1+\beta}\left\|\mathbf{P}^{\mathsf{T}}-\mathbf{P}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ {\rm s.t.}\quad&\mathbf{P}\mathbf{1}_{q}=\mathbf{1}_{n},\mathbf{0}_{n\times l}\leq\mathbf{P}\leq\mathbf{Y}.\end{split}

In order to solve the problem (24), we denote $\widetilde{\mathbf{p}}={\rm vec}\left(\mathbf{P}\right)\in[0,1]^{nl}$ , where ${\rm vec}(\cdot)$ is the vectorization operator. Similarly, $\widetilde{\mathbf{o}}={\rm vec}\left(\frac{\mathbf{H}+\beta(\mathbf{E}-\mathbf{Q})}{1+\beta}\right)\in\mathbb{R}^{nl}$ and $\widetilde{\mathbf{y}}={\rm vec}(\mathbf{Y})\in\{0,1\}^{nl}$ . Let $\mathbf{T}=2(\mathbf{I}_{n\times n}-\mathbf{G})(\mathbf{I}_{n\times n}-\mathbf{G})^{\mathsf{T}}\in\mathbb{R}^{n\times n}$ be a square matrix. Based on these notations, the optimization problem (24) can be written as

(25)

\begin{split}\min_{\widetilde{\mathbf{p}}}\quad&\frac{1}{2}\widetilde{\mathbf{p}}^{\mathsf{T}}\left(\mathbf{C}+\frac{2(1+\beta)}{\mu}\mathbf{I}_{nl\times nl}\right)\widetilde{\mathbf{p}}-\frac{2(1+\beta)}{\mu}\widetilde{\mathbf{o}}^{\mathsf{T}}\widetilde{\mathbf{p}}\\ {\rm s.t.}\quad&\sum_{i=1,i\%n=j,0\leq j\leq n-1}^{nl}\widetilde{\mathbf{p}}_{i}=1,\mathbf{0}_{nl}\leq\widetilde{\mathbf{p}}\leq\widetilde{\mathbf{y}},\end{split}

where $\mathbf{C}\in\mathbb{R}^{nl\times nl}$ is defined as:

(26)

{\mathbf{C}}=\left[\begin{array}[]{cccc}{\mathbf{T}}&\mathbf{0}_{m\times m}&\cdots&\mathbf{0}_{m\times m}\\ \mathbf{0}_{m\times m}&{\mathbf{T}}&\ddots&\vdots\\ \vdots&\ddots&\ddots&\mathbf{0}_{m\times m}\\ \mathbf{0}_{m\times m}&\cdots&\mathbf{0}_{m\times m}&{\mathbf{T}}\end{array}\right].

As the problem in Eq. (25) is a standard QP problem, it can be solved by off-the-shelf QP tools.

4.5. The Overall Optimization Algorithm

As a summary, PL-CL first initializes $\mathbf{G}$ and $\mathbf{P}$ by solving problems (5) and (6) respectively, and initializes $\mathbf{Q}$ as:

(27)

q_{ij}=\left\{\begin{array}[]{ll}\frac{1}{l-\sum_{j}y_{ij}}&\text{ if }\quad y_{ij}=0\\ 0&\text{ otherwise }.\end{array}\right.

Then, PL-CL iteratively and alternatively updates one variable with the other fixed until the model converges. For an unseen instance $\mathbf{x}^{\ast}$ , the predicted label $y^{\ast}$ is

(28)

y^{*}=\mathop{\arg\max}_{k}\sum_{i=1}^{n}\frac{1}{2\lambda}\mathbf{A}_{ik}\mathcal{K}(\mathbf{x}^{\ast},\mathbf{x}_{i})+b_{k}.

The pseudo code of PL-CL is summarized in Algorithm 1.

Algorithm 1 The pseudo code of PL-CL

Input: Partial label data

\mathcal{D}

, hyper-parameter

\alpha

\beta

\mu

\gamma

and

\lambda

, an unseen sample

\mathbf{x}^{\ast}

Output: The predicted label

y^{\ast}

\mathbf{x}^{\ast}

Process:

Initialize

\mathbf{G}

according to Eq. (5).

Initialize

\mathbf{P}

according to Eq. (6).

Initialize

\mathbf{Q}

according to Eq. (27).

repeat

Update

\mathbf{A}

\mathbf{b}

according to Eq. (16).

Update

\hat{\mathbf{A}}

\hat{\mathbf{b}}

according to Eq. (17).

Update

\mathbf{Q}

according to Eq. (19).

Update

\mathbf{G}

according to Eq. (22).

Update

\mathbf{P}

according to Eq. (25).

until Convergence

Return:

y^{\ast}

according to Eq. (28).

4.6. Complexity Analysis

The complexity for solving $\mathbf{W}$ , $\mathbf{b}$ , $\hat{\mathbf{W}}$ and $\hat{\mathbf{b}}$ is $O(n^{3})+O(nql)$ , that for updating $\mathbf{Q}$ is $O(nl)$ , and complexities for updating $\mathbf{G}$ and $\mathbf{P}$ are $O(nk^{3})$ and $O(n^{3}q^{3})$ . Therefor the overall complexity of PL-CL is $O(n^{3}+nql+nl+nk^{3}+n^{3}q^{3})$ .

5. Experiments

5.1. Compared Methods

In order to validate the effectiveness of PL-CL, we compared it with six state-of-the-art PLL approaches:

•

PL-AGGD (Wang et al., 2022): a PLL approach which constructs a similarity graph and further uses that graph to realize disambiguation [hyper-parameter configurations: $k=10$ , $T=20$ , $\lambda=1$ , $\mu=1$ , $\gamma=0.05$ ];
•

SURE (Feng and An, 2019): a self-guided retraining based approach which maximizes an infinity norm regularization on the modeling outputs to realize disambiguation [hyper-parameter configurations: $\lambda$ and $\beta$ $\in\{0.001,0.01,0.05,0.1,0.3,0.5,1\}$ ];
•

LALO (Feng and An, 2018): an identification-based approach with constrained local consistency to differentiate the candidate labels [hyper-parameter configurations: $k=10$ , $\lambda=0.05$ , $\mu=0.005$ ];
•

IPAL (Zhang and Yu, 2015): an instance-based algorithm which identifies the valid label via an iterative label propagation procedure [hyper-parameter configurations: $\alpha=0.95$ , $k=10$ , $T=100$ ];
•

PLDA (Wang and Zhang, 2022): a dimensionality reduction approach for partial label learning [hyper-parameter configurations: PL-KNN];
•

PL-KNN (Hüllermeier and Beringer, 2005): an averaging-based method that averages the neighbors’ candidate labels to make prediction on an unseen sample [hyper-parameter configurations: $k=10$ ].

5.2. Experimental Settings

For PL-CL, we set $k=10$ , $\lambda=0.03$ , $\gamma$ , $\mu$ , $\alpha$ and $\beta$ were chosen from $\{0.001,0.01,0.1,0.2,0.5,1,1.5,2,4\}$ . The kernel functions of SURE, PL-AGGD and LALO were the same as that of PL-CL. All the hyper-parameter configurations of each method were set according to the original paper. Ten runs of 50%/50% random train/test splits were performed, and the averaged accuracy with standard deviations on test data sets were recorded for all the comparing algorithms.

Table 1. Characteristics of the real-world data sets.

Data set	# Examples	# Features	# Labels	# Average Labels	Task Domain
FG-NET (Panis et al., 2016)	1002	262	78	7.48	facial age estimation
Lost (Cour et al., 2009)	1122	108	16	2.23	automatic face naming
MSRCv2 (Liu and Dietterich, 2012b)	1758	48	23	3.16	object classification
Mirflickr (Huiskes and Lew, 2008)	2780	1536	14	2.76	web image classification
Soccer Player (Zeng et al., 2013)	17472	279	171	2.09	automatic face naming
Yahoo!News (Guillaumin et al., 2010)	22991	163	219	1.91	automatic face naming

Table 2. The comparison of different methods on the real-world data sets.

\bullet

\circ

indicates whether PL-CL is statistically superior/inferior to the compared algorithm according to pairwise

t

-test at significance level of 0.05.

Data set	FG-NET	Lost	MSRCv2	Mirflickr	Soccer Player	Yahoo!News	FG-NET(MAE3)	FG-NET(MAE5)
PL-CL	0.072 $\pm$ 0.009	0.709 $\pm$ 0.022	0.469 $\pm$ 0.016	0.642 $\pm$ 0.012	0.534 $\pm$ 0.004	0.618 $\pm$ 0.003	0.433 $\pm$ 0.022	0.575 $\pm$ 0.015
SURE	0.052 $\pm$ 0.006 $\bullet$	0.693 $\pm$ 0.020 $\bullet$	0.445 $\pm$ 0.021 $\bullet$	0.631 $\pm$ 0.021 $\bullet$	0.519 $\pm$ 0.004 $\bullet$	0.598 $\pm$ 0.002 $\bullet$	0.356 $\pm$ 0.019 $\bullet$	0.494 $\pm$ 0.020 $\bullet$
PL-AGGD	0.063 $\pm$ 0.009 $\bullet$	0.683 $\pm$ 0.014 $\bullet$	0.451 $\pm$ 0.012 $\bullet$	0.610 $\pm$ 0.011 $\bullet$	0.524 $\pm$ 0.004 $\bullet$	0.607 $\pm$ 0.004 $\bullet$	0.387 $\pm$ 0.015 $\bullet$	0.530 $\pm$ 0.015 $\bullet$
LALO	0.065 $\pm$ 0.009 $\bullet$	0.680 $\pm$ 0.014 $\bullet$	0.448 $\pm$ 0.015 $\bullet$	0.626 $\pm$ 0.013 $\bullet$	0.523 $\pm$ 0.003 $\bullet$	0.600 $\pm$ 0.003 $\bullet$	0.423 $\pm$ 0.020 $\bullet$	0.566 $\pm$ 0.014 $\bullet$
IPAL	0.051 $\pm$ 0.011 $\bullet$	0.646 $\pm$ 0.023 $\bullet$	0.488 $\pm$ 0.031 $\circ$	0.527 $\pm$ 0.009 $\bullet$	0.528 $\pm$ 0.003 $\bullet$	0.625 $\pm$ 0.004 $\circ$	0.349 $\pm$ 0.019 $\bullet$	0.500 $\pm$ 0.019 $\bullet$
PLDA	0.042 $\pm$ 0.005 $\bullet$	0.289 $\pm$ 0.045 $\bullet$	0.422 $\pm$ 0.013 $\bullet$	0.480 $\pm$ 0.015 $\bullet$	0.493 $\pm$ 0.003 $\bullet$	0.380 $\pm$ 0.003 $\bullet$	0.150 $\pm$ 0.012 $\bullet$	0.232 $\pm$ 0.012 $\bullet$
PL-KNN	0.036 $\pm$ 0.006 $\bullet$	0.296 $\pm$ 0.021 $\bullet$	0.393 $\pm$ 0.014 $\bullet$	0.454 $\pm$ 0.015 $\bullet$	0.483 $\pm$ 0.005 $\bullet$	0.368 $\pm$ 0.004 $\bullet$	0.288 $\pm$ 0.013 $\bullet$	0.440 $\pm$ 0.016 $\bullet$

Table 3. Characteristics of the UCI data sets.

Data set	ecoli	glass	vehicle	steel
# Examples	336	214	846	1941
# Features	7	9	18	27
# Labels	8	6	4	7

Table 4. Win/tie/loss counts on the controlled UCI data sets between PL-CL and other compared approaches according to the pairwise

t

-test at 0.05 significance level. I: varying

\epsilon

; II: varying

p

(

r=1

); III: varying

p

(

r=2

); IV: varying

p

(

r=3

	I	II	III	IV	Total
SURE	22/6/0	26/2/0	26/2/0	24/4/0	88/14/0
PL-AGGD	26/2/0	25/3/0	23/5/0	22/6/0	96/16/0
LALO	27/1/0	24/4/0	26/2/0	22/6/0	99/13/0
IPAL	28/0/0	23/5/0	26/2/0	25/3/0	102/10/0
PLDA	28/0/0	28/0/0	28/0/0	28/0/0	112/0/0
PL-KNN	28/0/0	28/0/0	28/0/0	28/0/0	112/0/0

Table 5. The comparison of different methods on the real-world data sets on transductive accuracy.

\bullet

\circ

indicates whether PL-CL is statistically superior/inferior to the compared algorithm according to pairwise

t

-test at significance level of 0.05.

Data set	FG-NET	Lost	MSRCv2	Mirflickr	Soccer Player	Yahoo!News	FG-NET(MAE3)	FG-NET(MAE5)
PL-CL	0.159 $\pm$ 0.016	0.832 $\pm$ 0.019	0.585 $\pm$ 0.012	0.697 $\pm$ 0.016	0.715 $\pm$ 0.002	0.840 $\pm$ 0.004	0.600 $\pm$ 0.023	0.737 $\pm$ 0.018
SURE	0.158 $\pm$ 0.012	0.798 $\pm$ 0.019 $\bullet$	0.603 $\pm$ 0.015 $\circ$	0.652 $\pm$ 0.022 $\bullet$	0.700 $\pm$ 0.003 $\bullet$	0.798 $\pm$ 0.005 $\bullet$	0.590 $\pm$ 0.018 $\bullet$	0.727 $\pm$ 0.019 $\bullet$
PL-AGGD	0.130 $\pm$ 0.015 $\bullet$	0.804 $\pm$ 0.016 $\bullet$	0.551 $\pm$ 0.015 $\bullet$	0.653 $\pm$ 0.015 $\bullet$	0.698 $\pm$ 0.003 $\bullet$	0.817 $\pm$ 0.005 $\bullet$	0.530 $\pm$ 0.021 $\bullet$	0.679 $\pm$ 0.024 $\bullet$
LALO	0.153 $\pm$ 0.016 $\bullet$	0.817 $\pm$ 0.012 $\bullet$	0.548 $\pm$ 0.009 $\bullet$	0.675 $\pm$ 0.017 $\bullet$	0.698 $\pm$ 0.003 $\bullet$	0.821 $\pm$ 0.004 $\bullet$	0.592 $\pm$ 0.024 $\bullet$	0.730 $\pm$ 0.015 $\bullet$
IPAL	0.148 $\pm$ 0.021	0.793 $\pm$ 0.017 $\bullet$	0.680 $\pm$ 0.013 $\circ$	0.586 $\pm$ 0.007 $\bullet$	0.681 $\pm$ 0.003 $\bullet$	0.839 $\pm$ 0.003	0.563 $\pm$ 0.021 $\bullet$	0.698 $\pm$ 0.022 $\bullet$
PLDA	0.042 $\pm$ 0.005 $\bullet$	0.351 $\pm$ 0.060 $\bullet$	0.479 $\pm$ 0.015 $\bullet$	0.564 $\pm$ 0.015 $\bullet$	0.493 $\pm$ 0.004 $\bullet$	0.460 $\pm$ 0.009 $\bullet$	0.150 $\pm$ 0.012 $\bullet$	0.232 $\pm$ 0.012 $\bullet$
PL-KNN	0.041 $\pm$ 0.007 $\bullet$	0.338 $\pm$ 0.016 $\bullet$	0.415 $\pm$ 0.013 $\bullet$	0.466 $\pm$ 0.013 $\bullet$	0.504 $\pm$ 0.005 $\bullet$	0.403 $\pm$ 0.009 $\bullet$	0.285 $\pm$ 0.016 $\bullet$	0.438 $\pm$ 0.014 $\bullet$

Table 6. Ablation study of the proposed approach on the real-world data sets on classification accuracy.

\bullet

\circ

indicates whether PL-CL is statistically superior/inferior to the compared algorithm according to pairwise

t

-test at significance level of 0.05.

Classification Accuracy
Kernel	Complementary Classifier	Graph	FG-NET	Lost	MSRCv2	Mirflickr	Soccer Player	Yahoo!News	FG-NET(MAE3)	FG-NET(MAE5)	Average
✕	✕	✕	0.061 $\pm$ 0.006 $\bullet$	0.622 $\pm$ 0.019 $\bullet$	0.381 $\pm$ 0.015 $\bullet$	0.249 $\pm$ 0.010 $\bullet$	0.492 $\pm$ 0.003 $\bullet$	0.430 $\pm$ 0.051 $\bullet$	0.402 $\pm$ 0.031 $\bullet$	0.551 $\pm$ 0.024 $\bullet$	0.398
✓	✕	✕	0.060 $\pm$ 0.008 $\bullet$	0.654 $\pm$ 0.019 $\bullet$	0.426 $\pm$ 0.017 $\bullet$	0.533 $\pm$ 0.012 $\bullet$	0.515 $\pm$ 0.010 $\bullet$	0.526 $\pm$ 0.006 $\bullet$	0.413 $\pm$ 0.026 $\bullet$	0.564 $\pm$ 0.018 $\bullet$	0.461
✓	✕	✓	0.057 $\pm$ 0.009 $\bullet$	0.705 $\pm$ 0.023 $\bullet$	0.462 $\pm$ 0.015 $\bullet$	0.637 $\pm$ 0.011 $\bullet$	0.530 $\pm$ 0.004 $\bullet$	0.517 $\pm$ 0.046 $\bullet$	0.416 $\pm$ 0.017 $\bullet$	0.560 $\pm$ 0.019 $\bullet$	0.485
✓	✓	✕	0.065 $\pm$ 0.008 $\bullet$	0.684 $\pm$ 0.022 $\bullet$	0.456 $\pm$ 0.014 $\bullet$	0.635 $\pm$ 0.012 $\bullet$	0.529 $\pm$ 0.003 $\bullet$	0.607 $\pm$ 0.003 $\bullet$	0.426 $\pm$ 0.024 $\bullet$	0.566 $\pm$ 0.017 $\bullet$	0.496
✓	✓	✓	0.072 $\pm$ 0.009	0.709 $\pm$ 0.022	0.469 $\pm$ 0.016	0.642 $\pm$ 0.012	0.534 $\pm$ 0.004	0.618 $\pm$ 0.003	0.433 $\pm$ 0.022	0.575 $\pm$ 0.015	0.507
Transductive Accuracy
Kernel	Complementary Classifier	Graph	FG-NET	Lost	MSRCv2	Mirflickr	Soccer Player	Yahoo!News	FG-NET(MAE3)	FG-NET(MAE5)	Average
✕	✕	✕	0.161 $\pm$ 0.017 $\bullet$	0.729 $\pm$ 0.017 $\bullet$	0.414 $\pm$ 0.013 $\bullet$	0.464 $\pm$ 0.008 $\bullet$	0.492 $\pm$ 0.004 $\bullet$	0.439 $\pm$ 0.003 $\bullet$	0.580 $\pm$ 0.024 $\bullet$	0.717 $\pm$ 0.022 $\bullet$	0.499
✓	✕	✕	0.159 $\pm$ 0.012 $\bullet$	0.748 $\pm$ 0.015 $\bullet$	0.515 $\pm$ 0.018 $\bullet$	0.594 $\pm$ 0.013 $\bullet$	0.700 $\pm$ 0.004 $\bullet$	0.774 $\pm$ 0.032 $\bullet$	0.588 $\pm$ 0.021 $\bullet$	0.726 $\pm$ 0.010 $\bullet$	0.600
✓	✕	✓	0.141 $\pm$ 0.011 $\bullet$	0.819 $\pm$ 0.017 $\bullet$	0.572 $\pm$ 0.013 $\bullet$	0.685 $\pm$ 0.014 $\bullet$	0.714 $\pm$ 0.002 $\bullet$	0.839 $\pm$ 0.004 $\bullet$	0.568 $\pm$ 0.025 $\bullet$	0.706 $\pm$ 0.022 $\bullet$	0.630
✓	✓	✕	0.150 $\pm$ 0.015 $\bullet$	0.824 $\pm$ 0.021 $\bullet$	0.567 $\pm$ 0.008 $\bullet$	0.687 $\pm$ 0.012 $\bullet$	0.701 $\pm$ 0.003 $\bullet$	0.829 $\pm$ 0.004 $\bullet$	0.570 $\pm$ 0.022 $\bullet$	0.710 $\pm$ 0.021 $\bullet$	0.629
✓	✓	✓	0.159 $\pm$ 0.016	0.832 $\pm$ 0.019	0.585 $\pm$ 0.012	0.697 $\pm$ 0.016	0.715 $\pm$ 0.002	0.840 $\pm$ 0.004	0.600 $\pm$ 0.023	0.737 $\pm$ 0.018	0.646

5.3. Performance on Real-World Data Sets

We evaluated the performance of our method and the compared ones on 6 real-world data sets, whose details are summarized in Table 1. For facial age estimation such as FG-NET (Panis et al., 2016), human faces are represented as instances while the ages annotated by crowd-sourcing labelers are considered as candidate labels. For automatic face naming, i.e., Lost (Cour et al., 2009), Soccer Player (Zeng et al., 2013) and Yahoo!News (Guillaumin et al., 2010), each face cropped from an image or a video frame is taken as an instance with the names extracted from the corresponding captions or subtitles as its candidate labels. For object classification task like MSRCv2 (Liu and Dietterich, 2012b), image segments are considered as instances with objects appearing in the same image as candidate labels. For web image classification task such as Mirflickr (Huiskes and Lew, 2008), each image is presented as an instance with the tags extracted from the web page being candidate labels.

As the average candidate labels of FG-NET is very large, which may result in quite low classification accuracy. Following (Zhang et al., 2016), we further employed mean absolute error (MAE) as the evaluation criteria. Specifically, for FG-NET(MAE3)/(MAE5), test examples are considered to be correctly classified if the error between predicted age and true age is no more than 3/5 years.

Table 2 shows the classification accuracy with standard deviation of each approach on the real-world data sets, where we can observe that

•

PL-CL significantly outperforms PL-AGGD, LALO, PLDA and PL-KNN on all the real-world data sets. Compared with two graph-based approaches PL-AGGD and LALO, PL-CL achieves outstanding performance, which validates the effectiveness of the proposed complementary classifier.
•

IPAL performs well on MSRCv2 and Yahoo!News, while it is not apt to the other data sets. However, PL-CL performs well on all the data sets, which shows the robustness and effectiveness of PL-CL.
•

In a nutshell, PL-CL achieves significant superior accuracy in 95.8% cases according to the pairwise $t$ -test at 0.05 significance level.

5.4. Performance on Controlled UCI Data Sets

Table 3 summarizes the characteristics of four UCI data sets. Following the widely-used controlling protocol in PLL (Zeng et al., 2013), (Liu and Dietterich, 2012a), a synthetic partial label data set can be generated from the UCI data set, which is controlled by three parameters $p$ , $r$ and $\epsilon$ . Specifically, $p$ controls the proportion of the training instances with partial labels, $r$ denotes the number of false positive labels residing in the candidate label set, and $\epsilon$ stands for the co-occurring probability of one candidate label and the ground-truth label.

With fixed $p=1$ and $r=1$ , the classification accuracy of all the methods when $\epsilon$ varies from 0.1 to 0.7 with step size 0.1 is shown in Fig. 2(a). A specific label was chosen to co-occur with the ground-truth label with probability $\epsilon$ , and the rest of the labels were randomly chosen with probability $(1-\epsilon)/(l-2)$ . Fig. 2(b)-(d) show the classification accuracy with varying $p$ from 0.1 to 0.7 when $r=1,2,3$ , respectively. We randomly selected proportion $p$ of the training samples as partial label examples with randomly chosen $r$ labels as the corresponding false positive labels, i.e., the size of the candidate labels of each instance is $r+1$ ¹¹1Under this setting, as the false positive labels are randomly chosen, $\epsilon$ is set to $\frac{1}{l-1}$ .. Table 4 summarizes the detailed win/tie/loss counts between PL-CL and other methods according to the pairwise $t$ -test with a significance level of 0.05. From the Fig. 2 and Table 4 we can conclue that:

•

With increasing $p$ and $\epsilon$ , the classification accuracy drops gradually in general for all the methods, as a larger $p$ or a larger $\epsilon$ means a more difficult PLL task. Even though, our approach also performs better than the compared methods.
•

As the number of candidate labels increases, the gaps between PL-CL and others become more obvious, which means with more candidate labels, the effect of complementary classifier is more prominent.
•

PL-CL clearly outperforms other methods with all the settings in general, i.e., PL-CL outperforms other compared methods significantly in 92.1% cases and achieves at least comparable performance in 7.9% cases with the total of 672 cases. Moreover, PL-CL is never significantly outperformed by any compared approaches.

5.5. Further Analysis

5.5.1. Disambiguation Ability

Transductive accuracy evaluates the classification accuracy of a PLL method on the training examples, which could reflect the ability of label disambiguation. Table 5 shows the transductive accuracy of all the methods on all the real-world data sets, where we can find that PL-CL outperforms other compared methods in 89.5% cases and is at least comparable to them in 6.3% cases according to the pairwise $t$ -test, which validates the disambiguation ability of PL-CL.

5.5.2. Ablation Study

In order to validate the effectiveness of the complementary classifier and the adaptive graph, ablation studies are conducted and the results are shown in Table 6. It can be observed from the Table 6 that the complementary classifier is quite useful when comparing the performance of the second and forth rows or the third and fifth rows on classification accuracy, as the performance of the method with complementary classifier is superior to that without the complementary classifier. Additionally, when we compare the last two rows of the table, we can find that the model with the graph structure achieves superior performance than that without the graph. In general, PL-CL outperforms its degenerated versions in all cases w.r.t. both classification accuracy and transductive accuracy, which proves that both the complementary classifier and graph structure play significant roles in our model. Besides, the use of kernel can also improve the performance of PL-CL, which can be observed when comparing the results in the first and second rows. It is worth noting that although kernel contributes to the improvement of the performance, the main improvement of PL-CL comes from complementary classifier and the graph.

5.5.3. Sensitive and Convergence Analysis

Figs. 3(a)-(e) illustrate the sensitivity of PL-CL w.r.t. $\alpha$ , $\beta$ , $\mu$ , $\gamma$ and $\lambda$ on Lost, where we can find that PL-CL performs quite stably as the values of the hyper-parameter change within a reasonable wide range. Additionally, Figs. 3(f)-(k) show the convergence curves of PL-CL, where PL-CL converges fast within about 8 iterations on all the data sets, which is also a desirable property in practice.

6. Conclusion and Future Work

In this paper, a novel method PL-CL based on complementary classifier has been proposed to solve the PLL task. Different from previous works, we incorporated the information of complementary labels by inducing a complementary classifier, and use the complementary classifier to achieve candidate label disambiguation by a novel adversarial prior. Besides, PL-CL adopted an adaptive local graph shared by both the feature space and label space to assist disambiguation. Comprehensive experiments on four controlled UCI data sets as well as six real-world data sets validated the effectiveness of PL-CL and the usefulness of the proposed complementary classifier. In the future, we will investigate how to design deep-learning-based PLL models by leveraging the complementary classifier.

References

(1)
Cour et al. (2009) Timothée Cour, Benjamin Sapp, Chris Jordan, and Benjamin Taskar. 2009. Learning from ambiguously labeled images. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA. IEEE Computer Society, 919–926. https://doi.org/10.1109/CVPR.2009.5206667
Cour et al. (2011) Timothée Cour, Benjamin Sapp, and Ben Taskar. 2011. Learning from Partial Labels. J. Mach. Learn. Res. 12 (2011), 1501–1536. https://doi.org/10.5555/1953048.2021049
Feng and An (2018) Lei Feng and Bo An. 2018. Leveraging Latent Label Distributions for Partial Label Learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, Jérôme Lang (Ed.). ijcai.org, 2107–2113. https://doi.org/10.24963/ijcai.2018/291
Feng and An (2019) Lei Feng and Bo An. 2019. Partial Label Learning with Self-Guided Retraining. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 3542–3549. https://doi.org/10.1609/aaai.v33i01.33013542
Feng et al. (2020) Lei Feng, Takuo Kaneko, Bo Han, Gang Niu, Bo An, and Masashi Sugiyama. 2020. Learning with Multiple Complementary Labels. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 3072–3081. http://proceedings.mlr.press/v119/feng20a.html
Guillaumin et al. (2010) Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. 2010. Multiple Instance Metric Learning from Automatically Labeled Bags of Faces. In Computer Vision - ECCV 2010, 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 6311), Kostas Daniilidis, Petros Maragos, and Nikos Paragios (Eds.). Springer, 634–647. https://doi.org/10.1007/978-3-642-15549-9_46
Hou et al. (2016) Peng Hou, Xin Geng, and Min-Ling Zhang. 2016. Multi-Label Manifold Learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, Dale Schuurmans and Michael P. Wellman (Eds.). AAAI Press, 1680–1686. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12144
Huiskes and Lew (2008) Mark J. Huiskes and Michael S. Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM SIGMM International Conference on Multimedia Information Retrieval, MIR 2008, Vancouver, British Columbia, Canada, October 30-31, 2008, Michael S. Lew, Alberto Del Bimbo, and Erwin M. Bakker (Eds.). ACM, 39–43. https://doi.org/10.1145/1460096.1460104
Hüllermeier and Beringer (2005) Eyke Hüllermeier and Jürgen Beringer. 2005. Learning from Ambiguously Labeled Examples. 3646 (2005), 168–179. https://doi.org/10.1007/11552253_16
Ishida et al. (2017) Takashi Ishida, Gang Niu, Weihua Hu, and Masashi Sugiyama. 2017. Learning from Complementary Labels. (2017), 5639–5649. https://proceedings.neurips.cc/paper/2017/hash/1dba5eed8838571e1c80af145184e515-Abstract.html
Liu and Dietterich (2012a) Li-Ping Liu and Thomas G. Dietterich. 2012a. A Conditional Multinomial Mixture Model for Superset Label Learning. (2012), 557–565. https://proceedings.neurips.cc/paper/2012/hash/aaebdb8bb6b0e73f6c3c54a0ab0c6415-Abstract.html
Liu and Dietterich (2012b) Li-Ping Liu and Thomas G. Dietterich. 2012b. A Conditional Multinomial Mixture Model for Superset Label Learning. (2012), 557–565. https://proceedings.neurips.cc/paper/2012/hash/aaebdb8bb6b0e73f6c3c54a0ab0c6415-Abstract.html
Lyu et al. (2021) Gengyu Lyu, Songhe Feng, Tao Wang, Congyan Lang, and Yidong Li. 2021. GM-PLL: Graph Matching Based Partial Label Learning. IEEE Trans. Knowl. Data Eng. 33, 2 (2021), 521–535. https://doi.org/10.1109/TKDE.2019.2933837
Nguyen and Caruana (2008) Nam Nguyen and Rich Caruana. 2008. Classification with partial labels. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, Ying Li, Bing Liu, and Sunita Sarawagi (Eds.). ACM, 551–559. https://doi.org/10.1145/1401890.1401958
Panis et al. (2016) Gabriel Panis, Andreas Lanitis, Nicolas Tsapatsoulis, and Timothy F. Cootes. 2016. Overview of research on facial ageing using the FG-NET ageing database. IET Biom. 5, 2 (2016), 37–46. https://doi.org/10.1049/iet-bmt.2014.0053
Wang et al. (2022) Deng-Bao Wang, Min-Ling Zhang, and Li Li. 2022. Adaptive Graph Guided Disambiguation for Partial Label Learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 12 (2022), 8796–8811. https://doi.org/10.1109/TPAMI.2021.3120012
Wang and Zhang (2022) Wei Wang and Min-Ling Zhang. 2022. Partial Label Learning with Discrimination Augmentation. In KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, Aidong Zhang and Huzefa Rangwala (Eds.). ACM, 1920–1928. https://doi.org/10.1145/3534678.3539363
Yang and Vozila (2014) Fan Yang and Paul Vozila. 2014. Semi-Supervised Chinese Word Segmentation Using Partial-Label Learning With Conditional Random Fields. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 90–98. https://doi.org/10.3115/v1/d14-1010
Yu et al. (2018) Xiyu Yu, Tongliang Liu, Mingming Gong, and Dacheng Tao. 2018. Learning with Biased Complementary Labels. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 11205), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer, 69–85. https://doi.org/10.1007/978-3-030-01246-5_5
Zeng et al. (2013) Zinan Zeng, Shijie Xiao, Kui Jia, Tsung-Han Chan, Shenghua Gao, Dong Xu, and Yi Ma. 2013. Learning by Associating Ambiguously Labeled Images. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013. IEEE Computer Society, 708–715. https://doi.org/10.1109/CVPR.2013.97
Zhang et al. (2016) Min-Ling Zhang, Bin-Bin Zhou, and Xu-Ying Liu. 2016. Partial Label Learning via Feature-Aware Disambiguation. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi (Eds.). ACM, 1335–1344. https://doi.org/10.1145/2939672.2939788
Zhang and Yu (2015) Min-Ling Zhang and Fei Yu. 2015. Solving the partial label learning problem: An instance-based approach. In Twenty-fourth international joint conference on artificial intelligence.

Complementary Classifier Induced Partial Label Learning

Abstract.

1. Introduction

2. Related Work

2.1. Partial Label Learning

2.2. Learning from Complementary Labels

3. The Proposed Approach

3.1. Ordinary Classifier

3.2. Complementary Classifier

3.3. Local Consistency in Label Manifold and Feature Manifold

3.4. Model Formulation

4. Numerical Solution

4.1. Update W, b and 𝐖^\hat{\mathbf{W}}, 𝐛^\hat{\mathbf{b}}

4.2. Update Q

4.3. Update G

4.4. Update P

4.5. The Overall Optimization Algorithm

4.6. Complexity Analysis

5. Experiments

5.1. Compared Methods

5.2. Experimental Settings

5.3. Performance on Real-World Data Sets

5.4. Performance on Controlled UCI Data Sets

5.5. Further Analysis

5.5.1. Disambiguation Ability

5.5.2. Ablation Study

5.5.3. Sensitive and Convergence Analysis

6. Conclusion and Future Work

References

4.1. Update W, b and $\hat{\mathbf{W}}$ , $\hat{\mathbf{b}}$