Complementary Classifier Induced Partial Label Learning
Abstract.
In partial label learning (PLL), each training sample is associated with a set of candidate labels, among which only one is valid. The core of PLL is to disambiguate the candidate labels to get the ground-truth one. In disambiguation, the existing works usually do not fully investigate the effectiveness of the non-candidate label set (a.k.a. complementary labels), which accurately indicates a set of labels that do not belong to a sample. In this paper, we use the non-candidate labels to induce a complementary classifier, which naturally forms an adversarial relationship against the traditional PLL classifier, to eliminate the false-positive labels in the candidate label set. Besides, we assume the feature space and the label space share the same local topological structure captured by a dynamic graph, and use it to assist disambiguation. Extensive experimental results validate the superiority of the proposed approach against state-of-the-art PLL methods on 4 controlled UCI data sets and 6 real-world data sets, and reveal the usefulness of complementary learning in PLL. The code has been released in the link https://github.com/Chongjie-Si/PL-CL.
1. Introduction
In traditional multi-class learning, each training sample is associated with a single label indicating the class that sample belongs to. However, collecting such highly accurate label is expensive and time-consuming, which is also a bottleneck of many real-world classification tasks. Therefore, partial label learning (PLL) (Hüllermeier and Beringer, 2005), (Zhang et al., 2016), (Lyu et al., 2021) was proposed, where in PLL, an instance is associated with a set of candidate labels, among which only one is valid. PLL has been applied to many real-world applications, such as natural language processing (Yang and Vozila, 2014), web mining (Huiskes and Lew, 2008) and image classification (Zeng et al., 2013), (Liu and Dietterich, 2012a), etc.
Formally speaking, suppose denotes the -dimensional feature space, is the label space with classes. Given a partial label data set where is the -th sample, and is the corresponding candidate label set and is the number of samples, the task of PLL is to learn a function based on . As the ground-truth labels are concealed in the corresponding candidate label sets, i.e., , which are not directly accessible, PLL is a quite challenging problem.
Accordingly, disambiguating the candidate labels becomes the core task of PLL. A naive strategy is to treat all the candidate labels of a sample equally, and then averages the modeling outputs of the candidate labels, which is known as the averaging-based framework (Cour et al., 2011). For example, PL-KNN (Hüllermeier and Beringer, 2005) averages the candidate labels of neighboring instances to make prediction. However, this naive strategy may fail when false-positive labels dominate the candidate label set. The second strategy is to identify the ground-truth label from the candidate label set, which is known as the identification-based framework (a.k.a. the disambiguation-based approach).

To disambiguate the candidate labels, we believe the following three kinds of prior information are important, and the previous methods usually use one or two kinds of priors to achieve disambiguation. The first type is the correlations among instances, i.e., if the two samples are similar to each other in the feature space, those two samples are possible to have the same label. For instance, PL-AGGD (Wang et al., 2022) and IPAL (Zhang and Yu, 2015) use the feature matrix to construct a similarity graph and further use that graph to realize disambiguation. The second kind of prior is the mapping from instances to the candidate labels, i.e., we could expect that the mapping error of an instance to the correct label is small, while the mapping error of that instance with an incorrect label in the candidate label set (i.e., false-positive label) is relatively large. For example, SURE (Feng and An, 2019) adopts such a mapping to automatically remove the incorrect labels by maximizing an infinity norm. The last kind of information is non-candidate label set, which precisely describes “a set of labels should not be assigned to a sample”. Although the information in non-candidate label set is directly available and accurate, the majority of existing works overlook the valuable information in disambiguation.
To fully take advantage of above mentioned priors, as shown in Fig. 1, we propose a novel PLL method named PL-CL, with Partial Label learning based on Complementary Learning. Specifically, PL-CL uses the candidate labels to design an ordinary classifier, and meanwhile adopts the non-candidate labels to develop a complementary classifier. As the ordinary classifier indicates which label should be assigned to a sample while the complementary classifier specifies the labels that should not be assigned to that sample, the outputs of those two classifiers form an adversarial relationship. Moreover, as the non-candidate labels are accurate, the trained complementary classifier is highly reliable and can be used to promote the ordinary classifier through the adversarial relationship, which finally enhances PLL. Additionally, inspired by (Hou et al., 2016), the feature space and label space of a data set usually lie in two different manifolds but own the similar local structure. Therefore, PL-CL constructs an adaptive local topological graph shared by both the feature space and the label space. Then similar samples in the feature space will possess the similar labels, which further benefits the disambiguation of PLL. PL-CL is finally formulated as an adaptive graph regularized dual classifiers learning problem, and solved by an alternating iterative algorithm. Comprehensive experiments on both the controlled UCI data sets and the real-world data sets validate that PL-CL performs significantly better than state-of-the-art PLL methods, and also substantiate the effectiveness of the complementary classifier and the shared adaptive local graph structure in PLL.
The rest of the paper is organized as follows. We first review some related works in Section 2 and introduce the proposed approach in Section 3. The numerical solution is provided in Section 4. Experimental results and analyses are shown in Section 5, and conclusions and future research directions are finally given at Section 6.
2. Related Work
2.1. Partial Label Learning
PLL is a representative weakly supervised learning framework, which learns from ambiguous supervision information. In PLL, an instance is associated with a set of labels, among which only one is valid. Since the ground-truth label of each sample is concealed in its candidate labels, which can not be directly accessible during the training phase, PLL is a quite challenging task.
To address the mentioned challenge, the key approach is candidate label disambiguation, i.e., (Feng and An, 2018), (Nguyen and Caruana, 2008), (Feng and An, 2019), (Zhang and Yu, 2015). The existing methods can be summarized into two categories: averaging-based approaches and identification-based approaches. For the averaging-based approaches (Hüllermeier and Beringer, 2005), (Cour et al., 2011), each candidate label of a training sample is treated in an equal manner and the model prediction is yielded by averaging the modeling outputs of the candidate labels. For example, PL-KNN (Hüllermeier and Beringer, 2005) averages the candidate labels of neighboring instances to make prediction. This strategy is intuitive, however, it may fail when false-positive labels dominate the candidate label set. For the identification-based approaches, (Wang et al., 2022), (Feng and An, 2019), (Feng and An, 2018), the valid label is considered as a latent variable and can be identified through an iterative optimization procedure. For instance, PL-AGGD (Wang et al., 2022) uses the feature matrix to construct a similarity graph, which is further utilized for generating labeling confidence of each candidate label.
In this paper, we summarize the following three kinds of priors that are important for disambiguation. The first is type is the correlations among instances, i.e., similar samples in the feature space may share a same label. The second kind of prior is the mapping from instances to the candidate labels, i.e., the mapping error of an instance to the valid label is small, while that to an incorrect label in the candidate label set is large. The last kind of information is non-candidate label set, which precisely describes a set of labels should not be assigned to a sample. Existing works mainly focus on the first two types of priors, while the valuable information in non-candidate labels (complementary labels) which is directly available and accurate is overlooked. Therefore, we will fully take advantage of above mentioned priors (especially the complementary labels) to achieve PLL.
2.2. Learning from Complementary Labels
Complementary-label learning (CLL) is another representative framework of weakly supervised learning (Feng et al., 2020), (Yu et al., 2018). In the traditional CLL (Ishida et al., 2017), an instance is associated with only one label, which indicates that this sample does not belong to. Although in this paper, complementary label information are adopted, it is quite different from that in the traditional complementary-label learning. First, in partial label learning the number of complementary labels of each sample is usually more than one, while an instance only has one complementary label in complementary-label learning. Additionally, the aim of complementary classifier in PLL is to assist the disambiguation of candidate labels, which is different from CLL. To the best of our knowledge, the information of complementary labels has been overlooked to some extent in previous PLL works, and our work is the first time to induce a complementary classifier to assist the construction of the ordinary classifier by an adversarial prior in PLL.
3. The Proposed Approach
PLL aims to induce a multi-class classifier based on the data set . Denote the instance matrix and the partial label matrix with labels. Note that the value of is binary, and (resp. ) if the -th label resides in (resp. does not reside in) the candidate label set of .
As analyzed in the introduction, three kinds of priors are important to label disambiguation. We use all the three kinds of priors to solve the PLL problem, and formulate the loss function as
(1) |
where is the complementary label matrix, i.e., (resp. ) if the -th label is in the non-candidate (resp. candidate) label set of the -th instance. denotes the mapping from instance matrix to the candidate labels, i.e., the ordinary classifier. denotes the loss of the complementary classifier, i.e., the mapping from the instance to the non-candidate labels, and uses correlations of instances to disambiguate the candidate labels. These three terms are three different ways of disambiguation in PLL, and details of them will be introduced in the following subsections.
3.1. Ordinary Classifier
First, PL-CL uses a weight matrix to map the instance matrix to the labeling confidence matrix , where is the labeling confidence vector corresponding to , and represents the probability of the -th label being the ground-truth label of . To be specific, PL-CL uses the following least squares loss to form the ordinary classifier:
(2) |
where is the bias term and is an all ones vector. denotes the Frobenius norm of the weight matrix, and minimizing will control the complexity of the model. is a hyper-parameter trading off these terms. As represents the labeling confidence, it should be strictly non-negative, i.e., and has a chance to be positive only when the -th label lies in the candidate label set of -th sample, i.e., we have . Moreover, as each row of indicates the labeling confidence of a certain sample, the sum of its elements should be equal to 1, i.e., . The ideal state of is one-hot, for there will only one ground-truth label. By minimizing Eq. (2), the ground-truth label is excepted to produce a smaller mapping error than the incorrect ones residing the candidate label set, and the correct label is likely to be selected.
3.2. Complementary Classifier
Suppose matrix is the complementary-labeling confidence matrix where denotes the confidence of the -th label NOT being the ground-truth label of . Similar as the ordinary classifier, we construct a complementary classifier:
(3) |
where and are the weight matrix and bias for the complementary classifier. Different from , the ideal state of is “zero-hot”, i.e., only one value in is 0 with others 1, for there is only one ground-truth label in PLL. Therefore, we require , which means if the -th label resides in the candidate label set, and , if .
The ordinary classifier predicts which label should be assigned to an instance while the complementary classifier specifies a group of labels that should not be allocated to that instance. Therefore, an adversarial relationship exists between the outputs of the ordinary and complementary classifiers, i.e., a larger (resp. smaller) element in implies a smaller (resp. larger) element in , and vice versa. Besides, as the elements in both and are non-negative and no more than 1, the above adversarial relationship can be formulated as . Taking the adversarial relationship into account, the complementary classifier becomes:
(4) |
where is an all ones matrix, and and are two hyper-parameters. Different from the candidate label set, the complementary information in the non-candidate label set is directly accessible and accurate, by minimizing Eq. (4), the valuable information is passed to the labeling confidence matrix , which will further assist label disambiguation.
3.3. Local Consistency in Label Manifold and Feature Manifold
As suggested by (Hou et al., 2016), the label space and feature space of a data set lie in two different manifolds but with the similar local structure, i.e., if and are similar, their corresponding labeling confidence vector and should also be similar. To this end, we first construct a local non-negative similarity matrix to capture the local manifold structure of the feature space, i.e.,
(5) |
where is an all zeros matrix with the size of . will ensure that the sum of each column of is 1, i.e., is normalized. denotes the -nearest neighbors (KNN) location matrix, i.e., if belongs to the KNN of , otherwise . means that the elements of will be no less than 0 and no more than the elements of at the same location, which will ensure that one sample can only be reconstructed by its neighbors. We further assume the label space shares the same local structure as the feature space captured by , and use it to assist label disambiguation, i.e.,
(6) |
By minimizing Eq. (6), the local manifold structure will help produce a better . Finally, we combine Eqs. (5) and (6) together to jointly optimize the local structure and the labeling confidence matrix , i.e.,
(7) |
where and are two hyper-parameters. By minimizing Eq. (7), the feature space and label space will be well communicated to achieve a better local consistency.
3.4. Model Formulation
Based on the discussions above, the objective function of PL-CL is finally formulated as
(8) |
where and are the hyper-parameters balancing different terms. PL-CL comprehensively takes the information in the candidate-label set, the non-candidate label set, and the sample relationships into consideration to perform label disambiguation. As will be shown later, it is quite robust to all these hyper-parameters.
The most challenging issue in PLL is the inaccurate supervision. As the non-candidate labels precisely indicate whether an instance does not belong to a certain class, which can be regarded as a kind of accurate supervision, leveraging the non-candidate labels will certainly help PLL.
To the best of our knowledge, it is the first time to use the non-candidate labels to construct a complementary classifier in PLL to help label disambiguation. Moreover, the adversarial prior to introduce the complementary classifier is also novel in PLL.
4. Numerical Solution
The optimization problem in Eq. (8) has seven variables with different constraints, we therefore adopt an alternative and iterative manner to solve it.
4.1. Update W, b and ,
With other variables fixed, problem (8) with respect to and can be reformulated as
(9) |
which is a regularized least squares problem and the corresponding closed-form solution is
(10) |
where is an identity matrix with the size of . Similarly, the problem (8) with respect to and is
(11) |
which has a similar analytical solution as Eq. (9), i.e.,
(12) |
Kernel Extension The above linear model may be too simple to tackle the complex relationships between the instances to labels, therefore we extend the linear model to a kernel version. Suppose represents a feature transformation that maps the feature space to a higher dimensional space, and denotes the feature matrix in the higher dimensional space. With the feature mapping, we can rewrite problem (9) as follows:
(13) |
where . The Lagrangian function of Eq. (13) is formulated as
(14) |
where is the Lagrange multiplier. is the trace of a matrix. Based on the KKT conditions, we have
(15) |
Define a kernel matrix with its element , where is the kernel function. For PL-CL, we use Gaussian function as the kernel function and set to the average distance of all pairs of training instances. Then, we have
(16) |
where . We can also obtain from Eq. (15). The output of the model can be denoted by .
As the optimization problem regarding and has the same form as problem (13), similarly, we have
(17) |
where , and the output of the complementary classifier is defined as .
4.2. Update Q
Fixing other variables, the -subproblem can be equivalently rewritten as
(18) |
where the solution is
(19) |
, are two thresholding operators in element-wise, i.e., with being a scalar and .
4.3. Update G
Fixing other variables, the -subproblem is rewritten as
(20) |
Notice that each column of is independent to other columns, therefore we can solve the -subproblem column by column. To solve the -th column of , we have
(21) |
As there are only non-zero elements in that are supposed to be updated, which corresponds to the reconstruction coefficients of by its top- neighbors, let denote the vector of these elements. Let be the set of the indexes of these neighbors corresponding to . Define matrix and , let Gram matrices and , we can transform the problem in Eq. (21) into the following form:
(22) |
The optimization problem in Eq. (22) is a standard Quadratic Programming (QP) problem, and can be solved by off-the-shelf QP tools. is updated by concatenating all the solved together.
4.4. Update P
With other variables fixed, the -subproblem is
(23) |
and we rewrite the problem in Eq. (23) into the following form:
(24) |
In order to solve the problem (24), we denote , where is the vectorization operator. Similarly, and . Let be a square matrix. Based on these notations, the optimization problem (24) can be written as
(25) |
where is defined as:
(26) |
As the problem in Eq. (25) is a standard QP problem, it can be solved by off-the-shelf QP tools.
4.5. The Overall Optimization Algorithm
As a summary, PL-CL first initializes and by solving problems (5) and (6) respectively, and initializes as:
(27) |
Then, PL-CL iteratively and alternatively updates one variable with the other fixed until the model converges. For an unseen instance , the predicted label is
(28) |
The pseudo code of PL-CL is summarized in Algorithm 1.
4.6. Complexity Analysis
The complexity for solving , , and is , that for updating is , and complexities for updating and are and . Therefor the overall complexity of PL-CL is .
5. Experiments
5.1. Compared Methods
In order to validate the effectiveness of PL-CL, we compared it with six state-of-the-art PLL approaches:
-
•
PL-AGGD (Wang et al., 2022): a PLL approach which constructs a similarity graph and further uses that graph to realize disambiguation [hyper-parameter configurations: , , , , ];
-
•
SURE (Feng and An, 2019): a self-guided retraining based approach which maximizes an infinity norm regularization on the modeling outputs to realize disambiguation [hyper-parameter configurations: and ];
-
•
LALO (Feng and An, 2018): an identification-based approach with constrained local consistency to differentiate the candidate labels [hyper-parameter configurations: , , ];
-
•
IPAL (Zhang and Yu, 2015): an instance-based algorithm which identifies the valid label via an iterative label propagation procedure [hyper-parameter configurations: , , ];
-
•
PLDA (Wang and Zhang, 2022): a dimensionality reduction approach for partial label learning [hyper-parameter configurations: PL-KNN];
-
•
PL-KNN (Hüllermeier and Beringer, 2005): an averaging-based method that averages the neighbors’ candidate labels to make prediction on an unseen sample [hyper-parameter configurations: ].
5.2. Experimental Settings
For PL-CL, we set , , , , and were chosen from . The kernel functions of SURE, PL-AGGD and LALO were the same as that of PL-CL. All the hyper-parameter configurations of each method were set according to the original paper. Ten runs of 50%/50% random train/test splits were performed, and the averaged accuracy with standard deviations on test data sets were recorded for all the comparing algorithms.
Data set | # Examples | # Features | # Labels | # Average Labels | Task Domain |
---|---|---|---|---|---|
FG-NET (Panis et al., 2016) | 1002 | 262 | 78 | 7.48 | facial age estimation |
Lost (Cour et al., 2009) | 1122 | 108 | 16 | 2.23 | automatic face naming |
MSRCv2 (Liu and Dietterich, 2012b) | 1758 | 48 | 23 | 3.16 | object classification |
Mirflickr (Huiskes and Lew, 2008) | 2780 | 1536 | 14 | 2.76 | web image classification |
Soccer Player (Zeng et al., 2013) | 17472 | 279 | 171 | 2.09 | automatic face naming |
Yahoo!News (Guillaumin et al., 2010) | 22991 | 163 | 219 | 1.91 | automatic face naming |
Data set | FG-NET | Lost | MSRCv2 | Mirflickr | Soccer Player | Yahoo!News | FG-NET(MAE3) | FG-NET(MAE5) |
---|---|---|---|---|---|---|---|---|
PL-CL | 0.072 0.009 | 0.709 0.022 | 0.469 0.016 | 0.642 0.012 | 0.534 0.004 | 0.618 0.003 | 0.433 0.022 | 0.575 0.015 |
SURE | 0.052 0.006 | 0.693 0.020 | 0.445 0.021 | 0.631 0.021 | 0.519 0.004 | 0.598 0.002 | 0.356 0.019 | 0.494 0.020 |
PL-AGGD | 0.063 0.009 | 0.683 0.014 | 0.451 0.012 | 0.610 0.011 | 0.524 0.004 | 0.607 0.004 | 0.387 0.015 | 0.530 0.015 |
LALO | 0.065 0.009 | 0.680 0.014 | 0.448 0.015 | 0.626 0.013 | 0.523 0.003 | 0.600 0.003 | 0.423 0.020 | 0.566 0.014 |
IPAL | 0.051 0.011 | 0.646 0.023 | 0.488 0.031 | 0.527 0.009 | 0.528 0.003 | 0.625 0.004 | 0.349 0.019 | 0.500 0.019 |
PLDA | 0.042 0.005 | 0.289 0.045 | 0.422 0.013 | 0.480 0.015 | 0.493 0.003 | 0.380 0.003 | 0.150 0.012 | 0.232 0.012 |
PL-KNN | 0.036 0.006 | 0.296 0.021 | 0.393 0.014 | 0.454 0.015 | 0.483 0.005 | 0.368 0.004 | 0.288 0.013 | 0.440 0.016 |
Data set | ecoli | glass | vehicle | steel |
---|---|---|---|---|
# Examples | 336 | 214 | 846 | 1941 |
# Features | 7 | 9 | 18 | 27 |
# Labels | 8 | 6 | 4 | 7 |
I | II | III | IV | Total | |
---|---|---|---|---|---|
SURE | 22/6/0 | 26/2/0 | 26/2/0 | 24/4/0 | 88/14/0 |
PL-AGGD | 26/2/0 | 25/3/0 | 23/5/0 | 22/6/0 | 96/16/0 |
LALO | 27/1/0 | 24/4/0 | 26/2/0 | 22/6/0 | 99/13/0 |
IPAL | 28/0/0 | 23/5/0 | 26/2/0 | 25/3/0 | 102/10/0 |
PLDA | 28/0/0 | 28/0/0 | 28/0/0 | 28/0/0 | 112/0/0 |
PL-KNN | 28/0/0 | 28/0/0 | 28/0/0 | 28/0/0 | 112/0/0 |

















Data set | FG-NET | Lost | MSRCv2 | Mirflickr | Soccer Player | Yahoo!News | FG-NET(MAE3) | FG-NET(MAE5) |
---|---|---|---|---|---|---|---|---|
PL-CL | 0.159 0.016 | 0.832 0.019 | 0.585 0.012 | 0.697 0.016 | 0.715 0.002 | 0.840 0.004 | 0.600 0.023 | 0.737 0.018 |
SURE | 0.158 0.012 | 0.798 0.019 | 0.603 0.015 | 0.652 0.022 | 0.700 0.003 | 0.798 0.005 | 0.590 0.018 | 0.727 0.019 |
PL-AGGD | 0.130 0.015 | 0.804 0.016 | 0.551 0.015 | 0.653 0.015 | 0.698 0.003 | 0.817 0.005 | 0.530 0.021 | 0.679 0.024 |
LALO | 0.153 0.016 | 0.817 0.012 | 0.548 0.009 | 0.675 0.017 | 0.698 0.003 | 0.821 0.004 | 0.592 0.024 | 0.730 0.015 |
IPAL | 0.148 0.021 | 0.793 0.017 | 0.680 0.013 | 0.586 0.007 | 0.681 0.003 | 0.839 0.003 | 0.563 0.021 | 0.698 0.022 |
PLDA | 0.042 0.005 | 0.351 0.060 | 0.479 0.015 | 0.564 0.015 | 0.493 0.004 | 0.460 0.009 | 0.150 0.012 | 0.232 0.012 |
PL-KNN | 0.041 0.007 | 0.338 0.016 | 0.415 0.013 | 0.466 0.013 | 0.504 0.005 | 0.403 0.009 | 0.285 0.016 | 0.438 0.014 |
Classification Accuracy | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|
Kernel | Complementary Classifier | Graph | FG-NET | Lost | MSRCv2 | Mirflickr | Soccer Player | Yahoo!News | FG-NET(MAE3) | FG-NET(MAE5) | Average |
✕ | ✕ | ✕ | 0.061 0.006 | 0.622 0.019 | 0.381 0.015 | 0.249 0.010 | 0.492 0.003 | 0.430 0.051 | 0.402 0.031 | 0.551 0.024 | 0.398 |
✓ | ✕ | ✕ | 0.060 0.008 | 0.654 0.019 | 0.426 0.017 | 0.533 0.012 | 0.515 0.010 | 0.526 0.006 | 0.413 0.026 | 0.564 0.018 | 0.461 |
✓ | ✕ | ✓ | 0.057 0.009 | 0.705 0.023 | 0.462 0.015 | 0.637 0.011 | 0.530 0.004 | 0.517 0.046 | 0.416 0.017 | 0.560 0.019 | 0.485 |
✓ | ✓ | ✕ | 0.065 0.008 | 0.684 0.022 | 0.456 0.014 | 0.635 0.012 | 0.529 0.003 | 0.607 0.003 | 0.426 0.024 | 0.566 0.017 | 0.496 |
✓ | ✓ | ✓ | 0.072 0.009 | 0.709 0.022 | 0.469 0.016 | 0.642 0.012 | 0.534 0.004 | 0.618 0.003 | 0.433 0.022 | 0.575 0.015 | 0.507 |
Transductive Accuracy | |||||||||||
Kernel | Complementary Classifier | Graph | FG-NET | Lost | MSRCv2 | Mirflickr | Soccer Player | Yahoo!News | FG-NET(MAE3) | FG-NET(MAE5) | Average |
✕ | ✕ | ✕ | 0.161 0.017 | 0.729 0.017 | 0.414 0.013 | 0.464 0.008 | 0.492 0.004 | 0.439 0.003 | 0.580 0.024 | 0.717 0.022 | 0.499 |
✓ | ✕ | ✕ | 0.159 0.012 | 0.748 0.015 | 0.515 0.018 | 0.594 0.013 | 0.700 0.004 | 0.774 0.032 | 0.588 0.021 | 0.726 0.010 | 0.600 |
✓ | ✕ | ✓ | 0.141 0.011 | 0.819 0.017 | 0.572 0.013 | 0.685 0.014 | 0.714 0.002 | 0.839 0.004 | 0.568 0.025 | 0.706 0.022 | 0.630 |
✓ | ✓ | ✕ | 0.150 0.015 | 0.824 0.021 | 0.567 0.008 | 0.687 0.012 | 0.701 0.003 | 0.829 0.004 | 0.570 0.022 | 0.710 0.021 | 0.629 |
✓ | ✓ | ✓ | 0.159 0.016 | 0.832 0.019 | 0.585 0.012 | 0.697 0.016 | 0.715 0.002 | 0.840 0.004 | 0.600 0.023 | 0.737 0.018 | 0.646 |











5.3. Performance on Real-World Data Sets
We evaluated the performance of our method and the compared ones on 6 real-world data sets, whose details are summarized in Table 1. For facial age estimation such as FG-NET (Panis et al., 2016), human faces are represented as instances while the ages annotated by crowd-sourcing labelers are considered as candidate labels. For automatic face naming, i.e., Lost (Cour et al., 2009), Soccer Player (Zeng et al., 2013) and Yahoo!News (Guillaumin et al., 2010), each face cropped from an image or a video frame is taken as an instance with the names extracted from the corresponding captions or subtitles as its candidate labels. For object classification task like MSRCv2 (Liu and Dietterich, 2012b), image segments are considered as instances with objects appearing in the same image as candidate labels. For web image classification task such as Mirflickr (Huiskes and Lew, 2008), each image is presented as an instance with the tags extracted from the web page being candidate labels.
As the average candidate labels of FG-NET is very large, which may result in quite low classification accuracy. Following (Zhang et al., 2016), we further employed mean absolute error (MAE) as the evaluation criteria. Specifically, for FG-NET(MAE3)/(MAE5), test examples are considered to be correctly classified if the error between predicted age and true age is no more than 3/5 years.
Table 2 shows the classification accuracy with standard deviation of each approach on the real-world data sets, where we can observe that
-
•
PL-CL significantly outperforms PL-AGGD, LALO, PLDA and PL-KNN on all the real-world data sets. Compared with two graph-based approaches PL-AGGD and LALO, PL-CL achieves outstanding performance, which validates the effectiveness of the proposed complementary classifier.
-
•
IPAL performs well on MSRCv2 and Yahoo!News, while it is not apt to the other data sets. However, PL-CL performs well on all the data sets, which shows the robustness and effectiveness of PL-CL.
-
•
In a nutshell, PL-CL achieves significant superior accuracy in 95.8% cases according to the pairwise -test at 0.05 significance level.
5.4. Performance on Controlled UCI Data Sets
Table 3 summarizes the characteristics of four UCI data sets. Following the widely-used controlling protocol in PLL (Zeng et al., 2013), (Liu and Dietterich, 2012a), a synthetic partial label data set can be generated from the UCI data set, which is controlled by three parameters , and . Specifically, controls the proportion of the training instances with partial labels, denotes the number of false positive labels residing in the candidate label set, and stands for the co-occurring probability of one candidate label and the ground-truth label.
With fixed and , the classification accuracy of all the methods when varies from 0.1 to 0.7 with step size 0.1 is shown in Fig. 2(a). A specific label was chosen to co-occur with the ground-truth label with probability , and the rest of the labels were randomly chosen with probability . Fig. 2(b)-(d) show the classification accuracy with varying from 0.1 to 0.7 when , respectively. We randomly selected proportion of the training samples as partial label examples with randomly chosen labels as the corresponding false positive labels, i.e., the size of the candidate labels of each instance is 111Under this setting, as the false positive labels are randomly chosen, is set to .. Table 4 summarizes the detailed win/tie/loss counts between PL-CL and other methods according to the pairwise -test with a significance level of 0.05. From the Fig. 2 and Table 4 we can conclue that:
-
•
With increasing and , the classification accuracy drops gradually in general for all the methods, as a larger or a larger means a more difficult PLL task. Even though, our approach also performs better than the compared methods.
-
•
As the number of candidate labels increases, the gaps between PL-CL and others become more obvious, which means with more candidate labels, the effect of complementary classifier is more prominent.
-
•
PL-CL clearly outperforms other methods with all the settings in general, i.e., PL-CL outperforms other compared methods significantly in 92.1% cases and achieves at least comparable performance in 7.9% cases with the total of 672 cases. Moreover, PL-CL is never significantly outperformed by any compared approaches.
5.5. Further Analysis
5.5.1. Disambiguation Ability
Transductive accuracy evaluates the classification accuracy of a PLL method on the training examples, which could reflect the ability of label disambiguation. Table 5 shows the transductive accuracy of all the methods on all the real-world data sets, where we can find that PL-CL outperforms other compared methods in 89.5% cases and is at least comparable to them in 6.3% cases according to the pairwise -test, which validates the disambiguation ability of PL-CL.
5.5.2. Ablation Study
In order to validate the effectiveness of the complementary classifier and the adaptive graph, ablation studies are conducted and the results are shown in Table 6. It can be observed from the Table 6 that the complementary classifier is quite useful when comparing the performance of the second and forth rows or the third and fifth rows on classification accuracy, as the performance of the method with complementary classifier is superior to that without the complementary classifier. Additionally, when we compare the last two rows of the table, we can find that the model with the graph structure achieves superior performance than that without the graph. In general, PL-CL outperforms its degenerated versions in all cases w.r.t. both classification accuracy and transductive accuracy, which proves that both the complementary classifier and graph structure play significant roles in our model. Besides, the use of kernel can also improve the performance of PL-CL, which can be observed when comparing the results in the first and second rows. It is worth noting that although kernel contributes to the improvement of the performance, the main improvement of PL-CL comes from complementary classifier and the graph.
5.5.3. Sensitive and Convergence Analysis
Figs. 3(a)-(e) illustrate the sensitivity of PL-CL w.r.t. , , , and on Lost, where we can find that PL-CL performs quite stably as the values of the hyper-parameter change within a reasonable wide range. Additionally, Figs. 3(f)-(k) show the convergence curves of PL-CL, where PL-CL converges fast within about 8 iterations on all the data sets, which is also a desirable property in practice.
6. Conclusion and Future Work
In this paper, a novel method PL-CL based on complementary classifier has been proposed to solve the PLL task. Different from previous works, we incorporated the information of complementary labels by inducing a complementary classifier, and use the complementary classifier to achieve candidate label disambiguation by a novel adversarial prior. Besides, PL-CL adopted an adaptive local graph shared by both the feature space and label space to assist disambiguation. Comprehensive experiments on four controlled UCI data sets as well as six real-world data sets validated the effectiveness of PL-CL and the usefulness of the proposed complementary classifier. In the future, we will investigate how to design deep-learning-based PLL models by leveraging the complementary classifier.
References
- (1)
- Cour et al. (2009) Timothée Cour, Benjamin Sapp, Chris Jordan, and Benjamin Taskar. 2009. Learning from ambiguously labeled images. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA. IEEE Computer Society, 919–926. https://doi.org/10.1109/CVPR.2009.5206667
- Cour et al. (2011) Timothée Cour, Benjamin Sapp, and Ben Taskar. 2011. Learning from Partial Labels. J. Mach. Learn. Res. 12 (2011), 1501–1536. https://doi.org/10.5555/1953048.2021049
- Feng and An (2018) Lei Feng and Bo An. 2018. Leveraging Latent Label Distributions for Partial Label Learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, Jérôme Lang (Ed.). ijcai.org, 2107–2113. https://doi.org/10.24963/ijcai.2018/291
- Feng and An (2019) Lei Feng and Bo An. 2019. Partial Label Learning with Self-Guided Retraining. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 3542–3549. https://doi.org/10.1609/aaai.v33i01.33013542
- Feng et al. (2020) Lei Feng, Takuo Kaneko, Bo Han, Gang Niu, Bo An, and Masashi Sugiyama. 2020. Learning with Multiple Complementary Labels. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 3072–3081. http://proceedings.mlr.press/v119/feng20a.html
- Guillaumin et al. (2010) Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. 2010. Multiple Instance Metric Learning from Automatically Labeled Bags of Faces. In Computer Vision - ECCV 2010, 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 6311), Kostas Daniilidis, Petros Maragos, and Nikos Paragios (Eds.). Springer, 634–647. https://doi.org/10.1007/978-3-642-15549-9_46
- Hou et al. (2016) Peng Hou, Xin Geng, and Min-Ling Zhang. 2016. Multi-Label Manifold Learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, Dale Schuurmans and Michael P. Wellman (Eds.). AAAI Press, 1680–1686. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12144
- Huiskes and Lew (2008) Mark J. Huiskes and Michael S. Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM SIGMM International Conference on Multimedia Information Retrieval, MIR 2008, Vancouver, British Columbia, Canada, October 30-31, 2008, Michael S. Lew, Alberto Del Bimbo, and Erwin M. Bakker (Eds.). ACM, 39–43. https://doi.org/10.1145/1460096.1460104
- Hüllermeier and Beringer (2005) Eyke Hüllermeier and Jürgen Beringer. 2005. Learning from Ambiguously Labeled Examples. 3646 (2005), 168–179. https://doi.org/10.1007/11552253_16
- Ishida et al. (2017) Takashi Ishida, Gang Niu, Weihua Hu, and Masashi Sugiyama. 2017. Learning from Complementary Labels. (2017), 5639–5649. https://proceedings.neurips.cc/paper/2017/hash/1dba5eed8838571e1c80af145184e515-Abstract.html
- Liu and Dietterich (2012a) Li-Ping Liu and Thomas G. Dietterich. 2012a. A Conditional Multinomial Mixture Model for Superset Label Learning. (2012), 557–565. https://proceedings.neurips.cc/paper/2012/hash/aaebdb8bb6b0e73f6c3c54a0ab0c6415-Abstract.html
- Liu and Dietterich (2012b) Li-Ping Liu and Thomas G. Dietterich. 2012b. A Conditional Multinomial Mixture Model for Superset Label Learning. (2012), 557–565. https://proceedings.neurips.cc/paper/2012/hash/aaebdb8bb6b0e73f6c3c54a0ab0c6415-Abstract.html
- Lyu et al. (2021) Gengyu Lyu, Songhe Feng, Tao Wang, Congyan Lang, and Yidong Li. 2021. GM-PLL: Graph Matching Based Partial Label Learning. IEEE Trans. Knowl. Data Eng. 33, 2 (2021), 521–535. https://doi.org/10.1109/TKDE.2019.2933837
- Nguyen and Caruana (2008) Nam Nguyen and Rich Caruana. 2008. Classification with partial labels. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, Ying Li, Bing Liu, and Sunita Sarawagi (Eds.). ACM, 551–559. https://doi.org/10.1145/1401890.1401958
- Panis et al. (2016) Gabriel Panis, Andreas Lanitis, Nicolas Tsapatsoulis, and Timothy F. Cootes. 2016. Overview of research on facial ageing using the FG-NET ageing database. IET Biom. 5, 2 (2016), 37–46. https://doi.org/10.1049/iet-bmt.2014.0053
- Wang et al. (2022) Deng-Bao Wang, Min-Ling Zhang, and Li Li. 2022. Adaptive Graph Guided Disambiguation for Partial Label Learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 12 (2022), 8796–8811. https://doi.org/10.1109/TPAMI.2021.3120012
- Wang and Zhang (2022) Wei Wang and Min-Ling Zhang. 2022. Partial Label Learning with Discrimination Augmentation. In KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, Aidong Zhang and Huzefa Rangwala (Eds.). ACM, 1920–1928. https://doi.org/10.1145/3534678.3539363
- Yang and Vozila (2014) Fan Yang and Paul Vozila. 2014. Semi-Supervised Chinese Word Segmentation Using Partial-Label Learning With Conditional Random Fields. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 90–98. https://doi.org/10.3115/v1/d14-1010
- Yu et al. (2018) Xiyu Yu, Tongliang Liu, Mingming Gong, and Dacheng Tao. 2018. Learning with Biased Complementary Labels. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 11205), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer, 69–85. https://doi.org/10.1007/978-3-030-01246-5_5
- Zeng et al. (2013) Zinan Zeng, Shijie Xiao, Kui Jia, Tsung-Han Chan, Shenghua Gao, Dong Xu, and Yi Ma. 2013. Learning by Associating Ambiguously Labeled Images. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013. IEEE Computer Society, 708–715. https://doi.org/10.1109/CVPR.2013.97
- Zhang et al. (2016) Min-Ling Zhang, Bin-Bin Zhou, and Xu-Ying Liu. 2016. Partial Label Learning via Feature-Aware Disambiguation. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi (Eds.). ACM, 1335–1344. https://doi.org/10.1145/2939672.2939788
- Zhang and Yu (2015) Min-Ling Zhang and Fei Yu. 2015. Solving the partial label learning problem: An instance-based approach. In Twenty-fourth international joint conference on artificial intelligence.