This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Complementary Classifier Induced Partial Label Learning

Yuheng Jia Key Laboratory of Computer Network and Information Integration, School of Computer Science and EngineeringSoutheast UniversityNanjing 210096China [email protected] Chongjie Si MoE Key Lab of Artificial Intelligence, AI InstituteShanghai Jiao Tong UniversityShanghai 200240China [email protected]  and  Min-Ling Zhang Key Laboratory of Computer Network and Information Integration, School of Computer Science and EngineeringSoutheast UniversityNanjing 210096China [email protected]
(2023)
Abstract.

In partial label learning (PLL), each training sample is associated with a set of candidate labels, among which only one is valid. The core of PLL is to disambiguate the candidate labels to get the ground-truth one. In disambiguation, the existing works usually do not fully investigate the effectiveness of the non-candidate label set (a.k.a. complementary labels), which accurately indicates a set of labels that do not belong to a sample. In this paper, we use the non-candidate labels to induce a complementary classifier, which naturally forms an adversarial relationship against the traditional PLL classifier, to eliminate the false-positive labels in the candidate label set. Besides, we assume the feature space and the label space share the same local topological structure captured by a dynamic graph, and use it to assist disambiguation. Extensive experimental results validate the superiority of the proposed approach against state-of-the-art PLL methods on 4 controlled UCI data sets and 6 real-world data sets, and reveal the usefulness of complementary learning in PLL. The code has been released in the link https://github.com/Chongjie-Si/PL-CL.

partial label learning, complementary classifier, label disambiguation
copyright: acmcopyrightjournalyear: 2023doi: XXXXXXX.XXXXXXXconference: 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining; August 06–10, 2023; Long Beach, CAprice: 15.00isbn: 978-1-4503-XXXX-X/23/08ccs: Computing methodologies Learning paradigms

1. Introduction

In traditional multi-class learning, each training sample is associated with a single label indicating the class that sample belongs to. However, collecting such highly accurate label is expensive and time-consuming, which is also a bottleneck of many real-world classification tasks. Therefore, partial label learning (PLL) (Hüllermeier and Beringer, 2005), (Zhang et al., 2016), (Lyu et al., 2021) was proposed, where in PLL, an instance is associated with a set of candidate labels, among which only one is valid. PLL has been applied to many real-world applications, such as natural language processing (Yang and Vozila, 2014), web mining (Huiskes and Lew, 2008) and image classification (Zeng et al., 2013), (Liu and Dietterich, 2012a), etc.

Formally speaking, suppose 𝒳=q\mathcal{X}=\mathbb{R}^{q} denotes the qq-dimensional feature space, 𝒴={0,1}l\mathcal{Y}=\{0,1\}^{l} is the label space with ll classes. Given a partial label data set 𝒟={𝐱i,Si|1in}\mathcal{D}=\{\mathbf{x}_{i},S_{i}|1\leq i\leq n\} where 𝐱i𝒳\mathbf{x}_{i}\in\mathcal{X} is the ii-th sample, and Si𝒴S_{i}\subseteq\mathcal{Y} is the corresponding candidate label set and nn is the number of samples, the task of PLL is to learn a function f:𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y} based on 𝒟\mathcal{D}. As the ground-truth labels {y1,y2,,yn}\{y_{1},y_{2},...,y_{n}\} are concealed in the corresponding candidate label sets, i.e., yiSiy_{i}\in S_{i}, which are not directly accessible, PLL is a quite challenging problem.

Accordingly, disambiguating the candidate labels becomes the core task of PLL. A naive strategy is to treat all the candidate labels of a sample equally, and then averages the modeling outputs of the candidate labels, which is known as the averaging-based framework (Cour et al., 2011). For example, PL-KNN (Hüllermeier and Beringer, 2005) averages the candidate labels of neighboring instances to make prediction. However, this naive strategy may fail when false-positive labels dominate the candidate label set. The second strategy is to identify the ground-truth label from the candidate label set, which is known as the identification-based framework (a.k.a. the disambiguation-based approach).

Refer to caption
Figure 1. Framework of PL-CL. PL-CL utilizes candidate and non-candidate labels to generate an ordinary classifier and a complementary classifier, and restrains the outputs of those two classifiers with an adversarial manner. A dynamic graph is also adopted to exploit the local topology consistency of the feature manifold and label manifold to further assist disambiguation.

To disambiguate the candidate labels, we believe the following three kinds of prior information are important, and the previous methods usually use one or two kinds of priors to achieve disambiguation. The first type is the correlations among instances, i.e., if the two samples are similar to each other in the feature space, those two samples are possible to have the same label. For instance, PL-AGGD (Wang et al., 2022) and IPAL (Zhang and Yu, 2015) use the feature matrix to construct a similarity graph and further use that graph to realize disambiguation. The second kind of prior is the mapping from instances to the candidate labels, i.e., we could expect that the mapping error of an instance to the correct label is small, while the mapping error of that instance with an incorrect label in the candidate label set (i.e., false-positive label) is relatively large. For example, SURE (Feng and An, 2019) adopts such a mapping to automatically remove the incorrect labels by maximizing an infinity norm. The last kind of information is non-candidate label set, which precisely describes “a set of labels should not be assigned to a sample”. Although the information in non-candidate label set is directly available and accurate, the majority of existing works overlook the valuable information in disambiguation.

To fully take advantage of above mentioned priors, as shown in Fig. 1, we propose a novel PLL method named PL-CL, with Partial Label learning based on Complementary Learning. Specifically, PL-CL uses the candidate labels to design an ordinary classifier, and meanwhile adopts the non-candidate labels to develop a complementary classifier. As the ordinary classifier indicates which label should be assigned to a sample while the complementary classifier specifies the labels that should not be assigned to that sample, the outputs of those two classifiers form an adversarial relationship. Moreover, as the non-candidate labels are accurate, the trained complementary classifier is highly reliable and can be used to promote the ordinary classifier through the adversarial relationship, which finally enhances PLL. Additionally, inspired by (Hou et al., 2016), the feature space and label space of a data set usually lie in two different manifolds but own the similar local structure. Therefore, PL-CL constructs an adaptive local topological graph shared by both the feature space and the label space. Then similar samples in the feature space will possess the similar labels, which further benefits the disambiguation of PLL. PL-CL is finally formulated as an adaptive graph regularized dual classifiers learning problem, and solved by an alternating iterative algorithm. Comprehensive experiments on both the controlled UCI data sets and the real-world data sets validate that PL-CL performs significantly better than state-of-the-art PLL methods, and also substantiate the effectiveness of the complementary classifier and the shared adaptive local graph structure in PLL.

The rest of the paper is organized as follows. We first review some related works in Section 2 and introduce the proposed approach in Section 3. The numerical solution is provided in Section 4. Experimental results and analyses are shown in Section 5, and conclusions and future research directions are finally given at Section 6.

2. Related Work

2.1. Partial Label Learning

PLL is a representative weakly supervised learning framework, which learns from ambiguous supervision information. In PLL, an instance is associated with a set of labels, among which only one is valid. Since the ground-truth label of each sample is concealed in its candidate labels, which can not be directly accessible during the training phase, PLL is a quite challenging task.

To address the mentioned challenge, the key approach is candidate label disambiguation, i.e., (Feng and An, 2018), (Nguyen and Caruana, 2008), (Feng and An, 2019), (Zhang and Yu, 2015). The existing methods can be summarized into two categories: averaging-based approaches and identification-based approaches. For the averaging-based approaches (Hüllermeier and Beringer, 2005), (Cour et al., 2011), each candidate label of a training sample is treated in an equal manner and the model prediction is yielded by averaging the modeling outputs of the candidate labels. For example, PL-KNN (Hüllermeier and Beringer, 2005) averages the candidate labels of neighboring instances to make prediction. This strategy is intuitive, however, it may fail when false-positive labels dominate the candidate label set. For the identification-based approaches, (Wang et al., 2022), (Feng and An, 2019), (Feng and An, 2018), the valid label is considered as a latent variable and can be identified through an iterative optimization procedure. For instance, PL-AGGD (Wang et al., 2022) uses the feature matrix to construct a similarity graph, which is further utilized for generating labeling confidence of each candidate label.

In this paper, we summarize the following three kinds of priors that are important for disambiguation. The first is type is the correlations among instances, i.e., similar samples in the feature space may share a same label. The second kind of prior is the mapping from instances to the candidate labels, i.e., the mapping error of an instance to the valid label is small, while that to an incorrect label in the candidate label set is large. The last kind of information is non-candidate label set, which precisely describes a set of labels should not be assigned to a sample. Existing works mainly focus on the first two types of priors, while the valuable information in non-candidate labels (complementary labels) which is directly available and accurate is overlooked. Therefore, we will fully take advantage of above mentioned priors (especially the complementary labels) to achieve PLL.

2.2. Learning from Complementary Labels

Complementary-label learning (CLL) is another representative framework of weakly supervised learning (Feng et al., 2020), (Yu et al., 2018). In the traditional CLL (Ishida et al., 2017), an instance is associated with only one label, which indicates that this sample does not belong to. Although in this paper, complementary label information are adopted, it is quite different from that in the traditional complementary-label learning. First, in partial label learning the number of complementary labels of each sample is usually more than one, while an instance only has one complementary label in complementary-label learning. Additionally, the aim of complementary classifier in PLL is to assist the disambiguation of candidate labels, which is different from CLL. To the best of our knowledge, the information of complementary labels has been overlooked to some extent in previous PLL works, and our work is the first time to induce a complementary classifier to assist the construction of the ordinary classifier by an adversarial prior in PLL.

3. The Proposed Approach

PLL aims to induce a multi-class classifier f:𝒳𝒴f:\mathcal{X}\rightarrow\mathcal{Y} based on the data set 𝒟={𝐱i,Si|1in}\mathcal{D}=\{\mathbf{x}_{i},S_{i}|1\leq i\leq n\}. Denote 𝐗=[𝐱1,𝐱2,,𝐱n]𝖳n×q\mathbf{X}=[\mathbf{x}_{1},\mathbf{x}_{2},...,\mathbf{x}_{n}]^{\mathsf{T}}\in\mathbb{R}^{n\times q} the instance matrix and 𝐘=[𝐲1,𝐲2,,𝐲n]𝖳{0,1}n×l\mathbf{Y}=[\mathbf{y}_{1},\mathbf{y}_{2},...,\mathbf{y}_{n}]^{\mathsf{T}}\in\{0,1\}^{n\times l} the partial label matrix with ll labels. Note that the value of yijy_{ij} is binary, and yij=1y_{ij}=1 (resp. yij=0y_{ij}=0) if the jj-th label resides in (resp. does not reside in) the candidate label set of 𝐱i\mathbf{x}_{i}.

As analyzed in the introduction, three kinds of priors are important to label disambiguation. We use all the three kinds of priors to solve the PLL problem, and formulate the loss function as

(1) min(𝐗,𝐘)+^(𝐗,𝐘^)+Ψ(𝐗,𝐘),\min\mathcal{L}(\mathbf{X},\mathbf{Y})+\hat{\mathcal{L}}(\mathbf{X},\hat{\mathbf{Y}})+\Psi(\mathbf{X},\mathbf{Y}),

where 𝐘^=[y^ij]n×l\hat{\mathbf{Y}}=[\hat{y}_{ij}]_{n\times l} is the complementary label matrix, i.e., y^ij=1\hat{y}_{ij}=1 (resp. y^ij=0\hat{y}_{ij}=0) if the jj-th label is in the non-candidate (resp. candidate) label set of the ii-th instance. (𝐗,𝐘)\mathcal{L}(\mathbf{X},\mathbf{Y}) denotes the mapping from instance matrix to the candidate labels, i.e., the ordinary classifier. ^(𝐗,𝐘)\hat{\mathcal{L}}(\mathbf{X},\mathbf{Y}) denotes the loss of the complementary classifier, i.e., the mapping from the instance to the non-candidate labels, and Ψ(𝐗,𝐘)\Psi(\mathbf{X},\mathbf{Y}) uses correlations of instances to disambiguate the candidate labels. These three terms are three different ways of disambiguation in PLL, and details of them will be introduced in the following subsections.

3.1. Ordinary Classifier

First, PL-CL uses a weight matrix 𝐖=[𝐰1,𝐰2,,𝐰l]q×l\mathbf{W}=[\mathbf{w}_{1},\mathbf{w}_{2},...,\mathbf{w}_{l}]\in\mathbb{R}^{q\times l} to map the instance matrix to the labeling confidence matrix 𝐏=[𝐩1,𝐩2,,𝐩n]𝖳n×l\mathbf{P}=[\mathbf{p}_{1},\mathbf{p}_{2},...,\mathbf{p}_{n}]^{\mathsf{T}}\in\mathbb{R}^{n\times l}, where 𝐩il\mathbf{p}_{i}\in\mathbb{R}^{l} is the labeling confidence vector corresponding to 𝐱i\mathbf{x}_{i}, and pijp_{ij} represents the probability of the jj-th label being the ground-truth label of 𝐱i\mathbf{x}_{i}. To be specific, PL-CL uses the following least squares loss to form the ordinary classifier:

(2) (𝐗,𝐘)=𝐗𝐖+𝟏n𝐛𝖳𝐏F2+λ𝐖F2s.t.jpij=1,0pijyij,i,j,\begin{split}\mathcal{L}(\mathbf{X},\mathbf{Y})&=\left\|\mathbf{XW}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}-\mathbf{P}\right\|_{F}^{2}+\lambda\|\mathbf{W}\|_{F}^{2}\\ {\rm s.t.}&\quad\sum_{j}p_{ij}=1,0\leq p_{ij}\leq y_{ij},\forall i,j,\end{split}

where 𝐛=[b1,b2,,bl]l\mathbf{b}=[b_{1},b_{2},...,b_{l}]\in\mathbb{R}^{l} is the bias term and 𝟏nn\mathbf{1}_{n}\in\mathbb{R}^{n} is an all ones vector. 𝐖F\|\mathbf{W}\|_{F} denotes the Frobenius norm of the weight matrix, and minimizing 𝐖F2\|\mathbf{W}\|^{2}_{F} will control the complexity of the model. λ\lambda is a hyper-parameter trading off these terms. As 𝐏\mathbf{P} represents the labeling confidence, it should be strictly non-negative, i.e., pij0p_{ij}\geq 0 and pijp_{ij} has a chance to be positive only when the jj-th label lies in the candidate label set of ii-th sample, i.e., we have pijyijp_{ij}\leq y_{ij}. Moreover, as each row of 𝐏\mathbf{P} indicates the labeling confidence of a certain sample, the sum of its elements should be equal to 1, i.e., jpij=1\sum_{j}p_{ij}=1. The ideal state of 𝐩i\mathbf{p}_{i} is one-hot, for there will only one ground-truth label. By minimizing Eq. (2), the ground-truth label is excepted to produce a smaller mapping error than the incorrect ones residing the candidate label set, and the correct label is likely to be selected.

3.2. Complementary Classifier

Suppose matrix 𝐐=[𝐪1,𝐪2,,𝐪n]n×l\mathbf{Q}=[\mathbf{q}_{1},\mathbf{q}_{2},...,\mathbf{q}_{n}]\in\mathbb{R}^{n\times l} is the complementary-labeling confidence matrix where qijq_{ij} denotes the confidence of the jj-th label NOT being the ground-truth label of 𝐱i\mathbf{x}_{i}. Similar as the ordinary classifier, we construct a complementary classifier:

(3) min𝐖^,𝐛^𝐗𝐖^+𝟏n𝐛^𝖳𝐐F2+λ𝐖^F2s.t.y^ijqij1,i,j,\begin{split}\min_{\hat{\mathbf{W}},\hat{\mathbf{b}}}&\left\|\mathbf{X}\hat{\mathbf{W}}+\mathbf{1}_{n}\hat{\mathbf{b}}^{\mathsf{T}}-\mathbf{Q}\right\|_{F}^{2}+\lambda\left\|\hat{\mathbf{W}}\right\|_{F}^{2}\\ {\rm s.t.}&\quad\hat{y}_{ij}\leq q_{ij}\leq 1,\forall i,j,\end{split}

where 𝐖^q×l\hat{\mathbf{W}}\in\mathbb{R}^{q\times l} and 𝐛^l\hat{\mathbf{b}}\in\mathbb{R}^{l} are the weight matrix and bias for the complementary classifier. Different from 𝐩i\mathbf{p}_{i}, the ideal state of 𝐪i\mathbf{q}_{i} is “zero-hot”, i.e., only one value in 𝐪i\mathbf{q}_{i} is 0 with others 1, for there is only one ground-truth label in PLL. Therefore, we require y^ijqij1,i,j\hat{y}_{ij}\leq q_{ij}\leq 1,\forall i,j, which means 0qij10\leq q_{ij}\leq 1 if the jj-th label resides in the candidate label set, and qij=1q_{ij}=1, if y^ij=1\hat{y}_{ij}=1.

The ordinary classifier predicts which label should be assigned to an instance while the complementary classifier specifies a group of labels that should not be allocated to that instance. Therefore, an adversarial relationship exists between the outputs of the ordinary and complementary classifiers, i.e., a larger (resp. smaller) element in 𝐏\mathbf{P} implies a smaller (resp. larger) element in 𝐐\mathbf{Q}, and vice versa. Besides, as the elements in both 𝐐\mathbf{Q} and 𝐏\mathbf{P} are non-negative and no more than 1, the above adversarial relationship can be formulated as 𝐩i+𝐪i=𝟏l\mathbf{p}_{i}+\mathbf{q}_{i}=\mathbf{1}_{l}. Taking the adversarial relationship into account, the complementary classifier becomes:

(4) ^(𝐗,𝐘^)=α𝐗𝐖^+𝟏n𝐛^𝖳𝐐F2+β𝟏n×l𝐏𝐐F2+λ𝐖^F2s.t.y^ijqij1,\begin{split}\hat{\mathcal{L}}(\mathbf{X},\hat{\mathbf{Y}})=&\alpha\left\|\mathbf{X}\hat{\mathbf{W}}+\mathbf{1}_{n}\hat{\mathbf{b}}^{\mathsf{T}}-\mathbf{Q}\right\|_{F}^{2}\\ &+\beta\left\|\mathbf{1}_{n\times l}-\mathbf{P}-\mathbf{Q}\right\|_{F}^{2}+\lambda\left\|\hat{\mathbf{W}}\right\|_{F}^{2}\\ {\rm s.t.}&\quad\hat{y}_{ij}\leq q_{ij}\leq 1,\end{split}

where 𝟏n×ln×l\mathbf{1}_{n\times l}\in\mathbb{R}^{n\times l} is an all ones matrix, and α\alpha and β\beta are two hyper-parameters. Different from the candidate label set, the complementary information in the non-candidate label set is directly accessible and accurate, by minimizing Eq. (4), the valuable information is passed to the labeling confidence matrix 𝐏\mathbf{P}, which will further assist label disambiguation.

3.3. Local Consistency in Label Manifold and Feature Manifold

As suggested by (Hou et al., 2016), the label space and feature space of a data set lie in two different manifolds but with the similar local structure, i.e., if 𝐱i\mathbf{x}_{i} and 𝐱j\mathbf{x}_{j} are similar, their corresponding labeling confidence vector 𝐩i\mathbf{p}_{i} and 𝐩j\mathbf{p}_{j} should also be similar. To this end, we first construct a local non-negative similarity matrix 𝐆=[gij]n×n\mathbf{G}=[g_{ij}]_{n\times n} to capture the local manifold structure of the feature space, i.e.,

(5) min𝐆𝐗𝖳𝐗𝖳𝐆F2s.t.𝐆𝖳𝟏n=𝟏n,𝟎n×n𝐆𝐔,\begin{split}\min_{\mathbf{G}}&\quad\left\|\mathbf{X}^{\mathsf{T}}-\mathbf{X}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ {\rm s.t.}&\quad\mathbf{G}^{\mathsf{T}}\mathbf{1}_{n}=\mathbf{1}_{n},\mathbf{0}_{n\times n}\leq\mathbf{G}\leq\mathbf{U},\end{split}

where 𝟎n×n\mathbf{0}_{n\times n} is an all zeros matrix with the size of n×nn\times n. 𝐆𝖳𝟏n=𝟏n\mathbf{G}^{\mathsf{T}}\mathbf{1}_{n}=\mathbf{1}_{n} will ensure that the sum of each column of 𝐆\mathbf{G} is 1, i.e., 𝐆\mathbf{G} is normalized. 𝐔=[uij]n×n\mathbf{U}=[u_{ij}]_{n\times n} denotes the kk-nearest neighbors (KNN) location matrix, i.e., uij=1u_{ij}=1 if 𝐱i\mathbf{x}_{i} belongs to the KNN of 𝐱j\mathbf{x}_{j}, otherwise uij=0u_{ij}=0. 𝟎n×n𝐆𝐔\mathbf{0}_{n\times n}\leq\mathbf{G}\leq\mathbf{U} means that the elements of 𝐆\mathbf{G} will be no less than 0 and no more than the elements of 𝐔\mathbf{U} at the same location, which will ensure that one sample can only be reconstructed by its neighbors. We further assume the label space shares the same local structure as the feature space captured by 𝐆\mathbf{G}, and use it to assist label disambiguation, i.e.,

(6) min𝐏𝐏𝖳𝐏𝖳𝐆F2s.t.𝐏𝟏q=𝟏n,𝟎n×l𝐏𝐘.\begin{split}\min_{\mathbf{P}}&\quad\left\|\mathbf{P}^{\mathsf{T}}-\mathbf{P}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ {\rm s.t.}&\quad\mathbf{P}\mathbf{1}_{q}=\mathbf{1}_{n},\mathbf{0}_{n\times l}\leq\mathbf{P}\leq\mathbf{Y}.\end{split}

By minimizing Eq. (6), the local manifold structure 𝐆\mathbf{G} will help produce a better 𝐏\mathbf{P}. Finally, we combine Eqs. (5) and (6) together to jointly optimize the local structure 𝐆\mathbf{G} and the labeling confidence matrix 𝐏\mathbf{P}, i.e.,

(7) Ψ(𝐗,𝐘)=γ𝐗𝖳𝐗𝖳𝐆F2+μ𝐏𝖳𝐏𝖳𝐆F2s.t.𝐆𝖳𝟏n=𝟏n,𝟎n×n𝐆𝐔,𝐏𝟏q=𝟏n,𝟎n×l𝐏𝐘,\begin{split}\Psi(\mathbf{X},\mathbf{Y})=&\quad\gamma\left\|\mathbf{X}^{\mathsf{T}}-\mathbf{X}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}+\mu\left\|\mathbf{P}^{\mathsf{T}}-\mathbf{P}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ {\rm s.t.}&\quad\mathbf{G}^{\mathsf{T}}\mathbf{1}_{n}=\mathbf{1}_{n},\mathbf{0}_{n\times n}\leq\mathbf{G}\leq\mathbf{U},\\ &\quad\mathbf{P}\mathbf{1}_{q}=\mathbf{1}_{n},\mathbf{0}_{n\times l}\leq\mathbf{P}\leq\mathbf{Y},\end{split}

where γ\gamma and μ\mu are two hyper-parameters. By minimizing Eq. (7), the feature space and label space will be well communicated to achieve a better local consistency.

3.4. Model Formulation

Based on the discussions above, the objective function of PL-CL is finally formulated as

(8) min𝐖,𝐛,𝐖^,𝐛^,𝐏,𝐐,𝐆𝐗𝐖+𝟏n𝐛𝖳𝐏F2+β𝟏n×l𝐏𝐐||F2+α𝐗𝐖^+𝟏n𝐛^𝖳𝐐F2+μ𝐏𝖳𝐏𝖳𝐆F2+γ𝐗𝖳𝐗𝖳𝐆F2+λ(𝐖F2+𝐖^F2)s.t.𝐏𝟏q=𝟏n,𝟎n×l𝐏𝐘,𝐘^𝐐𝟏n×l𝐆𝖳𝟏n=𝟏n,𝟎n×n𝐆𝐔,\begin{split}\min_{\begin{subarray}{c}\mathbf{W},\mathbf{b},\hat{\mathbf{W}},\hat{\mathbf{b}},\\ \mathbf{P},\mathbf{Q},\mathbf{G}\end{subarray}}&\left\|\mathbf{XW}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}-\mathbf{P}\right\|_{F}^{2}+\beta\|\mathbf{1}_{n\times l}-\mathbf{P}-\mathbf{Q}||_{F}^{2}\\ +&\alpha\left\|\mathbf{X}\hat{\mathbf{W}}+\mathbf{1}_{n}\hat{\mathbf{b}}^{\mathsf{T}}-\mathbf{Q}\right\|_{F}^{2}+\mu\left\|\mathbf{P}^{\mathsf{T}}-\mathbf{P}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ +&\gamma\left\|\mathbf{X}^{\mathsf{T}}-\mathbf{X}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}+\lambda(\|{{\mathbf{W}}}\|_{F}^{2}+\left\|{\hat{\mathbf{W}}}\right\|_{F}^{2})\vspace{2pt}\\ {\rm s.t.}\quad&\mathbf{P}\mathbf{1}_{q}=\mathbf{1}_{n},\mathbf{0}_{n\times l}\leq\mathbf{P}\leq\mathbf{Y},\hat{\mathbf{Y}}\leq\mathbf{Q}\leq\mathbf{1}_{n\times l}\\ &\mathbf{G}^{\mathsf{T}}\mathbf{1}_{n}=\mathbf{1}_{n},\mathbf{0}_{n\times n}\leq\mathbf{G}\leq\mathbf{U},\end{split}

where β,α,γ,μ\beta,\alpha,\gamma,\mu and λ\lambda are the hyper-parameters balancing different terms. PL-CL comprehensively takes the information in the candidate-label set, the non-candidate label set, and the sample relationships into consideration to perform label disambiguation. As will be shown later, it is quite robust to all these hyper-parameters.

The most challenging issue in PLL is the inaccurate supervision. As the non-candidate labels precisely indicate whether an instance does not belong to a certain class, which can be regarded as a kind of accurate supervision, leveraging the non-candidate labels will certainly help PLL.

To the best of our knowledge, it is the first time to use the non-candidate labels to construct a complementary classifier in PLL to help label disambiguation. Moreover, the adversarial prior to introduce the complementary classifier is also novel in PLL.

4. Numerical Solution

The optimization problem in Eq. (8) has seven variables with different constraints, we therefore adopt an alternative and iterative manner to solve it.

4.1. Update W, b and 𝐖^\hat{\mathbf{W}}, 𝐛^\hat{\mathbf{b}}

With other variables fixed, problem (8) with respect to 𝐖\mathbf{W} and 𝐛\mathbf{b} can be reformulated as

(9) min𝐖,𝐛𝐗𝐖+𝟏n𝐛𝖳𝐏F2+λ𝐖F2,\min_{\mathbf{W},\mathbf{b}}\quad\left\|\mathbf{XW}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}-\mathbf{P}\right\|_{F}^{2}+\lambda\left\|\mathbf{W}\right\|_{F}^{2},

which is a regularized least squares problem and the corresponding closed-form solution is

(10) 𝐖=(𝐗𝖳𝐗+λ𝐈n×n)1𝐗𝖳𝐏𝐛=1n(𝐏𝖳𝟏n𝐖𝖳𝐗𝖳𝟏n),\begin{split}\mathbf{W}&=\left(\mathbf{X}^{\mathsf{T}}\mathbf{X}+\lambda\mathbf{I}_{n\times n}\right)^{-1}\mathbf{X}^{\mathsf{T}}\mathbf{P}\\ \mathbf{b}&=\frac{1}{n}\left(\mathbf{P}^{\mathsf{T}}\mathbf{1}_{n}-\mathbf{W}^{\mathsf{T}}\mathbf{X}^{\mathsf{T}}\mathbf{1}_{n}\right),\end{split}

where 𝐈n×n\mathbf{I}_{n\times n} is an identity matrix with the size of n×nn\times n. Similarly, the problem (8) with respect to 𝐖^\hat{\mathbf{W}} and 𝐛^\hat{\mathbf{b}} is

(11) min𝐖^,𝐛^𝐗𝐖^+𝟏n𝐛^𝖳𝐐F2+λα𝐖^F2,\min_{\hat{\mathbf{W}},\hat{\mathbf{b}}}\quad\|\mathbf{X}\hat{\mathbf{W}}+\mathbf{1}_{n}\hat{\mathbf{b}}^{\mathsf{T}}-\mathbf{Q}\|_{F}^{2}+\frac{\lambda}{\alpha}\|\hat{\mathbf{W}}\|_{F}^{2},

which has a similar analytical solution as Eq. (9), i.e.,

(12) 𝐖^=(𝐗𝖳𝐗+λα𝐈n×n)1𝐗𝖳𝐐𝐛^=1n(𝐐𝖳𝟏n𝐖^𝖳𝐗𝖳𝟏n).\begin{split}\hat{\mathbf{W}}&=\left(\mathbf{X}^{\mathsf{T}}\mathbf{X}+\frac{\lambda}{\alpha}\mathbf{I}_{n\times n}\right)^{-1}\mathbf{X}^{\mathsf{T}}\mathbf{Q}\\ \hat{\mathbf{b}}&=\frac{1}{n}\left(\mathbf{Q}^{\mathsf{T}}\mathbf{1}_{n}-\hat{\mathbf{W}}^{\mathsf{T}}\mathbf{X}^{\mathsf{T}}\mathbf{1}_{n}\right).\end{split}

Kernel Extension The above linear model may be too simple to tackle the complex relationships between the instances to labels, therefore we extend the linear model to a kernel version. Suppose ϕ():qh\phi(\cdot):\mathbb{R}^{q}\rightarrow\mathbb{R}^{h} represents a feature transformation that maps the feature space to a higher dimensional space, and 𝚽=[ϕ(𝐱1),ϕ(𝐱2),,ϕ(𝐱n)]\mathbf{\Phi}=[\phi(\mathbf{x}_{1}),\phi(\mathbf{x}_{2}),...,\phi(\mathbf{x}_{n})] denotes the feature matrix in the higher dimensional space. With the feature mapping, we can rewrite problem (9) as follows:

(13) min𝐖,𝐛𝐌F2+λ𝐖F2s.t.𝐌=𝚽𝐖+𝟏n𝐛𝖳𝐏,\begin{split}\min_{\mathbf{W},\mathbf{b}}\quad&\|\mathbf{M}\|_{F}^{2}+\lambda\|\mathbf{W}\|_{F}^{2}\\ {\rm s.t.}\quad&\mathbf{M}=\mathbf{\Phi}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}-\mathbf{P},\end{split}

where 𝐌=[𝐦1,𝐦2,,𝐦n]n×l\mathbf{M}=[\mathbf{m}_{1},\mathbf{m}_{2},...,\mathbf{m}_{n}]\in\mathbb{R}^{n\times l}. The Lagrangian function of Eq. (13) is formulated as

(14) (𝐖,𝐛,𝐌,𝐀)=𝐌F2+λ𝐖F2tr(𝐀𝖳(𝚽𝐖+𝟏n𝐛𝖳𝐏𝐌)),\begin{split}\mathcal{L}(\mathbf{W},\mathbf{b},\mathbf{M},\mathbf{A})=\|\mathbf{M}\|_{F}^{2}+\lambda\|\mathbf{W}\|_{F}^{2}\\ -{\rm tr}\left(\mathbf{A}^{\mathsf{T}}(\mathbf{\Phi}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}-\mathbf{P}-\mathbf{M})\right),\end{split}

where 𝐀n×l\mathbf{A}\in\mathbb{R}^{n\times l} is the Lagrange multiplier. tr(){\rm tr}(\cdot) is the trace of a matrix. Based on the KKT conditions, we have

(15) 𝐖=2λ𝐖𝚽𝖳𝐀=0,𝐛=𝐀𝖳𝟏n=0𝐌=2𝐌+𝐀=0,𝐀=𝚽𝐖+𝟏n𝐛𝖳𝐏𝐌=0.\begin{split}&\frac{\partial\mathcal{L}}{\partial\mathbf{W}}=2\lambda\mathbf{W}-\mathbf{\Phi}^{\mathsf{T}}\mathbf{A}=0,\frac{\partial\mathcal{L}}{\partial\mathbf{b}}=\mathbf{A}^{\mathsf{T}}\mathbf{1}_{n}=0\\ &\frac{\partial\mathcal{L}}{\partial\mathbf{M}}=2\mathbf{M}+\mathbf{A}=0,\frac{\partial\mathcal{L}}{\partial\mathbf{A}}=\mathbf{\Phi}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}-\mathbf{P}-\mathbf{M}=0.\end{split}

Define a kernel matrix 𝐊=𝚽𝚽𝖳\mathbf{K}=\mathbf{\Phi}\mathbf{\Phi}^{\mathsf{T}} with its element kij=ϕ(𝐱i)ϕ(𝐱j)=𝒦(𝐱i,𝐱j)k_{ij}=\phi(\mathbf{x}_{i})\phi(\mathbf{x}_{j})=\mathcal{K}(\mathbf{x}_{i},\mathbf{x}_{j}), where 𝒦(,)\mathcal{K}(\cdot,\cdot) is the kernel function. For PL-CL, we use Gaussian function 𝒦(𝐱i,𝐱j)=exp(𝐱i𝐱j22/(2σ2))\mathcal{K}(\mathbf{x}_{i},\mathbf{x}_{j})={\rm exp}(-\|\mathbf{x}_{i}-\mathbf{x}_{j}\|_{2}^{2}/(2\sigma^{2})) as the kernel function and set σ\sigma to the average distance of all pairs of training instances. Then, we have

(16) 𝐀=(12λ𝐊+12𝐈n×n)1(𝐏𝟏n𝐛𝖳),𝐛=(𝐬𝐏𝐬𝟏n)𝖳,\begin{split}\mathbf{A}&=\left(\frac{1}{2\lambda}\mathbf{K}+\frac{1}{2}\mathbf{I}_{n\times n}\right)^{-1}\left(\mathbf{P}-\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}\right),\\ \mathbf{b}&=\left(\frac{\mathbf{sP}}{\mathbf{s1}_{n}}\right)^{\mathsf{T}},\end{split}

where 𝐬=𝟏n𝖳(12λ𝐊+12𝐈n×n)1\mathbf{s}=\mathbf{1}_{n}^{\mathsf{T}}\left(\frac{1}{2\lambda}\mathbf{K}+\frac{1}{2}\mathbf{I}_{n\times n}\right)^{-1}. We can also obtain 𝐖=𝚽𝖳𝐀/2λ\mathbf{W}={\mathbf{\Phi}^{\mathsf{T}}\mathbf{A}}/{2\lambda} from Eq. (15). The output of the model can be denoted by 𝐇=𝚽𝐖+𝟏n𝐛𝖳=12λ𝐊𝐀+𝟏n𝐛𝖳\mathbf{H}=\mathbf{\Phi}\mathbf{W}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}=\frac{1}{2\lambda}\mathbf{KA}+\mathbf{1}_{n}\mathbf{b}^{\mathsf{T}}.

As the optimization problem regarding 𝐖^\hat{\mathbf{W}} and 𝐛^\hat{\mathbf{b}} has the same form as problem (13), similarly, we have

(17) 𝐀^=(α2λ𝐊+12𝐈n×n)1(𝐐𝟏n𝐛^𝖳),𝐛^=(𝐬^𝐐𝐬^𝟏n)𝖳,\begin{split}\hat{\mathbf{A}}&=\left(\frac{\alpha}{2\lambda}\mathbf{K}+\frac{1}{2}\mathbf{I}_{n\times n}\right)^{-1}\left(\mathbf{Q}-\mathbf{1}_{n}\hat{\mathbf{b}}^{\mathsf{T}}\right),\\ \hat{\mathbf{b}}&=\left(\frac{\hat{\mathbf{s}}\mathbf{Q}}{\hat{\mathbf{s}}\mathbf{1}_{n}}\right)^{\mathsf{T}},\end{split}

where 𝐬^=𝟏n𝖳(α2λ𝐊+12𝐈n×n)1\hat{\mathbf{s}}=\mathbf{1}_{n}^{\mathsf{T}}\left(\frac{\alpha}{2\lambda}\mathbf{K}+\frac{1}{2}\mathbf{I}_{n\times n}\right)^{-1}, and the output of the complementary classifier is defined as 𝐇^=α2λ𝐊𝐀^+𝟏n𝐛^𝖳\hat{\mathbf{H}}=\frac{\alpha}{2\lambda}\mathbf{K}\hat{\mathbf{A}}+\mathbf{1}_{n}\hat{\mathbf{b}}^{\mathsf{T}}.

4.2. Update Q

Fixing other variables, the 𝐐\mathbf{Q}-subproblem can be equivalently rewritten as

(18) min𝐐𝐐α𝐇^+β(𝟏n×l𝐏)α+βF2s.t.𝐘^𝐐𝟏n×l,\begin{split}\min_{\mathbf{Q}}\quad&\left\|\mathbf{Q}-\frac{\alpha\hat{\mathbf{H}}+\beta(\mathbf{1}_{n\times l}-\mathbf{P})}{\alpha+\beta}\right\|_{F}^{2}\\ {\rm s.t.}\quad&\hat{\mathbf{Y}}\leq\mathbf{Q}\leq\mathbf{1}_{n\times l},\end{split}

where the solution is

(19) 𝐐=𝒯1(𝒯𝐘^(α𝐇^+β(𝟏n×l𝐏)α+β)).\mathbf{Q}=\mathcal{T}_{1}\left(\mathcal{T}_{\hat{\mathbf{Y}}}\left(\frac{\alpha\hat{\mathbf{H}}+\beta(\mathbf{1}_{n\times l}-\mathbf{P})}{\alpha+\beta}\right)\right).

𝒯1\mathcal{T}_{1}, 𝒯𝐘^\mathcal{T}_{\hat{\mathbf{Y}}} are two thresholding operators in element-wise, i.e., 𝒯1(m):=min{1,m}\mathcal{T}_{1}(m):=\min\{1,m\} with mm being a scalar and 𝒯𝐘^(m):=max{y^ij,m}\mathcal{T}_{\hat{\mathbf{Y}}}(m):=\max\{\hat{{y}}_{ij},m\}.

4.3. Update G

Fixing other variables, the 𝐆\mathbf{G}-subproblem is rewritten as

(20) min𝐆γ𝐗𝖳𝐗𝖳𝐆F2+μ𝐏𝖳𝐏𝖳𝐆F2s.t.𝐆𝖳𝟏n=𝟏n,𝟎n×n𝐆𝐔.\begin{split}\min_{\mathbf{G}}\quad&\gamma\left\|\mathbf{X}^{\mathsf{T}}-\mathbf{X}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}+\mu\left\|\mathbf{P}^{\mathsf{T}}-\mathbf{P}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ {\rm s.t.}\quad&\mathbf{G}^{\mathsf{T}}\mathbf{1}_{n}=\mathbf{1}_{n},\mathbf{0}_{n\times n}\leq\mathbf{G}\leq\mathbf{U}.\end{split}

Notice that each column of 𝐆\mathbf{G} is independent to other columns, therefore we can solve the 𝐆\mathbf{G}-subproblem column by column. To solve the ii-th column of 𝐆\mathbf{G}, we have

(21) min𝐆𝐢γ𝐱ikji=1gji𝐱jF2+μ𝐩ikji=1gji𝐩jF2s.t.𝐆i𝖳𝟏n=1,𝟎n𝐆i𝐔i.\begin{split}\min_{\mathbf{G_{\cdot i}}}\quad&\gamma\left\|\mathbf{x}_{i}-\sum_{k_{ji}=1}g_{ji}\mathbf{x}_{j}\right\|_{F}^{2}+\mu\left\|\mathbf{p}_{i}-\sum_{k_{ji}=1}g_{ji}\mathbf{p}_{j}\right\|_{F}^{2}\\ {\rm s.t.}\quad&\mathbf{G}_{\cdot i}^{\mathsf{T}}\mathbf{1}_{n}=1,\mathbf{0}_{n}\leq\mathbf{G}_{\cdot i}\leq\mathbf{U}_{\cdot i}.\end{split}

As there are only kk non-zero elements in 𝐆i\mathbf{G}_{\cdot i} that are supposed to be updated, which corresponds to the reconstruction coefficients of 𝐱i\mathbf{x}_{i} by its top-kk neighbors, let 𝐠^ik\hat{\mathbf{g}}_{i}\in\mathbb{R}^{k} denote the vector of these elements. Let 𝒩i\mathcal{N}_{i} be the set of the indexes of these neighbors corresponding to 𝐠^i\hat{\mathbf{g}}_{i}. Define matrix 𝐃xi=[𝐱i𝐱𝒩i(1),𝐱i𝐱𝒩i(2),,𝐱i𝐱𝒩i(k)]𝖳k×l\mathbf{D}^{x_{i}}=[\mathbf{x}_{i}-\mathbf{x}_{\mathcal{N}_{i(1)}},\mathbf{x}_{i}-\mathbf{x}_{\mathcal{N}_{i(2)}},...,\mathbf{x}_{i}-\mathbf{x}_{\mathcal{N}_{i(k)}}]^{\mathsf{T}}\in\mathbb{R}^{k\times l} and 𝐃pi=[𝐩i𝐩𝒩i(1),𝐩i𝐩𝒩i(2),,𝐩i𝐩𝒩i(k)]𝖳k×l\mathbf{D}^{p_{i}}=[\mathbf{p}_{i}-\mathbf{p}_{\mathcal{N}_{i(1)}},\mathbf{p}_{i}-\mathbf{p}_{\mathcal{N}_{i(2)}},...,\mathbf{p}_{i}-\mathbf{p}_{\mathcal{N}_{i(k)}}]^{\mathsf{T}}\in\mathbb{R}^{k\times l}, let Gram matrices 𝐁xi=𝐃xi(𝐃xi)𝖳k×k\mathbf{B}^{x_{i}}=\mathbf{D}^{x_{i}}(\mathbf{D}^{x_{i}})^{\mathsf{T}}\in\mathbb{R}^{k\times k} and 𝐁pi=𝐃pi(𝐃pi)𝖳k×k\mathbf{B}^{p_{i}}=\mathbf{D}^{p_{i}}(\mathbf{D}^{p_{i}})^{\mathsf{T}}\in\mathbb{R}^{k\times k}, we can transform the problem in Eq. (21) into the following form:

(22) min𝐠^i𝐠^i𝖳(γ𝐁xi+μ𝐁pi)𝐠^is.t.𝐠^i𝖳𝟏k=1,𝟎k𝐠^i𝟏k.\begin{split}\min_{\hat{\mathbf{g}}_{i}}\quad&\hat{\mathbf{g}}_{i}^{\mathsf{T}}\left(\gamma\mathbf{B}^{x_{i}}+\mu\mathbf{B}^{p_{i}}\right)\hat{\mathbf{g}}_{i}\\ {\rm s.t.}\quad&\hat{\mathbf{g}}_{i}^{\mathsf{T}}\mathbf{1}_{k}=1,\mathbf{0}_{k}\leq\hat{\mathbf{g}}_{i}\leq\mathbf{1}_{k}.\end{split}

The optimization problem in Eq. (22) is a standard Quadratic Programming (QP) problem, and can be solved by off-the-shelf QP tools. 𝐆\mathbf{G} is updated by concatenating all the solved 𝐠^i\hat{\mathbf{g}}_{i} together.

4.4. Update P

With other variables fixed, the 𝐏\mathbf{P}-subproblem is

(23) min𝐏𝐇𝐏F2+β𝐄𝐐𝐏F2+μ𝐏𝖳𝐏𝖳𝐆F2s.t.𝐏𝟏q=𝟏n,𝟎n×l𝐏𝐘,\begin{split}\min_{{\mathbf{P}}}\quad&\left\|\mathbf{H}-\mathbf{P}\right\|_{F}^{2}+\beta\left\|\mathbf{E}-\mathbf{Q}-\mathbf{P}\right\|_{F}^{2}+\mu\left\|\mathbf{P}^{\mathsf{T}}-\mathbf{P}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ {\rm s.t.}\quad&\mathbf{P}\mathbf{1}_{q}=\mathbf{1}_{n},\mathbf{0}_{n\times l}\leq\mathbf{P}\leq\mathbf{Y},\end{split}

and we rewrite the problem in Eq. (23) into the following form:

(24) min𝐏𝐏𝐇+β(𝐄𝐐)1+βF2+μ1+β𝐏𝖳𝐏𝖳𝐆F2s.t.𝐏𝟏q=𝟏n,𝟎n×l𝐏𝐘.\begin{split}\min_{{\mathbf{P}}}\quad&\left\|\mathbf{P}-\frac{\mathbf{H}+\beta(\mathbf{E}-\mathbf{Q})}{1+\beta}\right\|_{F}^{2}+\frac{\mu}{1+\beta}\left\|\mathbf{P}^{\mathsf{T}}-\mathbf{P}^{\mathsf{T}}\mathbf{G}\right\|_{F}^{2}\\ {\rm s.t.}\quad&\mathbf{P}\mathbf{1}_{q}=\mathbf{1}_{n},\mathbf{0}_{n\times l}\leq\mathbf{P}\leq\mathbf{Y}.\end{split}

In order to solve the problem (24), we denote 𝐩~=vec(𝐏)[0,1]nl\widetilde{\mathbf{p}}={\rm vec}\left(\mathbf{P}\right)\in[0,1]^{nl}, where vec(){\rm vec}(\cdot) is the vectorization operator. Similarly, 𝐨~=vec(𝐇+β(𝐄𝐐)1+β)nl\widetilde{\mathbf{o}}={\rm vec}\left(\frac{\mathbf{H}+\beta(\mathbf{E}-\mathbf{Q})}{1+\beta}\right)\in\mathbb{R}^{nl} and 𝐲~=vec(𝐘){0,1}nl\widetilde{\mathbf{y}}={\rm vec}(\mathbf{Y})\in\{0,1\}^{nl}. Let 𝐓=2(𝐈n×n𝐆)(𝐈n×n𝐆)𝖳n×n\mathbf{T}=2(\mathbf{I}_{n\times n}-\mathbf{G})(\mathbf{I}_{n\times n}-\mathbf{G})^{\mathsf{T}}\in\mathbb{R}^{n\times n} be a square matrix. Based on these notations, the optimization problem (24) can be written as

(25) min𝐩~12𝐩~𝖳(𝐂+2(1+β)μ𝐈nl×nl)𝐩~2(1+β)μ𝐨~𝖳𝐩~s.t.i=1,i%n=j,0jn1nl𝐩~i=1,𝟎nl𝐩~𝐲~,\begin{split}\min_{\widetilde{\mathbf{p}}}\quad&\frac{1}{2}\widetilde{\mathbf{p}}^{\mathsf{T}}\left(\mathbf{C}+\frac{2(1+\beta)}{\mu}\mathbf{I}_{nl\times nl}\right)\widetilde{\mathbf{p}}-\frac{2(1+\beta)}{\mu}\widetilde{\mathbf{o}}^{\mathsf{T}}\widetilde{\mathbf{p}}\\ {\rm s.t.}\quad&\sum_{i=1,i\%n=j,0\leq j\leq n-1}^{nl}\widetilde{\mathbf{p}}_{i}=1,\mathbf{0}_{nl}\leq\widetilde{\mathbf{p}}\leq\widetilde{\mathbf{y}},\end{split}

where 𝐂nl×nl\mathbf{C}\in\mathbb{R}^{nl\times nl} is defined as:

(26) 𝐂=[𝐓𝟎m×m𝟎m×m𝟎m×m𝐓𝟎m×m𝟎m×m𝟎m×m𝐓].{\mathbf{C}}=\left[\begin{array}[]{cccc}{\mathbf{T}}&\mathbf{0}_{m\times m}&\cdots&\mathbf{0}_{m\times m}\\ \mathbf{0}_{m\times m}&{\mathbf{T}}&\ddots&\vdots\\ \vdots&\ddots&\ddots&\mathbf{0}_{m\times m}\\ \mathbf{0}_{m\times m}&\cdots&\mathbf{0}_{m\times m}&{\mathbf{T}}\end{array}\right].

As the problem in Eq. (25) is a standard QP problem, it can be solved by off-the-shelf QP tools.

4.5. The Overall Optimization Algorithm

As a summary, PL-CL first initializes 𝐆\mathbf{G} and 𝐏\mathbf{P} by solving problems (5) and (6) respectively, and initializes 𝐐\mathbf{Q} as:

(27) qij={1ljyij if yij=00 otherwise .q_{ij}=\left\{\begin{array}[]{ll}\frac{1}{l-\sum_{j}y_{ij}}&\text{ if }\quad y_{ij}=0\\ 0&\text{ otherwise }.\end{array}\right.

Then, PL-CL iteratively and alternatively updates one variable with the other fixed until the model converges. For an unseen instance 𝐱\mathbf{x}^{\ast}, the predicted label yy^{\ast} is

(28) y=argmaxki=1n12λ𝐀ik𝒦(𝐱,𝐱i)+bk.y^{*}=\mathop{\arg\max}_{k}\sum_{i=1}^{n}\frac{1}{2\lambda}\mathbf{A}_{ik}\mathcal{K}(\mathbf{x}^{\ast},\mathbf{x}_{i})+b_{k}.

The pseudo code of PL-CL is summarized in Algorithm 1.

Algorithm 1 The pseudo code of PL-CL
  Input: Partial label data 𝒟\mathcal{D}, hyper-parameter α\alpha, β\beta, μ\mu, γ\gamma and λ\lambda, an unseen sample 𝐱\mathbf{x}^{\ast}
  Output: The predicted label yy^{\ast} of 𝐱\mathbf{x}^{\ast}
  Process:
  Initialize 𝐆\mathbf{G} according to Eq. (5).
  Initialize 𝐏\mathbf{P} according to Eq. (6).
  Initialize 𝐐\mathbf{Q} according to Eq. (27).
  repeat
     Update 𝐀\mathbf{A}, 𝐛\mathbf{b} according to Eq. (16).
     Update 𝐀^\hat{\mathbf{A}}, 𝐛^\hat{\mathbf{b}} according to Eq. (17).
     Update 𝐐\mathbf{Q} according to Eq. (19).
     Update 𝐆\mathbf{G} according to Eq. (22).
     Update 𝐏\mathbf{P} according to Eq. (25).
  until Convergence
  Return: yy^{\ast} according to Eq. (28).

4.6. Complexity Analysis

The complexity for solving 𝐖\mathbf{W}, 𝐛\mathbf{b}, 𝐖^\hat{\mathbf{W}} and 𝐛^\hat{\mathbf{b}} is O(n3)+O(nql)O(n^{3})+O(nql), that for updating 𝐐\mathbf{Q} is O(nl)O(nl), and complexities for updating 𝐆\mathbf{G} and 𝐏\mathbf{P} are O(nk3)O(nk^{3}) and O(n3q3)O(n^{3}q^{3}). Therefor the overall complexity of PL-CL is O(n3+nql+nl+nk3+n3q3)O(n^{3}+nql+nl+nk^{3}+n^{3}q^{3}).

5. Experiments

5.1. Compared Methods

In order to validate the effectiveness of PL-CL, we compared it with six state-of-the-art PLL approaches:

  • PL-AGGD (Wang et al., 2022): a PLL approach which constructs a similarity graph and further uses that graph to realize disambiguation [hyper-parameter configurations: k=10k=10, T=20T=20, λ=1\lambda=1, μ=1\mu=1, γ=0.05\gamma=0.05];

  • SURE (Feng and An, 2019): a self-guided retraining based approach which maximizes an infinity norm regularization on the modeling outputs to realize disambiguation [hyper-parameter configurations: λ\lambda and β\beta {0.001,0.01,0.05,0.1,0.3,0.5,1}\in\{0.001,0.01,0.05,0.1,0.3,0.5,1\}];

  • LALO (Feng and An, 2018): an identification-based approach with constrained local consistency to differentiate the candidate labels [hyper-parameter configurations: k=10k=10, λ=0.05\lambda=0.05, μ=0.005\mu=0.005];

  • IPAL (Zhang and Yu, 2015): an instance-based algorithm which identifies the valid label via an iterative label propagation procedure [hyper-parameter configurations: α=0.95\alpha=0.95, k=10k=10, T=100T=100];

  • PLDA (Wang and Zhang, 2022): a dimensionality reduction approach for partial label learning [hyper-parameter configurations: PL-KNN];

  • PL-KNN (Hüllermeier and Beringer, 2005): an averaging-based method that averages the neighbors’ candidate labels to make prediction on an unseen sample [hyper-parameter configurations: k=10k=10].

5.2. Experimental Settings

For PL-CL, we set k=10k=10, λ=0.03\lambda=0.03 , γ\gamma, μ\mu, α\alpha and β\beta were chosen from {0.001,0.01,0.1,0.2,0.5,1,1.5,2,4}\{0.001,0.01,0.1,0.2,0.5,1,1.5,2,4\}. The kernel functions of SURE, PL-AGGD and LALO were the same as that of PL-CL. All the hyper-parameter configurations of each method were set according to the original paper. Ten runs of 50%/50% random train/test splits were performed, and the averaged accuracy with standard deviations on test data sets were recorded for all the comparing algorithms.

Table 1. Characteristics of the real-world data sets.
Data set # Examples # Features # Labels # Average Labels Task Domain
FG-NET (Panis et al., 2016) 1002 262 78 7.48 facial age estimation
Lost (Cour et al., 2009) 1122 108 16 2.23 automatic face naming
MSRCv2 (Liu and Dietterich, 2012b) 1758 48 23 3.16 object classification
Mirflickr (Huiskes and Lew, 2008) 2780 1536 14 2.76 web image classification
Soccer Player (Zeng et al., 2013) 17472 279 171 2.09 automatic face naming
Yahoo!News (Guillaumin et al., 2010) 22991 163 219 1.91 automatic face naming
Table 2. The comparison of different methods on the real-world data sets. \bullet/\circ indicates whether PL-CL is statistically superior/inferior to the compared algorithm according to pairwise tt-test at significance level of 0.05.
Data set FG-NET Lost MSRCv2 Mirflickr Soccer Player Yahoo!News FG-NET(MAE3) FG-NET(MAE5)
PL-CL 0.072 ±\pm 0.009 0.709 ±\pm 0.022 0.469 ±\pm 0.016 0.642 ±\pm 0.012 0.534 ±\pm 0.004 0.618 ±\pm 0.003 0.433 ±\pm 0.022 0.575 ±\pm 0.015
SURE 0.052 ±\pm 0.006 \bullet 0.693 ±\pm 0.020 \bullet 0.445 ±\pm 0.021 \bullet 0.631 ±\pm 0.021 \bullet 0.519 ±\pm 0.004 \bullet 0.598 ±\pm 0.002 \bullet 0.356 ±\pm 0.019 \bullet 0.494 ±\pm 0.020 \bullet
PL-AGGD 0.063 ±\pm 0.009 \bullet 0.683 ±\pm 0.014 \bullet 0.451 ±\pm 0.012 \bullet 0.610 ±\pm 0.011 \bullet 0.524 ±\pm 0.004 \bullet 0.607 ±\pm 0.004 \bullet 0.387 ±\pm 0.015 \bullet 0.530 ±\pm 0.015 \bullet
LALO 0.065 ±\pm 0.009 \bullet 0.680 ±\pm 0.014 \bullet 0.448 ±\pm 0.015 \bullet 0.626 ±\pm 0.013 \bullet 0.523 ±\pm 0.003 \bullet 0.600 ±\pm 0.003 \bullet 0.423 ±\pm 0.020 \bullet 0.566 ±\pm 0.014 \bullet
IPAL 0.051 ±\pm 0.011 \bullet 0.646 ±\pm 0.023 \bullet 0.488 ±\pm 0.031 \circ 0.527 ±\pm 0.009 \bullet 0.528 ±\pm 0.003 \bullet 0.625 ±\pm 0.004 \circ 0.349 ±\pm 0.019 \bullet 0.500 ±\pm 0.019 \bullet
PLDA 0.042 ±\pm 0.005 \bullet 0.289 ±\pm 0.045 \bullet 0.422 ±\pm 0.013 \bullet 0.480 ±\pm 0.015 \bullet 0.493 ±\pm 0.003 \bullet 0.380 ±\pm 0.003 \bullet 0.150 ±\pm 0.012 \bullet 0.232 ±\pm 0.012 \bullet
PL-KNN 0.036 ±\pm 0.006 \bullet 0.296 ±\pm 0.021 \bullet 0.393 ±\pm 0.014 \bullet 0.454 ±\pm 0.015 \bullet 0.483 ±\pm 0.005 \bullet 0.368 ±\pm 0.004 \bullet 0.288 ±\pm 0.013 \bullet 0.440 ±\pm 0.016 \bullet
Table 3. Characteristics of the UCI data sets.
Data set ecoli glass vehicle steel
# Examples 336 214 846 1941
# Features 7 9 18 27
# Labels 8 6 4 7
Table 4. Win/tie/loss counts on the controlled UCI data sets between PL-CL and other compared approaches according to the pairwise tt-test at 0.05 significance level. I: varying ϵ\epsilon; II: varying pp (r=1r=1); III: varying pp (r=2r=2); IV: varying pp (r=3r=3).
I II III IV Total
SURE 22/6/0 26/2/0 26/2/0 24/4/0 88/14/0
PL-AGGD 26/2/0 25/3/0 23/5/0 22/6/0 96/16/0
LALO 27/1/0 24/4/0 26/2/0 22/6/0 99/13/0
IPAL 28/0/0 23/5/0 26/2/0 25/3/0 102/10/0
PLDA 28/0/0 28/0/0 28/0/0 28/0/0 112/0/0
PL-KNN 28/0/0 28/0/0 28/0/0 28/0/0 112/0/0
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Classification accuracy on controlled UCI data sets with ϵ\epsilon varying from 0.1 to 0.7 (p=1p=1, r=1r=1).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) Classification accuracy on controlled UCI data sets with pp varying from 0.1 to 0.7 (r=1r=1).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(c) Classification accuracy on controlled UCI data sets with pp varying from 0.1 to 0.7 (r=2r=2).
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(d) Classification accuracy on controlled UCI data sets with pp varying from 0.1 to 0.7 (r=3r=3).
Figure 2. Classification accuracy on the controlled UCI data sets.
Table 5. The comparison of different methods on the real-world data sets on transductive accuracy. \bullet/\circ indicates whether PL-CL is statistically superior/inferior to the compared algorithm according to pairwise tt-test at significance level of 0.05.
Data set FG-NET Lost MSRCv2 Mirflickr Soccer Player Yahoo!News FG-NET(MAE3) FG-NET(MAE5)
PL-CL 0.159 ±\pm 0.016 0.832 ±\pm 0.019 0.585 ±\pm 0.012 0.697 ±\pm 0.016 0.715 ±\pm 0.002 0.840 ±\pm 0.004 0.600 ±\pm 0.023 0.737 ±\pm 0.018
SURE 0.158 ±\pm 0.012 0.798 ±\pm 0.019 \bullet 0.603 ±\pm 0.015 \circ 0.652 ±\pm 0.022 \bullet 0.700 ±\pm 0.003 \bullet 0.798 ±\pm 0.005 \bullet 0.590 ±\pm 0.018 \bullet 0.727 ±\pm 0.019 \bullet
PL-AGGD 0.130 ±\pm 0.015 \bullet 0.804 ±\pm 0.016 \bullet 0.551 ±\pm 0.015 \bullet 0.653 ±\pm 0.015 \bullet 0.698 ±\pm 0.003 \bullet 0.817 ±\pm 0.005 \bullet 0.530 ±\pm 0.021 \bullet 0.679 ±\pm 0.024 \bullet
LALO 0.153 ±\pm 0.016 \bullet 0.817 ±\pm 0.012 \bullet 0.548 ±\pm 0.009 \bullet 0.675 ±\pm 0.017 \bullet 0.698 ±\pm 0.003 \bullet 0.821 ±\pm 0.004 \bullet 0.592 ±\pm 0.024 \bullet 0.730 ±\pm 0.015 \bullet
IPAL 0.148 ±\pm 0.021 0.793 ±\pm 0.017 \bullet 0.680 ±\pm 0.013 \circ 0.586 ±\pm 0.007 \bullet 0.681 ±\pm 0.003 \bullet 0.839 ±\pm 0.003 0.563 ±\pm 0.021 \bullet 0.698 ±\pm 0.022 \bullet
PLDA 0.042 ±\pm 0.005 \bullet 0.351 ±\pm 0.060 \bullet 0.479 ±\pm 0.015 \bullet 0.564 ±\pm 0.015 \bullet 0.493 ±\pm 0.004 \bullet 0.460 ±\pm 0.009 \bullet 0.150 ±\pm 0.012 \bullet 0.232 ±\pm 0.012 \bullet
PL-KNN 0.041 ±\pm 0.007 \bullet 0.338 ±\pm 0.016 \bullet 0.415 ±\pm 0.013 \bullet 0.466 ±\pm 0.013 \bullet 0.504 ±\pm 0.005 \bullet 0.403 ±\pm 0.009 \bullet 0.285 ±\pm 0.016 \bullet 0.438 ±\pm 0.014 \bullet
Table 6. Ablation study of the proposed approach on the real-world data sets on classification accuracy. \bullet/\circ indicates whether PL-CL is statistically superior/inferior to the compared algorithm according to pairwise tt-test at significance level of 0.05.
Classification Accuracy
Kernel Complementary Classifier Graph FG-NET Lost MSRCv2 Mirflickr Soccer Player Yahoo!News FG-NET(MAE3) FG-NET(MAE5) Average
0.061 ±\pm 0.006 \bullet 0.622 ±\pm 0.019 \bullet 0.381 ±\pm 0.015 \bullet 0.249 ±\pm 0.010 \bullet 0.492 ±\pm 0.003 \bullet 0.430 ±\pm 0.051 \bullet 0.402 ±\pm 0.031 \bullet 0.551 ±\pm 0.024 \bullet 0.398
0.060 ±\pm 0.008 \bullet 0.654 ±\pm 0.019 \bullet 0.426 ±\pm 0.017 \bullet 0.533 ±\pm 0.012 \bullet 0.515 ±\pm 0.010 \bullet 0.526 ±\pm 0.006 \bullet 0.413 ±\pm 0.026 \bullet 0.564 ±\pm 0.018 \bullet 0.461
0.057 ±\pm 0.009 \bullet 0.705 ±\pm 0.023 \bullet 0.462 ±\pm 0.015 \bullet 0.637 ±\pm 0.011 \bullet 0.530 ±\pm 0.004 \bullet 0.517 ±\pm 0.046 \bullet 0.416 ±\pm 0.017 \bullet 0.560 ±\pm 0.019 \bullet 0.485
0.065 ±\pm 0.008 \bullet 0.684 ±\pm 0.022 \bullet 0.456 ±\pm 0.014 \bullet 0.635 ±\pm 0.012 \bullet 0.529 ±\pm 0.003 \bullet 0.607 ±\pm 0.003 \bullet 0.426 ±\pm 0.024 \bullet 0.566 ±\pm 0.017 \bullet 0.496
0.072 ±\pm 0.009 0.709 ±\pm 0.022 0.469 ±\pm 0.016 0.642 ±\pm 0.012 0.534 ±\pm 0.004 0.618 ±\pm 0.003 0.433 ±\pm 0.022 0.575 ±\pm 0.015 0.507
Transductive Accuracy
Kernel Complementary Classifier Graph FG-NET Lost MSRCv2 Mirflickr Soccer Player Yahoo!News FG-NET(MAE3) FG-NET(MAE5) Average
0.161 ±\pm 0.017 \bullet 0.729 ±\pm 0.017 \bullet 0.414 ±\pm 0.013 \bullet 0.464 ±\pm 0.008 \bullet 0.492 ±\pm 0.004 \bullet 0.439 ±\pm 0.003 \bullet 0.580 ±\pm 0.024 \bullet 0.717 ±\pm 0.022 \bullet 0.499
0.159 ±\pm 0.012 \bullet 0.748 ±\pm 0.015 \bullet 0.515 ±\pm 0.018 \bullet 0.594 ±\pm 0.013 \bullet 0.700 ±\pm 0.004 \bullet 0.774 ±\pm 0.032 \bullet 0.588 ±\pm 0.021 \bullet 0.726 ±\pm 0.010 \bullet 0.600
0.141 ±\pm 0.011 \bullet 0.819 ±\pm 0.017 \bullet 0.572 ±\pm 0.013 \bullet 0.685 ±\pm 0.014 \bullet 0.714 ±\pm 0.002 \bullet 0.839 ±\pm 0.004 \bullet 0.568 ±\pm 0.025 \bullet 0.706 ±\pm 0.022 \bullet 0.630
0.150 ±\pm 0.015 \bullet 0.824 ±\pm 0.021 \bullet 0.567 ±\pm 0.008 \bullet 0.687 ±\pm 0.012 \bullet 0.701 ±\pm 0.003 \bullet 0.829 ±\pm 0.004 \bullet 0.570 ±\pm 0.022 \bullet 0.710 ±\pm 0.021 \bullet 0.629
0.159 ±\pm 0.016 0.832 ±\pm 0.019 0.585 ±\pm 0.012 0.697 ±\pm 0.016 0.715 ±\pm 0.002 0.840 ±\pm 0.004 0.600 ±\pm 0.023 0.737 ±\pm 0.018 0.646
Refer to caption
(a) α\alpha
Refer to caption
(b) β\beta
Refer to caption
(c) γ\gamma
Refer to caption
(d) μ\mu
Refer to caption
(e) λ\lambda
Refer to caption
(f) FG-NET
Refer to caption
(g) Lost
Refer to caption
(h) MSRCv2
Refer to caption
(i) Mirflickr
Refer to caption
(j) Soccer Player
Refer to caption
(k) Yahoo!News
Figure 3. Sensitivity and convergence curves of PL-CL.

5.3. Performance on Real-World Data Sets

We evaluated the performance of our method and the compared ones on 6 real-world data sets, whose details are summarized in Table 1. For facial age estimation such as FG-NET (Panis et al., 2016), human faces are represented as instances while the ages annotated by crowd-sourcing labelers are considered as candidate labels. For automatic face naming, i.e., Lost (Cour et al., 2009), Soccer Player (Zeng et al., 2013) and Yahoo!News (Guillaumin et al., 2010), each face cropped from an image or a video frame is taken as an instance with the names extracted from the corresponding captions or subtitles as its candidate labels. For object classification task like MSRCv2 (Liu and Dietterich, 2012b), image segments are considered as instances with objects appearing in the same image as candidate labels. For web image classification task such as Mirflickr (Huiskes and Lew, 2008), each image is presented as an instance with the tags extracted from the web page being candidate labels.

As the average candidate labels of FG-NET is very large, which may result in quite low classification accuracy. Following (Zhang et al., 2016), we further employed mean absolute error (MAE) as the evaluation criteria. Specifically, for FG-NET(MAE3)/(MAE5), test examples are considered to be correctly classified if the error between predicted age and true age is no more than 3/5 years.

Table 2 shows the classification accuracy with standard deviation of each approach on the real-world data sets, where we can observe that

  • PL-CL significantly outperforms PL-AGGD, LALO, PLDA and PL-KNN on all the real-world data sets. Compared with two graph-based approaches PL-AGGD and LALO, PL-CL achieves outstanding performance, which validates the effectiveness of the proposed complementary classifier.

  • IPAL performs well on MSRCv2 and Yahoo!News, while it is not apt to the other data sets. However, PL-CL performs well on all the data sets, which shows the robustness and effectiveness of PL-CL.

  • In a nutshell, PL-CL achieves significant superior accuracy in 95.8% cases according to the pairwise tt-test at 0.05 significance level.

5.4. Performance on Controlled UCI Data Sets

Table 3 summarizes the characteristics of four UCI data sets. Following the widely-used controlling protocol in PLL (Zeng et al., 2013), (Liu and Dietterich, 2012a), a synthetic partial label data set can be generated from the UCI data set, which is controlled by three parameters pp, rr and ϵ\epsilon. Specifically, pp controls the proportion of the training instances with partial labels, rr denotes the number of false positive labels residing in the candidate label set, and ϵ\epsilon stands for the co-occurring probability of one candidate label and the ground-truth label.

With fixed p=1p=1 and r=1r=1, the classification accuracy of all the methods when ϵ\epsilon varies from 0.1 to 0.7 with step size 0.1 is shown in Fig. 2(a). A specific label was chosen to co-occur with the ground-truth label with probability ϵ\epsilon, and the rest of the labels were randomly chosen with probability (1ϵ)/(l2)(1-\epsilon)/(l-2). Fig. 2(b)-(d) show the classification accuracy with varying pp from 0.1 to 0.7 when r=1,2,3r=1,2,3, respectively. We randomly selected proportion pp of the training samples as partial label examples with randomly chosen rr labels as the corresponding false positive labels, i.e., the size of the candidate labels of each instance is r+1r+1111Under this setting, as the false positive labels are randomly chosen, ϵ\epsilon is set to 1l1\frac{1}{l-1}.. Table 4 summarizes the detailed win/tie/loss counts between PL-CL and other methods according to the pairwise tt-test with a significance level of 0.05. From the Fig. 2 and Table 4 we can conclue that:

  • With increasing pp and ϵ\epsilon, the classification accuracy drops gradually in general for all the methods, as a larger pp or a larger ϵ\epsilon means a more difficult PLL task. Even though, our approach also performs better than the compared methods.

  • As the number of candidate labels increases, the gaps between PL-CL and others become more obvious, which means with more candidate labels, the effect of complementary classifier is more prominent.

  • PL-CL clearly outperforms other methods with all the settings in general, i.e., PL-CL outperforms other compared methods significantly in 92.1% cases and achieves at least comparable performance in 7.9% cases with the total of 672 cases. Moreover, PL-CL is never significantly outperformed by any compared approaches.

5.5. Further Analysis

5.5.1. Disambiguation Ability

Transductive accuracy evaluates the classification accuracy of a PLL method on the training examples, which could reflect the ability of label disambiguation. Table 5 shows the transductive accuracy of all the methods on all the real-world data sets, where we can find that PL-CL outperforms other compared methods in 89.5% cases and is at least comparable to them in 6.3% cases according to the pairwise tt-test, which validates the disambiguation ability of PL-CL.

5.5.2. Ablation Study

In order to validate the effectiveness of the complementary classifier and the adaptive graph, ablation studies are conducted and the results are shown in Table 6. It can be observed from the Table 6 that the complementary classifier is quite useful when comparing the performance of the second and forth rows or the third and fifth rows on classification accuracy, as the performance of the method with complementary classifier is superior to that without the complementary classifier. Additionally, when we compare the last two rows of the table, we can find that the model with the graph structure achieves superior performance than that without the graph. In general, PL-CL outperforms its degenerated versions in all cases w.r.t. both classification accuracy and transductive accuracy, which proves that both the complementary classifier and graph structure play significant roles in our model. Besides, the use of kernel can also improve the performance of PL-CL, which can be observed when comparing the results in the first and second rows. It is worth noting that although kernel contributes to the improvement of the performance, the main improvement of PL-CL comes from complementary classifier and the graph.

5.5.3. Sensitive and Convergence Analysis

Figs. 3(a)-(e) illustrate the sensitivity of PL-CL w.r.t. α\alpha, β\beta, μ\mu, γ\gamma and λ\lambda on Lost, where we can find that PL-CL performs quite stably as the values of the hyper-parameter change within a reasonable wide range. Additionally, Figs. 3(f)-(k) show the convergence curves of PL-CL, where PL-CL converges fast within about 8 iterations on all the data sets, which is also a desirable property in practice.

6. Conclusion and Future Work

In this paper, a novel method PL-CL based on complementary classifier has been proposed to solve the PLL task. Different from previous works, we incorporated the information of complementary labels by inducing a complementary classifier, and use the complementary classifier to achieve candidate label disambiguation by a novel adversarial prior. Besides, PL-CL adopted an adaptive local graph shared by both the feature space and label space to assist disambiguation. Comprehensive experiments on four controlled UCI data sets as well as six real-world data sets validated the effectiveness of PL-CL and the usefulness of the proposed complementary classifier. In the future, we will investigate how to design deep-learning-based PLL models by leveraging the complementary classifier.

References

  • (1)
  • Cour et al. (2009) Timothée Cour, Benjamin Sapp, Chris Jordan, and Benjamin Taskar. 2009. Learning from ambiguously labeled images. In 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009, Miami, Florida, USA. IEEE Computer Society, 919–926. https://doi.org/10.1109/CVPR.2009.5206667
  • Cour et al. (2011) Timothée Cour, Benjamin Sapp, and Ben Taskar. 2011. Learning from Partial Labels. J. Mach. Learn. Res. 12 (2011), 1501–1536. https://doi.org/10.5555/1953048.2021049
  • Feng and An (2018) Lei Feng and Bo An. 2018. Leveraging Latent Label Distributions for Partial Label Learning. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, Jérôme Lang (Ed.). ijcai.org, 2107–2113. https://doi.org/10.24963/ijcai.2018/291
  • Feng and An (2019) Lei Feng and Bo An. 2019. Partial Label Learning with Self-Guided Retraining. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, The Thirty-First Innovative Applications of Artificial Intelligence Conference, IAAI 2019, The Ninth AAAI Symposium on Educational Advances in Artificial Intelligence, EAAI 2019, Honolulu, Hawaii, USA, January 27 - February 1, 2019. AAAI Press, 3542–3549. https://doi.org/10.1609/aaai.v33i01.33013542
  • Feng et al. (2020) Lei Feng, Takuo Kaneko, Bo Han, Gang Niu, Bo An, and Masashi Sugiyama. 2020. Learning with Multiple Complementary Labels. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event (Proceedings of Machine Learning Research, Vol. 119). PMLR, 3072–3081. http://proceedings.mlr.press/v119/feng20a.html
  • Guillaumin et al. (2010) Matthieu Guillaumin, Jakob Verbeek, and Cordelia Schmid. 2010. Multiple Instance Metric Learning from Automatically Labeled Bags of Faces. In Computer Vision - ECCV 2010, 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 6311), Kostas Daniilidis, Petros Maragos, and Nikos Paragios (Eds.). Springer, 634–647. https://doi.org/10.1007/978-3-642-15549-9_46
  • Hou et al. (2016) Peng Hou, Xin Geng, and Min-Ling Zhang. 2016. Multi-Label Manifold Learning. In Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence, February 12-17, 2016, Phoenix, Arizona, USA, Dale Schuurmans and Michael P. Wellman (Eds.). AAAI Press, 1680–1686. http://www.aaai.org/ocs/index.php/AAAI/AAAI16/paper/view/12144
  • Huiskes and Lew (2008) Mark J. Huiskes and Michael S. Lew. 2008. The MIR flickr retrieval evaluation. In Proceedings of the 1st ACM SIGMM International Conference on Multimedia Information Retrieval, MIR 2008, Vancouver, British Columbia, Canada, October 30-31, 2008, Michael S. Lew, Alberto Del Bimbo, and Erwin M. Bakker (Eds.). ACM, 39–43. https://doi.org/10.1145/1460096.1460104
  • Hüllermeier and Beringer (2005) Eyke Hüllermeier and Jürgen Beringer. 2005. Learning from Ambiguously Labeled Examples. 3646 (2005), 168–179. https://doi.org/10.1007/11552253_16
  • Ishida et al. (2017) Takashi Ishida, Gang Niu, Weihua Hu, and Masashi Sugiyama. 2017. Learning from Complementary Labels. (2017), 5639–5649. https://proceedings.neurips.cc/paper/2017/hash/1dba5eed8838571e1c80af145184e515-Abstract.html
  • Liu and Dietterich (2012a) Li-Ping Liu and Thomas G. Dietterich. 2012a. A Conditional Multinomial Mixture Model for Superset Label Learning. (2012), 557–565. https://proceedings.neurips.cc/paper/2012/hash/aaebdb8bb6b0e73f6c3c54a0ab0c6415-Abstract.html
  • Liu and Dietterich (2012b) Li-Ping Liu and Thomas G. Dietterich. 2012b. A Conditional Multinomial Mixture Model for Superset Label Learning. (2012), 557–565. https://proceedings.neurips.cc/paper/2012/hash/aaebdb8bb6b0e73f6c3c54a0ab0c6415-Abstract.html
  • Lyu et al. (2021) Gengyu Lyu, Songhe Feng, Tao Wang, Congyan Lang, and Yidong Li. 2021. GM-PLL: Graph Matching Based Partial Label Learning. IEEE Trans. Knowl. Data Eng. 33, 2 (2021), 521–535. https://doi.org/10.1109/TKDE.2019.2933837
  • Nguyen and Caruana (2008) Nam Nguyen and Rich Caruana. 2008. Classification with partial labels. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Las Vegas, Nevada, USA, August 24-27, 2008, Ying Li, Bing Liu, and Sunita Sarawagi (Eds.). ACM, 551–559. https://doi.org/10.1145/1401890.1401958
  • Panis et al. (2016) Gabriel Panis, Andreas Lanitis, Nicolas Tsapatsoulis, and Timothy F. Cootes. 2016. Overview of research on facial ageing using the FG-NET ageing database. IET Biom. 5, 2 (2016), 37–46. https://doi.org/10.1049/iet-bmt.2014.0053
  • Wang et al. (2022) Deng-Bao Wang, Min-Ling Zhang, and Li Li. 2022. Adaptive Graph Guided Disambiguation for Partial Label Learning. IEEE Trans. Pattern Anal. Mach. Intell. 44, 12 (2022), 8796–8811. https://doi.org/10.1109/TPAMI.2021.3120012
  • Wang and Zhang (2022) Wei Wang and Min-Ling Zhang. 2022. Partial Label Learning with Discrimination Augmentation. In KDD ’22: The 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, Washington, DC, USA, August 14 - 18, 2022, Aidong Zhang and Huzefa Rangwala (Eds.). ACM, 1920–1928. https://doi.org/10.1145/3534678.3539363
  • Yang and Vozila (2014) Fan Yang and Paul Vozila. 2014. Semi-Supervised Chinese Word Segmentation Using Partial-Label Learning With Conditional Random Fields. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, Alessandro Moschitti, Bo Pang, and Walter Daelemans (Eds.). ACL, 90–98. https://doi.org/10.3115/v1/d14-1010
  • Yu et al. (2018) Xiyu Yu, Tongliang Liu, Mingming Gong, and Dacheng Tao. 2018. Learning with Biased Complementary Labels. In Computer Vision - ECCV 2018 - 15th European Conference, Munich, Germany, September 8-14, 2018, Proceedings, Part I (Lecture Notes in Computer Science, Vol. 11205), Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer, 69–85. https://doi.org/10.1007/978-3-030-01246-5_5
  • Zeng et al. (2013) Zinan Zeng, Shijie Xiao, Kui Jia, Tsung-Han Chan, Shenghua Gao, Dong Xu, and Yi Ma. 2013. Learning by Associating Ambiguously Labeled Images. In 2013 IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR, USA, June 23-28, 2013. IEEE Computer Society, 708–715. https://doi.org/10.1109/CVPR.2013.97
  • Zhang et al. (2016) Min-Ling Zhang, Bin-Bin Zhou, and Xu-Ying Liu. 2016. Partial Label Learning via Feature-Aware Disambiguation. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016, Balaji Krishnapuram, Mohak Shah, Alexander J. Smola, Charu C. Aggarwal, Dou Shen, and Rajeev Rastogi (Eds.). ACM, 1335–1344. https://doi.org/10.1145/2939672.2939788
  • Zhang and Yu (2015) Min-Ling Zhang and Fei Yu. 2015. Solving the partial label learning problem: An instance-based approach. In Twenty-fourth international joint conference on artificial intelligence.