This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Label Distribution Learning from Logical Label

Yuheng Jia1,2    Jiawei Tang1,2    Jiahao Jiang1,2 1School of Computer Science and Engineering, Southeast University
2Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China
{yhjia, jwtang, jhjiang}@seu.edu.cn
Abstract

Label distribution learning (LDL) is an effective method to predict the label description degree (a.k.a. label distribution) of a sample. However, annotating label distribution (LD) for training samples is extremely costly. So recent studies often first use label enhancement (LE) to generate the estimated label distribution from the logical label and then apply external LDL algorithms on the recovered label distribution to predict the label distribution for unseen samples. But this step-wise manner overlooks the possible connections between LE and LDL. Moreover, the existing LE approaches may assign some description degrees to invalid labels. To solve the above problems, we propose a novel method to learn an LDL model directly from the logical label, which unifies LE and LDL into a joint model, and avoids the drawbacks of the previous LE methods. We also give the generalization error bound of our method and theoretically prove that directly learning an LDL model from the logical labels is feasible. Extensive experiments on various datasets prove that the proposed approach can construct a reliable LDL model directly from the logical label, and produce more accurate label distribution than the state-of-the-art LE methods. The code and the supplementary file can be found in https://github.com/seutjw/DLDL.

1 Introduction

Multi-label learning (MLL) Tsoumakas and Katakis (2007); Zhang and Zhou (2014) is a well-studied machine learning paradigm where each sample is associated with a set of labels. In MLL, each label is denoted as a logical value (0 or 1), indicating whether a label can describe a sample. However, the logical label cannot precisely describe the relative importance of each label to a sample. To this end, label distribution learning (LDL) Geng (2016) was proposed, in which real numbers are used to demonstrate the relative importance of a label to a certain sample. For example, Fig. 1(a) is a natural scene image, which is annotated with three positive labels (“Sky”, “Mountain” and “Sea”) and one negative label (“Human”) as shown in Fig. 1(b). Since the relative importance of each label to this image is different, for example, the label “Mountain” is more important than the label “Sea”, the real numbers (also known as the label description degree) in Fig. 1(c) can better describe this image. The description degrees of all labels constitute a label distribution (LD). Specifically, if d𝒙ld^{l}_{\boldsymbol{x}} represents the description degree of the label ll to the instance 𝒙\boldsymbol{x}, it’s subject to the non-negative constraint d𝒙l[0,1]d^{l}_{\boldsymbol{x}}\in\left[0,1\right] and the sum-to-one constraint ld𝒙l=1\sum_{l}d^{l}_{\boldsymbol{x}}=1. LDL aims to predict the LD for unseen samples, which is a more general paradigm than the traditional MLL.

Refer to caption
(a) Natural scene
Refer to caption
(b) Logical label
Refer to caption
(c) Label distribution
Figure 1: An example of using label distribution to describe a natural scene image in (a). The histogram in (b) denotes the logical label, indicating whether a label can describe the image in (a), and the line chart in (c) shows the label distribution, revealing to what degree a label can describe the image in (a).

In LDL, a training set with samples annotated by LDs is required to train an LDL model. Unfortunately, the acquisition of label distributions of instances is a very costly and time-consuming process. Moreover, in reality, most datasets are only annotated by logical labels, which cannot be used directly by the existing LDL methods. Then a question naturally arises: can we directly train an LDL model from the logical labels?

Recently, label enhancement (LE) Xu et al. (2021) was proposed to partially answer this question. Specifically, LE first recovers the label distributions of training samples from the logical labels and then performs an existing LDL algorithm on the recovered LDs. Following Xu et al. (2021), many variants of LE were proposed, such as Shao et al. (2018), Zheng et al. (2023) and Zhang et al. (2021). We refer the readers to the related work section for more details. To achieve LE, those methods generally construct a linear mapping from the features to the logical labels directly, and then normalize the output of the linear mapping as the recovered LDs. However, those LE methods may assign positive description degrees to some negative logical labels. Moreover, as all the positive logical labels are annotated as 11, simply building a mapping from features to the logical labels is not the best choice, which will not differentiate the description degrees of different labels. Last but not least, the LE-based strategy is a step-wise manner to predict LD for unseen samples, which may lose the connection between LE and LDL model training.

In this paper, we come up with a novel model named DLDL, i.e., Directly Label Distribution Learning from the logical label, which gives a favorable answer to the question “can we directly train an LDL model from the logical labels”. The major contributions of our method are summarized as follows:

  • Our model combines LE and LDL into a single model, which can learn an LDL model directly from the instances annotated by the logical labels. By the joint manner, the LE and LDL processes will better match each other.

  • We strictly constrain the description degree d𝒙ld^{l}_{\boldsymbol{x}} to be 0 when the logical value corresponding to the label ll is 0. The constraint will avoid assigning positive description degrees to the negative logical labels.

  • The existing LE methods usually minimize a least squares loss between the recovered LD and the logical label, which can not well differentiate the description degrees of different labels. The proposed model discards this fidelity term, and uses KL-divergence to minimize the difference between the recovered LD and the predictive LD, which is a better difference measure for two distributions.

  • By using the Rademacher complexity, we show the generalization bound of the proposed model and theoretically prove that it is possible to construct an LDL model directly from the logical labels for the first time.

Extensive experiments on six benchmark datasets clearly show that the LD recovered by our method is better than that recovered by state-of-the-art LE methods on the training set, and the prediction performance of our model on the testing set is also better than the traditional step-wise strategy.

2 Related Works

Notations: Let nn, mm and cc represent the number of samples, the dimension of features, and the number of labels. Let 𝒙m\boldsymbol{x}\in\mathbb{R}^{m} denote a feature vector and 𝒚{0,1}c\boldsymbol{y}\in\left\{0,1\right\}^{c} denote its corresponding logical label vector. The feature matrix and the corresponding logical label matrix can be denoted by 𝐗=[𝒙1;𝒙2;;𝒙n]n×m\mathbf{X}=\left[\boldsymbol{x}_{1};\boldsymbol{x}_{2};\ldots;\boldsymbol{x}_{n}\right]\in\mathbb{R}^{n\times m} and 𝐘=[𝒚1;𝒚2;;𝒚n]{0,1}n×c\mathbf{Y}=\left[\boldsymbol{y}_{1};\boldsymbol{y}_{2};\ldots;\boldsymbol{y}_{n}\right]\in\left\{0,1\right\}^{n\times c}, respectively. Let 𝒴={l1,l2,,lc}\mathcal{Y}=\left\{l_{1},l_{2},\ldots,l_{c}\right\} be the complete set of labels. The description degree of the label ll to the instance 𝒙\boldsymbol{x} is denoted by d𝒙ld^{l}_{\boldsymbol{x}}, which satisfies d𝒙l[0,1]d^{l}_{\boldsymbol{x}}\in\left[0,1\right] and ld𝒙l=1\sum_{l}d^{l}_{\boldsymbol{x}}=1, and the label distribution of 𝒙i\boldsymbol{x}_{i} is denoted by 𝒅i=(d𝒙il1,d𝒙il2,,d𝒙ilc)\boldsymbol{d}_{i}=\left(d^{l_{1}}_{\boldsymbol{x}_{i}},d^{l_{2}}_{\boldsymbol{x}_{i}},\ldots,d^{l_{c}}_{\boldsymbol{x}_{i}}\right).

2.1 Label Distribution Learning

LDL is a new machine learning paradigm that constructs a model to predict the label distribution of samples. At first, LDL were achieved through problem transformation that transforms LDL into a set of single label learning problems such as PT-SVM, PT-Bayes Geng (2016), or through algorithm adaptation that adopts the existing machine learning algorithms to LDL, such as AA-kNN and AA-BP Geng (2016). SA-IIS Geng (2016) is the first model that specially designed for LDL, whose objective function is a mixture of maximum entropy loss Berger et al. (1996) and KL-divergence. Based on SA-IIS, SA-BFGS Geng (2016) adopts BFGS to optimize the loss function, which is faster than SA-IIS. Sparsity conditional energy label distribution learning (SCE-LDL) Yang et al. (2016) is a three-layer energy-based model for LDL. In addition, SCE-LDL is improved by incorporating sparsity constraints into the objective function. To reduce feature noise, latent semantics encoding for LDL (LSE-LDL) Xu et al. (2019) converts the original data features into latent semantic features, and removes some irrelevant features by feature selection. LDL forests (LDLFs) Shen et al. (2017) is based on differentiable decision trees and may be combined with representation learning to provide an end-to-end learning framework. LDL by optimum transport (LDLOT) Zhao and Zhou (2018) builds an objective function using the optimal transport distance measurement and label correlations. L2\textit{L}^{2} Liu et al. (2021a) is the first attempt to combine LE and LDL into a unified model, utilizing the manifold structure of samples and label correlations to learn the prediction model.

2.2 Label Enhancement

The above mentioned LDL methods all assume that in the training set, each sample is annotated with label distribution. However, precisely annotating the LDs for the training samples is extremely costly and time-consuming. On the contrary, many datasets annotated by logical labels are readily available. To this end, LE was proposed, which aims to convert the logical labels of samples in training set to LDs. GLLE Xu et al. (2021) is the first LE algorithm, which assumes that the LDs of two instances are similar to each other if they are similar in the feature space. LEMLL Shao et al. (2018) adopts the local linear embedding technique to evaluate the relationship of samples in the feature spaces. Generalized label enhancement with sample correlations (gLESC) Zheng et al. (2023) tries to obtain the sample correlations from both of the feature and the label spaces. Bidirectional label enhancement (BD-LE) Liu et al. (2021b) takes the reconstruction errors from the label distribution space to the feature space into consideration.

All of these LE methods and LE loss of L2\textit{L}^{2} Liu et al. (2021a) can be generally formulated as

min𝐖𝐗𝐖𝐘F2+ϕ(𝐗𝐖,𝐗),\small\min_{\mathbf{W}}||\mathbf{XW}-\mathbf{Y}||_{F}^{2}+\phi(\mathbf{XW},\mathbf{X}), (1)

where 𝐗\mathbf{X} and 𝐘\mathbf{Y} are the feature matrix and logical label matrix, F\|\cdot\|_{F} denotes the Frobenius of a matrix, 𝐖m×c\mathbf{W}\in\mathbb{R}^{m\times c} builds a linear mapping from features to logical labels, and ϕ(𝐗𝐖,𝐗)\phi(\mathbf{XW},\mathbf{X}) models the geometric structure of samples, which is used to assist LD recovery. After minimizing Eq. (1), those methods usually normalize 𝐗𝐖\mathbf{XW} as the recovered LD.

Although those LE methods have achieved great success, they still suffer from the following limitations. Firstly, the fidelity term that minimizes the distance between the recovered LD (i.e., 𝐗𝐖\mathbf{XW}) and the logical label 𝐘\mathbf{Y} is inappropriate because the logical labels are annotated as the same value, and the linear mapping will not differentiate the description degrees for different labels. Besides, the Frobenius norm is also not the best choice to measure the difference between two distributions. Secondly, Eq. (1) doesn’t consider the physical restriction of the label distribution, i.e., i,0𝒅i𝒚i,ld𝒙il=1\forall i,\textbf{0}\leq\boldsymbol{d}_{i}\leq\boldsymbol{y}_{i},\sum_{l}d^{l}_{\boldsymbol{x}_{i}}=1. Although those methods perform a post-normalization to satisfy those constraints, they may assign positive description degrees to some negative logical labels. Furthermore, to predict the LD for unseen samples, those methods need to first perform LE and then carry out an existing LDL algorithm. The step-wise manner does not consider potential connections between LE and LDL.

3 The Proposed Approach

To solve the above-mentioned issues, we propose a novel model named DLDL. Different from the previous two-step strategy, our method can generate a label distribution matrix 𝐃=[𝒅1;𝒅2;;𝒅n]n×c\mathbf{D}=\left[\boldsymbol{d}_{1};\boldsymbol{d}_{2};\ldots;\boldsymbol{d}_{n}\right]\in\mathbb{R}^{n\times c} for samples annotated by logical labels and at the same time construct a weight matrix 𝐖m×c\mathbf{W}\in\mathbb{R}^{m\times c} to predict LD for unseen samples.

We use the following priors to recover the LD matrix 𝐃\mathbf{D} from the logical 𝐘\mathbf{Y}. First, as each row of 𝐃\mathbf{D} (e.g., 𝒅i\boldsymbol{d}_{i}) denotes the recovered LD for a sample, it should obey the non-negative constraint and the sum-to-one constraint to make it a well-defined distribution, i.e., 𝟎c𝒅i𝟏c\mathbf{0}_{c}\leq\boldsymbol{d}_{i}\leq\mathbf{1}_{c} and 𝒅i𝟏c=1\boldsymbol{d}_{i}\mathbf{1}_{c}=1, where 𝟎c\mathbf{0}_{c}, 𝟏cc\mathbf{1}_{c}\in\mathbb{R}^{c} denote an all-zeros vector and an all-ones vector, and 𝟎𝒅i𝟏\mathbf{0}\leq\boldsymbol{d}_{i}\leq\mathbf{1} means each element of 𝒅i\boldsymbol{d}_{i} is non-negative and no more than 11. Moreover, to avoid assigning a positive description to a negative logical label, we require that 𝟎𝒅i𝒚i,i\mathbf{0}\leq\boldsymbol{d}_{i}\leq\boldsymbol{y}_{i},\forall i. Reformulating the above constraints in the matrix form, we have 𝟎m×c𝐃𝐘,\mathbf{0}_{m\times c}\leq\mathbf{D}\leq\mathbf{Y}, 𝐃𝟏c=𝟏n\mathbf{D}\mathbf{1}_{c}=\mathbf{1}_{n}, where 𝟎m×c\mathbf{0}_{m\times c} is an all-zeros matrix with size m×cm\times c, and the above inequalities are held element-wisely.

Second, we utilize the geometric structure of the samples to help recover the LDs from the logical labels, i.e., if two samples are similar to each other in the feature space, they are likely to own the similar LDs. To capture the similarity among samples, we define the local similarity matrix 𝐀=[Aij]n×n\mathbf{A}=\left[A_{ij}\right]_{n\times n} as

Aij={exp(𝒙i𝒙j22/2σ2),if 𝒙j𝒩(𝒙i)0,otherwise.\small A_{ij}=\begin{aligned} \begin{cases}\exp\left(-\left\|\boldsymbol{x}_{i}-\boldsymbol{x}_{j}\right\|_{2}^{2}/2\sigma^{2}\right),&\text{if }\boldsymbol{x}_{j}\in\mathcal{N}\left(\boldsymbol{x}_{i}\right)\\ 0,\quad&\text{otherwise}.\end{cases}\end{aligned} (2)

𝒩(𝒙i)\mathcal{N}\left(\boldsymbol{x}_{i}\right) denotes the kk-nearest neighbors of 𝒙i\boldsymbol{x}_{i}, and σ\sigma is the hyper-parameter of the RBF kernel. Based on the constructed local similarity matrix 𝐀\mathbf{A}, we can infer that the value of 𝒅i𝒅j2||\boldsymbol{d}_{i}-\boldsymbol{d}_{j}||_{2} is small when AijA_{ij} is large, i.e.,

min𝐃i,jAij𝒅i𝒅j22=tr(𝐃T𝐆𝐃),\displaystyle\min_{\mathbf{D}}\ \underset{i,j}{\sum}A_{ij}||\boldsymbol{d}_{i}-\boldsymbol{d}_{j}||_{2}^{2}=\rm{tr}\left(\mathbf{D}^{T}\mathbf{G}\mathbf{D}\right), (3)

in which tr()\rm{tr}\left(\cdot\right) denotes the trace of a matrix, and 𝐆=𝐀^𝐀\mathbf{G}=\hat{\mathbf{A}}-\mathbf{A} is the graph Laplacian matrix Jia et al. (2020b, a) and 𝐀^\hat{\mathbf{A}} is a diagonal matrix with the elements A^ii=j=1nAij\hat{A}_{ii}=\sum_{j=1}^{n}A_{ij}.

By taking the above priors into consideration, the LD matrix 𝐃\mathbf{D} can be inferred from the logical label matrix 𝐘\mathbf{Y}, i.e.,

min𝐃tr(𝐃𝐆𝐃T)+𝐃F2\displaystyle\underset{\mathbf{D}}{\min}\ \rm{tr}\left(\mathbf{D}\mathbf{G}\mathbf{D}^{T}\right)+||\mathbf{D}||_{F}^{2}\qquad (4)
s.t.0n×c𝐃𝐘,𝐃𝟏c=𝟏n,\displaystyle\text{s.t.}\ \textbf{0}_{n\times c}\leq\mathbf{D}\leq\mathbf{Y},\mathbf{D}\mathbf{1}_{c}=\mathbf{1}_{n},

where an additional term 𝐃F2\|\mathbf{D}\|_{\rm{F}}^{2} is imposed as regularization. Different from the previous LE methods in Eq. (1), we discard the fidelity term 𝐘𝐖𝐗F2\|\mathbf{Y}-\mathbf{WX}\|_{\rm{F}}^{2}. But as we strictly constrain that 𝟎𝐃𝐘\mathbf{0}\leq\mathbf{D}\leq\mathbf{Y}, it will have a similar effect as the above fidelity term, i.e., encouraging the description degree of the valid labels to be large. But our approach avoids the drawbacks of the fidelity term, i.e., assigning label distribution degree to some negative logical labels and not differentiating the description degrees of different labels.

Based on the recovered LD, we adopt a non-linear model that maps the features to the recovered LD, which can be used to predict the LD for unseen samples. Specifically, the non-linear model is formulated as

Pij=1Zexp(k=1mXikWkj),\small P_{ij}=\dfrac{1}{Z}\exp\left(\sum^{m}_{k=1}X_{ik}W_{kj}\right), (5)

where 𝐖\mathbf{W} is the weight matrix, 𝐏=[Pij]n×c\mathbf{P}=[P_{ij}]_{n\times c} is the prediction matrix, and Z=j=1cexp(k=1mXikWkj)Z=\sum^{c}_{j=1}\exp\left(\sum^{m}_{k=1}X_{ik}W_{kj}\right) is a normalization term. To infer the weight matrix 𝐖\mathbf{W}, we minimize the KL-divergence between the recovered label distribution matrix 𝐃\mathbf{D} and the prediction matrix 𝐏\mathbf{P}, because KL-divergence is a good metric to measure the difference between the two distributions. Accordingly, the loss function for inferring 𝐖\mathbf{W} becomes

min𝐖KL(𝐃,𝐏)+𝐖F2,\displaystyle\underset{\mathbf{W}}{\min}\ \rm{KL}(\mathbf{D},\mathbf{P})+||\mathbf{W}||_{F}^{2}, (6)

where the widely-used squared Frobenius norm is imposed to control the model complexity.

To directly induce an LD prediction model from the logical labels, we combine Eqs. (4) and (6) together and the final optimization problem of our method becomes:

min𝐃,𝐖KL(𝐃,𝐏)+αtr(𝐃𝐆𝐃T)+β𝐃F2+γ𝐖F2\displaystyle\underset{\mathbf{D},\mathbf{W}}{\min}\ \rm{KL}(\mathbf{D},\mathbf{P})+\alpha tr\left(\mathbf{D}\mathbf{G}\mathbf{D}^{T}\right)+\beta||\mathbf{D}||_{F}^{2}+\gamma||\mathbf{W}||_{F}^{2} (7)
s.t.0n×c𝐃𝐘,𝐃𝟏c=𝟏n,\displaystyle\text{s.t.}\ \textbf{0}_{n\times c}\leq\mathbf{D}\leq\mathbf{Y},\mathbf{D}\mathbf{1}_{c}=\mathbf{1}_{n},

in which α\alpha, β\beta and γ\gamma are hyper-parameters to adjust the contributions of different terms. By minimizing Eq. (7), the proposed model can recover the LD for samples with logical labels, and simultaneously, can predict LD for unseen samples through Eq. (5).

3.1 Optimization

Eq. (7) has two variables, we solve it by an alternative and iterative process, i.e., update 𝐖\mathbf{W} with fixed 𝐃\mathbf{D}, and then update 𝐃\mathbf{D} with fixed 𝐖\mathbf{W}.

1) Update W When 𝐃\mathbf{D} is fixed, the optimization problem (7) with respect to 𝐖\mathbf{W} is formulated as

min𝐖\displaystyle\underset{\mathbf{W}}{\min} KL(𝐃,𝐏)+γ𝐖F2.\displaystyle\rm{KL}(\mathbf{D},\mathbf{P})+\gamma||\mathbf{W}||_{F}^{2}. (8)

Expanding the KL-divergence and substituting PijP_{ij} in Eq. (5) into Eq. (8), we obtain the objective function of 𝐖\mathbf{W} as

T(𝐖)=\displaystyle T(\mathbf{W})= ilnjexp(kXikWkj)+γ𝐖F2\displaystyle\sum_{i}{\rm ln}\sum_{j}\exp\left(\sum_{k}X_{ik}W_{kj}\right)+\gamma||\mathbf{W}||_{\rm{F}}^{2} (9)
i,jDijkXikWkj.\displaystyle-\sum_{i,j}D_{ij}\sum_{k}X_{ik}W_{kj}.

Eq. (9) is a convex problem, we use gradient descent algorithm to update 𝐖\mathbf{W}, where the gradient is expressed as

T(𝐖)Wkj=\displaystyle\dfrac{\partial\ T\left(\mathbf{W}\right)}{\partial\ W_{kj}}= iexp(kXikWkj)Xikjexp(kXikWkj)iDijXik\displaystyle\sum_{i}\dfrac{\exp\left(\sum_{k}X_{ik}W_{kj}\right)X_{ik}}{\sum_{j}\exp\left(\sum_{k}X_{ik}W_{kj}\right)}-\sum_{i}D_{ij}X_{ik} (10)
+2γWkj.\displaystyle+2\gamma W_{kj}.
Algorithm 1 Solve the Problem in (11)

Input: The initial label distribution matrix 𝐃\mathbf{D}, the initial weight matrix 𝐖\mathbf{W}, and 𝐁=𝚽=𝟎n×c\mathbf{B}=\mathbf{\Phi}=\mathbf{0}_{n\times c}.
Parameter: ρ=1.2\rho=1.2, τ=0.001\tau=0.001 and τmax=0.002\tau_{max}=0.002.
Output: The updated 𝐃\mathbf{D}.

1:  while 𝐁𝐃>103||\mathbf{B}-\mathbf{D}||_{\infty}>10^{-3} do
2:     Update 𝐃\mathbf{D} by Eq. (14);
3:     Update 𝐁\mathbf{B} by solving Eq. (16);
4:     Update 𝚽=𝚽+τ(𝐁𝐃)\mathbf{\Phi}=\mathbf{\Phi}+\tau(\mathbf{B}-\mathbf{D}) and τ=min(ρτ,τmax)\tau=\min(\rho\tau,\tau_{max});
5:  end while
6:  𝐃=𝐁\mathbf{D}=\mathbf{B}; return 𝐃\mathbf{D}.
Algorithm 2 The DLDL Algorithm

Input: The feature matrix 𝐗\mathbf{X} and the logical label matrix 𝐘\mathbf{Y}.
Parameter: α,β,λ\alpha,\beta,\lambda and σ\sigma.
Output: The label distribution matrix 𝐃\mathbf{D} and the weight matrix 𝐖\mathbf{W}.

1:  Initialize 𝐖\mathbf{W} and 𝐃\mathbf{D};
2:  repeat
3:     Update 𝐖\mathbf{W} by using Eq. (10);
4:     Update 𝐃\mathbf{D} by Algorithm 1;
5:  until convergence or reaches maximum number of iterations;
6:  return 𝐃\mathbf{D} and 𝐖\mathbf{W}.

2) Update D With fixed 𝐖\mathbf{W}, the optimization problem (7) regarding 𝐃\mathbf{D} is formulated as

min𝐃KL(𝐃,𝐏)+αtr(𝐃𝐆𝐃T)+β𝐃F2\displaystyle\underset{\mathbf{D}}{\min}\ \rm{KL}(\mathbf{D},\mathbf{P})+\alpha\rm{tr}\left(\mathbf{D}\mathbf{G}\mathbf{D}^{T}\right)+\beta||\mathbf{D}||_{F}^{2} (11)
s.t.0𝐃𝐘,𝐃𝟏c=𝟏n.\displaystyle\text{s.t.}\ \textbf{0}\leq\mathbf{D}\leq\mathbf{Y},\mathbf{D}\mathbf{1}_{c}=\mathbf{1}_{n}.

Eq. (11) involves multiple constraints and diverse losses, we introduce an auxiliary variable 𝐁=𝐃n×c\mathbf{B}=\mathbf{D}\in\mathbb{R}^{n\times c} to simplify it, and the corresponding augmented Lagrange equation becomes

min𝐃,𝐁KL(𝐃,𝐏)+αtr(𝐃T𝐆𝐃)+β𝐃F2\displaystyle\underset{\mathbf{D},\mathbf{B}}{\min}\ \rm{KL}(\mathbf{D},\mathbf{P})+\alpha\rm{tr}\left(\mathbf{D}^{T}\mathbf{G}\mathbf{D}\right)+\beta||\mathbf{D}||_{F}^{2} (12)
+𝚽,𝐁𝐃+τ2𝐁𝐃F2\displaystyle+\left<\mathbf{\Phi},\mathbf{B}-\mathbf{D}\right>+\frac{\tau}{2}||\mathbf{B}-\mathbf{D}||^{2}_{F}
s.t.0𝐁𝐘,𝐁𝟏c=𝟏n,\displaystyle\text{s.t.}\ \textbf{0}\leq\mathbf{B}\leq\mathbf{Y},\mathbf{B}\mathbf{1}_{c}=\mathbf{1}_{n},

where 𝚽n×c\mathbf{\Phi}\in\mathbb{R}^{n\times c} is the Lagrange multiplier, and τ\tau is parameter to introduce the augmented equality constraint. Eq. (12) can be minimized by solving the following sub-problems iteratively.

a) D\mathbf{D}-subproblem: Removing the irrelated terms regarding 𝐃\mathbf{D}, the 𝐃\mathbf{D}-subproblem of Eq. (12) becomes

min𝐃U(𝐃)\displaystyle\underset{\mathbf{D}}{\min}\ U\left(\mathbf{D}\right) =KL(𝐃,𝐏)+αtr(𝐃T𝐆𝐃)+β𝐃F2\displaystyle=\rm{KL}(\mathbf{D},\mathbf{P})+\alpha\rm{tr}\left(\mathbf{D}^{T}\mathbf{G}\mathbf{D}\right)+\beta||\mathbf{D}||_{F}^{2} (13)
+𝚽,𝐁𝐃+τ2𝐁𝐃F2,\displaystyle+\left<\mathbf{\Phi},\mathbf{B}-\mathbf{D}\right>+\frac{\tau}{2}||\mathbf{B}-\mathbf{D}||^{2}_{F},

then its gradient regarding 𝐃\mathbf{D} can be expressed as:

U(𝐃)Dij={1+lnDijlnPij+α(𝐆T+𝐆)𝐃+2β𝐃Φ+τ(𝐃𝐁),if Dij00,if Dij=0.\displaystyle\dfrac{\partial\ U\left(\mathbf{D}\right)}{\partial\ D_{ij}}=\begin{cases}1+\ln{D_{ij}}-\ln{P_{ij}}+\alpha(\mathbf{G}^{T}+\mathbf{G})\mathbf{D}\\ +2\beta\mathbf{D}-\Phi+\tau(\mathbf{D}-\mathbf{B}),&\text{if }D_{ij}\neq 0\\ 0,&\text{if }D_{ij}=0.\\ \end{cases} (14)

b) B\mathbf{B}-subproblem: When 𝐃\mathbf{D} is fixed, the problem (12) can be written as

min𝐁𝚽,𝐁𝐃+τ2𝐁𝐃F2\displaystyle\underset{\mathbf{B}}{\min}\ \left<\mathbf{\Phi},\mathbf{B}-\mathbf{D}\right>+\frac{\tau}{2}||\mathbf{B}-\mathbf{D}||^{2}_{F} (15)
s.t.0𝐁𝐘,𝐁𝟏c=𝟏n.\displaystyle\text{s.t.}\ \textbf{0}\leq\mathbf{B}\leq\mathbf{Y},\mathbf{B}\mathbf{1}_{c}=\mathbf{1}_{n}.

Let 𝒃^=[𝒃1T;𝒃2T;;𝒃nT]nc=vec(𝐁)\hat{\boldsymbol{b}}=[\boldsymbol{b}_{1}^{T};\boldsymbol{b}_{2}^{T};\cdots;\boldsymbol{b}_{n}^{T}]\in\mathbb{R}^{nc}=vec(\mathbf{B}), where 𝒃i\boldsymbol{b}_{i} is the ii-th row of 𝐁\mathbf{B} and vec()vec(\cdot) is the vectorization operator. Then, the problem (15) is equivalent to:

min𝒃^𝒃^T𝚺𝒃^(2𝒅^T+2τϕ^T)𝒃^\displaystyle\underset{\hat{\boldsymbol{b}}}{\min}\ \hat{\boldsymbol{b}}^{T}\mathbf{\Sigma}\hat{\boldsymbol{b}}-\left(2\hat{\boldsymbol{d}}^{T}+\frac{2}{\tau}\hat{\boldsymbol{\phi}}^{T}\right)\hat{\boldsymbol{b}} (16)
s.t. 0nc𝒃^𝒚^,j=c(i1)+1ci𝒃^j=1( 0in),\displaystyle\text{s.t.}\ \mathbf{0}_{nc}\leq\hat{\boldsymbol{b}}\leq\hat{\boldsymbol{y}},\sum_{j=c(i-1)+1}^{ci}\hat{\boldsymbol{b}}_{j}=1\ \left(\forall\ 0\leq i\leq n\right),

in which 𝒅^\hat{\boldsymbol{d}}, ϕ^\hat{\boldsymbol{\phi}} and 𝒚^\hat{\boldsymbol{y}} represent vec(𝐃)vec(\mathbf{D}), vec(𝚽)vec(\mathbf{\Phi}) and vec(𝐘)vec(\mathbf{Y}), respectively, and 𝒃^j\hat{\boldsymbol{b}}_{j} is the jj-th element of 𝒃^\hat{\boldsymbol{b}}. Eq. (16) is a quadratic programming (QP) problem that can be solved by off-the-shelf QP tools.

The detailed solution to update 𝐃\mathbf{D} in Eq. (11) is summarized in Algorithm 1 and the pseudo code of DLDL is finally presented in Algorithm 2. In this paper, we initialize 𝐖\mathbf{W} as an identity matrix and initialize 𝐃\mathbf{D} according to the method in Wang et al. (2023).

3.2 Generalization Bound

In this section, we first provide Rademacher complexity for DLDL, which is a commonly used tool for comprehensive analysis of data-dependent risk bounds.

Definition 1. Let \mathcal{H} be a family of functions mapping from 𝒳\mathcal{X} to [0,1] and 𝒮\mathcal{S} be a set of fixed samples with size nn. Then, the empirical Rademacher complexity of \mathcal{H} with respective to 𝒮\mathcal{S} is defined as

^S()=𝔼𝝈[suph1ni=1nσih(xi)].\small\widehat{\mathcal{R}}_{S}(\mathcal{H})=\mathbb{E}_{\boldsymbol{\sigma}}\left[\sup_{h\in\mathcal{H}}\frac{1}{n}\sum_{i=1}^{n}\sigma_{i}h\left(x_{i}\right)\right]. (17)

Lemma 1. Let \mathcal{H} be a family of functions. For a loss function \ell upper bounded by Θ\Theta, then for any δ>0\delta>0, with probability at least 1δ1-\delta, for all hh\in\mathcal{H} such that

(h)S(h)+^S()+3Θlog2/δ2n,\displaystyle\mathcal{L}(h)\leq\mathcal{L}_{S}(h)+\widehat{\mathcal{R}}_{S}(\ell\circ\mathcal{H})+3\Theta\sqrt{\frac{\log 2/\delta}{2n}}, (18)

where (h)\mathcal{L}(h) and S(h)\mathcal{L}_{S}(h) are the generalization risk and empirical risk with respective to h.

Theorem 1. The KL-divergence loss function \ell can be written as KL(𝐃,𝐏)\rm{KL}(\mathbf{D},\mathbf{P}), where 𝐃n×c\mathbf{D}\in\mathbb{R}^{n\times c} and 𝐏n×c\mathbf{P}\in\mathbb{R}^{n\times c} are the recovered LD matrix and the prediction matrix respectively, in which 𝐏\mathbf{P} can be expressed by the prediction weight matrix 𝐖m×c\mathbf{W}\in\mathbb{R}^{m\times c}. Let =𝐃×𝐖\mathcal{H}=\mathbf{D}\times\mathbf{W} represent the family of functions for DLDL, with functions (𝐃,𝐖)(\mathbf{D},\mathbf{W})\in\mathcal{H}. Our algorithm further encourages the complexity of 𝐖\mathbf{W} and the rank of 𝐃\mathbf{D} are upper bounded by ϵ1\epsilon_{1} and ϵ2\epsilon_{2} respectively, i.e., 𝐖Fϵ1||\mathbf{W}||_{F}\leq\epsilon_{1} and rank(𝐃)ϵ2rank(\mathbf{D})\leq\epsilon_{2}. According to Definition 1, the Rademacher complexity of DLDL with KL-divergence loss \ell is upper bounded as follows:

^S()ϵ2cmexp(m|Xmaxϵ1|)/exp(m|Xminϵ1|)n,\displaystyle\widehat{\mathcal{R}}_{S}(\ell\circ\mathcal{H})\leq\dfrac{\epsilon_{2}\sqrt{cm\cdot{\exp{(m|X_{max}\epsilon_{1}}|)}/{\exp{(-m|X_{min}\epsilon_{1}|)}}}}{\sqrt{n}}, (19)

in which XmaxX_{max}, XminX_{min} are the maximum and minimum element in the feature matrix 𝐗\mathbf{X} respectively.

The proof is given in section A of the supplementary file. According to Lemma 1 and Theorem 1, the Rademacher complexity of DLDL is controlled by the number of training samples nn. Because the numerator is a finite number, when DLDL has infinite training samples (nn tends to infinity), its Rademacher complexity tends to zero.

Theorem 2. Denote 𝐃n×c\mathbf{D}\in\mathbb{R}^{n\times c} and 𝐖m×c\mathbf{W}\in\mathbb{R}^{m\times c} as the recovered LD matrix and the prediction weight matrix, and the complexity of 𝐖\mathbf{W} and the rank of 𝐃\mathbf{D} are upper bounded by ϵ1\epsilon_{1} and ϵ2\epsilon_{2} respectively. Then we have the upper bound of Θ\Theta:

ΘΣi=1nΣj=1cln(mexp(m|Xmaxϵ1|))exp(m|Xminϵ1|).\displaystyle\Theta\leq\Sigma_{i=1}^{n}\Sigma_{j=1}^{c}\ln\frac{{(m\exp{(m|X_{max}\epsilon_{1}}|))}}{{\exp{(-m|X_{min}\epsilon_{1}|)}}}. (20)

The proof is given in section A of the supplementary file. The right side of Eq.(20) is also a finite number, so when nn tends to infinity, the term 3Θlog2/δ2n3\Theta\sqrt{\frac{\log 2/\delta}{2n}} in Lemma 1 tends to zero.

According to Lemma 1, Theorem 1 and Theorem 2, we finally have

(h)S(h)+ϵn,\displaystyle\mathcal{L}(h)\leq\mathcal{L}_{S}(h)+\dfrac{\epsilon}{\sqrt{n}}, (21)

in which ϵ\epsilon is a finite number. From Eq.(21), we can clearly see that when the number of training samples tends to infinity, the generalization risk of hh will be upper-bounded by the empirical risk of it.

The above generalized error bound is based on that the recovered label distribution matrix 𝐃\mathbf{D} is just the ground-truth one, so the last thing to prove is that under some certain conditions, DLDL can recover ground-truth label distribution from logical label.

Theorem 3. If the number of zeros in the ground-truth label distribution matrix 𝐃g\mathbf{D}_{g} is large enough, then the average difference between 𝐃\mathbf{D} and 𝐃g\mathbf{D}_{g} has an upper bound:

i,l|dildg(il)|nc2p2ϵ(1+ϵ)+λΣk+p3,\displaystyle\frac{\sum_{i,l}|d_{il}-d_{g(il)}|}{nc}\leq\frac{2p_{2}\epsilon}{(1+\epsilon)+\frac{\lambda}{\Sigma k}}+p_{3}, (22)

in which p2[0,1]p_{2}\in[0,1] and p3p_{3}, ϵ\epsilon, λΣk\frac{\lambda}{\Sigma k} are small positive numbers.

The detailed proof of Theorem 3 is given in section B of the supplementary file. The average difference between 𝐃\mathbf{D} and 𝐃g\mathbf{D}_{g} tends to zero when p3p_{3}, ϵ\epsilon, λΣk\frac{\lambda}{\Sigma k} are small enough, which indicates that DLDL can precisely recover the ground-truth label distribution from logical label under this circumstance. Combining all the lemmas and theorems together, we conclude that directly learning label distribution learning from logical label is theoretically feasible.

4 Experiments

We select six real-world datasets from various fields for experiment. Natural Scene (abbr. NS) Geng (2016); Geng et al. (2022) is generated from the preference distribution of each scene image, SCUT-FBP (abbr. SCUT) Xie et al. (2015) is a benchmark dataset for facial beauty perception, RAF-ML (abbr. RAF) Shang and Deng (2019) is a multi-label facial expression dataset, SCUT-FBP5500 (abbr. FBP) Liang et al. (2018) is a big dataset for facial beauty prediction, Ren_CECps (abbr. REN) Quan and Ren (2009) is a Chinese emotion corpus of weblog articles, and Twitter_LDL (abbr. Twitter) Yang et al. (2017) is a visual sentiment dataset. Some basic statistics about these datasets are given in section C of the supplementary file.

To verify whether our method can directly learn an LDL model from the logical labels, we generate the logical labels from the ground-truth LDs. Specifically, when the description degree is higher than a predefined threshold δ\delta, we set the corresponding logical label to 1; otherwise, the corresponding logical label is 0. In this paper, δ\delta is fixed to 0.01.

4.1 Baselines and Settings

In this paper, we split each dataset into three subsets: training set (60%), validation set (20%) and testing set (20%). The training set is used for recovery experiments, that is, we perform LE methods to recover the LDs of training instances; the validation set is used to select the optimal hyper-parameters for each method; the testing set is used for predictive experiments, that is, we use an LDL model learned from the training set with the recovered LDs to predict the LDs of testing instances.

In the recovery experiment, for DLDL, α\alpha and γ\gamma are chosen among {103,102,,10,10210^{-3},10^{-2},\cdots,10,10^{2}}, β\beta is selected from {103,102,,1,1010^{-3},10^{-2},\cdots,1,10}, the maximum of iterations tt is fixed to 5, the number of neighbors kk is set to 20. We compare DLDL with five state-of-the-art LE methods and a unified method, each configured with suggested configurations in respective literature: 1) FLE Wang et al. (2023): α\alpha, β\beta, λ\lambda, γ\gamma are chosen among {103,102,,10,10210^{-3},10^{-2},\cdots,10,10^{2}}; 2) GLLE Xu et al. (2021): λ\lambda is chosen among {103,102,,110^{-3},10^{-2},\cdots,1}; 3): LEMLL Shao et al. (2018): the number of neighbors KK is set to 10 and ϵ\epsilon is set to 0.2; 4) LESC Zheng et al. (2023): β\beta is set to 0.001 and γ\gamma is set to 1; 5) FCM Gayar et al. (2006): the number of clustering centers is set to 10 times of the number of labels. 6) L2\textit{L}^{2} Liu et al. (2021a): α\alpha, γ\gamma, λ\lambda are chosen among {105,103,,103,10510^{-5},10^{-3},\cdots,10^{3},10^{5}}, kk is set to 20 and β\beta, μ\mu keep the default value.

In the predictive experiment, we apply SA-BFGS Geng (2016) algorithm to predict the LDs of testing instances based on the recovered LDs by performing the five baseline LE methods. Note that DLDL and L2\textit{L}^{2} can directly predict the LDs without an external LDL model.

As suggested in Geng (2016), we adopt three distance metrics (i.e., Chebyshev, Clark and One-error) and one similarity metric (i.e., Intersection) to evaluate the recovery and the predictive performances. The formulas of these evaluation metrics are summarized in section C of the supplementary file.

4.2 Visualization of the Recovered Label Distributions

In this subsection, we present the visualization of two typical recovery results of all the models on two different datasets in Fig. 2. From it we can observe that DLDL has a better performance on the recovery of non-positive labels, and the recovery results of it are also more close to the ground-truth label distributions than the other comparison algorithms. For example, on the Twitter dataset, the logical values corresponding to the label 1, 4, 5, 6, 8 are all 0, indicating that these labels cannot describe this sample. However, the six comparison methods still assign positive description degrees for these invalid labels. Differently, DLDL avoids this problem, and produces differentiated description degrees fairly close to the ground-truth values.

Refer to caption
Refer to caption
Figure 2: The visualization of two typical recovery results on the Twitter (left) and NS (right) dataset.

4.3 Results

4.3.1 Recovery Results

In this subsection, we calculate the distances and similarities between the ground-truth LD and the LD recovered by each LE method via four evaluation metrics on each dataset. Table 1 shows the detailed recovery results, where all methods are ranked according to their performance and the best average rank on each metric is shown in boldface. In addition, the “\downarrow” after the metrics means “the smaller the greater”, while the “\uparrow” means “the larger the greater”. Based on the recovery results, we have the following observations:

  • DLDL achieves the lowest average rank in terms of all the four evaluation metrics. To be specific, out of the 24 statistical comparisons (6 datasets × 4 evaluation metrics), DLDL ranks 1st in all cases, which clearly shows the superiority of our approach in LE.

  • On some datasets, our method achieves superior performance. For example, on FBP, REN and Twitter, DLDL significantly improves the results compared with other algorithms in terms of the Chebyshev, One-error and Intersection.

  • Under the metric of One-error, DLDL shows overwhelming advantages. This is attributed to the constraint 0𝒅i𝒚i\textbf{0}\leq\boldsymbol{d}_{i}\leq\boldsymbol{y}_{i}, which avoids assigning label description degrees to the negative logical labels.

Method Chebyshev\downarrow Avg.Rank Clark\downarrow Avg.Rank
NS SCUT RAF FBP REN Twitter NS SCUT RAF FBP REN Twitter
DLDL 0.0845(1) 0.2821(1) 0.3133(1) 0.2783(1) 0.0306(1) 0.2989(1) 1.00(1) 2.3668(1) 0.9538(1) 1.0991(1) 1.0588(1) 0.8568(1) 1.2227(1) 1.00(1)
L2\textit{L}^{2} 0.3556(6) 0.3818(7) 0.3837(5) 0.3966(7) 0.6445(3) 0.5415(7) 5.83(6) 2.4620(5) 1.4968(7) 1.5966(2) 1.5060(6) 2.6508(3) 2.4016(7) 5.00(5)
FLE 0.3496(5) 0.3780(6) 0.3901(7) 0.3863(5) 0.6637(4) 0.3310(3) 5.00(5) 2.4682(6) 1.4949(6) 1.6153(7) 1.5011(5) 2.6574(4) 2.3846(6) 5.67(7)
GLLE 0.3257(2) 0.3466(2) 0.3801(3) 0.3630(2) 0.6686(5) 0.4710(4) 3.00(2) 2.4285(3) 1.4722(2) 1.6100(6) 1.4787(3) 2.6598(5) 2.3669(3) 3.67(3)
LEMLL 0.3291(3) 0.3527(3) 0.3699(2) 0.3897(6) 0.6369(2) 0.3271(2) 3.00(2) 2.4279(2) 1.4855(5) 1.6049(3) 1.6214(7) 2.6420(2) 2.3647(2) 3.50(2)
LESC 0.3601(7) 0.3665(5) 0.3845(6) 0.3790(4) 0.6746(7) 0.5105(6) 5.83(6) 2.4751(7) 1.4847(4) 1.6063(5) 1.4904(4) 2.6650(7) 2.3844(5) 5.33(6)
FCM 0.3466(4) 0.3596(4) 0.3825(4) 0.3638(3) 0.6725(6) 0.5064(5) 4.33(4) 2.4540(4) 1.4778(3) 1.6055(4) 1.4770(2) 2.6642(6) 2.3832(4) 3.83(4)
Method One-error\downarrow Avg.Rank Intersection\uparrow Avg.Rank
NS SCUT RAF FBP REN Twitter NS SCUT RAF FBP REN Twitter
DLDL 0.0000(1) 0.0037(1) 0.0189(1) 0.0912(1) 0.0000(1) 0.0859(1) 1.00(1) 0.9082(1) 0.6894(1) 0.6298(1) 0.6987(1) 0.9694(1) 0.6985(1) 1.00(1)
L2\textit{L}^{2} 0.6119(3) 0.2743(6) 0.2810(2) 0.2765(4) 0.8157(7) 0.6578(6) 4.67(5) 0.4123(4) 0.5068(5) 0.4976(6) 0.5014(7) 0.2430(2) 0.3444(7) 5.17(6)
FLE 0.6364(6) 0.2693(5) 0.2904(7) 0.2754(3) 0.8130(6) 0.6569(3) 5.00(7) 0.3907(5) 0.5197(4) 0.4949(7) 0.5195(6) 0.2133(3) 0.5023(3) 4.67(4)
GLLE 0.6259(4) 0.2663(3) 0.2879(5) 0.2774(5) 0.8109(4) 0.6575(5) 4.33(3) 0.4586(3) 0.5623(2) 0.5077(3) 0.5466(2) 0.2046(4) 0.4264(4) 3.00(2)
LEMLL 0.6348(5) 0.2679(4) 0.2883(6) 0.3187(7) 0.7377(2) 0.6161(2) 4.33(3) 0.4739(2) 0.4635(6) 0.5280(2) 0.5368(4) 0.1826(7) 0.5866(2) 3.83(3)
LESC 0.6386(7) 0.2894(7) 0.2860(3) 0.2752(2) 0.8117(5) 0.6569(4) 4.67(5) 0.3885(6) 0.5350(3) 0.5005(5) 0.5228(5) 0.1948(7) 0.3726(6) 5.33(7)
FCM 0.5968(2) 0.2635(2) 0.2874(4) 0.2780(6) 0.7943(3) 0.6595(7) 4.00(2) 0.3724(7) 0.4319(7) 0.5011(4) 0.5437(3) 0.1852(2) 0.3826(5) 4.67(4)
Table 1: The recovery results of testing instances on the six datasets and the best average rank (i.e., Avg.Rank) is shown in boldface. The full table with standard deviation is provided in Section D in the supplementary file.
Method Chebyshev\downarrow Avg.Rank Clark\downarrow Avg.Rank
NS SCUT RAF FBP REN Twitter NS SCUT RAF FBP REN Twitter
SA-BFGS 0.3533 0.4202 0.1583 0.3323 0.5871 0.3532 - 2.4212 1.5491 1.4531 1.4703 2.6407 2.5577 -
DLDL 0.4071(2) 0.4086(2) 0.3911(1) 0.2972(1) 0.6164(1) 0.3657(1) 1.33(1) 2.5059(1) 1.5321(1) 1.6165(1) 1.1913(1) 2.6403(1) 2.5688(1) 1.00(1)
L2\textit{L}^{2} 0.4967(7) 0.4468(7) 0.4064(7) 0.4118(7) 0.6913(7) 0.5444(7) 7.00(7) 2.5111(4) 1.6470(7) 1.6356(7) 1.5491(7) 2.6723(6) 2.4084(3) 5.67(7)
FLE 0.4011(1) 0.4254(6) 0.3934(3) 0.3915(6) 0.6897(6) 0.4330(2) 4.00(5) 2.5062(2) 1.5426(3) 1.6246(5) 1.5013(5) 2.6735(7) 2.3951(4) 4.33(4)
GLLE 0.4257(3) 0.4006(1) 0.3968(4) 0.3641(3) 0.6833(4) 0.4807(4) 3.16(2) 2.5203(6) 1.6169(6) 1.6242(4) 1.4785(2) 2.6686(4) 2.3764(7) 4.83(6)
LEMLL 0.4490(4) 0.4148(5) 0.3975(5) 0.3286(2) 0.6321(2) 0.4706(3) 3.50(3) 2.5278(7) 1.5548(5) 1.6334(6) 1.5188(6) 2.6486(2) 2.4233(2) 4.67(5)
LESC 0.4743(6) 0.4143(4) 0.4008(6) 0.3795(5) 0.6873(5) 0.5138(6) 5.33(6) 2.5065(3) 1.5353(2) 1.6172(2) 1.4915(4) 2.6714(5) 2.3856(5) 3.50(2)
FCM 0.4692(5) 0.4132(3) 0.3931(2) 0.3689(4) 0.6736(3) 0.5071(5) 3.67(4) 2.5192(5) 1.5452(4) 1.6216(3) 1.4828(3) 2.6610(3) 2.3832(6) 4.00(3)
Method One-error\downarrow Avg.Rank Intersection\uparrow Avg.Rank
NS SCUT RAF FBP REN Twitter NS SCUT RAF FBP REN Twitter
SA-BFGS 0.4796 0.2940 0.2481 0.2718 0.7898 0.6075 - 0.5183 0.4731 0.8029 0.6226 0.3269 0.6035 -
DLDL 0.5852(1) 0.2924(1) 0.2853(1) 0.1134(1) 0.1939(1) 0.3015(1) 1.00(1) 0.4347(2) 0.4825(1) 0.4864(1) 0.6523(1) 0.2719(1) 0.6025(1) 1.17(1)
L2\textit{L}^{2} 0.6549(7) 0.3404(7) 0.2909(4) 0.3171(7) 0.8149(4) 0.6605(7) 6.00(7) 0.3695(5) 0.3871(7) 0.4689(6) 0.5482(3) 0.1852(7) 0.3387(7) 5.83(7)
FLE 0.6525(6) 0.3093(4) 0.2931(6) 0.2754(2) 0.8159(5) 0.5791(3) 4.33(3) 0.3530(7) 0.4810(3) 0.4802(3) 0.5087(7) 0.1864(6) 0.3523(4) 5.00(6)
GLLE 0.6401(3) 0.3384(6) 0.2924(5) 0.2772(4) 0.8162(6) 0.6602(6) 5.00(6) 0.4002(3) 0.4250(6) 0.4643(5) 0.5455(4) 0.1974(4) 0.4117(2) 4.00(4)
LEMLL 0.6170(2) 0.3030(2) 0.2870(2) 0.2919(6) 0.8090(2) 0.5539(2) 2.67(2) 0.4412(1) 0.4514(5) 0.4776(4) 0.5929(2) 0.2636(2) 0.3392(6) 3.33(2)
LESC 0.6431(5) 0.3073(3) 0.2902(3) 0.2770(3) 0.8214(7) 0.6589(5) 4.33(3) 0.3772(4) 0.4679(4) 0.4703(5) 0.5238(6) 0.1910(5) 0.3770(3) 4.50(5)
FCM 0.6417(4) 0.3151(5) 0.2941(7) 0.2775(5) 0.8132(3) 0.6177(4) 4.67(5) 0.3583(6) 0.4818(2) 0.4824(2) 0.5389(5) 0.2070(3) 0.3408(5) 3.83(3)
Table 2: The predictive results of testing instances on the six datasets and the best average rank (i.e., Avg.Rank) is shown in boldface. The full table with standard deviation is provided in Section D in the supplementary file.

4.3.2 Predictive Results

In this subsection, SA-BFGS is used as the predictive model to generate the LDs of testing instances for the five LE methods, while DLDL and L2\textit{L}^{2} can directly predict the LDs of testing instances. Then, we rank all methods and highlight the best average rank on each metric in boldface. The detailed predictive results are presented in Table 2. In addition, SA-BFGS is trained on the ground-truth LDs of the training instances and its results are recorded as the upper bound of the predictive results. From the reported results, we observe that:

  • DLDL achieves the lowest average rank in terms of the three evaluation metrics (i.e., Clark, Canberra and Intersection). Specifically, out of the 24 statistical comparisons, DLDL ranks 1st in 87.5% cases and ranks 2nd in the remaining 12.5% cases. In general, DLDL performs better than most comparison algorithms.

  • In some cases, the performance of our approach even exceeds the upper bound. For example, on SCUT dataset, DLDL performs better than the SA-BFGS trained on the ground-truth LDs in terms of Chebyshev, Clark, Intersection and slightly on One-error, which shows that our model has considerable potential in learning from logical labels directly.

  • Although L2\textit{L}^{2} is also a joint model, it doesn’t consider the restriction of label distribution, and adopts an inappropriate fidelity term. Compared to L2\textit{L}^{2}, our method DLDL gets lower average ranks in terms of all the four metrics both in recovery and predictive results, indicating the effectiveness and superiority of our method.

Method Recovery Predictive
NS SCUT RAF FBP REN Twitter NS SCUT RAF FBP REN Twitter
DLDL 0.1086 0.2854 0.3142 0.2771 0.0328 0.2091 0.3474 0.4021 0.3009 0.3113 0.6451 0.2622
α=0\alpha=0 0.2509 0.3072 0.3516 0.3687 0.0692 0.2092 0.4148 0.4207 0.3068 0.4867 0.7337 0.3515
β=0\beta=0 0.1431 0.2943 0.3441 0.4254 0.0401 0.3026 0.4133 0.4104 0.3219 0.4730 0.6890 0.2911
γ=0\gamma=0 0.1087 0.3029 0.3357 0.3419 0.0400 0.3026 0.3649 0.4069 0.3115 0.4514 0.6733 0.3178
Table 3: Ablation study of recovery and predictive results (the selected metric is Chebyshev\downarrow)

4.4 Further Analysis

4.4.1 Ablation Study

In order to verify the necessity of the involved terms of our approach, we conduct ablation experiments and present the results on Chebyshev in Table 3, where α\alpha, β\beta and γ\gamma are hyper-parameters to adjust the contributions of different terms. When one of hyper-parameters is fixed to 0, the remaining ones are determined by the performance on the validation set. From the results, we observe that the performance of DLDL becomes poor when the term controlled by β\beta is missing, indicating that it is critical to control the smoothness of the label distribution matrix 𝐃\mathbf{D}. In general, DLDL outperforms its variants without some terms, which verify the effectiveness and rationality of our model.

4.4.2 Significance Test

In this subsection, a Bonferroni–Dunn test is applied to check whether DLDL achieves competitive performance compared to other algorithms. Through this test, we find that DLDL performs significantly better than the compared algorithms on the four metrics. For detailed significance test results, please refer to Section E in the supplementary file.

5 Conclusion and Future Work

This paper gives a preliminary positive answer to the question “can we directly train an LDL model from the logical labels” both theoretically and empirically. Specifically, we propose a new algorithm called DLDL, which unifies the conventional label enhancement and label distribution learning into a joint model. Moreover, our method avoids some common issues faced by the previous LE methods. Extensive experiments validate the advantage of DLDL against other LE algorithms in label enhancement, and also confirm the effectiveness of our method in directly training an LDL model from the logical labels. Nevertheless, our method is still inferior to the traditional LDL model when the ground-truth LD of the training set is available. In the future, we will explore possible ways to improve the predictive performance of our algorithm.

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant (62106044) and the Natural Science Foundation of Jiangsu Province under Grant (BK20210221).

References

  • Berger et al. [1996] Adam L. Berger, Stephen Della Pietra, and Vincent J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 22(1):39–71, 1996.
  • Gayar et al. [2006] Neamat El Gayar, Friedhelm Schwenker, and Günther Palm. A study of the robustness of KNN classifiers trained using soft labels. In Artificial Neural Networks in Pattern Recognition, Second IAPR Workshop, pages 67–80. Springer, 2006.
  • Geng et al. [2022] Xin Geng, Renyi Zheng, Jiaqi Lv, and Yu Zhang. Multilabel ranking with inconsistent rankers. IEEE Transactions on Pattern Analysis & Machine Intelligence, 44(09):5211–5224, 2022.
  • Geng [2016] Xin Geng. Label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 28(7):1734–1748, 2016.
  • Jia et al. [2020a] Yuheng Jia, Hui Liu, Junhui Hou, and Sam Kwong. Clustering-aware graph construction: A joint learning perspective. IEEE Trans. Signal Inf. Process. over Networks, 6:357–370, 2020.
  • Jia et al. [2020b] Yuheng Jia, Hui Liu, Junhui Hou, and Sam Kwong. Pairwise constraint propagation with dual adversarial manifold regularization. IEEE Trans. Neural Networks Learn. Syst., 31(12):5575–5587, 2020.
  • Liang et al. [2018] Lingyu Liang, Luojun Lin, Lianwen Jin, Duorui Xie, and Mengru Li. SCUT-FBP5500: A diverse benchmark dataset for multi-paradigm facial beauty prediction. In 24th International Conference on Pattern Recognition, pages 1598–1603. IEEE Computer Society, 2018.
  • Liu et al. [2021a] Xinyuan Liu, Jihua Zhu, Zhongyu Li, Zhiqiang Tian, Xiuyi Jia, and Lei Chen. Unified framework for learning with label distribution. Inf. Fusion, 75:116–130, 2021.
  • Liu et al. [2021b] Xinyuan Liu, Jihua Zhu, Qinghai Zheng, Zhongyu Li, Ruixin Liu, and Jun Wang. Bidirectional loss function for label enhancement and distribution learning. Knowledge-Based Systems, 213:106690, 2021.
  • Quan and Ren [2009] Changqin Quan and Fuji Ren. Construction of a blog emotion corpus for chinese emotional expression analysis. In Proceedings of the 2009 conference on empirical methods in natural language processing, pages 1446–1454, 2009.
  • Shang and Deng [2019] Li Shang and Weihong Deng. Blended emotion in-the-wild: Multi-label facial expression recognition using crowdsourced annotations and deep locality feature learning. International Journal of Computer Vision, 127(6-7):884–906, 2019.
  • Shao et al. [2018] Ruifeng Shao, Ning Xu, and Xin Geng. Multi-label learning with label enhancement. In 2018 IEEE International Conference on Data Mining (ICDM), pages 437–446, 2018.
  • Shen et al. [2017] Wei Shen, Kai Zhao, Yilu Guo, and Alan L. Yuille. Label distribution learning forests. CoRR, abs/1702.06086, 2017.
  • Tsoumakas and Katakis [2007] Grigorios Tsoumakas and Ioannis Katakis. Multi-label classification: An overview. International Journal of Data Warehousing and Mining, 3(3):1–13, 2007.
  • Wang et al. [2023] Ke Wang, Ning Xu, Miaogen Ling, and Xin Geng. Fast label enhancement for label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 35(2):1502–1514, 2023.
  • Xie et al. [2015] Duorui Xie, Lingyu Liang, Lianwen Jin, Jie Xu, and Mengru Li. Scut-fbp: A benchmark dataset for facial beauty perception. In 2015 IEEE International Conference on Systems, Man, and Cybernetics, pages 1821–1826, 2015.
  • Xu et al. [2019] Suping Xu, Lin Shang, and Furao Shen. Latent semantics encoding for label distribution learning. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 3982–3988. ijcai.org, 2019.
  • Xu et al. [2021] Ning Xu, Yun-Peng Liu, and Xin Geng. Label enhancement for label distribution learning. IEEE Transactions on Knowledge and Data Engineering, 33(4):1632–1643, 2021.
  • Yang et al. [2016] Xu Yang, Xin Geng, and Deyu Zhou. Sparsity conditional energy label distribution learning for age estimation. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence, pages 2259–2265. IJCAI/AAAI Press, 2016.
  • Yang et al. [2017] Jufeng Yang, Ming Sun, and Xiaoxiao Sun. Learning visual sentiment distributions via augmented conditional probability neural network. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, pages 224–230. AAAI Press, 2017.
  • Zhang and Zhou [2014] Min-Ling Zhang and Zhi-Hua Zhou. A review on multi-label learning algorithms. IEEE Transactions on Knowledge and Data Engineering, 26(8):1819–1837, 2014.
  • Zhang et al. [2021] Min-Ling Zhang, Qian-Wen Zhang, Jun-Peng Fang, Yu-Kun Li, and Xin Geng. Leveraging implicit relative labeling-importance information for effective multi-label learning. IEEE Transactions on Knowledge and Data Engineering, 33(5):2057–2070, 2021.
  • Zhao and Zhou [2018] Peng Zhao and Zhi-Hua Zhou. Label distribution learning by optimal transport. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), pages 4506–4513. AAAI Press, 2018.
  • Zheng et al. [2023] Qinghai Zheng, Jihua Zhu, Haoyu Tang, Xinyuan Liu, Zhongyu Li, and Huimin Lu. Generalized label enhancement with sample correlations. IEEE Transactions on Knowledge and Data Engineering, 35(1):482–495, 2023.