¹¹institutetext: Center of Intelligence Science and Technology,
Beijing University of Posts and Telecommunications, China ¹¹email: {liuhan,yuancx,xjwang}@bupt.edu.cn

Label-Wise Document Pre-Training for Multi-Label Text Classification

Han Liu Caixia Yuan Xiaojie Wang

Abstract

A major challenge of multi-label text classification (MLTC) is to stimulatingly exploit possible label differences and label correlations. In this paper, we tackle this challenge by developing Label-Wise Pre-Training (LW-PT) method to get a document representation with label-aware information. The basic idea is that, a multi-label document can be represented as a combination of multiple label-wise representations, and that, correlated labels always cooccur in the same or similar documents. LW-PT implements this idea by constructing label-wise document classification tasks and trains label-wise document encoders. Finally, the pre-trained label-wise encoder is fine-tuned with the downstream MLTC task. Extensive experimental results validate that the proposed method has significant advantages over the previous state-of-the-art models and is able to discover reasonable label relationship. The code is released to facilitate other researchers. ¹¹1https://github.com/laddie132/LW-PT

Keywords:

Pre-Training Document Representation Multi-Label Classification

1 Introduction

Multi-label text classification (MLTC) is a task of assigning a document into one or more class labels. In recent years, MLTC have been widely used in many scenarios, such as tag recommendation[7], information retrieval[8], sentiment analysis[4], and so on. Different from traditional single label classification, where one instance is associated with one target label, in MLTC, a text is naturally associated with multiple labels. This makes it both an essential and challenging task in Natural Language Processing (NLP).

A straightforward solution to MLTC is to decompose the problem into a series of binary text classification problems [3], each for one label. Such a solution, however, neglects the fact that information of one label may be helpful for learning another related label, whereas in real-world multi-label text classification applications, one label might be associated with other labels. For example, in music tagging task, music style “metal” is naturally associated with “rock”. It is well-accepted that, in order to achieve a good performance, we should not only consider the difference between labels, but also correlation. Especially when some labels have insufficient training examples, the label correlations may provide helpful extra information.

In recent years, deep learning models such as CNN [10] and LSTM [9] have been firmly established as state-of-the-art approaches in text representation. However, these methods cannot learn a label difference since they simply use logistic regression for each label independently to achieve multi-label classification. There are also sequence learning models that attempt for MLTC, such as CNN-RNN [5], and Sequence Generation Model (SGM)[18], which generate a sequence of possible labels using a RNN decoder. Although the label difference can be captured by attention on the state of decoder, correlation of any two labels cannot be dynamically modeled through the fixed recurrent decoder.

To alleviate the above existing problems, we proposed a novel pre-training task and model Label-Wise Pre-Training (LW-PT) to get label-wise document representation. Given a target document and one label it possesses, we sample a document as a positive example which also has this label, and several documents as negative ones that don’t take this label. The pre-training goal is to learn a classifier distinguishing between positive and negative examples. In this way, we can train a document encoder that captures label-sensitive information. Finally, each document is represented as a concatenation of all label-wise representations.

Specificly, to explicitly model the label difference, we propose two label-wise encoders by self-attention mechanism into the pre-training task, including Label-Wise LSTM (LW-LSTM) encoder for short documents and Hierarchical Label-Wise LSTM (HLW-LSTM) for long documents. For document representation on each label, they share the same LSTM layer and use different attention weights for different labels.

Obviously, labels appeared together in a document may have similar concepts. For example, music style label “metal” and “rock” tend to appear together, thus pre-training tasks for these two labels may share the same or similar training samples. Therefore, such label-wise document representation can also capture the label correlations.

The label-wise document representation is fine-tuned with a MLP layer for multi-label classification. Experiments demonstrate that our method achieves a state-of-art performance and has a substantial improvement compared with several strong baselinses.

Our contributions are as follows:

1.

We propose a novel label-wise pre-training task and model LW-PT for MLTC task, which encodes document with a combination of label-wise representations by effectively exploiting both label difference and correlation.
2.

Two label-wise encoder LW-LSTM and HLW-LSTM are proposed for pre-training and fine-tuning, capturing label-aware information for both short and long document.
3.

To the best of our knowledge, this is the first pre-training task to obtain a label-wise document representation. Experiments show that the proposed method achieves outstanding performance than previous approaches on several datasets.

2 Method

This section introduces the proposed method, including pre-trained task and multi-label classification model, as shown in the Figure 1. The former constructs a label-wise representation for document, and the latter fuses the multi-label representation for MLTC.

2.1 Task & Model

We design a multi-label pre-training task and model Label-Wise Pre-Training (LW-PT). When given a target document $T$ and it’s label set $\{1,...,k,...,l\}$ , we sample a positive example $C^{+}$ which has the same label $k$ as target and several negative examples $C^{-}$ which doesn’t have the label $k$ . This two parts are combined as candidates $\{C_{i}\}_{i=1}^{n}$ with size $n$ . The training goal is to discriminate the positive example from all $\{C_{i}\}_{i=1}^{n}$ . We do this for each label $k$ in the label set of targer document to make training samples.

Figure 1-a shows the architecture of pre-training task and model LW-PT. We denote the target encoder as T-Encoder and candidates encoder as C-Encoder. They have different parameters.

Therefore, as for the label $k$ , we calculate the representation for the target document $T$ and candidates $\{C_{i}\}_{i=1}^{n}$ , and obtain $Q^{t}_{k}\in\mathbb{R}^{2d}$ and $Q^{c_{i}}_{k}\in\mathbb{R}^{2d}$ .

	$\displaystyle Q^{t}_{k}$	$\displaystyle=\operatorname{T-Encoder}(T,k)$		(1)
	$\displaystyle Q^{c_{i}}_{k}$	$\displaystyle=\operatorname{C-Encoder}(C_{i},k)$		(2)

Further, the similarity of any two documents is directly calculated by dot product and trained using the negative log-likelihood loss function.

\mathcal{L}_{pt}=\mathop{\mathbb{E}}\limits_{t,k,c^{+},c^{-}}\left[-\log\frac{\exp{((Q^{t}_{k})^{T}Q^{c^{+}}_{k}})}{\exp{((Q^{t}_{k})^{T}Q^{c^{+}}_{k})}+\sum_{c^{-}}\exp{((Q^{t}_{k})^{T}Q^{c^{-}}_{k}})}\right]

(3)

Obviously, we can choose different encoder for it. In MLTC task, different labels may attend on differnt part of the document. Therefore, we design a label-wise encoder. Normally, Label-Wise LSTM (LW-LSTM) can meet the requirements. However, LSTM is not effective for long documents. Inspired by Hierarchical Attention Network (HAN) [19], we also introduce Hierarchical Label-Wise LSTM(HLW-LSTM) to model long documents with several sentences. The details are described in the following.

2.1.1 LW-LSTM

We use BiLSTM and self-attention mechanism to get document representation $Q_{k}\in\mathbb{R}^{2d}$ on label $k$ . Denote the document word-level embedding as $D\in\mathbb{R}^{t\times d}$ , where $t$ is the maximum number of words in a document, $d$ is the size of embedding dimision.

$\displaystyle H$	$\displaystyle=\operatorname{BiLSTM}(D)$	(4)
$\displaystyle\alpha_{k}$	$\displaystyle=\operatorname{softmax}(HU_{k})$	(5)
$\displaystyle Q_{k}$	$\displaystyle=(\alpha_{k})^{T}H$	(6)

where $H\in\mathbb{R}^{t\times 2d}$ , $\alpha_{k}\in\mathbb{R}^{t\times 1}$ , and $U_{k}\in\mathbb{R}^{2d\times 1}$ is a trainable parameter for label $k$ .

2.1.2 HLW-LSTM

Similar as Hierarchical Attention Networks (HAN) [19], we use BiLSTM and hierarchical attention mechanism. As for the label $k$ and document word-level embedding $D\in\mathbb{R}^{m\times t\times d}$ , where $m$ is the maximum number of sentences in a document, $t$ is the maximum number of words in a sentence, $d$ is the size of embedding dimision. Thus, $D_{j}\in\mathbb{R}^{t\times d}$ represents the $j$ -th sentence for a given document $D$ .

H^{w}_{j}=\operatorname{BiLSTM}(D_{j})

(7)

where $H^{w}_{j}\in\mathbb{R}^{t\times 2d}$ .

After that, we obtain the $j$ -th sentence vector $[S_{k}]_{j}\in\mathbb{R}^{2d}$ on a specific label $k$ .

	$\displaystyle[\alpha^{w}_{k}]_{j}$	$\displaystyle=\operatorname{softmax}(H^{w}_{j}U^{w}_{k})$		(8)
	$\displaystyle[S_{k}]_{j}$	$\displaystyle=([\alpha^{w}_{k}]_{j})^{T}H^{w}_{j}$		(9)

where $[\alpha^{w}_{k}]_{j}\in\mathbb{R}^{t\times 1}$ . And $U^{w}_{k}\in\mathbb{R}^{2d\times 1}$ is the word-level context vector on label $k$ , which is used to measure the importance of each word on current label. This is a trainable parameter and randomly initialized.

Thus, each sentence vector $S_{k}\in\mathbb{R}^{m\times 2d}$ for document $D$ is obtained. Same as above, we use BiLSTM and self-attention to get document representation $Q_{k}\in\mathbb{R}^{2d}$ .

$\displaystyle H^{s}_{k}$	$\displaystyle=\operatorname{BiLSTM}(S_{k})$	(10)
$\displaystyle\alpha^{s}_{k}$	$\displaystyle=\operatorname{softmax}(H^{s}_{k}U^{s}_{k})$	(11)
$\displaystyle Q_{k}$	$\displaystyle=(\alpha^{s}_{k})^{T}H^{s}_{k}$	(12)

where $H^{s}_{k}\in\mathbb{R}^{m\times 2d}$ , $\alpha^{s}_{k}\in\mathbb{R}^{m\times 1}$ . And $U^{s}_{k}\in\mathbb{R}^{2d\times 1}$ is the sentence-level context vector on label $k$ , which is used to measure the importance of each sentence on current label. This is a trainable parameter and randomly initialized.

2.2 Multi-Label Classification

After pre-training, we use T-Encoder and C-Encoder in the downstream model for MLTC task. As illustrated in Figure 1-b, we concatenate the output of T-Encoder and C-Encoder to get document representation for each label $k$ . Then, each label-wise representation is also concatenated. And we obtain $M\in\mathbb{R}^{l\times 4d}$ as the final representation for a given document $D$ .

$\displaystyle Q^{t}_{k}$	$\displaystyle=\operatorname{T-Encoder}(D,k)$	(13)
$\displaystyle Q^{c}_{k}$	$\displaystyle=\operatorname{C-Encoder}(D,k)$	(14)
$\displaystyle M_{k}$	$\displaystyle=[Q^{t}_{k};Q^{c}_{k}]$	(15)

After that, we simplely use a MLP layer to predict the probability of each label $\hat{y}$ .

\hat{y}=\operatorname{sigmoid}(W^{T}M)

(16)

where $W\in\mathbb{R}^{(l\times 4d)\times l}$ is a trainable parameter.

We use the cross-entropy loss function for MLTC task.

\mathcal{L}_{cls}=\mathbb{E}_{D}\left[-\sum_{k=1}^{l}(y_{k}\log\hat{y}_{k}+(1-y_{k})\log(1-\hat{y}_{k}))\right]

(17)

where $\hat{y}_{k}$ is the predict probability of label $k$ for current document. And $y_{k}$ is the ground truth of whether label $k$ appeared in current document.

It should be noted that T-Encoder and C-Encoder are fine-tuned in the MLTC task at the same time.

3 Experiments

In this section, we evaluate the proposed method on two datasets with short or long documents, and also compare with the previous state-of-art models on several widely used metrics.

3.1 Dataset

We use two MLTC datasets RMSC[20] and AAPD[18] with different length and language of documents. The former has longer documents with several sentences in Chinese. And the latter is shorter in English. Table 1 shows some statistics.

•

RMSC-V2[20]: The dataset is collected from a popular Chinese music review website²²2https://music.douban.com. For each music, it includes a set of human annotated styles, and associated reviews. 22 styles are defined in it. The dataset contains over 7.1k samples, 288K reviews, and 3.6M words. However, the initial version didn’t contains a certain split. Thus, we split the dataset into trian/valid/test by the same ratio as the original paper.
•

AAPD[18]: The dataset is collected from a English academic website³³3https://arxiv.org in the computer science field. Each sample contains an abstract and the corresponding subjects. 54 labels are defined and 55,840 samples are included.

Datasets	$N$	$V$	$M$	$L$	$\overline{L}$	$\overline{W}$
RMSC-V2	5020	646	1506	22	2.22	497.09
AAPD	53,840	1,000	1,000	54	2.41	167.27

Table 1: Statistics of datasets. N, V, M are the number of training, validing or testing samples. L is the size of label set.

\overline{L}

is the average number of labels per sample.

\overline{W}

is the average number of words per document.

3.2 Evalution Metrics

We use four widely used evaluation metrics, the same as [5, 18, 20].

•

One-Error: One-error calculates the fraction of samples whose top-1 predicted label is not in the ground truth.
•

Hamming Loss: Hamming loss calculates the fraction of the wrong labels to the total number of ground truth labels.
•

Macro-F1: Macro F1 takes the average of each label’s F1 score.
•

Micro-F1: Micro F1 calculates the F1 score over all sample-label pairs.

3.3 Details

For RMSC-V2 dataset, we use jieba ⁴⁴4https://github.com/fxsjy/jieba toolkit tokenizer and train word embedding by Skip-gram model[12]. The embedding dimision and hidden size is 100. For AAPD dataset, we also train word embedding by Skip-gram model[12]. The embedding dimision and hidden size is 256.

We use 2 layer LSTM for LW-LSTM encoder. And a hierarchical LSTM layer for HLW-LSTM encoder. In addition, Adam[11] optimizer with learning rate 0.001 is used. We add layer normalization[2] and dropout with probability 0.2. In pre-training procedure, 3k iterations are runned and the number of candidates documents is 3. Batch size is 128. In fine-tuning, 20 epochs are runned.

It should be noted that the pre-training procedure is on dataset of train.

3.4 Baselines

We use several MLTC baselines to compare with our method.

•

Binary Relevance (BR)[3] transforms the multi-label problem into several single-label classification problems and ignore the correlation between labels.
•

Classifier Chains (CC)[16] transforms the multi-label problem into a chain of several single-label classification problems.
•

Label Powerset (LP)[17] transforms the multi-label problem to a multi-class problem with one multi-class classifier trained on all unique label.
•

CNN[10] attempts to use a convolutional neural network with several different size of kernels, and predicts each label with logistic regression.
•

LSTM[9] uses a 2-layer LSTM neural network with self-attention on the last layer to get document representation, and predicts each label with logistic regression.
•

Hierarchical Attention Network (HAN)[19] uses a hierarchical attention network on word and sentence level to get document representation, and predicts each label with logistic regression.
•

HAN+LG[20] introduces a label graph matrix into HAN.
•

CNN-RNN[5] utilizes CNN and RNN on Seq2Seq framework to get labels one by one.
•

SGM+GE[18] computes a weighted global embedding based on all labels as opposed to just the top one at each timestep.
•

set-RNN[15] presents an adaptation of RNN sequence models to the problem, where the target is a set of labels, not a sequence.
•

reg-LSTM[1] proposes a simple BiLSTM architecture with regularization.
•

MAGNET[13] uses a graph attention network to capture the attentive dependency structure among the labels.

3.5 Results

The complete results on RMSC-V2 and AAPD datasets are shown in Table 2. We calculate the One-Error, Hamming-Loss, Macro F1 score and Micro F1 score for several models. Compared to the previous approaches, our LW-LSTM encoder or HLW-LSTM encoder with pre-training mechanism presents a outstanding performance. Adding the fine-tuning procedure further improve the performance on a substantial margin.

It should be noted that LW-LSTM model denotes directly use the LW-LSTM encoder in the downstream task model without pre-training mechanism. And the same as HLW-LSTM model. We can see that pre-training and fine-tuning have a significant contribution to the performance. But for which label-wise encoder to use, it’s determined by the dataset. For long documents with several sentences (e.g. RMSC), we suppose to choose HLW-LSTM encoder with hierarchical attention. And for short documents with less sentences (e.g. AAPD), we suppose to use LW-LSTM encoder.

We can see that the Macro F1 score has an outstanding improvement when pre-training mechanism is used. This is because the pre-training approach improves the learning of labels with less frequency by capturing the correlation between labels. And Macro F1 score is the average of each label’s F1 score. These experimental results is consistent with our ideas.

Models	RMSC-V2				AAPD
Models	OE(-)	HL(-)	MacroF1(+)	MicroF1(+)	OE(-)	HL(-)	MacroF1(+)	MicroF1(+)
BR	74.4	0.083	24.7	41.8	-	0.0316	-	64.6
CC	67.5	0.107	29.9	44.3	-	0.0306	-	65.4
LP	56.2	0.096	37.7	50.3	-	0.0312	-	63.4
CNN	23.84	0.0702	41.11	59.10	19.90	0.0264	40.37	65.00
LSTM	24.24	0.0688	36.16	58.81	18.50	0.0253	39.16	67.08
HAN	18.26	0.0590	53.18	66.75	15.50	0.0236	51.80	70.81
HAN+LG	17.60	0.0580	55.20	68.21	14.50	0.0235	52.97	71.19
CNN-RNN	-	-	-	-	-	0.0278	-	66.4
SGM+GE	-	-	-	-	-	0.0245	-	71.0
set-RNN	-	-	-	-	-	0.0241	54.8	72.0
reg-LSTM	-	-	-	-	-	-	-	70.5
MAGNET	-	-	-	-	-	0.0252	-	69.6
LW-LSTM	16.67	0.0591	59.82	68.00	16.60	0.0238	53.57	71.82
+PT	17.53	0.0588	65.37	69.62	16.70	0.0235	54.15	72.41
+FT	17.53	0.0596	66.44	69.80	14.10	0.0227	59.18	72.80
HLW-LSTM	14.94	0.0586	64.73	69.74	16.00	0.0239	52.61	70.37
+PT	15.27	0.0583	67.04	70.68	17.30	0.0239	53.69	71.21
+FT	14.54	0.0537	69.41	72.18	17.00	0.0241	55.67	71.31

Table 2: Test results on RMSC-V2 and AAPD dataset. PT denotes the pre-training method. FT denotes the fine-tuned method on downstream task. OE and HL denote one-error and hamming loss respectively. “(+)” represents that higher scores are better and “(-)” represents that lower scores are better. “-” means results are not available.

3.6 Ablation Study

In this section, some ablation studies to demonstrate the method proposed above are made on RMSC-V2 dataset.

Encoder

To demonstrate that the label-wise encoder are better for multi-label classification, we introduce the traditional LSTM encoder and HAN encoder into pre-trainig mechanism. As shown in Table 3, LW-LSTM encoder has a huge improvemence than LSTM encoder for pre-training mechanism, and the same as HLW-LSTM. It indicates the label-wise approach is very effective and plays a vital role for MLTC task.

Model	OE(-)	HL(-)	Macro F1(+)	Micro F1(+)
LSTM+PT	19.99	0.0623	48.36	63.71
LW-LSTM+PT	17.53	0.0588	65.37	69.62
HAN+PT	17.93	0.0623	50.02	64.94
HLW-LSTM+PT	15.27	0.0583	67.04	70.68

Table 3: Ablation tests on RMSC-V2 dataset with label-wise encoder.

Size of Candidates

To further explore the pre-training mechanism, we have tried several different size of candidates, such as 3, 4 and 5. As shown in Table 4, three candidates in pre-training model with HLW-LSTM encoder and PT pre-training has better performance. Too many candidates may hurt the learning of documents. Too few candidates makes the task too simple to learn effectively.

n	OE(-)	HL(-)	Macro F1(+)	Micro F1(+)
3	15.27	0.0583	67.04	70.68
4	15.80	0.0568	65.03	70.54
5	15.94	0.0577	65.23	70.46

Table 4: Ablation tests on RMSC-V2 dataset with different size of candidates, denoted as n. HLW-LSTM+PT model is used.

3.7 Analysis

To examine the ability of discovering the label correlations, we cite three correlated labels “alternative”, “metal” and “rock” in music tagging task of RMSC-V2 dataset for example. In Table 5, for target song (i.e. document) “Janes Addiction - Nothings Shocking” possesses labels of “alernative” and “rock”. We compute its top 50 similar songs via cosine similarity of document representation learnt respectively from three labels “alternative”, “metal” and “rock”. For each set of 50 similar songs, we calculate the label frequency (i.e., the proportion of songs that contain a specific label) and demonstrate the statistics in Table 5. The proportion smaller than 10% are not presented.

From Table 5 we can find that, for the target song, among its top 50 similar songs derived through “alternative”-wise document representation, 76% have label “rock”, 42% “alternative” and 26% “metal”. The similar observations can also be found for “metal”-wise and “rock”-wise document representation. This reveals that the proposed label-wise document representation is capable of capturing label correlations. Meanwhile, we can see all three columns has similar results, which means the label-wise representation on the original three labels are also similar. It is also interesting to notice that, even the target song doesn’t hold the label “metal”, it can also be accurately represented using the “metal”-wise representation due to label “metal” is correlated with “alternative” and “rock”.

Rep. on Alternative		Rep. on Rock		Rep. on Metal
Label	P	Label	P	Label	P
rock	76%	rock	86%	rock	82%
alternative	42%	metal	38%	metal	50%
metal	26%	punk	24%	alternative	24%
indie	18%	alternative	22%	punk	12%
britpop	18%	indie	18%	indie	10%
postpunk	12%	postpunk	12%	darkwave	10%
punk	12%	…	…	…	…
…	…	…	…	…	…

Table 5: Songs proportion over top 50 similar songs on “Janes Addiction - Nothings Shocking”. P denotes the proportion of songs containing the list labels. Rep denotes the document representation on a specific label.

As mentioned above, our pre-training mechanism captures the label correlation. Thus, the labels have lower frequency is also predicted accurately. To demonstrate it, we count the frequency of each label and the corresponding F1 score. As shown in the Figure 2, frequency are discretized and average F1 score of labels on RMSC-V2 are calculated with HLW-LSTM encoder. And also LW-LSTM encoder for AAPD dataset. Obviously, the model have pre-training and fine-tuning has the best performance at most times and are better for lower frequency labels.

3.8 Case Study

We present some examples for HLW-LSTM models with or without pre-training and fine-tuning mechanism. As shown in Table 6, five typical examples are selected from RMSC-V2 dataset in music tagging task. In the first three examples, models with pre-training accurately predict the whole label set. In the fourth, further using the fine-tuning mechanism obtains a accurate prediction. In the last, our pre-training and fine-tuning models both predict a extra label “pop” which is not appeared in the ground truth. However, we visit the website ⁵⁵5https://music.douban.com/subject/1774742/ and find this song also have some labels similar to “pop” but not appear in the dataset. Note that the labels are made by human and maybe not accurate and complete. But for the label “indie” which is not predicted, this is where the model needs futher improvement.

Ground Truth	HLW-LSTM	+PT	+PT+FT
jazz,soul	jazz	jazz,soul	jazz,soul
classical,ost,piano	ost,piano	classical,ost,piano	classical,ost,piano
folk,punk	folk,punk,rock	folk,punk	folk,punk
britpop,indie,rock	alternative,britpop,indie,rock	alternative,britpop,indie,rock	britpop,indie,rock
electronic,indie	electronic	electronic,pop	electronic,pop

Table 6: Examples predicted by HLW-LSTM model with or without pre-training and fine-tuning mechanism on RMSC-V2 dataset. PT denotes the pre-training mechanism. FT denotes the fine-tuning procedure. Labels with red color means the intersection of three predicts.

4 Related Work

There are many researches focus on Multi-Label Text Classification. In earlier years, traditional machine learning methods have been widely used in text classification, such as Naive Bayes, SVM, and so on. Therefore, researchers directly transform the multi-label task into several single-label tasks.

Binary relevance (BR) [3] is the earliest method to learn several single-label classifier independently. Label Powerset (LP)[17] transforms it to a multi-class problem with one multi-class classifier trained on all unique label. Classifier chains (CC) [16] further convert it to a chain of single-label tasks. However, these methods cannot learn label-wise document representation and have very limited performance beacuse of insufficient learning.

In neural network models, sequence-to-sequence (Seq2Seq) framework can represent the labels correlation naturally. Therefore, CNN-RNN [5] and Sequence Generation Model (SGM)[18] view the MLTC task as a sequence generation problem. Further, set-RNN[15] presents an adaptation of RNN sequence models to the problem, where the target is a set of labels, not a sequence. However, these methods lack the dynamically modeling of label correlation through the fixed recurrent decoder. Significantly, the computation efficiency of Seq2Seq framework is a huge challenge for practical application.

Our label-wise document representation is very similar with word embedding. A word may have several meanings and a single static word embedding is difficult to completely represent the entire word. Thus, dynamic word embedding (i.e. pre-trained language models) is proposed to get a specific document representation with context, such as ELMo[14], BERT[6] and so on. But it’s also static for different labels.

5 Conclusions

We propose a pre-training task and model LW-PT for multi-label text classification. Specifically, two label-wise encoders LW-LSTM and HLW-LSTM are introduced, which handle both short and long documents. In our method, label difference is modeled by the label-wise encoder. Label correlation is also captured in pre-training by the idea, that labels appeared together in a documents may have similar concepts. Experiments show that our method outperforms the previous approaches by a substantial margin. However, how to introduce unsupervised learning and transfer learning into label-wise pre-training mechanism is a further research in the future.

Acknowledgements

The research is supported by the Fundamental Research Funds for the Central Universities.

References

[1] Adhikari, A., Ram, A., Tang, R., Lin, J.: Rethinking complex neural network architectures for document classification. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 4046–4051 (2019)
[2] Ba, J.L., Kiros, J.R., Hinton, G.E.: Layer normalization. arXiv preprint arXiv:1607.06450 (2016)
[3] Boutell, M.R., Luo, J., Shen, X., Brown, C.M.: Learning multi-label scene classification. Pattern Recognition 37(9), 1757–1771 (2004)
[4] Cambria, E., Olsher, D., Rajagopal, D.: Senticnet 3: a common and common-sense knowledge base for cognition-driven sentiment analysis. In: Twenty-eighth AAAI conference on artificial intelligence (2014)
[5] Chen, G., Ye, D., Xing, Z., Chen, J., Cambria, E.: Ensemble application of convolutional and recurrent neural networks for multi-label text categorization. In: 2017 International Joint Conference on Neural Networks (IJCNN). pp. 2377–2383 (2017)
[6] Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
[7] Fürnkranz, J., Hüllermeier, E., Mencía, E.L., Brinker, K.: Multilabel classification via calibrated label ranking. Machine learning 73(2), 133–153 (2008)
[8] Gopal, S., Yang, Y.: Multilabel classification with meta-level features. In: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval. pp. 315–322 (2010)
[9] Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation 9(8), 1735–1780 (1997)
[10] Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1746–1751 (2014)
[11] Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
[12] Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems. pp. 3111–3119 (2013)
[13] Pal, A., Selvakumar, M., Sankarasubbu, M.: Multi-label text classification using attention-based graph neural network. arXiv preprint arXiv:2003.11644 (2020)
[14] Peters, M.E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer, L.: Deep contextualized word representations. arXiv preprint arXiv:1802.05365 (2018)
[15] Qin, K., Li, C., Pavlu, V., Aslam, J.: Adapting rnn sequence prediction model to multi-label set prediction. In: NAACL-HLT 2019: Annual Conference of the North American Chapter of the Association for Computational Linguistics. pp. 3181–3190 (2019)
[16] Read, J., Pfahringer, B., Holmes, G., Frank, E.: Classifier chains for multi-label classification. Machine learning 85(3), 333 (2011)
[17] Tsoumakas, G., Katakis, I.: Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM) 3(3), 1–13 (2007)
[18] Yang, P., Sun, X., Li, W., Ma, S., Wu, W., Wang, H.: Sgm: Sequence generation model for multi-label classification. In: COLING 2018: 27th International Conference on Computational Linguistics. pp. 3915–3926 (2018)
[19] Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. pp. 1480–1489 (2016)
[20] Zhao, G., Xu, J., Zeng, Q., Ren, X., Sun, X.: Review-driven multi-label music style classification by exploiting style correlations. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). pp. 2884–2891 (2019)