This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Description-Enhanced Label Embedding Contrastive Learning for Text Classification

Kun Zhang, , Le Wu, , Guangyi Lv, Enhong Chen, , Shulan Ruan, Jing Liu, , Zhiqiang Zhang, Jun Zhou, and Meng Wang Kun Zhang, Le Wu, and Meng Wang are with School of Computer and Information, Hefei University of Technology, Hefei, Anhui 230029, China. (email: zhang1028kun, lewu.ustc, [email protected]). Guangyi Lv is with AI Lab at Lenovo Research, Beijing, 100094, China. (email: [email protected]) Enhong Chen and Shulan Ruan are with the School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China. (email: [email protected], [email protected]). Jing Liu is with National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, 100190, China. (email: [email protected]) Zhiqiang Zhang and Jun Zhang are with Ant Group CO., Ltd, Hangzhou, 310007, China (email: lingyao.zzq, [email protected]) Le Wu is the corresponding author.
Abstract

Text Classification is one of the fundamental tasks in natural language processing, which requires an agent to determine the most appropriate category for input sentences. Recently, deep neural networks have achieved impressive performance in this area, especially Pre-trained Language Models (PLMs). Usually, these methods concentrate on input sentences and corresponding semantic embedding generation. However, for another essential component: labels, most existing works either treat them as meaningless one-hot vectors or use vanilla embedding methods to learn label representations along with model training, underestimating the semantic information and guidance that these labels reveal. To alleviate this problem and better exploit label information, in this paper, we employ Self-Supervised Learning (SSL) in model learning process and design a novel self-supervised Relation of Relation (R2) classification task for label utilization from a one-hot manner perspective. Then, we propose a novel Relation of Relation Learning Network (R2-Net) for text classification, in which text classification and R2 classification are treated as optimization targets. Meanwhile, triplet loss is employed to enhance the analysis of differences and connections among labels. Moreover, considering that one-hot usage is still short of exploiting label information, we incorporate external knowledge from WordNet to obtain multi-aspect descriptions for label semantic learning and extend R2-Net to a novel Description-Enhanced Label Embedding network (DELE) from a label embedding perspective. One step further, since these fine-grained descriptions may introduce unexpected noise, we develop a mutual interaction module to select appropriate parts from input sentences and labels simultaneously based on Contrastive Learning (CL) for noise mitigation. Extensive experiments on different text classification tasks reveal that R2-Net can effectively improve classification performance, and DELE can make better use of label information and further improve the performance. As a byproduct, we have released the codes111https://github.com/little1tow/DELE_pytorch to facilitate other research.

Index Terms:
Text Classification, Label Embedding, Contrastive Learning, Representation Learning.

I Introduction

As one of fundamental tasks in Natural Language Processing (NLP), text classification focuses on identifying the most appropriate category for sentences or the most suitable relation for sentence pairs. For example, Paraphrase Identification (PI) aims at identifying whether the sentence pair expresses the same meaning (Yes or No) [1]. Natural Language Inference (NLI) targets at classifying input sentence pair into one of three relations (i.e., Entailment, Contradiction, Neutral[2]. QA Topic Classification requires an agent to select the most suitable topic for a given question-answer pair [3]. Fig. 1(a) illustrates some examples with different relations from different tasks.

Refer to caption
Figure 1: (a) Some text classification examples: label signals can imply specific some semantic expressions of input sentences (e.g., In Entailment pair (a and b1): “A lady” is replaced with “A person”, In Contradiction pair (a and b2): “A lady” is replaced with “Nobody” ). (b) Some fine-grained noun descriptions for label word “Business” from WordNet, which give detailed explanations about different attributes of label “Business”.

As a vital technology, text classification has been applied successfully to various NLP fields, e.g., sentiment analysis [4, 5, 6], information retrieval [7], question answering [8], and dialogue system [9]. Based on label usage, most of the existing work can be grouped into two categories. The first category is one-hot encoding of labels. Researchers usually focus on designing a deep model to learn representations for input text [10, 11, 12]. Then, a simple classifier is employed to predict the label distribution. Cross-Entropy loss between the prediction and one-hot label encoding is finally adopted for model training [13, 14]. However, labels can reveal some common characteristics of examples within the same category [15] and provide rich semantic information as well as guidance for sentence semantics learning [16]. Treating labels as independent and meaningless one-hot vectors will cause potential information loss and have weaknesses in dealing with fine-grained interactions between text and labels. Gururangan et al. [17] have observed that different relations among sentence pairs imply specific semantic expressions. Taking Case one in Fig. 1(a) as an example, when constructing sentence pairs that are semantically contradictory, negation words (e.g., replacing “A lady” with “Nobody” in pair a and b2) are usually used. Moreover, replacing exact numbers with approximates can always generate “entailment” semantic relation. Therefore, the connection and differences among different relations (e.g., pairwise relation comparison) will be helpful to capture more implicit common semantic features, which is a promising direction to fully exploit label information for text classification. Then, the problem becomes how to design this type of signal and integrate it into model learning process, which is one of the main focuses of this paper.

The second category is label embedding method. Different from one-hot encoding methods, this type of work usually uses dense vectors to represent labels in the same semantic space as text embeddings. Then, attention mechanism is employed to measure the impact of label semantics on input text for classification. For example, Xiao et al. [18] proposed a label-specific attention network to enhance text representation learning. Zhang et al. [19] proposed to retrieve keywords from documents as pseudo descriptions for label embedding learning. Despite the inspiring performance, there still are some shortcomings. Specifically, vanilla embedding methods treat labels as general words and generate representations during model training. Knowledge-enhanced methods usually use a single description to enrich the embedding. They either only focus on the literal meanings of labels or describe labels in a coarse-grained manner, which is insufficient for precise and relevant label embedding learning. As shown in Fig. 1(b), even if a particular meaning of a word is utilized as label semantics, this meaning still has different attributes and can be described from different perspectives (e.g., noun.group, noun.act). Moreover, even if different input texts have the same labels, their semantics may associate with different aspects of the label. For example, for label “Business”, noun.group attribute is used to identify QA(1), and noun.act attribute is used for QA(2) in case two in Fig. 1(a). Thus, it is promising to consider fine-grained knowledge (e.g., multi-aspect descriptions from WordNet) to enhance label embedding. Meanwhile, different from key words extraction and single description usage, multi-aspect descriptions may not all relate to the specific label meaning, which will import unanticipated noise, and harm the precise label semantic representation. For example, noun.cognition attribute in Fig. 1(b) is not the option for the label “Business” used in Fig.1(a), which will become noise and confuse models to understand the semantics of the label “Business”. Therefore, another main contribution of this paper is how to introduce fine-grained information (i.e., multi-aspect descriptions) for better label embedding learning, as well as eliminating unexpected noise.

In order to mine the implicit information inside labels in a one-hot encoding manner, in our preliminary work [20], we propose a novel Relation of Relation Learning Network (R2-Net) to fully exploit labels in a simple but effective way. We first utilize PLMs encoder (e.g., BERT [21]) and CNN-based encoder to model global and partial semantic meanings of input words and sentences separately. Then, inspired by SSL, we propose a self-supervised Relation of Relation (R2) classification task to enhance the learning ability of R2-Net, so that inter-class relations of input texts will be modelled comprehensively (i.e., differences among different label relations). Moreover, a triplet loss is used to measure the connections among the same classes (e.g., those inputs that have the same label will be represented much closer and vice versa further apart).

However, R2-Net still mines label information in a one-hot encoding manner, which inhibits the potential of label utilization. To this end, in this paper, we focus on label embedding and extend R2-Net to a novel Description-Enhanced Label Embedding network (DELE), which incorporates fine-grained descriptions from WordNet222https://wordnet.princeton.edu/ as prior knowledge to boost the label embedding learning. Specifically, we first extract multiple descriptions from WordNet for each label word to enrich the label embedding learning. After that, considering that multi-aspect descriptions may not all relate to the specific meaning of labels, we design a novel mutual interaction module based on CL framework to explore bidirectional interactions between input sentences and labels. Along this line, sentence representations can be leveraged to denoise multi-aspect descriptions of labels and select the most relevant parts for better semantic representations. Meanwhile, label semantics can also be used to guide the importance selection of input sentences as existing methods do. Extensive experiments over different types of text classification tasks also prove that DELE can make better use of labels and do better classification.

In summary, the main contributions of this paper lie in the following parts: 1) We develop a novel self-supervised R2 task to mine the implicit information inside labels from a one-hot encoding perspective (Preliminary Work); 2) We propose to leverage multi-aspect descriptions from WordNet to fully exploit label information from label embedding perspective (Extended Work); 3) We also design a novel mutual interaction module based on CL for better description denoising and integrating (Extended Work); 4) Extensive experiments over different type of text classification tasks have been done to demonstrate the effectiveness of our proposed methods.

The remainder of this paper is organized as follows. Section II summarizes related work. Section III gives formal definitions of text classification and our proposed R2 classification. Sections IV and V report technical details of our proposed R2-Net and DELE. The experiments and detailed analysis are reported in Section VI. Finally, we discuss and conclude our work in Sections VII and VIII.

II Related Work

In this section, we will introduce the related work, which is grouped into two lines of literature, 1) Text Classification: focusing on sentence semantic modelling and relation identification with different label usages; 2) Contrastive Learning: introducing the recent progress of CL on text classification.

II-A Text Classification

With the development of various neural networks such as CNN [22], GRU [23], and attention mechanism [24, 5], plenty of methods have been exploited for text classification on large datasets like SNLI [25], Quora [26], and SST-5 [27]. Usually, researchers employ neural networks [21, 28] to generate text representations. Then, a simple classifier, such as Multi-Layer Perceptron (MLP), is used to predict the label distribution. Next, a cross-entropy loss is applied for model training [13, 29]. Based on label usage, there are two categories: One-hot Encoding methods and Label Embedding methods, which are summarized as follows:

For One-hot Encoding methods, researchers focus on input text and treat labels as one-hot meaningless training signals. For example, Zeng et al. [30] focused on the input sentences and generated representations by extracting lexical-level and sentence-level features. Zhang et al. [10] developed a DRr-Net to select important parts in a sentence precisely for text representation and classification. Similarly, Li et al. [31] leveraged a sequential decision process with reinforcement learning to exploit the potential of sentences for classification. After PLMs (e.g., BERT [21], RoBERTa [28]) have been proposed, this kind of learning paradigm has become much simpler and more effective. They have all achieved impressive performances. However, in most scenarios, treating labels as independent and meaningless one-hot training signals will ignore the implicit semantics and common feature indication of label words. This will cause information loss and limit the learning capability of representation methods.

To better exploit label information, label embedding methods are proposed. In practice, researchers treat text classification problem as a label-text joint embedding problem, where labels are embedded in the same space as texts. Then, attention mechanism is employed to measure semantic relations between labels and texts for classification [18, 32, 33, 34, 35]. Traditionally, vanilla embedding is the most common choice. For example, Du et al. [36] used vanilla embedding to represent label semantics from scratch and designed an EXAM to explicitly calculate matching scores between text and labels at word level. To generate better label embedding, side information, as well as relation modelling are taken into consideration. For side information, Zhu et al. [35] used TF-IFD to select the most effective words from documents as the descriptions of labels, and leveraged attention mechanism to fuse the input sentences and label information for better classification. Rivas et al. [37] extracted textual definitions from Oxford Dictionaries for label modelling. For relations, Guo et al. [14] developed a label confusion model to estimate the label dependency for better label exploration. In addition, there still exist other methods that leveraged label embedding to tackle multi-label classification [15, 38], hierarchical text classification [39], and so on.

Compared with existing work, our proposed work has following main improvements. First, we propose to use multi-aspect real sentences as descriptions for the fine-grained side information of labels. Second, we argued that additional descriptions would import unexpected noise and harm the model performance. Then, we developed a novel mutual interaction based on CL to achieve the fusing of input sentences and labels, as well as alleviate the unexpected noise introduced by multi-aspect descriptions.

II-B Contrastive Learning

As a core component of self-supervised learning, CL has led to state-of-the-art performance in the unsupervised training of deep learning models, and achieved impressive performance in Computer Vision [40] and Natural Language Processing [41, 20]. Traditionally, CL uses data augmentations to generate “positive” pairs and randomly selects “negative” samples from mini-batch without the consideration of labels. The target is to pull together an anchor and a positive sample in embedding space, and push apart the anchor from many negative samples [42]. This paradigm can be treated as contrastive instance discrimination [43]. SimCLR [44], MoCo [45], MAE [46] are representative works.

However, Instance-level CL has some weaknesses in mining common features among instances with the same category. Therefore, researchers have proposed to consider side information for better sampling, such as supervised CL [42, 47] and Cluster-based CL [48]. The most relevant work is supervised CL with the consideration of labels. For example, Khosla et al. [42] used labels to select positive examples with the same labels as the anchor example. Therefore, models can obtain multiple positive examples not only from data augmentation, but also from the same category. One step further, Gao et al. [47] leveraged “Entailment” and “Contradiction” labels to select positive examples and negative examples simultaneously. Besides, Yang et al. [49, 50] developed a noise-robust contrastive learning loss for handling the false negative situations in CL for performance improvement. In our preliminary work [20], we designed a relation of relation task to force the model to measure the label relations with contrastive learning. They have all made some progress in improving CL performance, and achieved promising results in downstream tasks. Nevertheless, these supervised CL-based methods still treat labels as one-hot signals and use these signals to directly select hard examples, underestimating the potential semantic information of labels. That is to say, there still is plenty of space for further improving text representation and classification with better label utilization methods.

Refer to caption
Figure 2: (A) The overall structure of R2-Net. (B) Using PLMs to obtain the representations of input words and sentences. (C) Our proposed CNN-based local encoding for the model ability enhancement on sentence partial information extraction. (D) Relation identification module for our proposed R2 classification.

III Problem Statements

In this section, we will introduce the definition of text classification task, our proposed relation of relation (R2) classification task, and necessary notations.

III-A Text Classification

Given a word sequence denoted by 𝑿={x1,x2,\bm{X}=\{x_{1},x_{2}, ,xn}...,x_{n}\}, where nn is the length of the word sequence, as well as the label set 𝒴(|𝒴|=m)\mathcal{Y}(|\mathcal{Y}|=m), where the jthj^{th} label is represented with one-hot representation 𝒚j\bm{y}_{j}. mm is the number of labels. The target is to learn a classifier ξ\xi, which is capable of computing the conditional probability P(𝒚|𝑿,𝒴)P(\bm{y}|\bm{X},\mathcal{Y}) and predicting the most appropriate label for the input text:

P(𝒚|𝑿,𝒴)=ξ(𝑿,𝒴),y=argmaxy𝒴P(𝒚|𝑿,𝒴),\begin{split}&P(\bm{y}|\bm{X},\mathcal{Y})=\xi(\bm{X},\mathcal{Y}),\\ &y^{*}=argmax_{y\in\mathcal{Y}}P(\bm{y}|\bm{X},\mathcal{Y}),\end{split} (1)

where true label y𝒴y\in\mathcal{Y} indicates the semantic category of input text. For example, 𝒴={Yes,No}\mathcal{Y}=\{Yes,No\} for PI task, and 𝒴={entailment,contradiction,neutral}\mathcal{Y}=\{entailment,contradiction,neutral\} for NLI task.

III-B Relation of Relation (R2) Classification

Previous study [17] has demonstrated that categories can be helpful to reveal some implicit patterns for text classification. Therefore, we propose a novel Relation of Relation (R2) classification task to guide models to exploit category information precisely. Given two input texts 𝑿1\bm{X}_{1} and 𝑿2\bm{X}_{2}, the goal is to learn a classifying function \mathcal{F} to identify whether these two input texts have the same semantic relation:

(𝑿1,𝑿2)={1,if 𝒚1=𝒚2,0,if 𝒚1𝒚2,\begin{split}\mathcal{F}(\bm{X}_{1},\bm{X}_{2})=\begin{cases}1,&\mbox{if }\bm{y}_{1}=\bm{y}_{2},\\ 0,&\mbox{if }\bm{y}_{1}\not=\bm{y}_{2},\end{cases}\end{split} (2)

where 𝒚1\bm{y}_{1} and 𝒚2\bm{y}_{2} stand for one-hot representations of labels of two input texts, respectively.

Next, we will introduce the technical details of our proposed R2-Net and DELE. Some necessary notations are listed in Table I for better illustration.

TABLE I: Notations used in R2-Net and DELE.
Notation Explanation
𝑿\bm{X} The input text sequence
𝒴\mathcal{Y} The label set
𝑯\bm{H} Word embedding matrix for entire input text
𝒗g,𝒗l\bm{v}_{g},\bm{v}_{l} Global and local representations for input text
𝒗\bm{v}_{*} Different vector representations from different modules
α1,α2,,αL\alpha_{1},\alpha_{2},...,\alpha_{L} Trainable weights for different layers in PLMs
S(i)S^{(i)} Fine-grained description set for the ithi^{th} label
𝑬f\bm{E}_{f} Vanilla label embedding matrix
𝑬\bm{E} Final label embedding matrix
𝑾,𝑼,𝝎,𝒃\bm{W}_{*},\bm{U}_{*},\bm{\omega}_{*},\bm{b}_{*} Trainable parameters of our proposed models
𝜷l,𝜷t\bm{\beta}^{l},\bm{\beta}^{t} Attention weights used in DELE

IV Relation of Relation Learning Network (R2-Net)

Fig. 2(A) reports the overall architecture of R2-Net. To better describe the technical details of R2-Net, similar to Section III, we elaborate them from two aspects: 1) Text Classification Part; 2) R2 Classification Part.

IV-A Text Classification Part

This part focuses on identifying the most suitable label for a given text. Specifically, we first utilize powerful PLMs [21, 28], such as BERT, to generate sentence semantic representation globally. Meanwhile, we develop a CNN-based encoder to capture keywords and phrase information from a local perspective. Thus, the input text sequence can be encoded in a comprehensive manner. Then, we leverage a Multi-Layer Perceptron (MLP) to predict the corresponding label.

IV-A1 Global Encoding

With the full usage of large corpus and multi-layer transformers, PLMs (e.g., BERT [21]) have accomplished much progress in many NLP tasks. Thus, we take pre-trained BERT as an example to describe how to generate sentence semantic representations for the input. Moreover, inspired by ELMo [51], we also use the weighted sum of all the hidden states of words from different transformer layers as the final contextual representations of input words in sentences.

Specifically, we first split input text into BPE tokens [52] and add special token “[CLS]” at the beginning and the end of word sequence. “[SEP]” token is also considered when there is more than one sentence. Then, BERT is utilized to generate word representation 𝑯\bm{H} and sentence representations 𝒗g\bm{v}_{g}. As illustrated in Fig. 2(B), suppose there are LL layers in BERT. Contextual word representations for input text sequence 𝑿\bm{X} is then a pre-layer weighted sum of transformer block output, with weights α1,α2,,αL\alpha_{1},\alpha_{2},...,\alpha_{L}.

𝒉0l,𝑯l=BERTl(𝑿),𝑯=l=1Lαl𝑯l,𝒗g=𝒉0L,\begin{split}&\bm{h}_{0}^{l},\bm{H}^{l}=BERT_{l}(\bm{X}),\\ &\bm{H}=\sum_{l=1}^{L}\alpha_{l}\bm{H}^{l},\quad\bm{v}_{g}=\bm{h}_{0}^{L},\\ \end{split} (3)

where 𝒉0l\bm{h}_{0}^{l} denotes the representation of first token “[CLS]” at the lthl^{th} layer, and 𝒗g\bm{v}_{g} denotes the global semantic representation of the input. 𝑯\bm{H} represents the sequence features of the whole input. αl\alpha_{l} is the weight of the lthl^{th} layer in BERT and will be learned during model training.

IV-A2 Local Encoding

The semantic relation within the text sequence is not only connected with the important words, but also affected by the local information (e.g., phrase and local structure). Though BERT leverages multi-layer transformers to perceive important words in the sentence pair, it still has some weaknesses in modelling local information. To alleviate these shortcomings, we develop a CNN-based local encoder to extract local information from the input.

Fig. 2(C) reports the local encoding structure. The input of this module is 𝑯\bm{H} from global encoding. We use convolution operations with different composite kernels (e.g., bi-gram and tri-gram) to process these features. Each operation with different kernels can model local patterns with different sizes. Next, we leverage average pooling and max pooling to enhance these local features and concatenate them before sending them to a non-linear transformation. Suppose we have KK different kernel sizes. This process can be formulated as follows:

𝑯k=CNNk(𝑯),k=1,2,,K,𝒉maxk=max(𝑯k),𝒉avgk=avg(𝑯k),𝒉concat=[𝒉max1;𝒉avg1;;𝒉maxK;𝒉avgK],𝒗l=ReLu(𝑾𝒉concat+𝒃),\begin{split}\bm{H}^{k}&=CNN_{k}(\bm{H}),\quad k=1,2,...,K,\\ \bm{h}^{k}_{max}&=max(\bm{H}^{k}),\bm{h}^{k}_{avg}=avg(\bm{H}^{k}),\\ \bm{h}_{concat}&=[\bm{h}^{1}_{max};\bm{h}^{1}_{avg};...;\bm{h}^{K}_{max};\bm{h}^{K}_{avg}],\\ \bm{v}_{l}&=ReLu(\bm{W}\bm{h}_{concat}+\bm{b}),\\ \end{split} (4)

where CNNkCNN_{k} denotes the convolution operation with the kthk^{th} kernel. [;][\cdot;\cdot] is the concatenation operation. 𝒗l\bm{v}_{l} represents local semantic representation of the input. {𝑾dp(2Kdp),𝒃dp}\{\bm{W}\in\mathcal{R}^{d_{p}*(2K*d_{p})},\bm{b}\in\mathcal{R}^{d_{p}}\} are trainable parameters. dpd_{p} is output size of pre-trained model. ReLu()ReLu(\cdot) is the activation function.

After getting the global representation 𝒗g\bm{v}_{g} and local representation 𝒗l\bm{v}_{l}, we investigate different fusion methods to integrate them, including simple concatenation, weighted concatenation, as well as weighted sum. Finally, we obtain that simple concatenation is flexible and can achieve comparable performance without extra training parameters. Thus, we employ concatenation 𝒗=[𝒗g;𝒗l]\bm{v}=[\bm{v}_{g};\bm{v}_{l}] as the final semantic representation of input text sequence.

IV-A3 Label Prediction

This component is adopted to predict the label of input text, which is an essential text classification part. To be specific, the input of this component is semantic representation 𝒗\bm{v}. We leverage a two-layer MLP to make the final classification, which can be formulated as follows:

P(y|𝑿)=MLP1(𝒗).\begin{split}P(y|\bm{X})=MLP_{1}(\bm{v}).\\ \end{split} (5)

IV-B Relation of Relation Learning Part

This part aims to properly and fully use comparison and contrastive learning of label information, and help to improve the model performance. To achieve this goal, we employ two critical modules to analyze pairwise relation and triplet-based relation simultaneously.

IV-B1 R2 Classification

Inspired by SSL methods in PLMs (e.g., MLM and NSP in BERT), we intend R2-Net to make full use of label information in a similar way. Therefore, we develop a novel self-supervised R2 classification task. Instead of just identifying the most suitable label, we plan to obtain more information about the input text sequence by analyzing the pairwise relation between semantic representations (i.e., 𝒗1\bm{v}_{1} for 𝑿1\bm{X}_{1}, and 𝒗2\bm{v}_{2} for 𝑿2\bm{X}_{2}). Since a learnable nonlinear transformation between representations and loss substantially improves the model performance [44], we first transfer 𝒗1\bm{v}_{1} and 𝒗2\bm{v}_{2} with a nonlinear transformation. Then, we leverage heuristic matching [53, 54] to model the similarity and difference between 𝒗1\bm{v}_{1} and 𝒗2\bm{v}_{2}. Next, we send the matching result 𝒖\bm{u} to an MLP with one hidden layer for final classification. This process is formulated as follows:

𝒗¯1=MLP1(𝒗1),𝒗¯2=MLP1(𝒗2),𝒖=[𝒗¯1;𝒗¯2;(𝒗¯1𝒗¯2);(𝒗¯1𝒗¯2)],P(y^|𝑿1,𝑿2)=MLP2(𝒖),\begin{split}&\bar{\bm{v}}_{1}=MLP_{1}(\bm{v}_{1}),\quad\bar{\bm{v}}_{2}=MLP_{1}(\bm{v}_{2}),\\ &\bm{u}=[\bar{\bm{v}}_{1};\bar{\bm{v}}_{2};(\bar{\bm{v}}_{1}\odot\bar{\bm{v}}_{2});(\bar{\bm{v}}_{1}-\bar{\bm{v}}_{2})],\\ &P(\hat{y}|\bm{X}_{1},\bm{X}_{2})=MLP_{2}(\bm{u}),\\ \end{split} (6)

where concatenation can retain all the information [54]. The element-wise product is a certain measure of “similarity” between two sentences [55]. Their differences can capture the degree of distributional inclusion in each dimension [56]. y^{1,0}\hat{y}\in\{1,0\} indicates whether two input sequences have the same relation.

IV-B2 Triplet Distance Calculation

Apart from leveraging R2 task to learn inter-class relation information, we also intend to learn intra-class connections from triplet-based relation. Thus, we introduce triplet loss [57] into R2-Net. As a fundamental similarity function, triplet loss is widely applied in information retrieval [7], and is able to pull together input sequences with the same label and push apart these with different labels. The inputs of this component are three semantic representations: 𝒗a\bm{v}_{a} for anchor text 𝑿a\bm{X}_{a}, 𝒗p\bm{v}_{p} for positive text 𝑿p\bm{X}_{p}, 𝒗n\bm{v}_{n} for negative text 𝑿n\bm{X}_{n}. We first transform them into a common space with a full connection layer [44]. Then, we calculate the distance between anchor and positive, as well as the distance between anchor and negative, respectively:

𝒗¯i=ReLu(𝑾d𝒗i+𝒃d),i{a,p,n},dap=Dist(𝒗¯a,𝒗¯p),dan=Dist(𝒗¯a,𝒗¯n),\begin{split}&\bar{\bm{v}}_{i}=ReLu(\bm{W}_{d}\bm{v}_{i}+\bm{b}_{d}),\quad i\in\{a,p,n\},\\ &d_{ap}=Dist(\bar{\bm{v}}_{a},\bar{\bm{v}}_{p}),\quad d_{an}=Dist(\bar{\bm{v}}_{a},\bar{\bm{v}}_{n}),\\ \end{split} (7)

where {𝑾ddmdp,𝒃ddm}\{\bm{W}_{d}\in\mathcal{R}^{d_{m}*d_{p}},\bm{b}_{d}\in\mathcal{R}^{d_{m}}\} are trainable parameters. dmd_{m} is the hidden state size. Dist()Dist(\cdot) is the distance calculation function, which we use Euclidean distance.

Refer to caption
Figure 3: The overall structure of DELE, which consists of Encoder, our proposed Mutual Interaction, and Projection. (Query) and (Key) denote the components in attention calculation.

IV-C Model Learning

As mentioned in Section III, both text classification and R2 task can be treated as classification tasks. Thus, we employ Cross-Entropy as the loss for each input as follows:

Ls=i=1N𝒚ilogP(yi|𝑿i),LR2=j=1N/2𝒚^jlogP(y^j|(𝑿1,𝑿2)j),\begin{split}L_{s}&=-\sum_{i=1}^{N}\bm{y}_{i}\mathrm{log}P(y_{i}|\bm{X}_{i}),\\ L_{R^{2}}&=-\sum_{j=1}^{N/2}\hat{\bm{y}}_{j}\mathrm{log}P(\hat{y}_{j}|(\bm{X}_{1},\bm{X}_{2})_{j}),\\ \end{split} (8)

where NN is the number of one training batch. 𝒚i\bm{y}_{i} is the one-hot vector for the true label of the ithi^{th} instance. 𝒚^j\hat{\bm{y}}_{j} is the one-hot vector for the true relation of relations of the jthj^{th} instance pair.

Moreover, we introduce triplet loss to better analyze the connections and differences among labels of different pairs:

Ld=i=1N/3max((dapdan+margin)i,0).\begin{split}L_{d}=\sum_{i=1}^{N/3}max((d_{ap}-d_{an}+margin)_{i},0).\\ \end{split} (9)

Finally, we treat the weighted sum of these losses with a hyper-parameter η\eta as the loss function for one mini-batch:

Loss=ηLs+(1η)(LR2+Ld).\begin{split}Loss=\eta L_{s}+(1-\eta)(L_{R^{2}}+L_{d}).\\ \end{split} (10)

V Description-Enhanced Label Embedding network (DELE)

In our preliminary work [20], we propose R2-Net to exploit label information in a one-hot manner for better supervision and text classification. However, one-hot usage still needs to improve in exploiting labels comprehensively. As mentioned in Section I, representing labels with dense vectors can be an inspiring direction. However, the particular meaning required in label semantics and multiple semantics contained in label words or phrases pose a big challenge for label embedding methods. Thus, in this section, we focus on describing and representing label semantics in a fine-grained manner.

Specifically, we propose to employ fine-grained descriptions to generate better label embedding, and extend R2-Net to a novel DELE. The overall architecture is reported in Fig. 3. Key contributions lie in the following two areas. The first is incorporating fine-grained descriptions from WordNet to enrich the label embedding learning process, so that label semantics can be fully exploited. The second is developing a novel mutual interaction module based on CL framework, which is used to analyze bidirectional interactions between input text sequences and labels. Along this line, label semantics can better guide the importance selection from input text. Meanwhile, text representations can be used to denoise fine-grained descriptions for better label embedding learning.

As reported in Fig.3, DELE consists of three modules: 1) Encoder module: encoding input text sequence and all labels; 2) Mutual Interaction module: developing S2L Attention and L2S Attention to enhance embedding learning and denoise fine-grained descriptions; 3) Projection module: projecting different learned embeddings into same space for contrastive learning. Next, we will report the technical details of DELE.

V-A Encoder

For input text sequence, we employ a similar operation to R2-Net to encode the input text sequence, and obtain the word representations 𝑯\bm{H} and global sequence representation 𝒗g\bm{v}_{g}.

As for labels, we design a novel Label Encoder to incorporate fine-grained descriptions for label embedding learning. Specifically, we employ WordNet as prior knowledge and retrieve relevant descriptions for labels. Since noun senses of words are usually employed as labels, we select noun descriptions of labels from WordNet. Moreover, words in WordNet have about 25 attributes (e.g., noun.act, noun.animal, noun.event)333https://wordnet.princeton.edu/documentation/lexnames5wn. To ensure the relevance and mitigate the noise problem, we manually select no more than 3 noun descriptions for each label word based on most relevant attributes. If the label consists of more than one word (e.g., “Business & Finance” in Yahoo! Answer dataset), we select no more than 3 noun descriptions for each word and treat all of them as fine-grained descriptions for the label. After that, our designed label encoder is used to generate the corresponding representation.

Fig. 4 demonstrates the structure of label encoder. Taking the ithi^{th} label as an example, we first use 𝑬fmdp\bm{E}_{f}\in\mathbb{R}^{m*d_{p}} to represent the vanilla label embedding matrix. The embedding of the ithi^{th} label can be obtained by extracting the ithi^{th} column 𝒆^i\hat{\bm{e}}_{i} from 𝑬f\bm{E}_{f}. Meanwhile, we obtain fine-grained description set 𝑺(i)\bm{S}^{(i)} for the ithi^{th} label, where the jthj^{th} description can be denoted as 𝒔j(i)\bm{s}^{(i)}_{j}. Then, we employ BERT to encode these descriptions and average the output of “[CLS]” from the last layer as the description representation. Next, we add the vanilla embedding 𝒆^i\hat{\bm{e}}_{i} and the description representation 𝒆¯i\bar{\bm{e}}_{i} to get the enriched representation 𝒆i\bm{e}_{i} for the ithi^{th} label as follows:

𝒆¯i=1|𝑺(i)|j=1|𝑺(𝒊)|BERTL(𝒔j(i)),𝒆^i=𝒚i𝑬f,𝒆i=𝒆^i+𝒆¯i.\begin{split}\bar{\bm{e}}_{i}&=\frac{1}{|\bm{S}^{(i)}|}\sum_{j=1}^{|\bm{S^{(i)}}|}BERT_{L}(\bm{s}^{(i)}_{j}),\\ \hat{\bm{e}}_{i}&=\bm{y}_{i}\bm{E}_{f},\quad\bm{e}_{i}=\hat{\bm{e}}_{i}+\bar{\bm{e}}_{i}.\\ \end{split} (11)

V-B Mutual Interaction

In the previous part, we introduced fine-grained descriptions from WordNet to enrich the label embedding learning. However, only the specific meaning of a word is treated as a label. Thus, not all descriptions are helpful, and some even obscure label semantics. To this end, we develop a novel mutual interaction module based on CL to mitigate this unexpected noise. As shown in Fig. 3, this module leverages S2L Attention and L2S Attention to tackle this problem.

V-B1 S2L Attention

Since label semantic meaning is highly related to input text, we first leverage text representation 𝒗g\bm{v}_{g} as the guidance to select necessary label representations from label embedding 𝑬\bm{E} as follows:

𝑬=[e1,e2,,em],𝜷l=𝝎lTtanh(𝑾l𝑬+𝑼l𝒗g𝑰l),𝒉¯e=i=1mexp(βil)k=1mexp(βkl)ei,\begin{split}\bm{E}&=[e_{1},e_{2},...,e_{m}],\\ \bm{\beta}^{l}&=\bm{\omega}_{l}^{T}tanh(\bm{W}_{l}\bm{E}+\bm{U}_{l}\bm{v}_{g}\otimes\bm{I}_{l}),\\ \bar{\bm{h}}_{e}&=\sum_{i=1}^{m}\frac{exp(\beta^{l}_{i})}{\sum_{k=1}^{m}exp(\beta^{l}_{k})}e_{i},\end{split} (12)

where {𝝎l1da,𝑾ldadp,𝑼ldadp}\{\bm{\omega}_{l}\in\mathcal{R}^{1*d_{a}},\bm{W}_{l}\in\mathcal{R}^{d_{a}*d_{p}},\bm{U}_{l}\in\mathcal{R}^{d_{a}*d_{p}}\} are trainable parameters. dad_{a} is the size of attention unit. 𝑰lm\bm{I}_{l}\in\mathbb{R}^{m} is a row vector of 11. 𝑼l𝒗g𝑰l\bm{U}_{l}\bm{v}_{g}\otimes\bm{I}_{l} means repeating 𝑼l𝒗g\bm{U}_{l}\bm{v}_{g} m times. 𝜷l\bm{\beta}^{l} is the unnormalized attention weight for label representations. 𝒉¯e\bar{\bm{h}}_{e} denotes the text supervised semantic vector generated from label semantics. With this operation, the unexpected noise from fine-grained descriptions can be effectively mitigated and label semantics can be expressed accurately.

Refer to caption
Figure 4: The Architecture of Label Encoder in DELE, which incorporates vanilla embedding and our proposed description-enhanced embedding.

V-B2 L2S Attention

Meanwhile, we intend to analyze the impact from label to text in a similar way. Since there are multiple labels, we adopt each label to guide the selection of input text. Then, max-pooling is employed to merge results for label supervised semantic vectors.

βt=𝝎Ttanh(𝑾𝑯+𝑼𝒆t𝑰),𝒉¯st=i=1nexp(βit)k=1nexp(βkt)𝒉i,𝒉i𝑯𝒉¯s=maxpooling([𝒉¯s1,𝒉¯s2,,𝒉¯sm,]),\begin{split}\beta^{t}&=\bm{\omega}^{T}tanh(\bm{W}\bm{H}+\bm{U}\bm{e}_{t}\otimes\bm{I}),\\ \bar{\bm{h}}_{s}^{t}&=\sum_{i=1}^{n}\frac{exp(\beta^{t}_{i})}{\sum_{k=1}^{n}exp(\beta^{t}_{k})}\bm{h}_{i},\quad\bm{h}_{i}\in\bm{H}\\ \bar{\bm{h}}_{s}&=maxpooling([\bar{\bm{h}}_{s}^{1},\bar{\bm{h}}_{s}^{2},...,\bar{\bm{h}}_{s}^{m},]),\end{split} (13)

where t{1,2,,m}t\in\{1,2,...,m\} indicates the index of the ttht^{th} label. βt\beta^{t} is the unnormalized attention weight for word representations under the guidance of the ttht^{th} label. {𝝎1Ls,𝑾Lsdp,𝑼Lsdp}\{\bm{\omega}\in\mathcal{R}^{1*L_{s}},\bm{W}\in\mathcal{R}^{L_{s}*d_{p}},\bm{U}\in\mathcal{R}^{L_{s}*d_{p}}\} are also trainable parameters. LsL_{s} is the sentence length. 𝒉¯s\bar{\bm{h}}_{s} denotes the label supervised semantic vector from sentence semantics.

V-C Projection.

In the aforementioned sections, we have obtained semantic vectors from the text perspective and label perspective. To minimize their distance for better representation learning and classification, inspired by SimCLR [58], we employ an MLP to project the semantic vectors into the same space for quality improvement of learned semantic vectors as follows:

𝒛e=MLP2(𝒉¯e),𝒛s=MLP2(𝒉¯s),\begin{split}\bm{z}_{e}=MLP_{2}(\bar{\bm{h}}_{e}),\quad\bm{z}_{s}=MLP_{2}(\bar{\bm{h}}_{s}),\\ \end{split} (14)

where 𝒛e\bm{z}_{e} and 𝒛s\bm{z}_{s} are projection vectors for the contrastive loss. Since our target is to predict the most suitable label for input text and 𝒉s\bm{h}_{s} is generated from the text perspective, we select label supervised semantic vector 𝒉¯s\bar{\bm{h}}_{s} to make the final decision:

P(y|𝑿,𝒴)=MLP3(𝒉¯s),y=argmaxy𝒴P(𝒚|𝑿,𝒴).\begin{split}&P(y|\bm{X},\mathcal{Y})=MLP_{3}(\bar{\bm{h}}_{s}),\\ &y^{*}=argmax_{y\in\mathcal{Y}}P(\bm{y}|\bm{X},\mathcal{Y}).\end{split} (15)

Similarly, we also employ R2 classification task to assist in the improvement of model ability. Text supervised semantic vector 𝒉e\bm{h}_{e} is selected to make the prediction as follows:

𝒖^=[𝒉¯e1;𝒉¯e2;(𝒉¯e1𝒉¯e2);(𝒉¯e1𝒉¯e2)],P(y^|(𝑿1,𝒴),(𝑿2,𝒴))=MLP4(𝒖^),y^=argmaxy^{0,1}P(y^|(𝑿1,𝒴),(𝑿2,𝒴)),\begin{split}&\hat{\bm{u}}=[\bar{\bm{h}}_{e}^{1};\bar{\bm{h}}_{e}^{2};(\bar{\bm{h}}_{e}^{1}\odot\bar{\bm{h}}_{e}^{2});(\bar{\bm{h}}_{e}^{1}-\bar{\bm{h}}_{e}^{2})],\\ &P(\hat{y}|(\bm{X}_{1},\mathcal{Y}),(\bm{X}_{2},\mathcal{Y}))=MLP_{4}(\hat{\bm{u}}),\\ &\hat{y}^{*}=argmax_{\hat{y}\in\{0,1\}}P(\hat{y}|(\bm{X}_{1},\mathcal{Y}),(\bm{X}_{2},\mathcal{Y})),\end{split} (16)

where MLP3()MLP_{3}(\cdot) and MLP4()MLP_{4}(\cdot) are MLPs, which consist of one hidden layer and a softmax output layer. yy^{*} and y^\hat{y}^{*} are the predicted label for 𝑿\bm{X} and predicted relation of relations of 𝑿1\bm{X}_{1} and 𝑿2\bm{X}_{2}.

TABLE II: Experimental Results (accuracy) on different datasets from NLI task.
++ and - denote the percentage of decrease or increase in error rate compared with backbones.
Type Model
SNLI Full Test
(3-classes)
SNLI Hard Test
(3-classes)
SICK Test
(3-classes)
SciTail Test
(2-classes)
One-hot Encoding (1) DRCN [2] 86.5% 68.3% 87.4% 85.7%
(2) CSRAN [59] 88.5% 76.8% 89.7% 86.5%
(3) RE2 [60] 88.9% 77.3% 89.8% 86.2%
(4) DRr-Net [10] 87.7% 71.4% 88.3% 87.4%
(5) SimCSE [47] 89.4% 81.2% 90.3% 93.3%
(6) BERT-(base) [21] 90.3% (0.0%) 80.6% (0.0%) 88.7% (0.0%) 93.1% (0.0%)
(7) BERT-(large) [21] 90.7% (++4.12%) 81.3% (++3.61%) 88.3% (-3.54%) 93.6% (++7.25%)
(8) RoBERTa-(base) [28] 90.9% (0.0%) 81.5% (0.0%) 90.3% (0.0%) 93.8% (0.0%)
(9) ALBERT-(base) [61] 86.2% 77.5% 87.3% 91.4%
Label Embedding (10)FLE-BERT(base) [15] 90.5% (++2.06%) 80.4% (-1.03%) 89.3% (++5.31%) 93.4% (++4.35%)
(11) LEAM-BERT(base) [62] 90.3% (0.0%) 80.8% (++1.03%) 89.0% (++2.65%) 93.4% (++4.35%)
(13) EXAM-BERT(base) [36] 90.6% (++3.09%) 81.0% (++2.06%) 89.5% (++7.08%) 93.8% (++10.14%)
(12) LGDSC-BERT(base) [35] 91.3% (++10.31%) 81.4% (++4.12%) 89.5% (++7.08%) 94.0% (++13.04%)
(14) LCM-BERT(base) [14] 90.8% (++5.15%) 81.3% (++3.61%) 90.3% (++14.16%) 94.1% (++14.49%)
(15) EXAM-RoBERTa(base) [36] 91.5% (++6.59%) 81.9% (++2.16%) 90.5% (++2.06%) 94.1% (++4.84%)
Our Methods (16) R2-Net-BERT-(base) 91.1% (++8.25%) 81.0% (++2.06%) 89.2% (++4.42%) 92.9% (-2.89%)
(17) R2-Net-RoBERTa-(base) 91.3% (++4.40%) 81.4% (-0.54%) 89.5% (-8.24) 93.9% (++1.61)
(18) DELE-BERT-(base) 91.3% (++10.31%) 81.6% (++5.15%) 89.9% (++10.16%) 94.1% (++14.49%)
(19) DELE-RoBERTa-(base) 91.8% (++9.89%) 83.2% (++9.19%) 90.7% (++4.12%) 94.5% (++11.29%)

V-D Model Learning

DELE has three targets to optimize, including contrastive learning target, R2 classification target, and text classification target. They have been listed in the following part.

Contrastive Learning Target. As mentioned before, minimizing the distance between semantic representations of input text and the proper label is pivotal for text classification. Thus, we select NT-Xent [58] as the contrastive loss, where the positive pair consists of label supervised semantic vector and text supervised semantic vector, and negative examples are the rest instances in current batch:

L1=i=1Nlogsim(𝒛si,𝒛ei)/τj=1K𝟙[ji]sim(𝒛si,𝒛sj)/τ,\begin{split}L_{1}=\sum_{i=1}^{N}-log\frac{sim(\bm{z}^{i}_{s},\bm{z}^{i}_{e})/\tau}{\sum_{j=1}^{K}\mathbbm{1}_{[j\neq i]}sim(\bm{z}^{i}_{s},\bm{z}^{j}_{s})/\tau},\\ \end{split} (17)

where sim()sim(\cdot) is the similarity calculation function. NN is the number of one training batch. τ\tau is the temperature parameter. 𝟙[ji]\mathbbm{1}_{[j\neq i]} is an indicator function evaluating to 1 iff jij\neq i.

R2 classification target. For this classification target, we select Cross-Entropy as the training loss:

L2=i=1N/2𝒚^ilogP(y^i|(𝑿1,𝒴)i,(𝑿2,𝒴)i),\begin{split}L_{2}=\sum_{i=1}^{N/2}-\hat{\bm{y}}_{i}logP(\hat{y}_{i}|(\bm{X}_{1},\mathcal{Y})_{i},(\bm{X}_{2},\mathcal{Y})_{i}),\\ \end{split} (18)

where 𝒚^i\hat{\bm{y}}_{i} is the one-hot ground truth that indicates the true relation of relations of the ithi^{th} pair.

Text Classification Target. For this task-related target, we select Cross-Entropy as the training loss:

L3=i=1N𝒚ilogP(yi|𝑿,𝒴),\begin{split}L_{3}=\sum_{i=1}^{N}-\bm{y}_{i}logP(y_{i}|\bm{X},\mathcal{Y}),\\ \end{split} (19)

where 𝒚i\bm{y}_{i} is the one-hot ground truth that represents the true label of the ithi^{th} example.

The final loss for one mini-batch can be calculated with a weighted sum operation. Here, we leverage weight δ\delta and μ\mu to control the impacts of CL target and R2 task for the final classification performance:

Loss=δL1+μL2+L3.Loss=\delta L_{1}+\mu L_{2}+L_{3}. (20)

VI Experiments

In this section, the evaluation datasets and metrics are first introduced. Then, model implementation and training details are reported for better illustration. Next, empirical results, a detailed analysis of models, and experimental results are presented. For all reported results, we employ boldface and underline for the best and the second-best results, respectively.

TABLE III: Hyper-parameters configuration in R2-Net and DELE.
Model Hyper-parameters Value
BERT-base 12 layers, hidden size dp=768d_{p}=768, attention heads 12
Roberta-base 12 layers, hidden size dp=768d_{p}=768, attention heads 12
R2-Net Kernel size of CNN dk={1,2,3}d_{k}=\{1,2,3\}
hidden size of MLP1MLP_{1} dm=300d_{m}=300
hidden size of MLP2MLP_{2} d2=300d_{2}=300
Backbone learning rate lr1=105lr_{1}=10^{-5}
The other learning rate lr2=103lr_{2}=10^{-3}
margin in loss LdL_{d} margin=0.2margin=0.2
DELE number of backbone layer L=2L=2
attention size of mutual interaction da=100d_{a}=100
hidden size of MLP3MLP_{3} and MLP4MLP_{4} d3=200d_{3}=200
TABLE IV: Results (accuracy) on PI, Sentiment Analysis, and QA Topic Classification task.
++ and - denote the percentage of decrease or increase in error rate compared with backbones.
Type Model
Quora Test
(2-classes)
MSRP Test
(2-classes)
STS-5 Test
(5-classes)
Yahoo! Answer Test
(10-classes)
One-hot Encoding (1) DRCN [2] 90.2% 82.5% - 75.1%
(2) RE2 [60] 89.3% 78.5% - 75.5%
(3) DRr-Net [10] 89.8% 82.9% 50.8% 74.7%
(4) SimCSE [47] 91.5% 84.8% 54.1% 76.2%
(5) BERT-(base) [21] 91.0% (0.0%) 84.2% (0.0%) 53.1% (0.0%) 75.7% (0.0%)
(6) BERT-(large) [21] 91.4% (++4.44%) 85.4% (++7.59%) 54.9% (+3.84%) 76.3% (++2.67%)
(7) RoBERTa-(base) [28] 90.6% (0.0%) 87.1% (0.0%) 56.2% (0.0%) 75.9% (0.0%)
(8) ALBERT-(base) [61] 90.3% 88.6% 47.3% 74.2%
Label Embedding (9)FLE-BERT(base) [15] 91.2% (++2.22%) 84.2% (0.0%) 53.4% (++0.64%) 75.6% (-0.41%)
(10) LEAM-BERT(base) [62] 91.3% (++3.34%) 83.9% (-1.90%) 53.8% (+1.49%) 75.9% (++0.82%)
(11) EXAM-BERT(base) [36] 91.4% (++4.44%) 84.2% (0.0%) 54.3% (+2.56%) 76.3% (++2.47%)
(12) LGDSC-BERT(base) [35] 91.6% (++6.67%) 84.5% (++1.90%) 54.3% (++2.56%) 76.2% (++2.06%)
(13) LCM-BERT(base) [14] 91.5% (++5.56%) 84.2% (0.0%) 54.6% (++3.20%) 76.4% (++2.88%)
(14) EXAM-RoBERTa(base) [36] 92.0% (++14.89%) 85.6% (-11.63%) 57.0% (+1.82%) 76.3% (++1.65%)
Our Methods (15) R2-Net-BERT-(base) 91.6% (++6.67%) 84.3% (++0.63%) 54.1% (+2.13%) 76.2% (++2.06%)
(16) R2-Net-RoBERTa-(base) 91.7% (++11.70%) 84.5% (-20.15%) 56.9% (+1.60%) 76.3% (++1.65%)
(17) DELE-BERT-(base) 91.7% (++7.78%) 84.8% (++3.80%) 54.7% (+3.41%) 76.5% (++3.29%)
(18) DELE-RoBERTa-(base) 92.3% (++18.09%) 86.7% (-3.10%) 57.8% (+3.65%) 76.8% (++3.73%)

VI-A Datasets and Evaluation Method

To evaluate our proposed R2-Net and DELE comprehensively, we select different benchmark datasets: SNLI [25], SICK [63], and SciTail [64] for Natural Language Inference (NLI), Quora Question Pair (Quora) [26] and MSRP [1] datasets for Paraphrase Identification (PI), Yahoo! Answers (Yahoo) for QA topic classification, as well as SST-5 [27] for sentiment classification. These tasks focus on different aspects and exhibit different characteristics of text classification task.

For evaluation, we select Accuracy and Error Rate Comparison as evaluation metrics, which are the same as most baselines did. We have to note that for each experiment, we repeat the evaluation process 55 times with different seeds and random initialization, and report the best results for our proposed methods and baselines.

VI-B Model Implementation and Training Details

To obtain the best performance, we have tuned hyper-parameters on validation set of each dataset, and used Early-Stop operation to select the best values for hyper-parameters. BERT-(base) and Roberta-(base) are selected as backbones. We have to note that all baselines have the same experimental settings for the fair comparison. In order to achieve the best model performance, R2-Net and DELE have different parameter settings over different datasets. Therefore, we report common hyper-parameters in Table III and summarize them here:

For R2-Net, kernel sizes of CNN in local encoding are dk=1,2,3d_{k}=1,2,3. The hidden size of MLP1MLP_{1} is dm=300d_{m}=300. The margin in Eq. (9) is margin=0.2margin=0.2. For PLMs, we set learning rate 10510^{-5} and use AdamW to fine-tune parameters. For the rest parameters, learning rate is set as 10310^{-3} and decreases as the model training. An Adam optimizer with β1=0.9\beta_{1}=0.9 and β2=0.999\beta_{2}=0.999 is adopted to optimize these parameters.

For DELE, the number of used output layers in PLMs is L=2L=2. Attention size in mutual interaction module is da=100d_{a}=100. The hidden size of MLP2MLP_{2} in projection is d2=300d_{2}=300. The hidden size of MLP3MLP_{3} and MLP4MLP_{4} in classification layer is d3=200d_{3}=200. For model training, we leverage Adam as the optimizer. Inspired by [65], we develop the following operation to control the learning rate in the ithi^{th} training batch:

lr=ϵ(lrmaxlrmin)+lrmin\displaystyle lr=\epsilon(lr_{max}-lr_{min})+lr_{min} (21)
μ={0.5cos(πηC(iηC+1))+0.5,ifiηC,(0.5cos(πCηC(iηC))+0.5)2,ifi>ηC,\displaystyle\mu=\left\{\begin{aligned} &0.5\mathrm{cos}\left(\frac{\pi}{\eta C}(i-\eta C+1)\right)+0.5,~{}if~{}i\leq\eta C,\\ &\left(0.5\mathrm{cos}\left(\frac{\pi}{C-\eta C}(i-\eta C)\right)+0.5\right)^{2},~{}if~{}i>\eta C,\\ \end{aligned}\right.

where lrmax=1lr_{max}=1 and lrmin=0.000001lr_{min}=0.000001. CC is the total number of training batches, η[0,1]\eta\in[0,1] is the percentage of the warm up loops.

VI-C Model Complexity Analysis

In order to better demonstrate the superiority of our proposed methods, we make an additional computational complexity analysis. Since our proposed methods are both based on PLMs, we only analyze the extra computational complexity. For our preliminary work R2-Net, additional components consist of CNN-based local encoder, R2 classification, and triplet distance calculation components. For CNN-based local encoder, we leverage three kernels with size dk={1,2,3}d_{k}=\{1,2,3\} and an MLP. The extra parameter size is 14+dp(6dp)14+d_{p}*(6*d_{p}). For the rest two components, three different MLPs are used. Therefore, the extra parameter size is (dm2dp)+(dp4dm)+(dmdp)(d_{m}*2d_{p})+(d_{p}*4d_{m})+(d_{m}*d_{p}). Therefore, total extra computational complexity increase is (7dp+6dm)dp+14(7d_{p}+6d_{m})*d_{p}+14.

For our proposed DELE, additional components include mutual interaction module and R2 classification layer. For the former, the extra parameter size is (da+dadp+dadp)+(Ls+Lsdp+Lsdp)(d_{a}+d_{a}*d_{p}+d_{a}*d_{p})+(L_{s}+L_{s}*d_{p}+L_{s}*d_{p}). For the latter, the extra parameter size is d44dp+d4d_{4}*4d_{p}+d_{4}. To this end, total extra computational complexity increase is (2da+2Ls+4d4)dp+(da+Ls+d4)(2d_{a}+2L_{s}+4d_{4})*d_{p}+(d_{a}+L_{s}+d_{4}). Note that LsL_{s} is the sentence length and other notation values can be found in Table III. For our proposed label encoder, since we leverage the same PLMs to encoder the fine-grained descriptions as input encoder, this label encoder does not add additional computational complexity.

As reported in Table III, the extra computational complexity in our proposed methods is acceptable in practice. Moreover, in DELE, we make full use of label information to generate positive semantic representations for input sentences and use in-batch negative sampling to obtain the negative semantic representations for each input sentence. Furthermore, to alleviate the computational complexity of pair-wise contrastive learning, we leverage the same method as [47], which transfers the pair-wise calculation into a cross-entropy calculation during implementation. This operation can largely decrease the computational complexity of pair-wise contrastive learning.

VI-D Overall Experimental Results

In this section, we will give a detailed analysis of experimental results by reporting accuracy and error rate comparison with backbones. Note that error rate comparison is based on corresponding backbone. For example, R2-Net-BERT-(base) achieves +8.25%+8.25\% improvement in Table II(16) denotes that it decreases the error rate by 8.25%8.25\% compared with backbone BERT-(base). Detailed analysis is reported in following parts.

VI-D1 Performance on NLI task

Table II reports the results on NLI task. We can conclude that our proposed R2-Net and DELE achieve highly comparable performance over all datasets. Moreover, DELE-RoBERTa-(base) has the best performance. To achieve such advantages, we first select PLMs as backbones, so that knowledge from large corpora can be accessed. This is one of the reasons that our methods outperform other baselines, especially better than these PLMs-free baselines (Table II(1)-(4)) by a large margin. Second, we develop a novel R2 task to help models fully exploit label information in a one-hot manner. Along this line, R2-Net and DELE can obtain intra-class and inter-class knowledge among input texts with the same or different labels, which is in favor of achieving better performance than all baselines, including PLMs baselines (Table II(5)-(9)). Moreover, we design to incorporate fine-grained descriptions from WordNet into label embedding learning, so that particular label semantics can be better represented. This is another vital reason that DELE achieves the best performance, which also demonstrates the effectiveness of our proposed description-enhanced label embedding method.

Moreover, incorporating label information (one-hot encoding or label embedding) can better ensure model performance when dealing with difficult examples (Hard test in Table II), in which text sequences with obviously identical words have been removed [17]. By employing R2 task to measure labels in a one-hot manner, R2-Net can better exploit implicit information from input text, which leads to an inspiring performance. Moreover, label embedding methods can exploit label information better than one-hot encoding methods. Therefore, we observe the superiority of label embedding methods. Furthermore, DELE employs fine-grained descriptions to enhance label embedding learning. Thus, it further improves the performance of label embedding methods and achieves the best performance.

To make a fair comparison, we replace basic encoders of label embedding methods (e.g., LEAM, EXAM) with the same PLMs backbones as our methods did, and report results in Table II(10)-(15), in which DELE still has better performance. The advantages of DELE lie in the more advanced label encoder and the consideration of denoising operation for additional knowledge. For label encoder, DELE not only learns the embedding during the model training so that the learned results are suitable for the current task, but also employs multi-aspect real sentences from WordNet as descriptions to enhance label semantic modelling. Meanwhile, considering that introducing fine-grained descriptions will import unexpected noise, we develop a novel mutual interaction module based on CL to leverage attention mechanism to select the most relevant parts from input text and labels simultaneously. Then, with the help of CL framework, DELE is able to use proper fine-grained knowledge to enhance the understanding of labels, and improve the model performance.

Refer to caption
Figure 5: Results of R2-Net with different kernel sizes and DELE with different σ\sigma and μ\mu on different datasets.

VI-D2 Performance on PI task

Apart from NLI task, we also select PI task to evaluate the model performance. PI task concerns whether two sentences express the same meaning and has broad applications in question answering communities444https://www.quora.com/. Table IV reports corresponding results. We observe that R2-Net and DELE still achieve highly competitive performance over other baselines. For one thing, the results demonstrate that though PI task is a binary classification task, its labels still contain some useful semantics. For another, we can conclude that our proposed R2 task and label embedding methods with fine-grained descriptions are useful for exploiting label information and boosting model performance.

Besides, we obtain that almost all models have a better performance on Quora than MSRP. One possible reason is that Quora has more data (over 400k sentence pairs in Quora v.s. 5,801 sentence pairs in MSRP). Besides, inter-sentence interaction is probably another reason. Lan et al. [66] has observed that Quora dataset contains many sentence pairs with less complicated interactions (many identical words in sentence pairs). Therefore, we can obtain that ALBERT achieves the best performance on MSRP dataset.

TABLE V: Ablation performance (accuracy) of R2-Net-BERT-(base).
Model SNLI Full Test SciTail Test
(1) BERT-(base) 90.3% 92.0%
(2) R2-Net (w/o local encoder) 90.7% 92.6%
(3) R2-Net (w/o R2 task learning) 90.5% 92.3%
(4) R2-Net (w/o triplet loss) 90.9% 92.6%
(5) R2-Net (w/ NT-Xent loss) 91.3% 93.3%
(6) R2-Net-BERT-(base) 91.1% 92.9%

VI-D3 Performance on Sentiment Analysis and QA Topic Classification

To make a better evaluate, we further conduct experiments on SST-5 and Yahoo! Answer datasets, which have more complex and realistic labels (e.g., somewhat positive, Business). Table IV summarizes the results. Similarly, our proposed methods achieve stable and comparable performance. Moreover, Different from previous experiments, label embedding baselines (i.e., EXAM, LGDSC) do not achieve a better improvement, compared with PLMs baselines. One possible reason is that embedding these complicated labels with coarse-grained information cannot capture sufficient useful information for label exploitation and text classification.

Moreover, compared with PLMs baselines, the improvement of DELE is not so obvious. After analyzing the dataset deeply, we observe that some QA pairs in this dataset have more than 512 words, as well as irregular textual expressions. These input noises will do harm to label utilization since we leverage text representations to guide the denoising process of label description utilization. Meanwhile, we only select no more than 3 descriptions manually, which may be insufficient for those labels that have a wealth of semantics. Furthermore, the particular meaning of label words requires a precise selection of label descriptions. However, DELE leverages attention mechanism to select the most relevant parts from label embedding 𝑬\bm{E}, where vanilla embedding and description embedding have already been integrated. Applying denoising operation at an earlier stage may achieve better performance. Nevertheless, on the contrary, DELE has made an early attempt at leveraging fine-grained knowledge to enhance label embedding and denoising of knowledge utilization, which also illustrates the advancement of DELE.

Refer to caption
Figure 6: Visualization of semantic representations from two representative baselines (BERT-(base)(A) and EXAM (B)), as well as our proposed R2-Net (C) and DELE (D). Note that BERT-(base) is selected as the backbone in this experiment.

VI-D4 Parameter Sensitive Test

As illustrated in Eq. 4 and Eq. 20, kernel sizes dkd_{k} in R2-Net, and {δ,μ}\{\delta,\mu\} in DELE are essential for model performance. To this end, we conduct additional parameter sensitive test to verify their impact.

For kernel sizes dkd_{k}, it aims to extract local information of input text at different scales. Results with different kernel size settings are reported in Fig. 5(A). We can observe that R2-Net will have better performance with more kernel sizes, in which {1,2,3}\{1,2,3\}, {1,2,5}\{1,2,5\}, and {2,3,5}\{2,3,5\} achieve the best performance. Moreover, using {1,2,3}\{1,2,3\} as kernel sizes have more stable performance. We speculate the possible reason is that these kernel sizes can pay more attention to uni-gram, bi-gram, as well as tri-gram information, which is helpful for enhancing the representation learning of PLMs. Meanwhile, using {2,3,5}\{2,3,5\} as kernel sizes has similar performance, which also demonstrates its effectiveness.

For σ\sigma and μ\mu, when removing CL or R2 task (i.e., δ=0\delta=0 or μ=0\mu=0), model performance cannot be comparable with the performance with CL or R2 task (i.e., δ0\delta\neq 0 or μ0\mu\neq 0), demonstrating the necessity of these two SSL frameworks. Moreover, with the increasing of label size or label semantics, the best values of δ\delta and μ\mu also become larger. For one thing, this phenomenon proves that our proposed SSL can extract more critical information from more diverse and realistic labels, and have a bigger impact on model performance, proving the necessity of better label utilization methods. Furthermore, we observe that μ\mu has a bigger impact than α\alpha when labels are harder, which is somewhat counterintuitive and different from other results. The possible reason is the underutilization of label information. Limited information and unexpected noise will decrease the effectiveness of our proposed CL framework. This finding is similar to the observation in Section VI-D3.

TABLE VI: Ablation performance (accuracy) of DELE-BERT-(base).
Model SNLI Full Test SciTail Test
(1) BERT-(base) 90.3% 93.1%
(2) BERT + Atten 90.6% 93.5%
(3) 𝒉¯s\bar{\bm{h}}_{s}(w/o mutual interaction) 90.4% 93.2%
(4) 𝒉¯e\bar{\bm{h}}_{e}(w/o mutual interaction) 43.7% 65.4%
(5) 𝒉¯s\bar{\bm{h}}_{s}(w/o description) 90.8% 93.8%
(6) 𝒉¯e\bar{\bm{h}}_{e}(w/o description) 89.3% 90.1%
(7) 𝒉¯s\bar{\bm{h}}_{s}(w/o R2 task)) 90.3% 92.9%
(8) 𝒉¯e\bar{\bm{h}}_{e}(w/o R2 task) 88.5% 90.4%
(9) DELE-BERT-(base) 91.3% 94.1%

VI-D5 Ablation Study

The overall experiments have proved the superiority of our proposed methods. However, which parts in R2-Net and DELE play a more important role in label utilization and performance improvement is still unclear. Therefore, we perform an ablation study to conduct a comprehensive analysis. Corresponding results are reported in Table V and Table VI. Note that we select BERT-(base) as the backbone to compare the importance of each part. Leveraging RoBERTa-(base) as the backbone will obtain similar results.

(a) For R2-Net, we conduct experiments on CNN-based local encoder, R2 task classification, as well as loss function. Results are summarized in Table V. According to the results, we can observe varying degrees of model performance decline. Among all of them, R2 task has the biggest impact, and triple loss has a relatively small impact on the model performance. Replacing triplet loss with contrastive loss will improve model performance slightly. These observations prove that R2 task is more important for relation information utilization. R2 task and better contrastive loss will further improve model performance.

(b) For DELE, we focus on label embedding module, mutual interaction module, and R2 task. Results are reported in Table VI. Here, “BERT+Atten” denotes leveraging vanilla label embedding to select important parts of text without CL “𝒉¯s\bar{\bm{h}}_{s} (w/o mutual interaction)” indicates leveraging 𝒉¯s\bar{\bm{h}}_{s} for classification and no interactions between input text and labels. Other ablation experiments have similar settings. According to the results, we obtain some observations. First of all, results (3-4) demonstrate the necessity of mutual interactions between input text and labels, especially result (4). When there is no interaction between input text and labels, label semantics cannot handle final classification at all. Second, results (5-6) prove that utilizing fine-grained descriptions is helpful for semantic understanding and performance improvement. Moreover, results (7-8) inspire us that R2 task can provide additional signals for label utilization, which demonstrates the effectiveness and necessity of better label utilization methods. Last but not least, the comparison between 𝒉¯s\bar{\bm{h}}_{s} and 𝒉¯e\bar{\bm{h}}_{e} with the same setting (e.g., w/o description) indicates that CL can constrain learned representations from different views to have similar semantics. However, 𝒉¯e\bar{\bm{h}}_{e} cannot be the replacement for input text, only an enhancement to input text.

VI-D6 Case Study of R2-Net and DELE

To provide some intuitionistic examples for explaining the superiority of our models, we sample 1,500 sentence pairs from SNLI Test and send them to BERT-(base) baseline, EXAM-BERT-(base) baseline, R2-Net, and DELE to generate representations. Then, we use t-sne [67] to visualize them with the same parameter settings. Fig. 6 reports corresponding results. By comparing these four figures, we observe that representations generated by DELE have closer intra-class distances and more distinguishable inter-class distances. In other words, by considering fine-grained knowledge and mutual interactions, DELE is able to pull together samples with the same label and push apart samples with different labels, which is helpful for better text classification. These observations not only explain why DELE achieves impressive performance, but also demonstrate the necessity of more detailed and cleaner label descriptions, as well as more comprehensive analysis between input text and labels. All of these are very helpful for label semantics utilization and text classification performance improvement.

VII Discussion

Here, we discuss the impact of the future directions of this study. To fully exploit label information for text classification, we develop a novel R2-Net and DELE from the one-hot encoding perspective and label embedding perspective separately. Despite the progress we have achieved, our work can also provide some inspiration for relative research. For example, in Extreme Multi-label classification tasks (XMC), label embedding matrix learning is a challenging problem. Inappropriate embedding methods may perform worse than sparse one-vs-all and partitioning approaches in XMC. To this end, our proposed methods provide a novel strategy to generate accurate label embeddings, which may inspire relative research, such as using additional descriptions to distinguish the similarity relations among hundreds of labels.

Meanwhile, our proposed methods still have space for further improvement. For example, when selecting fine-grained descriptions as additional knowledge, we employ a manual approach to select relevant descriptions, which is insufficient for complex labels and impractical for extreme multi-label classification. Therefore, better knowledge selection methods or multi-modal knowledge (e.g., relations and hierarchical structures among different labels) are needed to boost semantic representation learning further. When talking about additional information, we only consider label descriptions. The relations and hierarchical structures among labels are also beneficial for label semantic modelling. Thus, in the future, we plan to exploit the label relations in XMC and use Graph Neural Network (GNN) to obtain relation-aware label embeddings. Then, we can transfer our proposed methods to XMC by incorporating relation-aware label embeddings with our proposed description-enhanced label embeddings for model performance enhancement.

VIII Conclusion and Future Work

In this paper, we argued that current text classification methods are deficient in label utilization and ignore the guidance and semantic information contained in labels. Then, we presented a study on the exploitation of labels from a one-hot encoding manner and label embedding manners. Specifically, inspired by self-supervised learning methods used in PLMs, we developed a novel Relation of Relation (R2) task to mine the potential of labels from a one-hot encoding perspective. Based on this novel self-supervised task, we designed a simple but effective method named R2-Net for text classification. Meanwhile, triplet loss is employed to constrain R2-Net for better label relation analysis. One step further, since one-hot encoding method still has some weaknesses in exploiting label information, we focused on label embedding methods and proposed a novel DELE for better label utilization. In DELE, fine-grained descriptions from WordNet are adopted to complete the label embedding learning. And a mutual interaction module is designed to denoise the additional knowledge and select the most relevant parts from input text and labels simultaneously. Finally, extensive experiments on multiple benchmark datasets demonstrate the superiority of R2-Net and DELE, as well as the usefulness of initial attempts on fine-grained knowledge utilization for label embedding.

Acknowledgment

This research was partially supported by grants from the Young Scientists Fund of the National Natural Science Foundation of China (No. 62006066), the National Natural Science Foundation of China (No. 61727809, 61922073, and 72188101), the Fundamental Research Funds for the Central Universities (JZ2021HGTB0075), joint Funds of the National Natural Science Foundation of China (U22A2094), the Open Project Program of the National Laboratory of Pattern Recognition.

References

  • [1] W. B. Dolan and C. Brockett, “Automatically constructing a corpus of sentential paraphrases,” in IWP, 2005.
  • [2] S. Kim, J.-H. Hong, I. Kang, and N. Kwak, “Semantic sentence matching with densely-connected recurrent and co-attentive information,” in AAAI, 2019, pp. 6586–6593.
  • [3] D. Chen, A. Fisch, J. Weston, and A. Bordes, “Reading wikipedia to answer open-domain questions,” in ACL, 2017, pp. 1870–1879.
  • [4] S. F. Yilmaz, E. B. Kaynak, A. Koç, H. Dibeklioğlu, and S. S. Kozat, “Multi-label sentiment analysis on 100 languages with dynamic weighting for label imbalance,” IEEE TNNLS, vol. 34, no. 1, pp. 331–343, 2023.
  • [5] F. Huang, X. Li, C. Yuan, S. Zhang, J. Zhang, and S. Qiao, “Attention-emotion-enhanced convolutional lstm for sentiment analysis,” IEEE TNNLS, vol. 33, no. 9, pp. 4332–4345, 2022.
  • [6] L. Zhu, W. Li, Y. Shi, and K. Guo, “Sentivec: Learning sentiment-context vector via kernel optimization function for sentiment analysis,” IEEE TNNLS, vol. 32, no. 6, pp. 2561–2572, 2021.
  • [7] T.-Y. Liu, “Learning to rank for information retrieval,” in Found. Trends Inf. Retr., vol. 3, 2009, pp. 225–331.
  • [8] Q. Liu, Z. Huang, Z. Huang, C. Liu, E. Chen, Y. Su, and G. Hu, “Finding similar exercises in online education systems,” in SIGKDD, 2018, pp. 1821–1830.
  • [9] I. V. Serban, A. Sordoni, Y. Bengio, A. C. Courville, and J. Pineau, “Building end-to-end dialogue systems using generative hierarchical neural network models,” in AAAI, 2016, pp. 3776–3783.
  • [10] K. Zhang, G. Lv, L. Wang, L. Wu, E. Chen, F. Wu, and X. Xie, “Drr-net: Dynamic re-read network for sentence semantic matching,” in AAAI, vol. 33, 2019, pp. 7442–7449.
  • [11] K. Zhang, G. Lv, L. Wu, E. Chen, Q. Liu, and M. Wang, “Ladra-net: Locally aware dynamic reread attention net for sentence semantic matching,” IEEE TNNLS, pp. 1–14, 2021.
  • [12] Z. Tan, J. Chen, Q. Kang, M. Zhou, A. Abusorrah, and K. Sedraoui, “Dynamic embedding projection-gated convolutional neural networks for text classification,” IEEE TNNLS, vol. 33, no. 3, pp. 973–982, 2022.
  • [13] Z. Kun, L. Guangyi, W. Le, C. Enhong, L. Qi, and W. Han, “Image-enhanced multi-level sentence representation net for natural language inference,” in IEEE ICDM, 2018, pp. 747–756.
  • [14] B. Guo, S. Han, X. Han, H. Huang, and T. Lu, “Label confusion learning to enhance text classification models,” Arxiv, vol. abs/2012.04987, 2020.
  • [15] Y. Xiong, Y. Feng, H. Wu, H. Kamigaito, and M. Okumura, “Fusing label embedding into BERT: An efficient improvement for text classification,” in Findings of ACL-IJCNLP, 2021, pp. 1743–1750.
  • [16] H. Zhang, L. Xiao, W. Chen, Y. Wang, and Y. Jin, “Multi-task label embedding for text classification,” in EMNLP, 2018, pp. 4545–4553.
  • [17] S. Gururangan, S. Swayamdipta, O. Levy, R. Schwartz, S. R. Bowman, and N. A. Smith, “Annotation artifacts in natural language inference data,” in NAACL-HLT, 2018, pp. 107–112.
  • [18] L. Xiao, X. Huang, B. Chen, and L. Jing, “Label-specific document representation for multi-label text classification,” in EMNLP-IJCNLP, 2019, pp. 466–475.
  • [19] R. Zhang, Y.-S. Wang, Y. Yang, D. Yu, T. Vu, and L. Lei, “Long-tailed extreme multi-label text classification with generated pseudo label descriptions,” Arxiv, vol. abs/2204.00958, 2022.
  • [20] K. Zhang, L. Wu, G. Lv, M. Wang, E. Chen, and S. Ruan, “Making the relation matters: Relation of relation learning network for sentence semantic matching,” in AAAI, 2021, pp. 14 411–14 419.
  • [21] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” Arxiv, vol. abs/1810.04805, 2018.
  • [22] Y. Kim, “Convolutional neural networks for sentence classification,” Arxiv, vol. abs/1408.5882, 2014.
  • [23] J. Chung, C. Gulcehre, K. Cho, and Y. Bengio, “Empirical evaluation of gated recurrent neural networks on sequence modeling,” Arxiv, vol. abs/1412.3555, 2014.
  • [24] A. P. Parikh, O. Täckström, D. Das, and J. Uszkoreit, “A decomposable attention model for natural language inference,” in EMNLP, 2016, pp. 2249–2255.
  • [25] S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” in EMNLP, 2015, p. 632–642.
  • [26] S. Iyer, N. Dandekar, and K. Csernai, “First quora dataset release: Question pairs,” 2017.
  • [27] R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in EMNLP, 2013, pp. 1631–1642.
  • [28] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” Arxiv, vol. abs/1907.11692, 2019.
  • [29] W. Xu and Y. Tan, “Semisupervised text classification by variational autoencoder,” IEEE TNNLS, vol. 31, no. 1, pp. 295–308, 2020.
  • [30] D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao, “Relation classification via convolutional deep neural network,” in COLING, 2014, pp. 2335–2344.
  • [31] X. Liu, L. Mou, H. Cui, Z. Lu, and S. Song, “Finding decision jumps in text classification,” Neurocomputing, vol. 371, pp. 177–187, 2020.
  • [32] L. Cai, Y. Song, T. Liu, and K. Zhang, “A hybrid bert model that incorporates label semantics via adjustive attention for multi-label text classification,” IEEE Access, vol. 8, pp. 152 183–152 192, 2020.
  • [33] A. Mueller, J. Krone, S. Romeo, S. Mansour, E. Mansimov, Y. Zhang, and D. Roth, “Label semantic aware pre-training for few-shot text classification,” arXiv preprint arXiv:2204.07128, 2022.
  • [34] M. Liu, L. Liu, J. Cao, and Q. Du, “Co-attention network with label embedding for text classification,” Neurocomputing, vol. 471, pp. 61–69, 2022.
  • [35] X. Zhu, Z. Peng, J. Guo, and S. Dietze, “Generating effective label description for label-aware sentiment classification,” Expert Systems with Applications, p. 119194, 2022.
  • [36] C. Du, Z. Chen, F. Feng, L. Zhu, T. Gan, and L. Nie, “Explicit interaction model towards text classification,” in AAAI, vol. 33, 2019, pp. 6359–6366.
  • [37] K. Rivas Rojas, G. Bustamante, A. Oncevay, and M. A. Sobrevilla Cabezudo, “Efficient strategies for hierarchical text classification: External knowledge and auxiliary tasks,” in ACL, 2020, pp. 2252–2257.
  • [38] H. Wu, S. Qin, R. Nie, J. Cao, and S. Gorbachev, “Effective collaborative representation learning for multilabel text categorization,” IEEE TNNLS, vol. 33, no. 10, pp. 5200–5214, 2022.
  • [39] X. Wang, L. Zhao, B. Liu, T. Chen, F. Zhang, and D. Wang, “Concept-based label embedding via dynamic routing for hierarchical text classification,” in ACL-IJCNLP, 2021, pp. 5010–5019.
  • [40] N. Wang, W. Zhou, and H. Li, “Contrastive transformation for self-supervised correspondence learning,” in AAAI, 2021, pp. 10 174–10 182.
  • [41] Z. Wu, S. Wang, J. Gu, M. Khabsa, F. Sun, and H. Ma, “Clear: Contrastive learning for sentence representation,” Arxiv, vol. abs/2012.15466, 2020.
  • [42] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” in NeurIPS, 2020, pp. 1–13.
  • [43] T. T. Cai, J. Frankle, D. J. Schwab, and A. S. Morcos, “Are all negatives created equal in contrastive instance discrimination?” Arxiv, vol. abs/2010.06682, 2020.
  • [44] T. Chen, S. Kornblith, M. Norouzi, and G. E. Hinton, “A simple framework for contrastive learning of visual representations,” Arxiv, vol. abs/2002.05709, 2020.
  • [45] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” Arxiv, vol. abs/1911.05722, 2020.
  • [46] K. He, X. Chen, S. Xie, Y. Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” Arxiv, vol. abs/2111.06377, 2021.
  • [47] T. Gao, X. Yao, and D. Chen, “Simcse: Simple contrastive learning of sentence embeddings,” in EMNLP, 2021, pp. 6894–6910.
  • [48] Y. Li, P. Hu, Z. Liu, D. Peng, J. T. Zhou, and X. Peng, “Contrastive clustering,” in AAAI, 2021, pp. 8547–8555.
  • [49] M. Yang, Y. Li, Z. Huang, Z. Liu, P. Hu, and X. Peng, “Partially view-aligned representation learning with noise-robust contrastive loss,” in CVPR, 2021, pp. 1134–1143.
  • [50] M. Yang, Y. Li, P. Hu, J. Bai, J. C. Lv, and X. Peng, “Robust multi-view clustering with incomplete information,” IEEE TPAMI, 2022.
  • [51] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer, “Deep contextualized word representations,” Arxiv, vol. abs/1802.05365, 2018.
  • [52] R. Sennrich, B. Haddow, and A. Birch, “Neural machine translation of rare words with subword units,” in ACL, 2016, p. 1715–1725.
  • [53] Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, and D. Inkpen, “Enhanced lstm for natural language inference,” in ACL, 2017, pp. 1657–1668.
  • [54] K. Zhang, E. Chen, Q. Liu, C. Liu, and G. Lv, “A context-enriched neural network method for recognizing lexical entailment.” in AAAI, 2017, pp. 3127–3133.
  • [55] L. Mou, R. Men, G. Li, Y. Xu, L. Zhang, R. Yan, and Z. Jin, “Natural language inference by tree-based convolution and heuristic matching,” in ACL, 2016, pp. 130–136.
  • [56] J. Weeds, D. Clarke, J. Reffin, D. Weir, and B. Keller, “Learning to distinguish hypernyms and co-hyponyms,” in COLING, 2014, pp. 2249–2259.
  • [57] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in CVPR, 2015, pp. 815–823.
  • [58] T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” in ICML, 2020, pp. 1597–1607.
  • [59] Y. Tay, A. T. Luu, and S. C. Hui, “Co-stack residual affinity networks with multi-level attention refinement for matching text sequences,” in ACL, 2018, pp. 4492–4502.
  • [60] R. Yang, J. Zhang, X. Gao, F. Ji, and H. Chen, “Simple and effective text matching with richer alignment features,” in ACL, 2019, pp. 4699–4709.
  • [61] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “Albert: A lite bert for self-supervised learning of language representations,” in ICLR, 2020, pp. 1–13.
  • [62] G. Wang, C. Li, W. Wang, Y. Zhang, D. Shen, X. Zhang, R. Henao, and L. Carin, “Joint embedding of words and labels for text classification,” in ACL, 2018, pp. 2321–2331.
  • [63] M. Marelli, L. Bentivogli, M. Baroni, R. Bernardi, S. Menini, and R. Zamparelli, “Semeval-2014 task 1: Evaluation of compositional distributional semantic models on full sentences through semantic relatedness and textual entailment,” in SemEval, 2014, pp. 1–8.
  • [64] T. Khot, A. Sabharwal, and P. Clark, “Scitail: A textual entailment dataset from science question answering,” in AAAI, 2018, pp. 5189–5197.
  • [65] J. Howard and S. Ruder, “Universal language model fine-tuning for text classification,” in ACL, 2018, p. 328–339.
  • [66] W. Lan and W. Xu, “Neural network models for paraphrase identification, semantic textual similarity, natural language inference, and question answering,” in COLING, 2018, pp. 3890–3902.
  • [67] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,” Journal of machine learning research, vol. 9, no. Nov, pp. 2579–2605, 2008.
[Uncaptioned image] Kun Zhang received the PhD degree in computer science and technology from University of Science and Technology of China, Hefei, China, in 2019. He is is currently a faculty member with the Hefei University of Technology (HFUT), China. His research interests include Natural Language Understanding, Recommendation System. He has published several papers in refereed journals and conferences, such as the IEEE Transactions on Systems, Man, and Cybernetics: Systems, IEEE Transactions on Neural Networks and Learning Systems, the ACM Transactions on Knowledge Discovery from Data, AAAI, KDD, ACL, ICDM. He received the KDD 2018 Best Student Paper Award.
[Uncaptioned image] Le Wu received the PhD degree in computer science from the University of Science and Technology of China (USTC). She is a professor with Hefei University of Technology (HFUT), China. Her general area of research is data mining, recommender system, and social network analysis. She has published several papers in referred journals and conferences, such as the IEEE Transactions on Knowledge and Data Engineering, the ACM Transactions on Intelligent Systems and Technology, AAAI, IJCAI, KDD, SDM, and ICDM. She is the recipient of the Best of SDM 2015 Award.
[Uncaptioned image] Guangyi Lv received the PhD degree in computer science and technology from University of Science and Technology of China, Hefei, China, in 2019. He is currently an Advisoary Researcher at Lenovo Research. His major research interests include deep learning, natural language processing and adaptive AI. He has published several papers in refereed journals and conferences, such as the IEEE Transactions on Systems, Man, and Cybernetics: Systems, the IEEE Transactions on Big Data, AAAI, IJCAI, ICDM.
[Uncaptioned image] Enhong Chen (SM’07) received the PhD degree from USTC. He is a professor and vice dean of the School of Computer Science, USTC. His general area of research includes data mining and machine learning, social network analysis, and recommender systems. He has published more than 100 papers in refereed conferences and journals, including the IEEE Transactions on Knowledge and Data Engineering, the IEEE Transactions on Mobile Computing, KDD, ICDM, NIPS, and CIKM. He was on program committees of numerous conferences including KDD, ICDM, and SDM. His research is supported by the National Science Foundation for Distinguished Young Scholars of China. He is a senior member of the IEEE.
[Uncaptioned image] ShuLan Ruan received the B.S. degree from Hunan University, Changsha, China, in 2018. He is currently working toward the Ph.D. degree with the School of Computer Science and Technology, University of Science and Technology of China, Hefei, China. His research interests include sentiment analysis, computer vision and natural language processing. He has published several papers in AAAI and ICME.
[Uncaptioned image] Jing Liu received the B.E. and M.S. degrees from Shandong University, Shandong, in 2001 and 2004, respectively, and the Ph.D. degree from the Institute of Automation, Chinese Academy of Sciences, Beijing, in 2008. She is a Professor with the National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences. Her current research interests include deep learning, image content analysis and classification, multimedia, understanding and retrieval.
[Uncaptioned image] Zhiqiang Zhang is currently a Staff Engineer at Ant Group. His research interests mainly focus on graph machine learning. He has led a team to build an industrial graph machine learning system, AGL, in Ant Group. He has published more than 30 paper in top-tier machine learning and data mining conferences, including NeurIPS, VLDB, SIGKDD, and AAAI.
[Uncaptioned image] Jun Zhou is currently a Senior Staff Engineer at Ant Group. His research mainly focuses on machine learning and data mining. He has participated in the development of several distributed systems and machine learning platforms in Alibaba and Ant Group, such as Apsaras (Distributed Operating System), MaxCompute (Big Data Platform), and KunPeng (Parameter Server). He has published more than 40 papers in top-tier machine learning and data min-ing conferences, including VLDB, WWW, NeurIPS, and AAAI.
[Uncaptioned image] Meng Wang (SM’17) received the BE and PhD degrees from USTC, in 2003 and 2008, respectively. He is a professor with HFUT. His current research interests include multimedia content analysis, computer vision, and pattern recognition. He has authored more than 200 book chapters, journal, and conference papers in these areas. He is the recipient of the ACM SIGMM Rising Star Award 2014. He is an associate editor of the IEEE Transactions on Knowledge and Data Engineering, the IEEE Transactions on Circuits and Systems for Video Technology, and the IEEE Transactions on Neural Networks and Learning Systems. He is an IEEE Fellow.