Improve Text Classification Accuracy with
Intent Information
Abstract
Text classification, a core component of task-oriented dialogue systems, attracts continuous research from both the research and industry community, and has resulted in tremendous progress. However, existing method does not consider the use of label information, which may weaken the performance of text classification systems in some token-aware scenarios. To address the problem, in this paper, we introduce the use of label information as label embedding for the task of text classification and achieve remarkable performance on benchmark dataset.
Index Terms:
Text Classification, Label Information, Label Embedding, Pre-trained Language Model.I Introduction
In recent years, goal-oriented dialogue systems have been widely applied in intelligent voice assistant, e.g., Apple Siri, Amazon Alexa, where intent classification technology plays a crucial part. Given input utterance in natural language, the intent classification module aims to detect the user’s intent [10, 11, 23, 14]. Previous works have been proposed for better understanding the semantic of the utterance [16, 15, 3, 4]. A simple example of intent classification is shown in Figure I.
Despite the impressive results, most of the existing methods focus only on semantic understanding of utterances and are directly trained on the input utterances only. Because the design and training process of models are noise-agnostic, it is difficult for the model to adapt the knowledge learned from fixed and limited training set to the unknown open-domain user inputs directly, where the two datasets may be completely different in terms of data distributions. It means that a well-trained model on the ideal dataset may have poor performance on the open-domain user inputs, which indicates the current models suffer from poor robustness. In short, we argue that for the intent classification task, studies used to model on a given dataset barely on the sentences are not friendly to the model robustness. Intuitively, learning from label information at the same time will improve the performance and robustness of the model, compared to the sentence-only training. Thus, it is reasonable to combine label information and text information for modeling in text classification task.
Conventional goal-oriented dialogue system mainly train intent classification model with cross entropy by treating the label as on-hot embedding. However, in real application scenarios, the input sentence may contain many errors, like insertion error, deleting error, subtitution error. In this manner, the overall sentence representation is hard to be understood and easy to lead to wrong prediction.
In addition, existing text classification approaches only consider the utterances in the coarse granularity level, which may less the possibility of the model to explore relationship between token-level information and label information. For example, in Figure 1, if there is a token “When” in the input sentence, then the “NUM” intent is more likely to be recognized as high correlation with it, while the “LOC” intent does not.
Intent | Text |
---|---|
NUM | When was the first liver transplant |
LOC | What is the largest city in the world |
ENTY | What do you call a newborn kangaroo |
DESC | What does cc in engines mean |
To solve the above issues, in this paper, we propose to joint model label information and text information for the text classification task, specially, we consider the goal-oriented dialogue system scene that the label is an intent.
II Related Work
II-A Text Classification
Text classification is a common task in natural language processing (NLP), where the goal is to assign a label or category to a given piece of text. It is often used in applications such as sentiment analysis, where the goal is to predict whether a piece of text is positive or negative, or topic classification, where the goal is to identify the topic of a text. In related works, text classification has been extensively studied and many different approaches have been proposed. Some common approaches include using bag-of-words models, which represent the text as a collection of individual words, and using word embedding models, which capture the meaning and context of words in a continuous numerical space. More recently, transformer models like BERT have been shown to achieve remarkable performance on text classification tasks. Overall, text classification is an important and active area of research in NLP, with many different approaches and techniques being developed and studied.
II-B Pre-trained Language Model for Text Classification
Due to the powerful text representation provided by the BERT-like Pre-trained Language Model (PLM) [7, 19, 8, 12], previous works adopted BERT-like PLM to the text classification task, achieving remarkable success [22, 6]. BERT stands for ”Bidirectional Encoder Representations from Transformers.” BERT is a type of transformer model, a neural network architecture that uses self-attention mechanisms to process input data. BERT is trained to predict missing words in a sentence, which allows it to understand the context of words in a sentence and use that understanding to perform various NLP tasks such as sentiment analysis and question answering.
II-C Label Embedding
[22] proposed to joint train the label embedding and text embedding for the text classification task. However, their work does not consider the problem of noise text, as well as explore the ability of powerful PLM. In contrast, they inject the label information to the text encoding process by carefully designing the merging module. Recent works [5] explored the effect of label information by simultaneously learning the features of input samples and the parameters of classifiers in the same space, and showed that label information is good for the intent classification task.
III Method
In this section, we will describe the proposed framework in detail. Figure 2 shows the overall architecture of the model.
III-A Label-Sentence Co-Attention Model
III-A1 Text Encoder
Our sentence encoders are based on the pre-trained language model, RoBERTa. RoBERTa [19] stands for Robustly Optimized BERT Pre-training Approach. Give text input with tokens, RoBERTa then encode the text and output text representation, where is the embedding dimension.
III-A2 Label Encoder
The label encoder aims to encode the label information after it is converted to label embedding, which is a learnable embedding that takes a label in text format as input to RoBERTa. Note that we share the parameters of text encoder and the label encoder.
III-A3 Text-Label Fusion Layer
There are many ways to fuse two embeddings in NLP. One common method is to concatenate the two vectors, creating a single vector that contains both the text and label information. This vector can then be fed into a neural network for further processing and analysis. Another method is to use a weighted average of the two vectors, where the weights reflect the relative importance of the text and label information. This allows the model to give more or less emphasis to each type of information depending on the task at hand. Inspired by CLIP [20], we fuse the text information and label information by employing dot-product on the output representation of text encoder and label information.
III-B Classifier
BERT preprocessed inputs are fed into the BERT classifier, which processes the data and produces an output prediction. In the case of a text classification task, the output prediction would be the predicted label or category for the input text.
III-C Training Objective
In this work, cross entropy loss is used to optimize the probability from the predicted label to the ground truth label. It can be defined as .
IV Experiment
IV-A Datasets
To evaluate the efficiency of our proposed method, we conduct experiments on benchmark dataset TREC and ATIS. The detail of the datasets can be seen in Table II. Datasets used in our paper follows the same format and partition as in [2]. Intent detection accuracy is treated as evaluation metric in the experiments.
Dataset | #Class | Avg. Length | Train | Test |
---|---|---|---|---|
TREC | 6 | 8.89 | 5,452 | 500 |
ATIS | 22 | 11.14 | 4,978 | 893 |
TREC6
ATIS
The Airline Travel Information System (ATIS) dataset is a benchmark dataset commonly used to evaluate the performance of different models on a NLU task. The ATIS dataset contains a large collection of sentences and questions related to flight reservations, along with the correct intent for each sentence. For example, a sentence in the ATIS dataset might be ”I would like to book a flight from New York to Los Angeles” and the corresponding label might be ”book flight.”
IV-B Experimental Settings
We use RoBERTa-base [19] as the text encoder for all experiments. We follow [2] for the data split and data preprocessing. To give the PLM a better initialization, we take the pre-trained RoBERTa checkpoint from [2] as our text and label encoder checkpoint for all experiments. The training batch size is selected from [32, 64]. For each experimental setting, we train the model with 10 epoch.
IV-C Results
Label Embeddings | Fusion Methods | TREC6 | ATIS |
---|---|---|---|
No | No | 84.5 | 94.3 |
Yes | Add | 84.7 | 94.6 |
Yes | Dot Product | 85.3 | 95.1 |
Table III shows the main results of the experiments. We can see that after adding label information, the fusion model 2 perform better than the model without explicit label information 1. The results indicate that the model that incorporates label information (the fusion model) performs better than the model that does not (the original model). This suggests that incorporating label information can be beneficial for text classification tasks, as it can help the model to better understand the context and meaning of the text and make more accurate predictions. These results are consistent with many previous studies in natural language processing, which have shown that incorporating label information can improve the performance of text classification models.
V Conclusion
The label information introduced by this method enables the model to perform better text classification. This is a common approach in natural language processing (NLP) and can be very effective at improving the accuracy of text classification models. By incorporating label information, the model is able to better understand the context and meaning of the text, which can help it to make more accurate predictions. Additionally, by explicitly modeling the interaction between the label and the text, the model can learn more about the token-level information, such as the individual words and their relationships to one another, which can further improve its performance on the text classification task. Overall, incorporating label information into a text classification model can be a powerful technique for improving its performance and enhancing its ability to understand natural language
References
- [1] E. Bastianelli, A. Vanzo, P. Swietojanski, and V. Rieser. SLURP: A spoken language understanding resource package. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, 2020.
- [2] Y. Chang and Y. Chen. Contrastive learning for improving ASR robustness in spoken language understanding. In Proceedings of the 23rd Annual Conference of the International Speech Communication Association, 2022.
- [3] D. Chen, Z. Huang, X. Wu, S. Ge, and Y. Zou. Towards joint intent detection and slot filling via higher-order attention. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, 2022.
- [4] D. Chen, Z. Huang, and Y. Zou. Leveraging bilinear attention to improve spoken language understanding. In IEEE International Conference on Acoustics, Speech and Signal Processing, 2022.
- [5] Q. Chen, R. Zhang, Y. Zheng, and Y. Mao. Dual contrastive learning: Text classification via label-aware data augmentation. CoRR, abs/2201.08702, 2022.
- [6] Q. Chen, Z. Zhuo, and W. Wang. BERT for joint intent classification and slot filling. CoRR, abs/1902.10909, 2019.
- [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics, 2019.
- [8] L. Hou, Z. Huang, L. Shang, X. Jiang, X. Chen, and Q. Liu. Dynabert: Dynamic BERT with adaptive width and depth. In Advances in Neural Information Processing Systems, 2020.
- [9] E. Hovy, L. Gerber, U. Hermjakob, C. Lin, and D. Ravichandran. Toward semantics-based answer pinpointing. In Proceedings of the First International Conference on Human Language Technology Research, 2001.
- [10] C. Huang and Y. Chen. Adapting pretrained transformer to lattices for spoken language understanding. In IEEE Automatic Speech Recognition and Understanding Workshop, 2019.
- [11] C. Huang and Y. Chen. Learning asr-robust contextualized embeddings for spoken language understanding. In 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, 2020.
- [12] Z. Huang, L. Hou, L. Shang, X. Jiang, X. Chen, and Q. Liu. Ghostbert: Generate more features with cheap operations for BERT. In ACL/IJCNLP, 2021.
- [13] Z. Huang, F. Liu, X. Wu, S. Ge, H. Wang, W. Fan, and Y. Zou. Audio-oriented multimodal machine comprehension via dynamic inter- and intra-modality attention. In AAAI, 2021.
- [14] Z. Huang, F. Liu, P. Zhou, and Y. Zou. Sentiment injected iteratively co-interactive network for spoken language understanding. In ICASSP, 2021.
- [15] Z. Huang, F. Liu, and Y. Zou. Federated learning for spoken language understanding. In COLING. International Committee on Computational Linguistics, 2020.
- [16] Z. Huang, M. Rao, A. Raju, Z. Zhang, B. Bui, and C. Lee. MTL-SLT: multi-task learning for spoken language tasks. In Proceedings of the 4th Workshop on NLP for Conversational AI, ConvAI@ACL, 2022.
- [17] X. Li and D. Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002.
- [18] F. Liu, Y. Liu, X. Ren, X. He, and X. Sun. Aligning visual regions and textual concepts for semantic-grounded image representations. In Annual Conference on Neural Information Processing Systems, 2019.
- [19] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. Roberta: A robustly optimized bert pretraining approach. Preprint arXiv:1907.11692, 2019.
- [20] A. Radford, J. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, 2021.
- [21] P. Shivakumar, M. Yang, and P. Georgiou. Spoken language intent detection using confusion2vec. In 20th Annual Conference of the International Speech Communication Association, 2019.
- [22] G. Wang, C. Li, W. Wang, Y. Zhang, D. Shen, X. Zhang, R. Henao, and L. Carin. Joint embedding of words and labels for text classification. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018.
- [23] P. Zhou, Z. Huang, F. Liu, and Y. Zou. PIN: A novel parallel interactive network for spoken language understanding. In ICPR, 2020.