Cross domain emotion recognition using few shot knowledge transfer
Abstract
Emotion recognition from text is a challenging task due to diverse emotion taxonomies, lack of reliable labeled data in different domains, and highly subjective annotation standards. Few-shot and zero-shot techniques can generalize across unseen emotions by projecting the documents and emotion labels onto a shared embedding space. In this work, we explore the task of few-shot emotion recognition by transferring the knowledge gained from supervision on the GoEmotions Reddit dataset to the SemEval tweets corpus, using different emotion representation methods. The results show that knowledge transfer using external knowledge bases and fine-tuned encoders perform comparably as supervised baselines, requiring minimal supervision from the task dataset.
Index Terms— Emotion recognition, Few shot classification, Unsupervised
1 Introduction
Emotion recognition in unstructured text is an important natural language understanding task, with numerous downstream applications [1]. It enables a better understanding of customers’ opinions in product reviews [2, 3], public sentiment in social media [4], hate speech [5], political stances in news articles [6], and users’ mood in chatbot interactions [7, 8]. Emotion recognition research has expedited several NLP benchmark datasets and tasks, particularly in the domain of social media posts and question answering forums [4, 9]. Most contemporary emotion recognition approaches involve supervised modeling using deep neural networks and transformers [10, 11]. However, supervised models rely on high-quality labeled data, which is difficult to acquire for subjective tasks like emotion recognition, and is noisy due to inconsistencies in the understanding of emotions between annotators of different cultural backgrounds [12]. Moreover, emotion datasets follow different taxonomies, making it difficult to adapt existing models to newer domains and emotions. Lately, unsupervised (few-shot and zero-shot) models have gained popularity, where a large model with billions of parameters is pretrained on extensive unlabeled text corpora with some general language modeling objective, and then optionally fine-tuned on a small amount of labeled data of the target task [13, 14, 15]. Zero-shot and few-shot models project the labels (emotions) and documents to a common embedding space using pretrained encoders and matrix factorization methods and compare their representations to estimate a similarity score [16]. We extend this approach to the emotion recognition task and show that models learned on one taxonomy can be easily adapted to another with minimal fine-tuning.
In this work, we explore various few-shot emotion recognition models on the GoEmotions Reddit dataset [9]. We experiment with different methods to project the emotion label to a sentence embedding. We also show successful knowledge transfer from GoEmotions to SemEval tweets [4], even though the two datasets follow different emotion taxonomies. Our contributions are three-fold: 1) we improve the supervised baseline of the GoEmotions Reddit dataset using SBERT encoder, 2) we develop few-shot emotion recognition methods leveraging external knowledge bases like WordNet, and 3) we achieve comparable performance to the supervised baseline on SemEval tweets emotion classification by knowledge transfer from GoEmotions Reddit comments.
2 Related Work
Few shot and Zero shot text classification: In the domain of zero shot text classification for social-media, Chen et.al [17] rely on external knowledge-base guided projection methods for mapping posts and labels into a shared embedding space. For unsupervised document classification, Haj-Yahia et.al [18] perform category label enrichment through expert interventions followed by Wordnet [19] based definitions to compute similarities between documents and labels. We draw inspiration from these above-mentioned methods in our definition-based projection method, where we leverage emotion label definitions from Wordnet [19] and map it into a similar sentence embedding space as social media (Reddit, Twitter) comments.
Few shot emotion classification: Guibon et.al [20] explore the problem of emotion classification from conversations in a few shot setting by computing Euclidean distances between the example conversational utterances and emotion-centric prototypes, formed from the average of the word embeddings of utterances. This is the basis for our projection methods, where we project the example and emotion into the same vector space before computing the similarity score. Specifically in our labeled sentences projection method, for each emotion label, we consider embeddings of few labeled examples from the dataset to compute label specific embeddings via averaging.
Knowledge transfer for emotion classification: For multi-turn conversations, Hazarika et.al [21] explore transfer learning from a generative source model trained for dialogue modeling to the target domain of emotion classification from low and moderately sized datasets. For social media comments and personal reports, Demszky et.al [9] perform knowledge transfer from GoEmotions-specific models by finetuning output layers for target datasets through a set of freezing and unfreezing operations. We consider this as the motivation behind using Go-Emotions specific models in our unsupervised methods on Semeval tweets dataset, without going through the additional step of finetuning.
3 Data
Tweets and Reddit posts contain expressions of personal opinions and emotions and serve as good test sets for emotion recognition from text. We chose the GoEmotions Reddit dataset [9] and the benchmark SemEval 2018 tweets dataset [4] for our work.
3.1 GoEmotions Data
The GoEmotions dataset contains 58K Reddit comments, manually annotated for 27 emotion categories. Each comment can have zero or more emotions labels. Figure 1 shows the class distribution in this data. As shown in the figure, the dataset is highly imbalanced, with the largest class (admiration) containing 5512 samples versus the 96 samples for the smallest class (grief). The Reddit comments come from the reddit-data-tools project 111https://github.com/dewarim/reddit-data-tools from the years 2005-2019. Subreddits with high levels of profanity or low emotion content were excluded. The comments are 3 to 30 tokens long, with a median length of 12 tokens. At least three raters had annotated each example, achieving significant interrater correlation for each emotion.
3.2 SemEval 2018 Data
We used the English subset of the Affect in Tweets dataset of the 2018 International Workshop on Semantic Evaluation. The dataset contains 10.9K Tweets from 2016-2017, annotated for 11 emotion categories. Each tweet can have zero or more emotion labels. Figure 2 shows the class distribution for this dataset. At least 7 raters annotated each Tweet, achieving significant interrater correlation for each emotion. Three emotion labels: anticipation, trust, and pessimism, are absent from the GoEmotions taxonomy. The comments are 1 to 36 tokens long, with a median length of 16 tokens. In our experiments, we attempt to transfer knowledge from the GoEmotions Reddit comments to SemEval tweets using unsupervised models.
4 Methods
We consider both supervised and unsupervised approaches to model the sentence-level emotion.
4.1 Supervised methods
Herein we focus on well-established linear models as well as the more recently proposed transformer models for supervised emotion classification.
Linear models: As simple baselines, we trained logistic regression, linear support vector machines (SVM), and XGBoost classifiers [22]. We used bag-of-ngrams (), LIWC [23] category scores, and curated a set of common emoticons and emojis, to construct our feature set. We removed punctuations and special characters, lemmatized the word tokens, and excluded ngrams occurring with frequency less than three. In order to address the class imbalance of the emotion datasets, we weigh the classes according to: , where is the total number of samples, is the number of emotions, and is the number of examples labeled with emotion , using the scikit-learn framework 222https://scikit-learn.org/stable/.
Transformers: We fine-tuned a BERT [14] and SBERT [24] based encoder with a dense output layer. We used a pretrained BERT-Base model from the huggingface 333https://huggingface.co/transformers/ library. For SBERT, we used an all-mpnet-base-v2 layer, pretrained using a billion sentence pairs from multiple datasets. Both transformer encoders have a hidden state dimension of 768. We did not perform any text preprocessing. In our transformer-based experiments, we calculated the class weights using: , where and are the number of negative and positive examples of emotion class , respectively. The square root dampens the impact of the negative to positive ratio, which otherwise over-corrects the class imbalance.
4.2 Unsupervised methods
We propose a method of low resource few shot transfer learning by utilizing the embeddings produced by the transformer models fine-tuned on GoEmotions for an unsupervised classification task on the SemEval dataset. The GoEmotions and SemEval datasets have different taxonomies with overlapping labels in the emotion space. We leverage the fact that emotion classes not included as a part of training data are words and can be represented in the same word/sentence embedding space as comments. We project the sentences (Reddit comments or tweets) and the emotion labels onto the same embedding space. This allows us to compute the similarity between a given sentence and different emotion classes. We assign an emotion to a sentence if their cosine similarity score is higher than the emotion-specific threshold. We use the development set of the target dataset to set the threshold values. We discuss different projection approaches in the following subsections.



WordNet Definition: WordNet is a lexical database which arranges words in synonym sets called synsets. The synsets provide a short sentence defining the meaning of its constituent words. The synset definition of emotions should be semantically similar to sentences where that emotion is expressed. Therefore, we find the emotion representation by appending the WordNet definition of the emotion to its lexical form. For example, we represent the emotion embarrassment as “embarrassment: the shame you feel when your inadequacy or guilt is made public.” We find the sentence embedding of this emotion definition and the sentence and compute the cosine similarity between them. Fig. 3(a) illustrates this approach.
Labeled Sentences: We find the emotion representation by averaging the sentence embeddings of the development set sentences which are labeled with the corresponding emotion. We use this to compute the cosine similarity against the sentence embedding of the input sentences, as shown in Fig. 3(b).
Word Embeddings: The emotion representation is the word embedding of the emotion label. For each input sentence, we remove punctuation and stop words, and average the word embeddings of each token to find the sentence representation, as shown in Fig. 3(c).
5 Experiments
We perform three sets of emotion classification experiments: 1) supervised and 2) unsupervised modeling of GoEmotions Reddit comments, and 3) knowledge transfer from GoEmotions to SemEval using unsupervised approaches. We use macro precision, recall, and F1 scores to evaluate our models. We follow the same 8:1:1 train-test-dev split, as used in [9], for the supervised experiments on GoEmotions. For the unsupervised and knowledge transfer experiments, we chose smaller development sets than the original splits to emulate few-shot learning. We used 938 Reddit comments (80% smaller) and 572 SemEval tweets (35% smaller) as our GoEmotions and SemEval development sets.
We trained the transformer models using a sigmoid binary cross entropy loss function for 5 epochs and a learning rate of 5e-5. We used a batch size of 16. We experimented with two variations of the sentence encoder for the WordNet Definition and Labeled Sentences approach: 1) a pretrained SBERT, 2) SBERT fine-tuned on the supervised emotion classification task on GoEmotions training set. The latter constitutes a knowledge transfer of emotion-aware sentence embeddings from GoEmotions to SemEval. We also tried BERT encoders, but SBERT performed better, and we only present its results. We used 200-dimensional GloVe [25] vectors for the Word Embeddings unsupervised approach.
6 Results and Discussion
6.1 Supervised approaches
Table 1 shows the results of supervised emotion classification on GoEmotions dataset. BERT and SBERT achieves similar performance of 0.49 and 0.48 F1 respectively, performing better than the baseline model. However, their recall is lower than the baseline. The linear models achieve comparable recall but suffer in precision. Gradient boosting attained the best precision score of 0.47.
Model | Precision | Recall | F1 |
---|---|---|---|
Baseline [9] | 0.40 | 0.63 | 0.46 |
Logistic Regression | 0.30 | 0.57 | 0.39 |
Linear SVC | 0.29 | 0.61 | 0.39 |
XGBoost | 0.47 | 0.46 | 0.44 |
BERT | 0.43 | 0.59 | 0.49 |
SBERT | 0.45 | 0.57 | 0.48 |
6.2 Unsupervised approaches
Table 2 shows the results of unsupervised emotion classification on GoEmotions dataset. We use the GoEmotions development set to tune the threshold values, as described in Sec. 4.2. The performance of the unsupervised approach is not comparable to the supervised models of Table 1, with the best method (Labeled Sentences) only achieving 0.15 F1. Replacing the pretrained SBERT encoder with the fine-tuned SBERT layer of the supervised models dramatically improves the performance. We achieve a maximum F1 score of 0.4 by using this hybrid approach. The Labeled Sentences approach performs slightly better than the WordNet Definition approach, followed by the Word Embeddings method. The model is no longer unsupervised because we use the training set to fine-tune the SBERT embeddings. However, we are not constrained to the dataset’s label set and can make predictions for new emotions. Our experiments on knowledge transfer, described next, supports this.
6.3 Knowledge Transfer
Table 3 shows the results of emotion classification on SemEval dataset. We use the SemEval development set to tune the thresholds for the unsupervised approaches. The precision and F1 scores increase when we use the fine-tuned SBERT encoder of the GoEmotions-trained supervised models. Therefore, the knowledge gained by the SBERT embeddings from supervision on GoEmotions improves emotion classification on the SemEval dataset. We attain a maximum F1 score of 0.49 using the WordNet Definition approach. This is comparable to the NTUA-SLP supervised baseline, which achieves 0.53 F1. The WordNet Definition method performs slightly better than the Labeled Sentences approach.
Model | Precision | Recall | F1 |
---|---|---|---|
Unsupervised | |||
WordNet + SBERT | 0.11 | 0.33 | 0.12 |
Lbl St + SBERT | 0.23 | 0.14 | 0.15 |
Word Em + GloVe | 0.11 | 0.41 | 0.11 |
Supervised | |||
WordNet + SBERT FT | 0.35 | 0.51 | 0.38 |
Lbl St + SBERT FT | 0.38 | 0.48 | 0.40 |
Baseline [9] | 0.40 | 0.63 | 0.46 |
SBERT | 0.45 | 0.57 | 0.48 |
Model | Precision | Recall | F1 |
---|---|---|---|
Unsupervised | |||
WordNet + SBERT | 0.30 | 0.60 | 0.39 |
Lbl St + SBERT | 0.30 | 0.51 | 0.37 |
Word Em + GloVe | 0.28 | 0.70 | 0.36 |
Knowledge Transfer | |||
WordNet + SBERT FT | 0.48 | 0.56 | 0.49 |
Lbl St + SBERT FT | 0.40 | 0.66 | 0.47 |
Supervised Baselines | |||
Random | - | - | 0.29 |
SVM-Unigrams | - | - | 0.44 |
NTUA-SLP [26] | - | - | 0.53 |
The unsupervised approach allows us to make predictions on emotions unseen by the fine-tuned encoder. For example, the WordNet + SBERT FT model scores 0.13, 0.31, and 0.28 F1 on trust, pessimism, and anticipation emotions, which are not found in the GoEmotions label set (see sec 3.2). Fig 4 shows the change in F1 score versus the SemEval development set size, which is used to set the threshold for the unsupervised approaches. The performance steadies after about 600 examples and does not increase significantly after that. The WordNet Definition and Word Embedding approaches seem more robust to the number of examples used for tuning the threshold, compared to the Labeled Sentences method.
7 Conclusion
We presented several unsupervised and few-shot methods for emotion classification leveraging lexical knowledge bases, and evaluated them on the GoEmotions Reddit dataset. The sentence embeddings fine-tuned on the GoEmotions Reddit comments improved classification performance on the SemEval tweets, with minimal in-domain supervision. This showed that we could successfully transfer the knowledge gained from supervision on one emotion dataset to another, even if they followed different label taxonomies. Future work includes using synonym and semantic relations with knowledge graphs to enrich the emotion representations beyond simple definitions.
References
- [1] Rosalind W. Picard, Affective Computing, MIT Press, Cambridge, MA, 1997.
- [2] Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, Oct. 2013, pp. 1631–1642, Association for Computational Linguistics.
- [3] Maria Pontiki, Dimitris Galanis, John Pavlopoulos, Harris Papageorgiou, Ion Androutsopoulos, and Suresh Manandhar, “SemEval-2014 task 4: Aspect based sentiment analysis,” in Proceedings of the 8th International Workshop on Semantic Evaluation (SemEval 2014), Dublin, Ireland, Aug. 2014, pp. 27–35, Association for Computational Linguistics.
- [4] Saif Mohammad, Felipe Bravo-Marquez, Mohammad Salameh, and Svetlana Kiritchenko, “SemEval-2018 task 1: Affect in tweets,” in Proceedings of The 12th International Workshop on Semantic Evaluation, New Orleans, Louisiana, June 2018, pp. 1–17, Association for Computational Linguistics.
- [5] Flor Miriam Plaza del Arco, Sercan Halat, Sebastian Padó, and Roman Klinger, “Multi-task learning with sentiment, emotion, and target detection to recognize hate speech and offensive language,” 2021.
- [6] Isabelle Augenstein, Tim Rocktäschel, Andreas Vlachos, and Kalina Bontcheva, “Stance detection with bidirectional conditional encoding,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, Nov. 2016, pp. 876–885, Association for Computational Linguistics.
- [7] Zhaojiang Lin, Peng Xu, Genta Indra Winata, Farhad Bin Siddique, Zihan Liu, Jamin Shin, and Pascale Fung, “Caire: An end-to-end empathetic chatbot,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 09, pp. 13622–13623, Apr. 2020.
- [8] Li Zhou, Jianfeng Gao, Di Li, and Heung-Yeung Shum, “The design and implementation of xiaoice, an empathetic social chatbot,” 2019.
- [9] Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi, “GoEmotions: A Dataset of Fine-Grained Emotions,” in 58th Annual Meeting of the Association for Computational Linguistics (ACL), 2020.
- [10] Yen-Hao Huang, Ssu-Rui Lee, Mau-Yun Ma, Yi-Hsin Chen, Ya-Wen Yu, and Yi-Shin Chen, “Emotionx-idea: Emotion BERT - an affectional model for conversation,” CoRR, vol. abs/1908.06264, 2019.
- [11] Hassan Alhuzali and Sophia Ananiadou, “SpanEmo: Casting Multi-label Emotion Classification as Span-prediction,” in Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume. 2021, pp. 1573–1584, Association for Computational Linguistics.
- [12] Laura-Ana-Maria Bostan and Roman Klinger, “An analysis of annotated corpora for emotion classification in text,” in Proceedings of the 27th International Conference on Computational Linguistics, Santa Fe, New Mexico, USA, Aug. 2018, pp. 2104–2119, Association for Computational Linguistics.
- [13] Richard Socher, Milind Ganjoo, Christopher D Manning, and Andrew Ng, “Zero-shot learning through cross-modal transfer,” in Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds. 2013, vol. 26, Curran Associates, Inc.
- [14] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” CoRR, vol. abs/1810.04805, 2018.
- [15] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei, “Language Models are Few-Shot Learners,” Advances in Neural Information Processing Systems, vol. 33, pp. 1877–1901, 2020.
- [16] Samira Zad and Mark Finlayson, “Systematic Evaluation of a Framework for Unsupervised Emotion Recognition for Narrative Text,” in Proceedings of the First Joint Workshop on Narrative Understanding, Storylines, and Events. jul 2020, pp. 26–37, Association for Computational Linguistics (ACL).
- [17] Qi Chen, Wei Wang, Kaizhu Huang, and Frans Coenen, “Zero-shot text classification via knowledge graph embedding for social media data,” IEEE Internet of Things Journal, pp. 1–1, 2021.
- [18] Zied Haj-Yahia, Adrien Sieg, and Léa A. Deleris, “Towards unsupervised text classification leveraging experts and word embeddings,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, July 2019, pp. 371–379, Association for Computational Linguistics.
- [19] Christiane Fellbaum, WordNet: An Electronic Lexical Database, Bradford Books, 1998.
- [20] Gaël Guibon, Matthieu Labeau, Hélène Flamein, Luce Lefeuvre, and Chloé Clavel, “Few-shot emotion recognition in conversation with sequential prototypical networks,” in The 2021 Conference on Empirical Methods in Natural Language Processing (EMNLP 2021), 2021.
- [21] Devamanyu Hazarika, Soujanya Poria, Roger Zimmermann, and Rada Mihalcea, “Emotion recognition in conversations with transfer learning from generative conversation modeling,” CoRR, vol. abs/1910.04980, 2019.
- [22] Tianqi Chen and Carlos Guestrin, “XGBoost: A scalable tree boosting system,” in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, 2016, KDD ’16, pp. 785–794, ACM.
- [23] Yla R. Tausczik and James W. Pennebaker, “The psychological meaning of words: Liwc and computerized text analysis methods,” Journal of Language and Social Psychology, vol. 29, no. 1, pp. 24–54, 2010.
- [24] Nils Reimers and Iryna Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 11 2019, Association for Computational Linguistics.
- [25] Jeffrey Pennington, Richard Socher, and Christopher D. Manning, “Glove: Global vectors for word representation,” in Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 1532–1543.
- [26] Christos Baziotis, Nikos Athanasiou, Alexandra Chronopoulou, Athanasia Kolovou, Georgios Paraskevopoulos, Nikolaos Ellinas, Shrikanth S. Narayanan, and Alexandros Potamianos, “NTUA-SLP at semeval-2018 task 1: Predicting affective content in tweets with deep attentive rnns and transfer learning,” CoRR, vol. abs/1804.06658, 2018.