This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Guilt by Association: Emotion Intensities in Lexical Representations

Shahab Raji
Rutgers University
[email protected]
&Gerard de Melo
Hasso Plattner Institute/University of Potsdam
[email protected]
Abstract

What do word vector representations reveal about the emotions associated with words? In this study, we consider the task of estimating word-level emotion intensity scores for specific emotions, exploring unsupervised, supervised, and finally a self-supervised method of extracting emotional associations from word vector representations. Overall, we find that word vectors carry substantial potential for inducing fine-grained emotion intensity scores, showing a far higher correlation with human ground truth ratings than achieved by state-of-the-art emotion lexicons.

1 Introduction

Vector representations of words induced using distributional co-occurrence signals provide valuable information about associations between words Baroni et al. (2014). In this paper, we consider the question: What do word vector representations reveal about the emotions associated with words?

There has been substantial research on methods to label words with associated emotions. Crowd-sourcing approaches have been used to compile databases to study the nexus between them. Mohammad and Turney (2013) labelled around 15,000 unigrams with 8 basic emotions and two sentiments, providing binary tags for each word. Mohammad (2018) instead solicited real-valued emotion intensity ratings in [0,1][0,1] for roughly 500–1800 words for eight emotions. DepecheMood Staiano and Guerini (2014) and the improved DepecheMood++ Araque et al. (2018) are based on simple statistical methods to create an emotion lexicon from emotionally tagged data crawled from specific Web sites. We conjecture that vector representations allow us to obtain higher-quality ratings of the emotional associations of words.

Word vectors Mikolov et al. (2013); Pennington et al. (2014) have often been evaluated on standard word relatedness benchmarks. In this paper, we instead explore to what extent they encode emotional associations. Earlier methods Strapparava and Mihalcea (2008); Mac Kim et al. (2010) used corpus statistics in tandem with dimensionality reduction techniques for emotion prediction. Rothe et al. (2016) proposed predicting sentiment polarity ratings from word vectors, while other studies Buechel and Hahn (2018b); Buechel et al. (2018) predicted valence, arousal, and dominance using supervised deep neural networks. Khosla et al. (2018) proposed incorporating valence, arousal, and dominance ratings as additional signals into dense word embeddings. Buechel and Hahn (2018a) showed that emotion intensity scores can be predicted based on a lexicon providing valence–arousal–dominance ratings. We show that we can make use of word vectors to obtain high correlations with emotion intensities without any need for such ratings.

Contribution.

Overall, our intriguing finding in this paper is that dense word vectors allow us to predict much more accurate emotion associations than state-of-the-art emotion lexicons. We show that this holds even without any supervision, while different kinds of supervised setups yield even better results.

2 Emotion Intensities as Associations

While many emotion lexicons only provide binary emotion labels, it is clear that words may exhibit different degrees of association with an emotion. For instance, the word party suggests a greater degree of joy than the word sea, although the latter may as well be associated with joy.

Ground truth.

The NRC Emotion/Affect Intensity Lexicon Mohammad (2018) provides human ratings of emotion intensity for words, scaled to [0,1][0,1] such that a score of 1 signifies that a word “conveys the highest amount of” a particular emotion, while 0 corresponds to the lowest amount. The real-valued scores were computed based on best–worst scaling of ratings solicited via crowdsourcing.

In our work, instead of viewing this resource as a lexicon that provides emotion intensity scores, we propose to treat it as providing associations between pairs of words, one of the two words being an emotion. Thus, we propose to derive semantic association benchmarks similar to widely used semantic relatedness benchmark sets such as the RG-65 Rubenstein and Goodenough (1965) and WS353 Finkelstein et al. (2001) test collections.

Data split.

In order to more fairly evaluate resources that only provide unigrams, we disregard bigrams in the lexicon, resulting in a total of 9,706 word pairs. The data covers the eight basic emotions proposed in Plutchik’s wheel of emotions Plutchik (1980), i.e., anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. To facilitate an evaluation of unsupervised and supervised techniques, for each emotion, we split the corresponding data into training/validation/test portions with a ratio of 64%/16%/20%. This results in a total of 7,762 pairs in the training sets, and 1,944 pairs in the test sets. The remaining 1,552 instances serve as validation data.

3 Estimating Emotion Intensity

Emotion lexicons are typically created using crowdsourcing Mohammad and Turney (2013) or drawing on emotion-labeled text Staiano and Guerini (2014); Araque et al. (2018). We investigate deriving emotion scores from word vector representations without any emotion-specific signals.

Method Data Anger Anticipation Disgust Fear Joy Sadness Surprise Trust Overall
Baselines EmoLex 0.081 -0.071 -0.037 0.101 0.059 0.035 -0.186 0.250 0.076
DepecheMood 0.089 0.077 -0.020 0.204 0.042
DepecheMood++ 0.207 0.180 0.010 0.337 0.106
EmoWordNet 0.100 0.077 0.032 0.190 0.048
Unsupervised word2vec 0.321 0.266 0.056 0.081 0.482 0.321 0.249 0.403 0.254
GloVe 0.360 0.413 0.031 0.137 0.352 0.281 -0.025 0.439 0.246
fastText 0.388 0.332 0.071 0.038 0.306 0.281 0.186 0.497 0.249
EWE 0.301 0.434 0.120 0.127 0.236 0.181 -0.143 0.348 0.194
Counter-fitting 0.354 0.379 0.075 0.200 0.507 0.379 0.334 0.459 0.322
AffectVec 0.426 0.430 0.262 0.447 0.631 0.559 -0.078 0.592 0.419
BERT base 0.130 0.104 -0.136 0.081 0.002 -0.012 -0.037 0.155 0.033
Self-sup. FFNN word2vec 0.339 0.229 0.300 0.215 0.412 0.447 0.103 0.311 0.311
GloVe 0.266 0.256 0.214 0.158 0.351 0.386 0.024 0.207 0.239
fastText 0.352 0.300 0.348 0.272 0.347 0.530 0.137 0.469 0.359
EWE 0.259 0.242 0.251 0.174 0.330 0.304 -0.086 0.270 0.228
Counter-fitting 0.345 0.305 0.177 0.205 0.467 0.353 0.119 0.445 0.300
AffectVec 0.411 0.418 0.273 0.417 0.631 0.564 -0.091 0.574 0.405
BERT base 0.347 0.081 0.259 0.080 0.389 0.327 0.114 0.310 0.238
Self-sup. SVR word2vec 0.410 0.227 0.311 0.226 0.331 0.472 0.098 0.361 0.303
GloVe 0.419 0.379 0.335 0.256 0.494 0.509 0.061 0.441 0.386
fastText 0.346 0.302 0.409 0.272 0.389 0.513 0.158 0.480 0.371
EWE 0.348 0.327 0.386 0.249 0.390 0.486 -0.094 0.487 0.322
Counter-fitting 0.368 0.349 0.224 0.213 0.471 0.388 0.138 0.469 0.314
AffectVec 0.437 0.457 0.302 0.490 0.633 0.600 -0.128 0.605 0.426
BERT base 0.331 0.029 0.287 0.029 0.374 0.356 0.079 0.309 0.248
Supervised FFNN word2vec 0.681 0.568 0.711 0.713 0.695 0.749 0.712 0.694 0.699
GloVe 0.704 0.559 0.733 0.723 0.678 0.731 0.643 0.730 0.698
fastText 0.716 0.596 0.761 0.736 0.691 0.743 0.756 0.750 0.722
EWE 0.582 0.322 0.618 0.636 0.582 0.662 0.438 0.602 0.554
Counter-fitting 0.675 0.580 0.641 0.676 0.660 0.649 0.732 0.693 0.665
AffectVec 0.700 0.614 0.664 0.703 0.724 0.686 0.747 0.701 0.695
BERT base 0.624 0.486 0.550 0.578 0.578 0.570 0.612 0.580 0.533
Supervised SVR word2vec 0.642 0.566 0.654 0.654 0.680 0.734 0.721 0.664 0.672
GloVe 0.720 0.633 0.727 0.671 0.719 0.723 0.722 0.694 0.705
fastText 0.660 0.589 0.745 0.698 0.637 0.697 0.746 0.700 0.685
EWE 0.685 0.618 0.705 0.633 0.642 0.720 0.709 0.713 0.677
Counter-fitting 0.654 0.569 0.558 0.623 0.618 0.636 0.711 0.665 0.630
AffectVec 0.678 0.606 0.631 0.676 0.657 0.671 0.668 0.675 0.664
BERT base 0.567 0.478 0.484 0.532 0.576 0.559 0.683 0.556 0.553
Table 1: Main results for emotion intensity prediction, reported in terms of Pearson Correlation. ( These baselines are averaged based only on the labels they cover.)

3.1 Unsupervised Prediction

While past work on emotion intensity prediction has considered this an entirely separate task, we here consider emotional intensity scoring as similar in nature to regular lexical associations between words. Given two words w1w_{1}, w2w_{2}, the cosine similarity cos(𝐯w1,𝐯w2)\mathrm{cos}(\mathbf{v}_{w_{1}},\mathbf{v}_{w_{2}}) of their corresponding word vectors 𝐯w1\mathbf{v}_{w_{1}}, 𝐯w2\mathbf{v}_{w_{2}} is expected to reflect their association, as most word vector spaces capture semantic relatedness Hill et al. (2014).

Hence, for an unsupervised estimation of emotion intensities, we simply look up the association between a targeted vocabulary word ww and the emotion ee under consideration, obtaining

σu(w,e)=cos(𝐯w,𝐯we)=𝐯w𝐯we|𝐯w||𝐯we|\sigma_{\mathrm{u}}(w,e)=\mathrm{cos}(\mathbf{v}_{w},\mathbf{v}_{w_{e}})=\frac{\mathbf{v}_{w}\cdot\mathbf{v}_{w_{e}}}{|\mathbf{v}_{w}|\,|\mathbf{v}_{w_{e}}|} (1)

as the emotion intensity score. Here, we assume there is a single word wew_{e} denoting ee (e.g., joy).

3.2 Supervised Prediction

If a training set is available, then for each emotion ee covered by it, we train a regression model fe(𝐯wθe)f_{e}(\mathbf{v}_{w}\mid\theta_{e}) parametrized by θe\theta_{e} that allows us to predict the emotion intensity for ee, given a vector of a word as input. Finally, we simply define

σs(w,e)=fe(𝐯wθe)\sigma_{\mathrm{s}}(w,e)=f_{e}(\mathbf{v}_{w}\mid\theta_{e}) (2)

to invoke the relevant fef_{e} based on the emotion ee under consideration.

We consider two kinds of models. The first is a feedforward neural network with a ReLU activated hidden layer with 64 neurons and a single output neuron that predicts the intensity of emotion, and a dropout rate of 0.2 applied to the hidden layer. The loss function is the mean squared error of the prediction with respect to the ground truth. The second option is a support vector regression (SVR) model with RBF kernel.

3.3 Self-Supervised Prediction

Finally, we propose a hybrid self-supervised technique that relies on supervised learning but does not require any pre-existing training data. Instead, for each emotion ee, we first identify sets

Te+=argmaxT𝒱,|T|=kwTσu(w,e)\displaystyle T_{e}^{+}=\operatorname*{argmax}\limits_{T\subset\mathcal{V},|T|=k}\,\sum_{w\in T}\sigma_{\mathrm{u}}(w,e) (3)
Te=argminT𝒱,|T|=kwTσu(w,e)\displaystyle T_{e}^{-}=\operatorname*{argmin}\limits_{T\subset\mathcal{V},|T|=k}\,\sum_{w\in T}\sigma_{\mathrm{u}}(w,e) (4)

of words in the vocabulary 𝒱\mathcal{V} with the top kk highest and lowest intensity predictions as defined by Eq. 1, i.e., our unsupervised technique. For each ee, Te+TeT_{e}^{+}\cup T_{e}^{-} along with the corresponding labels is used to train a supervised model f¯e(𝐯wθ¯e)\bar{f}_{e}(\mathbf{v}_{w}\mid\bar{\theta}_{e}) as above in Section 3.2. Finally, the overall model again just involves invoking the relevant f¯e\bar{f}_{e}:

σs¯(w,e)=f¯e(𝐯wθ¯e)\sigma_{\mathrm{\bar{s}}}(w,e)=\bar{f}_{e}(\mathbf{v}_{w}\mid\bar{\theta}_{e}) (5)

4 Evaluation

4.1 Main Results

We evaluate to what extent word vectors capture emotion intensity information by comparing the presented methods against the ground truth test sets from Section 2. The score for out-of-vocabulary words is taken to be 0.0 and we retain the same test set for each emotion.

Table 1 provides an overview of the results for a series of baselines along with the unsupervised, self-supervised, and regular supervised methods proposed in Section 3. The “Overall” column reports the correlation with the union of all word–emotion pairings in the ground truth test set. As we have an equal number of word–emotion pairs for each emotion, this serves as an aggregate measure of the result quality.

Baselines.

We first evaluate the state-of-the-art emotion lexicons, finding that the scores that they provide exhibit very low correlation with the ground truth. In the case of EmoLex Mohammad and Turney (2013), this is because it merely provides binary labels, not intensities.

For DepecheMood Staiano and Guerini (2014), DepecheMood++ Araque et al. (2018), and EmoWordNet Badaro et al. (2018), we conjecture that the data-driven automated techniques used to create them based on coarse-grained document-level labels do not result in word-level scores of the same sort as those solicited from human raters. The additional labels in DepecheMood and DepecheMood++ (e.g., Amused) may carry some information on some of the labels in our ground truth data (e.g., Joy). However, since the term–document matrices are calculated independent of each other, mapping these emotions to a target emotion will result in less accurate emotion association scores.

Unsupervised method.

We next evaluate various pretrained word vector models against the test set, including pretrained word2vec trained on Google News Mikolov et al. (2013), Glove 840B CommonCrawl embeddings Pennington et al. (2014), and fastText trained on Wikipedia Joulin et al. (2017). Among these, we find that fastText vectors outperform word2vec and GloVe. They also outperform the pretrained BERT-base (uncased) model Devlin et al. (2019), for which we use the average of word-piece-level output embeddings as word-level vectors. Even higher correlation can be attained with vectors that have been post-processed. One example are the counter-fitted vectors Mrksic et al. (2016), obtained by taking the PARAGRAM-SL999 vectors by Wieting et al. (2015) and optimizing them using synonymy and antonymy constraints. Better results are obtained by the AffectVec technique Raji and de Melo (2020), which post-processes the same vectors using not only synonymy and antonymy constraints, but also sentiment polarity ones. Sentiment polarity evidently helps to better distinguish different emotional associations. In contrast, the emotion-enriched word vectors (EWE) introduced by Agrawal et al. (2018) do not perform well for word intensity prediction.

Supervised method.

For supervised methods, we report the mean correlation over 20 runs of the learning algorithm. As expected, the models succeed in learning emotional associations from word vectors with a higher correlation than unsupervised techniques. As this technique does not rely on cosine similarities, post-processing word vector spaces proves fruitless, and very strong results are obtained using GloVe and fastText embeddings.

Self-supervised method.

Unsupervised prediction is less accurate than supervised prediction, but is applicable to arbitrary emotions without any need for pre-existing emotion-specific training data. The self-supervised approach has the same advantage of not requiring training data and can thus as well be applied to arbitrary emotions. We evaluate it for k=100k=100, i.e., with 200 automatically induced training instances per emotion. For BERT vectors, we use the fastText vocabulary to find the most and least similar words to the emotion. As expected, without access to gold standard training data, self-supervision is unable to compete with the supervised approach. However, it is able to surpass the unsupervised approach despite drawing on it for training, owing to its ability to selectively pick out the most pertinent cues from the word vector.

4.2 Unsupervised Sentence Classification

Finally, we explore using our results for unsupervised sentence-level emotion classification on the GoEmotions dataset Demszky et al. (2020) with its fine-grained inventory of 28 different labels. Emotion lexicons are often used for unsupervised analysis, but existing lexicons do not cover all emotions in this dataset (e.g., remorse, gratitude, caring). Given input document DD, we choose the label

argmaxeΣwDγw,Dσu(e,w),\underset{e\in\Sigma}{\operatorname{arg\,max}}\,\sum_{w\in D}\gamma_{w,D}\,\sigma_{u}(e,w), (6)

where Σ\Sigma is the set of labels, γw,D\gamma_{w,D} denotes the TF-IDF score of ww in DD, and σu(w,e)\sigma_{u}(w,e) is our unsupervised word scoring (Eq. 1). An exception is made for the neutral label, which we choose if the average prediction score across labels in Σ\Sigma is 0.03\leq 0.03, based on the reasoning that, similar to the annotation instructions in the original paper, low scores or conflicting ones should be labeled neutral.

The results in Table 2 show that an entirely unsupervised approach is able to greatly outperform the random baseline. For reference, we also show the results of the supervised BERT model from Demszky et al. (2020), while our models are entirely unsupervised. Note that this dataset is fairly challenging, since many of the 28 emotions are easy to confuse, e.g., fear, nervousness, embarrassment, disappointment, disapproval, etc.

Method Precision Recall F1
Random 03.6 03.6 03.6
Unsupervised
     word2vec 26.2 13.7 11.2
     GloVe 30.0 08.0 02.3
     fastText 30.1 10.9 05.8
     Counter-fitting 23.7 20.3 16.8
     AffectVec 25.7 20.0 16.8
Supervised
     BERT 400 630 460
Table 2: Results for 28-way unsupervised sentence classification on GoEmotions dataset, in terms of macro-averaged precision, recall, and F1-score in %.

5 Conclusion

In this paper, we show that pretrained word vectors readily provide higher-quality emotional intensity information than state-of-the-art emotion lexicons, despite not being trained on emotion-labeled data, particularly if the word vector space is post-processed. We further investigate supervised learning and find that a regular supervised variant obtains very high correlations, while a self-supervised variant that does not require gold standard training data is able to outperform the unsupervised method and can as well be applied to arbitrary emotion labels. Overall, our results confirm that word vectors have remained underexplored for word-level emotion intensity assessment. We will share our training/validation/test splits and lexicons to promote further research in this area.

Broader Impact

Assessing the emotions evoked by a particular text has important applications, such as discovering disappointed customers on social media, designing conversational agents that emulate human empathy by responding in a more appropriate way, as well as numerous forms of digital humanities analyses. However, there are also important concerns and risks when using automated techniques to assess the emotional impact of text. Clearly, consent to assess the text must have been granted and malicious as well as potentially harmful applications must be avoided. Beyond this, however, even for applications deemed beneficial, there is a risk that inaccurate assessments may lead to undesirable outcomes. The empirical results in this paper show that many commonly used resources and techniques have a low correlation with human ratings and even the best BERT-based learning approaches are highly imperfect, as they may latch on to superficial lexical cues and mere correlations that may exhibit substantial biases. This is discussed further by Mohammad (2020). As a result, prior to any potential use of such techniques, a thorough analysis of potential risks must be conducted.

References