Guilt by Association: Emotion Intensities in Lexical Representations
Abstract
What do word vector representations reveal about the emotions associated with words? In this study, we consider the task of estimating word-level emotion intensity scores for specific emotions, exploring unsupervised, supervised, and finally a self-supervised method of extracting emotional associations from word vector representations. Overall, we find that word vectors carry substantial potential for inducing fine-grained emotion intensity scores, showing a far higher correlation with human ground truth ratings than achieved by state-of-the-art emotion lexicons.
1 Introduction
Vector representations of words induced using distributional co-occurrence signals provide valuable information about associations between words Baroni et al. (2014). In this paper, we consider the question: What do word vector representations reveal about the emotions associated with words?
There has been substantial research on methods to label words with associated emotions. Crowd-sourcing approaches have been used to compile databases to study the nexus between them. Mohammad and Turney (2013) labelled around 15,000 unigrams with 8 basic emotions and two sentiments, providing binary tags for each word. Mohammad (2018) instead solicited real-valued emotion intensity ratings in for roughly 500–1800 words for eight emotions. DepecheMood Staiano and Guerini (2014) and the improved DepecheMood++ Araque et al. (2018) are based on simple statistical methods to create an emotion lexicon from emotionally tagged data crawled from specific Web sites. We conjecture that vector representations allow us to obtain higher-quality ratings of the emotional associations of words.
Word vectors Mikolov et al. (2013); Pennington et al. (2014) have often been evaluated on standard word relatedness benchmarks. In this paper, we instead explore to what extent they encode emotional associations. Earlier methods Strapparava and Mihalcea (2008); Mac Kim et al. (2010) used corpus statistics in tandem with dimensionality reduction techniques for emotion prediction. Rothe et al. (2016) proposed predicting sentiment polarity ratings from word vectors, while other studies Buechel and Hahn (2018b); Buechel et al. (2018) predicted valence, arousal, and dominance using supervised deep neural networks. Khosla et al. (2018) proposed incorporating valence, arousal, and dominance ratings as additional signals into dense word embeddings. Buechel and Hahn (2018a) showed that emotion intensity scores can be predicted based on a lexicon providing valence–arousal–dominance ratings. We show that we can make use of word vectors to obtain high correlations with emotion intensities without any need for such ratings.
Contribution.
Overall, our intriguing finding in this paper is that dense word vectors allow us to predict much more accurate emotion associations than state-of-the-art emotion lexicons. We show that this holds even without any supervision, while different kinds of supervised setups yield even better results.
2 Emotion Intensities as Associations
While many emotion lexicons only provide binary emotion labels, it is clear that words may exhibit different degrees of association with an emotion. For instance, the word party suggests a greater degree of joy than the word sea, although the latter may as well be associated with joy.
Ground truth.
The NRC Emotion/Affect Intensity Lexicon Mohammad (2018) provides human ratings of emotion intensity for words, scaled to such that a score of 1 signifies that a word “conveys the highest amount of” a particular emotion, while corresponds to the lowest amount. The real-valued scores were computed based on best–worst scaling of ratings solicited via crowdsourcing.
In our work, instead of viewing this resource as a lexicon that provides emotion intensity scores, we propose to treat it as providing associations between pairs of words, one of the two words being an emotion. Thus, we propose to derive semantic association benchmarks similar to widely used semantic relatedness benchmark sets such as the RG-65 Rubenstein and Goodenough (1965) and WS353 Finkelstein et al. (2001) test collections.
Data split.
In order to more fairly evaluate resources that only provide unigrams, we disregard bigrams in the lexicon, resulting in a total of 9,706 word pairs. The data covers the eight basic emotions proposed in Plutchik’s wheel of emotions Plutchik (1980), i.e., anger, anticipation, disgust, fear, joy, sadness, surprise, and trust. To facilitate an evaluation of unsupervised and supervised techniques, for each emotion, we split the corresponding data into training/validation/test portions with a ratio of 64%/16%/20%. This results in a total of 7,762 pairs in the training sets, and 1,944 pairs in the test sets. The remaining 1,552 instances serve as validation data.
3 Estimating Emotion Intensity
Emotion lexicons are typically created using crowdsourcing Mohammad and Turney (2013) or drawing on emotion-labeled text Staiano and Guerini (2014); Araque et al. (2018). We investigate deriving emotion scores from word vector representations without any emotion-specific signals.
Method | Data | Anger | Anticipation | Disgust | Fear | Joy | Sadness | Surprise | Trust | Overall | |
---|---|---|---|---|---|---|---|---|---|---|---|
Baselines | EmoLex | 0.081 | -0.071 | -0.037 | 0.101 | 0.059 | 0.035 | -0.186 | 0.250 | 0.076 | |
DepecheMood∗ | 0.089 | – | – | 0.077 | -0.020 | 0.204 | – | – | 0.042 | ||
DepecheMood++∗ | 0.207 | – | – | 0.180 | 0.010 | 0.337 | – | – | 0.106 | ||
EmoWordNet∗ | 0.100 | – | – | 0.077 | 0.032 | 0.190 | – | – | 0.048 | ||
Unsupervised | word2vec | 0.321 | 0.266 | 0.056 | 0.081 | 0.482 | 0.321 | 0.249 | 0.403 | 0.254 | |
GloVe | 0.360 | 0.413 | 0.031 | 0.137 | 0.352 | 0.281 | -0.025 | 0.439 | 0.246 | ||
fastText | 0.388 | 0.332 | 0.071 | 0.038 | 0.306 | 0.281 | 0.186 | 0.497 | 0.249 | ||
EWE | 0.301 | 0.434 | 0.120 | 0.127 | 0.236 | 0.181 | -0.143 | 0.348 | 0.194 | ||
Counter-fitting | 0.354 | 0.379 | 0.075 | 0.200 | 0.507 | 0.379 | 0.334 | 0.459 | 0.322 | ||
AffectVec | 0.426 | 0.430 | 0.262 | 0.447 | 0.631 | 0.559 | -0.078 | 0.592 | 0.419 | ||
BERT base | 0.130 | 0.104 | -0.136 | 0.081 | 0.002 | -0.012 | -0.037 | 0.155 | 0.033 | ||
Self-sup. | FFNN | word2vec | 0.339 | 0.229 | 0.300 | 0.215 | 0.412 | 0.447 | 0.103 | 0.311 | 0.311 |
GloVe | 0.266 | 0.256 | 0.214 | 0.158 | 0.351 | 0.386 | 0.024 | 0.207 | 0.239 | ||
fastText | 0.352 | 0.300 | 0.348 | 0.272 | 0.347 | 0.530 | 0.137 | 0.469 | 0.359 | ||
EWE | 0.259 | 0.242 | 0.251 | 0.174 | 0.330 | 0.304 | -0.086 | 0.270 | 0.228 | ||
Counter-fitting | 0.345 | 0.305 | 0.177 | 0.205 | 0.467 | 0.353 | 0.119 | 0.445 | 0.300 | ||
AffectVec | 0.411 | 0.418 | 0.273 | 0.417 | 0.631 | 0.564 | -0.091 | 0.574 | 0.405 | ||
BERT base | 0.347 | 0.081 | 0.259 | 0.080 | 0.389 | 0.327 | 0.114 | 0.310 | 0.238 | ||
Self-sup. | SVR | word2vec | 0.410 | 0.227 | 0.311 | 0.226 | 0.331 | 0.472 | 0.098 | 0.361 | 0.303 |
GloVe | 0.419 | 0.379 | 0.335 | 0.256 | 0.494 | 0.509 | 0.061 | 0.441 | 0.386 | ||
fastText | 0.346 | 0.302 | 0.409 | 0.272 | 0.389 | 0.513 | 0.158 | 0.480 | 0.371 | ||
EWE | 0.348 | 0.327 | 0.386 | 0.249 | 0.390 | 0.486 | -0.094 | 0.487 | 0.322 | ||
Counter-fitting | 0.368 | 0.349 | 0.224 | 0.213 | 0.471 | 0.388 | 0.138 | 0.469 | 0.314 | ||
AffectVec | 0.437 | 0.457 | 0.302 | 0.490 | 0.633 | 0.600 | -0.128 | 0.605 | 0.426 | ||
BERT base | 0.331 | 0.029 | 0.287 | 0.029 | 0.374 | 0.356 | 0.079 | 0.309 | 0.248 | ||
Supervised | FFNN | word2vec | 0.681 | 0.568 | 0.711 | 0.713 | 0.695 | 0.749 | 0.712 | 0.694 | 0.699 |
GloVe | 0.704 | 0.559 | 0.733 | 0.723 | 0.678 | 0.731 | 0.643 | 0.730 | 0.698 | ||
fastText | 0.716 | 0.596 | 0.761 | 0.736 | 0.691 | 0.743 | 0.756 | 0.750 | 0.722 | ||
EWE | 0.582 | 0.322 | 0.618 | 0.636 | 0.582 | 0.662 | 0.438 | 0.602 | 0.554 | ||
Counter-fitting | 0.675 | 0.580 | 0.641 | 0.676 | 0.660 | 0.649 | 0.732 | 0.693 | 0.665 | ||
AffectVec | 0.700 | 0.614 | 0.664 | 0.703 | 0.724 | 0.686 | 0.747 | 0.701 | 0.695 | ||
BERT base | 0.624 | 0.486 | 0.550 | 0.578 | 0.578 | 0.570 | 0.612 | 0.580 | 0.533 | ||
Supervised | SVR | word2vec | 0.642 | 0.566 | 0.654 | 0.654 | 0.680 | 0.734 | 0.721 | 0.664 | 0.672 |
GloVe | 0.720 | 0.633 | 0.727 | 0.671 | 0.719 | 0.723 | 0.722 | 0.694 | 0.705 | ||
fastText | 0.660 | 0.589 | 0.745 | 0.698 | 0.637 | 0.697 | 0.746 | 0.700 | 0.685 | ||
EWE | 0.685 | 0.618 | 0.705 | 0.633 | 0.642 | 0.720 | 0.709 | 0.713 | 0.677 | ||
Counter-fitting | 0.654 | 0.569 | 0.558 | 0.623 | 0.618 | 0.636 | 0.711 | 0.665 | 0.630 | ||
AffectVec | 0.678 | 0.606 | 0.631 | 0.676 | 0.657 | 0.671 | 0.668 | 0.675 | 0.664 | ||
BERT base | 0.567 | 0.478 | 0.484 | 0.532 | 0.576 | 0.559 | 0.683 | 0.556 | 0.553 |
3.1 Unsupervised Prediction
While past work on emotion intensity prediction has considered this an entirely separate task, we here consider emotional intensity scoring as similar in nature to regular lexical associations between words. Given two words , , the cosine similarity of their corresponding word vectors , is expected to reflect their association, as most word vector spaces capture semantic relatedness Hill et al. (2014).
Hence, for an unsupervised estimation of emotion intensities, we simply look up the association between a targeted vocabulary word and the emotion under consideration, obtaining
(1) |
as the emotion intensity score. Here, we assume there is a single word denoting (e.g., joy).
3.2 Supervised Prediction
If a training set is available, then for each emotion covered by it, we train a regression model parametrized by that allows us to predict the emotion intensity for , given a vector of a word as input. Finally, we simply define
(2) |
to invoke the relevant based on the emotion under consideration.
We consider two kinds of models. The first is a feedforward neural network with a ReLU activated hidden layer with 64 neurons and a single output neuron that predicts the intensity of emotion, and a dropout rate of 0.2 applied to the hidden layer. The loss function is the mean squared error of the prediction with respect to the ground truth. The second option is a support vector regression (SVR) model with RBF kernel.
3.3 Self-Supervised Prediction
Finally, we propose a hybrid self-supervised technique that relies on supervised learning but does not require any pre-existing training data. Instead, for each emotion , we first identify sets
(3) | |||
(4) |
of words in the vocabulary with the top highest and lowest intensity predictions as defined by Eq. 1, i.e., our unsupervised technique. For each , along with the corresponding labels is used to train a supervised model as above in Section 3.2. Finally, the overall model again just involves invoking the relevant :
(5) |
4 Evaluation
4.1 Main Results
We evaluate to what extent word vectors capture emotion intensity information by comparing the presented methods against the ground truth test sets from Section 2. The score for out-of-vocabulary words is taken to be 0.0 and we retain the same test set for each emotion.
Table 1 provides an overview of the results for a series of baselines along with the unsupervised, self-supervised, and regular supervised methods proposed in Section 3. The “Overall” column reports the correlation with the union of all word–emotion pairings in the ground truth test set. As we have an equal number of word–emotion pairs for each emotion, this serves as an aggregate measure of the result quality.
Baselines.
We first evaluate the state-of-the-art emotion lexicons, finding that the scores that they provide exhibit very low correlation with the ground truth. In the case of EmoLex Mohammad and Turney (2013), this is because it merely provides binary labels, not intensities.
For DepecheMood Staiano and Guerini (2014), DepecheMood++ Araque et al. (2018), and EmoWordNet Badaro et al. (2018), we conjecture that the data-driven automated techniques used to create them based on coarse-grained document-level labels do not result in word-level scores of the same sort as those solicited from human raters. The additional labels in DepecheMood and DepecheMood++ (e.g., Amused) may carry some information on some of the labels in our ground truth data (e.g., Joy). However, since the term–document matrices are calculated independent of each other, mapping these emotions to a target emotion will result in less accurate emotion association scores.
Unsupervised method.
We next evaluate various pretrained word vector models against the test set, including pretrained word2vec trained on Google News Mikolov et al. (2013), Glove 840B CommonCrawl embeddings Pennington et al. (2014), and fastText trained on Wikipedia Joulin et al. (2017). Among these, we find that fastText vectors outperform word2vec and GloVe. They also outperform the pretrained BERT-base (uncased) model Devlin et al. (2019), for which we use the average of word-piece-level output embeddings as word-level vectors. Even higher correlation can be attained with vectors that have been post-processed. One example are the counter-fitted vectors Mrksic et al. (2016), obtained by taking the PARAGRAM-SL999 vectors by Wieting et al. (2015) and optimizing them using synonymy and antonymy constraints. Better results are obtained by the AffectVec technique Raji and de Melo (2020), which post-processes the same vectors using not only synonymy and antonymy constraints, but also sentiment polarity ones. Sentiment polarity evidently helps to better distinguish different emotional associations. In contrast, the emotion-enriched word vectors (EWE) introduced by Agrawal et al. (2018) do not perform well for word intensity prediction.
Supervised method.
For supervised methods, we report the mean correlation over 20 runs of the learning algorithm. As expected, the models succeed in learning emotional associations from word vectors with a higher correlation than unsupervised techniques. As this technique does not rely on cosine similarities, post-processing word vector spaces proves fruitless, and very strong results are obtained using GloVe and fastText embeddings.
Self-supervised method.
Unsupervised prediction is less accurate than supervised prediction, but is applicable to arbitrary emotions without any need for pre-existing emotion-specific training data. The self-supervised approach has the same advantage of not requiring training data and can thus as well be applied to arbitrary emotions. We evaluate it for , i.e., with 200 automatically induced training instances per emotion. For BERT vectors, we use the fastText vocabulary to find the most and least similar words to the emotion. As expected, without access to gold standard training data, self-supervision is unable to compete with the supervised approach. However, it is able to surpass the unsupervised approach despite drawing on it for training, owing to its ability to selectively pick out the most pertinent cues from the word vector.
4.2 Unsupervised Sentence Classification
Finally, we explore using our results for unsupervised sentence-level emotion classification on the GoEmotions dataset Demszky et al. (2020) with its fine-grained inventory of 28 different labels. Emotion lexicons are often used for unsupervised analysis, but existing lexicons do not cover all emotions in this dataset (e.g., remorse, gratitude, caring). Given input document , we choose the label
(6) |
where is the set of labels, denotes the TF-IDF score of in , and is our unsupervised word scoring (Eq. 1). An exception is made for the neutral label, which we choose if the average prediction score across labels in is , based on the reasoning that, similar to the annotation instructions in the original paper, low scores or conflicting ones should be labeled neutral.
The results in Table 2 show that an entirely unsupervised approach is able to greatly outperform the random baseline. For reference, we also show the results of the supervised BERT model from Demszky et al. (2020), while our models are entirely unsupervised. Note that this dataset is fairly challenging, since many of the 28 emotions are easy to confuse, e.g., fear, nervousness, embarrassment, disappointment, disapproval, etc.
Method | Precision | Recall | F1 |
---|---|---|---|
Random | 3.6 | 3.6 | 3.6 |
Unsupervised | |||
word2vec | 26.2 | 13.7 | 11.2 |
GloVe | 30.0 | 8.0 | 2.3 |
fastText | 30.1 | 10.9 | 5.8 |
Counter-fitting | 23.7 | 20.3 | 16.8 |
AffectVec | 25.7 | 20.0 | 16.8 |
Supervised | |||
BERT | 40 | 63 | 46 |
5 Conclusion
In this paper, we show that pretrained word vectors readily provide higher-quality emotional intensity information than state-of-the-art emotion lexicons, despite not being trained on emotion-labeled data, particularly if the word vector space is post-processed. We further investigate supervised learning and find that a regular supervised variant obtains very high correlations, while a self-supervised variant that does not require gold standard training data is able to outperform the unsupervised method and can as well be applied to arbitrary emotion labels. Overall, our results confirm that word vectors have remained underexplored for word-level emotion intensity assessment. We will share our training/validation/test splits and lexicons to promote further research in this area.
Broader Impact
Assessing the emotions evoked by a particular text has important applications, such as discovering disappointed customers on social media, designing conversational agents that emulate human empathy by responding in a more appropriate way, as well as numerous forms of digital humanities analyses. However, there are also important concerns and risks when using automated techniques to assess the emotional impact of text. Clearly, consent to assess the text must have been granted and malicious as well as potentially harmful applications must be avoided. Beyond this, however, even for applications deemed beneficial, there is a risk that inaccurate assessments may lead to undesirable outcomes. The empirical results in this paper show that many commonly used resources and techniques have a low correlation with human ratings and even the best BERT-based learning approaches are highly imperfect, as they may latch on to superficial lexical cues and mere correlations that may exhibit substantial biases. This is discussed further by Mohammad (2020). As a result, prior to any potential use of such techniques, a thorough analysis of potential risks must be conducted.
References
- Agrawal et al. (2018) Ameeta Agrawal, Aijun An, and Manos Papagelis. 2018. Learning emotion-enriched word representations. In Proceedings of the 27th International Conference on Computational Linguistics, pages 950–961, Santa Fe, New Mexico, USA.
- Araque et al. (2018) Oscar Araque, Lorenzo Gatti, Jacopo Staiano, and Marco Guerini. 2018. DepecheMood++: A bilingual emotion lexicon built through simple yet powerful techniques. arXiv preprint arXiv:1810.03660.
- Badaro et al. (2018) Gilbert Badaro, Hussein Jundi, Hazem Hajj, and Wassim El-Hajj. 2018. EmoWordNet: Automatic expansion of emotion lexicon using English WordNet. In Proceedings of the Seventh Joint Conference on Lexical and Computational Semantics, pages 86–93, New Orleans, Louisiana.
- Baroni et al. (2014) Marco Baroni, Georgiana Dinu, and Germán Kruszewski. 2014. Don’t count, predict! a systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 238–247, Baltimore, Maryland.
- Buechel and Hahn (2018a) Sven Buechel and Udo Hahn. 2018a. Emotion representation mapping for automatic lexicon construction (mostly) performs on human level. In Proceedings of the 27th International Conference on Computational Linguistics, pages 2892–2904, Santa Fe, New Mexico, USA.
- Buechel and Hahn (2018b) Sven Buechel and Udo Hahn. 2018b. Word emotion induction for multiple languages as a deep multi-task learning problem. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1907–1918, New Orleans, Louisiana.
- Buechel et al. (2018) Sven Buechel, João Sedoc, H. Andrew Schwartz, and Lyle H. Ungar. 2018. Learning neural emotion analysis from 100 observations: The surprising effectiveness of pre-trained word representations. CoRR, abs/1810.10949.
- Demszky et al. (2020) Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan S. Cowen, Gaurav Nemade, and Sujith Ravi. 2020. Goemotions: A dataset of fine-grained emotions. CoRR, abs/2005.00547.
- Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding.
- Finkelstein et al. (2001) Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin. 2001. Placing search in context: The concept revisited. In Proceedings of the 10th International Conference on World Wide Web, WWW ’01, pages 406–414, New York, NY, USA. ACM.
- Hill et al. (2014) Felix Hill, Roi Reichart, and Anna Korhonen. 2014. Simlex-999: Evaluating semantic models with (genuine) similarity estimation. CoRR, abs/1408.3456.
- Joulin et al. (2017) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2017. Bag of tricks for efficient text classification. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 427–431, Valencia, Spain.
- Khosla et al. (2018) Sopan Khosla, Niyati Chhaya, and Kushal Chawla. 2018. Aff2vec: Affect-enriched distributional word representations. CoRR, abs/1805.07966.
- Mac Kim et al. (2010) Sunghwan Mac Kim, Alessandro Valitutti, and Rafael A Calvo. 2010. Evaluation of unsupervised emotion models to textual affect recognition. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text, pages 62–70.
- Mikolov et al. (2013) Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26, pages 3111–3119.
- Mohammad (2018) Saif M. Mohammad. 2018. Word affect intensities. In Proceedings of the 11th Edition of the Language Resources and Evaluation Conference (LREC 2018), Miyazaki, Japan.
- Mohammad (2020) Saif M. Mohammad. 2020. Practical and ethical considerations in the effective use of emotion and sentiment lexicons.
- Mohammad and Turney (2013) Saif M. Mohammad and Peter D. Turney. 2013. Crowdsourcing a word–emotion association lexicon. Computational Intelligence, 29(3):436–465.
- Mrksic et al. (2016) Nikola Mrksic, Diarmuid Ó Séaghdha, Blaise Thomson, Milica Gasic, Lina Maria Rojas-Barahona, Pei-Hao Su, David Vandyke, Tsung-Hsien Wen, and Steve J. Young. 2016. Counter-fitting word vectors to linguistic constraints. CoRR, abs/1603.00892.
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pages 1532–1543.
- Plutchik (1980) Robert Plutchik. 1980. A general psychoevolutionary theory of emotion. In Theories of Emotion, pages 3–33. Elsevier.
- Raji and de Melo (2020) Shahab Raji and Gerard de Melo. 2020. What sparks joy: The AffectVec emotion database. In Proceedings of The Web Conference 2020, pages 2991–2997, New York, NY, USA. ACM.
- Rothe et al. (2016) Sascha Rothe, Sebastian Ebert, and Hinrich Schütze. 2016. Ultradense word embeddings by orthogonal transformation. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
- Rubenstein and Goodenough (1965) Herbert Rubenstein and John B. Goodenough. 1965. Contextual correlates of synonymy. Commun. ACM, 8(10):627–633.
- Staiano and Guerini (2014) Jacopo Staiano and Marco Guerini. 2014. Depeche mood: A lexicon for emotion analysis from crowd-annotated news. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), volume 2, pages 427–433.
- Strapparava and Mihalcea (2008) Carlo Strapparava and Rada Mihalcea. 2008. Learning to identify emotions in text. In Proceedings of the 2008 ACM symposium on Applied computing, pages 1556–1560.
- Wieting et al. (2015) John Wieting, Mohit Bansal, Kevin Gimpel, and Karen Livescu. 2015. From paraphrase database to compositional paraphrase model and back. Transactions of the Association for Computational Linguistics, 3:345–358.