Sentiment Analysis of Persian-English Code-mixed Texts

1^st Nazanin Sabri electrical and computer engineering
University of Tehran
Tehran, Iran
nazanin.sabri@ut.ac.ir 2^nd Ali Edalat electrical and computer engineering
University of Tehran
Tehran, Iran
ali.edalat@ut.ac.ir 3^rd Behnam Bahrak electrical and computer engineering
University of Tehran
Tehran, Iran
bahrak@ut.ac.ir

Abstract

The rapid production of data on the internet and the need to understand how users are feeling from a business and research perspective has prompted the creation of numerous automatic monolingual sentiment detection systems. More recently however, due to the unstructured nature of data on social media, we are observing more instances of multilingual and code-mixed texts. This development in content type has created a new demand for code-mixed sentiment analysis systems. In this study we collect, label and thus create a dataset of Persian-English code-mixed tweets. We then proceed to introduce a model which uses BERT pretrained embeddings as well as translation models to automatically learn the polarity scores of these Tweets. Our model outperforms the baseline models that use Naïve Bayes and Random Forest methods.

Index Terms:

code-mixed language, sentiment analysis, Persian-English text

I Introduction

Online social networking platforms place very few constraints on the type and structure of the textual content posted online. The unstructured nature of these content results in the creation of textual content which could be far from the original grammatical, syntactic, or semantic rules of the language they are written in [1, 2]. One deviation that has been observed quite often is the use of words from more than one language in the text [3]. Commonly known as “code-mixed” text. These texts are written in language A but include one or more words of language B (either written in the official alphabets of language B or transliterated to language A).
In this study, we investigate code-mixed Persian-English data and perform sentiment analysis on these texts. The reason why sentiment analysis of these texts would differ from a text written purely in the Persian language is that the words containing the emotional energies of the text could be written in English which would make the Persian-only sentiment analysis models unable to produce the correct outputs. Grammatical differences could also render such monolingual models useless [4]. The difficulties of this task have been shown in other language pairs [5, 6]. Our reasoning behind choosing English as the second language is the prominence of the usage of this language overall but more importantly among Persian speakers.
Since Persian is a low resource language, in order to perform the aforementioned task, we first needed to create a dataset of texts and label those texts with the correct sentiment scores. Thus, we begin by using the Twitter API and searching for a list of “Finglish” ¹¹1English words that are written using the Persian/Farsi alphabet words through the use of the API. After the data has been collected, 2 annotators labeled the 3640 tweets completely. A third annotator was then added to the project to label the tweets on which the previous two annotators did not agree on.
After the dataset collection and creation was completed, an ensemble model was created and used to detect the sentiment scores of the texts.
The rest of this paper is structured as follows: Section II provides a brief overview of related work. Next, we look at our dataset in detail in Section III. Our models, as well as our text cleaning and preparation steps are described in Section IV. We then report our results in Section V and the study is concluded in Section VI.

II Related Work

The prevalence of code-mixed textual data, due to the unstructured and uncontrolled nature of the web as well as social networks, have resulted in a focus on the topic throughout recent years, including multiple shared-tasks being defined on the subject [7, 8, 9, 10].
One of the language pairs that has been the center of attention in code-mixed text analysis is Hindi-English [11, 12, 13, 14, 15, 16]. With the large population of multilingual individuals in India, such forms of texts have become quite common. Other language pairs (such as Bengali-English, Bambara-French, and Tamil-English), however, have been studied as well [5, 17, 18, 19].
Some studies attempt to solve the issue by hand engineering features which help in the task. In [20] various features (e.g. the number of code switches in the text) were introduced and employed in a multi-layer perceptron model. Number of sentiment and opinion lexicon, number of uppercase words, and POS tags are among some of the features which have been utilized.
Other studies try to find methods with less need for feature selection. For instance, it has been attempted to use cross-lingual word-embeddings [21] or subword embeddings [13] to accomplish the task. In the SemEval-2020 task on Spanglish and Hinglish [10] it is reported that BERT and ensemble methods are the most common and successful approaches. In [22] an approach is introduced to help deal with different variations of the same word by substituting words with consideration to their context words. In [23] a benchmark for linguistic code-switching is presented, the aim of which is to enable evaluation of models.
To the best of our knowledge, the dataset annotated and presented as part of this study is the first dataset for the Code-mixed Persian-English sentiment analysis task. We also believe that there are no other studies on this specific subset of the topic available.

III Data

\setcode

utf8

In this section, we describe the distributions and characteristics of our data. As described in Section I, our dataset consists of 3,640 tweets labeled with polarity values. Our dataset fields include the terms that were searched via the API that resulted in the tweets’ retrieval, the text content of the tweets, the three labels assigned to each tweet, and the final label, which is calculated through majority voting.
We selected a list of 44 unique Finglish (English words transliterated to Persian) terms in order to collect this data. Some of the words in the list include: \<پرفکت¿ (perfect), \<هپی¿ (happy), \<بورينگ¿ (boring), and \<سينگل¿ (single).
In our dataset, 69.2% of the instances received unanimous labels by the first two annotators. A third annotator was then asked to label the rest of the data. In the resulting dataset, 15.7% were labeled as positive and 59.7% as negative (and the remaining 21.5% were labeled as neutral). Table I displays two examples of annotated data in our dataset. The majority of the data being negative can be explained by two facts: One is that due to the access restrictions of Twitter in Iran, only 9.24% of Iranians use Twitter [24] and as a result the subset of users on the platform are mostly from a particular belief system [25] which could result in the observed negative opinions. Another reason is that the data collection process was conducted in the last months of 2019 which included the beginning stages of the spread of the Coronavirus in Iran. Even though our keywords did not relate to the Coronavirus in any way the shift in spirit was observed in our data. We however believe that this issue reflects the current state of our society and thus the dataset can still be used for the task of code-mixed sentiment detection as the tweets do include the characteristics of code-mixed language and there are enough examples in the dataset to allow the model to learn attributes of polarities other than negative.
To preserve the privacy of the users, all user mentions in the texts have been replaced by @USERMENTION. However, since all tweets were public (at least at the time of collection) and were collected using the official Twitter API, Tweet IDs have been provided to allow use in future research should the need arise. This dataset is publicly available on GitHub. ²²2https://github.com/nazaninsbr/Persian-English-Code-mixed-Sentiment-Analysis

TABLE I: Example annotated tweets from our dataset

Tweet	Label
@USERMENTION \< زندگی خوشگل نيست، بورينگ و بی معنيه¿	Negative
\<اوهوم هپی نيو آواتار و ازين حرفا بانو جان¿	Positive

IV Method

In this section we will go over our text processing and feature extraction, and data representation methods as well as our model.
In the text processing step we aim to create a vectorized representation of the textual input in order to be able to fit the data into our machine learning model. To do so we take the following steps:

1.

Finding the non-Persian words in the sentence: By the definition of our task we know that the text we are processing includes non-Persian words, however, we do not know which words in the sentence they are. The reason why this knowledge could be useful is that knowing which words they are would allow us to use methods such as translation to convert them to their original language. In order to find these words, we use a dataset of Persian words collected from Wikipedia ³³3https://github.com/behnam/persian-words-frequency/blob/master/persian-wikipedia.txt. The huge collection of articles available on Wikipedia ensures that the most frequent words of the language would be in the list. We then check the existence of every word in the list and if it is not in the list we add it to our non-Persian word candidates.
2.

Translation: Next we translate the non-Persian words. First, we use an automatic tool, Yandex ⁴⁴4https://translate.yandex.com. This tool, however, faces difficulties when asked to translate Twitter specific slangs or expressions. Thus, we use common Twitter expression lists ⁵⁵5bit.ly/3h2qyNm and create a dictionary of our own.
3.

Embedding creation: To create an embedding for our textual data we used the pretrained multilingual BERT [26] model that Google has provided on their GitHub page ⁶⁶6https://github.com/google-research/bert. Since the model is multilingual, it would allow for creation of embeddings for words from more than one language which fits nicely with our problem since code-mixed data includes instances of more than one language. Further since the model uses the idea of subword tokenization and embedding, it could allow for a better understanding of slangs or other non-common words that could be made up of better known subwords.

After our data passes these steps, it is fed into an ensemble model consisting of three Bidirectional Long Short-Term Memory (Bi-LSTM) networks.

•

Our first network is a stacked Bi-LSTM network. The model is quite simple and aims to encode the general information available in the sentences.
•

Our second model adds the attention mechanism to the Bi-LSTM network which enables the model to pay more attention to the most important words in the sentence. The attention mechanism also helps with the encoding of long-distance dependencies and information.
•

Another method we use to make sure long-distance dependencies are accounted for and that information is not lost in our model is the use of pooling layers in our final model.

The final model takes the outputs of all three models and uses a weighted average to produce the output. To find the best weight assigned to the output of each model, we use the optimization algorithms offered by SciPy [27].

V Results

We used 10-fold cross-validation and averaged the metrics in order to present more reliable values. Additionally, to be able to compare our results, naive Bayes and random forest models were also used on the data. The results have been presented in Table II.

TABLE II: 10-fold cross-validation model performance results: each reported metric is calculated by averaging the values across all 10 folds

Model Name	Accuracy	Precision	Recall	F1
Naive Bayes	60.88	69.00	60.88	47.17
Random Forest	62.12	60.37	62.12	57.67
Our Model	66.17	64.16	65.99	63.66

We can see that our ensemble model outperforms the baseline models with regards to all metrics. Through our experiments, we find that the attention and pooling mechanisms both help with the performance of the models. We further find that the sum of all three models, offers better performance than each individual model, as each model appears to make up for the shortcomings of the other models in the ensemble.

VI Conclusion

In this study we presented a Persian-English code-mixed dataset. The dataset consisted of 3640 tweets collected through the use of the Twitter API. Each tweet was consequently labeled with its corresponding polarity score. We then used neural classification models to learn the polarity scores of these data. Our models employed Yandex and dictionary-based translation techniques to translate the code-mixed words in our texts. We further used pretrained BERT embeddings to represent our data. Our models reached an accuracy of 66.17% and F1 of 63.66 on the data.
Future work could focus on other methods of dealing with the code-mixed words or ways in which we could find word-based polarity scores for our code-mixed words using sentence level scores.

References

[1] Monika Arora and Vineet Kansal. Character level embedding with deep convolutional neural network for text normalization of unstructured data for twitter sentiment analysis. Social Network Analysis and Mining, 9(1):12, 2019.
[2] Geetika Gautam and Divakar Yadav. Sentiment analysis of twitter data using machine learning approaches and semantic analysis. In 2014 Seventh International Conference on Contemporary Computing (IC3), pages 437–442. IEEE, 2014.
[3] Anab Maulana Barik, Rahmad Mahendra, and Mirna Adriani. Normalization of indonesian-english code-mixed twitter data. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 417–424, 2019.
[4] Wei Xu, Bo Han, and Alan Ritter. Proceedings of the workshop on noisy user-generated text. In Proceedings of the Workshop on Noisy User-generated Text, 2015.
[5] Soumil Mandal, Sainik Kumar Mahata, and Dipankar Das. Preparing bengali-english code-mixed corpus for sentiment analysis of indian languages. arXiv preprint arXiv:1803.04000, 2018.
[6] Somnath Banerjee, Sahar Ghannay, Sophie Rosset, Anne Vilnat, and Paolo Rosso. Limsi_upv at semeval-2020 task 9: Recurrent convolutional neural network for code-mixed sentiment analysis. arXiv preprint arXiv:2008.13173, 2020.
[7] Braja Gopal Patra, Dipankar Das, and Amitava Das. Sentiment analysis of code-mixed indian languages: An overview of sail_code-mixed shared task@ icon-2017. arXiv preprint arXiv:1803.06745, 2018.
[8] Thamar Solorio, Elizabeth Blair, Suraj Maharjan, Steven Bethard, Mona Diab, Mahmoud Ghoneim, Abdelati Hawwari, Fahad AlGhamdi, Julia Hirschberg, Alison Chang, et al. Overview for the first shared task on language identification in code-switched data. In Proceedings of the First Workshop on Computational Approaches to Code Switching, pages 62–72, 2014.
[9] Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, Srinivas PYKL, Björn Gambäck, Tanmoy Chakraborty, Thamar Solorio, and Amitava Das. Semeval-2020 task 9: Overview of sentiment analysis of code-mixed tweets. In Proceedings of the 14th International Workshop on Semantic Evaluation (SemEval-2020), Barcelona, Spain, December 2020. Association for Computational Linguistics.
[10] Parth Patwa, Gustavo Aguilar, Sudipta Kar, Suraj Pandey, Srinivas PYKL, Björn Gambäck, Tanmoy Chakraborty, Thamar Solorio, and Amitava Das. Semeval-2020 task 9: Overview of sentiment analysis of code-mixed tweets. arXiv e-prints, pages arXiv–2008, 2020.
[11] Vivek Srivastava and Mayank Singh. Iit gandhinagar at semeval-2020 task 9: code-mixed sentiment classification using candidate sentence generation and selection. arXiv preprint arXiv:2006.14465, 2020.
[12] Aditya Joshi, Ameya Prabhu, Manish Shrivastava, and Vasudeva Varma. Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text. In Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 2482–2491, 2016.
[13] Ameya Prabhu, Aditya Joshi, Manish Shrivastava, and Vasudeva Varma. Towards sub-word level compositions for sentiment analysis of hindi-english code mixed text. arXiv preprint arXiv:1611.00472, 2016.
[14] Dinkar Sitaram, Savitha Murthy, Debraj Ray, Devansh Sharma, and Kashyap Dhar. Sentiment analysis of mixed language employing hindi-english code switching. In 2015 International Conference on Machine Learning and Cybernetics (ICMLC), volume 1, pages 271–276. IEEE, 2015.
[15] Koustav Rudra, Shruti Rijhwani, Rafiya Begum, Kalika Bali, Monojit Choudhury, and Niloy Ganguly. Understanding language preference for expression of opinion and sentiment: What do hindi-english speakers do on twitter? In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1131–1141, 2016.
[16] Madan Gopal Jhanwar and Arpita Das. An ensemble model for sentiment analysis of hindi-english code-mixed data. arXiv preprint arXiv:1806.04450, 2018.
[17] Arouna Konate and Ruiying Du. Sentiment analysis of code-mixed bambara-french social media text using deep learning techniques. Wuhan University Journal of Natural Sciences, 23(3):237–243, 2018.
[18] Bharathi Raja Chakravarthi, Navya Jose, Shardul Suryawanshi, Elizabeth Sherly, and John P McCrae. A sentiment analysis dataset for code-mixed malayalam-english. arXiv preprint arXiv:2006.00210, 2020.
[19] Bharathi Raja Chakravarthi, Vigneshwaran Muralidaran, Ruba Priyadharshini, and John P McCrae. Corpus creation for sentiment analysis in code-mixed tamil-english text. arXiv preprint arXiv:2006.00206, 2020.
[20] Souvick Ghosh, Satanu Ghosh, and Dipankar Das. Sentiment identification in code-mixed social media text. arXiv preprint arXiv:1707.01184, 2017.
[21] Pranaydeep Singh and Els Lefever. Sentiment analysis for hinglish code-mixed tweets by means of cross-lingual word embeddings. In Proceedings of the The 4th Workshop on Computational Approaches to Code Switching, pages 45–51, 2020.
[22] Rajat Singh, Nurendra Choudhary, and Manish Shrivastava. Automatic normalization of word variations in code-mixed social media text. arXiv preprint arXiv:1804.00804, 2018.
[23] Gustavo Aguilar, Sudipta Kar, and Thamar Solorio. Lince: A centralized benchmark for linguistic code-switching evaluation. arXiv preprint arXiv:2005.04322, 2020.
[24] Iran’s Social Media Statistics, (accessed October 25, 2020). https://gs.statcounter.com/social-media-stats/all/iran.
[25] Emad Khazraee. Mapping the political landscape of persian twitter: The case of 2013 presidential election. Big Data & Society, 6(1):2053951719835232, 2019.
[26] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
[27] Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, Tyler Reddy, David Cournapeau, Evgeni Burovski, Pearu Peterson, Warren Weckesser, Jonathan Bright, Stéfan J. van der Walt, Matthew Brett, Joshua Wilson, K. Jarrod Millman, Nikolay Mayorov, Andrew R. J. Nelson, Eric Jones, Robert Kern, Eric Larson, CJ Carey, İlhan Polat, Yu Feng, Eric W. Moore, Jake Vand erPlas, Denis Laxalde, Josef Perktold, Robert Cimrman, Ian Henriksen, E. A. Quintero, Charles R Harris, Anne M. Archibald, Antônio H. Ribeiro, Fabian Pedregosa, Paul van Mulbregt, and SciPy 1. 0 Contributors. SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python. Nature Methods, 17:261–272, 2020.