Large Language Models for Multi-label Propaganda Detection

Tanmay Chavan Equal contribution Aditya Kane¹¹footnotemark: 1
Pune Institute of Computer Technology, Pune
{chavantanmay1402, adityakane1}@gmail.com

Abstract

The spread of propaganda through the internet has increased drastically over the past years. Lately, propaganda detection has started gaining importance because of the negative impact it has on society. In this work, we describe our approach for the WANLP 2022 shared task which handles the task of propaganda detection in a multi-label setting. The task demands the model to label the given text as having one or more types of propaganda techniques. There are a total of 21 propaganda techniques to be detected. We show that an ensemble of five models performs the best on the task, scoring a micro-F1 score of 59.73%. We also conduct comprehensive ablations and propose various future directions for this work.

1 Introduction

The advent of social media has enabled people to view, create and share information easily on the internet. Such information can easily be accessed and viewed by a very large number of people in surprisingly short periods. Moreover, most social media websites have few restrictions over what the users choose to post and lack preemptive techniques to censor posts before they are uploaded. This has enabled the free flow of information from various strata of society which might have been restricted due to the lack of access to proper news sources. However, this has also led to a stark increase in the spread of propaganda through the internet. Information propagated through social media posts presents an individual’s personal opinions, and hence is often biased and lacks rigorous fact-checking. Such problems are less frequently found in the original media sources of newspapers and TV news channels where their posts are subjected to a higher level of scrutiny.

The presence of propaganda online poses a serious threat to society as it can often polarize the majority opinion and lead to violent events. A wave of misinformation-based propaganda during the time of the COVID-19 pandemicCinelli et al. (2020) was observed. However, the problem of propaganda detection is much more complicated than it appears. The biggest challenge in propaganda detection is that the bulk of propaganda information is partially based on truths, but is presented in a manner that might be misleading or unnecessarily polarizing. It is also observed that propaganda posts are written professionally and are compelling which makes most of the readers believe the information to be authentic. All of these problems make it difficult to train a model to detect propaganda, and much more difficult to interpret the results of such models. The purpose of the shared task Alam et al. (2022), a multi-label classification problem, is to come up with efficient methods for detecting propaganda on a dataset containing Arabic tweets.

Transformer-based models have achieved great success in text classification tasks. Additionally, ensemble-based models also outperform these individual models. Thus, we explore individual as well as ensemble of models for this task. Furthermore, we experimented with oversampling where we repeat the samples having minority labels. We also pretrained the DeHateBERT model on 1 million tweets to study the effect of domain-specific pretraining on downstream performance. We report the results of all these experiments and thereby propose an ensemble-based method for this task.

Refer to caption — Figure 1: Distribution of label counts

2 Related Work

Da San Martino et al. (2019) effectively addressed the problem of quantifying different types of propaganda into seventeen categories, which helps us distinguish between different types of propaganda. They also presented a corpus that contains information classified according to the seventeen classes. Previous shared tasks have generated successful results. The SemEval 2020 task 11 Da San Martino et al. (2020) used the PTC corpus for building models to detect and classify propaganda. The SemEval task 6 Dimitrov et al. (2021) helped develop novel approaches to detect propaganda in a multimodal environment. Yu et al. (2021) studied the topic of interpretability of propaganda detection and presented an interpretable model.

The use of BERT-based models which are pre-trained on a large corpus has proven to yield better performance than most of the other deep learning-based approaches without pre-trainingMin et al. (2021). There are several BERT models pre-trained on massive Arabic datasets available. We test some of these models for the task. AraBERT Antoun et al. (2020), MARBERT, ARBERT Abdul-Mageed et al. (2021) are some examples. However, most of these models are pretrained on structured data which significantly differs from tweets. Research has shown that domain-specific pretraining can yield better performance than general text pretrainingBrady (2021). Hence, we used DeHateBERT Aluru et al. (2020).

3 Data

The dataset consists of 504 training examples, 52 validation examples, and 52 testing examples. Our models were finally evaluated on a separate testing dataset, which consisted of 323 examples. Each example can have one or more of the 20 propagandist techniques¹¹1Complete list of propagandist techniques can be found at https://propaganda.qcri.org/annotations/definitions.html. Thus, it was a multi-label dataset. The number of label occurrences is illustrated in Figure 2. As shown in the figure, we see a skewed distribution. This shows that there is an imbalance. Given this problem of multi-label classification with a high class imbalance, we experimented with several architectures and found that DeHateBERT performed the best on the dataset. A full account of all of our successful experiments, as well as failed experiments, is given in Sections 5 and 6. We try multiple methods to mitigate this imbalance, as elaborated in Section 6. Since the dataset is a multi-label dataset, used one-hot encoding for each label to denote the ground truth labels.

Furthermore, we make some key observations about the number of labels per example in Figure 1. We observe that most examples have one label per example. We see that the number of examples having more than one label diminishes quickly, with only one example having 7 labels.

We use basic preprocessing to minimize the noise in the inputs. Firstly, we remove all links in the tweet. Then we remove the user mentions and hashtags (denoted by "@" and "#" followed by a string respectively). Finally, we replace underscores ("_") with space. This way, the separated words contribute to the semantics of the sentence. Note that we retain the emojis in the sentence since they also carry significant meaning and can aid the model to better detect sentiment.

4 System

Given this problem of multi-label classification with a high class imbalance, we experimented with several architectures and found that DeHateBERT performed the best on the dataset. A full account of all of our successful experiments, as well as failed experiments, is given in Sections 5 and 6.

We tried several models, namely AraBERT v1, v02 and v2, MARBERT, ARBERT, XLMRoBERTa Conneau et al. (2020), AraELECTRA Antoun et al. (2021). Note that the difference between AraBERTv2 and AraBERTv02 is that the former uses presegmented text whereas the latter uses the Farasa Segmenter Darwish and Mubarak (2016) to segment the text since Arabic is a language which requires its words to be segmented before being fed into the tokenizer. We used a specific variant of DeHateBERT, which is initialized from multilingual BERT and fine-tuned only on Arabic datasets. We found that this particular variant performed the amongst the best, in terms of micro-F1 on the test split of our dataset. Our model training is fairly straightforward. We train DeHateBERT on our multi-label dataset for 30 epochs and the best performing epoch is chosen based on validation micro-F1. We used a learning rate of $3e-6$ . Note that we use binary cross-entropy loss, since we have multi-hot labels in our dataset.

We also create an ensemble of all the models. We use the five models namely DeHateBERT, ARBERT, AraBERTv02, AraBERTv01, and MARBERT. Our ensemble system is shown in Figure 3. We use the method of hard voting to obtain the final results. For each sample, we recorded the predicted labels of each of the five models. Then, for each of the 21 labels present, we check how many models predict that label. If majority of the models predict the label, we include that label for the sample in the ensemble output. We find that the ensemble of models had the best performance.

The dataset has a significant class imbalance. To overcome this, we tried to augment the dataset by oversampling. For oversampling, we duplicated the samples containing less frequent target classes. Thus we obtained a larger dataset containing duplicate samples but overall having lesser class imbalance. However, this did not yield better performance. We discuss this in detail in Section 6.

5 Results

The official scoring metric for the shared task is the F1 micro score. We present the results of the various models we tried in Table 1. We have used the official scorer module provided by the organizers. We can see that the ensemble has the highest score. MARBERT and DeHateBERT have roughly similar scores and perform better than other models. This can lead us to speculate that a model might perform better at classification tasks if it is pretrained on a corpus containing data from a similar source than a corpus with similar characteristics but having data from a different source. The oversampled DeHateBERT model has a lower performance compared to the model trained on the original dataset.

We can however see that ARBERT outperforms all other single models. Another key observation is that ARBERT outperforms MARBERT, which in turn outperforms all variants of AraBERT. An explanation for this is that AraBERT variants are trained on far less data than ARBERT and MARBERT. In the case of ARBERT and MARBERT, ARBERT is pretrained on a wide variety of sources as opposed to MARBERT and thus has better performance than MARBERT.

We can also speculate that the high performance of the ensemble is because the constituent models are pretrained on different datasets. This enables the ensemble to capture a wider array of semantic vocabulary and hence is better at predicting classes. The hard voting mechanism ensures that the ensemble will not predict too many classes for each sample and thus limits the number of false positives.

Model	Micro-F1
AraBERTv01	54.195
AraBERTv2	50.841
AraBERTv02	53.996
AraBERTv02-twitter	54.135
DeHateBERT	56.484
Oversampling + DeHateBERT	52.529
MARBERT	56.556
ARBERT	59.048
Ensemble	59.725

Table 1: Results of our experiments on the WANLP-22 propaganda detection task dataset.

6 Discussion

We conducted several experiments apart from our best-performing model. Specifically, we tried pretraining on a large Arabic sentiment analysis tweet dataset as well as oversampling the classes having few samples.

We retrained the DeHateBERT model on 1 Million tweets from the Large Arabic Twitter Data for Sentiment Analysis dataset using the Masked Language Modeling technique. We found that pretraining on the sentiment analysis tweet dataset did not result in any gains to the model. We speculate this is primarily because the number of tweets we pretrained the model on is less than the size the model was originally pretrained on.

In another attempt, we implement oversampling in the dataset, where we repeat samples of less frequent classes. We calculate the average number of examples for each class. Then, we get the oversampling factor, that is the number of times the examples must be repeated to reach the average number of samples. We further clip this factor to 10. Note that, since this is a multi-label scenario, we need to be careful not to use examples with the most frequently occurring classes, in which case the process will have no effect.

Currently, we use hard voting for choosing the final output of the ensemble. We believe better results can be obtained by having a more sophisticated method like using an SVM instead of hard voting.

7 Conclusion

This paper aims to articulate our approach for the WANLP 2022 Shared Task. We experimented with multiple transformer-based models, namely AraBERT, ARBERT, MARBERT and others. We also present ablations with monolingual pretraining, oversampling, and ensemble of the aforementioned transformer-based models. We show that the ensemble consisting of models pretrained on various sources of data has the best performance, with a Micro-F1 score of 59.73%. We foresee several possible future directions. One line of work can be to improve the ensemble mechanism as well as to better handle the class imbalance in multi-label setting. Another line of work can be to study the effects of domain-specific pretraining on downstream classification tasks like multi-label classification.

Acknowledgement

We thank Neeraja Kirtane for her reviews and inputs to this paper.

References

Abdul-Mageed et al. (2021) Muhammad Abdul-Mageed, AbdelRahim Elmadany, and El Moatez Billah Nagoudi. 2021. ARBERT & MARBERT: Deep bidirectional transformers for Arabic. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7088–7105, Online. Association for Computational Linguistics.
Alam et al. (2022) Firoj Alam, Hamdy Mubarak, Wajdi Zaghouani, Preslav Nakov, and Giovanni Da San Martino. 2022. Overview of the WANLP 2022 shared task on propaganda detection in Arabic. In Proceedings of the Seventh Arabic Natural Language Processing Workshop, Abu Dhabi, UAE. Association for Computational Linguistics.
Aluru et al. (2020) Sai Saketh Aluru, Binny Mathew, Punyajoy Saha, and Animesh Mukherjee. 2020. Deep learning models for multilingual hate speech detection. CoRR, abs/2004.06465.
Antoun et al. (2020) Wissam Antoun, Fady Baly, and Hazem Hajj. 2020. AraBERT: Transformer-based model for Arabic language understanding. In Proceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection, pages 9–15, Marseille, France. European Language Resource Association.
Antoun et al. (2021) Wissam Antoun, Fady Baly, and Hazem Hajj. 2021. AraELECTRA: Pre-training text discriminators for Arabic language understanding. In Proceedings of the Sixth Arabic Natural Language Processing Workshop, pages 191–195, Kyiv, Ukraine (Virtual). Association for Computational Linguistics.
Brady (2021) Oliver J. Brady. 2021. Aitbert : Domain specific pretraining on alternative social media to improve hate speech classification.
Cinelli et al. (2020) Matteo Cinelli, Walter Quattrociocchi, Alessandro Galeazzi, Carlo Michele Valensise, Emanuele Brugnoli, Ana Lucia Schmidt, Paola Zola, Fabiana Zollo, and Antonio Scala. 2020. The COVID-19 social media infodemic. Scientific Reports, 10(1).
Conneau et al. (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised cross-lingual representation learning at scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 8440–8451, Online. Association for Computational Linguistics.
Da San Martino et al. (2020) Giovanni Da San Martino, Alberto Barrón-Cedeño, Henning Wachsmuth, Rostislav Petrov, and Preslav Nakov. 2020. SemEval-2020 task 11: Detection of propaganda techniques in news articles. In Proceedings of the Fourteenth Workshop on Semantic Evaluation, pages 1377–1414, Barcelona (online). International Committee for Computational Linguistics.
Da San Martino et al. (2019) Giovanni Da San Martino, Seunghak Yu, Alberto Barrón-Cedeño, Rostislav Petrov, and Preslav Nakov. 2019. Fine-grained analysis of propaganda in news article. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5636–5646, Hong Kong, China. Association for Computational Linguistics.
Darwish and Mubarak (2016) Kareem Darwish and Hamdy Mubarak. 2016. Farasa: A new fast and accurate Arabic word segmenter. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1070–1074, Portorož, Slovenia. European Language Resources Association (ELRA).
Dimitrov et al. (2021) Dimitar Dimitrov, Bishr Bin Ali, Shaden Shaar, Firoj Alam, Fabrizio Silvestri, Hamed Firooz, Preslav Nakov, and Giovanni Da San Martino. 2021. SemEval-2021 task 6: Detection of persuasion techniques in texts and images. In Proceedings of the 15th International Workshop on Semantic Evaluation (SemEval-2021), pages 70–98, Online. Association for Computational Linguistics.
Min et al. (2021) Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heinz, and Dan Roth. 2021. Recent advances in natural language processing via large pre-trained language models: A survey.
Yu et al. (2021) Seunghak Yu, Giovanni Da San Martino, Mitra Mohtarami, James Glass, and Preslav Nakov. 2021. Interpretable propaganda detection in news articles. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2021), pages 1597–1605, Held Online. INCOMA Ltd.