Emotion Based Hate Speech Detection using Multimodal Learning
Abstract
In recent years, monitoring hate speech and offensive language on social media platforms has become paramount due to its widespread usage among all age groups, races, and ethnicities. Consequently, there have been substantial research efforts towards automated detection of such content using Natural Language Processing (NLP). While successfully filtering textual data, no research has focused on detecting hateful content in multimedia data. With increased ease of data storage and the exponential growth of social media platforms, multimedia content proliferates the internet as much as the text data. Nevertheless, it escapes the automatic filtering systems. Hate speech and offensiveness can be detected in multimedia primarily via three modalities, i.e., visual, acoustic, and verbal. Our preliminary study concluded that the most essential features in classifying hate speech would be the speaker’s emotional state and its influence on the spoken words, therefore limiting our current research to these modalities. This paper proposes the first multimodal deep learning framework to combine the auditory features representing emotion and the semantic features to detect hateful content. Our results demonstrate that incorporating emotional attributes leads to significant improvement over text-based models in detecting hateful multimedia content. This paper also presents a new Hate Speech Detection Video Dataset (HSDVD) collected for the purpose of multimodal learning as no such dataset exists today.
Index Terms:
Multimedia, Multimodal Learning, Hate Speech, Transfer LearningI Introduction
Around the world, we are seeing a disturbing groundswell of racism and intolerance — including but not limited to rising antisemitism, anti-Muslim hatred, and hate crimes against Asians. Social media and other forms of communication are being exploited as platforms for bigotry. Hate Speech is defined as any communication in speech, writing or behavior, that attacks or uses pejorative or discriminatory language with reference to a person or a group based on who they are; in other words, based on their religion, ethnicity, nationality, race, color, descent, gender or other identity factors. It can lead to incitement, which explicitly and deliberately aims at triggering discrimination, hostility, and violence111https://www.un.org/en/genocideprevention/documents/UN%20Strategy%20and%20Plan%20of%20Action%20on%20Hate%20Speech%2018%20June%20SYNOPSIS.pdf. Therefore, real-time detection and filtering of all forms of hate speech is paramount.
Hate speech can be present in various formats like text, audio, images, and video on social media. Several NLP methods have been experimented with for hate speech detection in languages, such as neural networks [1, 2, 3], n-grams [4, 5], and graph based models [6, 7]. However, there is a lack of substantial research on multimedia data. Additionally, it should be noted that hate speech in multimedia data is not simply dependent on the text but the emotional effect that the tone and delivery of speech has on the end listener. According to Patrick et al. [8], hate speech and offensive behavior are linked to the emotional and psychological state of the speaker, which is reflected in the effective emotions of their language [9]. For example, a political leader calmly speaking about immigration policies at a conference is less harmful than delivering the same speech with extreme anger and disgust towards the targeted immigrants, as the latter causes incitement against the immigrants in the country. Accounting for emotion in detecting hate speech also reduces the number of false positives detected by the systems that only consider text data as input, as emotion provides more context to the speaker’s intent.
Therefore, we propose a method to classify hate speech using a multimodal deep learning architecture that combines semantic and emotion features extracted from the speech. We manually collected a hate speech detection video dataset (HSDVD) as no such dataset exists today, to the best of our knowledge. Due to the limited size of the data, transfer learning [10] was employed to pre-train the models responsible for capturing the unimodal language and speech embedding. To summarize, there are three machine learning models. The first one detects hate speech in text and was built by pre-trainning transformer networks such as BERT and ALBERT on the existing Twitter data sets. The second one, or the emotion detection multi-task learning (MTL) model, is trained on the IEMOCAP dataset to detect the level of valence, arousal, and dominance in the audio. Due to the challenge of discrete representation of complex emotions [11] such as anger or fear, we use a dimensional representation of emotions defined by valence, arousal, and dominance attributes [12].
Lastly, for multimodal learning (MML), a multilayer perceptron model is trained on HSDVD to detect hate speech. It selects the best performing text and emotion models to generate embeddings as input. In addition, the text based models are fine-tuned and evaluated on HSDVD to create a benchmark for comparison with the MML framework. Both are tested on a holdout dataset from HSDVD. We experimented with two baselines, and the MML framework exemplified a gain in precision and recall across its respective baselines. This confirms our hypothesis on the significance of multiple modalities for the hate speech detection task.
II Literature Review
Research on multimedia hate speech content has begun to emerge in the last couple of years, yet the prior work is limited. The hateful memes challenge by Facebook drew significant attention from researchers in 2020222https://ai.facebook.com/blog/hateful-memes-challenge-winners/. The top 3 winning teams used pre-trained multimodal transformer models to combine the visual features of the image with textual features of the caption with considerable success [13, 14, 15].
A much recent effort was also made in the field of video hate speech detection [16]. However, it only focuses on the text aspect of the video discarding any additional features which multimedia data provides. Another research focusing on offensive video detection collected and published a dataset in Portuguese [17]. It used transcripts along with social media features like tags and titles to detect offensiveness. Nevertheless, such models are dependent on features available after the content has reached a wider audience, rendering damage to the targeted group in the process. Therefore, a need for a more sophisticated method of hate speech detection in multimedia data arises.
To perform feature extraction on text, we train a hate speech detection language model. The NLP techniques for hate speech detection have evolved through several stages in the past decade. Early approaches experimented with TFIDF [5], bag-of-words (BOW) or n-gram [4], user-specific features such as age or social media features like shares, retweets or reporting for author profiling [6, 18]. Since the maturity of deep learning approaches, much of the research in recent years has focused on neural architectures. Badjatiya et al. [19] and Gamback et al. [20] were the first to use recurrent neural networks (RNNs) and convolution neural networks (CNNs), respectively, for hate speech detection in tweets. Current state-of-the-art models fine-tune pre-trained transformers like BERT and ALBERT. Seven of the top ten teams in offensive language detection tasks (2019) used BERT with some variation in parameters and pre-processing [21, 22]. In 2020 again [23], the top ten teams used combinations of BERT, RoBERTa, or XML-RoBERTa, and the winning team used ALBERT [24].
In Speech Emotion Recognition (SER), there has been extensive research, as it has a wide variety of applications such as Human-Computer Interaction [25], Sentiment Analysis, and Enhancing film sound design [26]. Researches can be divided into two categories based on whether they classify the speech into an emotional state (happiness, sadness, anger, fear, disgust, boredom) [27] or predict the emotional attributes, that are, valence, arousal, and dominance [28]. Zisad et al. [29] has built a CNN model on the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS) data merged with some locally generated problem specific data collected by them to classify the speech emotion as calm, angry, fearful, disgust, happy, surprise, neutral or sad. The research uses tonal properties like MFCCs as features for the model. Zhang et al. [30] proposed that an attention-based CNN model trained on the speech spectrogram as the input instead of acoustic or statistical features can give better results. It is to be noted that the research also focuses on archetype emotion classification. On similar notes, Weiser et al. [31] compares an end-to-end learning network trained on raw audio data with a feature based network. Bojanić et al. [32], in their call center based SER system focus on both archetype emotions as well as the emotional attributes of speech. Parthasarathy and Busso [33], in their research, claim that the emotional attributes are interrelated and hence predicting them with a unified learning framework will give us better results. They make use of Multi-task learning along with Deep Neural Network models.
III Datasets
In this transfer learning approach, the target domain’s feature space must overlap with the source domain to ensure a positive transfer [10]. Additionally, the consistency of training data distribution in all the models is critical to the success of transfer learning [34]. Hence, we experiment with multiple hate speech detection datasets for the text models and an acoustic dataset consisting of a varying range of emotional attributes for the emotion model.
The growing interest in automated hate speech detection has led to the creation of a plethora of text datasets from sources like Yahoo, Twitter, Wikipedia comments, and Reddit [35]. Twitter datasets were selected for this task, as the microblogging platform is closest to the target multimedia domain consisting of video/audio blogs. Each dataset has its advantages but also suffers from unintended biases [36]. Therefore, multiple combinations of datasets were experimented with, and the best model was chosen for feature extraction. All the datasets were subjected to the same pre-processing steps, i.e., removing Twitter user handles, URLs, and special characters (except exclamation and question marks).
OffensEval 2019: This dataset by Zampieri et al. [37] has been used extensively in SemEval 2019 Task 6 [21] and SemEval 2020 Task 12 [23] with considerable success. The data was collected from Twitter with the help of keywords and was annotated using crowdsourcing. It consists of 13k tweets and three levels of annotation. We focus only on level A annotations classifying it as offensive (33%) or not and level B annotations classifying the targeted offense towards an individual (17.8%), a group (8.2%), or otherwise. Posts containing profane language and targeted offense, including insults, threats, or swear words, were classified as offensive.
Waseem and Hovy 2016 (W&H): A dataset of 16k TweetIDs was published by Waseem and Hovy [38]. The authors annotated the tweets into three classes, namely racism (11.7%), sexism (20.0%), or neither. An additional third-party review recorded an inter-annotator agreement of 0.84. Only 10k tweets were retrieved, as the remaining had been taken down since 2016. The class distribution of the retrieved dataset is 9.2% racism, 17.5% sexism, and 73.3% neither. It was noted by Madukwe et al. [35] that the dataset might be biased towards specific users since all the racist tweets were collected from only nine users. Mishra et al. [39] also called attention to specific tweets that lacked explicit abusive traits but were annotated as racist or sexist regardless.
Davidson 2017: Davidson et al. [40] published a dataset of 24k tweets, which were labeled as hate speech (5.77%), offensive (77.43%), or neither. Tweets were collected using a lexicon compiled by hatebase.org333https://hatebase.org/ containing hateful words and phrases. It was further annotated by CrowdFlower workers (now known as Appen444https://appen.com/), and an inter-annotator agreement of 0.92 was reported. Madukwe et al. [35] noted inconsistencies in the labels and the lack of a diverse group of annotations in this dataset.
A combination of the aforementioned datasets was used to create three subsets of data for the text model experiments, as shown in Table I. OffensEval being the most reliable data source was used in dataset A. For dataset B, W&H was combined with a subset of the Davidson dataset. To avoid class imbalance, Davidson was sub-sampled to retain only 3.5k offensive tweets. The racist tweets from the W&H dataset were dropped in dataset C to further reduce noisy data.
Dataset | Hatespeech | Not Hatespeech | |
---|---|---|---|
A | OffensEval | 4400 (33.2%) | 8840 (66.8%) |
B | Davidson + W&H | 8408 (41.5%) | 11839 (58.4%) |
C | Davidson + W&H (no racist tweets) | 5620 (32%) | 11963 (68%) |
IEMOCAP Dataset 2007: The interactive emotional dyadic motion capture database, collected by the Speech Analysis and Interpretation Laboratory (SAIL) at the University of Southern California, is an audio-visual database [41]. A total of ten actors were recorded while performing a given script, but to maintain some originality in the emotions they were also asked to improvise certain hypothetical scenarios. It has approximately 12 hours of recordings. In total, the corpus contained 10039 samples (scripted session: 5255 samples; spontaneous sessions: 4784 samples) of an average duration of 4.5 seconds. Annotators were asked to evaluate the corpus in terms of the attributes valence (1-negative, 5-positive), arousal (1-calm, 5-excited), and dominance (1-weak, 5-strong). Two different annotators were asked to evaluate a single sample, and then speaker dependent z-normalization was used to compensate for inter-evaluator variation.
Hate Speech Detection Video Dataset (HSDVD): Multimedia data consists of various types of media like text, images, audio, video, and animation. No existing hate speech dataset currently incorporates video and audio data, which accounts for a significant amount of content produced on social media platforms today. For this purpose, we compiled the Hate Speech Detection Video Dataset (HSDVD). It was collected from Twitter and YouTube using Twitter API and PyTube, respectively. Hate speech comprises abuses targeting a broad category of ethnicity, gender, or other identities. Since HSDVD will be used only to compare relative performances and due to the availability of limited resources, we collected data concerning specific groups so that each group has sufficient training data for the model. We began by creating a lexicon of sexist and ethnic slurs555https://en.wikipedia.org/wiki/List_of_ethnic_slurs_by_ethnicity focusing on these groups of people. Tweets with video posts were searched using the lexicon and Twitter API. On the other hand, YouTube videos primarily consist of viral social media posts when original copies were taken down. Therefore, we searched by combining the lexicon with phrases like went viral or racist rant. Finally, 1k records of Tweets IDs and YouTube links were collected and classified as hate speech (25%) or not. Both authors labeled each record upon pre-deciding on the definition of hate speech and the target groups. We computed the Cohen’s kappa coefficient of 0.92, indicating a high agreement.
The collected dataset focuses on hate speech targeted towards a gender, sexual orientation, autistic minority, Muslim, Jews, Sikh, Latino, Native Americans, and Asians. Although various definitions of hate speech exist, the common elements we considered are [42] :
-
•
use of sexist, ethnic, or racial slurs intended to incite violence or hate against a target group
-
•
intended to threaten or harm a target group
-
•
abusive or offensive words being used to attack a target group
-
•
humor being used to attack a target group
-
•
intended to be derogatory or humiliating to a target group
-
•
negative stereotyping of a target group
It is important to note that some terms are offensive when used to hurt a target group but can also be used casually by the same group, such as n**ga. Other terms can be used abusively but also in raps or among friends such as f**k and bi**h. Almost all terms that can be used to cause offense, can also be used for informational purposes to spread awareness. Moreover, some videos can be implicitly offensive without containing any explicit slurs. Such nuances are often difficult to differentiate using machine learning models. However, it could be more easily distinguishable in audio/video where emotions and intent are obvious, unlike text content, hence the model can benefit from both audio and text modalities.
According to Poletto et al. [36], most of the existing hate speech datasets might be prone to unintended annotator bias and topic bias, including the above mentioned text datasets and HSDVD. However, our purpose is to demonstrate the effect of multiple modalities when detecting hate speech in multimedia and compare only the models’ relative performances. Hence, our results will most likely to be independent of biases that may exist in the annotations.
IV Approach
We propose three deep learning models, namely text based hate speech classification, speech based emotion attribute prediction, and finally, a multimodal deep learning model to classify hate speech based on text and emotion.
Modality refers to a way in which something is experienced or expressed. Humans perceive the world around us through multiple modalities such as sound, image, language, smell, and taste. In the context of hate speech, an individual determines something as racist or sexist after combining visual, audio, and textual information from a particular instance. Therefore, a progressive artificial intelligence system should also consider the information in different formats while determining hate speech. In this paper, we focus on language, which captures the semantic information and vocal signal that encodes paraverbal information through the tone, pitch, and pacing of the voice. Although visual features have the potential to provide additional perspectives to hate speech detection, its efficacy was finite when learning over the limited scope of the current dataset. For example, only 15% of the dataset contains distinguishable facial expressions, while in 10% of the videos, religious objects could be linked to hate speech comments. Therefore, an extensive analysis with additional resources and extending HSDVD can be highly beneficial when included in future work. Nevertheless, with the current findings, we hope to build the foundation and observe the advantages of these two modalities compared to one in the multimedia hate speech detection task.
The most critical aspect for the performance of multimodal learning specific to this task is the representation and fusion of information for prediction, which can be achieved together using deep neural networks. We use a joint representation technique that combines unimodal signals into a common representation space. There are other kinds of techniques called coordinated representations that process unimodal signals separately but under certain similarity constraints to project them onto a coordinated space [43]. The joint technique is used for this task, as all the modalities are present during both training and inference steps. Mathematically, it is expressed as:
(1) |
where the multimodal representation is calculated using a function (a neural network activation function) that combines unimodal representations .
For unimodal representations of text and audio data, deep neural networks have become increasingly effective in the last decade [44, 45, 46]. We train two separate neural networks to learn text and audio embeddings. Since each successive layer in deep learning captures more abstract features, we use the last layer for final representations [47]. To construct the multimodal representation, the individual embeddings of both modalities are then fed to a common neural layer that projects these modalities into a joint space [48, 49]. Multiple hidden layers can be used for further training from the multimodal representations before a non-linear classification layer is added to make the predictions (as seen in Figure 2).
As neural networks require a large amount of labeled training data, it is common to pre-train embeddings. Since HSDVD contains only 1000 records, we utilize the technique of transfer learning to train the unimodal embeddings. Transfer learning helps in improving the performance of a model on the target domain by transferring the knowledge from a different but related source domain [10]. In this approach of homogeneous transfer learning XS = XT, where X is the space of all possible feature vectors in a domain D [50]. As previously mentioned, this type of transfer learning is sensitive to differences in the domain and distribution of data. Therefore, the main objective is to reduce the distribution difference between source and target domains [34]. In the pre-trained text model, both the input feature space and labels extensively overlap with that of the target domain. Whereas for the speech model, the feature space is the same, and the classification objectives are different.
IV-A Text Based Hate Speech Classification Task
In order to obtain text embeddings, we experiment with transformer networks, i.e., BERT base uncased and ALBERT base uncased, pre-trained on offensive tweets from different datasets A, B, and C to classify hate speech in text. These models are further fine-tuned and evaluated on HSDVD to serve as the baseline for comparing the performance of multimodal over text based learning. BERT stands for Bidirectional Encoder Representations from Transformers [51] and its architecture has been proven highly successful for transfer learning [24]. It was pre-trained on a 3.3 Billion text corpus and can be fine-tuned on any text based classification task. ALBERT or A Lite BERT, is also a successor of BERT that improves upon its architecture for memory efficient and faster training [52].
The tweets from all the datasets are subjected to the same pre-processing steps as specified in section 3. The sentences are broken down into token sequences of fixed length using WordPiece embeddings [53]. A special token [CLS] is also appended to the beginning of each sequence. After model training, we experiment with two methods of extracting text embeddings. The final text feature representation to be used for MML can be given by , where is the size of the embedding.
IV-B Emotion Attribute Prediction Task
Unlike pre-trained transformers for text, we build out the multi-task deep learning model for emotion attribute prediction. The array of human emotions can be represented on a three-dimensional feature space defined by the attributes called valence, arousal, and dominance. Given the complexity of human interactions [54], it is better to represent emotions using these abstract attributes instead of discrete classes such as happy, angry, or sad [11]. Moreover, due to the interrelation between these attributes [33], a unified learning framework gives better results compared to single-task learning. The success of the multi-task learning paradigm has also been illustrated in the MTL survey by Zhang et al. [55]. Therefore, we formulate the prediction of each emotion attribute ranging between 0 and 1 as a regression problem solved using a multi-task deep learning model. The proposed approach jointly learns the objective of predicting valence, arousal, and dominance as continuous values (Figure 1).
The input representation for the IEMOCAP audio data are speech features including energy, entropy, chroma, spectral and mel frequency cepstral coefficient (MFCC) features computed using the pyAudioAnalysis666https://github.com/tyiannak/pyAudioAnalysis feature extractor. It extracts frame by frame short term features with a fixed window size, where the frame moves over the speech signal one step size at a time [56]. Over each short term feature sequence, statistical calculations such as arithmetic mean and standard deviation are performed to extract mid term features defined by their window size and step size parameters. We experiment with two different input representations using these feature vectors, which are then fed to the neural network.

The MTL framework inspired by Caruana et al. [57] and depicted in Figure 1 uses hard parameter sharing. It consists of two neural layers that are shared among all the attributes. These shared nodes create a joint representation for valence, arousal, and dominance. The third layer however, is specific to each task and learns representations to optimize that task. The entire model is trained to minimize the Mean Square Error (MSE) loss. This will generate three loss functions for each attribute, given by, . The overall loss is calculated by taking a weighted sum of all the losses. Equation (2) gives the overall loss function for the MTL model.
(2) |
where the values of and vary between 0 and 1, in the steps of 0.1, such that . These hyper parameters are tuned using a validation set during the experiments. Finally, the attribute specific layers are used to create speech embeddings given by Equation (3).
(3) |
where and are feature vectors generated by the respective Valence, Arousal, and Dominance specific layers.
IV-C Mutli-Modal Hate Speech Classification Task
Figure 2 shows the multimodal learning (MML) framework where both text and audio embeddings are fed to three dense layers, followed by a classification layer to classify it as hate speech or not. As specified in section 4, a neural network layer can be used to combine the unimodal embeddings where each node activation function projects multiple modalities onto a shared space. The input can be represented as:
(4) |
where , such that and . This is followed by three multi perceptron layers interleaved with two dropout layers to prevent overfitting [58]. Each of the hidden dense layer uses a rectified linear unit (ReLU) activation function [59] defined by . It increases the convergence of stochastic gradient descent and has been proven to overcome the shortcomings of other functions such as Tanh and Sigmoid [60]. For weight initialization in these layers, He normalized initialization [61] is applied. It is a minor adaption of the Xavier normalized initialization to address specific characteristics of the ReLU activation function, which is non-linear for half of its input [62]. Finally, a sigmoid activation function is used at the output layer to predict given by:
(5) |
where generates a value between 0 and 1. A threshold of 0.7 is used to make the binary prediction on whether an input is hate speech or not.

To optimize the model parameters a binary cross entropy loss is minimized. Additionally, for all the layers a L2 weight constraint was applied to keep the learned weights small, as suggested by Hinton et al. [58]. Thus, the final loss function can be defined as,
(6) |
where is the number of training examples, is the hyperparameter that decides the amount of penalty on the weights, denotes the true labels and denotes the predicted labels. The model was optimized during training using the Adam optimizer, which has been proven to outperform others especially towards the end of optimization as gradients become sparser [63].
V Experiments and Evaluation
V-A Text Model Experiments
Multiple combinations of datasets and models are experimented with to generate embeddings. Both BERT and ALBERT were pre-trained on all three A, B, and C, as shown in Table II. To select the best model for benchmark, they are fine-tuned on the HSDVD training set and evaluated on a holdout test set. The baseline models can be used to test the relative performance of multimodal learning when audio features are combined with the text features. The same train and test sets are used to evaluate the MML model so that the text model can be used as a true benchmark and ensure a fair comparison.
The input to both the models is a token sequence generated by breaking down sentences after pre-processing as specified in section 4(A). The maximum length of the sequence is limited to 128, and sequences smaller than this are padded with zeros as suggested by Devlin et al. [51]. Each model was trained for 6 epochs with batch size 8, learning rate 5e-6, and Adam optimizer with a weight decay rate of 0.01. After each epoch, the model was evaluated on a validation set, and the best performing epoch was saved.
Macro average of precision, recall, and f1-score of both the classes are used for an unbiased comparison. As seen from Table II both BERTA and ALBERTB models show competing results. Therefore, pre-trained versions of both models with the specified parameters will be used to generate text embeddings in the MML experiments.
Model | P | R | F1 |
---|---|---|---|
BERTA | 90.50 | 91.73 | 91.11 |
BERTB | 88.97 | 91.12 | 90.03 |
BERTC | 90.30 | 89.00 | 89.64 |
ALBERTA | 89.02 | 91.25 | 90.12 |
ALBERTB | 90.20 | 91.99 | 91.08 |
ALBERTc | 90.1 | 88.5 | 89.29 |
V-B Emotion Model Experiments
The interactive emotional dyadic motion capture (IEMOCAP) dataset is used for training and evaluation of the emotion model. It consists of 10039 audio-visual recordings of actors carrying out scripted dialogue scenarios to express various emotional states. Since the models required fixed-size representation of input and the sample lengths vary from 1 to 34 seconds, we experimented with two different feature representations, f1 and f2. Both representations have similar features like energy, entropy, chroma, spectral and mel frequency cepstral coefficient (MFCC). However, the first set contains long term features calculated over the entire sample, and the second set consists of mid term features calculated for 10 seconds of the sample recording. As described in section 4(B), 68 short term features are calculated over a window size of 50 ms and step size of 50 ms at sampling frequency of 44kHz. Then 136 mid term features are calculated over the short term sequence, with window size and step size of 1000 ms. For f1, the long term features are further calculated by averaging over the mid term sequence. On the other hand, for f2, the mid term feature vectors are concatenated, padded with zeros, and truncated accordingly to obtain a fixed-size representation of 1360 features spanning 10 seconds of each input signal. The feature values for both the experiments are scaled down between -1 and 1 before being fed to the neural network.
Model | Parameter | Value | Parameter | Value |
---|---|---|---|---|
MTLf1 | epoch | 30 | alpha | 0.2 |
learning rate | 1e-4 | beta | 0.1 | |
learning decay | 0.99 | gamma | 0.2 | |
L2 regularization | 1e-7 | batch size | 32 | |
MTLf2 | epoch | 18 | alpha | 0.1 |
learning rate | 1e-3 | beta | 0.1 | |
learning decay | 0.96 | gamma | 0.2 | |
L2 regularization | 1e-9 | batch size | 128 |
- •
As seen in Figure 1, the MTL model shares two hidden layers. A dropout of 0.2 is applied to the output of these layers before passing to the attribute specific dense layer. All the hidden units are activated with ReLU function, and weights are initialized with He normalized distribution [61]. To further increase generalization, an L2 kernel regularization was also applied [58]. The hyperparameter for regularization is tuned using a validation dataset. Table III lists the model parameters that were tuned over a hyperband during training by using validation loss as the measure of performance. As specified in section 4(B), this also includes loss weights for valence, arousal, and dominance, respectively.
Model | Val RMSE | Aro RMSE | Dom RMSE |
---|---|---|---|
MTLf1 | 0.2594 | 0.4232 | 0.4418 |
MTLf2 | 0.1846 | 0.1124 | 0.1431 |
We evaluate the two regression models using Root Mean Squared Error (RMSE). It is a measure of the standard deviation between the residuals i.e., the prediction errors. As seen in Table IV, MTLf2 outperforms MTLf1 in prediction of all three emotion attributes and is finally used to generate speech embeddings for the MML model.
Sample | BERTBaseline | BERTA+CLS |
---|---|---|
Yea, if you are a black man, you better stay out of trouble and keep your goddamn hands outside of your pocket. You are basically begging to get shot in public. But if you are indiana nick bear drapped in a snake flag and some ninji star from the mall. Yeah, welcome to congress sir.
|
HateSpeech | Not HateSpeech |
The white british population has decreased by six hundered thousand. While the minority population has increased by one twenty million. Yes lads, we are winning.
|
HateSpeech | Not HateSpeech |
Anita Hill testifies that supreme court nominie Clarence Thomas sexually harrased her. Hill was called a scorned women and a litle bit nutty and a litle bit sl*tty.
|
HateSpeech | Not HateSpeech |
Maternity flight suits, pregnant women are gonna fight our war. Its a mockery of the US military. China’s military becomes more masculine, our military needs to become as joe biden says feminine, whatever feminine means anymore. | Not HateSpeech | HateSpeech |
V-C Multimodal Learning Experiments
The MML model is trained on HSDVD consisting of 1k video recordings. These are first converted to wav audio formats using ffmpeg library777https://ffmpeg.org/ before speech processing for the emotion model. Since the data was collected from sources like Twitter and YouTube, it is more prone to noise; unlike the IEMOCAP dataset which was recorded in a controlled environment. We apply the spectral gating technique using Audacity888https://www.audacityteam.org/ for noise reduction. It is then fed to the MTLf2 model to generate speech embeddings Es (Equation (3)) of size 510, since each attribute specific layer consists of 170 hidden units.
For text processing, the speech samples are first converted to text using Houndify API999https://www.houndify.com/. Further pre-processing steps are bypassed, as this data is not noisy like the Twitter data. It does contain some special characters like punctuation and contractions, which were preserved in the Twitter data during pre-training as well. The sentences are converted to token sequences of size 128 before processing by the selected text models i.e., BERTA and ALBERTB. A special token [CLS] is appended to the beginning of each sequence as given in section 4(A). The final hidden vector of this token can be used as a fixed size feature representation for the input sequence. It is the recommended method according to the original work [51]. Another well-known method used is averaging all the contextual word embeddings extracted from the last hidden layer (for example: [64, 65, 66]). These two options are also provided by the popular bert-as-a-service repository101010https://github.com/hanxiao/bert-as-service/, hence we used both methods in our experiments. The final feature vector generated using either method i.e., Et has a size of 768.
Model | P | R | F1 |
---|---|---|---|
BERTA+CLS | 93.00 | 92.89 | 92.94 |
BERTA+avg | 90.60 | 91.70 | 91.14 |
BERTBaseline | 90.50 | 91.73 | 91.11 |
ALBERTB+CLS | 92.36 | 92.90 | 92.62 |
ALBERTB+avg | 90.34 | 91.76 | 91.04 |
ALBERTBaseline | 90.20 | 91.99 | 91.08 |
The input embedding E obtained by concatenating Es and Et (Equation (4)) consists of 1278 features. It is fed to the MML neural network depicted in Figure 2. The dataset is split into training (80%), validation (10%), and testing (10%). The validation set is used for tuning the hyperparameters such as learning rate of 1e-4 and decay rate of 0.99 for the Adam optimizer. Furthermore, early stopping is applied to stop model training if validation loss does not decrease for more than 10 epochs. As given in section 4(C) the binary cross entropy loss is calculated using Equation (6).
For evaluation, similar metrics to baseline model are used i.e. macro average of precision, recall and f1-score of both the classes. Table VI, lists the evaluation scores for both BERTA and ALBERTB with different text embedding extraction techniques. Baseline results have also been included for evaluation of the MML model performance. The BERTA+CLS outperforms all the models in every metric, comparable to state-of-the art text models. Additionally, both BERTA+CLS and ALBERTB+CLS perform significantly better than their respective baselines, thereby indicating the significance of speech features when processing multimedia input. It should also be noted that embeddings extracted by averaging all token representations did not result in much increase in performance for BERTA+avg. On the other hand the performance of ALBERTB+avg slightly decreased as compared to its baseline. Therefore, the mean pooling technique might not produce a good representation of its input during such transfer learning tasks.
To further evaluate the functionality of MML we looked at some examples listed in Table V that were miss-classified by the BERTBaseline model. Notably, most miss classifications were false positives, i.e., the content was classified as hate speech when it was not. This is in line with the majority increase in the precision of BERTA+CLS compared to the increase in recall. For better intuition, consider samples 1 and 2, which were spoken sarcastically. At a glance, the keywords and wordings might indicate hate speech; however, the mocking tone of the speaker will clearly distinguish the utterance as sarcasm. Consider another sample 3, it was part of an informational segment on Twitter. It also consists of several sexist words such as scorned women, nutty and sl*tty, but was correctly classified as not hate speech by the MML model, which can also be intrinsically attributed to the tone of the speaker. Rectifying false positives for such content which aims at spreading awareness about racism and sexism, becomes increasingly important to prevent stifling the freedom of speech in the process.
VI Conclusion
In this paper, we proposed a multimodal learning framework that considers both the tone and the words of the speaker in a video or audio to determine hate speech. The model benefits from the mutually beneficial relationship that exists between these modalities in the real world. Our experiments demonstrated that adding the speech features is more beneficial than just relying on the text features when detecting hate speech in the multimedia domain. This initial evidence suggests that including multiple other modalities and adding knowledge from different perspectives can further enhance the model’s performance. For instance, we expect hate speech detection to also benefit from visual features such as detecting religious attires/objects, violence, or facial expressions. Finally, this also calls attention to the need for a different kind of system for hate speech detection in multimedia data which accounts for a large portion of the internet today and encourages new research avenues for this task.
References
- [1] J. H. Park and P. Fung, “One-step and Two-step Classification for Abusive Language Detection on Twitter,” in ALW@ACL, 2017.
- [2] C. Wang, “Interpreting Neural Network Hate Speech Classifiers,” in Proceedings of the 2nd Workshop on Abusive Language Online ({ALW}2). Brussels, Belgium: Association for Computational Linguistics, October 2018, pp. 86–92. [Online]. Available: 10.18653/v1/W18-5111
- [3] J. Pavlopoulos, P. Malakasiotis, and I. Androutsopoulos, “Deep Learning for User Comment Moderation,” in Proceedings of the First Workshop on Abusive Language Online. Vancouver, BC, Canada: Association for Computational Linguistics, August 2017, pp. 25–35. [Online]. Available: 10.18653/v1/W17-3004
- [4] S. Sood, J. Antin, and E. Churchill, “Using Crowdsourcing to Improve Profanity Detection,” 2012.
- [5] K. Dinakar, R. Reichart, and H. Lieberman, “Modeling the Detection of Textual Cyberbullying,” in The Social Mobile Web, 2011.
- [6] P. Mishra, M. D. Tredici, H. Yannakoudakis, and E. Shutova, “Abusive Language Detection with Graph Convolutional Networks,” in NAACL, 2019.
- [7] G. Aglionby, C. Davis, P. Mishra, A. Caines, H. Yannakoudakis, M. Rei, E. Shutova, and P. Buttery, “CAMsterdam at SemEval-2019 Task 6: Neural and graph-based feature extraction for the identification of offensive tweets,” in Proceedings of the 13th International Workshop on Semantic Evaluation. Minneapolis, Minnesota, USA: Association for Computational Linguistics, June 2019, pp. 556–563. [Online]. Available: 10.18653/v1/S19-2100
- [8] T. W. George and Patrick, “The psychology of profanity,” Psychological Review, vol. 8, no. 2, pp. 113–113, 1901.
- [9] E. A. Mabry, “Dimensions of Profanity,” Psychological Reports, vol. 35, no. 1, pp. 387–391, 1974. [Online]. Available: 10.2466/pr0.1974.35.1.387
- [10] F. Zhuang, Z. Qi, K. Duan, D. Xi, Y. Zhu, H. Zhu, H. Xiong, and Q. He, “A Comprehensive Survey on Transfer Learning,” Proceedings of the IEEE, vol. 109, no. 1, pp. 43–76, 2021. [Online]. Available: 10.1109/JPROC.2020.3004555
- [11] L. Devillers, L. Vidrascu, and L. Lamel, “Challenges in real-life emotion annotation and machine learning based detection,” Neural networks : the official journal of the International Neural Network Society, vol. 18, pp. 407–22, 06 2005. [Online]. Available: 10.1016/j.neunet.2005.03.007
- [12] M. A. Nicolaou, H. Gunes, and M. Pantic, “Continuous Prediction of Spontaneous Affect from Multiple Cues and Modalities in Valence-Arousal Space,” IEEE Transactions on Affective Computing, vol. 2, no. 2, pp. 92–105, 2011. [Online]. Available: 10.1109/T-AFFC.2011.9
- [13] R. Zhu, “Enhance Multimodal Transformer With External Label And In-Domain Pretrain: Hateful Meme Challenge Winning Solution,” 2020.
- [14] N. Muennighoff, “Vilio: State-of-the-art Visio-Linguistic Models applied to Hateful Memes,” 2020.
- [15] R. Velioglu and J. Rose, “Detecting Hate Speech in Memes Using Multimodal Deep Learning Approaches: Prize-winning solution to Hateful Memes Challenge,” 2020.
- [16] C. S. Wu and U. Bhandary, “Detection of Hate Speech in Videos Using Machine Learning,” in 2020 International Conference on Computational Science and Computational Intelligence (CSCI), 2020, pp. 585–590. [Online]. Available: 10.1109/CSCI51800.2020.00104
- [17] C. Alcântara, V. Moreira, and D. Feijo, “Offensive Video Detection: Dataset and Baseline Results,” in Proceedings of the 12th Language Resources and Evaluation Conference. Marseille, France: European Language Resources Association, May 2020, pp. 4309–4319.
- [18] P. Mishra, M. D. Tredici, H. Yannakoudakis, and E. Shutova, “Author Profiling for Abuse Detection,” in Proceedings of the 27th International Conference on Computational Linguistics. Santa Fe, New Mexico, USA: Association for Computational Linguistics, August 2018, pp. 1088–1098.
- [19] P. Badjatiya, S. Gupta, M. Gupta, and V. Varma, “Deep Learning for Hate Speech Detection in Tweets,” 06 2017. [Online]. Available: 10.1145/3041021.3054223
- [20] B. Gambäck and U. K. Sikdar, “Using Convolutional Neural Networks to Classify Hate-Speech,” in Proceedings of the First Workshop on Abusive Language Online. Vancouver, BC, Canada: Association for Computational Linguistics, August 2017, pp. 85–90. [Online]. Available: 10.18653/v1/W17-3013
- [21] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar, “SemEval-2019 Task 6: Identifying and Categorizing Offensive Language in Social Media (OffensEval),” in Proceedings of the 13th International Workshop on Semantic Evaluation. Minneapolis, Minnesota, USA: Association for Computational Linguistics, June 2019, pp. 75–86. [Online]. Available: 10.18653/v1/S19-2010
- [22] P. Liu, W. Li, and L. Zou, “NULI at SemEval-2019 Task 6: Transfer Learning for Offensive Language Detection using Bidirectional Transformers,” in Proceedings of the 13th International Workshop on Semantic Evaluation. Minneapolis, Minnesota, USA: Association for Computational Linguistics, June 2019, pp. 87–91. [Online]. Available: 10.18653/v1/S19-2011
- [23] M. Zampieri, P. Nakov, S. Rosenthal, P. Atanasova, G. Karadzhov, H. Mubarak, L. Derczynski, Z. Pitenis, and Ç. Çöltekin, “SemEval-2020 Task 12: Multilingual Offensive Language Identification in Social Media (OffensEval 2020),” in Proceedings of the Fourteenth Workshop on Semantic Evaluation. Barcelona (online): International Committee for Computational Linguistics, December 2020, pp. 1425–1447. [Online]. Available: 10.18653/v1/2020.semeval-1.188
- [24] G. Wiedemann, S. Yimam, and C. Biemann, “UHH-LT at SemEval-2020 Task 12: Fine-tuning of pre-trained transformer networks for offensive language detection,” Proceedings of the International Workshop on Semantic Evaluation (SemEval), 2020.
- [25] S. Ramakrishnan and I. M. E. Emary, “Speech emotion recognition approaches in human computer interaction,” Telecommun Syst, vol. 52, pp. 1467–1478, 2013. [Online]. Available: 10.1007/s11235-011-9624-z
- [26] S. Cunningham, H. Ridley, and J. Weinel, “Supervised machine learning for audio emotion recognition,” Pers Ubiquit Comput, vol. 25, pp. 637–650, 2021. [Online]. Available: 10.1007/s00779-020-01389-0
- [27] P. Chandrasekar, S. Chapaneri, and D. Jayaswal, “Automatic Speech Emotion Recognition: A survey,” in 2014 International Conference on Circuits, Systems, Communication and Information Technology Applications (CSCITA), 2014, pp. 341–346. [Online]. Available: 10.1109/CSCITA.2014.6839284
- [28] K. Sridhar and C. Busso, “Modeling Uncertainty in Predicting Emotional Attributes from Spontaneous Speech,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 8384–8388. [Online]. Available: 10.1109/ICASSP40776.2020.9054237
- [29] S. N. Zisad, M. S. Hossain, and K. Andersson, “Speech emotion recognition in neurological disorders using Convolutional Neural Network,” Proceedings of the 13th International Conference on Brain Informatics (BI2020), pp. 287–296, 2020.
- [30] Y. Zhang, J. Du, Z. Wang, J. Zhang, and tu Yanhui, “Attention Based Fully Convolutional Network for Speech Emotion Recognition,” 11 2018, pp. 1771–1775. [Online]. Available: 10.23919/APSIPA.2018.8659587
- [31] I. Wieser, P. Barros, and S. Heinrich, “Understanding auditory representations of emotional expressions with neural networks,” Neural Comput & Applic, vol. 32, pp. 1007–1022, 2020. [Online]. Available: 10.1007/s00521-018-3869-3
- [32] “Call Redistribution for a Call Center Based on Speech Emotion Recognition,” Applied Sciences, vol. 10, no. 13, 2020. [Online]. Available: 10.3390/app10134653
- [33] S. Parthasarathy and C. Busso, “Jointly Predicting Arousal, Valence and Dominance with Multi-Task Learning,” in Proc. Interspeech 2017, 2017, pp. 1103–1107. [Online]. Available: 10.21437/Interspeech.2017-1494
- [34] S. J. Pan and Q. Yang, “A Survey on Transfer Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. 22, no. 10, pp. 1345–1359, 2010. [Online]. Available: 10.1109/TKDE.2009.191
- [35] K. Madukwe, X. Gao, and B. Xue, “In Data We Trust: A Critical Analysis of Hate Speech Detection Datasets,” in Proceedings of the Fourth Workshop on Online Abuse and Harms. Online: Association for Computational Linguistics, November 2020, pp. 150–161. [Online]. Available: 10.18653/v1/2020.alw-1.18
- [36] F. Poletto, V. Basile, and M. Sanguinetti, “Resources and benchmark corpora for hate speech detection: a systematic review,” Lang Resources & Evaluation, vol. 55, pp. 477–523, 2021. [Online]. Available: 10.1007/s10579-020-09502-8
- [37] M. Zampieri, S. Malmasi, P. Nakov, S. Rosenthal, N. Farra, and R. Kumar, “Predicting the Type and Target of Offensive Posts in Social Media,” in Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 1415–1420. [Online]. Available: 10.18653/v1/N19-1144
- [38] Z. Waseem and D. Hovy, “Hateful Symbols or Hateful People? Predictive Features for Hate Speech Detection on Twitter,” in Proceedings of the NAACL Student Research Workshop. San Diego, California: Association for Computational Linguistics, June 2016, pp. 88–93. [Online]. Available: 10.18653/v1/N16-2013
- [39] P. Mishra, H. Yannakoudakis, and E. Shutova, “Tackling online abuse: A survey of automated abuse detection methods,” 2019.
- [40] T. Davidson, D. Warmsley, M. Macy, and I. Weber, “Automated hate speech detection and the problem of offensive language,” Proceedings of the eleventh international conference on web and social media, AAAI, pp. 512–515, 2017.
- [41] M. Busso, C. Bulut, A. Lee, E. Kazemzadeh, S. Mower, J. Kim, S. Chang, S. Lee, and Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Journal of Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
- [42] P. Fortuna and S. Nunes, “A Survey on Automatic Detection of Hate Speech in Text,” ACM Comput. Surv., vol. 51, no. 4, jul 2018. [Online]. Available: 10.1145/3232676
- [43] T. Baltrusaitis, C. Ahuja, and L.-P. Morency, “Multimodal Machine Learning: A Survey and Taxonomy,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 2, pp. 423–443, feb 2019. [Online]. Available: 10.1109/TPAMI.2018.2798607
- [44] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed Representations of Words and Phrases and their Compositionality,” in Advances in Neural Information Processing Systems, C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, Eds., vol. 26. Curran Associates, Inc., 2013.
- [45] J. Ngiam, A. Khosla, M. Kim, J. Nam, H. Lee, and A. Y. Ng, “Multimodal Deep Learning,” in Proceedings of the 28th International Conference on International Conference on Machine Learning. Madison, WI, USA: Omnipress, 2011, pp. 689–696.
- [46] C. N. Anagnostopoulos, T. Iliou, and I. Giannoukos, “Features and classifiers for emotion recognition from speech: a survey from 2000 to 2011,” Artif Intell Rev, vol. 43, pp. 155–177, 2000. [Online]. Available: 10.1007/s10462-012-9368-5
- [47] Y. Bengio, A. Courville, and P. Vincent, “Representation Learning: A Review and New Perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, pp. 1798–1828, 08 2013. [Online]. Available: 10.1109/TPAMI.2013.50
- [48] Y. Mroueh, E. Marcheret, and V. Goel, “Deep multimodal learning for Audio-Visual Speech Recognition,” in IEEE, 04 2015, pp. 2130–2134. [Online]. Available: 10.1109/ICASSP.2015.7178347
- [49] Z. Wu, Y.-G. Jiang, J. Wang, J. Pu, and X. Xue, “Exploring Inter-Feature and Inter-Class Relationships with Deep Neural Networks for Video Classification,” in Proceedings of the 22nd ACM International Conference on Multimedia. New York, NY, USA: Association for Computing Machinery, 2014, pp. 167–176. [Online]. Available: 10.1145/2647868.2654931
- [50] K. Weiss, T. M. Khoshgoftaar, and D. Wang, “A survey of transfer learning,” J Big Data, vol. 3, pp. 9–9, 2016. [Online]. Available: 10.1186/s40537-016-0043-6
- [51] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1. Long and Short Papers, 2019, pp. 4171–4186.
- [52] Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut, “ALBERT: A lite BERT for self-supervised learning of language representations,” in International Conference on Learning Representations, and others, Ed., 2020.
- [53] Y. Wu, M. Schuster, Z. Chen, V. Quoc, M. Le, W. Norouzi, M. Macherey, Y. Krikun, Q. Cao, K. Gao, and Macherey, “Google’s neural machine translation system: Bridging the gap between human and machine translation,” 2016.
- [54] E. Mower, A. Metallinou, C. C. Lee, A. Kazemzadeh, C. Busso, S. Lee, and S. Narayanan, “Interpreting ambiguous emotional expressions,” in International Conference on Affective Computing and Intelligent Interaction, 2009, pp. 1–8.
- [55] Y. Zhang and Q. Yang, “A Survey on Multi-Task Learning,” IEEE Transactions on Knowledge and Data Engineering, vol. PP, 07 2017. [Online]. Available: 10.1109/TKDE.2021.3070203
- [56] T. Giannakopoulos, “pyAudioAnalysis: An Open-Source Python Library for Audio Signal Analysis,” PLoS ONE, vol. 10, no. 12, pp. e0 144 610–e0 144 610, 2015. [Online]. Available: 10.1371/journal.pone.0144610
- [57] R. Caruana, “Multitask Learning,” Machine Learning, vol. 28, pp. 41–75, 1997. [Online]. Available: 10.1023/A:1007379606734
- [58] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Improving neural networks by preventing co-adaptation of feature detectors,” CoRR, vol. abs/1207.0580, 2012.
- [59] X. Glorot, A. Bordes, and Y. Bengio, “Deep Sparse Rectifier Neural Networks,” in Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, G. Gordon, D. Dunson, and M. Dudík, Eds., vol. 15. Fort Lauderdale, FL, USA: PMLR, 11-13 Apr 2011, pp. 315–323.
- [60] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classi- fication with deep convolutional neural networks,” Advances in neural information processing systems, pp. 1097–1105, 2012.
- [61] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: surpassing human-level performance on imagenet classification,” Proceedings of the 2015 IEEE international conference on computer vision, pp. 1026–1034, 2015.
- [62] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” Aistats, vol. 9, pp. 249–256, 2010.
- [63] S. Ruder, “An overview of gradient descent optimization algorithms,” 09 2016.
- [64] C. May, A. Wang, S. Bordia, S. R. Bowman, and R. Rudinger, “On Measuring Social Biases in Sentence Encoders,” in Proceedings of the 2019 Conference of the North {A}merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Minneapolis, Minnesota: Association for Computational Linguistics, June 2019, pp. 622–628. [Online]. Available: 10.18653/v1/N19-1063
- [65] T. Zhang, V. Kishore, F. Wu, K. Q. Weinberger, and Y. Artzi, “BERTScore: Evaluating Text Generation with BERT,” ArXiv, vol. abs/1904.09675, 2020.
- [66] Y. Qiao, C. Xiong, Z. Liu, and Z. Liu, “Understanding the Behaviors of BERT in Ranking,” 2019.