Independent Ethical Assessment of Text Classification Models: A Hate Speech Detection Case Study
Abstract.
An independent ethical assessment of an artificial intelligence system is an impartial examination of the system’s development, deployment, and use in alignment with ethical values. System-level qualitative frameworks that describe high-level requirements and component-level quantitative metrics that measure individual ethical dimensions have been developed over the past few years. However, there exists a gap between the two, which hinders the execution of independent ethical assessments in practice. This study bridges this gap and designs a holistic independent ethical assessment process for a text classification model with a special focus on the task of hate speech detection. The assessment is further augmented with protected attributes mining and counterfactual-based analysis to enhance bias assessment. It covers assessments of technical performance, data bias, embedding bias, classification bias, and interpretability. The proposed process is demonstrated through an assessment of a deep hate speech detection model.
1. Introduction
Internal and external independent assessments of artificial intelligence (AI) systems are conducted to examine quality, responsibility and accountability for their development, deployment, and use (on Artificial Intelligence, 2020; Madiega, 2019). Major ethical concerns featuring language models have recently attracted a lot of attention for the need of accountability in this domain for example: Microsoft’s AI Chatbot Tay tweeted racists comments when trained on Twitter Data and GPT-3 AI generated blog hit no. 1 on Hacker News which was completely fake, hinting a mass misinformation threat. Independence in this context is defined as the freedom from conditions that may jeopardize the ability of the assessment activity to fulfill assessment responsibilities in an impartial manner (of Internal Auditors, 2021). A three-level classification of independent assessments is provided in Appendix A.
An independent ethical assessment requires both system-level qualitative guidelines that describe the process and requirements as well as component-level quantitative tools that enable the calculation of assessment metrics related to individual ethical dimensions. The Assessment List for Trustworthy Artificial Intelligence (on Artificial Intelligence, 2020) and the Artificial Intelligence Auditing Framework (of Internal Auditors, 2017) are examples of such qualitative guidelines, amongst others (Audit and (ISACA), 2018; Raji et al., 2020; Office, 2020). Quantitative models that focus on different ethical components have also been developed, such as the techniques for bias detection in embedding models by (Papakyriakopoulos et al., 2020; Garg et al., 2018; Rozado, 2020) and the metrics for evaluating bias in text classification datasets and models by (Dixon et al., 2018; Blodgett et al., 2020; Hutchinson et al., 2020; Park et al., 2018; Sap et al., 2019). Quantitative measures require desired metrics, which must be defined by the context, not by the framework.
There exists a gap between system-level qualitative guidelines and component-level quantitative models that hinders execution of independent ethical assessments in practice (Brown et al., 2021). This study bridges this gap for text classification systems in the context of hate speech detection models. While hate speech detection aims to serve the positive cause of reducing hateful content, it also poses the risk of silencing harmless content, or even worse, doing so with a false positive bias and without an ability to provide an explanation of its reasoning.
We summarize our main contributions as follows:
-
•
We develop a process for independent ethical assessments of text classification systems by bridging high-level qualitative guidelines and low-level technical models.
-
•
We propose methods to mine protected attributes from unstructured data.
-
•
We demonstrate approaches and metrics for quantifying bias in data, word embeddings, and classification models.
The paper is organized as follows. Section 2 presents the proposed assessment process. Section 3 describes the text classification system under consideration and its technical assessment. Section 4 presents the data bias assessment using two different approaches. Section 5 describes the embedding bias metrics and assessment. Section 6 presents the classification bias assessment using the original, swapped, and synthetic datasets. Section 7 describes the local and global interpretability assessment. Section 8 concludes the study.
2. Process Description
This process performs quantitative assessments to address the qualitative requirements set by the Assessment List for Trustworthy Artificial Intelligence (ALTAI) by the European Commission (on Artificial Intelligence, 2020). The classification accuracy assessment addresses Requirement #2 Technical Robustness and Safety. Bias assessments are conducted for the data, word embeddings, and classification model, which correspond to Requirement #5 Diversity, Non-discrimination and Fairness. ALTAI defines bias as “systematic and repeatable errors in a computer system that create unfair outcomes, such as favoring one arbitrary group of users over others.” Data and embedding bias are possible causes of unfairness, and may influence classification bias, which is an actual measure of fairness. Protected attributes are defined as those qualities, traits, or characteristics, such as gender, that cannot be discriminated against. Model interpretability assessments are performed to address Requirement #4 Transparency. Since the AI system under consideration is not computationally intensive and was trained on a laptop, its environmental impact is deemed to be negligible. Nevertheless, Requirement #6 Societal and Environmental Well-being is briefly addressed in Appendix I. Indeed, this process does not cover all requirements and future work is necessary to address the remainder.
3. The AI System and Technical Assessment
The system under assessment is a pre-trained natural language processing (NLP) system that identifies hateful comments on a social media platform. The system has been trained on a dataset of Twitter posts in English and classifies comments/posts as hateful or non-hateful. It is a proof of concept model that was developed by a separate team within the same organization. Information about the dataset and model are provided in Appendix B.
Requirement #2 in ALTAI states that reliable technical performance of the system is one of the critical requirements for trustworthy AI. We assessed the performance of the model on the test dataset using various metrics. The results are presented in Table 1 and the accuracy of the model is 88%.
Precision | Recall | F1-Score | Support | |
---|---|---|---|---|
Not-hateful | 0.78 | 0.92 | 0.84 | 1,031 |
Hateful | 0.95 | 0.87 | 0.91 | 2,502 |
Macro Avg. | 0.87 | 0.89 | 0.87 | 3,803 |
Weighted Avg. | 0.89 | 0.88 | 0.88 | 3,803 |
4. Data Bias Assessment
Requirement #5 in ALTAI outlines the need for an assessment of the input data for bias. This data bias assessment step is motivated by the possible impact that data bias may have on the fairness of the classification model. In the case of structured datasets, correlations between protected attribute features and decision variables can be analyzed to assess data bias. For unstructured data like text, however, this approach may not be possible because protected attributes are not represented as separate, standalone features. This study uses two approaches to tackle this problem and assess bias in the training data.
In the first approach, which is inspired by (Dixon et al., 2018), a curated set of identity terms are first identified and their frequencies across hateful and all the comments in the training dataset are then found and compared. The results for all the identity terms are provided in Appendix C. They indicate that some of identity terms such as “gay” and “white” are disproportionately used in hateful comments and this could be an indicative of false positive bias.
The second approach is based on the frequency of protected attribute references across hateful comments and overall. Protected attribute mining is first conducted to identify references made to each subgroup of each attribute in the comments (as elaborated upon in Section 6.2 below). Their frequencies across the hateful comments and all the comments is then calculated and compared. The results are shown in Table 2 for the protected attributes of gender and religion. The data shows some bias is present in the data such as for the “Islam” subgroup and the ”female” subgroup. The word Islam, in our dataset, has a larger presence in the hateful comments as compared word Christianity. Same holds true for the word female compared to the word male.
Protected Attribute | Subgroup | Hateful | Not-hateful | Overall |
---|---|---|---|---|
Gender | Female | 12.28% | 11.53% | 12.02% |
Male | 11.51% | 17.37% | 13.52% | |
Religion | Islam | 0.10% | 0.15% | 0.12% |
Christianity | 0.07% | 0.24% | 0.13% |
5. Embedding Bias Assessment
The embedding model should be assessed for bias as it influences the classification model’s decisions, which relates to Requirement #5 in ALTAI. This embedding bias assessment is motivated by the potential influence that embedding bias has on the fairness of the classification model. To assess embedding bias, we analyze similarity between a diverse set of 501 neutral words (e.g., “admirable” and “miserable”) (Garg et al., 2018) and word groups associated with three protected attributes of gender, religion, and ethnicity (Appendix D). Neutral words mean they are expected to be similarly correlated across protected attribute words (Garg et al., 2018). The similarity metric used is cosine similarity between the individual terms’ embeddings and is aggregated using average mean absolute error (AMAE) and average root mean squared error (ARMSE).
Given a protected attribute (e.g., gender) and a subgroup within (e.g., female), the similarity between an individual neutral word and that subgroup is calculated by averaging distances for the individual neutral word and subgroup word, such as, for example, “nice” and “she”. This is done for all the neutral words and subgroups. Then the AMAE and ARMSE of the average cosine similarity across the subgroups is calculated (the equations are provided in Appendix E). This is performed for all the protected attributes and the results are illustrated in Figure 1. The magnitudes of the AMAE and ARMSE show how strong the biases are with that particular category. In principal, the 2 measures are similar to the statistical measure of ”variance” of cosine-distances across the various sub-groups in a category. Therefore, higher the measure, greater is the disparity in association and more biased are the embeddings for that category.


The AMAE and ARMSE scores indicate that the embedding model is more biased as it relates to religion terms but less for gender terms. Appendix E provides a comparison of the bias magnitude for this system’s embedding model against other mainstream embedding models.
6. Classification Bias Assessment
While classification accuracy at an aggregate level is related to ALTAI Requirement #2 Technical Robustness and Safety, an accuracy report broken down by various protected attributes would address Requirement #5 Diversity, Non-discrimination and Fairness. In contrast to data and embedding bias, classification bias may directly impact the system’s stakeholders.
To assess this bias, subgroups based on gender, ethnicity, religion, etc. are first identified in different comments. Bias assessment is then performed to understand how the model’s prediction accuracy differs across these protected attributes. This assessment can be enhanced by measuring changes in the model’s accuracy as the subgroups in the comments are swapped (for example, does the model’s true positive prediction for ”men are universally terrible” changes to a false negative if ”men” is swapped with ”women”?). Synthetic sentences containing protected attribute references are also generated and analyzed for a better understanding of how the model learns biases.
6.1. Protected Attributes Mining
To extract protected attributes from text, two approaches are applied in this study: 1) a look-up based approach and 2) a named-entity recognition (NER) based approach. These approaches have been validated using datasets with human-annotated comments (Dixon et al., 2018). The look-up approach is based on the word counts extracted from the text examples and the words are looked up in a list of common identity terms such as pronouns and professions. The NER based approach uses spaCy’s pre-trained NER model to extract entities (i.e. protected attributes). A named entity is a “real-world object” with a type associated with it, which might be a country, a product, a religion, etc. In this study, entities tagged with the “NORP” type are extracted. “NORP” refers to the “nationalities or religious or political groups”. spaCy’s NER model takes into account the context and part of speech to determine the tag for a given word. Gender and religion are the protected attributes addressed in this subsection. Several examples of protected attribute extractions from comments are included in Appendix G.
6.2. Bias Assessment based on Protected Attributes Mining
The PwC Responsible AI (RAI) Toolkit performs a bias assessment using protected attribute references, ground-truth labels, and prediction probabilities (Rao et al., 2019). It checks whether the model discriminates against various groups or individuals using a variety of bias metrics. Table 3 and Figure 2 show the results for gender. There were 501 and 442 references to male and female subgroups, respectively. Appendix F provides a description of the seven bias metrics used.
Actual Comment Type | Subgroup | Avg. Predicted Probability |
---|---|---|
Not-hateful | Female | 14.0% hateful |
Not-hateful | Male | 15.4% hateful |
Hateful | Female | 85.6% hateful |
Hateful | Male | 80.4% hateful |


6.3. Assessment Based on Swapped Protected Attributes
Versions of the comments where identity terms are swapped help address the limitations of relying solely on the ground-truth data such as in the case of disproportionate representation of one subgroup over another. Examples are provided in G.
A bias assessment can be performed using the actual label and the predicted probabilities before and after identity swapping. We define the term ’favor’ as the model’s confidence in its prediction one way or the other. For not-hateful comments, a decrease in the prediction probability after swapping identities (e.g. male to female) means the model favors the new identity. For hateful comments, an increase in the prediction probability after swapping identities (e.g. male to female) means the model favors the new identity. This process of swapping is repeated for all the comments with the protected attribute references. This bias assessment indicated that for 55.9% of the comments the model favored the female subgroup, while in 43.4% of the comments it favored the male subgroup. The model prediction did not change for 0.7% of the comments with probabilities rounded to four decimal places.
6.4. Bias Assessment Based on Synthetic Counterfactual Dataset
Bias assessment can also be performed using a synthetic counterfactual dataset. This method offers greater flexibility but is labor-intensive and the resulting dataset may not be representative of the actual texts the model encounters in practice. This idea is related to counterfactual fairness, which as stated by (Kusner et al., 2017), “captures the intuition that a decision is fair towards an individual if it is the same in (a) the actual world and (b) a counterfactual world where the individual belonged to a different demographic group”. Examples of such counterfactuals for religion are provided in Appendix G.
Once the synthetic counterfactual dataset is generated, a bias assessment, similar to the above swapped dataset assessment, can be conducted. We used several templates (some inspired from (Dixon et al., 2018)) for gender and religion. The average prediction probabilities (Appendix G), indicate that there appears to be a bias towards treating all talk of Islam as more hateful compared to Christianity.
We use the following threshold-insensitive counterfactual bias metric for each reference subgroup:
(1) |
where is a series of reference group (e.g. female) examples (e.g. (”she is kind”, ”women are universally terrible”, ). is a series of sets of counterfactual examples of the form ((”he is kind”, ), (”men are universally terrible”, )). (i.e. the model’s predicted probability). is a series of labels for the reference group examples, where 0s correspond to not-hateful comments and 1s corresponds to hateful comments. is -1 when and 1 when and is a term that corrects for the direction of the bias. For the model under consideration, the preferred label is not-hateful and therefore is -1 for not-hateful comments. A positive value of means the model favors the reference subgroup.
Setting “Islam” as the reference subgroup returns a bias value of -0.0067, which denotes the model slightly favors the “Christianity” subgroup. For gender, setting “Male” as the reference subgroup would return a bias value of 0.0737. A comprehensive assessment may also include other bias metrics such as equalized odds (Hardt et al., 2016).
7. Classification Interpretability Assessment
Requirement #4 in ALTAI focuses on transparency requirements and describes explainability as one of the elements to consider alongside traceability and open communication. The question is simple - why are certain tweets being classified as hate-speech? Understanding hate in speech is a function of the vocabulary and the context of the language used and we as humans have an innate ability to understand hate. We seek to understand how an AI model infers it. Does it consider the vocabulary? If yes, how does it do so since we have not explicitly fed the model a list of abusive words or profane phrases. Does it understand the context? If yes, how does it distinguish between cases where strong words can be used as a way of free-speech or expression (e.g. “I am Gay”) versus cases where the words are used disparagingly as hateful comments directed at someone? We explored two methods to address these questions. Overall, the two explainability packages explored offer end-users a unique perspective to tie model decisions with the vocabulary used in a particular tweet. This may help automate, at least partially, some of the content modulation tasks Social Media players are trying to do and at the same time, provide powerful measurable evidence for driving hate.
The first method focuses on local interpretability and uses LIME, which is a model agnostic method that helps explain a given prediction through the features used in the model (Ribeiro et al., 2016). In the case of a text classification model, LIME assigns each token in a given sentence an importance score which can either be positive or negative, depending on the word’s effect on the prediction. For example, in the tweet shown in Figure 3, LIME shows that words like “vote” and “liberals” contribute to the model predicting non-hate, while words like “filthy” and “sick” push the model to classify as a hate-comment. LIME can be used to look at incidents of hate tweets as shown in Fig 3. to understand how the AI model determines hate. This can be of important use for the en-user to better analyze the classification and what kind of vocabulary is too strong for the model to predict the comment as non-hateful. Here the use of strong-language was the key for the model to predict hate.

The second method concerns global interpretability and uses SHAP values. SHAP is a model-agnostic explainability tool that generates feature importances using the overall datasets. It produces various permutations over feature selection to get an estimate of the importance of a particular feature on the overall model (Lundberg and Lee, 2017). The global interpretability results are provided in Appendix H. The results show at an overall level that strong-language is what drives model in predicting hate. The keywords listed here had the largest impact on model’s detection of hate-speech.
8. Conclusions
This study proposed a process for independent ethical assessments of text classification systems and demonstrated it through a first-party assessment of a hate speech detection model. While the ALTAI recommendations are largely applicable to text classification systems, they are underspecified, leaving a looming implementation gap. This study addresses that gap through a series of component-wise quantitative assessments. A discussion of the other requirements with ideas for addressing them are provided in the Appendix I. Risk mitigation is also a logical next step to any risk assessment process. This work serves as a concrete blueprint for the design and implementation of independent ethical assessment processes for text classification systems. More work is needed to apply these ideas to other domains.
Acknowledgements.
We wish to thank Todd Morrill for thorough feedback on this article.References
- (1)
- Audit and (ISACA) (2018) Information Systems Audit and Control Association (ISACA). 2018. Auditing Artificial Intelligence. https://ec.europa.eu/futurium/en/system/files/ged/auditing-artificial-intelligence.pdf
- Blodgett et al. (2020) Su Lin Blodgett, Solon Barocas, Hal Daumé III, and Hanna Wallach. 2020. Language (technology) is power: A critical survey of” bias” in nlp. arXiv preprint arXiv:2005.14050 (2020).
- Brown et al. (2021) Shea Brown, Jovana Davidovic, and Ali Hasan. 2021. The algorithm audit: Scoring the algorithms that score us. Big Data & Society 8, 1 (2021), 2053951720983865.
- Davidson et al. (2017) Thomas Davidson, Dana Warmsley, Michael Macy, and Ingmar Weber. 2017. Automated hate speech detection and the problem of offensive language. In Proceedings of the International AAAI Conference on Web and Social Media, Vol. 11.
- Dixon et al. (2018) Lucas Dixon, John Li, Jeffrey Sorensen, Nithum Thain, and Lucy Vasserman. 2018. Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 67–73.
- eurostat Statistics Explained (2017) eurostat Statistics Explained. 2017. Glossary:Carbon dioxide equivalent. Retrieved May 19, 2021 from https://ec.europa.eu/eurostat/statistics-explained/index.php?title=Glossary:Carbon_dioxide_equivalent#:~:text=A%20carbon%20dioxide%20equivalent%20or,with%20the%20same%20global%20warming
- Garg et al. (2018) Nikhil Garg, Londa Schiebinger, Dan Jurafsky, and James Zou. 2018. Word embeddings quantify 100 years of gender and ethnic stereotypes. Proceedings of the National Academy of Sciences 115, 16 (2018), E3635–E3644.
- Hardt et al. (2016) Moritz Hardt, Eric Price, and Nathan Srebro. 2016. Equality of opportunity in supervised learning. arXiv preprint arXiv:1610.02413 (2016).
- Hutchinson et al. (2020) Ben Hutchinson, Vinodkumar Prabhakaran, Emily Denton, Kellie Webster, Yu Zhong, and Stephen Denuyl. 2020. Social biases in NLP models as barriers for persons with disabilities. arXiv preprint arXiv:2005.00813 (2020).
- Kazim et al. (2021) Emre Kazim, Danielle Mendes Thame Denny, and Adriano Koshiyama. 2021. AI auditing and impact assessment: according to the UK information commissioner’s office. AI and Ethics (2021), 1–10.
- Kusner et al. (2017) Matt J Kusner, Joshua R Loftus, Chris Russell, and Ricardo Silva. 2017. Counterfactual fairness. arXiv preprint arXiv:1703.06856 (2017).
- Lacoste et al. (2019) Alexandre Lacoste, Alexandra Luccioni, Victor Schmidt, and Thomas Dandres. 2019. Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700 (2019).
- Lundberg and Lee (2017) Scott Lundberg and Su-In Lee. 2017. A unified approach to interpreting model predictions. arXiv preprint arXiv:1705.07874 (2017).
- Madiega (2019) T Madiega. 2019. EU guidelines on ethics in artificial intelligence: Context and implementation. European Parliamentary Research Service, PE 640 (2019).
- of Internal Auditors (2021) Institute of Internal Auditors. 2021. International Professional Practices Framework (IPPF). Institute of Internal Auditors.
- of Internal Auditors (2017) The Institute of Internal Auditors. 2017. Artificial Intelligence Auditing Framework. https://na.theiia.org/periodicals/Public%20Documents/GPI-Artificial-Intelligence-Part-II.pdf
- Office (2020) Information Commissioner’s Office. 2020. Guidance on the AI auditing framework. https://ico.org.uk/media/about-the-ico/consultations/2617219/guidance-on-the-ai-auditing-framework-draft-for-consultation.pdf
- on Artificial Intelligence (2020) High-Level Expert Group on Artificial Intelligence. 2020. Assessment List for Trustworthy Artificial Intelligence (ALTAI). https://doi.org/10.2759/791819
- Papakyriakopoulos et al. (2020) Orestis Papakyriakopoulos, Simon Hegelich, Juan Carlos Medina Serrano, and Fabienne Marco. 2020. Bias in word embeddings. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 446–457.
- Park et al. (2018) Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Reducing gender bias in abusive language detection. arXiv preprint arXiv:1808.07231 (2018).
- Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
- Raji et al. (2020) Inioluwa Deborah Raji, Andrew Smart, Rebecca N White, Margaret Mitchell, Timnit Gebru, Ben Hutchinson, Jamila Smith-Loud, Daniel Theron, and Parker Barnes. 2020. Closing the AI accountability gap: defining an end-to-end framework for internal algorithmic auditing. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 33–44.
- Rao et al. (2019) A Rao, F Palaci, and W Chow. 2019. A practical guide to Responsible Artificial Intelligence (AI). Technical Report. PwC.
- Ribeiro et al. (2016) Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. ” Why should i trust you?” Explaining the predictions of any classifier. In Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. 1135–1144.
- Rozado (2020) David Rozado. 2020. Wide range screening of algorithmic bias in word embedding models using large sentiment lexicons reveals underreported bias types. PloS one 15, 4 (2020), e0231189.
- Russell (2019) JP Russell. 2019. The ASQ auditing handbook. ASQ Quality Press.
- Sap et al. (2019) Maarten Sap, Dallas Card, Saadia Gabriel, Yejin Choi, and Noah A Smith. 2019. The risk of racial bias in hate speech detection. In Proceedings of the 57th annual meeting of the association for computational linguistics. 1668–1678.
- Wachter et al. (2021) Sandra Wachter, Brent Mittelstadt, and Chris Russell. 2021. Why fairness cannot be automated: Bridging the gap between EU non-discrimination law and AI. Computer Law & Security Review 41 (2021), 105567.
- Zampieri et al. (2019) Marcos Zampieri, Shervin Malmasi, Preslav Nakov, Sara Rosenthal, Noura Farra, and Ritesh Kumar. 2019. Predicting the type and target of offensive posts in social media. arXiv preprint arXiv:1902.09666 (2019).
- Zhao et al. (2018) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2018. Gender Bias in Coreference Resolution: Evaluation and Debiasing Methods. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 2.
Appendix A A three-level independent assessment classification
The three-level classification that is common in the audit and assurance landscape (Russell, 2019) can be adapted.
A first-party ethical assessment of an AI system is an internal assessment conducted by a team within the same organization that owns the AI system but who had no role in its development nor has any vested interest in its assessment results.
A second-party assessment is an external assessment conducted by a contracted party outside the organization on behalf of a customer.
A third-party assessment is an external assessment by a party independent of the organization-customer relationship and assesses the level of conformity of the AI system to certain ethical criteria and standards (Russell, 2019).
The AI system development team and the AI system assessment team are the two primary groups involved in the process for a first-party ethical assessment. The two teams are independent of each other, although they are still within the same organization. The AI system development team is analogous to the first and second lines in the IIA’s Three Line Model (of Internal Auditors, 2021), which are the creators, executors, operators, managers, and supervisors who are in charge of designing, building, and deploying the AI system as well as risk assessment and strategy. The AI system assessment team is related closely to the third line, which are the auditors and ethicists who assess the system against the ethical requirements.
Appendix B Dataset and model description
The text dataset: The dataset has been created by combining two public datasets from (Davidson
et al., 2017; Zampieri et al., 2019) and used for the purpose of training the hate speech classifier. The model was trained on 34,220 Tweets and an unseen test dataset of 3803 Tweets was also provided to us.
The word embedding model: The classifier uses the pretrained Stanford Glove Common Crawl word embeddings with 300 dimensions (Pennington
et al., 2014) for training and predictions.
The classification model: The classifier is a convolutional neural network model that takes word embeddings as input. The network architecture is a 1D CNN (32 filters with a kernel size of 17), followed by a max pooling (pool size =4) layer and then two fully connected dense Layers (25 unit output followed by a 1 unit output). The assessment here uses the pretrained model as is (i.e., only to make predictions where needed) and the whole process is repeatable regardless of knowing the architecture.
Appendix C Frequency of identity terms in data
The statistics are based on the training dataset and are provided in Table 4. It contains three columns: (1) the % of comments labelled as “Hateful” that contain an identity term, (2) the % of comments labelled as “Not-hateful” that contain an identity term, and (3) the % of comments Overall (“Hateful” and “Not-hateful”) that contain an identity term.
Term | Hateful % | Not-hateful % | Overall % |
---|---|---|---|
atheist | 0.0044 | 0.0085 | 0.0058 |
queer | 0.2220 | 0.0427 | 0.1607 |
gay | 0.3864 | 0.1282 | 0.2981 |
transgender | 0.0044 | 0.0085 | 0.0058 |
lesbian | 0.0266 | 0.0256 | 0.0263 |
homosexual | 0.0133 | 0.0085 | 0.0117 |
feminist | 0.0222 | 0.0000 | 0.0146 |
black | 0.6706 | 0.5042 | 0.6137 |
white | 1.3367 | 0.5640 | 1.0725 |
heterosexual | 0.0044 | 0.0000 | 0.0029 |
islam | 0.0044 | 0.0000 | 0.0029 |
muslim | 0.0178 | 0.0171 | 0.0175 |
Appendix D Protected attribute word groups
D.1. Word groups for three protected attributes (based on (Garg et al., 2018; Zhao et al., 2018))
Religion:
Islam: ’allah’, ’ramadan’, ’turban’, ’emir’, ’salaam’, ’sunni’, ’koran’, ’imam’, ’sultan’, ’prophet’, ’veil’, ’ayatollah’, ’shiite’, ’mosque’, ’islam’, ’sheik’, ’muslim’, ’muhammad’;
Christianity: ’baptism’, ’messiah’, ’catholicism’, ’resurrection’, ’christianity’, ’salvation’, ’protestant’, ’gospel’, ’trinity’, ’jesus’, ’christ’, ’christian’, ’cross’, ’catholic’, ’church’, ’christians’, ’catholics’
Gender:
Male: ’cowboy’, ’cowboys’, ’cameramen’, ’cameraman’, ’busboy’, ’busboys’, ’bellboy’, ’bellboys’, ’barman’, ’barmen’, ’tailor’, ’tailors’, ’prince’, ’princes’, ’governor’, ’governors’, ’adultor’, ’adultors’, ’god’, ’gods’, ’host’, ’hosts’, ’abbot’, ’abbots’, ’actor’, ’actors’, ’bachelor’, ’bachelors’, ’baron’, ’barons’, ’beau’, ’beaus’, ’bridegroom’, ’bridegrooms’, ’brother’, ’brothers’, ’duke’, ’dukes’, ’emperor’, ’emperors’, ’enchanter’, ’father’, ’fathers’, ’fiance’, ’fiances’, ’priest’, ’priests’, ’gentleman’, ’gentlemen’, ’grandfather’, ’grandfathers’, ’headmaster’, ’headmasters’, ’hero’, ’heros’, ’lad’, ’lads’, ’landlord’, ’landlords’, ’male’, ’males’, ’man’, ’men’, ’manservant’, ’manservants’, ’marquis’, ’masseur’, ’masseurs’, ’master’, ’masters’, ’monk’, ’monks’, ’nephew’, ’nephews’, ’priest’, ’priests’, ’sorcerer’, ’sorcerers’, ’stepfather’, ’stepfathers’, ’stepson’, ’stepsons’, ’steward’, ’stewards’, ’uncle’, ’uncles’, ’waiter’, ’waiters’, ’widower’, ’widowers’, ’wizard’, ’wizards’, ’airman’, ’airmen’, ’boy’, ’boys’, ’groom’, ’grooms’, ’businessman’, ’businessmen’, ’chairman’, ’chairmen’, ’dude’, ’dudes’, ’dad’, ’dads’, ’daddy’, ’daddies’, ’son’, ’sons’, ’guy’, ’guys’, ’grandson’, ’grandsons’, ’guy’, ’guys’, ’he’, ’himself’, ’him’, ’his’, ’husband’, ’husbands’, ’king’, ’kings’, ’lord’, ’lords’, ’sir’, ’sir’, ’mr.’, ’mr.’, ’policeman’, ’spokesman’, ’spokesmen’;
Female: ’cowgirl’, ’cowgirls’, ’camerawomen’, ’camerawoman’, ’busgirl’, ’busgirls’, ’bellgirl’, ’bellgirls’, ’barwoman’, ’barwomen’, ’seamstress’, ’seamstress’, ’princess’, ’princesses’, ’governess’, ’governesses’, ’adultress’, ’adultresses’, ’godess’, ’godesses’, ’hostess’, ’hostesses’, ’abbess’, ’abbesses’, ’actress’, ’actresses’, ’spinster’, ’spinsters’, ’baroness’, ’barnoesses’, ’belle’, ’belles’, ’bride’, ’brides’, ’sister’, ’sisters’, ’duchess’, ’duchesses’, ’empress’, ’empresses’, ’enchantress’, ’mother’, ’mothers’, ’fiancee’, ’fiancees’, ’nun’, ’nuns’, ’lady’, ’ladies’, ’grandmother’, ’grandmothers’, ’headmistress’, ’headmistresses’, ’heroine’, ’heroines’, ’lass’, ’lasses’, ’landlady’, ’landladies’, ’female’, ’females’, ’woman’, ’women’, ’maidservant’, ’maidservants’, ’marchioness’, ’masseuse’, ’masseuses’, ’mistress’, ’mistresses’, ’nun’, ’nuns’, ’niece’, ’nieces’, ’priestess’, ’priestesses’, ’sorceress’, ’sorceresses’, ’stepmother’, ’stepmothers’, ’stepdaughter’, ’stepdaughters’, ’stewardess’, ’stewardesses’, ’aunt’, ’aunts’, ’waitress’, ’waitresses’, ’widow’, ’widows’, ’witch’, ’witches’, ’airwoman’, ’airwomen’, ’girl’, ’girls’, ’bride’, ’brides’, ’businesswoman’, ’businesswomen’, ’chairwoman’, ’chairwomen’, ’chick’, ’chicks’, ’mom’, ’moms’, ’mommy’, ’mommies’, ’daughter’, ’daughters’, ’gal’, ’gals’, ’granddaughter’, ’granddaughters’, ’girl’, ’girls’, ’she’, ’herself’, ’her’, ’her’, ’wife’, ’wives’, ’queen’, ’queens’, ’lady’, ’ladies’, ”ma’am”, ’miss’, ’mrs.’, ’ms.’, ’policewoman’, ’spokeswoman’, ’spokeswomen’
Ethnicity:
Chinese: ’chung’, ’liu’, ’wong’, ’huang’, ’ng’, ’hu’, ’chu’, ’chen’, ’lin’, ’liang’, ’wang’, ’wu’, ’yang’, ’tang’, ’chang’, ’hong’, ’li’;
Hispanic: ’ruiz’, ’alvarez’, ’vargas’, ’castillo’, ’gomez’, ’soto’, ’gonzalez’, ’sanchez’, ’rivera’, ’mendoza’, ’martinez’, ’torres’, ’rodriguez’, ’perez’, ’lopez’, ’medina’, ’diaz’, ’garcia’, ’castro’, ’cruz’;
White: ’harris’, ’nelson’, ’robinson’, ’thompson’, ’moore’, ’wright’, ’anderson’, ’clark’, ’jackson’, ’taylor’, ’scott’, ’davis’, ’allen’, ’adams’, ’lewis’, ’williams’, ’jones’, ’wilson’, ’martin’, ’johnson’
Appendix E Embedding bias metrics and values
Let be a matrix representing dimensional word embeddings for the neutral terms (e.g. crazy, nice, etc.). Let be a matrix that corresponds to the dimensional word embeddings for the terms that represent subgroup (e.g. = 2, where 1 represents the male subgroup, which has terms {he, boy}). Let be the vector representing the averaged cosine similarity scores between the neutral terms and the subgroup terms for subgroup . We compute the entry of as follows.
(2) |
We measure deviations between subgroups (e.g. male vs female) as proxies for word embedding bias. These measures include mean absolute error (MAE), root mean squared error (RMSE), and several variations. When there are two subgroups being compared we compute
(3) |
In the event that there are more than two subgroups being compared, we compute the average MAE (AMAE) score across all pairwise comparisons between subgroups.
(4) |
where is the number of subgroups. Similarly, we compute the RMSE and averaged RMSE (ARMSE) as follows.
(5) |
(6) |


Appendix F Classification bias metric definitions
Equal Opportunity: It measures the difference between true positive rates (proportion of samples correctly classified into positive class) for reference and protected groups.
Gini Equality: It measures the difference between the Gini Inequality of benefits, which are defined as: Benefits = Prediction - Target + 1
Normalized Treatment Equality: It measures the difference between False Negative - False Positive ratio in reference and protected groups.
Overall Accuracy Equality: It measures the difference between accuracy in reference and protected groups.
Positive Predictive Value: It measures the difference between positive predictive values (proportion of positive samples among all samples classified as positive) for reference and protected groups.
Positive Class Balance: It measures the difference between average predicted probability, given predicted class is positive, for reference and protected groups.
Statistical Parity: It measures the difference between positive rates (proportion of samples classified into the positive class) for reference and protected groups.
The choice of metrics depends on which type of error the user assigns more penalty to, and therefore wants to equalize across different groups. For example: Equal Opportunity means equalizing the True Positive Rate (TPR) across the different groups. If TPR is a criterion the user really cares about, then Equal Opportunity is a suitable fairness metric to consider. If the user cares more about the probability of getting the favourable result, the user should consider the Statistical Parity. Therefore, the selection of the metric depends on the user’s needs and preferences. Automation of this process remains a challenge (Wachter et al., 2021).
Appendix G Classification bias supplementary data and results
Comment | Gender Extracted | Religion Extracted |
---|---|---|
A visit to the DC Holocaust Museum revealed Hitler won by 43% of the popular vote and 32% of the seats. He also used the Schutzstaffel (SS) to intimidate his opponents, reminding one of the Antifa thugs we have today. Hitler also got the Muslims on his side. Uncanny. | Male | Islam |
”Seton Catholic where their own students talk trash about how low of a division there football team is in.” | - | Christianity |
How to develop their competitive skills and social skills in Home Schooling…as my child too complain abt bullying and unjust behaviour of teachers since she is super active… | Female | - |
.
Version | Comment | Gender | Label | Prediction |
---|---|---|---|---|
Original | This Jerry Lewis ripoff needs to just go away already. The guy is so good at acting like a fool because he is a fool. | Male | Hateful | 0.9811 |
Swapped | This Jerry Lewis ripoff needs to just go away already. The gal is so good at acting like a fool because she is a fool. | Female | Hateful | 0.9891 |
Templates | Subgroup: Christianity | Subgroup: Islam | Label |
---|---|---|---|
I am a [Identity] person, ask me anything | I am a Christian person, ask me anything | I am a Muslim person, ask me anything | Not-hateful |
[Identity] people are gross and universally terrible | Christian people are gross and universally terrible | Muslim people are gross and universally terrible | Hateful |
Actual Comment Type | Subgroup | Avg. Predicted Probability |
---|---|---|
Not-hateful | Islam | 15.6% hateful |
Not-hateful | Christianity | 11.2% hateful |
Hateful | Islam | 67.9% hateful |
Hateful | Christianity | 63.5% hateful |
Actual Comment Type | Subgroup | Avg. Predicted Probability |
---|---|---|
Not-hateful | Male | 12.9% hateful |
Not-hateful | Female | 13.4% hateful |
Hateful | Male | 57.4% hateful |
Hateful | Female | 57.2% hateful |
Appendix H Global interpretability results
The global interpretability of predictions using SHAP provides a holistic view of how a given token impacts the overall working of the model. The results are illustrated in Figure 6. The various dots in this plot signify the presence of a token across different sets of comments. The vertical location shows what token is being depicted, the color shows whether that token’s importance was high or low for that observation, and the horizontal location shows whether the effect of that value caused a higher or lower prediction (model’s output probability). We generated this view for the hate-comments in particular, and discovered that profane language or cuss words turned out to be significant and positively impacting the model - thereby driving it to classify these comments as “hateful”.

Appendix I Other ALTAI Requirements
ALTAI Requirement #6 focuses on societal and environmental well-being and considers the environmental impact of AI systems. There exist different methods and mechanisms that can be used to evaluate the associated environmental impacts. Our process uses the Machine Learning Emissions Calculator tool (Lacoste et al., 2019) to estimate the amount of carbon emissions produced by training the classifier under assessment here. This can be measured in Carbon dioxide equivalent (CO2eq) units, which is “a metric measure used to compare the emissions from various greenhouse gases on the basis of their global-warming potential, by converting amounts of other gases to the equivalent amount of carbon dioxide with the same global warming potential.” (eurostat Statistics Explained, 2017). The model assessed in this study has been trained on CPUs locally and its carbon footprint was considered to be negligible.
There are several additional ethical AI requirements that have not been been applicable or covered in this paper. The model under assessment here has been trained offline on reliable datasets. But resilience of the model to cyber-attacks, such as data poisoning and model evasion, is a critical dimension if the model is trained online and using comments labels through crowdsourcing (ALTAI Requirement #2). The model here used public datasets. But in the event of use of or application to private data, the assessment should also evaluate privacy dimensions (ALTAI Requirement #3). This also includes data governance on both a macro and a micro level, with the focus areas of data availability, sharing, usability, consistency, integrity.
The assessment may also account for the interests of stakeholders and how much each requirement is important to each. The idea of relevance matrix proposed by (Brown et al., 2021) can be used to connect them with the models ethical performance. The assessment output can also inform AI impact assessment (Kazim et al., 2021), which are usually first party created, and not necessarily covered in second/third party reviews.