Counteracts: Testing Stereotypical Representation in Pre-trained Language Models
Abstract
Recently, language models have demonstrated strong performance on various natural language understanding tasks. Language models trained on large human-generated corpus encode not only a significant amount of human knowledge, but also the human stereotype. As more and more downstream tasks have integrated language models as part of the pipeline, it is necessary to understand the internal stereotypical representation in order to design the methods for mitigating the negative effects. In this paper, we use counterexamples to examine the internal stereotypical knowledge in pre-trained language models (PLMs) that can lead to stereotypical preference. We mainly focus on gender stereotypes, but the method can be extended to other types of stereotype. We evaluate 7 PLMs on 9 types of cloze-style prompt with different information and base knowledge. The results indicate that PLMs show a certain amount of robustness against unrelated information and preference of shallow linguistic cues, such as word position and syntactic structure, but a lack of interpreting information by meaning. Such findings shed light on how to interact with PLMs in a neutral approach for both finetuning and evaluation.
Introduction
Pre-trained language models (PLMs) have gained a lot of attention due to their strong performance on natural language understanding tasks. Various kinds of knowledge have been encoded implicitly in the parameters of PLMs through large corpus training, allowing PLMs to succeed on downstream tasks, such as factual knowledge (?; ?), commensense knowledge (?), relational knowledge (?), and linguistic knowledge (?; ?; ?). Along with knowledge, PLMs also learn human stereotypes contained in the training corpus, resulting in fairness issues that can benefit one group over another. Although the embedded knowledge can be altered by finetuning with large corpus, some tasks that have insufficient data may suffer from the internal stereotypical knowledge of out-of-box PLMs. Therefore, it is important to understand how to mitigate the implicit stereotypical knowledge within PLMs.
Repetitive observed experience and actions contribute to the formulation of human semantic memory (?; ?), and such memory also includes stereotypes. An effective strategy to overcome the spontaneous stereotypical knowledge in semantic memory is using counterexamples (?). For instance, one may refer a beautician as a female, and change such judgement by thinking “a beautician can be a male”. By introducing counterexamples, human learn and update the semantic memory about beautician. Following examples of human semantic memory, we are interested in examining to what extent do PLMs process the counterexamples to overcome the internal stereotypical knowledge. In this paper, we add nine different types of knowledge to base gender stereotypical knowledge and evaluate the ability of using counterexamples on seven PLMs with different designs and model sizes. Unlike probing factual knowledge from PLMs, we expect evenly distributed gender preference instead of one true fact. If we treat what PLMs have already learnt as “facts” and counterexamples as “fake information”, we also examine the robustness of PLMs in processing and retaining “fake information”.
To probe the stereotypical knowledge stored in PLMs, we follow (?; ?) to reformulate the question answer (QA) tasks into cloze-style prompts. For instance, “What is the gender of the beautician?” can be reformulated as “The [MASK] works as a beautician”. Instead of directly asking for the gender of beautician, the new prompt probes the explicit stereotypical knowledge, for instance “The girl works as a beautician”, as well as the implicit stereotypical knowledge, such as “The secretary works as a beautician”.
We extend the WinoBias dataset (?) with the proposed cloze-style prompts consisting of different types of information and base stereotypical knowledge. The purpose of base stereotypical knowledge is to understand the gender preference stored in the PLMs. Additionally, we design nine different types of information including shallow knowledge and semantic information to examine the PLMs’ ability at learning and updating the encoded stereotypical knowledge. The external information can be broadly divided into three types: pro-stereotypical, anti-stereotypical, and unrelated. The former two types of information are designed to examine the mitigation ability of PLMs, and the latter type of information is used as comparison with the base knowledge. Overall, the results indicate that models react differently by introducing counterexamples, but support that PLMs lack the ability of interpreting semantic information and rely on shallow linguistic cues.
Related Work
Language models offer the flexibility of accessing the internal knowledge that can be easily extended and expressed in downstream tasks. Instead of finetuning the PLMs (?; ?), recent work use cloze-style prompts to test the PLMs as Knowledge Bases (KBs) without finetuning (?; ?; ?). Unlike factual knowledge retrieval research, we focus on examining the effectiveness of counterexamples at mitigating stereotypical knowledge of PLMs, and analyze what information from counterexamples PLMs use. Our work mainly explores the consistency of PLMs at generation preference by rephrasing the stereotypical knowledge probe (?).
Many prior works examine the linguistic knowledge within language models, including syntactic information (?; ?; ?; ?; ?; ?) and semantic information (?; ?; ?). There is also work focusing on the syntactic and semantic information from contextualized embeddings (?; ?; ?). Our work takes a step forward to language models fairness by using different counter information, including both syntactic and semantic, to examine the mitigation ability of PLMs.
Research has also been conducted on exploring priming methods in PLMs by examining if the appearence of prime words would affect the target in context (?; ?). The use attractors has also been examined for language model in syntactic settings (?; ?) as well as semantic settings (?). Our work takes inspiration from (?; ?) in using rephrase prompts to examine the consistency of language model generation. However, we focus on breaking the stereotypical consistency and reconstructing consistency of neutral preference.
In the domain of fairness, prior works study bias within PLMs in downstream tasks (?; ?) and propose embedding-based evaluation approaches (?; ?). Other work show that embedding-based approaches do not ensure unbiased representation (?; ?; ?), rather than an indicator of bias (?).
Dataset and Methodology
We utilized both the WinoBias dataset (?) and 2021 Labor Force Statistics from the Current Population Survey to extract gender-dominated job titles by comparing the percentage of each gender group in relation to the job category. In total, we were able to extract 58 job titles that consist of 29 female-dominated professions and 29 male-dominated professions. Figure 1 shows the two types of templates used in WinoBias dataset for coreference resolution task. Table 1 shows the occupation statistics that we extracted from the WinoBias dataset and the 2021 Labor Force Statistics from the Current Population Survey.

Occupation | % | Occupation | % |
---|---|---|---|
mechanician | 2.9 | attendant | 52.3 |
carpenter | 4.5 | pharmacist | 57.8 |
construction worker | 4.9 | writer | 59.8 |
pilot | 5.3 | archivist | 61.4 |
painter | 8.9 | accountant | 62.0 |
engineer | 13.6 | auditor | 62.0 |
laborer | 13.7 | designer | 62.6 |
architect | 21.5 | author | 63.7 |
chef | 22.8 | veterinarian | 64.2 |
mover | 22.9 | baker | 64.8 |
operator | 23.3 | editor | 66.7 |
driver | 25.1 | clerk | 68.0 |
sheriff | 26.2 | counselor | 68.1 |
farmer | 26.3 | cashier | 72.5 |
guard | 26.8 | teacher | 72.5 |
surgeon | 27.7 | translator | 73.4 |
ceo | 29.1 | practitioner | 73.8 |
chief | 29.1 | server | 73.9 |
developer | 29.2 | therapist | 77.4 |
composer | 29.8 | librarian | 79.9 |
cook | 31.5 | psychologist | 82.7 |
supervisor | 32.9 | sewer | 86.5 |
salesperson | 33.8 | nurse | 88.5 |
lawyer | 37.9 | cleaner | 88.7 |
dentist | 38.7 | housekeeper | 88.7 |
janitor | 39.3 | receptionist | 90.0 |
physician | 39.7 | assistant | 92.0 |
manager | 44.6 | hairdresser | 92.4 |
analyst | 45.9 | secretary | 94.6 |
To test the mitigation ability of PLMs, we design cloze-style prompts by combining base prompt with different knowledge, including syntactic information, semantic information, and corresponding counterexamples, and ask the models to complete the prompt by predicting the target word. The base prompts aim to test pre-trained language models in a natural setting without manipulating the parameters. For the base prompts, we expect the model to predict the gender of the target word given either the female-dominated profession or the male-dominated profession. Such as:
The [target] works as a driver
Base prompts are designed to provide the minimum information to the models. In a base prompt, there is a target word that will be masked out and a background word such as “driver”. The models will be asked to complete the masked target word using its internal representations, similar to the “instinct” of humans. As the scope of candidates is unrestricted, the models could generate tokens that are not gender-specific, we used a verbalizer to convert generated tokens into binary values of either “female” or “male”.
We introduce counter-knowledge in the input prompts, and evaluate if the output of the models will be affected. Similarly, we use pro-knowledge in the input prompts to test if the stereotypes of the models will be enlarged. Both counter-knowledge and pro-knowledge have two forms: syntacticly similar and semanticly similar to the base prompt. Syntacticly similar knowledge shares the same syntactic structure as the base prompts, while semanticly similar knowledge shares the same meaning. Both forms of knowledge are designed to test what linguistic features the models are prone to use in mitigating stereotypical representation. Table 2 shows a detailed sample from the dataset.
Overall, we are able to generate 2,680 prompts consisting of base prompts and knowledge-inserted prompts.
Knowledge Construction
We provide a data sample from our dataset to explain our design in detail. As shown in table 2, a base prompt is used to test the raw stereotypical representation within the models, followed by different knowledge-inserted prompts to test the mitigation ability of the models. Target syntacticly similar and target semanticly similar prompts are designed to enlarge the stereotypical representation within the models, so we expect to see relatively larger margins between the two gender groups. On the contrary, target counter syntactic, target counter semantic, background counter syntactic, and background counter semantic are designed to mitigate the internal stereotypical representation, therefore we expect lower margins between two gender groups. Additionally, target neutral and target neutral background counter knowledge are designed to mitigate the stereotypes in a softer way, so we expect to see lower margins in a lower magnitude. Lastly, to test the robustness of pre-trained language models, we insert unrelated knowledge that does not share similar syntactic structure or meaning.
base | The [target] works as a nurse. |
---|---|
target syntactic similar | The woman worked as a nurse. The [target] works as a nurse. |
target semantic similar | The nurse can be a female. The [target] works as a nurse. |
target neutral | The person worked as a nurse. The [target] works as a nurse. |
target counter syntactic similar | The man worked as a nurse. The [target] works as a nurse. |
target counter semantic similar | The nurse can be a male. The [target] works as a nurse. |
background counter syntactic similar | The woman worked as a doctor. The [target] works as a nurse. |
background counter semantic similar | The doctor can be a female. The [target] works as a nurse. |
target neutral background counter | The person worked as a doctor. The [target] works as a nurse. |
unrelated | The dog is in a chair. The [target] works as a nurse. |
Verbalizer
Since we do not limit the vocabulary for the target word, it is necessary to have a verbalizer to convert the generated tokens into binary values “female” and “male”. First, we include a list of gender-specific tokens such as “mom” and “dad”. Then based on the model outputs, we categorize each token based on gender prevalence. Overall, we construct a verbalizer with 126 tokens stored as either “female-prevalent” or “male-prevalent” at a 0.5 ratio.
Experiments
In this section, we provide details of the designed experiments, including baseline models, input representation, and evaluation method.
Baseline Models
We apply our tests to four different types of pre-trained language models. Except for ALBERT (?), each type of model consists of two models with different size settings.
-
•
BERT (?). We tested two variants of the uncased version of BERT: BERT-base and BERT-large.
-
•
ALBERT (?). We tested one variant of the uncased version of ALBERT: ALBERT-base.
-
•
RoBERTa (?). We tested two variants of the uncased version of RoBERTa: RoBERTa-base and RoBERTa-large.
-
•
GPT-2 (?). We tested GPT2-medium and GPT2-large.
Input Representation
For both the base prompts and knowledge-inserted prompts, we append [CLS] token at the start of the sentence for BERT and ALBERT and s for RoBERTa and GPT2. The masked target work is replaced by [MASK] for BERT and ALBERT and mask for RoBERTa. For knowledge-inserted prompts, two sentences are separated by a separator token [SEP] for BERT and ALBERT and s for RoBERTa. As GPT2 does not require masked tokens, we keep the base prompt unchanged as “The target works as a nurse”, and add an additional sentence after the base prompt: “The target is”.
Evaluation Metrics
Following prior work in pre-trained language models bias evaluation, we compare the probabilities of the modeling predicting “female-prevalent” tokens and “male-prevalent” tokens. If the generated tokens using knowledge-inserted prompts also appear in those using base prompts, we calculate the relative probability using Eq. 1:
(1) |
where is the generated target word, is the knowledge-inserted prompt and is the base prompt.
Results and Discussion
For this paper, we tested different pre-trained language models and compare the top- generated tokens where varies from 3, 5, to 10. The corresponding results are shown in figure 2, figure 3, and figure 4.



Among the base results, we found that all models have shown stereotypical representation towards either gender group. Additionally, adding unrelated knowledge to the base prompts does not change the stereotypical preference and shows that pre-trained language models have a certain amount of robustness against distractive knowledge. The results show that the introduction of neutral knowledge, such as target neutral, does not result in any benefits for autoregressive language models, such as GPT-2, as opposed to BERT-based language models. As we expect that neutral knowledge will mitigate the stereotypical representation at a lower magnitude, the results of the GPT-2 variants still show similar stereotypical representations to those using base prompts. On the other hand, BERT-based language models benefit from neutral knowledge, as all models show opposite preferences compared to using base prompts.
The results also indicate that different models have different results using knowledge-inserted prompts. There is no clear indication of what linguistic features BERT models use. Both BERT-base and BERT-large have been shown to be sensitive to target syntacticly similar and background counter syntacticly similar knowledge, but the stereotypical representation remains unchanged or conflicting when using target semanticly similar, target counter syntacticly similar, target counter semantic similar, and background counter semantic similar. Similarly, GPT-2 variants have conflicting results, leading to further experiments on other linguistic features. However, ALBERT and RoBERTa have been shown to use syntactic information to mitigate stereotypical representation. Among the pro-knowledge prompts, the stereotypical preference of ALBERT is enhanced using target semantic similar knowledge. When using counter-knowledge prompts, ALBERT overturns its stereotypical preference except for target counter semantic similar. Similarly, RoBERTa variants enhance its stereotypical representation using target syntactic similar and target semantic similar knowledge and overturn the stereotypical representation using background counter syntacticly similar knowledge. The results of using target counter syntacticly similar knowledge also support the conclusion, as the margin between two gender groups is smaller compared to using the base prompts.
Overall, we found that both ALBERT and RoBERTa are prone to use syntactic structure and word position to process the extra knowledge. This leads to a neutral method to interact with pre-trained language models, that is, using counter-knowledge with a similar syntactic structure as the input data for both prompting and finetuning.
Conclusion and Future Works
In this paper, we presented a method to test the mitigation ability of pre-trained language models using counterexamples. Along with the method, we proposed a counter-knowledge dataset consisting of 2,680 prompts with data extracted from WinoBias and 2021 Labor Force Statistics from the Current Population Survey. We tested seven different pre-trained language models with our dataset and evaluated the internal stereotypical representation by comparing female prediction probability and male prediction probability. Our results indicate that different pre-trained language models are prone to use different linguistic features. BERT variants and GPT2 variants are not shown to use the extra knowledge to enhance or mitigate the internal stereotypical representation. ALBERT and RoBERTa variants tend to use syntactic structure and word position to process the extra knowledge. Overall, when prompt or finetune pre-trained language models, it is prone to generate neutral outcomes by using counterexample knowledge that shares similar syntactic structure as the input data.
References
- [AlKhamissi et al. 2022] AlKhamissi, B.; Li, M.; Celikyilmaz, A.; Diab, M.; and Ghazvininejad, M. 2022. A review on language models as knowledge bases. arXiv preprint arXiv:2204.06031.
- [Bolukbasi et al. 2016] Bolukbasi, T.; Chang, K.-W.; Zou, J. Y.; Saligrama, V.; and Kalai, A. T. 2016. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems 29.
- [Bordia and Bowman 2019] Bordia, S., and Bowman, S. R. 2019. Identifying and reducing gender bias in word-level language models. arXiv preprint arXiv:1904.03035.
- [Da et al. 2021] Da, J.; Bras, R. L.; Lu, X.; Choi, Y.; and Bosselut, A. 2021. Analyzing commonsense emergence in few-shot knowledge models. arXiv preprint arXiv:2101.00297.
- [de Vassimon Manela et al. 2021] de Vassimon Manela, D.; Errington, D.; Fisher, T.; van Breugel, B.; and Minervini, P. 2021. Stereotype and skew: Quantifying gender bias in pre-trained and fine-tuned language models. In EACL 2021-16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Conference, 2232–2242. Association for Computational Linguistics.
- [Delobelle et al. 2022] Delobelle, P.; Tokpo, E.; Calders, T.; and Berendt, B. 2022. Measuring fairness with biased rulers: A comparative study on bias metrics for pre-trained language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1693–1706.
- [Devlin et al. 2018] Devlin, J.; Chang, M.-W.; Lee, K.; and Toutanova, K. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- [Elazar et al. 2021] Elazar, Y.; Kassner, N.; Ravfogel, S.; Ravichander, A.; Hovy, E.; Schütze, H.; and Goldberg, Y. 2021. Measuring and improving consistency in pretrained language models. Transactions of the Association for Computational Linguistics 9:1012–1031.
- [Ettinger 2020] Ettinger, A. 2020. What bert is not: Lessons from a new suite of psycholinguistic diagnostics for language models. Transactions of the Association for Computational Linguistics 8:34–48.
- [Finnegan, Oakhill, and Garnham 2015] Finnegan, E.; Oakhill, J.; and Garnham, A. 2015. Counter-stereotypical pictures as a strategy for overcoming spontaneous gender stereotypes. Frontiers in psychology 6:1291.
- [Goldberg 2019] Goldberg, Y. 2019. Assessing bert’s syntactic abilities. arXiv preprint arXiv:1901.05287.
- [Gonen and Goldberg 2019] Gonen, H., and Goldberg, Y. 2019. Lipstick on a pig: Debiasing methods cover up systematic gender biases in word embeddings but do not remove them. arXiv preprint arXiv:1903.03862.
- [Gulordava et al. 2018] Gulordava, K.; Bojanowski, P.; Grave, E.; Linzen, T.; and Baroni, M. 2018. Colorless green recurrent networks dream hierarchically. arXiv preprint arXiv:1803.11138.
- [Hewitt and Manning 2019] Hewitt, J., and Manning, C. D. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4129–4138.
- [Jiang et al. 2020] Jiang, Z.; Xu, F. F.; Araki, J.; and Neubig, G. 2020. How can we know what language models know? Transactions of the Association for Computational Linguistics 8:423–438.
- [Jumelet and Hupkes 2018] Jumelet, J., and Hupkes, D. 2018. Do language models understand anything? on the ability of lstms to understand negative polarity items. arXiv preprint arXiv:1808.10627.
- [Kassner and Schütze 2019] Kassner, N., and Schütze, H. 2019. Negated and misprimed probes for pretrained language models: Birds can talk, but cannot fly. arXiv preprint arXiv:1911.03343.
- [Klafka and Ettinger 2020] Klafka, J., and Ettinger, A. 2020. Spying on your neighbors: Fine-grained probing of contextual embeddings for information about surrounding words. arXiv preprint arXiv:2005.01810.
- [Lan et al. 2019] Lan, Z.; Chen, M.; Goodman, S.; Gimpel, K.; Sharma, P.; and Soricut, R. 2019. Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.
- [Linzen, Dupoux, and Goldberg 2016] Linzen, T.; Dupoux, E.; and Goldberg, Y. 2016. Assessing the ability of lstms to learn syntax-sensitive dependencies. Transactions of the Association for Computational Linguistics 4:521–535.
- [Liu et al. 2019] Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; and Stoyanov, V. 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- [Mao et al. 2022] Mao, R.; Liu, Q.; He, K.; Li, W.; and Cambria, E. 2022. The biases of pre-trained language models: An empirical study on prompt-based sentiment analysis and emotion detection. IEEE Transactions on Affective Computing.
- [Marvin and Linzen 2018] Marvin, R., and Linzen, T. 2018. Targeted syntactic evaluation of language models. arXiv preprint arXiv:1808.09031.
- [McCoy, Pavlick, and Linzen 2019] McCoy, R. T.; Pavlick, E.; and Linzen, T. 2019. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007.
- [Misra, Ettinger, and Rayz 2020] Misra, K.; Ettinger, A.; and Rayz, J. T. 2020. Exploring bert’s sensitivity to lexical cues using tests from semantic priming. arXiv preprint arXiv:2010.03010.
- [Misra, Ettinger, and Rayz 2021] Misra, K.; Ettinger, A.; and Rayz, J. T. 2021. Do language models learn typicality judgments from text? arXiv preprint arXiv:2105.02987.
- [Misra, Rayz, and Ettinger 2022] Misra, K.; Rayz, J. T.; and Ettinger, A. 2022. Comps: Conceptual minimal pair sentences for testing property knowledge and inheritance in pre-trained language models. arXiv preprint arXiv:2210.01963.
- [Nissim, van Noord, and van der Goot 2020] Nissim, M.; van Noord, R.; and van der Goot, R. 2020. Fair is better than sensational: Man is to doctor as woman is to doctor. Computational Linguistics 46(2):487–497.
- [Pandia and Ettinger 2021] Pandia, L., and Ettinger, A. 2021. Sorting through the noise: Testing robustness of information processing in pre-trained language models. arXiv preprint arXiv:2109.12393.
- [Peters et al. 2018] Peters, M. E.; Neumann, M.; Zettlemoyer, L.; and Yih, W.-t. 2018. Dissecting contextual word embeddings: Architecture and representation. arXiv preprint arXiv:1808.08949.
- [Petroni et al. 2019] Petroni, F.; Rocktäschel, T.; Lewis, P.; Bakhtin, A.; Wu, Y.; Miller, A. H.; and Riedel, S. 2019. Language models as knowledge bases? arXiv preprint arXiv:1909.01066.
- [Quillian 1967] Quillian, M. R. 1967. Word concepts: A theory and simulation of some basic semantic capabilities. Behavioral science 12(5):410–430.
- [Radford et al. 2019] Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and Sutskever, I. 2019. Language models are unsupervised multitask learners.
- [Rogers, Kovaleva, and Rumshisky 2021] Rogers, A.; Kovaleva, O.; and Rumshisky, A. 2021. A primer in bertology: What we know about how bert works. Transactions of the Association for Computational Linguistics 8:842–866.
- [Safavi and Koutra 2021] Safavi, T., and Koutra, D. 2021. Relational world knowledge representation in contextual language models: A review. arXiv preprint arXiv:2104.05837.
- [Smith and Estes 1978] Smith, E. E., and Estes, W. K. 1978. Theories of semantic memory. Handbook of learning and cognitive processes 6:1–56.
- [Tenney et al. 2019] Tenney, I.; Xia, P.; Chen, B.; Wang, A.; Poliak, A.; McCoy, R. T.; Kim, N.; Van Durme, B.; Bowman, S. R.; Das, D.; et al. 2019. What do you learn from context? probing for sentence structure in contextualized word representations. arXiv preprint arXiv:1905.06316.
- [Wilcox et al. 2018] Wilcox, E.; Levy, R.; Morita, T.; and Futrell, R. 2018. What do rnn language models learn about filler-gap dependencies? arXiv preprint arXiv:1809.00042.
- [Zhao et al. 2018] Zhao, J.; Wang, T.; Yatskar, M.; Ordonez, V.; and Chang, K.-W. 2018. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876.