This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

The Lifecycle of "Facts": A Survey of Social Bias in Knowledge Graphs

Angelie Kraft1    Ricardo Usbeck12
1Department of Informatics, Universität Hamburg, Germany
2Hamburger Informatik Technologie-Center e.V. (HITeC), Germany
{angelie.kraft, ricardo.usbeck}@uni-hamburg.de
Abstract

Knowledge graphs are increasingly used in a plethora of downstream tasks or in the augmentation of statistical models to improve factuality. However, social biases are engraved in these representations and propagate downstream. We conducted a critical analysis of literature concerning biases at different steps of a knowledge graph lifecycle. We investigated factors introducing bias, as well as the biases that are rendered by knowledge graphs and their embedded versions afterward. Limitations of existing measurement and mitigation strategies are discussed and paths forward are proposed.

The Lifecycle of "Facts": A Survey of Social Bias in Knowledge Graphs


Angelie Kraft1and Ricardo Usbeck12 1Department of Informatics, Universität Hamburg, Germany 2Hamburger Informatik Technologie-Center e.V. (HITeC), Germany {angelie.kraft, ricardo.usbeck}@uni-hamburg.de


1 Introduction

Knowledge graphs (KGs) provide a structured and transparent form of information representation and lie at the core of popular Semantic Web technologies. They are utilized as a source of truth in a variety of downstream tasks (e.g., information extraction Martínez-Rodríguez et al. (2020), link prediction (Getoor and Taskar, 2007; Ngomo et al., 2021), or question-answering Höffner et al. (2017); Diefenbach et al. (2018); Chakraborty et al. (2021); Jiang and Usbeck (2022)) and in hybrid AI systems (e.g., knowledge-augmented language models Peters et al. (2019); Sun et al. (2020); Yu et al. (2022) or conversational AI Gao et al. (2018); Gerritse et al. (2020)). In the latter, KGs are employed to enhance the factuality of statistical models  Athreya et al. (2018); Rony et al. (2022). In this overview article, we question the ethical integrity of these facts and investigate the lifecycle of KGs (Auer et al., 2012; Paulheim, 2017) with respect to bias influences.111We focus on the KG lifecycle from a bias and fairness lens. For reference, the processes investigated in Section 3 correspond to the authoring stage in the taxonomy by Auer et al. (2012). The representation issues in KGs (Section 4) and KG embeddings (Sections 5 and 7) which affect downstream task bias relate to Auer et al.’s classification stage.

Refer to caption
Figure 1: Overview of the knowledge graph lifecycle as discussed in this paper. Exclamation marks indicate factors that introduce or amplify bias. We examine bias-inducing factors of triple crowd-sourcing, hand-crafted ontologies, and automated information extraction (Chapter 3), as well as the resulting social biases in KGs (Chapter 4) and KG embeddings, including approaches for measurement and mitigation (Chapter 5).

We claim that KGs manifest social biases and potentially propagate harmful prejudices. To utilize the full potential of KG technologies, such ethical risks must be targeted and avoided during development and application. Using an extensive literature analysis, this article provides a reflection on previous efforts and suggestions for future work.

We collected articles via Google Scholar222A literature search on Science Direct, ACM Digital Library, and Springer did not provide additional results. and filtered for titles including knowledge graph/base/resource, ontologies, named entity recognition, or relation extraction, paired with variants of bias, debiasing, harms, ethical, and fairness. We selected peer-reviewed publications (in journals, conference or workshop proceedings, and book chapters) from 2010 onward, related to social bias in the KG lifecycle. This resulted in a final count of 18 papers. Table 1 gives an overview of the reviewed works and Figure 1 illustrates the analyzed lifecycle stages.

2 Notes on Bias, Fairness, and Factuality

In the following, we clarify our operational definitions of the most relevant concepts in our analysis.

2.1 Bias

If we refer to a model or representation as biased, we — unless otherwise specified — mean that the model or representation is socially biased, i.e., biased towards certain social groups. This is usually indicated by a systematic and unfairly discriminating deviation in the way members of these groups are represented compared to others (Friedman and Nissenbaum, 1996) (also known as algorithmic bias). Such bias can stem from pre-existing societal inequalities and attitudes, such as prejudice and stereotypes, or arise on an algorithmic level, through design choices and formalization (Friedman and Nissenbaum, 1996). From a more impact-focused perspective, algorithmic bias can be described as "a skew that [causes] harm" (Kate Crawford, Keynote at NIPS2017). Such harm can manifest itself in unfair distribution of resources or derogatory misrepresentation of a disfavored group. We refer to fairness as the absence of bias.

2.2 Unwanted Biases and Harms

One can distinguish between allocational and representational harms (Barocas et al., as cited in, Blodgett et al., 2020), where the first refers to the unfair distribution of chances and resources and the second more broadly denotes types of insult or derogation, distorted representation, or lack of representation altogether. To quantify biases that lead to representational harm, analyses of more abstract constructs are required. Mehrabi et al. (2021a), for example, measure indicators of representational harm via polarized perceptions: a predominant association of groups with either negative or positive prejudice, denigration, or favoritism. Polarized perceptions are assumed to correspond to societal stereotypes. They can overgeneralize to all members of a social group (e.g., "all lawyers are dishonest"). It can be said that harm is to be prevented by avoiding or removing algorithmic bias. However, different views on the conditions for fairness can be found in the literature and, in consequence, different definitions of unwanted bias.

2.3 Factuality versus Fairness

We consider a KG factual if it is representative of the real world. For example, if it contains only male U.S. presidents, it truthfully represents the world as it is and has been. However, inference based on this snapshot would lead to the prediction that people of other genders cannot or will not become presidents. This would be false with respect to U.S. law and/or undermine the potential of non-male persons. Statistical inference over historical entities is one of the main usages of KGs. The factuality narrative, thus, risks consolidating and propagating pre-existing societal inequalities and works against matters of social fairness. Even if the data represented are not affected by sampling errors, they are restricted to describing the world as it is as opposed to the world as it should be. We strive for the latter kind of inference basis. Apart from that, in the following sections we will learn that popular KGs are indeed affected by sampling biases, which further amplify societal biases.

3 Entering the Lifecycle: Bias in Knowledge Graph Creation

We enter the lifecycle view (Figure 1) by investigating the processes underlying the creation of KGs. We focus on the human factors behind the authoring of ontologies and triples which constitute KGs. Furthermore, we address automated information extraction, i.e., the detection and extraction of entities and relations from text, since these approaches can be subject to algorithmic bias.

3.1 Triples: Crowd-Sourcing of Facts

Popular large-scale KGs, like Wikidata Vrandecic and Krötzsch (2014) and DBpedia Auer et al. (2007) are the products of continuous crowd-sourcing efforts. Both of these examples are closely related to Wikipedia, where the top five languages (English, Cebuano, German, Swedish, and French) constitute 35% of all articles on this platform.333https://en.wikipedia.org/wiki/List_of_Wikipedias It can be said that Wikipedia is Euro-centric in tendency. Moreover, the majority of authors are white males.444https://en.wikipedia.org/wiki/Gender_bias_on_Wikipedia;https://en.wikipedia.org/wiki/Racial_bias_on_Wikipedia As a result, the data transport a particular homogeneous set of interests and knowledge (Beytía et al., 2022; Wagner et al., 2015). This sampling bias affects the geospatial coverage of information (Janowicz et al., 2018) and leads to higher barriers for female personalities to receive a biographic entry (Beytía et al., 2022). In an experiment, Demartini (2019) asked crowd contributors to provide a factual answer to the (politically charged) question of whether or not Catalonia is a part of Spain. The diverging responses indicated that participants’ beliefs of what counts as true differed largely. This is an example of bias that is beyond a subliminal psychological level. In this case, structural aspects like consumed media and social discourse play an important role. To counter this problem, Demartini (2019) suggests actively asking contributors for evidence supporting their statements, as well as keeping track of their demographic backgrounds. This makes underlying motivations and possible sources for bias traceable.

3.2 Ontologies: Manual Creation of Rules

Ontologies determine rules regarding allowed types of entities and relations or their usage. They are often hand-made and a source of bias (Janowicz et al., 2018) due to the influence of opinions, motivations, and personal choices (Keet, 2021): Factors like scientific opinions (e.g., historical ideas about race), socio-culture (e.g., how many people a person can be married to), or political and religious views (e.g., classifying a person of type X as a terrorist or a protestor) can proximately lead to an encoding of social bias. Also structural constraints like the ontologies’ granularity levels can induce bias (Keet, 2021). Furthermore, issues can arise from the types of information used to characterize a person entity. Whether one attributes the person with their skin color or not could theoretically determine the emergence of racist bias in a downstream application (Paparidis and Kotis, 2021). Geller and Kollapally (2021) give a practical example for detection and alleviation of ontology bias in a real-world scenario. The authors discovered that ontological gaps in the medical context lead to an under-reporting of race-specific incidents. They were able to suggest countermeasures based on a structured analysis of real incidents and external terminological resources.

3.3 Extraction: Automated Extraction of Information

Natural language processing (NLP) methods can be used to recognize and extract entities (named entity recognition; NER) and their relations (relation extraction; RE), which are then represented as [[head entity, relation, tail entity]] tuples (or as [[subject, predicate, object]], respectively).

Mehrabi et al. (2020) showed that the NER system CoreNLP (Manning et al., 2014) exhibits binary gender bias. They used a number of template sentences, like "<Name> is going to school" or "<Name> is a person" using male and female names555While most of the works presented here refer to gender as a binary concept, this does not agree with our understanding. We acknowledge that gender is continuous and technology must do this reality justice. from 139 years of census data. The model returned more erroneous tags for female names. Similarly, Mishra et al. (2020) created synthetic sentences from adjusted Winogender (Rudinger et al., 2018) templates with names associated with different ethnicities and genders. A range of different NER systems were evaluated (bidirectional LSTMs with Conditional Random Field (BiLSTM CRF) (Huang et al., 2015) on GloVe (Pennington et al., 2014), ConceptNet (Speer et al., 2017) and ELMo (Peters et al., 2017) embeddings, CoreNLP, and spaCy666https://spacy.io/ NER models). Across models, non-white names yielded on average lower performance scores than white names. Generally, ELMo exhibited the least bias. Although ConceptNet is debiased for gender and ethnicity777https://blog.conceptnet.io/posts/2017/conceptnet-numberbatch-17-04-better-less-stereotyped-word-vectors/, it was found to produce strongly varied accuracy values.

Gaut et al. (2020) analyzed binary gender bias in a popular open-source neural relation extraction (NRE) model, OpenNRE (Han et al., 2019). For this purpose, the authors created a new dataset, named WikiGenderBias (sourced from Wikipedia and DBpedia). All sentences describe a gendered subject with one of four relations: spouse, hypernym, birthData, or birthPlace (DBpedia mostly uses occupation-related hypernyms). The most notable bias found was the spouse relation. It was more reliably predicted for male than female entities. This observation stands in contrast to the predominance of female instances with spouse relation in WikiGenderBias. The authors experimented with three different mitigation strategies: downsampling the training data to equalize the number of male and female instances, augmenting the data by artificially introducing new female instances, and finally word embedding debiasing (Bolukbasi et al., 2016). Only downsampling facilitated a reduction of bias that did not come at the cost of model performance.

Nowadays, contextualized transformer-based encoders are used in various NLP applications, including NER and NRE. Several works have analyzed the various societal biases encoded in large-scale word embeddings (like word2vec (Mikolov et al., 2013; Bolukbasi et al., 2016) or BERT (Devlin et al., 2019; Kurita et al., 2019)) or language models (like GPT-2 (Radford et al., 2019; Kirk et al., 2021) and GPT-3 (Brown et al., 2020; Abid et al., 2021)). Thus, it is likely that these biases also affect the downstream tasks discussed here. Li et al. (2021) used two types of tasks to analyze bias in BERT-based RE on the newly created Wiki80 and TACRED (Zhang et al., 2017) benchmarks. For the first task, they masked only entity names with a special token (masked-entity; ME), whereas for the second task, only the entity names were given (only-entity; OE). The model maintained higher performances in the OE setting, indicating that the entity names were more informative of the predicted relation than the contextual information. This hints at what the authors call semantic bias.

A Note on Reporting Bias

Generally, when extracting knowledge from text, one should be aware that the frequency with which facts are reported is not representative of their real-world prevalence. Humans tend to mention only events, outcomes, or properties that are out of their perceived ordinary (Gordon and Van Durme, 2013) (e.g., "a banana is yellow" is too trivial to be reported). This phenomenon is called reporting bias and likely stems from a need to be as informative and non-redundant as possible when sharing knowledge.

4 Bias in Knowledge Graphs

Next in our investigation of the lifecycle (Figure 1) comes the representation of entities and relations as a KG. In the following, we illustrate which social biases are manifested in KGs and how.

4.1 Descriptive Statistics

Janowicz et al. (2018) demonstrated that DBpedia, which is sourced from Wikipedia info boxes, mostly represents the western and industrialized world. Matching the coverage of location entries in the KG with population density all over the world showed that several countries and continents are underrepresented. A disproportionate 70% of the person entities in Wikidata are male (20% are female, less than 1% are neither male nor female, and for roughly 10% the gender is not indicated) (Beytía et al., 2022). Radstok et al. (2021) found that the most frequent occupation is researcher and Beytía et al. (2022) identified arts, sports, and science and technology as the most prominent occupation categories. In reality, only about 2% of people in the U.S. are researchers (Radstok et al., 2021). This gap is likely caused by reporting bias as discussed earlier (Section 3.3). Radstok et al. (2021), moreover, observed that mentions of ethnic group membership decreased and changed in focus between the 18th and 21st century. Greeks are the most frequently labeled ethnic group among historic entries (over 400 times) and African Americans among modern entries (only roughly 100 times).

4.2 Semantic Polarity

Mehrabi et al. (2021b) focused on biases in common sense KGs like ConceptNet (Speer et al., 2017) and GenericsKB (Bhakthavatsalam et al., 2020) (contains sentences) which are at risk of causing representational harms (see Section 2.2). They utilized regard (Sheng et al., 2019) and sentiment as intermediate bias proxies. Both concepts express the polarity of statements and can be measured via classifiers that predict a neutral, negative, or positive label (Sheng et al., 2019; Dhamala et al., 2021). Groups that are referred to in a mostly positive way are interpreted as favored and vice versa. Mehrabi et al. (2021b) applied this principle to natural language statements generated from ConceptNet triples. They found that subject and object entities relating to the professions CEO, nurse, and physician were more often favored while performing artist, politician, and prisoner were more often disfavored. Similarly, several Islam-related entities were on the negative end while Christian and Hindu were more ambiguously valuated. As for gender, no significant difference was found.

5 Bias in Knowledge Graph Embeddings

Vector representations of KGs are used in a range of downstream tasks or combined with other types of neural models (Nickel et al., 2016; Ristoski et al., 2019). They facilitate efficient aggregation of connectivity patterns and convey latent information.

Embeddings are created through statistical modeling and summarize distributional characteristics. So, if a KG like Wikidata contains mostly (if not only) male presidents, the relationship between the gender male and the profession president is assumed to manifest itself accordingly in the model. In fact, the papers summarized below provide evidence that the social biases of KGs are modeled or further amplified by KG embeddings (KGEs). The following sections are organized by measurement strategy to give an overview of existing approaches and the information gained from them.

Table 1: Overview of reviewed works concerning the sources, measurement, and mitigation of bias in KGs/KGEs.
Bias Source
Crowd-Sourcing Beytía et al. (2022); Janowicz et al. (2018); Demartini (2019)
Ontologies Janowicz et al. (2018); Keet (2021); Paparidis and Kotis (2021); Geller and Kollapally (2021)
Extraction Mehrabi et al. (2020); Mishra et al. (2020); Gaut et al. (2020); Li et al. (2021)
Bias Measurement
Representation Method
KG Descriptive Statistics Janowicz et al. (2018); Radstok et al. (2021); Beytía et al. (2022)
Semantic Polarity Mehrabi et al. (2021b)
KGE Analogies Bourli and Pitoura (2020)
Projection Bourli and Pitoura (2020)
Update-Based Fisher et al. (2020b); Keidar et al. (2021); Du et al. (2022)
Link Prediction Keidar et al. (2021); Arduini et al. (2020); Radstok et al. (2021); Du et al. (2022)
Bias Mitigation
Representation Method
KGE Data Balancing Radstok et al. (2021); Du et al. (2022)
Adversarial Learning Fisher et al. (2020a); Arduini et al. (2020)
Hard Debiasing Bourli and Pitoura (2020)

5.1 Stereotypical Analogies

The idea behind analogy tests is to see whether demographics are associated with attributes in stereotypical ways (e.g., "Man is to computer programmer as woman is to homemaker" (Bolukbasi et al., 2016)). In their in-depth analysis of a TransE-embedded Wikidata KG, Bourli and Pitoura (2020) investigated occupational analogies for binary gender seeds. TransE (Bordes et al., 2013) represents (h,r,t)(h,r,t) (with head hh, relation rr, tail tt) in a single space such that h+rth+r\approx t. The authors identified the model’s most likely instance of the claim "a is to x as b is to y" (with (a,b) being a set of demographics seeds and (x,y) a set of attributes) via a cosine score: S(a,b)(x,y)=cos(a+rb,x+ry)S_{(a,b)}(x,y)=cos(\vec{a}+\vec{r}-\vec{b},\vec{x}+\vec{r}-\vec{y}), where rr is the relation has_occupation. In their study, the highest scoring analogy was "woman is to fashion model as man is to businessperson". This example appears rather stereotypical, but other highly ranked analogies less so, like "Japanese entertainer" versus "businessperson" (Bourli and Pitoura, 2020). A systematic evaluation of how stereotypical the results are is missing here. In comparison, the work that originally introduced analogy testing for word2vec (Bolukbasi et al., 2016) employed human annotators to rate stereotypical and gender-appropriate analogies (e.g., "sister" versus "brother").

5.2 Projection onto a Bias Subspace

Projection-based measurement of bias is another approach that was first proposed by Bolukbasi et al. (2016) for word embeddings, and was adapted for TransE by Bourli and Pitoura (2020). In a first step, a one-dimensional gender direction dg\vec{d}_{g} is extracted. Then, a projection score metric SS is computed to indicate gender bias — with projection π\pi of an occupation vector o\vec{o} onto dg\vec{d}_{g} and a set of occupations CC: S(C)=1|C|oCπdgoS(C)=\frac{1}{|C|}\sum_{o\in C}||\pi_{\vec{d}_{g}}\vec{o}||. Occupations with higher scores are interpreted as more gender-biased and those with close-to-zero scores as neutral.

5.3 Update-Based Measurement

The translational likelihood (TL) metric was tailored for translation-based modeling approaches (Fisher et al., 2020b). To compute this metric, the embedding of a person entity is updated for one step towards one pole of a seed dimension. This update is done in the same way as the model was originally fit in. For example, if head entity person x is updated in the direction of male gender, the TL value is given by the difference between the likelihood of person x being a doctor after versus before the update. If the absolute value averaged across all human entities is high, this indicates a bias regarding the examined seed-attribute pair. Fisher et al. (2020b) argue that this measurement technique avoids model-specificity as it generalizes to any scoring function. However, Keidar et al. (2021) found that the TL metric does not compare well between different types of embeddings (details in Section 6). It should, thus, only be used for the comparison of biases within one kind of representation. Du et al. (2022) propose an approach comparable to Fisher et al. (2020b) to measure individual-level bias. Instead of updating towards a gender dimension, the authors suggest flipping the entity’s gender and fully re-training the model afterward. The difference between pre- and post-update link prediction errors gives the bias metric. A validation of the approach was done on TransE for a Freebase subset (FB5M (Bordes et al., 2015)) (Du et al., 2022). The summed per-gender averages (group-level metric) were found to correlate with U.S. census gender distributions of occupations.

6 Downstream Task Bias: Link Prediction

Link prediction is a standard downstream task that targets the prediction of relations between entities in a given KG. Systematic deviations in the relations suggested for entities with different demographics indicate reproduced social bias.

For the measurement of fairness or bias in link prediction, Keidar et al. (2021) distinguish between demographic parity versus predictive parity. The assumption underlying demographic parity is that the equality between predictions for demographic counterfactuals (opposite demographics, for example, female versus male in binary understanding) is the ideal state (Dwork et al., 2012). That is, the probability of predicting a label should be the same for both groups. Predictive parity is given, on the other hand, if the probability of true positive predictions (positive predictive value or precision) is equal between groups (Chouldechova, 2017). Hence, this measure factors in the label distribution by demographic. With these metrics, Keidar et al. (2021) analyzed different embedding types, namely TransE, ComplEx, RotatE, and DistMult, each fit on the benchmark datasets FB15k-237 (Toutanova and Chen, 2015) and Wikidata5m (Wang et al., 2021). They averaged the scores across a large set of human-associated relations to detect automatically which relations are most biased. The results showed that position played on a sports team was most consistently gender-biased across embeddings. Arduini et al. (2020) analyzed link prediction parity regarding the relations gender and occupation to estimate debiasing effects on TransH (Wang et al., 2014) and TransD (Ji et al., 2015). The comparability between different forms of vector representations is a strength of downstream metrics. In contrast, measures like the analogy test or projection score (Bourli and Pitoura, 2020) are based on specific distance metrics and TL (Fisher et al., 2020b) was shown to lack transferability across representations (Keidar et al., 2021) (Section 5.3).

Du et al. (2022) interpret the correlation between gender and link prediction errors as an indicator of group bias. With this, they found, for example, that engineer and nurse are stereotypically biased in FB5M. However, the ground truth gender ratio was found not predictive of the bias metric (e.g., despite its higher male ratio, animator produced a stronger female bias value). For validation, it was shown that the predicted bias values correlate to the gender distributions of occupations according to U.S. census (again, on TransE). Furthermore, the authors investigated how much single triples contribute to group bias via an influence function. They found that gender bias is mostly driven by triples containing gendered entities and triples of low degree.

7 Breaking the Cycle? Bias Mitigation in Knowledge Graph Embeddings

A number of works have attempted to post-hoc mitigate biases in KGEs. Given that pre-existing biases are hard to eradicate from KGs, manipulating embedding procedures, may alleviate the issue at least on a representation level. In the following, we summarize respective approaches.

7.1 Data Balancing

Radstok et al. (2021) explored the effects of training an embedding model on a gender-balanced subset of Wikidata triples. First, the authors worked with the originally gender-imbalanced Wikidata12k (Leblay and Chekol, 2018; Dasgupta et al., 2018) and DBpedia15k (Sun et al., 2017) on which they fit a TransE and a DistMult model (Yang et al., 2015). They then added more female triples from the Wikidata/DBpedia graph to even out the binary gender distribution among the top-5 most common occupations. Through link prediction, they compared the number of male and female predictions with the ground truth frequencies. More female entities were predicted after the data balancing intervention. However, the absolute difference between the female ratios in the data and the predictions increased, causing the model to be less accurate and fair. Moreover, the authors note that this process is not scalable since for some domains there are no or only a limited amount of female entities (e.g., female U.S. presidents do not exist in Wikidata).

Du et al. (2022) experimented with adding and removing triples to gender-balance a Freebase subset (Bordes et al., 2015). For the first approach, the authors added synthetic triples (as opposed to real entities from another source as was done by Radstok et al. (2021)) for occupations with a higher male ratio. The resulting bias change was inconsistent across occupations. This appears in line with the authors’ finding that ground truth gender ratios are not perfectly predictive of downstream task bias (Section 6). For the second strategy, the triples that most strongly influenced an existing bias were determined and removed. This outperformed random triple removal.

7.2 Adversarial Learning

Adversarial learning for model fairness aims to prevent prediction of a specific personal attribute from a person’s entity embedding. As an adversarial loss, Fisher et al. (2020a) used the KL-divergence between the link prediction score distribution and an idealized target distribution. For example, for an even target score distribution for a set of religions, the model is incentivized to give each of them equal probability. However, in their experiments, this treatment failed to remove the targeted bias fully. This is likely caused by related information encoded in the embedding that is able to inform the same bias.

Arduini et al. (2020) used a Filtering Adversarial Network (FAN) with a filter and a discriminator module. The filter intends to remove sensitive attribute information from the input, while the discriminator tries to predict the sensitive attribute from the output. Both modules were separately pre-trained (filter as an identity mapper of the embedding and discriminator as a gender predictor) and then jointly trained as adversaries. In their experiments, the gender classification accuracy for high- and low-degree entities was close to random for the filtered embeddings (TransH and TransD). For an additional occupation classifier, accuracy remained unaffected after treatment.

7.3 Hard Debiasing

Bourli and Pitoura (2020) propose applying the projection-based approach explained in Section 5.2 for the debiasing of TransE occupation embeddings. To achieve this, its linear projection onto the previously computed gender direction is subtracted from the occupation embedding. A variant of this technique ("soft" debiasing) aims to preserve some degree of gender information by applying a weight 0<λ<10<\lambda<1 to the projection value before subtraction. In the authors’ experiments, the correlation between gender and occupation was effectively removed — as indicated by the projection measure (Bourli and Pitoura, 2020). However, the debiasing degree determined by λ\lambda was found to be in trade-off with model accuracy. This technique was closely adapted from Bolukbasi et al. (2016), regarding which Gonen and Goldberg (2019) criticize that gender bias is only reduced according to their specific measure and not the "complete manifestation of this bias".

8 Discussion

In this article, we cover a wide range of evidence for harmful biases at different stages during the lifecycle of "facts" as represented in KGs. Some of the most influential graphs misrepresent the world as it is due to sampling and algorithmic biases at the creation step. Pre-existing biases are exaggerated in these representations. Embedding models learn to encode the same or further amplified versions of these biases. Since the training of high-quality embeddings is costly, they are, in practice, pre-trained once and afterward reused and fine-tuned for different systems. These systems preserve the inherited biases over long periods, exacerbating the issue further. Our survey shows that KGs may qualify as resources for historic facts, but they do not qualify for inference regarding various human attributes. Future work on biases in KGs and KGEs should aim for improvement in the following areas:

Attribute and Seed Choices

Bias metrics usually examine one or a few specific attributes (e.g., occupation) and their correlations with selected seed dimensions (e.g., gender). Occupation is by far the most researched attribute in the articles we found (Arduini et al., 2020; Radstok et al., 2021; Bourli and Pitoura, 2020; Fisher et al., 2020a, b). Only Keidar et al. (2021) propose to aggregate the correlations between a set of seed dimensions and all relations in a graph. All the works used binary gender as the seed dimension and some additionally addressed ethnicity, religion, and nationality (Fisher et al., 2020a, b; Mehrabi et al., 2021b).

Lack of Validation

Most of the KGE bias metrics presented here are interpreted as valid if they detect unfairly discriminating association patterns that intuitively align with existing stereotypes. Besides that, several works investigate the comparability between different metrics. Although both of these practices deliver valuable information on validity, they largely ignore the societal context. Only Du et al. (2022) compared embedding-level bias metrics with census-aligned data to assess compatibility with real-world inequalities. We suggest that future work consider a more comprehensive study of construct validity (Does the measurement instrument measure the construct in a meaningful and useful capacity?) (Jacobs and Wallach, 2021). One requirement is that the obtained measurements capture all relevant aspects of the construct the instrument claims to measure. That is, a gender bias measure must measure all relevant aspects of gender bias (Stanczak and Augenstein, 2021) (including, e.g., nonbinary gender and a distinction between benevolent and hostile forms of sexist stereotyping (Glick and Fiske, 1997)). Unless proven otherwise, we must be skeptical that this is achieved by existing approaches (Gonen and Goldberg, 2019). As a result of minimal validation, detailed interpretation guidelines are generally not provided. Therefore, the distinctions between strong and weak bias or weak bias and random variation are mostly vague.

(In-)Effectiveness of Mitigation Strategies

Data balancing is the most intuitive approach to bias mitigation and was proven to be effective in the context of text processing (Meade et al., 2022). However, for KGEs, data balancing methods were found to inconsistently reduce bias (Section 7.1). Adversarial learning yielded promising outcomes in the study by Arduini et al. (2020). Their FAN approach does not rely on pre-specified attributes. This is in contrast to Fisher et al. (2020a), whose intervention was found to miss non-targeted, yet bias-related information. This problem relates to one of the main criticisms of hard and soft debiasing: instead of alleviating the problem, these techniques risk concealing the full extent of the bias (Gonen and Goldberg, 2019).

Reported Motivations

Many, yet not all works in the field name potential social harms as a motivator for their research on social bias in KGs (Mehrabi et al., 2021b; Fisher et al., 2020a, b; Radstok et al., 2021). Only Mehrabi et al. (2021b) drew from established taxonomies and targeted biases associated with representational harms (Barocas et al., as cited in, Blodgett et al., 2020). Similarly, most works lack a clear working definition of social bias. For example, aspects of pre-existing societal biases captured in the data and biases arising through the algorithm (Friedman and Nissenbaum, 1996) are usually not disentangled. Only Bourli and Pitoura (2020) compared model bias to the original KG frequencies and showed that the statistical modeling caused an amplification.

9 Recommendations

To avoid harms caused by biases in KGs and their embeddings, we identify and recommend several actions for practitioners and researchers.

Transparency and Accountability

KGs should by default be published with bias-sensitive documentation to facilitate transparency and accountability regarding potential risks. Data Statements (Bender and Friedman, 2018) report curation criteria, language variety, demographics of the data authors and annotators, relevant indicators of context, quality, and provenance. Datasheets for Datasets (Gebru et al., 2021) additionally state motivation, composition, preparation, distribution, and maintenance. The associated questionnaire can accompany the dataset creation process to avoid risks early on. Especially in the case of ongoing crowd-sourcing efforts for encyclopedic KGs the demographic background of contributors should be reported (Demartini, 2019). Researchers using subsets of these KGs, should investigate respective data dumps for potential biases and report limitations transparently. Similarly, KG embedding models should be published with Model Cards (Mitchell et al., 2019) documenting intended use, underlying data, ethical considerations, and limitations. Stating the contact details for reporting problems and concerns establishes accountability (Mitchell et al., 2019; Gebru et al., 2021).

Improving Representativeness

To tackle selection bias, data collection should aim to employ authors and annotators from diverse social groups and with varied cultural imprints. Annotations should be determined via aggregation (see Hovy and Prabhumoye, 2021). For open editable KGs, interventions like edit-a-thons are helpful to introduce more authors from underrepresented groups (Vetter et al., 2022) (e.g., the Art+Feminism campaign aims to fill the gender gap in Wikimedia knowledge bases888https://outreachdashboard.wmflabs.org/campaigns/artfeminism_2022/overview). In order for such interventions to take effect, research must update data bases and benchmarks frequently (see Koch et al., 2021). In addition, the timeliness of encyclopedic data is necessary to avoid perpetuating historic biases.

Tackling Algorithmic Bias

Evaluation and prevention of harmful biases must become part of the development pipeline Stanczak and Augenstein (2021). Algorithmic biases are best evaluated with a combination of multiple quantitative (Section 5) and qualitative measures (Kraft et al., 2022; Dev et al., 2021), considering multiple demographic dimensions (beyond gender and occupation). Evaluating the content of attributions in light of social discourse and the intended use of a technology facilitates an assessment of potential harms (Selbst et al., 2019). Downstream task bias may exist independently from a measured embedding bias (Goldfarb-Tarrant et al., 2021), therefore a task- and context-oriented evaluation is preferred (Section 6). We have presented several bias-mitigating strategies for different KGEs, which might alleviate the issue in some cases (Section 7). However, more research is needed to establish more effective and robust mitigation methods, as well as metrics used to evaluate their impact (Gonen and Goldberg, 2019; Blodgett et al., 2020).

10 Related Work

Although a wide range of surveys investigates biases in NLP, none of them addresses KG-based methods, in particular. Blodgett et al. (2020) critically investigated the theoretical foundation of works analyzing bias in NLP. The authors claim that most works lack a clear taxonomy. We came to a similar conclusion with respect to evaluations of KGs and their embeddings. Sun et al. (2019) and Stanczak and Augenstein (2021) surveyed algorithmic measurement and mitigation strategies for gender bias in NLP. Sheng et al. (2021) summarized approaches for the measurement and mitigation of bias in generative language models. Some of the methods presented earlier are derived from works discussed in these surveys and adapted to the constraints of KG embeddings (e.g., Bourli and Pitoura (2020) adapted hard debiasing (Bolukbasi et al., 2016)). Criticisms point to the monolingual focus on the English language, the predominant assumption of a gender binary, and a lack of interdisciplinary collaboration.

Shah et al. (2020) identified four sources of predictive biases: label bias (label distributions are imbalanced and erroneous regarding certain demographics), selection bias (the data sample is not representative of the real world distribution), semantic bias/input representation bias (e.g., feature creation with biased embeddings), and overamplification through the predictive model (slight differences between human attributes are overemphasized by the model). All of these factors are reflected in the lifecycle as discussed in this article. To counter the risks, Shah et al. (2020) suggest employing multiple annotators and methods of aggregation (see also Hovy and Prabhumoye, 2021), re-stratification, re-weighting, or data augmentation, debiasing of models, and, finally, standardized data and model documentation.

11 Conclusion and Paths Forward

Our survey shows that biases affect KGs at different stages of their lifecycle. Social biases enter KGs in various ways at the creation step (e.g., through crowd-sourcing of triples and ontologies) and manifest in popular graphs, like DBpedia (Beytía et al., 2022) or ConceptNet (Mehrabi et al., 2021b). Embedding models can capture exaggerated versions of these biases (Bourli and Pitoura, 2020), which finally propagate downstream (Keidar et al., 2021). We acknowledge that KGs have enormous potential for a variety of knowledge-driven downstream applications Martínez-Rodríguez et al. (2020); Ngomo et al. (2021); Jiang and Usbeck (2022) and improvements in the truthfulness of statistical models Athreya et al. (2018); Rony et al. (2022). Yet, although KGs are factual about historic instances, they also perpetuate historically emerging social inequalities. Thus, ethical implications must be considered when developing or reusing these technologies.

We showed that most embedding-based measurement approaches for bias are still restricted to a limited number of demographic seeds and attributes. Furthermore, their alignment with social bias as a construct is not sufficiently validated. Some debiasing strategies appear effective within rather narrow definitions of bias. More in-depth scrutiny is required for a broader understanding of bias. Future work should be grounded in an investigation of concepts like gender or ethnic bias and strive for more comprehensive operationalizations and validation studies. Finally, the motivations and conceptualizations should be communicated clearly.

Acknowledgments

We acknowledge the financial support from the Federal Ministry for Economic Affairs and Energy of Germany in the project CoyPu (project number 01MK21007[G]) and the German Research Foundation in the project NFDI4DS (project number 460234259).

References