Fairness in Image Search: A Study of Occupational Stereotyping in Image Retrieval and its Debiasing

Swagatika Dash
Information School
University of Washington
Seattle, WA 98105, USA
[email protected]

Abstract

Multi-modal search engines have experienced significant growth and widespread use in recent years, making them the second most common internet use. While search engine systems offer a range of services, the image search field has recently become a focal point in the information retrieval community, as the adage goes, a picture is worth a thousand words. Although popular search engines like Google excel at image search accuracy and agility, there is an ongoing debate over whether their search results can be biased in terms of gender, language, demographics, socio-cultural aspects, and stereotypes. This potential for bias can have a significant impact on individuals’ perceptions and influence their perspectives.

In this paper, we present our study on bias and fairness in web search, with a focus on keyword-based image search. We first discuss several kinds of biases that exist in search systems and why it is important to mitigate them. We narrow down our study to assessing and mitigating occupational stereotypes in image search, which is a prevalent fairness issue in image retrieval. For the assessment of stereotypes, we take gender as an indicator. We explore various open-source and proprietary APIs for gender identification from images. With these, we examine the extent of gender bias in top-tanked image search results obtained for several occupational keywords. To mitigate the bias, we then propose a fairness-aware re-ranking algorithm that optimizes (a) relevance of the search result with the keyword and (b) fairness w.r.t genders identified. We experiment on 100 top-ranked images obtained for 10 occupational keywords and consider random re-ranking and re-ranking based on relevance as baselines. Our experimental results show that the fairness-aware re-ranking algorithm produces rankings with better fairness scores and competitive relevance scores than the baselines.

Report Organization: The paper is organized as follows. Section 1 is the introductory section that discusses web-search in general, the importance of image search, and the ranking of search results. Section 2 summarizes the bias and fairness aspects of search, and challenges in assessing the fairness of image search outputs. We then discuss different categories of biases in image search in Section 3 and cover occupational stereotyping as an important fairness issue in Section 4. Section 5 is devoted to prior arts on mitigation of biases in web search. In section 6, we present implementation details of models and frameworks for the assessment of gender stereotypes in image search. Based on this and existing work on text-search re-ranking, we devise a fairness-aware re-ranking algorithm, which is discussed in Section 7. Section 8 concludes the paper with pointers to future work.

1 Introduction

Search engines play a significant and decisive role in accessing the digital information ecosystem. Every minute, an estimated 3.8 million queries are processed by the Google search engine and this number continues to increase exponentially Grind et al. (2019). Search engines are arguably the most powerful lines of computer programs in the global economy that controls how much of the world accesses information on the internet. A 2017 international survey found that 86 % of people use search engines daily Dutton et al. (2017). Other findings from a study Dutton et al. (2013) include that search engines are one of the first places people seek to get information. Moreover, search engines are the second most common use of the internet after email Dutton et al. (2017). A vast majority of internet-using adults in countries like the U.S.A also rely on search engines to find and fact-check information Dutton et al. (2013). A study Mitchell et al. (2017) shows that search engines are the second-best most likely news gateway that inspires follow-up actions like further searching, online sharing, and talking about the information with others. Hence, search engines have not only attained remarkable growth and usage over a relatively shorter period, but they are also currently proving to be the most trusted source of information Robertson et al. (2018).

The motives behind using a search engine are different for every user. Users create search terms differently based on their intentions and likewise expect different results: articles, videos, or even an entire site. Even though queries may not always have unique purposes and outcomes, according to Broder et al., Broder (2002), there are three basic types of search i.e., (a) informational search queries (where the user looks for certain information), (b) navigational search queries (when the user wants to visit a specific site or finds a certain vendor), and (c) transactional search queries (when the user wishes to execute a transaction– for example buying something). Over the years, the proliferation of internet and web search usage has increased the volume of informational queries amongst other forms by orders of magnitude; this has also given rise to multi-modal search platforms for serving queries for images, speech (audio), web pages, knowledge-cards etc.

1.1 Importance of Image Search

Images convey much more information as compared to words. They have a powerful impact on what we recognize and what we remember in the future. At any point in time, they speak louder than words. The web tool Mozcast shows that more than 19% of google searches return images, which means images, rather than texts, are becoming the language of the internet. The growth in visual image search has given rise to a lot of research work in the field of image information retrieval (IIR).

1.2 Importance of Search Result Ranking

Internet search rankings are known to have a denoting impact on the users’ perceptions and decisions, mainly because most of the search users trust and choose higher-ranked results more than lower-ranked results Epstein & Robertson (2015) and often do not look below the third result Fidel (2012). Surprisingly, even if the high-ranked results are valued the most, there are no standard qualifiers to identify the top results as the most relevant information for a search keyword Mai (2016). Due to the search engine’s proprietary nature, users are unaware of the working of its algorithms. The majority of search engine users consider search engine results to be unbiased and accurate Zickuhr et al. (2012). Highly ranked results not only shape a user’s opinion and impact his beliefs and unconscious bias; they can also affect his search interactions and experiences. In the same context, it is difficult to believe that these results can often be unfair (biased) in terms of gender, language, demography, socio-cultural aspects, and stereotypes. The bias term is very much attached to the search engines for what they index, what they present overall, and what they present to a particular user. This is a deep concern as people are more vulnerable to bias especially when they are unaware of the biases. Hence, the romanticized view of the search engine that the search engine bypasses the structural biases and skewed data does not match the reality at all. The influence carried by the design decisions of the search engines is broad; it does not only impact the perception of individual information seekers, of society at large it influences our cultures and politics by steering peoples’ perspectives towards stereotypical skewed results Robertson et al. (2018).

2 Fairness in Web Search

Fairness in web search is the absence of any prejudice or inclination toward an individual or a group based on their inherent or acquired characteristics. In most of the current search engines, there is clear evidence of the absence of fairness that is spread across all the different dimensions of search, i.e., text, image, audio, and speech. If we talk about an image search, there are biased associations between the attributes of an image with representations of social concepts. For instance, the state-of-the-art unsupervised models trained on popular image datasets like ImageNet automatically learn bias from the way that a group of people is stereotypically portrayed on the web Deng et al. (2009). With the proliferation of artificial intelligence, the Internet of things, and web search and intelligence capabilities in day-to-day life, reducing (if not eliminating) bias is of paramount importance.

2.1 Examples of Unfairness in Image Search

A google search result in 2016 for the keyword ”three white teenagers” spat out happy and shiny men and women laughing and holding sports equipment. However, the search results for ”three black teenagers” offered an array of mug shots. Google acknowledged this bias and responded that the search algorithms mirror the availability and frequency of the online content.“This means that sometimes unpleasant portrayals of sensitive subject matter online can affect what image search results appear for a given query,” the company said in a statement to the Huffington Post UK. Guarino (2016)

The presentation of black women being sassy and angry presents a disturbing portrait of black womanhood in modern society. In Algorithms of Oppression, Safiya Umoja Noble challenges the claim of Google having equity in all forms of ideas, identities, and activities. She argues that the search algorithms that are privileged towards whiteness and discriminate against people of color, essentially women of color, are due to two main factors i.e., (a) Monopoly of a relatively small number of internet search engines and (b) Private interests in promoting certain aspects of the images, which are typically made available when a cursor hovers on the result.

2.2 Challenges in Evaluating Image Search Fairness vis-á-vis General Web Search

Image search results are typically displayed in a grid-like structure unlike that of web search results which are arranged as a sequential list. Users can view/scroll results not only in the vertical direction but in the horizontal direction too. These differences in user behavior patterns lead to challenges in evaluating the search results from a user experience standpoint. There are three key differences in Search Engine Result Pages (SERP) of web search and image search. i.e., (1) An image search engine typically places results on a grid-based panel rather than in a one-dimensional ranked list. As a result, users can view results not only vertically but also horizontally. (2) Users can view results by scrolling down without a need to click on the “next page” button because the image search engine does not have an explicit pagination feature. (3) Instead of a snippet, i.e., a query-dependent abstract of the landing page, an image snapshot is shown together with metadata. Xie et al. (2019). These subtle changes in user experience in image search means that users have instant access to more images (i.e., more number of search result outcomes), and because images provide instant access to a large amount of information (as opposed to text/web-pages), tackling biases/unfairness in image search results becomes even more important. The following section enlists some of the biases that are typically observed in an image search. Note that, we use the terms bias and fairness interchangeably, considering that unfairness could result from certain kinds of biases in search algorithms and procedures.

3 Type of Image Search Biases

In this section, we report certain kinds of biases that are typically seen in image search results.

1.

Position Bias: One of the key sources of bias in web search results is due to the probability of click which is influenced by a document’s position in the SERP(Search Engine Results Page) Craswell et al. (2008).
2.

Confirmation Bias: For most people, the psychological tendency of interpreting information in web search results has a common ground. They focus on information that confirms their preconceptions. The SERP commonly presents messages with diverse perspectives and expertise, all focused on a single topic or search term. Search results are perhaps unique in the extent to which they can highlight differing views on a topic. Individual convictions lead to one-sided information processing. If these convictions are not justified by evidence, people run into the risk of being misinformed Schweiger et al. (2014).
3.

Domain Bias: Ieong et al. [24] investigated domain bias, a phenomenon in Web search that users’ tendency to prefer a search result just because it is from a reputable domain, and found that domains can flip a users preference about 25% of the time under a blind domain test.
4.

Selection Bias: This bias occurs when a dataset is imbalanced for different regions or groups; it over-represents one group and under-represents the other. When ML algorithms are trained through web-scraping, the search results mostly revolve around the data that are present in a vast amount on the web. So, the selection does not reflect the random sample and is not representative of the actual population here. This particular bias, which could be referred to as the (re)search bubble effect, is introduced because of the inherent, personalized nature of internet search engines that tailors results according to derived user preferences based on non-reproducible criteria. In other words, internet search engines adjust their user’s beliefs and attitudes, leading to the creation of a personalized (re)search bubble, including entries that have not been subjected to a rigorous peer-review process. The Internet search engine algorithms are in a state of constant flux, producing differing results at any given moment, even if the query remains identical Ćurković & Košec (2018).
5.

Historical Bias: This type of bias comes from socio-economic issues in the world and passes on gradually right from the data generation process. For example, while searching for images of nurses, only a few male nurses show up in the search result. This type of bias occurs due to the presence of already existing stereotypes based on historical data. Suppose for a profession, most of the persons were male earlier. But now people of both genders work in that profession. If the data at a given time frame characterizes the creator’s preconceived notions, it may result in historical bias as time progresses Lim et al. (2020).
6.

Human Reporting Bias: The frequency of a particular type of content may not be a reflection of the real-world frequencies of that content/event. What people share on the web may not be a reflection of real-world frequencies. For example, wedding images from various cultures may not be uniformly uploaded to the web and indexed Kulshrestha et al. (2017).
7.

Racial Bias: This form of bias occurs when data skews in favor of particular demographics. For instance, indexing a greater number of images of certain demography (e.g., younger population or races) is prevalent in certain countries just because this demographic population accesses and uses the web applications more Kulshrestha et al. (2017).
8.

Association bias: This bias occurs when the data for a machine learning model multiplies the cultural bias. For instance, datasets created in an automatic/semi-automatic manner may have a collection of jobs in which all men are doctors, and all women are nurses. This does not necessarily mean that women cannot be doctors, and men cannot be nurses Lim et al. (2020).

Various forms of biases undoubtedly can be a potential menace for the active internet population. Some kinds of bias can even have a more adverse impact and should be mitigated/addressed both from the dimensions of a system (algorithmic) and user (behavioral). In the following section, we delve into one such form of bias i.e., occupational stereotypes, a specific yet important fairness issue in image search results.

4 Occupational Stereotypes in Image Search Results

Stereotyping is the generalization of a group of people. At times even if it is statistically almost accurate, it is not universally valid. In this context, one of the most prevalent and persistent biases in the United States is portraying and perpetuating inequality in the representation of women on various online information sources Zhao et al. (2018).

A recent study from the University of WashingtonLangston (2015) assessed the gender representations in online image search results for 45 different occupations. The study founds that in a few jobs like CEO, women are significantly underrepresented in Google search results. This study also claims that across all other professions too, women are slightly underrepresented on average. Other search results data published Silberg & Manyika (2019) also show similar trends - for example, the percentage of women in the top 100 Google image search results for CEO is 11 percent in contrast to the actual percentage of women CEOs in the US which is 27 percent. These biases are highly insidious as they are neither transparent to the user nor the search engine designers.

This form of bias is mainly attributed to two main factors i.e., (a) Slight exaggeration of gender ratios and (b) Systematic over or under-representation of genders Kay et al. (2015). Male-dominated professions have even more men in the search results than it is supposed to have(real-world distributions). This effect is even prevalent when people rate the quality of search results or select the best image that represents an occupation. They unknowingly prefer the image with a gender that matches the stereotype of that particular occupation.

While ranking images in search results based on quality, people do not systematically prefer either gender. However, here stereotyping dominates in the decision-making process. They prefer images with the gender that matches the stereotype for that occupation. Additionally, the skewed search results also exhibit biases in how genders are depicted overall. The results that match with the gender stereotype of a profession tend to be portrayed as more professional-looking and less inappropriate-looking Kay et al. (2015). Figures 2 and 4 provide insights into such issues and disparities for frequent occupational image search queries.

For male-dominated professions, both of the aforementioned effects i.e., a slight exaggeration of gender ratios and systematic over or under-representation of genders amplify each other. In female-dominated professions, these two effects cancel each other Kay et al. (2015). The study also revealed that there may be a slight under-representation of women and there may be a slight exaggeration of gender stereotypes, but it’s not completely different or divorced from reality. As research and strategy development nonprofit Catalyst reports, women currently hold only 30 (i.e., 6%) CEO positions at S&P500 companies Hopkins et al. (2021).

In 2015, a journalist writing about this study found that when searching for CEOs, the first picture of a woman to appear on the second page of image results was a ”Barbie doll” Hopkins et al. (2021). Furthermore, another study revealed that Google’s online advertising system features advertisements for high-income jobs for male Internet users much more often than female users Schroeder & Borgerson (2015).

Some other examples of occupational stereotypes are as follows. The image search for the keyword ”US authors” results in only twenty-five percent of women in the search results in contrast to the actual percentage of 56%. Similarly, the search results for the keyword ”telemarketers” depict 64% female pictures. However, that occupation is evenly split between men and women in real Langston (2015).

In one research study from the University of Washington, the participants were to rank the images based on the professionalism depicted in top image results. This study found that the majority gender for a profession tended to be ranked as more competent, professional, and trustworthy Langston (2015). The images of persons whose gender does not match with the occupational stereotype are more likely to be rated as provocative or inappropriate, e.g., Construction Workers. ”Getty Images last year created a new online image catalog of women in the workplace – one that countered visual stereotypes on the internet of moms as frazzled caregivers rather than powerful CEOs” Schroeder & Borgerson (2015).

To name this as a problem, we need to understand whether this gender stereotyping affects or shifts the users’ perceptions regarding the dominance of gender in that particular profession. The results from a study by the University of Washington researchers Kay et al. (2015) hint that the exposure to the skewed image search results shifted the perceptions of users by 7% at least for short-term changes in perceptions. However, these short-term biases over time can have a lasting effect, starting from personal perceptions to the high-valued decision-making process like hiring.

The skewed representation and gender stereotypes in image search results for occupations also contribute to the type of images selected by users. An image that matches the stereotype for an occupation is more likely to be selected as an exemplar result Kay et al. (2015).

From the aforementioned points, we are certain that this occupational stereotype has an adverse effect on altering users’ belief systems about different occupations and their related attributes. So, there is a need to mitigate this type of bias in image search results. Some of the existing approaches to alleviate this bias and our own implementation for de-biasing are described in the following sections.

5 Existing Approaches to Mitigate Bias in Image Search

Due to the growing dependence on search engines, automatic curation of biased content has become a mandate. Mitigating biases in search results would promote effective navigation of the web and will improve the decision-making of users Fogg (2002). One possible way is to collect a large number of ranked search results and re-rank them in a post-hoc manner so that the top results shown to the users become fairer. Another possible way is to make changes in the search and ranking algorithms so as to address biases. Post-hoc approaches are model agnostic and hence, more preferable in information retrieval, especially considering the black-box nature of search systems.

5.1 Mitigation through Re-ranking of Search Results

On an unprecedented scale and in many unexpected ways, surprisingly search ranking makes our psychological heuristics and vulnerabilities susceptible Bond et al. (2012). Algorithms trained on biased data reflect the underlying bias. This has led to emerging of datasets designed to evaluate the fairness of algorithms and there have been benchmarks to quantify discrimination imposed by the search algorithms Hardt et al. (2016); Kilbertus et al. (2017). In this regard, the goal of re-ranking search results is to bring more fairness and diversity to the search results without the cost of relevance. However, due to severely imbalanced training datasets, the methods to integrate de-biasing capabilities into these search algorithms still remain largely unsolved. Concerns regarding the power and influence of ranking algorithms are exacerbated by the lack of transparency of the search engine algorithms Pasquale (2015). Due to the proprietary nature of the system and the requirement of high-level technical sophistication to understand the logic makes parameters and processes used by these ranking algorithms opaquePasquale (2015); Gillespie (2014). To overcome these challenges, researchers have developed techniques inspired by social sciences to audit the algorithms to check for potential biases Sandvig et al. (2014). To quantify bias and compute fairness-aware re-ranking results for a search task, the algorithms would seek to achieve the desired distribution of top-ranked results with respect to one or more protected attributes like gender and ageEpstein et al. (2017). This type of framework can be tailored to achieve such as equality of opportunity and demographic parity depending on the choice of the desired distribution Geyik et al. (2019).

5.2 Need of re-ranking keyword-based image search results

According to a study Jain & Varma (2011), there are three limitations for keyword-based image search i.e.,

(a)

There is no straightforward and fully automated way of going from text queries to visual features. In a search process, visual features are mainly used for secondary tasks like finding similar images. Since the search keyword/query is fed as a text, rather than an image to a search engine, the search engines are forced to rely on static and textual features extracted from the image’s parent web page and surrounding texts which might not describe its salient visual information.
(b)

Image rankers are trained on query-image pairs labeled with relevance judgments determined by human experts. Such labels are well known to be noisy due to various factors including ambiguous queries, unknown user intent, and subjectivity in human judgments. This leads to learning a sub-optimal ranker.
(c)

A static ranker is typically built to handle disparate user queries. Therefore, the static ranker is unable to adapt its parameters to suit the query at hand and it might lead to sub-optimal results. In this regard, Jain et al.Jain & Varma (2011) demonstrated that these problems can be mitigated by employing a re-ranking algorithm that leverages aggregate user click-through data.

There are different types of methods for re-ranking keyword-based image search results and those are described in the following subsections.

5.3 Reranking by Modeling Users’ Click Data

One way to re-rank the search engine results is through the user click data. For a given query if we can identify images that have been clicked earlier in response to that query, a Gaussian Process (GP) regressor is trained on these images to predict their normalized click counts. Then, this regressor can be used to predict the normalized click counts for the top-ranked 1000 images. The final re-ranking would be done based on a linear combination of the predicted click counts as well as the original ranking scores Epstein & Robertson (2015). This way of modeling tackles re-ranking nicely while still coping with the limitations described earlier.

•

The GP regressor is trained on not just textual features but also visual features extracted from the set of previously clicked images. Consequently, images that are visually similar to the clicked images in terms of measured shape, color, and texture properties are automatically ranked high.
•

Expert labels might be erroneous, inconsistent with different experts assigning different levels of relevance to the same query-image pair. Such factors bias the training set that results in the learned ranker being sub-optimal. The click-based re-ranker provides an alternative by tackling this problem directly. The hypothesis is that, for a given query, most of the previously clicked images are highly relevant and hence should be leveraged to mitigate the inaccuracies of the baseline ranker.
•

As the GP regressor is trained afresh on each incoming query, it is free to tailor its parameters to suit a given query at hand. For example, images named TajMahal.jpg are extremely likely to be of the Taj Mahal. For landmark queries, the query-image file-name match feature is important. But this feature may be uninformative for city queries. For example - Images of Delhi’s tourist attractions are sometimes named delhi.jpg. Also, people’s photographs during their trip to Delhi may be named delhi.jpg. In this case, a single static ranker would be inadequate. However, the GP regressor aims to learn this directly from the click training data and weights this feature differently in these two situations.

Here the key assumption is that ”For a given query the clicked images are highly relevant”. It would work for an image search due to an obvious reason. For a normal textual web search, only a two-line snippet for each document is displayed in the search results. So, the clicked documents might not be relevant to the keyword. The relevance can only be determined if the user goes or does not go through the document. However, in the case of an image search, most search results are thumbnails allowing users to see the entire image before clicking on it. Therefore, the user predominantly tends to click on the relevant images and most likely discards distracting imagesJain & Varma (2011).

5.4 De-biased Reinforcement Learning Click model (DRLC) for re-ranking search results

The users’ clicks on web search results are one of the key signals for evaluating and improving web search quality, hence is widely used in state-of-the-art Learning-To-Rank(LTR) algorithms. However, this has a drawback from the perspective of fairness of the ranked results. These algorithms can’t justify the scenario when a search result is not necessarily clicked as it is not relevant. Rather it is not chosen because of the lower rank assigned to it on the SERP. If this kind of bias in the users’ click log data is incorporated into any LTR ranking model, the underlying bias would be propagated to the model. In this regard, a reinforcement learning model for re-ranking seems to be very effective and can avoid the proliferation of position bias in the search results.

5.4.1 Importance of Reinforcement Learning in Information Retrieval Process

In today’s world of advanced modern information retrieval interface which typically involves multiple pages of search results, the users are likely to access more than one page. In a common retrieval scenario, the search results are split into multiple pages that the user traverses across by clicking the ”next page” button. The user generally believes in the ranking of the Search Engine Results Page(SERP) and examines the page by browsing the rank list from top to bottom; he clicks on the relevant documents and returns to the SERP in the same session. In the case of a good multi-page search system, it begins with a static method and continues to adopt a model based on feedback from the user. Here comes the importance of reinforcement learning. This type of relevance feedback methodJoachims et al. (2007) has been proven to be very effective for improving retrieval accuracy over interactive information retrieval tasks.

According to the Rocchio algorithmRocchio (1971), the search engine gets feedback from the user; adds weights to the terms from known relevant documents, and minus weights of the terms from the known irrelevant documents. So, most of the feedback methods balance the initial query and the feedback information based on a fixed value.

Learning to rank methods have been widely used for information retrieval, in which all the documents are represented by feature vectors to reflect the relevance of the documents to the query Liu (2011). The learning-to-rank method aims to learn a score function for the candidate documents by minimizing a carefully designed loss function. The work from Zeng et al. Zeng et al. (2018) considers the multi-page search scenario and applies relevance feedback techniques to state-of-art learning to rank models. The multi-page search process is an interactive process between the user and the search engine. At each time step, the search engine selects M documents to construct a rank list. The user browses this rank list from top to down, clicks the relevant documents, skips the irrelevant documents, and then clicks the ”next page” button for more results. In the paper Zeng et al. (2018), multi-page search processes are mathematically formulated as a Markov Decision Process. The search engine is treated as the agent, which selects documents from the remaining candidate document set to deliver to the user for satisfying the user’s information need. The state of the environment consists of the query, remaining documents, rank position, and user’s click information. The soft-max policy is applied to balance the exploration and exploitation during training and design the reward based on the IR measure metric. A classical policy gradient policy method based on the REINFORCE algorithm is applied to optimize the search policy.

Zeng et.al. Zeng et al. (2018), proposed a technical schema to use user feedback from the top-ranked documents to generate re-ranking for the remaining documents. Compared with existing methods, their method enjoys the following advantages: (i) it formulates the multi-page process as a Markov Decision Process and applies policy gradient to train the search policy which could optimize the search measure metric directly, (ii) it applies the Recurrent Neural Network to process the feedback and improve the traditional learning to rank model with the feedback information based on Rocchio. The authors used traditional learning to rank the method ListNet, RankNet, and RankBoost as the initial model to construct the experiments on the OHSUMED dataset and simulate the interaction between the search engine and user based on Dependent Click Model (DCM). Experimental result shows that their model can prove the ranking accuracy for traditional learning to a rank method and has better generalization ability.

Zhou et al. Zhou et al. (2021) proposed a De-biased Reinforcement Learning Click model (DRLC) that ignores previously made assumptions about the users’ examination behavior and resulting latent bias. To implement this model, CNNs are used as the value network for reinforcement learning, trained to log a policy to reduce bias in click logs. The experiments demonstrated the effectiveness of the DRLC model in learning to reduce bias in click logs, leading to improved modeling performance and showing the potential for using DRLC for improving Web search quality.

It is worth mentioning that probabilistic Graphical Model Frameworks (GMFs) have been traditionally used for search result ranking and there are basically two broad ways these models operate in:

(a)

Direction 1: This considers the search process as a sequence of events. Predicting clicks is based on some probability models and assumptions. While they are flexible and interpreted, this is limited by a weak learning model with fewer features.
(b)

Direction 2: This model considers the searching process as represented by some vectors. While this model allows the users to consider a variety of features easily and feed them to a stronger learning model such as neural nets Chakraborty et al. (2000), this model can not consider the bias issue in an interpretable way.

As DRLC is also kind of a PGM-based method, it can be organized in a flexible way for different ranking scenarios and generate an interpretive model to reduce a variety of biases. DRLC is constructed by a more dynamic system, which is the reinforcement learning Sutton & Barto (2018); Zhou & Agichtein (2020) This allows DRLC to take advantage of stronger learning models (NNs). DRLC model can thus overcome issues faced by traditional GMFs.

5.5 Best Practices and Policies to Mitigate Search Engine’s, Algorithmic Bias

According to Lee et al. [30], understanding various causes of biases is the first step for adopting effective algorithmic hygiene. But it is a challenging task to assess the search results for bias. Even when flaws in the training data are corrected, the results may still be problematic because context matters during the bias detection phase. When detecting bias, computer programmers generally examine the set of outputs that the algorithm produces to check for anomalous results. However, the downside of these approaches is that not all unequal outcomes are unfair; even error rates are not a simple litmus test for biased algorithms. In this regard, below are some of the evaluation protocols and metric formulations.

•

Conducting quantitative experimental studies on bias and unfairness.
•

Defining objective metrics that consider fairness and/or bias.
•

Formulating bias-aware protocols to evaluate existing algorithms.
•

Evaluating existing strategies in unexplored domains.

In the following sections, we discuss our implementations and insights on assessing occupational stereotypes in image search results and also debasing the search results through re-ranking. As gender stereotyping seems to be a prevalent issue in occupational search results, we explore various techniques for detecting gender distributions of given sets of images (which correspond to a set of retrieved images from a search engine). This is presented in Section 6. After this, in Section 2, we present a framework for reranking image search results to make the results fairer while preserving the relevance of the images retrieved with respect to the input query.

Name	Source	Framework	Input	Output	Size
SSR-Net Yang et al. (2018)	t.ly/MFwu	Keras/TensorFlow	(64,64,3)	Real	0.32Mb
ConvNet Levi & Hassner (2015)	t.ly/FPFT	Caffe	(256,256,3)	Binary	43.5Mb
Inception-V3 (Carnie)	t.ly/SxXJ	TensorFlow	(256,256,3)	Binary	166Mb
ConvNet (Jiang)	t.ly/IMSY	TensorFlow	(160,160,3)	Real	246Mb
ConvNet (Chengwei)	t.ly/G9ff	Keras/TensorFlow	(64,64,3)	Real	186Mb
ConvNet (Serengil)	t.ly/6WH9	Keras/TensorFlow	(224,224,3)	Real	553Mb

Table 1: Existing approaches to gender detection from facial data

6 Implementation: Automatic Assessment Occupational Stereotypes

In this section, we investigate and re-implement a few existing frameworks for occupational stereotype assessment of image-search results. One key component of the analysis of the fairness of search results for occupations is to get insights into the distribution of gender in the ranked search results. To this end, the following sections aim to discuss our implementations, experimental results, and assessment of challenges involved in computing the distribution of gender in search results.

6.1 Automatic Detection of Gender of Image Search Results: Open Source Systems

Automatic gender detection from images deals with classifying images into gender categories such as male, female, both and uncertain. Of late, gender identification has become relevant to an increasing amount of applications related to facial recognition and analysis, especially since the rise of social media. Gender detection typically relies on spatial features related to the human body, specifically the face, which needs identification and alignment of facial elements in the image. The problem becomes harder when the environment becomes unconstrained, such as in the case of web-search retrieved images. This is primarily due to variations in poses, illuminations, occlusions, and interference of other objects with the facial region of interest. Figure 6 show this through top-5 search results obtained from a popular search engine for the keyword “biologist”. As it can be seen, the result images can be diverse and very different from frontal face images typically used by applications requiring gender identification.

In the following section, we summarize some of the existing works on this problem and enlist a few off-the-shelf methods that can be tried for fairness assessment related to gender.

6.1.1 Existing Open Source Systems

Following the article by Chernov Chernov (2019), we summarize some of the existing projects on gender identification in Table 1 that can be tried off-the-shelf on image search results.

Most of the current gender identification model relies on a pre-processing step (details given in Section 6.1.2 of face recognition and several models exists for that as well. Some of the popular implementations are:

•

OpenCV Harcascade: OpenCV ¹¹1https://opencv.org/ is a collection of APIs for real-time computer vision. It comes up with a powerful cascade classifier for object detection Viola & Jones (2001) and provides pre-trained machine learning models for frontal face detection. It implements classical machine learning based techniques for classification.
•

ConvNet by Lavi and Hassen: Lavi and Hassen Levi & Hassner (2015) have released a set of deep learning models for face detection, and gender and age identification.
•

MultiTask CNN: Multitask model for joint modeling of face alignment and detectionZhang et al. (2016) aims to mutually learn face alignment and identification tasks and thus benefit each individual task.
•

Facelib: Facelib ²²2https://github.com/sajjjadayobi/FaceLib implements MobileNetsHoward et al. (2017) based face detection module. It also has a self-trained gender predictor based on the UTKFace Dataset Zhang et al. (2017).
•

Detectron: Detectron Wu et al. (2019) is a popular deep learning based tool for object detection and image segmentation. It implements masked Region Based CNNs for object segmentation. Although the pretrained models from Detectron and Detectron2 may not provide segmentation for faces, they can help extract segments corresponding to persons in the images. This is particularly beneficial when the images do not contain clear faces, but partial and oriented human faces and bodies.

6.1.2 Central Idea

Most of the existing approaches to gender identification from images follow a pipeline approach as shown in Figure 7. The input image is pre-processed (sometimes gray-scaled) and then passed through a face/body identification model that extracts the region of interest i.e., human face or body. The original image is then cropped to retain the identified portion, which is then given to a gender/age prediction model. The gender prediction model is typically a deep neural network like multi-layered Convolutional Neural Network (CNN), which is responsible for extracting gender-specific semantic representations from the cropped image and pass this information onto a dense layer of a neural classifier.

Since we are interested in conducting inference on search results, we should ideally consider pre-trained models (like the ones described in Section 6.1.1). It goes without a say that existing models exhibit superior/acceptable performance on the frontal face, as they are often trained on datasets like ImageNet or UTKFace and with frontal face images. Their performance though reduces when the environment becomes unconstrained, such as in the case of web-search retrieved images. In the following section, we discuss some of the challenges related to gender identification web-search retrieved images.

6.1.3 Challenges in Estimation of Gender using Existing Frameworks

Image search results can result in a truly diverse set of images with varying poses, illuminations, occlusions, artifacts, orientation, and quality. This throws additional challenges for existing pre-trained gender detection frameworks. Some of the challenges we observed in though our challenge dataset (refer Section 6.1.4) are given below:

1.

Face / body partially visible and/or misaligned: Unlike frontal images, images from keyword search may be misaligned / partially informative. For example, Figure 6 (a), (b), (c) do which correspond to top-ranked results do not provide enough information about faces and gender. Specifically, occupation related searches will retrieve images of persons focusing on their work, which result in a higher number of non-frontal, partial and misaligned images.
2.

Interference of objects: For occupational key-word search, the retrieved images may have occupation related objects/instruments blocking faces/bodies. For example, biologists are often seen with instruments like microscopes, or wearing masks.
3.

Images with varying degree of quality and resolutions: Since web search results are nothing but indexed images, they may vary in terms of resolution, frame height and width and aspect ratio. Hence, processing all the retrieved images may not yield ideal results.

	Search Term	Abbr.	%M	%F	%Both	%Uncertain
1	“biologist”	BIO	27	43	11	19
2	“chief executive officer”	CEO	51	11	6	32
3	“cook”	COOK	23	30	13	34
4	“engineer”	ENG	43	18	21	18
5	“nurse”	NUR	1	50	43	1
6	“police officer”	POL	70	10	18	2
7	“primary school teacher”	PST	5	11	81	3
8	“computer programmer”	PRO	38	13	13	36
9	“software developer”	SD	26	11	14	49
10	“truck driver”	TD	44	7	0	49

Table 2: Search terms and gender distribution in the collected image search dataset

6.1.4 Creation of Challenge Image-Search Datasets for Evaluation

From our initial inspection of search results for occupational queries such as “biologist”, we observed that the images retrieved had significant differences from typical frontal face images used in most of the gender identifiers. We prepare a manually labeled challenge dataset for evaluating gender identification systems, targeted for assessing and de-biasing search systems. The dataset was created by manually assigning gender labels to the top 100 image search results from Google for a specific keyword. We collected search results for 10 occupation keywords, as shown in Table 2. For annotation, we considered 4 labels: (a) Male (b) Female (c) Both (d) Uncertain. For labeling, we leveraged the Amazon Mechanical Turk framework, where workers were asked to assign one of these labels for each image. Appropriate guidelines were given to them to tackle ambiguous cases, especially cases where faces/gender-specific attributes are not clearly visible from the images.

As shown in Table 2, for certain occupations like the nurse, the results are heavily biased towards the female gender, whereas it is completely opposite for a male dominated occupation such as truck driving. It is also interesting to note that, for certain occupations like cook and software developer, the top-ranked images, more often than not, do not provide any gender-specific information.

Candidate Pipeline	BIO	CEO	COOK	ENG	NUR	POL	PST	PRO	SD	TD	Avg.
Cascade + Facelib	0.31	0.7	0.42	0.33	0.36	0.41	0.07	0.42	0.51	0.57	0.41
ConvNet + Facelib	0.39	0.81	0.55	0.38	0.19	0.53	0.19	0.48	0.54	0.64	0.47
MTCNN + Facelib	0.49	0.87	0.64	0.54	0.31	0.68	0.33	0.61	0.6	0.81	0.58
Facelib + Facelib	0.26	0.68	0.39	0.21	0.29	0.3	0.08	0.41	0.5	0.51	0.363
Detectron2 + Facelib	0.46	0.65	0.32	0.53	0.39	0.69	0.8	0.54	0.5	0.76	0.564
Detectron2-MTCNN + Facelib	0.55	0.73	0.43	0.55	0.51	0.72	0.53	0.55	0.51	0.83	0.591
Cascade + ConvNet	0.34	0.77	0.45	0.32	0.32	0.41	0.12	0.39	0.52	0.56	0.42
ConvNet + ConvNet	0.27	0.73	0.46	0.33	0.26	0.41	0.12	0.42	0.51	0.56	0.40
MTCNN + ConvNet	0.52	0.78	0.59	0.51	0.56	0.65	0.43	0.58	0.57	0.75	0.594
Facelib + ConvNet	0.25	0.68	0.38	0.22	0.22	0.27	0.07	0.41	0.5	0.5	0.35
Detectron2 + ConvNet	0.47	0.65	0.38	0.52	0.3	0.59	0.76	0.48	0.58	0.71	0.544
Detectron2-MTCNN + ConvNet	0.55	0.8	0.41	0.59	0.49	0.69	0.51	0.57	0.55	0.74	0.59

Table 3: Accuracy of gender detection on the test dataset for various open source candidate systems

6.1.5 Experimental Setup

We now describe our experimental setup. Our candidate frameworks are shown in Figure 8. We pick 5 different APIs for face detection; the APIs help obtain contours pertaining to faces/bodies. The images are then cropped based on the contour information with the help of the OpenCV-DNN library. We then use two different models available for gender prediction. This results in 10 different variants. Additionally, we implement a fallback mechanism for face detection, wherein if one mechanism (i.e., MTCNN) fails to detect faces, it will fall back to another mechanism (i.e., detection). Experiments are run with default configurations and evaluation is carried out on the challenge dataset described in Section 6.1.4. The source-code for this experiment is available at https://github.com/swagatikadash010/gender_age.git.

6.1.6 Evaluation Results

The results are shown in Table 3. We see that the Detectron2-mtcnn fallback mechanism for detecting faces works well with both types of gender detectors. Pipelines with MTCNN based face detectors give competitive performance and can be used running time needs to be reduced (Detectron based pipelines take around 10X longer duration to complete). It is also worth noting that for occupations where the ground truth gender distribution is imbalanced, the performance of all the variants reduces. This is expected as the chances of the number of false positives and false negatives growing are more when datasets are imbalanced.

The confusion matrix for the best performing system is given in Table 4. Looking at this matrix we can say that the model misidentifies 15 females as males which is not that satisfactory. However, for males, the model is performing better as it correctly identifies most of the images with males.

	Female	Male	Both	Uncertain
Female	23	15	4	2
Male	6	18	2	1
Both	2	7	2	0
Uncertain	3	2	1	13

Table 4: Confusion matrix for best performing system i.e., Detectron2-MTCNN + ConvNet

While open-source pipelines for gender identification are easily accessible and provide transparent and replicable outcomes on images, their performance is unreliable on open-ended images. This is because most of the pre-trained models consider frontal face data as their source of truth. One possible solution to mitigate this can be to train the same systems on large-scale open-ended images. Datasets such as Google’s Open Image Dataset³³3https://opensource.google/projects/open-images-dataset can be used for this purpose. Additionally, certain architectural changes are needed to capture and aggregate additional non-facial features, such as features from the body, hands, and surrounding environment. To this end, the work by Pavlakos et al. Pavlakos et al. (2019) is relevant, as it aims to detect demographic attributes from the positions of humans in the image, by modeling expressive body capture.

6.2 Automatic Detection of Gender of Image Search Results: Amazon Rekognition

With the best open-source APIs we got the maximum accuracy of 83% and the average accuracy score for all occupations around 60%. We also considered one proprietary system for gender detection i.e., Amazon Rekognition for detecting the gender of the faces in an image. Amazon Rekognition is based on a scalable, deep learning-based architecture for analyzing billions of images and videos with higher accuracy than its open-source counterparts. Amazon Rekognition includes a simple, easy-to-use API that can quickly analyze any image or video file that’s stored in Amazon S3. For our project, we recorded the “FaceDetails” response for an image and retrieved the “Gender” attribute for the faces detected in the given image. With Amazon Rekognition, the average accuracy for gender detection is 79.49% and the maximum accuracy score is for the query term “truck driver” which is 93.94%. So, we considered this proprietary system for our further analysis and implementation of our de-biasing algorithm (Section 7). However, we will also make efforts to increase the accuracy of gender detection by open-source APIs in our future work.

Some of the known issues in the Amazon Rekognition system include low accuracy in detecting read faces and non-frontal faces in general. Moreover, if an image contains faces/gender features that are overshadowed by other objects, the detection accuracy goes down. Figure 10 shows some of such examples.

6.3 Insights from the distribution of gender(detected by Amazon Rekognition) across top 30 google search results

If we consider the top 30 Google search results for all occupations and the gender distribution across these, we get the statistics presented in Figure 11. Please note that in this case, we have ignored images for which Rekognition detects “no face” or “both male and female”.

From the plot, we can see that the distribution of males and females is somewhat balanced for some occupations like biologists and computer programmers. However, for some of the occupations like nurse and primary school teacher, the distribution is skewed towards females to a great extent. This shows a bias in image search results for these occupations.

This sums up our exploration and implementation of frameworks for assessing occupational stereotyping in image search. While we focused on only one vital aspect of occupational stereotyping i.e., gender estimation, we believe that the underlying frameworks and architecture can be extended to other measures of stereotypes such as race and ethnicity. In the following section, we discuss our implementation of image re-ranking techniques that aim to produce a fairer ranked set of images for a given query (than vanilla search engine outputs), while preserving the relevance of the images with respect to the input query.

7 Implementation: Fairness Aware Re-Ranker

7.1 Why re-ranking?

Ranking reflects search engines’ estimated relevance of Web pages (or in our case, images) to the query. However, each search engine keeps its underlying ranking algorithms secret. Search engines vary by underlying ranking implementation, display of the ranked search results, and showing related pages. Proprietary systems like Google may consider many factors (some of them are personalized to a specific user) along with relevance and also they may be tuned in many ways. A study’s Zhao (2004) findings suggest that a higher rank in a Google retrieval requires a combination of factors like the Google Page Rank, the popularity of websites, the density of keywords on the home page, and the keywords in the URL. However, with the secret ranking algorithm and rapid and frequent changes in these algorithms, it is impossible to have an authoritative description of these ranking algorithms. To make the search results fairer, one may argue that, the search engines and their underlying ranking algorithms can be taught to provide fairer rankings while not compromising on relevance. This is, however, a harder ask for external developers, given the opaque nature of search engines. Another way is to re-rank the retrieved results in a post-hoc manner, to optimize fairness and relevance together. Post-hoc re-ranking not only makes the re-ranking framework agnostic to the underlying search and ranking procedure, but it also is way more scalable, transparent, and controllable.

7.2 Existing Work on Re-ranking

Though systems for re-ranking of image search results remain elusive at this point, there have been several attempts to re-rank text/webpage search results. For example, the TREC 2019 and 2020 Fairness Ranking tasks Biega et al. (2020) invite and evaluate systems for fairness ranking while maintaining the relevance for search algorithms designed to retrieve academic papers. Precisely, the goal is to provide fair exposure to different groups of authors while maintaining good relevance of the ranked papers regarding given queries. Most participating systems such as Feng et al. (2020) look to define a cost function that indicates how relevant a document is for a given query and how fair it is to a certain author group if it is ranked at a certain portion. The author groups here may correspond to the country the authors come from and the gender of the authors. For relevance, document relevance metric such as BM25 scoreRobertson et al. (1995) is considered. For fairness, Feng et al. Feng et al. (2020; 2021) consider the Kullback-Leibler (KL) divergence of the group distribution probability between the ranked list created at a certain step and the whole retrieved corpus. The rationale behind the fairness cost is this - if at any position in the final ranked list, the so-far ranked documents represent an author group distribution close to the author group distribution for the whole corpus (as measured through KL divergence), the re-ranked documents will exhibit more fairness towards the author groups. Feng et al. consider off-the-shelf systems for detecting author group attributes such as country and gender. We implement this strategy for re-ranking of image results, which we describe in the following sections.

7.3 Methodology

We propose a fairness ranking algorithm for images incorporating both relevance and fairness with respect to the gender distribution for the given search query. The objective is to assign a higher rank to an image from a lot that maximizes a defined relevance score (or equivalently minimizes the relevance cost) while ensuring fairness. For relevance cost measurement, we propose a scheme depicted in Figure 12. For a given query (say “biologist”) and a retrieved image, we first extract a set of object labels from the image. We use an off-the-shelf system such as Amazon Rekognition (“detect_label” handle) for object detection. Once, object terms are identified, we extract their word embedding representation using GloVe embeddings Pennington et al. (2014). We use the “glove-wiki-gigaword-300-binary” pretrained model. GloVe is an unsupervised learning algorithm for obtaining vector representations for words. Training of GloVe is performed on aggregated global word-to-word co-occurrence statistics from a large corpus, and the resulting representations showcase interesting linear substructures of the word vector space. Once embeddings are extracted, they are averaged and we compute the cosine distance between the averaged object-term vector and the vector that represents the query word. Intuitively, the distance indicates how dissimilar the set of objects is with the query. The more the dissimilarity, more becomes the relevance cost.

Our re-ranking method is given as follows. Let us assume that for a given query $q$ a set of images $I^{\prime}$ are already retrieved using an image search engine, from a large indexed image corpus $I$ . Our intention is to re-rank $I^{\prime}$ and form a re-ranked list $R$ . We initialize $R$ with an empty list and gradually move an image $i$ from $I^{\prime}$ . At any given time, the idea is to select an image $i$ in such a way that it minimizes the overall cost of adding it to $R$ . The overall cost is given below:

\begin{split}C(i,w,R,I^{\prime},q)=w_{r}*cosine\_dist(w2v([objects_{i}]),w2v([q]))\\ +w_{g}*KL(p(g,R+{i})||p(g,I^{\prime}))\end{split}

(1)

where, $w_{r}$ and $w_{g}$ are user-defined weights, that control how much each term in the above equation, contributes to the overall cost of adding an image ( $w_{r}+w_{g}=1$ ). $w2v(.)$ and $cosine\_dist(.)$ are functions to compute average word embeddings of given terms and cosine distance between two embeddings (vectors). $objects_{i}$ represents objects identified from image $i$ . $p(g,.)$ represents the distribution of group property (in our case, gender) in a given set of observations. $KL(.)$ is the Kullback-Leibler divergence between two distributions. The rationale behind this formulation is similar to that of Feng et al. Feng et al. (2020; 2021).

7.4 Experimental Setup

We now describe our experimental setup, datasets and implementation and evaluation details. The implementation can be checked out from https://github.com/swagatikadash010/Image_Search_Bias.

7.4.1 Dataset

We use the same dataset described in Section 6.1.4. We have top 100 Google search results for each of the 10 occupations i.e, “Biologist”, “CEO”, “Cook”, “Engineer”, “Nurse”, “Police Officer”, “Primary School Teacher”, “Programmer”, “Software developer”, “Truck driver”. The original rankings given by Google is used as ground truth for relevance scores, which will serve as a reference for relevance metric computation. Additionally, we have ground-truth files mentioning the gender of the persons present in the images, obtained through crowd sourcing. This will help in evaluating the fairness metric.

7.4.2 Baselines and Systems for Comparison

We used random ranking and only relevance score based ranking as our baselines for comparison. We vary the $w_{r}$ and $w_{g}$ weights given in Equation 1 and experiment with $w_{r}=[0.1,0.3,0.5,0.7,0.9]$ and $w_{g}=(1-w_{r})$ . It is worth noting that, unlike Feng et al. Feng et al. (2021), we do not normalize the relevance cost as cosine distance is bounded between 0 and 1.0 for word embeddings, in practice.

7.4.3 Evaluation Criteria

We consider two metrics for evaluating the systems: (a) Relevance and (b) Fairness. For relevance, we consider bucket ranking accuracy of the systems. This is because, unlike document search, the absolute ranking does not make much sense as image search results are often displayed in a grid-like structure and are not shown in a list-like structure as documents. We surmise that images in a certain bucket should be equally relevant. For example, if we consider that the first 30 images are visible to the user on one page and the next 30 appear on the second page and so on, images in the first bucket of size 30 are equally relevant. Based on this idea, we map the ground truth relevance scores (Google’s ranks) and predicted ranks into buckets of 30, and compute the relevance score as follows:

Relevance=\frac{\#(predicted\_rank==ground\_truth\_rank)}{\#total\_images}

(2)

For fairness, we follow Geyik et al. Geyik et al. (2019) and compute the Normalized Discounted KL Divergence (NDKL) as a measure of the degree of unfairness. This is given as:

Unfairness(R)=\frac{1}{Z}\sum_{i=1}^{|R|}\frac{1}{log_{2}(i+1))}*KL(p(g,R_{i})~{}||~{}p(g,D^{\prime}))

(3)

where,

Z=\sum_{i=1}^{|R|}\frac{1}{log_{2}(i+1))}

Intuitively, this metric penalizes the degree of unfairness (computed through KL divergence) observed in higher-ranked documents and discounts unfairness as we proceed through the rankings.

7.5 Results and Analysis

The overall results are plotted in Figure 13. As expected, the random baseline does not re-rank very well and a trade-off between relevance and fairness is seen. There is no clear winner which optimizes both aspects the best and models can be selected based on how sensitive these two aspects can become for specific search applications.

While, for most of the occupations, we get similar plots as the one shown above, queries like “CEO” and “Nurse” yielded different observations. Figure 15 presents performance plots for these two keywords. For “CEO”, the overall relevance score is very low even for the relevance-only baseline. This is because, for CEO the extracted labels are not semantically much relevant to the keyword “Chief Executive Officer”. The minimum relevance score is 0.6974 for all 100 images. For “Nurse” the gender distribution of the retrieved corpus is very highly biased towards Females (around 72%). Biases of such kind, may not have helped the KL divergence term to yield meaningful indications of fairness cost.

Figure 17 presents some anecdotal examples where we qualitatively compare Google-ranked images, images ranked by relevance-only baseline, and by our fairness-aware algorithm. The query keyword here is “Engineer”. As expected we see that while the relevance-only model provides images that are quite relevant for the query term engineer, they are quite male-dominated. This is mitigated by our fairness-aware algorithm (here we have set $w_{r}=w_{g}=0.5$ ) which distributes the images better across different genders.

8 Conclusion and Future Directions

Search and retrieval systems have become an integral part of human lives and have attained remarkable success and the trust of users in recent times. Their efficacy in fairly retrieving and representing, however, remains below par and this has raised concerns in the information retrieval community. In this paper, we discussed our explorations on the fairness of ranking in search engines, primarily focusing on gender stereotyping issues in occupational keyword-based image search. We discussed the fairness issues that arise from default search and retrieval mechanisms and proposed a fairness-aware ranking procedure that can help mitigate the bias. For gender bias assessment, we employed both open-source pre-trained models and a proprietary system like Amazon Rekognition for gender identification. This helped us in gauging the gender bias in search results obtained for several occupational keywords. For de-biasing, our proposed ranking algorithm uses the gender identification APIs and models and re-ranks the retrieved images through a carefully designed cost function that considers both relevance and fairness. On received sets of images for 10 occupational keywords, we plotted the performance of our de-biased model and compared it with the baseline systems having random and relevance-only re-ranking methods. Our experimental results justified our proposed model performing better in terms of fairness of the image search results.

8.1 Future Work

•

The maximum average accuracy for gender detection for the open-source and the proprietary system is 59.1% and 79.49% respectively. This lower accuracy is primarily due to the open-ended nature of the images and the absence of prominent facial features. In the future, we will explore models that consider the body and environmental features to classify genders better.
•

For measuring relevance, we extracted the labels using Amazon Rekognition API, and using word embedding we calculated the semantic similarity between the extracted labels with the occupation keyword. In the future, we will consider joint language and vision models like VisualBERTLi et al. (2019) to compute the relevance scores.
•

For de-biasing, we only considered a cost-based re-ranking algorithm that does not improve the search over time. We can use this cost to optimize the ranking procedure itself with the help of reinforcement learning.
•

This study only considered the search results from Google. We will take image search results from other popular search engines and open-source search frameworks for our experiments.
•

This study only included occupational stereotypes and it is aligned with a few types of biases(as described in the section 3). Mitigating other forms of biases (such as racial bias) is also on our agenda.

Acknowledgements

We would like to thank Professor Yunhe Feng, Department of Computer Science and Engineering, University of North Texas and Professor Chirag Shah, School of Information, University of Washington, for their continuous guidance and support.

References

Biega et al. (2020) Asia J Biega, Fernando Diaz, Michael D Ekstrand, and Sebastian Kohlmeier. Overview of the trec 2019 fair ranking track. arXiv preprint arXiv:2003.11650, 2020.
Bond et al. (2012) Robert M Bond, Christopher J Fariss, Jason J Jones, Adam DI Kramer, Cameron Marlow, Jaime E Settle, and James H Fowler. A 61-million-person experiment in social influence and political mobilization. Nature, 489(7415):295–298, 2012.
Broder (2002) Andrei Broder. A taxonomy of web search. In ACM Sigir forum, volume 36, pp. 3–10. ACM New York, NY, USA, 2002.
Chakraborty et al. (2000) B Chakraborty, R Kaustubha, A Hegde, A Pereira, W Done, R Kirlin, A Moghaddamjoo, A Georgakis, C Kotropoulos, and Pitas Xafopoulos. Bishop, cm, neural networks for pattern recognition, oxford university press, new york, 1995. carreira-perpiñán m., mode-finding for mixtures of gaussian distributions, ieee transaction on pattern analysis and machine intelligence, vol. 22, no. 11, november 2000, 1318-1323. IEEE transaction on Pattern Analysis and Machine Intelligence, 22(11):1318–1323, 2000.
Chernov (2019) Pavel Chernov. Age and gender estimation. open-source projects overview. simple project from scratch. https://medium.com/@pavelchernov/age-and-gender-estimation-open-source-projects-overview-simple-project-from-scratch-69581831297e, 2019. (Accessed on 04/28/2023).
Craswell et al. (2008) Nick Craswell, Onno Zoeter, Michael Taylor, and Bill Ramsey. An experimental comparison of click position-bias models. In Proceedings of the 2008 international conference on web search and data mining, pp. 87–94, 2008.
Ćurković & Košec (2018) Marko Ćurković and Andro Košec. Bubble effect: including internet search engines in systematic reviews introduces selection bias and impedes scientific reproducibility. BMC medical research methodology, 18(1):1–3, 2018.
Deng et al. (2009) Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pp. 248–255. Ieee, 2009.
Dutton et al. (2013) William H Dutton, Grant Blank, and Darja Groselj. Cultures of the internet: the internet in Britain: Oxford Internet Survey 2013 Report. Oxford Internet Institute, 2013.
Dutton et al. (2017) William H Dutton, Bianca Reisdorf, Elizabeth Dubois, and Grant Blank. Search and politics: The uses and impacts of search in Britain, France, Germany, Italy, Poland, Spain, and the United States. Quello Center Working Paper, 2017.
Epstein & Robertson (2015) Robert Epstein and Ronald E Robertson. The search engine manipulation effect (seme) and its possible impact on the outcomes of elections. Proceedings of the National Academy of Sciences, 112(33):E4512–E4521, 2015.
Epstein et al. (2017) Robert Epstein, Ronald E Robertson, David Lazer, and Christo Wilson. Suppressing the search engine manipulation effect (seme). Proceedings of the ACM on Human-Computer Interaction, 1(CSCW):1–22, 2017.
Feng et al. (2020) Yunhe Feng, Daniel Saelid, Ke Li, Ruoyuan Gao, and Chirag Shah. University of washington at trec 2020 fairness ranking track. arXiv preprint arXiv:2011.02066, 2020.
Feng et al. (2021) Yunhe Feng, Daniel Saelid, Ke Li, Ruoyuan Gao, and Chirag Shah. Towards fairness-aware ranking by defining latent groups using inferred features. In International Workshop on Algorithmic Bias in Search and Recommendation, pp. 1–8. Springer, 2021.
Fidel (2012) Raya Fidel. Human information interaction: An ecological approach to information behavior. Mit Press, 2012.
Fogg (2002) Brian J Fogg. Persuasive technology: using computers to change what we think and do. Ubiquity, 2002(December):2, 2002.
Geyik et al. (2019) Sahin Cem Geyik, Stuart Ambler, and Krishnaram Kenthapadi. Fairness-aware ranking in search & recommendation systems with application to linkedin talent search. In Proceedings of the 25th acm sigkdd international conference on knowledge discovery & data mining, pp. 2221–2231, 2019.
Gillespie (2014) Tarleton Gillespie. The relevance of algorithms. Media technologies: Essays on communication, materiality, and society, 167(2014):167, 2014.
Grind et al. (2019) Kirsten Grind, Sam Schechner, Robert McMillan, and John West. How google interferes with its search algorithms and changes your results. The Wall Street Journal, 15, 2019.
Guarino (2016) Ben Guarino. Google faulted for racial bias in image search results for black teenagers. Washington Post, 6:2016, 2016.
Hardt et al. (2016) Moritz Hardt, Eric Price, and Nati Srebro. Equality of opportunity in supervised learning. Advances in neural information processing systems, 29:3315–3323, 2016.
Hopkins et al. (2021) Margaret M Hopkins, Deborah Anne O’Neil, Diana Bilimoria, and Alison Broadfoot. Buried treasure: Contradictions in the perception and reality of women’s leadership. Frontiers in Psychology, 12:1804, 2021.
Howard et al. (2017) Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
Jain & Varma (2011) Vidit Jain and Manik Varma. Learning to re-rank: query-dependent image re-ranking using click data. In Proceedings of the 20th international conference on World wide web, pp. 277–286, 2011.
Joachims et al. (2007) Thorsten Joachims, Laura Granka, Bing Pan, Helene Hembrooke, Filip Radlinski, and Geri Gay. Evaluating the accuracy of implicit feedback from clicks and query reformulations in web search. ACM Transactions on Information Systems (TOIS), 25(2):7–es, 2007.
Kay et al. (2015) Matthew Kay, Cynthia Matuszek, and Sean A Munson. Unequal representation and gender stereotypes in image search results for occupations. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, pp. 3819–3828, 2015.
Kilbertus et al. (2017) Niki Kilbertus, Mateo Rojas-Carulla, Giambattista Parascandolo, Moritz Hardt, Dominik Janzing, and Bernhard Schölkopf. Avoiding discrimination through causal reasoning. arXiv preprint arXiv:1706.02744, 2017.
Kulshrestha et al. (2017) Juhi Kulshrestha, Motahhare Eslami, Johnnatan Messias, Muhammad Bilal Zafar, Saptarshi Ghosh, Krishna P Gummadi, and Karrie Karahalios. Quantifying search bias: Investigating sources of bias for political searches in social media. In Proceedings of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, pp. 417–432, 2017.
Langston (2015) Jennifer Langston. Who’sa ceo? google image results can shift gender biases. UW News, April, 2015.
Levi & Hassner (2015) Gil Levi and Tal Hassner. Age and gender classification using convolutional neural networks. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) workshops, 2015. URL https://osnathassner.github.io/talhassner/projects/cnn_agegender.
Li et al. (2019) Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang. Visualbert: A simple and performant baseline for vision and language. arXiv preprint arXiv:1908.03557, 2019.
Lim et al. (2020) Sora Lim, Adam Jatowt, Michael Färber, and Masatoshi Yoshikawa. Annotating and analyzing biased sentences in news articles using crowdsourcing. In Proceedings of the 12th Language Resources and Evaluation Conference, pp. 1478–1484, Marseille, France, May 2020. European Language Resources Association. ISBN 979-10-95546-34-4. URL https://aclanthology.org/2020.lrec-1.184.
Liu (2011) Tie-Yan Liu. Learning to rank for information retrieval. Springer Science & Business Media, 2011.
Mai (2016) Jens-Erik Mai. Looking for information: A survey of research on information seeking, needs, and behavior. Emerald Group Publishing, 2016.
Mitchell et al. (2017) Amy Mitchell, Jeffrey Gottfried, Elisa Shearer, and Kristine Lu. How Americans encounter, recall and act upon digital news. Pew Research Center, 2017.
Pasquale (2015) Frank Pasquale. The black box society. Harvard University Press, 2015.
Pavlakos et al. (2019) Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed A. A. Osman, Dimitrios Tzionas, and Michael J. Black. Expressive body capture: 3d hands, face, and body from a single image. In Proceedings IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2019.
Pennington et al. (2014) Jeffrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532–1543, 2014.
Robertson et al. (2018) Ronald E Robertson, Shan Jiang, Kenneth Joseph, Lisa Friedland, David Lazer, and Christo Wilson. Auditing partisan audience bias within google search. Proceedings of the ACM on Human-Computer Interaction, 2(CSCW):1–22, 2018.
Robertson et al. (1995) Stephen E Robertson, Steve Walker, Susan Jones, Micheline M Hancock-Beaulieu, Mike Gatford, et al. Okapi at trec-3. Nist Special Publication Sp, 109:109, 1995.
Rocchio (1971) Joseph John Rocchio. The smart retrieval system: Experiments in automatic document processing. Relevance feedback in information retrieval, pp. 313–323, 1971.
Rothe et al. (2015) Rasmus Rothe, Radu Timofte, and Luc Van Gool. Dex: Deep expectation of apparent age from a single image. In Proceedings of the IEEE international conference on computer vision workshops, pp. 10–15, 2015.
Sandvig et al. (2014) Christian Sandvig, Kevin Hamilton, Karrie Karahalios, and Cedric Langbort. Auditing algorithms: Research methods for detecting discrimination on internet platforms. Data and discrimination: converting critical concerns into productive inquiry, 22:4349–4357, 2014.
Schroeder & Borgerson (2015) Jonathan E Schroeder and Janet L Borgerson. Critical visual analysis of gender: Reactions and reflections. Journal of Marketing Management, 31(15-16):1723–1731, 2015.
Schweiger et al. (2014) Stefan Schweiger, Aileen Oeberst, and Ulrike Cress. Confirmation bias in web-based search: a randomized online study on the effects of expert information and social tags on information search and evaluation. Journal of medical Internet research, 16(3):e94, 2014.
Silberg & Manyika (2019) Jake Silberg and James Manyika. Notes from the ai frontier: Tackling bias in ai (and in humans). McKinsey Global Institute (June 2019), 2019.
Sutton & Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Viola & Jones (2001) Paul Viola and Michael Jones. Rapid object detection using a boosted cascade of simple features. In Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, volume 1, pp. I–I. Ieee, 2001.
Wu et al. (2019) Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https://github.com/facebookresearch/detectron2, 2019.
Xie et al. (2019) Xiaohui Xie, Jiaxin Mao, Yiqun Liu, Maarten de Rijke, Yunqiu Shao, Zixin Ye, Min Zhang, and Shaoping Ma. Grid-based evaluation metrics for web image search. In The World Wide Web Conference, pp. 2103–2114, 2019.
Yang et al. (2018) Tsun-Yi Yang, Yi-Hsuan Huang, Yen-Yu Lin, Pi-Cheng Hsiu, and Yung-Yu Chuang. Ssr-net: A compact soft stagewise regression network for age estimation. In IJCAI, volume 5, pp. 7, 2018.
Zeng et al. (2018) Wei Zeng, Jun Xu, Yanyan Lan, Jiafeng Guo, and Xueqi Cheng. Multi page search with reinforcement learning to rank. In Proceedings of the 2018 ACM SIGIR International Conference on Theory of Information Retrieval, pp. 175–178, 2018.
Zhang et al. (2016) Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
Zhang et al. (2017) Zhifei Zhang, Yang Song, and Hairong Qi. Age progression/regression by conditional adversarial autoencoder. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.
Zhao et al. (2018) Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. Gender bias in coreference resolution: Evaluation and debiasing methods. arXiv preprint arXiv:1804.06876, 2018.
Zhao (2004) Lisa Zhao. Jump higher: Analyzing web-site rank in google. Information technology and libraries, 23(3):108, 2004.
Zhou & Agichtein (2020) Jianghong Zhou and Eugene Agichtein. Rlirank: Learning to rank with reinforcement learning for dynamic search. In Proceedings of The Web Conference 2020, pp. 2842–2848, 2020.
Zhou et al. (2021) Jianghong Zhou, Sayyed M Zahiri, Simon Hughes, Khalifeh Al Jadda, Surya Kallumadi, and Eugene Agichtein. De-biased modelling of search click behavior with reinforcement learning. arXiv preprint arXiv:2105.10072, 2021.
Zickuhr et al. (2012) Kathryn Zickuhr, Lee Rainie, Kristen Purcell, Mary Madden, and Joanna Brenner. Libraries, patrons, and e-books. Pew Internet & American Life Project, 2012.