Metric-Free Individual Fairness with Cooperative Contextual Bandits

Qian Hu, Huzefa Rangwala {qhu3, rangwala}@gmu.edu Department of Computer Science, George Mason University

Abstract

Data mining algorithms are increasingly used in automated decision making across all walks of daily life. Unfortunately, as reported in several studies these algorithms learn bias from data and environment leading to unequitable and unfair solutions. To mitigate bias in machine learning, different formalizations of fairness have been proposed that can be categorized into group fairness and individual fairness. Group fairness requires that different groups should be treated similarly which might be unfair to some individuals within a group. On the other hand, individual fairness requires that similar individuals be treated similarly. However, individual fairness remains understudied due to its reliance on problem-specific similarity metric. We propose a metric-free individual fairness and a cooperative contextual bandits (CCB) algorithm. The CCB algorithm utilizes fairness as a reward and attempts to maximize it. The advantage of treating fairness as a reward is that the fairness criterion does not need to be differentiable. The proposed algorithm is tested on multiple real-world benchmark datasets. The results show the effectiveness of the proposed algorithm at mitigating bias and at achieving both individual and group fairness.

Index Terms:

fair machine learning, decision making, contextual bandits

I Introduction

Machine learning (ML) algorithms have permeated many levels of human society and achieved tremendous success in many scenarios ranging from improved healthcare, education and commerce. For example, ML enables people with disabilities to live better lives [1, 2] and can identify tumors before the on-set of cancer [3]. Educational technologies based on ML provide personlalized learning experiences to students and improve their learning outcomes [4].

However, concerns and evidence of machine learning bias are increasing with its adoption. For example, Buolamwini et al. [5] found that commercial face recognition software exhibit skin-type and gender biases. This study showed that the classification error rate for light-skinned men is substantially lower than dark-skinned women, 0.8% and 34.7%, respectively. A recidivism prediction algorithm COMPAS investigated by ProPublica misclassifies African-American defendants at a higher risk of recidivism than their Caucasian counterparts (45% vs. 23%) [6]. A study by Ali et al. [7] showed skewed job and housing advertisement delivery. For instance, advertisements for jobs in taxi industry disproportionately target communities of color (75%), even though the specified targeting audience is identical for all race. The increasing amount of bias cases in machine learning applications require the development of bias-free algorithms.

Therefore, research communities in machine learning, policy, social science, law and statistics have invested in the development of fairness definitions [8], and methods [9, 10, 11, 12]. The proposed fairness definitions include fairness through unawareness [13], counterfactual fairness [10], statistical parity [14], equalized odds and equal opportunity [15] and individual fairness [13].

The proposed fairness definitions can be classified into two categories: group and individual fairness. Group fairness requires that the protected group as a whole should be treated statistically similarly to the advantaged groups. For example, if the hiring rate for males is 60%, then the hiring rate for female candidates should be 60% as well. Although most existing algorithms proposed for achieving fairness focus on group fairness, it has the potential to be unfair to specific individuals. For example, we can accept certain people in an identified group to achieve group fairness, as long as the ratio of being accepted is identical across all groups. However, it is unfair to qualified people in that group. In addition, forcing the acceptance rates to be the same across all groups can lead to inferior prediction performance if the distribution across the different groups are different.

As an alternate formalization, individual fairness was proposed by Dwork et al. [13]. This requires that similar individuals be treated similarly. To implement this concept, a problem-specific similarity metric is required. However, defining a similarity metric for each task has its own ethical issues, which is a critical challenge in making individual fairness practical [16].

We propose the concept of metric-free individual fairness where the prediction of an individual should not be influenced by their sensitive attribute. Compared to traditional individual fairness which assures fairness by comparing an individual to others, the proposed metric-free individual fairness compares the predictions of an individual conditioned on different values of the sensitive attribute.

To implement the proposed metric-free individual fairness, we propose an algorithm called cooperative contextual bandits. The algorithm consists of multiple contextual bandits where each bandit corresponds to a sensitive attribute. Given an individual, the algorithm predicts the individual’s probability of getting a positive outcome using the bandit corresponding to their sensitive attribute and compares it to the probability of getting a positive outcome by using the bandit corresponding to the other sensitive attribute value(s). The proximity of the two probability distributions is treated as fairness. Namely, if the prediction of the individual is not affected by their sensitive attribute, it is considered fair. The algorithm treats the proximity of the two probability distributions as a reward signal and maximizes it to achieve fairness. As we need to measure the divergence of the predictions between different bandits, we develop a gradient contextual bandits algorithm. Compared to traditional contextual bandits which predict the action value, gradient contextual bandits directly map a state to an action distribution that can be used to measure divergence.

The proposed algorithm is analyzed theoretically and empirically. It is shown that the proposed fair algorithm converges linearly to a local minimum. We empirically evaluate the algorithm on several public datasets and the experimental results show the effectiveness of the proposed algorithm at mitigating bias.

The contributions of this work can be summarized as follows

•

We propose a new concept of individual fairness which does not need a similarity metric. The proposed metric-free individual fairness allows for broader adoption of individual fairness.
•

We propose gradient contextual bandits algorithm. The proposed gradient contextual bandits can be used in many other domains such as recommender systems. One of the advantage of the gradient contextual bandits is that it learns a stochastic policy, while the policy learned by classic contextual bandits is deterministic.
•

Based on gradient contextual bandits, we propose a novel algorithm, cooperative contextual bandits, to implement the proposed metric-free individual fairness. The proposed algorithm relies on the reward signal to achieve fairness, therefore, the fairness measure is not required to be differentiable.

II Related Work

II-A Fairness Formalization

Different definitions of fairness have been proposed [17]. Demographic parity imposes the constraint that a predictor predicts a particular outcome for individuals across groups with nearly equal probability [18, 19, 20]. This notion has the disadvantage of trading utility for fairness. Hardt et al. [21] proposed equalized odds and equal opportunity. A classifier satisfies equalized odds if the predicted outcome is independent of the sensitive attribute conditioned on the true outcome. For binary classification, if the true outcome is positive, equalized odds requires that a classifier yield equal true positive rates across the groups. If the true outcome is negative, it requires the predictions lead to equal false positive rates. Equalized odds is a stronger constraint compared to equal opportunity. Equal opportunity only requires non-discrimination within the advantaged group. For example, the qualified students in different groups should have similar probability of getting admitted into colleges.

Based on causal inference, Kusner et al. [10] proposed counterfactual fairness. A predictor satisfies counterfactual fairness if its output remains the same when an individual’s sensitive attribute is flipped to its counterfactual value. Counterfactual fairness is an individual-level fairness formalization. Dwork el al. [13] explicitly proposed individual fairness based on the idea that similar individuals should be treated similarly. However, this notion of fairness largely depends on the similarity metric. To achieve this notion of fairness, a reliable and non-discriminating similarity metric is required [8].

Recently, several efforts have been made to eliminate the need of similarity metric of individual fairness [22]. Ilvento [23] proposed to approximate a similarity metric for individual fairness by using human judgement, namely, a human provides with sufficient domain knowledge to evaluate similarity. Although an explicit similarity metric is not necessary in this work, human judgements are needed, which is prone to error and hard to implement. Bechavod et al. [24] proposed a metric-free online learning algorithm which assumes access to a human auditor identifying fairness violations.

Different from the aforementioned metric-free individual fairness, our definition of metric-free individual fairness needs no human judgements.

II-B Fair Algorithms

In this section, we mainly survey the works on individual fairness. Based on John Rawls’ notion of fair equality of opportunity, Joseph et al. [25] proposed Rawlsian fairness, which states that in the case of hiring, a weaker applicant should never be favored over a stronger candidate. Their proposed fair algorithm is based on contextual bandits and confidence intervals, which chains confidence intervals of different individuals to determine which individuals should be favored over others. Zemel et al. [26] take a different perspective on how to achieve individual fairness. They proposed an algorithm to learn a representation while removing sensitive information from the learned representation. The proposed algorithm is able to achieve both individual and group fairness.

Following the idea of learning a fair representation, Edwards et al. [27] proposed to remove sensitive information from the representation by adversarial learning. An encoder learns to generate the representation and an adversary predicts the sensitive attributes from the representation. To achieve fairness, the adversary tries to predict sensitive information from the representation while the encoder seeks to refrain the adversary from predicting it. Louizos et al. [28] proposed using variational autoencoder for learning a fair representation by encouraging independence between sensitive attributes and the representation. To further remove dependencies, the algorithm is integrated with an additional term based on maximum mean discrepancy. Compared to the algorithms achieving individual fairness by learning a fair representation, our algorithm directly optimizes the individual fairness criterion.

II-C Cooperative Contextual Bandits

Cooperative contextual bandits have been applied in areas ranging from recommender systems [29, 30, 31] to robotics [32]. Although it is based on the idea of combining multiple bandits, the ways that how the bandits are combined and the reasons to combine them are different.

To improve information sharing and overcome data sparsity issues in recommender systems where users are connected through social networks, Cesa et al. [29] proposed to combine the bandits through the Laplacian matrix of the graph. A similar work done by Wu et al. [30] combines the bandits by parameters propagation through the graph. In robotics, Landgren [32] proposed distributed multi-agent bandits for searching robots which are combined by using consensus communication protocol. Each robot is represented by a bandit, and the main reason for combining multiple bandits is to help agents with sensor defects.

Our proposed cooperative contextual bandits algorithm is different from existing algorithms in the sense that we impose constraints on the output space of the bandits.

III Preliminaries

In this section, we present traditional definition of individual fairness and contextual bandits. Without loss of generality, in this work we assume a binary sensitive attribute.

Given an individual $((x_{i},s_{i}),y_{i})$ , where $x_{i}\in X$ denotes the feature vector of individual $i$ , $s_{i}\in\{0,1\}$ is the individual’s sensitive attribute, $y_{i}$ is the ground-truth label.

III-A Metric-Based Individual Fairness

The most popular individual fairness is based on the idea that similar individuals should be treated similarly, regardless of their sensitive attributes. Given a classifier, two individuals $x_{i}$ and $x_{j}$ are mapped to two distributions $H(x_{i})$ and $H(x_{j})$ , respectively. Individual fairness requires that the difference between the two output distributions is upper bounded by the distance between the two individuals.

Definition 1

(Metric-Based Individual Fairness) $\forall x_{i},x_{j}\in X$ , the metric based individual fairness has the constraint that $D(H(x_{i}),H(x_{j}))\leq d(x_{i},x_{j})$ , where $d:X\times X\to\mathbb{R}$ is a distance metric for two individuals and $D$ measures divergence between the two outcomes.

III-B Contextual Bandits

The classic contextual bandits problem is formalized as a finite set of $K$ actions $A$ , context space $X$ , a set of policies $\Pi$ which maps a context $x\in X$ to an action $a\in A$ based on action value $Q(x,a)$ .

At time $t=1,...,T$ , a context $x_{t}$ is shown. A policy chooses an action $a_{t}$ by estimating reward value of actions based on the history $(x_{1},a_{1},r_{1}(a_{1})),...,(x_{t-1},a_{t-1},r_{t-1}(a_{t-1}))$ and receives a reward $r_{t}(a_{t})$ . The goal of the contextual bandit algorithm is to maximize the expected reward of a policy $R(\pi)=E_{(x,r)\sim D}[r(\pi(x))]=\sum_{t=1}^{T}r_{t}(a_{t})$ , where $D$ is the distribution over context and reward.

At each round, the policy decides which action to take. It has two options, taking the action with the highest estimated value (exploitation) or explore actions that might have higher value (exploration). Contextual bandit algorithms have to make a sensible tradeoff between exploitation and exploration. Many algorithms have been proposed to overcome the exploitation and exploration challenge such as $\epsilon$ -greedy, LinUCB [33] and Thompson Sampling [34].

IV Proposed Algorithm

In this section, we propose our algorithms for metric free individual fairness based on gradient contextual bandit.

IV-A Metric-Free Individual Fairness

Definition 2

(Metric-Free Individual Fairness) Given an individual with $(x_{i},s_{i})$ , metric-free individual fairness is defined as, $D(P(Y|x_{i},S=s_{i}),P(Y|x_{i},S=1-s_{i}))<\epsilon$ , for arbitrarily small positive constant $\epsilon$ , where $P(Y|x_{i},S=s_{i})$ is the distribution of the outcome given an individual $x_{i}$ and the sensitive attribute $s_{i}$ .

Metric-free individual fairness requires that the outcome of an individual is independent of their sensitive attributes given their input features. This implies that the prediction for an individual should not be influenced by their sensitive attributes.

IV-A1 Relationship to Counterfactual Fairness

Counterfactual fairness [10] is defined such that a predictor $\hat{Y}$ is counterfactually fair for any individual $x$ with sensitive attribute $s$ , if $P(\hat{Y}_{S\leftarrow s}|X=x,A=a)=P(\hat{Y}_{S\leftarrow s^{\prime}}|X=x,A=a)$ , where $S\leftarrow s$ denotes the intervention that sets sensitive attribute $S$ to value $s$ .

The main difficulty of adopting counterfactual fairness is untestable assumptions about causal relationships between variables [35]. Although counterfactual fairness holds under different levels of assumptions from weak to strong, the utility and feasibility of the models under these assumptions decrease [10]. For example, the weakest assumption only requires to identify non-descendants of sensitive attribute, but in many practical problems there exist few non-descendants of sensitive attribute. The strongest assumption postulates a fully deterministic model which guarantees the maximum information extraction, however, it is the hardest to validate. Validating a causal assumption is sometimes unrealistic; Russell et al. [35] propose to avoid doing this by integrating multiply competing assumptions. However, it results in approximate counterfactual fairness and still requires the formalization of causal relations between variables. Compared to counterfactual fairness, the proposed metric-free individual fairness is based on conditional independence between the prediction and the sensitive attribute, which requires no assumptions and is simple to implement. The experimental results show the effectiveness of models implementing it at removing bias.

IV-B Gradient Contextual Bandit

Contextual bandits algorithms make a decision at each step by estimating the action value $Q(x,a)$ , which is the expected reward of taking action $a$ given context $x$ . The gradient contextual bandits learns a stochastic policy that maps a context vector to a probability distribution of actions. This is desirable in our case as we want to know the distance between two distributions of actions given a context vector, which is detailed in Section IV-C1. Besides that, a stochastic policy is beneficial in other domains such as recommender systems. For example, a user might like an action movie as much as a science fiction movie, which can be easily modeled by using a stochastic policy, while a deterministic policy can only choose the movie that the user likes the best.

Formally, we want to learn a policy $\pi_{\theta}(a|x)=P(A=a|X=x,\theta)$ parameterized by $\theta$ , which is the probability of choosing action $a$ given context vector $x$ that maximizes the expected reward

J(\theta)=E_{x\sim P(X)}[\pi_{\theta}(a|x)Q(x,a)]

(1)

In the following text, we ignore $\theta$ and use $\pi_{\theta}(a|x)$ and $\pi(a|x)$ interchangeably. As the action space is discrete, we can parameterize the policy as

\pi(a|x)=\frac{e^{\phi(x,a)}}{\sum_{b}e^{\phi(x,b)}}

(2)

where $\phi(x,a)$ is a preference score of choosing action $a$ , which is mapped from feature vector $x$ .

The gradient of the expected reward of a policy with respect to the policy parameter is

\nabla J(\theta)=E_{x\sim P(X)}[\nabla\pi_{\theta}(a|x)Q(x,a)]

(3)

However, the probability distribution of X is unknown. Instead, we seek to optimize the empirical reward $J_{t}(\theta)$ , which is an unbiased estimator of the expected reward, namely, $J(\theta)=E[J_{t}(\theta)]$ . The empirical policy gradient is

\nabla J_{t}(\theta)=\nabla\log(\pi_{\theta}(a_{t}|x_{t})r_{t}(a_{t})).

(4)

The gradient of the policy with respect to its parameters is the gradient of the logarithm of the policy times the reward. The derivation of the policy gradient is detailed as following

$\displaystyle\nabla J_{t}(\theta)$	$\displaystyle=\sum_{a}\nabla\pi_{\theta}(a\|x_{t})q(x_{t},a)$	(5)
	$\displaystyle=\sum_{a}\pi_{\theta}(a\|x_{t})\frac{\nabla\pi_{\theta}(a\|x_{t})}{\pi_{\theta}(a\|x_{t})}q(x_{t},a)$
	$\displaystyle=E_{\pi_{\theta}}[\nabla\log\pi_{\theta}(a\|x_{t})q(x_{t},a)]$
	$\displaystyle=E_{\pi_{\theta}}[\nabla\log\pi_{\theta}(a_{t}\|x_{t})q(x_{t},a_{t})]$
	$\displaystyle=E_{\pi_{\theta}}[\nabla\log\pi_{\theta}(a_{t}\|x_{t})R_{t}]$
	$\displaystyle=\nabla\log\pi_{\theta}(a_{t}\|x_{t})R_{t}.$

where $E_{\pi_{\theta}}$ is the expectation taken with respect to $\pi_{\theta}$ . Interestingly, we find that the derived algorithm is similar to the REINFORCE algorithm [36]. The difference between contextual bandits and full reinforcement learning (RL) is that the action taken by full RL agent will influence the environment while that taken by a contextual-bandit agent will not. Therefore, REINFORCE uses return from current step, while gradient contextual bandits use reward at current step.

The policy parameter can be updated by using stochastic gradient ascent algorithm as

\theta_{t+1}=\theta_{t}+\alpha\nabla\log\pi_{\theta}(a_{t}|x_{t})r_{t}(a_{t}).

(6)

IV-C Cooperative Contextual Bandits

Refer to caption — Figure 1: The Cooperative Contextual Bandits Model. GCB₀ and GCB₁ represent the bandits corresponding to sensitive attribute values of 0 (male) and 1 (female), respectively.

Metric-free individual fairness imposes the constraint that the prediction for an individual is independent of their sensitive attribute. We propose cooperative contextual bandits algorithm which is composed of two gradient contextual bandits. Figure 1 shows the model architecture of the proposed model. Each gradient contextual bandit corresponds to a sensitive attribute group, e.g., male or female. Each gradient contextual bandit learns a policy $\pi_{s}$ which maps a feature vector to an action distribution. The individual’s context vector is denoted by $x_{t}$ and is input into both bandits, each of which outputs an action distribution. The algorithm takes an action according to the action distribution of the bandit corresponding to $s_{t}$ . A reward is determined by the ground truth label $y_{t}$ and the divergence between the action distributions of the two bandits. The bandit corresponding to $s_{t}$ receives the reward feedback and gets updated. Note that the other bandit is not updated.

According to the definition of metric-free individual fairness, the difference between outcomes given by different gradient contextual bandits should be trivial, i.e. $D(\pi_{s_{i}}(x_{i}),\pi_{1-s_{i}}(x_{i}))<\epsilon$ , for arbitrarily small $\epsilon$ . Since we are using gradient contextual bandits which directly maps the context vector to the probability distribution of actions, the difference of the outcomes can be measured by KL divergence between the outcomes. Note that, metric-free individual fairness in this work means that there is no requirement of similarity metric between individuals, while we still need to measure the divergence between the two output distributions. The former is problem-specific and hard to justify if it is a fair metric, while the latter is easy to define and the choice of which has no effect on fairness.

IV-C1 Reward Function

At time step $t$ , an individual $(x_{t},s_{t})$ is presented, an action $a_{t}$ is taken and a reward $r_{t}(a_{t})$ is received. The action is the predicted outcome for the individual, i.e., whether he/she is hired. The reward function is composed of two parts, fairness and accuracy.

r_{t}(a_{t})=r_{t}^{a}(a_{t})-\lambda\text{KL}(\pi_{s_{t}}(x_{t}),\pi_{1-s_{t}}(x_{t}))

(7)

where $r_{t}^{a}(a_{t})\in\{0,1\}$ is the accuracy reward,

r_{t}^{a}(a_{t})=\begin{cases}0,&a_{t}\neq y_{t}\\ 1,&a_{t}=y_{t}\end{cases}

$\lambda$ is a hyperparameter trading off between accuracy and fairness. The reward is high when the prediction is correct while the difference between the outcomes from different contextual bandits is small.

IV-C2 Model Architecture of Gradient Contextual Bandits

The input into the gradient contextual bandit is a feature vector describing an individual. The bandit learns a policy that maps the feature vector to an action distribution. The policy is parameterized as a feed-forward neural network. As we assume a binary discrete label, the last layer of the neural networks is a softmax layer whose output is the probability distribution of actions. The neural networks are composed of one hidden layer and the activation function is a ReLU function [37].

Algorithm 1 shows the detail of the proposed algorithm ¹¹1The code is made public at https://github.com/anony-user-123/public.

Input : Data

D=\{((x_{i},s_{i}),y_{i})\}_{i=1}^{N}

, learning rate

\alpha

\lambda

, number of steps T.

3Initialize parameters

\{\theta_{0}^{0},\theta_{1}^{0}\}

4for $t=1,...,T$ do

5 Sample example

((x_{t},s_{t}),y_{t})

from

D

6 Sample

a_{t}\sim\pi_{s_{t}}(x_{t})

7 Take action

a_{t}

, compute reward

r_{t}(a_{t})

using Equation 7

\theta_{s_{t}}^{t+1}=\theta_{s_{t}}^{t}+\alpha\nabla\log B_{s_{t}}(a_{t}|x_{t})r_{t}(a_{t})

return $\{\theta_{0}^{T},\theta_{1}^{T}\}$

Algorithm 1 Cooperative Contextual Bandits

IV-D Convergence Analysis

In this section, we analyze the convergence property of the proposed algorithm.

Assumption 1

There exists $Q$ , such that the reward function satisfies

|r_{t}(a_{t})|\leq Q

(8)

for all $t$ and $a_{t}$ .

Assumption 1 states that the reward function is bounded.

Assumption 2

Let $s\in\{0,1\}$ . The first and second derivatives of the score function are elementwise bounded by constants $F$ and $S$ , respectively, and $0\leq F,S<\infty$

|\nabla\log\pi_{s}|\leq F

(9)

|\nabla^{2}\log\pi_{s}|\leq S.

(10)

The policy is parameterized as

\pi_{s}(a|x)=\frac{e^{\phi(x,a)}}{\sum_{b}e^{\phi(x,b)}}.

(11)

Therefore, $\nabla\log\pi_{s}=(1-\pi_{s}(a|x))\nabla\phi(x,a)$ , which is bounded if $\nabla\phi(x,a)$ and the parameter $\theta_{s}$ is bounded. The second derivative is derived as

	$\displaystyle\nabla^{2}\log\pi_{s}(a\|x)$	$\displaystyle=(1-\pi(a\|x))^{2}\nabla^{2}\phi(x,a)\nabla^{2}\pi_{s}(a\|x)$		(12)
		$\displaystyle-(1-\pi(a\|x))^{2}\nabla\phi(x,a)\nabla\phi(x,a)^{T}$		(12)

which is also bounded if $\nabla^{2}\phi(x,a)$ , $\nabla^{2}\pi_{s}(a|x)$ and $\nabla\phi(x,a)$ , $\theta_{s}$ are bounded.

Lemma 1

Under Assumption 2, for $\forall s\in\{0,1\},J_{s}(\theta)$ is $L$ -Lipschitz smooth and $0\leq L<\infty$

||\nabla J_{s}(\theta_{1})-\nabla J_{s}(\theta_{2})||\leq L||\theta_{1}-\theta_{2}||

(13)

Lemma 1 assures that the objective function $J_{s}(\theta)$ is smooth, which is essential in analyzing the convergence of the algorithm.

Proof

The Hessian matrix of the score function $\log\pi_{\theta_{s}}$ is

\nabla^{2}\log\pi_{\theta_{s}}=\pi_{\theta_{s}}^{-1}\nabla^{2}\pi_{\theta_{s}}-\nabla\log\pi_{\theta_{s}}\nabla\log\pi_{\theta_{s}}^{T}.

(14)

Therefore,

\nabla^{2}\pi_{\theta_{s}}=\pi_{\theta_{s}}\nabla^{2}\log\pi_{\theta_{s}}+\pi_{\theta_{s}}\nabla\log\pi_{\theta_{s}}\nabla\log\pi_{\theta_{s}}^{T}.

(15)

The second derivative of the objective function is bounded as shown below

$\displaystyle\nabla^{2}J(\theta_{s})$	$\displaystyle=\int_{X}\nabla^{2}\pi_{\theta_{s}}(a\|x)q(x,a)dx$	(16)
	$\displaystyle=\pi_{\theta_{s}}(\int_{X}\nabla^{2}\log\pi_{\theta_{s}}(a\|x)q(x,a)dx+$
	$\displaystyle\int_{X}\nabla\log\pi_{\theta_{s}}(a\|x)\nabla\log\pi_{\theta_{s}}(x\|a)^{T}q(x,a)dx)$
	$\displaystyle\leq SQ+F^{2}Q=L.$

Bounded second derivative implies Lipschitz continuity, therefore $J(\theta_{s})$ is $L$ -smooth.

Theorem 1

For $\forall s\in\{0,1\}$ , let $\{\theta_{s}^{t}\}_{t=0}^{T}$ be the sequence of learned parameters of the policy $\pi_{s}$ given by Algorithm 1. Then, under Assumption 1 and 2, we have

||\nabla J(\theta_{s}^{m})||_{2}^{2}\leq\frac{1}{T}\sum_{t=0}^{T}||\nabla J(\theta_{s}^{k})||_{2}^{2}\leq\frac{J(\theta_{s}^{*})-J(\theta_{s}^{0})}{MT},

(17)

where $\nabla J(\theta_{s}^{m})=\underset{t\in\{0,...,T\}}{\min}\nabla J(\theta_{s}^{t})$ , $J(\theta_{s}^{*})=\underset{t\in\{0,...,T\}}{\max}J(\theta_{s}^{t})$ , and M is a constant with $0<M<\infty$ .

Theorem 1 shows that the average of the gradient norm square converges to a neighborhood near zero with rate of $\frac{1}{T}$ or the minimizer of the gradient norm square converges to zero in $O(\frac{1}{T})$ iterations. The minimizer can be obtained by using early stopping.

Proof

The objective function $J(\theta_{s})$ is smooth, which directly implies

$\displaystyle J(\theta_{s}^{k+1})$	$\displaystyle\geq J(\theta_{s}^{k})+\langle\nabla J(\theta_{s}^{k}),\theta_{s}^{k+1}-\theta_{s}^{k}\rangle-\frac{L}{2}\|\|\theta_{s}^{k+1}-\theta_{s}^{k}\|\|_{2}^{2}$	(18)
	$\displaystyle\geq J(\theta_{s}^{k})+\alpha\|\|\nabla J(\theta_{s}^{k})\|\|_{2}^{2}-\frac{L}{2}\|\|\theta_{s}^{k+1}-\theta_{s}^{k}\|\|_{2}^{2}$
	$\displaystyle\geq J(\theta_{s}^{k})+\alpha\|\|\nabla J(\theta_{s}^{k})\|\|_{2}^{2}-\frac{L\alpha^{2}}{2}\|\|\nabla J(\theta_{s}^{k})\|\|_{2}^{2}$
	$\displaystyle\geq J(\theta_{s}^{k})+\frac{2\alpha-L\alpha^{2}}{2}\|\|\nabla J(\theta_{s}^{k})\|\|_{2}^{2}$
	$\displaystyle\geq J(\theta_{s}^{k})+M\|\|\nabla J(\theta_{s}^{k})\|\|_{2}^{2}$

where $M=\frac{2\alpha-L\alpha^{2}}{2}$ .

By telescoping sum the above equation with $k$ from $0$ to $K$ , we obtain

J(\theta_{s}^{k})\geq J(\theta_{s}^{1})+M\sum_{k=0}^{K}||\nabla J(\theta_{s}^{k})||_{2}^{2}.

(19)

Therefore,

	$\displaystyle\frac{1}{K}\sum_{k=0}^{K}\|\|\nabla J(\theta_{s}^{k})\|\|_{2}^{2}$	$\displaystyle\leq\frac{J(\theta_{s}^{k})-J(\theta_{s}^{1})}{MK}$		(20)
		$\displaystyle\leq\frac{J(\theta_{s}^{*})-J(\theta_{s}^{1})}{MK}$		(20)

where $J(\theta_{s}^{*})$ is the maximizer of $J(\theta_{s}^{k})$ with respect to k.

Defining $\nabla J(\theta_{s}^{m})=\underset{t\in\{0,...,T\}}{\min}\nabla J(\theta_{s}^{t})$ , we obtain

||\nabla J(\theta_{s}^{m})||_{2}^{2}\leq\frac{1}{T}\sum_{t=0}^{T}||\nabla J(\theta_{s}^{k})||_{2}^{2}\leq\frac{J(\theta_{s}^{*})-J(\theta_{s}^{0})}{MT}.

(21)

This Completes the proof for Theorem 1

TABLE I: Experimental results on public datasets using discrimination as model selection criterion.

adult compas german health acc $|$ discri $|$ consist acc $|$ discri $|$ consist acc $|$ discri $|$ consist acc $|$ discri $|$ consist LR 0.8387 $|$ 0.1873 $|$ 0.8307 0.6878 $|$ 0.2859 $|$ 0.7493 0.7700 $|$ 0.0759 $|$ 0.6590 0.8786 $|$ 0.0119 $|$ 0.9883 Rawlsian 0.8378 $|$ 0.1686 $|$ 0.8359 0.6802 $|$ 0.3877 $|$ 0.7054 0.7500 $|$ 0.0287 $|$ 0.6920 0.8765 $|$ 0.0072 $|$ 0.9837 LFR 0.7548 $|$ 0.0016 $|$ 0.9990 0.6254 $|$ 0.0356 $|$ 0.5114 0.6950 $|$ 0.0282 $|$ 0.9840 0.8785 $|$ 0.0000 $|$ 1.0000 ALFR 0.8420 $|$ 0.0722 $|$ 0.8473 0.4986 $|$ 0.0487 $|$ 0.9101 0.6850 $|$ 0.0000 $|$ 1.0000 0.8785 $|$ 0.0000 $|$ 1.0000 CCB 0.7544 $|$ 0.0002 $|$ 0.9999 0.4768 $|$ 0.0031 $|$ 0.9960 0.6950 $|$ 0.0282 $|$ 0.9840 0.8785 $|$ 0.0002 $|$ 0.9999 • acc = accuracy, discri = discrimination, consist = consistency.

TABLE II: Experimental results on public datasets using delta as model selection criterion.

adult compas german health acc $|$ discri $|$ consist acc $|$ discri $|$ consist acc $|$ discri $|$ consist acc $|$ discri $|$ consist LR 0.8387 $|$ 0.1873 $|$ 0.8307 0.6878 $|$ 0.2859 $|$ 0.7493 0.7700 $|$ 0.0759 $|$ 0.6590 0.8786 $|$ 0.0119 $|$ 0.9883 Rawlsian 0.8373 $|$ 0.1651 $|$ 0.8351 0.6802 $|$ 0.3877 $|$ 0.7054 0.7500 $|$ 0.0287 $|$ 0.6920 0.8765 $|$ 0.0072 $|$ 0.9837 LFR 0.7756 $|$ 0.0197 $|$ 0.7938 0.6556 $|$ 0.0408 $|$ 0.5147 0.7500 $|$ 0.1058 $|$ 0.7020 0.8785 $|$ 0.0000 $|$ 1.0000 ALFR 0.8327 $|$ 0.0465 $|$ 0.8220 0.5137 $|$ 0.0398 $|$ 0.9396 0.7450 $|$ 0.0797 $|$ 0.7760 0.8803 $|$ 0.0015 $|$ 0.9879 CCB 0.8165 $|$ 0.0077 $|$ 0.8341 0.6254 $|$ 0.0145 $|$ 0.7272 0.7000 $|$ 0.0423 $|$ 0.9810 0.8790 $|$ 0.0021 $|$ 0.9956 • acc = accuracy, discri = discrimination, consist = consistency.

V Experimental Protocol

V-A Datasets

We evaluate the performance of CCB model on four widely used benchmarks. We also perform a case study on a dataset collected at an educational institute for assessing the performance of various classifiers in academic performance prediction. All the tasks on these datasets are binary classification problem and the sensitive attributes are converted into two groups, i.e. protected and advantaged group. All the datasets are randomly split into 70%, 15% and 15% for training, validation and testing, respectively. The categorical features are one-hot encoded and continuous features are normalized with z-score.

Adult

The Adult income dataset has 48,842 samples [38]. The goal is to predict if an adult’s income is above 50K based on 14 categorical and continuous features. An individual’s gender is the sensitive attribute.

Compas

The COMPAS algorithm is used to assess a pretrial defendant’s risk of recidivism in Broward County, Florida. ProPublica collected 11,757 defendants assessed by COMPAS algorithm over two year from 2013 to 2014. Profiles were created for defendants using their criminal history, before and after they were scored by COMPAS [6]. The task is to predict whether a defendant will re-offend and the sensitive attribute is whether a defendant is black.

German

The German dataset has 1000 samples and 20 features [38]. The task is to predict whether a individual has a good or bad credit rating. The sensitive attribute is a person’s gender.

Health

The Health dataset is extracted from Heritage Health Prize which has 147,743 samples. We follow Brierley et al. [39] to preprocess the dataset. The task of this dataset is to predict whether an individual will spend any days in the hospital in the next year. The sensitive attribute is a person’s age. The age is converted to a binary variable, i.e., whether a persons is older than 65 years old.

V-B Baselines

We choose several algorithms as baselines which focus on achieving individual fairness.

Logistic Regression LR

This baseline does not have fairness constraint.

Rawlsian Fairness Rawlsian

Joseph et al. [25] proposed a concept of individual fairness based on the idea that a worse candidate should never be favored over a better one. The algorithm is implemented by using classical contextual bandits. Given a pool of candidates, their context vectors are mapped to corresponding qualification intervals. Overlapping intervals are chained together and the candidates chained to the candidate with the highest upper confidence bound are treated equally and over other candidates. The hyperparameters for this model are regularization parameter $\lambda$ selected from $\{1.0,3.0,9.0\}$ and the scaling parameter $\delta$ is chosen from $\{0.3,0.5,0.9\}$ .

Learning Fair Representation LFR

Zemel et al. [26] proposed to train a fair representation and build a classifier based on the representation. The idea of making the representation fair is to remove sensitive information from it, which is achieved by imposing the constraint that a random element from the protected group has the same probability of mapping to a particular prototype as a random element from the advantaged group. Following the original work, we set $A\_x$ as 0.01, $A\_y$ and $A\_z$ are chosen from $\{0.1,0.5,1,5,10\}$ , the dimension of the representation is chose from $\{10,20,30\}$ .

Adversarial Learned Fair Representation ALFR

Following the idea of fair representation learning, Edwards et al. [27] proposed to learn a fair representation by using an adversary. The feature space is transformed to latent space by an encoder. An adversary tries to predict the sensitive attribute from the representation, while the encoder tries to generate a representation based on which the sensitive attribute can not be easily predicted. The encoder and the adversary play a minimax game to generate a fair representation. The hyperparameters of this method include the dimension of the representation which is chosen from $\{10,20,30,40\}$ ; and the number of neurons of the hidden layer which is chosen from $\{10,20,30,40\}$ .

For the CCB algorithm, the hyperparameters include the number of neurons of the hidden layer and $\lambda$ . The number of neurons of the hidden layer is chosen from $\{10,20,30,40\}$ and $\lambda$ is chosen from $\{10,20,30,40,50,60,70,80,90,100\}$ .

V-C Evaluation Metrics

Following Zemel [26], we use three evaluation metrics accuracy, discrimination and consistency to evaluate the performance of the models.

The accuracy metric assesses the predictive accuracy of a classifier. Discrimination measures the difference between the two groups’ rate of being predicted as positive. Consistency compares an individual’s predicted result with his/her $k$ -nearest neighbors. If the predicted result is close to the results of the neighbors, it has high consistency. We compute the similarity between individuals by using Gower’s similarity [40]. Gower’s similarity can handle feature vectors with both categorical and continuous features. Gower similarity is defined as

\text{Gower}(i,j)=\frac{\sum_{k=1}^{N}w_{k}S_{ijk}}{\sum_{k=1}^{N}w_{k}}

(22)

where $N$ is the number of features and $w_{k}$ is the weight of the $k$ -th variable, in this paper the weights are set to one; $S_{ijk}$ is the contribution by the $k$ -th variable. If the $k$ -th variable is continuous, $S_{ijk}$ is defined as

S_{ijk}=1-\frac{|x_{ik}-x_{jk}|}{r_{k}}

(23)

where $x_{ik}$ is the value of $k$ -th feature of $i$ and $r_{k}$ is the range of values for the $k$ -th variable. If the $k$ -th variable is categorical, $S_{ijk}$ is 1 if $x_{ik}=x_{jk}$ or 0, otherwise.

V-D Model Selection Criterion

We perform model selection on the validation dataset for two different criterion, namely, discrimination and delta. Discrimination is the same as the metric used to evaluate the model introduced in Section V-C. As the model selected by discrimination does not take accuracy into account, we use delta as another selection criterion. Delta is the difference between accuracy and discrimination. A model selected by delta trades off some fairness for accuracy.

TABLE III: The prediction results by using different submodels of CCB.

adult compas german health acc $|$ discri $|$ consist acc $|$ discri $|$ consist acc $|$ discri $|$ consist acc $|$ discri $|$ consist Model 0 $0.7548|0.0019|0.9976$ $0.4787|0.0019|0.9926$ $0.6950|0.0028|0.9510$ $0.8785|0.0000|1.0000$ Model 1 $0.7543|0.0000|1.0000$ $0.4749|0.0000|1.0000$ $0.6850|0.0000|1.0000$ $0.8784|0.0002|0.9998$ Reversed Model $0.7548|0.0021|0.9977$ $0.4768|0.0050|0.9958$ $0.6850|0.0310|0.9670$ $0.8784|0.0001|0.9999$ Original Model $0.7544|0.0002|0.9999$ $0.4768|0.0031|0.9960$ $0.6950|0.0282|0.9840$ $0.8785|0.0002|0.9999$ • acc = accuracy, discri = discrimination, consist = consistency.

VI Experimental Results

VI-A Results and Analysis

Table I shows the performance metrics on the test set by using discrimination as model selection criterion on the validation set. From the results, we can see that the proposed method CCB achieves comparable or better performance than baselines . The discrimination scores are pushed close to 0 and consistency scores close to 1. In addition, the proposed method is able to achieve both group fairness (measured by discrimination) and individual fairness (measured by consistency), although the proposed method is targeting at achieving individual fairness. LR that has no fairness constraints achieves almost the best performance in terms of predictive accuracy. However, its predictions have the highest bias. Other methods achieve fairness at the cost of accuracy. Rawlsian as an individual fairness algorithm fails to remove bias from both adult and compas dataset and has limited capability of removing bias for german and health datasets. The reason might be that its implementation using interval chaining is a weak constraint on the model. The experimental results also reveal that german and health have far less bias than adult and compas, even LR has very limited discriminative predictions on german and health.

Table II shows the results using delta as model selection criterion. As the delta criterion trades some fairness for accuracy, the results shows that the model is not able to achieve the same level of fairness as the ones using discrimination as selection criterion. However, the models are able to achieve higher predictive accuracy. Another observation is that the models are not able to achieve individual fairness by using delta as selection criterion. It is interesting to see that on the two most biased datasets adult and compas, CCB has the most improvement in terms of accuracy compared to other methods by changing the criterion from discrimination to delta. The improvements for adult and compas in terms of accuracy are 8.23% and 31.17%, respectively.

TABLE IV: Dataset Statistics

Major #S #C #G #M #F #AA #NAA Major1 6,127 16 124,716 1,927(31.45%) 4,200(68.55%) 759(12.39%) 5,368(87.61%) Major2 450 7 23,708 338(75.11%) 112(24.89%) 27(6.00%) 423(94.00%) Major3 2,430 11 90,819 1,942(79.92%) 488(20.08%) 157(6.46%) 2,273(93.54%) Major4 671 10 65,396 575(85.69%) 96(14.31%) 66(9.84%) 605(90.16%) Major5 5,110 17 84,504 1,200(23.48%) 3,910(76.52%) 694(13.58%) 4,416(86.42%) • #S total number of students, #C number of courses for prediction, #G total number of grades • #M number of male students, #F number of female students, #AA number of African-American students #NNA number of non-African-American students.

TABLE V: Experimental results on educational dataset with gender as sensitive attribute.

Method BIOL CEIE CS ECE PSYC acc $|$ discri $|$ consist acc $|$ discri $|$ consist acc $|$ discri $|$ consist acc $|$ discri $|$ consist acc $|$ discri $|$ consist LR 0.7662 $|$ 0.0613 $|$ 0.8152 0.6761 $|$ 0.0837 $|$ 0.7451 0.6628 $|$ 0.1007 $|$ 0.7569 0.7545 $|$ 0.0980 $|$ 0.7655 0.7769 $|$ 0.0192 $|$ 0.9578 Rawlsian 0.5889 $|$ 0.0807 $|$ 0.8120 0.6250 $|$ 0.0866 $|$ 0.7052 0.5582 $|$ 0.0913 $|$ 0.8301 0.6660 $|$ 0.1498 $|$ 0.7036 0.7559 $|$ 0.0960 $|$ 0.9396 LFR 0.6470 $|$ 0.0369 $|$ 0.9691 0.6983 $|$ 0.0518 $|$ 0.9631 0.6004 $|$ 0.0228 $|$ 0.9463 0.7389 $|$ 0.0273 $|$ 0.9912 0.7898 $|$ 0.0248 $|$ 0.9865 ALFR 0.6802 $|$ 0.0202 $|$ 0.9675 0.7062 $|$ 0.0240 $|$ 0.9855 0.6124 $|$ 0.0134 $|$ 0.9821 0.7465 $|$ 0.0114 $|$ 0.9783 0.7903 $|$ 0.0125 $|$ 0.9878 CCB 0.6663 $|$ 0.0089 $|$ 0.9791 0.5901 $|$ 0.0334 $|$ 0.9624 0.6051 $|$ 0.0257 $|$ 0.9416 0.7300 $|$ 0.0279 $|$ 0.9790 0.7861 $|$ 0.0094 $|$ 0.9954

•

acc = accuracy, discri = discrimination, consist = consistency.

TABLE VI: Experimental results on educational dataset with race as sensitive attribute.

Method BIOL CEIE CS ECE PSYC acc $|$ discri $|$ consist acc $|$ discri $|$ consist acc $|$ discri $|$ consist acc $|$ discri $|$ consist acc $|$ discri $|$ consist LR 0.7662 $|$ 0.1004 $|$ 0.8152 0.6761 $|$ 0.1411 $|$ 0.7451 0.6628 $|$ 0.1085 $|$ 0.7569 0.7545 $|$ 0.1238 $|$ 0.7655 0.7769 $|$ 0.0276 $|$ 0.9578 Rawlsian 0.5854 $|$ 0.1129 $|$ 0.7870 0.5849 $|$ 0.3658 $|$ 0.7349 0.5561 $|$ 0.1857 $|$ 0.8007 0.6999 $|$ 0.1446 $|$ 0.7416 0.7608 $|$ 0.0776 $|$ 0.9570 LFR 0.6202 $|$ 0.0569 $|$ 0.9051 0.7099 $|$ 0.1722 $|$ 0.9701 0.6107 $|$ 0.0599 $|$ 0.9897 0.7441 $|$ 0.0800 $|$ 0.9852 0.7874 $|$ 0.0172 $|$ 0.9933 ALFR 0.6850 $|$ 0.0505 $|$ 0.9504 0.7274 $|$ 0.0862 $|$ 0.9688 0.6129 $|$ 0.0086 $|$ 0.9715 0.7435 $|$ 0.0384 $|$ 0.9887 0.7898 $|$ 0.0156 $|$ 0.9882 CCB 0.6443 $|$ 0.0115 $|$ 0.9767 0.6781 $|$ 0.1071 $|$ 0.8798 0.5944 $|$ 0.0372 $|$ 0.9824 0.7440 $|$ 0.0137 $|$ 0.9842 0.7863 $|$ 0.0042 $|$ 0.9993 • acc = accuracy, discri = discrimination, consist = consistency.

VI-B Discrepancy between Submodels

The CCB model is composed of two gradient contextual bandits. To achieve individual fairness, we put a constraint on the distance of the predictions from the two contextual bandits. If the model is fair, ideally, given an individual the output distributions of the two bandits should be close to each other. If that is the case, we can use this property to do model compression, namely, we just keep one contextual bandit at inference time. In this section, we propose two ways to investigate the discrepancy between the two submodels.

Table III shows the results of using only one submodel for predictions. Model 0 is the model corresponding to sensitive attribute 0 and Model 1 is the model corresponding to sensitive attribute 1. Reversed Model implies that we do predictions for individuals with sensitive attribute 0 using Model 1, and for individuals with sensitive attribute 1 using Model 0. Original Model implies that we are predicting for individuals with sensitive attribute 0 using Model 0 and for individuals with sensitive attribute 1 using Model 1. The results show that there is no significant difference by using different models for predictions. Thus, we can keep any one of the submodels for prediction after it is trained.

VI-C Convergence Analysis

Figure 2 shows the convergence rates of the proposed model on the four public datasets. The X axis is the number of iterations and the Y axis is the accumulated reward. The convergence curves show that the accumulated reward increase monotonically on four datasets. The convergence rates are different on different datasets. For the health and adult datasets, the rewards accumulate slowly at the beginning due to exploration and then increase fast; at the end, converge smoothly. On the compas and german datasets, the reward accumulation rates are linear.

VI-D A Case Study on Educational Setting

Prior research in the educational domain has focussed on the development of machine learning models for assisting students with their learning such as performance prediction [41], knowledge tracing [42] and degree pathway recommendations [43]. With the increasing adoption of machine learning in education, fairness is a big concern. A biased machine learning model can negatively impact a minority group of students. For example, an unfair at-risk identification approach which predicts some group of students to have lower GPAs than others can discourage them. To mitigate bias, we apply our proposed model to educational setting and do a case study on student performance prediction.

VI-D1 Datasets

The dataset is collected at George Mason University from Fall 2009 to Fall 2019 including the top 5 largest majors, Biology (BIOL), Civil Engineering (CEIE), Computer Science (CS), Electrical Engineering (ECE), and Psychology (PSYC). For each course in a major, we build a model to predict if a student is going to fail that course (grade $<$ 3.0). The features fed into the model are students’ grades in courses taken prior to the target course. The sensitive attributes include gender (male/female) and race (African-American/Non-African-American). Table IV shows the statistics of the datasets.

VI-D2 Experimental Protocol

Following the experimental protocol of public dataset, we choose the same baselines and use the same evaluation metrics to evaluate the proposed model on removing bias in the setting of student performance prediction. For model selection, we choose discrimination.

VI-D3 Experimental Results

Table V shows the results using gender as sensitive attribute. From the results of LR, we can see that different majors have different levels of bias; PSYC has the least biased predictions, while CS has the highest biased predictions. Similar to the results on public datasets, it is interesting to see that Rawlsian is not able to remove bias and in some cases it leads to even more unfair predictions. Table VI shows the results using race as sensitive attribute. Compared to the results by using gender as sensitive attribute, predictions with race as sensitive attribute is more biased. For example, in terms of predictions of LR, the discriminations of using gender and race as sensitive attribute are 7.27% and 10.03%, respectively. Overall, the proposed model is able to remove bias for both race and gender. It achieves comparable or even better results than the baselines. Compared to two competitive baselines LFR and ALFR which need to first learn a fair representation and then train a classifier on it, CCB is easier to implement.

VII Conclusions

In this work, we proposed the concept of metric free individual fairness. Traditional definition of individual fairness needs a problem-specific similarity metric, which refrains its adoption. The proposed metric free individual fairness eliminates the requirement of a similarity metric and is easy to implement. The proposed gradient contextual bandit algorithm learns a stochastic policy which can be applied to other domains such as recommender systems and for information retrieval. We also proposed a CCB fair algorithm. The experimental results show the effectiveness of the proposed algorithm at removing unfairness. In the future, we plan to explore applying the proposed gradient contextual bandits to other domains. We also want to explore using different fairness measures as reward signals and see how they influence the performance of the model.

VIII Acknowledgements

This work was supported by the National Science Foundation grant 1447489 and DUE-1937905. The computational resources was provided by ARGO, a research computing cluster provided by the Office of Research Computing at George Mason University, VA. (URL:http://orc.gmu.edu)

References

[1] Y. Zhao, S. Szpiro, and S. Azenkot, “Foresee: A customizable head-mounted vision enhancement system for people with low vision,” in Proceedings of the 17th International ACM SIGACCESS Conference on Computers & Accessibility, 2015, pp. 239–249.
[2] O. Rudovic, J. Lee, M. Dai, B. Schuller, and R. W. Picard, “Personalized machine learning for robot perception of affect and engagement in autism therapy,” Science Robotics, vol. 3, no. 19, p. eaao6760, 2018.
[3] S. M. McKinney, M. Sieniek, V. Godbole, J. Godwin, N. Antropova, H. Ashrafian, T. Back, M. Chesus, G. C. Corrado, A. Darzi et al., “International evaluation of an ai system for breast cancer screening,” Nature, vol. 577, no. 7788, pp. 89–94, 2020.
[4] R. S. Baker and P. S. Inventado, “Educational data mining and learning analytics,” in Learning analytics. Springer, 2014, pp. 61–75.
[5] J. Buolamwini and T. Gebru, “Gender shades: Intersectional accuracy disparities in commercial gender classification,” in Proceedings of the 1st Conference on Fairness, Accountability and Transparency, ser. Proceedings of Machine Learning Research, S. A. Friedler and C. Wilson, Eds., vol. 81. New York, NY, USA: PMLR, 23–24 Feb 2018, pp. 77–91. [Online]. Available: http://proceedings.mlr.press/v81/buolamwini18a.html
[6] J. Angwin, J. Larson, S. Mattu, and L. Kirchner, “Machine bias,” ProPublica.
[7] M. Ali, P. Sapiezynski, M. Bogen, A. Korolova, A. Mislove, and A. Rieke, “Discrimination through optimization: How facebook’s ad delivery can lead to skewed outcomes,” arXiv preprint arXiv:1904.02095, 2019.
[8] P. Gajane, “On formalizing fairness in prediction with machine learning,” 10 2017.
[9] M. B. Zafar, I. Valera, M. Gomez-Rodriguez, and K. P. Gummadi, “Fairness constraints: A flexible approach for fair classification,” Journal of Machine Learning Research, vol. 20, no. 75, pp. 1–42, 2019. [Online]. Available: http://jmlr.org/papers/v20/18-262.html
[10] M. J. Kusner, J. Loftus, C. Russell, and R. Silva, “Counterfactual fairness,” in Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, Eds. Curran Associates, Inc., 2017, pp. 4066–4076. [Online]. Available: http://papers.nips.cc/paper/6995-counterfactual-fairness.pdf
[11] M. B. Zafar, I. Valera, M. Gomez Rodriguez, and K. P. Gummadi, “Fairness beyond disparate treatment & disparate impact,” Proceedings of the 26th International Conference on World Wide Web - WWW ’17, 2017. [Online]. Available: http://dx.doi.org/10.1145/3038912.3052660
[12] T. Calders and S. Verwer, “Three naive bayes approaches for discrimination-free classification,” Data Mining and Knowledge Discovery, vol. 21, no. 2, pp. 277–292, 2010.
[13] C. Dwork, M. Hardt, T. Pitassi, O. Reingold, and R. Zemel, “Fairness through awareness,” in Proceedings of the 3rd innovations in theoretical computer science conference. ACM, 2012, pp. 214–226.
[14] M. Feldman, S. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian, “Certifying and removing disparate impact,” 2014.
[15] M. Hardt, E. Price, and N. Srebro, “Equality of opportunity in supervised learning,” CoRR, vol. abs/1610.02413, 2016. [Online]. Available: http://arxiv.org/abs/1610.02413
[16] A. Chouldechova and A. Roth, “The frontiers of fairness in machine learning,” CoRR, vol. abs/1810.08810, 2018. [Online]. Available: http://arxiv.org/abs/1810.08810
[17] N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, “A survey on bias and fairness in machine learning,” arXiv preprint arXiv:1908.09635, 2019.
[18] M. Feldman, S. A. Friedler, J. Moeller, C. Scheidegger, and S. Venkatasubramanian, “Certifying and removing disparate impact,” in proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, 2015, pp. 259–268.
[19] S. Corbett-Davies, E. Pierson, A. Feller, S. Goel, and A. Huq, “Algorithmic decision making and the cost of fairness,” in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2017, pp. 797–806.
[20] S. Barocas and A. D. Selbst, “Big data’s disparate impact,” Calif. L. Rev., vol. 104, p. 671, 2016.
[21] M. Hardt, E. Price, and N. Srebro, “Equality of opportunity in supervised learning,” in Advances in neural information processing systems, 2016, pp. 3315–3323.
[22] X. Gitiaux and H. Rangwala, “Multi-differential fairness auditor for black box classifiers,” arXiv preprint arXiv:1903.07609, 2019.
[23] C. Ilvento, “Metric learning for individual fairness,” arXiv preprint arXiv:1906.00250, 2019.
[24] Y. Bechavod, C. Jung, and Z. S. Wu, “Metric-free individual fairness in online learning,” arXiv preprint arXiv:2002.05474, 2020.
[25] M. Joseph, M. Kearns, J. H. Morgenstern, and A. Roth, “Fairness in learning: Classic and contextual bandits,” in Advances in Neural Information Processing Systems, 2016, pp. 325–333.
[26] R. Zemel, Y. Wu, K. Swersky, T. Pitassi, and C. Dwork, “Learning fair representations,” in International Conference on Machine Learning, 2013, pp. 325–333.
[27] H. Edwards and A. Storkey, “Censoring representations with an adversary,” arXiv preprint arXiv:1511.05897, 2015.
[28] C. Louizos, K. Swersky, Y. Li, M. Welling, and R. Zemel, “The variational fair autoencoder,” arXiv preprint arXiv:1511.00830, 2015.
[29] N. Cesa-Bianchi, C. Gentile, and G. Zappella, “A gang of bandits,” in Advances in Neural Information Processing Systems, 2013, pp. 737–745.
[30] Q. Wu, H. Wang, Q. Gu, and H. Wang, “Contextual bandits in a collaborative environment,” in Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval, 2016, pp. 529–538.
[31] R. K. Kolla, K. Jagannathan, and A. Gopalan, “Collaborative learning of stochastic bandits over a social network,” IEEE/ACM Transactions on Networking, vol. 26, no. 4, pp. 1782–1795, 2018.
[32] P. C. Landgren, “Distributed multi-agent multi-armed bandits,” Ph.D. dissertation, Princeton University, 2019.
[33] L. Li, W. Chu, J. Langford, and R. E. Schapire, “A contextual-bandit approach to personalized news article recommendation,” in Proceedings of the 19th international conference on World wide web. ACM, 2010, pp. 661–670.
[34] S. Agrawal and N. Goyal, “Thompson sampling for contextual bandits with linear payoffs,” in International Conference on Machine Learning, 2013, pp. 127–135.
[35] C. Russell, M. J. Kusner, J. Loftus, and R. Silva, “When worlds collide: integrating different counterfactual assumptions in fairness,” in Advances in Neural Information Processing Systems, 2017, pp. 6414–6423.
[36] R. J. Williams, “Simple statistical gradient-following algorithms for connectionist reinforcement learning,” Machine learning, vol. 8, no. 3-4, pp. 229–256, 1992.
[37] R. H. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, and H. S. Seung, “Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit,” Nature, vol. 405, no. 6789, pp. 947–951, 2000.
[38] D. Dua and C. Graff, “UCI machine learning repository,” 2017. [Online]. Available: http://archive.ics.uci.edu/ml
[39] P. Brierley, D. Vogel, and R. Axelrod, “Heritage provider network health prize round 1 milestone prize: How we did it–team ‘market makers’,” 2011.
[40] J. C. Gower, “A general coefficient of similarity and some of its properties,” Biometrics, pp. 857–871, 1971.
[41] A. Elbadrawy, A. Polyzou, Z. Ren, M. Sweeney, G. Karypis, and H. Rangwala, “Predicting student performance using personalized analytics,” Computer, vol. 49, no. 4, pp. 61–69, 2016.
[42] C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. J. Guibas, and J. Sohl-Dickstein, “Deep knowledge tracing,” in Advances in neural information processing systems, 2015, pp. 505–513.
[43] A. Elbadrawy and G. Karypis, “Domain-aware grade prediction and top-n course recommendation,” in Proceedings of the 10th ACM Conference on Recommender Systems, 2016, pp. 183–190.

$\displaystyle\nabla J_{t}(\theta)$	$\displaystyle=\sum_{a}\nabla\pi_{\theta}(a\|x_{t})q(x_{t},a)$	(5)
	$\displaystyle=\sum_{a}\pi_{\theta}(a\|x_{t})\frac{\nabla\pi_{\theta}(a\|x_{t})}{\pi_{\theta}(a\|x_{t})}q(x_{t},a)$
	$\displaystyle=E_{\pi_{\theta}}[\nabla\log\pi_{\theta}(a\|x_{t})q(x_{t},a)]$
	$\displaystyle=E_{\pi_{\theta}}[\nabla\log\pi_{\theta}(a_{t}\|x_{t})q(x_{t},a_{t})]$
	$\displaystyle=E_{\pi_{\theta}}[\nabla\log\pi_{\theta}(a_{t}\|x_{t})R_{t}]$
	$\displaystyle=\nabla\log\pi_{\theta}(a_{t}\|x_{t})R_{t}.$

	$\displaystyle\nabla^{2}\log\pi_{s}(a\|x)$	$\displaystyle=(1-\pi(a\|x))^{2}\nabla^{2}\phi(x,a)\nabla^{2}\pi_{s}(a\|x)$		(12)
		$\displaystyle-(1-\pi(a\|x))^{2}\nabla\phi(x,a)\nabla\phi(x,a)^{T}$		(12)

$\displaystyle J(\theta_{s}^{k+1})$	$\displaystyle\geq J(\theta_{s}^{k})+\langle\nabla J(\theta_{s}^{k}),\theta_{s}^{k+1}-\theta_{s}^{k}\rangle-\frac{L}{2}\|\|\theta_{s}^{k+1}-\theta_{s}^{k}\|\|_{2}^{2}$	(18)
	$\displaystyle\geq J(\theta_{s}^{k})+\alpha\|\|\nabla J(\theta_{s}^{k})\|\|_{2}^{2}-\frac{L}{2}\|\|\theta_{s}^{k+1}-\theta_{s}^{k}\|\|_{2}^{2}$
	$\displaystyle\geq J(\theta_{s}^{k})+\alpha\|\|\nabla J(\theta_{s}^{k})\|\|_{2}^{2}-\frac{L\alpha^{2}}{2}\|\|\nabla J(\theta_{s}^{k})\|\|_{2}^{2}$
	$\displaystyle\geq J(\theta_{s}^{k})+\frac{2\alpha-L\alpha^{2}}{2}\|\|\nabla J(\theta_{s}^{k})\|\|_{2}^{2}$
	$\displaystyle\geq J(\theta_{s}^{k})+M\|\|\nabla J(\theta_{s}^{k})\|\|_{2}^{2}$