This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

\DeclareDocumentCommand\dpdv∂∂

Long-Term Fairness with Unknown Dynamics

Tongxin Yin111These authors contributed equally to this work.
Computer Science and Engineering
University of Michigan
Ann Arbor, MI 48109
[email protected] &Reilly Raab111These authors contributed equally to this work.
Computer Science and Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
[email protected]
\ANDMingyan Liu
Computer Science and Engineering
University of Michigan
Ann Arbor, MI 48109
[email protected] &Yang Liu
Computer Science and Engineering
University of California, Santa Cruz
Santa Cruz, CA 95064
[email protected]
Abstract

While machine learning can myopically reinforce social inequalities, it may also be used to dynamically seek equitable outcomes. In this paper, we formalize long-term fairness in the context of online reinforcement learning. This formulation can accommodate dynamical control objectives, such as driving equity inherent in the state of a population, that cannot be incorporated into static formulations of fairness. We demonstrate that this framing allows an algorithm to adapt to unknown dynamics by sacrificing short-term incentives to drive a classifier-population system towards more desirable equilibria. For the proposed setting, we develop an algorithm that adapts recent work in online learning. We prove that this algorithm achieves simultaneous probabilistic bounds on cumulative loss and cumulative violations of fairness (as statistical regularities between demographic groups). We compare our proposed algorithm to the repeated retraining of myopic classifiers, as a baseline, and to a deep reinforcement learning algorithm that lacks safety guarantees. Our experiments model human populations according to evolutionary game theory and integrate real-world datasets.

1 Introduction

As machine learning (ML) algorithms are deployed for tasks with real-world social consequences (e.g., school admissions, loan approval, medical interventions, etc.), the possibility exists for runaway social inequalities (Crawford and Calo, 2016; Chaney et al., 2018; Fuster et al., 2018; Ensign et al., 2018). While “fairness” has become a salient ethical concern in contemporary research, the closed-loop dynamics of real-world systems comprising ML policies and populations that mutually adapt to each other (Fig. 5 in the supplementary material) remain poorly understood.

In this paper, our primary contribution is to consider the problem of long-term fairness, or algorithmic fairness in the context of a dynamically responsive population, as a reinforcement learning (RL) problem subject to constraint. In our formulation, the central learning task is to develop a policy that minimizes cumulative loss (e.g., financial risk, negative educational outcomes, misdiagnoses, etc.) incurred by an ML agent interacting with a human population up to a finite time horizon, subject to constraints on cumulative “violations of fairness”, which we refer to in a single time step as disparity and cumulatively as distortion.

Our central hypothesis is that an RL formulation of long-term fairness may allow an agent to learn to sacrifice short-term utility in order to drive the system towards more desirable equilibria. The core practical difficulties posed by our general problem formulation, however, are the potentially unknown dynamics of the system under control, which must be determined by the RL agent online (i.e., during actual deployment), and the general non-convexity of the losses or constraints considered. Additionally, we address continuous state and action spaces, in general, which preclude familiar methods with performance guarantees in discrete settings.

Our secondary contributions are 1) to show that long-term fairness can be solved within asymptotic, probabalistic bounds under certain dynamical assumptions and 2) to demonstrate that the problem of long-term fairness can also be addressed more flexibly. For theoretical guarantees, we develop L-UCBFair, an online RL method, and prove sublinear bounds on regret (suboptimiality of cumulative loss) and distortion (suboptimality of cumulative disparity) with high probability (Section 3.1). To demonstrate practical solutions without limiting assumptions, we apply R-TD3, an adaptation of a well-known deep reinforcement learning method (viz., TD3) to a Lagrangian relaxation of the central problem with time-dependent regularization. We compare L-UCBFair and R-TD3 to a baseline, myopic policy in interaction with simulated populations initialized with synthetic or real-world data and updated according to evolutionary game theory (Section 4).

This paper considers fairness in terms of statistical regularities across (ideally) socioculturally meaningful groups. Acknowledging that internal conflict exists between different statistical measures of fairness, we show that an RL approach to long-term fairness can mitigate trade-offs between fairness defined on the statistics of immediate policy decision outcomes (Chen et al., 2022), (e.g., acceptance rate disparities (Dwork et al., 2012; Zemel et al., 2013; Feldman et al., 2015)) and underlying distributional parameters (e.g., qualification rate (Raab and Liu, 2021; Zhang et al., 2020)).

1.1 Related Work

Our effort to formalize long-term fairness as a reinforcement learning problem bridges recent work on “fairness in machine learning”, which has developed in response to the proliferation of data-driven methods in society, and “safe reinforcement learning”, which seeks theoretical safety guarantees in the control of dynamical systems.

Dynamics of Fairness in Machine Learning  We distinguish long-term fairness from the dynamics of fair allocation problems Joseph et al. (2016); Jabbari et al. (2017); Tang et al. (2021); Liu et al. (2017) and emphasize side-effects of algorithmic decisions affecting future decision problems. By formalizing long-term fairness in terms of cumulative losses and disparities, we iterate on a developing research trend that accounts for the dynamical response of a human population to deployed algorithmic prediction: both as a singular reaction Liu et al. (2018); Hu et al. (2019); Perdomo et al. (2020) or as a sequence of mutual updates to the population and the algorithm Coate and Loury (1993); D’Amour et al. (2020); Zhang et al. (2020); Heidari et al. (2019); Wen et al. (2019); Liu et al. (2020); Hu and Chen (2018); Mouzannar et al. (2019); Williams and Kolter (2019); Raab and Liu (2021). In particular, Perdomo et al. (2020) introduces the concept of “performative prediction”, analyzing the fixed-points of interactions between a population and an algorithmic classifier, but with state treated as a pure function of a classifier’s actions. For more realistic dynamics, Mouzannar et al. (2019) and Raab and Liu (2021) model updates to qualification rates that depend on both previous state and the classifier’s actions, but only treat myopic classifiers that optimize immediate utility (subject to fairness constraints) rather than learning to anticipate dynamical population responses. Wen et al. (2019) adopted reinforcement learning to achieve long-term fairness. However, they explore a setting that is restricted to discrete state and action spaces. In particular, this recommends a tabular explore-then-commit approach to reinforcement learning that does not generalize to continuous spaces and thus cannot be directly used in our setting. In addition, we provide separate and tighter bounds for both utility and disparity.

Safe Reinforcement Learning  L-UCBFair furthers recent efforts in safe RL. While “model-based” approaches, in which the algorithm learns an explicit dynamical model of the environment, constitute one thread of prior work Efroni et al. (2020); Singh et al. (2020); Brantley et al. (2020); Zheng and Ratliff (2020); Kalagarla et al. (2021); Liu et al. (2021); Ding et al. (2021), such algorithms are typified by significant time and space complexity. Among “model-free” algorithms, the unknown dynamics of our setting preclude the use of a simulator that can generate arbitrary state-action pairs Xu et al. (2021); Ding et al. (2020); Bai et al. (2022). While Wei et al. (2022) introduce a model-free and simulator-free algorithm, the tabular setting considered is only applicable to discrete state and action spaces. To tackle continuous state space, Ding et al. (2021); Ghosh et al. (2022) consider linear dynamics: Ding et al. (2021) develop a primal-dual algorithm with safe exploration, and Ghosh et al. (2022) use a softmax policy design. Both algorithms are based on the work of Jin et al. (2020), which proposed a least squares value iteration method, using an Upper Confidence Bound (UCB) Auer et al. (2002) to estimate a state-action “QQ” function. To our knowledge, L-UCBFair is the first model-free, simulator-free RL algorithm that provides theoretical safety guarantees for both discrete and continuous state and action spaces. Moreover, L-UCBFair achieves bounds on regret and distortion as tight as any algorithm thus far with discrete action space Ghosh et al. (2022).

2 Problem Formulation

A binary classification task is an initial motivating example for our formulation of long-term fairness, though the formal problem we propose is more widely applicable. To this initial task, we introduce “fairness” constraints, then population dynamics, and then cumulative loss and “disparity”, before formalizing the problem of optimizing cumulative loss subject to constraints on cumulative disparity.

We provide notation for the binary classification task: a random individual, sampled i.i.d. from a population, has features X𝐑dX\in\mathbf{R}^{d}, a label Y{1,1}Y\in\{-1,1\}, and a demographic group G𝒢G\in\mathcal{G} (where 𝒢=[n]\mathcal{G}=[n] for n2n\geq 2). Denote the joint distribution of these variables in the population as sPr(X,Y,G)s\coloneqq\Pr(X,Y,G). The task is to predict YY (as Y^\hat{Y}) from XX and GG. Specifically, the task is to choose a classifier aa, such that Y^a(X,G)\hat{Y}\sim a(X,G), that minimizes some bounded loss [0,1]\mathscr{L}\in[0,1] over ss. This basic classification task is

mina(s,a)\min_{a}\quad\mathscr{L}(s,a)

Typically, \mathscr{L} corresponds to the expectation value of a loss function LL (such as zero-one-loss) defined for individuals drawn from ss. We consider this example, but, in general, allow arbitrary, (unit-interval) bounded loss functions \mathscr{L}:

(s,a)=e.g.EX,Y,GsY^a(X,G)[L(Y,Y^)]\mathscr{L}(s,a)\stackrel{{\scriptstyle{e.g.}}}{{=}}\operatorname*{E}_{\begin{subarray}{c}X,Y,G\sim s\\ \hat{Y}\sim a(X,G)\end{subarray}}\big{[}L(Y,\hat{Y})\big{]}

The standard “fair” classification task (without a dynamically responsive population) is to constrain classifier aa such that the disparity 𝒟[0,1]\mathscr{D}\in[0,1] induced on distribution ss by aa is bounded by some value c[0,1]c\in[0,1]:

mina(s,a)subject to 𝒟(s,a)c\displaystyle\min_{a}\quad\mathscr{L}(s,a)\quad\text{subject to }\quad\mathscr{D}(s,a)\leq c

A standard example of disparity is the expected divergence of group acceptance rates β\beta. e.g., when 𝒢={g1,g2}\mathcal{G}=\{g_{1},g_{2}\},

𝒟(s,a)\displaystyle\mathscr{D}(s,a) =e.g.|βs,a(g1)βs,a(g2)|2\displaystyle\stackrel{{\scriptstyle{e.g.}}}{{=}}\big{|}\beta_{s,a}(g_{1})-\beta_{s,a}(g_{2})\big{|}^{2}
where βs,a(g)\displaystyle\text{where }\beta_{s,a}(g) PrX,Y,GsY^a(X,G)(Y^=1G=g)\displaystyle\coloneqq\Pr_{\begin{subarray}{c}X,Y,G\sim s\\ \hat{Y}\sim a(X,G)\end{subarray}}(\hat{Y}{=}1\mid G{=}g)

In this example, 𝒟\mathscr{D} measures the violation of “demographic parity”: the condition under which a classifier achieves group-independent positive classification rates g,Pr(Y^=1G=g)=Pr(Y^=1)\forall g,\Pr(\hat{Y}{=}1\mid G{=}g)=\Pr(\hat{Y}{=}1) Dwork et al. (2012). In this paper, we also consider measures of fairness based on inherent population statistics (e.g., parity of group qualification rates Pr(Y=1G=g)\Pr(Y{=}1\mid G{=}g)), which must be driven dynamically Raab and Liu (2021); Zhang et al. (2020). Such notions of disparity are well-suited to an RL formulation of long-term fairness.

State, Action, and Policy

Considering the semantics of iterated classification tasks, we identify the distribution s𝒮s\in\mathcal{S} of individuals in the population as a state and the classifier a𝒜a\in\mathcal{A} as an action. While state space 𝒮\mathcal{S} is arbitrary, we assume that action space 𝒜\mathcal{A} admits a Euclidean metric, under which it is closed (i.e., 𝒜\mathcal{A} is isomorphic to [0,1]m,m𝐙>0[0,1]^{m},m\in\mathbf{Z}_{>0}). At a given time τ\tau, aτa_{\tau} is sampled stochastically according to the current policy πτ\pi_{\tau}: aτπτ(sτ).a_{\tau}\sim\pi_{\tau}(s_{\tau}). We assume sτs_{\tau} is fully observable at time τ\tau. In practice, sτs_{\tau} must be approximated from finitely many empirical samples, though this caveat introduces well-understood errors that vanish in the limit of infinitely many samples.

Dynamics

In contrast to a “one-shot” fair classification task, we assume that a population may react to classification, inducing the distribution ss to change. Importantly, such “distribution shift” is a well-known, real-world phenomenon that can increase realized loss and disparity when deployed classification policies are fixed Chen et al. (2022). For classification policies that free to change in response to a mutating distribution ss, subsequent classification tasks depend on the (stochastic) predictions made in previous tasks. In our formulation, we assume the existence of dynamical kernel 𝐏\mathbf{P} that maps a state ss and action aa at time τ\tau to a distribution over possible states at time τ+1\tau+1:

sτ+1𝐏(sτ,aτ)s_{\tau+1}\sim\mathbf{P}(s_{\tau},a_{\tau}) (1)

We stipulate that 𝐏\mathbf{P} may be initially unknown, but it does not explicitly depend on time and may be reasonably approximated “online”. While real-world dynamics may depend on information other than the current distribution Pr(X,Y,G)\Pr(X,Y,G) (e.g., exogenous parameters, history, or additional variables of state), we identify ss with the current distribution for simplicity in our treatment of long-term fairness.

Reward and Utility

Because standard RL literature motivates maximizing reward rather than minimizing loss, let us define the instantaneous reward r[0,1]r\in[0,1] and a separate, instantaneous “utility” g[0,1]g\in[0,1] for an RL agent as

r(sτ,aτ)\displaystyle r(s_{\tau},a_{\tau}) 1(sτ,aτ)\displaystyle\coloneqq 1-\mathscr{L}(s_{\tau},a_{\tau}) (2)
g(sτ,aτ)\displaystyle g(s_{\tau},a_{\tau}) 1𝒟(sτ,aτ)\displaystyle\coloneqq 1-\mathscr{D}(s_{\tau},a_{\tau}) (3)

where rr and gg do not explicitly depend on time τ\tau.

Value and Quality Functions

Learnable dynamics inspire us to optimize anticipated cumulative reward, given constraints on anticipated cumulative utility. Let jj represent either reward rr or utility gg. We use the letter VV (for “value”) to denote the future expected accumulation of jj over steps [h,,H][h,...,H] (without time-discounting) starting from state ss, using policy π\pi. Likewise, we denote the “quality” of an action aa in state ss with the letter QQ. For j{r,g}j\in\{r,g\},

Vj,hπ(s)\displaystyle V_{j,h}^{\pi}(s) =E[τ=hHj(sτ,aτ)|sh=s]\displaystyle=\operatorname*{E}\Big{[}\sum_{\tau=h}^{H}j\big{(}s_{\tau},a_{\tau}\big{)}|s_{h}=s\Big{]} (4)
Qj,hπ(s,a)\displaystyle Q_{j,h}^{\pi}(s,a) =E[τ=hHj(sτ,aτ))sh=s,ah=a]\displaystyle=\operatorname*{E}\left[\sum_{\tau=h}^{H}j\big{(}s_{\tau},a_{\tau})\big{)}\mid s_{h}=s,a_{h}=a\right] (5)

By the boundedness of r,g[0,1]r,g\in[0,1], VV and QQ belong to the interval [0,Hh+1][0,H-h+1].

“Long-term fairness” via reinforcement learning

The central problem explored in this paper is

maxπVr,1π(s)subject to Vg,1π(s)c~\displaystyle\max_{\pi}\quad V_{r,1}^{\pi}(s)\quad\text{subject to }\quad V_{g,1}^{\pi}(s)\geq\tilde{c} (6)

We emphasize that this construction of long-term fairness considers a finite time horizon of HH steps and denote the optimal value of π\pi as π\pi^{\star}.

Assumption 2.1 (Slater’s Condition).

i.e., “strict primal feasibility”: \exists γ>0\gamma>0, π¯\bar{\pi}, such that Vg,1π¯(s)V_{g,1}^{\bar{\pi}}\left(s\right)\geq c~+γ\tilde{c}+\gamma.

Slater’s condition is also adopted in Efroni et al. (2020); Ding et al. (2021); Ghosh et al. (2022).

The Online Setting

Given initially unknown dynamics, we formulate long-term fairness for the “online” setting, in which learning is only possible through actual “live” deployment of policy, rather than through simulation. As it is not possible to unconditionally guarantee constraint satisfaction in Eq. 6 over a finite number of episodes, we instead measure two types of regret: one that measures the suboptimality of a policy with respect to cumulative incurred loss, which we will continue to call “regret”, and one that measures the suboptimality of a policy with respect to cumulative induced disparity, which we will call “distortion”. Note that we define regret and distortion in Eq. 7 by marginalizing over the stochasticity of state transitions and the sampling of actions:

Regret(π,s1)Vr,1π(s1)Vr,1π(s1)\displaystyle\operatorname{Regret}(\pi,s_{1})\coloneqq V_{r,1}^{\pi^{*}}\left(s_{1}\right)-V_{r,1}^{\pi}\left(s_{1}\right) (7)
Distortion(π,s1)max[0,c~Vg,1π(s1)]\displaystyle\operatorname{Distortion}(\pi,s_{1})\coloneqq\max\left[0,\tilde{c}-V_{g,1}^{\pi}\left(s_{1}\right)\right]

3 Algorithms and Analysis

We show that it is possible to provide guarantees for long-term fairness in the online setting. Specifically, we develop an algorithm, L-UCBFair, and prove that it provides probabilistic, sublinear bounds for regret and distortion under a linear MDP assumption (3.2) and properly chosen parameters (Section A.1, in the supplementary material). L-UCBFair is the first model-free algorithm to provide such guarantees in the online setting with continuous state and action spaces.

3.1 L-UCBFair

Episodic MDP

L-UCBFair inherits from a family of algorithms that treat an episodic Markov decision process (MDP) Jin et al. (2020). Therefore, we first map the long-term fairness problem to the episodic MDP setting, which we denote as MDP(𝒮,𝒜,H,𝐏,,𝒟)\text{MDP}(\mathcal{S},\mathcal{A},H,\mathbf{P},\mathscr{L},\mathscr{D}). The algorithm runs for KK episodes, each consisting of HH time steps. At the beginning of each episode, which we index with kk, the agent commits to a sequence of policies πk=(π1k,π2k,,πHk)\pi^{k}=(\pi_{1}^{k},\pi_{2}^{k},...,\pi_{H}^{k}) for the next HH steps. At each step hh within an episode, an action ahk𝒜a_{h}^{k}\in\mathcal{A} is sampled according to policy πhk\pi_{h}^{k}, then the state sh+1ks_{h+1}^{k} is sampled according to the transition kernel 𝐏(shk,ahk)\mathbf{P}(s_{h}^{k},a_{h}^{k}). \mathscr{L} and 𝒟\mathscr{D} are the loss and disparity functions. s1ks_{1}^{k} is sampled arbitrarily with each episode.

Episodic Regret

Because L-UCBFair predetermines its policy for an entire episode, we amend our definition of regret and distortion for the all HKHK time steps by breaking it into a sum over KK episodes of length HH.

Regret(K)=k=1K(Vr,1π(s1k)Vr,1πk(s1k))\displaystyle\operatorname{Regret}(K)=\sum_{k=1}^{K}\left(V_{r,1}^{\pi^{*}}\left(s_{1}^{k}\right)-V_{r,1}^{\pi^{k}}\left(s_{1}^{k}\right)\right) (8)
Distortion(K)=max[0,k=1K(c~Vg,1πk(s1k))]\displaystyle\operatorname{Distortion}(K)=\max\left[0,\sum_{k=1}^{K}\left(\tilde{c}-V_{g,1}^{\pi^{k}}\left(s_{1}^{k}\right)\right)\right]
The Lagrangian

Consider the Lagrangian function \mathcal{L} associated with Eq. 6, with dual variable ν0\nu\geq 0:

(π,ν):=Vr,1π(s)+ν(Vg,1π(s)c~)\mathcal{L}(\pi,\nu):=V_{r,1}^{\pi}\left(s\right)+\nu\left(V_{g,1}^{\pi}\left(s\right)-\tilde{c}\right) (9)

L-UCBFair approximately solves the primal problem maxπminν(π,ν)\max_{\pi}\min_{\nu}\mathcal{L}(\pi,\nu), which is non-trivial, since the objective function is seldom concave in practical parameterizations of π\pi. Let ν\nu^{*} to denote the optimal value of ν\nu.

Lemma 3.1 (Boundedness of ν\nu^{*}).

For π¯\bar{\pi} and γ>0\gamma>0 satisfying Slater’s Condition (2.1),

νVr,1π(s1)Vr,1π(s1)γHγ𝒱\nu^{*}\leq\frac{V_{r,1}^{\pi*}\left(s_{1}\right)-V_{r,1}^{\pi}\left(s_{1}\right)}{\gamma}\leq\frac{H}{\gamma}\coloneqq\mathscr{V}

Lemma 3.1, defines H/γ=𝒱H/\gamma=\mathscr{V} as an upper bound for the optimal dual variable ν\nu^{*}. 𝒱\mathscr{V} is an input to L-UCBFair.

Assumption 3.2 (Linear MDP).

MDP(𝒮,𝒜,H,𝐏,,𝒟)\text{MDP}(\mathcal{S},\mathcal{A},H,\mathbf{P},\mathscr{L},\mathscr{D}) is a linear MDP with feature map ϕ:𝒮×𝒜d\phi:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{d}: For any hh, there exist dd signed measures μh={μh1,,μhd}\mu_{h}=\left\{\mu_{h}^{1},\ldots,\mu_{h}^{d}\right\} over 𝒮\mathcal{S}, such that, for any (s,a,s)𝒮×𝒜×𝒮\left(s,a,s^{\prime}\right)\in\mathcal{S}\times\mathcal{A}\times\mathcal{S},

h(ss,a)=ϕ(s,a),μh(s),\mathbb{P}_{h}\left(s^{\prime}\mid s,a\right)=\left\langle\phi(s,a),\mu_{h}\left(s^{\prime}\right)\right\rangle,

In addition, there exist vectors θr,h,θg,hd\theta_{r,h},\theta_{g,h}\in\mathbb{R}^{d}, such that, for any (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A},

r(s,a)=ϕ(s,a),θr,h;g(s,a)=ϕ(s,a),θg,hr\big{(}s,a\big{)}=\left\langle\phi(s,a),\theta_{r,h}\right\rangle;\quad g(s,a)=\left\langle\phi(s,a),\theta_{g,h}\right\rangle

3.2 addresses the curse of dimensionality when state space 𝒮\mathcal{S} is the space of distributions over X,Y,GX,Y,G. This assumption is also used in Jin et al. (2020); Ghosh et al. (2022), with a similar assumption made in Ding et al. (2021).

3.1.1 Explicit Construction

L-UCBFair, or “LSVI-UCB for Fairness ” (Algorithm 1) is based on an optimistic modification of a Least-Squares Value Iteration, where optimism is realized by an Upper-Confidence Bound, as in LSVI-UCB Jin et al. (2020). For each HH-step episode kk, L-UCBFair maintains estimates for Qrk,QgkQ^{k}_{r},Q^{k}_{g} and a time-indexed policy πk\pi^{k}. In each episode kk, L-UCBFair updates Qrk,QgkQ^{k}_{r},Q^{k}_{g}, interacts with the environment, and updates the dual variable νk\nu_{k} (constant in kk).

LSVI-UCB Jin et al. (2020)

The estimation of QQ is challenging. Consider the Bellman equation: (s,a)𝒮×𝒜,\forall(s,a)\in\mathcal{S}\times\mathcal{A},

Qh(s,a)[rh+Es𝐏h(s,a)maxa𝒜Qh+1(,a)](s,a)Q_{h}^{\star}(s,a)\leftarrow\left[r_{h}+\operatorname*{E}_{s^{\prime}\sim\mathbf{P}_{h}(\cdot\mid s,a)}\max_{a^{\prime}\in\mathcal{A}}Q_{h+1}^{\star}\left(\cdot,a^{\prime}\right)\right](s,a)

It is impossible to iterate over all s,as,a pairs, since 𝒮\mathcal{S} and 𝒜\mathcal{A} are continuous. Instead, LSVI parameterizes Qh(s,a)Q_{h}^{\star}(s,a) by the linear form whϕ(s,a)\mathrm{w}_{h}^{\top}\phi(s,a), which is supported by Jin et al. (2020). As an additional complication, 𝐏\mathbf{P} is unknown is estimated by LSVI from empirical samples. Combining these techniques, the Bellman equation can be replaced by a least squares problem.

𝐰hargmin𝐰dτ=1k1\displaystyle\mathbf{w}_{h}\leftarrow\underset{\mathbf{w}\in\mathbb{R}^{d}}{\operatorname{argmin}}\sum_{\tau=1}^{k-1} [rh(shτ,ahτ)+maxa𝒜Qh+1(sh+1τ,a)𝐰ϕ(shτ,ahτ)]2+ς𝐰2\displaystyle\bigg{[}r_{h}\left(s_{h}^{\tau},a_{h}^{\tau}\right)+\max_{a\in\mathcal{A}}Q_{h+1}\left(s_{h+1}^{\tau},a\right)-\mathbf{w}^{\top}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\bigg{]}^{2}+\varsigma\|\mathbf{w}\|^{2}

In addition, a “bonus term” β(ϕΛh1ϕ)1/2\beta\left(\phi^{\top}\Lambda_{h}^{-1}\phi\right)^{1/2} is added to the estimate of Q to encourage exploration.

Adaptive Search Policy Compared to the works of Ding et al. (2021) and Ghosh et al. (2022), the major challenge we face for long-term fairness is a continuous action space 𝒜\mathcal{A}, which renders the independent computation of Qrk,QgkQ^{k}_{r},Q^{k}_{g} for each action impossible. To handle this issue, we propose an adaptive search policy: Instead of drawing an action directly from a distribution over continuous values, π\pi represents a distribution over finitely many, finite regions of 𝒜\mathcal{A}. After sampling a region from a Softmax scheme, the agent draws action aa uniformly at random from it.

Definition 3.3.

Given a set of distinct actions I={I0,,IM}𝒜I=\{I_{0},\cdots,I_{M}\}\subset\mathcal{A}, where 𝒜\mathcal{A} is a closed set in Euclidean space, define i={a:aIi2aIj2,i<j}\mathcal{I}_{i}=\{a\colon\|a-I_{i}\|_{2}\leq\|a-I_{j}\|_{2},\forall i<j\} as the subset of actions closer to IiI_{i} than to IjI_{j}, i.e., the Voronoi region corresponding to locus IiI_{i}, with tie-breaking imposed by the order of indices ii. Also define the locus function I(a)=miniargminIiaIi2I(a)=\min_{i}\operatorname*{argmin}_{I_{i}}\|a-I_{i}\|_{2}.

Lemma 3.4.

The Voronoi partitioning described above satisfies ij=,ij\mathcal{I}_{i}\cap\mathcal{I}_{j}=\varnothing,\forall i\neq j and i=1Mi=𝒜\cup_{i=1}^{M}\mathcal{I}_{i}=\mathcal{A}.

Theorem 3.5.

If the number MM of distinct loci or regions partitioning 𝒜\mathcal{A} is sufficiently large, there exists a set of loci II such that ai,iM,aIi2ϵI\forall a\in\mathcal{I}_{i},i\in M,\|a-I_{i}\|_{2}\leq\epsilon_{I}.

In addition to defining a Voronoi partitioning of the action space for the adaptive search policy of L-UCBFair, we make the following assumption:

Assumption 3.6.

There exists ρ>0\rho>0, such that ϕ(s,a)ϕ(s,a)2ρaa2.\|\phi(s,a)-\phi(s,a^{\prime})\|_{2}\leq\rho\|a-a^{\prime}\|_{2}.

This assumption is used to bound the distance of estimated action-value function and value function, thus bound adaptive search policy and the optimal policy.

Dual Update For L-UCBFair, the update method for the dual variable ν\nu in Eq. 9 is also essential. Since Vr,1π(s)V_{r,1}^{\pi}\left(s\right) and Vg,1π(s)V_{g,1}^{\pi}\left(s\right) are unknown, we use Vr,1k(s)V_{r,1}^{k}\left(s\right) and Vg,1k(s)V_{g,1}^{k}\left(s\right) to estimate them. ν\nu is iteratively updated by minimizing Eq. 9 with step-size η\eta, and 𝒱\mathscr{V} is an upper bound for ν\nu. A similar method is also used in Ding et al. (2021); Ghosh et al. (2022).

3.1.2 Analysis

We next bound the regret and distortion, defined in Eq. 8, realizable by L-UCBFair. We then compare L-UCBFair with existing algorithms for discrete action spaces and discuss the importance of the number of regions MM and the distance ϵI\epsilon_{I}.

Theorem 3.7 (Boundedness).

With probability 1p1-p, there exists a constant bb such that L-UCBFair achieves

Regret(K)\displaystyle\text{Regret}(K) (bζH2d3+(𝒱+1)H)K\displaystyle\leq\big{(}b\zeta H^{2}\sqrt{d^{3}}+(\mathscr{V}+1)H\big{)}\sqrt{K} (10)
Distortion (K)\displaystyle\text{ Distortion }(K) bζ(1+𝒱)H2d3𝒱K\displaystyle\leq\frac{b\zeta(1+\mathscr{V})H^{2}\sqrt{d^{3}}}{\mathscr{V}}\sqrt{K} (11)

when selecting parameter values ς=1\varsigma{=}1 ϵI12ρ(1+𝒱)KHd\epsilon_{I}\leq\frac{1}{2\rho(1+\mathscr{V})KH\sqrt{d}}, ζ=log(log(M)4dHK/p)\zeta=\log(\log(M)4dHK/p), and β=𝒪(dHζ)\beta=\mathcal{O}(dH\sqrt{\zeta}).

For a detailed proof of Theorem 3.7, refer to Section A.1. Theorem 3.7 indicates that L-UCBFair reaches 𝒪~(H2d3K)\tilde{\mathcal{O}}\left(H^{2}\sqrt{d^{3}K}\right) bounds for both regret and distortion with high probability. Compared to the algorithms introduced by Ding et al. (2021); Ghosh et al. (2022), which work with discrete action space, L-UCBFair guarantees the same asymptotic bounds on regret and distortion.

Algorithm 1 L-UCBFair
  Input: A set of points {I0,I1,,IM}\{I_{0},I_{1},\cdots,\operatorname*{I}_{M}\} satisty Definition 3.3. ϵI=12ρ(1+χ)KH\epsilon_{I}=\frac{1}{2\rho(1+\chi)KH}.ν1=0\nu_{1}{=}0. wr,h=wg,h=0w_{r,h}{=}w_{g,h}{=}0. α=log(M)K2(1+χ+H)\alpha{=}\frac{\log(M)K}{2(1+\chi+H)}. η=χ/KH2\eta{=}\chi/\sqrt{KH^{2}}. β=C1dHlog(4logMdT/p)\beta{=}C_{1}dH\sqrt{\log(4\log MdT/p)}, ς=1\varsigma=1.
  for episode k=1,2,,Kk=1,2,...,K do
     Receive the initial state s1ks^{k}_{1}.
     for step h=H,H1,,1h=H,H-1,\cdots,1 do
        Λhkτ=1k1ϕ(shτ,ahτ)ϕ(shτ,ahτ)T+ς𝐈\Lambda_{h}^{k}\leftarrow\sum_{\tau=1}^{k-1}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)^{T}+\varsigma\mathbf{I}
        for j{r,g}j\in\{r,g\} do
           wj,hk(Λhk)1[τ=1k1ϕ(shτ,ahτ)(j(shτ,ahτ)+Vj,h+1k(sh+1τ))]w_{j,h}^{k}\leftarrow\left(\Lambda_{h}^{k}\right)^{-1}\left[\sum_{\tau=1}^{k-1}\phi\left(s_{h}^{\tau},a_{h}^{\tau}\right)\big{(}j\left(s_{h}^{\tau},a_{h}^{\tau}\right)+V_{j,h+1}^{k}\left(s_{h+1}^{\tau}\right)\big{)}\right]
        end for
        for iteration i=1,,Mi=1,\cdots,M and index j{r,g}j\in\{r,g\} do
           ξi,j(ϕ(,Ii)T(Λhk)1ϕ(,Ii))1/2,Qj,hk(,Ii)min[wj,hk,ϕ(,Ii)+βξi,j,H]\xi_{i,j}\leftarrow\left(\phi(\cdot,I_{i})^{T}\left(\Lambda_{h}^{k}\right)^{-1}\phi(\cdot,I_{i})\right)^{1/2},\quad Q_{j,h}^{k}(\cdot,I_{i}){\leftarrow}\min\left[\Big{\langle}w_{j,h}^{k},\phi(\cdot,I_{i})\Big{\rangle}+\beta\xi_{i,j},H\right]
        end for
        SMh,k(Ii)=exp(α(Qr,hk(,Ii)+νkQg,hk(,Ii)))jexp(α(Qr,hk(,Ij)+νkQa,hk(,Ij)))\text{SM}_{h,k}(I_{i}\mid\cdot)=\frac{\exp\left(\alpha\left(Q_{r,h}^{k}(\cdot,I_{i})+\nu_{k}Q_{g,h}^{k}(\cdot,I_{i})\right)\right)}{\sum_{j}\exp\left(\alpha\left(Q_{r,h}^{k}(\cdot,I_{j})+\nu_{k}Q_{a,h}^{k}(\cdot,I_{j})\right)\right)}
        πhk(a)1b(a)𝑑bSMh,k(I(a))\pi_{h}^{k}(a\mid\cdot)\leftarrow\frac{1}{\int_{b\in\mathcal{I}(a)}db}\text{SM}_{h,k}(I(a)\mid\cdot)
        Vr,hk()a𝒜πhk(a)Qr,hk(,a)𝑑a,Vg,hk()a𝒜πhk(a)Qg,hk(,a)𝑑aV_{r,h}^{k}(\cdot)\leftarrow\int_{a\in\mathcal{A}}\pi_{h}^{k}(a\mid\cdot)Q_{r,h}^{k}(\cdot,a)da,\quad V_{g,h}^{k}(\cdot)\leftarrow\int_{a\in\mathcal{A}}\pi_{h}^{k}(a\mid\cdot)Q_{g,h}^{k}(\cdot,a)da
     end for
     for step h=1,,Hh=1,\cdots,H do
        Compute Qr,hk(shk,Ii),Qg,hk(shk,Ii),π(Iishk)Q_{r,h}^{k}\left(s_{h}^{k},I_{i}\right),Q_{g,h}^{k}\left(s_{h}^{k},I_{i}\right),\pi\left(I_{i}\mid s_{h}^{k}\right).
        Take action ahkπhk(shk)a_{h}^{k}\sim\pi_{h}^{k}\left(\cdot\mid s_{h}^{k}\right) and observe sh+1ks_{h+1}^{k}.
     end for
     νk+1=max{min{νk+η(c~Vg,1k(s1)),𝒱},0}\nu_{k+1}=\max\left\{\min\left\{\nu_{k}+\eta\left(\tilde{c}-V_{g,1}^{k}\left(s_{1}\right)\right),\mathscr{V}\right\},0\right\}
  end for

3.2 R-TD3

Because assumptions made in Section 3.1 (e.g., the linear MDP assumption) are often violated in the real world, we also consider more flexible and arguably powerful deep reinforcement learning methods. Concretely, we utilize “Twin-Delayed Deep Deterministic Policy Gradient” (TD3) Fujimoto et al. (2018) with the implementation and default parameters provided by the open-source package “Stable Baselines 3” Raffin et al. (2021) on a (Lagrangian) relaxation of the long-term fairness problem. We term this specific algorithm for long-term fairness R-TD3.

In general, methods such as TD3 lack provable safety guarantees, however they may still confirm our hypothesis that agents trained via RL can learn to sacrifice short-term utility in order to drive dynamics towards preferable long-term states and expliictly incorporate dynamical control objectives provided as functions of state.

To treat the long-term fairness problem (Eq. 6) using unconstrained optimization techniques (i.e., methods like TD3), we consider a time-dependent Lagrangian relaxation of Eq. 6. We train R-TD3 to optimize

minπEaτπ(sτ)[τ=1H[κτ(sτ,aτ)+λτ𝒟(sτ,aτ)]]\min_{\pi}\operatorname*{E}_{a_{\tau}\sim\pi(s_{\tau})}\left[\sum_{\tau=1}^{H}\left[\kappa_{\tau}\mathscr{L}(s_{\tau},a_{\tau})+\lambda_{\tau}\mathscr{D}(s_{\tau},a_{\tau})\right]\right] (12)

where sτ+1𝐏(sτ,aτ)s_{\tau+1}\sim\mathbf{P}(s_{\tau},a_{\tau}), λτ=τ/H\lambda_{\tau}=\tau/H, and κτ=1λτ\kappa_{\tau}=1-\lambda_{\tau}.

Strictly applied, myopic fairness constraints can lead to undesirable dynamics and equilibria Raab and Liu (2021). Relaxing these constraints (hard \to soft) for the near future while emphasizing them long-term, we hope to develop classifiers that learn to transition to more favorable equilibria.

3.3 Greedy Baseline

In our experiments, we compare L-UCBFair and R-TD3 to a “Greedy Baseline” agent as a proxy for a myopic status quo in which policy is repeatedly determined by optimizing for immediate utility, without regard for the population dynamics induced by algorithmic actions. Our chosen algorithm for the greedy baseline is simply gradient descent in ff, defined as loss regularized by disparity, performed anew with each time step with fixed parameter λ\lambda.

fτ(π)=Eaτπ[(1λ)(sτ,aτ)+λ𝒟(sτ,aτ)]f_{\tau}(\pi)=\operatorname*{E}_{a_{\tau}\sim\pi}\left[(1-\lambda)\mathscr{L}(s_{\tau},a_{\tau})+\lambda\mathscr{D}(s_{\tau},a_{\tau})\right] (13)

While such an algorithm does not guarantee constraint satisfaction, it is nonetheless “constraint aware” in precisely the same way as a firm that (probabilistically) incurs penalties for violating constraints.

4 Simulated Environments

We describe our experiments with the algorithms we have detailed for long-term fairness as an RL problem: We consider a series of binary (Y{1,1}Y\in\{-1,1\} classification tasks on a population of two groups 𝒢={g1,g2}\mathcal{G}=\{g_{1},g_{2}\} modeled according to evolutionary game theory (using replicator dynamics). We consider two families of distributions of real-valued features for the population: One that is purely synthetic, for which X𝒩(Y,1)X\sim\mathcal{N}(Y,1), independent of group GG, and one that is based on a logistic regression to real-world data. Both families of distributions are parameterized by the joint distribution Pr(Y,G)\Pr(Y,G). RL agents are trained on episodes of length HH initialized with randomly sampled states.

4.1 Setting

The following assumptions simplify our hypothesis space for classifiers in order to better handle continuous state space. These assumptions appeared in Raab and Liu (2021).

Assumption 4.1 (Well-behaved feature).

For purely synthetic data, we require XX to be a “well-behaved” real-valued feature or “score” within each group. That is,

g,Pr(Y=1G=g,X=x) strictly increases in x\forall g,\quad\Pr(Y{=}1\mid G{=}g,X{=}x)\text{ strictly increases in }x

As an intuitive example of 4.1, if YY represents qualification for a fixed loan and XX represents credit-score, we require higher credit scores to strongly imply higher likelihood that an individual is qualified for the loan.

Theorem 4.2 (Threshold Bayes-optimality).

For each group gg, when 4.1 is satisfied, the Bayes-optimal, deterministic binary classifier is a threshold policy

Y^=1 if xAg and 1 otherwise\hat{Y}=1\text{ if }x\geq A_{g}\text{ and }-1\text{ otherwise}

where AgA_{g} is the feature threshold for group gg.

As a result of Theorem 4.2, we consider our action space to be the space of group-specific thresholds, and denote an individual action as the vector 𝐀(A1,A2,,An)\mathbf{A}\coloneqq(A_{1},A_{2},...,A_{n}).

4.2 Replicator Dynamics

Our use of replicator dynamics closely mirrors that of Raab and Liu (2021) as an “equitable” model of a population, in which individuals my be modeled identically, independently of group membership, yet persistent outcome disparities may nonetheless emerge from disparate initial conditions between groups. In particular, we parameterize the evolving distribution Pr(X,YG)\Pr(X,Y\mid G), assuming constant group sizes, in terms of “qualification rates” qgPr(Y=1G=g)q_{g}\coloneqq\Pr(Y{=}1\mid G{=}g) and update these qualification rates according to the discrete-time replicator dynamics:

qg[t+1]=qg[t]W1g[t]W¯g[t];W¯g[t]W1qg+(1qg)W1q_{g}[t+1]=q_{g}[t]\frac{W_{1}^{g}[t]}{\overline{W}^{g}[t]};\quad\overline{W}^{g}[t]\coloneqq W_{1}q_{g}+(1-q_{g})W_{{-}1}

In this model, the fitness Wyg>0W_{y}^{g}>0 of label Y=yY{=}y in group G=gG{=}g may be interpreted as the “average utility to the individual” in group gg of possessing label yy, and thus relative replication rate of label yy in group gg, as agents update their labels by mimicking the successful strategies of in-group peers. Also following Raab and Liu (2021), we model WygW_{y}^{g} in terms of acceptance and rejection rates with a group-independent utility matrix UU:

Wyg=y^{1,1}Uy,y^Pr(Y^=y^Y=y,G=g)W_{y}^{g}=\sum_{\hat{y}\in\{{-}1,1\}}U_{y,\hat{y}}\Pr(\hat{Y}{=}\hat{y}\mid Y{=}y,G{=}g)

We choose the matrix UU to eliminate dominant strategies (i.e., agents prefer one label over another, independent of classification), assert that agents always prefer acceptance over rejection, and to imply that the costs of qualification are greater than the costs of non-qualification among accepted individuals. While other parameterizations of UU are valid, this choice of parameters guarantees internal equilibrium of the replicator dynamics for a Bayes-optimal classifier and “well-behaved” scalar-valued feature XX, such that Pr(Y=1X=x)\Pr(Y{=}1\mid X{=}x) is monotonically increasing in xx Raab and Liu (2021).

4.3 Data Synthesis and Processing

In addition to a synthetic distribution, for which we assume X𝒩(Y,1)X\sim\mathcal{N}(Y,1), independent of GG, for all time, we also consider real-world distributions in simulating and comparing algorithms for “long-term fairness”. In both cases, as mentioned above, we wish to parameterize distributions in terms of qualification rates qgq_{g}. As we perform binary classification on discrete groups and scalar-valued features, in addition to parameterizing a distribution in terms of qgq_{g}, we desire a scalar-valued feature for each example, rather than the multi-dimensional features common to real-world data. Our solution to parameterize a distribution of groups and scalar features is to use an additional learning step for “preprocessing”: Given a static dataset 𝒟\mathcal{D} from which (X,Y,G)(X^{\prime},Y,G) is drawn i.i.d., (e.g., the “Adult Data Set” Dua and Graff (2017)), at each time-step, we train a stochastic binary classifier a~\tilde{a}, such that Y^a~(X,G)\hat{Y}^{\prime}\sim\tilde{a}(X^{\prime},G) with a loss that re-weights examples by label value, in order to simulate the desired qgq_{g}: mina~Ea~,𝒟[w(X,Y,G)L(Y,Y^)]\min_{\tilde{a}}\quad\operatorname*{E}_{\tilde{a},\mathcal{D}}[w(X^{\prime},Y,G)L(Y,\hat{Y}^{\prime})], where w(X,Y,G)=[(1Y)/2+Yqg]/E𝒟[Y|G]w(X^{\prime},Y,G)=[(1-Y)/2+Yq_{g}]/\operatorname*{E}_{\mathcal{D}}[Y|G], LL is zero-one loss, and, in our experiments, we choose a~\tilde{a} according to logistic regression. We interpret Pr(Y^=1)\Pr(\hat{Y}{=}1) as a new, scalar feature value X𝐑X\in\mathbf{R} mapped from from higher-dimensional features XX^{\prime} as the output of a learned “preprocessing” function a~\tilde{a}, 4.1 is as hard to satisfy in general as solving the Bayes-optimal binary classification task over higher-dimensional features. Nonetheless, we expect 4.1 to be approximately satisfied by such a “preprocessing” pipeline.

4.4 Linearity of Dynamics

L-UCBFair relies on 3.2, which asserts the existence of some Hilbert space in which the state dynamics 𝐏\mathbf{P} are linear. Such linearity for real-world (continuous time) dynamics holds only in infinite-dimensional Hilbert space Brunton et al. (2021) and is not computationally tractable. In addition, the “feature map” ϕ\phi that maps state-action pairs to the aforementioned Hilbert space must be learned by the policy maker. In experiment, we use a neural network to estimate a feature map ϕ^\hat{\phi} which approximately satisfies the linear MDP assumption. We defer details to Section C.1.

5 Experimental Results

Do RL agents learn to seek favorable equilibria against short-term utility? Is a Lagrangian relaxation of long-term fairness sufficient to encourage this behavior? We give positive demonstrations for both questions.

5.1 Losses and Disparities Considered

Our experiments consider losses \mathscr{L} which combine true-positive and true-negative rates, where, for α,β[0,1]\alpha,\beta\in[0,1],

(s,a)=1αtp(s,a)+βtn(s,a),\mathscr{L}(s,a)=1-\alpha\texttt{tp}(s,a)+\beta\texttt{tn}(s,a), (14)

where tp(s,a)=Prs,a(Y^=1,Y=1)\texttt{tp}(s,a)=\Pr_{s,a}(\hat{Y}{=}1,Y{=}1) and tn(s,a)=Prs,a(Y^=1,Y=1)\texttt{tn}(s,a)=\Pr_{s,a}(\hat{Y}{=}{-}1,Y{=}{-}1). For disparity 𝒟\mathscr{D}, we consider demographic parity (DP) Dwork et al. (2012), equal opportunity (EOp) Hardt et al. (2016), and qualification rate (QR):

Func. Form ξs,ay(g)=Prs,a()\xi^{y}_{s,a}(g)=\Pr_{s,a}(\cdot)
DP |ξs,a(g1)ξs,a(g2)|2/2\big{|}\xi_{s,a}(g_{1}){-}\xi_{s,a}(g_{2})\big{|}^{2}/2 Y^=1G=g\hat{Y}{=}1\mid G{=}g
QR Y^=1G=g\hat{Y}{=}1\mid G{=}g
EOp Y^=1G=g\hat{Y}{=}1\mid G{=}g
EO y|ξs,ay(g1)ξs,ay(g2)|2/2\sum_{y}\big{|}\xi^{y}_{s,a}(g_{1})-\xi^{y}_{s,a}(g_{2})\big{|}^{2}/2 Y^=y^Y=y,G=g\hat{Y}{=}\hat{y}\mid Y{=}y,G{=}g

QR does not matter to myopic fair classification, which does not consider mutable population state.

Refer to caption
Refer to caption
Refer to caption
Figure 1: The greedy baseline algorithm (left) and L-UCBFair (right) are tasked to maximize the fraction of true-positive classifications (=1tp\mathscr{L}=1-\texttt{tp}, Eq. 14), subject to demographic parity (𝒟=DP\mathscr{D}{=}\texttt{DP}, Section 5.1). The greedy algorithm uses λ=0.5\lambda{=}0.5 in Eq. 13, while L-UCBFair is trained for 2,000 steps on episodes of length 100 prior to generating this “phase portrait”. We depict the expected dynamics (averaged over 20 policy iterations for each state) of the classifier-population system, parameterized by the time-evolving qualification rate in each group (1 on the horizontal, 2 on the vertical). Each group is of equal size and identically modeled by the standard normal X𝒩(Y,1)X\sim\mathcal{N}(Y,1). Note that states in the left plot attract to universal non-qualification Pr(Y=1)=0\Pr(Y{=}1){=}0, while the right plot converges to universal qualification. The lower plot shows average loss over pairs of randomly sampled episodes.

5.2 Results

Our experiments show that algorithms trained with an RL formulation of long-term fairness can drive a reactive population toward states with higher utility and fairness, even when short-term utility is misaligned with desirable dynamics. Our central hypothesis, that long-term fairness via RL may induce an algorithm to sacrifice short-term utility for better long-term outcomes, is concretely demonstrated by Fig. 1, in which a greedy classifier and L-UCBFair, maximizing true positive rate tp (Section 5.1) subject to demographic parity DP (Section 5.1), drive a population to universal non-qualification (Pr(Y=1)0)\Pr(Y{=}1)\to 0) and universal qualification (Pr(Y=1)1)\Pr(Y{=}1)\to 1), respectively. Each phase plot shows the dynamics of qualification rates qg=Pr(Y=1G=g)q_{g}=\Pr(Y{=}1\mid G{=}g), which parameterize the population state ss and define the axes, with streamlines; color depicts averaged disparity 𝒟\mathscr{D} incurred by aa in state ss.

Refer to caption
Refer to caption
Refer to caption
Figure 2: Using a modelled population initialized with the “Adult” dataset, reweighted for equal group representation (Section 4.3), L-UCBFair (left) and R-TD3 (right) are tasked, as in Fig. 1, to maximize the fraction of true-positive classifications (=1tp\mathscr{L}=1-\texttt{tp}, Eq. 14), subject to demographic parity (𝒟=DP\mathscr{D}{=}\texttt{DP}, Section 5.1). L-UCBFair performs almost indistinguishably from the experiment on the synthetic dataset (Fig. 1), while R-TD3 learns qualitatively similar behavior with more aggressive short-term violations of the fairness constraint.

While both algorithms achieve no or little violation of demographic parity, the myopic algorithm eventually precludes future true-positive classifications (arrows in Fig. 1 approach a low qualification state), while L-UCBFair maintains stochastic thresholds at equilibrium (mean [0.49, 0.38], by group) with a non-trivial fraction of true-positives. The episodic mean loss and disparity training curves for L-UCBFair are depicted in Fig. 3.

We show that RL algorithms that are not limited by the same restrictive assumptions as L-UCBFair are applicable to long-term fairness. In Fig. 2, R-TD3 achieves similar qualitative behavior (i.e., driving near-universal qualification at the expense of short-term utility) when optimizing a loss subject to scheduled disparity regularization. This figure also highlights the lack of guarantees of R-TD3 in incurring prominent violations of the fairness constraint and failing to convincingly asymptote to the global optimum.

Refer to caption
Refer to caption
Figure 3: L-UCBFair 20-step sliding mean & std training loss (left) and disparity (right) for the Fig. 1 setting.
Refer to caption
Refer to caption
Refer to caption
Figure 4: Phase portraits for L-UCBFair (left), and R-TD3 (right) interacting on the synthetic distribution X𝒩(Y,1)X\sim\mathcal{N}(Y,1) with groups of equal size. Both algorithms use =1tptn\mathscr{L}=1-\texttt{tp}-\texttt{tn} (i.e., zero-one loss) and 𝒟=QR\mathscr{D}=\texttt{QR}. Shading: qualification rate disparity for the next time-step.

Finally, we demonstrate the capability of RL to utilize notions of fairness that are impossible to treat in the myopic setting, viz. qualification rate parity, in Fig. 4. In this example, while both RL agents achieve qualification rate parity, we note that L-UCBFair fails to realize the optimal equilibrium discovered by R-TD3.

Concluding remarks

We have demonstrated that desirable social outcomes that are in conflict with myopic optimization may be realized as solutions to long-term fairness formalized through the lens of reinforcement learning. Moreover, we have shown that this problem both admits solutions with theoretical guarantees and can be relaxed to accommodate a wider class of recent advances in RL. We hope our contributions spur interest in long-term mechanisms and incentive structures for machine learning to be a driver of positive social change.

Acknowledgement

This work is partially supported by the National Science Foundation (NSF) under grants IIS-2143895 and IIS-2040800, and CCF-2023495.

References

  • Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. Finite-time analysis of the multiarmed bandit problem. Machine learning, 47(2):235–256, 2002.
  • Bai et al. (2022) Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel, and Vaneet Aggarwal. Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 3682–3689, 2022.
  • Brantley et al. (2020) Kianté Brantley, Miro Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. Advances in Neural Information Processing Systems, 33:16315–16326, 2020.
  • Brunton et al. (2021) Steven L Brunton, Marko Budišić, Eurika Kaiser, and J Nathan Kutz. Modern koopman theory for dynamical systems. arXiv preprint arXiv:2102.12086, 2021.
  • Chaney et al. (2018) Allison JB Chaney, Brandon M Stewart, and Barbara E Engelhardt. How algorithmic confounding in recommendation systems increases homogeneity and decreases utility. In Proceedings of the 12th ACM Conference on Recommender Systems, pages 224–232. ACM, 2018.
  • Chen et al. (2022) Yatong Chen, Reilly Raab, Jialu Wang, and Yang Liu. Fairness transferability subject to bounded distribution shift. arXiv preprint arXiv:2206.00129, 2022.
  • Coate and Loury (1993) Stephen Coate and Glenn C Loury. Will affirmative-action policies eliminate negative stereotypes? The American Economic Review, pages 1220–1240, 1993.
  • Crawford and Calo (2016) Kate Crawford and Ryan Calo. There is a blind spot in AI research. Nature News, 538(7625):311, 2016.
  • D’Amour et al. (2020) Alexander D’Amour, Hansa Srinivasan, James Atwood, Pallavi Baljekar, D Sculley, and Yoni Halpern. Fairness is not static: Deeper understanding of long term fairness via simulation studies. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 525–534, 2020.
  • Ding et al. (2020) Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33:8378–8390, 2020.
  • Ding et al. (2021) Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pages 3304–3312. PMLR, 2021.
  • Dua and Graff (2017) Dheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml.
  • Dwork et al. (2012) Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Reingold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012.
  • Efroni et al. (2020) Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189, 2020.
  • Ensign et al. (2018) Danielle Ensign, Sorelle A Friedler, Scott Neville, Carlos Scheidegger, and Suresh Venkatasubramanian. Runaway feedback loops in predictive policing. In Conference of Fairness, Accountability, and Transparency, 2018.
  • Epasto et al. (2020) Alessandro Epasto, Mohammad Mahdian, Vahab Mirrokni, and Emmanouil Zampetakis. Optimal approximation-smoothness tradeoffs for soft-max functions. Advances in Neural Information Processing Systems, 33:2651–2660, 2020.
  • Feldman et al. (2015) Michael Feldman, Sorelle A Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. Certifying and removing disparate impact. In proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pages 259–268, 2015.
  • Fujimoto et al. (2018) Scott Fujimoto, Herke Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International conference on machine learning, pages 1587–1596. PMLR, 2018.
  • Fuster et al. (2018) Andreas Fuster, Paul Goldsmith-Pinkham, Tarun Ramadorai, and Ansgar Walther. Predictably unequal? The effects of machine learning on credit markets. The Effects of Machine Learning on Credit Markets, 2018.
  • Ghosh et al. (2022) Arnob Ghosh, Xingyu Zhou, and Ness Shroff. Provably efficient model-free constrained rl with linear function approximation. arXiv preprint arXiv:2206.11889, 2022.
  • Hardt et al. (2016) Moritz Hardt, Eric Price, and Nathan Srebro. Equality of opportunity in supervised learning, 2016.
  • Heidari et al. (2019) Hoda Heidari, Vedant Nanda, and Krishna P. Gummadi. On the long-term impact of algorithmic decision policies: Effort unfairness and feature segregation through social learning. the International Conference on Machine Learning (ICML), 2019.
  • Hu and Chen (2018) Lily Hu and Yiling Chen. A short-term intervention for long-term fairness in the labor market. In Proceedings of the 2018 World Wide Web Conference on World Wide Web, pages 1389–1398. International World Wide Web Conferences Steering Committee, 2018.
  • Hu et al. (2019) Lily Hu, Nicole Immorlica, and Jennifer Wortman Vaughan. The disparate effects of strategic manipulation. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 259–268, 2019.
  • Jabbari et al. (2017) Shahin Jabbari, Matthew Joseph, Michael Kearns, Jamie Morgenstern, and Aaron Roth. Fairness in reinforcement learning. In International conference on machine learning, pages 1617–1626. PMLR, 2017.
  • Jin et al. (2020) Chi Jin, Zhuoran Yang, Zhaoran Wang, and Michael I Jordan. Provably efficient reinforcement learning with linear function approximation. In Conference on Learning Theory, pages 2137–2143. PMLR, 2020.
  • Joseph et al. (2016) Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. Fairness in learning: Classic and contextual bandits. Advances in neural information processing systems, 29, 2016.
  • Kalagarla et al. (2021) Krishna C Kalagarla, Rahul Jain, and Pierluigi Nuzzo. A sample-efficient algorithm for episodic finite-horizon mdp with constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 8030–8037, 2021.
  • Liu et al. (2018) Lydia T Liu, Sarah Dean, Esther Rolf, Max Simchowitz, and Moritz Hardt. Delayed impact of fair machine learning. In International Conference on Machine Learning, pages 3150–3158. PMLR, 2018.
  • Liu et al. (2020) Lydia T Liu, Ashia Wilson, Nika Haghtalab, Adam Tauman Kalai, Christian Borgs, and Jennifer Chayes. The disparate equilibria of algorithmic decision making when individuals invest rationally. In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, pages 381–391, 2020.
  • Liu et al. (2021) Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021.
  • Liu et al. (2017) Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, and David C Parkes. Calibrated fairness in bandits. arXiv preprint arXiv:1707.01875, 2017.
  • Mouzannar et al. (2019) Hussein Mouzannar, Mesrob I Ohannessian, and Nathan Srebro. From fair decision making to social equality. In Proceedings of the Conference on Fairness, Accountability, and Transparency, pages 359–368. ACM, 2019.
  • Pan et al. (2019) Ling Pan, Qingpeng Cai, Qi Meng, Wei Chen, Longbo Huang, and Tie-Yan Liu. Reinforcement learning with dynamic boltzmann softmax updates. arXiv preprint arXiv:1903.05926, 2019.
  • Paszke et al. (2019) Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc., 2019.
  • Perdomo et al. (2020) Juan Perdomo, Tijana Zrnic, Celestine Mendler-Dünner, and Moritz Hardt. Performative prediction. In International Conference on Machine Learning, pages 7599–7609. PMLR, 2020.
  • Raab and Liu (2021) Reilly Raab and Yang Liu. Unintended selection: Persistent qualification rate disparities and interventions. Advances in Neural Information Processing Systems, 34:26053–26065, 2021.
  • Raffin et al. (2021) Antonin Raffin, Ashley Hill, Adam Gleave, Anssi Kanervisto, Maximilian Ernestus, and Noah Dormann. Stable-baselines3: Reliable reinforcement learning implementations. Journal of Machine Learning Research, 22(268):1–8, 2021.
  • Singh et al. (2020) Rahul Singh, Abhishek Gupta, and Ness B Shroff. Learning in markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.
  • Tang et al. (2021) Wei Tang, Chien-Ju Ho, and Yang Liu. Bandit learning with delayed impact of actions. Advances in Neural Information Processing Systems, 34:26804–26817, 2021.
  • Wei et al. (2022) Honghao Wei, Xin Liu, and Lei Ying. Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation. In International Conference on Artificial Intelligence and Statistics, pages 3274–3307. PMLR, 2022.
  • Wen et al. (2019) Min Wen, Osbert Bastani, and Ufuk Topcu. Fairness with dynamics. arXiv preprint arXiv:1901.08568, 2019.
  • Williams and Kolter (2019) Joshua Williams and J Zico Kolter. Dynamic modeling and equilibria in fair decision making. arXiv preprint arXiv:1911.06837, 2019.
  • Xu et al. (2021) Tengyu Xu, Yingbin Liang, and Guanghui Lan. Crpo: A new approach for safe reinforcement learning with convergence guarantee. In International Conference on Machine Learning, pages 11480–11491. PMLR, 2021.
  • Zemel et al. (2013) Rich Zemel, Yu Wu, Kevin Swersky, Toni Pitassi, and Cynthia Dwork. Learning fair representations. In International conference on machine learning, pages 325–333. PMLR, 2013.
  • Zhang et al. (2020) Xueru Zhang, Ruibo Tu, Yang Liu, Mingyan Liu, Hedvig Kjellström, Kun Zhang, and Cheng Zhang. How do fair decisions fare in long-term qualification? arXiv preprint arXiv:2010.11300, 2020.
  • Zheng and Ratliff (2020) Liyuan Zheng and Lillian Ratliff. Constrained upper confidence reinforcement learning. In Learning for Dynamics and Control, pages 620–629. PMLR, 2020.

Appendix A Proof

Without loss of generality, we assume ϕ(s,a)1\|\phi(s,a)\|\leq 1 for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, and max{μh(𝒮),θh}\max\left\{\left\|\mu_{h}(\mathcal{S})\right\|,\left\|\theta_{h}\right\|\right\}\leq d\sqrt{d} for all h[H]h\in[H].

A.1 Proof of Theorem 3.7

See 3.7 Outline The outline of this proof simulates the proof in Ghosh et al. (2022). For brevity, denote hVj,h+1π(s,a)=𝔼sh(s,a)Vj,h+1π(s) for j=r,g\mathbb{P}_{h}V_{j,h+1}^{\pi}(s,a)=\mathbb{E}_{s^{\prime}\sim\mathbb{P}_{h}(\cdot\mid s,a)}V_{j,h+1}^{\pi}\left(s^{\prime}\right)\text{ for }j=r,g. Then

Qj,hπ(s,a)=(rh+hVj,h+1π)(s,a)\displaystyle Q_{j,h}^{\pi}(s,a)=\left(r_{h}+\mathbb{P}_{h}V_{j,h+1}^{\pi}\right)(s,a) (15)
Vj,hπ(s)=πh(s)Qj,hπ(s,)𝒜\displaystyle V_{j,h}^{\pi}(s)=\left\langle\pi_{h}(\cdot\mid s)Q_{j,h}^{\pi}(s,\cdot)\right\rangle_{\mathcal{A}} (16)
πh(s),Qj,hπ(s,)𝒜=a𝒜πh(as)Qj,hπ(s,a)\displaystyle\left\langle\pi_{h}(\cdot\mid s),Q_{j,h}^{\pi}(s,\cdot)\right\rangle_{\mathcal{A}}=\sum_{a\in\mathcal{A}}\pi_{h}(a\mid s)Q_{j,h}^{\pi}(s,a) (17)

Similar to Efroni et al. (2020), we establish

Regret(K)+νDistortion(K)\displaystyle\text{Regret}(K)+\nu\text{Distortion}(K)
=\displaystyle= k=1K(Vr,1π(s1)Vr,1πk(s1))+νk=1K(bVg,1πk(s1))\displaystyle\sum_{k=1}^{K}\left(V_{r,1}^{\pi^{*}}\left(s_{1}\right)-V_{r,1}^{\pi_{k}}\left(s_{1}\right)\right)+\nu\sum_{k=1}^{K}\left(b-V_{g,1}^{\pi_{k}}\left(s_{1}\right)\right)
\displaystyle\leq k=1K(Vr,1π(s1)+νkVg,1π(s1))(Vr,1k(s1)+νkVg,1k(s1))𝒯1\displaystyle\underbrace{\sum_{k=1}^{K}\left(V_{r,1}^{\pi^{*}}\left(s_{1}\right)+\nu_{k}V_{g,1}^{\pi^{*}}\left(s_{1}\right)\right)-\left(V_{r,1}^{k}\left(s_{1}\right)+\nu_{k}V_{g,1}^{k}\left(s_{1}\right)\right)}_{\mathcal{T}_{1}}
+k=1K(Vr,1k(s1)Vr,1πk(s1))+νk=1K(Vg,1k(s1)Vg,1πk(s1))𝒯2\displaystyle+\underbrace{\sum_{k=1}^{K}\left(V_{r,1}^{k}\left(s_{1}\right)-V_{r,1}^{\pi_{k}}\left(s_{1}\right)\right)+\nu\sum_{k=1}^{K}\left(V_{g,1}^{k}\left(s_{1}\right)-V_{g,1}^{\pi_{k}}\left(s_{1}\right)\right)}_{\mathcal{T}_{2}}
+12ην2+η2H2K𝒯3\displaystyle+\underbrace{\frac{1}{2\eta}\nu^{2}+\frac{\eta}{2}H^{2}K}_{\mathcal{T}_{3}}

𝒯3\mathcal{T}_{3} is easily bounded if η\eta. The major task remains bound 𝒯1\mathcal{T}_{1} and 𝒯2\mathcal{T}_{2}.

Bound 𝒯1\mathcal{T}_{1} and 𝒯2\mathcal{T}_{2}. We have following two lemmas.

Lemma A.1 (Boundedness of 𝒯1\mathcal{T}_{1}).

With probability 1p/21-p/2, we have 𝒯1KH(log(M)α+2(1+𝒱)HρϵIdKς)\mathcal{T}_{1}\leq KH\left(\frac{\log(M)}{\alpha}+2(1+\mathscr{V})H\rho\epsilon_{I}\sqrt{\frac{dK}{\varsigma}}\right). Specifically, if α=log(M)K2(1+𝒱+H)\alpha=\frac{\log(M)K}{2(1+\mathscr{V}+H)} and ς=1\varsigma=1, we have 𝒯12H(1+𝒱+H)+2KH2(1+𝒱)ρϵIdK\mathcal{T}_{1}\leq 2H(1+\mathscr{V}+H)+2KH^{2}(1+\mathscr{V})\rho\epsilon_{I}\sqrt{dK} with probability 1p/21-p/2.

Lemma A.2 (Boundedness of 𝒯2\mathcal{T}_{2}).

Ghosh et al. (2022) With probability 1p/2,𝒯2𝒪((ν+1)H2ζd3K)1-p/2,\mathcal{T}_{2}\leq\mathcal{O}\left((\nu+1)H^{2}\zeta\sqrt{d^{3}K}\right), where ζ=\zeta= log[log(M)4dHK/p]\log[\log(M)4dHK/p].

Lemma A.2 follows the same logic in Ghosh et al. (2022), and we delay the proof of Lemma A.1 to Section LABEL:sec:t1t2. Now we are ready to proof Theorem 3.7.

Proof.

For any ν[0,𝒱]\nu\in[0,\mathscr{V}], with prob. 1p1-p,

Regret(K)+νDistortion(K)\displaystyle\text{Regret}(K)+\nu\text{Distortion}(K)
\displaystyle\leq 𝒯1+𝒯2+𝒯3\displaystyle\mathcal{T}_{1}+\mathcal{T}_{2}+\mathcal{T}_{3}
\displaystyle\leq 12ην2+η2H2K+HKlogMα+2KH2(1+𝒱)ρϵIdK+𝒪((ν+1)H2ζd3K)\displaystyle\frac{1}{2\eta}\nu^{2}+\frac{\eta}{2}H^{2}K+\frac{HK\log M}{\alpha}+2KH^{2}(1+\mathscr{V})\rho\epsilon_{I}\sqrt{dK}+\mathcal{O}\left((\nu+1)H^{2}\zeta\sqrt{d^{3}K}\right) (19)

Taking ν=0,η=𝒱KH2,α=KlogM2(1+𝒱+H),ϵI=12ρ(1+𝒱)KHd\nu=0,\eta=\frac{\mathscr{V}}{\sqrt{KH^{2}}},\alpha=\frac{K\log M}{2(1+\mathscr{V}+H)},\epsilon_{I}=\frac{1}{2\rho(1+\mathscr{V})KH\sqrt{d}}, there exist constant bb,

Regret(K)\displaystyle\text{Regret}(K) 𝒱H2K+2H(1+𝒱+H)+2H2K(1+𝒱)ρϵIdK+𝒪(H2ζd3K)\displaystyle\leq\frac{\mathscr{V}H}{2}\sqrt{K}+2H(1+\mathscr{V}+H)+2H^{2}K(1+\mathscr{V})\rho\epsilon_{I}\sqrt{dK}+\mathcal{O}\left(H^{2}\zeta\sqrt{d^{3}K}\right)
(bζH2d3+(𝒱+1)H)K=𝒪~(H2d3K).\displaystyle\leq\big{(}b\zeta H^{2}\sqrt{d^{3}}+(\mathscr{V}+1)H\big{)}\sqrt{K}=\mathcal{\tilde{O}}(H^{2}\sqrt{d^{3}K}).

Taking ν=𝒱,η=𝒱KH2,α=KlogM2(1+𝒱+H),ϵI=12ρ(1+𝒱)KHd\nu=\mathscr{V},\eta=\frac{\mathscr{V}}{\sqrt{KH^{2}}},\alpha=\frac{K\log M}{2(1+\mathscr{V}+H)},\epsilon_{I}=\frac{1}{2\rho(1+\mathscr{V})KH\sqrt{d}},

Regret(K)+𝒱Distortion(K)\displaystyle\text{Regret}(K)+\mathscr{V}\text{Distortion}(K) (𝒱+1)HK+(1+𝒱)𝒪(H2ζd3K)\displaystyle\leq(\mathscr{V}+1)H\sqrt{K}+(1+\mathscr{V})\mathcal{O}\left(H^{2}\zeta\sqrt{d^{3}K}\right)

Following the idea of Efroni et al. (2020), there exists a policy π\pi^{\prime} such that Vr,1π=1Kk=1KVr,1πk,Vg,1π=1Kk=1KVg,1πkV_{r,1}^{\pi^{\prime}}=\frac{1}{K}\sum_{k=1}^{K}V_{r,1}^{\pi_{k}},V_{g,1}^{\pi^{\prime}}=\frac{1}{K}\sum_{k=1}^{K}V_{g,1}^{\pi_{k}}. By the occupancy measure, Vr,1πV_{r,1}^{\pi} and Vg,1πV_{g,1}^{\pi} are linear in occupancy measure induced by π\pi. Thus, the average of KK occupancy measure also produces an occupancy measure which induces policy π\pi^{\prime} and Vr,1πV_{r,1}^{\pi^{\prime}}, and Vg,1πV_{g,1}^{\pi^{\prime}}. We take ν=0\nu=0 when k=1K(bVg,1πk(s1k))<0\sum_{k=1}^{K}\left(b-V_{g,1}^{\pi_{k}}\left(s_{1}^{k}\right)\right)<0, otherwise ν=𝒱\nu=\mathscr{V}. Hence, we have

Vr,1π(s1)1Kk=1KVr,1πk(s1)+𝒱max((c1Kk=1KVg,1πk(s1),0)\displaystyle V_{r,1}^{\pi^{*}}\left(s_{1}\right)-\frac{1}{K}\sum_{k=1}^{K}V_{r,1}^{\pi_{k}}\left(s_{1}\right)+\mathscr{V}\max\big{(}(c-\frac{1}{K}\sum_{k=1}^{K}V_{g,1}^{\pi_{k}}\left(s_{1}\right),0\big{)}
=Vr,1π(s1)Vr,1π(s1)+𝒱max(cVg,1π(s1),0)\displaystyle=V_{r,1}^{\pi^{*}}\left(s_{1}\right)-V_{r,1}^{\pi^{\prime}}\left(s_{1}\right)+\mathscr{V}\max\big{(}c-V_{g,1}^{\pi^{\prime}}\left(s_{1}\right),0\big{)}
𝒱+1KHK+𝒱+1K𝒪(H2ζd3K)\displaystyle\leq\frac{\mathscr{V}+1}{K}H\sqrt{K}+\frac{\mathscr{V}+1}{K}\mathcal{O}\left(H^{2}\zeta\sqrt{d^{3}K}\right) (20)

Since 𝒱=2H/γ\mathscr{V}=2H/\gamma, and using the result of Lemma A.11, we have

max(c1Kk=1KVg,1πk(s1k),0)𝒱+1K𝒱𝒪(H2ζd3K)\max\big{(}c-\frac{1}{K}\sum_{k=1}^{K}V_{g,1}^{\pi_{k}}\left(s_{1}^{k}\right),0\big{)}\leq\frac{\mathscr{V}+1}{K\mathscr{V}}\mathcal{O}\left(H^{2}\zeta\sqrt{d^{3}K}\right)

In this section we proof Lemma A.1 and Lemma A.2.

A.2 Prepare for Lemma A.1

In order to bound 𝒯1\mathcal{T}_{1} and 𝒯2\mathcal{T}_{2}, we introduce the following lemma.

Lemma A.3.

There exists a constant B2B_{2} such that for any fixed p(0,1)p\in(0,1), with probability at least 1p/21-p/2, the following event holds

τ=1k1ϕj,hτ[Vj,h+1k(sh+1τ)hVj,h+1k(shτ,ahτ)(ςhk)1B2dHq\|\sum_{\tau=1}^{k-1}\phi_{j,h}^{\tau}\left[V_{j,h+1}^{k}\left(s_{h+1}^{\tau}\right)-\mathbb{P}_{h}V_{j,h+1}^{k}\left(s_{h}^{\tau},a_{h}^{\tau}\right)\|_{\left(\varsigma_{h}^{k}\right)^{-1}}\leq B_{2}dHq\right.

for j{r,g}j\in\{r,g\}, where q=log[4(B1+1)log(M)dT/p]q=\sqrt{\log\left[4\left(B_{1}+1\right)\log(M)dT/p\right]} for some constant B1B_{1}.

We delay the proof of Lemma A.3 to Section A.4.

Lemma A.3 shows the bound of estimated value function Vj,hkV^{k}_{j,h} and value function Vj,hπV^{\pi}_{j,h} corresponding in a given policy at k. We now introcuce the following lemma appeared in Ghosh et al. (2022). This lemma bounds the difference between the value function without bonus in L-UCBFair and the true value function of any policy π\pi. This is bounded using their expected difference at next step, plus a error term.

Lemma A.4.

Ghosh et al. (2022) There exists an absolute constant β=C1dHζ,ζ=log(log(M)4dT/p)\beta=C_{1}dH\sqrt{\zeta},\zeta=\log(\log(M)4dT/p), and for any fixed policy π\pi, for the event defined in Lemma A.3, we have

ϕ(s,a),wj,hkQj,hπ(s,a)=h(Vj,h+1kVj,h+1π)(s,a)+Δhk(s,a)\left\langle\phi(s,a),w_{j,h}^{k}\right\rangle-Q_{j,h}^{\pi}(s,a)=\mathbb{P}_{h}\left(V_{j,h+1}^{k}-V_{j,h+1}^{\pi}\right)(s,a)+\Delta_{h}^{k}(s,a)

for some Δhk(s,a)\Delta_{h}^{k}(s,a) that satisfies |Δhk(s,a)|βϕ(s,a)T(Λhk)1ϕ(s,a)\left|\Delta_{h}^{k}(s,a)\right|\leq\beta\sqrt{\phi(s,a)^{T}\left(\Lambda_{h}^{k}\right)^{-1}\phi(s,a)}.

Lemma A.5.

Ghosh et al. (2022) With probability at least 1p/21-p/2, (for the event defined in Lemma A.3)

Qr,hπ(s,a)+νkQg,hπ(s,a)Qr,hk(s,a)+νkQg,hk(s,a)h(Vh+1kVh+1π,νk)(s,a)Q_{r,h}^{\pi}(s,a)+\nu_{k}Q_{g,h}^{\pi}(s,a)\leq Q_{r,h}^{k}(s,a)+\nu_{k}Q_{g,h}^{k}(s,a)-\mathbb{P}_{h}\left(V_{h+1}^{k}-V_{h+1}^{\pi,\nu_{k}}\right)(s,a)

We also introduce the following lemma. This lemma bound the value function by taking L-UCBFair policy and greedy policy.

Lemma A.6.

Define V¯hk()=maxa[Qr,hk(,a)+νkQg,hk(,a)]\bar{V}_{h}^{k}(\cdot)=\max_{a}\left[Q_{r,h}^{k}(\cdot,a)+\nu_{k}Q_{g,h}^{k}(\cdot,a)\right] the value function corresponding to greedy policy, we have

V¯hk(s)Vhk(s)logMα+2(1+𝒱)HρϵIdkς.\bar{V}_{h}^{k}(s)-V_{h}^{k}(s)\leq\frac{\log M}{\alpha}+2(1+\mathscr{V})H\rho\epsilon_{I}\sqrt{\frac{dk}{\varsigma}}. (21)
Proof.

Define aga_{g} the solution of greedy policy,

Vhk(s)Vhk(s)=\displaystyle V_{h}^{k}(s)-V_{h}^{k}(s)= [Qr,hk(s,ag)+νkQg,hk(s,ag)]\displaystyle\left[Q_{r,h}^{k}\left(s,a_{g}\right)+\nu_{k}Q_{g,h}^{k}\left(s,a_{g}\right)\right] (22)
aπh,k(as)[Qr,hk(s,a)+νkQg,hk(s,a)]𝑑a\displaystyle-\int_{a}\pi_{h,k}(a\mid s)\left[Q_{r,h}^{k}(s,a)+\nu_{k}Q_{g,h}^{k}(s,a)\right]da (23)
\displaystyle\leq [Qr,hk(s,ag)+νkQg,hk(s,ag)]\displaystyle\left[Q_{r,h}^{k}\left(s,a_{g}\right)+\nu_{k}Q_{g,h}^{k}\left(s,a_{g}\right)\right] (24)
iSMα(Iix)[Qr,hk(x,Ii)+νkQg,hk(x,Ii)]+2(1+𝒱)HρϵIdkς\displaystyle-\sum_{i}\text{SM}_{\alpha}(I_{i}\mid x)\left[Q_{r,h}^{k}(x,I_{i})+\nu_{k}Q_{g,h}^{k}(x,I_{i})\right]+2(1+\mathscr{V})H\rho\epsilon_{I}\sqrt{\frac{dk}{\varsigma}} (25)
\displaystyle\leq (log(aexp(α(Qr,hk(s,Ii)+νkQg,hk(s,Ii))))α)\displaystyle\left(\frac{\log\left(\sum_{a}\exp\left(\alpha\left(Q_{r,h}^{k}(s,I_{i})+\nu_{k}Q_{g,h}^{k}(s,I_{i})\right)\right)\right)}{\alpha}\right) (26)
iSMα(Iis)[Qr,hk(s,Ii)+νkQg,hk(s,Ii)]+2(1+𝒱)HρϵIdkς\displaystyle-\sum_{i}\text{SM}_{\alpha}(I_{i}\mid s)\left[Q_{r,h}^{k}(s,I_{i})+\nu_{k}Q_{g,h}^{k}(s,I_{i})\right]+2(1+\mathscr{V})H\rho\epsilon_{I}\sqrt{\frac{dk}{\varsigma}} (27)
\displaystyle\leq log(M)α+2(1+𝒱)HρϵIdkς.\displaystyle\frac{\log(M)}{\alpha}+2(1+\mathscr{V})H\rho\epsilon_{I}\sqrt{\frac{dk}{\varsigma}}. (28)

The first inequality follows from Lemma A.10 and the second inequality holds because of Proposition 1 in Pan et al. (2019).

A.3 Proof of Lemma A.1

Now we’re ready to proof Lemma A.1.

Proof.

This proof simulates Lemma 3 in Ghosh et al. (2022).

We use induction to proof this lemma. At step HH, we have Qj,H+1k=0=Qj,H+1πQ_{j,H+1}^{k}=0=Q_{j,H+1}^{\pi} by definition. Under the event in Lemma A.9 and using Lemma A.4, we have for j=r,gj=r,g,

|ϕ(s,a),wj,Hk(s,a)Qj,Hπ(s,a)|βϕ(s,a)T(ΛHk)1ϕ(s,a)\left|\left\langle\phi(s,a),w_{j,H}^{k}(s,a)\right\rangle-Q_{j,H}^{\pi}(s,a)\right|\leq\beta\sqrt{\phi(s,a)^{T}\left(\Lambda_{H}^{k}\right)^{-1}\phi(s,a)}

Thus Qj,Hπ(s,a)min{ϕ(s,a),wj,Hk+βϕ(s,a)T(ΛHk)1ϕ(s,a),H}=Qj,Hk(s,a)Q_{j,H}^{\pi}(s,a)\leq\min\left\{\left\langle\phi(s,a),w_{j,H}^{k}\right\rangle+\beta\sqrt{\phi(s,a)^{T}\left(\Lambda_{H}^{k}\right)^{-1}\phi(s,a)},H\right\}=Q_{j,H}^{k}(s,a).

From the definition of V¯hk\bar{V}_{h}^{k},

V¯Hk(s)=maxa[Qr,Hk(s,a)+νkQg,hk(s,a)]aπ(ax)[Qr,Hπ(s,a)+νkQg,Hπ(s,a)]=VHπ,νk(s)\bar{V}_{H}^{k}(s)=\max_{a}\left[Q_{r,H}^{k}(s,a)+\nu_{k}Q_{g,h}^{k}(s,a)\right]\geq\sum_{a}\pi(a\mid x)\left[Q_{r,H}^{\pi}(s,a)+\nu_{k}Q_{g,H}^{\pi}(s,a)\right]=V_{H}^{\pi,\nu_{k}}(s)

for any policy π\pi. Thus, it also holds for π\pi^{*}, the optimal policy. Using Lemma A.6 we can get

VHπ,νk(s)VHk(s)logMα+2(1+𝒱)HρϵIdkςV_{H}^{\pi^{*},\nu_{k}}(s)-V_{H}^{k}(s)\leq\frac{\log M}{\alpha}+2(1+\mathscr{V})H\rho\epsilon_{I}\sqrt{\frac{dk}{\varsigma}}

Now, suppose that it is true till the step h+1h+1 and consider the step hh. Since, it is true till step h+1h+1, thus, for any policy π\pi,

h(Vh+1π,νkVh+1k)(s,a)(Hh)(logMα+2(1+𝒱)HρϵIdkς)\mathbb{P}_{h}\left(V_{h+1}^{\pi,\nu_{k}}-V_{h+1}^{k}\right)(s,a)\leq(H-h)\big{(}\frac{\log M}{\alpha}+2(1+\mathscr{V})H\rho\epsilon_{I}\sqrt{\frac{dk}{\varsigma}}\big{)}

From (27) in Lemma 10 and the above result, we have for any (s,a)(s,a)

Qr,hπ(s,a)+νkQg,hπ(s,a)Qr,hk(s,a)+νkQg,hk(s,a)+(Hh)(logMα+2(1+𝒱)HρϵIdkς)Q_{r,h}^{\pi}(s,a)+\nu_{k}Q_{g,h}^{\pi}(s,a)\leq Q_{r,h}^{k}(s,a)+\nu_{k}Q_{g,h}^{k}(s,a)+(H-h)\big{(}\frac{\log M}{\alpha}+2(1+\mathscr{V})H\rho\epsilon_{I}\sqrt{\frac{dk}{\varsigma}}\big{)}

Hence,

Vhπ,νk(s)V¯hk(s)+(Hh)(logMα+2(1+𝒱)HρϵIdkς)V_{h}^{\pi,\nu_{k}}(s)\leq\bar{V}_{h}^{k}(s)+(H-h)\big{(}\frac{\log M}{\alpha}+2(1+\mathscr{V})H\rho\epsilon_{I}\sqrt{\frac{dk}{\varsigma}}\big{)}

Now, again from Lemma 11, we have V¯hk(s)Vhk(s)log(|𝒜|)α\bar{V}_{h}^{k}(s)-V_{h}^{k}(s)\leq\frac{\log(|\mathcal{A}|)}{\alpha}. Thus,

Vhπ,νk(s)Vhk(s)(Hh+1)(logMα+2(1+𝒱)HρϵIdkς)V_{h}^{\pi,\nu_{k}}(s)-V_{h}^{k}(s)\leq(H-h+1)\big{(}\frac{\log M}{\alpha}+2(1+\mathscr{V})H\rho\epsilon_{I}\sqrt{\frac{dk}{\varsigma}}\big{)}

Now, since it is true for any policy π\pi, it will be true for π\pi^{*}. From the definition of Vπ,νkV^{\pi,\nu_{k}}, we have

(Vr,hπ(s)+νkVg,hπ(s))(Vr,hk(s)+νkVg,hk(s))(Hh+1)(logMα+2(1+𝒱)HρϵIdkς)\left(V_{r,h}^{\pi^{*}}(s)+\nu_{k}V_{g,h}^{\pi^{*}}(s)\right)-\left(V_{r,h}^{k}(s)+\nu_{k}V_{g,h}^{k}(s)\right)\leq(H-h+1)\big{(}\frac{\log M}{\alpha}+2(1+\mathscr{V})H\rho\epsilon_{I}\sqrt{\frac{dk}{\varsigma}}\big{)}

Hence, the result follows by summing over KK and considering h=1h=1.

A.4 Proof of Lemma A.3

We first define some useful sets. Let 𝒬j={QQ(,a)=min{wjTϕ(,a)+βϕT(,a)TΛ1ϕ(,a),H},a𝒜}\mathcal{Q}_{j}=\left\{Q\mid Q(\cdot,a)=\min\left\{w_{j}^{T}\phi(\cdot,a)+\beta\sqrt{\phi^{T}(\cdot,a)^{T}\Lambda^{-1}\phi(\cdot,a)},H\right\},\ a\in\mathcal{A}\right\} be the set of Q functions, where j{r,g}j\in\{r,g\}. Since the minimum eigen value of Λ\Lambda is no smaller than one so the Frobenius norm of Λ1\Lambda^{-1} is bounded.

Let 𝒱j={VjVj()=aπ(a)Qj(,a)𝑑a;Qr𝒬r,Qg𝒬g,ν[0,𝒱]}\mathcal{V}_{j}=\left\{V_{j}\mid V_{j}(\cdot)=\int_{a}\pi(a\mid\cdot)Q_{j}(\cdot,a)da;Q_{r}\in\mathcal{Q}_{r},Q_{g}\in\mathcal{Q}_{g},\nu\in[0,\mathscr{V}]\right\} be the set of Q functions, where j{r,g}j\in\{r,g\}. Define

Π={πa𝒜,π(a)=1b(a)𝑑bSMα(Qr(,I(a))+νQg(,I(a))),Qr𝒬r,Qg𝒬g,ν[0,𝒱]}\Pi=\left\{\pi\mid\forall a\in\mathcal{A},\pi(a\mid\cdot)=\frac{1}{\int_{b\in\mathcal{I}(a)}db}\text{SM}_{\alpha}\left(Q_{r}(\cdot,I(a))+\nu Q_{g}(\cdot,I(a))\right),\ Q_{r}\in\mathcal{Q}_{r},Q_{g}\in\mathcal{Q}_{g},\nu\in[0,\mathscr{V}]\right\}

the set of policies.

It’s easy to verify Vjk𝒱jV_{j}^{k}\in\mathcal{V}_{j}.

Then we introduce the proof of Lemma A.3. To proof Lemma A.3, we need the ϵ\epsilon-covering number for the set of value functions(Lemma A.9Ghosh et al. (2022)). To achieve this, we need to show if two QQ functions and the dual variable ν\nu are close, then the bound of policy and value function can be derived(Lemma A.7, Lemma A.8). Though the proof of Lemma A.7 and Lemma A.8 are different from Ghosh et al. (2022), we show the results remain the same, thus Lemma A.9 still holds. We’ll only introduce Lemma A.9 and omit the proof.

We now proof Lemma A.7.

Lemma A.7.

Let π\pi be the policy of L-UCBFair corresponding to Qrk+νkQgkQ^{k}_{r}+\nu_{k}Q^{k}_{g}, i.e.,

π(a)=1b(a)𝑑bSMα(Qr(,I(a))+νQg(,I(a)))\pi(a\mid\cdot)=\frac{1}{\int_{b\in\mathcal{I}(a)}db}\text{SM}_{\alpha}\left(Q_{r}(\cdot,I(a))+\nu Q_{g}(\cdot,I(a))\right) (29)

and

π~(a)=1b(a)𝑑bSMα(Q~r(,I(a))+ν~Q~g(,I(a))),\tilde{\pi}(a\mid\cdot)=\frac{1}{\int_{b\in\mathcal{I}(a)}db}\text{SM}_{\alpha}\left(\tilde{Q}_{r}(\cdot,I(a))+\tilde{\nu}\tilde{Q}_{g}(\cdot,I(a))\right), (30)

if |QjQ~j|ϵ\left|Q_{j}-\tilde{Q}_{j}\right|\leq\epsilon^{\prime} and |νν~|ϵ\left|\nu-\tilde{\nu}\right|\leq\epsilon^{\prime}, then |a(π(ax)π~(ax))da|2αϵ(1+𝒱+H)\left|\int_{a}\left(\pi(a\mid x)-\tilde{\pi}(a\mid x)\right)da\right|\leq 2\alpha\epsilon^{\prime}(1+\mathscr{V}+H).

Proof.
|a(π(ax)π~(ax))da|\displaystyle\left|\int_{a}\left(\pi(a\mid x)-\tilde{\pi}(a\mid x)\right)da\right| (31)
=\displaystyle= |i=1Mai(π(I(a)x)π~(I(a)x))da|\displaystyle\left|\sum_{i=1}^{M}\int_{a\in\mathcal{I}_{i}}\left(\pi(I(a)\mid x)-\tilde{\pi}(I(a)\mid x)\right)da\right|
=\displaystyle= |i=1Mbidb(π(Iix)π~(Iix))|\displaystyle\left|\sum_{i=1}^{M}\int_{b\in\mathcal{I}_{i}}db\left(\pi(I_{i}\mid x)-\tilde{\pi}(I_{i}\mid x)\right)\right|
\displaystyle\leq i=1M|SMα(Qr(s,Ii)+νQg(s,Ii))SMα(Q~r(s,Ii)+ν~Q~g(s,Ii))|\displaystyle\sum_{i=1}^{M}\left|{SM}_{\alpha}\big{(}Q_{r}(s,I_{i})+\nu Q_{g}(s,I_{i})\big{)}-{SM}_{\alpha}\big{(}\tilde{Q}_{r}(s,I_{i})+\tilde{\nu}\tilde{Q}_{g}(s,I_{i})\big{)}\right|
\displaystyle\leq 2α|Qr(,I(a))+νQg(,I(a))Q~r(,I(a))ν~Q~g(,I(a))|\displaystyle 2\alpha\left|Q_{r}(\cdot,I(a))+\nu Q_{g}(\cdot,I(a))-\tilde{Q}_{r}(\cdot,I(a))-\tilde{\nu}\tilde{Q}_{g}(\cdot,I(a))\right| (32)

The last inequaity holds because of Theorem 4.4 in Epasto et al. (2020). Using Corollary A.13, we have

|a(π(ax)π~(ax))da|2αϵ(1+𝒱+H)\left|\int_{a}\left(\pi(a\mid x)-\tilde{\pi}(a\mid x)\right)da\right|\leq 2\alpha\epsilon^{\prime}(1+\mathscr{V}+H) (33)

Now since we have Lemma A.7, we can further bound the value functions.

Lemma A.8.

If |Q~jQjk|ϵ\left|\tilde{Q}_{j}-Q_{j}^{k}\right|\leq\epsilon^{\prime}, where Q~j𝒬j\tilde{Q}_{j}\in\mathcal{Q}_{j}, then there exists V~j𝒱j\tilde{V}_{j}\in\mathcal{V}_{j} such that

|VjkV~j|H2αϵ(1+𝒱+H)+ϵ,\left|V_{j}^{k}-\widetilde{V}_{j}\right|\leq H2\alpha\epsilon^{\prime}(1+\mathscr{V}+H)+\epsilon^{\prime},
Proof.

For any xx,

Vjk(s)V~j(s)\displaystyle V_{j}^{k}(s)-\widetilde{V}_{j}(s)
=|aπ(as)Qjk(s,a)daaπ~(as)Q~j(s,a)da|\displaystyle=\left|\int_{a}\pi(a\mid s)Q_{j}^{k}(s,a)da-\int_{a}\tilde{\pi}(a\mid s)\tilde{Q}_{j}(s,a)da\right|
=|aπ(as)Qjk(s,a)daaπ(as)Q~j(s,a)da+aπ(as)Q~j(s,a)daaπ~(as)Q~j(s,a)da|\displaystyle=\left|\int_{a}\pi(a\mid s)Q_{j}^{k}(s,a)da-\int_{a}\pi(a\mid s)\tilde{Q}_{j}(s,a)da+\int_{a}\pi(a\mid s)\tilde{Q}_{j}(s,a)da-\int_{a}\tilde{\pi}(a\mid s)\tilde{Q}_{j}(s,a)da\right|
|aπ(as)(Qjk(s,a)Q~j(s,a))da|+|aπ(as)Q~j(s,a)daaπ~(as)Q~j(s,a)da|\displaystyle\leq\left|\int_{a}\pi(a\mid s)\left(Q_{j}^{k}(s,a)-\tilde{Q}_{j}(s,a)\right)da\right|+\left|\int_{a}\pi(a\mid s)\tilde{Q}_{j}(s,a)da-\int_{a}\tilde{\pi}(a\mid s)\tilde{Q}_{j}(s,a)da\right|
ϵ+H|a(π(as)π~(as))da|\displaystyle\leq\epsilon^{\prime}+H\left|\int_{a}\left(\pi(a\mid s)-\tilde{\pi}(a\mid s)\right)da\right|
ϵ+H2αϵ(1+𝒱+H)\displaystyle\leq\epsilon^{\prime}+H2\alpha\epsilon^{\prime}(1+\mathscr{V}+H)

Using Lemmas above, we can have the same result presented in Lemma 13 of Ghosh et al. (2022) as following.

Lemma A.9.

Ghosh et al. (2022) There exists a V~j𝒱j\tilde{V}_{j}\in\mathcal{V}_{j} parameterized by (w~r,w~g,β~,Λ,𝒱~)\left(\tilde{w}_{r},\tilde{w}_{g},\tilde{\beta},\Lambda,\tilde{\mathscr{V}}\right) such that dist (Vj,V~j)ϵ\left(V_{j},\tilde{V}_{j}\right)\leq\epsilon where

|VjV~j|=supx|Vj(s)V~r(s)|.\left|V_{j}-\tilde{V}_{j}\right|=\sup_{x}\left|V_{j}(s)-\tilde{V}_{r}(s)\right|.

Let NϵVjN_{\epsilon}^{V_{j}} be the ϵ\epsilon-covering number for the set 𝒱j\mathcal{V}_{j}, then,

logNϵVjdlog(1+8Hdkςϵ)+d2log[1+8d1/2β2/(ς(ϵ)2)]+log(1+𝒱ϵ)\log N_{\epsilon}^{V_{j}}\leq d\log\left(1+8H\frac{\sqrt{dk}}{\sqrt{\varsigma}\epsilon^{\prime}}\right)+d^{2}\log\left[1+8d^{1/2}\beta^{2}/\left(\varsigma\left(\epsilon^{\prime}\right)^{2}\right)\right]+\log\left(1+\frac{\mathscr{V}}{\epsilon^{\prime}}\right)

where ϵ=ϵH2α(1+𝒱+H)+1\epsilon^{\prime}=\frac{\epsilon}{H2\alpha(1+\mathscr{V}+H)+1}

Lemma A.10.

|Qj,hk(s,a)Qj,hk(s,I(a))|2HρϵIdKς|Q^{k}_{j,h}(s,a)-Q^{k}_{j,h}(s,I(a))|\leq 2H\rho\epsilon_{I}\sqrt{\frac{dK}{\varsigma}}.

Proof.
|Qj,hk(s,a)Qj,hk(s,I(a))|\displaystyle\Big{|}Q^{k}_{j,h}(s,a)-Q^{k}_{j,h}(s,I(a))\Big{|} (34)
=\displaystyle= |wj,hk(s,a)T(ϕ(s,a)ϕ(s,I(a)))|\displaystyle\Big{|}w^{k}_{j,h}(s,a)^{T}(\phi(s,a)-\phi(s,I(a)))\Big{|} (35)
\displaystyle\leq ||wj,hk(s,a)2ϕ(s,a)ϕ(s,I(a))||2\displaystyle||w^{k}_{j,h}(s,a)\|_{2}\|\phi(s,a)-\phi(s,I(a))||_{2} (36)

From Lemma A.12 and 3.6 we get the result.

A.5 Preliminary Results

Lemma A.11.

Ding et al. (2021) Let ν\nu^{*} be the optimal dual variable, and C2νC\geq 2\nu^{*}, then, if

Vr,1π(s1)Vr,1π(s1)+C[cVg,1π(s1)]+δ,V_{r,1}^{\pi^{*}}\left(s_{1}\right)-V_{r,1}^{\pi}\left(s_{1}\right)+C\left[c-V_{g,1}^{\pi}\left(s_{1}\right)\right]_{+}\leq\delta,

we have

[cVg,1π~(x1)]+2δC.\left[c-V_{g,1}^{\tilde{\pi}}\left(x_{1}\right)\right]_{+}\leq\frac{2\delta}{C}.
Lemma A.12.

Jin et al. (2020) For any (k,h)(k,h), the weight wj,hkw_{j,h}^{k} satisfies

wj,hk2Hdk/ς\left\|w_{j,h}^{k}\right\|\leq 2H\sqrt{dk/\varsigma}
Corollary A.13.

If dist(Qr,Q~r)ϵ,dist(Qg,Q~g)ϵ\operatorname{dist}\left(Q_{r},\tilde{Q}_{r}\right)\leq\epsilon^{\prime},\operatorname{dist}\left(Q_{g},\tilde{Q}_{g}\right)\leq\epsilon^{\prime}, and |ν~kνk|ϵ\left|\tilde{\nu}_{k}-\nu_{k}\right|\leq\epsilon^{\prime}, then, dist(Qrk+\operatorname{dist}\left(Q_{r}^{k}+\right. νkQgk,Q~r+ν~kQ~g)ϵ(1+𝒱+H)\left.\nu_{k}Q_{g}^{k},\tilde{Q}_{r}+\tilde{\nu}_{k}\tilde{Q}_{g}\right)\leq\epsilon^{\prime}(1+\mathscr{V}+H).

Appendix B Additional Figures

Refer to caption
Figure 5: The interaction of an algorithmic classifier and a reactive population. Given state sτs_{\tau}, the classifier uses policy π\pi to select action aτa_{\tau}. The population, in state sτs_{\tau}, reacts to aτa_{\tau} , transitioning state to sτ+1s_{\tau+1}, then the process repeats.
Refer to caption
Figure 6: A synthetic distribution is updated according to a dynamical kernel 𝐏\mathbf{P} based on evolutionary dynamics (Section 4.2), when a classifier repeatedly predicts Y^=1\hat{Y}{=}1 iff X0.5X\geq 0.5. We visualize how the distribution of XX and conditional qualification rates Pr(Y=1X)\Pr(Y{=}1\mid X) change in each group g{1 (red, solid),2 (blue, dashed)}g\in\{1\text{ (red, solid)},2\text{ (blue, dashed)}\}, fading the plotted lines over 10 time steps. In this example, the feature values XX in each group decrease with time, while the qualification rates of agents at any fixed value of XX decrease.

Appendix C Experiment

C.1 Experiment Details

Device and Packages. We run all the experiment on a single 1080Ti GPU. We implement the R-TD3 agent using StableBaseline3Raffin et al. (2021). The neural network is implemented using PytorchPaszke et al. (2019).

Neural Network to learn ϕ\phi. We use a multi-layer perceptron to learn ϕ\phi. Specifically, we sample 100000 data points using a random policy, storing ss, aa, rr and gg. The inputs of the network are state and action, passing through fully connected (fc) layers with size 256, 128, 64, 64. ReLU is used as activation function between fc layers, while a SoftMax layer is applied after the last fc layer. We treat the outcome of this network as ϕ\phi. To learn ϕ\phi, we apply two separated fc layers (without bias) with size 1 to ϕ^\hat{\phi} and treat the outputs as predicted rr and predicted gg. A combination of MSE losses of rr and gg are adopted. We use Adam as the optimizer. Weight decay is set to 1e-4 and learning rate is set to 1e-3, while batch size is 128.

Note that, ϕ^\hat{\phi} is linear regarding rr and gg, but the linearity of transition kernel cannot be captured using such a schema. Therefore, equivalently we made an assumption that there always exists measure μh\mu_{h} such that for given ϕ^\hat{\phi}, the linearity of transition kernel holds. It’s a stronger assumption than Assumption 3.2.

Refer to caption
Refer to caption
Refer to caption
(a) A baseline, greedy classifier performing gradient descent to (locally) maximize true-positives, with static fairness regularization (columns).
Refer to caption
Refer to caption
Refer to caption
(b) L-UCBFair, trained for 2,000 steps on the same, cumulative utility functions.
Refer to caption
Refer to caption
Refer to caption
(c) A R-TD3 agent (Section 3.2) trained for 200,000 steps on the same, cumulative utility functions.
Refer to caption
Refer to caption
Refer to caption
(d)
Figure 7: A comparison of learning policies trained to optimize cumulative true positive fraction subject to three different regularized fairness constraints (columns) with ν=1\nu=1 (L-UCBFair), λ=0.5\lambda=0.5, (greedy agent), and the time-dependent regularization detailed in Section 3.2 (R-TD3). The first policy (top row) is a baseline, myopic policy that greedily seeks to optimize current utility in any state by performing gradient decent. The second policy (bottom row) is trained using deep reinforcement learning (R-TD3) as detailed in Section 3.2 for 200,000 steps before we terminate learning and generate the phase portraits depicted. This is on the synthetic distribution. In all cases, the baseline, greedy policy drives the system to promote unqualified individuals, with low qualification rates in each group, while the R-TD3 agent is able to drive the system to more favorable equilibria characterized by higher qualification rates. The shading in the phase plots depicts the violation of the regularizing fairness constraint within each column, validating the claim that the R-TD3 agent learns to sacrifice short-term utility to drive towards preferable system states.
Refer to caption
Refer to caption
Refer to caption
(a) A baseline, greedy classifier locally maximizing true positive classifications, regularized by fairness (columns).
Refer to caption
Refer to caption
Refer to caption
(b) L-UCBFair, trained for 2,000 steps on the same, cumulative utility functions.
Refer to caption
Refer to caption
Refer to caption
(c) A R-TD3 agent (Section 3.2) trained for 200,000 steps on the same, cumulative utility functions.
Refer to caption
Refer to caption
Refer to caption
Figure 8: A repetition of the experiment performed in Fig. 7, rewarding true positive fraction using data synthesized from the UCI “Adult Data Set”, as detailed in section Section 4.3 with equal group size reweighting. For this experiment, an individual’s sex defined their “group” membership, which is an imbalanced label in the dataset (67\approx 67% male, group 2, vertical axis) that we re-weight for equal representation Section 4. The stark difference between Fig. 7 and this experiment in the qualitative behavior of the greedy agent can be largely explained by the fact that Pr(Y=1|X=x))\Pr(Y{=}1|X{=}x)) is not actually monotonically increasing in xx, as stipulated by 4.1. Indeed, if Pr(Y=1|X=x)Pr(Y{=}1|X{=}x) is sufficiently rough, the threshold selected by the baseline agent is liable to appear as if sampled uniformly at random, which is how the initial threshold value is chosen for each of the 20 iterations averaged over for each pair of group qualification rates used to generate the phase portraits above. Despite this failure mode of the baseline agent, however, the R-TD3 agent is still largely able to drive the system towards equilibria with more equal qualification rates in both groups. The line of equal qualification rates in both groups is depicted in black, from the lower-left corner of each phase plot to the upper-right.

C.2 zero one accuracy with more weights on true positive

Refer to caption
Refer to caption
Refer to caption
(a) A baseline, greedy classifier locally maximizing utility, regularized by fairness (columns).
Refer to caption
Refer to caption
Refer to caption
(b) A R-TD3 agent (Section 3.2) trained for 200,000 steps on the same, cumulative utility functions.
Refer to caption
Refer to caption
Refer to caption
Figure 9: A repetition of the experiment performed in Fig. 7 with a different base utility function (true positive fraction + 0.8 true negative fraction), weighting each regularized disparity term with yy = 1, with the same synthetic distribution. While our observations are largely consistent with Fig. 7, we also note that the R-TD3 agent drives a subset of state-space in the third pane to an equilibrium less desired than the one that the myopic agent reaches.
Refer to caption
Refer to caption
Refer to caption
(a) A baseline, greedy classifier locally maximizing utility, regularized by fairness (columns).
Refer to caption
Refer to caption
Refer to caption
(b) A R-TD3 agent (Section 3.2) trained for 200,000 steps on the same, cumulative utility functions.
Refer to caption
Refer to caption
Refer to caption
Figure 10: A repetition of the experiment performed in Fig. 9 (i.e., with a base utility function (true positive fraction + 0.8 true negative fraction) and y=1y=1 weighted regularized disparity term), on the UCI “Adult Data Set”, as detailed in section Section 4.3 with groups re-weighted for equal representation.

C.3 Training Curves: L-UCBFair

C.3.1

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Demographic Parity
Refer to caption
(b) Equal Opportunity
Refer to caption
(c) Equalized Odds
Figure 11: L-UCBFair 20-step sliding mean & std for the setting in Fig. 7.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Demographic Parity
Refer to caption
(b) Equal Opportunity
Refer to caption
(c) Equalized Odds
Figure 12: L-UCBFair 20-step sliding mean & std for the setting in Fig. 8.

C.4 Training Curves: R-TD3

Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Demographic Parity
Refer to caption
(b) Equal Opportunity
Refer to caption
(c) Equalized Odds
Figure 13: R-TD3 100-step sliding mean & std for the setting in Fig. 8.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(a) Demographic Parity
Refer to caption
(b) Equal Opportunity
Refer to caption
(c) Equalized Odds
Figure 14: R-TD3 100-step sliding mean & std for the setting in Fig. 10.

C.5 Reduction of Utility

Refer to caption
Figure 15: The figure depicts the short-term impact on utility of the UCBFair algorithm compared to a greedy baseline agent that operates without fairness constraints. In this experiment, both algorithms were designed to optimize the fraction of true-positive classifications, but only UCBFair was subject to the additional constraint of demographic parity. As the results indicate, the UCBFair algorithm experiences a reduction in utility compared to the greedy baseline, but it is able to drive the system towards a state that is preferable in the long term.