This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Risk Preferences of Learning Algorithms111We thank seminar audiences at Harvard and MIT and the mathoverflow.com user fedja for helpful conversations.

Andreas Haupt [email protected] [ Aroon Narayanan [email protected] Massachusetts Institute of Technology [
Abstract

Agents’ learning from feedback shapes economic outcomes, and many economic decision-makers today employ learning algorithms to make consequential choices. This note shows that a widely used learning algorithm—ε\varepsilon-Greedy—exhibits emergent risk aversion: it prefers actions with lower variance. When presented with actions of the same expectation, under a wide range of conditions, ε\varepsilon-Greedy chooses the lower-variance action with probability approaching one. This emergent preference can have wide-ranging consequences, ranging from concerns about fairness to homogenization, and holds transiently even when the riskier action has a strictly higher expected payoff. We discuss two methods to correct this bias. The first method requires the algorithm to reweight data as a function of how likely the actions were to be chosen. The second requires the algorithm to have optimistic estimates of actions for which it has not collected much data. We show that risk-neutrality is restored with these corrections.

keywords:
Online learning, behavior attribution, fairness.
journal: Games and Economic Behavior

url]https://www.andyhaupt.com/ url]https://sites.google.com/view/aroon-narayanan

1 Introduction

Decision-makers often confront the same problem repeatedly, using the outcomes of their choices in the past to guide their understanding of the best course of action for the next time they encounter it. For example, credit scores are used to approve or deny credit based on the borrower’s past credit behavior, and pretrial detention decisions are made based on the defendant’s criminal history. Who gets the money to advance their lives—and who gets put in jail with curtailed liberties—will crucially depend on how the prior data is used to make those decisions.

Many heuristics for solving these types of problems—essentially revolving around keeping an estimate of how well an action has performed in the past—have been developed. These have also been turned into formal algorithms deployed on computers and shown to have desirable properties. Such “learning algorithms” simultaneously make decisions and learn how to make future decisions better. They are now widely used in the economy, from product recommendation and pricing to highly consequential areas such as credit decisions and pretrial detentions.

While most deployed algorithms have a plethora of provable desirable properties, such as identifying the best option in the limit (no-regret), we define and demonstrate an important bias in widely used learning algorithms: risk aversion, which emerges without being explicitly specified by the algorithm designer. For example, consider a repeated binary choice: Either pull lever A and get a deterministic payoff of 0, or pull lever B and get a stochastic payoff of 11 or 1-1, distributed uniformly. At any point in time, contingent on past observations of payoffs from the action taken (the bandit feedback setting), an algorithm chooses an action, potentially randomly, to take next. What is the probability that the learning algorithm selects each of the actions after tt rounds of interaction? A risk-averse learning algorithm would choose the deterministic reward action more often than the other action. We prove that a classic algorithm, ε\varepsilon-Greedy, chooses the deterministic action with probability converging to 11. This property holds without ε\varepsilon-Greedy being explicitly designed to be risk-averse: its design specification is to record the average reward received from each action and choose with some probability the action with the highest estimated average reward or any action at random. It is an emergent property of the algorithm.

Risk aversion of algorithms is more than a mere intellectual curiosity. It has stark implications across many real-world applications, particularly for fairness in algorithmic choice. This is especially true given that learning algorithms are increasingly being deployed for highly consequential societal decisions in the form of pretrial algorithms and risk assessment tools (RATs) and credit scoring algorithms, which necessitates extra attention to their emergent and unintended properties. In many of the economic settings where algorithms are deployed to make such decisions, a risk-averse algorithm can perpetuate deep inequities in society. As a concrete example, consider a firm making credit decisions using a risk-averse algorithm. Underrepresented minorities often have wide variances in their credit scores, shaped partly by inequities in access to good credit opportunities:

As the white suburbs and black inner cities diverged in their mortgage access, two different credit markets emerged in both zones. Lower-risk mortgages led to higher wealth and stability in the white suburbs. These conditions also led to a healthy consumer credit market. In the redlined black ghettos, the economic climate was radically different. (Baradaran, 2018, 893).

When faced with minority applicants with higher variability in credit history, a risk-averse algorithm may decide to systematically deny them even when it would have approved privileged applicants with similar expected repayment probability but features that are correlated with less variability in credit repayment, and hence perpetuate centuries of iniquity. Yet another setting in which it can play a significant role is recommendation systems, which regularly improve their recommendations using feedback from users. The choices that the recommendation algorithm takes, such as which content to show next to a user or which products to provide as an answer to a search query in e-commerce, are determined by how valuable the recommendation is deemed to be. Here, too, risk aversion can lead to the recommendation system suppressing “noisier” content—which, in most cases, will be the less mainstream, more marginalized content—even when its deployers do not find such bias desirable. In the long run, this bias can also lead to the homogenization of the content on these platforms, as more divergent content gets screened out by the algorithm’s decision to not recommend it.

Our first formal result is that the ε\varepsilon-Greedy algorithm exhibits perfect risk-aversion under some conditions: as TT\to\infty, it chooses the riskless action with probability one. The proof exposes that the mechanism that leads to algorithmic risk preferences is related to how it estimates the value of each action. If an algorithm’s estimate of each action is the simple average of the observed rewards in the past from these actions, which is what ε\varepsilon-Greedy does, its estimates will be biased because the algorithm undersamples actions after bad reward draws. We then discuss two corrections to the algorithm that enable it to be risk-neutral. Our first correction, which we call the Reweighted ε\varepsilon-Greedy, counters the undersampling propensity by adjusting its estimate to account for the probability with which the action was chosen in the first place. We show that Reweighted ε\varepsilon-Greedy is risk neutral. We also propose another correction for a broader class of settings: the Optimistic ε\varepsilon-Greedy. It introduces an optimism term to the estimate that corrects for the pessimism that bad draws induce, in the style of the celebrated Upper Confidence Bound (UCB) algorithm (refer Auer (2002); Auer et al. (2002); Lattimore and Szepesvári (2020)). Our third formal result shows that this correction also makes ε\varepsilon-Greedy risk neutral. We use simulations to explore the necessity of conditions we make in our theoretical analysis and the transient persistence of risk behavior even with unequal expected values for the actions.

Related literature

There is a large literature on learning in economics, compare, e.g., Bolton and Harris (1999), Keller et al. (2005), Klein and Rady (2011); see Bergemann and Välimäki (2017) for an early survey. The closest in spirit is Bardhi et al. (2020), in which the authors show that even arbitrarily small differences in early-career discrimination can be highly consequential later on. Our results complement this literature by showing that algorithmic learning can exhibit unintended discrimination with strong consequences in the long run. A second branch of literature is empirical. Papers such as Farber and Gibbons (1996) and Altonji and Pierret (2001) study learning by employers, showing that it has testable implications for wage dynamics. Crawford and Shum (2005) applies learning to demand for pharmaceutical drugs. Recently, there has also been a strand that empirically analyzes the behavior of learning algorithmic, particularly in relation to collusion. Calvano et al. (2020), Musolff (2022), and Brown and MacKay (2023) find in different settings that pricing algorithms learn to play collusive equilibria, raising antitrust concerns about the use of such algorithms. Such emergent behavior in a game theoretic setting closely resembles the emergent preference we illustrate in an algorithmic decision-theoretic setting.

Our paper is connected to the computer science literature on the effect of biased payoff estimates in recommendation systems. Marlin and Zemel (2009) observes that online learning in recommendation systems leads to confounding of average user scores in recommendation systems and proposes algorithmic interventions to correct this bias. Chaney et al. (2018) proposes a model and shows that recommendation systems’ biased estimates of user preferences can increase homogeneity and decrease user utility. Our study highlights the fact of noise in reward distributions as a reason for biased estimates instead of taking bias as a given.

We also relate to the study of fairness in bandit problems. While Joseph et al. (2016) considers fairness (which is equivalent to our notion of risk neutrality) as a constraint for algorithm design and constructs algorithms that (approximately) satisfy it, this paper provides evidence on the risk preferences of an existing algorithm, ε\varepsilon-Greedy, and proposes two ways to mitigate risk preferences. Consider also Patil et al. (2021) and Liu et al. (2017) for treatments of fairness in bandit problems.

Finally, we relate to the regret analysis of bandit algorithms under diffusion scaling. Kalvit and Zeevi (2021a) studies this for the Upper Confidence Bound algorithm (compare Theorem 3). Fan and Glynn (2021) derives the limit action distribution of Thompson sampling as a solution to a random ordinary differential equation.

Outline

The structure of the rest of this note is as follows. In section 2, we introduce our online learning setup and our definition of risk aversion, along with formal definitions of our algorithms. The result on ε\varepsilon-Greedy’s risk aversion is presented in section 3. We discuss two corrections of risk aversion in section 4. We complement our theoretical analysis with simulations in section 5.

2 Model

In a bandit problem, a decision maker repeatedly takes an action from a finite set AA, |A|<\lvert A\rvert<\infty. Each action aa is associated to a sub-Gaussian distribution FaΔ(),aAF_{a}\in\Delta(\mathbb{R}),a\in A with expectation μa\mu_{a} and variance proxy σa2\sigma_{a}^{2}.222A distribution is sub-Gaussian if eλXdFa(x)exp(λ2σa22)\int e^{\lambda X}\,\mathrm{d}F_{a}(x)\leq\exp(\frac{\lambda^{2}\sigma_{a}^{2}}{2}). In this case, σa2\sigma_{a}^{2} is called the variance proxy. A strategy or algorithm used by the decision maker is a function π:t=1(A×[0,1])tΔ(A)\pi\colon\bigcup_{t=1}^{\infty}(A\times[0,1])^{t}\to\Delta(A). We denote action-reward histories by (a1:t,r1:t)(a_{1:t},r_{1:t}) and the probability that action aAa\in A is chosen in round tt by πat=π(a1:t,r1:t)a\pi_{at}=\pi(a_{1:t},r_{1:t})_{a}.

An algorithm π\pi generates (potentially random) sequences of actions (at)t(a_{t})_{t\in\mathbb{N}} and rewards or payoffs (rt)t(r_{t})_{t\in\mathbb{N}}. For each tt\in\mathbb{N}, repeatedly, the algorithm chooses an action atπta_{t}\sim\pi_{t} and receives a reward rtFatr_{t}\sim F_{a_{t}}. Denote Na(t)|{1tt:at=a}|N_{a}(t)\coloneqq\lvert\{1\leq t^{\prime}\leq t:a_{t}=a\}\rvert the number of times action aa has been chosen up to time tt.

The main concept in this paper is a notion of risk-aversion of algorithms. An algorithm is risk-neutral if it chooses (asymptotically) uniformly from amongst actions of equal expectation. In the long run, risk-averse algorithms prefer less risky actions than others of the same expectation. In the extreme case where the algorithm exclusively chooses (asymptotically) the least risky action among those of the same expectation, we call them perfectly risk averse.

Definition 1.

We call π\pi risk-neutral if for any actions a,aAa,a^{\prime}\in A such that μa=μa\mu_{a}=\mu_{a^{\prime}},

limt[at=a]=limt[at=a]\lim_{t\to\infty}\mathbb{P}[a_{t}=a]=\lim_{t\to\infty}\mathbb{P}[a_{t}=a^{\prime}]

We call an algorithm risk-averse if for all a,aAa,a^{\prime}\in A such that μa=μa\mu_{a}=\mu_{a^{\prime}} and FaSOSDFaF_{a}\prec_{\text{SOSD}}F_{a^{\prime}}, it holds that

limt[at=a]>limt[at=a].\lim_{t\to\infty}\mathbb{P}[a_{t}=a]>\lim_{t\to\infty}\mathbb{P}[a_{t}=a^{\prime}].

An algorithm is perfectly risk averse if whenever there exists an aAa\in A such that either μa<μa\mu_{a^{\prime}}<\mu_{a} or FaSOSDFaF_{a^{\prime}}\prec_{\text{SOSD}}F_{a} for all actions aA{a}a^{\prime}\in A\setminus\{a\}, it is true that

limt[at=a]=1.\lim_{t\to\infty}\mathbb{P}[a_{t}=a]=1.

This paper considers the ε\varepsilon-Greedy algorithm and two variants of ε\varepsilon-Greedy with different sufficient statistics.

Definition 2 (ε\varepsilon-Greedy).

Let (εt)t(\varepsilon_{t})_{t\in\mathbb{N}} be a [0,1][0,1]-valued sequence. ε\varepsilon-Greedy chooses the empirically best action with probability 1ϵt1-\epsilon_{t}, and randomizes between all the actions with probability ϵt\epsilon_{t}, i.e.

πat={Unif(argmaxaAμa(t1)) w.p. 1εtUnif(A) w.p. εt,\pi_{at}=\begin{cases}\operatorname{Unif}(\operatorname{arg\,max}\limits_{a\in A}\mu_{a}(t-1))&\text{ w.p. }1-\varepsilon_{t}\\ \operatorname{Unif}(A)&\text{ w.p. }\varepsilon_{t},\end{cases}

where μa(t1)\mu_{a}(t-1) is historical average payoff

μa(t)1Na(t)1ttat=art.\mu_{a}(t)\coloneqq\frac{1}{N_{a}(t)}\sum_{\begin{subarray}{c}1\leq t^{\prime}\leq t\\ a_{t^{\prime}}=a\end{subarray}}r_{t^{\prime}}.

When ε\varepsilon-Greedy takes an action to maximize μa(t1)\mu_{a}(t-1), we say it exploits or takes an exploitation step. Otherwise, it explores. We also call μa(t1)\mu_{a}(t-1) ε\varepsilon-Greedy’s sufficient statistic.

The first variant reweighs data points to change their importance.

Definition 3 (Reweighted ε\varepsilon-Greedy).

Reweighted ε\varepsilon-Greedy uses a reweighted payoff estimate as a sufficient statistic:

μa,r(t)=1Na(t)1ttat=artπat.\mu_{a,r}(t)=\frac{1}{N_{a}(t)}\sum_{\begin{subarray}{c}1\leq t^{\prime}\leq t\\ a_{t}=a\end{subarray}}\frac{r_{t^{\prime}}}{\sqrt{\pi_{at^{\prime}}}}.

A second intervention adds an optimism term to the sufficient statistic of ε\varepsilon-Greedy.

Definition 4 (Optimistic ε\varepsilon-Greedy).

Optimistic ε\varepsilon-Greedy uses sufficient statistic μa,o\mu_{a,o}, such that

μa,o(t)=μa(t)+ρlog(t)Na(t).\mu_{a,o}(t)=\mu_{a}(t)+\rho\sqrt{\frac{\log(t)}{N_{a}(t)}}.

If action aa has not been chosen before, μa,o(t1)=\mu_{a,o}(t-1)=\infty.

3 Risk aversion of epsilon-Greedy

We first show that ε\varepsilon-Greedy exhibits perfectly risk-averse behavior.

Theorem 1.

Let (εt)t(\varepsilon_{t})_{t\in\mathbb{N}} such that εt0\varepsilon_{t}\to 0 and t=1εt=\sum_{t=1}^{\infty}\varepsilon_{t}=\infty. If there is a deterministic, centered dominant action, and all other actions have symmetric continuous distributions, ε\varepsilon-Greedy is perfectly risk-averse.

A discussion on the conditions in this result is in order before we move into the proof. The learning rates are both necessary to yield a no-regret algorithm, compare Lattimore and Szepesvári (2020), and hence are rather innocuous. We show in our simulations in section 5 that relaxing the requirement of determinism of a dominant action leads to risk aversion, but not perfect risk aversion, as does relaxing the symmetry requirement. Our result also holds for Rademacher-distributed action rewards.

The intuition behind this result lies in the sampling bias of ε\varepsilon-Greedy. Upon receiving a low payoff for an action, it becomes less likely to choose that action and, hence, less likely to receive data to correct its estimate. This means that it keeps a pessimistic estimate of reward. This bias leads to behavior that is consistent with risk aversion.

Proof of Theorem 1.

We first observe that we can restrict to bandit problems of two actions with actions of the same expectation, one deterministic dominant action aa and another action aa^{\prime} with a continuous, symmetric distribution. Under the assumptions on the learning rate made in the theorem, ε\varepsilon-Greedy chooses dominated actions with vanishing probability. We prove this in Lemma 1 in the appendix. As this probability is low, we may consider instances of actions of equal expectation. In addition, a union bound shows that if the probability that the algorithm chooses action aa over any single action aa^{\prime} converges to 11, this implies that this action will be chosen with probability one among all actions of the same probability. Hence, it is without loss to restrict to two-action bandit problems.

We also observe that the result is trivial for two deterministic actions with the same expectation. Hence, we may assume that var(Fa)\operatorname{var}(F_{a}) has a positive variance. Furthermore, it is without loss to assume that the deterministic action is centered: ε\varepsilon-Greedy is invariant to the addition of a constant to all reward distributions.

As a final reduction step, as εt0\varepsilon_{t}\to 0, it is sufficient to show that it becomes unlikely that ε\varepsilon-Greedy chooses aa^{\prime} in exploitation steps, or

[μa(t)<μa(t)]0.\mathbb{P}[\mu_{a}(t)<\mu_{a^{\prime}}(t)]\to 0.

We express this event as a property of a stochastic process. As one of the actions in the ε\varepsilon-Greedy is deterministic and (without loss of generality) centered, the sum of the payoffs of the non-deterministic action, denoted by (Xt)t(X_{t})_{t\in\mathbb{N}}, is a sufficient statistic for the dynamics of the algorithm. (Xt)t(X_{t})_{t\in\mathbb{N}} is a lazy random walk starting at the origin, X0=0X_{0}=0, with transition kernel

Xt+1={Xt+r with probability (1ε2)1Xt>0+121Xt=0+ε21Xt<0,Xtelse,X_{t+1}=\begin{cases}X_{t}+r&\text{ with probability $(1-\frac{\varepsilon}{2})1_{X_{t}>0}+\frac{1}{2}1_{X_{t}=0}+\frac{\varepsilon}{2}1_{X_{t}<0}$,}\\ X_{t}&\text{else,}\end{cases}

where rFar\sim F_{a}. We call this the advantage walk and depict it in Figure 1.

Refer to caption
(a) A stylized depiction of an advantage walk.
Refer to caption
(b) One realization of the advantage walk for ε\varepsilon-Greedy where the safe action has distribution 𝟙{0}\mathbb{1}_{\{0\}} while the risky action has distribution U[1,1]U[-1,1]
Figure 1: The advantage walk for ε\varepsilon-Greedy. The main intuition for risk aversion in online algorithms is that a random walk with non-uniform variance spends more time in places with less variance.

Consider the time since the last time that the random walk crossed zero,

τttmax{2tt|Xt10Xt}.\tau_{t}\coloneqq t-\max\{2\leq t^{\prime}\leq t|X_{t^{\prime}-1}\leq 0\leq X_{t^{\prime}}\}.

We claim that τtt\tau_{t}\xrightarrow[t\to\infty]{\mathbb{P}}\infty.

To prove this claim, observe that for any c0c\geq 0, there is C>0C>0 and tt\in\mathbb{N} such that for all ttt^{\prime}\geq t

[τtc][|Xtc|<C]+ε2ε.\mathbb{P}[\tau_{t^{\prime}}\leq c]\leq\mathbb{P}[\lvert X_{t^{\prime}-c}\rvert<C]+\varepsilon\leq 2\varepsilon.

The first inequality is a consequence of the sub-Gaussianity of FaF_{a}. The second inequality is a result of var(Fa)>0\operatorname{var}(F_{a})>0, the conditional independence of increments, and t=tctεt\sum_{t^{\prime}=t-c}^{t}\varepsilon_{t^{\prime}}\to\infty as tt\to\infty.

We also define the distribution of the number of steps taken since τt\tau_{t} on the positive side. These are distributed as

Pt\displaystyle P_{t} t=tτttZt,ZtBern(1εt/2).\displaystyle\sim\sum_{t^{\prime}=t-\tau_{t}}^{t}Z_{t^{\prime}},\quad Z_{t}\sim\operatorname{Bern}(1-\varepsilon_{t}/2).

As PτtP_{\tau_{t}} is a sum of i.i.d. random variables, by Hoeffding’s inequality, PτtP_{\tau_{t}} is close to its conditional expectation 𝔼[Pτt|τt]\mathbb{E}[P_{\tau_{t}}|\tau_{t}] with high probability in τt\tau_{t} and hence with high probability in tt. This conditional expectation is

𝔼[Pτt|τt]=t=tτt1εt2.\mathbb{E}[P_{\tau_{t}}|\tau_{t}]=\sum_{t^{\prime}=t-\tau_{t}}1-\frac{\varepsilon_{t^{\prime}}}{2}.

Hence, for any δ>0\delta>0 that there is tt\in\mathbb{N} such that for all ttt^{\prime}\geq t

[Xt>0|τt][Y1,Y2,,YPt>0|τt].\begin{split}\mathbb{P}[X_{t^{\prime}}>0|\tau_{t}]&\geq\mathbb{P}[Y_{1},Y_{2},\dots,Y_{P_{t^{\prime}}}>0|\tau_{t}].\end{split}

where Y0=0Y_{0}=0 and (Yt)t(Y_{t})_{t\in\mathbb{N}} is a standard random walk with increment distribution FaF_{a}.

Hence, for any δ>0\delta>0, there is tt^{\prime}\in\mathbb{N} such that for all ttt^{\prime}\geq t,

[Xt>0|τt]c[Y1,Y2,,YPt+1>0]cπ(t=tτt1εt2)(1+δ).\begin{split}\mathbb{P}[X_{t}>0|\tau_{t}]&\leq c\mathbb{P}[Y_{1},Y_{2},\dots,Y_{P_{t}+1}>0]\\ &\leq\frac{c}{\sqrt{\pi(\sum_{t^{\prime}=t-\tau_{t}}1-\frac{\varepsilon_{t^{\prime}}}{2})}}(1+\delta).\end{split} (1)

For the first inequality, we can choose c1/[r>Xtτt]c\geq 1/\mathbb{P}[r>X_{t^{\prime}-\tau_{t^{\prime}}}], where rFar\sim F_{a} is independent of (Xt)t(X_{t})_{t\in\mathbb{N}}. This follows as a single step from zero could lead from 0 to XtX_{t^{\prime}}, or a higher value. Because XtτtX_{t^{\prime}-\tau_{t^{\prime}}} is reached from Xtτt1<0X_{t^{\prime}-\tau_{t^{\prime}}-1}<0, [r>Xtτt]>0\mathbb{P}[r>X_{t^{\prime}-\tau_{t^{\prime}}}]>0 must be positive, and hence cc is well-defined.

The second inequality uses the well-known property that the probability of a random walk stays positive until time tt with probability approximately 1πt\frac{1}{\sqrt{\pi t}} (Frisch and Frisch, 1995, Eqn. 35).444This property also holds for Rademacher-distributed increments and is the only place where we use the assumption that our distribution is continuous and symmetric.

Observe that (1) approaches 0 as τt\tau_{t}\to\infty and recall that τtt\tau_{t}\xrightarrow[t\to\infty]{\mathbb{P}}\infty. Given these two facts,

[Xt>0]t0.\mathbb{P}[X_{t}>0]\xrightarrow[t\to\infty]{\mathbb{P}}0.

As [Xt=0]0,t\mathbb{P}[X_{t}=0]\to 0,t\to\infty, this concludes the proof. ∎

We highlight that a similar argument applies to actions that are dominated by others in expectation. For two actions a,aa,a^{\prime} of the same expectation, such that aa is deterministic but aa^{\prime} is not, and a third action a′′a^{\prime\prime} such that μa′′>μa,μa\mu_{a^{\prime\prime}}>\mu_{a},\mu_{a^{\prime}}, the probability that aa is taken conditional aa or aa^{\prime} are taken converges to one.

4 Achieving risk-neutrality

Next, we propose variants to the ε\varepsilon-Greedy sufficient statistic to achieve risk neutrality. These corrections are motivated by the technical analysis of the previous theorem as well as known algorithmic ideas (see ex Auer (2002)), which enabled us to pinpoint the reason why it was exhibiting risk aversion, but crucially, we are also able to link them directly to insights on how policymakers should address possible biases in deployed algorithms. Thus, these “corrected” algorithms should also be seen as clarifying the underlying mechanisms and their redressal.

4.1 A reweighting approach to risk-neutrality

The first approach stems from our analysis of variance: we should reweight data to achieve risk neutrality. A reweighted ε\varepsilon-Greedy provably is risk neutral if exploration is sufficiently high.

Theorem 2 (Reweighted ε\varepsilon-Greedy).

Let εt=t12+κ,κ(0,12)\varepsilon_{t}=t^{\frac{1}{2}+\kappa},\kappa\in(0,\frac{1}{2}). Reweighted ε\varepsilon-Greedy is risk-neutral for two centered actions, one of which is deterministic.

The reweighting proposed here means that payoffs resulting from currently unfavoured actions are weighted more highly in the sufficient statistic of the algorithm. In the credit scoring setting, if the option of rejecting the loan is currently favored, the outcome of any credit that is given (due to exploration) is weighted more highly. Thus, this correction tells us that we should assign more importance to outcomes that resulted from choices that seemed apriori less attractive.

It is worth noting that this reweighting does not lead to an unbiased estimator of action rewards. We show in simulations in an appendix that an unbiased estimator leads to a risk-affine algorithm, B. This is a result of what the rescaling does to the internal sufficient statistics: The goal of the algorithm proposed here is to equalize variance, which is at odds with producing an unbiased estimator.

Several comments on the assumptions are in order. The first assumption on exploration means that a sufficient amount of exploration is needed for this algorithm to be risk-neutral. We provide a simulation in section 5 showing that this condition is necessary for risk neutrality. The assumption of centeredness and determinism is needed for our proof, but risk aversion seems to hold beyond them, as we show in section 5. On the other hand, the requirement that there are only two actions with the same expectation is crucial, as we show in another simulation in section 5.

Proof.

Note that if both actions are deterministic, the conclusion of the theorem holds trivial. Otherwise, denote the non-deterministic action by aAa\in A. The evolution of choices can be expressed only based on the stochastic process

Yt=1ttat=artπt.Y_{t}=\sum_{\begin{subarray}{c}1\leq t^{\prime}\leq t\\ a_{t}=a\end{subarray}}\frac{r_{t}}{\sqrt{\pi_{t}}}.

If Yt>0Y_{t}>0, action aa is chosen with probablity 1εt/21-\varepsilon_{t}/2, if Yt=0Y_{t}=0, it is chosen with probability 1/21/2, and if Yt<0Y_{t}<0, then it is chosen with probability ε/2\varepsilon/2. We define the random array

Xtt=1tYtX_{t^{\prime}t}=\frac{1}{\sqrt{t}}Y_{t^{\prime}}

This random array has the following properties:

Martingale

It is an L2L^{2}-martingale array, i.e. (Xtt)1tt(X_{t^{\prime}t})_{1\leq t^{\prime}\leq t} is a square-integrable martingale with respect to its natural filtration.

Asymptotic Variance

The conditional variances of martingale increments are constant (and deterministic).

t=1t𝔼[(XttX(t1)t)2|X(t1)t]\displaystyle\sum_{t^{\prime}=1}^{t}\mathbb{E}[(X_{t^{\prime}t}-X_{(t^{\prime}-1)t})^{2}|X_{(t^{\prime}-1)t}] =t=1taAπa(t+1)𝔼[ra2tπa(t+1)2|Xtt]\displaystyle=\sum_{t^{\prime}=1}^{t}\sum_{a\in A}\pi_{a(t^{\prime}+1)}\mathbb{E}\left[\frac{r_{a}^{2}}{\sqrt{t\pi_{a(t^{\prime}+1)}}^{2}}\middle|X_{t^{\prime}t}\right]
=t=1t1taAσa2\displaystyle=\sum_{t^{\prime}=1}^{t}\frac{1}{t}\sum_{a\in A}\sigma_{a}^{2}
=σa2.\displaystyle=\sigma^{2}_{a}.

In particular, as nn\to\infty, 𝔼[(XttX(t1)t)2|X(t1)t]σa2\mathbb{E}[(X_{t^{\prime}t}-X_{(t^{\prime}-1)t})^{2}|X_{(t^{\prime}-1)t}]\to\sigma_{a}^{2} in probability.

Lindeberg Condition

For any ε>0\varepsilon>0, we have that

t=1t𝔼[(XttX(t1)t)2𝟙|XttX(t1)t|ε|X(t1)t]\displaystyle\sum_{t^{\prime}=1}^{t}\mathbb{E}[(X_{t^{\prime}t}-X_{(t^{\prime}-1)t})^{2}\mathds{1}_{\lvert X_{t^{\prime}t}-X_{(t^{\prime}-1)t}\rvert\geq\varepsilon}|X_{(t^{\prime}-1)t}] 2t=1tt1(t)12κ𝔼[r2𝟙rt12(t)12κε]\displaystyle\leq 2\sum_{t^{\prime}=1}^{t}t^{-1}(t^{\prime})^{1-2\kappa}\mathbb{E}[r^{2}\mathds{1}_{r\geq t^{-\frac{1}{2}}(t^{\prime})^{\frac{1}{2}-\kappa}\varepsilon}]
=2tt=1t(t)12κ𝔼[r2𝟙rt12(t)12κε]\displaystyle=\frac{2}{t}\sum_{t^{\prime}=1}^{t}(t^{\prime})^{1-2\kappa}\mathbb{E}[r^{2}\mathds{1}_{r\geq t^{\frac{1}{2}}(t^{\prime})^{\frac{1}{2}-\kappa}\varepsilon}]
2tt=1t(t)12κσa2eλ2σa22λt12(t)12κε\displaystyle\leq\frac{2}{t}\sum_{t^{\prime}=1}^{t}(t^{\prime})^{1-2\kappa}\sigma_{a}^{2}e^{\frac{\lambda^{2}\sigma_{a}^{2}}{2}-\lambda t^{\frac{1}{2}}(t^{\prime})^{\frac{1}{2}-\kappa}\varepsilon}
2tt=1tt12κσa2eλ2σa22λt12ε\displaystyle\leq\frac{2}{t}\sum_{t^{\prime}=1}^{t}t^{1-2\kappa}\sigma_{a}^{2}e^{\frac{\lambda^{2}\sigma_{a}^{2}}{2}-\lambda t^{\frac{1}{2}}\varepsilon}
0.\displaystyle\to 0.

The first inequality uses the fact that we can bound

𝔼[(XttX(t1)t)2|X(t1)t]𝔼[r2]tminaAπat2𝔼[r2]tεt=2σa2tεt.\mathbb{E}[(X_{t^{\prime}t}-X_{(t^{\prime}-1)t})^{2}|X_{(t^{\prime}-1)t}]\leq\frac{\mathbb{E}[r^{2}]}{t\min_{a\in A}\pi_{at^{\prime}}}\leq\frac{2\mathbb{E}[r^{2}]}{t\varepsilon_{t^{\prime}}}=\frac{2\sigma_{a}^{2}}{t\varepsilon_{t^{\prime}}}.

The second is arithmetic. The third applies a Chernoff bound. The last uses 1tt1\leq t\leq t^{\prime}. The convergence follows as exponential decay dominates polynomial growth and as convergence of a sequence implies convergence of the Cesàro mean.

Given these conditions, we can apply a Martingale Central Limit Theorem (Hall and Heyde, 1980, Corollary 3.1) and conclude that the distribution of XttX_{tt} converges to N(0,σa2)N(0,\sigma_{a}^{2}). This means that

[Xtt>0],[Xtt<0]t12,\mathbb{P}[X_{tt}>0],\mathbb{P}[X_{tt}<0]\xrightarrow{t\to\infty}\frac{1}{2},

and hence

[at=a]=12.\mathbb{P}[a_{t}=a]=\frac{1}{2}.

While this algorithm works for two actions and, as we show, works for two actions, another approach to risk neutrality, optimism, allows us to guarantee risk neutrality for an arbitrary number of actions of equal expectation. This is what we discuss in the next subsection.

4.2 An optimism approach to risk-neutrality

Another way to modify the sufficient statistic is not to reweight but to explicitly favor alternatives that have not previously been chosen as frequently in the past. Conventionally, this is referred to as optimism in the multi-armed bandit literature, compare (Slivkins et al., 2019, Section 1.3.3). We show that it ensures risk-neutrality.

Theorem 3 (Optimistic ε\varepsilon-Greedy).

There exists ρ0>1\rho_{0}>1 such that for any ρρ0\rho\geq\rho_{0} and any (εt)t(\varepsilon_{t})_{t\in\mathbb{N}} with εt0\varepsilon_{t}\to 0, Optimistic ε\varepsilon-Greedy is risk-neutral.

Proof Sketch.

Note first that for non-exploration steps of Optimistic ε\varepsilon-Greedy, the policy is the same as Upper Confidence Bound with exploration coefficient ρ\rho, compare Auer (2002). We adapt the proof of (Kalvit and Zeevi, 2021b, Theorem 2) for our variant of ε\varepsilon-Greedy. Theorem 2 in Kalvit and Zeevi (2021b) shows that an optimistic policy without exploration has the property that

limt[at=a]1|argmaxaAμa|\lim_{t\to\infty}\mathbb{P}[a_{t}=a]\to\frac{1}{\lvert\operatorname{arg\,max}_{a\in A}\mu_{a}\rvert}

for all aa such that aargmaxaAμaa\in\operatorname{arg\,max}_{a\in A}\mu_{a} and ρ>ρ0\rho>\rho_{0}. As Optimistic ε\varepsilon-Greedy does not incur regret as we show in A, this property implies that the algorithm does not incur regret.

The full proof can be found in (Kalvit and Zeevi, 2021a, Appendix D). The proof goes as follows. First, show that Ni(t)t>12|argmaxaAμa|\frac{N_{i}(t)}{t}>\frac{1}{2\lvert\operatorname{arg\,max}_{a\in A}\mu_{a}\rvert} with probability approaching 11 in the limit, i.e. the fraction of times any action is chosen can be bounded below in probability. This is proved by means of a union bound and a Hoeffding bound. The operative equation that gets to this lower bound is (Kalvit and Zeevi, 2021a, Eqn. 40), which depends on (Kalvit and Zeevi, 2021a, Eqns. 35 and 39). Using this lower bound, we show that |Ni(t)Nj(t)|\lvert N_{i}(t)-N_{j}(t)\rvert is small, i.e. the difference in the number of times any two actions are chosen is small as the time goes to infinity. This again uses Markov’s inequality, along with the Law of the Iterated Logarithm. Now, consider optimistic ϵ\epsilon-Greedy. Since it implements the Upper Confidence Bound policy with probability 1ϵt1-\epsilon_{t} and randomizes otherwise, in either case, it must be eventually choosing uniformly from amongst the highest mean actions, except for the vanishing probability with which it chooses dominated actions. ∎

It is worth noting why the mathematical intuition from the earlier result on ε\varepsilon-Greedy breaks. In this proof, the main object was a random walk with different variances for positive and negative sufficient statistics. Optimism may be seen as introducing a drift towards the origin. The proof shows that this drift is strong enough to correct risk aversion. We also note that this approach works for non-centered random variables.

This result demonstrates that one way to build a fairer world is to be as optimistic as is rationally possible. Linking back to the credit decisions example we referenced in the introduction, credit scoring algorithms should evaluate applicants in the best possible light, adjusting for the risk profile of minority applicants by accounting for the fact that they often have lower access to good credit opportunities.

5 Simulations

For the final section of this note, we use simulations to investigate how far our results extend beyond the conditions we have imposed in our theoretical results.

On the conditions of Theorem 1

Our first set of simulations, shown in Figure 2, relate to the conditions in Theorem 1. 2(a) shows that within the conditions of Theorem 1, ε\varepsilon-Greedy exhibits perfect risk aversion. 2(b) shows that the assumption of symmetry is not necessary, as even for asymmetric distributions such as the exponential distribution, risk aversion holds, and 2(c) shows that it also holds for distributions that are not centered around 0. If there is no optimal deterministic action, we show in 5(a) that ε\varepsilon-Greedy is risk-averse yet not perfectly risk-averse. Hence, we find that the risk aversion result is relatively robust.

On the conditions of Theorem 2

Next, Figure 3 shows that the results of Theorem 2 extend beyond its assumptions, but only as long as exploration is sufficiently high. The restriction to two actions is necessary, as Figure 6 shows.

On the conditions of Theorem 3

Finally, we consider the assumptions of Theorem 3. In Figure 4, we see that Optimistic ε\varepsilon-Greedy is quite risk neutral as long as ρ\rho is sufficiently high, which is in line with Theorem 3. When ρ\rho is low, the level of optimism is insufficient to counteract ε\varepsilon-Greedy’s pessimism.

Finite-time behavior for dominated actions

We show in 5(b) that ε\varepsilon-Greedy’s risk aversion has large transient effects before asymptotic guarantees kick in. This clarifies that the bias we identify can persist for a long time—for example, credit decisions can continue to be significantly discriminatory with such an algorithm even when the minority candidate has a strictly higher likelihood of repaying the loan.

Refer to caption
(a) Perfect risk aversion.
Refer to caption
(b) Perfect risk aversion.
Refer to caption
(c) Perfect risk aversion.
Figure 2: Plots of the behaviour of ε\varepsilon-Greedy, under and outside the conditions of Theorem 1. In all plots, the dominated action has payoff distribution 𝟙{1}\mathbb{1}_{\{-1\}}. In (a), safe action has distribution 𝟙{0}\mathbb{1}_{\{0\}}, the riskier action has payoff distribution U[0.5,0.5]U[-0.5,0.5] while the riskiest action has payoff distribution U[1,1]U[-1,1]. In (b) safe action has distribution 𝟙{10}\mathbb{1}_{\{10\}} while the risky action has distribution an exponential distribution with rate 10 i.e. Exp(10)\operatorname{Exp}(10). In (c), the safe action has distribution 𝟙{0.5}\mathbb{1}_{\{0.5\}} while the risky action has distribution U[1,2]U[-1,2]. We set εt=t1\varepsilon_{t}=t^{-1}. Note that the bands around the plot line are 90%90\% confidence intervals, with 100 independent runs.
Refer to caption
(a) Risk neutral.
Refer to caption
(b) Risk neutral.
Refer to caption
(c) Risk averse.
Figure 3: Plots showing that the Reweighted ε\varepsilon-Greedy is risk-neutral for two actions as long as exploration is sufficiently high. In (a), the safe action has payoff distribution 𝟙{0}\mathbb{1}_{\{0\}} while the risky action has payoff distribution U[1,1]U[-1,1]. In (b), the safe action on the left has a distribution U[0.25,0.75]U[0.25,0.75] while the risky action has a payoff distribution U[0,1]U[0,1]. We set εt=t0.49\varepsilon_{t}=t^{-0.49} in (a) and (b). In (c), we use the same setup as in experiment (a), but with an exploration rate of εt=t1\varepsilon_{t}=t^{-1}. Note that the bands around the plot line are 90%90\% confidence intervals, with 100 independent runs.
Refer to caption
(a) Risk neutral.
Refer to caption
(b) Risk neutral.
Refer to caption
(c) Risk averse.
Figure 4: Plots showing that the Optimistic ε\varepsilon-Greedy is quite generally risk-neutral, as long as exploration coefficient ρ\rho is sufficiently high. In all plots, the dominated action has payoff distribution 𝟙{1}\mathbb{1}_{\{-1\}}. In (a), the safe action has a payoff distribution U[0.25,0.25]U[-0.25,0.25] while the risky action has a payoff distribution U[1,1]U[-1,1]. In b), the safe action on the left has a distribution U[0.25,0.75]U[0.25,0.75] while the risky action has a payoff distribution U[0,1]U[0,1]. Both (a) and (b) have ρ=2\rho=2. In (c), the safe action has payoff distribution 𝟙{0}\mathbb{1}_{\{0\}}, and the risky action has payoff distribution U[1,1]U[-1,1], with ρ=0.02\rho=0.02. We set εt=t1\varepsilon_{t}=t^{-1}. Note that the bands around the plot line are 90%90\% confidence intervals, with 100 independent runs.
Refer to caption
(a) ε\varepsilon-Greedy with no optimal safe action.
Refer to caption
(b) ε\varepsilon-Greedy with a strictly better risky action.
Figure 5: Plots showing that ε\varepsilon-Greedy’s risk aversion is quite general. In a), the dominated action has reward distribution 𝟙{1}\mathbb{1}_{\{-1\}}, the safe action has distribution U[0.25,0.25]U[-0.25,0.25], the riskier action has payoff distribution U[0.5,0.5]U[-0.5,0.5] while the riskiest action has payoff distribution U[1,1]U[-1,1]. In b), the safe action has payoff distribution 𝟙{0}\mathbb{1}_{\{0\}} while the risky action has payoff distribution U[1,1.2]U[-1,1.2]. The dominated action has payoff distribution 𝟙{1}\mathbb{1}_{\{-1\}}. We set εt=t1\varepsilon_{t}=t^{-1}. Note that the bands around the plot line are 90%90\% confidence intervals, with 100 independent runs.
Refer to caption
Figure 6: Reweighted ε\varepsilon-Greedy where we add a third action with distribution 𝟙{1}\mathbb{1}_{\{-1\}} to the setup of where the safe action has payoff distribution 𝟙{0}\mathbb{1}_{\{0\}} while the risky action has payoff distribution U[1,1]U[-1,1]. We set εt=t0.49\varepsilon_{t}=t^{-0.49}. Note that the bands around the plot line are 90%90\% confidence intervals, with 100 independent runs.

References

  • (1)
  • Altonji and Pierret (2001) Joseph G Altonji and Charles R Pierret. 2001. Employer learning and statistical discrimination. The Quarterly Journal of Economics 116, 1 (2001), 313–350.
  • Auer (2002) Peter Auer. 2002. Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research 3, Nov (2002), 397–422.
  • Auer et al. (2002) Peter Auer, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. Finite-time analysis of the multiarmed bandit problem. Machine learning 47, 2 (2002), 235–256.
  • Baradaran (2018) Mehrsa Baradaran. 2018. Jim Crow Credit. UC Irvine L. Rev. 9 (2018), 887.
  • Bardhi et al. (2020) Arjada Bardhi, Yingni Guo, and Bruno Strulovici. 2020. Early-Career Discrimination: Spiraling or Self-Correcting? (2020).
  • Bergemann and Välimäki (2017) Dirk Bergemann and Juuso Välimäki. 2017. Bandit Problems. Palgrave Macmillan UK, London, 1–7. https://doi.org/10.1057/978-1-349-95121-5_2386-1
  • Bolton and Harris (1999) Patrick Bolton and Christopher Harris. 1999. Strategic experimentation. Econometrica 67, 2 (1999), 349–374.
  • Brown and MacKay (2023) Zach Y Brown and Alexander MacKay. 2023. Competition in pricing algorithms. American Economic Journal: Microeconomics 15, 2 (2023), 109–156.
  • Calvano et al. (2020) Emilio Calvano, Giacomo Calzolari, Vincenzo Denicolo, and Sergio Pastorello. 2020. Artificial intelligence, algorithmic pricing, and collusion. American Economic Review 110, 10 (2020), 3267–3297.
  • Chaney et al. (2018) Allison J. B. Chaney, Brandon M. Stewart, and Barbara E. Engelhardt. 2018. How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility. In Proceedings of the 12th ACM Conference on Recommender Systems (Vancouver, British Columbia, Canada) (RecSys ’18). Association for Computing Machinery, New York, NY, USA, 224–232. https://doi.org/10.1145/3240323.3240370
  • Crawford and Shum (2005) Gregory S Crawford and Matthew Shum. 2005. Uncertainty and learning in pharmaceutical demand. Econometrica 73, 4 (2005), 1137–1173.
  • Fan and Glynn (2021) Lin Fan and Peter W. Glynn. 2021. Diffusion Approximations for Thompson Sampling. arXiv:2105.09232 [cs.LG]
  • Farber and Gibbons (1996) Henry S Farber and Robert Gibbons. 1996. Learning and wage dynamics. The Quarterly Journal of Economics 111, 4 (1996), 1007–1047.
  • Frisch and Frisch (1995) Uriel Frisch and Hélene Frisch. 1995. Universality of escape from a half-space for symmetrical random walks. In Lévy Flights and Related Topics in Physics: Proceedings of the International Workshop Held at Nice, France, 27–30 June 1994. Springer, Berlin, 262–268.
  • Hall and Heyde (1980) P. Hall and C.C. Heyde. 1980. 3 - The Central Limit Theorem. In Martingale Limit Theory and its Application, P. Hall and C.C. Heyde (Eds.). Academic Press, 51–96. https://doi.org/10.1016/B978-0-12-319350-6.50009-8
  • Joseph et al. (2016) Matthew Joseph, Michael Kearns, Jamie H Morgenstern, and Aaron Roth. 2016. Fairness in learning: Classic and contextual bandits. Advances in Neural Information Processing Systems 29 (2016).
  • Kalvit and Zeevi (2021a) Anand Kalvit and Assaf Zeevi. 2021a. A Closer Look at the Worst-case Behavior of Multi-armed Bandit Algorithms. arXiv:2106.02126 [cs.LG]
  • Kalvit and Zeevi (2021b) Anand Kalvit and Assaf Zeevi. 2021b. A closer look at the worst-case behavior of multi-armed bandit algorithms. Advances in Neural Information Processing Systems 34 (2021), 8807–8819.
  • Keller et al. (2005) Godfrey Keller, Sven Rady, and Martin Cripps. 2005. Strategic experimentation with exponential bandits. Econometrica 73, 1 (2005), 39–68.
  • Klein and Rady (2011) Nicolas Klein and Sven Rady. 2011. Negatively correlated bandits. The Review of Economic Studies 78, 2 (2011), 693–732.
  • Lattimore and Szepesvári (2020) Tor Lattimore and Csaba Szepesvári. 2020. Bandit Algorithms. Cambridge University Press. https://doi.org/10.1017/9781108571401
  • Liu et al. (2017) Yang Liu, Goran Radanovic, Christos Dimitrakakis, Debmalya Mandal, and David C Parkes. 2017. Calibrated fairness in bandits. arXiv preprint arXiv:1707.01875 (2017).
  • Marlin and Zemel (2009) Benjamin M. Marlin and Richard S. Zemel. 2009. Collaborative Prediction and Ranking with Non-Random Missing Data. In Proceedings of the Third ACM Conference on Recommender Systems (New York, New York, USA) (RecSys ’09). Association for Computing Machinery, New York, NY, USA, 5–12. https://doi.org/10.1145/1639714.1639717
  • Musolff (2022) Leon Musolff. 2022. Algorithmic pricing facilitates tacit collusion: Evidence from e-commerce. In Proceedings of the 23rd ACM Conference on Economics and Computation. 32–33.
  • Patil et al. (2021) Vishakha Patil, Ganesh Ghalme, Vineet Nair, and Yadati Narahari. 2021. Achieving fairness in the stochastic multi-armed bandit problem. The Journal of Machine Learning Research 22, 1 (2021), 7885–7915.
  • Slivkins et al. (2019) Aleksandrs Slivkins et al. 2019. Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning 12, 1-2 (2019), 1–286.

Appendix A Regret analysis of algorithms

This section provides evidence that the algorithms we consider do not incur regret.

Definition 5.

An algorithm π:\pi\colon is no-regret or does not incur regret if for all bandit problems such that FaF_{a}, aAa\in A is sub-Gaussian and for all a,aAa,a^{\prime}\in A, μa<μa\mu_{a}<\mu_{a^{\prime}} implies

[at=a]t0.\mathbb{P}[a_{t}=a]\xrightarrow{t\to\infty}0.

We provide a proof sketch for the following statement, which implies that ε\varepsilon-Greedy does not incur regret.

Lemma 1.

Let εt0\varepsilon_{t}\to 0, t=1εt=\sum_{t=1}^{\infty}\varepsilon_{t}=\infty. Also let aAa\in A, μa<maxaAμa\mu_{a}<\max_{a\in A}\mu_{a}. Then, for any δ>0\delta>0, there is tt\in\mathbb{N} such that for all ttt^{\prime}\geq t,

πtδ.\pi_{t^{\prime}}\leq\delta.
Proof Sketch.

Choose tt^{\prime}\in\mathbb{N} such that for all t~t\tilde{t}\geq t^{\prime}, εtδ/2\varepsilon_{t}\leq\delta/2. It is sufficient to show that for some t′′t^{\prime\prime}, in exploitation steps,

[μa(t)μa(t)0]δ2.\mathbb{P}[\mu_{a}(t)-\mu_{a^{\prime}}(t)\geq 0]\leq\frac{\delta}{2}.

By Hoeffding’s inequality, with high probability in t′′t^{\prime\prime}, both actions are chosen at least 13t=1t′′εt\frac{1}{3}\sum_{t=1}^{t^{\prime\prime}}\varepsilon_{t} times. Conditional on this event,

[μa(t)μa(t)0]t0.\mathbb{P}[\mu_{a}(t)-\mu_{a^{\prime}}(t)\geq 0]\xrightarrow[t\to\infty]{}0.

In particular,

[μa(t)μa(t)0]δ2.\mathbb{P}[\mu_{a}(t)-\mu_{a^{\prime}}(t)\geq 0]\leq\frac{\delta}{2}.

for some t′′t^{\prime\prime}\in\mathbb{N} and any t~t′′\tilde{t}\geq t^{\prime\prime}. Choosing tmax{t,t′′}t\geq\max\{t^{\prime},t^{\prime\prime}\} yields the claim. ∎

Corollary 1.

ε\varepsilon-Greedy does not incur regret.

Next, we provide a proof sketch that optimistic ε\varepsilon-Greedy does not incur regret.

Proposition 1.

Optimistic ε\varepsilon-Greedy does not incur regret.

Proof Sketch.

This proof is similar to the proof that Upper Confidence Bound does not incur regret (see, e.g., Auer (2002)). The only difference between the Upper Confidence Bound algorithm and the Optimistic ε\varepsilon-Greedy is exploration that vanishes in the limit. It remains the case that the confidence bands are valid, and hence, there is a logarithmic upper bound for the probability that the bandit algorithm chooses a sub-optimal action in an exploitation step. As in the original proof of the Upper Confidence Band, this amounts to a sub-linearly growing probability of choosing a sub-optimal action. ∎

Finally, we give evidence that Reweighted ε\varepsilon-Greedy does not incur regret, first with a theoretical argument for a restricted class of instances and then for more general settings using an experiment.

Proposition 2.

Let εt=t12+κ\varepsilon_{t}=t^{-\frac{1}{2}+\kappa}. Reweighted ε\varepsilon-Greedy does not incur regret if the best arm is deterministic and centered.

Proof Sketch.

We show that reweighted ε\varepsilon-Greedy does not incur regret for one deterministic, centered and one other arm of negative expected payoff. Using a union bound, this implies no-regretness for any number of negative-expectation arms against a deterministic centered arm.

Let a,aAa,a^{\prime}\in A be such that 0=μa>μa0=\mu_{a^{\prime}}>\mu_{a}. The statement is trivial if FaF_{a} is deterministic.

If FaF_{a} is non-deterministic, observe that, as in the the proof of Theorem 2

Yt=1ttrtπtY_{t}=\sum_{1\leq t^{\prime}\leq t}\frac{r_{t^{\prime}}}{\sqrt{\pi_{t}}}

is a sufficient statistic for Reweighted ε\varepsilon-Greedy. It is sufficient to show that [Yt<0]1\mathbb{P}[Y_{t}<0]\to 1 as tt\to\infty. (As in Theorem 1, [Yt=0]0\mathbb{P}[Y_{t}=0]\to 0 as tt\to\infty.) To show [Yt<0]1\mathbb{P}[Y_{t}<0]\to 1 as tt\to\infty, we condition on YtY_{t}’s last crossing time τt\tau_{t} through zero. As in the proof of Theorem 2, we observe that tτtt-\tau_{t} grows large. We will show that conditional on a large τt\tau_{t}, the probability that Yt>0Y_{t}>0 is small, which implies the claim.

To this end, we observe the increment has a bias of μa1εt/2\frac{\mu_{a}}{\sqrt{1-\varepsilon_{t}/2}} and a standard deviation of

var(Fa)1εt2,\frac{\sqrt{\operatorname{var}(F_{a})}}{\sqrt{1-\frac{\varepsilon_{t}}{2}}},

and that increments are independent (conditioning on the last crossing time). An application of a Chernoff bound shows that the probability that the walk is positive is small. ∎

We supplement this theoretical result with a simulation beyond the conditions of Figure 7. Figure Figure 7 shows that even with a non-deterministic, non-centered action of highest expected payoff, Reweighted ε\varepsilon-Greedy appears not to incur regret.

Refer to caption
(a) Reweighted ε\varepsilon-Greedy for two actions, with a strictly better risky action.
Refer to caption
(b) Reweighted ε\varepsilon-Greedy for three actions, with a strictly better risky action.
Figure 7: Plots illustrating that Reweighted ε\varepsilon-Greedy is no-regret. In a), the lower mean action has distribution Exp(1)\operatorname{Exp}(1), the higher mean action has payoff distribution Exp(2)+N(0,1)\operatorname{Exp}(2)+N(0,1). In b) the lower mean actions have reward distribution 𝟙{0.35}\mathbb{1}_{\{0.35\}} and U[0.25,0.75]U[0.25,0.75] while the higher mean action has payoff distribution U[1,3]U[-1,3]. We set εt=t0.49\varepsilon_{t}=t^{-0.49}. Note that the bands around the plot line are 90%90\% confidence intervals, with 100 independent runs.

Appendix B Additional simulations

Both ε\varepsilon-Greedy’s and Reweighted ε\varepsilon-Greedy’s reward estimates are biased (compare (Lattimore and Szepesvári, 2020, Section 11.2)). It is natural to ask whether debiasing the reward estimates can address emergent risk preferences. Figure 8 simulates a ε\varepsilon-Greedy with the unbiased sufficient statistic

μa,d(t)=1Na(t)1ttat=artπt.\mu_{a,d}(t)=\frac{1}{N_{a}(t)}\sum_{\begin{subarray}{c}1\leq t^{\prime}\leq t\\ a_{t}=a\end{subarray}}\frac{r_{t}}{\pi_{t}}.

The debiasing leads to risk affinity as opposed to risk neutrality. One intuition for this is that the division by the square root of the choice probability in Reweighted ε\varepsilon-Greedy led to a correction that makes the algorithm exactly risk-neutral. Debiasing, which divides by the choice probability, a smaller number, leads to a stronger correction in the direction of risk affinity.

Refer to caption
Figure 8: ε\varepsilon-Greedy with a debiased reward estimate. The safe action has reward distribution 𝟙{0}\mathbb{1}_{\{0\}} while the risky action has reward distribution U[1,1]U[-1,1]. Note that the bands around the plot line are 90%90\% confidence intervals, with 100 independent runs.