This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Examining average and discounted reward optimality criteria in reinforcement learning

Vektor Dewanto, Marcus Gallagher
School of Information Technology and Electrical Engineering
University of Queensland, Australia
[email protected], [email protected]
Abstract

In reinforcement learning (RL), the goal is to obtain an optimal policy, for which the optimality criterion is fundamentally important. Two major optimality criteria are average and discounted rewards. While the latter is more popular, it is problematic to apply in environments without an inherent notion of discounting. This motivates us to revisit a) the progression of optimality criteria in dynamic programming, b) justification for and complication of an artificial discount factor, and c) benefits of directly maximizing the average reward criterion, which is discounting-free. Our contributions include a thorough examination of the relationship between average and discounted rewards, as well as a discussion of their pros and cons in RL. We emphasize that average-reward RL methods possess the ingredient and mechanism for applying a family of discounting-free optimality criteria (Veinott,, 1969) to RL.

1 Introduction

Reinforcement learning (RL) is concerned with sequential decision making, where a decision maker has to choose an action based on its current state. Determining the best actions amounts to finding an optimal mapping from every state to a probability distribution over actions available at that state. Thus, one fundamental component of any RL method is the optimality criterion, by which we define what we mean by such an optimal mapping.

The most popular optimality criterion in RL is the discounted reward (Duan et al.,, 2016; Machado et al.,, 2018; Henderson et al.,, 2018). On the other hand, there is growing interest in the average reward optimality, as surveyed by Mahadevan, 1996a ; Dewanto et al., (2020). In this paper, we discuss both criteria in order to obtain a comprehensive understanding of their properties, relationships, and differences. This is important because the choice of optimality criteria affects almost every aspect of RL methods, including the policy evaluation function, the policy gradient formulation, and the resulting optimal policy, where the term policy refers to the above-mentioned mapping. Thus, the choice of optimality criteria eventually impacts the performance of an RL system (and the choices made within, e.g. approximation techniques and hyperparameters).

This paper presents a thorough examination of the connection between average and discounted rewards, as well as a discussion of their pros and cons in RL. Our examination here is devised through broader lens of refined optimality criteria (which generalize average and discounted rewards), inspired by the seminal work of Mahadevan, 1996b . It is also broader in the sense of algorithmic styles: value- and policy-iteration, as well as of tabular and function approximation settings in RL. In this paper, we also attempt to compile and unify various justifications for deviating from the common practice of maximizing the discounted reward criterion in RL.

There are a number of papers that specifically compare and contrast average to discounted rewards in RL. For example, Mahadevan, (1994) empirically investigated average versus discounted reward Q-learning (in the context of value-iteration RL). Tsitsiklis and Van Roy, (2002) theoretically compared average versus discounted reward temporal-difference (TD) methods for policy evaluation (in the context of policy-iteration RL). We provide updates and extensions to those existing comparison works in order to obtain a comprehensive view on discounting and discounting-free RL.

Since RL is approximate dynamic programming (DP), we begin with reviewing optimality criteria in DP, model classification (which plays an important role in average-reward optimality), and the interpretation of discounting in Sec 2. We then provide justifications for discounting for environments without any inherent notion of discounting (Sec 3). This is followed by the difficulties that arise from introducing an artificial discount factor (Sec 4). We discuss the benefits from maximizing the average-reward criterion in Sec 5, as well as our finding and viewpoint in Sec 6.

2 Preliminaries

Sequential decision making is often formulated as a Markov decision process (MDP) with a state set 𝒮𝒮\mathcal{S}, an action set 𝒜𝒜\mathcal{A}, a reward set \mathcal{R}, and a decision-epoch set 𝒯𝒯\mathcal{T}. Here, all but 𝒯𝒯\mathcal{T} are finite sets, yielding an infinite-horizon finite MDP. At the current decision-epoch t𝒯𝑡𝒯t\in\mathcal{T}, a decision maker (henceforth, an agent) is in a current state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S}, and chooses to then execute a current action at𝒜subscript𝑎𝑡𝒜a_{t}\in\mathcal{A}. Consequently at the next decision-epoch t+1𝑡1t+1, it arrives in the next state st+1𝒮subscript𝑠𝑡1𝒮s_{t+1}\in\mathcal{S} and earns an (immediate) scalar reward rt+1subscript𝑟𝑡1absentr_{t+1}\in\real{}. The decision-epochs occur every single time unit (i.e. timestep) from t=0𝑡0t=0 till the maximum timestep (denoted as tmaxsubscript𝑡maxt_{\mathrm{max}}); hence, we have a discrete-time MDP.

For t=0,1,,tmax=formulae-sequence𝑡01subscript𝑡maxt=0,1,\ldots,t_{\mathrm{max}}=\infty, the agent experiences a sequence (trajectory) of states stsubscript𝑠𝑡s_{t}, actions atsubscript𝑎𝑡a_{t} and rewards rt+1subscript𝑟𝑡1r_{t+1}. That is, s0,a0,r1,s1,a1,r2,,stmaxsubscript𝑠0subscript𝑎0subscript𝑟1subscript𝑠1subscript𝑎1subscript𝑟2subscript𝑠subscript𝑡maxs_{0},a_{0},r_{1},s_{1},a_{1},r_{2},\ldots,s_{t_{\mathrm{max}}}. The initial state, the next state, and the next reward are governed by the environment dynamics that is fully specified by three time-homogenous (time-invariant) entities: i) the initial state distribution p̊̊𝑝\mathring{p}, from which St=0p̊similar-tosubscript𝑆𝑡0̊𝑝S_{t=0}\sim\mathring{p}, ii) the one-(time)step state transition distribution p(|st,at)p(\cdot|s_{t},a_{t}), from which St+1|St=st,At=atp(|st,at)S_{t+1}|S_{t}=s_{t},A_{t}=a_{t}\sim p(\cdot|s_{t},a_{t}), and iii) the reward function r(st,at)=𝔼p(|st,at)[rPr{r|st,at,St+1}r]rt+1r(s_{t},a_{t})=\mathbb{E}_{p(\cdot|s_{t},a_{t})}\mathopen{}\mathclose{{}\left[\sum_{r\in\mathcal{R}}\mathrm{Pr}\{r|s_{t},a_{t},S_{t+1}\}\cdot r}\right]\eqqcolon r_{t+1}, where Stsubscript𝑆𝑡S_{t} and Atsubscript𝐴𝑡A_{t} denote the state and action random variables, respectively. The existence of a probability space that holds this infinite sequence of random variables (S0,A0,S1,A1,subscript𝑆0subscript𝐴0subscript𝑆1subscript𝐴1S_{0},A_{0},S_{1},A_{1},\ldots) can be shown using the Ionescu-Tulcea theorem (Lattimore and Szepesvári,, 2020, p513). We assume that the rewards are bounded, i.e. |r(s,a)|rmax<,(s,a)𝒮×𝒜formulae-sequence𝑟𝑠𝑎subscript𝑟maxfor-all𝑠𝑎𝒮𝒜|r(s,a)|\leq r_{\mathrm{max}}<\infty,\forall(s,a)\in\mathcal{S}\times\mathcal{A}, and the MDP is unichain and aperiodic (see Sec 2.2 for MDP classification).

The solution to a sequential decision making problem is an optimal mapping from every state to a probability distribution over the (available) action set 𝒜𝒜\mathcal{A} in that state.111 In general, different states may have different action sets. It is optimal with respect to some optimality criterion, as discussed later in Sec 2.1. Any of such a mapping (regardless whether it is optimal) is called a policy and generally depends on timestep t𝑡t. It is denoted as πt:𝒮[0,1]|𝒜|:subscript𝜋𝑡maps-to𝒮superscript01𝒜\pi_{t}:\mathcal{S}\mapsto[0,1]^{|\mathcal{A}|}, or alternatively πt:𝒮×𝒜[0,1]:subscript𝜋𝑡maps-to𝒮𝒜01\pi_{t}:\mathcal{S}\times\mathcal{A}\mapsto[0,1], where πt(at|st)Pr{At=at|St=st}subscript𝜋𝑡conditionalsubscript𝑎𝑡subscript𝑠𝑡Prconditional-setsubscript𝐴𝑡subscript𝑎𝑡subscript𝑆𝑡subscript𝑠𝑡\pi_{t}(a_{t}|s_{t})\coloneqq\mathrm{Pr}\{A_{t}=a_{t}|S_{t}=s_{t}\} indicating the probability of selecting action at𝒜subscript𝑎𝑡𝒜a_{t}\in\mathcal{A} given the state st𝒮subscript𝑠𝑡𝒮s_{t}\in\mathcal{S} at timestep t𝒯𝑡𝒯t\in\mathcal{T}. Thus, each action is sampled from a conditional action distribution, i.e. At|St=stπt(|st)A_{t}|S_{t}=s_{t}\sim\pi_{t}(\cdot|s_{t}).

The most specific solution space is the stationary and deterministic policy set ΠSDsubscriptΠSD\Pi_{\mathrm{SD}} whose policies πΠSD𝜋subscriptΠSD\pi\in\Pi_{\mathrm{SD}} are stationary (time-invariant), i.e. ππ0=π1==πtmax1𝜋subscript𝜋0subscript𝜋1subscript𝜋subscript𝑡max1\pi\coloneqq\pi_{0}=\pi_{1}=\ldots=\pi_{t_{\mathrm{max}}-1}, as well as deterministic, i.e. π(|st)\pi(\cdot|s_{t}) has a single action support (hence the mapping is reduced to π:𝒮𝒜:𝜋maps-to𝒮𝒜\pi:\mathcal{S}\mapsto\mathcal{A}). In this work, we consider a more general policy set, that is the stationary policy set ΠSsubscriptΠS\Pi_{\mathrm{S}}. It includes the stationary randomized (stochastic) set ΠSRsubscriptΠSR\Pi_{\mathrm{SR}} and its degenerate counterpart: the stationary deterministic set ΠSDsubscriptΠSD\Pi_{\mathrm{SD}}.

2.1 Optimality criteria

In a basic notion of optimality, a policy with the largest value is optimal. That is,

vx(πx)vx(π),πΠS.formulae-sequencesubscript𝑣𝑥superscriptsubscript𝜋𝑥subscript𝑣𝑥𝜋for-all𝜋subscriptΠSv_{x}(\pi_{x}^{*})\geq v_{x}(\pi),\quad\forall\pi\in\Pi_{\mathrm{S}}. (1)

Here, the function vxsubscript𝑣𝑥v_{x} measures the value (utility) of a policy π𝜋\pi based on the infinite reward sequence that is earned by an agent following π𝜋\pi. The subscript x𝑥x indicates the specific type of value functions, which induces a specific x𝑥x-optimality criterion, hence the x𝑥x-optimal policy, denoted as πxsuperscriptsubscript𝜋𝑥\pi_{x}^{*}.

One intuitive222 This intuition seems to lead to the so-called reward hypothesis (Sutton and Barto,, 2018, p53). value function is the expected total reward. That is,

vtot(π,s)limtmax𝔼Atπ(|st),St+1p(|st,at)[t=0tmax1r(St,At)|S0=s,π],πΠS,s𝒮.v_{\mathrm{tot}}(\pi,s)\coloneqq\lim_{t_{\mathrm{max}}\to\infty}\mathbb{E}_{A_{t}\sim\pi(\cdot|s_{t}),S_{t+1}\sim p(\cdot|s_{t},a_{t})}\mathopen{}\mathclose{{}\left[\sum_{t=0}^{t_{\mathrm{max}}-1}r(S_{t},A_{t})\Big{|}S_{0}=s,\pi}\right],\quad\forall\pi\in\Pi_{\mathrm{S}},\forall s\in\mathcal{S}. (2)

However, vtotsubscript𝑣totv_{\mathrm{tot}} may be infinite (unbounded, divergent, non-summable).333 If a limit is equal to infinity, then we assert that such a limit does not exist. No comparison can be made between finite and infinite policy values, as well as between infinite policy values. Howard, (1960) therefore, examined the expected average reward (also referred to as the gain) defined as

vg(π,s)subscript𝑣𝑔𝜋𝑠\displaystyle v_{g}(\pi,s) limtmax1tmax𝔼Atπ(|st),St+1p(|st,at)[t=0tmax1r(St,At)|S0=s,π],\displaystyle\coloneqq\lim_{t_{\mathrm{max}}\to\infty}\frac{1}{t_{\mathrm{max}}}\mathbb{E}_{A_{t}\sim\pi(\cdot|s_{t}),S_{t+1}\sim p(\cdot|s_{t},a_{t})}\mathopen{}\mathclose{{}\left[\sum_{t=0}^{t_{\mathrm{max}}-1}r(S_{t},A_{t})\Big{|}S_{0}=s,\pi}\right], (3)

which is finite for all πΠS𝜋subscriptΠS\pi\in\Pi_{\mathrm{S}} and all s𝒮𝑠𝒮s\in\mathcal{S}. For more details, including interpretation about the gain, we refer the reader to (Dewanto et al.,, 2020).

Alternatively, Blackwell, (1962, Sec 4) attempted tackling the infiniteness of (2) through the expected total discounted reward444 There are other discounting schemes, e.g. 1/(t2+t)1superscript𝑡2𝑡1/(t^{2}+t), 1/(1+κt)11𝜅𝑡1/(1+\kappa t) for κ>0𝜅0\kappa>0 (hyperbolic), and tκsuperscript𝑡𝜅t^{-\kappa} for κ>1𝜅1\kappa>1 (Hutter,, 2006; Lattimore and Hutter,, 2014), but they do not exhibit advantageous properties as the geometric discounting (4), such as its decomposition via (6). Hence, those other schemes are not considered here. , which was studied before by Bellman, (1957). That is,

vγ(π,s)limtmax𝔼Atπ(|st),St+1p(|st,at)[t=0tmax1γtr(St,At)|S0=s,π],πΠS,s𝒮,v_{\gamma}(\pi,s)\coloneqq\lim_{t_{\mathrm{max}}\to\infty}\mathbb{E}_{A_{t}\sim\pi(\cdot|s_{t}),S_{t+1}\sim p(\cdot|s_{t},a_{t})}\mathopen{}\mathclose{{}\left[\sum_{t=0}^{t_{\mathrm{max}}-1}\gamma^{t}r(S_{t},A_{t})\Big{|}S_{0}=s,\pi}\right],\quad\forall\pi\in\Pi_{\mathrm{S}},\forall s\in\mathcal{S}, (4)

with a discount factor γ[0,1)𝛾01\gamma\in[0,1). In particular, according to what is later known as the truncated Laurent series expansion (7), Blackwell suggested finding policies that are γ𝛾\gamma-discounted optimal for all discount factors γ𝛾\gamma sufficiently close to 1. He also established their existence in finite MDPs. Subsequently, Smallwood, (1966) identified that the discount factor interval can be divided into a finite number of intervals555 See also Hordijk et al., (1985) and Feinberg and Shwartz, (2002, Thm 2.16). , i.e. [0=γm,γm1),[γm1,γm2),,[γ0,γ1=1)[0=\gamma_{m},\gamma_{m-1}),[\gamma_{m-1},\gamma_{m-2}),\ldots,[\gamma_{0},\gamma_{-1}=1), in such a way that there exist policies πγisubscriptsuperscript𝜋subscript𝛾𝑖\pi^{*}_{\gamma_{i}} for 0im0𝑖𝑚0\leq i\leq m, that are γ𝛾\gamma-discounted optimal for all γ[γi,γi1)𝛾subscript𝛾𝑖subscript𝛾𝑖1\gamma\in[\gamma_{i},\gamma_{i-1}). This leads to the concept of Blackwell optimality. A policy πBwsubscriptsuperscript𝜋Bw\pi^{*}_{\mathrm{Bw}} is Blackwell optimal if there exists a critical666 It is critical in that it specifies the sufficiency of being close to 1 for attaining Blackwell optimal policies. discount factor γBw[0,1)subscript𝛾Bw01\gamma_{\mathrm{Bw}}\in[0,1) such that

vγ(πBw,s)vγ(π,s),forγBwγ<1,and for all πΠS and all s𝒮.formulae-sequenceformulae-sequencesubscript𝑣𝛾subscriptsuperscript𝜋Bw𝑠subscript𝑣𝛾𝜋𝑠forsubscript𝛾Bw𝛾1and for all πΠS and all s𝒮v_{\gamma}(\pi^{*}_{\mathrm{Bw}},s)\geq v_{\gamma}(\pi,s),\quad\text{for}\ \gamma_{\mathrm{Bw}}\leq\gamma<1,\ \text{and for all $\pi\in\Pi_{\mathrm{S}}$ and all $s\in\mathcal{S}$}. (5)

Note that whenever the policy value function v𝑣v depends not only on policies π𝜋\pi but also on states s𝑠s from which the value is measured, the basic optimality (1) requires that the optimal policy has a value greater than or equal to the other policies’ values in all state s𝒮𝑠𝒮s\in\mathcal{S}.

In order to obtain Blackwell optimal policies, Veinott, (1969, Eqn 27) introduced a family of new optimality criteria. That is, a policy πnsuperscriptsubscript𝜋𝑛\pi_{n}^{*} is n𝑛n-discount optimal for n=1,0,𝑛10n=-1,0,\ldots, if

limγ11(1γ)n(vγ(πn,s)vγ(π,s))0,πΠS,s𝒮.formulae-sequencesubscript𝛾11superscript1𝛾𝑛subscript𝑣𝛾superscriptsubscript𝜋𝑛𝑠subscript𝑣𝛾𝜋𝑠0formulae-sequencefor-all𝜋subscriptΠSfor-all𝑠𝒮\lim_{\gamma\to 1}\frac{1}{(1-\gamma)^{n}}\Big{(}v_{\gamma}(\pi_{n}^{*},s)-v_{\gamma}(\pi,s)\Big{)}\geq 0,\qquad\forall\pi\in\Pi_{\mathrm{S}},\forall s\in\mathcal{S}.

He showed that (n=|𝒮|)𝑛𝒮(n=|\mathcal{S}|)-discount optimality777 This is eventually refined to (n=|𝒮|nre(πBw))𝑛𝒮subscript𝑛resubscriptsuperscript𝜋Bw(n=|\mathcal{S}|-n_{\mathrm{re}}(\pi^{*}_{\mathrm{Bw}}))-discount optimality, where nre(πBw)subscript𝑛resubscriptsuperscript𝜋Bwn_{\mathrm{re}}(\pi^{*}_{\mathrm{Bw}}) is the number of recurrent classes under the Blackwell optimal policy (Feinberg and Shwartz,, 2002, Thm 8.3, Sec 8.1.4). is equivalent to Blackwell optimality because the selectivity increases with n𝑛n (hence, nisubscript𝑛𝑖n_{i}-discount optimality implies njsubscript𝑛𝑗n_{j}-discount optimalities for all i>j𝑖𝑗i>j). Then, he developed a policy-iteration algorithm (for finding n𝑛n-discount optimal policies) that utilizes the full Laurent series expansion of 𝒗γ(π)subscript𝒗𝛾𝜋\boldsymbol{v}_{\gamma}(\pi) for any πΠS𝜋subscriptΠS\pi\in\Pi_{\mathrm{S}}, where the policy value vector 𝒗x(π)|𝒮|subscript𝒗𝑥𝜋𝒮\boldsymbol{v}_{x}(\pi)\in\real{|\mathcal{S}|} is obtained by stacking the state-wise values vx(π,s)subscript𝑣𝑥𝜋𝑠v_{x}(\pi,s) altogether for all s𝒮𝑠𝒮s\in\mathcal{S}. That is,

𝒗γ(π)subscript𝒗𝛾𝜋\displaystyle\boldsymbol{v}_{\gamma}(\pi) =1γ(γ1γ𝒗1(π)+𝒗0(π)+n=1(1γγ)n𝒗n(π))(Full expansion)absent1𝛾𝛾1𝛾subscript𝒗1𝜋subscript𝒗0𝜋superscriptsubscript𝑛1superscript1𝛾𝛾𝑛subscript𝒗𝑛𝜋(Full expansion)\displaystyle=\frac{1}{\gamma}\Big{(}\frac{\gamma}{1-\gamma}\boldsymbol{v}_{-1}(\pi)+\boldsymbol{v}_{0}(\pi)+\sum_{n=1}^{\infty}\Big{(}\frac{1-\gamma}{\gamma}\Big{)}^{n}\boldsymbol{v}_{n}(\pi)\Big{)}\hskip 56.9055pt\text{(Full expansion)} (6)
=11γ𝒗1(π)+𝒗0(π)+{1γγ𝒗0(π)+1γn=1(1γγ)n𝒗n(π)}absent11𝛾subscript𝒗1𝜋subscript𝒗0𝜋1𝛾𝛾subscript𝒗0𝜋1𝛾superscriptsubscript𝑛1superscript1𝛾𝛾𝑛subscript𝒗𝑛𝜋\displaystyle=\frac{1}{1-\gamma}\boldsymbol{v}_{-1}(\pi)+\boldsymbol{v}_{0}(\pi)+\Bigg{\{}\frac{1-\gamma}{\gamma}\boldsymbol{v}_{0}(\pi)+\frac{1}{\gamma}\sum_{n=1}^{\infty}\Big{(}\frac{1-\gamma}{\gamma}\Big{)}^{n}\boldsymbol{v}_{n}(\pi)\Bigg{\}} (Using the additive identity: 𝒗0𝒗0=𝟎subscript𝒗0subscript𝒗00\boldsymbol{v}_{0}-\boldsymbol{v}_{0}=\boldsymbol{0})
=11γ𝒗1(π)+𝒗0(π)+{𝒆(π,γ)},(Truncated expansion)absent11𝛾subscript𝒗1𝜋subscript𝒗0𝜋𝒆𝜋𝛾(Truncated expansion)\displaystyle=\frac{1}{1-\gamma}\boldsymbol{v}_{-1}(\pi)+\boldsymbol{v}_{0}(\pi)+\bigg{\{}\boldsymbol{e}(\pi,\gamma)\bigg{\}},\hskip 88.20354pt\text{(Truncated expansion)} (7)

where 𝒗n|𝒮|subscript𝒗𝑛𝒮\boldsymbol{v}_{n}\in\real{|\mathcal{S}|} for n=1,0,𝑛10n=-1,0,\ldots denotes the expansion coefficients. The truncated form first appeared (before the full one) in Blackwell, (1962, Thm 4a), where 𝒗1subscript𝒗1\boldsymbol{v}_{-1} is equivalent to the gain from all states 𝒗g|𝒮|subscript𝒗𝑔𝒮\boldsymbol{v}_{\!g}\in\real{|\mathcal{S}|}, 𝒗0subscript𝒗0\boldsymbol{v}_{0} is equivalent to the so-called bias, and 𝒆(π,γ)𝒆𝜋𝛾\boldsymbol{e}(\pi,\gamma) converges to 𝟎0\boldsymbol{0} as γ𝛾\gamma approaches 1, that is limγ1𝒪(1γ)=𝟎subscript𝛾1𝒪1𝛾0\lim_{\gamma\to 1}\mathcal{O}\mathopen{}\mathclose{{}\left(1-\gamma}\right)=\boldsymbol{0}.

2.2 MDP classification

A stationary policy π𝜋\pi of a finite MDP induces a stationary finite Markov chain (MC) with a |𝒮|𝒮|\mathcal{S}|-by-|𝒮|𝒮|\mathcal{S}| one-step transition stochastic matrix 𝑷πsubscript𝑷𝜋\boldsymbol{P}_{\!\!\pi}, whose [s,s]𝑠superscript𝑠[s,s^{\prime}]-th entry indicates

pπ(s|s)Pr{St+1=s|St=s;π}=a𝒜p(s|s,a)π(a|s),s,s𝒮.formulae-sequencesubscript𝑝𝜋conditionalsuperscript𝑠𝑠Prconditional-setsubscript𝑆𝑡1superscript𝑠subscript𝑆𝑡𝑠𝜋subscript𝑎𝒜𝑝conditionalsuperscript𝑠𝑠𝑎𝜋conditional𝑎𝑠for-all𝑠superscript𝑠𝒮p_{\pi}(s^{\prime}|s)\coloneqq\mathrm{Pr}\{S_{t+1}=s^{\prime}|S_{t}=s;\pi\}=\sum_{a\in\mathcal{A}}p(s^{\prime}|s,a)\pi(a|s),\quad\forall s,s^{\prime}\in\mathcal{S}. (8)

Therefore, an MDP can be classified on the basis of the set of MCs induced by its stationary policies. To this end, we need to review the classification of states, then of MCs below; mainly following Puterman, (1994, Appendix A) and Douc et al., (2018, Ch 6.2).

State classification:

Let 𝑷πtsuperscriptsubscript𝑷𝜋𝑡\boldsymbol{P}_{\!\!\pi}^{t} denote the t𝑡t-th power of 𝑷πsubscript𝑷𝜋\boldsymbol{P}_{\!\!\pi} of a Markov chain. This 𝑷πtsuperscriptsubscript𝑷𝜋𝑡\boldsymbol{P}_{\!\!\pi}^{t} is another |𝒮|𝒮|\mathcal{S}|-by-|𝒮|𝒮|\mathcal{S}| stochastic matrix, whose [s,s]𝑠superscript𝑠[s,s^{\prime}]-entry indicates pπt(s|s)Pr{St=s|S0=s;π}superscriptsubscript𝑝𝜋𝑡conditionalsuperscript𝑠𝑠Prconditional-setsubscript𝑆𝑡superscript𝑠subscript𝑆0𝑠𝜋p_{\pi}^{t}(s^{\prime}|s)\coloneqq\mathrm{Pr}\{S_{t}=s^{\prime}|S_{0}=s;\pi\}, i.e. the probability of being in state s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S} in t𝑡t timesteps starting from a state s𝒮𝑠𝒮s\in\mathcal{S}.

A state s𝒮superscript𝑠𝒮s^{\prime}\in\mathcal{S} is accessible from state s𝒮𝑠𝒮s\in\mathcal{S}, denoted as ss𝑠superscript𝑠s\to s^{\prime}, if pπt(s|s)>0superscriptsubscript𝑝𝜋𝑡conditionalsuperscript𝑠𝑠0p_{\pi}^{t}(s^{\prime}|s)>0 for some t0𝑡0t\geq 0. Furthermore, s𝑠s and ssuperscript𝑠s^{\prime} communicate if ss𝑠superscript𝑠s\to s^{\prime} and sssuperscript𝑠𝑠s^{\prime}\to s.

A subset 𝒮~~𝒮\tilde{\mathcal{S}} of 𝒮𝒮\mathcal{S} is called a closed set if no state outside 𝒮~~𝒮\tilde{\mathcal{S}} is accessible from any state in 𝒮~~𝒮\tilde{\mathcal{S}}. Furthermore, a subset 𝒮~~𝒮\tilde{\mathcal{S}} of 𝒮𝒮\mathcal{S} is called a recurrent class (or a recurrent chain) if all states in 𝒮~~𝒮\tilde{\mathcal{S}} communicate and 𝒮~~𝒮\tilde{\mathcal{S}} is closed. In a finite MC, there exists at least one recurrent class.

Let nviss,πsuperscriptsubscript𝑛vis𝑠𝜋n_{\mathrm{vis}}^{s,\pi} denote the number of visits to a state s𝑠s under a policy π𝜋\pi. That is, nviss,πt=0𝕀[St=s|π]superscriptsubscript𝑛vis𝑠𝜋superscriptsubscript𝑡0𝕀delimited-[]subscript𝑆𝑡conditional𝑠𝜋n_{\mathrm{vis}}^{s,\pi}\coloneqq{\sum_{t=0}^{\infty}\mathbb{I}[S_{t}=s|\pi]}, where 𝕀𝕀\mathbb{I} is an identity (indicator) operator. Then, the expected number of visits to a state s𝒮𝑠𝒮s\in\mathcal{S} starting from the state s𝑠s itself under a policy π𝜋\pi is given by 𝔼[Nviss,π]=t=0pπt(s|s)𝔼delimited-[]superscriptsubscript𝑁vis𝑠𝜋superscriptsubscript𝑡0superscriptsubscript𝑝𝜋𝑡conditional𝑠𝑠\mathbb{E}\mathopen{}\mathclose{{}\left[N_{\mathrm{vis}}^{s,\pi}}\right]=\sum_{t=0}^{\infty}p_{\pi}^{t}(s|s).

A state s𝒮𝑠𝒮s\in\mathcal{S} is recurrent if and only if 𝔼[Nviss,π]=𝔼delimited-[]superscriptsubscript𝑁vis𝑠𝜋\mathbb{E}\mathopen{}\mathclose{{}\left[N_{\mathrm{vis}}^{s,\pi}}\right]=\infty. This means that a recurrent state will be re-visited infinitely often. Equivalently, starting from a recurrent state s𝒮𝑠𝒮s\in\mathcal{S}, the probability of returning to s𝑠s itself for the first time in finite time is 1.

A state s𝒮𝑠𝒮s\in\mathcal{S} is transient if and only if 𝔼[Nviss,π]<𝔼delimited-[]superscriptsubscript𝑁vis𝑠𝜋\mathbb{E}\mathopen{}\mathclose{{}\left[N_{\mathrm{vis}}^{s,\pi}}\right]<\infty. This means that a transient state will never be re-visited again after some point in time. Equivalently, starting from a transient state s𝒮𝑠𝒮s\in\mathcal{S}, the probability of returning to s𝑠s itself for the first time in finite time is less than 1. A transient state is a non-recurrent state (hence, every state is either recurrent or transient).

The period of a state s𝒮𝑠𝒮s\in\mathcal{S} is defined as the greatest common divisors tgcdsubscript𝑡gcdt_{\mathrm{gcd}} of all t1𝑡1t\geq 1 for which pπt(St=s|S0=s)>0superscriptsubscript𝑝𝜋𝑡subscript𝑆𝑡conditional𝑠subscript𝑆0𝑠0p_{\pi}^{t}(S_{t}=s|S_{0}=s)>0. Whenever the period tgcd>1subscript𝑡gcd1t_{\mathrm{gcd}}>1, we clasify s𝑠s as periodic. Otherwise, s𝑠s is aperiodic (not periodic). Intuitively, if returning to a state s𝑠s occurs at irregular times, then s𝑠s is aperiodic. Note that periodicity is a class property, implying that all states in a recurrent class have the same period.

An ergodic class (or an ergodic chain) is a class that is both recurrent and aperiodic. Its recurrent and aperiodic states are referred to as ergodic states. Note however, that the term “ergodic” is also used in the ergodic theory (mathematics), in which its definition does not involve the notion of aperiodicity.

Markov chain (MC) classification:

An MC is irreducible if the set of all its states forms a single recurrent class. An irreducible MC is aperiodic if all its states are aperiodic. Otherwise, it is periodic. An MC is reducible if it has both recurrent states and transient states. The state set of a reducible MC can be partitioned into one or more disjoint recurrent classes, plus a set of transient states.

An MC is unichain if it consists of a single (uni-) recurrent class (chain), plus a (possibly empty) set of transient states. Otherwise, it is multichain.

MDP classification based on the pattern of MCs induced by all stationary policies:

An MDP is unichain if the induced MC corresponding to every stationary policy is unichain. Otherwise, it is multichain (i.e. at least one induced MC is multichain, containing two or more recurrent classes). This is a restrictive definition of a multichain MDP. In the literature, its loose definition is also used, where it simply means a general MDP.

A recurrent MDP is a special case of unichain MDPs, whenever the MC corresponding to every stationary policy is irreducible. In other words, a recurrent MDP is a unichain MDP whose all induced MCs have an empty transient state set.

MDP classification based on the pattern of state accessibility under some stationary policy:

An MDP is communicating (also termed strongly connected) if, for every pair of states s𝑠s and ssuperscript𝑠s^{\prime} in the state set 𝒮𝒮\mathcal{S}, there exists a stationary policy (which may depend on s𝑠s and ssuperscript𝑠s^{\prime}) under which ssuperscript𝑠s^{\prime} is accessible from s𝑠s. A recurrent MDP is always communicating, but a communicating MDP may not be recurrent, it may be unichain or multichain.

An MDP is weakly communicating (also termed simply connected) if there exists a closed state set 𝒮~𝒮~𝒮𝒮\tilde{\mathcal{S}}\subseteq\mathcal{S}, for which there exists a stationary policy under which s𝑠s is accessible from ssuperscript𝑠s^{\prime} for every pair of states s,s𝒮~𝑠superscript𝑠~𝒮s,s^{\prime}\in\tilde{\mathcal{S}}, plus a (possibly empty) set of transient states under every stationary policy. This weakly communicating classification is more general than the communicating in that any communicating MDP is weakly communicating without any state that is transient under every stationary policy. Unichain MDPs are always weakly communicating.

There exists an MDP that is not weakly communicating. Such a non-weakly-communicating MDP is always multichain, but a multichain MDP may also be weakly communicating (or communicating).

2.3 Discounting in environments with an inherent notion of discounting

A environment with an inherent notion of discounting has a discount factor γ𝛾\gamma that encodes one of the following entities. Thus, γ𝛾\gamma is part of the environment specification (definition, description).

Firstly, the time value of rewards, i.e. the value of a unit reward t𝑡t timesteps in the future is γtsuperscript𝛾𝑡\gamma^{t}: This is related to psychological concepts. For example, some people prefer rewards now rather than latter (Mischel et al.,, 1972), hence they assign greater values to early rewards through a small γ𝛾\gamma (being shortsighted). It is also natural to believe that there is more certainty about near- than far-future, because immediate rewards are (exponentially more) likely due to recent actions. The time preference is also well-motivated in economics (Samuelson,, 1937). This includes γ𝛾\gamma for taking account of the decreasing value of money (because of inflation), as well as the interpretation of (1γ)/γ1𝛾𝛾(1-\gamma)/\gamma as a positive interest rate. Moreover, commercial activities have failure (abandonment) risk due to changing government regulation and consumer preferences over time.

Secondly, the uncertainty about random termination independent of the agent’s actions: Such termination comes from external control beyond the agent, e.g. someone shutting down a robot, engine failure (due to weather/natural disaster), or death of any living organisms.

In particular, whenever the random termination time Tmaxsubscript𝑇maxT_{\mathrm{max}} follows a geometric distribution Geo(p=1γ)𝐺𝑒𝑜𝑝1𝛾{Geo(p=1-\gamma)}, we have the following identity between the total and discounted rewards,

vTmax(π,s)𝔼St,At[𝔼Tmax[t=0Tmax1r(St,At)|S0=s,π]]=vγ(π,s)See (4),πΠS,s𝒮,formulae-sequencesubscript𝑣subscript𝑇max𝜋𝑠subscript𝔼subscript𝑆𝑡subscript𝐴𝑡delimited-[]subscript𝔼subscript𝑇maxdelimited-[]conditionalsuperscriptsubscript𝑡0subscript𝑇max1𝑟subscript𝑆𝑡subscript𝐴𝑡subscript𝑆0𝑠𝜋subscriptsubscript𝑣𝛾𝜋𝑠See (4)formulae-sequencefor-all𝜋subscriptΠSfor-all𝑠𝒮v_{T_{\mathrm{max}}}(\pi,s)\coloneqq\mathbb{E}_{S_{t},A_{t}}\mathopen{}\mathclose{{}\left[\mathbb{E}_{T_{\mathrm{max}}}\mathopen{}\mathclose{{}\left[\sum_{t=0}^{T_{\mathrm{max}}-1}r(S_{t},A_{t})\Big{|}S_{0}=s,\pi}\right]}\right]=\underbrace{v_{\gamma}(\pi,s)}_{\text{See \eqref{equ:v_disc}}},\quad\forall\pi\in\Pi_{\mathrm{S}},\forall s\in\mathcal{S}, (9)

where the discount factor γ𝛾\gamma plays the role of the geometric distribution parameter (Puterman,, 1994, Prop 5.3.1). This discounting implies that at every timestep (for any state-action pair), the agent has a probability of (1γ)1𝛾(1-\gamma) for entering the 0-reward absorbing terminal state, see Fig 4(a). Note that because γ𝛾\gamma is invariant to states and actions (as well as time), this way of capturing the randomness of Tmaxsubscript𝑇maxT_{\mathrm{max}} may be inaccurate in cases where termination depends on states, actions, or both.

3 Discounting without an inherent notion of discounting

From now on, we focus on environments without inherent notion of discounting, where γ𝛾\gamma is not part of the environment specification (cf. Sec 2.3). We emphasize the qualification “inherent” since any MDP can always be thought of having some notion of discounting from the Blackwell optimality point of view (5). This is because a Blackwell optimal policy is guaranteed to exist in finite MDPs (Puterman,, 1994, Thm 10.1.4); implying the existence of its discount factor γBw(0,1]subscript𝛾Bw01\gamma_{\mathrm{Bw}}\in(0,1] and of a (potentially very long) finite-horizon MDP model that gives exactly the same Blackwell-optimal policy as its infinite-horizon counterpart (Chang et al.,, 2013, Ch 1.3).

When there is no inherent notion of discounting, the discount factor γ𝛾\gamma is imposed for bounded sums (Sec 2.1) and becomes part of the solution method (algorithm). This is what we refer to as artificial discounting, which induces artificial interpretation as for instance, those described in Sec 2.3. The γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}} (mentioned in the previous paragraph) is one of such artificial discount factors. In addition to bounding the sum, we observe other justifications that have been made for introducing an artificial γ𝛾\gamma. They are explained in the following Secs 3.1, 3.2 and 3.3.

3.1 Approximation to the average reward (the gain) as γ𝛾\gamma approaches 1

For recurrent MDPs, the gain optimality is the most selective888 For an intuitive explanation about the selectivity of optimality criteria, refer to Fig 1. because there are no transient states (Feinberg and Shwartz,, 2002, Sec 3.1). This implies that a gain optimal policy is also Blackwell optimal in recurrent MDPs, for which one should target the gain optimality criterion. Nonetheless, the following relationships exist between the average reward vgsubscript𝑣𝑔v_{g} and discounted reward vγsubscript𝑣𝛾v_{\gamma} value functions of a policy πΠS𝜋subscriptΠS\pi\in\Pi_{\mathrm{S}}.

Firstly, for every state s0𝒮subscript𝑠0𝒮s_{0}\in\mathcal{S},

vg(π,s0)subscript𝑣𝑔𝜋subscript𝑠0\displaystyle v_{g}(\pi,s_{0}) =limγ1(1γ)vγ(π,s0)(Puterman,, 1994, Corollary 8.2.5)absentsubscript𝛾11𝛾subscript𝑣𝛾𝜋subscript𝑠0(Puterman,, 1994, Corollary 8.2.5)\displaystyle=\lim_{\gamma\to 1}\ (1-\gamma)\ v_{\gamma}(\pi,s_{0})\hskip 85.35826pt\text{\cite[citep]{(\@@bibref{AuthorsPhrase1Year}{puterman_1994_mdp}{\@@citephrase{, }}{}, Corollary~{}8.2.5)}} (10)
=(1γ)s𝒮pπ(s|s0)vγ(π,s),γ[0,1),(Singh et al.,, 1994, Sec 5.3)formulae-sequenceabsent1𝛾subscript𝑠𝒮superscriptsubscript𝑝𝜋conditional𝑠subscript𝑠0subscript𝑣𝛾𝜋𝑠for-all𝛾01(Singh et al.,, 1994, Sec 5.3)\displaystyle=(1-\gamma)\sum_{s\in\mathcal{S}}p_{\pi}^{\star}(s|s_{0})\ v_{\gamma}(\pi,s),\quad\forall\gamma\in[0,1),\hskip 17.07164pt\text{\cite[citep]{(\@@bibref{AuthorsPhrase1Year}{singh_1994_rlpomdp}{\@@citephrase{, }}{}, Sec~{}5.3)}} (11)

where pπ(s|s0)superscriptsubscript𝑝𝜋conditional𝑠subscript𝑠0p_{\pi}^{\star}(s|s_{0}) denotes the stationary probability of a state s𝑠s, i.e. the long-run (steady-state) probability of being in state s𝑠s when the MC begins in s0subscript𝑠0s_{0}. Here, (10) is obtained by multiplying the left and right hand sides of (7) by (1γ(1-\gamma), then taking the limit of both sides as γ1𝛾1\gamma\to 1. The derivation of (11) begins with taking the expectation of vγ(π,S)subscript𝑣𝛾𝜋𝑆v_{\gamma}(\pi,S) with respect to pπsuperscriptsubscript𝑝𝜋p_{\pi}^{\star}, then utilizes the discounted-reward Bellman (expectation) equation, and the stationarity of pπsuperscriptsubscript𝑝𝜋p_{\pi}^{\star}. That is,

s𝒮pπ(s|s0){vγ(π,s)}subscript𝑠𝒮superscriptsubscript𝑝𝜋conditional𝑠subscript𝑠0subscript𝑣𝛾𝜋𝑠\displaystyle\sum_{s\in\mathcal{S}}p_{\pi}^{\star}(s|s_{0})\bigg{\{}v_{\gamma}(\pi,s)\bigg{\}} =s𝒮pπ(s|s0){rπ(s)+γs𝒮pπ(s|s)vγ(π,s)vγ(π,s) via the Bellman equation}absentsubscript𝑠𝒮superscriptsubscript𝑝𝜋conditional𝑠subscript𝑠0subscriptsubscript𝑟𝜋𝑠𝛾subscriptsuperscript𝑠𝒮subscript𝑝𝜋conditionalsuperscript𝑠𝑠subscript𝑣𝛾𝜋superscript𝑠vγ(π,s) via the Bellman equation\displaystyle=\sum_{s\in\mathcal{S}}p_{\pi}^{\star}(s|s_{0})\bigg{\{}\underbrace{r_{\pi}(s)+\gamma\sum_{s^{\prime}\in\mathcal{S}}p_{\pi}(s^{\prime}|s)v_{\gamma}(\pi,s^{\prime})}_{\text{$v_{\gamma}(\pi,s)$ via the Bellman equation}}\bigg{\}}
=s𝒮pπ(s|s0)rπ(s)Gain vg(π,s0) in (3)+γs𝒮s𝒮pπ(s|s0)pπ(s|s)pπ(s|s0) due to stationarityvγ(π,s),absentsubscriptsubscript𝑠𝒮superscriptsubscript𝑝𝜋conditional𝑠subscript𝑠0subscript𝑟𝜋𝑠Gain vg(π,s0) in (3)𝛾subscriptsuperscript𝑠𝒮subscriptsubscript𝑠𝒮superscriptsubscript𝑝𝜋conditional𝑠subscript𝑠0subscript𝑝𝜋conditionalsuperscript𝑠𝑠pπ(s|s0) due to stationaritysubscript𝑣𝛾𝜋superscript𝑠\displaystyle=\underbrace{\sum_{s\in\mathcal{S}}p_{\pi}^{\star}(s|s_{0})r_{\pi}(s)}_{\text{Gain $v_{g}(\pi,s_{0})$ in \eqref{equ:v_gain}}}+\ \gamma\sum_{s^{\prime}\in\mathcal{S}}\underbrace{\sum_{s\in\mathcal{S}}p_{\pi}^{\star}(s|s_{0})p_{\pi}(s^{\prime}|s)}_{\text{$p_{\pi}^{\star}(s^{\prime}|s_{0})$ due to stationarity}}v_{\gamma}(\pi,s^{\prime}),

which can be re-arranged to obtain (11). Here, rπ(s)a𝒜π(a|s)r(s,a)subscript𝑟𝜋𝑠subscript𝑎𝒜𝜋conditional𝑎𝑠𝑟𝑠𝑎r_{\pi}(s)\coloneqq\sum_{a\in\mathcal{A}}\pi(a|s)r(s,a) whereas pπ(s|s)subscript𝑝𝜋conditionalsuperscript𝑠𝑠p_{\pi}(s^{\prime}|s) is defined in (8). It is interesting that any discount factor γ[0,1)𝛾01\gamma\in[0,1) maintains the equality in (11), which was also proved by Sutton and Barto, (2018, p254).

The second relationship pertains to the gradient of the gain when a parameterized policy π(𝜽)𝜋𝜽\pi(\boldsymbol{\theta}) is used. By notationally suppressing the policy parameterization 𝜽𝜽\boldsymbol{\theta} and the dependency on s0subscript𝑠0s_{0}, as well as using /𝜽𝜽\nabla\coloneqq\partial/\partial\boldsymbol{\theta}, this relation can be expressed as

vg(π)subscript𝑣𝑔𝜋\displaystyle\nabla v_{g}(\pi) =limγ1{s𝒮s𝒮pπ(s)a𝒜p(s|s,a)π(a|s)logπ(a|s)pπ(s|s)vγ(π,s)}absentsubscript𝛾1subscript𝑠𝒮subscriptsuperscript𝑠𝒮superscriptsubscript𝑝𝜋𝑠subscriptsubscript𝑎𝒜𝑝conditionalsuperscript𝑠𝑠𝑎𝜋conditional𝑎𝑠𝜋conditional𝑎𝑠subscript𝑝𝜋conditionalsuperscript𝑠𝑠subscript𝑣𝛾𝜋superscript𝑠\displaystyle=\lim_{\gamma\to 1}\Big{\{}\sum_{s\in\mathcal{S}}\sum_{s^{\prime}\in\mathcal{S}}p_{\pi}^{\star}(s)\underbrace{\sum_{a\in\mathcal{A}}p(s^{\prime}|s,a)\pi(a|s)\nabla\log\pi(a|s)}_{\nabla p_{\pi}(s^{\prime}|s)}v_{\gamma}(\pi,s^{\prime})\Big{\}} (12)
=s𝒮a𝒜pπ(s)π(a|s)[qγ(π,s,a)logπ(a|s)]involving the discounted state-action value qγ+(1γ)s𝒮pπ(s)[vγ(π,s)logpπ(s)]involving the discounted state value vγ,absentsubscriptsubscript𝑠𝒮subscript𝑎𝒜superscriptsubscript𝑝𝜋𝑠𝜋conditional𝑎𝑠delimited-[]subscript𝑞𝛾𝜋𝑠𝑎𝜋conditional𝑎𝑠involving the discounted state-action value qγsubscript1𝛾subscript𝑠𝒮superscriptsubscript𝑝𝜋𝑠delimited-[]subscript𝑣𝛾𝜋𝑠superscriptsubscript𝑝𝜋𝑠involving the discounted state value vγ\displaystyle=\underbrace{\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}p_{\pi}^{\star}(s)\pi(a|s)\Big{[}q_{\gamma}(\pi,s,a)\nabla\log\pi(a|s)\Big{]}}_{\text{involving the \emph{discounted} state-action value $q_{\gamma}$}}+\underbrace{(1-\gamma)\sum_{s\in\mathcal{S}}p_{\pi}^{\star}(s)\Big{[}v_{\gamma}(\pi,s)\nabla\log p_{\pi}^{\star}(s)\Big{]}}_{\text{involving the \emph{discounted} state value $v_{\gamma}$}}, (13)

for all γ[0,1)𝛾01\gamma\in[0,1). Notice that the right hand sides (RHS’s) of (12) and (13) involve the discounted-reward value functions, i.e. the state value function vγ(π,s),s𝒮subscript𝑣𝛾𝜋𝑠for-all𝑠𝒮v_{\gamma}(\pi,s),\forall s\in\mathcal{S} in (4) and the corresponding (state-)action value function qγ(π,s,a),(s,a)𝒮×𝒜subscript𝑞𝛾𝜋𝑠𝑎for-all𝑠𝑎𝒮𝒜q_{\gamma}(\pi,s,a),\forall(s,a)\in\mathcal{S}\times\mathcal{A}. They are related via vγ(π,s)=𝔼Aπ[qγ(π,s,A)]subscript𝑣𝛾𝜋𝑠subscript𝔼similar-to𝐴𝜋delimited-[]subscript𝑞𝛾𝜋𝑠𝐴v_{\gamma}(\pi,s)=\mathbb{E}_{A\sim\pi}\mathopen{}\mathclose{{}\left[q_{\gamma}(\pi,s,A)}\right]. The identity (12) was shown by Baxter and Bartlett, (2001, Thm 2), whereas (13) was derived from (11) by Morimura et al., (2010, Appendix A).

Thus for attaining average-reward optimality, one can maximize vγsubscript𝑣𝛾v_{\gamma} but merely as an approximation to vgsubscript𝑣𝑔v_{g} because the equality in (10) is only in the limit, i.e. setting γ𝛾\gamma exactly to 1 is prohibited by definition (4). The similiar limiting behaviour applies to (12), where vγsubscript𝑣𝛾v_{\gamma} is used to approximately compute the gain gradient vgsubscript𝑣𝑔\nabla v_{g}, yielding approximately gain-optimal policies. In particular, Jin and Sidford, (2021, Lemma 3) proved that an ϵitalic-ϵ\epsilon-gain-optimal policy is equivalent to an ϵ3(1γ)italic-ϵ31𝛾\frac{\epsilon}{3(1-\gamma)}-discounted-optimal policy with a certain γ𝛾\gamma value that depends on ϵitalic-ϵ\epsilon and a parameter describing some property of the target MDP. Here, a policy π𝜋\pi is said to be ε𝜀\varepsilon-x𝑥x-optimal for any positive ε𝜀\varepsilon and a criterion x𝑥x if its values vx(π,s)vx(πx,s)εsubscript𝑣𝑥𝜋𝑠subscript𝑣𝑥superscriptsubscript𝜋𝑥𝑠𝜀v_{x}(\pi,s)\geq v_{x}(\pi_{x}^{*},s)-\varepsilon, for all s𝒮𝑠𝒮s\in\mathcal{S}.

The aforementioned approximation to vgsubscript𝑣𝑔v_{g} by vγsubscript𝑣𝛾v_{\gamma} with γ1𝛾1\gamma\to 1 seems to justify the use of stationary state distribution pπsuperscriptsubscript𝑝𝜋p_{\pi}^{\star} (instead of the (improper) discounted state distribution pπγsuperscriptsubscript𝑝𝜋𝛾p_{\pi}^{\gamma} in (14) below) to weight the errors in recurrent states in approximate discounted policy evaluation, see e.g. Dann et al., (2014, Eqn 4). In fact, the identity in (10) implies the following connection,

𝑷π𝒓π𝒗g(π)=limγ1(1γ)𝑷πγ𝒓π𝒗γ(π),such that𝑷π=limγ1(1γ)𝑷πγ,hencepπ=limγ1(1γ)pπγa proper distribution,formulae-sequencesubscriptsuperscriptsubscript𝑷𝜋subscript𝒓𝜋subscript𝒗𝑔𝜋subscript𝛾11𝛾subscriptsuperscriptsubscript𝑷𝜋𝛾subscript𝒓𝜋subscript𝒗𝛾𝜋formulae-sequencesuch thatsuperscriptsubscript𝑷𝜋subscript𝛾11𝛾superscriptsubscript𝑷𝜋𝛾hencesuperscriptsubscript𝑝𝜋subscript𝛾1subscript1𝛾superscriptsubscript𝑝𝜋𝛾a proper distribution\displaystyle\underbrace{\boldsymbol{P}_{\!\!\pi}^{\star}\boldsymbol{r}_{\!\!\pi}}_{\boldsymbol{v}_{\!g}(\pi)}=\lim_{\gamma\to 1}(1-\gamma)\underbrace{\boldsymbol{P}_{\!\!\pi}^{\gamma}\boldsymbol{r}_{\!\!\pi}}_{\boldsymbol{v}_{\gamma}(\pi)},\quad\text{such that}\ \boldsymbol{P}_{\!\!\pi}^{\star}=\lim_{\gamma\to 1}(1-\gamma)\boldsymbol{P}_{\!\!\pi}^{\gamma},\ \text{hence}\ p_{\pi}^{\star}=\lim_{\gamma\to 1}\underbrace{(1-\gamma)p_{\pi}^{\gamma}}_{\text{a proper distribution}}, (14)
where𝑷π=limtmax1tmaxt=0tmax1𝑷πt,and𝑷πγ=limtmaxt=0tmax1(γ𝑷π)t.formulae-sequencewheresuperscriptsubscript𝑷𝜋subscriptsubscript𝑡max1subscript𝑡maxsuperscriptsubscript𝑡0subscript𝑡max1superscriptsubscript𝑷𝜋𝑡andsuperscriptsubscript𝑷𝜋𝛾subscriptsubscript𝑡maxsuperscriptsubscript𝑡0subscript𝑡max1superscript𝛾subscript𝑷𝜋𝑡\displaystyle\text{where}\quad\boldsymbol{P}_{\!\!\pi}^{\star}=\lim_{t_{\mathrm{max}}\to\infty}\frac{1}{t_{\mathrm{max}}}\sum_{t=0}^{t_{\mathrm{max}}-1}\boldsymbol{P}_{\!\!\pi}^{t},\quad\text{and}\quad\boldsymbol{P}_{\!\!\pi}^{\gamma}=\lim_{t_{\mathrm{max}}\to\infty}\sum_{t=0}^{t_{\mathrm{max}}-1}(\gamma\boldsymbol{P}_{\!\!\pi})^{t}. (15)

Here, the s0subscript𝑠0s_{0}-th row of 𝑷πsuperscriptsubscript𝑷𝜋\boldsymbol{P}_{\!\!\pi}^{\star} contains the probability values of the stationary state distribution pπ(|s0)p_{\pi}^{\star}(\cdot|s_{0}). On the other hand, the s0subscript𝑠0s_{0}-th row of 𝑷πγsuperscriptsubscript𝑷𝜋𝛾\boldsymbol{P}_{\!\!\pi}^{\gamma} contains the values of the improper discounted state distribution pπγ(|s0)p_{\pi}^{\gamma}(\cdot|s_{0}), which is improper because s𝒮pπγ(s|s0)=1/(1γ)1subscriptsuperscript𝑠𝒮superscriptsubscript𝑝𝜋𝛾conditionalsuperscript𝑠subscript𝑠011𝛾1\sum_{s^{\prime}\in\mathcal{S}}p_{\pi}^{\gamma}(s^{\prime}|s_{0})=1/(1-\gamma)\neq 1. Importantly, this connection suggests that pπsuperscriptsubscript𝑝𝜋p_{\pi}^{\star} is suitable as weights in the discounted value approximator’s error function only when γ1𝛾1\gamma\to 1. Otherwise, pπγsuperscriptsubscript𝑝𝜋𝛾p_{\pi}^{\gamma} may be more suitable.

It is also an approximation in (11) whenever vγsubscript𝑣𝛾v_{\gamma} is weighted by some initial state distribution p̊̊𝑝\mathring{p} (such as in (18)) or transient state-distributions pπtsuperscriptsubscript𝑝𝜋𝑡p_{\pi}^{t}, which generally differs from pπsuperscriptsubscript𝑝𝜋p_{\pi}^{\star}. In (13), the second RHS term is typically ignored since calculating logpπ(s)superscriptsubscript𝑝𝜋𝑠\nabla\log p_{\pi}^{\star}(s) is difficult in RL settings.999 The difficulty of computing the gradient of state distributions, such as logpπ(s)superscriptsubscript𝑝𝜋𝑠\nabla\log p_{\pi}^{\star}(s), in RL motivates the development of the policy gradient theorem (Sutton and Barto,, 2018, Ch 13.2). Consequently, vg(π)subscript𝑣𝑔𝜋\nabla v_{g}(\pi) is approximated (closely whenever γ𝛾\gamma is close to 1) by the first RHS term of (13), then by sampling the state Spπsimilar-to𝑆superscriptsubscript𝑝𝜋S\sim p_{\pi}^{\star} (after the agent interacts long “enough” with its environment).

Moreover, approximately maximizing the average reward via discounting is favourable because discounting formulation has several mathematical virtues, as described in Sec 3.3.

3.2 A technique for attaining the most selective optimality with γ[γBw,1)𝛾subscript𝛾Bw1\gamma\in[\gamma_{\mathrm{Bw}},1)

As discussed in the previous Sec 3.1, γ𝛾\gamma-discounted optimality approximates the gain optimality as γ1𝛾1\gamma\to 1. This is desirable since the gain optimality is the most selective in recurrent MDPs. In unichain MDPs however, the gain optimality is generally underselective since the gain ignores the rewards earned in transient states (for an illustrative example, see Fig 1). Consequently, multiple gain-optimal policies prescribe different action selections (earning different rewards) in transient states. The underselectiveness of gain optimality (equivalent to (n=1)𝑛1(n=-1)-discount optimality) can be refined up to the most selective optimality by increasing the value of n𝑛n from 11-1 to 00 (or higher if needed up to n=(|𝒮|2)𝑛𝒮2n=(|\mathcal{S}|-2) for unichain MDPs) in the family of n𝑛n-discount optimality (Sec 2.1).

Interestingly, such a remedy towards the most selective criterion can also be achieved by specifying a discount factor γ𝛾\gamma that lies in the Blackwell’s interval, i.e. γ[γBw,1)𝛾subscript𝛾Bw1\gamma\in[\gamma_{\mathrm{Bw}},1), for the γ𝛾\gamma-discounted optimality (5). This is because the resulting πγ[γBw,1)superscriptsubscript𝜋𝛾subscript𝛾Bw1\pi_{\gamma\in[\gamma_{\mathrm{Bw}},1)}^{*}, which is also called a Blackwell optimal policy, is also optimal for all n=1,0,𝑛10n=-1,0,\ldots in n𝑛n-discount optimality (Puterman,, 1994, Thm 10.1.5). Moreover, Blackwell optimality is always the most selective regardless of the MDP classification, inheriting the classification invariance property of γ𝛾\gamma-discounted optimality. Thus, artificial discounting can be interpreted as a technique to attain the most selective criterion (i.e. the Blackwell optimality) whenever γ[γBw,1)𝛾subscript𝛾Bw1\gamma\in[\gamma_{\mathrm{Bw}},1) not only in recurrent but also unichain MDPs, as well as the most general multichain MDPs.

s0superscript𝑠0s^{0}s1superscript𝑠1s^{1}s2superscript𝑠2s^{2}reward=2𝑟𝑒𝑤𝑎𝑟𝑑2reward=2reward=1𝑟𝑒𝑤𝑎𝑟𝑑1reward=1reward=1𝑟𝑒𝑤𝑎𝑟𝑑1reward=1reward=0𝑟𝑒𝑤𝑎𝑟𝑑0reward=0
Figure 1: A symbolic diagram of a unichain MDP with 𝒮={s0,s1,s2}𝒮superscript𝑠0superscript𝑠1superscript𝑠2\mathcal{S}=\{s^{0},s^{1},s^{2}\} and deterministic transitions. There are two stationary policies that differ in their blue or red action selection in the left-most state s0superscript𝑠0s^{0}, hence we call them: red and blue policies. Under both policies, the right-most state s2superscript𝑠2s^{2} is a 0-reward (absorbing) recurrent state szratsubscript𝑠zrats_{\mathrm{zrat}}, hence this unichain MDP is a szratsubscript𝑠zrats_{\mathrm{zrat}}-model. Both policies are gain-optimal and (n=0)𝑛0(n=0)-discount optimal, but only the blue policy is (n=1)𝑛1(n=1)-discount optimal. Therefore, (n=1=|𝒮|2)𝑛1𝒮2(n=1=|\mathcal{S}|-2)-discount optimality is the most selective, which is equivalent to Blackwell optimality. In particular for this simple environment, the Blackwell discount factor is trivially γBw=0subscript𝛾Bw0\gamma_{\mathrm{Bw}}=0. Note that the total reward and the (n=0𝑛0n=0)-discount values of both policies are equal, i.e. vtot(red,s0)=vtot(blue,s0)=vn=0(red,s0)=vn=0(blue,s0)=2subscript𝑣totredsuperscript𝑠0subscript𝑣totbluesuperscript𝑠0subscript𝑣𝑛0redsuperscript𝑠0subscript𝑣𝑛0bluesuperscript𝑠02v_{\mathrm{tot}}(\mathrm{red},s^{0})=v_{\mathrm{tot}}(\mathrm{blue},s^{0})=v_{n=0}(\mathrm{red},s^{0})=v_{n=0}(\mathrm{blue},s^{0})=2 such that the total reward and (n=0𝑛0n=0)-discount criteria are underselective. This example MDP is adopted from Puterman, (1994, Fig 10.1.1) and Mahadevan, 1996b (, Fig 1).

Targetting the Blackwell optimality (instead of gain optimality) is imperative, especially for episodic environments101010 Episodic environments are those with at least one terminal state. Once the agent enters the terminal state, the agent-environment interaction terminates. that are commonly modelled as infinite-horizon MDPs (so that the stationary policy set ΠSsubscriptΠS\Pi_{\mathrm{S}} is a sufficient space to look at for optimal policies).111111 In finite-horizon MDPs, the optimal policy is generally non-stationary, where it depends on the number of remaining timesteps (in addition to the state). Intuitively, if we (approximately) had only 1 week to live, would we act the same way as if we (approximately) had 10 years to live? In basketball, a long shot attempt (from beyond half court) is optimal only at the final seconds of the game. Such modelling is carried out by augmenting the state set with a 0-reward absorbing terminal state (denoted by szratsubscript𝑠zrats_{\mathrm{zrat}}), as shown in Fig 4(a). This yields a unichain MDP with a non-empty set of transient states. For such szratsubscript𝑠zrats_{\mathrm{zrat}}-models, the gain is trivially 00 for all stationary policies so that gain optimality is underselective. The (n=0)𝑛0(n=0)-discount optimality improves the selectivity. It may be the most selective (hence, it is equivalent to Blackwell optimality) in some cases. Otherwise, it is underselective as well, hence some higher n𝑛n-discount optimality criterion should be used, e.g. n=1𝑛1n=1 for an MDP shown in Fig 1. It is also worth noting that in szratsubscript𝑠zrats_{\mathrm{zrat}}-models, (n=0𝑛0n=0)-discount optimality is equivalent to the total reward optimality whose vtotsubscript𝑣totv_{\mathrm{tot}} (2) is finite (Puterman,, 1994, Prop 10.4.2). The total reward optimality therefore may also be underselective in some cases.

Towards obtaining Blackwell optimal policies in unichain MDPs, the relationship between maximizing γ𝛾\gamma-discounted and n𝑛n-discount criteria can be summarized as follows (see also Fig 6).

argmaxπΠSvγ[γBw,1)(π,s)Blackwell-optimal policiessubscriptsubscript𝜋subscriptΠSsubscript𝑣𝛾subscript𝛾Bw1𝜋𝑠Blackwell-optimal policies\displaystyle\underbrace{\operatorname*{\arg\!\max}_{\pi\in\Pi_{\mathrm{S}}}v_{\gamma\in[\gamma_{\mathrm{Bw}},1)}(\pi,s)}_{\text{Blackwell-optimal policies}}
={argmaxπΠS[v1(π,s)=limγ1(1γ)vγ(π,s)Based on (10)]if MDPs are recurrent (see Sec 3.1),argmaxπΠn=1[v0(π,s)=limγ1(1γ)vγ(π,s)v1(π,s)1γBased on (7)]if MDPs are unichain and (n=0)-discount is the most selective,argmaxπΠn1[vn(π,s)]for n=1,2,,(|𝒮|3)if MDPs are unichain and n-discount is the most selective,argmaxπΠn=(|𝒮|3)[vn=(|𝒮|2)(π,s)]if MDPs are unichain,absentcases𝜋subscriptΠSdelimited-[]subscriptsubscript𝑣1𝜋𝑠𝛾11𝛾subscript𝑣𝛾𝜋𝑠Based on (10)if MDPs are recurrent (see Sec 3.1)otherwise𝜋superscriptsubscriptΠn1delimited-[]subscriptsubscript𝑣0𝜋𝑠𝛾11𝛾subscript𝑣𝛾𝜋𝑠subscript𝑣1𝜋𝑠1𝛾Based on (7)if MDPs are unichain and (n=0)-discount is the most selective,otherwise𝜋superscriptsubscriptΠn1delimited-[]subscript𝑣𝑛𝜋𝑠for n=1,2,,(|𝒮|3)if MDPs are unichain and n-discount is the most selective,otherwise𝜋superscriptsubscriptΠn𝒮3delimited-[]subscript𝑣𝑛𝒮2𝜋𝑠if MDPs are unichainotherwise\displaystyle=\begin{cases}\underset{\pi\in\Pi_{\mathrm{S}}}{\operatorname*{\arg\!\max}}\ \Big{[}\underbrace{v_{-1}(\pi,s)=\underset{\gamma\to 1}{\lim}\ (1-\gamma)v_{\gamma}(\pi,s)}_{\text{Based on \eqref{equ:gain2disrew}}}\Big{]}\quad\text{if MDPs are recurrent (see Sec~{}\ref{sec:avgrewapprox})},\\ \underset{\pi\in\Pi_{\mathrm{n=-1}}^{*}}{\operatorname*{\arg\!\max}}\ \Big{[}\underbrace{v_{0}(\pi,s)=\underset{\gamma\to 1}{\lim}\ \frac{(1-\gamma)v_{\gamma}(\pi,s)\ -\ v_{-1}(\pi,s)}{1-\gamma}}_{\text{Based on \eqref{equ:laurent_expansion_truncated}}}\Big{]}\ \text{\parbox[t]{93.89409pt}{if MDPs are unichain and $(n=0)$-discount is the most selective,}}\\ \underset{\pi\in\Pi_{\mathrm{n-1}}^{*}}{\operatorname*{\arg\!\max}}\ \big{[}v_{n}(\pi,s)\big{]}\ \text{for $n=1,2,\ldots,(|\mathcal{S}|-3)$}\qquad\text{\parbox[t]{122.34685pt}{if MDPs are unichain and $n$-discount is the most selective,}}\\ \underset{\pi\in\Pi_{\mathrm{n=(|\mathcal{S}|-3)}}^{*}}{\operatorname*{\arg\!\max}}\ \big{[}v_{n=(|\mathcal{S}|-2)}(\pi,s)\big{]}\quad\text{if MDPs are unichain},\end{cases} (16)

for all s𝒮𝑠𝒮s\in\mathcal{S}, where ΠnsuperscriptsubscriptΠn\Pi_{\mathrm{n}}^{*} denotes the n𝑛n-discount optimal policy set for n=1,0,𝑛10n=-1,0,\ldots.

From the second case in the RHS of (16), we know that v0subscript𝑣0v_{0} can be computed by taking the limit of a function involving vγsubscript𝑣𝛾v_{\gamma} and v1(=vg)annotatedsubscript𝑣1absentsubscript𝑣𝑔v_{-1}(=v_{g}) as γ𝛾\gamma approaches 1. This means that for unichain MDPs, the Blackwell optimal policies may be obtained by setting γ𝛾\gamma very close to 1, similar to that for recurrent MDPs (the first case) but not the same since the function of which the limit is taken differs.

In practice, the limits in the first and second cases in (16) are computed approximately using a discount factor γ𝛾\gamma close to unity, which is likely in the Blackwell’s interval, i.e. γBw(γ1)<1subscript𝛾Bw𝛾11{\gamma_{\mathrm{Bw}}\leq(\gamma\approx 1)<1}. Paradoxically however, this does not necessarily attain the Blackwell optimality because the finite-precision computation involving γ1𝛾1\gamma\approx 1 yields quite accurate estimation to the limit values: maximizing the first and second cases in the RHS of (16) with a high γ1𝛾1\gamma\approx 1 attains (approximately) gain (n=1𝑛1n=-1) and bias (n=0𝑛0n=0) optimality respectively, which may be underselective in unichain MDPs. Consequently in practice, the most selective Blackwell optimality can always be achieved using γ𝛾\gamma that is at least as high as γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}} but not too close to 1. That is, γBwγγ¯<1subscript𝛾Bw𝛾¯𝛾1\gamma_{\mathrm{Bw}}\leq\gamma\leq\bar{\gamma}<1 for some practical upper bound γ¯¯𝛾\bar{\gamma}, which is computation (hence, implementation) dependent.

3.3 Contraction, variance reduction, and independence from MDP classification

Discounted reward optimality is easier to deal with than its average reward counterpart. This can be attributed to three main factors as follows.

Firstly, the discounted-reward theory holds regardless of the classification of the induced MCs (Sec 2.2), whereas that of the average reward involves such classification. Because in RL settings, the transition probability p(s|s,a)𝑝conditionalsuperscript𝑠𝑠𝑎p(s^{\prime}|s,a) is unknown, average reward algorithms require estimation or assumption about the chain classification, specifically whether unichain or multichain. Nevertheless, note that such (assumed) classification is needed in order to apply a specific (simpler) class of average-reward algorithms: leveraging the fact that a unichain MDP has a single scalar gain (associated with its single chain) that is constant across all states, whereas a multichain MDP generally has different gain values associated with its multiple chains.

Secondly, the discounted-reward Bellman optimality operator 𝔹γsuperscriptsubscript𝔹𝛾\mathbb{B}_{\gamma}^{*} is contractive, where the discount factor γ𝛾\gamma serves as the contraction modulus (hence, 𝔹γsuperscriptsubscript𝔹𝛾\mathbb{B}_{\gamma}^{*} is a γ𝛾\gamma-contraction). That is,

𝔹γ[𝒗]𝔹γ[𝒗]γ𝒗𝒗,for any vectors 𝒗,𝒗|𝒮|,subscriptnormsuperscriptsubscript𝔹𝛾delimited-[]𝒗superscriptsubscript𝔹𝛾delimited-[]superscript𝒗𝛾subscriptnorm𝒗superscript𝒗for any vectors 𝒗,𝒗|𝒮|\|\mathbb{B}_{\gamma}^{*}[\boldsymbol{v}]-\mathbb{B}_{\gamma}^{*}[\boldsymbol{v}^{\prime}]\|_{\infty}\leq\gamma\|\boldsymbol{v}-\boldsymbol{v}^{\prime}\|_{\infty},\quad\text{for any vectors $\boldsymbol{v},\boldsymbol{v}^{\prime}\in\real{|\mathcal{S}|}$},

where in state-wise form, 𝔹γ[𝒗](s)maxa𝒜{r(s,a)+γs𝒮p(s|s,a)v(s)},s𝒮formulae-sequencesuperscriptsubscript𝔹𝛾delimited-[]𝒗𝑠subscript𝑎𝒜𝑟𝑠𝑎𝛾subscriptsuperscript𝑠𝒮𝑝conditionalsuperscript𝑠𝑠𝑎𝑣superscript𝑠for-all𝑠𝒮\mathbb{B}_{\gamma}^{*}[\boldsymbol{v}](s)\coloneqq\max_{a\in\mathcal{A}}\{r(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|s,a)v(s^{\prime})\},\forall s\in\mathcal{S}, and 𝒗maxs𝒮|v(s)|subscriptnorm𝒗subscript𝑠𝒮𝑣𝑠\|\boldsymbol{v}\|_{\infty}\coloneqq\max_{s\in\mathcal{S}}|v(s)| denotes the maximum norm. This means that 𝔹γsuperscriptsubscript𝔹𝛾\mathbb{B}_{\gamma}^{*} makes 𝒗𝒗\boldsymbol{v} and 𝒗superscript𝒗\boldsymbol{v}^{\prime} closer by at least γ𝛾\gamma such that the sequence of iterates 𝒗k+1𝔹γ[𝒗k]superscript𝒗𝑘1superscriptsubscript𝔹𝛾delimited-[]superscript𝒗𝑘\boldsymbol{v}^{k+1}\leftarrow\mathbb{B}_{\gamma}^{*}[\boldsymbol{v}^{k}] converges to the unique fixed point of 𝔹γsuperscriptsubscript𝔹𝛾\mathbb{B}_{\gamma}^{*} as k𝑘k\to\infty from any initial 𝒗k=0|𝒮|superscript𝒗𝑘0𝒮\boldsymbol{v}^{k=0}\in\real{|\mathcal{S}|}. This is based on the Banach’s fixed-point theorem Szepesvári, (2010, Thm A.9). In particular as k𝑘k\to\infty, we obtain 𝔹γ[𝒗k]=𝒗k=𝒗γsuperscriptsubscript𝔹𝛾delimited-[]superscript𝒗𝑘superscript𝒗𝑘superscriptsubscript𝒗𝛾\mathbb{B}_{\gamma}^{*}[\boldsymbol{v}^{k}]=\boldsymbol{v}^{k}=\boldsymbol{v}_{\gamma}^{*}, where 𝒗γsuperscriptsubscript𝒗𝛾\boldsymbol{v}_{\gamma}^{*} denotes the optimal discounted value, i.e. 𝒗γ(πγ)subscript𝒗𝛾superscriptsubscript𝜋𝛾\boldsymbol{v}_{\gamma}(\pi_{\gamma}^{*}). Thus, limk𝒗k𝒗γ=𝟎subscript𝑘subscriptnormsuperscript𝒗𝑘superscriptsubscript𝒗𝛾0\lim_{k\to\infty}\|\boldsymbol{v}^{k}-\boldsymbol{v}_{\gamma}^{*}\|_{\infty}=\boldsymbol{0}. In the absense of γ𝛾\gamma, the contraction no longer holds. This is the case for the average-reward Bellman optimality operator 𝔹g[𝒗](s)maxa𝒜{r(s,a)+s𝒮p(s|s,a)v(s)},s𝒮formulae-sequencesuperscriptsubscript𝔹𝑔delimited-[]𝒗𝑠subscript𝑎𝒜𝑟𝑠𝑎subscriptsuperscript𝑠𝒮𝑝conditionalsuperscript𝑠𝑠𝑎𝑣superscript𝑠for-all𝑠𝒮\mathbb{B}_{g}^{*}[\boldsymbol{v}](s)\coloneqq\max_{a\in\mathcal{A}}\{r(s,a)+\sum_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|s,a)v(s^{\prime})\},\forall s\in\mathcal{S}. As a result, the basic value iteration based on 𝔹gsuperscriptsubscript𝔹𝑔\mathbb{B}_{g}^{*} is not guaranteed to converge (Mahadevan, 1996a, , Sec 2.4.3).

The aforementioned contraction property also applies to the discounted-reward Bellman expectation operator, i.e. 𝔹γπ[𝒗](s)𝔼Aπ[r(s,A)+γs𝒮p(s|s,A)v(s)],s𝒮formulae-sequencesuperscriptsubscript𝔹𝛾𝜋delimited-[]𝒗𝑠subscript𝔼similar-to𝐴𝜋delimited-[]𝑟𝑠𝐴𝛾subscriptsuperscript𝑠𝒮𝑝conditionalsuperscript𝑠𝑠𝐴𝑣superscript𝑠for-all𝑠𝒮\mathbb{B}_{\gamma}^{\pi}[\boldsymbol{v}](s)\coloneqq\mathbb{E}_{A\sim\pi}\mathopen{}\mathclose{{}\left[r(s,A)+\gamma\sum_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|s,A)v(s^{\prime})}\right],\forall s\in\mathcal{S}. As a consequence of Banach’s fixed-point theorem, we have limk𝒗k𝒗γπ=𝟎subscript𝑘subscriptnormsuperscript𝒗𝑘superscriptsubscript𝒗𝛾𝜋0{\lim_{k\to\infty}\|\boldsymbol{v}^{k}-\boldsymbol{v}_{\gamma}^{\pi}\|_{\infty}}=\boldsymbol{0}, where 𝒗k+1𝔹γπ[𝒗k]superscript𝒗𝑘1superscriptsubscript𝔹𝛾𝜋delimited-[]superscript𝒗𝑘\boldsymbol{v}^{k+1}\leftarrow\mathbb{B}_{\gamma}^{\pi}[\boldsymbol{v}^{k}] is iteratively applied on any initial 𝒗k=0|𝒮|superscript𝒗𝑘0𝒮\boldsymbol{v}^{k=0}\in\real{|\mathcal{S}|} for k=0,1,𝑘01k=0,1,\ldots, whereas 𝒗γπ𝒗γ(π)superscriptsubscript𝒗𝛾𝜋subscript𝒗𝛾𝜋\boldsymbol{v}_{\gamma}^{\pi}\coloneqq\boldsymbol{v}_{\gamma}(\pi) is the discounted policy value of a policy π𝜋\pi. In other words, 𝒗γπsuperscriptsubscript𝒗𝛾𝜋\boldsymbol{v}_{\gamma}^{\pi} is the unique fixed point of 𝔹γπsuperscriptsubscript𝔹𝛾𝜋\mathbb{B}_{\gamma}^{\pi} such that 𝒗γπ=𝔹γπ[𝒗γπ]superscriptsubscript𝒗𝛾𝜋superscriptsubscript𝔹𝛾𝜋delimited-[]superscriptsubscript𝒗𝛾𝜋\boldsymbol{v}_{\gamma}^{\pi}=\mathbb{B}_{\gamma}^{\pi}[\boldsymbol{v}_{\gamma}^{\pi}], which is known as the discounted-reward Bellman evaluation equation (γ𝛾\gamma-BEE). This γ𝛾\gamma-BEE is further utilized to derive a TD-based parametric value approximator whose convergence depends on the fact that γ[0,1)𝛾01\gamma\in[0,1). That is, the approximator’s error minimizer formula involves the inverse (𝑰γ𝑷π)1superscript𝑰𝛾subscript𝑷𝜋1(\boldsymbol{I}-\gamma\boldsymbol{P}_{\!\!\pi})^{-1}, which exists (Sutton and Barto,, 2018, p206). This is in contrast to the matrix (𝑰𝑷π)𝑰subscript𝑷𝜋(\boldsymbol{I}-\boldsymbol{P}_{\!\!\pi}) whose inverse does not exist (Puterman,, 1994, p596).

In addition, the contractive nature induced by γ𝛾\gamma is also utilized for computing state similarity metrics (Castro,, 2020; Ferns et al.,, 2006). There, γ𝛾\gamma guarantees the convergence of the metric operator to its fixed point (when such an operator is iteratively applied). Note that the notion of state similarity plays an important role in for example, state aggregation and state representation learning.

Third, discounting can be used to reduce the variance of, for example policy gradient estimates, at the cost of bias-errors Baxter and Bartlett,, 2001, Thm 3; Kakade,, 2001, Thm 1. In particular, the variance (the bias-error) increases (decreases) as a function of 1/(1γ)=t=0γt11𝛾superscriptsubscript𝑡0superscript𝛾𝑡1/(1-\gamma)=\sum_{t=0}^{\infty}\gamma^{t}. This is because the effective number of timesteps (horizon) can be controlled by γ𝛾\gamma. This is also related to the fact that the infinite-horizon discounted reward vγsubscript𝑣𝛾v_{\gamma} (4) can be ϵitalic-ϵ\epsilon-approximated by a finite horizon τ𝜏\tau proportional to logγ(1γ)subscript𝛾1𝛾\log_{\gamma}(1-\gamma), as noted by Tang and Abbeel, (2010, Footnote 1). That is,

ϵ𝔼[t=0γtr(St,At)]𝔼[t=0τ1γtr(St,At)]=𝔼[t=τγtr(St,At)]t=τγtrmax=γτrmax1γ.italic-ϵ𝔼delimited-[]superscriptsubscript𝑡0superscript𝛾𝑡𝑟subscript𝑆𝑡subscript𝐴𝑡𝔼delimited-[]superscriptsubscript𝑡0𝜏1superscript𝛾𝑡𝑟subscript𝑆𝑡subscript𝐴𝑡𝔼delimited-[]superscriptsubscript𝑡𝜏superscript𝛾𝑡𝑟subscript𝑆𝑡subscript𝐴𝑡superscriptsubscript𝑡𝜏superscript𝛾𝑡subscript𝑟maxsuperscript𝛾𝜏subscript𝑟max1𝛾\epsilon\coloneqq\mathbb{E}\mathopen{}\mathclose{{}\left[\sum_{t=0}^{\infty}\gamma^{t}r(S_{t},A_{t})}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[\sum_{t=0}^{\tau-1}\gamma^{t}r(S_{t},A_{t})}\right]=\mathbb{E}\mathopen{}\mathclose{{}\left[\sum_{t=\tau}^{\infty}\gamma^{t}r(S_{t},A_{t})}\right]\leq\sum_{t=\tau}^{\infty}\gamma^{t}r_{\mathrm{max}}=\frac{\gamma^{\tau}r_{\mathrm{max}}}{1-\gamma}.

This is then re-arranged to obtain γτ((1γ)ϵ)/rmaxsuperscript𝛾𝜏1𝛾italic-ϵsubscript𝑟max\gamma^{\tau}\geq((1-\gamma)\epsilon)/r_{\mathrm{max}}, whose both sides are taken to the logarithm with base γ𝛾\gamma to yield

τlogγ(1γ)ϵrmax,hence, the smallest of such a finite horizon isτ=logγ(1γ)ϵrmax,formulae-sequence𝜏subscript𝛾1𝛾italic-ϵsubscript𝑟maxhence, the smallest of such a finite horizon is𝜏subscript𝛾1𝛾italic-ϵsubscript𝑟max\tau\geq\log_{\gamma}\frac{(1-\gamma)\epsilon}{r_{\mathrm{max}}},\quad\text{hence, the smallest of such a finite horizon is}\ \tau=\Big{\lceil}\log_{\gamma}\frac{(1-\gamma)\epsilon}{r_{\mathrm{max}}}\Big{\rceil},

where x𝑥\mathopen{}\mathclose{{}\left\lceil x}\right\rceil indicates the smallest integer greater than or equal to x𝑥absentx\in\real{}.

4 Artificial discount factors are sensitive and troublesome

The artificial discount factor γ𝛾\gamma (which is part of the solution method) is said to be sensitive because the performance of RL methods often depends largely on γ𝛾\gamma. Fig 2(a) illustrates this phenomenon using Qγsubscript𝑄𝛾Q_{\gamma}-learning with various γ𝛾\gamma values. As can be seen, higher γ𝛾\gamma leads to slower convergence, whereas lower γ𝛾\gamma leads to suboptimal policies (with respect to the most selective criterion, which in this case, is the gain optimality since the MDP is recurrent). This trade-off is elaborated more in Secs 4.1 and 4.2. The sensitivity to γ𝛾\gamma has also been observed in the error (and even whether convergence or divergence) of approximate policy evaluation methods with function approximators Scherrer,, 2010, Fig 1; Sutton and Barto,, 2018, Example 11.1.

Additionally, Fig 3 shows the effect of various γ𝛾\gamma values on the optimization landscapes on which policy gradient methods look for the maximizer of the discounted value vγsubscript𝑣𝛾v_{\gamma}. It is evident that different γ𝛾\gamma values induce different maximizing policy parameters. From the 2D visualization in Fig 3, we can observe that the landscape of vγsubscript𝑣𝛾v_{\gamma} begins to look like that of the gain vgsubscript𝑣𝑔v_{g} when γ𝛾\gamma is around γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}}, then becomes more and more look like it as γ𝛾\gamma approaches 1. When γ𝛾\gamma is far less than γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}}, the maximizer of vγsubscript𝑣𝛾v_{\gamma} does not coincide with that of vgsubscript𝑣𝑔v_{g}. In such cases, some local maximizer of vγsubscript𝑣𝛾v_{\gamma} may be desirable (instead of the global one) because of its proximity to the maximizer of vgsubscript𝑣𝑔v_{g}, which represents the optimal point of the most selective criterion for recurrent MDPs examined in Fig 3.

The artificial γ𝛾\gamma is troublesome because its critical value, i.e. γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}}, is difficult to determine, even in DP where the transition and reward functions are known (Hordijk et al.,, 1985). This is exacerbated by the fact that γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}} is specific to each environment instance (even from the same environment family, as shown in Fig 2(b)). Nevertheless, knowing this critical value γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}} is always desirable. For example, despite the gain optimality can be attained by having γ𝛾\gamma very close to 1, setting γγBw𝛾subscript𝛾Bw\gamma\leftarrow\gamma_{\mathrm{Bw}} (or some value around it) leads to not only convergence to the optimal gain (or close to it) but also faster convergence, as demonstrated by Qγsubscript𝑄𝛾Q_{\gamma}-learning (Fig 2(a)). We can also observe visually in Fig 3 that the discounted vγγBwsubscript𝑣𝛾subscript𝛾Bwv_{\gamma\approx\gamma_{\mathrm{Bw}}}-landscape already resembles the gain vgsubscript𝑣𝑔v_{g}-landscape. Thus, for obtaining the gain-optimal policy in recurrent MDPs, γ𝛾\gamma does not need to be too close to 1 (as long as it is larger than or equal to γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}}); see also (16).

Apart from that, γ𝛾\gamma is troublesome because some derivation involving it demands extra care, e.g. for handling the improper discounted state distribution pπγsuperscriptsubscript𝑝𝜋𝛾p_{\pi}^{\gamma} (14) in discounted-reward policy gradient algorithms (Nota and Thomas,, 2020; Thomas,, 2014).

Refer to caption
(a) Learning curves of Qγsubscript𝑄𝛾Q_{\gamma}-learning with varying γ𝛾\gamma values on GridNav-25, which is episodic but modelled using srstsubscript𝑠rsts_{\mathrm{rst}} (Sec 5.5). Empirically, the critical discount factor is γBw0.83subscript𝛾Bw0.83\gamma_{\mathrm{Bw}}\approx 0.83 (yellowish).
Refer to caption
(b) The critical γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}} as a (non-trivial) function of number of states, and of some constant in the reward function on three environment families.
Figure 2: Empirical results illustrating the sensitivity and troublesomeness of artificial discount factors γ𝛾\gamma. In the left sub-figure (a), the gain optimality is the most selective (since the MDP is recurrent), where the optimal gain vg(πg)subscript𝑣𝑔superscriptsubscript𝜋𝑔v_{g}(\pi_{g}^{*}) is indicated by the black solid horizontal line on top of the plot. For experimental setup about the environments and learning methods, see Sec 7.

4.1 Higher discount factors lead to slower convergence

According to (10), increasing the discount factor γ𝛾\gamma closer to 1 makes the scaled discounted reward (1γ)vγ1𝛾subscript𝑣𝛾(1-\gamma)v_{\gamma} approximate the average reward vgsubscript𝑣𝑔v_{g} more closely. This means that a discounted-reward method with such a setting obtains more accurate estimates of gain-optimal policies. However, it suffers from a lower rate of convergence (to the approximate gain optimality), as well as from some numerical issue (since for example, it involves the term 1/(1γ)11𝛾1/(1-\gamma) that explodes as γ1𝛾1\gamma\to 1). This becomes unavoidable whenever γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}} is indeed very close to unity because the most selective Blackwell optimality (equivalent to gain optimality in recurrent MDPs) requires that γγBw𝛾subscript𝛾Bw\gamma\geq\gamma_{\mathrm{Bw}}.

The slow convergence can be explained by examining the effect of the effective horizon induced by γ𝛾\gamma. That is, as γ𝛾\gamma approaches 1, the reward information is propagated to more states (Beleznay et al.,, 1999, Fig 1). From discounted policy gradient methods, we also know that an i.i.d state sample from the discounted state distribution pπγsuperscriptsubscript𝑝𝜋𝛾p_{\pi}^{\gamma} in (14) is the last state of a trajectory whose length is drawn from a geometric distribution Geo(p=1γ)𝐺𝑒𝑜𝑝1𝛾Geo(p=1-\gamma), see Kumar et al., (2020, Algo 3). Evidently, the closer γ𝛾\gamma to 1, the longer the required trajectory. Also recall that such a geometric distribution has a mean of 1/(1γ)11𝛾1/(1-\gamma) and a variance of γ/(1γ)2𝛾superscript1𝛾2\gamma/(1-\gamma)^{2}, which blow up as γ1𝛾1\gamma\to 1.

There are numerous works that prove and demonstrate slow convergence due to higher γ𝛾\gamma. From them, we understand that the error (hence, iteration/sample complexity) essentially grows as a function of 1/(1γ)11𝛾1/(1-\gamma). Those works include Thrun and Schwartz,, 1993, Fig 3; Melo and Ribeiro,, 2007, Thm 1; Fan et al.,, 2020, Thm 4.4 for Q-learning with function approximators, (Sutton and Barto,, 2018, Eqn 9.14, 12.8) for TD learning, and (Agarwal et al.,, 2019, Table 1, 2) for policy gradient methods. We note that for a specific environment type and with some additional hyperparameter, Devraj and Meyn, (2020) proposed a variant of Q-learning whose sample complexity is independent of γ𝛾\gamma.

4.2 Lower discount factors likely lead to suboptimal policies

Setting γ𝛾\gamma further from 1 such that γ<γBw𝛾subscript𝛾Bw\gamma<\gamma_{\mathrm{Bw}} yields γ𝛾\gamma-discounted optimal policies that are suboptimal with respect to the most selective criterion (see Fig 2(a)). From the gain optimality standpoint, lower γ𝛾\gamma makes (1γ)vγ1𝛾subscript𝑣𝛾(1-\gamma)v_{\gamma} deviate from vgsubscript𝑣𝑔v_{g} in the order of 𝒪(1γ)𝒪1𝛾\mathcal{O}\mathopen{}\mathclose{{}\left(1-\gamma}\right) as shown by (Baxter and Bartlett,, 2001). More generally based on (16), γ<γBw𝛾subscript𝛾Bw\gamma<\gamma_{\mathrm{Bw}} induces an optimal policy πγ<γBwsuperscriptsubscript𝜋𝛾subscript𝛾Bw\pi_{\gamma<\gamma_{\mathrm{Bw}}}^{*} that is not Blackwell optimal (hence, πγ<γBwsuperscriptsubscript𝜋𝛾subscript𝛾Bw\pi_{\gamma<\gamma_{\mathrm{Bw}}}^{*} is also not gain optimal in recurrent MDPs). This begs the question: is it ethical to run a suboptimal policy (due to misspecifying the optimality criterion) in perpetuity?

For a parameterized policy in recurrent MDPs, γ<γBw𝛾subscript𝛾Bw\gamma<\gamma_{\mathrm{Bw}} induces a discounted vγsubscript𝑣𝛾v_{\gamma}-landscape that is different form the gain vgsubscript𝑣𝑔v_{g}-landscape. Fig 3 shows such a discrepancy, which becomes more significant as γ𝛾\gamma is set further below 1. Therefore, the maximizing parameters do not coincide, i.e. argmax𝜽vγ<γBw(π(𝜽))argmax𝜽vg(π(𝜽))𝜽gsubscript𝜽subscript𝑣𝛾subscript𝛾Bw𝜋𝜽subscript𝜽subscript𝑣𝑔𝜋𝜽superscriptsubscript𝜽𝑔\operatorname*{\arg\!\max}_{\boldsymbol{\theta}}v_{\gamma<\gamma_{\mathrm{Bw}}}(\pi(\boldsymbol{\theta}))\neq\operatorname*{\arg\!\max}_{\boldsymbol{\theta}}v_{g}(\pi(\boldsymbol{\theta}))\eqqcolon\boldsymbol{\theta}_{\!g}^{*}, where 𝜽Θ𝜽Θ\boldsymbol{\theta}\in\Theta denotes the policy parameter in some parameter set. Interestingly, vγ<γBw(π(𝜽g))subscript𝑣𝛾subscript𝛾Bw𝜋superscriptsubscript𝜽𝑔v_{\gamma<\gamma_{\mathrm{Bw}}}(\pi(\boldsymbol{\theta}_{\!g}^{*})) is a local maximum in vγsubscript𝑣𝛾v_{\gamma}-landscape so that the (γ<γBw)𝛾subscript𝛾Bw(\gamma<\gamma_{\mathrm{Bw}})-discounted-reward optimization is ill-posed in that the (global) maximum is not what we desire in terms of obtaining an optimal policy with respect to the most selective criterion (that is, the gain optimal policy π(𝜽g)𝜋superscriptsubscript𝜽𝑔\pi(\boldsymbol{\theta}_{\!g}^{*}) in recurrent MDPs).

Petrik and Scherrer, (2008, Thm 2) established the following error bound due to misspecifying a discount factor γ<γBw𝛾subscript𝛾Bw\gamma<\gamma_{\mathrm{Bw}}. That is,

𝒗γBw𝒗γ(γBwγ)rmax(1γ)(1γBw),for 𝒗γBw,𝒗γ|𝒮|, and 𝒗γ𝒗γ(πγ) for any γ[0,1).subscriptnormsuperscriptsubscript𝒗subscript𝛾Bwsuperscriptsubscript𝒗𝛾subscript𝛾Bw𝛾subscript𝑟max1𝛾1subscript𝛾Bwfor 𝒗γBw,𝒗γ|𝒮|, and 𝒗γ𝒗γ(πγ) for any γ[0,1)\|\boldsymbol{v}_{\gamma_{\mathrm{Bw}}}^{*}-\boldsymbol{v}_{\gamma}^{*}\|_{\infty}\leq\frac{(\gamma_{\mathrm{Bw}}-\gamma)\ r_{\mathrm{max}}}{(1-\gamma)(1-\gamma_{\mathrm{Bw}})},\quad\text{for $\boldsymbol{v}_{\gamma_{\mathrm{Bw}}}^{*},\boldsymbol{v}_{\gamma}^{*}\in\real{|\mathcal{S}|}$, and $\boldsymbol{v}_{\gamma}^{*}\coloneqq\boldsymbol{v}_{\gamma}(\pi_{\gamma}^{*})$ for any $\gamma\in[0,1)$}.

Subsequently, Jiang et al., (2016) refined the above error bound by taking into account the transition and reward functions. They also highlighted that such an error as a function of γ𝛾\gamma is not monotonically decreasing (with increasing γ𝛾\gamma). This is consistent with what Smallwood, (1966) observed, i.e. multiple disconnected γ𝛾\gamma-intervals that share the same γ𝛾\gamma-discounted optimal policy. We note that specifically for sparse-reward environments (where non-zero rewards are not received in every timestep), a lower discount factor γ<γBw𝛾subscript𝛾Bw\gamma<\gamma_{\mathrm{Bw}} is likely to improve the performance of RL algorithms (Petrik and Scherrer,, 2008, Thm 10). They argue that lowering γ𝛾\gamma decreases the value approximation error 𝒗γ𝒗^γsubscriptnormsuperscriptsubscript𝒗𝛾superscriptsubscript^𝒗𝛾\|\boldsymbol{v}_{\gamma}^{*}-\hat{\boldsymbol{v}}_{\gamma}^{*}\|_{\infty} more significantly than it increases the γ𝛾\gamma-misspecification error 𝒗γBw𝒗γsubscriptnormsuperscriptsubscript𝒗subscript𝛾Bwsuperscriptsubscript𝒗𝛾\|\boldsymbol{v}_{\gamma_{\mathrm{Bw}}}^{*}-\boldsymbol{v}_{\gamma}^{*}\|_{\infty}. Here, 𝒗^γsuperscriptsubscript^𝒗𝛾\hat{\boldsymbol{v}}_{\gamma}^{*} denotes an approximation to 𝒗γsuperscriptsubscript𝒗𝛾\boldsymbol{v}_{\gamma}^{*}.

Refer to caption
(a) γ=0.00𝛾0.00\gamma=0.00
Refer to caption
(b) γ=0.35𝛾0.35\gamma=0.35
Refer to caption
(c) γ=0.50𝛾0.50\gamma=0.50
Refer to caption
(d) γ=0.70𝛾0.70\gamma=0.70
Refer to caption
(e) γ=0.85γBw𝛾0.85subscript𝛾Bw\gamma=0.85\approx\gamma_{\mathrm{Bw}}
Refer to caption
(f) γ=0.95𝛾0.95\gamma=0.95
Refer to caption
(g) Gain
Refer to caption
(h) γ=0.00𝛾0.00\gamma=0.00
Refer to caption
(i) γ=0.35𝛾0.35\gamma=0.35
Refer to caption
(j) γ=0.50𝛾0.50\gamma=0.50
Refer to caption
(k) γ=0.80γBw𝛾0.80subscript𝛾Bw\gamma=0.80\approx\gamma_{\mathrm{Bw}}
Refer to caption
(l) γ=0.85𝛾0.85\gamma=0.85
Refer to caption
(m) γ=0.95𝛾0.95\gamma=0.95
Refer to caption
(n) Gain
Refer to caption
(o) γ=0.35𝛾0.35\gamma=0.35
Refer to caption
(p) γ=0.50𝛾0.50\gamma=0.50
Refer to caption
(q) γ=0.55γBw𝛾0.55subscript𝛾Bw\gamma=0.55\approx\gamma_{\mathrm{Bw}}
Refer to caption
(r) γ=0.60𝛾0.60\gamma=0.60
Refer to caption
(s) γ=0.75𝛾0.75\gamma=0.75
Refer to caption
(t) γ=0.95𝛾0.95\gamma=0.95
Refer to caption
(u) Gain
Figure 3: Policy-value landscapes as a function of two policy parameters 𝜽2𝜽2\boldsymbol{\theta}\in\real{2} (i.e. horizontal and vertical axes) on three environments (from top to bottom rows: Chain, Taxicab, and Torus, which all induce recurrent MDPs where the gain optimality is the most selective). All columns, except the rightmost (the gain vgsubscript𝑣𝑔v_{g}), correspond to the scaled discounted reward (1γ)vγ1𝛾subscript𝑣𝛾(1-\gamma)v_{\gamma} of randomized policies π(𝜽)𝜋𝜽\pi(\boldsymbol{\theta}), measured from a single initial state. The color maps the lowest value to dark-blue, and the highest value to yellow. These lowest and highest values are respectively of the “worst” and of the optimal deterministic stationary policies with respect to the corresponding criterion (such that different columns (hence, subplots) have different lowest and highest values). As anticipated, the scaled vγsubscript𝑣𝛾v_{\gamma}-landscape becomes more and more similar to its gain counterpart as γ𝛾\gamma approaches 1. Particularly, γγBw𝛾subscript𝛾Bw\gamma\geq\gamma_{\mathrm{Bw}} induces a scaled vγsubscript𝑣𝛾v_{\gamma}-landscape, whose global maximum coordinate coincides with that of the gain. Note that the gain landspace in the bottom row only has a yellow-ish region because there is a significant difference between the maximum gain vgsuperscriptsubscript𝑣𝑔v_{g}^{*} of the optimal deterministic policy and the maximum gain v^gvgsuperscriptsubscript^𝑣𝑔superscriptsubscript𝑣𝑔\hat{v}_{g}^{*}\approx v_{g}^{*} across the shown landscape (which is only of a portion of the policy parameter space). Their absolute difference, i.e. |(v^gvg)/vg|superscriptsubscript^𝑣𝑔superscriptsubscript𝑣𝑔superscriptsubscript𝑣𝑔\mathopen{}\mathclose{{}\left|(\hat{v}_{g}^{*}-v_{g}^{*})/v_{g}^{*}}\right|, is of 7.070%percent7.0707.070\%, where vg=0.224superscriptsubscript𝑣𝑔0.224v_{g}^{*}=0.224 and v^g=0.208superscriptsubscript^𝑣𝑔0.208\hat{v}_{g}^{*}=0.208. For experimental setup, see Sec 7.

5 Benefits of maximizing the average reward in recurrent MDPs

In this Section, we enumerate the benefits of directly maximizing the average reward (gain) optimality criterion in recurrent MDPs. Loosely speaking, such a recurrent structure is found in continuing environments121212 Continuing environments are those with no terminal state; cf. episodic environments (Footnote 10). with cyclical events across all states. For episodic environments, we can obtain their recurrent MDPs by explicitly modelling the episode repetition, as explained in Sec 5.5. For a non-exhaustive list of continuing and episodic environments, refer to (Dewanto et al.,, 2020).

The combination of recurrent MDPs and gain optimality is advantageous. There is no discount factor γ𝛾\gamma involved (as a by-product), removing the difficulties due to artificial discounting (Sec 4). Other benefits are described in the following sub-sections.

5.1 Unconditionally the most selective criterion

Gain optimality is the most selective criterion for recurrent MDPs. This is because all states are recurrent so that the gain is all that is needed to quantitatively measure the quality of any stationary policy from those states (such a gain quantity turns out to be constant for all states in recurrent or more generally unichain MDPs). Recall that the gain is concerned with the long-run rewards (3), and recurrent states are states that are re-visited infinitely many times in the long-run (Sec 2.2).

Gain optimality therefore is equivalent to Blackwell optimality unconditionally, as well as instantaneously in that there is no need for hyperparameter tuning for the optimality criterion (which fundamentally determines the optimization objective function, hence the overall optimization). This is in contrast to γ𝛾\gamma-discounted optimality, where it is equivalent to Blackwell optimality if γγBw𝛾subscript𝛾Bw\gamma\geq\gamma_{\mathrm{Bw}} as in (5). Moreover since γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}} is unknown, tuning γ𝛾\gamma is necessary (Sec 3.2).

5.2 Uniformly optimal policies

Since recurrent (up to unichain) MDPs have only a single recurrent class (chain), the gain of any policy is constant across all states. As a result, average-reward policy gradient methods maximize an objective (17) that is independent of initial states, or generally of initial state distributions. The resulting gain-optimal policies are said to be uniformly optimal because they are optimal for all initial states or all initial state distributions (Altman,, 1999, Def 2.1).

In contrast, the discounted-reward counterpart maximizes an objective (18) that is defined with respect to some initial state distribution p̊̊𝑝\mathring{p} (since the discounted-reward value vγsubscript𝑣𝛾v_{\gamma} depends on the state from which the value is measured, as in (4)). Consequently, the resulting γ𝛾\gamma-discounted optimal policies may not be optimal for all initial states; they are said to be non-uniformly optimal, as noted by Bacon, (2018, p41). This non-uniform optimality can be interpreted as a relaxation of the uniform optimality in DP, which requires that the superiority of optimal policies πγsuperscriptsubscript𝜋𝛾\pi_{\gamma}^{*} holds in every state, i.e. vγ(πγ,s0)vγ(π,s0)subscript𝑣𝛾superscriptsubscript𝜋𝛾subscript𝑠0subscript𝑣𝛾𝜋subscript𝑠0v_{\gamma}(\pi_{\gamma}^{*},s_{0})\geq v_{\gamma}(\pi,s_{0}), for all states s0𝒮subscript𝑠0𝒮s_{0}\in\mathcal{S} and all policies πΠS𝜋subscriptΠS\pi\in\Pi_{\mathrm{S}}, see Sec 2.1.

The objectives of average- and discounted-reward policy gradient methods are as follows.

Average-reward policy gradient objective: argmax𝜽Θvg(π(𝜽)),subscript𝜽Θsubscript𝑣𝑔𝜋𝜽\displaystyle\quad\operatorname*{\arg\!\max}_{\boldsymbol{\theta}\in\Theta}v_{g}(\pi(\boldsymbol{\boldsymbol{\theta}})), (17)
Discounted-reward policy gradient objective: argmax𝜽Θ𝔼p̊[vγ(π(𝜽),S0)],subscript𝜽Θsubscript𝔼̊𝑝delimited-[]subscript𝑣𝛾𝜋𝜽subscript𝑆0\displaystyle\quad\operatorname*{\arg\!\max}_{\boldsymbol{\theta}\in\Theta}\mathbb{E}_{\mathring{p}}\mathopen{}\mathclose{{}\left[v_{\gamma}(\pi(\boldsymbol{\boldsymbol{\theta}}),S_{0})}\right], (18)

where 𝜽Θ=dim(𝜽)𝜽Θdimension𝜽\boldsymbol{\theta}\in\Theta=\real{\dim(\boldsymbol{\theta})} is the policy parameter and S0p̊similar-tosubscript𝑆0̊𝑝S_{0}\sim\mathring{p} is the initial state random variable.

5.3 Potentially higher convergence rates

Without delving into specific algorithms, there are at least two reasons for faster convergence of average-reward methods (compared to their discounted-reward counterparts), as hinted by Schwartz, (1993, Sec 6) and Van Roy, (1998, p114).

First is the common gain across states. Such commonality eases the gain approximation in that no generalization is required. That is, a single gain estimation is all that is needed for one true gain of all states in unichain MDPs.

Second, average-reward methods optimize solely the gain term in the Laurent series expansion of vγsubscript𝑣𝛾v_{\gamma} in (6). On the other hand, their discounted-reward counterparts optimize the gain, bias, and higher-order terms altogether simultaneously, whose implication becomes more substantial as γ𝛾\gamma is set further below 1, as can be observed in (7).

5.4 Less discrepancy between the learning objective and the performance metric

Since there is a finite number of timesteps (as training budget) and no inherent notion of discounting, the final learning performance metric ψ¯totfinalsuperscriptsubscript¯𝜓totfinal\bar{\psi}_{\mathrm{tot}}^{\mathrm{final}} in RL is typically measured by the average return over several last experiment-episodes131313 The term “experiment-episode” refers to one trial (run, roll-out, or simulation), which produces one sample of trajectory (of states, actions, and rewards). The term “experiment-episode” is applicable to both episodic and continuing environments. In contrast, the term “episode” only makes sense for episodic environments. Machado et al.,, 2018, Sec 5.2; Henderson et al.,, 2018, Fig 2. That is,

ψ¯totfinalsuperscriptsubscript¯𝜓totfinal\displaystyle\bar{\psi}_{\mathrm{tot}}^{\mathrm{final}} 1nxeplasti=j+1nxep[t=0t^maxxep(i)rt+1(i)|π^Return]absent1superscriptsubscript𝑛xeplastsuperscriptsubscript𝑖𝑗1subscript𝑛xepdelimited-[]subscriptconditionalsuperscriptsubscript𝑡0superscriptsubscript^𝑡maxxep𝑖superscriptsubscript𝑟𝑡1𝑖superscript^𝜋Return\displaystyle\coloneqq\frac{1}{n_{\mathrm{xep}}^{\mathrm{last}}}\sum_{i=j+1}^{n_{\mathrm{xep}}}\Bigg{[}\underbrace{\sum_{t=0}^{\hat{t}_{\mathrm{max}}^{\mathrm{xep}(i)}}r_{t+1}^{(i)}\Big{|}\hat{\pi}^{*}}_{\text{Return}}\Bigg{]} (with j=nxepnxeplast𝑗subscript𝑛xepsuperscriptsubscript𝑛xeplastj=n_{\mathrm{xep}}-n_{\mathrm{xep}}^{\mathrm{last}})
𝔼[t=0t^maxxepRt+1|π^]Finite-horizon total reward1t^maxxep+1𝔼[t=0t^maxxepRt+1|π^]Finite-horizon average rewardvg(π^)Gain,absentsubscript𝔼delimited-[]conditionalsuperscriptsubscript𝑡0superscriptsubscript^𝑡maxxepsubscript𝑅𝑡1superscript^𝜋Finite-horizon total rewardproportional-tosubscript1superscriptsubscript^𝑡maxxep1𝔼delimited-[]conditionalsuperscriptsubscript𝑡0superscriptsubscript^𝑡maxxepsubscript𝑅𝑡1superscript^𝜋Finite-horizon average rewardsubscriptsubscript𝑣𝑔superscript^𝜋Gain\displaystyle\approx\underbrace{\mathbb{E}\Bigg{[}\sum_{t=0}^{\hat{t}_{\mathrm{max}}^{\mathrm{xep}}}R_{t+1}\Big{|}\hat{\pi}^{*}\Bigg{]}}_{\text{Finite-horizon total reward}}\propto\underbrace{\frac{1}{\hat{t}_{\mathrm{max}}^{\mathrm{xep}}+1}\mathbb{E}\Bigg{[}\sum_{t=0}^{\hat{t}_{\mathrm{max}}^{\mathrm{xep}}}R_{t+1}\Big{|}\hat{\pi}^{*}\Bigg{]}}_{\text{Finite-horizon average reward}}\approx\underbrace{v_{g}(\hat{\pi}^{*})}_{\text{Gain}}, (19)

where nxepsubscript𝑛xepn_{\mathrm{xep}} denotes the number of i.i.d experiment-episodes, from which we pick the last nxeplastsuperscriptsubscript𝑛xeplastn_{\mathrm{xep}}^{\mathrm{last}} experiment-episodes once the learning is deemed converged to a final learned policy π^superscript^𝜋\hat{\pi}^{*}. Each i𝑖i-th experiment-episode runs from t=0𝑡0t=0 to a finite t^maxxep(i)tmaxsuperscriptsubscript^𝑡maxxep𝑖subscript𝑡max\hat{t}_{\mathrm{max}}^{\mathrm{xep}(i)}\approx t_{\mathrm{max}}, where the reward realization at every timestep t𝑡t is denoted by rt+1(i)superscriptsubscript𝑟𝑡1𝑖r_{t+1}^{(i)}. The expectation in (19) of the reward random variable Rt+1r(St,At)subscript𝑅𝑡1𝑟subscript𝑆𝑡subscript𝐴𝑡R_{t+1}\coloneqq r(S_{t},A_{t}) is with respect to S0p̊similar-tosubscript𝑆0̊𝑝S_{0}\sim\mathring{p}, Atπ^(|st)A_{t}\sim\hat{\pi}^{*}(\cdot|s_{t}) and St+1p(|st,at)S_{t+1}\sim p(\cdot|s_{t},a_{t}).

We argue that the learning performance metric ψ¯totfinalsuperscriptsubscript¯𝜓totfinal\bar{\psi}_{\mathrm{tot}}^{\mathrm{final}} has less discrepancy with respect to the gain vgsubscript𝑣𝑔v_{g} than to the discounted reward vγsubscript𝑣𝛾v_{\gamma}. In other words, whenever the criterion is vgsubscript𝑣𝑔v_{g}, what is measured during training (i.e. the learning performance metric) is more aligned to what the agent learns/optimizes (i.e. the learning objective).141414 In some cases, this alignment may be traded-off for more tractable computation/training/analysis. There are three arguments for this as follows.

First, since the experiment-episodes are i.i.d, the expectation of ψ¯totfinalsuperscriptsubscript¯𝜓totfinal\bar{\psi}_{\mathrm{tot}}^{\mathrm{final}} in (19) converges as nxeplastsuperscriptsubscript𝑛xeplastn_{\mathrm{xep}}^{\mathrm{last}}\to\infty to the expected finite-horizon total reward. This is proportional to the expected finite-horizon average reward, which is an approximation to the gain vg(π^)subscript𝑣𝑔superscript^𝜋v_{g}(\hat{\pi}^{*}) due to the finiteness of t^maxxepsuperscriptsubscript^𝑡maxxep\hat{t}_{\mathrm{max}}^{\mathrm{xep}}.

Second is the fact that t^maxxepsuperscriptsubscript^𝑡maxxep\hat{t}_{\mathrm{max}}^{\mathrm{xep}} is typically fixed for all experiment-episodes during training.151515 One common practice is to set t^maxxepsuperscriptsubscript^𝑡maxxep\hat{t}_{\mathrm{max}}^{\mathrm{xep}} to some constant far less than the training budget to obtain multiple experiment-episodes. This is implemented as for instance, the ‘max_episode_steps’ variable at https://github.com/openai/gym/blob/master/gym/envs/__init__.py of a seminal and popular RL environment codebase: OpenAI Gym (Brockman et al.,, 2016). It is not randomly sampled from a geometric distribution Geo(p=1γ)𝐺𝑒𝑜𝑝1𝛾Geo(p=1-\gamma) even when maximizing vγsubscript𝑣𝛾v_{\gamma} (seemingly because there is no inherent notion of discounting). If it was sampled from such a geometric distribution, then the expectation of ψ¯totfinalsuperscriptsubscript¯𝜓totfinal\bar{\psi}_{\mathrm{tot}}^{\mathrm{final}} would converge to vγsubscript𝑣𝛾v_{\gamma}, following the identity in (9). This however would inherently pose a risk of miss-specifying γ𝛾\gamma (Sec 4.2). This metric gap due to artificial discounting is highlighted by Schwartz, (1993, Sec 3), van Hasselt, (2011, p25), Dabney, (2014, p47), and Van Seijen et al., (2019, Sec 2).

Third is because the induced Markov chain may reach stationarity (or close to it) within t^maxxep<superscriptsubscript^𝑡maxxep\hat{t}_{\mathrm{max}}^{\mathrm{xep}}<\infty in some environments. That is, 𝑷πt𝑷π=𝑷πt+1=𝑷πt=𝑷πsuperscriptsubscript𝑷𝜋𝑡subscript𝑷𝜋superscriptsubscript𝑷𝜋𝑡1superscriptsubscript𝑷𝜋𝑡superscriptsubscript𝑷𝜋\boldsymbol{P}_{\!\!\pi}^{t}\boldsymbol{P}_{\!\!\pi}=\boldsymbol{P}_{\!\!\pi}^{t+1}=\boldsymbol{P}_{\!\!\pi}^{t}=\boldsymbol{P}_{\!\!\pi}^{\star} for several timesteps tt^maxxep𝑡superscriptsubscript^𝑡maxxept\leq\hat{t}_{\mathrm{max}}^{\mathrm{xep}}, see (15). If this happens, then some number of rewards are generated from states sampled from the stationary state distribution pπsuperscriptsubscript𝑝𝜋p_{\pi}^{\star} (although they are not independent, but Markovian state samples). This makes the approximation to the gain vgsubscript𝑣𝑔v_{g} more accurate (although it is inherently biased due to non-i.i.d state samples) than to the discounted vγsubscript𝑣𝛾v_{\gamma} since vg(π)=𝔼Spπ[rπ(S)𝔼Aπ[r(S,A)]]subscript𝑣𝑔𝜋subscript𝔼similar-to𝑆superscriptsubscript𝑝𝜋delimited-[]subscript𝑟𝜋𝑆subscript𝔼similar-to𝐴𝜋delimited-[]𝑟𝑆𝐴v_{g}(\pi)=\mathbb{E}_{S\sim p_{\pi}^{\star}}\mathopen{}\mathclose{{}\left[r_{\pi}(S)\coloneqq\mathbb{E}_{A\sim\pi}\mathopen{}\mathclose{{}\left[r(S,A)}\right]}\right], which can be shown to be equivalent to (3).

5.5 Modelling the episode repetition explicitly in episodic environments

s0superscript𝑠0s^{0}s1superscript𝑠1s^{1}s2superscript𝑠2s^{2}szratsubscript𝑠zrats_{\mathrm{zrat}}00
(a) A unichain szratsubscript𝑠zrats_{\mathrm{zrat}}-model
s0superscript𝑠0s^{0}s1superscript𝑠1s^{1}s2superscript𝑠2s^{2}srstsubscript𝑠rsts_{\mathrm{rst}}000000
(b) A recurrent srstsubscript𝑠rsts_{\mathrm{rst}}-model
Figure 4: Diagrams of a unichain szratsubscript𝑠zrats_{\mathrm{zrat}}-model and a recurrent srstsubscript𝑠rsts_{\mathrm{rst}}-model of an episodic environment with an (original) state set 𝒮={s0,s1,s2}𝒮superscript𝑠0superscript𝑠1superscript𝑠2\mathcal{S}=\{s^{0},s^{1},s^{2}\}. The red edge indicates a deterministic transition via a single self-loop action azratsubscript𝑎zrata_{\mathrm{zrat}} in a 0-reward absorbing terminal state szratsubscript𝑠zrats_{\mathrm{zrat}}. On the other hand, the blue edges indicate possible transitions via a single reset action arstsubscript𝑎rsta_{\mathrm{rst}} in a resetting state srstsubscript𝑠rsts_{\mathrm{rst}}. This arstsubscript𝑎rsta_{\mathrm{rst}} leads to a next state that is distributed according to the initial state distribution p̊̊𝑝\mathring{p} (which here has all states s𝒮𝑠𝒮s\in\mathcal{S} as its supports). Transitions via azratsubscript𝑎zrata_{\mathrm{zrat}} and arstsubscript𝑎rsta_{\mathrm{rst}} yield a zero reward as indicated by the red and blue edge labels. The other black unlabeled edges represent examples of transitions with positive probabilities and any reward values. For a formal szratsubscript𝑠zrats_{\mathrm{zrat}}- to srstsubscript𝑠rsts_{\mathrm{rst}}-model conversion, refer to Sec 5.5.

In order to obtain an infinite-horizon recurrent MDP of an episodic environment, we can model its episode repetition explicitly. This is in contrast to modelling an episodic environment as an infinite-horizon MDP with a 0-reward absorbing terminal state (denoted as szratsubscript𝑠zrats_{\mathrm{zrat}}, which is recurrent under every policy). This modelling (called szratsubscript𝑠zrats_{\mathrm{zrat}}-modelling) induces a unichain MDP, where all states but szratsubscript𝑠zrats_{\mathrm{zrat}} are transient. In szratsubscript𝑠zrats_{\mathrm{zrat}}-model, the gain is trivially 0 for all stationary policies, rendering the gain optimality underselective (Sec 3.2). Fig 4(a) shows the diagram of a szratsubscript𝑠zrats_{\mathrm{zrat}}-model.

To explicitly model episode repetition, we augment the original state set with a resetting terminal state srstsubscript𝑠rsts_{\mathrm{rst}} (instead of szratsubscript𝑠zrats_{\mathrm{zrat}}) that has a single available reset action arstsubscript𝑎rsta_{\mathrm{rst}} (instead of a self-loop action azratsubscript𝑎zrata_{\mathrm{zrat}}). This action arstsubscript𝑎rsta_{\mathrm{rst}} is responsible for a transition from srstsubscript𝑠rsts_{\mathrm{rst}} to an initial state S0p̊similar-tosubscript𝑆0̊𝑝S_{0}\sim\mathring{p}, which yields a reward of 0. We call this approach srstsubscript𝑠rsts_{\mathrm{rst}}-modelling, whose example diagram is shown in Fig 4(b). The conversion from a unichain szratsubscript𝑠zrats_{\mathrm{zrat}}-model to a recurrent srstsubscript𝑠rsts_{\mathrm{rst}}-model is as follows.

Let 𝒮𝒮\mathcal{S} and 𝒜𝒜\mathcal{A} denote the original state and action sets of an episodic environment, respectively, before the augmentation of a 0-reward absorbing terminal state szratsubscript𝑠zrats_{\mathrm{zrat}}. Given a unichain szratsubscript𝑠zrats_{\mathrm{zrat}}-model with a state set 𝒮zrat+=𝒮{szrat}superscriptsubscript𝒮zrat𝒮subscript𝑠zrat\mathcal{S}_{\mathrm{zrat}}^{+}=\mathcal{S}\cup\{s_{\mathrm{zrat}}\}, an action set 𝒜zrat+=𝒜{azrat}superscriptsubscript𝒜zrat𝒜subscript𝑎zrat\mathcal{A}_{\mathrm{zrat}}^{+}=\mathcal{A}\cup\{a_{\mathrm{zrat}}\}, an initial state distribution p̊zratsubscript̊𝑝zrat\mathring{p}_{\mathrm{zrat}}, a state transition distribution pzratsubscript𝑝zratp_{\mathrm{zrat}}, and a reward function rzratsubscript𝑟zratr_{\mathrm{zrat}}, then the corresponding recurrent srstsubscript𝑠rsts_{\mathrm{rst}}-model has the following components.

  • A state set 𝒮rst+=𝒮{srst}superscriptsubscript𝒮rst𝒮subscript𝑠rst\mathcal{S}_{\mathrm{rst}}^{+}=\mathcal{S}\cup\{s_{\mathrm{rst}}\}, an action set 𝒜rst+=𝒜{arst}superscriptsubscript𝒜rst𝒜subscript𝑎rst\mathcal{A}_{\mathrm{rst}}^{+}=\mathcal{A}\cup\{a_{\mathrm{rst}}\}, and an initial state distribution p̊rst=p̊zratsubscript̊𝑝rstsubscript̊𝑝zrat\mathring{p}_{\mathrm{rst}}=\mathring{p}_{\mathrm{zrat}}, where p̊rst(srst)=p̊zrat(szrat)=0subscript̊𝑝rstsubscript𝑠rstsubscript̊𝑝zratsubscript𝑠zrat0\mathring{p}_{\mathrm{rst}}(s_{\mathrm{rst}})=\mathring{p}_{\mathrm{zrat}}(s_{\mathrm{zrat}})=0.

  • A state transition distribution prstsubscript𝑝rstp_{\mathrm{rst}} and a reward function rrstsubscript𝑟rstr_{\mathrm{rst}}, where

    • \bullet

      prst(s|s,a)=pzrat(s|s,a)subscript𝑝rstconditionalsuperscript𝑠𝑠𝑎subscript𝑝zratconditionalsuperscript𝑠𝑠𝑎p_{\mathrm{rst}}(s^{\prime}|s,a)=p_{\mathrm{zrat}}(s^{\prime}|s,a) and rrst(s,a,s)=rzrat(s,a,s)subscript𝑟rst𝑠𝑎superscript𝑠subscript𝑟zrat𝑠𝑎superscript𝑠r_{\mathrm{rst}}(s,a,s^{\prime})=r_{\mathrm{zrat}}(s,a,s^{\prime}), for all s,s𝒮𝑠superscript𝑠𝒮{s,s^{\prime}\in\mathcal{S}} and for all a𝒜𝑎𝒜{a\in\mathcal{A}},

    • \bullet

      prst(srst|s,a)=pzrat(szrat|s,a)subscript𝑝rstconditionalsubscript𝑠rst𝑠𝑎subscript𝑝zratconditionalsubscript𝑠zrat𝑠𝑎p_{\mathrm{rst}}(s_{\mathrm{rst}}|s,a)=p_{\mathrm{zrat}}(s_{\mathrm{zrat}}|s,a) and rrst(s,a,srst)=rzrat(s,a,szrat)subscript𝑟rst𝑠𝑎subscript𝑠rstsubscript𝑟zrat𝑠𝑎subscript𝑠zratr_{\mathrm{rst}}(s,a,s_{\mathrm{rst}})=r_{\mathrm{zrat}}(s,a,s_{\mathrm{zrat}}), for all s,s𝒮𝑠superscript𝑠𝒮{s,s^{\prime}\in\mathcal{S}} and for all a𝒜𝑎𝒜a\in\mathcal{A}, as well as

    • \bullet

      prst(s|srst,arst)=p̊zrat(s)subscript𝑝rstconditional𝑠subscript𝑠rstsubscript𝑎rstsubscript̊𝑝zrat𝑠p_{\mathrm{rst}}(s|s_{\mathrm{rst}},a_{\mathrm{rst}})=\mathring{p}_{\mathrm{zrat}}(s) and rrst(srst,arst,s)=0subscript𝑟rstsubscript𝑠rstsubscript𝑎rst𝑠0r_{\mathrm{rst}}(s_{\mathrm{rst}},a_{\mathrm{rst}},s)=0, for all s𝒮𝑠𝒮s\in\mathcal{S}.

The above conversion requires that i) reaching szratsubscript𝑠zrats_{\mathrm{zrat}} is inevitable with probability 1 under all stationary policies (this is equivalent to inevitable termination in an episodic environment), and ii) there is no inherent notion of discounting that operates from t=0𝑡0t=0 until the end of an episode. For example diagrams of both szratsubscript𝑠zrats_{\mathrm{zrat}}- and srstsubscript𝑠rsts_{\mathrm{rst}}-models, refer to Fig 4.

The srstsubscript𝑠rsts_{\mathrm{rst}}-modelling suits the fact that in practice, an RL-agent is expected to run in multiple episodes, e.g. to play some game repeatedly. By using srstsubscript𝑠rsts_{\mathrm{rst}}-modelling, we can train an agent operating on episodic environments in the same practical way as if it were operating on continuing environments. Similar to the zero state-value of the terminal state (i.e. vγπ(szrat)=0superscriptsubscript𝑣𝛾𝜋subscript𝑠zrat0v_{\gamma}^{\pi}(s_{\mathrm{zrat}})=0) in discounted-reward szratsubscript𝑠zrats_{\mathrm{zrat}}-modelling, we may have a zero state-value of the resetting state (i.e. vbπ(srst)=0superscriptsubscript𝑣𝑏𝜋subscript𝑠rst0v_{b}^{\pi}(s_{\mathrm{rst}})=0) in the average-reward srstsubscript𝑠rsts_{\mathrm{rst}}-modelling, where vbsubscript𝑣𝑏v_{b} denotes the relative value of the bias (7).

Refer to caption
(a) Episodic: GridNav-25
Refer to caption
(b) Episodic: Taxi-15
Figure 5: Learning curves of Qtotsubscript𝑄totQ_{\mathrm{tot}}-learning on szratsubscript𝑠zrats_{\mathrm{zrat}}-model and Qbsubscript𝑄𝑏Q_{b}-learning on srstsubscript𝑠rsts_{\mathrm{rst}}-model, evaluated on two episodic environments. Qtotsubscript𝑄totQ_{\mathrm{tot}}-learning maximizes the total reward criterion, whereas Qbsubscript𝑄𝑏Q_{b}-learning maximizes the average-reward criterion. For experimental setup, see Sec 7.

We compare szratsubscript𝑠zrats_{\mathrm{zrat}}- and srstsubscript𝑠rsts_{\mathrm{rst}}-modelling by running experiments with two training schemes as follows.

  • Scheme-A uses an szratsubscript𝑠zrats_{\mathrm{zrat}}-model and the total reward criterion (hence, Qtotsubscript𝑄totQ_{\mathrm{tot}}-learning).

  • Scheme-B uses an srstsubscript𝑠rsts_{\mathrm{rst}}-model and the average reward criterion (hence, Qbsubscript𝑄𝑏Q_{b}-learning).

Fig 5 depicts the learning curves of Qxsubscript𝑄𝑥Q_{x}-learning trained under Scheme-A and Scheme-B on two episodic environments. Both schemes are trained with the same experiment-episode length161616 During training on episodic environments, an agent is always transported back to the initial state after an episode ends, regardless of the training scheme. This transportation to initial states is captured by srstsubscript𝑠rsts_{\mathrm{rst}}-modelling, but not by szratsubscript𝑠zrats_{\mathrm{zrat}}-modelling (see Fig 4). , and gauged by the same performance metric, i.e. the finite-time average reward (19). As can be observed, both schemes converge to the same value, but Scheme-B empirically leads to a higher rate of convergence. This indicates that both szratsubscript𝑠zrats_{\mathrm{zrat}}- and srstsubscript𝑠rsts_{\mathrm{rst}}-models induce the same optimal policy (in the original states s𝒮𝑠𝒮s\in\mathcal{S}), and that the szratsubscript𝑠zrats_{\mathrm{zrat}}- to srstsubscript𝑠rsts_{\mathrm{rst}}-model conversion is sound. More importantly, conversion to an srstsubscript𝑠rsts_{\mathrm{rst}}-model enables obtaining the optimal policy using an average-reward method since such a model is recurrent, for which gain optimality is the most selective.

Mahadevan, 1996a (, Sec 3.1) mentioned the idea about episode repetitions in an episodic grid-navigation environment. He however, did not provide a formal conversion. Pardo et al., (2018) also used a similar conception to the episode repetition for proposing a technique that bootstraps from the value of the state at the end of each experiment-episode for random-horizon episodic environments. Another important related work is of White, (2017) who introduced a unification of episodic and continuing environment specifications. Such a unification is carried out via transition-based discounting, where the discount factor from the terminal to initial states is set to 0. She noted that the transition-based discounting breaks the Laurent-expansion-based connection between the discounted- and average-reward criteria in (6).

6 Discussions

The route of using the discounted reward to approximately maximize the average reward in RL seems to follow Blackwell’s γ𝛾\gamma-discounted approach (1962) to Howard’s average reward (1960) in DP.171717 Recall that both Blackwell’s γ𝛾\gamma-discounted and Howard’s average-reward approaches were aimed to deal with the infiniteness of the total reward in infinite-horizon MDPs, refer to Sec 2.1. As explained in Sec 2.1, the major successive development of such Blackwell’s approach came from Veinott who introduced a family of n𝑛n-discount optimality criteria (1969), which is discounting-free. Then, he developed an algorithm for obtaining n𝑛n-discount optimal policies, from which policies that are considered nearly-optimal and optimal by Blackwell emerge as special cases, namely (n=0)𝑛0(n=0)- and (n=)𝑛(n=\infty)-discount optimal policies, respectively. Both imply the average reward optimality.

Like in DP, the route in RL should be completed by devising methods based on Veinott optimality criteria. To this end, average reward RL methods constitute the first step in the following senses. First, they yield a set of approximately (n=1)𝑛1(n=-1)-discount optimal policies, from which higher n𝑛n-discount optimal policies are sought, as illustrated in Fig 6(b). This follows from the hierarchical property of n𝑛n-discount optimality, whose selectivity increases as n𝑛n increases. Note that such a property is exploited in the n𝑛n-discount policy iteration in DP (Puterman,, 1994, Ch 10.3). Second, average reward RL methods evaluate the gain 𝒗g𝒗1subscript𝒗𝑔subscript𝒗1\boldsymbol{v}_{g}\equiv\boldsymbol{v}_{-1} and the bias 𝒗0subscript𝒗0\boldsymbol{v}_{0} that are likely to be useful for attaining higher n𝑛n-discount criteria. This is because the expansion coefficients 𝒗nsubscript𝒗𝑛\boldsymbol{v}_{n} in (6) are interconnected. For instance, the first three coefficients satisfy three nested equations below,

i)𝒗1=𝑷𝒗1,ii)𝒗0=𝒓𝒗1+𝑷𝒗0,iii)𝒗1=𝒗0+𝑷𝒗1,formulae-sequencei)subscript𝒗1𝑷subscript𝒗1formulae-sequenceii)subscript𝒗0𝒓subscript𝒗1𝑷subscript𝒗0iii)subscript𝒗1subscript𝒗0𝑷subscript𝒗1\text{\emph{i)}}\ \boldsymbol{v}_{-1}=\boldsymbol{P}\boldsymbol{v}_{-1},\qquad\text{\emph{ii)}}\ \boldsymbol{v}_{0}=\boldsymbol{r}-\boldsymbol{v}_{-1}+\boldsymbol{P}\boldsymbol{v}_{0},\qquad\text{\emph{iii)}}\ \boldsymbol{v}_{1}=-\boldsymbol{v}_{0}+\boldsymbol{P}\boldsymbol{v}_{1}, (20)

where 𝒓𝒓\boldsymbol{r} is the reward vector, 𝑷𝑷\boldsymbol{P} is the one-step transition matrix, and 𝒗1subscript𝒗1\boldsymbol{v}_{1} is the policy evaluation entity for (n=1)𝑛1(n=1)-discount optimality. Thus, advancement on gain and bias estimations is transferable to higher n𝑛n-discount methods towards obtaining Blackwell optimal policies.

may be emptyΠ0γ<γ2subscriptsuperscriptΠ0𝛾subscript𝛾2\Pi^{*}_{0\leq\gamma<\gamma_{2}}Πγ1γ<γBwsubscriptsuperscriptΠsubscript𝛾1𝛾subscript𝛾Bw\Pi^{*}_{\gamma_{1}\leq\gamma<\gamma_{\mathrm{Bw}}}Πγ2γ<γ1subscriptsuperscriptΠsubscript𝛾2𝛾subscript𝛾1\Pi^{*}_{\gamma_{2}\leq\gamma<\gamma_{1}}ΠγBwγ<1ΠBwsubscriptsuperscriptΠsubscript𝛾Bw𝛾1subscriptsuperscriptΠBw\Pi^{*}_{\gamma_{\mathrm{Bw}}\leq\gamma<1}\coloneqq\Pi^{*}_{\mathrm{Bw}}ΠSsubscriptΠS\Pi_{\mathrm{S}}
(a) γ𝛾\gamma-discounted optimality for γ[0,1)𝛾01\gamma\in[0,1), whose optimal policy sets are denoted as ΠγsuperscriptsubscriptΠ𝛾\Pi_{\gamma}^{*}. Theoretically, there are a finite number of intervals. For simplicity here, we assume that there are four intervals, namely: [0=γ3,γ2),[γ2,γ1),[γ1,γ0=γBw),[γBw,γ1=1){\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}[0=\gamma_{3},\gamma_{2})},{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}[\gamma_{2},\gamma_{1})},{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}[\gamma_{1},\gamma_{0}=\gamma_{\mathrm{Bw}})},{\color[rgb]{0,0.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.5,0}[\gamma_{\mathrm{Bw}},\gamma_{-1}=1)}.
Π|𝒮|2=ΠBwsuperscriptsubscriptΠ𝒮2subscriptsuperscriptΠBw\Pi_{|\mathcal{S}|-2}^{*}=\Pi^{*}_{\mathrm{Bw}}Π=ΠBwsuperscriptsubscriptΠsubscriptsuperscriptΠBw\Pi_{\ldots}^{*}=\Pi^{*}_{\mathrm{Bw}}(may be)Π1=ΠBwsuperscriptsubscriptΠ1subscriptsuperscriptΠBw\Pi_{1}^{*}=\Pi^{*}_{\mathrm{Bw}}(may be)Π0=ΠBwsuperscriptsubscriptΠ0subscriptsuperscriptΠBw\Pi_{0}^{*}=\Pi^{*}_{\mathrm{Bw}}(may be)Π1=ΠBwsuperscriptsubscriptΠ1subscriptsuperscriptΠBw\Pi_{-1}^{*}=\Pi^{*}_{\mathrm{Bw}}(if recurrent)ΠSsubscriptΠS\Pi_{\mathrm{S}}
(b) n𝑛n-discount optimality for n=1,0,,|𝒮|2𝑛10𝒮2n=-1,0,\ldots,{|\mathcal{S}|-2}, whose optimal policy sets are denoted by ΠnsuperscriptsubscriptΠ𝑛\Pi_{n}^{*}. In recurrent MDPs, Π1=ΠBwsuperscriptsubscriptΠ1subscriptsuperscriptΠBw\Pi_{-1}^{*}=\Pi^{*}_{\mathrm{Bw}}. When transient states are present, such an equivalency to ΠBwsubscriptsuperscriptΠBw\Pi^{*}_{\mathrm{Bw}} may not be achieved till n=|𝒮|2𝑛𝒮2n=|\mathcal{S}|-2 at most in unichain MDPs.
Figure 6: Two Venn diagrams of optimal policy sets ΠsuperscriptΠ\Pi^{*} of γ𝛾\gamma-discounted and n𝑛n-discount optimality criteria, applied to unichain MDPs. There is only one green circle that is present in both Venn diagrams. It indicates a subset ΠBwsubscriptsuperscriptΠBw\Pi^{*}_{\mathrm{Bw}} containing stationary policies that are guaranteed to be optimal with respect to the most selective Blackwell criterion. Note that the sizes of circles in both diagrams are not to scale. The left Venn diagram of γ𝛾\gamma-discounted optimality is invariant to MDP classification and is based on (Blackwell,, 1962; Smallwood,, 1966; Jiang et al.,, 2016), particularly we conjecture that ΠγBwγ<1subscriptsuperscriptΠsubscript𝛾Bw𝛾1\Pi^{*}_{\gamma_{\mathrm{Bw}}\leq\gamma<1} is always disjoint with the other optimal policy subsets. The right Venn diagram of n𝑛n-discount optimality is based on Puterman,, 1994, Thm 10.1.5, 10.3.6 ; Mahadevan, 1996b, , Fig 2.

Directly maximizing the average reward brings several benefits, as described in Sec 5. It is also evidently free from any γ𝛾\gamma-related difficulties (Sec 4). It is interesting now to discuss whether those benefits outweigh the loss of all the virtues of γ𝛾\gamma (Sec 3).

Approximating the average reward via discounting (Sec 3.1) is not without caveats. Taking γ𝛾\gamma really close to 1 slows the convergence, as well as increases the variance of, for example, policy gradient estimates. On the other hand, lowering γ𝛾\gamma poses the risk of suboptimality (with respect to the most selective criterion). This trade-off can be potentially guided by the critical discount factors γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}} (Sec 3.2). Specifically in RL, we conjecture that it is not the exact value of γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}} that is needed, but some value around it because of the interplay among γ𝛾\gamma-dependent approximation layers, such as those related to policy value and gradient estimations. Thus, fine-tuning γ𝛾\gamma is always a crucial yet non-trivial task, dealing with possibly non-concave and non-differentiable learning performance functions with respect to γ𝛾\gamma (even though the domain of such a univariate function is limited to [0,1)01[0,1)). Research on this front has been going on with some progress (Zahavy et al.,, 2020; Paul et al.,, 2019; Xu et al.,, 2018). In that regard, Veinott’s discounting-free criteria (including average-reward optimality) can be viewed as alternatives towards achieving the most selective Blackwell optimality in RL.

There are several ways to compensate for the other merits of discounting (Sec 3.3), which are missed due to directly maximizing the average reward. We describe some of them as follows.

The chain-classification independence merit:

The need for chain-type determination (or assumption) can be thought of as a way to exploit the chain structural property to be able to apply more simple average-reward methods. That is, for a constant gain across all states, we impose a unichain assumption, which is relatively not restrictive in that it already generalizes the recurrent type. Nevertheless, the ultimate goal remains: an average-reward RL method that can handle non-constant gains across states as in multichain MDPs. Because of its generality (hence, sophistication), it does not require prior determination of the chain type.

The chain-type assumption can also be interpreted as an attempt to break down the difficulty in attaining Blackwell-optimal policies in one go (regardless of the chain structure) as in the γ𝛾\gamma-discounted optimality, see Fig 6(a). Recall that in such a criterion, only (γγBw𝛾subscript𝛾Bw\gamma\geq\gamma_{\mathrm{Bw}})-discounted optimal policies are Blackwell optimal (however the critical γBwsubscript𝛾Bw\gamma_{\mathrm{Bw}} is generally unknown, otherwise this one-go technique would be favourable). On the other hand, the average reward ((n=1𝑛1n=-1)-discount) optimality is already equivalent to the Blackwell optimality whenever the MDPs are recurrent. For more general unichain MDPs, (n=0𝑛0n=0)-discount optimality is equivalent to the Blackwell optimality but only for some transition and reward configuration (for other configurations, higher n𝑛n-discount criteria should be applied, as shown in Fig 6(b)).

The contractive merit:

The relative value-iteration (VI) has been shown to alleviate the non-contractive nature of basic VI in average rewards (Abounadi et al.,, 2001). There are also policy gradient methods whose convergence generally follows that of stochastic gradient ascents. Their policy evaluation part can be carried out via stochastic gradient descents. That is, by minimizing some estimation loss derived for instance, from the average-reward Bellman evaluation equation, which involves the substraction of gain (similar to that of the relative VI).

The variance reduction merit:

The variance of trajectory-related estimation can be controlled by directly varying (truncating) the experiment-episode length (as an alternative to varying the artificial discount factor). One may also apply baseline substraction techniques for variance reduction. In general however, lower variance induces higher bias-errors, leading to a trade-off.

We conclude that directly maximizing the average reward has a number of benefits that make it worthwhile to use and to investigate further. It is the root for approaching Blackwell optimality through Veinott’s criteria, which are discounting-free (eliminating any complication due to artificial discounting). Future works include examination about exploration strategies: to what extent strategies developed for the discounted rewards applies to RL aiming at discounting-free criteria.

7 Experimental setup

In this Section, we describe the experimental setups used to produce Figs 2, 3, and 5. They are about environments (Sec 7.1) and learning methods (Sec 7.2).

7.1 Environments

GridNav-n𝑛n (episodic)

refers to an n𝑛n-by-n𝑛n grid-world with no obstacle. An agent has to navigate from an initial state to the goal state. There are n21superscript𝑛21n^{2}-1 states, excluding the goal state. Every state has four available actions, i.e. moving in North, East, South and West compass directions. The state-transition stochasticity is governed by an action-slip parameter. That is, the agent actually moves according to the chosen (intended) directional action with a probability q𝑞q. Otherwise, it stays or moves in one of three other directions; hence there are four alternatives due to slipping, each is with probability (1q)/41𝑞4(1-q)/4. If an action brings the agent beyond the grid-world, then the agent stays at the current grid. There is a reward of +11+1 for transitioning to the goal; any other movement costs 11-1.

Taxi-n𝑛n (episodic)

refers to the n𝑛n-by-n𝑛n grid Taxi environment, which is adopted from OpenAI Gym (Brockman et al.,, 2016). The number of obstacles is increased accordingly with respect to n𝑛n, whereas the number of passengers and pickup/drop-off locations is kept the same.

The environment families 1, 2, and 3

are adopted from the access-control queueing task (Sutton and Barto,, 2018, p252), the Chain problem (Strens,, 2000, Fig 1), and the torus MDP (Morimura et al.,, 2010, Fig 2), respectively. In Fig 2(b), the first two were used as Families 1 and 2, whose numbers of states are varied, whereas the third was used as Family 3, whose reward constant is varied.

7.2 Learning methods

The learning method used for experiments in Figs 2 and 5 is Q𝑄Q-learning. It is an iterative method that relies on the Bellman optimality equation (BOE) to produce iterates approximating the optimal action value (denoted by qsuperscript𝑞q^{*}). Different optimality criteria have different BOEs, inducing different types of Q𝑄Q-learning as follows.

Discounted rewards (Qγsubscript𝑄𝛾Q_{\gamma}-learning): q^γ(st,at)(1αt)q^γ(st,at)+αt{rt+1+γmaxa𝒜q^γ(st+1,a)},superscriptsubscript^𝑞𝛾subscript𝑠𝑡subscript𝑎𝑡1subscript𝛼𝑡superscriptsubscript^𝑞𝛾subscript𝑠𝑡subscript𝑎𝑡subscript𝛼𝑡subscript𝑟𝑡1𝛾subscript𝑎𝒜superscriptsubscript^𝑞𝛾subscript𝑠𝑡1𝑎\displaystyle\ \hat{q}_{\gamma}^{*}(s_{t},a_{t})\leftarrow(1-\alpha_{t})\hat{q}_{\gamma}^{*}(s_{t},a_{t})+\alpha_{t}\{r_{t+1}+\gamma\max_{a\in\mathcal{A}}\hat{q}_{\gamma}^{*}(s_{t+1},a)\},
Average rewards (Qbsubscript𝑄𝑏Q_{b}-learning): q^b(st,at)(1αt)q^b(st,at)+αt{rt+1v^g+maxa𝒜q^b(st+1,a)},superscriptsubscript^𝑞𝑏subscript𝑠𝑡subscript𝑎𝑡1subscript𝛼𝑡superscriptsubscript^𝑞𝑏subscript𝑠𝑡subscript𝑎𝑡subscript𝛼𝑡subscript𝑟𝑡1superscriptsubscript^𝑣𝑔subscript𝑎𝒜superscriptsubscript^𝑞𝑏subscript𝑠𝑡1𝑎\displaystyle\ \hat{q}_{b}^{*}(s_{t},a_{t})\leftarrow(1-\alpha_{t})\hat{q}_{b}^{*}(s_{t},a_{t})+\alpha_{t}\{r_{t+1}-\hat{v}_{g}^{*}+\max_{a\in\mathcal{A}}\hat{q}_{b}^{*}(s_{t+1},a)\},
Total rewards (Qtotsubscript𝑄totQ_{\mathrm{tot}}-learning): q^tot(st,at)(1αt)q^tot(st,at)+αt{rt+1+maxa𝒜q^tot(st+1,a)},superscriptsubscript^𝑞totsubscript𝑠𝑡subscript𝑎𝑡1subscript𝛼𝑡superscriptsubscript^𝑞totsubscript𝑠𝑡subscript𝑎𝑡subscript𝛼𝑡subscript𝑟𝑡1subscript𝑎𝒜superscriptsubscript^𝑞totsubscript𝑠𝑡1𝑎\displaystyle\ \hat{q}_{\mathrm{tot}}^{*}(s_{t},a_{t})\leftarrow(1-\alpha_{t})\hat{q}_{\mathrm{tot}}^{*}(s_{t},a_{t})+\alpha_{t}\{r_{t+1}+\max_{a\in\mathcal{A}}\hat{q}_{\mathrm{tot}}^{*}(s_{t+1},a)\},

with a positive learning rate αtsubscript𝛼𝑡\alpha_{t} (here, we used a fine-tuned constant α𝛼\alpha). The estimate q^superscript^𝑞\hat{q}^{*} is initialized optimistically to large values to encourage exploration in the outset of learning. For Qbsubscript𝑄𝑏Q_{b}-learning, we set the optimal gain estimate v^gmaxa𝒜q^b(sref,a)superscriptsubscript^𝑣𝑔subscript𝑎𝒜superscriptsubscript^𝑞𝑏subscript𝑠ref𝑎\hat{v}_{g}^{*}\leftarrow\max_{a\in\mathcal{A}}\hat{q}_{b}^{*}(s_{\mathrm{ref}},a) with a prescribed (arbitrary but fixed) reference state srefsubscript𝑠refs_{\mathrm{ref}}, following the relative-VI (RVI) technique by Abounadi et al., (2001, Sec 2.2). Note that Qtotsubscript𝑄totQ_{\mathrm{tot}}-learning converges as long as the total reward is finite (Schwartz,, 1993, p2).

References

  • Abounadi et al., (2001) Abounadi, J., Bertsekas, D., and Borkar, V. S. (2001). Learning algorithms for Markov decision processes with average cost. SIAM Journal on Control and Optimization, 40(3).
  • Agarwal et al., (2019) Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2019). On the theory of policy gradient methods: Optimality, approximation, and distribution shift. arXiv: 1908.00261.
  • Altman, (1999) Altman, E. (1999). Constrained Markov Decision Processes. Taylor & Francis.
  • Bacon, (2018) Bacon, P.-L. (2018). Temporal Representation Learning. PhD thesis, School of Computer Science, McGill University.
  • Baxter and Bartlett, (2001) Baxter, J. and Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15(1).
  • Beleznay et al., (1999) Beleznay, F., Grobler, T., and Szepesvari, C. (1999). Comparing value-function estimation algorithms in undiscounted problems. Technical report.
  • Bellman, (1957) Bellman, R. (1957). Dynamic Programming. Princeton University Press.
  • Blackwell, (1962) Blackwell, D. (1962). Discrete dynamic programming. The Annals of Mathematical Statistics, 33(2).
  • Brockman et al., (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). OpenAI Gym. arXiv:1606.01540.
  • Castro, (2020) Castro, P. S. (2020). Scalable methods for computing state similarity in deterministic Markov decision processes. Proceedings of the AAAI Conference on Artificial Intelligence.
  • Chang et al., (2013) Chang, H. S., Hu, J., Fu, M. C., and Marcus, S. I. (2013). Simulation-Based Algorithms for Markov Decision Processes. Springer, 2nd edition.
  • Dabney, (2014) Dabney, W. C. (2014). Adaptive step-sizes for reinforcement learning. PhD thesis, Computer Science Department, University of Massachusetts Amherst.
  • Dann et al., (2014) Dann, C., Neumann, G., and Peters, J. (2014). Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research, 15(24).
  • Devraj and Meyn, (2020) Devraj, A. M. and Meyn, S. P. (2020). Q-learning with uniformly bounded variance: Large discounting is not a barrier to fast learning. arXiv: 2002.10301.
  • Dewanto et al., (2020) Dewanto, V., Dunn, G., Eshragh, A., Gallagher, M., and Roosta, F. (2020). Average-reward model-free reinforcement learning: a systematic review and literature mapping. arXiv: 2010.08920.
  • Douc et al., (2018) Douc, R., Moulines, E., Priouret, P., and Soulier, P. (2018). Markov Chains. Springer Series in Operations Research and Financial Engineering. Springer International Publishing.
  • Duan et al., (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. In Proceedings of The 33rd International Conference on Machine Learning.
  • Fan et al., (2020) Fan, J., Wang, Z., Xie, Y., and Yang, Z. (2020). A theoretical analysis of deep Q-learning. In Proceedings of the 2nd Conference on Learning for Dynamics and Control.
  • Feinberg and Shwartz, (2002) Feinberg, E. A. and Shwartz, A. (2002). Handbook of Markov Decision Processes: Methods and Applications, volume 40. Springer US.
  • Ferns et al., (2006) Ferns, N., Castro, P. S., Precup, D., and Panangaden, P. (2006). Methods for computing state similarity in Markov decision processes. In Conference in Uncertainty in Artificial Intelligence.
  • Henderson et al., (2018) Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2018). Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence.
  • Hordijk et al., (1985) Hordijk, A., Dekker, R., and Kallenberg, L. C. M. (1985). Sensitivity analysis in discounted Markovian decision problems. Operations Research Spektrum, 7.
  • Howard, (1960) Howard, R. A. (1960). Dynamic Programming and Markov Processes. Technology Press of the Massachusetts Institute of Technology.
  • Hutter, (2006) Hutter, M. (2006). General discounting versus average reward. In Algorithmic Learning Theory. Springer Berlin Heidelberg.
  • Jiang et al., (2016) Jiang, N., Singh, S., and Tewari, A. (2016). On structural properties of MDPs that bound loss due to shallow planning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence.
  • Jin and Sidford, (2021) Jin, Y. and Sidford, A. (2021). Towards tight bounds on the sample complexity of average-reward MDPs. In Proceedings of the 38th International Conference on Machine Learning.
  • Kakade, (2001) Kakade, S. (2001). Optimizing average reward using discounted rewards. In Proceedings of the 14th Annual Conference on Computational Learning Theory.
  • Kumar et al., (2020) Kumar, H., Kalogerias, D. S., Pappas, G. J., and Ribeiro, A. (2020). Zeroth-order deterministic policy gradient. Arxiv:2006.07314.
  • Lattimore and Hutter, (2014) Lattimore, T. and Hutter, M. (2014). General time consistent discounting. Theoretical Computer Science, 519.
  • Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press.
  • Machado et al., (2018) Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M. J., and Bowling, M. (2018). Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research.
  • Mahadevan, (1994) Mahadevan, S. (1994). To discount or not to discount in reinforcement learning: A case study comparing R-learning and Q-learning. In Proceedings of the 11th International Conference on Machine Learning.
  • (33) Mahadevan, S. (1996a). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning.
  • (34) Mahadevan, S. (1996b). Sensitive discount optimality: Unifying discounted and average reward reinforcement learning. In Proceedings of the 13th International Conference on Machine Learning.
  • Melo and Ribeiro, (2007) Melo, F. S. and Ribeiro, M. I. (2007). Q-learning with linear function approximation. In Computational Learning Theory.
  • Mischel et al., (1972) Mischel, W., Ebbesen, E. B., and Zeiss, A. R. (1972). Cognitive and attentional mechanisms in delay of gratification. Journal of personality and social psychology, 21(2).
  • Morimura et al., (2010) Morimura, T., Uchibe, E., Yoshimoto, J., Peters, J., and Doya, K. (2010). Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning. Neural Computation, 22(2).
  • Nota and Thomas, (2020) Nota, C. and Thomas, P. S. (2020). Is the policy gradient a gradient? Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems.
  • Pardo et al., (2018) Pardo, F., Tavakoli, A., Levdik, V., and Kormushev, P. (2018). Time limits in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning.
  • Paul et al., (2019) Paul, S., Kurin, V., and Whiteson, S. (2019). Fast efficient hyperparameter tuning for policy gradient methods. In Advances in Neural Information Processing Systems 32.
  • Petrik and Scherrer, (2008) Petrik, M. and Scherrer, B. (2008). Biasing approximate dynamic programming with a lower discount factor. In Advances in Neural Information Processing Systems 21.
  • Puterman, (1994) Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1st edition.
  • Samuelson, (1937) Samuelson, P. A. (1937). A note on measurement of utility. Review of Economic Studies, 4.
  • Scherrer, (2010) Scherrer, B. (2010). Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view. In Proceedings of the 27th International Conference on Machine Learning.
  • Schwartz, (1993) Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the 10th International Conference on Machine Learning.
  • Singh et al., (1994) Singh, S. P., Jaakkola, T. S., and Jordan, M. I. (1994). Learning without state-estimation in partially observable Markovian decision processes. In Proceedings of the 11th International Conference on Machine Learning.
  • Smallwood, (1966) Smallwood, R. D. (1966). Optimum policy regions for Markov processes with discounting. Operations Research, 14.
  • Strens, (2000) Strens, M. J. A. (2000). A Bayesian framework for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning.
  • Sutton and Barto, (2018) Sutton, R. S. and Barto, A. G. (2018). Introduction to Reinforcement Learning. MIT Press.
  • Szepesvári, (2010) Szepesvári, C. (2010). Algorithms for Reinforcement Learning. Morgan & Claypool.
  • Tang and Abbeel, (2010) Tang, J. and Abbeel, P. (2010). On a connection between importance sampling and the likelihood ratio policy gradient. In Advances in Neural Information Processing Systems 23.
  • Thomas, (2014) Thomas, P. (2014). Bias in natural actor-critic algorithms. In Proceedings of the 31st International Conference on Machine Learning.
  • Thrun and Schwartz, (1993) Thrun, S. and Schwartz, A. (1993). Issues in using function approximation for reinforcement learning. In Proceedings of Connectionist Models Summer School.
  • Tsitsiklis and Van Roy, (2002) Tsitsiklis, J. N. and Van Roy, B. (2002). On average versus discounted reward temporal-difference learning. Machine Learning, 49.
  • van Hasselt, (2011) van Hasselt, H. P. (2011). Insights in Reinforcement Learning: formal analysis and empirical evaluation of temporal-difference learning algorithms. PhD thesis, Univ. Utrecht.
  • Van Roy, (1998) Van Roy, B. (1998). Learning and Value Function Approximation in Complex Decision Processes. PhD thesis, MIT.
  • Van Seijen et al., (2019) Van Seijen, H., Fatemi, M., and Tavakoli, A. (2019). Using a logarithmic mapping to enable lower discount factors in reinforcement learning. In Advances in Neural Information Processing Systems 32.
  • Veinott, (1969) Veinott, A. F. (1969). Discrete Dynamic Programming with Sensitive Discount Optimality Criteria. The Annals of Mathematical Statistics, 40(5).
  • White, (2017) White, M. (2017). Unifying task specification in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning.
  • Xu et al., (2018) Xu, Z., van Hasselt, H. P., and Silver, D. (2018). Meta-gradient reinforcement learning. In Advances in Neural Information Processing Systems 31.
  • Zahavy et al., (2020) Zahavy, T., Xu, Z., Veeriah, V., Hessel, M., Oh, J., van Hasselt, H. P., Silver, D., and Singh, S. (2020). A self-tuning actor-critic algorithm. In Advances in Neural Information Processing Systems 33.