Examining average and discounted reward optimality criteria in reinforcement learning

Vektor Dewanto, Marcus Gallagher
School of Information Technology and Electrical Engineering
University of Queensland, Australia
[email protected], [email protected]

Abstract

In reinforcement learning (RL), the goal is to obtain an optimal policy, for which the optimality criterion is fundamentally important. Two major optimality criteria are average and discounted rewards. While the latter is more popular, it is problematic to apply in environments without an inherent notion of discounting. This motivates us to revisit a) the progression of optimality criteria in dynamic programming, b) justification for and complication of an artificial discount factor, and c) benefits of directly maximizing the average reward criterion, which is discounting-free. Our contributions include a thorough examination of the relationship between average and discounted rewards, as well as a discussion of their pros and cons in RL. We emphasize that average-reward RL methods possess the ingredient and mechanism for applying a family of discounting-free optimality criteria (Veinott,, 1969) to RL.

1 Introduction

Reinforcement learning (RL) is concerned with sequential decision making, where a decision maker has to choose an action based on its current state. Determining the best actions amounts to finding an optimal mapping from every state to a probability distribution over actions available at that state. Thus, one fundamental component of any RL method is the optimality criterion, by which we define what we mean by such an optimal mapping.

The most popular optimality criterion in RL is the discounted reward (Duan et al.,, 2016; Machado et al.,, 2018; Henderson et al.,, 2018). On the other hand, there is growing interest in the average reward optimality, as surveyed by Mahadevan, 1996a ; Dewanto et al., (2020). In this paper, we discuss both criteria in order to obtain a comprehensive understanding of their properties, relationships, and differences. This is important because the choice of optimality criteria affects almost every aspect of RL methods, including the policy evaluation function, the policy gradient formulation, and the resulting optimal policy, where the term policy refers to the above-mentioned mapping. Thus, the choice of optimality criteria eventually impacts the performance of an RL system (and the choices made within, e.g. approximation techniques and hyperparameters).

This paper presents a thorough examination of the connection between average and discounted rewards, as well as a discussion of their pros and cons in RL. Our examination here is devised through broader lens of refined optimality criteria (which generalize average and discounted rewards), inspired by the seminal work of Mahadevan, 1996b . It is also broader in the sense of algorithmic styles: value- and policy-iteration, as well as of tabular and function approximation settings in RL. In this paper, we also attempt to compile and unify various justifications for deviating from the common practice of maximizing the discounted reward criterion in RL.

There are a number of papers that specifically compare and contrast average to discounted rewards in RL. For example, Mahadevan, (1994) empirically investigated average versus discounted reward Q-learning (in the context of value-iteration RL). Tsitsiklis and Van Roy, (2002) theoretically compared average versus discounted reward temporal-difference (TD) methods for policy evaluation (in the context of policy-iteration RL). We provide updates and extensions to those existing comparison works in order to obtain a comprehensive view on discounting and discounting-free RL.

Since RL is approximate dynamic programming (DP), we begin with reviewing optimality criteria in DP, model classification (which plays an important role in average-reward optimality), and the interpretation of discounting in Sec 2. We then provide justifications for discounting for environments without any inherent notion of discounting (Sec 3). This is followed by the difficulties that arise from introducing an artificial discount factor (Sec 4). We discuss the benefits from maximizing the average-reward criterion in Sec 5, as well as our finding and viewpoint in Sec 6.

2 Preliminaries

Sequential decision making is often formulated as a Markov decision process (MDP) with a state set $\mathcal{S}$ , an action set $\mathcal{A}$ , a reward set $\mathcal{R}$ , and a decision-epoch set $\mathcal{T}$ . Here, all but $\mathcal{T}$ are finite sets, yielding an infinite-horizon finite MDP. At the current decision-epoch $t\in\mathcal{T}$ , a decision maker (henceforth, an agent) is in a current state $s_{t}\in\mathcal{S}$ , and chooses to then execute a current action $a_{t}\in\mathcal{A}$ . Consequently at the next decision-epoch $t+1$ , it arrives in the next state $s_{t+1}\in\mathcal{S}$ and earns an (immediate) scalar reward $r_{t+1}\in\real{}$ . The decision-epochs occur every single time unit (i.e. timestep) from $t=0$ till the maximum timestep (denoted as $t_{\mathrm{max}}$ ); hence, we have a discrete-time MDP.

For $t=0,1,\ldots,t_{\mathrm{max}}=\infty$ , the agent experiences a sequence (trajectory) of states $s_{t}$ , actions $a_{t}$ and rewards $r_{t+1}$ . That is, $s_{0},a_{0},r_{1},s_{1},a_{1},r_{2},\ldots,s_{t_{\mathrm{max}}}$ . The initial state, the next state, and the next reward are governed by the environment dynamics that is fully specified by three time-homogenous (time-invariant) entities: i) the initial state distribution $\mathring{p}$ , from which $S_{t=0}\sim\mathring{p}$ , ii) the one-(time)step state transition distribution $p(\cdot|s_{t},a_{t})$ , from which $S_{t+1}|S_{t}=s_{t},A_{t}=a_{t}\sim p(\cdot|s_{t},a_{t})$ , and iii) the reward function $r(s_{t},a_{t})=\mathbb{E}_{p(\cdot|s_{t},a_{t})}\mathopen{}\mathclose{{}\left[\sum_{r\in\mathcal{R}}\mathrm{Pr}\{r|s_{t},a_{t},S_{t+1}\}\cdot r}\right]\eqqcolon r_{t+1}$ , where $S_{t}$ and $A_{t}$ denote the state and action random variables, respectively. The existence of a probability space that holds this infinite sequence of random variables ( $S_{0},A_{0},S_{1},A_{1},\ldots$ ) can be shown using the Ionescu-Tulcea theorem (Lattimore and Szepesvári,, 2020, p513). We assume that the rewards are bounded, i.e. $|r(s,a)|\leq r_{\mathrm{max}}<\infty,\forall(s,a)\in\mathcal{S}\times\mathcal{A}$ , and the MDP is unichain and aperiodic (see Sec 2.2 for MDP classification).

The solution to a sequential decision making problem is an optimal mapping from every state to a probability distribution over the (available) action set $\mathcal{A}$ in that state.¹¹1 In general, different states may have different action sets. It is optimal with respect to some optimality criterion, as discussed later in Sec 2.1. Any of such a mapping (regardless whether it is optimal) is called a policy and generally depends on timestep $t$ . It is denoted as $\pi_{t}:\mathcal{S}\mapsto[0,1]^{|\mathcal{A}|}$ , or alternatively $\pi_{t}:\mathcal{S}\times\mathcal{A}\mapsto[0,1]$ , where $\pi_{t}(a_{t}|s_{t})\coloneqq\mathrm{Pr}\{A_{t}=a_{t}|S_{t}=s_{t}\}$ indicating the probability of selecting action $a_{t}\in\mathcal{A}$ given the state $s_{t}\in\mathcal{S}$ at timestep $t\in\mathcal{T}$ . Thus, each action is sampled from a conditional action distribution, i.e. $A_{t}|S_{t}=s_{t}\sim\pi_{t}(\cdot|s_{t})$ .

The most specific solution space is the stationary and deterministic policy set $\Pi_{\mathrm{SD}}$ whose policies $\pi\in\Pi_{\mathrm{SD}}$ are stationary (time-invariant), i.e. $\pi\coloneqq\pi_{0}=\pi_{1}=\ldots=\pi_{t_{\mathrm{max}}-1}$ , as well as deterministic, i.e. $\pi(\cdot|s_{t})$ has a single action support (hence the mapping is reduced to $\pi:\mathcal{S}\mapsto\mathcal{A}$ ). In this work, we consider a more general policy set, that is the stationary policy set $\Pi_{\mathrm{S}}$ . It includes the stationary randomized (stochastic) set $\Pi_{\mathrm{SR}}$ and its degenerate counterpart: the stationary deterministic set $\Pi_{\mathrm{SD}}$ .

2.1 Optimality criteria

In a basic notion of optimality, a policy with the largest value is optimal. That is,

v_{x}(\pi_{x}^{*})\geq v_{x}(\pi),\quad\forall\pi\in\Pi_{\mathrm{S}}.

(1)

Here, the function $v_{x}$ measures the value (utility) of a policy $\pi$ based on the infinite reward sequence that is earned by an agent following $\pi$ . The subscript $x$ indicates the specific type of value functions, which induces a specific $x$ -optimality criterion, hence the $x$ -optimal policy, denoted as $\pi_{x}^{*}$ .

One intuitive²²2 This intuition seems to lead to the so-called reward hypothesis (Sutton and Barto,, 2018, p53). value function is the expected total reward. That is,

v_{\mathrm{tot}}(\pi,s)\coloneqq\lim_{t_{\mathrm{max}}\to\infty}\mathbb{E}_{A_{t}\sim\pi(\cdot|s_{t}),S_{t+1}\sim p(\cdot|s_{t},a_{t})}\mathopen{}\mathclose{{}\left[\sum_{t=0}^{t_{\mathrm{max}}-1}r(S_{t},A_{t})\Big{|}S_{0}=s,\pi}\right],\quad\forall\pi\in\Pi_{\mathrm{S}},\forall s\in\mathcal{S}.

(2)

However, $v_{\mathrm{tot}}$ may be infinite (unbounded, divergent, non-summable).³³3 If a limit is equal to infinity, then we assert that such a limit does not exist. No comparison can be made between finite and infinite policy values, as well as between infinite policy values. Howard, (1960) therefore, examined the expected average reward (also referred to as the gain) defined as

\displaystyle v_{g}(\pi,s)

\displaystyle\coloneqq\lim_{t_{\mathrm{max}}\to\infty}\frac{1}{t_{\mathrm{max}}}\mathbb{E}_{A_{t}\sim\pi(\cdot|s_{t}),S_{t+1}\sim p(\cdot|s_{t},a_{t})}\mathopen{}\mathclose{{}\left[\sum_{t=0}^{t_{\mathrm{max}}-1}r(S_{t},A_{t})\Big{|}S_{0}=s,\pi}\right],

(3)

which is finite for all $\pi\in\Pi_{\mathrm{S}}$ and all $s\in\mathcal{S}$ . For more details, including interpretation about the gain, we refer the reader to (Dewanto et al.,, 2020).

Alternatively, Blackwell, (1962, Sec 4) attempted tackling the infiniteness of (2) through the expected total discounted reward⁴⁴4 There are other discounting schemes, e.g. $1/(t^{2}+t)$ , $1/(1+\kappa t)$ for $\kappa>0$ (hyperbolic), and $t^{-\kappa}$ for $\kappa>1$ (Hutter,, 2006; Lattimore and Hutter,, 2014), but they do not exhibit advantageous properties as the geometric discounting (4), such as its decomposition via (6). Hence, those other schemes are not considered here. , which was studied before by Bellman, (1957). That is,

v_{\gamma}(\pi,s)\coloneqq\lim_{t_{\mathrm{max}}\to\infty}\mathbb{E}_{A_{t}\sim\pi(\cdot|s_{t}),S_{t+1}\sim p(\cdot|s_{t},a_{t})}\mathopen{}\mathclose{{}\left[\sum_{t=0}^{t_{\mathrm{max}}-1}\gamma^{t}r(S_{t},A_{t})\Big{|}S_{0}=s,\pi}\right],\quad\forall\pi\in\Pi_{\mathrm{S}},\forall s\in\mathcal{S},

(4)

with a discount factor $\gamma\in[0,1)$ . In particular, according to what is later known as the truncated Laurent series expansion (7), Blackwell suggested finding policies that are $\gamma$ -discounted optimal for all discount factors $\gamma$ sufficiently close to 1. He also established their existence in finite MDPs. Subsequently, Smallwood, (1966) identified that the discount factor interval can be divided into a finite number of intervals⁵⁵5 See also Hordijk et al., (1985) and Feinberg and Shwartz, (2002, Thm 2.16). , i.e. $[0=\gamma_{m},\gamma_{m-1}),[\gamma_{m-1},\gamma_{m-2}),\ldots,[\gamma_{0},\gamma_{-1}=1)$ , in such a way that there exist policies $\pi^{*}_{\gamma_{i}}$ for $0\leq i\leq m$ , that are $\gamma$ -discounted optimal for all $\gamma\in[\gamma_{i},\gamma_{i-1})$ . This leads to the concept of Blackwell optimality. A policy $\pi^{*}_{\mathrm{Bw}}$ is Blackwell optimal if there exists a critical⁶⁶6 It is critical in that it specifies the sufficiency of being close to 1 for attaining Blackwell optimal policies. discount factor $\gamma_{\mathrm{Bw}}\in[0,1)$ such that

v_{\gamma}(\pi^{*}_{\mathrm{Bw}},s)\geq v_{\gamma}(\pi,s),\quad\text{for}\ \gamma_{\mathrm{Bw}}\leq\gamma<1,\ \text{and for all $\pi\in\Pi_{\mathrm{S}}$ and all $s\in\mathcal{S}$}.

(5)

Note that whenever the policy value function $v$ depends not only on policies $\pi$ but also on states $s$ from which the value is measured, the basic optimality (1) requires that the optimal policy has a value greater than or equal to the other policies’ values in all state $s\in\mathcal{S}$ .

In order to obtain Blackwell optimal policies, Veinott, (1969, Eqn 27) introduced a family of new optimality criteria. That is, a policy $\pi_{n}^{*}$ is $n$ -discount optimal for $n=-1,0,\ldots$ , if

\lim_{\gamma\to 1}\frac{1}{(1-\gamma)^{n}}\Big{(}v_{\gamma}(\pi_{n}^{*},s)-v_{\gamma}(\pi,s)\Big{)}\geq 0,\qquad\forall\pi\in\Pi_{\mathrm{S}},\forall s\in\mathcal{S}.

He showed that $(n=|\mathcal{S}|)$ -discount optimality⁷⁷7 This is eventually refined to $(n=|\mathcal{S}|-n_{\mathrm{re}}(\pi^{*}_{\mathrm{Bw}}))$ -discount optimality, where $n_{\mathrm{re}}(\pi^{*}_{\mathrm{Bw}})$ is the number of recurrent classes under the Blackwell optimal policy (Feinberg and Shwartz,, 2002, Thm 8.3, Sec 8.1.4). is equivalent to Blackwell optimality because the selectivity increases with $n$ (hence, $n_{i}$ -discount optimality implies $n_{j}$ -discount optimalities for all $i>j$ ). Then, he developed a policy-iteration algorithm (for finding $n$ -discount optimal policies) that utilizes the full Laurent series expansion of $\boldsymbol{v}_{\gamma}(\pi)$ for any $\pi\in\Pi_{\mathrm{S}}$ , where the policy value vector $\boldsymbol{v}_{x}(\pi)\in\real{|\mathcal{S}|}$ is obtained by stacking the state-wise values $v_{x}(\pi,s)$ altogether for all $s\in\mathcal{S}$ . That is,

$\displaystyle\boldsymbol{v}_{\gamma}(\pi)$	$\displaystyle=\frac{1}{\gamma}\Big{(}\frac{\gamma}{1-\gamma}\boldsymbol{v}_{-1}(\pi)+\boldsymbol{v}_{0}(\pi)+\sum_{n=1}^{\infty}\Big{(}\frac{1-\gamma}{\gamma}\Big{)}^{n}\boldsymbol{v}_{n}(\pi)\Big{)}\hskip 56.9055pt\text{(Full expansion)}$	(6)
	$\displaystyle=\frac{1}{1-\gamma}\boldsymbol{v}_{-1}(\pi)+\boldsymbol{v}_{0}(\pi)+\Bigg{\{}\frac{1-\gamma}{\gamma}\boldsymbol{v}_{0}(\pi)+\frac{1}{\gamma}\sum_{n=1}^{\infty}\Big{(}\frac{1-\gamma}{\gamma}\Big{)}^{n}\boldsymbol{v}_{n}(\pi)\Bigg{\}}$	(Using the additive identity: $\boldsymbol{v}_{0}-\boldsymbol{v}_{0}=\boldsymbol{0}$ )
	$\displaystyle=\frac{1}{1-\gamma}\boldsymbol{v}_{-1}(\pi)+\boldsymbol{v}_{0}(\pi)+\bigg{\{}\boldsymbol{e}(\pi,\gamma)\bigg{\}},\hskip 88.20354pt\text{(Truncated expansion)}$	(7)

where $\boldsymbol{v}_{n}\in\real{|\mathcal{S}|}$ for $n=-1,0,\ldots$ denotes the expansion coefficients. The truncated form first appeared (before the full one) in Blackwell, (1962, Thm 4a), where $\boldsymbol{v}_{-1}$ is equivalent to the gain from all states $\boldsymbol{v}_{\!g}\in\real{|\mathcal{S}|}$ , $\boldsymbol{v}_{0}$ is equivalent to the so-called bias, and $\boldsymbol{e}(\pi,\gamma)$ converges to $\boldsymbol{0}$ as $\gamma$ approaches 1, that is $\lim_{\gamma\to 1}\mathcal{O}\mathopen{}\mathclose{{}\left(1-\gamma}\right)=\boldsymbol{0}$ .

2.2 MDP classification

A stationary policy $\pi$ of a finite MDP induces a stationary finite Markov chain (MC) with a $|\mathcal{S}|$ -by- $|\mathcal{S}|$ one-step transition stochastic matrix $\boldsymbol{P}_{\!\!\pi}$ , whose $[s,s^{\prime}]$ -th entry indicates

p_{\pi}(s^{\prime}|s)\coloneqq\mathrm{Pr}\{S_{t+1}=s^{\prime}|S_{t}=s;\pi\}=\sum_{a\in\mathcal{A}}p(s^{\prime}|s,a)\pi(a|s),\quad\forall s,s^{\prime}\in\mathcal{S}.

(8)

Therefore, an MDP can be classified on the basis of the set of MCs induced by its stationary policies. To this end, we need to review the classification of states, then of MCs below; mainly following Puterman, (1994, Appendix A) and Douc et al., (2018, Ch 6.2).

State classification:

Let $\boldsymbol{P}_{\!\!\pi}^{t}$ denote the $t$ -th power of $\boldsymbol{P}_{\!\!\pi}$ of a Markov chain. This $\boldsymbol{P}_{\!\!\pi}^{t}$ is another $|\mathcal{S}|$ -by- $|\mathcal{S}|$ stochastic matrix, whose $[s,s^{\prime}]$ -entry indicates $p_{\pi}^{t}(s^{\prime}|s)\coloneqq\mathrm{Pr}\{S_{t}=s^{\prime}|S_{0}=s;\pi\}$ , i.e. the probability of being in state $s^{\prime}\in\mathcal{S}$ in $t$ timesteps starting from a state $s\in\mathcal{S}$ .

A state $s^{\prime}\in\mathcal{S}$ is accessible from state $s\in\mathcal{S}$ , denoted as $s\to s^{\prime}$ , if $p_{\pi}^{t}(s^{\prime}|s)>0$ for some $t\geq 0$ . Furthermore, $s$ and $s^{\prime}$ communicate if $s\to s^{\prime}$ and $s^{\prime}\to s$ .

A subset $\tilde{\mathcal{S}}$ of $\mathcal{S}$ is called a closed set if no state outside $\tilde{\mathcal{S}}$ is accessible from any state in $\tilde{\mathcal{S}}$ . Furthermore, a subset $\tilde{\mathcal{S}}$ of $\mathcal{S}$ is called a recurrent class (or a recurrent chain) if all states in $\tilde{\mathcal{S}}$ communicate and $\tilde{\mathcal{S}}$ is closed. In a finite MC, there exists at least one recurrent class.

Let $n_{\mathrm{vis}}^{s,\pi}$ denote the number of visits to a state $s$ under a policy $\pi$ . That is, $n_{\mathrm{vis}}^{s,\pi}\coloneqq{\sum_{t=0}^{\infty}\mathbb{I}[S_{t}=s|\pi]}$ , where $\mathbb{I}$ is an identity (indicator) operator. Then, the expected number of visits to a state $s\in\mathcal{S}$ starting from the state $s$ itself under a policy $\pi$ is given by $\mathbb{E}\mathopen{}\mathclose{{}\left[N_{\mathrm{vis}}^{s,\pi}}\right]=\sum_{t=0}^{\infty}p_{\pi}^{t}(s|s)$ .

A state $s\in\mathcal{S}$ is recurrent if and only if $\mathbb{E}\mathopen{}\mathclose{{}\left[N_{\mathrm{vis}}^{s,\pi}}\right]=\infty$ . This means that a recurrent state will be re-visited infinitely often. Equivalently, starting from a recurrent state $s\in\mathcal{S}$ , the probability of returning to $s$ itself for the first time in finite time is 1.

A state $s\in\mathcal{S}$ is transient if and only if $\mathbb{E}\mathopen{}\mathclose{{}\left[N_{\mathrm{vis}}^{s,\pi}}\right]<\infty$ . This means that a transient state will never be re-visited again after some point in time. Equivalently, starting from a transient state $s\in\mathcal{S}$ , the probability of returning to $s$ itself for the first time in finite time is less than 1. A transient state is a non-recurrent state (hence, every state is either recurrent or transient).

The period of a state $s\in\mathcal{S}$ is defined as the greatest common divisors $t_{\mathrm{gcd}}$ of all $t\geq 1$ for which $p_{\pi}^{t}(S_{t}=s|S_{0}=s)>0$ . Whenever the period $t_{\mathrm{gcd}}>1$ , we clasify $s$ as periodic. Otherwise, $s$ is aperiodic (not periodic). Intuitively, if returning to a state $s$ occurs at irregular times, then $s$ is aperiodic. Note that periodicity is a class property, implying that all states in a recurrent class have the same period.

An ergodic class (or an ergodic chain) is a class that is both recurrent and aperiodic. Its recurrent and aperiodic states are referred to as ergodic states. Note however, that the term “ergodic” is also used in the ergodic theory (mathematics), in which its definition does not involve the notion of aperiodicity.

Markov chain (MC) classification:

An MC is irreducible if the set of all its states forms a single recurrent class. An irreducible MC is aperiodic if all its states are aperiodic. Otherwise, it is periodic. An MC is reducible if it has both recurrent states and transient states. The state set of a reducible MC can be partitioned into one or more disjoint recurrent classes, plus a set of transient states.

An MC is unichain if it consists of a single (uni-) recurrent class (chain), plus a (possibly empty) set of transient states. Otherwise, it is multichain.

MDP classification based on the pattern of MCs induced by all stationary policies:

An MDP is unichain if the induced MC corresponding to every stationary policy is unichain. Otherwise, it is multichain (i.e. at least one induced MC is multichain, containing two or more recurrent classes). This is a restrictive definition of a multichain MDP. In the literature, its loose definition is also used, where it simply means a general MDP.

A recurrent MDP is a special case of unichain MDPs, whenever the MC corresponding to every stationary policy is irreducible. In other words, a recurrent MDP is a unichain MDP whose all induced MCs have an empty transient state set.

MDP classification based on the pattern of state accessibility under some stationary policy:

An MDP is communicating (also termed strongly connected) if, for every pair of states $s$ and $s^{\prime}$ in the state set $\mathcal{S}$ , there exists a stationary policy (which may depend on $s$ and $s^{\prime}$ ) under which $s^{\prime}$ is accessible from $s$ . A recurrent MDP is always communicating, but a communicating MDP may not be recurrent, it may be unichain or multichain.

An MDP is weakly communicating (also termed simply connected) if there exists a closed state set $\tilde{\mathcal{S}}\subseteq\mathcal{S}$ , for which there exists a stationary policy under which $s$ is accessible from $s^{\prime}$ for every pair of states $s,s^{\prime}\in\tilde{\mathcal{S}}$ , plus a (possibly empty) set of transient states under every stationary policy. This weakly communicating classification is more general than the communicating in that any communicating MDP is weakly communicating without any state that is transient under every stationary policy. Unichain MDPs are always weakly communicating.

There exists an MDP that is not weakly communicating. Such a non-weakly-communicating MDP is always multichain, but a multichain MDP may also be weakly communicating (or communicating).

2.3 Discounting in environments with an inherent notion of discounting

A environment with an inherent notion of discounting has a discount factor $\gamma$ that encodes one of the following entities. Thus, $\gamma$ is part of the environment specification (definition, description).

Firstly, the time value of rewards, i.e. the value of a unit reward $t$ timesteps in the future is $\gamma^{t}$ : This is related to psychological concepts. For example, some people prefer rewards now rather than latter (Mischel et al.,, 1972), hence they assign greater values to early rewards through a small $\gamma$ (being shortsighted). It is also natural to believe that there is more certainty about near- than far-future, because immediate rewards are (exponentially more) likely due to recent actions. The time preference is also well-motivated in economics (Samuelson,, 1937). This includes $\gamma$ for taking account of the decreasing value of money (because of inflation), as well as the interpretation of $(1-\gamma)/\gamma$ as a positive interest rate. Moreover, commercial activities have failure (abandonment) risk due to changing government regulation and consumer preferences over time.

Secondly, the uncertainty about random termination independent of the agent’s actions: Such termination comes from external control beyond the agent, e.g. someone shutting down a robot, engine failure (due to weather/natural disaster), or death of any living organisms.

In particular, whenever the random termination time $T_{\mathrm{max}}$ follows a geometric distribution ${Geo(p=1-\gamma)}$ , we have the following identity between the total and discounted rewards,

v_{T_{\mathrm{max}}}(\pi,s)\coloneqq\mathbb{E}_{S_{t},A_{t}}\mathopen{}\mathclose{{}\left[\mathbb{E}_{T_{\mathrm{max}}}\mathopen{}\mathclose{{}\left[\sum_{t=0}^{T_{\mathrm{max}}-1}r(S_{t},A_{t})\Big{|}S_{0}=s,\pi}\right]}\right]=\underbrace{v_{\gamma}(\pi,s)}_{\text{See \eqref{equ:v_disc}}},\quad\forall\pi\in\Pi_{\mathrm{S}},\forall s\in\mathcal{S},

(9)

where the discount factor $\gamma$ plays the role of the geometric distribution parameter (Puterman,, 1994, Prop 5.3.1). This discounting implies that at every timestep (for any state-action pair), the agent has a probability of $(1-\gamma)$ for entering the 0-reward absorbing terminal state, see Fig 4(a). Note that because $\gamma$ is invariant to states and actions (as well as time), this way of capturing the randomness of $T_{\mathrm{max}}$ may be inaccurate in cases where termination depends on states, actions, or both.

3 Discounting without an inherent notion of discounting

From now on, we focus on environments without inherent notion of discounting, where $\gamma$ is not part of the environment specification (cf. Sec 2.3). We emphasize the qualification “inherent” since any MDP can always be thought of having some notion of discounting from the Blackwell optimality point of view (5). This is because a Blackwell optimal policy is guaranteed to exist in finite MDPs (Puterman,, 1994, Thm 10.1.4); implying the existence of its discount factor $\gamma_{\mathrm{Bw}}\in(0,1]$ and of a (potentially very long) finite-horizon MDP model that gives exactly the same Blackwell-optimal policy as its infinite-horizon counterpart (Chang et al.,, 2013, Ch 1.3).

When there is no inherent notion of discounting, the discount factor $\gamma$ is imposed for bounded sums (Sec 2.1) and becomes part of the solution method (algorithm). This is what we refer to as artificial discounting, which induces artificial interpretation as for instance, those described in Sec 2.3. The $\gamma_{\mathrm{Bw}}$ (mentioned in the previous paragraph) is one of such artificial discount factors. In addition to bounding the sum, we observe other justifications that have been made for introducing an artificial $\gamma$ . They are explained in the following Secs 3.1, 3.2 and 3.3.

3.1 Approximation to the average reward (the gain) as $\gamma$ approaches 1

For recurrent MDPs, the gain optimality is the most selective⁸⁸8 For an intuitive explanation about the selectivity of optimality criteria, refer to Fig 1. because there are no transient states (Feinberg and Shwartz,, 2002, Sec 3.1). This implies that a gain optimal policy is also Blackwell optimal in recurrent MDPs, for which one should target the gain optimality criterion. Nonetheless, the following relationships exist between the average reward $v_{g}$ and discounted reward $v_{\gamma}$ value functions of a policy $\pi\in\Pi_{\mathrm{S}}$ .

Firstly, for every state $s_{0}\in\mathcal{S}$ ,

	$\displaystyle v_{g}(\pi,s_{0})$	$\displaystyle=\lim_{\gamma\to 1}\ (1-\gamma)\ v_{\gamma}(\pi,s_{0})\hskip 85.35826pt\text{\cite[citep]{(\@@bibref{AuthorsPhrase1Year}{puterman_1994_mdp}{\@@citephrase{, }}{}, Corollary~{}8.2.5)}}$		(10)
		$\displaystyle=(1-\gamma)\sum_{s\in\mathcal{S}}p_{\pi}^{\star}(s\|s_{0})\ v_{\gamma}(\pi,s),\quad\forall\gamma\in[0,1),\hskip 17.07164pt\text{\cite[citep]{(\@@bibref{AuthorsPhrase1Year}{singh_1994_rlpomdp}{\@@citephrase{, }}{}, Sec~{}5.3)}}$		(11)

where $p_{\pi}^{\star}(s|s_{0})$ denotes the stationary probability of a state $s$ , i.e. the long-run (steady-state) probability of being in state $s$ when the MC begins in $s_{0}$ . Here, (10) is obtained by multiplying the left and right hand sides of (7) by $(1-\gamma$ ), then taking the limit of both sides as $\gamma\to 1$ . The derivation of (11) begins with taking the expectation of $v_{\gamma}(\pi,S)$ with respect to $p_{\pi}^{\star}$ , then utilizes the discounted-reward Bellman (expectation) equation, and the stationarity of $p_{\pi}^{\star}$ . That is,

	$\displaystyle\sum_{s\in\mathcal{S}}p_{\pi}^{\star}(s\|s_{0})\bigg{\{}v_{\gamma}(\pi,s)\bigg{\}}$	$\displaystyle=\sum_{s\in\mathcal{S}}p_{\pi}^{\star}(s\|s_{0})\bigg{\{}\underbrace{r_{\pi}(s)+\gamma\sum_{s^{\prime}\in\mathcal{S}}p_{\pi}(s^{\prime}\|s)v_{\gamma}(\pi,s^{\prime})}_{\text{$v_{\gamma}(\pi,s)$ via the Bellman equation}}\bigg{\}}$
		$\displaystyle=\underbrace{\sum_{s\in\mathcal{S}}p_{\pi}^{\star}(s\|s_{0})r_{\pi}(s)}_{\text{Gain $v_{g}(\pi,s_{0})$ in \eqref{equ:v_gain}}}+\ \gamma\sum_{s^{\prime}\in\mathcal{S}}\underbrace{\sum_{s\in\mathcal{S}}p_{\pi}^{\star}(s\|s_{0})p_{\pi}(s^{\prime}\|s)}_{\text{$p_{\pi}^{\star}(s^{\prime}\|s_{0})$ due to stationarity}}v_{\gamma}(\pi,s^{\prime}),$

which can be re-arranged to obtain (11). Here, $r_{\pi}(s)\coloneqq\sum_{a\in\mathcal{A}}\pi(a|s)r(s,a)$ whereas $p_{\pi}(s^{\prime}|s)$ is defined in (8). It is interesting that any discount factor $\gamma\in[0,1)$ maintains the equality in (11), which was also proved by Sutton and Barto, (2018, p254).

The second relationship pertains to the gradient of the gain when a parameterized policy $\pi(\boldsymbol{\theta})$ is used. By notationally suppressing the policy parameterization $\boldsymbol{\theta}$ and the dependency on $s_{0}$ , as well as using $\nabla\coloneqq\partial/\partial\boldsymbol{\theta}$ , this relation can be expressed as

	$\displaystyle\nabla v_{g}(\pi)$	$\displaystyle=\lim_{\gamma\to 1}\Big{\{}\sum_{s\in\mathcal{S}}\sum_{s^{\prime}\in\mathcal{S}}p_{\pi}^{\star}(s)\underbrace{\sum_{a\in\mathcal{A}}p(s^{\prime}\|s,a)\pi(a\|s)\nabla\log\pi(a\|s)}_{\nabla p_{\pi}(s^{\prime}\|s)}v_{\gamma}(\pi,s^{\prime})\Big{\}}$		(12)
		$\displaystyle=\underbrace{\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}p_{\pi}^{\star}(s)\pi(a\|s)\Big{[}q_{\gamma}(\pi,s,a)\nabla\log\pi(a\|s)\Big{]}}_{\text{involving the \emph{discounted} state-action value $q_{\gamma}$}}+\underbrace{(1-\gamma)\sum_{s\in\mathcal{S}}p_{\pi}^{\star}(s)\Big{[}v_{\gamma}(\pi,s)\nabla\log p_{\pi}^{\star}(s)\Big{]}}_{\text{involving the \emph{discounted} state value $v_{\gamma}$}},$		(13)

for all $\gamma\in[0,1)$ . Notice that the right hand sides (RHS’s) of (12) and (13) involve the discounted-reward value functions, i.e. the state value function $v_{\gamma}(\pi,s),\forall s\in\mathcal{S}$ in (4) and the corresponding (state-)action value function $q_{\gamma}(\pi,s,a),\forall(s,a)\in\mathcal{S}\times\mathcal{A}$ . They are related via $v_{\gamma}(\pi,s)=\mathbb{E}_{A\sim\pi}\mathopen{}\mathclose{{}\left[q_{\gamma}(\pi,s,A)}\right]$ . The identity (12) was shown by Baxter and Bartlett, (2001, Thm 2), whereas (13) was derived from (11) by Morimura et al., (2010, Appendix A).

Thus for attaining average-reward optimality, one can maximize $v_{\gamma}$ but merely as an approximation to $v_{g}$ because the equality in (10) is only in the limit, i.e. setting $\gamma$ exactly to 1 is prohibited by definition (4). The similiar limiting behaviour applies to (12), where $v_{\gamma}$ is used to approximately compute the gain gradient $\nabla v_{g}$ , yielding approximately gain-optimal policies. In particular, Jin and Sidford, (2021, Lemma 3) proved that an $\epsilon$ -gain-optimal policy is equivalent to an $\frac{\epsilon}{3(1-\gamma)}$ -discounted-optimal policy with a certain $\gamma$ value that depends on $\epsilon$ and a parameter describing some property of the target MDP. Here, a policy $\pi$ is said to be $\varepsilon$ - $x$ -optimal for any positive $\varepsilon$ and a criterion $x$ if its values $v_{x}(\pi,s)\geq v_{x}(\pi_{x}^{*},s)-\varepsilon$ , for all $s\in\mathcal{S}$ .

The aforementioned approximation to $v_{g}$ by $v_{\gamma}$ with $\gamma\to 1$ seems to justify the use of stationary state distribution $p_{\pi}^{\star}$ (instead of the (improper) discounted state distribution $p_{\pi}^{\gamma}$ in (14) below) to weight the errors in recurrent states in approximate discounted policy evaluation, see e.g. Dann et al., (2014, Eqn 4). In fact, the identity in (10) implies the following connection,

	$\displaystyle\underbrace{\boldsymbol{P}_{\!\!\pi}^{\star}\boldsymbol{r}_{\!\!\pi}}_{\boldsymbol{v}_{\!g}(\pi)}=\lim_{\gamma\to 1}(1-\gamma)\underbrace{\boldsymbol{P}_{\!\!\pi}^{\gamma}\boldsymbol{r}_{\!\!\pi}}_{\boldsymbol{v}_{\gamma}(\pi)},\quad\text{such that}\ \boldsymbol{P}_{\!\!\pi}^{\star}=\lim_{\gamma\to 1}(1-\gamma)\boldsymbol{P}_{\!\!\pi}^{\gamma},\ \text{hence}\ p_{\pi}^{\star}=\lim_{\gamma\to 1}\underbrace{(1-\gamma)p_{\pi}^{\gamma}}_{\text{a proper distribution}},$		(14)
	$\displaystyle\text{where}\quad\boldsymbol{P}_{\!\!\pi}^{\star}=\lim_{t_{\mathrm{max}}\to\infty}\frac{1}{t_{\mathrm{max}}}\sum_{t=0}^{t_{\mathrm{max}}-1}\boldsymbol{P}_{\!\!\pi}^{t},\quad\text{and}\quad\boldsymbol{P}_{\!\!\pi}^{\gamma}=\lim_{t_{\mathrm{max}}\to\infty}\sum_{t=0}^{t_{\mathrm{max}}-1}(\gamma\boldsymbol{P}_{\!\!\pi})^{t}.$		(15)

Here, the $s_{0}$ -th row of $\boldsymbol{P}_{\!\!\pi}^{\star}$ contains the probability values of the stationary state distribution $p_{\pi}^{\star}(\cdot|s_{0})$ . On the other hand, the $s_{0}$ -th row of $\boldsymbol{P}_{\!\!\pi}^{\gamma}$ contains the values of the improper discounted state distribution $p_{\pi}^{\gamma}(\cdot|s_{0})$ , which is improper because $\sum_{s^{\prime}\in\mathcal{S}}p_{\pi}^{\gamma}(s^{\prime}|s_{0})=1/(1-\gamma)\neq 1$ . Importantly, this connection suggests that $p_{\pi}^{\star}$ is suitable as weights in the discounted value approximator’s error function only when $\gamma\to 1$ . Otherwise, $p_{\pi}^{\gamma}$ may be more suitable.

It is also an approximation in (11) whenever $v_{\gamma}$ is weighted by some initial state distribution $\mathring{p}$ (such as in (18)) or transient state-distributions $p_{\pi}^{t}$ , which generally differs from $p_{\pi}^{\star}$ . In (13), the second RHS term is typically ignored since calculating $\nabla\log p_{\pi}^{\star}(s)$ is difficult in RL settings.⁹⁹9 The difficulty of computing the gradient of state distributions, such as $\nabla\log p_{\pi}^{\star}(s)$ , in RL motivates the development of the policy gradient theorem (Sutton and Barto,, 2018, Ch 13.2). Consequently, $\nabla v_{g}(\pi)$ is approximated (closely whenever $\gamma$ is close to 1) by the first RHS term of (13), then by sampling the state $S\sim p_{\pi}^{\star}$ (after the agent interacts long “enough” with its environment).

Moreover, approximately maximizing the average reward via discounting is favourable because discounting formulation has several mathematical virtues, as described in Sec 3.3.

3.2 A technique for attaining the most selective optimality with $\gamma\in[\gamma_{\mathrm{Bw}},1)$

As discussed in the previous Sec 3.1, $\gamma$ -discounted optimality approximates the gain optimality as $\gamma\to 1$ . This is desirable since the gain optimality is the most selective in recurrent MDPs. In unichain MDPs however, the gain optimality is generally underselective since the gain ignores the rewards earned in transient states (for an illustrative example, see Fig 1). Consequently, multiple gain-optimal policies prescribe different action selections (earning different rewards) in transient states. The underselectiveness of gain optimality (equivalent to $(n=-1)$ -discount optimality) can be refined up to the most selective optimality by increasing the value of $n$ from $-1$ to $0$ (or higher if needed up to $n=(|\mathcal{S}|-2)$ for unichain MDPs) in the family of $n$ -discount optimality (Sec 2.1).

Interestingly, such a remedy towards the most selective criterion can also be achieved by specifying a discount factor $\gamma$ that lies in the Blackwell’s interval, i.e. $\gamma\in[\gamma_{\mathrm{Bw}},1)$ , for the $\gamma$ -discounted optimality (5). This is because the resulting $\pi_{\gamma\in[\gamma_{\mathrm{Bw}},1)}^{*}$ , which is also called a Blackwell optimal policy, is also optimal for all $n=-1,0,\ldots$ in $n$ -discount optimality (Puterman,, 1994, Thm 10.1.5). Moreover, Blackwell optimality is always the most selective regardless of the MDP classification, inheriting the classification invariance property of $\gamma$ -discounted optimality. Thus, artificial discounting can be interpreted as a technique to attain the most selective criterion (i.e. the Blackwell optimality) whenever $\gamma\in[\gamma_{\mathrm{Bw}},1)$ not only in recurrent but also unichain MDPs, as well as the most general multichain MDPs.

Figure 1: A symbolic diagram of a unichain MDP with

\mathcal{S}=\{s^{0},s^{1},s^{2}\}

and deterministic transitions. There are two stationary policies that differ in their blue or red action selection in the left-most state

s^{0}

, hence we call them: red and blue policies. Under both policies, the right-most state

s^{2}

is a 0-reward (absorbing) recurrent state

s_{\mathrm{zrat}}

, hence this unichain MDP is a

s_{\mathrm{zrat}}

-model. Both policies are gain-optimal and

(n=0)

-discount optimal, but only the blue policy is

(n=1)

-discount optimal. Therefore,

(n=1=|\mathcal{S}|-2)

-discount optimality is the most selective, which is equivalent to Blackwell optimality. In particular for this simple environment, the Blackwell discount factor is trivially

\gamma_{\mathrm{Bw}}=0

. Note that the total reward and the (

n=0

)-discount values of both policies are equal, i.e.

v_{\mathrm{tot}}(\mathrm{red},s^{0})=v_{\mathrm{tot}}(\mathrm{blue},s^{0})=v_{n=0}(\mathrm{red},s^{0})=v_{n=0}(\mathrm{blue},s^{0})=2

such that the total reward and (

n=0

)-discount criteria are underselective. This example MDP is adopted from Puterman, (1994, Fig 10.1.1) and Mahadevan, 1996b (, Fig 1).

Targetting the Blackwell optimality (instead of gain optimality) is imperative, especially for episodic environments¹⁰¹⁰10 Episodic environments are those with at least one terminal state. Once the agent enters the terminal state, the agent-environment interaction terminates. that are commonly modelled as infinite-horizon MDPs (so that the stationary policy set $\Pi_{\mathrm{S}}$ is a sufficient space to look at for optimal policies).¹¹¹¹11 In finite-horizon MDPs, the optimal policy is generally non-stationary, where it depends on the number of remaining timesteps (in addition to the state). Intuitively, if we (approximately) had only 1 week to live, would we act the same way as if we (approximately) had 10 years to live? In basketball, a long shot attempt (from beyond half court) is optimal only at the final seconds of the game. Such modelling is carried out by augmenting the state set with a 0-reward absorbing terminal state (denoted by $s_{\mathrm{zrat}}$ ), as shown in Fig 4(a). This yields a unichain MDP with a non-empty set of transient states. For such $s_{\mathrm{zrat}}$ -models, the gain is trivially $0$ for all stationary policies so that gain optimality is underselective. The $(n=0)$ -discount optimality improves the selectivity. It may be the most selective (hence, it is equivalent to Blackwell optimality) in some cases. Otherwise, it is underselective as well, hence some higher $n$ -discount optimality criterion should be used, e.g. $n=1$ for an MDP shown in Fig 1. It is also worth noting that in $s_{\mathrm{zrat}}$ -models, ( $n=0$ )-discount optimality is equivalent to the total reward optimality whose $v_{\mathrm{tot}}$ (2) is finite (Puterman,, 1994, Prop 10.4.2). The total reward optimality therefore may also be underselective in some cases.

Towards obtaining Blackwell optimal policies in unichain MDPs, the relationship between maximizing $\gamma$ -discounted and $n$ -discount criteria can be summarized as follows (see also Fig 6).

	$\displaystyle\underbrace{\operatorname*{\arg\!\max}_{\pi\in\Pi_{\mathrm{S}}}v_{\gamma\in[\gamma_{\mathrm{Bw}},1)}(\pi,s)}_{\text{Blackwell-optimal policies}}$
	$\displaystyle=\begin{cases}\underset{\pi\in\Pi_{\mathrm{S}}}{\operatorname{\arg\!\max}}\ \Big{[}\underbrace{v_{-1}(\pi,s)=\underset{\gamma\to 1}{\lim}\ (1-\gamma)v_{\gamma}(\pi,s)}_{\text{Based on \eqref{equ:gain2disrew}}}\Big{]}\quad\text{if MDPs are recurrent (see Sec~{}\ref{sec:avgrewapprox})},\\ \underset{\pi\in\Pi_{\mathrm{n=-1}}^{}}{\operatorname{\arg\!\max}}\ \Big{[}\underbrace{v_{0}(\pi,s)=\underset{\gamma\to 1}{\lim}\ \frac{(1-\gamma)v_{\gamma}(\pi,s)\ -\ v_{-1}(\pi,s)}{1-\gamma}}_{\text{Based on \eqref{equ:laurent_expansion_truncated}}}\Big{]}\ \text{\parbox[t]{93.89409pt}{if MDPs are unichain and $(n=0)$-discount is the most selective,}}\\ \underset{\pi\in\Pi_{\mathrm{n-1}}^{}}{\operatorname{\arg\!\max}}\ \big{[}v_{n}(\pi,s)\big{]}\ \text{for $n=1,2,\ldots,(\|\mathcal{S}\|-3)$}\qquad\text{\parbox[t]{122.34685pt}{if MDPs are unichain and $n$-discount is the most selective,}}\\ \underset{\pi\in\Pi_{\mathrm{n=(\|\mathcal{S}\|-3)}}^{}}{\operatorname*{\arg\!\max}}\ \big{[}v_{n=(\|\mathcal{S}\|-2)}(\pi,s)\big{]}\quad\text{if MDPs are unichain},\end{cases}$		(16)

for all $s\in\mathcal{S}$ , where $\Pi_{\mathrm{n}}^{*}$ denotes the $n$ -discount optimal policy set for $n=-1,0,\ldots$ .

From the second case in the RHS of (16), we know that $v_{0}$ can be computed by taking the limit of a function involving $v_{\gamma}$ and $v_{-1}(=v_{g})$ as $\gamma$ approaches 1. This means that for unichain MDPs, the Blackwell optimal policies may be obtained by setting $\gamma$ very close to 1, similar to that for recurrent MDPs (the first case) but not the same since the function of which the limit is taken differs.

In practice, the limits in the first and second cases in (16) are computed approximately using a discount factor $\gamma$ close to unity, which is likely in the Blackwell’s interval, i.e. ${\gamma_{\mathrm{Bw}}\leq(\gamma\approx 1)<1}$ . Paradoxically however, this does not necessarily attain the Blackwell optimality because the finite-precision computation involving $\gamma\approx 1$ yields quite accurate estimation to the limit values: maximizing the first and second cases in the RHS of (16) with a high $\gamma\approx 1$ attains (approximately) gain ( $n=-1$ ) and bias ( $n=0$ ) optimality respectively, which may be underselective in unichain MDPs. Consequently in practice, the most selective Blackwell optimality can always be achieved using $\gamma$ that is at least as high as $\gamma_{\mathrm{Bw}}$ but not too close to 1. That is, $\gamma_{\mathrm{Bw}}\leq\gamma\leq\bar{\gamma}<1$ for some practical upper bound $\bar{\gamma}$ , which is computation (hence, implementation) dependent.

3.3 Contraction, variance reduction, and independence from MDP classification

Discounted reward optimality is easier to deal with than its average reward counterpart. This can be attributed to three main factors as follows.

Firstly, the discounted-reward theory holds regardless of the classification of the induced MCs (Sec 2.2), whereas that of the average reward involves such classification. Because in RL settings, the transition probability $p(s^{\prime}|s,a)$ is unknown, average reward algorithms require estimation or assumption about the chain classification, specifically whether unichain or multichain. Nevertheless, note that such (assumed) classification is needed in order to apply a specific (simpler) class of average-reward algorithms: leveraging the fact that a unichain MDP has a single scalar gain (associated with its single chain) that is constant across all states, whereas a multichain MDP generally has different gain values associated with its multiple chains.

Secondly, the discounted-reward Bellman optimality operator $\mathbb{B}_{\gamma}^{*}$ is contractive, where the discount factor $\gamma$ serves as the contraction modulus (hence, $\mathbb{B}_{\gamma}^{*}$ is a $\gamma$ -contraction). That is,

\|\mathbb{B}_{\gamma}^{*}[\boldsymbol{v}]-\mathbb{B}_{\gamma}^{*}[\boldsymbol{v}^{\prime}]\|_{\infty}\leq\gamma\|\boldsymbol{v}-\boldsymbol{v}^{\prime}\|_{\infty},\quad\text{for any vectors $\boldsymbol{v},\boldsymbol{v}^{\prime}\in\real{|\mathcal{S}|}$},

where in state-wise form, $\mathbb{B}_{\gamma}^{*}[\boldsymbol{v}](s)\coloneqq\max_{a\in\mathcal{A}}\{r(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|s,a)v(s^{\prime})\},\forall s\in\mathcal{S}$ , and $\|\boldsymbol{v}\|_{\infty}\coloneqq\max_{s\in\mathcal{S}}|v(s)|$ denotes the maximum norm. This means that $\mathbb{B}_{\gamma}^{*}$ makes $\boldsymbol{v}$ and $\boldsymbol{v}^{\prime}$ closer by at least $\gamma$ such that the sequence of iterates $\boldsymbol{v}^{k+1}\leftarrow\mathbb{B}_{\gamma}^{*}[\boldsymbol{v}^{k}]$ converges to the unique fixed point of $\mathbb{B}_{\gamma}^{*}$ as $k\to\infty$ from any initial $\boldsymbol{v}^{k=0}\in\real{|\mathcal{S}|}$ . This is based on the Banach’s fixed-point theorem Szepesvári, (2010, Thm A.9). In particular as $k\to\infty$ , we obtain $\mathbb{B}_{\gamma}^{*}[\boldsymbol{v}^{k}]=\boldsymbol{v}^{k}=\boldsymbol{v}_{\gamma}^{*}$ , where $\boldsymbol{v}_{\gamma}^{*}$ denotes the optimal discounted value, i.e. $\boldsymbol{v}_{\gamma}(\pi_{\gamma}^{*})$ . Thus, $\lim_{k\to\infty}\|\boldsymbol{v}^{k}-\boldsymbol{v}_{\gamma}^{*}\|_{\infty}=\boldsymbol{0}$ . In the absense of $\gamma$ , the contraction no longer holds. This is the case for the average-reward Bellman optimality operator $\mathbb{B}_{g}^{*}[\boldsymbol{v}](s)\coloneqq\max_{a\in\mathcal{A}}\{r(s,a)+\sum_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|s,a)v(s^{\prime})\},\forall s\in\mathcal{S}$ . As a result, the basic value iteration based on $\mathbb{B}_{g}^{*}$ is not guaranteed to converge (Mahadevan, 1996a, , Sec 2.4.3).

The aforementioned contraction property also applies to the discounted-reward Bellman expectation operator, i.e. $\mathbb{B}_{\gamma}^{\pi}[\boldsymbol{v}](s)\coloneqq\mathbb{E}_{A\sim\pi}\mathopen{}\mathclose{{}\left[r(s,A)+\gamma\sum_{s^{\prime}\in\mathcal{S}}p(s^{\prime}|s,A)v(s^{\prime})}\right],\forall s\in\mathcal{S}$ . As a consequence of Banach’s fixed-point theorem, we have ${\lim_{k\to\infty}\|\boldsymbol{v}^{k}-\boldsymbol{v}_{\gamma}^{\pi}\|_{\infty}}=\boldsymbol{0}$ , where $\boldsymbol{v}^{k+1}\leftarrow\mathbb{B}_{\gamma}^{\pi}[\boldsymbol{v}^{k}]$ is iteratively applied on any initial $\boldsymbol{v}^{k=0}\in\real{|\mathcal{S}|}$ for $k=0,1,\ldots$ , whereas $\boldsymbol{v}_{\gamma}^{\pi}\coloneqq\boldsymbol{v}_{\gamma}(\pi)$ is the discounted policy value of a policy $\pi$ . In other words, $\boldsymbol{v}_{\gamma}^{\pi}$ is the unique fixed point of $\mathbb{B}_{\gamma}^{\pi}$ such that $\boldsymbol{v}_{\gamma}^{\pi}=\mathbb{B}_{\gamma}^{\pi}[\boldsymbol{v}_{\gamma}^{\pi}]$ , which is known as the discounted-reward Bellman evaluation equation ( $\gamma$ -BEE). This $\gamma$ -BEE is further utilized to derive a TD-based parametric value approximator whose convergence depends on the fact that $\gamma\in[0,1)$ . That is, the approximator’s error minimizer formula involves the inverse $(\boldsymbol{I}-\gamma\boldsymbol{P}_{\!\!\pi})^{-1}$ , which exists (Sutton and Barto,, 2018, p206). This is in contrast to the matrix $(\boldsymbol{I}-\boldsymbol{P}_{\!\!\pi})$ whose inverse does not exist (Puterman,, 1994, p596).

In addition, the contractive nature induced by $\gamma$ is also utilized for computing state similarity metrics (Castro,, 2020; Ferns et al.,, 2006). There, $\gamma$ guarantees the convergence of the metric operator to its fixed point (when such an operator is iteratively applied). Note that the notion of state similarity plays an important role in for example, state aggregation and state representation learning.

Third, discounting can be used to reduce the variance of, for example policy gradient estimates, at the cost of bias-errors Baxter and Bartlett,, 2001, Thm 3; Kakade,, 2001, Thm 1. In particular, the variance (the bias-error) increases (decreases) as a function of $1/(1-\gamma)=\sum_{t=0}^{\infty}\gamma^{t}$ . This is because the effective number of timesteps (horizon) can be controlled by $\gamma$ . This is also related to the fact that the infinite-horizon discounted reward $v_{\gamma}$ (4) can be $\epsilon$ -approximated by a finite horizon $\tau$ proportional to $\log_{\gamma}(1-\gamma)$ , as noted by Tang and Abbeel, (2010, Footnote 1). That is,

\epsilon\coloneqq\mathbb{E}\mathopen{}\mathclose{{}\left[\sum_{t=0}^{\infty}\gamma^{t}r(S_{t},A_{t})}\right]-\mathbb{E}\mathopen{}\mathclose{{}\left[\sum_{t=0}^{\tau-1}\gamma^{t}r(S_{t},A_{t})}\right]=\mathbb{E}\mathopen{}\mathclose{{}\left[\sum_{t=\tau}^{\infty}\gamma^{t}r(S_{t},A_{t})}\right]\leq\sum_{t=\tau}^{\infty}\gamma^{t}r_{\mathrm{max}}=\frac{\gamma^{\tau}r_{\mathrm{max}}}{1-\gamma}.

This is then re-arranged to obtain $\gamma^{\tau}\geq((1-\gamma)\epsilon)/r_{\mathrm{max}}$ , whose both sides are taken to the logarithm with base $\gamma$ to yield

\tau\geq\log_{\gamma}\frac{(1-\gamma)\epsilon}{r_{\mathrm{max}}},\quad\text{hence, the smallest of such a finite horizon is}\ \tau=\Big{\lceil}\log_{\gamma}\frac{(1-\gamma)\epsilon}{r_{\mathrm{max}}}\Big{\rceil},

where $\mathopen{}\mathclose{{}\left\lceil x}\right\rceil$ indicates the smallest integer greater than or equal to $x\in\real{}$ .

4 Artificial discount factors are sensitive and troublesome

The artificial discount factor $\gamma$ (which is part of the solution method) is said to be sensitive because the performance of RL methods often depends largely on $\gamma$ . Fig 2(a) illustrates this phenomenon using $Q_{\gamma}$ -learning with various $\gamma$ values. As can be seen, higher $\gamma$ leads to slower convergence, whereas lower $\gamma$ leads to suboptimal policies (with respect to the most selective criterion, which in this case, is the gain optimality since the MDP is recurrent). This trade-off is elaborated more in Secs 4.1 and 4.2. The sensitivity to $\gamma$ has also been observed in the error (and even whether convergence or divergence) of approximate policy evaluation methods with function approximators Scherrer,, 2010, Fig 1; Sutton and Barto,, 2018, Example 11.1.

Additionally, Fig 3 shows the effect of various $\gamma$ values on the optimization landscapes on which policy gradient methods look for the maximizer of the discounted value $v_{\gamma}$ . It is evident that different $\gamma$ values induce different maximizing policy parameters. From the 2D visualization in Fig 3, we can observe that the landscape of $v_{\gamma}$ begins to look like that of the gain $v_{g}$ when $\gamma$ is around $\gamma_{\mathrm{Bw}}$ , then becomes more and more look like it as $\gamma$ approaches 1. When $\gamma$ is far less than $\gamma_{\mathrm{Bw}}$ , the maximizer of $v_{\gamma}$ does not coincide with that of $v_{g}$ . In such cases, some local maximizer of $v_{\gamma}$ may be desirable (instead of the global one) because of its proximity to the maximizer of $v_{g}$ , which represents the optimal point of the most selective criterion for recurrent MDPs examined in Fig 3.

The artificial $\gamma$ is troublesome because its critical value, i.e. $\gamma_{\mathrm{Bw}}$ , is difficult to determine, even in DP where the transition and reward functions are known (Hordijk et al.,, 1985). This is exacerbated by the fact that $\gamma_{\mathrm{Bw}}$ is specific to each environment instance (even from the same environment family, as shown in Fig 2(b)). Nevertheless, knowing this critical value $\gamma_{\mathrm{Bw}}$ is always desirable. For example, despite the gain optimality can be attained by having $\gamma$ very close to 1, setting $\gamma\leftarrow\gamma_{\mathrm{Bw}}$ (or some value around it) leads to not only convergence to the optimal gain (or close to it) but also faster convergence, as demonstrated by $Q_{\gamma}$ -learning (Fig 2(a)). We can also observe visually in Fig 3 that the discounted $v_{\gamma\approx\gamma_{\mathrm{Bw}}}$ -landscape already resembles the gain $v_{g}$ -landscape. Thus, for obtaining the gain-optimal policy in recurrent MDPs, $\gamma$ does not need to be too close to 1 (as long as it is larger than or equal to $\gamma_{\mathrm{Bw}}$ ); see also (16).

Apart from that, $\gamma$ is troublesome because some derivation involving it demands extra care, e.g. for handling the improper discounted state distribution $p_{\pi}^{\gamma}$ (14) in discounted-reward policy gradient algorithms (Nota and Thomas,, 2020; Thomas,, 2014).

Refer to caption — (a) Learning curves of $Q_{\gamma}$ -learning with varying $\gamma$ values on GridNav-25, which is episodic but modelled using $s_{\mathrm{rst}}$ (Sec 5.5). Empirically, the critical discount factor is $\gamma_{\mathrm{Bw}}\approx 0.83$ (yellowish).

4.1 Higher discount factors lead to slower convergence

According to (10), increasing the discount factor $\gamma$ closer to 1 makes the scaled discounted reward $(1-\gamma)v_{\gamma}$ approximate the average reward $v_{g}$ more closely. This means that a discounted-reward method with such a setting obtains more accurate estimates of gain-optimal policies. However, it suffers from a lower rate of convergence (to the approximate gain optimality), as well as from some numerical issue (since for example, it involves the term $1/(1-\gamma)$ that explodes as $\gamma\to 1$ ). This becomes unavoidable whenever $\gamma_{\mathrm{Bw}}$ is indeed very close to unity because the most selective Blackwell optimality (equivalent to gain optimality in recurrent MDPs) requires that $\gamma\geq\gamma_{\mathrm{Bw}}$ .

The slow convergence can be explained by examining the effect of the effective horizon induced by $\gamma$ . That is, as $\gamma$ approaches 1, the reward information is propagated to more states (Beleznay et al.,, 1999, Fig 1). From discounted policy gradient methods, we also know that an i.i.d state sample from the discounted state distribution $p_{\pi}^{\gamma}$ in (14) is the last state of a trajectory whose length is drawn from a geometric distribution $Geo(p=1-\gamma)$ , see Kumar et al., (2020, Algo 3). Evidently, the closer $\gamma$ to 1, the longer the required trajectory. Also recall that such a geometric distribution has a mean of $1/(1-\gamma)$ and a variance of $\gamma/(1-\gamma)^{2}$ , which blow up as $\gamma\to 1$ .

There are numerous works that prove and demonstrate slow convergence due to higher $\gamma$ . From them, we understand that the error (hence, iteration/sample complexity) essentially grows as a function of $1/(1-\gamma)$ . Those works include Thrun and Schwartz,, 1993, Fig 3; Melo and Ribeiro,, 2007, Thm 1; Fan et al.,, 2020, Thm 4.4 for Q-learning with function approximators, (Sutton and Barto,, 2018, Eqn 9.14, 12.8) for TD learning, and (Agarwal et al.,, 2019, Table 1, 2) for policy gradient methods. We note that for a specific environment type and with some additional hyperparameter, Devraj and Meyn, (2020) proposed a variant of Q-learning whose sample complexity is independent of $\gamma$ .

4.2 Lower discount factors likely lead to suboptimal policies

Setting $\gamma$ further from 1 such that $\gamma<\gamma_{\mathrm{Bw}}$ yields $\gamma$ -discounted optimal policies that are suboptimal with respect to the most selective criterion (see Fig 2(a)). From the gain optimality standpoint, lower $\gamma$ makes $(1-\gamma)v_{\gamma}$ deviate from $v_{g}$ in the order of $\mathcal{O}\mathopen{}\mathclose{{}\left(1-\gamma}\right)$ as shown by (Baxter and Bartlett,, 2001). More generally based on (16), $\gamma<\gamma_{\mathrm{Bw}}$ induces an optimal policy $\pi_{\gamma<\gamma_{\mathrm{Bw}}}^{*}$ that is not Blackwell optimal (hence, $\pi_{\gamma<\gamma_{\mathrm{Bw}}}^{*}$ is also not gain optimal in recurrent MDPs). This begs the question: is it ethical to run a suboptimal policy (due to misspecifying the optimality criterion) in perpetuity?

For a parameterized policy in recurrent MDPs, $\gamma<\gamma_{\mathrm{Bw}}$ induces a discounted $v_{\gamma}$ -landscape that is different form the gain $v_{g}$ -landscape. Fig 3 shows such a discrepancy, which becomes more significant as $\gamma$ is set further below 1. Therefore, the maximizing parameters do not coincide, i.e. $\operatorname*{\arg\!\max}_{\boldsymbol{\theta}}v_{\gamma<\gamma_{\mathrm{Bw}}}(\pi(\boldsymbol{\theta}))\neq\operatorname*{\arg\!\max}_{\boldsymbol{\theta}}v_{g}(\pi(\boldsymbol{\theta}))\eqqcolon\boldsymbol{\theta}_{\!g}^{*}$ , where $\boldsymbol{\theta}\in\Theta$ denotes the policy parameter in some parameter set. Interestingly, $v_{\gamma<\gamma_{\mathrm{Bw}}}(\pi(\boldsymbol{\theta}_{\!g}^{*}))$ is a local maximum in $v_{\gamma}$ -landscape so that the $(\gamma<\gamma_{\mathrm{Bw}})$ -discounted-reward optimization is ill-posed in that the (global) maximum is not what we desire in terms of obtaining an optimal policy with respect to the most selective criterion (that is, the gain optimal policy $\pi(\boldsymbol{\theta}_{\!g}^{*})$ in recurrent MDPs).

Petrik and Scherrer, (2008, Thm 2) established the following error bound due to misspecifying a discount factor $\gamma<\gamma_{\mathrm{Bw}}$ . That is,

\|\boldsymbol{v}_{\gamma_{\mathrm{Bw}}}^{*}-\boldsymbol{v}_{\gamma}^{*}\|_{\infty}\leq\frac{(\gamma_{\mathrm{Bw}}-\gamma)\ r_{\mathrm{max}}}{(1-\gamma)(1-\gamma_{\mathrm{Bw}})},\quad\text{for $\boldsymbol{v}_{\gamma_{\mathrm{Bw}}}^{*},\boldsymbol{v}_{\gamma}^{*}\in\real{|\mathcal{S}|}$, and $\boldsymbol{v}_{\gamma}^{*}\coloneqq\boldsymbol{v}_{\gamma}(\pi_{\gamma}^{*})$ for any $\gamma\in[0,1)$}.

Subsequently, Jiang et al., (2016) refined the above error bound by taking into account the transition and reward functions. They also highlighted that such an error as a function of $\gamma$ is not monotonically decreasing (with increasing $\gamma$ ). This is consistent with what Smallwood, (1966) observed, i.e. multiple disconnected $\gamma$ -intervals that share the same $\gamma$ -discounted optimal policy. We note that specifically for sparse-reward environments (where non-zero rewards are not received in every timestep), a lower discount factor $\gamma<\gamma_{\mathrm{Bw}}$ is likely to improve the performance of RL algorithms (Petrik and Scherrer,, 2008, Thm 10). They argue that lowering $\gamma$ decreases the value approximation error $\|\boldsymbol{v}_{\gamma}^{*}-\hat{\boldsymbol{v}}_{\gamma}^{*}\|_{\infty}$ more significantly than it increases the $\gamma$ -misspecification error $\|\boldsymbol{v}_{\gamma_{\mathrm{Bw}}}^{*}-\boldsymbol{v}_{\gamma}^{*}\|_{\infty}$ . Here, $\hat{\boldsymbol{v}}_{\gamma}^{*}$ denotes an approximation to $\boldsymbol{v}_{\gamma}^{*}$ .

5 Benefits of maximizing the average reward in recurrent MDPs

In this Section, we enumerate the benefits of directly maximizing the average reward (gain) optimality criterion in recurrent MDPs. Loosely speaking, such a recurrent structure is found in continuing environments¹²¹²12 Continuing environments are those with no terminal state; cf. episodic environments (Footnote 10). with cyclical events across all states. For episodic environments, we can obtain their recurrent MDPs by explicitly modelling the episode repetition, as explained in Sec 5.5. For a non-exhaustive list of continuing and episodic environments, refer to (Dewanto et al.,, 2020).

The combination of recurrent MDPs and gain optimality is advantageous. There is no discount factor $\gamma$ involved (as a by-product), removing the difficulties due to artificial discounting (Sec 4). Other benefits are described in the following sub-sections.

5.1 Unconditionally the most selective criterion

Gain optimality is the most selective criterion for recurrent MDPs. This is because all states are recurrent so that the gain is all that is needed to quantitatively measure the quality of any stationary policy from those states (such a gain quantity turns out to be constant for all states in recurrent or more generally unichain MDPs). Recall that the gain is concerned with the long-run rewards (3), and recurrent states are states that are re-visited infinitely many times in the long-run (Sec 2.2).

Gain optimality therefore is equivalent to Blackwell optimality unconditionally, as well as instantaneously in that there is no need for hyperparameter tuning for the optimality criterion (which fundamentally determines the optimization objective function, hence the overall optimization). This is in contrast to $\gamma$ -discounted optimality, where it is equivalent to Blackwell optimality if $\gamma\geq\gamma_{\mathrm{Bw}}$ as in (5). Moreover since $\gamma_{\mathrm{Bw}}$ is unknown, tuning $\gamma$ is necessary (Sec 3.2).

5.2 Uniformly optimal policies

Since recurrent (up to unichain) MDPs have only a single recurrent class (chain), the gain of any policy is constant across all states. As a result, average-reward policy gradient methods maximize an objective (17) that is independent of initial states, or generally of initial state distributions. The resulting gain-optimal policies are said to be uniformly optimal because they are optimal for all initial states or all initial state distributions (Altman,, 1999, Def 2.1).

In contrast, the discounted-reward counterpart maximizes an objective (18) that is defined with respect to some initial state distribution $\mathring{p}$ (since the discounted-reward value $v_{\gamma}$ depends on the state from which the value is measured, as in (4)). Consequently, the resulting $\gamma$ -discounted optimal policies may not be optimal for all initial states; they are said to be non-uniformly optimal, as noted by Bacon, (2018, p41). This non-uniform optimality can be interpreted as a relaxation of the uniform optimality in DP, which requires that the superiority of optimal policies $\pi_{\gamma}^{*}$ holds in every state, i.e. $v_{\gamma}(\pi_{\gamma}^{*},s_{0})\geq v_{\gamma}(\pi,s_{0})$ , for all states $s_{0}\in\mathcal{S}$ and all policies $\pi\in\Pi_{\mathrm{S}}$ , see Sec 2.1.

The objectives of average- and discounted-reward policy gradient methods are as follows.

	Average-reward policy gradient objective:	$\displaystyle\quad\operatorname*{\arg\!\max}_{\boldsymbol{\theta}\in\Theta}v_{g}(\pi(\boldsymbol{\boldsymbol{\theta}})),$		(17)
	Discounted-reward policy gradient objective:	$\displaystyle\quad\operatorname*{\arg\!\max}_{\boldsymbol{\theta}\in\Theta}\mathbb{E}_{\mathring{p}}\mathopen{}\mathclose{{}\left[v_{\gamma}(\pi(\boldsymbol{\boldsymbol{\theta}}),S_{0})}\right],$		(18)

where $\boldsymbol{\theta}\in\Theta=\real{\dim(\boldsymbol{\theta})}$ is the policy parameter and $S_{0}\sim\mathring{p}$ is the initial state random variable.

5.3 Potentially higher convergence rates

Without delving into specific algorithms, there are at least two reasons for faster convergence of average-reward methods (compared to their discounted-reward counterparts), as hinted by Schwartz, (1993, Sec 6) and Van Roy, (1998, p114).

First is the common gain across states. Such commonality eases the gain approximation in that no generalization is required. That is, a single gain estimation is all that is needed for one true gain of all states in unichain MDPs.

Second, average-reward methods optimize solely the gain term in the Laurent series expansion of $v_{\gamma}$ in (6). On the other hand, their discounted-reward counterparts optimize the gain, bias, and higher-order terms altogether simultaneously, whose implication becomes more substantial as $\gamma$ is set further below 1, as can be observed in (7).

5.4 Less discrepancy between the learning objective and the performance metric

Since there is a finite number of timesteps (as training budget) and no inherent notion of discounting, the final learning performance metric $\bar{\psi}_{\mathrm{tot}}^{\mathrm{final}}$ in RL is typically measured by the average return over several last experiment-episodes¹³¹³13 The term “experiment-episode” refers to one trial (run, roll-out, or simulation), which produces one sample of trajectory (of states, actions, and rewards). The term “experiment-episode” is applicable to both episodic and continuing environments. In contrast, the term “episode” only makes sense for episodic environments. Machado et al.,, 2018, Sec 5.2; Henderson et al.,, 2018, Fig 2. That is,

	$\displaystyle\bar{\psi}_{\mathrm{tot}}^{\mathrm{final}}$	$\displaystyle\coloneqq\frac{1}{n_{\mathrm{xep}}^{\mathrm{last}}}\sum_{i=j+1}^{n_{\mathrm{xep}}}\Bigg{[}\underbrace{\sum_{t=0}^{\hat{t}_{\mathrm{max}}^{\mathrm{xep}(i)}}r_{t+1}^{(i)}\Big{\|}\hat{\pi}^{*}}_{\text{Return}}\Bigg{]}$		(with $j=n_{\mathrm{xep}}-n_{\mathrm{xep}}^{\mathrm{last}}$ )
		$\displaystyle\approx\underbrace{\mathbb{E}\Bigg{[}\sum_{t=0}^{\hat{t}_{\mathrm{max}}^{\mathrm{xep}}}R_{t+1}\Big{\|}\hat{\pi}^{}\Bigg{]}}_{\text{Finite-horizon total reward}}\propto\underbrace{\frac{1}{\hat{t}_{\mathrm{max}}^{\mathrm{xep}}+1}\mathbb{E}\Bigg{[}\sum_{t=0}^{\hat{t}_{\mathrm{max}}^{\mathrm{xep}}}R_{t+1}\Big{\|}\hat{\pi}^{}\Bigg{]}}_{\text{Finite-horizon average reward}}\approx\underbrace{v_{g}(\hat{\pi}^{*})}_{\text{Gain}},$		(19)

where $n_{\mathrm{xep}}$ denotes the number of i.i.d experiment-episodes, from which we pick the last $n_{\mathrm{xep}}^{\mathrm{last}}$ experiment-episodes once the learning is deemed converged to a final learned policy $\hat{\pi}^{*}$ . Each $i$ -th experiment-episode runs from $t=0$ to a finite $\hat{t}_{\mathrm{max}}^{\mathrm{xep}(i)}\approx t_{\mathrm{max}}$ , where the reward realization at every timestep $t$ is denoted by $r_{t+1}^{(i)}$ . The expectation in (19) of the reward random variable $R_{t+1}\coloneqq r(S_{t},A_{t})$ is with respect to $S_{0}\sim\mathring{p}$ , $A_{t}\sim\hat{\pi}^{*}(\cdot|s_{t})$ and $S_{t+1}\sim p(\cdot|s_{t},a_{t})$ .

We argue that the learning performance metric $\bar{\psi}_{\mathrm{tot}}^{\mathrm{final}}$ has less discrepancy with respect to the gain $v_{g}$ than to the discounted reward $v_{\gamma}$ . In other words, whenever the criterion is $v_{g}$ , what is measured during training (i.e. the learning performance metric) is more aligned to what the agent learns/optimizes (i.e. the learning objective).¹⁴¹⁴14 In some cases, this alignment may be traded-off for more tractable computation/training/analysis. There are three arguments for this as follows.

First, since the experiment-episodes are i.i.d, the expectation of $\bar{\psi}_{\mathrm{tot}}^{\mathrm{final}}$ in (19) converges as $n_{\mathrm{xep}}^{\mathrm{last}}\to\infty$ to the expected finite-horizon total reward. This is proportional to the expected finite-horizon average reward, which is an approximation to the gain $v_{g}(\hat{\pi}^{*})$ due to the finiteness of $\hat{t}_{\mathrm{max}}^{\mathrm{xep}}$ .

Second is the fact that $\hat{t}_{\mathrm{max}}^{\mathrm{xep}}$ is typically fixed for all experiment-episodes during training.¹⁵¹⁵15 One common practice is to set $\hat{t}_{\mathrm{max}}^{\mathrm{xep}}$ to some constant far less than the training budget to obtain multiple experiment-episodes. This is implemented as for instance, the ‘max_episode_steps’ variable at https://github.com/openai/gym/blob/master/gym/envs/__init__.py of a seminal and popular RL environment codebase: OpenAI Gym (Brockman et al.,, 2016). It is not randomly sampled from a geometric distribution $Geo(p=1-\gamma)$ even when maximizing $v_{\gamma}$ (seemingly because there is no inherent notion of discounting). If it was sampled from such a geometric distribution, then the expectation of $\bar{\psi}_{\mathrm{tot}}^{\mathrm{final}}$ would converge to $v_{\gamma}$ , following the identity in (9). This however would inherently pose a risk of miss-specifying $\gamma$ (Sec 4.2). This metric gap due to artificial discounting is highlighted by Schwartz, (1993, Sec 3), van Hasselt, (2011, p25), Dabney, (2014, p47), and Van Seijen et al., (2019, Sec 2).

Third is because the induced Markov chain may reach stationarity (or close to it) within $\hat{t}_{\mathrm{max}}^{\mathrm{xep}}<\infty$ in some environments. That is, $\boldsymbol{P}_{\!\!\pi}^{t}\boldsymbol{P}_{\!\!\pi}=\boldsymbol{P}_{\!\!\pi}^{t+1}=\boldsymbol{P}_{\!\!\pi}^{t}=\boldsymbol{P}_{\!\!\pi}^{\star}$ for several timesteps $t\leq\hat{t}_{\mathrm{max}}^{\mathrm{xep}}$ , see (15). If this happens, then some number of rewards are generated from states sampled from the stationary state distribution $p_{\pi}^{\star}$ (although they are not independent, but Markovian state samples). This makes the approximation to the gain $v_{g}$ more accurate (although it is inherently biased due to non-i.i.d state samples) than to the discounted $v_{\gamma}$ since $v_{g}(\pi)=\mathbb{E}_{S\sim p_{\pi}^{\star}}\mathopen{}\mathclose{{}\left[r_{\pi}(S)\coloneqq\mathbb{E}_{A\sim\pi}\mathopen{}\mathclose{{}\left[r(S,A)}\right]}\right]$ , which can be shown to be equivalent to (3).

5.5 Modelling the episode repetition explicitly in episodic environments

(a) A unichain

s_{\mathrm{zrat}}

-model

(b) A recurrent

s_{\mathrm{rst}}

-model

Figure 4: Diagrams of a unichain

s_{\mathrm{zrat}}

-model and a recurrent

s_{\mathrm{rst}}

-model of an episodic environment with an (original) state set

\mathcal{S}=\{s^{0},s^{1},s^{2}\}

. The red edge indicates a deterministic transition via a single self-loop action

a_{\mathrm{zrat}}

in a 0-reward absorbing terminal state

s_{\mathrm{zrat}}

. On the other hand, the blue edges indicate possible transitions via a single reset action

a_{\mathrm{rst}}

in a resetting state

s_{\mathrm{rst}}

. This

a_{\mathrm{rst}}

leads to a next state that is distributed according to the initial state distribution

\mathring{p}

(which here has all states

s\in\mathcal{S}

as its supports). Transitions via

a_{\mathrm{zrat}}

and

a_{\mathrm{rst}}

yield a zero reward as indicated by the red and blue edge labels. The other black unlabeled edges represent examples of transitions with positive probabilities and any reward values. For a formal

s_{\mathrm{zrat}}

- to

s_{\mathrm{rst}}

-model conversion, refer to Sec 5.5.

In order to obtain an infinite-horizon recurrent MDP of an episodic environment, we can model its episode repetition explicitly. This is in contrast to modelling an episodic environment as an infinite-horizon MDP with a 0-reward absorbing terminal state (denoted as $s_{\mathrm{zrat}}$ , which is recurrent under every policy). This modelling (called $s_{\mathrm{zrat}}$ -modelling) induces a unichain MDP, where all states but $s_{\mathrm{zrat}}$ are transient. In $s_{\mathrm{zrat}}$ -model, the gain is trivially 0 for all stationary policies, rendering the gain optimality underselective (Sec 3.2). Fig 4(a) shows the diagram of a $s_{\mathrm{zrat}}$ -model.

To explicitly model episode repetition, we augment the original state set with a resetting terminal state $s_{\mathrm{rst}}$ (instead of $s_{\mathrm{zrat}}$ ) that has a single available reset action $a_{\mathrm{rst}}$ (instead of a self-loop action $a_{\mathrm{zrat}}$ ). This action $a_{\mathrm{rst}}$ is responsible for a transition from $s_{\mathrm{rst}}$ to an initial state $S_{0}\sim\mathring{p}$ , which yields a reward of 0. We call this approach $s_{\mathrm{rst}}$ -modelling, whose example diagram is shown in Fig 4(b). The conversion from a unichain $s_{\mathrm{zrat}}$ -model to a recurrent $s_{\mathrm{rst}}$ -model is as follows.

Let $\mathcal{S}$ and $\mathcal{A}$ denote the original state and action sets of an episodic environment, respectively, before the augmentation of a 0-reward absorbing terminal state $s_{\mathrm{zrat}}$ . Given a unichain $s_{\mathrm{zrat}}$ -model with a state set $\mathcal{S}_{\mathrm{zrat}}^{+}=\mathcal{S}\cup\{s_{\mathrm{zrat}}\}$ , an action set $\mathcal{A}_{\mathrm{zrat}}^{+}=\mathcal{A}\cup\{a_{\mathrm{zrat}}\}$ , an initial state distribution $\mathring{p}_{\mathrm{zrat}}$ , a state transition distribution $p_{\mathrm{zrat}}$ , and a reward function $r_{\mathrm{zrat}}$ , then the corresponding recurrent $s_{\mathrm{rst}}$ -model has the following components.

•

A state set $\mathcal{S}_{\mathrm{rst}}^{+}=\mathcal{S}\cup\{s_{\mathrm{rst}}\}$ , an action set $\mathcal{A}_{\mathrm{rst}}^{+}=\mathcal{A}\cup\{a_{\mathrm{rst}}\}$ , and an initial state distribution $\mathring{p}_{\mathrm{rst}}=\mathring{p}_{\mathrm{zrat}}$ , where $\mathring{p}_{\mathrm{rst}}(s_{\mathrm{rst}})=\mathring{p}_{\mathrm{zrat}}(s_{\mathrm{zrat}})=0$ .
•
A state transition distribution $p_{\mathrm{rst}}$ and a reward function $r_{\mathrm{rst}}$ , where
- $\bullet$
  
  $p_{\mathrm{rst}}(s^{\prime}|s,a)=p_{\mathrm{zrat}}(s^{\prime}|s,a)$ and $r_{\mathrm{rst}}(s,a,s^{\prime})=r_{\mathrm{zrat}}(s,a,s^{\prime})$ , for all ${s,s^{\prime}\in\mathcal{S}}$ and for all ${a\in\mathcal{A}}$ ,
- $\bullet$
  
  $p_{\mathrm{rst}}(s_{\mathrm{rst}}|s,a)=p_{\mathrm{zrat}}(s_{\mathrm{zrat}}|s,a)$ and $r_{\mathrm{rst}}(s,a,s_{\mathrm{rst}})=r_{\mathrm{zrat}}(s,a,s_{\mathrm{zrat}})$ , for all ${s,s^{\prime}\in\mathcal{S}}$ and for all $a\in\mathcal{A}$ , as well as
- $\bullet$
  
  $p_{\mathrm{rst}}(s|s_{\mathrm{rst}},a_{\mathrm{rst}})=\mathring{p}_{\mathrm{zrat}}(s)$ and $r_{\mathrm{rst}}(s_{\mathrm{rst}},a_{\mathrm{rst}},s)=0$ , for all $s\in\mathcal{S}$ .

The above conversion requires that i) reaching $s_{\mathrm{zrat}}$ is inevitable with probability 1 under all stationary policies (this is equivalent to inevitable termination in an episodic environment), and ii) there is no inherent notion of discounting that operates from $t=0$ until the end of an episode. For example diagrams of both $s_{\mathrm{zrat}}$ - and $s_{\mathrm{rst}}$ -models, refer to Fig 4.

The $s_{\mathrm{rst}}$ -modelling suits the fact that in practice, an RL-agent is expected to run in multiple episodes, e.g. to play some game repeatedly. By using $s_{\mathrm{rst}}$ -modelling, we can train an agent operating on episodic environments in the same practical way as if it were operating on continuing environments. Similar to the zero state-value of the terminal state (i.e. $v_{\gamma}^{\pi}(s_{\mathrm{zrat}})=0$ ) in discounted-reward $s_{\mathrm{zrat}}$ -modelling, we may have a zero state-value of the resetting state (i.e. $v_{b}^{\pi}(s_{\mathrm{rst}})=0$ ) in the average-reward $s_{\mathrm{rst}}$ -modelling, where $v_{b}$ denotes the relative value of the bias (7).

We compare $s_{\mathrm{zrat}}$ - and $s_{\mathrm{rst}}$ -modelling by running experiments with two training schemes as follows.

•

Scheme-A uses an $s_{\mathrm{zrat}}$ -model and the total reward criterion (hence, $Q_{\mathrm{tot}}$ -learning).
•

Scheme-B uses an $s_{\mathrm{rst}}$ -model and the average reward criterion (hence, $Q_{b}$ -learning).

Fig 5 depicts the learning curves of $Q_{x}$ -learning trained under Scheme-A and Scheme-B on two episodic environments. Both schemes are trained with the same experiment-episode length¹⁶¹⁶16 During training on episodic environments, an agent is always transported back to the initial state after an episode ends, regardless of the training scheme. This transportation to initial states is captured by $s_{\mathrm{rst}}$ -modelling, but not by $s_{\mathrm{zrat}}$ -modelling (see Fig 4). , and gauged by the same performance metric, i.e. the finite-time average reward (19). As can be observed, both schemes converge to the same value, but Scheme-B empirically leads to a higher rate of convergence. This indicates that both $s_{\mathrm{zrat}}$ - and $s_{\mathrm{rst}}$ -models induce the same optimal policy (in the original states $s\in\mathcal{S}$ ), and that the $s_{\mathrm{zrat}}$ - to $s_{\mathrm{rst}}$ -model conversion is sound. More importantly, conversion to an $s_{\mathrm{rst}}$ -model enables obtaining the optimal policy using an average-reward method since such a model is recurrent, for which gain optimality is the most selective.

Mahadevan, 1996a (, Sec 3.1) mentioned the idea about episode repetitions in an episodic grid-navigation environment. He however, did not provide a formal conversion. Pardo et al., (2018) also used a similar conception to the episode repetition for proposing a technique that bootstraps from the value of the state at the end of each experiment-episode for random-horizon episodic environments. Another important related work is of White, (2017) who introduced a unification of episodic and continuing environment specifications. Such a unification is carried out via transition-based discounting, where the discount factor from the terminal to initial states is set to 0. She noted that the transition-based discounting breaks the Laurent-expansion-based connection between the discounted- and average-reward criteria in (6).

6 Discussions

The route of using the discounted reward to approximately maximize the average reward in RL seems to follow Blackwell’s $\gamma$ -discounted approach (1962) to Howard’s average reward (1960) in DP.¹⁷¹⁷17 Recall that both Blackwell’s $\gamma$ -discounted and Howard’s average-reward approaches were aimed to deal with the infiniteness of the total reward in infinite-horizon MDPs, refer to Sec 2.1. As explained in Sec 2.1, the major successive development of such Blackwell’s approach came from Veinott who introduced a family of $n$ -discount optimality criteria (1969), which is discounting-free. Then, he developed an algorithm for obtaining $n$ -discount optimal policies, from which policies that are considered nearly-optimal and optimal by Blackwell emerge as special cases, namely $(n=0)$ - and $(n=\infty)$ -discount optimal policies, respectively. Both imply the average reward optimality.

Like in DP, the route in RL should be completed by devising methods based on Veinott optimality criteria. To this end, average reward RL methods constitute the first step in the following senses. First, they yield a set of approximately $(n=-1)$ -discount optimal policies, from which higher $n$ -discount optimal policies are sought, as illustrated in Fig 6(b). This follows from the hierarchical property of $n$ -discount optimality, whose selectivity increases as $n$ increases. Note that such a property is exploited in the $n$ -discount policy iteration in DP (Puterman,, 1994, Ch 10.3). Second, average reward RL methods evaluate the gain $\boldsymbol{v}_{g}\equiv\boldsymbol{v}_{-1}$ and the bias $\boldsymbol{v}_{0}$ that are likely to be useful for attaining higher $n$ -discount criteria. This is because the expansion coefficients $\boldsymbol{v}_{n}$ in (6) are interconnected. For instance, the first three coefficients satisfy three nested equations below,

\text{\emph{i)}}\ \boldsymbol{v}_{-1}=\boldsymbol{P}\boldsymbol{v}_{-1},\qquad\text{\emph{ii)}}\ \boldsymbol{v}_{0}=\boldsymbol{r}-\boldsymbol{v}_{-1}+\boldsymbol{P}\boldsymbol{v}_{0},\qquad\text{\emph{iii)}}\ \boldsymbol{v}_{1}=-\boldsymbol{v}_{0}+\boldsymbol{P}\boldsymbol{v}_{1},

(20)

where $\boldsymbol{r}$ is the reward vector, $\boldsymbol{P}$ is the one-step transition matrix, and $\boldsymbol{v}_{1}$ is the policy evaluation entity for $(n=1)$ -discount optimality. Thus, advancement on gain and bias estimations is transferable to higher $n$ -discount methods towards obtaining Blackwell optimal policies.

(a)

\gamma

-discounted optimality for

\gamma\in[0,1)

, whose optimal policy sets are denoted as

\Pi_{\gamma}^{*}

. Theoretically, there are a finite number of intervals. For simplicity here, we assume that there are four intervals, namely:

{\color[rgb]{1,0,0}\definecolor[named]{pgfstrokecolor}{rgb}{1,0,0}[0=\gamma_{3},\gamma_{2})},{\color[rgb]{.75,.5,.25}\definecolor[named]{pgfstrokecolor}{rgb}{.75,.5,.25}[\gamma_{2},\gamma_{1})},{\color[rgb]{0,0,1}\definecolor[named]{pgfstrokecolor}{rgb}{0,0,1}[\gamma_{1},\gamma_{0}=\gamma_{\mathrm{Bw}})},{\color[rgb]{0,0.5,0}\definecolor[named]{pgfstrokecolor}{rgb}{0,0.5,0}[\gamma_{\mathrm{Bw}},\gamma_{-1}=1)}

(b)

n

-discount optimality for

n=-1,0,\ldots,{|\mathcal{S}|-2}

, whose optimal policy sets are denoted by

\Pi_{n}^{*}

. In recurrent MDPs,

\Pi_{-1}^{*}=\Pi^{*}_{\mathrm{Bw}}

. When transient states are present, such an equivalency to

\Pi^{*}_{\mathrm{Bw}}

may not be achieved till

n=|\mathcal{S}|-2

at most in unichain MDPs.

Figure 6: Two Venn diagrams of optimal policy sets

\Pi^{*}

\gamma

-discounted and

n

-discount optimality criteria, applied to unichain MDPs. There is only one green circle that is present in both Venn diagrams. It indicates a subset

\Pi^{*}_{\mathrm{Bw}}

containing stationary policies that are guaranteed to be optimal with respect to the most selective Blackwell criterion. Note that the sizes of circles in both diagrams are not to scale. The left Venn diagram of

\gamma

-discounted optimality is invariant to MDP classification and is based on (Blackwell,, 1962; Smallwood,, 1966; Jiang et al.,, 2016), particularly we conjecture that

\Pi^{*}_{\gamma_{\mathrm{Bw}}\leq\gamma<1}

is always disjoint with the other optimal policy subsets. The right Venn diagram of

n

-discount optimality is based on Puterman,, 1994, Thm 10.1.5, 10.3.6 ; Mahadevan, 1996b, , Fig 2.

Directly maximizing the average reward brings several benefits, as described in Sec 5. It is also evidently free from any $\gamma$ -related difficulties (Sec 4). It is interesting now to discuss whether those benefits outweigh the loss of all the virtues of $\gamma$ (Sec 3).

Approximating the average reward via discounting (Sec 3.1) is not without caveats. Taking $\gamma$ really close to 1 slows the convergence, as well as increases the variance of, for example, policy gradient estimates. On the other hand, lowering $\gamma$ poses the risk of suboptimality (with respect to the most selective criterion). This trade-off can be potentially guided by the critical discount factors $\gamma_{\mathrm{Bw}}$ (Sec 3.2). Specifically in RL, we conjecture that it is not the exact value of $\gamma_{\mathrm{Bw}}$ that is needed, but some value around it because of the interplay among $\gamma$ -dependent approximation layers, such as those related to policy value and gradient estimations. Thus, fine-tuning $\gamma$ is always a crucial yet non-trivial task, dealing with possibly non-concave and non-differentiable learning performance functions with respect to $\gamma$ (even though the domain of such a univariate function is limited to $[0,1)$ ). Research on this front has been going on with some progress (Zahavy et al.,, 2020; Paul et al.,, 2019; Xu et al.,, 2018). In that regard, Veinott’s discounting-free criteria (including average-reward optimality) can be viewed as alternatives towards achieving the most selective Blackwell optimality in RL.

There are several ways to compensate for the other merits of discounting (Sec 3.3), which are missed due to directly maximizing the average reward. We describe some of them as follows.

The chain-classification independence merit:

The need for chain-type determination (or assumption) can be thought of as a way to exploit the chain structural property to be able to apply more simple average-reward methods. That is, for a constant gain across all states, we impose a unichain assumption, which is relatively not restrictive in that it already generalizes the recurrent type. Nevertheless, the ultimate goal remains: an average-reward RL method that can handle non-constant gains across states as in multichain MDPs. Because of its generality (hence, sophistication), it does not require prior determination of the chain type.

The chain-type assumption can also be interpreted as an attempt to break down the difficulty in attaining Blackwell-optimal policies in one go (regardless of the chain structure) as in the $\gamma$ -discounted optimality, see Fig 6(a). Recall that in such a criterion, only ( $\gamma\geq\gamma_{\mathrm{Bw}}$ )-discounted optimal policies are Blackwell optimal (however the critical $\gamma_{\mathrm{Bw}}$ is generally unknown, otherwise this one-go technique would be favourable). On the other hand, the average reward (( $n=-1$ )-discount) optimality is already equivalent to the Blackwell optimality whenever the MDPs are recurrent. For more general unichain MDPs, ( $n=0$ )-discount optimality is equivalent to the Blackwell optimality but only for some transition and reward configuration (for other configurations, higher $n$ -discount criteria should be applied, as shown in Fig 6(b)).

The contractive merit:

The relative value-iteration (VI) has been shown to alleviate the non-contractive nature of basic VI in average rewards (Abounadi et al.,, 2001). There are also policy gradient methods whose convergence generally follows that of stochastic gradient ascents. Their policy evaluation part can be carried out via stochastic gradient descents. That is, by minimizing some estimation loss derived for instance, from the average-reward Bellman evaluation equation, which involves the substraction of gain (similar to that of the relative VI).

The variance reduction merit:

The variance of trajectory-related estimation can be controlled by directly varying (truncating) the experiment-episode length (as an alternative to varying the artificial discount factor). One may also apply baseline substraction techniques for variance reduction. In general however, lower variance induces higher bias-errors, leading to a trade-off.

We conclude that directly maximizing the average reward has a number of benefits that make it worthwhile to use and to investigate further. It is the root for approaching Blackwell optimality through Veinott’s criteria, which are discounting-free (eliminating any complication due to artificial discounting). Future works include examination about exploration strategies: to what extent strategies developed for the discounted rewards applies to RL aiming at discounting-free criteria.

7 Experimental setup

In this Section, we describe the experimental setups used to produce Figs 2, 3, and 5. They are about environments (Sec 7.1) and learning methods (Sec 7.2).

7.1 Environments

GridNav- $n$ (episodic)

refers to an $n$ -by- $n$ grid-world with no obstacle. An agent has to navigate from an initial state to the goal state. There are $n^{2}-1$ states, excluding the goal state. Every state has four available actions, i.e. moving in North, East, South and West compass directions. The state-transition stochasticity is governed by an action-slip parameter. That is, the agent actually moves according to the chosen (intended) directional action with a probability $q$ . Otherwise, it stays or moves in one of three other directions; hence there are four alternatives due to slipping, each is with probability $(1-q)/4$ . If an action brings the agent beyond the grid-world, then the agent stays at the current grid. There is a reward of $+1$ for transitioning to the goal; any other movement costs $-1$ .

Taxi- $n$ (episodic)

refers to the $n$ -by- $n$ grid Taxi environment, which is adopted from OpenAI Gym (Brockman et al.,, 2016). The number of obstacles is increased accordingly with respect to $n$ , whereas the number of passengers and pickup/drop-off locations is kept the same.

The environment families 1, 2, and 3

are adopted from the access-control queueing task (Sutton and Barto,, 2018, p252), the Chain problem (Strens,, 2000, Fig 1), and the torus MDP (Morimura et al.,, 2010, Fig 2), respectively. In Fig 2(b), the first two were used as Families 1 and 2, whose numbers of states are varied, whereas the third was used as Family 3, whose reward constant is varied.

7.2 Learning methods

The learning method used for experiments in Figs 2 and 5 is $Q$ -learning. It is an iterative method that relies on the Bellman optimality equation (BOE) to produce iterates approximating the optimal action value (denoted by $q^{*}$ ). Different optimality criteria have different BOEs, inducing different types of $Q$ -learning as follows.

	Discounted rewards ( $Q_{\gamma}$ -learning):	$\displaystyle\ \hat{q}_{\gamma}^{}(s_{t},a_{t})\leftarrow(1-\alpha_{t})\hat{q}_{\gamma}^{}(s_{t},a_{t})+\alpha_{t}\{r_{t+1}+\gamma\max_{a\in\mathcal{A}}\hat{q}_{\gamma}^{*}(s_{t+1},a)\},$
	Average rewards ( $Q_{b}$ -learning):	$\displaystyle\ \hat{q}_{b}^{}(s_{t},a_{t})\leftarrow(1-\alpha_{t})\hat{q}_{b}^{}(s_{t},a_{t})+\alpha_{t}\{r_{t+1}-\hat{v}_{g}^{}+\max_{a\in\mathcal{A}}\hat{q}_{b}^{}(s_{t+1},a)\},$
	Total rewards ( $Q_{\mathrm{tot}}$ -learning):	$\displaystyle\ \hat{q}_{\mathrm{tot}}^{}(s_{t},a_{t})\leftarrow(1-\alpha_{t})\hat{q}_{\mathrm{tot}}^{}(s_{t},a_{t})+\alpha_{t}\{r_{t+1}+\max_{a\in\mathcal{A}}\hat{q}_{\mathrm{tot}}^{*}(s_{t+1},a)\},$

with a positive learning rate $\alpha_{t}$ (here, we used a fine-tuned constant $\alpha$ ). The estimate $\hat{q}^{*}$ is initialized optimistically to large values to encourage exploration in the outset of learning. For $Q_{b}$ -learning, we set the optimal gain estimate $\hat{v}_{g}^{*}\leftarrow\max_{a\in\mathcal{A}}\hat{q}_{b}^{*}(s_{\mathrm{ref}},a)$ with a prescribed (arbitrary but fixed) reference state $s_{\mathrm{ref}}$ , following the relative-VI (RVI) technique by Abounadi et al., (2001, Sec 2.2). Note that $Q_{\mathrm{tot}}$ -learning converges as long as the total reward is finite (Schwartz,, 1993, p2).

References

Abounadi et al., (2001) Abounadi, J., Bertsekas, D., and Borkar, V. S. (2001). Learning algorithms for Markov decision processes with average cost. SIAM Journal on Control and Optimization, 40(3).
Agarwal et al., (2019) Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2019). On the theory of policy gradient methods: Optimality, approximation, and distribution shift. arXiv: 1908.00261.
Altman, (1999) Altman, E. (1999). Constrained Markov Decision Processes. Taylor & Francis.
Bacon, (2018) Bacon, P.-L. (2018). Temporal Representation Learning. PhD thesis, School of Computer Science, McGill University.
Baxter and Bartlett, (2001) Baxter, J. and Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15(1).
Beleznay et al., (1999) Beleznay, F., Grobler, T., and Szepesvari, C. (1999). Comparing value-function estimation algorithms in undiscounted problems. Technical report.
Bellman, (1957) Bellman, R. (1957). Dynamic Programming. Princeton University Press.
Blackwell, (1962) Blackwell, D. (1962). Discrete dynamic programming. The Annals of Mathematical Statistics, 33(2).
Brockman et al., (2016) Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J., and Zaremba, W. (2016). OpenAI Gym. arXiv:1606.01540.
Castro, (2020) Castro, P. S. (2020). Scalable methods for computing state similarity in deterministic Markov decision processes. Proceedings of the AAAI Conference on Artificial Intelligence.
Chang et al., (2013) Chang, H. S., Hu, J., Fu, M. C., and Marcus, S. I. (2013). Simulation-Based Algorithms for Markov Decision Processes. Springer, 2nd edition.
Dabney, (2014) Dabney, W. C. (2014). Adaptive step-sizes for reinforcement learning. PhD thesis, Computer Science Department, University of Massachusetts Amherst.
Dann et al., (2014) Dann, C., Neumann, G., and Peters, J. (2014). Policy evaluation with temporal differences: A survey and comparison. Journal of Machine Learning Research, 15(24).
Devraj and Meyn, (2020) Devraj, A. M. and Meyn, S. P. (2020). Q-learning with uniformly bounded variance: Large discounting is not a barrier to fast learning. arXiv: 2002.10301.
Dewanto et al., (2020) Dewanto, V., Dunn, G., Eshragh, A., Gallagher, M., and Roosta, F. (2020). Average-reward model-free reinforcement learning: a systematic review and literature mapping. arXiv: 2010.08920.
Douc et al., (2018) Douc, R., Moulines, E., Priouret, P., and Soulier, P. (2018). Markov Chains. Springer Series in Operations Research and Financial Engineering. Springer International Publishing.
Duan et al., (2016) Duan, Y., Chen, X., Houthooft, R., Schulman, J., and Abbeel, P. (2016). Benchmarking deep reinforcement learning for continuous control. In Proceedings of The 33rd International Conference on Machine Learning.
Fan et al., (2020) Fan, J., Wang, Z., Xie, Y., and Yang, Z. (2020). A theoretical analysis of deep Q-learning. In Proceedings of the 2nd Conference on Learning for Dynamics and Control.
Feinberg and Shwartz, (2002) Feinberg, E. A. and Shwartz, A. (2002). Handbook of Markov Decision Processes: Methods and Applications, volume 40. Springer US.
Ferns et al., (2006) Ferns, N., Castro, P. S., Precup, D., and Panangaden, P. (2006). Methods for computing state similarity in Markov decision processes. In Conference in Uncertainty in Artificial Intelligence.
Henderson et al., (2018) Henderson, P., Islam, R., Bachman, P., Pineau, J., Precup, D., and Meger, D. (2018). Deep reinforcement learning that matters. In Proceedings of the AAAI Conference on Artificial Intelligence.
Hordijk et al., (1985) Hordijk, A., Dekker, R., and Kallenberg, L. C. M. (1985). Sensitivity analysis in discounted Markovian decision problems. Operations Research Spektrum, 7.
Howard, (1960) Howard, R. A. (1960). Dynamic Programming and Markov Processes. Technology Press of the Massachusetts Institute of Technology.
Hutter, (2006) Hutter, M. (2006). General discounting versus average reward. In Algorithmic Learning Theory. Springer Berlin Heidelberg.
Jiang et al., (2016) Jiang, N., Singh, S., and Tewari, A. (2016). On structural properties of MDPs that bound loss due to shallow planning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence.
Jin and Sidford, (2021) Jin, Y. and Sidford, A. (2021). Towards tight bounds on the sample complexity of average-reward MDPs. In Proceedings of the 38th International Conference on Machine Learning.
Kakade, (2001) Kakade, S. (2001). Optimizing average reward using discounted rewards. In Proceedings of the 14th Annual Conference on Computational Learning Theory.
Kumar et al., (2020) Kumar, H., Kalogerias, D. S., Pappas, G. J., and Ribeiro, A. (2020). Zeroth-order deterministic policy gradient. Arxiv:2006.07314.
Lattimore and Hutter, (2014) Lattimore, T. and Hutter, M. (2014). General time consistent discounting. Theoretical Computer Science, 519.
Lattimore and Szepesvári, (2020) Lattimore, T. and Szepesvári, C. (2020). Bandit Algorithms. Cambridge University Press.
Machado et al., (2018) Machado, M. C., Bellemare, M. G., Talvitie, E., Veness, J., Hausknecht, M. J., and Bowling, M. (2018). Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents. Journal of Artificial Intelligence Research.
Mahadevan, (1994) Mahadevan, S. (1994). To discount or not to discount in reinforcement learning: A case study comparing R-learning and Q-learning. In Proceedings of the 11th International Conference on Machine Learning.
(33) Mahadevan, S. (1996a). Average reward reinforcement learning: Foundations, algorithms, and empirical results. Machine Learning.
(34) Mahadevan, S. (1996b). Sensitive discount optimality: Unifying discounted and average reward reinforcement learning. In Proceedings of the 13th International Conference on Machine Learning.
Melo and Ribeiro, (2007) Melo, F. S. and Ribeiro, M. I. (2007). Q-learning with linear function approximation. In Computational Learning Theory.
Mischel et al., (1972) Mischel, W., Ebbesen, E. B., and Zeiss, A. R. (1972). Cognitive and attentional mechanisms in delay of gratification. Journal of personality and social psychology, 21(2).
Morimura et al., (2010) Morimura, T., Uchibe, E., Yoshimoto, J., Peters, J., and Doya, K. (2010). Derivatives of logarithmic stationary distributions for policy gradient reinforcement learning. Neural Computation, 22(2).
Nota and Thomas, (2020) Nota, C. and Thomas, P. S. (2020). Is the policy gradient a gradient? Proceedings of the 19th International Conference on Autonomous Agents and MultiAgent Systems.
Pardo et al., (2018) Pardo, F., Tavakoli, A., Levdik, V., and Kormushev, P. (2018). Time limits in reinforcement learning. In Proceedings of the 35th International Conference on Machine Learning.
Paul et al., (2019) Paul, S., Kurin, V., and Whiteson, S. (2019). Fast efficient hyperparameter tuning for policy gradient methods. In Advances in Neural Information Processing Systems 32.
Petrik and Scherrer, (2008) Petrik, M. and Scherrer, B. (2008). Biasing approximate dynamic programming with a lower discount factor. In Advances in Neural Information Processing Systems 21.
Puterman, (1994) Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1st edition.
Samuelson, (1937) Samuelson, P. A. (1937). A note on measurement of utility. Review of Economic Studies, 4.
Scherrer, (2010) Scherrer, B. (2010). Should one compute the Temporal Difference fix point or minimize the Bellman Residual? The unified oblique projection view. In Proceedings of the 27th International Conference on Machine Learning.
Schwartz, (1993) Schwartz, A. (1993). A reinforcement learning method for maximizing undiscounted rewards. In Proceedings of the 10th International Conference on Machine Learning.
Singh et al., (1994) Singh, S. P., Jaakkola, T. S., and Jordan, M. I. (1994). Learning without state-estimation in partially observable Markovian decision processes. In Proceedings of the 11th International Conference on Machine Learning.
Smallwood, (1966) Smallwood, R. D. (1966). Optimum policy regions for Markov processes with discounting. Operations Research, 14.
Strens, (2000) Strens, M. J. A. (2000). A Bayesian framework for reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning.
Sutton and Barto, (2018) Sutton, R. S. and Barto, A. G. (2018). Introduction to Reinforcement Learning. MIT Press.
Szepesvári, (2010) Szepesvári, C. (2010). Algorithms for Reinforcement Learning. Morgan & Claypool.
Tang and Abbeel, (2010) Tang, J. and Abbeel, P. (2010). On a connection between importance sampling and the likelihood ratio policy gradient. In Advances in Neural Information Processing Systems 23.
Thomas, (2014) Thomas, P. (2014). Bias in natural actor-critic algorithms. In Proceedings of the 31st International Conference on Machine Learning.
Thrun and Schwartz, (1993) Thrun, S. and Schwartz, A. (1993). Issues in using function approximation for reinforcement learning. In Proceedings of Connectionist Models Summer School.
Tsitsiklis and Van Roy, (2002) Tsitsiklis, J. N. and Van Roy, B. (2002). On average versus discounted reward temporal-difference learning. Machine Learning, 49.
van Hasselt, (2011) van Hasselt, H. P. (2011). Insights in Reinforcement Learning: formal analysis and empirical evaluation of temporal-difference learning algorithms. PhD thesis, Univ. Utrecht.
Van Roy, (1998) Van Roy, B. (1998). Learning and Value Function Approximation in Complex Decision Processes. PhD thesis, MIT.
Van Seijen et al., (2019) Van Seijen, H., Fatemi, M., and Tavakoli, A. (2019). Using a logarithmic mapping to enable lower discount factors in reinforcement learning. In Advances in Neural Information Processing Systems 32.
Veinott, (1969) Veinott, A. F. (1969). Discrete Dynamic Programming with Sensitive Discount Optimality Criteria. The Annals of Mathematical Statistics, 40(5).
White, (2017) White, M. (2017). Unifying task specification in reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning.
Xu et al., (2018) Xu, Z., van Hasselt, H. P., and Silver, D. (2018). Meta-gradient reinforcement learning. In Advances in Neural Information Processing Systems 31.
Zahavy et al., (2020) Zahavy, T., Xu, Z., Veeriah, V., Hessel, M., Oh, J., van Hasselt, H. P., Silver, D., and Singh, S. (2020). A self-tuning actor-critic algorithm. In Advances in Neural Information Processing Systems 33.