This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Improving the Efficiency of Off-Policy Reinforcement Learning
by Accounting for Past Decisions

Brett Daley
Khoury College of Computer Sciences
Northeastern University
Boston, MA 02115, USA
[email protected]
   Christopher Amato
Khoury College of Computer Sciences
Northeastern University
Boston, MA 02115, USA
[email protected]
Abstract

Off-policy learning from multistep returns is crucial for sample-efficient reinforcement learning, particularly in the experience replay setting now commonly used with deep neural networks. Classically, off-policy estimation bias is corrected in a per-decision manner: past temporal-difference errors are re-weighted by the instantaneous Importance Sampling (IS) ratio (via eligibility traces) after each action. Many important off-policy algorithms such as Tree Backup and Retrace rely on this mechanism along with differing protocols for truncating (“cutting”) the ratios (“traces”) to counteract the excessive variance of the IS estimator. Unfortunately, cutting traces on a per-decision basis is not necessarily efficient; once a trace has been cut according to local information, the effect cannot be reversed later, potentially resulting in the premature truncation of estimated returns and slower learning. In the interest of motivating efficient off-policy algorithms, we propose a multistep operator that permits arbitrary past-dependent traces. We prove that our operator is convergent for policy evaluation, and for optimal control when targeting greedy-in-the-limit policies. Our theorems establish the first convergence guarantees for many existing algorithms including Truncated IS, Non-Markov Retrace, and history-dependent TD(λ\lambda). Our theoretical results also provide guidance for the development of new algorithms that jointly consider multiple past decisions for better credit assignment and faster learning.

1 Introduction

Reinforcement learning concerns an agent interacting with its environment through trial and error to maximize its expected cumulative reward [23]. One of the great challenges of reinforcement learning is the temporal credit assignment problem [21]: upon the receipt of a reward, which past action(s) should be held responsible and, hence, be reinforced? Basic temporal-difference (TD) methods [e.g. 29, 18] assign credit to the immediately taken action, bootstrapping from previous experience to learn long-term dependencies. This process requires a large number of repetitions to generate effective behaviors from rewards, motivating research into multistep return estimation in which credit is distributed among multiple (even infinitely many) past actions according to some eligibility rule [e.g. 22, 29, 23]. This can lead to significantly faster learning [23].

One challenge of multistep estimators is that they generally have higher variance than 1-step estimators [7]. This is exacerbated in the off-policy setting, where environment interaction is conducted according to a behavior policy that differs from the target policy for which returns are being estimated. The discrepancy between the two policies manifests mathematically as bias in the return estimation, which can be detrimental to learning if left unaddressed. Despite these challenges, off-policy learning is important for exploration and sample efficiency, especially when combined with experience replay [10], which has gained popularity due to deep reinforcement learning [e.g. 13]. The canonical bias-correction technique is Importance Sampling (IS) wherein the bias due to the differing policies is eliminated by the product of their probability ratios. Although IS theoretically resolves the off-policy bias, it can suffer from extreme variance that makes the algorithm largely untenable in practice.

Directly managing the variance of the IS estimator has been a fruitful avenue for developing efficient off-policy algorithms. Past work has focused on modifying the individual IS ratios to reduce the variance of the full update: e.g. Tree Backup [16], Q(λ\lambda) with Off-Policy Corrections [5], Retrace [14], ABQ [12], and C-trace [17]. All of these methods can be implemented online with per-decision rules [16] that determine how much to decay the IS ratio according to the current state-action pair—a process commonly called “trace cutting” in reference to the first method to utilize this technique, Watkins’ Q(λ\lambda) [29, 23, 14]. The re-weighted TD error is then broadcast to previous experiences using eligibility traces [8, 1, 21, 23]. These algorithms are Markov in the sense that each iterative off-policy correction depends on only the current state-action pair. One issue with this strategy is that it may lead to suboptimal decisions, since cutting a trace cannot be reversed later, even as new information is ascertained. In contrast, a non-Markov method could examine an entire trajectory of past state-action pairs to make globally better decisions regarding credit assignment.

Indeed, some existing off-policy methods already conduct offline bias correction in a non-Markov manner. Perhaps the simplest example is Truncated IS where the IS ratio products are pre-calculated offline and then clipped to some finite value (see Section 2.2). More recently, [14] suggested a recursive variant of Retrace that automatically relaxes the clipping bound when its historical trace magnitude becomes small; the authors conjecture that this could lead to more efficient learning. To our knowledge, no theoretical analysis has been conducted on non-Markov algorithms such as these, meaning that their convergence properties are unknown, and the space of possible algorithms has not been fully explored.

To better understand the behavior of these non-Markov off-policy algorithms, and to support new discoveries of efficient algorithms, we propose a multistep operator \mathcal{M} that accounts for arbitrary dependencies on past decisions. Our operator is a significant generalization of the \mathcal{R} operator proposed by [14] in that it allows an arbitrary, time-dependent weight for each TD error that may generally depend on the history of the Markov Decision Process (MDP). We prove that our operator converges for policy evaluation and for control provided that the chosen weights never exceed the true IS ratio product. We also remove the assumptions of increasingly greedy policies and pessimistic initialization used by [14] in the control setting. Finally, we discuss practical considerations when implementing algorithms described by our operator. Our results presented here provide new insights into off-policy credit assignment and open avenues for efficient empirical methods—with particularly interesting opportunities for deep reinforcement learning and experience replay.

2 Preliminaries

We consider MDPs of the form (𝒮,𝒜,P,R,γ)(\mathcal{S},\mathcal{A},P,R,\gamma). 𝒮\mathcal{S} and 𝒜\mathcal{A} are finite sets of states and actions, respectively. Letting ΔX\Delta X denote the set of distributions over a set XX, then P:𝒮×𝒜Δ𝒮{P\colon\mathcal{S}\times\mathcal{A}\to\Delta\mathcal{S}} is the transition function, R:𝒮×𝒜R\colon\mathcal{S}\times\mathcal{A}\to\mathbb{R} is the reward function, and γ[0,1]\gamma\in[0,1] is the discount factor.

A policy π:𝒮Δ𝒜\pi\colon\mathcal{S}\to\Delta\mathcal{A} determines the agent’s probability of selecting a given action in each state. A Q-function Q:𝒮×𝒜Q\colon\mathcal{S}\times\mathcal{A}\to\mathbb{R} represents the agent’s estimate of the expected return achievable from each state-action pair. For a policy π\pi, we define the operator PπP_{\pi} as

(PπQ)(s,a)s𝒮a𝒜P(s|s,a)π(a|s)Q(s,a).(P_{\pi}Q)(s,a)\coloneqq\sum_{s^{\prime}\in\mathcal{S}}\sum_{a^{\prime}\in\mathcal{A}}P(s^{\prime}|s,a)\pi(a^{\prime}|s^{\prime})Q(s^{\prime},a^{\prime}).

As a shorthand notation, it is useful to represent Q-functions and the reward function as vectors in n\mathbb{R}^{n} where n=|𝒮×𝒜|n=|\mathcal{S}\times\mathcal{A}|. We define ee to be the vector whose components are equal to 11. Linear operators such as PπP_{\pi} can hence be interpreted as n×nn\times n square matrices that multiply these vectors, with repeated application corresponding to exponentiation: e.g. PπtQ=Pπ(Pπt1Q)P_{\pi}^{t}Q=P_{\pi}(P_{\pi}^{t-1}Q). All inequalities in our work should be interpreted element wise when involving vectors or matrices. Finally, we let AA\norm{A}\coloneqq\norm{A}_{\infty} for a matrix or vector AA.

In the policy evaluation setting, we seek to estimate the expected discounted return for policy π\pi, given by

Qπt0γtPπtR.Q^{\pi}\coloneqq\sum_{t\geq 0}\gamma^{t}P_{\pi}^{t}R.

QπQ^{\pi} is the unique fixed point of the Bellman operator TπQR+γPπQT_{\pi}Q\coloneqq R+\gamma P_{\pi}Q [2]. In the control setting, we seek to estimate the expected return under the optimal policy π\pi^{*}, denoted by QQ^{*}. Letting Π\Pi be the set of all possible policies, then QQ^{*} is the unique fixed point of the Bellman optimality operator TQR+γmaxπΠPπQTQ\coloneqq R+\gamma\max\limits_{\pi^{\prime}\in\Pi}P_{\pi^{\prime}}Q.

2.1 Multistep Off-Policy Operators

In our work, we are particularly interested in the off-policy learning case, where trajectories of the form ((s,a),(s1,a1),(s2,a2),)((s,a),(s_{1},a_{1}),(s_{2},a_{2}),\dots), (sk,ak)𝒮×𝒜(s_{k},a_{k})\in\mathcal{S}\times\mathcal{A}, are generated by interacting with the MDP using a behavior policy μΠ\mu\in\Pi, μπ\mu\neq\pi. Let t\mathcal{F}_{t} denote the first t+1t+1 terms of this sequence. We define the TD error for policy π\pi at timestep tt as

δtπrt+γa𝒜π(a|st+1)Q(st+1,a)Q(st,at).\delta^{\pi}_{t}\coloneqq r_{t}+\gamma\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s_{t+1})Q(s_{t+1},a^{\prime})-Q(s_{t},a_{t}).

Let ρkπ(ak|sk)μ(ak|sk)\rho_{k}\coloneqq\frac{\pi(a_{k}|s_{k})}{\mu(a_{k}|s_{k})} for brevity. [14] introduced the following off-policy operator:

(Q)(s,a)Q(s,a)+𝔼μ[t0γt(k=1tc(sk,ak))δtπ],(\mathcal{R}Q)(s,a)\coloneqq Q(s,a)+\mathbb{E}_{\mu}\Big{[}\sum_{t\geq 0}\gamma^{t}\Big{(}\prod_{k=1}^{t}c(s_{k},a_{k})\Big{)}\delta^{\pi}_{t}\Big{]}, (1)

where c(sk,ak)[0,ρk]c(s_{k},a_{k})\in[0,\rho_{k}] are arbitrary nonnegative coefficients, or traces. When c(sk,ak)<ρkc(s_{k},a_{k})<\rho_{k}, we say that the trace has been (partially) cut. Note that the traces are Markov in the sense that they depend only on the state-action pair (sk,ak)(s_{k},a_{k}) and are otherwise independent of k\mathcal{F}_{k}. In other words, the traces for \mathcal{R} can be calculated per decision [16], thereby permitting an efficient online implementation with eligibility traces.

While per-decision traces are convenient from a computational perspective, they require making choices about how much to cut a trace without direct consideration of the past. This can lead to locally optimal decisions; for example, if the trace is cut by setting c(sk,ak)=0c(s_{k},a_{k})=0 at some timestep, then the effect cannot be reversed later. Regardless of whatever new experiences are encountered by the agent, experiences before timestep kk will be ineligible for credit assignment, resulting in an opportunity cost. In fact, this exact phenomenon is why Watkins’ Q(λ\lambda) [29] often learns more slowly than Peng’s Q(λ\lambda) [15] even though the former avoids off-policy bias [23, 4, 9]. The same effect (but to a lesser extent) impacts Tree Backup and Retrace where c(sk,ak)1c(s_{k},a_{k})\leq 1 always, implying that the eligibilities for past experiences can never increase.

For this reason, it may be desirable to expend some additional computation in exchange for better decisions regarding credit assignment. Our principal contribution is the proposal and analysis of the following off-policy operator that encompasses this possibility:

(Q)(s,a)Q(s,a)+𝔼μ[t0γtβ(t)δtπ],(\mathcal{M}Q)(s,a)\coloneqq Q(s,a)+\mathbb{E}_{\mu}\Big{[}\sum_{t\geq 0}\gamma^{t}\beta(\mathcal{F}_{t})\delta^{\pi}_{t}\Big{]}, (2)

where each β(t)\beta(\mathcal{F}_{t}) is an arbitrary nonnegative coefficient that generally depends on the history t\mathcal{F}_{t}. By analogy to TD(0) [22], we define β(0)1\beta(\mathcal{F}_{0})\coloneqq 1. The principal goal of Section 3 is to characterize the values of β(t)\beta(\mathcal{F}_{t}) for t1t\geq 1 that lead to convergence in the policy evaluation and control settings.

The major analytical challenge of our operator, and its main novelty, is the complex dependence on the sequence t\mathcal{F}_{t}. Our operator is therefore inherently non-Markov, in spite of the fact that the MDP dynamics are Markov by definition. Mathematically, this makes the \mathcal{M} operator difficult to analyze, as the terms in the series 1+β(1)+β(2)+1+\beta(\mathcal{F}_{1})+\beta(\mathcal{F}_{2})+\cdots generally share no common factors that would allow a per-decision update for eligibility traces. Nevertheless, removing the Markov assumption is necessary to understand important existing algorithms (see Section 2.2) while also paving the way for new credit assignment algorithms.

A special case arises when β(t)\beta(\mathcal{F}_{t}) does factor into traces: that is, β(t)=k=1tc(sk,ak)\beta(\mathcal{F}_{t})=\prod_{k=1}^{t}c(s_{k},a_{k}) for every history t\mathcal{F}_{t}. Equation (2) therefore reduces to equation (1), taking us back to the Markov setting studied by [14]. As such, our \mathcal{M} operator subsumes the \mathcal{R} operator as a specific case, and our later theoretical developments have implications for Retrace and related off-policy algorithms. We discuss these implications in Section 4.

2.2 Example Algorithms

By weighting each TD error with an arbitrary history-dependent coefficient β(t)\beta(\mathcal{F}_{t}), the \mathcal{M} operator is remarkably general, but this also makes it abstract. To help concretize it, we describe several existing off-policy algorithms below and how they can be described in the notation of our operator. This should aid with understanding and illuminate cases where our operator is necessary to explain convergence.

Importance Sampling (IS): β(t)=k=1tρk\beta(\mathcal{F}_{t})=\prod_{k=1}^{t}\rho_{k}.
The standard approach for correcting off-policy experience. Although it is the only unbiased estimator in this list, it suffers from high variance when μ(ak|sk)0\mu(a_{k}|s_{k})\approx 0 that makes it difficult to utilize effectively in practice.

Q(λ\lambda) with Off-Policy Corrections [5]: β(t)=k=1tλ\beta(\mathcal{F}_{t})=\prod_{k=1}^{t}\lambda.
A straightforward algorithm that decays the TD errors by a fixed constant λ[0,1]\lambda\in[0,1]. In the on-policy (μ=π\mu=\pi) policy evaluation case, this is equivalent to the TD(λ\lambda) extension [22] of Expected Sarsa [23]. The algorithm does not require explicit knowledge of the behavior policy μ\mu, which is desirable; however, it is not convergent when π\pi and μ\mu differ too much, and this criterion is restrictive in practice.

Tree Backup (TB) [16]: β(t)=k=1tλπ(ak|sk)\beta(\mathcal{F}_{t})=\prod_{k=1}^{t}\lambda\pi(a_{k}|s_{k}).
A method that automatically cuts traces according to the product of probabilities under the target policy π\pi, which forms a conservative lower bound on the IS ratio product. As a result, TB converges for any behavior policy μ\mu, but it is not efficient since traces are cut excessively—especially in the on-policy case.

Retrace [14]: β(t)=k=1tλmin(1,ρk)\beta(\mathcal{F}_{t})=\prod_{k=1}^{t}\lambda\min(1,\rho_{k}).
A convergent algorithm for arbitrary policies π\pi and μ\mu that remains efficient in the on-policy case because it does not cut traces (if λ=1\lambda=1); however, the fact that β(t)\beta(\mathcal{F}_{t}) is monotone non-increasing can cause the trace products to decay too quickly in practice [12, 17].

222Denotes a non-Markov algorithm that must be modeled by our \mathcal{M} operator defined in equation (2).

Non-Markov Retrace [14]: β(t)=λmin(1,β(t1)ρt)\beta(\mathcal{F}_{t})=\lambda\min(1,\beta(\mathcal{F}_{t-1})\rho_{t}).
A modification to Retrace proposed by [14] and conjectured to lead to faster learning. It permits trace values larger than 11 by relaxing the clipping bound when the historical trace product β(t1)\beta(\mathcal{F}_{t-1}) is small (see Section 4.3). The algorithm is recursive, which makes it difficult to analyze, and its convergence for control is an open question.

222Denotes a non-Markov algorithm that must be modeled by our \mathcal{M} operator defined in equation (2).

Truncated Importance Sampling [6]: β(t)=min(dt,k=1tρk)\beta(\mathcal{F}_{t})=\min(d_{t},\prod_{k=1}^{t}\rho_{k}).
A simple but effective method to combat the variance of the IS estimator. For any value of dt0d_{t}\geq 0, the obtained variance is finite. Variations of this algorithm have been applied in the reinforcement learning literature [e.g. 27, 31, 30, 28], but to our knowledge its convergence in an MDP setting has not been studied.

3 Analysis

In this section, we study the convergence properties of the \mathcal{M} operator for policy evaluation and control. It will be convenient to re-express equation (2) in operator notation in our following analysis. First, we can define an operator BtB_{t} such that

(BtQ)(s,a)β(t1(s,a))Q(s,a).(B_{t}Q)(s,a)\coloneqq\beta(\mathcal{F}_{t-1}\cup(s,a))Q(s,a).

Note that BtB_{t} can be interpreted as a diagonal matrix where each element along the main diagonal is equal to β(t)\beta(\mathcal{F}_{t}) after hypothetically selecting the corresponding state-action pair (s,a)(s,a) in history t1\mathcal{F}_{t-1}. Furthermore, by our earlier definition of β(0)=1\beta(\mathcal{F}_{0})=1, we must have B0=IB_{0}=I, the identity matrix. With this, we can write the \mathcal{M} operator as

Q=Q+t0γtPμtBt(TπQQ).\mathcal{M}Q=Q+\sum_{t\geq 0}\gamma^{t}P_{\mu}^{t}B_{t}(T_{\pi}Q-Q). (3)

3.1 Policy Evaluation

In this setting, we seek to estimate QπQ^{\pi} for a fixed target policy πΠ\pi\in\Pi from interactions conducted according to a fixed behavior policy μΠ\mu\in\Pi. Specifically, our goal is to prove that repeated application of the \mathcal{M} operator to an arbitrarily initialized QQ converges to QπQ^{\pi}. We start by proving the following lemma:

Lemma 1.

Let \mathcal{M} be the operator defined in (3). QπQ^{\pi} is a fixed point of \mathcal{M}; the difference between Q\mathcal{M}Q and QπQ^{\pi} is given by

QQπ=t1γtPμt1(Bt1PπPμBt)(QQπ).\mathcal{M}Q-Q^{\pi}=\sum_{t\geq 1}\gamma^{t}P_{\mu}^{t-1}(B_{t-1}P_{\pi}-P_{\mu}B_{t})(Q-Q^{\pi}). (4)

We present the proof in Appendix A.1. With this, we can introduce our main theorem for policy evaluation:

Theorem 1.

Suppose that Btk=1tρkB_{t}\leq\prod_{k=1}^{t}\rho_{k} for any history t1(st,at)\mathcal{F}_{t-1}\cup(s_{t},a_{t}). The operator \mathcal{M} defined in (3) is a contraction mapping with QπQ^{\pi} as its unique fixed point. That is,

QQπγQQπ,\norm{\mathcal{M}Q-Q^{\pi}}\leq\gamma\norm{Q-Q^{\pi}},

and consequently limkkQ=Qπ\lim\limits_{k\to\infty}\mathcal{M}^{k}Q=Q^{\pi} for any QQ.

Proof.

Lemma 1 already established that QπQ^{\pi} is a fixed point of \mathcal{M}. We will show that \mathcal{M} is a contraction mapping, which will guarantee that QπQ^{\pi} is its unique fixed point.

Let At1γtPμt1(Bt1PπPμBt)A\coloneqq\sum_{t\geq 1}\gamma^{t}P_{\mu}^{t-1}(B_{t-1}P_{\pi}-P_{\mu}B_{t}) and rewrite (4) as QQπ=A(QQπ)\mathcal{M}Q-Q^{\pi}=A(Q-Q^{\pi}). By our assumption that BtB_{t} is diagonal and Btk=1tρk\smash{B_{t}\leq\prod_{k=1}^{t}\rho_{k}}, the matrix AA has nonnegative elements (since Bt1PπPμBt0B_{t-1}P_{\pi}-P_{\mu}B_{t}\geq 0). To complete the proof, we can show that the row sums of AA are never greater than γ\gamma. Equivalently, we will show that AeγeAe\leq\gamma e:

Ae\displaystyle\allowdisplaybreaks Ae =t1γtPμt1(Bt1PπPμBt)e\displaystyle=\sum_{t\geq 1}\gamma^{t}P_{\mu}^{t-1}(B_{t-1}P_{\pi}-P_{\mu}B_{t})e
=t1γtPμt1(Bt1ePμBte)\displaystyle=\sum_{t\geq 1}\gamma^{t}P_{\mu}^{t-1}(B_{t-1}e-P_{\mu}B_{t}e)
=γt0γtPμtBtet1γtPμtBte\displaystyle=\gamma\sum_{t\geq 0}\gamma^{t}P_{\mu}^{t}B_{t}e-\sum_{t\geq 1}\gamma^{t}P_{\mu}^{t}B_{t}e
=γ(e+S)S\displaystyle=\gamma(e+S)-S
=γe(1γ)S,\displaystyle=\gamma e-(1-\gamma)S,

where we have let St1γtPμtBteS\coloneqq\sum_{t\geq 1}\gamma^{t}P_{\mu}^{t}B_{t}e. Because Bt0B_{t}\geq 0, we have that S0S\geq 0 and hence AeγeAe\leq\gamma e; thus, A(QQπ){A(Q-Q^{\pi})} is a vector whose components each comprise a nonnegative-weighted combination of the components of QQπQ-Q^{\pi} where the weights add up to at most γ\gamma. This implies that

QQπγQQπ\norm{\mathcal{M}Q-Q^{\pi}}\leq\gamma\norm{Q-Q^{\pi}}

and the operator \mathcal{M} is a contraction mapping. Its fixed point QπQ^{\pi} established by Lemma 1 must therefore be unique, which implies that limkkQ=Qπ\lim\limits_{k\to\infty}\mathcal{M}^{k}Q=Q^{\pi} for any QQ when γ<1\gamma<1. ∎

We conclude with some remarks about the interpretation of our result. Note that when β(t)=0\beta(\mathcal{F}_{t})=0 for t1t\geq 1, we have the slowest contraction rate of γ\gamma, which corresponds to TD(0). When β(t)=k=1tρk\beta(\mathcal{F}_{t})=\prod_{k=1}^{t}\rho_{k}, we get an optimal contraction rate of 0, which corresponds to the standard IS estimator. (Of course, this is only in expectation.) Between these two extremes lies a vast spectrum of convergent algorithms, which includes the examples mentioned in Section 2.2 as well as an infinitude of other possibilities. In Section 4.2, we discuss practical considerations for choosing β(t)\beta(\mathcal{F}_{t}) and how to efficiently implement history-dependent algorithms.

3.2 Control

We now consider the more-difficult setting of control. Given a sequence of target policies (πk)k0(\pi_{k})_{k\geq 0}, πkΠ\pi_{k}\in\Pi, and a sequence of behavior policies (μk)k0(\mu_{k})_{k\geq 0}, μkΠ\mu_{k}\in\Pi, we aim to show that the sequence of Q-functions (Qk)(Q_{k}) given by Qk+1kQkQ_{k+1}\coloneqq\mathcal{M}_{k}Q_{k} converges to QQ^{*}. (Here, k\mathcal{M}_{k} is the \mathcal{M} operator defined for πk\pi_{k} and μk\mu_{k}.)

Unlike [14], we do not assume that the target policies are increasingly greedy with respect to (Qk)(Q_{k}). Instead, we require only that the policies become greedy in the limit. We say that the sequence of policies (πk)(\pi_{k}) is greedy in the limit if and only if TπkQkTQkT_{\pi_{k}}Q_{k}\to TQ_{k} as kk\to\infty. Intuitively, this means that each policy need not be greedier than its predecessors, so long as the sequence of policies eventually does become greedy. We discuss the significance of this assumption more in Section 4.1.

Let Ct0γtPμtBtC\coloneqq\sum_{t\geq 0}\gamma^{t}P_{\mu}^{t}B_{t}. We first rewrite the \mathcal{M} operator in (3) concisely as

Q=Q+C(TπQQ).\mathcal{M}Q=Q+C(T_{\pi}Q-Q). (5)

The following lemma will be useful to show that \mathcal{M} remains a contraction mapping in this new setting:

Lemma 2.

For any policy πΠ\pi\in\Pi,

IC(IγPπ)γ.\norm{I-C(I-\gamma P_{\pi})}\leq\gamma. (6)

Once again, we present the proof in Appendix A.2. We now arrive at our main theoretical result for control:

Theorem 2.

Consider a sequence of target policies (πk)(\pi_{k}) and a sequence of arbitrary behavior policies (μk)(\mu_{k}). Let Q0Q_{0} be an arbitrarily initialized Q-function and define the sequence Qk+1kQkQ_{k+1}\coloneqq\mathcal{M}_{k}Q_{k} where k\mathcal{M}_{k} is the operator defined in (3) for πk\pi_{k} and μk\mu_{k}. Assume that (πk)(\pi_{k}) is greedy in the limit and let ϵk[0,1]\epsilon_{k}\in[0,1] be the smallest constant such that TπkQkTQkϵkQkeT_{\pi_{k}}Q_{k}\geq TQ_{k}-\epsilon_{k}\norm{Q_{k}}e. Then,

Qk+1QγQkQ+ϵk1γQk,\norm{Q_{k+1}-Q^{*}}\leq\gamma\norm{Q_{k}-Q^{*}}+\frac{\epsilon_{k}}{1-\gamma}\norm{Q_{k}},

and consequently limkQk=Q\lim\limits_{k\to\infty}Q_{k}=Q^{*}.

Proof.

Part 1: Upper bound on Qk+1QQ_{k+1}-Q^{*}. First, note the following inequality:

TπkQkTQ=γPπkQkγmaxπΠPπQkγPπk(QkQ).T_{\pi_{k}}Q_{k}-TQ^{*}=\gamma P_{\pi_{k}}Q_{k}-\gamma\max\limits_{\pi^{\prime}\in\Pi}P_{\pi^{\prime}}Q_{k}\leq\gamma P_{\pi_{k}}(Q_{k}-Q^{*}).

Then, from (5), we can deduce that

Qk+1Q\displaystyle\allowdisplaybreaks Q_{k+1}-Q^{*} =QkQ+C(TπkQkQk)\displaystyle=Q_{k}-Q^{*}+C(T_{\pi_{k}}Q_{k}-Q_{k})
=(IC)(QkQ)+C(TπkQkQ)\displaystyle=(I-C)(Q_{k}-Q^{*})+C(T_{\pi_{k}}Q_{k}-Q^{*}) (7)
=(IC)(QkQ)+C(TπkQkTQ)\displaystyle=(I-C)(Q_{k}-Q^{*})+C(T_{\pi_{k}}Q_{k}-TQ^{*})
(IC)(QkQ)+γCPπk(QkQ)\displaystyle\leq(I-C)(Q_{k}-Q^{*})+\gamma CP_{\pi_{k}}(Q_{k}-Q^{*})
=(IC(IγPπk))(QkQ).\displaystyle=(I-C(I-\gamma P_{\pi_{k}}))(Q_{k}-Q^{*}). (8)

Part 2: Lower bound on Qk+1QQ_{k+1}-Q^{*}. Note the following inequality:

TQkTQTπQkTQ=γPπ(QkQ).TQ_{k}-TQ^{*}\geq T_{\pi^{*}}Q_{k}-TQ^{*}=\gamma P_{\pi^{*}}(Q_{k}-Q^{*}).

Additionally, for each policy πk\pi_{k}, there exists some ϵk[0,1]\epsilon_{k}\in[0,1] such that TπkQkTQkϵkQkeT_{\pi_{k}}Q_{k}\geq TQ_{k}-\epsilon_{k}\norm{Q_{k}}e. (The inequality is vacuously true when ϵk=1\epsilon_{k}=1, but recall that Theorem 2 defines ϵk\epsilon_{k} to be as small as possible.) Starting again from equation (7), and noting that the elements of CC are nonnegative,

Qk+1Q\displaystyle\allowdisplaybreaks Q_{k+1}-Q^{*} =(IC)(QkQ)+C(TπQQ)\displaystyle=(I-C)(Q_{k}-Q^{*})+C(T_{\pi}Q-Q^{*})
(IC)(QkQ)+C(TQkQ)ϵkQkCe\displaystyle\geq(I-C)(Q_{k}-Q^{*})+C(TQ_{k}-Q^{*})-\epsilon_{k}\norm{Q_{k}}Ce
=(IC)(QkQ)+C(TQkTQ)ϵkQkCe\displaystyle=(I-C)(Q_{k}-Q^{*})+C(TQ_{k}-TQ^{*})-\epsilon_{k}\norm{Q_{k}}Ce
(IC)(QkQ)+γCPπ(QkQ)ϵkQkCe\displaystyle\geq(I-C)(Q_{k}-Q^{*})+\gamma CP_{\pi^{*}}(Q_{k}-Q^{*})-\epsilon_{k}\norm{Q_{k}}Ce
=(IC(IγPπ))(QkQ)ϵkQkCe\displaystyle=(I-C(I-\gamma P_{\pi^{*}}))(Q_{k}-Q^{*})-\epsilon_{k}\norm{Q_{k}}Ce (9)

Part 3: Conclusion. When Qk+1Q0Q_{k+1}-Q^{*}\geq 0, inequality (8) and Lemma 2 imply

Qk+1QIC(IγPπk)QkQγQkQ.\norm{Q_{k+1}-Q^{*}}\leq\norm{I-C(I-\gamma P_{\pi_{k}})}\norm{Q_{k}-Q^{*}}\leq\gamma\norm{Q_{k}-Q^{*}}. (10)

Additionally, when Qk+1Q0Q_{k+1}-Q^{*}\leq 0, inequality (9) and Lemma 2 imply

Qk+1Q\displaystyle\norm{Q_{k+1}-Q^{*}} IC(IγPπ)QkQ+ϵkQkC\displaystyle\leq\norm{I-C(I-\gamma P_{\pi^{*}})}\norm{Q_{k}-Q^{*}}+\epsilon_{k}\norm{Q_{k}}\norm{C}
γQkQ+ϵk1γQk,\displaystyle\leq\gamma\norm{Q_{k}-Q^{*}}+\frac{\epsilon_{k}}{1-\gamma}\norm{Q_{k}}, (11)

because Ct0γtPμtBt(1γ)1\norm{C}\leq\sum_{t\geq 0}\gamma^{t}\norm{P_{\mu}^{t}B_{t}}\leq(1-\gamma)^{-1}. Since (11) is more conservative than (10), its bound holds regardless of the sign of Qk+1QQ_{k+1}-Q^{*}. It remains to be shown that this bound implies convergence to QQ^{*}. Observe that

γQkQ+ϵk1γQk\displaystyle\gamma\norm{Q_{k}-Q^{*}}+\frac{\epsilon_{k}}{1-\gamma}\norm{Q_{k}} γQkQ+ϵk1γ(QkQ+Q)\displaystyle\leq\gamma\norm{Q_{k}-Q^{*}}+\frac{\epsilon_{k}}{1-\gamma}(\norm{Q_{k}-Q^{*}}+\norm{Q^{*}})
=(γ+ϵk1γ)QkQ+ϵk1γQ.\displaystyle=\left(\gamma+\frac{\epsilon_{k}}{1-\gamma}\right)\norm{Q_{k}-Q^{*}}+\frac{\epsilon_{k}}{1-\gamma}\norm{Q^{*}}.

Our assumption of greedy-in-the-limit policies tells us that ϵk0\epsilon_{k}\to 0 as kk\to\infty; there must exist some iteration kk^{\prime} such that ϵk12(1γ)2\epsilon_{k}\leq\frac{1}{2}(1-\gamma)^{2} for all kkk\geq k^{\prime}. Therefore, for kkk\geq k^{\prime},

Qk+1Q1+γ2QkQ+ϵk1γQ.\norm{Q_{k+1}-Q^{*}}\leq\frac{1+\gamma}{2}\norm{Q_{k}-Q^{*}}+\frac{\epsilon_{k}}{1-\gamma}\norm{Q^{*}}.

If γ<1\gamma<1, then 12(1+γ)<1\frac{1}{2}(1+\gamma)<1. Since Q\norm{Q^{*}} is finite, we conclude that QkQQ_{k}\to Q^{*} as ϵk0\epsilon_{k}\to 0. ∎

We see that the convergence criterion of β(t)k=1tρk\beta(\mathcal{F}_{t})\leq\prod_{k=1}^{t}\rho_{k} in the control setting is the same as that for the policy evaluation. In fact, the only additional assumption we needed is the greedy-in-the-limit target policies. Crucially, the proof allows arbitrary behavior policies and arbitrary initialization of the Q-function, which we discuss further in the next section.

4 Discussion

Despite weighting TD errors by general coefficients β(t)[0,k=1tρk]\beta(\mathcal{F}_{t})\in[0,\prod_{k=1}^{t}\rho_{k}] that could be chosen by any arbitrary selection criteria, the \mathcal{M} operator converges for both policy evaluation and control with few assumptions. In this section, we summarize our theoretical contributions and discuss the choice of coefficients, practical implementations, and the convergence of specific algorithms from Section 2.2.

4.1 Theoretical Contributions

Removal of Markov assumption. As we stated in the introduction, removing the Markov assumption of the \mathcal{R} operator [14] was our primary theoretical goal. With Markov traces, the operator BtB_{t} is independent of tt, allowing the sum C=t0γtPμtBtC=\sum_{t\geq 0}\gamma^{t}P_{\mu}^{t}B_{t} to be reduced to t0γtPcμt\sum_{t\geq 0}\gamma^{t}P_{c\mu}^{t} for the linear operator Pcμ(s,a)𝒮×𝒜P(s|s,a)μ(a|s)c(s,a)Q(s,a)P_{c\mu}\coloneqq\sum_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}P(s^{\prime}|s,a)\mu(a^{\prime}|s^{\prime})c(s^{\prime},a^{\prime})Q(s^{\prime},a^{\prime}). The resulting geometric series can then be evaluated analytically: (IγPcμ)1(I-\gamma P_{c\mu})^{-1}. This technique was used by [14]. In our proofs, we avoided the Markov assumption by directly analyzing the infinite summation CC, which generally does not have a closed-form expression. We believe our work is the first to do this, and it should have far-reaching consequences for existing on- and off-policy algorithms (see Sections 4.3 and 5) as well as yet-undiscovered algorithms.

Arbitrary initialization of Q-function. Our proof of Theorem 2 permits any initialization of Q0Q_{0} in the control setting. In contrast, [14] made the assumption that Tπ0Q0Q00T_{\pi_{0}}Q_{0}-Q_{0}\geq 0 in order to produce a lower bound on Qk+1Q\mathcal{R}Q_{k+1}-Q^{*}. This is accomplished in practice by a pessimistic initialization of the Q-function: Q0(s,a)=R/(1γ)Q_{0}(s,a)=-\norm{R}\mathbin{/}(1-\gamma), (s,a)𝒮×𝒜\forall(s,a)\in\mathcal{S}\times\mathcal{A}. Since \mathcal{R} is a special case of our operator \mathcal{M} where each β(t)\beta(\mathcal{F}_{t}) factors into Markov traces, we deduce as a corollary that Retrace and all other algorithms described by \mathcal{R} do not require this pessimistic initialization for convergence.

Greedy-in-the-limit policies. Our requirement of greedy-in-the-limit target policies for the control setting is significantly less restrictive than the increasingly greedy policies proposed by [14]. We need only that limkTπkQk=TQk\lim_{k\to\infty}T_{\pi_{k}}Q_{k}=TQ_{k}, and we do not force the sequence of target policies to satisfy Pπk+1Qk+1PπkQk+1P_{\pi_{k+1}}Q_{k+1}\geq P_{\pi_{k}}Q_{k+1}. This implies that the agent may target non-greedy policies for an arbitrarily long period of time, as long as the policies do eventually become arbitrarily close to the greedy policy. This could lead to faster learning since targeting greedy policies causes excessive trace cutting (e.g. Watkins’ Q(λ\lambda)). Our greedy-in-the-limit policies are closely related to the Greedy-in-the-Limit with Infinite Exploration (GLIE) assumption of [19]; however, we are able to remove the need for infinite exploration by allowing an arbitrary sequence of behavior policies, as did [14]. Once again, our theory implies as a corollary that neither increasingly greedy policies nor the GLIE assumption are strictly necessary for the convergence of Retrace and other Markov algorithms.

Algorithm 1 Non-Markov Eligibility Trace
Initialize Q(s,a)Q(s,a) arbitrarily
Select learning rate α(0,1]\alpha\in(0,1]
for each episode do
     Initialize environment state s0s_{0}
     {}\mathcal{F}\leftarrow\{\}
     repeat for t=0,1,2,t=0,1,2,\dots
         Take action atμ(st)a_{t}\sim\mu(s_{t}), observe reward rtr_{t} and next state st+1s_{t+1}
         (st,at)\mathcal{F}\leftarrow\mathcal{F}\cup(s_{t},a_{t})
         δtπrt+γa𝒜Q(st+1,a)Q(st,at)\delta^{\pi}_{t}\leftarrow r_{t}+\gamma\sum_{a^{\prime}\in\mathcal{A}}Q(s_{t+1},a^{\prime})-Q(s_{t},a_{t})
         For k=0,,tk=0,\ldots,t: Q(sk,ak)Q(sk,ak)+αγtkβ(k:t)δtπQ(s_{k},a_{k})\leftarrow Q(s_{k},a_{k})+\alpha\gamma^{t-k}\beta(\mathcal{F}_{k:t})\delta^{\pi}_{t} \triangleright Let k:ttk1\mathcal{F}_{k:t}\coloneqq\mathcal{F}_{t}\setminus\mathcal{F}_{k-1}, β(t:t)1\beta(\mathcal{F}_{t:t})\coloneqq 1
     until st+1s_{t+1} is terminal
end for

4.2 Practical Concerns

Implementation with eligibility traces. The fact that the coefficient β(t)\beta(\mathcal{F}_{t}) does not generally factor into traces c(s1,a1),,c(st,at)c(s_{1},a_{1}),\dots,c(s_{t},a_{t}) means that eligibility traces in the traditional sense are not possible. The problem occurs when the same state-action pair appears more than once in t\mathcal{F}_{t}; afterwards, there does not exist a constant factor by which to decay the eligibility for that state-action pair that would produce the correct updates according to (3). We can rectify this by modifying the eligibility trace to remember each individual occurrence of the state-action pairs: a Non-Markov Eligibility Trace (see Algorithm 1). By tracking a separate eligibility for each repeated visitation, the algorithm is able to generate the correct coefficients β(t)\beta(\mathcal{F}_{t}) even though they do not factor into traces. In terms of computation, our algorithm is efficient when episodes are not extremely long, and is reminiscent of the practical linked-list implementation of the eligibility trace described by [23].

Online convergence. Algorithm 1 is presented as an online eligibility-trace algorithm. In practice, we expect that the updates will be calculated and accumulated offline, akin to the offline λ\lambda-return algorithm [23], for the sake of efficiency and compatibility with deep function approximation [4]. As such, the analysis of the online variant is beyond the scope of our work and we leave its convergence as an open question. Nevertheless, the fact that the expected \mathcal{M} operator is convergent would make it surprising (although not impossible) that the online version diverges in the presence of zero-mean, finite-variance noise with appropriately annealed stepsizes [3].

Choice of coefficients β(t)\beta(\mathcal{F}_{t}). Our theorems guarantee convergence provided that β(t)k=1tρk\beta(\mathcal{F}_{t})\leq\prod_{k=1}^{t}\rho_{k} for every t\mathcal{F}_{t}. Of course, not all choices of β\beta will perform well in practice. At one extreme, when β(t)=k=1tρk\beta(\mathcal{F}_{t})=\prod_{k=1}^{t}\rho_{k} for every t\mathcal{F}_{t}, we recover the standard IS estimator, which we know suffers from high variance and is often ineffectual. At the other extreme, if β(t)<k=1tπ(ak|sk)\beta(\mathcal{F}_{t})<\prod_{k=1}^{t}\pi(a_{k}|s_{k}) for every t\mathcal{F}_{t}, then we have a method that cuts traces more than Tree Backup does; it seems unlikely that such a method would learn faster than Tree Backup, making it difficult to justify the extra complexity of the non-Markov coefficients in this case.

We therefore see that while it is important to control the overall variance of the coefficients β(t)\beta(\mathcal{F}_{t}), it is also important to maintain some minimum efficiency by avoiding unnecessary trace cuts, and to leverage the non-Markov capabilities of the operator. Effective algorithms will likely compute the full IS ratio product k=1tρk\prod_{k=1}^{t}\rho_{k} first and then apply some nonlinear transformation (e.g. clipping) to control the variance. This ensures that the coefficients are cut only when necessary.

One final yet pertinent consideration is the discounted sum of coefficients t0γtβ(t)\sum_{t\geq 0}\gamma^{t}\beta(\mathcal{F}_{t}). Roughly speaking, this sum represents the potential magnitude of an update to a state-action pair. If its value is extremely large along some trajectories with nonzero probability of occurrence, then the algorithm may not be stable. This suggests that a constant bound on each β(t)\beta(\mathcal{F}_{t}), or a recency heuristic that guarantees β(t)β(t+1)\beta(\mathcal{F}_{t})\geq\beta(\mathcal{F}_{t+1}), may be desirable to control the update magnitude.

4.3 Convergence for Specific Algorithms

In Section 2.2, we mentioned some algorithms that cannot be modeled by the \mathcal{R} operator can, in fact, be modeled by the \mathcal{M} operator. It remains to show that their coefficients always satisfy our condition of β(t)k=1tρk\beta(\mathcal{F}_{t})\leq\prod_{k=1}^{t}\rho_{k} to prove that they converge according to Theorems 1 and 2.

Truncated IS. Clearly, min(dt,k=1tρk)k=1tρk\min(d_{t},\prod_{k=1}^{t}\rho_{k})\leq\prod_{k=1}^{t}\rho_{k} for any choice of dtd_{t}, hence the algorithm converges.

Non-Markov Retrace. [14] define the traces for the algorithm as c(st,at)=λmin(1/β(t1),ρt)c(s_{t},a_{t})=\lambda\min(1\mathbin{/}\beta(\mathcal{F}_{t-1}),\rho_{t}), so β(t1)=k=1t1c(sk,ak)\beta(\mathcal{F}_{t-1})=\prod_{k=1}^{t-1}c(s_{k},a_{k}). This means that

β(t)=β(t1)c(st,at)=λmin(1β(t1),ρt)β(t1)=λmin(1,β(t1)ρt).\beta(\mathcal{F}_{t})=\beta(\mathcal{F}_{t-1})c(s_{t},a_{t})=\lambda\min\left(\frac{1}{\beta(\mathcal{F}_{t-1})},\rho_{t}\right)\beta(\mathcal{F}_{t-1})=\lambda\min\left(1,\beta(\mathcal{F}_{t-1})\rho_{t}\right).

We can now prove by induction that the required bound always holds. Assume that β(t1)k=1t1ρk\beta(\mathcal{F}_{t-1})\leq\prod_{k=1}^{t-1}\rho_{k}. Our base case is β(0)=1\beta(\mathcal{F}_{0})=1 by definition, which clearly holds. Then, by hypothesis,

β(t)=λmin(1,β(t1)ρt)λmin(1,k=1tρk)k=1tρk,\displaystyle\beta(\mathcal{F}_{t})=\lambda\min\left(1,\beta(\mathcal{F}_{t-1})\rho_{t}\right)\leq\lambda\min\left(1,\prod_{k=1}^{t}\rho_{k}\right)\leq\prod_{k=1}^{t}\rho_{k},

and the non-Markov variant of Retrace therefore converges.

5 Conclusion

Although per-decision traces are convenient from a computational perspective, they are not expressive enough to implement arbitrary off-policy corrections, and their inability to reverse previously cut traces can lead to inefficient learning. By removing the assumption of Markov (i.e. per-decision) traces, our \mathcal{M} operator enables off-policy algorithms that jointly consider all of the agent’s past experiences when assigning credit. This provides myriad opportunities for efficient off-policy algorithms and offers the first convergence guarantees for many previous algorithms.

The \mathcal{M} operator is convergent for both policy evaluation and control. In the latter setting, our proof removes the assumptions of increasingly greedy policies and pessimistic initialization. This has significant implications for Retrace, the other algorithms discussed in Section 2.2, Watkins’ Q(λ\lambda), and more.

The generality of the \mathcal{M} operator means that it provides convergence guarantees for a number of TD methods that we were not able to discuss in the main text. These include methods with variable discount factors or λ\lambda-values [e.g. 29, 23, 11, 25, 12, 24], and even history-dependent λ\lambda-values [e.g. 20, 32]. All of these can be expressed in a common form: β(t)=k=1tγ(k)λ(k)ρk\beta(\mathcal{F}_{t})=\prod_{k=1}^{t}\gamma(\mathcal{F}_{k})\lambda(\mathcal{F}_{k})\rho_{k} where γ(k)[0,1]\gamma(\mathcal{F}_{k})\in[0,1] and λ(k)[0,1]\lambda(\mathcal{F}_{k})\in[0,1]. Clearly, β(t)k=1tρk\beta(\mathcal{F}_{t})\leq\prod_{k=1}^{t}\rho_{k} and we determine that these methods converge.

An interesting variant of our theory is one where the coefficient-determining function β\beta is not fixed but instead varies with time. In this case, the operator \mathcal{M} becomes non-stationary; each diagonal matrix BtB_{t} in equation (3) depends not only on the number of timesteps tt since the action was taken, but also the time tt^{\prime} when the action took place. Such instances may arise, for example, when coefficients depend on information external to the trajectory t\mathcal{F}_{t}, such as the number of previous visitations to the state-action pair [e.g. 26]. Provided that the same conditions of Btk=1tρkB_{t}\leq\prod_{k=1}^{t}\rho_{k} hold for all tt^{\prime}, then the proofs of Theorems 1 and 2 should go through with no modifications.

Important open questions remain. Namely, do the convergence guarantees established here hold for online algorithms? And, more practically, what non-Markov algorithms exist that lead to reliably efficient and stable learning? Interesting possibilities with regard to both of these questions are recursive methods like Non-Markov Retrace, which are easier to implement and to analyze according to our theoretical framework. In conjunction, our earlier considerations for choosing coefficients (Section 4.2) are pertinent. These should be interesting starting points for future developments in off-policy learning.

References

  • [1] Andrew G Barto, Richard S Sutton, and Charles W Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, pages 834–846, 1983.
  • [2] Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.
  • [3] Dimitri P Bertsekas and John N Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
  • [4] Brett Daley and Christopher Amato. Reconciling λ\lambda-returns with experience replay. In Advances in Neural Information Processing Systems, pages 1133–1142, 2019.
  • [5] Anna Harutyunyan, Marc G Bellemare, Tom Stepleton, and Rémi Munos. Q(λ\lambda) with off-policy corrections. In International Conference on Algorithmic Learning Theory, pages 305–320, 2016.
  • [6] Edward L Ionides. Truncated importance sampling. Journal of Computational and Graphical Statistics, 17(2):295–311, 2008.
  • [7] Michael J Kearns and Satinder P Singh. Bias-variance error bounds for temporal difference updates. In COLT, pages 142–147, 2000.
  • [8] A Harry Klopf. Brain function and adaptive systems: A heterostatic theory. Technical report, Air Force Cambridge Research Labs, Hanscom AFB, MA, 1972.
  • [9] Tadashi Kozuno, Yunhao Tang, Mark Rowland, Rémi Munos, Steven Kapturowski, Will Dabney, Michal Valko, and David Abel. Revisiting Peng’s Q(λ\lambda) for modern reinforcement learning. arXiv:2103.00107, 2021.
  • [10] Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and reaching. Machine Learning, 8(3-4):293–321, 1992.
  • [11] Hamid Reza Maei and Richard S Sutton. GQ(λ\lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proceedings of the Third Conference on Artificial General Intelligence, pages 91–96, 2010.
  • [12] Ashique Rupam Mahmood, Huizhen Yu, and Richard S Sutton. Multi-step off-policy learning without importance sampling ratios. arXiv:1702.03006, 2017.
  • [13] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, et al. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [14] Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc G Bellemare. Safe and efficient off-policy reinforcement learning. arXiv:1606.02647, 2016.
  • [15] Jing Peng and Ronald J Williams. Incremental multi-step Q-Learning. Machine Learning, pages 226–232, 1994.
  • [16] Doina Precup, Richard S Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the Seventeenth International Conference on Machine Learning, page 759–766, 2000.
  • [17] Mark Rowland, Will Dabney, and Rémi Munos. Adaptive trade-offs in off-policy learning. In International Conference on Artificial Intelligence and Statistics, pages 34–44, 2020.
  • [18] Gavin A Rummery and Mahesan Niranjan. On-line Q-Learning using connectionist systems. Technical report, Cambridge University, 1994.
  • [19] Satinder Singh, Tommi Jaakkola, Michael L Littman, and Csaba Szepesvári. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine learning, 38(3):287–308, 2000.
  • [20] Satinder P Singh and Richard S Sutton. Reinforcement learning with replacing eligibility traces. Machine learning, 22(1):123–158, 1996.
  • [21] Richard S Sutton. Temporal Credit Assignment in Reinforcement Learning. PhD thesis, University of Massachusetts Amherst, 1984.
  • [22] Richard S Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
  • [23] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 1st edition, 1998.
  • [24] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.
  • [25] Richard S Sutton, Ashique Rupam Mahmood, Doina Precup, and Hado Hasselt. A new Q(λ\lambda) with interim forward view and Monte Carlo equivalence. In International Conference on Machine Learning, pages 568–576, 2014.
  • [26] Richard S Sutton and Satinder P Singh. On step-size and bias in temporal-difference learning. In Proceedings of the Eighth Yale Workshop on Adaptive and Learning Systems, pages 91–96, 1994.
  • [27] Eiji Uchibe and Kenji Doya. Competitive-cooperative-concurrent reinforcement learning with importance sampling. In Proceedings of International Conference on Simulation of Adaptive Behavior: From Animals and Animats, pages 287–296, 2004.
  • [28] Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. arXiv:1611.01224, 2016.
  • [29] Christopher John Cornish Hellaby Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, Cambridge, 1989.
  • [30] Paweł Wawrzyński. Real-time reinforcement learning by sequential actor-critics and experience replay. Neural Networks, 22(10):1484–1497, 2009.
  • [31] Pawel Wawrzynski and Andrzej Pacut. Truncated importance sampling for reinforcement learning with experience replay. Proc. CSIT Int. Multiconf, pages 305–315, 2007.
  • [32] Huizhen Yu, A Rupam Mahmood, and Richard S Sutton. On generalized Bellman equations and temporal-difference learning. The Journal of Machine Learning Research, 19(1):1864–1912, 2018.

Appendix A Proofs

A.1 Proof of Lemma 1

Proof.

We begin by rewriting the operator \mathcal{M} in (3):

Q\displaystyle\allowdisplaybreaks\mathcal{M}Q =Q+t0γtPμtBt(TπQQ)\displaystyle=Q+\sum_{t\geq 0}\gamma^{t}P_{\mu}^{t}B_{t}(T_{\pi}Q-Q)
=Q+t0γtPμtBt(TπQR)+t0γtPμtBt(RQ)\displaystyle=Q+\sum_{t\geq 0}\gamma^{t}P_{\mu}^{t}B_{t}(T_{\pi}Q-R)+\sum_{t\geq 0}\gamma^{t}P_{\mu}^{t}B_{t}(R-Q)
=R+t0γtPμtBt(TπQR)+t1γtPμtBt(RQ).\displaystyle=R+\sum_{t\geq 0}\gamma^{t}P_{\mu}^{t}B_{t}(T_{\pi}Q-R)+\sum_{t\geq 1}\gamma^{t}P_{\mu}^{t}B_{t}(R-Q).

It is evident from (3) that QπQ^{\pi} is the fixed point of \mathcal{M} because Q=QπTπQQ=0Q=Q^{\pi}\implies T_{\pi}Q-Q=0. Therefore,

QQπ\displaystyle\allowdisplaybreaks\mathcal{M}Q-Q^{\pi} =QQπ\displaystyle=\mathcal{M}Q-\mathcal{M}Q^{\pi}
=t0γtPμtBt(TπQTπQπ)t1γtPμtBt(QQπ)\displaystyle=\sum_{t\geq 0}\gamma^{t}P_{\mu}^{t}B_{t}(T_{\pi}Q-T_{\pi}Q^{\pi})-\sum_{t\geq 1}\gamma^{t}P_{\mu}^{t}B_{t}(Q-Q^{\pi})
=t0γt+1PμtBtPπ(QQπ)t1γtPμtBt(QQπ)\displaystyle=\sum_{t\geq 0}\gamma^{t+1}P_{\mu}^{t}B_{t}P_{\pi}(Q-Q^{\pi})-\sum_{t\geq 1}\gamma^{t}P_{\mu}^{t}B_{t}(Q-Q^{\pi})
=t1γtPμt1Bt1Pπ(QQπ)t1γtPμtBt(QQπ)\displaystyle=\sum_{t\geq 1}\gamma^{t}P_{\mu}^{t-1}B_{t-1}P_{\pi}(Q-Q^{\pi})-\sum_{t\geq 1}\gamma^{t}P_{\mu}^{t}B_{t}(Q-Q^{\pi})
=t1γtPμt1(Bt1PπPμBt)(QQπ),\displaystyle=\sum_{t\geq 1}\gamma^{t}P_{\mu}^{t-1}(B_{t-1}P_{\pi}-P_{\mu}B_{t})(Q-Q^{\pi}),

which is the desired quantity. ∎

A.2 Proof of Lemma 2

Proof.

We start with the negative of the original matrix expression: C(IγPπ)I=(CI)γCPπC(I-\gamma P_{\pi})-I=(C-I)-\gamma CP_{\pi}. This expression is affine in CC; we can check its two extremal values to determine the norm.

Recall that C=t0γtPμtBtC=\sum_{t\geq 0}\gamma^{t}P_{\mu}^{t}B_{t}, a nonnegative matrix. One extreme occurs when Bt=0B_{t}=0 for t1t\geq 1 and hence C=IC=I; substituting this into the expression, we obtain γPπ-\gamma P_{\pi} and deduce that C(IγPπ)I=γ\norm{C(I-\gamma P_{\pi})-I}=\gamma when C=IC=I.

The other extreme occurs when each BtB_{t} is maximized. By our criteria for BtB_{t} from Theorem 1, we must have PμtBtPπtP_{\mu}^{t}B_{t}\leq P_{\pi}^{t} and therefore Ct0γtPπt=(IγPπ)1C\leq\sum_{t\geq 0}\gamma^{t}P_{\pi}^{t}=(I-\gamma P_{\pi})^{-1}. We can also deduce in this case that

CI(γPπ+γ2Pπ+)=γ(I+γPπ+)Pπ=γ(IγPπ)1Pπ.\displaystyle\allowdisplaybreaks C-I\leq(\gamma P_{\pi}+\gamma^{2}P_{\pi}+\cdots)=\gamma(I+\gamma P_{\pi}+\cdots)P_{\pi}=\gamma(I-\gamma P_{\pi})^{-1}P_{\pi}.

Substituting these into the expression, we obtain γ(IγPπ)1Pπγ(IγPπ)1Pπ=0\gamma(I-\gamma P_{\pi})^{-1}P_{\pi}-\gamma(I-\gamma P_{\pi})^{-1}P_{\pi}=0 when C=(IγPπ)1C=(I-\gamma P_{\pi})^{-1}.

Combining these two cases, we conclude that IC(IγPπ)=C(IγPπ)Iγ\norm{I-C(I-\gamma P_{\pi})}=\norm{C(I-\gamma P_{\pi})-I}\leq\gamma. ∎