This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Taylor Expansions of Discount Factors

Yunhao Tang    Mark Rowland    Rémi Munos    Michal Valko
Abstract

In practical reinforcement learning (RL), the discount factor used for estimating value functions often differs from that used for defining the evaluation objective. In this work, we study the effect that this discrepancy of discount factors has during learning, and discover a family of objectives that interpolate value functions of two distinct discount factors. Our analysis suggests new ways for estimating value functions and performing policy optimization updates, which demonstrate empirical performance gains. This framework also leads to new insights on commonly-used deep RL heuristic modifications to policy optimization algorithms.

Machine Learning, ICML

1 Introduction

One of the most popular models for reinforcement learning (RL) is the Markov decision process (MDP) with exponential discounting over an infinite horizon (Sutton and Barto, 2018; Puterman, 2014), with discounted objectives of the following form

Vγπ(x)=𝔼π[t=0γtrt|x0=x].\displaystyle V_{\gamma}^{\pi}(x)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\;\middle|\;x_{0}=x\right]\,.

Discounted models enjoy favorable theoretical properties, and are also the foundation of many practical RL algorithms that enjoy empirical success (e.g. see (Mnih et al., 2015; Schulman et al., 2015a; Lillicrap et al., 2015; Schulman et al., 2017)). However, in most applications of RL, the objective of interest is the expected undiscounted cumulative return,

𝔼π[t=0Trt|x0=x],\displaystyle\mathbb{E}_{\pi}\left[\sum_{t=0}^{T}r_{t}\;\middle|\;x_{0}=x\right], (1)

where T<T<\infty is a (possibly random) evaluation horizon, which usually also denotes the end of the trajectory. For example, TT could be the first time the MDP gets into a terminal state (e.g., a robot falls); when the MDP does not have a natural terminal state, TT could be enforced as a deterministic horizon. This creates a technical gap between algorithmic developments and implementations: it is tempting to design algorithms that optimize Vγπ(x)V_{\gamma}^{\pi}(x), however, further heuristics are often needed to get strong practical performance. This issue manifests itself with the policy gradient (PG) theorem (Sutton et al., 2000). Let πθ\pi_{\theta} be a parameterized policy. The policy gradient (PG) θVγπθ(x)\nabla_{\theta}V_{\gamma}^{\pi_{\theta}}(x) is computed as

𝔼πθ[t=0γtQγπθ(xt,at)θlogπθ(at|xt)|x0=x].\displaystyle\mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{\infty}\gamma^{t}Q_{\gamma}^{\pi_{\theta}}(x_{t},a_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t})\;\middle|\;x_{0}=x\right]\,. (2)

However, the practical implementation of PG updates usually omits the discount factors (see for example the high-quality open source packages (Dhariwal et al., 2017; Achiam and OpenAI, 2018), leading to an approximate gradient of the form

𝔼πθ[t=0TQγπθ(xt,at)θlogπθ(at|xt)|x0=x].\displaystyle\mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{T}Q_{\gamma}^{\pi_{\theta}}(x_{t},a_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t})\;\middle|\;x_{0}=x\right]\,. (3)

Most prior work on PG algorithms rely on this heuristic update to work properly in deep RL applications. The intuitive argument for dropping the factor γt\gamma^{t} is that Eqn (2) optimizes Vγπθ(x)V_{\gamma}^{\pi_{\theta}}(x), which is very myopic compared to the objective in Eqn (1). Consequently, the exponential discount γt\gamma^{t} is too aggressive for weighting updates with large tt. As a concrete example, in many MuJoCo control tasks (Brockman et al., 2016), the most commonly used discount factor is γ=0.99\gamma=0.99. This leads to an effective horizon of 11γ=100\frac{1}{1-\gamma}=100, which is much smaller than the evaluation horizon T=1000T=1000. This technical gap between theory and practice has been alluded to previously (by e.g., O’Donoghue et al., 2016) and is explicitly discussed by Nota and Thomas (2019).

To bypass this gap, a straightforward solution would be to naïvely increase the discount factor γ11T\gamma\geq 1-\frac{1}{T} and apply the PG in Eqn (2). In the example above, this implies using γ0.999\gamma\geq 0.999. Unfortunately, this rarely works well in practice, as we will also see in experiments. The failure might be due to the higher variance of the estimation (Schulman et al., 2015b) or the collapse of the action gaps (Lehnert et al., 2018; Laroche and van Seijen, 2018), which is aggravated when combined with function approximations.

Nevertheless, as a theoretical framework, it is insightful to emulate the undiscounted objective in Eqn (1) using the (un)discounted objective Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x) with γ11T\gamma^{\prime}\geq 1-\frac{1}{T}. To build intuitions about this approximation, note that when the time step is small tTt\ll T, the multiplicative factor (γ)t1(\gamma^{\prime})^{t}\approx 1 and the cumulative rewards are almost undiscounted; even when t=Tt=T, we have (γ)t(11T)T1e0(\gamma^{\prime})^{t}\geq(1-\frac{1}{T})^{T}\approx\frac{1}{e}\gg 0. Overall, this is a much more accurate approximation than Vγπ(x)V_{\gamma}^{\pi}(x). This naturally prompts us to answer the following general question: How do we evaluate and optimize Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x) with estimates built for Vγπ(x)V_{\gamma}^{\pi}(x) where 0<γ<γ10<\gamma<\gamma^{\prime}\leq 1?

Main idea.

We study the relation between Vγπ(x)V_{\gamma}^{\pi}(x) and Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x) via Taylor expansions. In Section 3, we identify a family of interpolating objectives between the more myopic objective Vγπ(x)V_{\gamma}^{\pi}(x) and the true objective of interest Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x). In Section 4, we start with insights on why the heuristic in Eqn (3) might be useful in practice. Then, we apply Taylor expansions directly to the heuristic updates, to arrive at a family of interpolating updates. In Section 5, we build on theoretical insights to derive improvements to established deep RL algorithms. We show their performance gains in Section 7.

2 Background

Consider the setup of a MDP. At any discrete time t0t\geq 0, the agent is in state xt𝒳x_{t}\in\mathcal{X}, takes an action at𝒜a_{t}\in\mathcal{A}, receives an instant reward rt=r(xt,at)[0,Rmax]r_{t}=r(x_{t},a_{t})\in[0,R_{\text{max}}] and transitions to a next state xt+1p(|xt,at)x_{t+1}\sim p(\cdot|x_{t},a_{t}). For simplicity, we assume r(x,a)r(x,a) to be deterministic. Let policy π:𝒳𝒫(𝒜)\pi:\mathcal{X}\rightarrow\mathcal{P}(\mathcal{A}) be a mapping from states to distributions over actions. Let γ[0,1)\gamma\in[0,1) be a discount factor, define the Q-function Qγπ(x,a)𝔼π[t=0γtrt|x0=x,a0=a]Q_{\gamma}^{\pi}(x,a)\coloneqq\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\;\middle|\;x_{0}=x,a_{0}=a\right] and value function Vγπ(x)𝔼π[t=0γtrt|x0=x]V_{\gamma}^{\pi}(x)\coloneqq\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\;\middle|\;x_{0}=x\right]. We also define the advantage function Aγπ(x,a)Qγπ(x,a)Vγπ(x)A_{\gamma}^{\pi}(x,a)\coloneqq Q_{\gamma}^{\pi}(x,a)-V_{\gamma}^{\pi}(x). Here, 𝔼π[]\mathbb{E}_{\pi}\left[\cdot\right] denotes that the trajectories (xt,at,rt)t=0(x_{t},a_{t},r_{t})_{t=0}^{\infty} are generated under policy π\pi. Throughout the paper, we use subscripts γ\gamma to emphasize that RL quantities implicitly depend on discount factors.

2.1 Linear programs for reinforcement learning

Henceforth, we assume all vectors to be column vectors. The value functions VγπV_{\gamma}^{\pi} satisfy the Bellman equations Vγπ(x)=𝔼π[r(x,a)+γVγπ(x)x0=x]V_{\gamma}^{\pi}(x)=\mathbb{E}_{\pi}\left[r(x,a)+\gamma V_{\gamma}^{\pi}(x^{\prime})\mid x_{0}=x\right] (Bellman, 1957). Such equations can be encoded into a linear program (LP) (De Farias and Van Roy, 2003; Puterman, 2014). Let V𝒳V\in\mathbb{R}^{\mathcal{X}} be the primal variables, consider the following LP,

maxδxTV,V=rπ+γPπV,\displaystyle\max\ \delta_{x}^{T}V,\ \ V=r^{\pi}+\gamma P^{\pi}V, (4)

where rπ𝒳r^{\pi}\in\mathbb{R}^{\mathcal{X}} is the state-dependent reward rπ(x)aπ(a|x)r(x,a)r^{\pi}(x^{\prime})\coloneqq\sum_{a^{\prime}}\pi(a^{\prime}|x^{\prime})r(x^{\prime},a^{\prime}) and Pπ𝒳×𝒳P^{\pi}\in\mathbb{R}^{\mathcal{X}\times\mathcal{X}} is the transition matrix under π\pi. Here, δx𝒳\delta_{x}\in\mathbb{R}^{\mathcal{X}} encodes the one-hot distribution (Dirac) at xx. Similar results hold for considering the LP objective vTVv^{T}V with a general distribution v𝒫(𝒳)v\in\mathcal{P}(\mathcal{X}). It then follows that the optimal solution to the above LP is V=VγπV^{\ast}=V_{\gamma}^{\pi}. Now, consider the dual LP to Eqn (4), let d𝒳d\in\mathbb{R}^{\mathcal{X}} be the dual variables,

min(1γ)1(rπ)Td,d=(1γ)δx+γ(Pπ)Td.\displaystyle\min\ (1-\gamma)^{-1}(r^{\pi})^{T}d,\ \ d=(1-\gamma)\delta_{x}+\gamma(P^{\pi})^{T}d. (5)

The optimal solution to the dual program has a natural probabilistic interpretation. It is the discounted visitation distribution dx,γπd_{x,\gamma}^{\pi} under policy π\pi with starting state xx as dx,γπ(x)(1γ)t0γtPπ(xt=x|x0=x)d_{x,\gamma}^{\pi}(x^{\prime})\coloneqq(1-\gamma)\sum_{t\geq 0}\gamma^{t}P_{\pi}(x_{t}=x^{\prime}|x_{0}=x) where Pπ(xt=x|x0=x)P_{\pi}(x_{t}=x^{\prime}|x_{0}=x) is a probability measure induced by the policy π\pi and the MDP transition kernel. By strong duality, the value function can be equivalently written as

Vγπ(x)=11γ𝔼xdx,γπ,aπ(|x)[r(x,a)].\displaystyle V_{\gamma}^{\pi}(x)=\frac{1}{1-\gamma}\mathbb{E}_{x^{\prime}\sim d_{x,\gamma}^{\pi},a^{\prime}\sim\pi(\cdot|x^{\prime})}\left[r(x^{\prime},a^{\prime})\right]. (6)

3 Taylor Expansions of Value Functions

Below, we show how to estimate Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x) with approximations constructed from value functions Vγπ(x)V_{\gamma}^{\pi}(x) for γ<γ\gamma<\gamma^{\prime}. Unless otherwise stated, we always assume γ<1\gamma^{\prime}<1 for a more convenient mathematical treatment of the problem.

3.1 Taylor expansions of discount factors

We start with some notations: we abuse the notation of value functions Vγπ𝒳V_{\gamma}^{\pi}\in\mathbb{R}^{\mathcal{X}} to both refer to the scalar function as well as a vector. The Bellman equation for the value-function is expressed in the matrix form (Puterman, 2014)

Vγπ=rπ+γPπVγπ.\displaystyle V_{\gamma^{\prime}}^{\pi}=r^{\pi}+\gamma^{\prime}P^{\pi}V_{\gamma^{\prime}}^{\pi}. (7)

Inverting the equation,

Vγπ=(IγPπ)1rπ.\displaystyle V_{\gamma^{\prime}}^{\pi}=(I-\gamma^{\prime}P^{\pi})^{-1}r^{\pi}. (8)

Now, we present the main result of Taylor expansions.

Proposition 3.1.

The following holds for all K0K\geq 0,

Vγπ\displaystyle V_{\gamma^{\prime}}^{\pi} =k=0K((γγ)(IγPπ)1Pπ)kVγπ\displaystyle=\sum_{k=0}^{K}\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{k}V_{\gamma}^{\pi}
+((γγ)(IγPπ)1Pπ)K+1Vγπresidual.\displaystyle+\underbrace{\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{K+1}V_{\gamma^{\prime}}^{\pi}}_{\text{residual}}. (9)

When γ<γ<1\gamma<\gamma^{\prime}<1, the residual norm converges to 0, which implies

Vγπ=k=0((γγ)(IγPπ)1Pπ)kVγπ.\displaystyle V_{\gamma^{\prime}}^{\pi}=\sum_{k=0}^{\infty}\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{k}V_{\gamma}^{\pi}. (10)

We provide a proof sketch here: Note that γPπ=(γγ)Pπ+γPπ\gamma^{\prime}P^{\pi}=(\gamma^{\prime}-\gamma)P^{\pi}+\gamma P^{\pi} and apply the Woodbury matrix identity to obtain (IγPπ)1=(IγPπ)1+(γγ)(IγPπ)1Pπ(IγPπ)1(I-{\gamma^{\prime}}P^{\pi})^{-1}=(I-\gamma P^{\pi})^{-1}+(\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}(I-\gamma^{\prime}P^{\pi})^{-1}. We can then recursively expand Eqn (8) KK times to arrive at Eqn (9). In particular, by expanding the equation once, we see that (IγPπ)1(I-\gamma^{\prime}P^{\pi})^{-1} is equivalent to the following,

(IγPπ)1+(γγ)(IγPπ)1Pπ(IγPπ)1\displaystyle(I-\gamma P^{\pi})^{-1}+(\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}(I-\gamma P^{\pi})^{-1}
+(γγ)2((IγPπ)1Pπ)2(IγPπ)1can be expanded further,\displaystyle+(\gamma^{\prime}-\gamma)^{2}\left((I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{2}\underbrace{(I-\gamma^{\prime}P^{\pi})^{-1}}_{\text{can be expanded further}},

where the last term can be expanded further by plugging in the Woodbury matrix identity. See the complete proof in Appendix A.

Extensions to γ=1\gamma^{\prime}=1.

The above result can extend to the case γ=1\gamma^{\prime}=1. We make two assumptions: A.1 The Markov chain induced by π\pi is absorbing and TT is the absorption time; A.2 rπ(x)=0r^{\pi}(x)=0 for absorbing states xx. Under these assumptions, we can interpret such absorbing states as the terminal states. As a result, Vγ=1π(x)=𝔼π[t=0Trt|x0=x]V_{\gamma^{\prime}=1}^{\pi}(x)=\mathbb{E}_{\pi}\left[\sum_{t=0}^{T}r_{t}\;\middle|\;x_{0}=x\right] is well-defined and Proposition 3.1 still holds; see Appendix A for the complete proof.

In practice, it is infeasible to sum up all infinite number of terms in the Taylor expansion. It is then of interest to consider the KKth-order expansion of VγπV_{\gamma^{\prime}}^{\pi}, which truncates the infinite series. Specifically, we define the KKth-order expansion as

VK,γ,γπk=0K((γγ)(IγPπ)1Pπ)kVγπ.\displaystyle V_{K,\gamma,\gamma^{\prime}}^{\pi}\coloneqq\sum_{k=0}^{K}((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi})^{k}V_{\gamma}^{\pi}\,. (11)

As KK increases, the KKth order expansion becomes increasingly close to the infinite series, which evaluates to Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x). This is formalized next.

Proposition 3.2.

The following bound holds for all K0K\geq 0,

|Vγπ(x)VK,γ,γπ(x)|(γγ1γ)K+1Rmax1γ.\displaystyle\left|V_{\gamma^{\prime}}^{\pi}(x)-V_{K,\gamma,\gamma^{\prime}}^{\pi}(x)\right|\leq\left(\frac{\gamma^{\prime}-\gamma}{1-\gamma}\right)^{K+1}\frac{R_{\text{max}}}{1-\gamma^{\prime}}. (12)

3.2 Sample-based approximations of Taylor expansions

We now describe how to estimate VK,γ,γπ(x)V_{K,\gamma,\gamma^{\prime}}^{\pi}(x) via samples. First, we build some intuition on the behavior of expansions at different orders KK by considering a few special cases.

Zeroth-order expansion.

By setting K=0K=0, we see that

V0,γ,γπ=Vγπ.\displaystyle V_{0,\gamma,\gamma^{\prime}}^{\pi}=V_{\gamma}^{\pi}. (13)

The zeroth order expansion approximates the value function Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x) of the discount factor γ\gamma^{\prime} with that Vγπ(x)V_{\gamma}^{\pi}(x) of a lower discount factor γ<γ\gamma<\gamma^{\prime}. This is a very straightforward approximation to use in that no sampling at all is required, but it may not be accurate.

First-order expansion.

When K=1K=1, we consider the increments of the expansions,

V1,γ,γπV0,γ,γπ=(γγ)(IγPπ)1PπVγπ.\displaystyle V_{1,\gamma,\gamma^{\prime}}^{\pi}-V_{0,\gamma,\gamma^{\prime}}^{\pi}=(\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}V_{\gamma}^{\pi}. (14)

To understand the first order expansion, recall that in the definition of value function Vγπ=(IγPπ)1rπV_{\gamma}^{\pi}=(I-\gamma P^{\pi})^{-1}r^{\pi}, immediate rewards rπr^{\pi} are accumulated via the matrix (IγPπ)1(I-\gamma P^{\pi})^{-1}. In general, for any X,Y𝒳X,Y\in\mathbb{R}^{\mathcal{X}}, we can interpret X=(IγPπ))1YX=(I-\gamma P^{\pi}))^{-1}Y as accumulating YY as rewards to compute XX as value functions. By analogy, we can interpret the RHS of Eqn (14) as the value function assuming (γγ)PπVγπ(\gamma^{\prime}-\gamma)P^{\pi}V_{\gamma}^{\pi} as immediate rewards. In other words, the first order expansion bootstraps the zeroth order expansion VγπV_{\gamma}^{\pi} to form a more accurate approximation. Combined with the zeroth order expansion, we can also conveniently write the difference of first- and zeroth-order expansions as an expectation V1,γ,γπ(x)V0,γ,γ(x)=(γγ)𝔼π[t=1γt1Vγπ(xt)|x0=x]V_{1,\gamma,\gamma^{\prime}}^{\pi}(x)-V_{0,\gamma,\gamma^{\prime}}(x)=(\gamma^{\prime}-\gamma)\mathbb{E}_{\pi}\left[\sum_{t=1}^{\infty}\gamma^{t-1}V_{\gamma}^{\pi}(x_{t})\;\middle|\;x_{0}=x\right]. Let τGeometric(1γ)\tau\sim\text{Geometric}(1-\gamma) be a random time such that P(τ=t)=(1γ)γt,t1P(\tau=t)=(1-\gamma)\gamma^{t},\forall t\in\mathbb{Z}_{\geq 1}. The difference can also be expressed via this random time

V1,γ,γπ(x)V0,γ,γ(x)=γγ1γ𝔼π,τ[Vγπ(xτ)].\displaystyle V_{1,\gamma,\gamma^{\prime}}^{\pi}(x)-V_{0,\gamma,\gamma^{\prime}}(x)=\frac{\gamma^{\prime}-\gamma}{1-\gamma}\mathbb{E}_{\pi,\tau}\left[V_{\gamma}^{\pi}(x_{\tau})\right].

Note that from this expression, we obtain a simple unbiased estimate for V1,γ,γπ(x)V0,γ,γ(x)V_{1,\gamma,\gamma^{\prime}}^{\pi}(x)-V_{0,\gamma,\gamma^{\prime}}(x), using a sampled trajectory and a random time step τ\tau.

General KKth-order expansion.

We now present results for general KK. Consider the incremental term,

VK,γ,γπVK1,γ,γπ=(γγ)K((IγPπ)1Pπ)KVγπ.\displaystyle V_{K,\gamma,\gamma^{\prime}}^{\pi}-V_{K-1,\gamma,\gamma^{\prime}}^{\pi}=(\gamma^{\prime}-\gamma)^{K}\left((I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{K}V_{\gamma}^{\pi}. (15)

Note that the aggregate matrix ((IγPπ)1Pπ)K\left((I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{K} suggests a recursive procedure to bootstrap from lower order expansions to construct higher order expansions. To see why, we can rewrite the right-hand side of Eqn (15) as

(γγ)(IγPπ)1Pπ(VK1,γ,γπVK2,γ,γπ).\displaystyle(\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\left(V_{K-1,\gamma,\gamma^{\prime}}^{\pi}-V_{K-2,\gamma,\gamma^{\prime}}^{\pi}\right).

Indeed, we can interpret the difference VK,γ,γπVK1,γ,γπV_{K,\gamma,\gamma^{\prime}}^{\pi}-V_{K-1,\gamma,\gamma^{\prime}}^{\pi} as the value function under the immediate reward (γγ)Pπ(VK1,γ,γπVK2,γ,γπ)(\gamma^{\prime}-\gamma)P^{\pi}\left(V_{K-1,\gamma,\gamma^{\prime}}^{\pi}-V_{K-2,\gamma,\gamma^{\prime}}^{\pi}\right). This generalizes the bootstrap procedure of the first order expansion as a special case where we naturally assume V1,γ,γπ=0V_{-1,\gamma,\gamma^{\prime}}^{\pi}=0. Given KK i.i.d. random times τiGeometric(1γ)\tau_{i}\sim\text{Geometric}(1-\gamma), we can write VK,γ,γπ(x)VK1,γ,γπ(x)V_{K,\gamma,\gamma^{\prime}}^{\pi}(x)-V_{K-1,\gamma,\gamma^{\prime}}^{\pi}(x) as the expectation

(γγ1γ)K𝔼τi,1iK[Vγπ(xτ1++τK)].\displaystyle\left(\frac{\gamma^{\prime}-\gamma}{1-\gamma}\right)^{K}\mathbb{E}_{\tau_{i},1\leq i\leq K}\left[V_{\gamma}^{\pi}\left(x_{\tau_{1}+\dots+\tau_{K}}\right)\right]\,.

Based on the above expression, Algorithm 1 provides a subroutine that generates unbiased estimates of VK,γ,γπ(x)V_{K,\gamma,\gamma^{\prime}}^{\pi}(x) by sub-sampling an infinite trajectory (xt,at,rt)t=0(x_{t},a_{t},r_{t})_{t=0}^{\infty} with the random times.

Practical implementations.

While the above and Algorithm 1 show how to compute one-sample estimates, in practice, we might want to average multiple samples along a single trajectory for variance reduction. See Appendix F for further details on the practical estimates.

0:  A trajectory (xt,at,rt)t=0π(x_{t},a_{t},r_{t})_{t=0}^{\infty}\sim\pi and discount factors γ<γ<1\gamma<\gamma^{\prime}<1
  1. Compute an unbiased estimate V^γπ(xt)\widehat{V}_{\gamma}^{\pi}(x_{t}) for states along the trajectory, e.g., V^γπ(xt)=ttγttrt\widehat{V}_{\gamma}^{\pi}(x_{t})=\sum_{t^{\prime}\geq t}\gamma^{t^{\prime}-t}r_{t^{\prime}}.
  2. Sample KK random time {τi}1iK\{\tau_{i}\}_{1\leq i\leq K}, all i.i.d. geometrically distributed τiGeometric(1γ)\tau_{i}\sim\text{Geometric}(1-\gamma).
  3. Return the unbiased estimate k=0K(γγ1γ)kV^γπ(xtk)\sum_{k=0}^{K}\left(\frac{\gamma^{\prime}-\gamma}{1-\gamma}\right)^{k}\widehat{V}_{\gamma}^{\pi}(x_{t_{k}}) where tk=i=1kτit_{k}=\sum_{i=1}^{k}\tau_{i}.
Algorithm 1 Estimating the KKth order expansion
Interpretation of expansions in the dual space.

Recall that Vγπ=(IγPπ)1rπ=I(IγPπ)1rπV_{\gamma^{\prime}}^{\pi}=(I-\gamma^{\prime}P^{\pi})^{-1}r^{\pi}=I(I-\gamma^{\prime}P^{\pi})^{-1}r^{\pi} where the identity matrix I=[δ0,δ1,δ𝒳]I=[\delta_{0},\delta_{1},...\delta_{\mathcal{X}}] concatenates Dirac delta vectors δx,x𝒳\delta_{x},\forall x\in\mathcal{X}. Since rπr^{\pi} is a constant vector, Taylor expansions essentially construct approximations to the matrix (IγPπ)1(I-\gamma^{\prime}P^{\pi})^{-1}. By grouping the matrix with the reward vector (or the density matrix), we arrive at the primal expansion (or the dual expansion),

I(IγPπ)1rπprimal expansions ofVγπ(x)=I(IγPπ)1dual expansions ofdx,γπrπ\displaystyle I\underbrace{(I-\gamma^{\prime}P^{\pi})^{-1}r^{\pi}}_{\text{primal\ expansions\ of}\ V_{\gamma^{\prime}}^{\pi}(x)}=\underbrace{I(I-\gamma^{\prime}P^{\pi})^{-1}}_{\text{dual\ expansions\ of}\ d_{x,\gamma^{\prime}}^{\pi}}r^{\pi}

The derivations above focus on the primal expansion view. We show a parallel theory of dual expansion in Appendix B. The equivalence of primal-dual view of Taylor expansions suggests connections with seemingly disparate lines of prior work: Janner et al. (2020) propose a density model for visitation distribution of different γ\gamma in the context of model-based RL. They show that predictions of large discount factors could be bootstrapped from predictions of small discount factors. This corresponds exactly to the dual space expansions, which is equivalent to the primal space expansions.

Extensions to Q-functions.

In Appendix C, we show that it is possible to build approximations to QγπQ_{\gamma^{\prime}}^{\pi} using QγπQ_{\gamma}^{\pi} as building blocks. The theoretical guarantees and estimation procedures are similar to the case of value functions.

3.3 Approximation errors with finite samples

Proposition 3.2 shows that the expected approximation error decays as |VK,γ,γπ(x)Vγπ(x)|=O((γγ1γ)K+1)\left|V_{K,\gamma,\gamma^{\prime}}^{\pi}(x)-V_{\gamma^{\prime}}^{\pi}(x)\right|=O\left(\left(\frac{\gamma^{\prime}-\gamma}{1-\gamma}\right)^{K+1}\right) for γ<γ<1\gamma<\gamma^{\prime}<1. This motivates using a high value of KK when constructing the approximation. However, in practice, all constituent terms in the KKth order expansion are random estimates, each with a non-zero variance. This might lead the variance of the overall estimate to increase as KK increases. As a result, KK mediates a trade-off between bias (expected approximation error) and variance. We formalize such intuitions in Appendix E, where we theoretically analyze the trade-off using the phased TD-learning framework (Kearns and Singh, 2000).

A numerical example.

To get direct intuition about the effect of KK, we focus on a tabular MDP example. The MDP has |𝒳|=10|\mathcal{X}|=10 states and |𝒜|=2|\mathcal{A}|=2 actions. All entries of the transition table p(y|x,a)p(y|x,a) are generated from a Dirichlet distribution with parameters (α,,α)(\alpha,\ldots,\alpha) with α=0.01\alpha=0.01. The policy π(a|x)\pi(a|x) is uniformly random. We take γ=0.2\gamma=0.2 and γ=0.8\gamma^{\prime}=0.8. The agent generates N=10N=10 trajectories (xt,at,rt)t=0T(x_{t},a_{t},r_{t})_{t=0}^{T} with a very large horizon TT with a fixed starting state x0x_{0}. We assume access to base estimates V^γπ(xt)\widehat{V}_{\gamma}^{\pi}(x_{t}) and the Taylor expansion estimates V^K,γ,γπ(x0)\widehat{V}_{K,\gamma,\gamma^{\prime}}^{\pi}(x_{0}) are computed based on Algorithm 1. We estimate the relative error as E^K(x0)=|Vγπ(x0)V^K,γ,γπ(x0)|\widehat{E}_{K}(x_{0})=\left|V_{\gamma^{\prime}}^{\pi}(x_{0})-\widehat{V}_{K,\gamma,\gamma^{\prime}}^{\pi}(x_{0})\right|. For further experiment details, see Appendix F.

In Figure 1(a), we show how errors vary as a function of KK. We study two settings: (1) Expected estimates (red), where V^K,γ,γπ(x0)\widehat{V}_{K,\gamma,\gamma^{\prime}}^{\pi}(x_{0}) is computed analytically through access to transition tables. In this case, similar to how the theory suggests, the error decays exponentially; (2) Sample-based estimates (blue) with base estimates V^γπ(xt)=s=0γsrt+s\widehat{V}_{\gamma}^{\pi}(x_{t})=\sum_{s=0}^{\infty}\gamma^{s}r_{t+s}. The errors decay initially with KK but later start to increase a bit as KK gets large. The optimal KK in the middle achieves the best bias-variance trade-off. Note that in this particular example, the estimates do not pay a very big price in variance for large KK. We speculate this is because increments to the estimates are proportional to (γγ1γ)K+1\left(\frac{\gamma^{\prime}-\gamma}{1-\gamma}\right)^{K+1}, which scales down additional variance terms quickly as KK increases.

In Figure 1(b), we study how the optimal expansion order KK^{\ast} depends on the noise level of base estimates. To emulate the noise, we assume access to base estimates V^γπ(xt)=Vγπ(xt)+𝒩(0,σ2)\widehat{V}_{\gamma}^{\pi}(x_{t})=V_{\gamma}^{\pi}(x_{t})+\mathcal{N}(0,\sigma^{2}) for some noise level σ\sigma. The optimal order KK^{\ast} is computed as K=argminkE^k(x0)K^{\ast}=\arg\min_{k}\widehat{E}_{k}(x_{0}). In general, we observe that when σ\sigma increases, KK^{\ast} decreases. Intuitively, this implies that as the base estimates V^γπ(x)\widehat{V}_{\gamma}^{\pi}(x) become noisy, we should prefer smaller value of KK to control the variance. This result bears some insights for practical applications such as downstream policy optimization, where we need to select an optimal KK for the tasks at hand.

Refer to caption
(a) Trade-off of KK
Refer to caption
(b) Optimal KK
Figure 1: Comparison of Taylor expansions with different orders. The x-axis shows the order KK, the y-axis shows the log relative errors of the approximations. The blue curve shows the exact computations while the red curve shows the sample based estimations. See Appendix F for more details.

4 Taylor Expansions of Gradient Updates

In Section 3, we discussed how to construct approximations to Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x). For the purpose of policy optimization, it is of direct interest to study approximations to θVγπθ(x)\nabla_{\theta}V_{\gamma^{\prime}}^{\pi_{\theta}}(x). As stated in Section 1, a major premise of our work is that in many practical contexts, estimating discounted values under γ1\gamma^{\prime}\approx 1 is difficult. As a result, directly evaluating the full gradient θVγπθ(x)\nabla_{\theta}V_{\gamma^{\prime}}^{\pi_{\theta}}(x) is challenging, because it requires estimating Q-functions Qγπθ(x,a)Q_{\gamma^{\prime}}^{\pi_{\theta}}(x,a). Below, we start by showing how the decomposition of θVγπθ(x)\nabla_{\theta}V_{\gamma^{\prime}}^{\pi_{\theta}}(x) motivates a particular form of gradient update, which is generally considered a deep RL heuristic. Then we construct approximations to this update based on Taylor expansions.

4.1 VγπV_{\gamma^{\prime}}^{\pi} as a weighted mixture of VγπV_{\gamma}^{\pi}

We can explicitly rewrite Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x) as a weighted mixture of value functions Vγπ(x),x𝒳V_{\gamma}^{\pi}(x^{\prime}),x^{\prime}\in\mathcal{X}. This result was alluded to in (Romoff et al., 2019) and formally shown below.

Lemma 4.1.

Assume γ<γ<1\gamma<\gamma^{\prime}<1. We can write Vγπ(x)=(ρx,γ,γπ)TVγπV_{\gamma^{\prime}}^{\pi}(x)=(\rho_{x,\gamma,\gamma^{\prime}}^{\pi})^{T}V_{\gamma}^{\pi}, where the weight vector ρx,γ,γπ𝒳\rho_{x,\gamma,\gamma^{\prime}}^{\pi}\in\mathbb{R}^{\mathcal{X}} is

(Iγ(Pπ)T)(Iγ(Pπ)T)1δx.\displaystyle\left(I-\gamma(P^{\pi})^{T}\right)\left(I-\gamma^{\prime}(P^{\pi})^{T}\right)^{-1}\delta_{x}.

Also we can rewrite Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x), using an expectation, as:

Vγπ(x)+𝔼π[t=1(γγ)(γ)t1Vγπ(xt)|x0=x].\displaystyle V_{\gamma}^{\pi}(x)+\mathbb{E}_{\pi}\left[\sum_{t=1}^{\infty}(\gamma^{\prime}-\gamma)(\gamma^{\prime})^{t-1}V_{\gamma}^{\pi}(x_{t})\;\middle|\;x_{0}=x\right]. (16)

When γ=1\gamma^{\prime}=1, ρx,γ,γπ\rho_{x,\gamma,\gamma^{\prime}}^{\pi} might be undefined. However, Eqn (16) still holds if assumptions A.1 and A.2 are satisfied.

4.2 Decomposing the full gradient θVγπθ(x)\nabla_{\theta}V_{\gamma^{\prime}}^{\pi_{\theta}}(x)

Lemma 4.1 highlights that Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x) depends on π\pi in two aspects: (1) the value functions Vγπ(x),x𝒳V_{\gamma}^{\pi}(x^{\prime}),x^{\prime}\in\mathcal{X}; (2) the state-dependent distribution ρx,γ,γπ(x)\rho_{x,\gamma,\gamma^{\prime}}^{\pi}(x^{\prime}). Let πθ\pi_{\theta} be a parameterized policy. For conceptual clarity, we can write Vγπθ(x)=F(Vγπθ,ρx,γ,γπθ)V_{\gamma^{\prime}}^{\pi_{\theta}}(x)=F(V_{\gamma}^{\pi_{\theta}},\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}) with a function F:𝒳×𝒳F:\mathbb{R}^{\mathcal{X}}\times\mathbb{R}^{\mathcal{X}}\rightarrow\mathbb{R}. Though this function is essentially the inner product, i.e., F(V,ρ)=VTρF(V,\rho)=V^{T}\rho, notationally, it helps stress that Vγπθ(x)V_{\gamma^{\prime}}^{\pi_{\theta}}(x) depends on θ\theta through two vector arguments. Now, we can decompose θVγπθ(x)\nabla_{\theta}V_{\gamma^{\prime}}^{\pi_{\theta}}(x).

Lemma 4.2.

The full gradient θVγπθ(x)\nabla_{\theta}V_{\gamma^{\prime}}^{\pi_{\theta}}(x) can be decomposed into the sum of two partial gradients as follows,

(VF(V,ρ))TθVγπθ+(ρF(V,ρ))Tθρx,γ,γπθ\displaystyle\left(\partial_{V}F(V,\rho)\right)^{T}\nabla_{\theta}V_{\gamma}^{\pi_{\theta}}+\left(\partial_{\rho}F(V,\rho)\right)^{T}\nabla_{\theta}\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}
=𝔼[θVγπθ(x)]first partial gradient+𝔼[Vγπθ(x)θlogρx,γ,γπθ(x)]second partial gradient,\displaystyle=\underbrace{\mathbb{E}\left[\nabla_{\theta}V_{\gamma}^{\pi_{\theta}}(x^{\prime})\right]}_{\text{first\ partial\ gradient}}+\underbrace{\mathbb{E}\left[V_{\gamma}^{\pi_{\theta}}(x^{\prime})\nabla_{\theta}\log\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x^{\prime})\right]}_{\text{second\ partial\ gradient}},

where the above partial gradients are both evaluated at V=Vγπθ,ρ=ρx,γ,γπθV=V_{\gamma}^{\pi_{\theta}},\rho=\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}} and both expectations are with respect to xρx,γ,γπθx^{\prime}\sim\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}.

We argue that the second partial gradient introduces most challenges in practical optimization. Intuitively, this is because its unbiased estimator is equivalent to a REINFORCE gradient estimator which requires estimating discounted values that accumulate Vγπ(x)V_{\gamma}^{\pi}(x^{\prime}) as ‘reward’ under discount factor γ\gamma^{\prime}. By the premise of our work, this estimation would be difficult. We will detail the discussions in Appendix D.

The following result characterizes the first partial gradient.

Proposition 4.3.

For any γ<γ<1\gamma<\gamma^{\prime}<1, the first partial gradient (VF(Vγπθ,ρx,γ,γπθ))TθVγπθ(\partial_{V}F(V_{\gamma}^{\pi_{\theta}},\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}))^{T}\nabla_{\theta}V_{\gamma}^{\pi_{\theta}} can be expressed as

𝔼πθ[t=0(γ)tQγπθ(xt,at)θlogπθ(at|xt)|x0=x].\displaystyle\mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{\infty}(\gamma^{\prime})^{t}Q_{\gamma}^{\pi_{\theta}}(x_{t},a_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t})\;\middle|\;x_{0}=x\right]. (17)

When γ=1\gamma^{\prime}=1, under assumptions A.1 and A.2, the first partial gradient exists and is expressed as

𝔼πθ[t=0TQγπθ(xt,at)θlogπθ(at|xt)|x0=x].\displaystyle\mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{T}Q_{\gamma}^{\pi_{\theta}}(x_{t},a_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t})\;\middle|\;x_{0}=x\right]. (18)
Connections to common deep RL heuristic.

Many high-quality deep RL algorithms (see, e.g. Dhariwal et al., 2017; Achiam and OpenAI, 2018) implement parameter updates which are very similar to Eqn (18). As such, Proposition 4.3 provides some insights on why implementing such a heuristic might be useful in practice: though in general Eqn (18) is not a gradient (Nota and Thomas, 2019), it is a partial gradient of Vγ=1πθ(x)V_{\gamma^{\prime}=1}^{\pi_{\theta}}(x), which is usually the objective of interest at evaluation time. Compared with the formula of vanilla PG in Eqn (2), Eqn (18) offsets the over-discounting by via a uniform average over states.

However, it is worth noting that in deep RL practice, the definition of the evaluation horizon TT might slightly differ from that specified in A.1. In such cases, Proposition 4.3 does not hold. By A.1, TT is the absorption time that defines when the MDP enters a terminal absorbing state. In many applications, however, for MDPs without a natural terminatal state, TT is usually enforced by an external time constraint which does not depend on states. In other words, an environment can terminate even when it does not enter any terminal state (see, e.g., Brockman et al., 2016 for such examples). To bypass this subtle technical gap, one idea is to incorporate time steps as part of the state x~[x,t]\widetilde{x}\leftarrow[x,t]. This technique was hinted at in early work such as (Schulman et al., 2015b) and empirically studied in (Pardo et al., 2018). In this case, the random absorbing time TT depends fully on the augmented states, and Proposition 4.3 holds.

4.3 Taylor expansions of partial gradients

We now consider approximations to the first partial gradients

(VF(Vγπθ,ρx,γ,γπθ))TθVγπθ=(ρx,γ,γπθ)TθVγπθ.\displaystyle\left(\partial_{V}F(V_{\gamma}^{\pi_{\theta}},\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}})\right)^{T}\nabla_{\theta}V_{\gamma}^{\pi_{\theta}}=(\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}})^{T}\nabla_{\theta}V_{\gamma}^{\pi_{\theta}}.

Since θVγπθ\nabla_{\theta}V_{\gamma}^{\pi_{\theta}} does not depend on γ\gamma^{\prime}, the approximation is effectively with respect to the weight vector ρx,γ,γπθ\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}. Below, we show results for the KKth order approximation.

Proposition 4.4.

Assume γ<γ<1\gamma<\gamma^{\prime}<1. For any x𝒳x\in\mathcal{X}, define the KKth Taylor expansion to ρx,γ,γπ\rho_{x,\gamma,\gamma^{\prime}}^{\pi} as

ρx,K,γ,γπ=k=0K((γγ)(Iγ(Pπ)T)1(Pπ)T)kδx.\displaystyle\rho_{x,K,\gamma,\gamma^{\prime}}^{\pi}=\sum_{k=0}^{K}\left((\gamma^{\prime}-\gamma)\left(I-\gamma(P^{\pi})^{T}\right)^{-1}(P^{\pi})^{T}\right)^{k}\delta_{x}.

It can be shown that VK,γ,γπ(x)=(ρx,K,γ,γπ)TVγπV_{K,\gamma,\gamma^{\prime}}^{\pi}(x)=(\rho_{x,K,\gamma,\gamma^{\prime}}^{\pi})^{T}V_{\gamma}^{\pi} and ρx,K,γ,γπρK,γ,γπ=O((γγ1γ)K+1)\left\lVert\rho_{x,K,\gamma,\gamma^{\prime}}^{\pi}-\rho_{K,\gamma,\gamma^{\prime}}^{\pi}\right\rVert_{\infty}=O\left(\left(\frac{\gamma^{\prime}-\gamma}{1-\gamma}\right)^{K+1}\right).

We build some intuitions about the approximations. Note that in general we can write the partial gradient as a weighted mixture of local gradients Qtθlogπθ(at|xt)Q_{t}\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t}) where QtQγπθ(xt,at)Q_{t}\coloneqq Q_{\gamma}^{\pi_{\theta}}(x_{t},a_{t}),

𝔼π[t=0wK,γ,γ(t)Qtθlogπθ(at|xt)|x0=x],\displaystyle\mathbb{E}_{\pi}\left[\sum_{t=0}^{\infty}w_{K,\gamma,\gamma^{\prime}}(t)Q_{t}\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t})\;\middle|\;x_{0}=x\right], (19)

for some weight function wK,γ,γ(t)w_{K,\gamma,\gamma^{\prime}}(t)\in\mathbb{R}. When KK\rightarrow\infty, limwK,γ,γ(t)=(γ)t\lim w_{K,\gamma,\gamma^{\prime}}(t)=(\gamma^{\prime})^{t} and we recover the original first partial gradient defined in Eqn (17); when K=0K=0, wK,γ,γ(t)=γtw_{K,\gamma,\gamma^{\prime}}(t)=\gamma^{t} recovers the vanilla PG in Eqn (2). For other values of KK, we show the analytic weights wK,γ,γ(t)w_{K,\gamma,\gamma^{\prime}}(t) in Appendix D. Similar to how VK,γ,γπV_{K,\gamma,\gamma^{\prime}}^{\pi} interpolates VγπV_{\gamma}^{\pi} and VγπV_{\gamma^{\prime}}^{\pi}, here the KKth order expansion to the partial gradients interpolate the full partial gradients and vanilla PG. In practice, we might expect an intermediate value of KK achieve the best bias and variance trade-off of the update.

5 Policy optimization with Taylor expansions

Based on theoretical insights of previous sections, we propose two algorithmic changes to baseline algorithms. Based on Section 3, we propose Taylor expansion advantage estimation; based on Section 4, we propose Taylor expansion update weighting. It is important to note that other algorithmic changes are possible, which we leave to future work.

5.1 Baseline near on-policy algorithm

We briefly introduce backgrounds for near on-policy policy optimization algorithms (Schulman et al., 2015a; Mnih et al., 2016; Schulman et al., 2017; Espeholt et al., 2018). We assume that the data are collected under a behavior policy (xt,at,rt)t=0μ(x_{t},a_{t},r_{t})_{t=0}^{\infty}\sim\mu, which is close to the target policy πθ\pi_{\theta}. The on-policyness is ensured by constraining D(πθ,μ)εD(\pi_{\theta},\mu)\leq\varepsilon for some divergence DD and threshold ε>0\varepsilon>0. Usually, ε\varepsilon is chosen to be small such that little off-policy corrections are needed for estimating value functions. With data (xt,at,rt)t=0(x_{t},a_{t},r_{t})_{t=0}^{\infty}, the algorithms estimate Q-functions Q^γπθQγπθ\widehat{Q}_{\gamma}^{\pi_{\theta}}\approx Q_{\gamma}^{\pi_{\theta}}. Then the estimates Q^γπθ(x,a)\widehat{Q}_{\gamma}^{\pi_{\theta}}(x,a) are used as plug-in alternatives to the Q-functions in the definition of gradient updates such as Eqn (2) for sample-based updates.

5.2 Taylor expansion Q-function estimation

In Section 3, we discussed how to construct approximations to QγπθQ_{\gamma^{\prime}}^{\pi_{\theta}} using QγπθQ_{\gamma}^{\pi_{\theta}} as building blocks. As the first first algorithmic change, we propose to construct the KKth order expansion QK,γ,γπθQ_{K,\gamma,\gamma^{\prime}}^{\pi_{\theta}} as a plug-in alternative to QγπθQ_{\gamma}^{\pi_{\theta}} when combined with downstream optimization. Since QK,γ,γπθQγπθQ_{K,\gamma,\gamma^{\prime}}^{\pi_{\theta}}\approx Q_{\gamma^{\prime}}^{\pi_{\theta}}, we expect the optimization subroutine to account for an objective of a longer effective horizon.

In many baseline algorithms, we have access to a value function critic Vϕ(x)V_{\phi}(x) and a subroutine which produces Q-function estimates Q^γπθ(x,a)\widehat{Q}_{\gamma}^{\pi_{\theta}}(x,a) (e.g., Q^γπθ(xt,at)=s=0γsrt+s\widehat{Q}_{\gamma}^{\pi_{\theta}}(x_{t},a_{t})=\sum_{s=0}^{\infty}\gamma^{s}r_{t+s}). We then construct the KKth order expansion Q^K,γ,γπθ(x,a)\widehat{Q}_{K,\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x,a) using Q^γπθ\widehat{Q}_{\gamma}^{\pi_{\theta}}. This procedure is similar to Algorithm 1 and we show the full algorithm in Appendix C. See also Appendix F for further experimental details.

0:  policy πθ\pi_{\theta} with parameter θ\theta and α\alpha
  while not converged do
     1. Collect partial trajectories (xt,at,rt)t=1Tμ(x_{t},a_{t},r_{t})_{t=1}^{T}\sim\mu.
     2. Estimate Q-functions Q^γπθ(xt,at)\widehat{Q}_{\gamma}^{\pi_{\theta}}(x_{t},a_{t}).
     3. Construct KKth order Taylor expansion estimator Q^K,γ,γπθ(xt,at)\widehat{Q}_{K,\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x_{t},a_{t}) using Q^πθ(xt,at)\widehat{Q}^{\pi_{\theta}}(x_{t},a_{t}).
     4. Update the parameter via gradient ascent θθ+αt=1TQ^K,γ,γπθ(xt)θlogπθ(at|xt)\theta\leftarrow\theta+\alpha\sum_{t=1}^{T}\widehat{Q}_{K,\gamma,\gamma}^{\pi_{\theta}}(x_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t}).
  end while
Algorithm 2 Taylor expansion Q-function estimation

5.3 Taylor expansion update weighting

In Section 4, we discussed Taylor expansions approximation ρx,K,γ,γπθ\rho_{x,K,\gamma,\gamma^{\prime}}^{\pi_{\theta}} to the weight vector ρx,γ,γπθ\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}. As the second algorithmic change to the baseline algorithm, we update parameters in the direction of KKth order approximations to the partial gradient θθ+α(ρx,K,γ,γπθ)TθVγπθ\theta\leftarrow\theta+\alpha\left(\rho_{x,K,\gamma,\gamma^{\prime}}^{\pi_{\theta}}\right)^{T}\nabla_{\theta}V_{\gamma}^{\pi_{\theta}}. Eqn (19) shows that the update effectively translates into adjusting the weight wt=wK,γ,γ(t)w_{t}=w_{K,\gamma,\gamma^{\prime}}(t). When combined with other components of the algorithm, the pseudocode is shown in Algorithm 3. Under this framework, the common deep RL heuristic could be recovered by setting wt=1w_{t}=1.

0:  policy πθ\pi_{\theta} with parameter θ\theta and α\alpha
  while not converged do
     1. Collect partial trajectories (xt,at,rt)t=1Tμ(x_{t},a_{t},r_{t})_{t=1}^{T}\sim\mu.
     2. Estimate Q-functions Q^t=Q^γπθ(xt,at)\widehat{Q}_{t}=\widehat{Q}_{\gamma}^{\pi_{\theta}}(x_{t},a_{t}).
     3. Compute weights for each state wt=wx0,K,γ,γ(t)w_{t}=w_{x_{0},K,\gamma,\gamma^{\prime}}(t), and average gθ=t=1TwtQ^tθlogπθ(at|xt)g_{\theta}=\sum_{t=1}^{T}w_{t}\widehat{Q}_{t}\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t}).
     4. Update parameters θθ+αgθ\theta\leftarrow\theta+\alpha g_{\theta}.
  end while
Algorithm 3 Taylor expansion update weighting

6 Related work

Discount factors in RL.

Discount factors impact RL agents in various aspects. A number of work suggest that RL problems with large discount factors are generally more difficult to solve (Jiang et al., 2016), potentially due to increased complexities of the optimal value functions or collapses of the action gaps (Lehnert et al., 2018; Laroche and van Seijen, 2018). However, optimal policies defined with small discounts can be very sub-optimal for RL objectives with a large discount factor. To entail numerical stability of using large discounts, prior work has suggested non-linear transformation of the Bellman targets for Q-learning (Pohlen et al., 2018; van Hasselt et al., 2019; Kapturowski et al., 2018; Van Seijen et al., 2019). However, when data is scarce, small discount factors might prove useful due to its implicit regularization effect (Amit et al., 2020).

As such, there is a trade-off mediated by choosing different values of discount factors. Similar trade-off effects are most well-known in the context of TD(λ\lambda), where λ[0,1]\lambda\in[0,1] trades-off the bias and variance of the TD updates (Sutton and Barto, 2018; Kearns and Singh, 2000).

Adapting discount factors & multiple discount factors.

In general, when selecting a single optimal discount factor for training is difficult, it might be desirable to adjust the discount during training. This could be achieved by human-designed (Prokhorov and Wunsch, 1997; François-Lavet et al., 2015) or blackbox adaptation (Xu et al., 2018). Alternatively, it might also be beneficial to learn with multiple discount factors at the same time, which could improve TD-learning (Sutton, 1995) or representation learning (Fedus et al., 2019). Complementary to all such work, we study the connections between value functions defined with different discounts.

Taylor expansions for RL.

Recently in (Tang et al., 2020), Taylor expansions were applied to study the relationship between VγπV_{\gamma}^{\pi} and VγμV_{\gamma}^{\mu}, i.e., value functions under the same discount factor but different policies πμ\pi\neq\mu. This is useful in the context of off-policy learning. Our work is orthogonal and could be potentially combined with this approach.

7 Experiments

In this section, we evaluate the empirical performance of new algorithmic changes to the baseline algorithms. We focus on robotics control experiments with continuous state and action space. The tasks are available in OpenAI gym (Brockman et al., 2016), with backends such as MuJoCo (Todorov et al., 2012) and bullet physics (Coumans, 2015). We label the tasks as gym (G) and bullet (B) respectively. We always compare the undiscounted cumulative rewards evaluated under a default evaluation horizon T=1000T=1000.

Hyper-parameters.

Throughout the experiments, we use the same hyper-parameters across all algorithms. The learning rate is tuned for the baseline PPO, and fixed across all algorithms. See Appendix F for further details.

7.1 Taylor expansion Q-function estimation

Refer to caption
(a) HalfCheetah(G)
Refer to caption
(b) Ant(G)
Refer to caption
(c) Walker2d(G)
Refer to caption
(d) HalfCheetah(B)
Refer to caption
(e) Ant(B)
Refer to caption
(f) Walker2d(B)
Figure 2: Comparison of Taylor expansion Q-function estimation with other baselines. Each curve shows median ±\pm std results across 55 seeds. Taylor expansion outperforms PPO baselines with both lower and high discount factors.

We use Q^K,γ,γπθ(x,a)\widehat{Q}_{K,\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x,a) with K=1K=1 as the Q-function estimator plug-in for the gradient update. When combining with PPO (Schulman et al., 2017), the resulting algorithm is named PPO(KK). We compare with the baseline PPO and TRPO (Schulman et al., 2015a). In practice, we consider a mixture of advantage estimator Q^πθ(x,a)=(1η)Q^γπθ(x,a)+ηQ^K,γ,γπθ(x,a)\widehat{Q}^{\pi_{\theta}}(x,a)=(1-\eta)\widehat{Q}_{\gamma}^{\pi_{\theta}}(x,a)+\eta\widehat{Q}_{K,\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x,a) with η[0,1]\eta\in[0,1] a constant that interpolates between the PPO (i.e., η=0\eta=0) and PPO(KK). Note that though η\eta should be selected such that it balances the numerical scales of the two extremes, as a result, we usually find η\eta to work well when it is small in absolute scale (η=0.01\eta=0.01 works the best).

Results.

In Figure 2, we compare a few baselines: (1) PPO with γ=0.99\gamma=0.99 (default); (2) PPO with high discount factor γ=11T=0.999\gamma=1-\frac{1}{T}=0.999; (3) PPO with Taylor expansion based advantage estimator, PPO(KK). Throughout, we use a single hyper-parameter η=0.01\eta=0.01. We see that in general, PPO(KK) leads to better performance (faster learning speed, better asymptotic performance or smaller variance across 55 seeds). This shows Taylor expansion Q-function estimation could lead to performance gains across tasks, given that the hyper-parameter η\eta is carefully tuned. We provide a detailed ablation study on η\eta in Appendix F, where we show how the overall performance across the benchmark tasks vary as η\eta changes from small to large values.

A second observation is that simply increasing the discount factor to γ=11T=0.999\gamma=1-\frac{1}{T}=0.999 generally degrades the performance. This confirms issue with instability of directly applying high discount factors which motivates this work.

We also compare with the open source implementation of (Romoff et al., 2019) in Appendix F, where they estimate Q^γπ\widehat{Q}_{\gamma^{\prime}}^{\pi} based on recursive bootstraps of Q-function differences. Conceptually, this is similar to Taylor expansions with K=K=\infty. We show that without a careful trade-off mediated by smaller KK, this algorithm does not improve performance out of the box.

7.2 Taylor expansion update weighting

As introduced in Section 5, we weigh local gradients Q^tθlogπθ(at|xt)\widehat{Q}_{t}\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t}) with KKth order expansion weights wK,γ,γ(t)w_{K,\gamma,\gamma^{\prime}}(t). Here, we take γ=11T\gamma^{\prime}=1-\frac{1}{T}. Note that since K=K=\infty corresponds to limwK,γ,γ(t)=(γ)t1\lim w_{K,\gamma,\gamma^{\prime}}(t)=(\gamma^{\prime})^{t}\approx 1, this is very close to the commonly implemented PPO baseline. We hence expect the algorithm to work better with relatively large values of KK and set K=100K=100 throughout experiments. In practice, we find the performance to be fairly robust in the choice of KK. We provide further analysis and ablation study in Appendix F.

Refer to caption
(a) HalfCheetah(G)
Refer to caption
(b) Ant(G)
Refer to caption
(c) Walker2d(G)
Refer to caption
(d) HalfCheetah(B)
Refer to caption
(e) Ant(B)
Refer to caption
(f) Walker2d(B)
Figure 3: Comparison of Taylor expansion update weighting with other baselines. Each curve shows median ±\pm std results across 55 seeds. Taylor expansion outperforms the default PPO baseline most stably.
Results.

We compare a few baselines: (1) default PPO; (2) PPO with time limit (Pardo et al., 2018). In this case, the states are augmented with time steps x~[x,t]\widetilde{x}\leftarrow[x,t] such that the augmented states x~\widetilde{x} are Markovian; (3) PPO with Taylor expansion update weighting PPO(KK). In Figure 3, we see that in general, PPO(KK) and PPO with time limit outperform the baseline PPO. We speculate that the performance gains arise from the following empirical motivation: since the evaluation stops at t=Tt=T, local gradients close to t=Tt=T should be weighed down because they do not contribute as much to the final objective. However, the default PPO ignores such an effect and weighs all updates uniformly. To tackle this issue, PPO(KK) explicitly weighs down the update while and PPO with time limit augments the state space to restore stationarity. Empirically, though in some cases PPO with time limit also outperforms PPO(KK), it behaves fairly unstably in other cases.

Extensions to off-policy algorithms.

Above, we mainly focused on on-policy algorithms. The setup is simpler because the data are collected (near) on-policy. It is possible to extend similar results to off-policy algorithms (Mnih et al., 2015; Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018). Due to the space limit, we present extended results in Appendix F, where we show how to combine similar techniques to off-policy actor-critic algorithms such as TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018) in continuous control domains.

8 Conclusion

We have proposed a family of objectives that interpolate value functions defined with two discount factors. We have shown that similar techniques are applicable to other cumulative quantities defined through discounts, such as PG updates. This framework allowed us to achieve trade-off in estimating value functions or gradient updates, and led to empirical performance gains.

We also highlighted a new direction for bridging the gap between theory and practice: the gap between a fully discounted objective (in theory) and an undiscounted objective (in practice). By building a better understanding of this gap, we shed light on seemingly opaque heuristics which are necessary to achieve good empirical performance. We expect this framework to be useful for new practical algorithms.

Acknowledgements.

Yunhao thanks Tadashi Kozuno and Shipra Agrawal for discussions on the discrepancy between policy gradient theory and practices. Yunhao acknowledges the support from Google Cloud Platform for computational resources. The authors also thank Pooria Joulani for reviewing a draft of the paper.

References

  • Achiam and OpenAI [2018] Joshua Achiam and OpenAI. Spinning Up in Deep Reinforcement Learning. https://github.com/openai/spinningup, 2018.
  • Amit et al. [2020] Ron Amit, Ron Meir, and Kamil Ciosek. Discount factor as a regularizer in reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2020.
  • Bellman [1957] Richard Bellman. A Markovian decision process. Journal of mathematics and mechanics, pages 679–684, 1957.
  • Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI gym. arXiv, 2016.
  • Coumans [2015] Erwin Coumans. Bullet physics simulation. In ACM SIGGRAPH 2015 Courses, page 1. 2015.
  • De Farias and Van Roy [2003] Daniela Pucci De Farias and Benjamin Van Roy. The linear programming approach to approximate dynamic programming. Operations research, 51(6):850–865, 2003.
  • Degris et al. [2012] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. arXiv preprint arXiv:1205.4839, 2012.
  • Dhariwal et al. [2017] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. OpenAI baselines, 2017.
  • Espeholt et al. [2018] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceeedings of the International Conference on Machine Learning, 2018.
  • Fedus et al. [2019] William Fedus, Carles Gelada, Yoshua Bengio, Marc G Bellemare, and Hugo Larochelle. Hyperbolic discounting and learning over multiple horizons. arXiv, 2019.
  • Fox et al. [2016] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2016.
  • François-Lavet et al. [2015] Vincent François-Lavet, Raphael Fonteneau, and Damien Ernst. How to discount deep reinforcement learning: Towards new dynamic strategies. NIPS Deep Reinforcement Learning Workshop, 2015.
  • Fujimoto et al. [2018] Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, 2018.
  • Grinstead and Snell [2012] Charles Miller Grinstead and James Laurie Snell. Introduction to probability. American Mathematical Soc., 2012.
  • Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceeedings of the International Conference on Machine Learning, 2018.
  • Hasselt [2010] Hado Hasselt. Double Q-learning. Advances in Neural Information Processing Systems, 2010.
  • Janner et al. [2020] Michael Janner, Igor Mordatch, and Sergey Levine. Gamma-models: Generative temporal difference learning for infinite-horizon prediction. Advances in Neural Information Processing Systems, 2020.
  • Jiang et al. [2016] Nan Jiang, Satinder P Singh, and Ambuj Tewari. On structural properties of MDPs that bound loss due to shallow planning. In Proceedings of the International Joint Conference on Artificial Intelligence, 2016.
  • Kapturowski et al. [2018] Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2018.
  • Kearns and Singh [2000] Michael J Kearns and Satinder P Singh. Bias-variance error bounds for temporal difference updates. In Proceedings of the Conference on Learning Theory, 2000.
  • Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  • Laroche and van Seijen [2018] Romain Laroche and Harm van Seijen. In reinforcement learning, all objective functions are not equal. 2018.
  • Lehnert et al. [2018] Lucas Lehnert, Romain Laroche, and Harm van Seijen. On value function representation of long horizon problems. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  • Lillicrap et al. [2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2015.
  • Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2016.
  • Nota and Thomas [2019] Chris Nota and Philip S Thomas. Is the policy gradient a gradient? In Proceedings of the International Conference on Autonomous Agenst and Multiagent Systems, 2019.
  • O’Donoghue et al. [2016] Brendan O’Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Combining policy gradient and Q-learning. In Proceedings of the International Conference on Learning Representations, 2016.
  • Pardo et al. [2018] Fabio Pardo, Arash Tavakoli, Vitaly Levdik, and Petar Kormushev. Time limits in reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2018.
  • Pohlen et al. [2018] Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado Van Hasselt, John Quan, Mel Večerík, Matteo Hessel, Rémi Munos, and Olivier Pietquin. Observe and look further: Achieving consistent performance on atari. arXiv, 2018.
  • Prokhorov and Wunsch [1997] Danil V Prokhorov and Donald C Wunsch. Adaptive critic designs. IEEE transactions on Neural Networks, 8(5):997–1007, 1997.
  • Puterman [2014] Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  • Romoff et al. [2019] Joshua Romoff, Peter Henderson, Ahmed Touati, Emma Brunskill, Joelle Pineau, and Yann Ollivier. Separating value functions across time-scales. Proceedings of the International Conference on Machine Learning, 2019.
  • Ross [2014] Sheldon M Ross. Introduction to probability models. Academic press, 2014.
  • Schaul et al. [2015] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations, 2015.
  • Schulman et al. [2015a] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, 2015a.
  • Schulman et al. [2015b] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv, 2017.
  • Silver et al. [2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, 2014.
  • Sinha et al. [2020] Samarth Sinha, Jiaming Song, Animesh Garg, and Stefano Ermon. Experience replay with likelihood-free importance weights. arXiv, 2020.
  • Sutton [1995] Richard S Sutton. TD models: Modeling the world at a mixture of time scales. In Proceedings of the International Conference on Machine Learning. 1995.
  • Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, 2018.
  • Sutton et al. [2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 2000.
  • Tang et al. [2020] Yunhao Tang, Michal Valko, and Rémi Munos. Taylor expansion policy optimization. In Proceedings of the International Conference on Machine Learning, 2020.
  • Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012.
  • Van Hasselt et al. [2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
  • van Hasselt et al. [2019] Hado van Hasselt, John Quan, Matteo Hessel, Zhongwen Xu, Diana Borsa, and André Barreto. General non-linear Bellman equations. arXiv, 2019.
  • Van Seijen et al. [2019] Harm Van Seijen, Mehdi Fatemi, and Arash Tavakoli. Using a logarithmic mapping to enable lower discount factors in reinforcement learning. In Advances in Neural Information Processing Systems, 2019.
  • Xu et al. [2018] Zhongwen Xu, Hado P van Hasselt, and David Silver. Meta-gradient reinforcement learning. Advances in Neural Information Processing Systems, 2018.
  • Ziebart et al. [2008] Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2008.

APPENDICES: Taylor Expansions of Discount Factors

Appendix A Proofs

See 3.1

Proof.

Recall the Woodbury matrix identity

(IγPπ)1=(IγPπ)1+(γγ)(IγPπ)1Pπ(IγPπ)1.\displaystyle(I-{\gamma^{\prime}}P^{\pi})^{-1}=(I-\gamma P^{\pi})^{-1}+(\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}(I-\gamma^{\prime}P^{\pi})^{-1}.

Recall the equality Vγπ=(IγPπ)1rπV_{\gamma^{\prime}}^{\pi}=(I-\gamma^{\prime}P^{\pi})^{-1}r^{\pi}. By plugging in the Woodbury matrix identity, this immediate shows

Vγπ\displaystyle V_{\gamma^{\prime}}^{\pi} =(IγPπ)1rπ+(γγ)(IγPπ)1Pπ(IγPπ)1rπ\displaystyle=(I-\gamma P^{\pi})^{-1}r^{\pi}+(\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}(I-\gamma^{\prime}P^{\pi})^{-1}r^{\pi}
=Vγπ+(γγ)(IγPπ)1PπVγπ.\displaystyle=V_{\gamma}^{\pi}+(\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}V_{\gamma^{\prime}}^{\pi}.

Now, observe that the second term involves VγπV_{\gamma^{\prime}}^{\pi}. We can plug in the definition of Vγπ=(IγPπ)1rπV_{\gamma^{\prime}}^{\pi}=(I-\gamma^{\prime}P^{\pi})^{-1}r^{\pi} and invoke the Woodbury matrix identity again. This produces

Vγπ\displaystyle V_{\gamma^{\prime}}^{\pi} =Vγπ+(γγ)(IγPπ)1PπVγπ+((γγ)(IγPπ)1Pπ)2Vγπ.\displaystyle=V_{\gamma}^{\pi}+(\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}V_{\gamma}^{\pi}+\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{2}V_{\gamma^{\prime}}^{\pi}.

By induction, it is straightforward to show that iterating the above procedure K0K\geq 0 times produces the following equalities

Vγπ=k=0K((γγ)(IγPπ)1Pπ)kVγπ+\displaystyle V_{\gamma^{\prime}}^{\pi}=\sum_{k=0}^{K}\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{k}V_{\gamma}^{\pi}+ +((γγ)(IγPπ)1Pπ)K+1Vγπresidual.\displaystyle+\underbrace{\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{K+1}V_{\gamma^{\prime}}^{\pi}}_{\text{residual}}.

Consider the norm of the residual term. Since PπP^{\pi} is a transition matrix, Pπ<1\left\lVert P^{\pi}\right\rVert_{\infty}<1. As a result, (IγPπ)1=t=0γt(Pπ)t<(1γ)1\left\lVert(I-\gamma P^{\pi})^{-1}\right\rVert_{\infty}=\left\lVert\sum_{t=0}^{\infty}\gamma^{t}(P^{\pi})^{t}\right\rVert_{\infty}<(1-\gamma)^{-1}. This implies

((γγ)(IγPπ)1Pπ)K+1Vγπ<(γγ1γ)K+1Rmax1γ.\displaystyle\left\lVert\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{K+1}V_{\gamma^{\prime}}^{\pi}\right\rVert_{\infty}<\left(\frac{\gamma^{\prime}-\gamma}{1-\gamma}\right)^{K+1}\cdot\frac{R_{\text{max}}}{1-\gamma^{\prime}}.

When γ<γ<1\gamma<\gamma^{\prime}<1, the residual norm decays exponentially and 0\rightarrow 0 as KK\rightarrow\infty. This implies that the infinite series converges,

Vγπ=k=0((γγ)(IγPπ)1Pπ)kVγπ.\displaystyle V_{\gamma^{\prime}}^{\pi}=\sum_{k=0}^{\infty}\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{k}V_{\gamma}^{\pi}.
Additional consideration when γ=1\gamma^{\prime}=1.

When γ=1\gamma^{\prime}=1, in order to ensure finiteness of Vγ=1πV_{\gamma^{\prime}=1}^{\pi}, we assume the following two conditions: (1) The Markov chain induced by π\pi is absorbing; (2) for any absorbing state xx, rπ(x)=0r^{\pi}(x)=0. Without loss of generality, assume there exists a single absorbing state. In general, the transition matrix PπP^{\pi} can be decomposed as follows [Grinstead and Snell, 2012, Ross, 2014],

Pπ=(P~πp~π01),\displaystyle P^{\pi}=\begin{pmatrix}\widetilde{P}^{\pi}&\widetilde{p}^{\pi}\\ 0&1\end{pmatrix},

where P~π(|𝒳|1)|𝒜|×(|𝒳|1)|𝒜|\widetilde{P}^{\pi}\in\mathbb{R}^{(|\mathcal{X}|-1)|\mathcal{A}|\times(|\mathcal{X}|-1)|\mathcal{A}|} and p~π𝒳1)\widetilde{p}^{\pi}\in\mathbb{R}^{\mathcal{X}-1)}. Here, the first 𝒳1\mathcal{X}-1 states are transient and the last state is absorbing. For convenience, define r~π\widetilde{r}^{\pi} as the reward vector rπr^{\pi} constrained on the first 𝒳1\mathcal{X}-1 transient states. We provide a few lemmas below.

Lemma A.1.

The matrix (IP~)π(I-\widetilde{P})^{\pi} is invertible and its inverse is (IP~)π=k=0(P~)k(I-\widetilde{P})^{\pi}=\sum_{k=0}^{\infty}(\widetilde{P})^{k}.

Proof.

Define a matrix N=k=0(P~π)kN=\sum_{k=0}^{\infty}(\widetilde{P}^{\pi})^{k}, then N[x,y]N[x,y] defines the expected number of times it takes to transition from xx to yy before absorption. By definition of the absorbing chain, NN is finite. This further shows that (IP~π)(I-\widetilde{P}^{\pi}) is invertible, because

N(IP~π)=(IP~π)N=I.\displaystyle N(I-\widetilde{P}^{\pi})=(I-\widetilde{P}^{\pi})N=I.

Lemma A.2.

Let f(A,B)f(A,B) be a matrix polynomial function of matrix AA and BB. Then

f(Pπ,(IγPπ)1)=(f(P~π,(IγP~π)1)B01),\displaystyle f\left(P^{\pi},\left(I-\gamma P^{\pi}\right)^{-1}\right)=\begin{pmatrix}f\left(\widetilde{P}^{\pi},\left(I-\gamma\widetilde{P}^{\pi}\right)^{-1}\right)&B\\ 0&1\end{pmatrix},

where BB is some matrix.

Proof.

The intuition for the above result is that polynomial transformation preserves the block triangle property of PπP^{\pi} and (IγPπ)1(I-\gamma P^{\pi})^{-1}. In general, we can assume

f(A,B)=m,nKcm,nAmBn,\displaystyle f(A,B)=\sum_{m,n\leq K}c_{m,n}A^{m}B^{n},

for some K0K\geq 0 and cm,nc_{m,n}\in\mathbb{R} are scalar coefficients. First, note that (Pπ)k,k0(P^{\pi})^{k},k\geq 0 is of the form

(Pπ)k=((P~π)kC01),\displaystyle\left(P^{\pi}\right)^{k}=\begin{pmatrix}\left(\widetilde{P}^{\pi}\right)^{k}&C\\ 0&1\end{pmatrix},

for some matrix CC. Since (IγA)1=k=0Ak(I-\gamma A)^{-1}=\sum_{k=0}^{\infty}A^{k} for A{Pπ,P~π}A\in\{P^{\pi},\widetilde{P}^{\pi}\}, this implies that

(IγPπ)1=k=0(γPπ)k=((P~π)kD01)=((IγP~π)1D01),\displaystyle(I-\gamma P^{\pi})^{-1}=\sum_{k=0}^{\infty}(\gamma P^{\pi})^{k}=\begin{pmatrix}\left(\widetilde{P}^{\pi}\right)^{k}&D\\ 0&1\end{pmatrix}=\begin{pmatrix}\left(I-\gamma\widetilde{P}^{\pi}\right)^{-1}&D\\ 0&1\end{pmatrix},

for some matrix DD. The above two results show that both polynomials of PπP^{\pi} and (IγPπ)1(I-\gamma P^{\pi})^{-1} are block upper triangle matrices. It is then straightforward that

(Pπ)m((IγPπ)1)n=((P~π)m((IγP~π)1)nE01),\displaystyle\left(P^{\pi}\right)^{m}\left(\left(I-\gamma P^{\pi}\right)^{-1}\right)^{n}=\begin{pmatrix}\left(\widetilde{P}^{\pi}\right)^{m}\left(\left(I-\gamma\widetilde{P}^{\pi}\right)^{-1}\right)^{n}&E\\ 0&1\end{pmatrix},

for some matrix EE. Finally, since f(Pπ,(IγPπ)1)f(P^{\pi},(I-\gamma P^{\pi})^{-1}) is a linear combination of (Pπ)m((IγPπ)1)n\left(P^{\pi}\right)^{m}\left(\left(I-\gamma P^{\pi}\right)^{-1}\right)^{n}, we conclude the proof. ∎

Lemma A.3.

Under assumption (1) and (2), one could write the value function Vγ=1πV_{\gamma^{\prime}=1}^{\pi} as

Vγ=1π=k=0(Pπ)trπ,\displaystyle V_{\gamma^{\prime}=1}^{\pi}=\sum_{k=0}^{\infty}(P^{\pi})^{t}r^{\pi},

where the infinite series on the RHS converges. In addition, for any transient state xx, Vγ=1π(x)=[k=0(P~π)kr~π](x)V_{\gamma^{\prime}=1}^{\pi}(x)=\left[\sum_{k=0}^{\infty}(\widetilde{P}^{\pi})^{k}\widetilde{r}^{\pi}\right](x).

Proof.

Recall that Vγπ(x)𝔼[t=0rt|x0=x]V_{\gamma^{\prime}}^{\pi}(x)\coloneqq\mathbb{E}\left[\sum_{t=0}^{\infty}r_{t}\;\middle|\;x_{0}=x\right]. Under assumption (2), for any absorbing state xx, Vγπ(x)=0=[k=0(Pπ)krπ](x)V_{\gamma^{\prime}}^{\pi}(x)=0=\left[\sum_{k=0}^{\infty}(P^{\pi})^{k}r^{\pi}\right](x). We can instead constrain the Markov chain to the transient states. For any transient state xx, recall the definition of NN from Lemma A.1, it follows that

Vγ=1π(x)=yN(expected number of times in y|x0=x)rπ(y)=[Nrπ](x)=[(IP~π)1r~π](x)=[k=0(P~π)kr~π](x).\displaystyle V_{\gamma^{\prime}=1}^{\pi}(x)=\sum_{y}N(\text{expected\ number\ of\ times\ in\ }y|x_{0}=x)r^{\pi}(y)=\left[Nr^{\pi}\right](x)=\left[\left(I-\widetilde{P}^{\pi}\right)^{-1}\widetilde{r}^{\pi}\right](x)=\left[\sum_{k=0}^{\infty}(\widetilde{P}^{\pi})^{k}\widetilde{r}^{\pi}\right](x).

By Lemma A.2, this is equivalent to [k=0(Pπ)trπ](x)\left[\sum_{k=0}^{\infty}(P^{\pi})^{t}r^{\pi}\right](x). We thus complete the proof. ∎

Lemma A.4.

The following holds for any γ<1\gamma<1,

(IP~π)1\displaystyle\left(I-\widetilde{P}^{\pi}\right)^{-1} =(IγP~π(1γ)P~π)1\displaystyle=\left(I-\gamma\widetilde{P}^{\pi}-(1-\gamma)\widetilde{P}^{\pi}\right)^{-1}
=k=0K((1γ)(IγP~π)1P~π)k(IγP~π)1+((1γ)(IγP~π)1P~π)K+1(IP~π)1\displaystyle=\sum_{k=0}^{K}\left((1-\gamma)\left(I-\gamma\widetilde{P}^{\pi}\right)^{-1}\widetilde{P}^{\pi}\right)^{k}\left(I-\gamma\widetilde{P}^{\pi}\right)^{-1}+\left((1-\gamma)\left(I-\gamma\widetilde{P}^{\pi}\right)^{-1}\widetilde{P}^{\pi}\right)^{K+1}\left(I-\widetilde{P}^{\pi}\right)^{-1}
=k=0((1γ)(IγP~π)1P~π)k(IγP~π)1.\displaystyle=\sum_{k=0}^{\infty}\left((1-\gamma)\left(I-\gamma\widetilde{P}^{\pi}\right)^{-1}\widetilde{P}^{\pi}\right)^{k}\left(I-\gamma\widetilde{P}^{\pi}\right)^{-1}. (20)
Proof.

The first two lines derive from a straightforward application of Woodbury matrix identity to (IP~π)1(I-\widetilde{P}^{\pi})^{-1}. This is valid because by Lemma A.1, (IP~π)(I-\widetilde{P}^{\pi}) is invertible. The convergence of the infinite series is guaranteed for all γ<1\gamma<1. To see why, recall that the finiteness of N=k=0(P~π)kN=\sum_{k=0}^{\infty}(\widetilde{P}^{\pi})^{k} implies (P~π)K+10(\widetilde{P}^{\pi})^{K+1}\rightarrow 0. We can bound the residual,

((1γ)(IγP~π)1P~π)K+1(IP~π)1(P~π)K+1(IP~π)10.\displaystyle\left\lVert\left((1-\gamma)\left(I-\gamma\widetilde{P}^{\pi}\right)^{-1}\widetilde{P}^{\pi}\right)^{K+1}\left(I-\widetilde{P}^{\pi}\right)^{-1}\right\rVert_{\infty}\leq\left\lVert(\widetilde{P}^{\pi})^{K+1}\left(I-\widetilde{P}^{\pi}\right)^{-1}\right\rVert_{\infty}\rightarrow 0.

Finally, we combine results from the above to prove the main claim. First, consider the absorbing state xx. Due to Assumption (2), Vγπ(x)=0V_{\gamma}^{\pi}(x)=0 for any γ[0,1]\gamma\in[0,1]. The matrix equalities in Proposition 3.2 holds in this case.

In the following, we consider any transient states xx. By Lemma A.3 and Lemma A.4

V~γ=1π(x)\displaystyle\widetilde{V}_{\gamma^{\prime}=1}^{\pi}(x) =[k=0(P~π)kr~π](x)\displaystyle=\left[\sum_{k=0}^{\infty}(\widetilde{P}^{\pi})^{k}\widetilde{r}^{\pi}\right](x)
=[k=0K((1γ)(IγP~π)1P~π)k(IγP~π)1r~π+((1γ)(IγP~π)1P~π)K+1(IP~π)1r~π](x)\displaystyle=\left[\sum_{k=0}^{K}\left((1-\gamma)\left(I-\gamma\widetilde{P}^{\pi}\right)^{-1}\widetilde{P}^{\pi}\right)^{k}\left(I-\gamma\widetilde{P}^{\pi}\right)^{-1}\widetilde{r}^{\pi}+\left((1-\gamma)\left(I-\gamma\widetilde{P}^{\pi}\right)^{-1}\widetilde{P}^{\pi}\right)^{K+1}\left(I-\widetilde{P}^{\pi}\right)^{-1}\widetilde{r}^{\pi}\right](x)

Now, notice that because the last entries of rπ,Vγπ,Vγ=1πr^{\pi},V_{\gamma}^{\pi},V_{\gamma^{\prime}=1}^{\pi} are zero (for the absorbing state),

[(IγP~π)1r~π](x)=[(IγPπ)1rπ](x).\displaystyle\left[\left(I-\gamma\widetilde{P}^{\pi}\right)^{-1}\widetilde{r}^{\pi}\right](x)=\left[\left(I-\gamma P^{\pi}\right)^{-1}r^{\pi}\right](x).

Combining with Lemma A.2,

V~γ=1π(x)\displaystyle\widetilde{V}_{\gamma^{\prime}=1}^{\pi}(x) =[k=0K((1γ)(IγPπ)1Pπ)k(IγPπ)1r~π+((1γ)(IγPπ)1Pπ)K+1Vγ=1π](x)\displaystyle=\left[\sum_{k=0}^{K}\left((1-\gamma)\left(I-\gamma P^{\pi}\right)^{-1}P^{\pi}\right)^{k}\left(I-\gamma P^{\pi}\right)^{-1}\widetilde{r}^{\pi}+\left((1-\gamma)\left(I-\gamma P^{\pi}\right)^{-1}P^{\pi}\right)^{K+1}V_{\gamma^{\prime}=1}^{\pi}\right](x)
=[k=0K((1γ)(IγPπ)1Pπ)kVγπK-th order expansion+((1γ)(IγPπ)1Pπ)K+1Vγ=1πresidual](x).\displaystyle=\left[\underbrace{\sum_{k=0}^{K}\left((1-\gamma)\left(I-\gamma P^{\pi}\right)^{-1}P^{\pi}\right)^{k}V_{\gamma}^{\pi}}_{\text{K-th\ order\ expansion}}+\underbrace{\left((1-\gamma)\left(I-\gamma P^{\pi}\right)^{-1}P^{\pi}\right)^{K+1}V_{\gamma^{\prime}=1}^{\pi}}_{\text{residual}}\right](x).

The residual term 0\rightarrow 0 as K0K\rightarrow 0 with similar arguments used for Lemma A.4. We hence conclude the proof. ∎

See 3.2

Proof.

The proof follows directly from the residual term in Proposition 3.1. Recall that the residual term takes the form

VγπVK,γ,γπ=((γγ)(IγPπ)1Pπ)K+1Vγπ.\displaystyle V_{\gamma^{\prime}}^{\pi}-V_{K,\gamma,\gamma^{\prime}}^{\pi}=\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{K+1}V_{\gamma^{\prime}}^{\pi}.

Its infinity norm can be bounded as (γγ1γ)K+1Rmax1γ(\frac{\gamma^{\prime}-\gamma}{1-\gamma})^{K+1}\frac{R_{\text{max}}}{1-\gamma^{\prime}}

See 4.1

Proof.

We will derive the above result with the matrix form. Recall by applying Woodbury inversion identity to (IγPπ)1=(I(γγ)PπγPπ)1(I-\gamma^{\prime}P^{\pi})^{-1}=(I-(\gamma^{\prime}-\gamma)P^{\pi}-\gamma P^{\pi})^{-1}, we get

(IγPπ)1\displaystyle(I-\gamma^{\prime}P^{\pi})^{-1} =k=0((γγ)(IγPπ)1Pπ)k(IγPπ)1\displaystyle=\sum_{k=0}^{\infty}\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{k}(I-\gamma P^{\pi})^{-1}
=(IγPπ)1+k=1((γγ)(IγPπ)1Pπ)k(IγPπ)1\displaystyle=(I-\gamma P^{\pi})^{-1}+\sum_{k=1}^{\infty}\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{k}(I-\gamma P^{\pi})^{-1}
=(IγPπ)1+(γγ)k=1((γγ)(IγPπ)1Pπ)k(IγPπ)1Pπ(IγPπ)1\displaystyle=(I-\gamma P^{\pi})^{-1}+(\gamma^{\prime}-\gamma)\sum_{k=1}^{\infty}\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{k}\cdot(I-\gamma P^{\pi})^{-1}\cdot P^{\pi}(I-\gamma P^{\pi})^{-1}
=(IγPπ)1+(γγ)(IγPπ)1Pπ(IγPπ)1.\displaystyle=(I-\gamma P^{\pi})^{-1}+(\gamma^{\prime}-\gamma)(I-\gamma^{\prime}P^{\pi})^{-1}\cdot P^{\pi}\cdot(I-\gamma P^{\pi})^{-1}.

Then, right multiply the above equation by rπr^{\pi},

Vγπ\displaystyle V_{\gamma^{\prime}}^{\pi} =Vγπ+(γγ)(IγPπ)1PπVγπ\displaystyle=V_{\gamma}^{\pi}+(\gamma^{\prime}-\gamma)(I-\gamma^{\prime}P^{\pi})^{-1}P^{\pi}V_{\gamma}^{\pi}
=Vγπ+(γγ)t=1(γ)t1(Pπ)tVγπ.\displaystyle=V_{\gamma}^{\pi}+(\gamma^{\prime}-\gamma)\sum_{t=1}^{\infty}(\gamma^{\prime})^{t-1}(P^{\pi})^{t}V_{\gamma}^{\pi}.

By indexing both sides at state xx, we recover the following equality,

Vγπ(x)=Vγπ(x)+𝔼π[t=1(γγ)(γ)t1Vγπ(xt)|x0=x].\displaystyle V_{\gamma^{\prime}}^{\pi}(x)=V_{\gamma}^{\pi}(x)+\mathbb{E}_{\pi}\left[\sum_{t=1}^{\infty}(\gamma^{\prime}-\gamma)(\gamma^{\prime})^{t-1}V_{\gamma}^{\pi}(x_{t})\;\middle|\;x_{0}=x\right].

To derive the expression for ρx,γ,γπ\rho_{x,\gamma,\gamma}^{\pi}, note that also

Vγπ=(IγPπ)1rπ=(IγPπ)(IγPπ)1(IγPπ)1rπ=(IγPπ)(IγPπ)1weight matrixWVγπ,\displaystyle V_{\gamma^{\prime}}^{\pi}=(I-\gamma^{\prime}P^{\pi})^{-1}r^{\pi}=(I-\gamma P^{\pi})(I-\gamma^{\prime}P^{\pi})^{-1}(I-\gamma P^{\pi})^{-1}r^{\pi}=\underbrace{(I-\gamma P^{\pi})(I-\gamma^{\prime}P^{\pi})^{-1}}_{\text{weight\ matrix}W}V_{\gamma}^{\pi},

where we use the fact that (IγPπ)(I-\gamma P^{\pi}) commutes with (IγPπ)1(I-\gamma^{\prime}P^{\pi})^{-1}. Since we define ρx,γ,γπ\rho_{x,\gamma,\gamma^{\prime}}^{\pi} as such that Vγπ(x)=(ρx,γ,γπ)TVγπV_{\gamma^{\prime}}^{\pi}(x)=\left(\rho_{x,\gamma,\gamma^{\prime}}^{\pi}\right)^{T}V_{\gamma}^{\pi}, we can derive the matrix form of ρx,γ,γπ\rho_{x,\gamma,\gamma^{\prime}}^{\pi} by indexing the xx-th row of weight matrix WW. This directly leads to the desired result

ρx,γ,γπ=(Iγ(Pπ)T)(Iγ(Pπ)T)1δx.\displaystyle\rho_{x,\gamma,\gamma^{\prime}}^{\pi}=\left(I-\gamma(P^{\pi})^{T}\right)\left(I-\gamma^{\prime}(P^{\pi})^{T}\right)^{-1}\delta_{x}.

See 4.3

Proof.

First we assume γ<1\gamma^{\prime}<1, we will consider the extension to γ=1\gamma^{\prime}=1 at the end of the proof. Recall that the policy gradient takes the following form,

θVγπθ(x)=𝔼πθ[t=0γtQγπθ(xt,at)θlogπθ(at|xt)|x0=x.]\displaystyle\nabla_{\theta}V_{\gamma}^{\pi_{\theta}}(x)=\mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{\infty}\gamma^{t}Q_{\gamma}^{\pi_{\theta}}(x_{t},a_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t})\;\middle|\;x_{0}=x.\right]

We plug in the above, the partial derivative (VF(Vγπθ,ρx,γ,γπθ))TθVγπθ\left(\partial_{V}F(V_{\gamma}^{\pi_{\theta}},\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}})\right)^{T}\nabla_{\theta}V_{\gamma}^{\pi_{\theta}} evaluates to the following

θVγπθ(x)+𝔼πθ[(γγ)t=1(γ)t1θVγπθ(xt)]\displaystyle\nabla_{\theta}V_{\gamma}^{\pi_{\theta}}(x)+\mathbb{E}_{\pi_{\theta}}\left[(\gamma^{\prime}-\gamma)\sum_{t=1}^{\infty}(\gamma^{\prime})^{t-1}\nabla_{\theta}V_{\gamma}^{\pi_{\theta}}(x_{t})\right]
=𝔼πθ[t=0γtQγπθ(xt,at)θlogπθ(at|xt)|x0=x]\displaystyle=\mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{\infty}\gamma^{t}Q_{\gamma}^{\pi_{\theta}}(x_{t},a_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t})\;\middle|\;x_{0}=x\right]
+𝔼πθ[(γγ)(γ)t1t=1s=0γsQγπθ(xt+s,at+s)θlogπθ(at+s|xt+s)|x0=x]\displaystyle+\mathbb{E}_{\pi_{\theta}}\left[(\gamma^{\prime}-\gamma)(\gamma^{\prime})^{t-1}\sum_{t=1}^{\infty}\sum_{s=0}^{\infty}\gamma^{s}Q_{\gamma}^{\pi_{\theta}}(x_{t+s},a_{t+s})\nabla_{\theta}\log\pi_{\theta}(a_{t+s}|x_{t+s})\;\middle|\;x_{0}=x\right]
=𝔼πθ[t=0(γt+u=1t(γγ)(γ)u1γtucoefficient at time t)Qγπθ(xt,at)θlogπθ(at|xt)|x0=x.]\displaystyle=\mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{\infty}\left(\underbrace{\gamma^{t}+\sum_{u=1}^{t}(\gamma^{\prime}-\gamma)(\gamma^{\prime})^{u-1}\gamma^{t-u}}_{\text{coefficient\ at\ time\ t}}\right)Q_{\gamma}^{\pi_{\theta}}(x_{t},a_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t})\;\middle|\;x_{0}=x.\right]

In the above, the coefficient term at time tt can be calculated by carefully grouping terms across different time steps. It can be shown that the coefficient term evaluates to (γ)t(\gamma^{\prime})^{t} for all t0t\geq 0. This concludes the proof.

Alternative proof based on matrix notations.

We introduce an alternative proof based on matrix notations as it will make the extension to γ=1\gamma^{\prime}=1 simpler. First, note that

Vγπ=(IγPπ)1rπ=(IγPπ)1(IγPπ)(IγPπ)1rπ=(IγPπ)(IγPπ)1(IγPπ)1rπ,\displaystyle V_{\gamma^{\prime}}^{\pi}=\left(I-\gamma^{\prime}P^{\pi}\right)^{-1}r^{\pi}=\left(I-\gamma^{\prime}P^{\pi}\right)^{-1}\left(I-\gamma P^{\pi}\right)\left(I-\gamma P^{\pi}\right)^{-1}r^{\pi}=\left(I-\gamma P^{\pi}\right)\left(I-\gamma^{\prime}P^{\pi}\right)^{-1}\left(I-\gamma P^{\pi}\right)^{-1}r^{\pi},

where for the second equality we exploit the fact that (IγPπ)\left(I-\gamma P^{\pi}\right) commutes with (IγPπ)1\left(I-\gamma^{\prime}P^{\pi}\right)^{-1}. Now, notice that the above rewrites as

Vγπ=(IγPπ)(IγPπ)1Wγ,γVγπ\displaystyle V_{\gamma^{\prime}}^{\pi}=\underbrace{\left(I-\gamma P^{\pi}\right)\left(I-\gamma^{\prime}P^{\pi}\right)^{-1}}_{W_{\gamma,\gamma^{\prime}}}V_{\gamma}^{\pi}

where Wγ,γW_{\gamma,\gamma^{\prime}} is the weight matrix. This matrix is equivalent to the weighting distribution ρx,γ,γπ\rho_{x,\gamma,\gamma^{\prime}}^{\pi} by Wγ,γ[x]=ρx,γ,γπW_{\gamma,\gamma^{\prime}}[x]=\rho_{x,\gamma,\gamma^{\prime}}^{\pi} where A[x]A[x] is the xx-th row of matrix AA. The first partial gradient corresponds to differentiating VγπθV_{\gamma^{\prime}}^{\pi_{\theta}} only through VγπθV_{\gamma}^{\pi_{\theta}}. To make the derivation clear in matrix notations, let θi\theta_{i} be the ii-th component of the parameter θ\theta. Define θiVγπθ𝒳\nabla_{\theta_{i}}V_{\gamma}^{\pi_{\theta}}\in\mathbb{R}^{\mathcal{X}} such that θiVγπθ(x)=θiVγπθ(x)\nabla_{\theta_{i}}V_{\gamma}^{\pi_{\theta}}(x)=\nabla_{\theta_{i}}V_{\gamma}^{\pi_{\theta}}(x), This means the ii-th component of the first partial gradient across all states is

Wγ,γθiVγπθ𝒳.\displaystyle W_{\gamma,\gamma^{\prime}}\nabla_{\theta_{i}}V_{\gamma}^{\pi_{\theta}}\in\mathbb{R}^{\mathcal{X}}.

Let Gγθi𝒳G_{\gamma}^{\theta_{i}}\in\mathbb{R}^{\mathcal{X}} to be the vector of local gradient (for parameter θi\theta_{i}) such that Gγθi(x)=aθiπθ(a|x)Qγπθ(x,a)G_{\gamma}^{\theta_{i}}(x)=\sum_{a}\nabla_{\theta_{i}}\pi_{\theta}(a|x)Q_{\gamma}^{\pi_{\theta}}(x,a). Vanilla PG [Sutton et al., 2000] can be expressed as

θiVγπθ=(IγPπ)1Gγθi.\displaystyle\nabla_{\theta_{i}}V_{\gamma}^{\pi_{\theta}}=\left(I-\gamma P^{\pi}\right)^{-1}G_{\gamma}^{\theta_{i}}.

We can finally derive the following,

Wγ,γθiVγπθ\displaystyle W_{\gamma,\gamma^{\prime}}\nabla_{\theta_{i}}V_{\gamma}^{\pi_{\theta}} =(IγPπ)(IγPπ)1Gγθi\displaystyle=(I-\gamma P^{\pi})(I-\gamma^{\prime}P^{\pi})^{-1}G_{\gamma}^{\theta_{i}}
=(IγPπ)(IγPπ)1(IγPπ)1Gγθi\displaystyle=(I-\gamma P^{\pi})(I-\gamma^{\prime}P^{\pi})^{-1}(I-\gamma P^{\pi})^{-1}G_{\gamma}^{\theta_{i}}
=(IγPπ)(IγPπ)1(IγPπ)1Gγθi\displaystyle=(I-\gamma P^{\pi})(I-\gamma P^{\pi})^{-1}(I-\gamma^{\prime}P^{\pi})^{-1}G_{\gamma}^{\theta_{i}}
=(IγPπ)1Gγθi\displaystyle=(I-\gamma^{\prime}P^{\pi})^{-1}G_{\gamma}^{\theta_{i}}

Now, consider the xx-th component of the above vector. We have θ[J(πθ,πt)]πt=πθ\nabla_{\theta}[J(\pi_{\theta},\pi_{t})]_{\pi_{t}=\pi_{\theta}} is equal to

𝔼πθ[t=0(γ)taθiπθ(a|xt)Qγπθ(xt,a)|x0=x]=𝔼πθ[t=0(γ)tθilogπθ(at|xt)Qγπθ(xt,a)|x0=x]\displaystyle\mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{\infty}(\gamma^{\prime})^{t}\sum_{a}\nabla_{\theta_{i}}\pi_{\theta}(a|x_{t})Q_{\gamma}^{\pi_{\theta}}(x_{t},a)\;\middle|\;x_{0}=x\right]=\mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{\infty}(\gamma^{\prime})^{t}\nabla_{\theta_{i}}\log\pi_{\theta}(a_{t}|x_{t})Q_{\gamma}^{\pi_{\theta}}(x_{t},a)\;\middle|\;x_{0}=x\right]

When concatenating the gradient for all component θii\theta_{i}i of θ\theta, we conclude the proof.

Extensions to the case γ=1\gamma^{\prime}=1.

Similar to the arguments made in the proof of Proposition 3.2, under assumptions A.1 and A.2, we can decompose the transition matrix PπP^{\pi} as

Pπ=(P~p~01),\displaystyle P^{\pi}=\begin{pmatrix}\widetilde{P}&\widetilde{p}\\ 0&1\end{pmatrix},

where the last state is assumed to be absorbing. Though (IγPπ)1(I-\gamma^{\prime}P^{\pi})^{-1} for γ=1\gamma^{\prime}=1 is in general not necessarily invertible, the matrix (IP~)1(I-\widetilde{P})^{-1} is invertible. Since rπ(x)r^{\pi}(x) for the absorbing state xx, we have deduced that Qγπ(x.a)=Vγπ(x)=0Q_{\gamma}^{\pi}(x.a)=V_{\gamma}^{\pi}(x)=0, and accordingly Gγθi(x)=0G_{\gamma}^{\theta_{i}}(x)=0. As such, though (IγPπ)1(I-\gamma^{\prime}P^{\pi})^{-1} for γ=1\gamma^{\prime}=1 might be undefined, the multiplication (IγPπ)1Gγθi(I-\gamma^{\prime}P^{\pi})^{-1}G_{\gamma}^{\theta_{i}} is defined, with the last entry being 0. Since at time t=Tt=T, the chain enters the absorbing states, all local gradient terms that come after TT are zero. As a result, the xx-th component of (IγPπ)1Gγθi(I-\gamma^{\prime}P^{\pi})^{-1}G_{\gamma}^{\theta_{i}} is

𝔼πθ[t=0T(γ)tθilogπθ(at|xt)Qγπθ(xt,a)|x0=x]\displaystyle\mathbb{E}_{\pi_{\theta}}\left[\sum_{t=0}^{T}(\gamma^{\prime})^{t}\nabla_{\theta_{i}}\log\pi_{\theta}(a_{t}|x_{t})Q_{\gamma}^{\pi_{\theta}}(x_{t},a)\;\middle|\;x_{0}=x\right]

See 4.4

Proof.

Recall from Lemma 4.1, by construction,

ρx,γ,γπ=(Iγ(Pπ)T)1(Iγ(Pπ)T)δx.\displaystyle\rho_{x,\gamma,\gamma^{\prime}}^{\pi}=\left(I-\gamma^{\prime}(P^{\pi})^{T}\right)^{-1}\left(I-\gamma(P^{\pi})^{T}\right)\delta_{x}.

Simialr to the case of primal space expansions in Section 3.1, we construct the KKth order expansion to ρx,γ,γπ\rho_{x,\gamma,\gamma^{\prime}}^{\pi} via the expansion of the matrix (Iγ(Pπ)T)1\left(I-\gamma(P^{\pi})^{T}\right)^{-1}. Recall that

(Iγ(Pπ)T)1=k=0((γγ)(Pπ)T(Iγ(Pπ)T)1)k(Iγ(Pπ)T)1.\displaystyle\left(I-\gamma^{\prime}(P^{\pi})^{T}\right)^{-1}=\sum_{k=0}^{\infty}\left((\gamma^{\prime}-\gamma)(P^{\pi})^{T}\left(I-\gamma(P^{\pi})^{T}\right)^{-1}\right)^{k}\left(I-\gamma(P^{\pi})^{T}\right)^{-1}.

When truncating the infinite series to the first K+1K+1 terms, we derive the KKth order expansion ρx,K,γ,γπ\rho_{x,K,\gamma,\gamma^{\prime}}^{\pi},

((γγ)(Pπ)T(Iγ(Pπ)T)1)k(Iγ(Pπ)T)1(Iγ(Pπ)T)δx=k=0K((γγ)(Pπ)T(Iγ(Pπ)T)1)kδx.\displaystyle\left((\gamma^{\prime}-\gamma)(P^{\pi})^{T}\left(I-\gamma(P^{\pi})^{T}\right)^{-1}\right)^{k}\left(I-\gamma(P^{\pi})^{T}\right)^{-1}\left(I-\gamma(P^{\pi})^{T}\right)\delta_{x}=\sum_{k=0}^{K}\left((\gamma^{\prime}-\gamma)(P^{\pi})^{T}\left(I-\gamma(P^{\pi})^{T}\right)^{-1}\right)^{k}\delta_{x}.

Note that since

k=K+1((γγ)(Pπ)T(Iγ(Pπ)T)1)k(Iγ(Pπ)T)1(γγ1γ)K+111γ.\displaystyle\left\lVert\sum_{k=K+1}^{\infty}\left((\gamma^{\prime}-\gamma)(P^{\pi})^{T}\left(I-\gamma(P^{\pi})^{T}\right)^{-1}\right)^{k}\left(I-\gamma(P^{\pi})^{T}\right)^{-1}\right\rVert_{\infty}\leq\left(\frac{\gamma^{\prime}-\gamma}{1-\gamma}\right)^{K+1}\frac{1}{1-\gamma^{\prime}}.

This concludes the proof.

Appendix B Further results on Taylor expansions in the dual space

The dual representation of value function Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x) in Eqn (6) is Vγπ(x)=(1γ)1(rπ)Tdx,γπV_{\gamma^{\prime}}^{\pi}(x)=(1-\gamma^{\prime})^{-1}(r^{\pi})^{T}d_{x,\gamma^{\prime}}^{\pi} where rπ,dx,γπ𝒳r^{\pi},d_{x,\gamma^{\prime}}^{\pi}\in\mathbb{R}^{\mathcal{X}} are vector rewards and visitation distribution starting at state xx. Here, we abuse the notation dx,γπd_{x,\gamma}^{\pi} to denote both a function and a vector, i.e., dx,γπ(x)d_{x,\gamma}^{\pi}(x^{\prime}) can be interpreted as both a function evaluation and a vector indexing. Given such a dual representation, one natural question is whether the KKth expansion in the primal space corresponds to some approximations of the discounted visitation distribution dK,γ,γπdx,γπd_{K,\gamma,\gamma^{\prime}}^{\pi}\approx d_{x,\gamma^{\prime}}^{\pi}. Below, we answer in the affirmative.

Let δx𝒳\mathbf{\delta}_{x}\in\mathbb{R}^{\mathcal{X}} be the one-hot distribution such that [δx]x=1[\mathbf{\delta}_{x}]_{x^{\prime}}=1 only when x=xx^{\prime}=x. The visitation distribution satisfies the following balance equation in matrix form

dx,γπ=(1γ)δx+γ(Pπ)Tdx,γπ.\displaystyle d_{x,\gamma^{\prime}}^{\pi}=(1-\gamma^{\prime})\mathbf{\delta}_{x}+\gamma^{\prime}(P^{\pi})^{T}d_{x,\gamma^{\prime}}^{\pi}. (21)

Inverting the equation, we obtain an explicit expression for the visitation distribution dx,γπ=(1γ)(IγPπ)1δxd_{x,\gamma^{\prime}}^{\pi}=(1-\gamma^{\prime})(I-\gamma^{\prime}P^{\pi})^{-1}\mathbf{\delta}_{x}. Following techniques used in the derivation of Propo 3.1, we can derive similar approximation results for dual variables. See Appendix B.

Proposition B.1.

The following holds for all K0K\geq 0,

dx,γπ\displaystyle d_{x,\gamma^{\prime}}^{\pi} =1γ1γk=0K((γγ)(Iγ(Pπ)T)1(Pπ)T)kdx,γπ.\displaystyle=\frac{1-\gamma^{\prime}}{1-\gamma}\sum_{k=0}^{K}\left((\gamma^{\prime}-\gamma)\left(I-\gamma(P^{\pi})^{T}\right)^{-1}(P^{\pi})^{T}\right)^{k}d_{x,\gamma}^{\pi}.
+((γγ)(Iγ(Pπ)T)1(Pπ)T)K+1dx,γπresidual.\displaystyle+\underbrace{\left((\gamma^{\prime}-\gamma)\left(I-\gamma(P^{\pi})^{T}\right)^{-1}(P^{\pi})^{T}\right)^{K+1}d_{x,\gamma^{\prime}}^{\pi}}_{\text{residual}}. (22)

When γ<γ<1\gamma<\gamma^{\prime}<1, the residual norm 0\rightarrow 0, which implies that the following holds

dx,γπ=1γ1γk=0((γγ)(Iγ(Pπ)T)1(Pπ)T)kdx,γπ.\displaystyle d_{x,\gamma^{\prime}}^{\pi}=\frac{1-\gamma^{\prime}}{1-\gamma}\sum_{k=0}^{\infty}((\gamma^{\prime}-\gamma)\left(I-\gamma(P^{\pi})^{T}\right)^{-1}(P^{\pi})^{T})^{k}d_{x,\gamma}^{\pi}. (23)
Proof.

Starting from the fixed point equation satisfied by dγπd_{\gamma^{\prime}}^{\pi}, we can apply Woodbury inversion indentity

dγπ\displaystyle d_{\gamma^{\prime}}^{\pi} =(1γ)(Iγ(Pπ)T)1δx\displaystyle=(1-\gamma^{\prime})\left(I-\gamma^{\prime}(P^{\pi})^{T}\right)^{-1}\delta_{x}
=(1γ)k=0K((γγ)(IγPπ)1(Pπ)T)kδx+(1γ)((γγ)(Iγ(Pπ)T)1(Pπ)T)K(Iγ(Pπ)T)1δx\displaystyle=(1-\gamma^{\prime})\sum_{k=0}^{K}\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}(P^{\pi})^{T}\right)^{k}\delta_{x}+(1-\gamma^{\prime})\left((\gamma^{\prime}-\gamma)\left(I-\gamma(P^{\pi})^{T}\right)^{-1}(P^{\pi})^{T}\right)^{K}\left(I-\gamma^{\prime}(P^{\pi})^{T}\right)^{-1}\delta_{x}
=1γ1γk=0K((γγ)(Iγ(Pπ))1Pπ)kdγπ+(1γ)((γγ)(IγPπ)1(Pπ)T)Kdγπ\displaystyle=\frac{1-\gamma^{\prime}}{1-\gamma}\sum_{k=0}^{K}\left((\gamma^{\prime}-\gamma)\left(I-\gamma(P^{\pi})\right)^{-1}P^{\pi}\right)^{k}d_{\gamma}^{\pi}+(1-\gamma^{\prime})\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}(P^{\pi})^{T}\right)^{K}d_{\gamma^{\prime}}^{\pi}

The norm of the residual term could be bounded as

(1γ)((γγ)(IγPπ)1Pπ)Kdγπ(1γ)(γγ1γ)K+10.\displaystyle\left\|(1-\gamma^{\prime})\left((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi}\right)^{K}d_{\gamma^{\prime}}^{\pi}\right\rVert_{\infty}\leq(1-\gamma^{\prime})\left(\frac{\gamma^{\prime}-\gamma}{1-\gamma}\right)^{K+1}\rightarrow 0.

With a similar motivation as expansions in the primal space, we define the KKth order expansion by truncating to first K+1K+1 terms,

dx,K,γ,γπ1γ1γk=0K((γγ)(IγPπ)1Pπ)kdx,γπ\displaystyle d_{x,K,\gamma,\gamma^{\prime}}^{\pi}\coloneqq\frac{1-\gamma^{\prime}}{1-\gamma}\sum_{k=0}^{K}((\gamma^{\prime}-\gamma)(I-\gamma P^{\pi})^{-1}P^{\pi})^{k}d_{x,\gamma}^{\pi} (24)

The following result formalizes the connection between the KKth order dual approximation to the visitation distribution dK,γ,γπd_{K,\gamma,\gamma^{\prime}}^{\pi} and the primal approximation to the value function at state xx, VK,γ,γπ(x)V_{K,\gamma,\gamma^{\prime}}^{\pi}(x).

Proposition B.2.

The KKth order primal and dual approximations are related by the following equality for any K0K\geq 0,

VK,γ,γπ(x)=(1γ)1(dx,K,γ,γπ)Trπ\displaystyle V_{K,\gamma,\gamma^{\prime}}^{\pi}(x)=(1-\gamma^{\prime})^{-1}\left(d_{x,K,\gamma,\gamma^{\prime}}^{\pi}\right)^{T}r^{\pi} (25)
Proof.

The proof follows by expanding out the RHS of the equation. Recall the definition of dK,γ,γπd_{K,\gamma,\gamma^{\prime}}^{\pi},

(dK,γ,γπ)T\displaystyle(d_{K,\gamma,\gamma^{\prime}}^{\pi})^{T} =1γ1γk=0K(dγπ)T((γγ)(IγPπ)1)k\displaystyle=\frac{1-\gamma^{\prime}}{1-\gamma}\sum_{k=0}^{K}(d_{\gamma}^{\pi})^{T}\left((\gamma^{\prime}-\gamma)\left(I-\gamma P^{\pi}\right)^{-1}\right)^{k}
=(1γ)k=0KδxT(IγPπ)1((γγ)(IγPπ)1)k\displaystyle=(1-\gamma^{\prime})\sum_{k=0}^{K}\delta_{x}^{T}(I-\gamma P^{\pi})^{-1}\left((\gamma^{\prime}-\gamma)\left(I-\gamma P^{\pi}\right)^{-1}\right)^{k}
=(1γ)δxT[k=0K((γγ)(IγPπ)1Pπ)k](IγPπ)1.\displaystyle=(1-\gamma^{\prime})\delta_{x}^{T}\left[\sum_{k=0}^{K}\left(\left(\gamma^{\prime}-\gamma)(I-\gamma P^{\pi}\right)^{-1}P^{\pi}\right)^{k}\right]\cdot\left(I-\gamma P^{\pi}\right)^{-1}.

Now multiply the RHS by rπr^{\pi} and recall that Vγπ=(IγPπ)1rπV_{\gamma}^{\pi}=(I-\gamma P^{\pi})^{-1}r^{\pi}, we conclude the proof,

RHS=1γ1γδxT[k=0K((γγ)(IγPπ)1Pπ)k]Vγπ=(1γ)δxTVK,γ,γπ=(1γ)VK,γ,γπ(x).\displaystyle\text{RHS}=\frac{1-\gamma^{\prime}}{1-\gamma}\delta_{x}^{T}\left[\sum_{k=0}^{K}\left(\left(\gamma^{\prime}-\gamma)(I-\gamma P^{\pi}\right)^{-1}P^{\pi}\right)^{k}\right]V_{\gamma}^{\pi}=(1-\gamma^{\prime})\delta_{x}^{T}V_{K,\gamma,\gamma^{\prime}}^{\pi}=(1-\gamma^{\prime})V_{K,\gamma,\gamma^{\prime}}^{\pi}(x).

Proposition B.2 shows that indeed, the KKth order approximation of the value function is equivalent to the KKth order approximation of the visitation distribution in the dual space. It is instructive to consider the special case K=1K=1.

Appendix C Details on Taylor expansion Q-function advantage estimation

Proposition C.1.

Let Qγπ𝒳×𝒜Q_{\gamma}^{\pi}\in\mathbb{R}^{\mathcal{X}\times\mathcal{A}} be the vector advantage functions. Let P¯π(𝒳×𝒜)×(𝒳×𝒜)\overline{P}^{\pi}\in\mathbb{R}^{\left(\mathcal{X}\times\mathcal{A}\right)\times\left(\mathcal{X}\times\mathcal{A}\right)} be the transition matrix such that P¯π(x,a,x,a)=π(x|x)p(x|x,a)\overline{P}^{\pi}(x,a,x^{\prime},a^{\prime})=\pi(x^{\prime}|x^{\prime})p(x^{\prime}|x,a). Define the KKth order Taylor expansion of advantage as QK,γ,γπk=0K((γγ)(IγP¯π)1P¯π)kQγπQ_{K,\gamma,\gamma^{\prime}}^{\pi}\coloneqq\sum_{k=0}^{K}((\gamma^{\prime}-\gamma)(I-\gamma\overline{P}^{\pi})^{-1}\overline{P}^{\pi})^{k}Q_{\gamma}^{\pi}. Then limKQK,γ,γπ=Qγπ\lim_{K\rightarrow\infty}Q_{K,\gamma,\gamma^{\prime}}^{\pi}=Q_{\gamma^{\prime}}^{\pi} for any γ<γ<1\gamma<\gamma^{\prime}<1.

0:  A trajectory (xt,at,rt)t=0π(x_{t},a_{t},r_{t})_{t=0}^{\infty}\sim\pi and discount factors γ<γ<1\gamma<\gamma^{\prime}<1
  1. Compute advantage function estimates Q^γπ(xt,at)\widehat{Q}_{\gamma}^{\pi}(x_{t},a_{t}) for states on the trajectory. For example, Q^γπ(xt,at)=ttγttrt\widehat{Q}_{\gamma}^{\pi}(x_{t},a_{t})=\sum_{t^{\prime}\geq t}\gamma^{t^{\prime}-t}r_{t^{\prime}}. One could also apply other alternatives (e.g., [Schulman et al., 2015b]) which potentiall reduce the variance of Q^γπ(xt,at)\widehat{Q}_{\gamma}^{\pi}(x_{t},a_{t}).
  2. Sample KK random time τi,1iK\tau_{i},1\leq i\leq K, all i.i.d. geometrically distributed τiGeometric(1γ)\tau_{i}\sim\text{Geometric}(1-\gamma).
  3. Return (γγ)K(1γ)KQ^γπ(xτ,aτ)\frac{(\gamma^{\prime}-\gamma)^{K}}{(1-\gamma)^{K}}\widehat{Q}_{\gamma}^{\pi}(x_{\tau},a_{\tau}), where τ=i=1Kτi\tau=\sum_{i=1}^{K}\tau_{i}.
Algorithm 4 Estimating the KKth term of the expansion (Q-function)
Proof.

The proof follows closely that of Taylor expansion based approximation to value functions in Proposition 3.2. Importantly, notice that here we define P¯π\overline{P}^{\pi}, which differs from PπP^{\pi} used in the derivation of value functions. In particular, P¯π(x,a,y,b)=p(y|x,a)π(b|y)\overline{P}^{\pi}(x,a,y,b)=p(y|x,a)\pi(b|y) for any x,y𝒳,a,b𝒜x,y\in\mathcal{X},a,b\in\mathcal{A}. Let rr be the vector reward function. The Bellmen equation for Q-function is

Qγπ=r+γP¯πQγπ.\displaystyle Q_{\gamma^{\prime}}^{\pi}=r+\gamma^{\prime}\overline{P}^{\pi}Q_{\gamma^{\prime}}^{\pi}.

Inverting the equation and applying the Woodbury inversion identity,

Qγπ\displaystyle Q_{\gamma^{\prime}}^{\pi} =(IγP¯π)1r=k=0((γγ)(IγP¯π)1P¯π)kQγπ\displaystyle=(I-\gamma^{\prime}\overline{P}^{\pi})^{-1}r=\sum_{k=0}^{\infty}\left((\gamma^{\prime}-\gamma)\left(I-\gamma\overline{P}^{\pi}\right)^{-1}\overline{P}^{\pi}\right)^{k}Q_{\gamma}^{\pi}

The above equality holds for all γ<γ<1\gamma<\gamma^{\prime}<1 due to similar convergence argument as in Proposition 3.2. Truncating the infinite series at step KK, we arrive at the KKth order expansion QK,γ,γπQ_{K,\gamma,\gamma^{\prime}}^{\pi}. By construction, limKQK,γ,γπ=Qγπ\lim_{K\rightarrow\infty}Q_{K,\gamma,\gamma^{\prime}}^{\pi}=Q_{\gamma^{\prime}}^{\pi}. ∎

Appendix D Details on Taylor expansion update weighting

Proposition D.1.

The following is true for all K0K\geq 0,

ρx,K,γ,γ(x)=𝕀[x=x]+𝔼π[t=1f(K,t,γ,γ)𝕀[xt=x]|x0=x],\displaystyle\rho_{x,K,\gamma,\gamma^{\prime}}(x^{\prime})=\mathbb{I}[x^{\prime}=x]+\mathbb{E}_{\pi}\left[\sum_{t=1}^{\infty}f(K,t,\gamma,\gamma^{\prime})\mathbb{I}[x_{t}=x^{\prime}]\;\middle|\;x_{0}=x\right],

Equivalently, the KKth order Taylor expansion of Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x) is

VK,γ,γπ(x)=Vγ(x)+𝔼π[t=1f(K,t,γ,γ)Vγπ(xt)|x0=x],\displaystyle V_{K,\gamma,\gamma^{\prime}}^{\pi}(x)=V_{\gamma}(x)+\mathbb{E}_{\pi}\left[\sum_{t=1}^{\infty}f(K,t,\gamma,\gamma^{\prime})V_{\gamma}^{\pi}(x_{t})\;\middle|\;x_{0}=x\right], (26)

where f(K,t,γ,γ)=u=1min(K,t)(γγ)uγtu(t1tu)f(K,t,\gamma,\gamma^{\prime})=\sum_{u=1}^{\min(K,t)}(\gamma^{\prime}-\gamma)^{u}\gamma^{t-u}\binom{t-1}{t-u} is a weight function.

Proof.

We start with a few lemmas.

Lemma D.2.

For any n0,k1n\geq 0,k\geq 1, define a set of kk-dimensional vector {x1,xk|xi0,i=1kxi=n}\left\{x_{1},...x_{k}|x_{i}\in\mathbb{Z}_{\geq 0},\sum_{i=1}^{k}x_{i}=n\right\} and let F(n,k)F(n,k) be the size of this set. Then

F(n,k)=(n+k1k1).\displaystyle F(n,k)=\binom{n+k-1}{k-1}.
Proof.

By construction, the above set can be decomposed into smaller sets by fixing the value of xkx_{k}, i.e.,

{x1,xk|xi0,i=1kxi=n}=s=0n{x1,xk1,xk|xi0,i=1kxi=n,xk=s}\displaystyle\left\{x_{1},...x_{k}|x_{i}\in\mathbb{Z}_{\geq 0},\sum_{i=1}^{k}x_{i}=n\right\}=\cup_{s=0}^{n}\left\{x_{1},...x_{k-1},x_{k}|x_{i}\in\mathbb{Z}_{\geq 0},\sum_{i=1}^{k}x_{i}=n,x_{k}=s\right\}

Since these sets do not overlap, we have a recursive formula, F(n,k)=s=0nF(ns,k1)F(n,k)=\sum_{s=0}^{n}F(n-s,k-1). Starting from the base case F(n,1)=1,n0F(n,1)=1,\forall n\geq 0, it is straightforward to prove by induction that for all n0,k1n\geq 0,k\geq 1

F(n,k)=(n+k1k1).\displaystyle F(n,k)=\binom{n+k-1}{k-1}.

Lemma D.3.

Consider VK+1,γ,γπVK,γ,γπV_{K+1,\gamma,\gamma^{\prime}}^{\pi}-V_{K,\gamma,\gamma^{\prime}}^{\pi} for K0K\geq 0. It can be shown that

VK+1,γ,γπVK,γ,γπ=(γγ)K+1(t=0F(t,K+1)(Pπ)t)(Pπ)K+1Vγπ.\displaystyle V_{K+1,\gamma,\gamma^{\prime}}^{\pi}-V_{K,\gamma,\gamma^{\prime}}^{\pi}=(\gamma^{\prime}-\gamma)^{K+1}\left(\sum_{t=0}^{\infty}F(t,K+1)\left(P^{\pi}\right)^{t}\right)\left(P^{\pi}\right)^{K+1}V_{\gamma}^{\pi}.
Proof.

Starting with the definition,

VK+1,γ,γπVK,γ,γπ\displaystyle V_{K+1,\gamma,\gamma^{\prime}}^{\pi}-V_{K,\gamma,\gamma^{\prime}}^{\pi} =((γγ)(IγPπ)1Pπ)K+1Vγπ\displaystyle=\left(\left(\gamma^{\prime}-\gamma\right)\left(I-\gamma P^{\pi}\right)^{-1}P^{\pi}\right)^{K+1}V_{\gamma}^{\pi}
=((γγ)(IγPπ)1)K+1(Pπ)K+1Vγπ,\displaystyle=\left(\left(\gamma^{\prime}-\gamma\right)\left(I-\gamma P^{\pi}\right)^{-1}\right)^{K+1}\left(P^{\pi}\right)^{K+1}V_{\gamma}^{\pi},

where for the second equality we use the fact that PπP^{\pi} commutes with (IγPπ)1(I-\gamma P^{\pi})^{-1}. Then consider ((IγPπ)1)K+1\left((I-\gamma P^{\pi})^{-1}\right)^{K+1},

((IγPπ)1)K+1\displaystyle\left(\left(I-\gamma P^{\pi}\right)^{-1}\right)^{K+1} =(t=0(γPπ)t)K+1=s10sK+10(γPπ)i=1K+1si=s=0F(s,K+1)(γPπ)s.\displaystyle=\left(\sum_{t=0}^{\infty}\left(\gamma P^{\pi}\right)^{t}\right)^{K+1}=\sum_{s_{1}\geq 0}...\sum_{s_{K+1}\geq 0}\left(\gamma P^{\pi}\right)^{\sum_{i=1}^{K+1}s_{i}}=\sum_{s=0}^{\infty}F(s,K+1)\left(\gamma P^{\pi}\right)^{s}.

Note that the last equality corresponds to a regrouping of terms in the infinite summation – instead of summing over s1,sK+1s_{1},...s_{K+1} sequentially, we count the number of examples such that i=1K+1si=s\sum_{i=1}^{K+1}s_{i}=s and then sum over ss. This count is exactly F(s,K+1)F(s,K+1) as defined in Lemma D.2. Hence the proof is completed. ∎

With the above lemmas, we are ready to prove the final result. We start by summing up all the differences of expansions,

VK,γ,γπ\displaystyle V_{K,\gamma,\gamma^{\prime}}^{\pi} =V0,γ,γπ+k=0K1(Vk+1,γ,γπVk,γ,γπ)\displaystyle=V_{0,\gamma,\gamma^{\prime}}^{\pi}+\sum_{k=0}^{K-1}\left(V_{k+1,\gamma,\gamma^{\prime}}^{\pi}-V_{k,\gamma,\gamma^{\prime}}^{\pi}\right)
=Vγπ+k=0K1(γγ)k+1(t=0F(t,k+1)(γPπ)t)(Pπ)k+1Vγπ\displaystyle=V_{\gamma}^{\pi}+\sum_{k=0}^{K-1}(\gamma^{\prime}-\gamma)^{k+1}\left(\sum_{t=0}^{\infty}F(t,k+1)\left(\gamma P^{\pi}\right)^{t}\right)\left(P^{\pi}\right)^{k+1}V_{\gamma}^{\pi}
=Vγπ+t=0k=0K1(γγ)k+1γk1F(t,k+1)(γPπ)t+k+1Vγπ\displaystyle=V_{\gamma}^{\pi}+\sum_{t=0}^{\infty}\sum_{k=0}^{K-1}(\gamma^{\prime}-\gamma)^{k+1}\gamma^{-k-1}F(t,k+1)\left(\gamma P^{\pi}\right)^{t+k+1}V_{\gamma}^{\pi}
=Vγπ+t=0u=1K(γγ)uγuF(t,u)(γPπ)t+uVγπ\displaystyle=V_{\gamma}^{\pi}+\sum_{t=0}^{\infty}\sum_{u=1}^{K}(\gamma^{\prime}-\gamma)^{u}\gamma^{-u}F(t,u)\left(\gamma P^{\pi}\right)^{t+u}V_{\gamma}^{\pi}
=Vγπ+s=0u=1K(γγ)uγuF(su,u)(γPπ)sVγπ\displaystyle=V_{\gamma}^{\pi}+\sum_{s=0}^{\infty}\sum_{u=1}^{K}(\gamma^{\prime}-\gamma)^{u}\gamma^{-u}F(s-u,u)\left(\gamma P^{\pi}\right)^{s}V_{\gamma}^{\pi}
=Vγπ+s=1u=1K(γγ)uγuF(su,u)(γPπ)sVγπ\displaystyle=V_{\gamma}^{\pi}+\sum_{s=1}^{\infty}\sum_{u=1}^{K}(\gamma^{\prime}-\gamma)^{u}\gamma^{-u}F(s-u,u)\left(\gamma P^{\pi}\right)^{s}V_{\gamma}^{\pi}
=Vγπ+s=1u=1K(γγ)uγsu(s1u1)(Pπ)sVγπ\displaystyle=V_{\gamma}^{\pi}+\sum_{s=1}^{\infty}\sum_{u=1}^{K}(\gamma^{\prime}-\gamma)^{u}\gamma^{s-u}\binom{s-1}{u-1}\left(P^{\pi}\right)^{s}V_{\gamma}^{\pi}
=Vγπ+s=1u=1min(K,s)(γγ)uγsu(s1u1)(Pπ)sVγπ\displaystyle=V_{\gamma}^{\pi}+\sum_{s=1}^{\infty}\sum_{u=1}^{\min(K,s)}(\gamma^{\prime}-\gamma)^{u}\gamma^{s-u}\binom{s-1}{u-1}\left(P^{\pi}\right)^{s}V_{\gamma}^{\pi}

In the above derivation, we have applied the transformation u=k+1,s=t+uu=k+1,s=t+u. Then we have modified the bound of the summation with the definition of F(su,u)F(s-u,u) (in particular, if s<us<u, F(su,u)=0F(s-u,u)=0). If we index the xx-th component of the vector, we recover the desired result.

D.1 Further discussions on the objectives

Recall that the full gradient θVγπθ(x)\nabla_{\theta}V_{\gamma^{\prime}}^{\pi_{\theta}}(x) is

θVγπθ(x)\displaystyle\nabla_{\theta}V_{\gamma^{\prime}}^{\pi_{\theta}}(x) =𝔼xργ,γπθ(;x)[θVγπθ(x)]+𝔼xργ,γπθ(;x)[Vγπθ(x)θlogργ,γπθ(x;x).]second term\displaystyle=\mathbb{E}_{x^{\prime}\sim\rho_{\gamma,\gamma^{\prime}}^{\pi_{\theta}}(\cdot;x)}\left[\nabla_{\theta}V_{\gamma}^{\pi_{\theta}}(x^{\prime})\right]+\underbrace{\mathbb{E}_{x^{\prime}\sim\rho_{\gamma,\gamma^{\prime}}^{\pi_{\theta}}(\cdot;x)}\left[V_{\gamma}^{\pi_{\theta}}(x^{\prime})\nabla_{\theta}\log\rho_{\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x^{\prime};x).\right]}_{\text{second\ term}}

Consider the second term. Now, we derive this term in an alternative way which imparts more intuitions on why its estimation is challenging. Note that

Vγπθ(x)=Vγπθ(x)+(γγ)𝔼π[t=1(γ)t1Vγπθ(xt)]\displaystyle V_{\gamma^{\prime}}^{\pi_{\theta}}(x)=V_{\gamma}^{\pi_{\theta}}(x)+(\gamma^{\prime}-\gamma)\mathbb{E}_{\pi}\left[\sum_{t=1}^{\infty}(\gamma^{\prime})^{t-1}V_{\gamma}^{\pi_{\theta}}(x_{t})\right]

The second term of the full gradient is equivalent to differentiating through the above expression, while keeping all Vγπθ(xt)V_{\gamma}^{\pi_{\theta}}(x_{t}) fixed. This leads the following gradient

second term=(γγ)(γ)1𝔼π[t=1(γ)tWγ,γπθ(xt)θlogπθ(at|xt)].\displaystyle\text{second\ term}=(\gamma^{\prime}-\gamma)(\gamma^{\prime})^{-1}\mathbb{E}_{\pi}\left[\sum_{t=1}^{\infty}(\gamma^{\prime})^{t}W_{\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x_{t})\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t})\right].

Here, we introduce Wγ,γπθ(xt)=𝔼π[s=0(γ)sVγπθ(xt+s)]W_{\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x_{t})=\mathbb{E}_{\pi}\left[\sum_{s=0}^{\infty}(\gamma^{\prime})^{s}V_{\gamma}^{\pi_{\theta}}(x_{t+s})\right], which is equivalent to a value function that treats Vγπθ(x)V_{\gamma}^{\pi_{\theta}}(x) as rewards and with discount factor γ\gamma^{\prime}. Naturally, constructing an unbiased estimator of the second term of the full gradient requires estimating Wγ,γπθW_{\gamma,\gamma^{\prime}}^{\pi_{\theta}}, which is difficult in at least two aspects: (1) in practice, value functions are already estimated, which could introduce additional bias and variance; (2) as a premise of our work, estimating discounted values with discount factor γ\gamma^{\prime} is challenging potentially due to high variance.

Appendix E Details on approximation errors with finite samples

Intuitively, as KK increases, the KKth order expansion VK,γ,γπV_{K,\gamma,\gamma^{\prime}}^{\pi} approximates VγKV_{\gamma^{\prime}}^{K} more accurately in expectation. However, in practice where all constituent terms of the approximation are built from the same batch of data, the variance might negatively impact the accuracy of the estimate.

To formalize such intuitions, we characterize the bias and variance trade-off under the phased TD-learning framework [Kearns and Singh, 2000]. Consider estimating the value function Vγπ(x)V_{\gamma}^{\pi}(x) under discount γ\gamma, with estimator V^γπ(x)\widehat{V}_{\gamma}^{\pi}(x). At each iteration tt, let Δtγmaxx𝒳|Vγπ(x)V^γπ(x)|\Delta_{t}^{\gamma}\coloneqq\max_{x\in\mathcal{X}}|V_{\gamma}^{\pi}(x)-\widehat{V}_{\gamma}^{\pi}(x)| be the absolute error of value function estimates V^γπ\widehat{V}_{\gamma}^{\pi}. Assume from each state xx, there are independent nn trajectories generated under π\pi, [Kearns and Singh, 2000] shows that commonly used TD-learning methods (e.g. TD(λ\lambda)) have error bounds of the following form with probability 1δ1-\delta,

ΔtγA(γ,δ)+B(γ)Δt1γ.\displaystyle\Delta_{t}^{\gamma}\leq A(\gamma,\delta)+B(\gamma)\Delta_{t-1}^{\gamma}. (27)

Here, the factor A(γ,δ)A(\gamma,\delta) is an error term which characterizes the errors arising from the finite sample size nn. As nn\rightarrow\infty, A(γ,δ)0A(\gamma,\delta)\rightarrow 0; the constant B(γ)B(\gamma) is a contraction coefficient that shows how fast the error decays in expectation. See Appendix E for details.

With the calculations of estimators V^γπ(x)\widehat{V}_{\gamma}^{\pi}(x) as a subroutine, we construct the nn-sample KKth order estimator V^K,γ,γπ(x)\widehat{V}_{K,\gamma,\gamma^{\prime}}^{\pi}(x),

V^K,γ(x0)=k=0K1ni=1n(γγ)kV^γπ(xi,k),\displaystyle\widehat{V}_{K,\gamma}(x_{0})=\sum_{k=0}^{K}\frac{1}{n}\sum_{i=1}^{n}(\gamma^{\prime}-\gamma)^{k}\widehat{V}_{\gamma}^{\pi}(x_{i,k}), (28)

where xi,kx_{i,k} is sampled from (Pπdγμ)k(;x)(P^{\pi}\cdot d_{\gamma}^{\mu})^{k}(\cdot;x). Note that if K=0K=0, Eqn (28) reduces to 1ni=1nV^γπ(x0)\frac{1}{n}\sum_{i=1}^{n}\widehat{V}_{\gamma}^{\pi}(x_{0}), the estimator analyzed by [Kearns and Singh, 2000]. We are interested in the error ΔK,tγmaxx𝒳|Vγπ(x)V^K,γ,γπ(x)|\Delta_{K,t}^{\gamma}\coloneqq\max_{x\in\mathcal{X}}|V_{\gamma^{\prime}}^{\pi}(x)-\widehat{V}_{K,\gamma,\gamma^{\prime}}^{\pi}(x)|, measured against the value function of discount γ\gamma^{\prime}. The following summarizes how errors propagate across iterations,

Proposition E.1.

Assume all samples xi,kx_{i,k} are generated independently. Define a factor ε1(γγ)K+11(γγ)\varepsilon\coloneqq\frac{1-(\gamma^{\prime}-\gamma)^{K+1}}{1-(\gamma^{\prime}-\gamma)}. Then with probability at least 12δ1-2\delta if K1K\geq 1 and probability 1δ1-\delta if K=0K=0, the following holds111The error bounds could be further improved, e.g., by adapting the concentration bounds at different steps 1kK1\leq k\leq K. Note that its purpose is to illustrate the bias and variance trade-off induced by the Taylor expansion order KK.,

ΔK,tγ\displaystyle\Delta_{K,t}^{\gamma} ε(A(γ,δ)+U)finite sample error+E(γ,γ,K)expected gap error+εB(γ)contraction coeffΔtγ,\displaystyle\leq\underbrace{\varepsilon(A(\gamma,\delta)+U)}_{\text{finite\ sample\ error}}+\underbrace{E(\gamma,\gamma^{\prime},K)}_{\text{expected\ gap\ error}}+\underbrace{\varepsilon B(\gamma)}_{\text{contraction\ coeff}}\Delta_{t}^{\gamma}, (29)

where U=2log2(K+1)δ/nU=\sqrt{2\log\frac{2(K+1)}{\delta}/n} for K1K\geq 1 and U=0U=0 if K=0K=0. The expected gap error E(γ,γ,K)=(γγ1γ)K+1Rmax1γE(\gamma,\gamma^{\prime},K)=\left(\frac{\gamma^{\prime}-\gamma}{1-\gamma}\right)^{K+1}\frac{R_{\text{max}}}{1-\gamma} is defined in Proposition 3.2.

Proof.

Recall the results from [Kearns and Singh, 2000]: Let Δtγmaxx𝒳|Vγπ(x)V^γπ(x)|\Delta_{t}^{\gamma}\coloneqq\max_{x\in\mathcal{X}}|V_{\gamma}^{\pi}(x)-\widehat{V}_{\gamma}^{\pi}(x)|. Then with probability at least 1δ1-\delta, the following holds

ΔtγA(γ,δ)+B(γ)Δt1γ.\displaystyle\Delta_{t}^{\gamma}\leq A(\gamma,\delta)+B(\gamma)\Delta_{t-1}^{\gamma}.

In the following, we condition all analysis on the event set that the above inequality holds. Now, using V^γπ(x)\widehat{V}_{\gamma}^{\pi}(x) as a subroutine, define the estimator for the KKth Taylor expansion as in Eqn (28),

V^K,γ(x0)=k=0K1ni=1n(γγ)kV^γπ(xi,k).\displaystyle\widehat{V}_{K,\gamma}(x_{0})=\sum_{k=0}^{K}\frac{1}{n}\sum_{i=1}^{n}(\gamma^{\prime}-\gamma)^{k}\widehat{V}_{\gamma}^{\pi}(x_{i,k}).

Define the error ΔK,tγmaxx𝒳|Vγπ(x)V^K,γπ(x)|\Delta_{K,t}^{\gamma}\coloneqq\max_{x\in\mathcal{X}}|V_{\gamma^{\prime}}^{\pi}(x)-\widehat{V}_{K,\gamma}^{\pi}(x)|, which is measured against the value function Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x) with a higher discount factor γ\gamma^{\prime}. Consider for a given starting state x0x_{0},

|Vγπ(x0)V^K,γπ(x0)|\displaystyle|V_{\gamma^{\prime}}^{\pi}(x_{0})-\widehat{V}_{K,\gamma}^{\pi}(x_{0})| =Vγπ(x0)VK,γπ(x0)+VK,γπ(x0)V^K,γπ(x)\displaystyle=V_{\gamma^{\prime}}^{\pi}(x_{0})-V_{K,\gamma}^{\pi}(x_{0})+V_{K,\gamma}^{\pi}(x_{0})-\widehat{V}_{K,\gamma}^{\pi}(x)
|Vγπ(x0)VK,γπ(x0)|+VK,γπ(x0)V^K,γπ(x0)\displaystyle\leq|V_{\gamma^{\prime}}^{\pi}(x_{0})-V_{K,\gamma}^{\pi}(x_{0})|+V_{K,\gamma}^{\pi}(x_{0})-\widehat{V}_{K,\gamma}^{\pi}(x_{0})
E(γ,γ,K)+VK,γπ(x0)𝔼[V^K,γπ(x0)]second term+𝔼[V^K,γπ(x0)]V^K,γπ(x0)third term.\displaystyle\leq E(\gamma,\gamma^{\prime},K)+\underbrace{V_{K,\gamma}^{\pi}(x_{0})-\mathbb{E}\left[\widehat{V}_{K,\gamma}^{\pi}(x_{0})\right]}_{\text{second\ term}}+\underbrace{\mathbb{E}\left[\widehat{V}_{K,\gamma}^{\pi}(x_{0})\right]-\widehat{V}_{K,\gamma}^{\pi}(x_{0})}_{\text{third\ term}}.

Now, we bound each term in the equation above. Recall εk=0K(γγ)k=1(γγ)K+11γ+γ\varepsilon\coloneqq\sum_{k=0}^{K}(\gamma^{\prime}-\gamma)^{k}=\frac{1-(\gamma^{\prime}-\gamma)^{K+1}}{1-\gamma^{\prime}+\gamma}. The second term is bounded as follows

VK,γπ(x0)𝔼[V^K,γπ(x0)]εΔtγ.\displaystyle V_{K,\gamma}^{\pi}(x_{0})-\mathbb{E}\left[\widehat{V}_{K,\gamma}^{\pi}(x_{0})\right]\leq\varepsilon\Delta_{t}^{\gamma}.

The third term is bounded by applying concentration bounds. Recall that the estimator V^K,γπ(x0)k=0K1ni=1n(γγ)kV^γπ(xi,k)\widehat{V}_{K,\gamma}^{\pi}(x_{0})\coloneqq\sum_{k=0}^{K}\frac{1}{n}\sum_{i=1}^{n}(\gamma^{\prime}-\gamma)^{k}\widehat{V}_{\gamma}^{\pi}(x_{i,k}) decomposes into K+1K+1 estimators, each being an average over nn i.i.d. samples drawn from the KKth step visitation distribution (Pπdγπ)k,0kK(P^{\pi}\cdot d_{\gamma}^{\pi})^{k},0\leq k\leq K. Applying similarly naive techniques in [Kearns and Singh, 2000], we bound each of the K+1K+1 terms individually and then take a union bound over all K+1K+1 terms. This implies that, with probability at least 1δ1-\delta, the following holds

𝔼[V^K,γπ(x0)]V^K,γπ(x0)εU=ε2log2(K+1)δ/n.\displaystyle\mathbb{E}\left[\widehat{V}_{K,\gamma}^{\pi}(x_{0})\right]-\widehat{V}_{K,\gamma}^{\pi}(x_{0})\leq\varepsilon U=\varepsilon\sqrt{2\log\frac{2(K+1)}{\delta}/n}.

Aggregating all results, we have

|Vγπ(x0)V^K,γπ(x0)|\displaystyle|V_{\gamma^{\prime}}^{\pi}(x_{0})-\widehat{V}_{K,\gamma}^{\pi}(x_{0})| E(γ,γ,K)+εΔtγ+εU\displaystyle\leq E(\gamma,\gamma^{\prime},K)+\varepsilon\Delta_{t}^{\gamma}+\varepsilon U
ε(A(γ,δ)+U)+E(γ,γ,K)+εB(γ)Δt1γ.\displaystyle\leq\varepsilon(A(\gamma,\delta)+U)+E(\gamma,\gamma^{\prime},K)+\varepsilon B(\gamma)\Delta_{t-1}^{\gamma}.

This holds with probability at least (1δ)212δ(1-\delta)^{2}\geq 1-2\delta.

Bias-variance trade-off via KK.

The error terms come from two parts: the first term contains errors A(γ,δ)A(\gamma,\delta) in the subroutine estimator V^γπ(x)\widehat{V}_{\gamma}^{\pi}(x), and its propagated errors through the sampling of KKth order approximations for 1kK1\leq k\leq K (shown via the multiplier ε\varepsilon). This first term also contains UU, a concentration bound that scales with O(logK)O(\sqrt{\log K}), which shows that the variance of the overall estimator grows with KK. This first error term scales with n\sqrt{n} and vanishes as the number of samples increases. The second term is due to the gap between the expected KKth order Taylor expansion and Vγπ(x0)V_{\gamma^{\prime}}^{\pi}(x_{0}), which decreases with KK and does not depend on sample size nn. The new contraction coefficient is εB(γ)\varepsilon B(\gamma), where it can be shown that ε[1,11γ+γ]\varepsilon\in[1,\frac{1}{1-\gamma^{\prime}+\gamma}]. Since typical estimators have B(γ)γB(\gamma)\leq\gamma, in general εB(γ)<1\varepsilon B(\gamma)<1 and the error contracts with respect to Δt\Delta_{t}. In general, the contraction becomes slower as KK increases. For example, for TD(λ\lambda), B(γ)=(1λ)γ1γλB(\gamma)=\frac{(1-\lambda)\gamma}{1-\gamma\lambda}.

Appendix F Further experiment details

Below, we provide further details on experiment setups along with additional results.

F.1 Further details on the toy example

We presented a toy example that highlighted the trade-off between bias and variance, mediated by the order parameter KK. Here, we provide further details of the experiments.

Toy MDP.

We consider tabular MDPs with |𝒳|=10|\mathcal{X}|=10 states and |𝒜|=2|\mathcal{A}|=2 actions. The transition table p(y|x,a)p(y|x,a) is drawn from a Dirichlet distribution (α,,α)(\alpha,\ldots,\alpha) for α=0.01\alpha=0.01. Here, α\alpha is chosen such that the MDP is not very communicative (i.e., the distribution p(|x,a)p(\cdot|x,a) concentrates only on a few states). The rewards are random r(x,a)=r¯(x,a)(1+ε)r(x,a)=\overline{r}(x,a)(1+\varepsilon) where ε𝒩(0,0.22)\varepsilon\sim\mathcal{N}(0,0.2^{2}) and mean rewards r¯(x,a)\overline{r}(x,a) are drawn from Uniform(0,1)\text{Uniform}(0,1) and fixed for the problem.

F.2 Deep RL algorithms

Proximal policy optimization (PPO).

PPO [Schulman et al., 2017] implements a stochastic actor πθ(a|x)\pi_{\theta}(a|x) as a Gaussian distribution a𝒩(μθ(x),σ2𝕀)a\sim\mathcal{N}(\mu_{\theta}(x),\sigma^{2}\mathbb{I}) with state-conditional mean μθ(x)\mu_{\theta}(x) and a global standard deviation σ2𝕀\sigma^{2}\mathbb{I}; and a value function Vϕ(x)V_{\phi}(x). The behavior policy μ\mu is the previous policy iterate. The policy is updated as A^γμ(x,a)θclip(πθ(a|x)μ(a|x),1ε,1+ε)\widehat{A}_{\gamma}^{\mu}(x,a)\nabla_{\theta}\text{clip}(\frac{\pi_{\theta}(a|x)}{\mu(a|x)},1-\varepsilon,1+\varepsilon) with ε=0.2\varepsilon=0.2222The exact PPO update is more complicated than this. Refer to [Schulman et al., 2017] for the exact formula.. The advantages A^γμ(x,a)\widehat{A}_{\gamma}^{\mu}(x,a) estimated using generalized advantage estimation (GAE, [Schulman et al., 2015b]) with γ=0.99,λ=0.95\gamma=0.99,\lambda=0.95. Value functions are trained by minimizing (Vϕ(x)R(x))2(V_{\phi}(x)-R(x))^{2} with returns R(x)=Vϕ(x)+A^γμ(x,a)R(x)=V_{\phi^{\prime}}(x)+\widehat{A}_{\gamma}^{\mu}(x,a) with ϕ\phi^{\prime} being a prior parameter. Both parameters θ,ϕ\theta,\phi are trained with the Adam optimizer [Kingma and Ba, 2014] with learning rate α=3104\alpha=3\cdot 10^{-4}. We adopt other default hyper-parameters in [Dhariwal et al., 2017], for details, please refer to the code base.

Trust region policy optimization (TRPO).

TRPO [Schulman et al., 2015b] implements the same actor-critic pipeline as PPO, the difference is in the updates. Instead of enforcing a soft clipping constraint, TRPO enforces a strict KL-divegergence constraint 𝔼xμ[KL(πθ(|x),μ(|x)]ε\mathbb{E}_{x\sim\mu}\left[\text{KL}(\pi_{\theta}(\cdot|x),\mu(\cdot|x)\right]\leq\varepsilon with ε=0.01\varepsilon=0.01. The policy gradient is computed as A^γμ(x,a)θlogπθ(a|x)\widehat{A}_{\gamma}^{\mu}(x,a)\nabla_{\theta}\log\pi_{\theta}(a|x), and then the final update is constructed by approximately solving a constrained optimization problem, see [Schulman et al., 2015a] for details. The scale of the final update is found through a line search, to ensure that the KL-divergence constraint is satisfied. The implementations are based in [Achiam and OpenAI, 2018].

F.3 Deep RL architecture

Across all algorithms, the policy πθ(a|x)=𝒩(μθ(x),σ2𝕀)\pi_{\theta}(a|x)=\mathcal{N}(\mu_{\theta}(x),\sigma^{2}\mathbb{I}) has a parameterized mean μθ(x)\mu_{\theta}(x) and a single standard deviation σ2\sigma^{2}. The mean μθ(x)\mu_{\theta}(x) is a 2-layer neural network with hidden units h=64h=64, and f(x)=tanh(x)f(x)=\text{tanh}(x) activation functions. The output layer does not have any activation functions; The value function Vϕ(x)V_{\phi}(x) is a 2-layer neural network with hidden units h=64h=64 and f(x)=tanh(x)f(x)=\text{tanh}(x) as activation functions. The output layer does not have any activation functions.

F.4 Additional deep RL experiment results

F.4.1 Taylor expansion Q-function estimation: ablation study on η\eta

Recall that throughout the experiments, we choose K=1K=1 and construct the new Q-function estimator as a mixture of the default estimator and Taylor expansion Q-function estimator. In particular, the final Q-function estimator is

Q^(x,a)=(1η)Q^γπ(x,a)+ηQ^K,γ,γπ(x,a).\displaystyle\widehat{Q}(x,a)=(1-\eta)\widehat{Q}_{\gamma}^{\pi}(x,a)+\eta\widehat{Q}_{K,\gamma,\gamma^{\prime}}^{\pi}(x,a).

We choose η[0,1]\eta\in[0,1] such that it balances the numerical scales of the two combining estimators. In our implementation, we find that the algorithm performs more stably when η\eta is small in the absolute scale. In Figure 5(a)-(b), we show the ablation study on the effect of η\eta, where we vary η[0.01,0.03]\eta\in[0.01,0.03]. The y-axis shows the normalized performance against PPO baselines (which is equivalent to η=0\eta=0), such that the PPO baseline achieves a normalized performance of 11.

Overall, we see on different tasks, η\eta impacts the performance differently. For example: on HalfCheetah(B), better performance is achieved with larger values of η\eta, this is consistent with the observation that PPO with γ=0.999\gamma=0.999 also achieves better performance; on Ant(B), however, as η\eta increases from zero, the performance increases marginally before degrading. In Figure 5, we show the median and mean performance across all tasks. Note that in general, the average performance increases as η\eta increases from zero, but later starts to decay a bit. When accounting for the effect of performance variance across all tasks, we chose η=0.01\eta=0.01 as the fixed hyper-parameter throughout experiments in the main paper.

Further details on computing Q^K,γ,γπ(x,a)\widehat{Q}_{K,\gamma,\gamma^{\prime}}^{\pi}(x,a).

Below we assume K=1K=1. In Algorithm 4, we showed we can construct unbiased estimates of QK,γ,γπ(x,a)Q_{K,\gamma,\gamma^{\prime}}^{\pi}(x,a) using Q^γπ(x,a)\widehat{Q}_{\gamma}^{\pi}(x,a) as building blocks. With a random time τGeometric(1γ)\tau\sim\text{Geometric}(1-\gamma), the estimator takes the following form

Q^K,γ,γ(xt,at)=Q^γπ(xt,at)+γγ1γQγπ(xt+τ,at+τ).\displaystyle\widehat{Q}_{K,\gamma,\gamma^{\prime}}(x_{t},a_{t})=\widehat{Q}_{\gamma}^{\pi}(x_{t},a_{t})+\frac{\gamma^{\prime}-\gamma}{1-\gamma}Q_{\gamma}^{\pi}(x_{t+\tau},a_{t+\tau}).

However, since the estimator is based on a single random time, it can have high variance. To reduce variance, we propose the following procedure: let (xt,at)(x_{t},a_{t}) be the target state-action pair, we can compute the estimate as

Q^K,γ,γ(xt,at)=Q^γπ(xt,at)+γγ1γs=1Hγss=1HγsQγπ(xt+s,at+s).\displaystyle\widehat{Q}_{K,\gamma,\gamma^{\prime}}(x_{t},a_{t})=\widehat{Q}_{\gamma}^{\pi}(x_{t},a_{t})+\frac{\gamma^{\prime}-\gamma}{1-\gamma}\sum_{s=1}^{H}\frac{\gamma^{s}}{\sum_{s^{\prime}=1}^{H}\gamma^{s^{\prime}}}Q_{\gamma}^{\pi}(x_{t+s},a_{t+s}).

When H=H=\infty, the above estimator corresponds to an estimator which marginalizes over the random time. This should achieve variance reduction compared to the random time based estimate in Algorithm 4. However, then the estimate requires computing cumulative sums over an infinite horizon (or in general a horizon of TT), which might be computationally expensive. To mitigate this, we propose to truncate the above summation up to H=10H=10 steps. This choice of HH aims to achieve a trade-off between computation efficiency and variance. Note that this estimator was previously introduced in [Tang et al., 2020] for off-policy learning.

F.4.2 Taylor expansion update weighting: ablation on KK

In Figure 5(c)-(d), we carry out ablation study on the effect of KK for the update weighting. Recall that KK interpolates two extremes: when K=0K=0, it recovers the vanilla PG [Sutton et al., 2000] while when K=K=\infty, it recovers the deep RL heuristic update. We expect an intermediate value of KK to achieve some trade-off between bias and variance of the overall update.

In Figure 5(c), we see the effect on individual environments. The effect is case dependent. For HalfCheetah(G), larger KK improves the performance; however, for Walker(G), the improvement is less prominent over a large range of KK. When aggregating the performance metric in Figure 5(d), we see that intermediate values of KK indeed peak in performance. We see that on average, both K=10K=10 and K=100K=100 achieve locally optimal mean performance, while K=10K=10 also achieves the locally optimal median performance.

Note on how the practical updates impact the effect of KK.

Based on our theoretical analysis, when K=0K=0 the update should recover the vanilla PG [Sutton et al., 2000], which is generally considered too conservative for the undiscounted objective in Eqn (1). However, in practice, as shown in Figure 5(d), the algorithm does not severely underperform even when K=0K=0. We speculate that this is because practical implementations of PG updates use batches of data instead of the full trajectories. This means that the relative weights w(t)w(t) of the local gradients Q^tθlogπθ(at|xt)\widehat{Q}_{t}\nabla_{\theta}\log\pi_{\theta}(a_{t}|x_{t}) are effectively self-normalized: w~(t)w(t)w(t)\widetilde{w}(t)\leftarrow\frac{w(t)}{\sum w(t^{\prime})} where the summation is over the time steps in a sampled mini-batch. The self-normalized weights w~(t)\widetilde{w}(t) are increased in the absolute scale relative to w(t)w(t) and partly offset the effect of an initially aggressive discount w(t)=γtw(t)=\gamma^{t}.

F.4.3 Comparison to results in [Romoff et al., 2019]

Recently, Romoff et al. [2019] derived a recursive relations between differences value functions defined with different discount factors. This was shown in Lemma 4.1. Given a sequence of discount factors γ1<γ2<<γN<γ\gamma_{1}<\gamma_{2}<\ldots<\gamma_{N}<\gamma^{\prime}, they derived a value function estimator to Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x) based on recursive bootstraps of value function differences Vγiπ(x)Vγi1π(x)V_{\gamma_{i}}^{\pi}(x)-V_{\gamma_{i-1}}^{\pi}(x^{\prime}). Because they aim at recovering the exact value functions, this estimator could be interpreted as similar to Taylor expansions but with K=K=\infty.

Different from their motives, we focus on the trade-off achieved by intermediate values of KK. We argued that by using K=0K=0, the estimate might be too conservative; however, using K=K=\infty might be challenging due to the variance induced in the recursive bootstrapping procedure. Though it is not straightforward to theoretically show, we conjecture that using the Taylor expansion Q-function estimator with K=K=\infty is as difficult as directly estimating Vγπ(x)V_{\gamma^{\prime}}^{\pi}(x).

Refer to caption
Figure 4: Learning curves generated by running the open source implementation of [Romoff et al., 2019] on Walker2d(G), averaged across 55 runs. There is little progress of learning for the algorithm on other benchmark tasks.
Empirical comparison.

The base algorithm of [Romoff et al., 2019] is PPO[Schulman et al., 2017]. Their algorithm uses the recursive bootstraps to estimate Q-functions and advantage functions. The new estimate is used as a direct plug-in replacement to Q^γπ(x,a)\widehat{Q}_{\gamma}^{\pi}(x,a) and A^γπ(x,a)\widehat{A}_{\gamma}^{\pi}(x,a) adopted in the PPO algorithm. We run experiments with the open source implementation of [Romoff et al., 2019] from the original authors333See https://github.com/facebookresearch/td-delta.. We evaluate the algorithm’s performance over continuous control benchmark tasks. We applied the default configurations from the code base with minimum changes to run on continuous problems (note that [Romoff et al., 2019] focused on a few discrete control problems). Overall, we find that the algorithm does not learn stably (see Figure 4).

Refer to caption
(a) Ablation on η\eta (individual)
Refer to caption
(b) Ablation on η\eta (average)
Refer to caption
(c) Ablation on KK (individual)
Refer to caption
(d) Ablation on KK (average)
Figure 5: Ablation study of hyper-parameters. We study two hyper-parameters: (a) η\eta (b) KK. In both cases, we calculate the task-dependent normalized final returns after training for 10610^{6} steps. See Appendix F for how such normalized returns are computed. In (a), normalized returns are computed with respect to η=0\eta=0 (i.e, the PPO baseline), such that when η=0\eta=0, the normalized returns are ones; in (b), normalized returns are computed with respect to the default PPO baseline, such that values of ones imply that the baseline performs the same as the default PPO baseline. Dashed curves (bullet tasks) and solid curves (gym tasks) are both mean scores averaged over 55 seeds.

Appendix G Extensions of update weighting techniques to off-policy algorithms

Below, we show that techniques developed in this paper could be extended to off-policy learning algorithms. We provide both details in theoretical derivations, algorithms, as well as experimental results.

G.1 Off-policy actor-critic algorithms

Off-policy actor-critics [Mnih et al., 2015, Lillicrap et al., 2015] maintain a deterministic policy πθ(x)\pi_{\theta}(x) and a Q-function critic Qϕ(x,a)Q_{\phi}(x,a). The agent takes exploratory actions under the environment, and saves data (xt,at,rt)(x_{t},a_{t},r_{t}) into a common replay buffer 𝒟\mathcal{D}. At training time, the algorithm samples data from the replay to update parameters. The policy is updated via the deterministic policy gradient [Silver et al., 2014], θθ+αθ𝔼μ[Qϕ(x,πθ(x))]\theta\leftarrow\theta+\alpha\nabla_{\theta}\mathbb{E}_{\mu}\left[Q_{\phi}(x,\pi_{\theta}(x))\right], where μ\mu is implicitly defined by the past behavior policy.

Deep deterministic policy gradient (DDPG).

DDPG [Lillicrap et al., 2015] maintains a deterministic policy network πθ(a|x)πθ(x)\pi_{\theta}(a|x)\equiv\pi_{\theta}(x) and a Q-function critic Qϕ(x,a)Q_{\phi}(x,a). The algorithm explores by executing a perturbed policy a=ε+πθ(x)a=\varepsilon+\pi_{\theta}(x) where ε𝒩(0,σ2)\varepsilon\sim\mathcal{N}(0,\sigma^{2}) for σ=0.1\sigma=0.1, and then saves the data (x,a,r,x)(x,a,r,x^{\prime}) into a replay buffer 𝒟\mathcal{D}. At training time, the behavior data is sampled uniformly from the replay buffer (xi,ai,ri,xi)i=0B1𝒰(𝒟)(x_{i},a_{i},r_{i},x_{i}^{\prime})_{i=0}^{B-1}\sim\mathcal{U}(\mathcal{D}) with B=100B=100. The critic is updated via TD(0), by minimizing: 1Bi=0B1(Qϕ(xi,ai)Qtarget(xi,ai))2\frac{1}{B}\sum_{i=0}^{B-1}(Q_{\phi}(x_{i},a_{i})-Q_{\text{target}}(x_{i},a_{i}))^{2} where Qtarget(xi,ai)=ri+γQϕ(xi,πθ(xi))Q_{\text{target}}(x_{i},a_{i})=r_{i}+\gamma Q_{\phi^{\prime}}(x_{i}^{\prime},\pi_{\theta^{\prime}}(x_{i}^{\prime})), where θ,ϕ\theta^{\prime},\phi^{\prime} are delayed versions of θ,ϕ\theta,\phi respectively [Mnih et al., 2015]. The policy is updated by maximizing 1Bi=0B1Qϕ(xi,πθ(xi))\frac{1}{B}\sum_{i=0}^{B-1}Q_{\phi}(x_{i},\pi_{\theta}(x_{i})) with respect to θ\theta. Both parameters θ,ϕ\theta,\phi are trained with the Adam optimizer [Kingma and Ba, 2014] with learning rate α=104\alpha=10^{-4}. We adopt other default hyper-parameters in [Achiam and OpenAI, 2018], for details, please refer to the code base.

Twin-delayed DDPG (TD3).

TD3 [Fujimoto et al., 2018] adopts the same training pipeline and architectures as DDPG. TD3 also adopts two critic networks Qϕ1(x,a),Qϕ2(x,a)Q_{\phi_{1}}(x,a),Q_{\phi_{2}}(x,a) with parameters ϕ1,ϕ2\phi_{1},\phi_{2}, in order to minimize the over-estimation bias [Hasselt, 2010, Van Hasselt et al., 2016].

Soft actor-critic.

SAC [Haarnoja et al., 2018] adopts a similar training pipeline and architectures as TD3. A major conceptual difference is that SAC is based on the maximum-entropy formulation of RL [Ziebart et al., 2008, Fox et al., 2016]. The Q-function is augmented by entropy regularization bonus and the policy is optimized such that it does not collapse to a deterministic policy.

G.2 Architecture

All algorithms share the same architecture. The policy network πθ(x)\pi_{\theta}(x) takes as input the state xx, and is a 2-layer neural network with hidden units h=256h=256 and f(x)=relu(x)f(x)=\text{relu}(x) activation functions. The output is squashed by f(x)=tanh(x)f(x)=\text{tanh}(x) to comply with the action space boundaries; The critic Qϕ(x,a)Q_{\phi}(x,a) takes a concatenated vector [x,a][x,a] as inputs, is 2-layer neural network with hidden units h=256h=256 and f(x)=relu(x)f(x)=\text{relu}(x) activation functions. The output does not have any activation functions.

For stochastic policies, the policy network parameterizes a Gaussian also parameterizes a log standard deviation vector logσ(x)\log\sigma(x), which is a neural network with the same architecture above. The stochastic output is a reparameterized function a=πθ(x)+exp(logσ(x))εa=\pi_{\theta}(x)+\exp(\log\sigma(x))\cdot\varepsilon where the noise ε𝒩(0,1)\varepsilon\sim\mathcal{N}(0,1). Finally, the action output is squashed by tanh(x)\text{tanh}(x) to comply with the action boundary [Haarnoja et al., 2018].

G.3 Algorithm details for update weighting

To derive an update based on update weighting, we start with the undiscounted on-policy objective Vγ(x)=𝔼xρx,γ,γπθ[Vγπθ(x)]V_{\gamma^{\prime}}(x)=\mathbb{E}_{x^{\prime}\sim\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}}\left[V_{\gamma}^{\pi_{\theta}}(x^{\prime})\right]. Given behavior data generated under μ\mu, we abuse the notation and also use μ\mu to denote the state distribution under μ\mu (usually implicitly defined by sampling from a replay buffer 𝒟\mathcal{D}). By rewriting the objective with importance sampling (IS),

Vγπθ(x)=𝔼xρx,γ,γπθ[Vγπθ(x)]=𝔼xμ[ρx,γ,γπθ(x)μ(x)Vγπθ(x)],\displaystyle V_{\gamma^{\prime}}^{\pi_{\theta}}(x)=\mathbb{E}_{x^{\prime}\sim\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}}\left[V_{\gamma}^{\pi_{\theta}}(x^{\prime})\right]=\mathbb{E}_{x^{\prime}\sim\mu}\left[\frac{\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x^{\prime})}{\mu(x^{\prime})}V_{\gamma}^{\pi_{\theta}}(x^{\prime})\right], (30)

we derive an off-policy learning objective. By dropping a certain terms (see [Degris et al., 2012] for details about the justifications for dropping such terms), we can derive the IS-based gradient update

𝔼xμ[ρx,γ,γπθ(x)μ(x)θVγπθ(x)]𝔼xμ[ρx,γ,γπθ(x)μ(x)θQϕ(x,πθ(x))]\displaystyle\mathbb{E}_{x^{\prime}\sim\mu}\left[\frac{\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x^{\prime})}{\mu(x^{\prime})}\nabla_{\theta}V_{\gamma}^{\pi_{\theta}}(x^{\prime})\right]\approx\mathbb{E}_{x^{\prime}\sim\mu}\left[\frac{\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x^{\prime})}{\mu(x^{\prime})}\nabla_{\theta}Q_{\phi}\left(x^{\prime},\pi_{\theta}(x^{\prime})\right)\right]

To render the update feasible, we need to estimate the ratio ρx,γ,γπθ(x)μ(x)\frac{\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x^{\prime})}{\mu(x^{\prime})}. Inspired by [Sinha et al., 2020], we propose to maintain a fast replay buffer 𝒟f\mathcal{D}_{f} which contains the most recent sampled data (which implicitly defines ρx,γ,γπθ\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}), then the estimator wψw_{\psi} is trained to estimate the density ratio between 𝒟\mathcal{D} (which implicitly defines μ\mu) and 𝒟f\mathcal{D}_{f}. See Appendix F for further details. The full off-policy actor-critic algorithm is summarized in Algorithm 5. In practice, we implement a undiscounted uniform distribution instead of ρx,γ,γπ(x)\rho_{x,\gamma,\gamma^{\prime}}^{\pi}(x^{\prime}) with γ=1\gamma^{\prime}=1. The main motivation is that this distribution is much easier to specify as it corresponds to sampling from the replay buffer uniformly without discounts, as explained below.

As an important observation for practical implementations, note that

ρx,γ,γπ(x)=γγ𝕀[x0=x]+(γγ)𝔼π[t1(γ)t1𝕀[xt=x]|x0=x]\displaystyle\rho_{x,\gamma,\gamma^{\prime}}^{\pi}(x^{\prime})=\frac{\gamma}{\gamma^{\prime}}\mathbb{I}[x_{0}=x^{\prime}]+(\gamma^{\prime}-\gamma)\mathbb{E}_{\pi}\left[\sum_{t\geq 1}(\gamma^{\prime})^{t-1}\mathbb{I}[x_{t}=x^{\prime}]\;\middle|\;x_{0}=x\right]

when setting γ=1\gamma^{\prime}=1, we see that the second term of the distribution is proportional to 𝔼π[t1𝕀[xt=x]|x0=x]\mathbb{E}_{\pi}\left[\sum_{t\geq 1}\mathbb{I}[x_{t}=x^{\prime}]\;\middle|\;x_{0}=x\right], which corresponds to a uniform distribution over states on sampled trajectories, without discounting. This will make implementations much simpler. We will see that this could also lead to performance gains. We leave Taylor expansion based extension of this method for future work.

Details on training the density estimator wψ(x)w_{\psi}(x).

The density estimator wψ(x)w_{\psi}(x) is parameterized with exactly the same architecture as the policy network πθ(x)\pi_{\theta}(x), except that its output activation is replaced by log(1+exp(x))\log(1+\exp(x)) to ensure that wψ(x)>0w_{\psi}(x)>0. The off-policy actor-critic algorithm maintains an original buffer 𝒟\mathcal{D} of size |𝒟|=106|\mathcal{D}|=10^{6}; in addition, we maintain a fast replay buffer 𝒟f\mathcal{D}_{f} with |𝒟f|=104|\mathcal{D}_{f}|=10^{4}, which is used for saving the most recently generated data points. For ease of analysis, assume that the data sampled from 𝒟f\mathcal{D}_{f} come from πθ\pi_{\theta}, while the data sampled from 𝒟\mathcal{D} come from μ\mu.

To learn the ratio ρx,γ,γπθ(x)μ(x)\frac{\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x^{\prime})}{\mu(x^{\prime})}, we adopt a simple discriminative loss function as follows

L(ψ)=𝔼xρx,γ,γπθ[logwψ(x)1+wψ(x)]𝔼xμ[log11+wψ(x)]𝔼x𝒟f[logwψ(x)1+wψ(x)]𝔼x𝒟[log11+wψ(x)].\displaystyle L(\psi)=-\mathbb{E}_{x^{\prime}\sim\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}}\left[\log\frac{w_{\psi}(x^{\prime})}{1+w_{\psi}(x^{\prime})}\right]-\mathbb{E}_{x^{\prime}\sim\mu}\left[\log\frac{1}{1+w_{\psi}(x^{\prime})}\right]\approx-\mathbb{E}_{x\sim\mathcal{D}_{f}}\left[\log\frac{w_{\psi}(x^{\prime})}{1+w_{\psi}(x^{\prime})}\right]-\mathbb{E}_{x\sim\mathcal{D}}\left[\log\frac{1}{1+w_{\psi}(x^{\prime})}\right].

The optimal solution to ψ=argminψL(ψ)\psi^{\ast}=\arg\min_{\psi}L(\psi) is wψ(x)=ρx,γ,γπθ(x)μ(x)w_{\psi^{\ast}}(x^{\prime})=\frac{\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x^{\prime})}{\mu(x^{\prime})}(assuming enough expressiveness). Then, the density estimator is used for weighting the policy update: when sampling a batch of BB data from the buffer, the weight wψ(xi),1iBw_{\psi}(x_{i}),1\leq i\leq B is computed for each data point xix_{i}. Then the weights are normalized across batch w~i=wψ(xi)τj=1Bwψ(xj)τ\widetilde{w}_{i}=\frac{w_{\psi}(x_{i})^{\tau}}{\sum_{j=1}^{B}w_{\psi}(x_{j})^{\tau}} where the inverse temperature is τ=0.1\tau=0.1. Then w~i\widetilde{w}_{i} is used for weighting the such that the policy is updated as θθ+α1Bi=1Bw~iθQϕ(xi,πθ(xi))\theta\leftarrow\theta+\alpha\frac{1}{B}\sum_{i=1}^{B}\widetilde{w}_{i}\nabla_{\theta}Q_{\phi}(x_{i},\pi_{\theta}(x_{i})).

0:  policy πθ(x)\pi_{\theta}(x), Q-function critic Qϕ(x,a)Q_{\phi}(x,a), density estimator wψ(x)w_{\psi}(x) and learning rate α0\alpha\geq 0
  while not converged do
     1. Collect data (xt,at,rt)μ(x_{t},a_{t},r_{t})\sim\mu and save to the buffer 𝒟\mathcal{D} and the fast buffer 𝒟f\mathcal{D}_{f}
     2. Estimate the density by the discriminative loss between 𝒟,𝒟f\mathcal{D},\mathcal{D}_{f}, such that wψ(x)ρx,γ,γπθ(x)/μ(x)w_{\psi}(x^{\prime})\approx\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x^{\prime})/\mu(x^{\prime}), where xx is the initial state of the MDP.
     3. Sample data from (xi,ai,ri)i=1B𝒟(x_{i},a_{i},r_{i})_{i=1}^{B}\sim\mathcal{D}.
     3(a). Update the Q-function critic Qϕ(x,a)Q_{\phi}(x,a) via TD-learning, such that Qϕ(x,a)Qγπθ(x,a)Q_{\phi}(x,a)\approx Q_{\gamma}^{\pi_{\theta}}(x,a).
     3(b). Update the policy parameter with the gradient θθ+αi=1Bwψ(xi)θQϕ(xi,πθ(xi))\theta\leftarrow\theta+\alpha\sum_{i=1}^{B}w_{\psi}(x_{i})\nabla_{\theta}Q_{\phi}(x_{i},\pi_{\theta}(x_{i})).
  end while
Algorithm 5 Update weighting Off-policy actor-critic

We carry out the update in Algorithm 2, where the density estimator wψ(x)w_{\psi}(x) is trained based on a discriminative loss between 𝒟\mathcal{D} and 𝒟f\mathcal{D}_{f}. For any given batch of data {xi}i=1B\{x_{i}\}_{i=1}^{B}, we normalize the prediction w~i=wψ(xi)τ/j=1Bwψ(xj)τ\widetilde{w}_{i}=w_{\psi}(x_{i})^{\tau}/\sum_{j=1}^{B}w_{\psi}(x_{j})^{\tau} with hyper-parameter τ=0.1\tau=0.1 as similarly implemented in [Sinha et al., 2020]. The temperature annealing moves w~i\widetilde{w}_{i} closer to a uniform distribution and tends to stabilize the algorithm. See Appendix F for further details.

Discussion on relations to other algorithms.

Previous work focuses on re-weighting transitions to stabilize the training of critics. For example, prioritized replay [Schaul et al., 2015] prioritizes samples with high Bellman errors. Instead, Algorithm 2 reweighs samples to speed up the training of the policy. Our observation above also implies that when sampling from 𝒟,𝒟f\mathcal{D},\mathcal{D}_{f} for training the estimates wψρx,γ,γπθ(x)μ(x)w_{\psi}\approx\frac{\rho_{x,\gamma,\gamma^{\prime}}^{\pi_{\theta}}(x^{\prime})}{\mu(x^{\prime})}, it is not necessary to discount the transitions. This is in clear contrast to prior work, such as [Sinha et al., 2020], where they propose to train wψ(x)dx,γπθ(x)/dx,γμ(x)w_{\psi}(x^{\prime})\approx d_{x,\gamma}^{\pi_{\theta}}(x^{\prime})/d_{x,\gamma}^{\mu}(x^{\prime}) , which is the fully discounted visitation distribution under γ\gamma based on the derivation of optimizing a discounted objective Vγπθ(x)V_{\gamma}^{\pi_{\theta}}(x).

Refer to caption
(a) HalfCheetah(G)
Refer to caption
(b) Ant(G)
Refer to caption
(c) Walker2d(G)
Refer to caption
(d) Hopper(G)
Figure 6: Evaluation of near off-policy actor-critic algorithms over continuous control domains. Each curve corresponds to a baseline algorithm averaged over 55 random seeds. TD3(γ\gamma) (red curve) consistently outperforms or performs similarly as other baselines.
Refer to caption
(a) HalfCheetah(G)
Refer to caption
(b) Ant(G)
Refer to caption
(c) Walker2d(G)
Refer to caption
(d) Hopper(G)
Figure 7: Evaluation of near off-policy actor-critic algorithms over continuous control domains. Each curve corresponds to a baseline algorithm averaged over 55 random seeds. SAC(γ\gamma) (red curve) consistently outperforms or performs similarly as other baselines.
Results.

We build the algorithmic improvements based on TD3 [Fujimoto et al., 2018] and SAC [Haarnoja et al., 2018], and name the correspinding algorithms TD3(γ\gamma) and SAC(γ\gamma) respectively. We compare with TD3, SAC, and DDPG [Lillicrap et al., 2015], all of which are off-policy algorithms.

We first compare TD3(γ\gamma) with TD3 in Figure 6. To highlight the default sample efficiency of off-policy methods, we include PPO as a baseline as well. Across all four presented tasks, we see that TD(γ\gamma) performs similarly or marginally outperforms the TD3 baseline. To make concrete the comparison between final performance, we report the final score mean±0.5std\text{mean}\pm 0.5\text{std} of each algorithm in Table 1. As a default baseline, we also show the results of DDPG reported in [Achiam and OpenAI, 2018]. Overall, TD3(γ\gamma) provides a modest yet consistent boost over baseline TD3.

Then we compare SAC(γ\gamma) with SAC in Figure 7 and Table 1. We see that SAC(γ\gamma) provides marginal performance gains over Walker2d and Ant, while it is slightly overperformed by baseline SAC for HalfCheetah and Hopper. We speculate that this is partly because the hyper-parameters of baseline SAC are well tuned on HalfCheetah, and it is difficult to achieve further significant gains without exhaustive hyper-parameter search. Overall, SAC(γ\gamma) is competitive compared to SAC.

Tasks TD3(γ\gamma) TD3 DDPG-v1
Ant(G) 𝟑𝟔𝟎𝟏±𝟖𝟕𝟗\mathbf{3601\pm 879} 𝟑𝟐𝟔𝟗±𝟔𝟖𝟔\mathbf{3269\pm 686} 1000\approx 1000
HalfCheetah(G) 𝟏𝟎𝟑𝟓𝟎±𝟐𝟕𝟗\mathbf{10350\pm 279} 9156±7189156\pm 718 8500\approx 8500
Walker2D(G) 𝟒𝟎𝟗𝟎±𝟒𝟒𝟎\mathbf{4090\pm 440} 𝟒𝟐𝟑𝟑±𝟑𝟏𝟒\mathbf{4233\pm 314} 2000\approx 2000
Hopper(G) 𝟑𝟑𝟒𝟎±𝟐𝟔𝟐\mathbf{3340\pm 262} 2626±6772626\pm 677 1800\approx 1800
Tasks SAC(γ\gamma) SAC DDPG-v2
Ant(G) 𝟓𝟓𝟕𝟐±𝟏𝟏𝟓\mathbf{5572\pm 115} 4886±5304886\pm 530 706±123706\pm 123
HalfCheetah(G) 11774±9611774\pm 96 𝟏𝟐𝟎𝟓𝟗±𝟗𝟏\mathbf{12059\pm 91} 7957±5277957\pm 527
Walker2D(G) 𝟒𝟔𝟐𝟔±𝟏𝟔𝟓\mathbf{4626\pm 165} 𝟒𝟓𝟐𝟐±𝟐𝟔𝟗\mathbf{4522\pm 269} 2261±1472261\pm 147
Hopper(G) 3384±813384\pm 81 𝟑𝟓𝟓𝟕±𝟐𝟎\mathbf{3557\pm 20} 2024±2972024\pm 297
Table 1: Final performance of baseline algorithms over benchmark tasks. The final performance is computed as the mean scores over the last 1010 iterations of each algorithm, averaged over 55 seeds. When compared with TD3, the performance of DDPG-v1 is taken from [Achiam and OpenAI, 2018]; when compared with SAC, the performance is based on re-runs of the DDPG-v2 baselines with [Achiam and OpenAI, 2018]. For each task, the best algorithms are highlighted in bold fonts (potentially with ties).