Taylor Expansions of Discount Factors
Abstract
In practical reinforcement learning (RL), the discount factor used for estimating value functions often differs from that used for defining the evaluation objective. In this work, we study the effect that this discrepancy of discount factors has during learning, and discover a family of objectives that interpolate value functions of two distinct discount factors. Our analysis suggests new ways for estimating value functions and performing policy optimization updates, which demonstrate empirical performance gains. This framework also leads to new insights on commonly-used deep RL heuristic modifications to policy optimization algorithms.
1 Introduction
One of the most popular models for reinforcement learning (RL) is the Markov decision process (MDP) with exponential discounting over an infinite horizon (Sutton and Barto, 2018; Puterman, 2014), with discounted objectives of the following form
Discounted models enjoy favorable theoretical properties, and are also the foundation of many practical RL algorithms that enjoy empirical success (e.g. see (Mnih et al., 2015; Schulman et al., 2015a; Lillicrap et al., 2015; Schulman et al., 2017)). However, in most applications of RL, the objective of interest is the expected undiscounted cumulative return,
(1) |
where is a (possibly random) evaluation horizon, which usually also denotes the end of the trajectory. For example, could be the first time the MDP gets into a terminal state (e.g., a robot falls); when the MDP does not have a natural terminal state, could be enforced as a deterministic horizon. This creates a technical gap between algorithmic developments and implementations: it is tempting to design algorithms that optimize , however, further heuristics are often needed to get strong practical performance. This issue manifests itself with the policy gradient (PG) theorem (Sutton et al., 2000). Let be a parameterized policy. The policy gradient (PG) is computed as
(2) |
However, the practical implementation of PG updates usually omits the discount factors (see for example the high-quality open source packages (Dhariwal et al., 2017; Achiam and OpenAI, 2018), leading to an approximate gradient of the form
(3) |
Most prior work on PG algorithms rely on this heuristic update to work properly in deep RL applications. The intuitive argument for dropping the factor is that Eqn (2) optimizes , which is very myopic compared to the objective in Eqn (1). Consequently, the exponential discount is too aggressive for weighting updates with large . As a concrete example, in many MuJoCo control tasks (Brockman et al., 2016), the most commonly used discount factor is . This leads to an effective horizon of , which is much smaller than the evaluation horizon . This technical gap between theory and practice has been alluded to previously (by e.g., O’Donoghue et al., 2016) and is explicitly discussed by Nota and Thomas (2019).
To bypass this gap, a straightforward solution would be to naïvely increase the discount factor and apply the PG in Eqn (2). In the example above, this implies using . Unfortunately, this rarely works well in practice, as we will also see in experiments. The failure might be due to the higher variance of the estimation (Schulman et al., 2015b) or the collapse of the action gaps (Lehnert et al., 2018; Laroche and van Seijen, 2018), which is aggravated when combined with function approximations.
Nevertheless, as a theoretical framework, it is insightful to emulate the undiscounted objective in Eqn (1) using the (un)discounted objective with . To build intuitions about this approximation, note that when the time step is small , the multiplicative factor and the cumulative rewards are almost undiscounted; even when , we have . Overall, this is a much more accurate approximation than . This naturally prompts us to answer the following general question: How do we evaluate and optimize with estimates built for where ?
Main idea.
We study the relation between and via Taylor expansions. In Section 3, we identify a family of interpolating objectives between the more myopic objective and the true objective of interest . In Section 4, we start with insights on why the heuristic in Eqn (3) might be useful in practice. Then, we apply Taylor expansions directly to the heuristic updates, to arrive at a family of interpolating updates. In Section 5, we build on theoretical insights to derive improvements to established deep RL algorithms. We show their performance gains in Section 7.
2 Background
Consider the setup of a MDP. At any discrete time , the agent is in state , takes an action , receives an instant reward and transitions to a next state . For simplicity, we assume to be deterministic. Let policy be a mapping from states to distributions over actions. Let be a discount factor, define the Q-function and value function . We also define the advantage function . Here, denotes that the trajectories are generated under policy . Throughout the paper, we use subscripts to emphasize that RL quantities implicitly depend on discount factors.
2.1 Linear programs for reinforcement learning
Henceforth, we assume all vectors to be column vectors. The value functions satisfy the Bellman equations (Bellman, 1957). Such equations can be encoded into a linear program (LP) (De Farias and Van Roy, 2003; Puterman, 2014). Let be the primal variables, consider the following LP,
(4) |
where is the state-dependent reward and is the transition matrix under . Here, encodes the one-hot distribution (Dirac) at . Similar results hold for considering the LP objective with a general distribution . It then follows that the optimal solution to the above LP is . Now, consider the dual LP to Eqn (4), let be the dual variables,
(5) |
The optimal solution to the dual program has a natural probabilistic interpretation. It is the discounted visitation distribution under policy with starting state as where is a probability measure induced by the policy and the MDP transition kernel. By strong duality, the value function can be equivalently written as
(6) |
3 Taylor Expansions of Value Functions
Below, we show how to estimate with approximations constructed from value functions for . Unless otherwise stated, we always assume for a more convenient mathematical treatment of the problem.
3.1 Taylor expansions of discount factors
We start with some notations: we abuse the notation of value functions to both refer to the scalar function as well as a vector. The Bellman equation for the value-function is expressed in the matrix form (Puterman, 2014)
(7) |
Inverting the equation,
(8) |
Now, we present the main result of Taylor expansions.
Proposition 3.1.
The following holds for all ,
(9) |
When , the residual norm converges to , which implies
(10) |
We provide a proof sketch here: Note that and apply the Woodbury matrix identity to obtain . We can then recursively expand Eqn (8) times to arrive at Eqn (9). In particular, by expanding the equation once, we see that is equivalent to the following,
where the last term can be expanded further by plugging in the Woodbury matrix identity. See the complete proof in Appendix A.
Extensions to .
The above result can extend to the case . We make two assumptions: A.1 The Markov chain induced by is absorbing and is the absorption time; A.2 for absorbing states . Under these assumptions, we can interpret such absorbing states as the terminal states. As a result, is well-defined and Proposition 3.1 still holds; see Appendix A for the complete proof.
In practice, it is infeasible to sum up all infinite number of terms in the Taylor expansion. It is then of interest to consider the th-order expansion of , which truncates the infinite series. Specifically, we define the th-order expansion as
(11) |
As increases, the th order expansion becomes increasingly close to the infinite series, which evaluates to . This is formalized next.
Proposition 3.2.
The following bound holds for all ,
(12) |
3.2 Sample-based approximations of Taylor expansions
We now describe how to estimate via samples. First, we build some intuition on the behavior of expansions at different orders by considering a few special cases.
Zeroth-order expansion.
By setting , we see that
(13) |
The zeroth order expansion approximates the value function of the discount factor with that of a lower discount factor . This is a very straightforward approximation to use in that no sampling at all is required, but it may not be accurate.
First-order expansion.
When , we consider the increments of the expansions,
(14) |
To understand the first order expansion, recall that in the definition of value function , immediate rewards are accumulated via the matrix . In general, for any , we can interpret as accumulating as rewards to compute as value functions. By analogy, we can interpret the RHS of Eqn (14) as the value function assuming as immediate rewards. In other words, the first order expansion bootstraps the zeroth order expansion to form a more accurate approximation. Combined with the zeroth order expansion, we can also conveniently write the difference of first- and zeroth-order expansions as an expectation . Let be a random time such that . The difference can also be expressed via this random time
Note that from this expression, we obtain a simple unbiased estimate for , using a sampled trajectory and a random time step .
General th-order expansion.
We now present results for general . Consider the incremental term,
(15) |
Note that the aggregate matrix suggests a recursive procedure to bootstrap from lower order expansions to construct higher order expansions. To see why, we can rewrite the right-hand side of Eqn (15) as
Indeed, we can interpret the difference as the value function under the immediate reward . This generalizes the bootstrap procedure of the first order expansion as a special case where we naturally assume . Given i.i.d. random times , we can write as the expectation
Based on the above expression, Algorithm 1 provides a subroutine that generates unbiased estimates of by sub-sampling an infinite trajectory with the random times.
Practical implementations.
While the above and Algorithm 1 show how to compute one-sample estimates, in practice, we might want to average multiple samples along a single trajectory for variance reduction. See Appendix F for further details on the practical estimates.
Interpretation of expansions in the dual space.
Recall that where the identity matrix concatenates Dirac delta vectors . Since is a constant vector, Taylor expansions essentially construct approximations to the matrix . By grouping the matrix with the reward vector (or the density matrix), we arrive at the primal expansion (or the dual expansion),
The derivations above focus on the primal expansion view. We show a parallel theory of dual expansion in Appendix B. The equivalence of primal-dual view of Taylor expansions suggests connections with seemingly disparate lines of prior work: Janner et al. (2020) propose a density model for visitation distribution of different in the context of model-based RL. They show that predictions of large discount factors could be bootstrapped from predictions of small discount factors. This corresponds exactly to the dual space expansions, which is equivalent to the primal space expansions.
Extensions to Q-functions.
In Appendix C, we show that it is possible to build approximations to using as building blocks. The theoretical guarantees and estimation procedures are similar to the case of value functions.
3.3 Approximation errors with finite samples
Proposition 3.2 shows that the expected approximation error decays as for . This motivates using a high value of when constructing the approximation. However, in practice, all constituent terms in the th order expansion are random estimates, each with a non-zero variance. This might lead the variance of the overall estimate to increase as increases. As a result, mediates a trade-off between bias (expected approximation error) and variance. We formalize such intuitions in Appendix E, where we theoretically analyze the trade-off using the phased TD-learning framework (Kearns and Singh, 2000).
A numerical example.
To get direct intuition about the effect of , we focus on a tabular MDP example. The MDP has states and actions. All entries of the transition table are generated from a Dirichlet distribution with parameters with . The policy is uniformly random. We take and . The agent generates trajectories with a very large horizon with a fixed starting state . We assume access to base estimates and the Taylor expansion estimates are computed based on Algorithm 1. We estimate the relative error as . For further experiment details, see Appendix F.
In Figure 1(a), we show how errors vary as a function of . We study two settings: (1) Expected estimates (red), where is computed analytically through access to transition tables. In this case, similar to how the theory suggests, the error decays exponentially; (2) Sample-based estimates (blue) with base estimates . The errors decay initially with but later start to increase a bit as gets large. The optimal in the middle achieves the best bias-variance trade-off. Note that in this particular example, the estimates do not pay a very big price in variance for large . We speculate this is because increments to the estimates are proportional to , which scales down additional variance terms quickly as increases.
In Figure 1(b), we study how the optimal expansion order depends on the noise level of base estimates. To emulate the noise, we assume access to base estimates for some noise level . The optimal order is computed as . In general, we observe that when increases, decreases. Intuitively, this implies that as the base estimates become noisy, we should prefer smaller value of to control the variance. This result bears some insights for practical applications such as downstream policy optimization, where we need to select an optimal for the tasks at hand.


4 Taylor Expansions of Gradient Updates
In Section 3, we discussed how to construct approximations to . For the purpose of policy optimization, it is of direct interest to study approximations to . As stated in Section 1, a major premise of our work is that in many practical contexts, estimating discounted values under is difficult. As a result, directly evaluating the full gradient is challenging, because it requires estimating Q-functions . Below, we start by showing how the decomposition of motivates a particular form of gradient update, which is generally considered a deep RL heuristic. Then we construct approximations to this update based on Taylor expansions.
4.1 as a weighted mixture of
We can explicitly rewrite as a weighted mixture of value functions . This result was alluded to in (Romoff et al., 2019) and formally shown below.
Lemma 4.1.
Assume . We can write , where the weight vector is
Also we can rewrite , using an expectation, as:
(16) |
When , might be undefined. However, Eqn (16) still holds if assumptions A.1 and A.2 are satisfied.
4.2 Decomposing the full gradient
Lemma 4.1 highlights that depends on in two aspects: (1) the value functions ; (2) the state-dependent distribution . Let be a parameterized policy. For conceptual clarity, we can write with a function . Though this function is essentially the inner product, i.e., , notationally, it helps stress that depends on through two vector arguments. Now, we can decompose .
Lemma 4.2.
The full gradient can be decomposed into the sum of two partial gradients as follows,
where the above partial gradients are both evaluated at and both expectations are with respect to .
We argue that the second partial gradient introduces most challenges in practical optimization. Intuitively, this is because its unbiased estimator is equivalent to a REINFORCE gradient estimator which requires estimating discounted values that accumulate as ‘reward’ under discount factor . By the premise of our work, this estimation would be difficult. We will detail the discussions in Appendix D.
The following result characterizes the first partial gradient.
Proposition 4.3.
For any , the first partial gradient can be expressed as
(17) |
When , under assumptions A.1 and A.2, the first partial gradient exists and is expressed as
(18) |
Connections to common deep RL heuristic.
Many high-quality deep RL algorithms (see, e.g. Dhariwal et al., 2017; Achiam and OpenAI, 2018) implement parameter updates which are very similar to Eqn (18). As such, Proposition 4.3 provides some insights on why implementing such a heuristic might be useful in practice: though in general Eqn (18) is not a gradient (Nota and Thomas, 2019), it is a partial gradient of , which is usually the objective of interest at evaluation time. Compared with the formula of vanilla PG in Eqn (2), Eqn (18) offsets the over-discounting by via a uniform average over states.
However, it is worth noting that in deep RL practice, the definition of the evaluation horizon might slightly differ from that specified in A.1. In such cases, Proposition 4.3 does not hold. By A.1, is the absorption time that defines when the MDP enters a terminal absorbing state. In many applications, however, for MDPs without a natural terminatal state, is usually enforced by an external time constraint which does not depend on states. In other words, an environment can terminate even when it does not enter any terminal state (see, e.g., Brockman et al., 2016 for such examples). To bypass this subtle technical gap, one idea is to incorporate time steps as part of the state . This technique was hinted at in early work such as (Schulman et al., 2015b) and empirically studied in (Pardo et al., 2018). In this case, the random absorbing time depends fully on the augmented states, and Proposition 4.3 holds.
4.3 Taylor expansions of partial gradients
We now consider approximations to the first partial gradients
Since does not depend on , the approximation is effectively with respect to the weight vector . Below, we show results for the th order approximation.
Proposition 4.4.
Assume . For any , define the th Taylor expansion to as
It can be shown that and .
We build some intuitions about the approximations. Note that in general we can write the partial gradient as a weighted mixture of local gradients where ,
(19) |
for some weight function . When , and we recover the original first partial gradient defined in Eqn (17); when , recovers the vanilla PG in Eqn (2). For other values of , we show the analytic weights in Appendix D. Similar to how interpolates and , here the th order expansion to the partial gradients interpolate the full partial gradients and vanilla PG. In practice, we might expect an intermediate value of achieve the best bias and variance trade-off of the update.
5 Policy optimization with Taylor expansions
Based on theoretical insights of previous sections, we propose two algorithmic changes to baseline algorithms. Based on Section 3, we propose Taylor expansion advantage estimation; based on Section 4, we propose Taylor expansion update weighting. It is important to note that other algorithmic changes are possible, which we leave to future work.
5.1 Baseline near on-policy algorithm
We briefly introduce backgrounds for near on-policy policy optimization algorithms (Schulman et al., 2015a; Mnih et al., 2016; Schulman et al., 2017; Espeholt et al., 2018). We assume that the data are collected under a behavior policy , which is close to the target policy . The on-policyness is ensured by constraining for some divergence and threshold . Usually, is chosen to be small such that little off-policy corrections are needed for estimating value functions. With data , the algorithms estimate Q-functions . Then the estimates are used as plug-in alternatives to the Q-functions in the definition of gradient updates such as Eqn (2) for sample-based updates.
5.2 Taylor expansion Q-function estimation
In Section 3, we discussed how to construct approximations to using as building blocks. As the first first algorithmic change, we propose to construct the th order expansion as a plug-in alternative to when combined with downstream optimization. Since , we expect the optimization subroutine to account for an objective of a longer effective horizon.
In many baseline algorithms, we have access to a value function critic and a subroutine which produces Q-function estimates (e.g., ). We then construct the th order expansion using . This procedure is similar to Algorithm 1 and we show the full algorithm in Appendix C. See also Appendix F for further experimental details.
5.3 Taylor expansion update weighting
In Section 4, we discussed Taylor expansions approximation to the weight vector . As the second algorithmic change to the baseline algorithm, we update parameters in the direction of th order approximations to the partial gradient . Eqn (19) shows that the update effectively translates into adjusting the weight . When combined with other components of the algorithm, the pseudocode is shown in Algorithm 3. Under this framework, the common deep RL heuristic could be recovered by setting .
6 Related work
Discount factors in RL.
Discount factors impact RL agents in various aspects. A number of work suggest that RL problems with large discount factors are generally more difficult to solve (Jiang et al., 2016), potentially due to increased complexities of the optimal value functions or collapses of the action gaps (Lehnert et al., 2018; Laroche and van Seijen, 2018). However, optimal policies defined with small discounts can be very sub-optimal for RL objectives with a large discount factor. To entail numerical stability of using large discounts, prior work has suggested non-linear transformation of the Bellman targets for Q-learning (Pohlen et al., 2018; van Hasselt et al., 2019; Kapturowski et al., 2018; Van Seijen et al., 2019). However, when data is scarce, small discount factors might prove useful due to its implicit regularization effect (Amit et al., 2020).
Adapting discount factors & multiple discount factors.
In general, when selecting a single optimal discount factor for training is difficult, it might be desirable to adjust the discount during training. This could be achieved by human-designed (Prokhorov and Wunsch, 1997; François-Lavet et al., 2015) or blackbox adaptation (Xu et al., 2018). Alternatively, it might also be beneficial to learn with multiple discount factors at the same time, which could improve TD-learning (Sutton, 1995) or representation learning (Fedus et al., 2019). Complementary to all such work, we study the connections between value functions defined with different discounts.
Taylor expansions for RL.
Recently in (Tang et al., 2020), Taylor expansions were applied to study the relationship between and , i.e., value functions under the same discount factor but different policies . This is useful in the context of off-policy learning. Our work is orthogonal and could be potentially combined with this approach.
7 Experiments
In this section, we evaluate the empirical performance of new algorithmic changes to the baseline algorithms. We focus on robotics control experiments with continuous state and action space. The tasks are available in OpenAI gym (Brockman et al., 2016), with backends such as MuJoCo (Todorov et al., 2012) and bullet physics (Coumans, 2015). We label the tasks as gym (G) and bullet (B) respectively. We always compare the undiscounted cumulative rewards evaluated under a default evaluation horizon .
Hyper-parameters.
Throughout the experiments, we use the same hyper-parameters across all algorithms. The learning rate is tuned for the baseline PPO, and fixed across all algorithms. See Appendix F for further details.
7.1 Taylor expansion Q-function estimation






We use with as the Q-function estimator plug-in for the gradient update. When combining with PPO (Schulman et al., 2017), the resulting algorithm is named PPO(). We compare with the baseline PPO and TRPO (Schulman et al., 2015a). In practice, we consider a mixture of advantage estimator with a constant that interpolates between the PPO (i.e., ) and PPO(). Note that though should be selected such that it balances the numerical scales of the two extremes, as a result, we usually find to work well when it is small in absolute scale ( works the best).
Results.
In Figure 2, we compare a few baselines: (1) PPO with (default); (2) PPO with high discount factor ; (3) PPO with Taylor expansion based advantage estimator, PPO(). Throughout, we use a single hyper-parameter . We see that in general, PPO() leads to better performance (faster learning speed, better asymptotic performance or smaller variance across seeds). This shows Taylor expansion Q-function estimation could lead to performance gains across tasks, given that the hyper-parameter is carefully tuned. We provide a detailed ablation study on in Appendix F, where we show how the overall performance across the benchmark tasks vary as changes from small to large values.
A second observation is that simply increasing the discount factor to generally degrades the performance. This confirms issue with instability of directly applying high discount factors which motivates this work.
We also compare with the open source implementation of (Romoff et al., 2019) in Appendix F, where they estimate based on recursive bootstraps of Q-function differences. Conceptually, this is similar to Taylor expansions with . We show that without a careful trade-off mediated by smaller , this algorithm does not improve performance out of the box.
7.2 Taylor expansion update weighting
As introduced in Section 5, we weigh local gradients with th order expansion weights . Here, we take . Note that since corresponds to , this is very close to the commonly implemented PPO baseline. We hence expect the algorithm to work better with relatively large values of and set throughout experiments. In practice, we find the performance to be fairly robust in the choice of . We provide further analysis and ablation study in Appendix F.






Results.
We compare a few baselines: (1) default PPO; (2) PPO with time limit (Pardo et al., 2018). In this case, the states are augmented with time steps such that the augmented states are Markovian; (3) PPO with Taylor expansion update weighting PPO(). In Figure 3, we see that in general, PPO() and PPO with time limit outperform the baseline PPO. We speculate that the performance gains arise from the following empirical motivation: since the evaluation stops at , local gradients close to should be weighed down because they do not contribute as much to the final objective. However, the default PPO ignores such an effect and weighs all updates uniformly. To tackle this issue, PPO() explicitly weighs down the update while and PPO with time limit augments the state space to restore stationarity. Empirically, though in some cases PPO with time limit also outperforms PPO(), it behaves fairly unstably in other cases.
Extensions to off-policy algorithms.
Above, we mainly focused on on-policy algorithms. The setup is simpler because the data are collected (near) on-policy. It is possible to extend similar results to off-policy algorithms (Mnih et al., 2015; Lillicrap et al., 2015; Fujimoto et al., 2018; Haarnoja et al., 2018). Due to the space limit, we present extended results in Appendix F, where we show how to combine similar techniques to off-policy actor-critic algorithms such as TD3 (Fujimoto et al., 2018) and SAC (Haarnoja et al., 2018) in continuous control domains.
8 Conclusion
We have proposed a family of objectives that interpolate value functions defined with two discount factors. We have shown that similar techniques are applicable to other cumulative quantities defined through discounts, such as PG updates. This framework allowed us to achieve trade-off in estimating value functions or gradient updates, and led to empirical performance gains.
We also highlighted a new direction for bridging the gap between theory and practice: the gap between a fully discounted objective (in theory) and an undiscounted objective (in practice). By building a better understanding of this gap, we shed light on seemingly opaque heuristics which are necessary to achieve good empirical performance. We expect this framework to be useful for new practical algorithms.
Acknowledgements.
Yunhao thanks Tadashi Kozuno and Shipra Agrawal for discussions on the discrepancy between policy gradient theory and practices. Yunhao acknowledges the support from Google Cloud Platform for computational resources. The authors also thank Pooria Joulani for reviewing a draft of the paper.
References
- Achiam and OpenAI [2018] Joshua Achiam and OpenAI. Spinning Up in Deep Reinforcement Learning. https://github.com/openai/spinningup, 2018.
- Amit et al. [2020] Ron Amit, Ron Meir, and Kamil Ciosek. Discount factor as a regularizer in reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2020.
- Bellman [1957] Richard Bellman. A Markovian decision process. Journal of mathematics and mechanics, pages 679–684, 1957.
- Brockman et al. [2016] Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI gym. arXiv, 2016.
- Coumans [2015] Erwin Coumans. Bullet physics simulation. In ACM SIGGRAPH 2015 Courses, page 1. 2015.
- De Farias and Van Roy [2003] Daniela Pucci De Farias and Benjamin Van Roy. The linear programming approach to approximate dynamic programming. Operations research, 51(6):850–865, 2003.
- Degris et al. [2012] Thomas Degris, Martha White, and Richard S Sutton. Off-policy actor-critic. arXiv preprint arXiv:1205.4839, 2012.
- Dhariwal et al. [2017] Prafulla Dhariwal, Christopher Hesse, Oleg Klimov, Alex Nichol, Matthias Plappert, Alec Radford, John Schulman, Szymon Sidor, Yuhuai Wu, and Peter Zhokhov. OpenAI baselines, 2017.
- Espeholt et al. [2018] Lasse Espeholt, Hubert Soyer, Remi Munos, Karen Simonyan, Volodymir Mnih, Tom Ward, Yotam Doron, Vlad Firoiu, Tim Harley, Iain Dunning, et al. Impala: Scalable distributed deep-rl with importance weighted actor-learner architectures. In Proceeedings of the International Conference on Machine Learning, 2018.
- Fedus et al. [2019] William Fedus, Carles Gelada, Yoshua Bengio, Marc G Bellemare, and Hugo Larochelle. Hyperbolic discounting and learning over multiple horizons. arXiv, 2019.
- Fox et al. [2016] Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Proceedings of the Conference on Uncertainty in Artificial Intelligence, 2016.
- François-Lavet et al. [2015] Vincent François-Lavet, Raphael Fonteneau, and Damien Ernst. How to discount deep reinforcement learning: Towards new dynamic strategies. NIPS Deep Reinforcement Learning Workshop, 2015.
- Fujimoto et al. [2018] Scott Fujimoto, Herke Van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In Proceedings of the International Conference on Machine Learning, 2018.
- Grinstead and Snell [2012] Charles Miller Grinstead and James Laurie Snell. Introduction to probability. American Mathematical Soc., 2012.
- Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceeedings of the International Conference on Machine Learning, 2018.
- Hasselt [2010] Hado Hasselt. Double Q-learning. Advances in Neural Information Processing Systems, 2010.
- Janner et al. [2020] Michael Janner, Igor Mordatch, and Sergey Levine. Gamma-models: Generative temporal difference learning for infinite-horizon prediction. Advances in Neural Information Processing Systems, 2020.
- Jiang et al. [2016] Nan Jiang, Satinder P Singh, and Ambuj Tewari. On structural properties of MDPs that bound loss due to shallow planning. In Proceedings of the International Joint Conference on Artificial Intelligence, 2016.
- Kapturowski et al. [2018] Steven Kapturowski, Georg Ostrovski, John Quan, Remi Munos, and Will Dabney. Recurrent experience replay in distributed reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2018.
- Kearns and Singh [2000] Michael J Kearns and Satinder P Singh. Bias-variance error bounds for temporal difference updates. In Proceedings of the Conference on Learning Theory, 2000.
- Kingma and Ba [2014] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Laroche and van Seijen [2018] Romain Laroche and Harm van Seijen. In reinforcement learning, all objective functions are not equal. 2018.
- Lehnert et al. [2018] Lucas Lehnert, Romain Laroche, and Harm van Seijen. On value function representation of long horizon problems. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
- Lillicrap et al. [2015] Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In Proceedings of the International Conference on Learning Representations, 2015.
- Mnih et al. [2015] Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A Rusu, Joel Veness, Marc G Bellemare, Alex Graves, Martin Riedmiller, Andreas K Fidjeland, Georg Ostrovski, Stig Petersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- Mnih et al. [2016] Volodymyr Mnih, Adria Puigdomenech Badia, Mehdi Mirza, Alex Graves, Timothy Lillicrap, Tim Harley, David Silver, and Koray Kavukcuoglu. Asynchronous methods for deep reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2016.
- Nota and Thomas [2019] Chris Nota and Philip S Thomas. Is the policy gradient a gradient? In Proceedings of the International Conference on Autonomous Agenst and Multiagent Systems, 2019.
- O’Donoghue et al. [2016] Brendan O’Donoghue, Remi Munos, Koray Kavukcuoglu, and Volodymyr Mnih. Combining policy gradient and Q-learning. In Proceedings of the International Conference on Learning Representations, 2016.
- Pardo et al. [2018] Fabio Pardo, Arash Tavakoli, Vitaly Levdik, and Petar Kormushev. Time limits in reinforcement learning. In Proceedings of the International Conference on Machine Learning, 2018.
- Pohlen et al. [2018] Tobias Pohlen, Bilal Piot, Todd Hester, Mohammad Gheshlaghi Azar, Dan Horgan, David Budden, Gabriel Barth-Maron, Hado Van Hasselt, John Quan, Mel Večerík, Matteo Hessel, Rémi Munos, and Olivier Pietquin. Observe and look further: Achieving consistent performance on atari. arXiv, 2018.
- Prokhorov and Wunsch [1997] Danil V Prokhorov and Donald C Wunsch. Adaptive critic designs. IEEE transactions on Neural Networks, 8(5):997–1007, 1997.
- Puterman [2014] Martin L Puterman. Markov decision processes: Discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- Romoff et al. [2019] Joshua Romoff, Peter Henderson, Ahmed Touati, Emma Brunskill, Joelle Pineau, and Yann Ollivier. Separating value functions across time-scales. Proceedings of the International Conference on Machine Learning, 2019.
- Ross [2014] Sheldon M Ross. Introduction to probability models. Academic press, 2014.
- Schaul et al. [2015] Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In Proceedings of the International Conference on Learning Representations, 2015.
- Schulman et al. [2015a] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of the International Conference on Machine Learning, 2015a.
- Schulman et al. [2015b] John Schulman, Philipp Moritz, Sergey Levine, Michael Jordan, and Pieter Abbeel. High-dimensional continuous control using generalized advantage estimation. arXiv preprint arXiv:1506.02438, 2015b.
- Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv, 2017.
- Silver et al. [2014] David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In Proceedings of the International Conference on Machine Learning, 2014.
- Sinha et al. [2020] Samarth Sinha, Jiaming Song, Animesh Garg, and Stefano Ermon. Experience replay with likelihood-free importance weights. arXiv, 2020.
- Sutton [1995] Richard S Sutton. TD models: Modeling the world at a mixture of time scales. In Proceedings of the International Conference on Machine Learning. 1995.
- Sutton and Barto [2018] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT Press, 2018.
- Sutton et al. [2000] Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 2000.
- Tang et al. [2020] Yunhao Tang, Michal Valko, and Rémi Munos. Taylor expansion policy optimization. In Proceedings of the International Conference on Machine Learning, 2020.
- Todorov et al. [2012] Emanuel Todorov, Tom Erez, and Yuval Tassa. MuJoCo: A physics engine for model-based control. In International Conference on Intelligent Robots and Systems, 2012.
- Van Hasselt et al. [2016] Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double Q-learning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 30, 2016.
- van Hasselt et al. [2019] Hado van Hasselt, John Quan, Matteo Hessel, Zhongwen Xu, Diana Borsa, and André Barreto. General non-linear Bellman equations. arXiv, 2019.
- Van Seijen et al. [2019] Harm Van Seijen, Mehdi Fatemi, and Arash Tavakoli. Using a logarithmic mapping to enable lower discount factors in reinforcement learning. In Advances in Neural Information Processing Systems, 2019.
- Xu et al. [2018] Zhongwen Xu, Hado P van Hasselt, and David Silver. Meta-gradient reinforcement learning. Advances in Neural Information Processing Systems, 2018.
- Ziebart et al. [2008] Brian D. Ziebart, Andrew L. Maas, J. Andrew Bagnell, and Anind K. Dey. Maximum entropy inverse reinforcement learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2008.
APPENDICES: Taylor Expansions of Discount Factors
Appendix A Proofs
See 3.1
Proof.
Recall the Woodbury matrix identity
Recall the equality . By plugging in the Woodbury matrix identity, this immediate shows
Now, observe that the second term involves . We can plug in the definition of and invoke the Woodbury matrix identity again. This produces
By induction, it is straightforward to show that iterating the above procedure times produces the following equalities
Consider the norm of the residual term. Since is a transition matrix, . As a result, . This implies
When , the residual norm decays exponentially and as . This implies that the infinite series converges,
Additional consideration when .
When , in order to ensure finiteness of , we assume the following two conditions: (1) The Markov chain induced by is absorbing; (2) for any absorbing state , . Without loss of generality, assume there exists a single absorbing state. In general, the transition matrix can be decomposed as follows [Grinstead and Snell, 2012, Ross, 2014],
where and . Here, the first states are transient and the last state is absorbing. For convenience, define as the reward vector constrained on the first transient states. We provide a few lemmas below.
Lemma A.1.
The matrix is invertible and its inverse is .
Proof.
Define a matrix , then defines the expected number of times it takes to transition from to before absorption. By definition of the absorbing chain, is finite. This further shows that is invertible, because
∎
Lemma A.2.
Let be a matrix polynomial function of matrix and . Then
where is some matrix.
Proof.
The intuition for the above result is that polynomial transformation preserves the block triangle property of and . In general, we can assume
for some and are scalar coefficients. First, note that is of the form
for some matrix . Since for , this implies that
for some matrix . The above two results show that both polynomials of and are block upper triangle matrices. It is then straightforward that
for some matrix . Finally, since is a linear combination of , we conclude the proof. ∎
Lemma A.3.
Under assumption (1) and (2), one could write the value function as
where the infinite series on the RHS converges. In addition, for any transient state , .
Proof.
Lemma A.4.
The following holds for any ,
(20) |
Proof.
The first two lines derive from a straightforward application of Woodbury matrix identity to . This is valid because by Lemma A.1, is invertible. The convergence of the infinite series is guaranteed for all . To see why, recall that the finiteness of implies . We can bound the residual,
∎
Finally, we combine results from the above to prove the main claim. First, consider the absorbing state . Due to Assumption (2), for any . The matrix equalities in Proposition 3.2 holds in this case.
In the following, we consider any transient states . By Lemma A.3 and Lemma A.4
Now, notice that because the last entries of are zero (for the absorbing state),
Combining with Lemma A.2,
The residual term as with similar arguments used for Lemma A.4. We hence conclude the proof. ∎
See 3.2
Proof.
The proof follows directly from the residual term in Proposition 3.1. Recall that the residual term takes the form
Its infinity norm can be bounded as ∎
See 4.1
Proof.
We will derive the above result with the matrix form. Recall by applying Woodbury inversion identity to , we get
Then, right multiply the above equation by ,
By indexing both sides at state , we recover the following equality,
To derive the expression for , note that also
where we use the fact that commutes with . Since we define as such that , we can derive the matrix form of by indexing the -th row of weight matrix . This directly leads to the desired result
∎
See 4.3
Proof.
First we assume , we will consider the extension to at the end of the proof. Recall that the policy gradient takes the following form,
We plug in the above, the partial derivative evaluates to the following
In the above, the coefficient term at time can be calculated by carefully grouping terms across different time steps. It can be shown that the coefficient term evaluates to for all . This concludes the proof.
Alternative proof based on matrix notations.
We introduce an alternative proof based on matrix notations as it will make the extension to simpler. First, note that
where for the second equality we exploit the fact that commutes with . Now, notice that the above rewrites as
where is the weight matrix. This matrix is equivalent to the weighting distribution by where is the -th row of matrix . The first partial gradient corresponds to differentiating only through . To make the derivation clear in matrix notations, let be the -th component of the parameter . Define such that , This means the -th component of the first partial gradient across all states is
Let to be the vector of local gradient (for parameter ) such that . Vanilla PG [Sutton et al., 2000] can be expressed as
We can finally derive the following,
Now, consider the -th component of the above vector. We have is equal to
When concatenating the gradient for all component of , we conclude the proof.
Extensions to the case .
Similar to the arguments made in the proof of Proposition 3.2, under assumptions A.1 and A.2, we can decompose the transition matrix as
where the last state is assumed to be absorbing. Though for is in general not necessarily invertible, the matrix is invertible. Since for the absorbing state , we have deduced that , and accordingly . As such, though for might be undefined, the multiplication is defined, with the last entry being . Since at time , the chain enters the absorbing states, all local gradient terms that come after are zero. As a result, the -th component of is
∎
See 4.4
Proof.
When truncating the infinite series to the first terms, we derive the th order expansion ,
Note that since
This concludes the proof.
Appendix B Further results on Taylor expansions in the dual space
The dual representation of value function in Eqn (6) is where are vector rewards and visitation distribution starting at state . Here, we abuse the notation to denote both a function and a vector, i.e., can be interpreted as both a function evaluation and a vector indexing. Given such a dual representation, one natural question is whether the th expansion in the primal space corresponds to some approximations of the discounted visitation distribution . Below, we answer in the affirmative.
Let be the one-hot distribution such that only when . The visitation distribution satisfies the following balance equation in matrix form
(21) |
Inverting the equation, we obtain an explicit expression for the visitation distribution . Following techniques used in the derivation of Propo 3.1, we can derive similar approximation results for dual variables. See Appendix B.
Proposition B.1.
The following holds for all ,
(22) |
When , the residual norm , which implies that the following holds
(23) |
Proof.
Starting from the fixed point equation satisfied by , we can apply Woodbury inversion indentity
The norm of the residual term could be bounded as
∎
With a similar motivation as expansions in the primal space, we define the th order expansion by truncating to first terms,
(24) |
The following result formalizes the connection between the th order dual approximation to the visitation distribution and the primal approximation to the value function at state , .
Proposition B.2.
The th order primal and dual approximations are related by the following equality for any ,
(25) |
Proof.
The proof follows by expanding out the RHS of the equation. Recall the definition of ,
Now multiply the RHS by and recall that , we conclude the proof,
∎
Proposition B.2 shows that indeed, the th order approximation of the value function is equivalent to the th order approximation of the visitation distribution in the dual space. It is instructive to consider the special case .
Appendix C Details on Taylor expansion Q-function advantage estimation
Proposition C.1.
Let be the vector advantage functions. Let be the transition matrix such that . Define the th order Taylor expansion of advantage as . Then for any .
Proof.
The proof follows closely that of Taylor expansion based approximation to value functions in Proposition 3.2. Importantly, notice that here we define , which differs from used in the derivation of value functions. In particular, for any . Let be the vector reward function. The Bellmen equation for Q-function is
Inverting the equation and applying the Woodbury inversion identity,
The above equality holds for all due to similar convergence argument as in Proposition 3.2. Truncating the infinite series at step , we arrive at the th order expansion . By construction, . ∎
Appendix D Details on Taylor expansion update weighting
Proposition D.1.
The following is true for all ,
Equivalently, the th order Taylor expansion of is
(26) |
where is a weight function.
Proof.
We start with a few lemmas.
Lemma D.2.
For any , define a set of -dimensional vector and let be the size of this set. Then
Proof.
By construction, the above set can be decomposed into smaller sets by fixing the value of , i.e.,
Since these sets do not overlap, we have a recursive formula, . Starting from the base case , it is straightforward to prove by induction that for all
∎
Lemma D.3.
Consider for . It can be shown that
Proof.
Starting with the definition,
where for the second equality we use the fact that commutes with . Then consider ,
Note that the last equality corresponds to a regrouping of terms in the infinite summation – instead of summing over sequentially, we count the number of examples such that and then sum over . This count is exactly as defined in Lemma D.2. Hence the proof is completed. ∎
With the above lemmas, we are ready to prove the final result. We start by summing up all the differences of expansions,
In the above derivation, we have applied the transformation . Then we have modified the bound of the summation with the definition of (in particular, if , ). If we index the -th component of the vector, we recover the desired result.
∎
D.1 Further discussions on the objectives
Recall that the full gradient is
Consider the second term. Now, we derive this term in an alternative way which imparts more intuitions on why its estimation is challenging. Note that
The second term of the full gradient is equivalent to differentiating through the above expression, while keeping all fixed. This leads the following gradient
Here, we introduce , which is equivalent to a value function that treats as rewards and with discount factor . Naturally, constructing an unbiased estimator of the second term of the full gradient requires estimating , which is difficult in at least two aspects: (1) in practice, value functions are already estimated, which could introduce additional bias and variance; (2) as a premise of our work, estimating discounted values with discount factor is challenging potentially due to high variance.
Appendix E Details on approximation errors with finite samples
Intuitively, as increases, the th order expansion approximates more accurately in expectation. However, in practice where all constituent terms of the approximation are built from the same batch of data, the variance might negatively impact the accuracy of the estimate.
To formalize such intuitions, we characterize the bias and variance trade-off under the phased TD-learning framework [Kearns and Singh, 2000]. Consider estimating the value function under discount , with estimator . At each iteration , let be the absolute error of value function estimates . Assume from each state , there are independent trajectories generated under , [Kearns and Singh, 2000] shows that commonly used TD-learning methods (e.g. TD()) have error bounds of the following form with probability ,
(27) |
Here, the factor is an error term which characterizes the errors arising from the finite sample size . As , ; the constant is a contraction coefficient that shows how fast the error decays in expectation. See Appendix E for details.
With the calculations of estimators as a subroutine, we construct the -sample th order estimator ,
(28) |
where is sampled from . Note that if , Eqn (28) reduces to , the estimator analyzed by [Kearns and Singh, 2000]. We are interested in the error , measured against the value function of discount . The following summarizes how errors propagate across iterations,
Proposition E.1.
Assume all samples are generated independently. Define a factor . Then with probability at least if and probability if , the following holds111The error bounds could be further improved, e.g., by adapting the concentration bounds at different steps . Note that its purpose is to illustrate the bias and variance trade-off induced by the Taylor expansion order .,
(29) |
where for and if . The expected gap error is defined in Proposition 3.2.
Proof.
Recall the results from [Kearns and Singh, 2000]: Let . Then with probability at least , the following holds
In the following, we condition all analysis on the event set that the above inequality holds. Now, using as a subroutine, define the estimator for the th Taylor expansion as in Eqn (28),
Define the error , which is measured against the value function with a higher discount factor . Consider for a given starting state ,
Now, we bound each term in the equation above. Recall . The second term is bounded as follows
The third term is bounded by applying concentration bounds. Recall that the estimator decomposes into estimators, each being an average over i.i.d. samples drawn from the th step visitation distribution . Applying similarly naive techniques in [Kearns and Singh, 2000], we bound each of the terms individually and then take a union bound over all terms. This implies that, with probability at least , the following holds
Aggregating all results, we have
This holds with probability at least .
∎
Bias-variance trade-off via .
The error terms come from two parts: the first term contains errors in the subroutine estimator , and its propagated errors through the sampling of th order approximations for (shown via the multiplier ). This first term also contains , a concentration bound that scales with , which shows that the variance of the overall estimator grows with . This first error term scales with and vanishes as the number of samples increases. The second term is due to the gap between the expected th order Taylor expansion and , which decreases with and does not depend on sample size . The new contraction coefficient is , where it can be shown that . Since typical estimators have , in general and the error contracts with respect to . In general, the contraction becomes slower as increases. For example, for TD(), .
Appendix F Further experiment details
Below, we provide further details on experiment setups along with additional results.
F.1 Further details on the toy example
We presented a toy example that highlighted the trade-off between bias and variance, mediated by the order parameter . Here, we provide further details of the experiments.
Toy MDP.
We consider tabular MDPs with states and actions. The transition table is drawn from a Dirichlet distribution for . Here, is chosen such that the MDP is not very communicative (i.e., the distribution concentrates only on a few states). The rewards are random where and mean rewards are drawn from and fixed for the problem.
F.2 Deep RL algorithms
Proximal policy optimization (PPO).
PPO [Schulman et al., 2017] implements a stochastic actor as a Gaussian distribution with state-conditional mean and a global standard deviation ; and a value function . The behavior policy is the previous policy iterate. The policy is updated as with 222The exact PPO update is more complicated than this. Refer to [Schulman et al., 2017] for the exact formula.. The advantages estimated using generalized advantage estimation (GAE, [Schulman et al., 2015b]) with . Value functions are trained by minimizing with returns with being a prior parameter. Both parameters are trained with the Adam optimizer [Kingma and Ba, 2014] with learning rate . We adopt other default hyper-parameters in [Dhariwal et al., 2017], for details, please refer to the code base.
Trust region policy optimization (TRPO).
TRPO [Schulman et al., 2015b] implements the same actor-critic pipeline as PPO, the difference is in the updates. Instead of enforcing a soft clipping constraint, TRPO enforces a strict KL-divegergence constraint with . The policy gradient is computed as , and then the final update is constructed by approximately solving a constrained optimization problem, see [Schulman et al., 2015a] for details. The scale of the final update is found through a line search, to ensure that the KL-divergence constraint is satisfied. The implementations are based in [Achiam and OpenAI, 2018].
F.3 Deep RL architecture
Across all algorithms, the policy has a parameterized mean and a single standard deviation . The mean is a 2-layer neural network with hidden units , and activation functions. The output layer does not have any activation functions; The value function is a 2-layer neural network with hidden units and as activation functions. The output layer does not have any activation functions.
F.4 Additional deep RL experiment results
F.4.1 Taylor expansion Q-function estimation: ablation study on
Recall that throughout the experiments, we choose and construct the new Q-function estimator as a mixture of the default estimator and Taylor expansion Q-function estimator. In particular, the final Q-function estimator is
We choose such that it balances the numerical scales of the two combining estimators. In our implementation, we find that the algorithm performs more stably when is small in the absolute scale. In Figure 5(a)-(b), we show the ablation study on the effect of , where we vary . The y-axis shows the normalized performance against PPO baselines (which is equivalent to ), such that the PPO baseline achieves a normalized performance of .
Overall, we see on different tasks, impacts the performance differently. For example: on HalfCheetah(B), better performance is achieved with larger values of , this is consistent with the observation that PPO with also achieves better performance; on Ant(B), however, as increases from zero, the performance increases marginally before degrading. In Figure 5, we show the median and mean performance across all tasks. Note that in general, the average performance increases as increases from zero, but later starts to decay a bit. When accounting for the effect of performance variance across all tasks, we chose as the fixed hyper-parameter throughout experiments in the main paper.
Further details on computing .
Below we assume . In Algorithm 4, we showed we can construct unbiased estimates of using as building blocks. With a random time , the estimator takes the following form
However, since the estimator is based on a single random time, it can have high variance. To reduce variance, we propose the following procedure: let be the target state-action pair, we can compute the estimate as
When , the above estimator corresponds to an estimator which marginalizes over the random time. This should achieve variance reduction compared to the random time based estimate in Algorithm 4. However, then the estimate requires computing cumulative sums over an infinite horizon (or in general a horizon of ), which might be computationally expensive. To mitigate this, we propose to truncate the above summation up to steps. This choice of aims to achieve a trade-off between computation efficiency and variance. Note that this estimator was previously introduced in [Tang et al., 2020] for off-policy learning.
F.4.2 Taylor expansion update weighting: ablation on
In Figure 5(c)-(d), we carry out ablation study on the effect of for the update weighting. Recall that interpolates two extremes: when , it recovers the vanilla PG [Sutton et al., 2000] while when , it recovers the deep RL heuristic update. We expect an intermediate value of to achieve some trade-off between bias and variance of the overall update.
In Figure 5(c), we see the effect on individual environments. The effect is case dependent. For HalfCheetah(G), larger improves the performance; however, for Walker(G), the improvement is less prominent over a large range of . When aggregating the performance metric in Figure 5(d), we see that intermediate values of indeed peak in performance. We see that on average, both and achieve locally optimal mean performance, while also achieves the locally optimal median performance.
Note on how the practical updates impact the effect of .
Based on our theoretical analysis, when the update should recover the vanilla PG [Sutton et al., 2000], which is generally considered too conservative for the undiscounted objective in Eqn (1). However, in practice, as shown in Figure 5(d), the algorithm does not severely underperform even when . We speculate that this is because practical implementations of PG updates use batches of data instead of the full trajectories. This means that the relative weights of the local gradients are effectively self-normalized: where the summation is over the time steps in a sampled mini-batch. The self-normalized weights are increased in the absolute scale relative to and partly offset the effect of an initially aggressive discount .
F.4.3 Comparison to results in [Romoff et al., 2019]
Recently, Romoff et al. [2019] derived a recursive relations between differences value functions defined with different discount factors. This was shown in Lemma 4.1. Given a sequence of discount factors , they derived a value function estimator to based on recursive bootstraps of value function differences . Because they aim at recovering the exact value functions, this estimator could be interpreted as similar to Taylor expansions but with .
Different from their motives, we focus on the trade-off achieved by intermediate values of . We argued that by using , the estimate might be too conservative; however, using might be challenging due to the variance induced in the recursive bootstrapping procedure. Though it is not straightforward to theoretically show, we conjecture that using the Taylor expansion Q-function estimator with is as difficult as directly estimating .

Empirical comparison.
The base algorithm of [Romoff et al., 2019] is PPO[Schulman et al., 2017]. Their algorithm uses the recursive bootstraps to estimate Q-functions and advantage functions. The new estimate is used as a direct plug-in replacement to and adopted in the PPO algorithm. We run experiments with the open source implementation of [Romoff et al., 2019] from the original authors333See https://github.com/facebookresearch/td-delta.. We evaluate the algorithm’s performance over continuous control benchmark tasks. We applied the default configurations from the code base with minimum changes to run on continuous problems (note that [Romoff et al., 2019] focused on a few discrete control problems). Overall, we find that the algorithm does not learn stably (see Figure 4).




Appendix G Extensions of update weighting techniques to off-policy algorithms
Below, we show that techniques developed in this paper could be extended to off-policy learning algorithms. We provide both details in theoretical derivations, algorithms, as well as experimental results.
G.1 Off-policy actor-critic algorithms
Off-policy actor-critics [Mnih et al., 2015, Lillicrap et al., 2015] maintain a deterministic policy and a Q-function critic . The agent takes exploratory actions under the environment, and saves data into a common replay buffer . At training time, the algorithm samples data from the replay to update parameters. The policy is updated via the deterministic policy gradient [Silver et al., 2014], , where is implicitly defined by the past behavior policy.
Deep deterministic policy gradient (DDPG).
DDPG [Lillicrap et al., 2015] maintains a deterministic policy network and a Q-function critic . The algorithm explores by executing a perturbed policy where for , and then saves the data into a replay buffer . At training time, the behavior data is sampled uniformly from the replay buffer with . The critic is updated via TD(), by minimizing: where , where are delayed versions of respectively [Mnih et al., 2015]. The policy is updated by maximizing with respect to . Both parameters are trained with the Adam optimizer [Kingma and Ba, 2014] with learning rate . We adopt other default hyper-parameters in [Achiam and OpenAI, 2018], for details, please refer to the code base.
Twin-delayed DDPG (TD3).
Soft actor-critic.
SAC [Haarnoja et al., 2018] adopts a similar training pipeline and architectures as TD3. A major conceptual difference is that SAC is based on the maximum-entropy formulation of RL [Ziebart et al., 2008, Fox et al., 2016]. The Q-function is augmented by entropy regularization bonus and the policy is optimized such that it does not collapse to a deterministic policy.
G.2 Architecture
All algorithms share the same architecture. The policy network takes as input the state , and is a 2-layer neural network with hidden units and activation functions. The output is squashed by to comply with the action space boundaries; The critic takes a concatenated vector as inputs, is 2-layer neural network with hidden units and activation functions. The output does not have any activation functions.
For stochastic policies, the policy network parameterizes a Gaussian also parameterizes a log standard deviation vector , which is a neural network with the same architecture above. The stochastic output is a reparameterized function where the noise . Finally, the action output is squashed by to comply with the action boundary [Haarnoja et al., 2018].
G.3 Algorithm details for update weighting
To derive an update based on update weighting, we start with the undiscounted on-policy objective . Given behavior data generated under , we abuse the notation and also use to denote the state distribution under (usually implicitly defined by sampling from a replay buffer ). By rewriting the objective with importance sampling (IS),
(30) |
we derive an off-policy learning objective. By dropping a certain terms (see [Degris et al., 2012] for details about the justifications for dropping such terms), we can derive the IS-based gradient update
To render the update feasible, we need to estimate the ratio . Inspired by [Sinha et al., 2020], we propose to maintain a fast replay buffer which contains the most recent sampled data (which implicitly defines ), then the estimator is trained to estimate the density ratio between (which implicitly defines ) and . See Appendix F for further details. The full off-policy actor-critic algorithm is summarized in Algorithm 5. In practice, we implement a undiscounted uniform distribution instead of with . The main motivation is that this distribution is much easier to specify as it corresponds to sampling from the replay buffer uniformly without discounts, as explained below.
As an important observation for practical implementations, note that
when setting , we see that the second term of the distribution is proportional to , which corresponds to a uniform distribution over states on sampled trajectories, without discounting. This will make implementations much simpler. We will see that this could also lead to performance gains. We leave Taylor expansion based extension of this method for future work.
Details on training the density estimator .
The density estimator is parameterized with exactly the same architecture as the policy network , except that its output activation is replaced by to ensure that . The off-policy actor-critic algorithm maintains an original buffer of size ; in addition, we maintain a fast replay buffer with , which is used for saving the most recently generated data points. For ease of analysis, assume that the data sampled from come from , while the data sampled from come from .
To learn the ratio , we adopt a simple discriminative loss function as follows
The optimal solution to is (assuming enough expressiveness). Then, the density estimator is used for weighting the policy update: when sampling a batch of data from the buffer, the weight is computed for each data point . Then the weights are normalized across batch where the inverse temperature is . Then is used for weighting the such that the policy is updated as .
We carry out the update in Algorithm 2, where the density estimator is trained based on a discriminative loss between and . For any given batch of data , we normalize the prediction with hyper-parameter as similarly implemented in [Sinha et al., 2020]. The temperature annealing moves closer to a uniform distribution and tends to stabilize the algorithm. See Appendix F for further details.
Discussion on relations to other algorithms.
Previous work focuses on re-weighting transitions to stabilize the training of critics. For example, prioritized replay [Schaul et al., 2015] prioritizes samples with high Bellman errors. Instead, Algorithm 2 reweighs samples to speed up the training of the policy. Our observation above also implies that when sampling from for training the estimates , it is not necessary to discount the transitions. This is in clear contrast to prior work, such as [Sinha et al., 2020], where they propose to train , which is the fully discounted visitation distribution under based on the derivation of optimizing a discounted objective .








Results.
We build the algorithmic improvements based on TD3 [Fujimoto et al., 2018] and SAC [Haarnoja et al., 2018], and name the correspinding algorithms TD3() and SAC() respectively. We compare with TD3, SAC, and DDPG [Lillicrap et al., 2015], all of which are off-policy algorithms.
We first compare TD3() with TD3 in Figure 6. To highlight the default sample efficiency of off-policy methods, we include PPO as a baseline as well. Across all four presented tasks, we see that TD() performs similarly or marginally outperforms the TD3 baseline. To make concrete the comparison between final performance, we report the final score of each algorithm in Table 1. As a default baseline, we also show the results of DDPG reported in [Achiam and OpenAI, 2018]. Overall, TD3() provides a modest yet consistent boost over baseline TD3.
Then we compare SAC() with SAC in Figure 7 and Table 1. We see that SAC() provides marginal performance gains over Walker2d and Ant, while it is slightly overperformed by baseline SAC for HalfCheetah and Hopper. We speculate that this is partly because the hyper-parameters of baseline SAC are well tuned on HalfCheetah, and it is difficult to achieve further significant gains without exhaustive hyper-parameter search. Overall, SAC() is competitive compared to SAC.
Tasks | TD3() | TD3 | DDPG-v1 |
---|---|---|---|
Ant(G) | |||
HalfCheetah(G) | |||
Walker2D(G) | |||
Hopper(G) | |||
Tasks | SAC() | SAC | DDPG-v2 |
Ant(G) | |||
HalfCheetah(G) | |||
Walker2D(G) | |||
Hopper(G) |