Adaptive Tree Backup Algorithms for
Temporal-Difference Reinforcement Learning
Abstract
Q() is a recently proposed temporal-difference learning method that interpolates between learning from expected backups and sampled backups. It has been shown that intermediate values for the interpolation parameter perform better in practice, and therefore it is commonly believed that functions as a bias-variance trade-off parameter to achieve these improvements. In our work, we disprove this notion, showing that the choice of minimizes variance without increasing bias. This indicates that must have some other effect on learning that is not fully understood. As an alternative, we hypothesize the existence of a new trade-off: larger -values help overcome poor initializations of the value function, at the expense of higher statistical variance. To automatically balance these considerations, we propose Adaptive Tree Backup (ATB) methods, whose weighted backups evolve as the agent gains experience. Our experiments demonstrate that adaptive strategies can be more effective than relying on fixed or time-annealed -values.
Keywords:
Reinforcement Learning, Temporal-Difference Learning,
Adaptive Methods, Tree Backup, Expected Sarsa
1 Introduction
Unifying temporal-difference (TD), dynamic programming (DP), and Monte Carlo (MC) methods has helped illuminate trade-offs in reinforcement learning and has lead to the development of better algorithms. Much work has focused on the length of backups as a unifying axis between TD and MC methods, in which these two classes can be understood as opposite extremes of the TD() algorithm (Sutton, 1988; Sutton and Barto, 1998). In this view, the decay parameter serves not only as a conceptual link between these distinct methods, but also as a mechanism for managing the bias-variance trade-off (Kearns and Singh, 2000), where intermediate values of often deliver the best empirical performance (Sutton and Barto, 1998).
Recent research has begun to explore an orthogonal unification between TD and DP methods by varying the width of backups. This concept was formally introduced by Sutton and Barto (2018) in the Q() algorithm; the method interpolates linearly, via a parameter , between Expected Sarsa () (Sutton and Barto, 1998) and Sarsa () (Rummery and Niranjan, 1994). As such, Q() can be seen as a spectrum of algorithms whose backup widths lie somewhere between those of a sample backup and a full (expected) backup. De Asis et al. (2018) conducted further theoretical and empirical analysis of Q(), arriving at two major hypotheses: is a bias-variance trade-off parameter, and dynamically adapting can lead to faster convergence. These hypotheses appeared to be corroborated by their experiments, where intermediate -values performed well, and exponentially decaying during training performed even better.
In our work, we re-examine these two hypotheses regarding Q(), and ultimately show that they are incorrect. In particular, we prove that the choice of minimizes the variance of Q() without increasing the bias. It follows that the choice of does not constitute a bias-variance trade-off, since both bias and variance can be simultaneously optimized without competition. Furthermore, we prove that the contraction rate and fixed point of the Q() operator are independent of , suggesting again that the choice of is preferable for its variance-reduction effect. These findings are intriguing given the empirical results obtained by De Asis et al. (2018), which did not demonstrate a clear superiority of . This apparent discrepancy between theory and practice suggests that other factors are responsible for the observed benefits of Q(), and that our understanding of the algorithm must be revised accordingly.
We therefore advance a new hypothesis to explain the effectiveness of intermediate -values: conducting partial backups () helps escape poor initializations of the value function by reducing the chance that previously unvisited state-action pairs are considered during the backup. The value estimates of these pairs could be arbitrarily incorrect, slowing initial learning; however, as training progresses and the majority of state-action pairs have been visited, starts to become more favorable for its variance reduction. This explanation matches the intuition of the time-decayed schedule utilized by De Asis et al. (2018), in which is desirable early in training and is desirable later.
This new hypothesis suggests that backup width should be adjusted based on the number of visitations to each state-action pair, with more emphasis placed on frequently sampled pairs. We therefore develop a new family of TD algorithms that we call Adaptive Tree Backup (ATB) methods, which generalizes Tree Backup (Precup et al., 2000) to arbitrary weightings of the value estimates. We highlight two instances of ATB methods (Count-Based and Policy-Based) that we believe to be promising. The proposed methods have the advantage of eliminating the hyperparameter , as they adapt automatically to the data experienced by the agent. Finally, our experiments demonstrate that Policy-Based ATB outperforms the exponential-decay schedule for introduced by De Asis et al. (2018).
2 Background
We model the environment as a Markov Decision Process (MDP) described by the tuple : is a finite set of environment states, is a finite set of actions, is the transition function, and is the reward function. An agent interacts with the environment by selecting an action in state with probability , receiving a reward and changing the state to with probability . Given the agent’s policy , the objective is to estimate the expected discounted return , where , since acting greedily with respect to is known to produce a better policy (Sutton and Barto, 1998).
Q() (Sutton and Barto, 2018) is a method that iteratively improves its estimated value function according to
(1) |
where is a user-specified hyperparameter. This hyperparameter interpolates between the exact on-policy expectation (Expected Sarsa, ) and noisy sample estimates of it (Sarsa, ), and has been hypothesized to serve as a bias-variance trade-off parameter that improves the contraction rate of the algorithm (De Asis et al., 2018). In the subsequent section, we analyze the theoretical properties of the Q() update to investigate the validity of this hypothesis.
3 Analysis of Q()
We investigate the effects that has on the Q() algorithm in terms of convergence, contraction rate, bias, and variance. Our analysis dispels two misconceptions about Q(): that dynamically adjusting can achieve a faster contraction rate, and that functions as a bias-variance trade-off parameter.
3.1 Convergence and Contraction Rate
For notational conciseness when proving convergence, we represent the Q() update (1) as an operator , where . The Bellman operator for a policy is defined as the affine function , where . The Bellman operator is a contraction mapping (when that admits the unique fixed point , the unique solution to (Bellman, 1966).
For policy evaluation, the operator is considered convergent if and only if for every , where . Let be zero-mean noise that represents the randomness due to the environment and policy. Since Q() interpolates linearly between Expected Sarsa and Sarsa, we can express its operator as
(2) |
We can therefore write update (1) in vector notation as
(3) |
Theorem 1.
Assume and . For , the sequence defined by (3) converges to almost surely.
Proof.
Iteration (3) is already in a form where Proposition 4.4 of Bertsekas and Tsitsiklis (1996) is applicable; is a contraction mapping with respect to the maximum norm, and is zero-mean noise. Furthermore, because , and is bounded by a constant since and are finite sets. Consequently, under the assumed stepsize conditions, the sequence converges to (the fixed point of ) almost surely. ∎
From our derived Q() operator (2), we see that primarily functions as a noise attenuation parameter; however, it does not affect the expected update, which is inherently related to the Bellman operator. It follows that the contraction rate and fixed point of Q() are identical to those of the Bellman operator, and that these properties are independent of .
3.2 Variance
Theorem 1 suggests that small values of have a beneficial variance-reduction effect, with no detrimental effects on convergence; however, the abstract noise process makes it difficult to quantify this effect. In this section, we derive the exact variance of Q() by extending existing results for Sarsa and Expected Sarsa from van Seijen et al. (2009).
Theorem 2.
Let and be the variances of Expected Sarsa and Sarsa, respectively. The variance of Q() is given by the expression
(4) |
and this quantity is minimized when .
Proof.
From the definition of Q() as a linear interpolation, we can deduce that its variance has the form
(5) |
The Sarsa and Expected Sarsa updates share the same expected value (van Seijen et al., 2009); let it be denoted by . We show algebraically that by comparing the exact calculation to the expression for derived by van Seijen et al. (2009). Therefore, expression (5) reduces to
which is the desired result in expression (4). From van Seijen et al. (2009), , and hence expression (4) is minimized when . ∎
Theorem 2 shows that the variance of Q() is minimal when and increases monotonically, which is interesting because it supports the view of as a noise attenuation parameter from Theorem 1. Furthermore, since the expected update is unaffected by the choice of , it cannot represent a bias-variance trade-off; the variance can be minimized with no apparent drawback. Our theory indicates that there must be a different explanation for the empirical success of Q().
4 Adaptive Tree Backup (ATB) Algorithms
Q() does not address the standard bias-variance trade-off for temporal-difference learning, since the choice of minimizes variance without impacting bias. Why, then, did De Asis et al. (2018) observe empirical improvement when employing an intermediate -value? We conjecture that Expected Sarsa () struggles to overcome initialization error early in training (during which the majority of the state-action space is unexplored). Until every action has been sampled at least once in a given state, Expected Sarsa mixes one or more nonsensical value estimates into its full backup, potentially resulting in severe inaccuracies. In contrast, a sample method like Sarsa () is likely to visit previously taken actions (simply by definition of a probabilistic policy), thereby bootstrapping sooner from more accurate estimates, but also with additional noise due to sampling. This phenomenon helps explain why the time-decayed schedule for proposed by De Asis et al. (2018) is effective; excluding incorrect value estimates () is important when most state-action pairs have not been visited, but minimizing variance () becomes more important once all estimates are relatively accurate.
If our hypothesis is correct, then methods that dynamically increase the effective backup width based on state-action pair visitations should generally perform better than Q() with a fixed -value or time-annealed schedule. We call these Adaptive Tree Backup (ATB) methods, as they generalize Tree Backup (Precup et al., 2000) to arbitrary backup weightings. For a given transition , ATB algorithms can be written in the generic form
(6) |
where each is a nonnegative (possibly random) coefficient. Note that if , then we recover Q() exactly. However, the generality of the coefficients permits myriad possibilities beyond the limited, one-dimensional spectrum spanned by Q().
The backup associated with update (6) admits the correct fixed point only if . Even so, if we ensure that as , then the ATB method will eventually converge to according to Theorem 1. We subsequently discuss two promising variants that do this.
4.1 Count-Based ATB
To match the intuition that an effective ATB strategy should ignore unvisited state-action pairs, we develop a method that weights state-action pairs according to their relative frequency of appearance. This can be accomplished by tracking a count for each state-action pair and then normalizing it as a proportion. This variant, which we call Count-Based ATB, defines the backup coefficients as
(7) |
By the law of large numbers, after infinitely many visitations to state , and Theorem 1 with applies at the limit. Beyond convergence to , there are several reasons why this particular ATB formulation is appealing. First, the count-based backup amounts to a maximum-likelihood -step estimate of , and in this sense is the “most reasonable” backup to perform based on previously observed experiences. Furthermore, frequency counts implicitly emphasize value estimates that are more reliable, while excluding those that have not changed since initialization. This means that Count-Based ATB automatically transitions from Sarsa to Expected Sarsa, with a different effective rate for each state, which could not be achieved easily with Q() and does not require a user-specified hyperparameter schedule.
Unfortunately, these benefits are not without some drawbacks. Although the algorithm theoretically converges to in the limit, it is extremely unlikely that after any finite amount of training. This means that Count-Based ATB may tend towards a fixed point other than in practice, which is undesirable even if it does represent a maximum-likelihood estimate according to past experience. Another problem—which is much more significant—occurs when the agent switches its policy from to a different policy during training. At this point, all of the previously collected counts will no longer reflect the correct on-policy distribution. The agent would either need to reset the counts to zero (thus discarding useful information), or wait many timesteps for the counts to accurately reflect the new policy . Both of these options are inefficient, motivating us to search for a better alternative in the next subsection.
4.2 Policy-Based ATB
We can resolve the issues of Count-Based ATB by directly utilizing knowledge of the policy , as do Q(), Expected Sarsa, and Tree Backup. If some state-action pair is visited for the first time, then we know that the eventual value of will be according to Count-Based ATB. We can greatly accelerate convergence by immediately assigning after this first visitation. Let denote the unit step function such that if and otherwise. The backup coefficients for this new variant, which we call Policy-Based ATB, become
(8) |
This definition clearly solves the two problems of Count-Based ATB. Unvisited state-action pairs are still ignored, as desired, but after each pair has been sampled at least once, the algorithm becomes equivalent to Expected Sarsa and starts converging to . This effectively bypasses the infinite-visitation requirement of Count-Based ATB. We can also see that the weighted backup instantly reflects changes to the policy, thanks to the explicit dependency on in definition (8). For these reasons, we expect Policy-Based ATB to learn significantly faster than Count-Based ATB in a tabular setting; however, Policy-Based ATB may be harder to combine with function approximation in high-dimensional MDPs, where the need for generalization makes it difficult to determine whether a state should be considered visited or unvisited. Additionally, the reliance on an explicit policy model may be restrictive in some settings. We therefore foresee useful roles for both ATB variants.
5 Experiments and Conclusion
To test whether adaptive strategies can outperform fixed -values and the handcrafted schedule proposed by De Asis et al. (2018), we compare our two ATB variants against Q() in a 19-state deterministic random walk environment (Sutton and Barto, 2018) and an 11-state stochastic gridworld environment (Russell and Norvig, 2010). For all methods, we fix and . We train the agents for 200 episodes and plot the root-mean-square (RMS) error of the estimated value function relative to (Figure 1). We plot the error after each episode, averaging over 50 independent trials, where the shaded regions represent 99% confidence intervals. Policy-Based ATB performs strongly, learning significantly faster than all of the methods except Q(), and even surpassing the handcrafted, dynamic schedule of De Asis et al. (2018)—an impressive feat for an automatic method. Count-Based ATB outperforms Q(), but its asymptotic performance in the deterministic random walk deteriorates due to the fixed-point bias we discussed earlier; however, this seems to have a less detrimental effect in the stochastic gridworld. Surprisingly, in contrast to the results of De Asis et al. (2018), we find that is the best choice of fixed -value in both environments.
The empirical success of our adaptive methods—particularly Policy-Based ATB—has important implications. First, it shows that data-driven adaptive strategies for adjusting the backup width can be more effective than Q() with a fixed -value or a pre-specified schedule for annealing . In addition, since both of our methods gradually expand the effective backup width as new state-action pairs are visited, they further strengthen our hypothesis that initialization error in the value estimates is an important consideration when determining backup width. This also helps formalize the intuition of De Asis et al. (2018) that backup width should increase over time. Finally, these results show that is a good choice in practice, corroborating our theoretical predictions that it minimizes variance without negatively impacting the bias or contraction rate of Q(). Although does not represent a bias-variance trade-off in temporal-difference learning, it is clear that does play an important role in learning, and that analyzing methods in terms of how well their backups can balance initialization error with variance is a promising pathway to further algorithmic improvements.
References
- Bellman (1966) Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.
- Bertsekas and Tsitsiklis (1996) Dimitri P Bertsekas and John N Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
- De Asis et al. (2018) Kristopher De Asis, J Hernandez-Garcia, G Holland, and Richard Sutton. Multi-step reinforcement learning: A unifying algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
- Kearns and Singh (2000) Michael J Kearns and Satinder P Singh. Bias-variance error bounds for temporal difference updates. In COLT, pages 142–147, 2000.
- Precup et al. (2000) Doina Precup, Richard S Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation. In International Conference on Machine Learning, page 759–766, 2000.
- Rummery and Niranjan (1994) G. A. Rummery and M. Niranjan. On-line Q-Learning using connectionist systems. Technical Report TR 166, Cambridge University Engineering Department, Cambridge, England, 1994.
- Russell and Norvig (2010) Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 3rd edition, 2010.
- Sutton (1988) Richard S Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.
- Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 1st edition, 1998.
- Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.
- van Seijen et al. (2009) Harm van Seijen, Hado van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical and empirical analysis of Expected Sarsa. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 177–184, 2009.