This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Adaptive Tree Backup Algorithms for
Temporal-Difference Reinforcement Learning

Brett Daley
Khoury College of Computer Sciences
Northeastern University
Boston, MA 02115
[email protected]
&Isaac Chan11footnotemark: 1
Khoury College of Computer Sciences
Northeastern University
Boston, MA 02115
[email protected]
Equal contribution.
Abstract

Q(σ\sigma) is a recently proposed temporal-difference learning method that interpolates between learning from expected backups and sampled backups. It has been shown that intermediate values for the interpolation parameter σ[0,1]\sigma\in[0,1] perform better in practice, and therefore it is commonly believed that σ\sigma functions as a bias-variance trade-off parameter to achieve these improvements. In our work, we disprove this notion, showing that the choice of σ=0\sigma=0 minimizes variance without increasing bias. This indicates that σ\sigma must have some other effect on learning that is not fully understood. As an alternative, we hypothesize the existence of a new trade-off: larger σ\sigma-values help overcome poor initializations of the value function, at the expense of higher statistical variance. To automatically balance these considerations, we propose Adaptive Tree Backup (ATB) methods, whose weighted backups evolve as the agent gains experience. Our experiments demonstrate that adaptive strategies can be more effective than relying on fixed or time-annealed σ\sigma-values.


Keywords:

Reinforcement Learning, Temporal-Difference Learning,

Adaptive Methods, Tree Backup, Expected Sarsa

1 Introduction

Unifying temporal-difference (TD), dynamic programming (DP), and Monte Carlo (MC) methods has helped illuminate trade-offs in reinforcement learning and has lead to the development of better algorithms. Much work has focused on the length of backups as a unifying axis between TD and MC methods, in which these two classes can be understood as opposite extremes of the TD(λ\lambda) algorithm (Sutton, 1988; Sutton and Barto, 1998). In this view, the decay parameter λ[0,1]\lambda\in[0,1] serves not only as a conceptual link between these distinct methods, but also as a mechanism for managing the bias-variance trade-off (Kearns and Singh, 2000), where intermediate values of λ\lambda often deliver the best empirical performance (Sutton and Barto, 1998).

Recent research has begun to explore an orthogonal unification between TD and DP methods by varying the width of backups. This concept was formally introduced by Sutton and Barto (2018) in the Q(σ\sigma) algorithm; the method interpolates linearly, via a parameter σ[0,1]\sigma\in[0,1], between Expected Sarsa (σ=0\sigma=0) (Sutton and Barto, 1998) and Sarsa (σ=1\sigma=1) (Rummery and Niranjan, 1994). As such, Q(σ\sigma) can be seen as a spectrum of algorithms whose backup widths lie somewhere between those of a sample backup and a full (expected) backup. De Asis et al. (2018) conducted further theoretical and empirical analysis of Q(σ\sigma), arriving at two major hypotheses: σ\sigma is a bias-variance trade-off parameter, and dynamically adapting σ\sigma can lead to faster convergence. These hypotheses appeared to be corroborated by their experiments, where intermediate σ\sigma-values performed well, and exponentially decaying σ\sigma during training performed even better.

In our work, we re-examine these two hypotheses regarding Q(σ\sigma), and ultimately show that they are incorrect. In particular, we prove that the choice of σ=0\sigma=0 minimizes the variance of Q(σ\sigma) without increasing the bias. It follows that the choice of σ\sigma does not constitute a bias-variance trade-off, since both bias and variance can be simultaneously optimized without competition. Furthermore, we prove that the contraction rate and fixed point of the Q(σ\sigma) operator are independent of σ\sigma, suggesting again that the choice of σ=0\sigma=0 is preferable for its variance-reduction effect. These findings are intriguing given the empirical results obtained by De Asis et al. (2018), which did not demonstrate a clear superiority of σ=0\sigma=0. This apparent discrepancy between theory and practice suggests that other factors are responsible for the observed benefits of Q(σ\sigma), and that our understanding of the algorithm must be revised accordingly.

We therefore advance a new hypothesis to explain the effectiveness of intermediate σ\sigma-values: conducting partial backups (σ0\sigma\neq 0) helps escape poor initializations of the value function QQ by reducing the chance that previously unvisited state-action pairs are considered during the backup. The value estimates of these pairs could be arbitrarily incorrect, slowing initial learning; however, as training progresses and the majority of state-action pairs have been visited, σ=0\sigma=0 starts to become more favorable for its variance reduction. This explanation matches the intuition of the time-decayed schedule utilized by De Asis et al. (2018), in which σ=1\sigma=1 is desirable early in training and σ=0\sigma=0 is desirable later.

This new hypothesis suggests that backup width should be adjusted based on the number of visitations to each state-action pair, with more emphasis placed on frequently sampled pairs. We therefore develop a new family of TD algorithms that we call Adaptive Tree Backup (ATB) methods, which generalizes Tree Backup (Precup et al., 2000) to arbitrary weightings of the value estimates. We highlight two instances of ATB methods (Count-Based and Policy-Based) that we believe to be promising. The proposed methods have the advantage of eliminating the hyperparameter σ\sigma, as they adapt automatically to the data experienced by the agent. Finally, our experiments demonstrate that Policy-Based ATB outperforms the exponential-decay schedule for σ\sigma introduced by De Asis et al. (2018).

2 Background

We model the environment as a Markov Decision Process (MDP) described by the tuple (𝒮,𝒜,P,R)(\mathcal{S},\mathcal{A},P,R): 𝒮\mathcal{S} is a finite set of environment states, 𝒜\mathcal{A} is a finite set of actions, PP is the transition function, and RR is the reward function. An agent interacts with the environment by selecting an action at𝒜a_{t}\in\mathcal{A} in state sts_{t} with probability π(at|st)\pi(a_{t}|s_{t}), receiving a reward rtR(st,at)r_{t}\coloneqq R(s_{t},a_{t}) and changing the state to st+1s_{t+1} with probability P(st+1|st,at)P(s_{t+1}|s_{t},a_{t}). Given the agent’s policy π\pi, the objective is to estimate the expected discounted return Qπ(st,at)𝔼π[rt+γrt+1+γ2rt+2+]Q^{\pi}(s_{t},a_{t})\coloneqq\mathbb{E}_{\pi}[r_{t}+\gamma r_{t+1}+\gamma^{2}r_{t+2}+\dots], where γ[0,1]\gamma\in[0,1], since acting greedily with respect to QπQ^{\pi} is known to produce a better policy (Sutton and Barto, 1998).

Q(σ\sigma) (Sutton and Barto, 2018) is a method that iteratively improves its estimated value function QQ according to

Q(st,at)(1αt)Q(st,at)+αt(rt+γ(σQ(st+1,at+1)+(1σ)a𝒜π(a|st+1)Q(st+1,a))),Q(s_{t},a_{t})\leftarrow(1-\alpha_{t})Q(s_{t},a_{t})+\alpha_{t}\left(r_{t}+\gamma\left(\sigma Q(s_{t+1},a_{t+1})+(1-\sigma)\sum_{a^{\prime}\in\mathcal{A}}\pi(a^{\prime}|s_{t+1})Q(s_{t+1},a^{\prime})\right)\right), (1)

where σ[0,1]\sigma\in[0,1] is a user-specified hyperparameter. This hyperparameter interpolates between the exact on-policy expectation (Expected Sarsa, σ=0\sigma=0) and noisy sample estimates of it (Sarsa, σ=1\sigma=1), and has been hypothesized to serve as a bias-variance trade-off parameter that improves the contraction rate of the algorithm (De Asis et al., 2018). In the subsequent section, we analyze the theoretical properties of the Q(σ\sigma) update to investigate the validity of this hypothesis.

3 Analysis of Q(σ\sigma)

We investigate the effects that σ\sigma has on the Q(σ\sigma) algorithm in terms of convergence, contraction rate, bias, and variance. Our analysis dispels two misconceptions about Q(σ\sigma): that dynamically adjusting σ\sigma can achieve a faster contraction rate, and that σ\sigma functions as a bias-variance trade-off parameter.

3.1 Convergence and Contraction Rate

For notational conciseness when proving convergence, we represent the Q(σ\sigma) update (1) as an operator Tπσ:nnT_{\pi}^{\sigma}\colon\mathbb{R}^{n}\to\mathbb{R}^{n}, where n|𝒮×𝒜|n\coloneqq|\mathcal{S}\times\mathcal{A}|. The Bellman operator TπT_{\pi} for a policy π\pi is defined as the affine function QR+γPπQQ\mapsto R+\gamma P_{\pi}Q, where (PπQ)(s,a)s,aP(s|s,a)π(a|s)Q(a|s)\smash{(P_{\pi}Q)(s,a)\coloneqq\sum_{s^{\prime},a^{\prime}}P(s^{\prime}|s,a)\pi(a^{\prime}|s^{\prime})Q(a^{\prime}|s^{\prime})}. The Bellman operator is a contraction mapping (when γ<1)\gamma<1) that admits the unique fixed point Qπ=k=0(γPπ)kR\smash{Q^{\pi}=\sum_{k=0}^{\infty}(\gamma P_{\pi})^{k}R}, the unique solution to TπQ=QT_{\pi}Q=Q (Bellman, 1966).

For policy evaluation, the operator TπσT_{\pi}^{\sigma} is considered convergent if and only if limk(Tπσ)kQ=Qπ\lim_{k\to\infty}(T_{\pi}^{\sigma})^{k}Q=Q^{\pi} for every QnQ\in\mathbb{R}^{n}, where (Tπσ)kQ(Tπσ)k1(TπσQ)(T_{\pi}^{\sigma})^{k}Q\coloneqq(T_{\pi}^{\sigma})^{k-1}(T_{\pi}^{\sigma}Q). Let wnw\in\mathbb{R}^{n} be zero-mean noise that represents the randomness due to the environment and policy. Since Q(σ\sigma) interpolates linearly between Expected Sarsa and Sarsa, we can express its operator as

TπσQ(1σ)TπQ+σ(TπQ+w)=TπQ+σw.T_{\pi}^{\sigma}Q\coloneqq(1-\sigma)T_{\pi}Q+\sigma(T_{\pi}Q+w)=T_{\pi}Q+\sigma w. (2)

We can therefore write update (1) in vector notation as

Qt+1=(1αt)Qt+αt(TπQt+σwt).Q_{t+1}=(1-\alpha_{t})Q_{t}+\alpha_{t}(T_{\pi}Q_{t}+\sigma w_{t}). (3)
Theorem 1.

Assume t=0αt=\sum_{t=0}^{\infty}\alpha_{t}=\infty and t=0αt2<\sum_{t=0}^{\infty}\alpha_{t}^{2}<\infty. For γ<1\gamma<1, the sequence QtQ_{t} defined by (3) converges to QπQ^{\pi} almost surely.

Proof.

Iteration (3) is already in a form where Proposition 4.4 of Bertsekas and Tsitsiklis (1996) is applicable; TπT_{\pi} is a contraction mapping with respect to the maximum norm, and σwt\sigma w_{t} is zero-mean noise. Furthermore, 𝔼[(σwt)2]E[wt2]{\mathbb{E}[(\sigma w_{t})^{2}]\leq E[w_{t}^{2}]} because σ[0,1]\sigma\in[0,1], and E[wt2]E[w_{t}^{2}] is bounded by a constant since 𝒮\mathcal{S} and 𝒜\mathcal{A} are finite sets. Consequently, under the assumed stepsize conditions, the sequence QtQ_{t} converges to QπQ^{\pi} (the fixed point of TπT_{\pi}) almost surely. ∎

From our derived Q(σ\sigma) operator (2), we see that σ\sigma primarily functions as a noise attenuation parameter; however, it does not affect the expected update, which is inherently related to the Bellman operator. It follows that the contraction rate and fixed point of Q(σ\sigma) are identical to those of the Bellman operator, and that these properties are independent of σ\sigma.

3.2 Variance

Theorem 1 suggests that small values of σ\sigma have a beneficial variance-reduction effect, with no detrimental effects on convergence; however, the abstract noise process wtw_{t} makes it difficult to quantify this effect. In this section, we derive the exact variance of Q(σ\sigma) by extending existing results for Sarsa and Expected Sarsa from van Seijen et al. (2009).

Theorem 2.

Let vtv_{t} and v^t\hat{v}_{t} be the variances of Expected Sarsa and Sarsa, respectively. The variance of Q(σ\sigma) is given by the expression

Var(vt)+σ2(Var(v^t)Var(vt)),\operatorname{Var}(v_{t})+\sigma^{2}(\operatorname{Var}(\hat{v}_{t})-\operatorname{Var}(v_{t})), (4)

and this quantity is minimized when σ=0\sigma=0.

Proof.

From the definition of Q(σ\sigma) as a linear interpolation, we can deduce that its variance has the form

σ2Var(v^t)+(1σ)2Var(vt)+2σ(1σ)Cov(vt^,vt).\sigma^{2}\operatorname{Var}(\hat{v}_{t})+(1-\sigma)^{2}\operatorname{Var}(v_{t})+2\sigma(1-\sigma)\operatorname{Cov}(\hat{v_{t}},v_{t}). (5)

The Sarsa and Expected Sarsa updates share the same expected value (van Seijen et al., 2009); let it be denoted by μt\mu_{t}. We show algebraically that Cov(vt^,vt)=Var(vt)\operatorname{Cov}(\hat{v_{t}},v_{t})=\operatorname{Var}(v_{t}) by comparing the exact calculation 𝔼[(vt^μt)(vtμt)]\mathbb{E}[(\hat{v_{t}}-\mu_{t})(v_{t}-\mu_{t})] to the expression for Var(vt)\operatorname{Var}(v_{t}) derived by van Seijen et al. (2009). Therefore, expression (5) reduces to

σ2Var(vt^)+[(1σ)2+2σ(1σ)]Var(vt)=Var(vt)+σ2(Var(v^t)Var(vt)),\sigma^{2}\operatorname{Var}(\hat{v_{t}})+[(1-\sigma)^{2}+2\sigma(1-\sigma)]\operatorname{Var}(v_{t})=\operatorname{Var}(v_{t})+\sigma^{2}(\operatorname{Var}(\hat{v}_{t})-\operatorname{Var}(v_{t})),

which is the desired result in expression (4). From van Seijen et al. (2009), Var(v^t)Var(vt)0\operatorname{Var}(\hat{v}_{t})-\operatorname{Var}(v_{t})\geq 0, and hence expression (4) is minimized when σ=0\sigma=0. ∎

Theorem 2 shows that the variance of Q(σ\sigma) is minimal when σ=0\sigma=0 and increases monotonically, which is interesting because it supports the view of σ\sigma as a noise attenuation parameter from Theorem 1. Furthermore, since the expected update is unaffected by the choice of σ\sigma, it cannot represent a bias-variance trade-off; the variance can be minimized with no apparent drawback. Our theory indicates that there must be a different explanation for the empirical success of Q(σ\sigma).

4 Adaptive Tree Backup (ATB) Algorithms

Q(σ\sigma) does not address the standard bias-variance trade-off for temporal-difference learning, since the choice of σ=0\sigma=0 minimizes variance without impacting bias. Why, then, did De Asis et al. (2018) observe empirical improvement when employing an intermediate σ\sigma-value? We conjecture that Expected Sarsa (σ=0\sigma=0) struggles to overcome initialization error early in training (during which the majority of the state-action space is unexplored). Until every action has been sampled at least once in a given state, Expected Sarsa mixes one or more nonsensical value estimates into its full backup, potentially resulting in severe inaccuracies. In contrast, a sample method like Sarsa (σ=1\sigma=1) is likely to visit previously taken actions (simply by definition of a probabilistic policy), thereby bootstrapping sooner from more accurate estimates, but also with additional noise due to sampling. This phenomenon helps explain why the time-decayed schedule for σ\sigma proposed by De Asis et al. (2018) is effective; excluding incorrect value estimates (σ=1\sigma=1) is important when most state-action pairs have not been visited, but minimizing variance (σ=0\sigma=0) becomes more important once all estimates are relatively accurate.

If our hypothesis is correct, then methods that dynamically increase the effective backup width based on state-action pair visitations should generally perform better than Q(σ\sigma) with a fixed σ\sigma-value or time-annealed schedule. We call these Adaptive Tree Backup (ATB) methods, as they generalize Tree Backup (Precup et al., 2000) to arbitrary backup weightings. For a given transition (st,at,rt,st+1,at+1)(s_{t},a_{t},r_{t},s_{t+1},a_{t+1}), ATB algorithms can be written in the generic form

Q(st,at)(1αt)Q(st,at)+αt(rt+γa𝒜ct(st+1,a)Q(st+1,a)),subjecttoa𝒜ct(s,a)=1,s𝒮,Q(s_{t},a_{t})\leftarrow~{}(1-\alpha_{t})Q(s_{t},a_{t})+\alpha_{t}\left(r_{t}+\gamma\sum_{a^{\prime}\in\mathcal{A}}c_{t}(s_{t+1},a^{\prime})Q(s_{t+1},a^{\prime})\right),\quad\mathrm{subject~{}to}~{}\smash{\sum_{a\in\mathcal{A}}}c_{t}(s,a)=1,\forall~{}s\in\mathcal{S}, (6)

where each ct(s,a)c_{t}(s,a) is a nonnegative (possibly random) coefficient. Note that if ct(s,a)=(1σ)π(a|s)+σ𝕀a=at+1c_{t}(s,a)=(1-\sigma)\pi(a|s)+\sigma\mathbb{I}_{a=a_{t+1}}, then we recover Q(σ\sigma) exactly. However, the generality of the coefficients ct(s,a)c_{t}(s,a) permits myriad possibilities beyond the limited, one-dimensional spectrum spanned by Q(σ\sigma).

The backup associated with update (6) admits the correct fixed point QπQ^{\pi} only if 𝔼[ct(s,a)]=π(a|s)\mathbb{E}[c_{t}(s,a)]=\pi(a|s). Even so, if we ensure that 𝔼[ct(s,a)]π(a|s)\mathbb{E}[c_{t}(s,a)]\to\pi(a|s) as tt\to\infty, then the ATB method will eventually converge to QπQ^{\pi} according to Theorem 1. We subsequently discuss two promising variants that do this.

4.1 Count-Based ATB

To match the intuition that an effective ATB strategy should ignore unvisited state-action pairs, we develop a method that weights state-action pairs according to their relative frequency of appearance. This can be accomplished by tracking a count n(s,a)n(s,a) for each state-action pair (s,a)(s,a) and then normalizing it as a proportion. This variant, which we call Count-Based ATB, defines the backup coefficients as

ct(s,a)=n(s,a)a𝒜n(s,a).c_{t}(s,a)=\frac{n(s,a)}{\sum\limits_{a^{\prime}\in\mathcal{A}}n(s,a^{\prime})}. (7)

By the law of large numbers, ct(s,a)π(a|s)c_{t}(s,a)\to\pi(a|s) after infinitely many visitations to state ss, and Theorem 1 with σ=0\sigma=0 applies at the limit. Beyond convergence to QπQ^{\pi}, there are several reasons why this particular ATB formulation is appealing. First, the count-based backup amounts to a maximum-likelihood 11-step estimate of Q(s,a)Q(s,a), and in this sense is the “most reasonable” backup to perform based on previously observed experiences. Furthermore, frequency counts implicitly emphasize value estimates that are more reliable, while excluding those that have not changed since initialization. This means that Count-Based ATB automatically transitions from Sarsa to Expected Sarsa, with a different effective rate for each state, which could not be achieved easily with Q(σ\sigma) and does not require a user-specified hyperparameter schedule.

Unfortunately, these benefits are not without some drawbacks. Although the algorithm theoretically converges to QπQ^{\pi} in the limit, it is extremely unlikely that c(s,a)=π(a|s)c(s,a)=\pi(a|s) after any finite amount of training. This means that Count-Based ATB may tend towards a fixed point other than QπQ^{\pi} in practice, which is undesirable even if it does represent a maximum-likelihood estimate according to past experience. Another problem—which is much more significant—occurs when the agent switches its policy from π\pi to a different policy π\pi^{\prime} during training. At this point, all of the previously collected counts n(s,a)n(s,a) will no longer reflect the correct on-policy distribution. The agent would either need to reset the counts to zero (thus discarding useful information), or wait many timesteps for the counts to accurately reflect the new policy π\pi^{\prime}. Both of these options are inefficient, motivating us to search for a better alternative in the next subsection.

4.2 Policy-Based ATB

We can resolve the issues of Count-Based ATB by directly utilizing knowledge of the policy π\pi, as do Q(σ\sigma), Expected Sarsa, and Tree Backup. If some state-action pair (s,a)(s,a) is visited for the first time, then we know that the eventual value of ct(s,a)c_{t}(s,a) will be π(a|s)\pi(a|s) according to Count-Based ATB. We can greatly accelerate convergence by immediately assigning ct(s,a)=π(a|s)c_{t}(s,a)=\pi(a|s) after this first visitation. Let uu denote the unit step function such that u(n)=1u(n)=1 if n>0n>0 and u(n)=0u(n)=0 otherwise. The backup coefficients for this new variant, which we call Policy-Based ATB, become

ct(s,a)=u(n(s,a))a𝒜u(n(s,a))π(a|s).c_{t}(s,a)=\frac{u(n(s,a))}{\sum\limits_{a^{\prime}\in\mathcal{A}}u(n(s,a^{\prime}))}\pi(a|s). (8)

This definition clearly solves the two problems of Count-Based ATB. Unvisited state-action pairs are still ignored, as desired, but after each pair has been sampled at least once, the algorithm becomes equivalent to Expected Sarsa and starts converging to QπQ^{\pi}. This effectively bypasses the infinite-visitation requirement of Count-Based ATB. We can also see that the weighted backup instantly reflects changes to the policy, thanks to the explicit dependency on π\pi in definition (8). For these reasons, we expect Policy-Based ATB to learn significantly faster than Count-Based ATB in a tabular setting; however, Policy-Based ATB may be harder to combine with function approximation in high-dimensional MDPs, where the need for generalization makes it difficult to determine whether a state should be considered visited or unvisited. Additionally, the reliance on an explicit policy model π(a|s)\pi(a|s) may be restrictive in some settings. We therefore foresee useful roles for both ATB variants.

5 Experiments and Conclusion

Refer to captionRefer to caption
Figure 1: Comparison of our ATB methods against Q(σ\sigma) with fixed σ\sigma-values and the dynamic schedule proposed by De Asis et al. (2018).

To test whether adaptive strategies can outperform fixed σ\sigma-values and the handcrafted schedule proposed by De Asis et al. (2018), we compare our two ATB variants against Q(σ\sigma) in a 19-state deterministic random walk environment (Sutton and Barto, 2018) and an 11-state stochastic gridworld environment (Russell and Norvig, 2010). For all methods, we fix αt=0.4\alpha_{t}=0.4 and γ=1\gamma=1. We train the agents for 200 episodes and plot the root-mean-square (RMS) error of the estimated value function QQ relative to QπQ^{\pi} (Figure 1). We plot the error after each episode, averaging over 50 independent trials, where the shaded regions represent 99% confidence intervals. Policy-Based ATB performs strongly, learning significantly faster than all of the methods except Q(0), and even surpassing the handcrafted, dynamic σ\sigma schedule of De Asis et al. (2018)—an impressive feat for an automatic method. Count-Based ATB outperforms Q(11), but its asymptotic performance in the deterministic random walk deteriorates due to the fixed-point bias we discussed earlier; however, this seems to have a less detrimental effect in the stochastic gridworld. Surprisingly, in contrast to the results of De Asis et al. (2018), we find that σ=0\sigma=0 is the best choice of fixed σ\sigma-value in both environments.

The empirical success of our adaptive methods—particularly Policy-Based ATB—has important implications. First, it shows that data-driven adaptive strategies for adjusting the backup width can be more effective than Q(σ\sigma) with a fixed σ\sigma-value or a pre-specified schedule for annealing σ\sigma. In addition, since both of our methods gradually expand the effective backup width as new state-action pairs are visited, they further strengthen our hypothesis that initialization error in the value estimates is an important consideration when determining backup width. This also helps formalize the intuition of De Asis et al. (2018) that backup width should increase over time. Finally, these results show that σ=0\sigma=0 is a good choice in practice, corroborating our theoretical predictions that it minimizes variance without negatively impacting the bias or contraction rate of Q(σ\sigma). Although σ\sigma does not represent a bias-variance trade-off in temporal-difference learning, it is clear that σ\sigma does play an important role in learning, and that analyzing methods in terms of how well their backups can balance initialization error with variance is a promising pathway to further algorithmic improvements.

References

  • Bellman (1966) Richard Bellman. Dynamic programming. Science, 153(3731):34–37, 1966.
  • Bertsekas and Tsitsiklis (1996) Dimitri P Bertsekas and John N Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1996.
  • De Asis et al. (2018) Kristopher De Asis, J Hernandez-Garcia, G Holland, and Richard Sutton. Multi-step reinforcement learning: A unifying algorithm. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Kearns and Singh (2000) Michael J Kearns and Satinder P Singh. Bias-variance error bounds for temporal difference updates. In COLT, pages 142–147, 2000.
  • Precup et al. (2000) Doina Precup, Richard S Sutton, and Satinder Singh. Eligibility traces for off-policy policy evaluation. In International Conference on Machine Learning, page 759–766, 2000.
  • Rummery and Niranjan (1994) G. A. Rummery and M. Niranjan. On-line Q-Learning using connectionist systems. Technical Report TR 166, Cambridge University Engineering Department, Cambridge, England, 1994.
  • Russell and Norvig (2010) Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach. Prentice Hall, 3rd edition, 2010.
  • Sutton (1988) Richard S Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44, 1988.
  • Sutton and Barto (1998) Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 1st edition, 1998.
  • Sutton and Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018.
  • van Seijen et al. (2009) Harm van Seijen, Hado van Hasselt, Shimon Whiteson, and Marco Wiering. A theoretical and empirical analysis of Expected Sarsa. In IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pages 177–184, 2009.