This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Gradient Descent Temporal Difference-difference Learning

Rong J.B. Zhu
Institute of Science and Technology for Brain-inspired Intelligence
Fudan University, Shanghai 200433, People’s Republic of China
[email protected]
&James M. Murray
Institute of Neuroscience
University of Oregon, Eugene, OR 97403, USA
[email protected]
Abstract

Off-policy algorithms, in which a behavior policy differs from the target policy and is used to gain experience for learning, have proven to be of great practical value in reinforcement learning. However, even for simple convex problems such as linear value function approximation, these algorithms are not guaranteed to be stable. To address this, alternative algorithms that are provably convergent in such cases have been introduced, the most well known being gradient descent temporal difference (GTD) learning. This algorithm and others like it, however, tend to converge much more slowly than conventional temporal difference learning. In this paper we propose gradient descent temporal difference-difference (Gradient-DD) learning in order to improve GTD2, a GTD algorithm [1], by introducing second-order differences in successive parameter updates. We investigate this algorithm in the framework of linear value function approximation, theoretically proving its convergence by applying the theory of stochastic approximation. Studying the model empirically on the random walk task, the Boyan-chain task, and the Baird’s off-policy counterexample, we find substantial improvement over GTD2 and, in several cases, better performance even than conventional TD learning.

1 Introduction

Off-policy algorithms for value function learning enable an agent to use a behavior policy that differs from the target policy in order to gain experience for learning. However, because off-policy methods learn a value function for a target policy given data due to a different behavior policy, they are slower to converge than on-policy methods and may even diverge when applied to problems involving function approximation [2, 3].

Two general approaches have been investigated to address the challenge of developing stable and effective off-policy temporal-difference algorithms. One approach is to use importance sampling methods to warp the update distribution so that the expected value is corrected [4, 5]. This approach is useful for the convergence guarantee, but it does not address stability issues. The second main approach to addressing the challenge of off-policy learning is to develop true gradient descent-based methods that are guaranteed to be stable regardless of the update distribution. [6, 1] proposed the first off-policy gradient-descent-based temporal difference algorithms (GTD and GTD2, respectively). These algorithms are guaranteed to be stable, with computational complexity scaling linearly with the size of the function approximator. Empirically, however, their convergence is much slower than conventional temporal difference (TD) learning, limiting their practical utility [7, 8]. Building on this work, extensions to the GTD family of algorithms (see [9] for a review) have allowed for incorporating eligibility traces [10, 11], non-linear function approximation such as with a neural network [12], and reformulation of the optimization as a saddle point problem [13, 14]. However, due to their slow convergence, none of these stable off-policy methods are commonly used in practice.

In this work, we introduce a new gradient descent algorithm for temporal difference learning with linear value function approximation. This algorithm, which we call gradient descent temporal difference-difference (Gradient-DD) learning, is an acceleration technique that employs second-order differences in successive parameter updates. The basic idea of Gradient-DD is to modify the error objective function by additionally considering the prediction error obtained in the last time step, then to derive a gradient-descent algorithm based on this modified objective function. In addition to exploiting the Bellman equation to get the solution, this modified error objective function avoids drastic changes in the value function estimate by encouraging local search around the current estimate. Algorithmically, the Gradient-DD approach only adds an additional term to the update rule of the GTD2 method, and the extra computational cost is negligible. We prove its convergence by applying the theory of stochastic approximation. This result is supported by numerical experiments, which also show that Gradient-DD obtains better convergence in many cases than conventional TD learning.

1.1 Related Work

In related approaches to ours, some previous studies have attempted to improve Gradient-TD algorithms by adding regularization terms to the objective function. These approaches have used have used l1l_{1} regularization on weights to learn sparse representations of value functions [Liu:12], or l2l_{2} regularization on weights [Ghiassian:20]. Our work is different from these approaches in two ways. First, whereas these previous studies investigated a variant of TD learning with gradient corrections, we take the GTD2 algorithm as our starting point. Second, unlike these previous approaches, our approach modifies the error objective function by using a distance constraint rather than a penalty on weights. The distance constraint works by restricting the search to some region around the evaluation obtained in the most recent time step. With this modification, our method provides a learning rule that contains second-order differences in successive parameter updates.

Our approach is similar to trust region policy optimization [Schulman:15] or relative entropy policy search [Peters:10], which penalize large changes being learned in policy learning. In these methods, constrained optimization is used to update the policy by considering the constraint on some measure between the new policy and the old policy. Here, however, our aim is to find the optimal value function, and the regularization term uses the previous value function estimate to avoid drastic changes in the updating process.

Our approach bears similarity to the natural gradient approach widely used in reinforcement learning [Amari:98, Bh:09, Degris:12, DT:14, Thomas:16], which also features a constrained optimization form. However, Gradient-DD is distinct from the natural gradient. The essential difference is that, unlike the natural gradient, Gradient-DD is a trust region method, which defines the trust region according to the difference between the current value and the value obtained from the previous step. From the computational cost viewpoint, unlike natural TD [DT:14], which needs to update an estimate of the metric tensor, the computational cost of Gradient-DD is essentially the same as that of GTD2.

2 Gradient descent method for off-policy temporal difference learning

2.1 Problem definition and background

In this section, we formalize the problem of learning the value function for a given policy under the Markov decision process (MDP) framework. In this framework, the agent interacts with the environment over a sequence of discrete time steps, t=1,2,t=1,2,\ldots. At each time step the agent observes a state st𝒮s_{t}\in\mathcal{S} and selects an action at𝒜a_{t}\in\mathcal{A}. In response, the environment emits a reward rtr_{t}\in\mathbb{R} and transitions the agent to its next state st+1𝒮s_{t+1}\in\mathcal{S}. The state and action sets are finite. State transitions are stochastic and dependent on the immediately preceding state and action. Rewards are stochastic and dependent on the preceding state and action, as well as on the next state. The process generating the agent’s actions is termed the behavior policy. In off-policy learning, this behavior policy is in general different from the target policy π:𝒮𝒜\pi:\mathcal{S}\rightarrow\mathcal{A}. The objective is to learn an approximation to the state-value function under the target policy in a particular environment:

V(s)=Eπ[t=1γt1rt|s1=s],V(s)=\text{E}_{\pi}\left[\sum\nolimits_{t=1}^{\infty}\gamma^{t-1}r_{t}|s_{1}=s\right], (1)

where γ[0,1)\gamma\in[0,1) is the discount rate.

In problems for which the state space is large, it is practical to approximate the value function. In this paper we consider linear function approximation, where states are mapped to feature vectors with fewer components than the number of states. Specifically, for each state s𝒮s\in\mathcal{S} there is a corresponding feature vector 𝐱(s)p\mathbf{x}(s)\in\mathbb{R}^{p}, with p|𝒮|p\leq|\mathcal{S}|, such that the approximate value function is given by

V𝐰(s):=𝐰𝐱(s).V_{\mathbf{w}}(s):=\mathbf{w}^{\top}\mathbf{x}(s). (2)

The goal is then to learn the parameters 𝐰\mathbf{w} such that V𝐰(s)V(s)V_{\mathbf{w}}(s)\approx V(s).

2.2 Gradient temporal difference learning

A major breakthrough for the study of the convergence properties of MDP systems came with the introduction of the GTD and GTD2 learning algorithms [SSM09, SMP09]. We begin by briefly recapitulating the GTD algorithms, which we will then extend in the following sections. To begin, we introduce the Bellman operator 𝐁\mathbf{B} such that the true value function 𝐕|𝒮|\mathbf{V}\in\mathbb{R}^{|\mathcal{S}|} satisfies the Bellman equation:

𝐕=𝐑+γ𝐏𝐕=:𝐁𝐕,\mathbf{V}=\mathbf{R}+\gamma\mathbf{P}\mathbf{V}=:\mathbf{B}\mathbf{V},

where 𝐑\mathbf{R} is the reward vector with components E(rn|sn=s)\text{E}(r_{n}|s_{n}=s), and 𝐏\mathbf{P} is a matrix of the state transition probabilities under the behavior policy. In temporal difference methods, an appropriate objective function should minimize the difference between the approximate value function and the solution to the Bellman equation.

Having defined the Bellman operator, we next introduce the projection operator 𝚷\mathbf{\Pi}, which takes any value function 𝐕\mathbf{V} and projects it to the nearest value function within the space of approximate value functions of the form Eqn. (2). Letting 𝐗\mathbf{X} be the matrix whose rows are 𝐱(s)\mathbf{x}(s), the approximate value function can be expressed as 𝐕𝐰=𝐗𝐰\mathbf{V}_{\mathbf{w}}=\mathbf{X}\mathbf{w}. The projection operator is then given by

𝚷=𝐗(𝐗𝐃𝐗)1𝐗𝐃,\mathbf{\Pi}=\mathbf{X}(\mathbf{X}^{\top}\mathbf{D}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbf{D},

where the matrix 𝐃\mathbf{D} is diagonal, with each diagonal element dsd_{s} corresponding to the probability of visiting state ss.

The natural measure of how closely the approximation 𝐕𝐰\mathbf{V}_{\mathbf{w}} satisfies the Bellman equation is the mean-squared Bellman error:

MSBE(𝐰)=𝐕𝐰𝐁𝐕𝐰𝐃2,\text{MSBE}(\mathbf{w})=\|\mathbf{V}_{\mathbf{w}}-\mathbf{B}\mathbf{V}_{\mathbf{w}}\|_{\mathbf{D}}^{2}, (3)

where the norm is weighted by 𝐃\mathbf{D}, such that 𝐕𝐃2=𝐕𝐃𝐕\|\mathbf{V}\|^{2}_{\mathbf{D}}=\mathbf{V}^{\top}\mathbf{D}\mathbf{V}. However, because the Bellman operator follows the underlying state dynamics of the Markov chain, irrespective of the structure of the linear function approximator, 𝐁𝐕𝐰\mathbf{B}\mathbf{V}_{\mathbf{w}} will typically not be representable as 𝐕𝐰\mathbf{V}_{\mathbf{w}} for any 𝐰\mathbf{w}. An alternative objective function, therefore, is the mean squared projected Bellman error (MSPBE), which we define as

J(𝐰)=𝐕𝐰𝚷𝐁𝐕𝐰𝐃2.\displaystyle J(\mathbf{w})=\|\mathbf{V}_{\mathbf{w}}-\mathbf{\Pi}\mathbf{B}\mathbf{V}_{\mathbf{w}}\|_{\mathbf{D}}^{2}. (4)

Following [SMP09], our objective is to minimize this error measure. As usual in stochastic gradient descent, the weights at each time step are then updated by Δ𝐰=αJ(𝐰)\Delta\mathbf{w}=-\alpha\nabla J(\mathbf{w}), where α>0\alpha>0, and

12J(𝐰)\displaystyle-\frac{1}{2}\nabla J(\mathbf{w}) =E[(γ𝐱n+1𝐱n)𝐱n][E(𝐱n𝐱n)]1E(δn𝐱n).\displaystyle=-\text{E}[(\gamma\mathbf{x}_{n+1}-\mathbf{x}_{n})\mathbf{x}_{n}^{\top}][\text{E}(\mathbf{x}_{n}\mathbf{x}_{n}^{\top})]^{-1}\text{E}(\delta_{n}\mathbf{x}_{n}). (5)

For notational simplicity, we have denoted the feature vector associated with sns_{n} as 𝐱n=𝐱(sn)\mathbf{x}_{n}=\mathbf{x}({s_{n}}). We have also introduced the temporal difference error δn=rn+(γ𝐱n+1𝐱n)𝐰n\delta_{n}=r_{n}+(\gamma\mathbf{x}_{n+1}-\mathbf{x}_{n})^{\top}\mathbf{w}_{n}. Let 𝜼n\boldsymbol{\eta}_{n} denote the estimate of [E(𝐱n𝐱n)]1E(δn𝐱n)[\text{E}(\mathbf{x}_{n}\mathbf{x}_{n}^{\top})]^{-1}\text{E}(\delta_{n}\mathbf{x}_{n}) at the time step nn. Because the factors in Eqn. (5) can be directly sampled, the resulting updates in each step are

δn=\displaystyle\delta_{n}= rn+(γ𝐱n+1𝐱n)𝐰n\displaystyle r_{n}+(\gamma\mathbf{x}_{n+1}-\mathbf{x}_{n})^{\top}\mathbf{w}_{n}
𝜼n+1=\displaystyle\boldsymbol{\eta}_{n+1}= 𝜼n+βn(δn𝐱n𝜼n)𝐱n\displaystyle\boldsymbol{\eta}_{n}+\beta_{n}(\delta_{n}-\mathbf{x}_{n}^{\top}\boldsymbol{\eta}_{n})\mathbf{x}_{n}
𝐰n+1=\displaystyle\mathbf{w}_{n+1}= 𝐰nαn(γ𝐱n+1𝐱n)(𝐱n𝜼n).\displaystyle\mathbf{w}_{n}-\alpha_{n}(\gamma\mathbf{x}_{n+1}-\mathbf{x}_{n})(\mathbf{x}_{n}^{\top}\boldsymbol{\eta}_{n}). (6)

These updates define the GTD2 learning algorithm, which we will build upon in the following section.

3 Gradient descent temporal difference-difference learning

In this section we modify the objective function by additionally considering the difference between 𝐕𝐰\mathbf{V}_{\mathbf{w}} and 𝐕𝐰n1\mathbf{V}_{\mathbf{w}_{n-1}}, which denotes the value function estimate at step n1n-1 of the optimization. We propose a new objective JGDD(𝐰|𝐰n1)J_{\text{GDD}}(\mathbf{w}|\mathbf{w}_{n-1}), where the notation “𝐰|𝐰n1\mathbf{w}|\mathbf{w}_{n-1}" in the parentheses means that the objective is defined given 𝐕𝐰n1\mathbf{V}_{\mathbf{w}_{n-1}} of the previous time step n1n-1. Specifically, we modify Eqn. (4) as follows:

JGDD(𝐰|𝐰n1)=J(𝐰)+κ𝐕𝐰𝐕𝐰n1𝐃2,J_{\text{GDD}}(\mathbf{w}|\mathbf{w}_{n-1})=J(\mathbf{w})+\kappa\|\mathbf{V}_{\mathbf{w}}-\mathbf{V}_{\mathbf{w}_{n-1}}\|_{\mathbf{D}}^{2}, (7)

where κ0\kappa\geq 0 is a parameter of the regularization. We show in Section A.1 of the appendix that minimizing Eqn. (7) is equivalent to the following optimization

argmin𝐰J(𝐰) s.t. 𝐕𝐰𝐕𝐰n1𝐃2μ\arg\min\limits_{\mathbf{w}}J(\mathbf{w})\text{ s.t. }\|\mathbf{V}_{\mathbf{w}}-\mathbf{V}_{\mathbf{w}_{n-1}}\|_{\mathbf{D}}^{2}\leq\mu (8)

where μ>0\mu>0 is a parameter which becomes large when κ\kappa is small, so that the MSPBE objective is recovered as μ\mu\to\infty, equivalent to κ0\kappa\to 0 in Eqn. (7).

Refer to caption
Figure 1: Schematic diagram of Gradient-DD learning with 𝐰2\mathbf{w}\in\mathbb{R}^{2}. Rather than updating 𝐰\mathbf{w} directly along the gradient of the MSPBE (black arrow), the update rule selects 𝐰n\mathbf{w}_{n} (red arrow) that minimizes the MSPBE while satisfying the constraint 𝐕𝐰𝐕𝐰n1𝐃2μ\|\mathbf{V}_{\mathbf{w}}-\mathbf{V}_{\mathbf{w}_{n-1}}\|_{\mathbf{D}}^{2}\leq\mu (shaded ellipse).

Rather than simply minimizing the optimal prediction from the projected Bellman equation, the agent makes use of the most recent update to look for the solution, choosing a 𝐰\mathbf{w} that minimizes the MSPBE while following the constraint that the estimated value function should not change too greatly, as illustrated in Fig. 1. In effect, the regularization term encourages searching around the estimate at previous time step, especially when the state space is large.

Eqn. (8) shows that the regularized objective is a trust region approach, which seeks a direction that attains the best improvement possible subject to the distance constraint. The trust region is defined by the value distance rather than the weight distance, meaning that Gradient-DD also makes use of the natural gradient of the objective around 𝐰n1\mathbf{w}_{n-1} rather than around 𝐰n\mathbf{w}_{n} (see Section A.2 of the appendix for details). In this sense, our approach can be explained as a trust region method that makes use of natural gradient information to prevent the estimated value function from changing too drastically.

For comparison with related approaches using natural gradients, in Fig. 8 of the appendix we compare the empirical performance of our algorithm with natural GTD2 and natural TDC [DT:14] using the random walk task introduced below in Section 5.

With these considerations in mind, the negative gradient of JGDD(𝐰|𝐰n1)J_{\text{GDD}}(\mathbf{w}|\mathbf{w}_{n-1}) is

12JGDD(𝐰|𝐰n1)\displaystyle-\frac{1}{2}\nabla J_{\text{GDD}}(\mathbf{w}|\mathbf{w}_{n-1})
=\displaystyle= E[(γ𝐱n+1𝐱n)𝐱n][E(𝐱n𝐱n)]1E(δn𝐱n)κE[(𝐱n𝐰n𝐱n𝐰n1)𝐱n].\displaystyle-\text{E}[(\gamma\mathbf{x}_{n+1}-\mathbf{x}_{n})\mathbf{x}_{n}^{\top}][\text{E}(\mathbf{x}_{n}\mathbf{x}_{n}^{\top})]^{-1}\text{E}(\delta_{n}\mathbf{x}_{n})-\kappa\text{E}[(\mathbf{x}_{n}^{\top}\mathbf{w}_{n}-\mathbf{x}_{n}^{\top}\mathbf{w}_{n-1})\mathbf{x}_{n}]. (9)

Because the terms in Eqn. (3) can be directly sampled, the stochastic gradient descent updates are given by

δn=\displaystyle\delta_{n}= rn+(γ𝐱n+1𝐱n)𝐰n\displaystyle r_{n}+(\gamma\mathbf{x}_{n+1}-\mathbf{x}_{n})^{\top}\mathbf{w}_{n}
𝜼n+1=\displaystyle\boldsymbol{\eta}_{n+1}= 𝜼n+αn(δn𝐱n𝜼n)𝐱n\displaystyle\boldsymbol{\eta}_{n}+\alpha_{n}(\delta_{n}-\mathbf{x}_{n}^{\top}\boldsymbol{\eta}_{n})\mathbf{x}_{n}
𝐰n+1=\displaystyle\mathbf{w}_{n+1}= 𝐰nκαn(𝐱n𝐰n𝐱n𝐰n1)𝐱nαn(γ𝐱n+1𝐱n)(𝐱n𝜼n).\displaystyle\mathbf{w}_{n}-\kappa\alpha_{n}(\mathbf{x}_{n}^{\top}\mathbf{w}_{n}-\mathbf{x}_{n}^{\top}\mathbf{w}_{n-1})\mathbf{x}_{n}-\alpha_{n}(\gamma\mathbf{x}_{n+1}-\mathbf{x}_{n})(\mathbf{x}_{n}^{\top}\boldsymbol{\eta}_{n}). (10)

These update equations define the Gradient-DD method, in which the GTD2 update equations (2.2) are generalized by including a second-order update term in the third update equation, where this term originates from the squared bias term in the objective (7). Since Gradient-DD is not sensitive to the step size of updating 𝜼\boldsymbol{\eta} (see Fig. 7 in the appendix), the updates of Gradient-DD only have a single shared step size αn\alpha_{n} rather than two step sizes αn,βn\alpha_{n},\beta_{n} as GTD2 and TDC used. It is worth noting that the computational cost of our algorithm is essentially the same as that of GTD2. In the following sections, we shall analytically and numerically investigate the convergence and performance of Gradient-DD learning.

4 Convergence Analysis

In this section we establish the convergence of Gradient-DD learning. Denote 𝐆n=[𝐱n𝐱n𝐱n(𝐱nγ𝐱n+1)(𝐱nγ𝐱n+1)𝐱n𝟎], and 𝐇n=[𝟎𝟎𝟎𝐱n𝐱n]\mathbf{G}_{n}=\left[\begin{array}[]{cc}\mathbf{x}_{n}\mathbf{x}_{n}^{\top}&\mathbf{x}_{n}(\mathbf{x}_{n}-\gamma\mathbf{x}_{n+1})^{\top}\\ -(\mathbf{x}_{n}-\gamma\mathbf{x}_{n+1})\mathbf{x}_{n}^{\top}&\mathbf{0}\end{array}\right],\text{ and }\mathbf{H}_{n}=\left[\begin{array}[]{cc}\mathbf{0}&\mathbf{0}\\ \mathbf{0}&\mathbf{x}_{n}\mathbf{x}_{n}^{\top}\end{array}\right]. We rewrite the update rules in Eqn. (3) as a single iteration in a combined parameter vector with 2n2n components, 𝝆n=(𝜼n,𝐰n)\boldsymbol{\rho}_{n}=(\boldsymbol{\eta}_{n}^{\top},\mathbf{w}_{n}^{\top})^{\top}, and a new reward-related vector with 2n2n components, 𝐠n+1=(rn𝐱n,𝟎)\mathbf{g}_{n+1}=(r_{n}\mathbf{x}_{n}^{\top},\mathbf{0}^{\top})^{\top}, as follows:

𝝆n+1=\displaystyle\boldsymbol{\rho}_{n+1}= 𝝆nκαn𝐇n(𝝆n𝝆n1)+αn(𝐆n𝝆n+𝐠n+1),\displaystyle\boldsymbol{\rho}_{n}-\kappa\alpha_{n}\mathbf{H}_{n}(\boldsymbol{\rho}_{n}-\boldsymbol{\rho}_{n-1})+\alpha_{n}(\mathbf{G}_{n}\boldsymbol{\rho}_{n}+\mathbf{g}_{n+1}), (11)
Theorem 1.

Consider the update rules (11) with step-size sequences αn\alpha_{n}. Let the TD fixed point be 𝐰\mathbf{w}^{*}, such that 𝐕𝐰=𝚷𝐁𝐕𝐰\mathbf{V}_{\mathbf{w}^{*}}=\mathbf{\Pi}\mathbf{B}\mathbf{V}_{\mathbf{w}^{*}}. Suppose that (A0) αn(0,1)\alpha_{n}\in(0,1), n=1αn=\sum\nolimits_{n=1}^{\infty}\alpha_{n}=\infty, n=1αn2<\sum\nolimits_{n=1}^{\infty}\alpha_{n}^{2}<\infty, (A1) (𝐱n,rn,𝐱n+1)(\mathbf{x}_{n},r_{n},\mathbf{x}_{n+1}) is an i.i.d. sequence with uniformly bounded second moments, (A2) E[(𝐱nγ𝐱n+1)𝐱n]\text{E}[(\mathbf{x}_{n}-\gamma\mathbf{x}_{n+1})\mathbf{x}_{n}^{\top}] and E(𝐱n𝐱n)\text{E}(\mathbf{x}_{n}\mathbf{x}_{n}^{\top}) are non-singular, (A3) supn𝛒n+1𝛒n\sup_{n}\|\boldsymbol{\rho}_{n+1}-\boldsymbol{\rho}_{n}\| is bounded in probability, (A4) κ[0,)\kappa\in[0,\infty). Then as nn\rightarrow\infty, 𝐰n𝐰\mathbf{w}_{n}\rightarrow\mathbf{w}^{*} with probability 1.

{proofsketch}

Due to the second-order difference term in Eqn. (11), the analysis framework in [BM:00] does not directly apply to the Gradient-DD algorithm when (A0) holdes, i.e., step size is tapered. Likewise, the two-timescale convergence analysis [Bh:09] is also not directly applicable. Defining 𝐮n+1=𝝆n+1𝝆n\mathbf{u}_{n+1}=\boldsymbol{\rho}_{n+1}-\boldsymbol{\rho}_{n}, we rewrite the iterative process in Eqn. (11) into two parallel processes which are given by

𝝆n+1\displaystyle\boldsymbol{\rho}_{n+1} =𝝆nκαn𝐇n𝐮n+αn(𝐆n𝝆n+𝐠n+1),\displaystyle=\boldsymbol{\rho}_{n}-\kappa\alpha_{n}\mathbf{H}_{n}\mathbf{u}_{n}+\alpha_{n}(\mathbf{G}_{n}\boldsymbol{\rho}_{n}+\mathbf{g}_{n+1}), (12)
𝐮n+1\displaystyle\mathbf{u}_{n+1} =καn𝐇n𝐮n+αn(𝐆n𝝆n+𝐠n+1).\displaystyle=-\kappa\alpha_{n}\mathbf{H}_{n}\mathbf{u}_{n}+\alpha_{n}(\mathbf{G}_{n}\boldsymbol{\rho}_{n}+\mathbf{g}_{n+1}). (13)

We analyze the parallel processes Eqns. (12) & Eqn. (13) instead of directly analyzing Eqn. (11). Our proofs have three steps. First we show supn𝝆n\sup_{n}\|\boldsymbol{\rho}_{n}\| is bounded by applying the stability of the stochastic approximation [BM:00] into the recursion Eqn. (12). Second, based on this result, we shall show that 𝐮n\mathbf{u}_{n} goes to 0 in probability by analyzing the recursion Eqn. (13). At last, along with the result that 𝐮n\mathbf{u}_{n} goes to 0 in probability, by applying the convergence of the stochastic approximation [BM:00] into the recursion Eqn. (12), we show that 𝝆n\boldsymbol{\boldsymbol{\rho}}_{n} goes to the TD fixed point which is given by the solution of 𝐆𝝆+𝐠=0\mathbf{G}\boldsymbol{\rho}+\mathbf{g}=0. The details are provided in Section A.3 of the Appendix.

Theorem 1 shows that Gradient-DD maintains convergence as GTD2 under some mild conditions. The assumptions (A0), (A1), and (A2) are standard conditions in the convergence analysis of Gradient TD learning algorithms [SSM09, SMP09, Maei:11]. The assumption (A3) is weak since it means only that the incremental update in each step is bounded in probability. The assumption (A4) requires that κ\kappa is a constant, meaning κ=O(1)\kappa=O(1). Given this assumption, the contribution of the term κ𝐇n𝐮n\kappa\mathbf{H}_{n}\mathbf{u}_{n} is controlled by αn\alpha_{n} as nn\rightarrow\infty.

5 Empirical Study

In this section, we assess the practical utility of the Gradient-DD method in numerical experiments. To validate performance of Gradient-DD learning, we compare Gradient-DD learning with GTD2 learning, TDC learning (TD with gradient correction [SMP09]), TDRC learning (TDC with regularized correction [Ghiassian:20]) and TD learning in both tabular representation and linear representation. We conducted three tasks: a simple random walk task, the Boyan-chain task, and Baird’s off-policy counterexample. In each task, we evaluate the performance of a learning algorithm by empirical root mean-squared (RMS) error: s𝒮ds(V𝐰n(s)V(s))2\sqrt{\sum\nolimits_{s\in\mathcal{S}}d_{s}(V_{\mathbf{w}_{n}}(s)-V(s))^{2}}. The reason we choose the empirical RMS error rather than root projected mean-squared error or other measures as [Gh:18, Ghiassian:20] used is because it is a direct measure of concern in practical performance. We set the κ=1\kappa=1. We also investigate the sensitivity to κ\kappa in the Appendix, where we show that κ=1\kappa=1 is a good and natural choice in our empirical studies, although tuning κ\kappa may be necessary for some tasks in practice.

5.1 Random walk task

As a first test of Gradient-DD learning, we conducted a simple random walk task [Sutton:2018]. The random walk task has a linear arrangement of mm states plus an absorbing terminal state at each end. Thus there are m+2m+2 sequential states, S0,S1,,Sm,Sm+1S_{0},S_{1},\cdots,S_{m},S_{m+1}, where m=m= 10, 20, or 40. Every walk begins in the center state. At each step, the walk moves to a neighboring state, either to the right or to the left with equal probability. If either edge state (S0S_{0} or Sm+1S_{m+1}) is entered, the walk terminates. A walk’s outcome is defined to be r=0r=0 at S0S_{0} and r=1r=1 at Sm+1S_{m+1}. Our aim is to learn the value of each state V(s)V(s), where the true values are (1,,m)/(m+1)(1,\cdots,m)/(m+1). In all cases the approximate value function is initialized to the intermediate value V0(s)=0.5V_{0}(s)=0.5. We first consider tabular representation of the value function. We also consider linear approximation representation and obtain the similar results, which are reported in Fig. 10 of the appendix. We consider that the learning rate αn\alpha_{n} is tapered according to the schedule αn=α(103+1)/(103+n)\alpha_{n}=\alpha(10^{3}+1)/(10^{3}+n). We tune α{1012/4,1011/4,,101/4,1}\alpha\in\{10^{-12/4},10^{-11/4},\cdots,10^{-1/4},1\}. We also consider the constant step sizes and obtain the similar results, which are reported in Fig. 9 of the appendix. For GTD2 and TDC, we set βn=ζαn\beta_{n}=\zeta\alpha_{n} with ζ{1/64,1/16,1/4,1,4}\zeta\in\{1/64,1/16,1/4,1,4\}.

We first compare the methods by plotting the empirical RMS error from averaging the final 100 steps as a function of step size α\alpha in Fig. 2, where 20,000 episodes are used. We also plot the average error of all episodes during training and report these results in Fig. 5 of the Appendix. From these figures, we can make several observations. (1) Gradient-DD clearly performs better than the GTD2 and TDC methods. This advantage is consistent in various settings, and gets bigger as the state space becomes large. (2) Gradient-DD performs similarly to TDRC and conventional TD learning, with a similar dependence on α\alpha, although Gradient-DD exhibits greater sensitivity to the value of α\alpha in the log domain than these other algorithms. In summary, Gradient-DD exhibits clear advantages over the GTD2 algorithm, and its performance is also as good as TDRC and conventional TD learning.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 2: The random walk task with tabular representation and tapering step size αn=α(103+1)/(103+n)\alpha_{n}=\alpha(10^{3}+1)/(10^{3}+n). Upper: Performance as a function of α\alpha; Lower: performance over episodes. In each row, state space size 10 (left), 20 (middle), or 40 (right). The curves are averaged over 50 runs, with error bars denoting the standard error of the mean, though most are vanishingly small.

Next we look closely at the performance during training in Fig. 2. For each method, we tuned α{1012/4,,101/4,1}\alpha\in\{10^{-12/4},\cdots,10^{-1/4},1\} by minimizing the average error of the last 100 episodes. We also compare the performance when α\alpha is tuned by minimizing the average error of the last 100 episodes and report in Fig. 5 of the appendix. From these results, we draw several observations. (1) For all conditions tested, Gradient-DD converges much more rapidly than GTD2 and TDC. The results also indicate that Gradient-DD even converges as fast as TDRC and conventional TD learning. (2) The advantage of Gradient-DD learning over GTD2 grows as the state space increases in size. (3) Gradient-DD has consistent and good performance under both the constant step size setting and under the tapered step size setting. In summary, the Gradient-DD learning curves in this task show improvements over other gradient-based methods and performance that matches conventional TD learning.

Like TDRC, the updates of Gradient-DD only have a single shared step size αn\alpha_{n}, i.e., βn=αn\beta_{n}=\alpha_{n}, rather than two independent step sizes αn\alpha_{n} and βn\beta_{n} as in the GTD2 and TDC algorithms. A possible concern is that the performance gains in our second-order algorithm could just as easily be obtained with existing methods by adopting this two-timescale approach, where the value function weights are updated with a smaller step size than the second set of weights. Hence, in addition to investigating the effects of the learning rate, size of the state space, and magnitude of the regularization parameter, we also investigate the effect of using distinct values for the two learning rates, αn\alpha_{n} and βn\beta_{n}, although we set βn=ζαn\beta_{n}=\zeta\alpha_{n} with ζ=1\zeta=1. To do this, we set βn=ζαn\beta_{n}=\zeta\alpha_{n} for GDD, with ζ{1/64,1/16,1/4,1,4}\zeta\in\{1/64,1/16,1/4,1,4\}, and report the results in Fig. 7 of the appendix. The results show that comparably good performance of Gradient-DD is obtained under these various ζ\zeta, providing evidence that the second-order difference term in our approach provides an improvement beyond what can be obtained with previous gradient-based methods using the two time scale approach.

5.2 Boyan-chain task

We next investigate Gradient-DD learning on the Boyan-chain problem, which is a standard task for testing linear value-function approximation [Boyan02]. In this task we allow for 4p34p-3 states, each of which is represented by a pp-dimensional feature vector, with p=20,50,p=20,50, or 100100. The pp-dimensional representation for every fourth state from the start is [1,0,,0][1,0,\cdots,0] for state s1s_{1}, [0,1,0,,0][0,1,0,\cdots,0] for s5s_{5}, \cdots, and [0,0,,0,1][0,0,\cdots,0,1] for the terminal state s4p3s_{4p-3}. The representations for the remaining states are obtained by linearly interpolating between these. The optimal coefficients of the feature vector are (4(p1),4(p2),,0)/5(-4(p-1),-4(p-2),\cdots,0)/5. In each state, except for the last one before the end, there are two possible actions: move forward one step or move forward two steps, where each action occurs with probability 0.5. Both actions lead to reward -0.3. The last state before the end just has one action of moving forward to the terminal with reward -0.2. We tune α{102,101.5,101,103/4,101/2,101/4,101/8,1,101/8,101/4}\alpha\in\{10^{-2},10^{-1.5},10^{-1},10^{-3/4},10^{-1/2},10^{-1/4},10^{-1/8},1,10^{1/8},10^{1/4}\} for each method by minimizing the average error of the final 100 episodes. The step size is tapered according to the schedule αn=α(2×103+1)/(2×103+n)\alpha_{n}=\alpha(2\times 10^{3}+1)/(2\times 10^{3}+n). For GTD2 and TDC, we set βn=ζαn\beta_{n}=\zeta\alpha_{n} with ζ{1/64,1/16,1/4,1,4}\zeta\in\{1/64,1/16,1/4,1,4\}. In this task, we set γ=1\gamma=1.

We report the performance as a function of α\alpha and the performance over episodes in Fig. 3, where we tune α\alpha by the performance based on the average error of the last 100 episodes. We also compare the performance based on the average error of all episodes during training and report the results in Fig. 11 of the appendix. These figures lead to conclusions similar to those already drawn in the random walk task. (1) Gradient-DD has much faster convergence than GTD2 and TDC, and generally converges to better values. (2) Gradient-DD is competitive with TDRC and conventional TD learning despite being somewhat slower at the beginning episodes. (3) The improvement over GTD2 or TDC grows as the state space becomes larger. (4) Comparing the performance with constant step size versus that with tapered step size, the Gradient-DD method performs better with tapered step size than it does with constant step size.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 3: The Boyan Chain task with linear approximation and tapering step size αn=α(2×103+1)/(2×103+n)\alpha_{n}=\alpha(2\times 10^{3}+1)/(2\times 10^{3}+n). Upper: Performance as a function of α\alpha; Lower: performance over episodes. In each row, the feature size is 20 (left), 50 (middle), or 100 (right). The curves are averaged over 50 runs, with error bars denoting the standard error of the mean, though most are vanishingly small across runs.

5.3 Baird’s off-policy counterexample

We also verify the performance of Gradient-DD on Baird’s off-policy counterexample [Baird:95, Sutton:2018], illustrated schematically in Fig. 4, for which TD learning famously diverges. We show results from Baird’s counterexample with N=7,20N=7,20 states. The reward is always zero, and the agent learns a linear value approximation with N+1N+1 weights w1,,wN+1w_{1},\cdots,w_{N+1}: the estimated value of the jj-th state is 2wj+wN+12w_{j}+w_{N+1} for jN1j\leq N-1 and that of the NN-th state is wN+2wN+1w_{N}+2w_{N+1}. In the task, the importance sampling ratio for the dashed action is (N1)/N(N-1)/N, while the ratio for the solid action is NN. Thus, comparing different state sizes illustrates the impact of importance sampling ratios in these algorithms. The initial weight values are (1,,1,10,1)(1,\cdots,1,10,1)^{\top}. Constant α\alpha is used in this task. We set γ=0.99\gamma=0.99. For TDC and GTD2, thus we tune ζ{42,41,1,42,43}\zeta\in\{4^{-2},4^{-1},1,4^{2},4^{3}\}. Meanwhile we tune α\alpha for TDC in a wider region. For Gradient-DD, we tune κ{41,1,4}\kappa\in\{4^{-1},1,4\}. We tune α\alpha separately for each algorithm by minimizing the average error from the final 100 episodes.

Fig. 4 demonstrates that Gradient-DD works better on this counterexample than GTD2, TDC, and TDRC. It is worthwhile to observe that when the state size is 20, TDRC become unstable, meaning serious unbalance of importance sampling ratios may cause TDRC unstable. We also note that, because the linear approximation leaves a residual error in the value estimation due to the projection error, the RMS errors of GTD2, TDC, and TDRC in this task do not go to zero. In contrast to other algorithms, the errors from our Gradient-DD converge to zero.

Refer to caption
(a) Illustration of the extended Baird’s off-policy counterexample. The “solid" action usually goes to the NN-th state, and the “dashed" action usually goes to one of the other N1N-1 states, each with equal probability.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
(b) The performance of various algorithms.
Figure 4: Baird’s off-policy counterexample. Upper in (b): Performance as a function of α\alpha; Lower in (b): performance over episodes. From left to right in (b): 7-state and 20-state.

6 Conclusion and discussion

In this work, we have proposed Gradient-DD learning, a new gradient descent-based TD learning algorithm. The algorithm is based on a modification of the projected Bellman error objective function for value function approximation by introducing a second-order difference term. The algorithm significantly improves upon existing methods for gradient-based TD learning, obtaining better convergence performance than conventional linear TD learning.

Since GTD learning was originally proposed, the Gradient-TD family of algorithms has been extended to incorporate eligibility traces and learning optimal policies [Maei:10, GS:14], as well as for application to neural networks [Maei:11]. Additionally, many variants of the vanilla Gradient-TD methods have been proposed, including HTD [Hackman12] and Proximal Gradient-TD [Liu16]. Because Gradient-DD just modifies the objective error of GTD2 by considering an additional squared-bias term, it may be extended and combined with these other methods, potentially broadening its utility for more complicated tasks.

One potential limitation of our method is that it introduces an additional hyperparameter relative to similar gradient-based algorithms, which increases the computational requirements for hyperparameter optimization. This is somewhat mitigated by our finding that the algorithm’s performance is not particularly sensitive to values of κ\kappa, and that κ1\kappa\sim 1 was found to be a good choice for the range of environments that we considered. Another potential limitation is that we have focused on value function prediction in the two simple cases of tabular representations and linear approximation, which has enabled us to provide convergence guarantees, but not yet for the case of nonlinear function approximation. An especially interesting direction for future study will be the application of Gradient-DD learning to tasks requiring more complex representations, including neural network implementations. Such approaches are especially useful in cases where state spaces are large, and indeed we have found in our results that Gradient-DD seems to confer the greatest advantage over other methods in such cases. Intuitively, we expect that this is because the difference between the optimal update direction and that chosen by gradient descent becomes greater in higher-dimensional spaces (cf. Fig. 1). This performance benefit in large state spaces suggests that Gradient-DD may be of practical use for these more challenging cases.

Acknowledgments and Disclosure of Funding

This work was partially supported by the National Natural Science Foundation of China (No.11871459) and by the Shanghai Municipal Science and Technology Major Project (No.2018SHZDZX01). Support for this work was also provided by the National Science Foundation NeuroNex program (DBI-1707398) and the Gatsby Charitable Foundation.


References

  • [1] R.S. Sutton, H.R. Maei, D. Precup, S. Bhatnagar, D. Silver, Cs. Szepesvári, and E. Wiewiora. Fast gradient-descent methods for temporal-difference learning with linear function approximation. In Proceedings of the 26th International Conference on Machine Learning, 2009.
  • [2] L. C. Baird. Residual algorithms: Reinforcement learning with function approximation. In Proceedings of the 12 th International Conference on Machine Learning, pages 30–37, 1995.
  • [3] Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. The MIT Press, second edition edition, 2018.
  • [4] D. Precup, R. S. Sutton, and S. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of the 17 th International Conference on Machine Learning, 2000.
  • [5] A. R. Mahmood, H. van Hasselt, and R. S. Sutton. Weighted importance sampling for off-policy learning with linear function approximation. In Advances in Neural Information Processing Systems 27, 2014.
  • [6] R. S. Sutton, Cs. Szepesvári, and H. R. Maei. A convergent O(n) algorithm for off-policy temporal difference learning with linear function approximation. In Advances in Neural Information Processing Systems 21, 2009.
  • [7] S. Ghiassian, A. Patterson, S. Garg, D. Gupta, A. White, and M. White. Gradient temporal-difference learning with regularized corrections. In International Conference on Machine Learning, 2020.
  • [8] A. White and M. White. Investigating practical linear temporal difference learning. In International Conference on Autonomous Agents and Multi-Agent Systems, 2016.
  • [9] S. Ghiassian, A. Patterson, M. White, R.S. Sutton, and A. White. Online off-policy prediction. arXiv:1811.02597, 2018.
  • [10] H.R. Maei and R.S. Sutton. GQ(λ)\text{GQ}(\lambda): A general gradient algorithm for temporal-difference prediction learning with eligibility traces. In Proceedings of the 3rd Conference on Artificial General Intelligence, pages 91–96, 2010.
  • [11] M. Geist and B. Scherrer. Off-policy learning with eligibility traces: A survey. Journal of Machine Learning Research, 15:289–333, 2014.
  • [12] H.R. Maei. Gradient temporal-difference learning algorithms. PhD thesis, University of Alberta, Edmonton, 2011.
  • [13] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik. Finite-sample analysis of proximal gradient TD algorithms. In Proceedings of the 31st International Conference on Uncertainty in Artificial Intelligence, pages 504–513, 2015.
  • [14] S. S. Du, J. Chen, L. Li, L. Xiao, and D. Zhou. Stochastic variance reduction methods for policy evaluation. In Proceedings of the 34 th International Conference on Machine Learning, 2017.
  • [15] B. Liu, S. Mahadevan, and J. Liu. Regularized off-policy TD-learning. In Advances in Neural Information Processing Systems, 2012.
  • [16] John Schulman, Sergey Levine, Philipp Moritz, Michael I. Jordan, and Pieter Abbeel. Trust region policy optimization. In Proceedings of the 32nd International Conference on Machine Learning, 2015.
  • [17] J. Peters, K. Mülling, and Y. Altün. Relative entropy policy search. In AAAI Conference on Artificial Intelligence, 2010.
  • [18] S. Amari. Natural gradient works efficiently in learning. Neural Computation, (10):251–276, 1998.
  • [19] S. Bhatnagar, R.S. Sutton, M. Ghavamzadeh, and M. Lee. Natural actor-critic algorithm. Automatic, (73):2471–2482, 2009.
  • [20] T. Degris, P.M. Pilarski, and R.S. Sutton. Model-free reinforcement learning with continuous action in practice. In Proceedings of the 2012 American Control Conference, 2012.
  • [21] W. Dabney and P. Thomas. Natural temporal difference learning. In Proceedings of the AAAI Conference on Artificial Intelligence, 2014.
  • [22] P. Thomas, B.C. Silva, C. Dann, and E. Brunskill. Energetic natural gradient descent. In Proceedings of the 33th International Conference on Machine Learning, 2016.
  • [23] V.S. Borkar and S.P. Meyn. The ODE method for convergence of stochastic approximation and reinforcement learning. SIAM Journal on Control and Optimization, 38(2):447–469, 2000.
  • [24] Justin A. Boyan. Technical update: least-squares temporal difference learning. Machine Learning, 49:233–246, 2002.
  • [25] L. Hackman. Faster gradient-TD algorithms. Master’s thesis, University of Alberta, Edmonton, 2012.
  • [26] B. Liu, J. Liu, M. Ghavamzadeh, S. Mahadevan, and M. Petrik. Proximal gradient temporal difference learning algorithms. In The 25th International Conference on Artificial Intelligence (IJCAI-16),, 2016.
  • [27] P. Thomas. GeNGA: A generalization of natural gradient ascent with positive and negative convergence results. In Proceedings of the 31st International Conference on Machine Learning, 2014.

Appendix A Appendix

A.1 On the equivalence of Eqns. (7) & (8)

The Karush-Kuhn-Tucker conditions of Eqn. (8) are the following system of equations

{dd𝐰J(𝐰)+κdd𝐰(𝐕𝐰𝐕𝐰n1𝐃2μ)=0;κ(𝐕𝐰𝐕𝐰n1𝐃2μ)=0;𝐕𝐰𝐕𝐰n1𝐃2μ;κ0.\begin{cases}\frac{d}{d\mathbf{w}}J(\mathbf{w})+\kappa\frac{d}{d\mathbf{w}}(\|\mathbf{V}_{\mathbf{w}}-\mathbf{V}_{\mathbf{w}_{n-1}}\|_{\mathbf{D}}^{2}-\mu)=0;\\ \kappa(\|\mathbf{V}_{\mathbf{w}}-\mathbf{V}_{\mathbf{w}_{n-1}}\|_{\mathbf{D}}^{2}-\mu)=0;\\ \|\mathbf{V}_{\mathbf{w}}-\mathbf{V}_{\mathbf{w}_{n-1}}\|_{\mathbf{D}}^{2}\leq\mu;\\ \kappa\geq 0.\end{cases}

These equations are equivalent to

{dd𝐰J(𝐰)+κdd𝐰𝐕𝐰𝐕𝐰n1𝐃2=0 and κ>0, if 𝐕𝐰𝐕𝐰n1𝐃2=μ;dd𝐰J(𝐰)=0 and κ=0, if 𝐕𝐰𝐕𝐰n1𝐃2<μ.\begin{cases}\frac{d}{d\mathbf{w}}J(\mathbf{w})+\kappa\frac{d}{d\mathbf{w}}\|\mathbf{V}_{\mathbf{w}}-\mathbf{V}_{\mathbf{w}_{n-1}}\|_{\mathbf{D}}^{2}=0\text{ and }\kappa>0,\\ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \ \text{ if }\|\mathbf{V}_{\mathbf{w}}-\mathbf{V}_{\mathbf{w}_{n-1}}\|_{\mathbf{D}}^{2}=\mu;\\ \frac{d}{d\mathbf{w}}J(\mathbf{w})=0\text{ and }\kappa=0,\text{ if }\|\mathbf{V}_{\mathbf{w}}-\mathbf{V}_{\mathbf{w}_{n-1}}\|_{\mathbf{D}}^{2}<\mu.\end{cases}

Thus, for any μ>0\mu>0, there exists a κ0\kappa\geq 0 such that dd𝐰J(𝐰)+μdd𝐰𝐕𝐰𝐕𝐰n1𝐃2=0\frac{d}{d\mathbf{w}}J(\mathbf{w})+\mu\frac{d}{d\mathbf{w}}\|\mathbf{V}_{\mathbf{w}}-\mathbf{V}_{\mathbf{w}_{n-1}}\|_{\mathbf{D}}^{2}=0.

A.2 The relation to natural gradients

In this section, we shall show that Gradient-DD is related to, but distinct from, the natural gradient. We thank a reviewer for pointing out the connection between Gradient-DD and the natural gradient.

Following [Amari:98] or [Thomas:14], the natural gradient of J(𝐰)J(\mathbf{w}) is the direction obtained by solving the following optimization:

limϵ0argminΔJ(𝐰+ϵΔ) s.t. ϵ2Δ𝐗𝐃𝐗Δμ.\lim_{\epsilon\rightarrow 0}\arg\min\limits_{\Delta}J(\mathbf{w}+\epsilon\Delta)\text{ s.t. }\epsilon^{2}\Delta^{\top}\mathbf{X}^{\top}\mathbf{D}\mathbf{X}\Delta\leq\mu. (A.1)

We can note that this corresponds to the ordinary gradient in the case where the metric tensor 𝐗𝐃𝐗\mathbf{X}^{\top}\mathbf{DX} is proportional to the identity matrix.

Now we rewrite Eqn. (8) as

𝐕𝐰𝐕𝐰n1𝐃2=(𝐰𝐰n1)𝐗𝐃𝐗(𝐰𝐰n1).\displaystyle\|\mathbf{V}_{\mathbf{w}}-\mathbf{V}_{\mathbf{w}_{n-1}}\|_{\mathbf{D}}^{2}=(\mathbf{w}-\mathbf{w}_{n-1})^{\top}\mathbf{X}^{\top}\mathbf{D}\mathbf{X}(\mathbf{w}-\mathbf{w}_{n-1}).

Denote ϵΔ=𝐰𝐰n1\epsilon\Delta=\mathbf{w}-\mathbf{w}_{n-1}, where ϵ\epsilon is the radius of the circle of 𝐰\mathbf{w} around 𝐰n1\mathbf{w}_{n-1} and Δ\Delta is a unit vector. Thus, we have

𝐕𝐰𝐕𝐰n1𝐃2=ϵ2Δ𝐗𝐃𝐗Δ.\displaystyle\|\mathbf{V}_{\mathbf{w}}-\mathbf{V}_{\mathbf{w}_{n-1}}\|_{\mathbf{D}}^{2}=\epsilon^{2}\Delta^{\top}\mathbf{X}^{\top}\mathbf{D}\mathbf{X}\Delta.

For the MSPBE objective, we have

J(𝐰)=J(𝐰n1+𝐰𝐰n1)=J(𝐰n1+ϵΔ).\displaystyle J(\mathbf{w})=J(\mathbf{w}_{n-1}+\mathbf{w}-\mathbf{w}_{n-1})=J(\mathbf{w}_{n-1}+\epsilon\Delta).

Minimizing Eqn. (8) is equivalent to the following optimization

argminΔJ(𝐰n1+ϵΔ) s.t. ϵ2Δ𝐗𝐃𝐗Δμ.\arg\min\limits_{\Delta}J(\mathbf{w}_{n-1}+\epsilon\Delta)\text{ s.t. }\epsilon^{2}\Delta^{\top}\mathbf{X}^{\top}\mathbf{D}\mathbf{X}\Delta\leq\mu. (A.2)

In the limit as ϵ0\epsilon\rightarrow 0, the above optimization is equivalent to

argminΔΔJ(𝐰n1) s.t. ϵ2Δ𝐗𝐃𝐗Δμ.\arg\min\limits_{\Delta}\Delta^{\top}\nabla J(\mathbf{w}_{n-1})\text{ s.t. }\epsilon^{2}\Delta^{\top}\mathbf{X}^{\top}\mathbf{D}\mathbf{X}\Delta\leq\mu.

Thus, given the metric tensor 𝐆=𝐗𝐃𝐗\mathbf{G}=\mathbf{X}^{\top}\mathbf{D}\mathbf{X}, 𝐆1J(𝐰n1)-\mathbf{G}^{-1}\nabla J(\mathbf{w}_{n-1}) is the direction of steepest descent, i.e. the natural gradient, of J(𝐰n1)J(\mathbf{w}_{n-1}). The natural gradient of J(𝐰)J(\mathbf{w}), on the other hand, is the direction of steepest descent at 𝐰\mathbf{w}, rather than at 𝐰n1\mathbf{w}_{n-1}.

Therefore, our Gradient-DD approach makes use of the natural gradient of the objective around 𝐰n1\mathbf{w}_{n-1} rather than around 𝐰n\mathbf{w}_{n} in Eqn. (A.1). This explains the distinction of the updates of our Gradient-DD approach from the updates of directly applying the natural gradient of the objective 𝐰\mathbf{w}.

A.3 Proof of Theorem 1

We introduce an ODE result on stochastic approximation in the following lemma, then show the convergence of our GDD approach in Theorem 1 by applying this result.

Lemma 1.

(Theorems 2.1 & 2.2 of [BM:00]) Consider the stochastic approximation algorithm described by the dd-dimensional recursion

𝐲n+1=𝐲n+an[f(𝐲n)+𝐌n+1].\mathbf{y}_{n+1}=\mathbf{y}_{n}+a_{n}[f(\mathbf{y}_{n})+\mathbf{M}_{n+1}].

Suppose the following conditions hold: (c1) The sequence {αn}\{\alpha_{n}\} satisfies 0<αn<10<\alpha_{n}<1, n=1nαn=\sum_{n=1}^{n}\alpha_{n}=\infty, n=1nαn2<\sum_{n=1}^{n}\alpha_{n}^{2}<\infty. (c2) The function ff is Lipschitz, and there exists a function ff_{\infty} such that limrfr(𝐲)=f(𝐲)\lim_{r\rightarrow\infty}f_{r}(\mathbf{y})=f_{\infty}(\mathbf{y}), where the scaled function fr:ddf_{r}:\mathbb{R}^{d}\rightarrow\mathbb{R}^{d} is given by fr(𝐲)=f(r𝐲)/rf_{r}(\mathbf{y})=f(r\mathbf{y})/r. (c3) The sequence {𝐌n,n}\{\mathbf{M}_{n},\mathcal{F}_{n}\}, with n=σ(𝐲i,𝐌i,in)\mathcal{F}_{n}=\sigma(\mathbf{y}_{i},\mathbf{M}_{i},i\leq n), is a martingale difference sequence. (c4) For some c0<c_{0}<\infty and any initial condition y0y_{0}, E(𝐌n+12|n)c0(1+𝐲n2)\text{E}(\|\mathbf{M}_{n+1}\|^{2}|\mathcal{F}_{n})\leq c_{0}(1+\|\mathbf{y}_{n}\|^{2}). (c5) The ODE 𝐲˙=f(𝐲)\dot{\mathbf{y}}=f_{\infty}(\mathbf{y}) has the origin as a globally asymptotically stable equilibrium. (c6) The ODE 𝐲˙(t)=f(𝐲(t))\dot{\mathbf{y}}(t)=f(\mathbf{y}(t)) has a unique globally asymptotically stable equilibrium 𝐲\mathbf{y}^{*}. Then (1) under the assumptions (c1-c5), supn𝐲n<\sup_{n}\mathbf{y}_{n}<\infty in probability. (2) under the assumptions (c1-c6), as nn\rightarrow\infty, 𝐲n\mathbf{y}_{n} converges to 𝐲\mathbf{y}^{*} with probability 1 .

Now we investigate the stochastic gradient descent updates in Eqn. (11), which is recalled as follows:

𝝆n+1\displaystyle\boldsymbol{\rho}_{n+1} =𝝆nκαn𝐇n(𝝆n𝝆n1)+αn(𝐆n𝝆n+𝐠n+1).\displaystyle=\boldsymbol{\rho}_{n}-\kappa\alpha_{n}\mathbf{H}_{n}(\boldsymbol{\rho}_{n}-\boldsymbol{\rho}_{n-1})+\alpha_{n}(\mathbf{G}_{n}\boldsymbol{\rho}_{n}+\mathbf{g}_{n+1}). (A.3)

The iterative process in Eqn. (A.3) can be rewritten as

(𝝆n+1𝝆n)\displaystyle(\boldsymbol{\rho}_{n+1}-\boldsymbol{\rho}_{n}) =καn𝐇n(𝝆n𝝆n1)+αn(𝐆n𝝆n+𝐠n+1).\displaystyle=-\kappa\alpha_{n}\mathbf{H}_{n}(\boldsymbol{\rho}_{n}-\boldsymbol{\rho}_{n-1})+\alpha_{n}(\mathbf{G}_{n}\boldsymbol{\rho}_{n}+\mathbf{g}_{n+1}). (A.4)

Defining

𝐮n+1=𝝆n+1𝝆n.\mathbf{u}_{n+1}=\boldsymbol{\rho}_{n+1}-\boldsymbol{\rho}_{n}.

Eqn. (A.4) becomes

𝐮n+1\displaystyle\mathbf{u}_{n+1} =καn𝐇n𝐮n+αn(𝐆n𝝆n+𝐠n+1).\displaystyle=-\kappa\alpha_{n}\mathbf{H}_{n}\mathbf{u}_{n}+\alpha_{n}(\mathbf{G}_{n}\boldsymbol{\rho}_{n}+\mathbf{g}_{n+1}).

Thus, the iterative process in Eqn. (A.3) is rewritten as two parallel processes that are given by

𝝆n+1\displaystyle\boldsymbol{\rho}_{n+1} =𝝆nκαn𝐇n𝐮n+αn(𝐆n𝝆n+𝐠n+1),\displaystyle=\boldsymbol{\rho}_{n}-\kappa\alpha_{n}\mathbf{H}_{n}\mathbf{u}_{n}+\alpha_{n}(\mathbf{G}_{n}\boldsymbol{\rho}_{n}+\mathbf{g}_{n+1}), (A.5)
𝐮n+1\displaystyle\mathbf{u}_{n+1} =καn𝐇n𝐮n+αn(𝐆n𝝆n+𝐠n+1).\displaystyle=-\kappa\alpha_{n}\mathbf{H}_{n}\mathbf{u}_{n}+\alpha_{n}(\mathbf{G}_{n}\boldsymbol{\rho}_{n}+\mathbf{g}_{n+1}). (A.6)

Our proofs have three steps. First we shall show supn𝝆n\sup_{n}\|\boldsymbol{\rho}_{n}\| is bounded by applying the ordinary differential equation approach of the stochastic approximation (the 1st result of Lemma 1) into the recursion Eqn. (A.5). Second, based on this result, we shall show that 𝐮n\mathbf{u}_{n} goes to 0 in probability by analyzing the recursion Eqn. (A.6). At last, along with the result that 𝐮n\mathbf{u}_{n} goes to 0 in probability, by applying the 2nd result of Lemma 1 into the recursion Eqn. (A.5), we show that 𝝆n\boldsymbol{\boldsymbol{\rho}}_{n} goes to the TD fixed point, which is given by the solution of 𝐆𝝆+𝐠=0\mathbf{G}\boldsymbol{\rho}+\mathbf{g}=0.

First, we shall show supn𝝆n\sup_{n}\|\boldsymbol{\rho}_{n}\| is bounded. Eqn. (A.5) is rewritten as

𝝆n+1=𝝆n+αn(f(𝝆n)+𝐌n+1),\boldsymbol{\rho}_{n+1}=\boldsymbol{\rho}_{n}+\alpha_{n}(f(\boldsymbol{\rho}_{n})+\mathbf{M}_{n+1}), (A.7)

where f(𝝆n)=(𝐆𝝆n+𝐠)κ𝐇𝐮nf(\boldsymbol{\rho}_{n})=(\mathbf{G}\boldsymbol{\rho}_{n}+\mathbf{g})-\kappa\mathbf{H}\mathbf{u}_{n} and 𝐌n+1=((𝐆n𝐆)𝝆n+𝐠n+1𝐠)κ(𝐇n𝐇)𝐮n\mathbf{M}_{n+1}=((\mathbf{G}_{n}-\mathbf{G})\boldsymbol{\rho}_{n}+\mathbf{g}_{n+1}-\mathbf{g})-\kappa(\mathbf{H}_{n}-\mathbf{H})\mathbf{u}_{n}. Let n=σ(𝐮0,𝝆0,𝐌0,𝐮1,𝝆1,𝐌1,,𝐮n,𝝆n,𝐌n)\mathcal{F}_{n}=\sigma(\mathbf{u}_{0},\boldsymbol{\rho}_{0},\mathbf{M}_{0},\mathbf{u}_{1},\boldsymbol{\rho}_{1},\mathbf{M}_{1},\cdots,\mathbf{u}_{n},\boldsymbol{\rho}_{n},\mathbf{M}_{n}) be σ\sigma-fields generated by the quantities 𝐮i,𝝆i,𝐌i\mathbf{u}_{i},\boldsymbol{\rho}_{i},\mathbf{M}_{i}, ini\leq n.

Now we verify the conditions (c1-c5) of Lemma 1. Condition (c1) is satisfied under the assumption of the step sizes. Clearly, f(𝐮)f(\mathbf{u}) is Lipschitz and f(𝝆)=𝐆𝝆f_{\infty}(\boldsymbol{\rho})=\mathbf{G}\boldsymbol{\rho}, meaning Condition (c2) is satisfied. Condition (c3) is satisfied by noting that (Mn,n)(M_{n},\mathcal{F}_{n}) is a martingale difference sequence, i.e., E(𝐌n+1|n)=0\text{E}(\mathbf{M}_{n+1}|\mathcal{F}_{n})=0.

We next investigate E(𝐌n+12|n)\text{E}(\|\mathbf{M}_{n+1}\|^{2}|\mathcal{F}_{n}). From the triangle inequality, we have that

𝐌n+122(𝐆n𝐆)2𝝆n2+2κ(𝐇n𝐇)2𝐮n2.\displaystyle\|\mathbf{M}_{n+1}\|^{2}\leq 2\|(\mathbf{G}_{n}-\mathbf{G})\|^{2}\|\boldsymbol{\rho}_{n}\|^{2}+2\|\kappa(\mathbf{H}_{n}-\mathbf{H})\|^{2}\|\mathbf{u}_{n}\|^{2}. (A.8)

From Assumption A3 in Theorem 1 that 𝐮n\|\mathbf{u}_{n}\| is bounded and the Assumption A1 that (𝐱n,rn,𝐱n+1\mathbf{x}_{n},r_{n},\mathbf{x}_{n+1}) is an i.i.d. sequence with uniformly bounded second moments, there exists some constant c0c_{0} such that

𝐆n𝐆2c0/2, and κ(𝐇n𝐇)2𝐮n2c0/2.\displaystyle\|\mathbf{G}_{n}-\mathbf{G}\|^{2}\leq c_{0}/2,\text{ and }\|\kappa(\mathbf{H}_{n}-\mathbf{H})\|^{2}\|\mathbf{u}_{n}\|^{2}\leq c_{0}/2.

Thus, Condition (c4) is satisfied.

Note that 𝐆\mathbf{G} is defined here with opposite sign relative to 𝐆\mathbf{G} in [Maei:11]. From [SSM09] and [Maei:11], the eigenvalues of the matrix 𝐆-\mathbf{G} are strictly negative under the Assumption A2. Therefore, Condition (c5) is satisfied. Thus, applying the 1st part of Lemma 1 shows that supn𝝆n\sup_{n}\|\boldsymbol{\rho}_{n}\| is bounded in probability.

Second, we investigate the recursion Eqn. (A.6). Let 𝐲n+1=(𝐆n𝝆n+𝐠n+1)\mathbf{y}_{n+1}=(\mathbf{G}_{n}\boldsymbol{\rho}_{n}+\mathbf{g}_{n+1}). Then

𝐮n+1\displaystyle\mathbf{u}_{n+1} =αn[κ𝐇n𝐮n+𝐲n+1]\displaystyle=\alpha_{n}[-\kappa\mathbf{H}_{n}\mathbf{u}_{n}+\mathbf{y}_{n+1}]
=αn𝐲n+1+αnαn1(κ𝐇n)𝐲n+αnαn1αn2(κ𝐇n)(κ𝐇n1)𝐲n1\displaystyle=\alpha_{n}\mathbf{y}_{n+1}+\alpha_{n}\alpha_{n-1}(-\kappa\mathbf{H}_{n})\mathbf{y}_{n}+\alpha_{n}\alpha_{n-1}\alpha_{n-2}(-\kappa\mathbf{H}_{n})(-\kappa\mathbf{H}_{n-1})\mathbf{y}_{n-1}
++αnk=0n1αk(κ𝐇k+1)𝐲1+k=0nαk(κ𝐇k)𝐮0.\displaystyle\quad+\cdots+\alpha_{n}\prod_{k=0}^{n-1}\alpha_{k}(-\kappa\mathbf{H}_{k+1})\mathbf{y}_{1}+\prod_{k=0}^{n}\alpha_{k}(-\kappa\mathbf{H}_{k})\mathbf{u}_{0}. (A.9)

Note that 𝐇n1/κ\|\mathbf{H}_{n}\|\leq 1/\kappa due to 𝐱n1/κ\|\mathbf{x}_{n}\|\leq 1/\kappa and that there exists a constant cc such that 𝝆nc\|\boldsymbol{\rho}_{n}\|\leq c due to the above result that supn𝝆n<\sup_{n}\|\boldsymbol{\rho}_{n}\|<\infty in probability. Without loss of generality, we assume that 𝐱n1/κ\|\mathbf{x}_{n}\|\leq 1/\kappa. Eqn. (A.3) implies that

𝐮n+1\displaystyle\|\mathbf{u}_{n+1}\| c(αn+αnαn1+αnαn1αn2++k=0nαk)+k=0nαk𝐮0.\displaystyle\leq c\left(\alpha_{n}+\alpha_{n}\alpha_{n-1}+\alpha_{n}\alpha_{n-1}\alpha_{n-2}+\cdots+\prod_{k=0}^{n}\alpha_{k}\right)+\prod_{k=0}^{n}\alpha_{k}\|\mathbf{u}_{0}\|. (A.10)

Under Assumption A0, αn0\alpha_{n}\rightarrow 0 as n0n\rightarrow 0. Based on this, Lemma 2 (given in the following section) tells us that αn+αnαn1+αnαn1αn2++k=0nαk0\alpha_{n}+\alpha_{n}\alpha_{n-1}+\alpha_{n}\alpha_{n-1}\alpha_{n-2}+\cdots+\prod_{k=0}^{n}\alpha_{k}\rightarrow 0 as n0n\rightarrow 0. Thus, Eqn. (A.10) implies that 𝐮n0\mathbf{u}_{n}\rightarrow 0 in probability.

Finally, for applying the 2nd part of Lemma 1, we just need to verify Condition (c6). Because 𝐮n\mathbf{u}_{n} goes to 0 with probability 1, Eqn. (A.7) tells us that the associated ODE corresponds to

𝐆𝝆+𝐠=0.\mathbf{G}\boldsymbol{\rho}+\mathbf{g}=0.

Thus, Condition (c6) is satisfied. The theorem is proved.

A.4 A Lemma

Lemma 2.

Denote ϵn=αn+αnαn1++αnαn1α0\epsilon_{n}=\alpha_{n}+\alpha_{n}\alpha_{n-1}+\cdots+\alpha_{n}\alpha_{n-1}\cdots\alpha_{0}. If αn0\alpha_{n}\rightarrow 0 as nn\rightarrow\infty, then ϵn0\epsilon_{n}\rightarrow 0 as nn\rightarrow\infty.

Proof.

Because αn0\alpha_{n}\rightarrow 0 as nn\rightarrow\infty, there exists α(0,1)\alpha\in(0,1) and some integer NN such that αnα<1\alpha_{n}\leq\alpha<1 when nNn\geq N. Define a sequence εn\varepsilon_{n} such that

εn\displaystyle\varepsilon_{n} =1+αεn1 for nN+1;\displaystyle=1+\alpha\varepsilon_{n-1}\text{ for }n\geq N+1;
εN\displaystyle\varepsilon_{N} =ϵN.\displaystyle=\epsilon_{N}.

Obviously,

ϵnεn,nN.\epsilon_{n}\leq\varepsilon_{n},\forall n\geq N. (A.11)

Now we investigate the sequence εn\varepsilon_{n}.

εn\displaystyle\varepsilon_{n} =1+αεn1=1+α(1+αεn2)==1+α++αnN1+αnNεN\displaystyle=1+\alpha\varepsilon_{n-1}=1+\alpha(1+\alpha\varepsilon_{n-2})=\cdots=1+\alpha+\cdots+\alpha^{n-N-1}+\alpha^{n-N}\varepsilon_{N}
k=0αk+αnNεN=1/(1α)+αnNεN.\displaystyle\leq\sum\nolimits_{k=0}^{\infty}\alpha^{k}+\alpha^{n-N}\varepsilon_{N}=1/(1-\alpha)+\alpha^{n-N}\varepsilon_{N}.

Thus, we have that

supnNεn<.\sup_{n\geq N}\varepsilon_{n}<\infty. (A.12)

From Eqns. (A.11) & (A.12), we have

supn0ϵn<.\sup_{n\geq 0}\epsilon_{n}<\infty. (A.13)

From the definition of ϵn\epsilon_{n}, we have that ϵn=αn+αnϵn1\epsilon_{n}=\alpha_{n}+\alpha_{n}\epsilon_{n-1}. It follows that

αn=ϵn1+ϵn1ϵn1+supk0ϵk.\alpha_{n}=\frac{\epsilon_{n}}{1+\epsilon_{n-1}}\geq\frac{\epsilon_{n}}{1+\sup_{k\geq 0}\epsilon_{k}}.

From the assumption αn0\alpha_{n}\rightarrow 0 as nn\rightarrow\infty and Eqn. (A.13), we have ϵn0\epsilon_{n}\rightarrow 0 as nn\rightarrow\infty. ∎

A.5 Additional empirical results

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 5: The random walk task with tabular representation. The setting is similar to Fig. 2, but the performance is evaluated by the average error of all episodes, and α\alpha is tuned by minimizing the average error of all episodes. Upper: Performance as a function of α\alpha; Lower: performance over episodes. From left to right: state space size 10 (left), 20 (middle), or 40 (right).
Refer to caption
Refer to caption
Refer to caption
Figure 6: Performance of Gradient-DD in the random walk task in the tabular representation with κ{0.25,0.5,1,2,4}\kappa\in\{0.25,0.5,1,2,4\}. From left to right: state space size 10 (left), 20 (middle), or 40 (right). In each figure, α\alpha is tuned for each algorithm by minimizing the average error of the last 100 episodes. Results are averaged over 50 runs, with error bars denoting standard error of the mean.
Refer to caption
Refer to caption
Refer to caption
Figure 7: The random walk task in the tabular representation. Performance for various βn=ζαn\beta_{n}=\zeta\alpha_{n}, with ζ{43,42,41,1,4}\zeta\in\{4^{-3},4^{-2},4^{-1},1,4\}. From left to right in each row: the size of the state space is m=10m=10, m=20m=20, and m=40m=40. In each case α\alpha is tuned by by minimizing the average error of the last 100 episodes according to the their performance of corresponding algorithms. Results are averaged over 50 runs, with error bars denoting standard error of the mean.
Refer to caption
Figure 8: Performance of natural TDC and natural GTD2 in the random walk task with the tabular representation and m=20m=20. Performance for various βn=ζαn\beta_{n}=\zeta\alpha_{n}, with ζ{43,42,41,1,4}\zeta\in\{4^{-3},4^{-2},4^{-1},1,4\}. In each case α\alpha is tuned by by minimizing the average error of the last 100 episodes according to the their performance of corresponding algorithms. “NGTD" and “NTDC" denote the natural gradient-based algorithm of GTD2 and TDC, respectively. Results are averaged over 50 runs, with error bars denoting standard error of the mean.
Refer to caption
Refer to caption
Figure 9: The random walk task with the tabular representation and constant step size. The state size is 20. The curves are averaged over 50 runs, with error bars denoting the standard error of the mean, though most are vanishingly small.
Refer to caption
Refer to caption
Figure 10: The random walk task with linear value-function approximation, where αn=α(103+1)/(103+n)\alpha_{n}=\alpha(10^{3}+1)/(10^{3}+n). The state size is m=20m=20. Each state is represented by a pp-dimensional feature vector with p=5p=5, corresponding to mm-state with m=20m=20. The pp-dimensional representation for every fourth state from the start is [1,0,,0][1,0,\cdots,0] for state s1s_{1}, [0,1,0,,0][0,1,0,\cdots,0] for s5s_{5}, \cdots, and [0,0,,0,1][0,0,\cdots,0,1] for state s4p+1s_{4p+1}. The representations for the intermediate states are obtained by linearly interpolating between these. The sequential states, S1,,SmS_{1},\cdots,S_{m} are obtained by using the first mm states above. Left: Performance as α\alpha; Right: performance over episodes. The curves are averaged over 50 runs.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 11: The Boyan Chain task with linear approximation. The setting is similar to Fig. 3, but we evaluate the performance the average error of all episodes, and α\alpha is tuned by minimizing the average error of all episodes. Upper: Performance as a function of α\alpha; Lower: performance over episodes.
Refer to caption
Refer to caption
Refer to caption
Figure 12: Performance of Gradient-DD in the Boyan Chain task with κ{22,21,1,2,22}\kappa\in\{2^{-2},2^{-1},1,2,2^{2}\} and tapered αn\alpha_{n}. “GDD(κ\kappa)" denotes the Gradient-DD with regularization parameter κ\kappa.