This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Off-Policy Actor-Critic with Emphatic Weightings

\nameEric Graves \email[email protected]
\nameEhsan Imani \email[email protected]
\nameRaksha Kumaraswamy \email[email protected]
\nameMartha White \email[email protected]
\addrReinforcement Learning and Artificial Intelligence Laboratory
Department of Computing Science, University of Alberta
Edmonton, Alberta, Canada T6G 2E8
Abstract

A variety of theoretically-sound policy gradient algorithms exist for the on-policy setting due to the policy gradient theorem, which provides a simplified form for the gradient. The off-policy setting, however, has been less clear due to the existence of multiple objectives and the lack of an explicit off-policy policy gradient theorem. In this work, we unify these objectives into one off-policy objective, and provide a policy gradient theorem for this unified objective. The derivation involves emphatic weightings and interest functions. We show multiple strategies to approximate the gradients, in an algorithm called Actor Critic with Emphatic weightings (ACE). We prove in a counterexample that previous (semi-gradient) off-policy actor-critic methods—particularly Off-Policy Actor-Critic (OffPAC) and Deterministic Policy Gradient (DPG)—converge to the wrong solution whereas ACE finds the optimal solution. We also highlight why these semi-gradient approaches can still perform well in practice, suggesting strategies for variance reduction in ACE. We empirically study several variants of ACE on two classic control environments and an image-based environment designed to illustrate the tradeoffs made by each gradient approximation. We find that by approximating the emphatic weightings directly, ACE performs as well as or better than OffPAC in all settings tested.

Keywords: off-policy learning, policy gradient, actor-critic, reinforcement learning

1 Introduction

Policy gradient methods are a general class of algorithms for learning optimal policies for both the on and off-policy settings. In policy gradient methods, a parameterized policy is improved using gradient ascent (Williams, 1992), with seminal work in actor-critic algorithms (Witten, 1977; Barto et al., 1983) and many techniques since proposed to reduce variance of the estimates of this gradient (Konda and Tsitsiklis, 2000; Weaver and Tao, 2001; Greensmith et al., 2004; Peters et al., 2005; Bhatnagar et al., 2007, 2009; Grondman et al., 2012; Gu et al., 2017b). These algorithms rely on a fundamental theoretical result: the policy gradient theorem. This theorem (Sutton et al., 1999a; Marbach and Tsitsiklis, 2001) simplifies estimation of the gradient, which would otherwise require difficult-to-estimate gradients with respect to the stationary distribution of the policy and potentially of the action-values.

These gradients can be sampled on-policy or off-policy. On-policy methods are limited to learning about the agent’s current policy: the policy must be executed to obtain a sample of the gradient. Conversely, in off-policy learning the agent can learn about many policies that are different from the policy being executed. Sampling these gradients is most straightforward in the on-policy setting, and so most policy gradient methods are on-policy.

Off-policy methods, however, are critical in an online setting, where an agent generates a single stream of interaction with its environment. Off-policy methods can learn from data generated by older versions of a policy, known as experience replay, a critical factor in the recent success of deep reinforcement learning (Lin, 1992; Mnih et al., 2015; Schaul et al., 2016). They also enable learning from other forms of suboptimal data, including data generated by human demonstration, non-learning controllers, and even random behaviour. Off-policy methods also enable learning about the optimal policy while executing an exploratory policy (Watkins and Dayan, 1992), thereby addressing the exploration-exploitation tradeoff. Finally, off-policy methods allow an agent to learn about many different policies at once, forming the basis for a predictive understanding of an agent’s environment (Sutton et al., 2011; White, 2015) and enabling the learning of options (Sutton et al., 1999b; Precup, 2000; Klissarov and Precup, 2021). With options, an agent can determine optimal (short) behaviours from its current state.

Off-policy policy gradient methods have been developed, particularly in recent years where the need for data efficiency and decorrelated samples in deep reinforcement learning require the use of experience replay and so off-policy learning. This work began with the Off-Policy Actor-Critic algorithm (OffPAC) (Degris et al., 2012a), where an off-policy policy gradient theorem was provided that parallels the on-policy policy gradient theorem, but only for tabular policy representations.111See B. Errata in Degris et al. (2012b) that the theorem only applies to tabular policy representations. This motivated further development, including a recent actor-critic algorithm proven to converge when the critic uses linear function approximation (Maei, 2018), as well as several methods using the approximate off-policy gradient such as Deterministic Policy Gradient (DPG) (Silver et al., 2014; Lillicrap et al., 2015), Actor-Critic with Experience Replay (ACER) (Wang et al., 2016), and Interpolated Policy Gradient (IPG) (Gu et al., 2017a). However, it remained an open question whether the foundational theorem that underlies these algorithms can be generalized beyond tabular representations.

This question was resolved with the development of an off-policy policy gradient theorem, for general policy parametrization (Imani et al., 2018). The key insight is that the gradient can be simplified if the gradient in each state is weighted with an emphatic weighting. This result, combined with previous methods for incrementally estimating emphatic weightings (Yu, 2015; Sutton et al., 2016), allowed for the design of a new off-policy actor-critic algorithm, called Actor-Critic with Emphatic Weightings (ACE). Afterwards, an algorithm was proposed that directly estimates the emphatic weightings using a gradient temporal difference update, with an associated proof of convergence using the standard two-timescale analysis (Zhang et al., 2020).

However, ACE and the underlying off-policy policy gradient theorem do not obviously resolve the dilemma of computing the off-policy gradient, because ACE—and previously OffPAC—were introduced under what is called the excursions objective. This objective weights states by the visitation distribution of the behaviour policy. This is sensible in the parallel off-policy setting, where many policies are learned in parallel from one stream of experience. The agent reasons about the outcomes of those policies when taking excursions from its current behaviour. In the episodic setting, however, weighting the states by the visitation distribution of the behaviour policy is not appropriate. Instead, an ideal episodic objective should be weighted by the start-state distribution, and this objective should be optimized from off-policy data without changing this weighting.

Potentially because of this mismatch, most recent methods have pursued strategies that directly correct state-weightings,222A notable exception is an approach that uses a nonparametric Bellman equation—essentially using nonparametric estimates of the environment model (Tosatto et al., 2020). This approach allows direct estimation of the gradient by taking derivatives through this nonparametric Bellman equation. to obtain gradients of either the episodic objective (Liu et al., 2020b) or a finite-horizon objective (Kallus and Uehara, 2020). This approach involves estimating the distribution of state visitation under the behaviour policy and the discounted state visitation under the target policy to obtain an importance sampling ratio to reweight the updates. One disadvantage to this approach is that the state visitation distribution must be estimated, and these estimates can be extremely biased—both by the initialization and by the limitations of the parametric function class used for the estimates.

In this work, we revisit estimating the off-policy gradient with emphatic weightings. First, we propose a unifying objective for the episodic setting and the excursions objective. This objective facilitates the design of a single, unifying off-policy algorithm pertinent to both settings. Then, we extend the off-policy policy gradient theorem both to this objective and to objectives that incorporate entropy regularization. We further show that the on-policy episodic objective can also be seen as a special case, through the use of interest functions to specify interest in the set of start states. We then highlight the difference to previous off-policy approaches including OffPAC and DPG, which use a biased gradient that we call the semi-gradient because it omits a component of the gradient. We prove that in a simple three state counterexample, with two states aliased, the stationary point under semi-gradients is suboptimal, unlike ACE which converges to the optimal solution.

Though such state-aliasing is not pathological, it does seem to contradict the success of methods built on OffPAC in the literature. We show that, under a condition called the strong growth condition (Schmidt and Roux, 2013), which is obtained if there is sufficient representational capacity, the semi-gradient solution is equivalent to that of ACE. Further, we highlight that even outside this setting, in some cases the semi-gradient can be seen as using the sound update underlying ACE, but with a different state weighting. These two insights help explain why OffPAC has been surprisingly effective and suggest promising directions for a lower-variance version of ACE.

Finally, we discuss several improvements to the algorithm, including using direct estimates of the emphatic weightings to reduce variance, incorporating experience replay, and balancing between the higher-variance full gradient update and the lower-variance semi-gradient update. We empirically investigate variants of ACE in several benchmark environments, and find that those with direct estimates of the emphatic weighting consistently perform as well as or better than OffPAC.

Remark on Contributions: This paper builds on our conference paper (Imani et al., 2018). The conference paper presented a policy gradient theorem for one off-policy objective function, and a counterexample showing that OffPAC and DPG can converge to the wrong solution, with experiments limited to the simple three-state counterexample.

An important limitation of that work is that it only applies to the excursions objective, which does not obviously relate to the standard policy gradient objective. Further, the work primarily investigated the true gradient, and did not provide much insight into how to practically use the algorithm. This paper builds on the conference paper in several substantive ways.

  1. 1.

    We first clarify the confusion surrounding objectives by providing a more general objective that includes both the standard objective and the excursions objective. We provide all the derivations for this more general objective (their Theorems 1 and 2 and Proposition 1, which are our Theorems 2 and 3 and Proposition 6). The result is nearly identical, but reiterated with the more general definition, and slightly different terminology. We also highlight that we can recover a sound on-policy algorithm using interest functions.

  2. 2.

    We prove that the semi-gradient from OffPAC has suboptimal solutions on the counterexample. The conference paper only showed this empirically. To obtain this result, we needed to extend the derivation to include entropy regularization, so that the optimal policy is stochastic and its parameters are bounded. Then we proved that the stationary points for OffPAC on the counterexample do not include the optimal point.

  3. 3.

    We provide insight into why semi-gradient methods can perform well, by showing that in some cases the weighting does not affect the solution and that locally we can characterize the semi-gradient as a gradient update with a different state weighting.

  4. 4.

    We provide several algorithmic improvements to ACE, particularly by directly estimating the emphatic weightings to give a lower variance update. In our experiments, this algorithm outperforms or performs comparably to all others, including OffPAC.

  5. 5.

    We conduct a more thorough empirical investigation of ACE in more settings, including two classic benchmarks and a new partially-observable image-based domain. The experiments are aimed at deeply investigating the role of different components, including (a) the objective used for learning versus evaluation, (b) the algorithm for estimating the emphatic weightings, and (c) the algorithm used for the critic.

2 Problem Formulation

Throughout the paper, we use bolded symbols to represent vectors and matrices, and uppercase italicized symbols to represent random variables. We consider a Markov decision process (SS\SS, 𝒜{\cal A}, P, rr), where SS\SS denotes the set of states, 𝒜{\cal A} denotes the set of actions, P :SS×𝒜Δ(SS):\SS\times{\cal A}\to\Delta(\SS) denotes the one-step state transition dynamics, and r:SS×𝒜×SSr:\SS\times{\cal A}\times\SS\to\mathbb{R} denotes the transition-based reward function. At each timestep t=1,2,t=1,2,\ldots, the agent selects an action AtA_{t} according to its behaviour policy μ\mu, where μ:SSΔ(𝒜)\mu:\SS\to\Delta({\cal A}). The environment responds by transitioning into a new state St+1S_{t+1} according to P, and emits a scalar reward Rt+1=r(St,At,St+1)R_{t+1}=r(S_{t},A_{t},S_{t+1}).

The discounted sum of future rewards given actions are selected according to some target policy π\pi is called the return, and defined as:

Gt\displaystyle G_{t} =defRt+1+γt+1Rt+2+γt+1γt+2Rt+3+\displaystyle\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}R_{t+1}+\gamma_{t+1}R_{t+2}+\gamma_{t+1}\gamma_{t+2}R_{t+3}+\ldots (1)
=Rt+1+γt+1Gt+1.\displaystyle=R_{t+1}+\gamma_{t+1}G_{t+1}.

We use transition-based discounting γ:SS×𝒜×SS[0,1]\gamma:\SS\times{\cal A}\times\SS\to[0,1], as it facilitates specifying different tasks (termination) on top of the same MDP (White, 2017). This generality is useful in our setting, where we may be interested in learning multiple (option) policies, off-policy. The state value function for π\pi and γ\gamma is defined as

vπ(s)=def𝔼π[Gt|St=s]sSS,v_{\pi}(s)\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\mathbb{E}_{\pi}[G_{t}|S_{t}=s]\quad\forall s\in\SS, (2)

where under discrete states and actions this corresponds to

vπ(s)=a𝒜π(a|s)sSSP(s|s,a)[r(s,a,s)+γ(s,a,s)vπ(s)]sSSv_{\pi}(s)=\sum_{a\in{\cal A}}\pi(a|s)\sum_{s^{\prime}\in\SS}\text{P}(s^{\prime}|s,a)[r(s,a,s^{\prime})+\gamma(s,a,s^{\prime})v_{\pi}(s^{\prime})]\quad\forall s\in\SS

and under continuous states and actions this corresponds to

vπ(s)=𝒜π(a|s)SSP(s|s,a)[r(s,a,s)+γ(s,a,s)vπ(s)]dadssSS.v_{\pi}(s)=\int_{{\cal A}}\pi(a|s)\int_{\SS}\text{P}(s^{\prime}|s,a)[r(s,a,s^{\prime})+\gamma(s,a,s^{\prime})v_{\pi}(s^{\prime})]\mathop{}\!\mathrm{d}a\mathop{}\!\mathrm{d}s^{\prime}\quad\forall s\in\SS.

When possible, we will opt for the more generic expectation notation, to address both the finite and continuous cases.

In off-policy control, the agent’s goal is to learn a target policy π\pi while following the behaviour policy μ\mu. The target policy π𝜽\pi_{\bm{\theta}} is a differentiable function of a weight vector 𝜽n𝜽,n𝜽\bm{\theta}\in\mathbb{R}^{n_{\bm{\theta}}},n_{\bm{\theta}}\in\mathbb{N}. The goal of the agent is to learn a policy that maximizes total reward. Depending on the setting, however, this goal is specified with different objectives, as we discuss in the next section.

Throughout, we assume that the limiting distribution dμ(s)d_{\mu}(s) under μ\mu exists, where

dμ(s)=deflimtP(St=s|s0,μ)d_{\mu}(s)\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\lim_{t\to\infty}\text{P}(S_{t}=s|s_{0},\mu) (3)

and P(St=s|s0,μ)(S_{t}=s|s_{0},\mu) is the probability—or density—that St=sS_{t}=s when starting in state s0s_{0} and executing μ\mu. Similarly, dπd_{\pi} denotes the limiting distribution333This term is sometimes overloaded to mean the discounted state visitation starting from a single start state s0s_{0}: t=0γtP(St=s|s0,π)\sum_{t=0}^{\infty}\gamma^{t}\text{P}(S_{t}=s|s_{0},\pi). This overloading comes from the definition given in the original policy gradient theorem (Sutton et al., 1999a). We only use it to mean state visitation (limiting distribution). of states under π\pi. Note that this limiting distribution exists in either the episodic or continuing settings (White, 2017; Huang, 2020). One way to see this is that in the episodic setting under transition-based discounting, there are no explicit terminal states. Rather, the agent transitions between the state right before termination, back to a start state, as it would for a continuing problem. The transition-based discount truncates returns, to specify that the goal of the agent from each state is to maximize return for the episode.

3 The Weighted-Excursions Objective for Off-Policy Control

In this section we introduce the weighted-excursions objective that unifies two objectives commonly considered in off-policy control. The objective encompasses the standard on-policy episodic objective as well as the excursions objective that allows option policies to be learned off-policy with different termination conditions. Throughout, we will define both the finite state and continuous state versions.

The standard episodic objective is

J0(𝜽)=defs𝒮d0(s)vπ𝜽(s) or J0(𝜽)=def𝒮d0(s)vπ𝜽(s)dsJ_{0}({\bm{\theta}})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\sum_{s\in\mathcal{S}}d_{0}(s)v_{\pi_{\bm{\theta}}}(s)\hskip 28.45274pt\text{ or }\hskip 28.45274ptJ_{0}({\bm{\theta}})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\int_{\mathcal{S}}d_{0}(s)v_{\pi_{\bm{\theta}}}(s)\mathop{}\!\mathrm{d}s (4)

where d0Δ(𝒮)d_{0}\in\Delta(\mathcal{S}) is the start state distribution. This objective can be optimized in either the on-policy or off-policy settings, with different updates to account for the fact that data is gathered according to μ\mu in the off-policy setting.

Another objective that has been considered in the off-policy setting is the excursions objective (Degris et al., 2012b)

Jexc(𝜽)=defs𝒮dμ(s)vπ𝜽(s),J_{\text{exc}}({\bm{\theta}})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\sum_{s\in\mathcal{S}}d_{\mu}(s)v_{\pi_{\bm{\theta}}}(s), (5)

where the state weighting is determined by the behaviour policy’s limiting state distribution dμd_{\mu}. This objective assumes that the target policy π𝜽\pi_{{\bm{\theta}}} will be executed as an excursion from the current behaviour—or that the agent will use the target policy in planning under the current state visitation. In the options framework (Sutton et al., 1999b), for example, the agent learns the optimal policies for a variety of different options, each with different termination conditions. The behaviour policy itself μ\mu could be learning in a continuing environment, even though the objective for the option policy πθ\pi_{\theta} is episodic, in that it has termination conditions.

The excursions objective, however, is not necessarily appropriate for learning an alternative optimal policy. For example, consider an episodic setting where a reasonable behaviour policy is currently being used, and we would like to learn an improved policy. The improved policy will be executed from the set of start states, not from the visitation distribution under the behaviour policy. In this setting, we would like to optimize the standard episodic objective, using off-policy data generated by the current behaviour policy.

These two settings can be unified by considering the weighted-excursions objective

Jμ(𝜽)=defs𝒮dμ(s)i(s)vπ𝜽(s) or Jμ(𝜽)=def𝒮dμ(s)i(s)vπ𝜽(s)ds,J_{\mu}({\bm{\theta}})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\sum_{s\in\mathcal{S}}d_{\mu}(s)i(s)v_{\pi_{\bm{\theta}}}(s)\hskip 28.45274pt\text{ or }\hskip 28.45274ptJ_{\mu}({\bm{\theta}})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\int_{\mathcal{S}}d_{\mu}(s)i(s)v_{\pi_{\bm{\theta}}}(s)\mathop{}\!\mathrm{d}s, (6)

where the weighting i:𝒮[0,)i:\mathcal{S}\rightarrow[0,\infty) represents the interest in a state (Sutton et al., 2016). This objective reduces to the episodic (start-state) formulation by setting i(s)=d0(s)/dμ(s)i(s)=d_{0}(s)/d_{\mu}(s). This choice is not just hypothetical; we will show how our algorithm designed for the excursions objective can also be used for the start-state formulation by using an easy-to-compute setting of the interest i(s)i(s) for each s𝒮s\in\mathcal{S}. This generalization goes beyond unifying the previous two objectives, and also naturally allows us to reweight states based on their importance for the target policy. For example, when learning an option, the interest function could be set to 1 in states from which the policy will be executed (i.e., the initiation set), and zero elsewhere.

More generally, we could consider the generic objective s𝒮d(s)vπ𝜽(s)\sum_{s\in\mathcal{S}}d(s)v_{\pi_{\bm{\theta}}}(s) for any state weighting dd. Then the weighted-excursions objective is simply an instance of this more generic objective, with d=dμ(s)i(s)d=d_{\mu}(s)i(s). The reason that we opt for the weighted-excursions objective is because it is sufficiently general to incorporate our two settings of interest but sufficiently restricted to facilitate development of a practical off-policy algorithm. In fact, the idea of interest comes from the same work on emphatic weightings (Sutton et al., 2016) on which we build our algorithm. This work also provides insight into how to easily incorporate such interest functions in the off-policy setting.

Remark: In this work, we focus on episodic objectives—those with termination—but it is worthwhile to contrast this to the objective used for continuing problems. For continuing problems, there is a compelling case for the average reward objective

Jave(𝜽)=defs𝒮dπ𝜽(s)rπ𝜽(s) or Jave(𝜽)=def𝒮dπ𝜽(s)rπ𝜽(s)dsJ_{\text{ave}}({\bm{\theta}})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\sum_{s\in\mathcal{S}}d_{\pi_{\bm{\theta}}}(s)r_{\pi_{\bm{\theta}}}(s)\hskip 28.45274pt\text{ or }\hskip 28.45274ptJ_{\text{ave}}({\bm{\theta}})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\int_{\mathcal{S}}d_{\pi_{\bm{\theta}}}(s)r_{\pi_{\bm{\theta}}}(s)\mathop{}\!\mathrm{d}s (7)

for rπ(s)=𝔼π[r(s,A,S)]r_{\pi}(s)=\mathbb{E}_{\pi}[r(s,A,S^{\prime})]. Optimizing JaveJ_{\text{ave}} is equivalent to optimizing s𝒮dπ(s)vπ(s)\sum_{s\in\mathcal{S}}d_{\pi}(s)v_{\pi}(s) for a constant 0γ<10\leq\gamma<1 (Sutton and Barto, 2018). Intuitively, the agent seeks to maximize state values while shifting its state distribution towards high-valued states.

A sharp distinction has been previously made between alternative life objectives and excursions objectives in policy evaluation (Patterson et al., 2022). In that work, alternative life objectives use the state-weighting dπd_{\pi} that specifies: learn the best value function as if data had been gathered under the target policy π\pi. Such a weighting is obtained by using prior corrections—namely reweighting entire trajectories with products of importance sampling ratios (Precup et al., 2000; Meuleau et al., 2000)—or estimating dπ/dμd_{\pi}/d_{\mu} (Hallak and Mannor, 2017; Liu et al., 2018; Gelada and Bellemare, 2019). The excursions objective, on the other hand, uses a weighting of dμd_{\mu}. This distinction is not clearly relevant to control. Instead, typically on-policy average reward algorithms are developed for the continuing objective—which has weighting dπd_{\pi}. The few off-policy algorithms designed for the continuing, average reward setting directly estimate dπd_{\pi} and dμd_{\mu} (Liu et al., 2019). The episodic objective for control does not include dπd_{\pi}, and so is different from the alternative life objective in policy evaluation.

In the remainder of this paper we develop an algorithm for optimizing the weighted-excursions objective in Equation (6) from off-policy data.

4 An Off-Policy Policy Gradient Theorem using Emphatic Weightings

In this section, we introduce the off-policy policy gradient theorem for the discrete and continuous settings as well as for deterministic policies. We additionally compare this gradient to the gradient underlying OffPAC, and provide a generalized gradient that interpolates between this gradient—called the semi-gradient—and the true gradient.

4.1 The Off-Policy Policy Gradient Theorem

The policy gradient theorem with function approximation in the on-policy setting was a seminal result (Sutton et al., 1999a, Theorem 1). The first attempt to extend the policy gradient theorem to the off-policy excursion objective was limited to the setting where the policy is tabular (Degris et al., 2012b, Theorem 2).444Note that the statement in the paper is stronger, but in an errata published by the authors, they highlight an error in the proof. Consequently, the result is only correct for the tabular setting. In this section, we show that the policy gradient theorem does hold in the off-policy setting for the weighted excursions objective, for arbitrary smooth policy parametrizations. The resulting gradient resembles the standard policy gradient, but with emphatic weightings to reweight the states.

Definition 1 (Emphatic Weighting)

The emphatic weighting m:𝒮[0,)m:\mathcal{S}\to[0,\infty) for a behaviour policy μ\mu and target policy π\pi can be recursively defined as

m(s)=defdμ(s)i(s)+𝔼[γ(S,A,S)m(S)S=s],m(s^{\prime})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}d_{\mu}(s^{\prime})i(s^{\prime})+\mathbb{E}[\gamma(S,A,S^{\prime})m(S)\mid S^{\prime}=s^{\prime}], (8)

where the distribution over S,AS,A from next state SS^{\prime} is under P and π\pi.

Under finite states and actions, this expectation is s𝒮a𝒜π(a|s)P(s|s,a)γ(s,a,s)m(s)\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}\pi(a|s)\text{P}(s^{\prime}|s,a)\gamma(s,a,s^{\prime})m(s). Under continuous states and actions, it is 𝒮𝒜π(a|s)P(s|s,a)γ(s,a,s)m(s)dads\int_{\mathcal{S}}\int_{\mathcal{A}}\pi(a|s)\text{P}(s^{\prime}|s,a)\gamma(s,a,s^{\prime})m(s)\mathop{}\!\mathrm{d}a\mathop{}\!\mathrm{d}s. If a deterministic policy π:𝒮𝒜\pi:\mathcal{S}\rightarrow\mathcal{A} is used, then it is 𝒮P(s|s,π(s))γ(s,π(s),s)m(s)ds\int_{\mathcal{S}}\text{P}(s^{\prime}|s,\pi(s))\gamma(s,\pi(s),s^{\prime})m(s)\mathop{}\!\mathrm{d}s.

Notice that these emphatic weightings involve states leading into ss^{\prime}; bootstrapping is back-in-time, rather than forward. The emphatic weighting reflects the relative importance of a state ss^{\prime}, based on its own interest and weighting—the first term dμ(s)i(s)d_{\mu}(s^{\prime})i(s^{\prime})—as well as the discounted importance of states that lead into ss^{\prime}. For value estimation, this is sensible because it reflects how much the agent bootstraps off of the values in ss^{\prime}, and hence how much it relies on accurate estimates in ss^{\prime}. For policy gradients, we have a related role: the importance of a state is due both to its own value, as well as how much other states that lead into it depend on the value obtained from that state.

Theorem 2 (Off-Policy Policy Gradient Theorem)

For finite states and actions,

Jμ(𝜽)𝜽=s𝒮m(s)aπ(a|s;𝜽)𝜽qπ(s,a)=s𝒮m(s)g(s,𝜽)\frac{\partial J_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}}=\sum_{s\in\mathcal{S}}m(s)\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a)=\sum_{s\in\mathcal{S}}m(s)g(s,{\bm{\theta}}) (9)

where

g(s,𝜽)=def𝔼π𝜽[logπ(A|S;𝜽)𝜽qπ(S,A)|S=s]g(s,{\bm{\theta}})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\mathbb{E}_{\pi_{\bm{\theta}}}\left[\frac{\partial\log\pi(A|S;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(S,A)\ \Big{|}\ S=s\right] (10)

is the gradient for a given state, rewritten using the log-likelihood ratio. For continuous states, and either continuous or discrete actions,

Jμ(𝜽)𝜽=SSm(s)g(s,𝜽)ds.\frac{\partial J_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}}=\int_{\SS}m(s)g(s,{\bm{\theta}})\mathop{}\!\mathrm{d}s. (9 for continuous states)

Proof  We first start with the finite state setting. The proof relies on the vector form of the emphatic weighting

𝐦=def𝐢(𝐈𝐏π,γ)1,\mathbf{m}^{\top}\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\mathbf{i}^{\top}(\mathbf{I}-\mathbf{P}_{\!\pi,\gamma})^{-1},

where the vector 𝐢|SS|\mathbf{i}\in\mathbb{R}^{|\SS|} has entries 𝐢(s)=defdμ(s)i(s)\mathbf{i}(s)\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}d_{\mu}(s)i(s) and 𝐏π,γ|SS|×|SS|\mathbf{P}_{\!\pi,\gamma}\in\mathbb{R}^{|\SS|\times|\SS|} is the matrix with entries 𝐏π,γ(s,s)=defa𝒜π(a|s;𝜽)P(s|s,a)γ(s,a,s)\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\sum_{a\in{\cal A}}\pi(a|s;{\bm{\theta}})\text{P}(s^{\prime}|s,a)\gamma(s,a,s^{\prime}). First notice that

Jμ(𝜽)𝜽=s𝒮𝐢(s)vπ(s)𝜽=s𝒮𝐢(s)vπ(s)𝜽.\frac{\partial J_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}}=\frac{\partial\sum_{s\in\mathcal{S}}\mathbf{i}(s)v_{\pi}(s)}{\partial{\bm{\theta}}}=\sum_{s\in\mathcal{S}}\mathbf{i}(s)\frac{\partial v_{\pi}(s)}{\partial{\bm{\theta}}}.

Therefore, to compute the gradient of JμJ_{\mu}, we need to compute the gradient of the value function with respect to the policy parameters. A recursive form of the gradient of the value function can be derived, as we show below. Before starting, for simplicity of notation we will use

𝐠(s)=aπ(a|s;𝜽)𝜽qπ(s,a),\displaystyle\mathbf{g}(s)=\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a),

where 𝐠:𝒮d\mathbf{g}:\mathcal{S}\rightarrow\mathbb{R}^{d}. Now let us compute the gradient of the value function:

vπ(s)𝜽\displaystyle\frac{\partial v_{\pi}(s)}{\partial{\bm{\theta}}} =𝜽aπ(a|s;𝜽)qπ(s,a)\displaystyle=\frac{\partial}{\partial{\bm{\theta}}}\sum_{a}\pi(a|s;{\bm{\theta}})q_{\pi}(s,a)
=aπ(a|s;𝜽)𝜽qπ(s,a)+aπ(a|s;𝜽)qπ(s,a)𝜽\displaystyle=\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a)+\sum_{a}\pi(a|s;{\bm{\theta}})\frac{\partial q_{\pi}(s,a)}{\partial{\bm{\theta}}} (11)
=𝐠(s)+aπ(a|s;𝜽)sP(s|s,a)(r(s,a,s)+γ(s,a,s)vπ(s))𝜽\displaystyle=\mathbf{g}(s)+\sum_{a}\pi(a|s;{\bm{\theta}})\frac{\partial\sum_{s^{\prime}}\text{P}(s^{\prime}|s,a)(r(s,a,s^{\prime})+\gamma(s,a,s^{\prime})v_{\pi}(s^{\prime}))}{\partial{\bm{\theta}}}
=𝐠(s)+aπ(a|s;𝜽)sP(s|s,a)γ(s,a,s)vπ(s)𝜽.\displaystyle=\mathbf{g}(s)+\sum_{a}\pi(a|s;{\bm{\theta}})\sum_{s^{\prime}}\text{P}(s^{\prime}|s,a)\gamma(s,a,s^{\prime})\frac{\partial v_{\pi}(s^{\prime})}{\partial{\bm{\theta}}}.

We can simplify this more easily using vector form. Let 𝐯˙π|𝒮|×d\dot{\mathbf{v}}_{\pi}\in\mathbb{R}^{|\mathcal{S}|\times d} be the matrix of gradients (with respect to the policy parameters 𝜽{\bm{\theta}}) of vπv_{\pi} for each state ss, and 𝐆|𝒮|×d\mathbf{G}\in\mathbb{R}^{|\mathcal{S}|\times d} the matrix where each row corresponding to state ss is the vector 𝐠(s)\mathbf{g}(s). Then

𝐯˙π\displaystyle\dot{\mathbf{v}}_{\pi} =𝐆+𝐏π,γ𝐯˙π𝐯˙π=(𝐈𝐏π,γ)1𝐆.\displaystyle=\mathbf{G}+\mathbf{P}_{\!\pi,\gamma}\dot{\mathbf{v}}_{\pi}\ \ \ \ \implies\dot{\mathbf{v}}_{\pi}=(\mathbf{I}-\mathbf{P}_{\!\pi,\gamma})^{-1}\mathbf{G}. (12)

Therefore, we obtain

s𝒮𝐢(s)vπ(s)𝜽\displaystyle\sum_{s\in\mathcal{S}}\mathbf{i}(s)\frac{\partial v_{\pi}(s)}{\partial{\bm{\theta}}} =𝐢𝐯˙π=𝐢(𝐈𝐏π,γ)1𝐆\displaystyle=\mathbf{i}^{\top}\dot{\mathbf{v}}_{\pi}\ \ =\mathbf{i}^{\top}(\mathbf{I}-\mathbf{P}_{\!\pi,\gamma})^{-1}\mathbf{G}
=𝐦𝐆\displaystyle=\mathbf{m}^{\top}\mathbf{G}
=s𝒮m(s)aπ(a|s;𝜽)𝜽qπ(s,a).\displaystyle=\sum_{s\in\mathcal{S}}m(s)\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a).

The proof is similar for the continuous case; we include it in Appendix B.  
We additionally prove the deterministic policy gradient objective, for a deterministic policy, π:𝒮𝒜\pi:\mathcal{S}\rightarrow\mathcal{A}. The objective remains the same, but the space of possible policies is constrained, resulting in a slightly different gradient.

Theorem 3 (Deterministic Off-policy Policy Gradient Theorem)
Jμ(𝜽)𝜽=𝒮m(s)π(s;𝜽)𝜽qπ(s,a)a|a=π(s;𝜽)ds\frac{\partial J_{\mu}(\bm{\mathbf{\theta}})}{\partial\bm{\mathbf{\theta}}}=\int_{\mathcal{S}}m(s)\frac{\partial\pi(s;\bm{\mathbf{\theta}})}{\partial\bm{\mathbf{\theta}}}\left.\frac{\partial q_{\pi}(s,a)}{\partial a}\right\rvert_{a=\pi(s;\bm{\mathbf{\theta}})}\mathop{}\!\mathrm{d}s (13)

where

m(s)=dμ(s)i(s)+𝒮P(s|s,π(s;𝜽))γ(s,π(s;𝜽),s)m(s)ds.m(s^{\prime})=d_{\mu}(s^{\prime})i(s^{\prime})+\int_{\mathcal{S}}\textup{\text{P}}(s^{\prime}|s,\pi(s;\bm{\mathbf{\theta}}))\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})m(s)\mathop{}\!\mathrm{d}s.

The proof is presented in Appendix A.

4.2 Connection to OffPAC and the Semi-gradient

The off-policy policy gradient above contrasts with the one used by OffPAC

s𝒮dμ(s)𝔼π𝜽[logπ(a|s;𝜽)𝜽qπ(S,A)|S=s].\sum_{s\in\mathcal{S}}d_{\mu}(s)\mathbb{E}_{\pi_{\bm{\theta}}}\left[\frac{\partial\log\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(S,A)\ \Big{|}\ S=s\right]. (14)

The only difference is in the state weighting dμd_{\mu}. This small difference, however, is key for getting the correct gradient. In fact, the OffPAC gradient can be considered a semi-gradient, because it drops a part of the gradient when using the product rule:

𝜽Jexc(𝜽)\displaystyle\nabla_{\bm{\theta}}J_{\text{exc}}({\bm{\theta}}) =s𝒮dμ(s)𝜽aπ𝜽(a|s)qπ(s,a)\displaystyle=\sum_{s\in\mathcal{S}}d_{\mu}(s)\nabla_{\bm{\theta}}\sum_{a}\pi_{\bm{\theta}}(a|s)q_{\pi}(s,a)
=s𝒮dμ(s)a[qπ(s,a)𝜽π𝜽(a|s)+π𝜽(a|s)𝜽qπ(s,a)].\displaystyle=\sum_{s\in\mathcal{S}}d_{\mu}(s)\sum_{a}\left[q_{\pi}(s,a)\nabla_{\bm{\theta}}\pi_{\bm{\theta}}(a|s)+\pi_{\bm{\theta}}(a|s)\nabla_{\bm{\theta}}q_{\pi}(s,a)\right].

The second term with 𝜽qπ(s,a)\nabla_{\bm{\theta}}q_{\pi}(s,a) is not obviously easy to compute, because it requires reasoning about how changes to the policy parameters affect the action-values. OffPAC opted to drop this term, and justify why this approximation was acceptable. The resulting semi-gradient has proven to be surprisingly effective, though we provide several examples in this work where it performs poorly.

Note that using this correct weighting is only critical due to state aliasing. If the policy representation is tabular, for example, the weighting on the gradients does not affect the asymptotic solution as long as it is non-zero.555Note that the weighting on the gradients could still affect the rate of convergence (Laroche and Tachet des Combes, 2021). The gradient s𝒮m(s)g(𝜽,s)=0\sum_{s\in\mathcal{S}}m(s)g({\bm{\theta}},s)=0 if and only if g(𝜽,s)=0g({\bm{\theta}},s)=0, because the gradient vector has an independent entry for each state. Therefore, even if the state weighting in the gradient is changed to some other weighting dd, then we still get s𝒮d(s)g(𝜽,s)=0\sum_{s\in\mathcal{S}}d(s)g({\bm{\theta}},s)=0. This is the reason that the semi-gradient underlying OffPAC converges to a stationary point of this objective in the tabular setting. More generally, for any sufficiently powerful function approximator that can obtain g(𝜽,s)=0g({\bm{\theta}},s)=0 for all states, this result holds. In optimization, this property has been termed the strong growth condition (Schmidt and Roux, 2013). We state this simple result for the irrelevance of the weighting formally in the following proposition, both to highlight it and because to the best of our knowledge it has not been formally stated.

Proposition 4 (Irrelevance of Weighting under the Strong Growth Condition)

If g(𝛉,s)=0g({\bm{\theta}},s)=0 for all states, then 𝛉{\bm{\theta}} is a stationary point under any state weighting d:𝒮d:\mathcal{S}\rightarrow\mathbb{R}.

This result states that with a function approximator that can perfectly maximize value in each state, the choice of state weighting in the gradient computation is not relevant. Both the off-policy gradient with emphatic weighting, and the semi-gradient with weighting dμd_{\mu}, can converge to this stationary point.

Alternatively, we can consider the more generic emphatic weighting that lets us sweep between the gradient and the semi-gradient. The emphatic weightings were introduced for temporal difference (TD) learning with eligibility traces (Mahmood et al., 2015). The weightings depend on the eligibility trace parameter, λ\lambda. For λ=1\lambda=1, the weightings effectively become 1 everywhere—implicitly resulting in state weightings of dμd_{\mu}—because there is no bootstrapping. For λ=0\lambda=0, the weightings are exactly as defined above.

Definition 5 (Generalized Emphatic Weighting)

For η[0,1]{\eta}\in[0,1], the generalized emphatic weighting mη:𝒮[0,)m_{\eta}:\mathcal{S}\to[0,\infty) for a behaviour policy μ\mu and target policy π\pi is defined as

mη(s)=def(1η)dμ(s)i(s)+ηm(s)m_{\eta}(s^{\prime})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}(1-{\eta})d_{\mu}(s^{\prime})i(s^{\prime})+{\eta}m(s^{\prime}) (15)

For η=1{\eta}=1, mη(s)=m(s)m_{\eta}(s^{\prime})=m(s^{\prime}).

Note that in the original work, m(s)m(s^{\prime}) was called the follow-on weighting, and mη(s)m_{\eta}(s^{\prime}) the emphatic weighting. Because mηm_{\eta} is a strict generalization of mm, we simply call mm an emphatic weighting as well.666Note that the original emphatic weightings (Sutton et al., 2016) use λ=1η\lambda=1-{\eta}. This is because their emphatic weightings are designed to balance bias introduced from using λ\lambda for estimating value functions: larger λ\lambda means the emphatic weighting plays less of a role. For this setting, we want larger η{\eta} to correspond to the full emphatic weighting (the unbiased emphatic weighting), and smaller η{\eta} to correspond to a more biased estimate, to better match the typical meaning of such trace parameters.

The parameter η{\eta} can be interpreted as trading off bias and variance in the emphatic weightings. Notice that for η=0{\eta}=0, and assuming an interest of 1 everywhere, we get that mη(s)=dμ(s)m_{\eta}(s)=d_{\mu}(s). Such a choice would give exactly the semi-gradient weighting. As we increase η{\eta}, we interpolate between the semi-gradient and the full gradient. For η=1{\eta}=1, we get mη(s)=m(s)m_{\eta}(s)=m(s) and so we have the full gradient. For η=0{\eta}=0, therefore, we obtain a biased gradient estimate, but the emphatic weightings themselves are easy to estimate—they are myopic estimates of interest—which could significantly reduce variance when estimating the gradient. Selecting η{\eta} between 0 and 1 could provide a reasonable balance, obtaining a nearly unbiased gradient to enable convergence to a valid stationary point but potentially reducing some variability when estimating the emphatic weighting. We will develop our algorithm for the general emphatic weighting, as this will give us more flexibility in the weighting, and will also allow us to incorporate OffPAC as one extreme (biased) version of the approach.

4.3 Summary

In this section we proved the off-policy policy gradient theorem for the weighted-excursions objective, for both stochastic (Theorem 2) and deterministic (Theorem 3) policies. These gradients are similar to the standard policy gradient, but with states weighted by emphatic weightings. We contrasted this to OffPAC, the (unsound) algorithm on which many off-policy policy gradient algorithms are built, highlighting that the only distinction is in this state weighting. We argued that this state weighting does not always affect the solution (Proposition 4), potentially partially explaining why OffPAC often performs well—especially in deep RL with large neural networks. Finally, we show how we can use generalized emphatic weightings, with a parameter η\eta (Definition 5), that allows us to interpolate between the sound, but harder-to-estimate emphatic weighting (at η=1\eta=1) and the unsound, but easy-to-obtain weighting used in OffPAC (at η=0\eta=0).

5 Actor-Critic with Emphatic Weightings

In this section, we develop an incremental actor-critic algorithm with emphatic weightings that uses the above off-policy policy gradient theorem. To perform a gradient ascent update on the policy parameters, the goal is to obtain a sample of the gradient in Equation 9. As discussed above, the only difference to the semi-gradient used by OffPAC, in Equation 14, is in the weighting of the states: the true gradient uses m(s)m(s) and the semi-gradient uses dμ(s)d_{\mu}(s). Therefore, we can use standard solutions developed for other actor-critic algorithms to obtain a sample of g(s,𝜽)g(s,{\bm{\theta}}). The key difficulty is in estimating m(s)m(s) to reweight this gradient.

5.1 Sampling the Gradient from a State

Before addressing this key difficulty, we provide a brief reminder on how to obtain a sample of g(s,𝜽)g(s,{\bm{\theta}}), with more details in Appendix C. The simplest approach to compute the gradient for a given state is to use what is sometimes called the all-actions gradient aπ(a|s;𝜽)𝜽qπ(s,a)\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a). To avoid summing over all actions, we instead obtain an unbiased sample of the gradient under the action taken by the behaviour policy Aμ(s,)A\sim\mu(s,\cdot): ρ(s,A;𝜽)logπ(a|s;𝜽)𝜽qπ(s,A)\rho(s,A;{\bm{\theta}})\frac{\partial\log\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,A) where the importance sampling ratio ρ(s,a;𝜽)=defπ(a|s;𝜽)μ(a|s)\rho(s,a;{\bm{\theta}})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\frac{\pi(a|s;{\bm{\theta}})}{\mu(a|s)} corrects the distribution over actions. The estimated gradient has more variance when we only use a sampled action rather than all actions. The standard strategy to reduce this variance is to subtract a baseline, such as an estimate of the value function v(s)v(s). The resulting update is

ρ(s,a;𝜽)logπ(a|s;𝜽)𝜽[qπ(s,a)v(s)].\displaystyle\rho(s,a;{\bm{\theta}})\frac{\partial\log\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}[q_{\pi}(s,a)-v(s)].

In practice, it is difficult to obtain qπ(s,a)q_{\pi}(s,a) to compute this gradient exactly. Instead, we obtain (a likely biased) sample of the advantage, δtqπ(St,At)v(st)\delta_{t}\approx q_{\pi}(S_{t},A_{t})-v(s_{t}). An unbiased but high variance estimate uses a sample of the return GtG_{t} from (st,at)(s_{t},a_{t}), and uses δt=Gtv(st)\delta_{t}=G_{t}-v(s_{t}). Then 𝔼[δt|St=s,At=a]=qπ(s,a)v(s)\mathbb{E}[\delta_{t}|S_{t}=s,A_{t}=a]=q_{\pi}(s,a)-v(s). On the other extreme, we can directly approximate the action-values, q^(s,a)\hat{q}(s,a). In-between, we can use nn-step returns, or λ\lambda-returns, to balance between using observed rewards and value estimates. In this work, we opt for the simplest choice: n=1n=1 with δt=Rt+1+γt+1v(St+1)v(St)\delta_{t}=R_{t+1}+\gamma_{t+1}v(S_{t+1})-v(S_{t}).

Finally, we need to estimate the values themselves. Any value approximation algorithm can be used to estimate vv, such as TD, gradient TD or even emphatic TD. Given we already compute the emphasis weighting, it is straightforward to use emphatic TD. We investigate both a gradient TD critic and an emphatic TD critic in the experiments.

5.2 Estimating the Gradient with Emphatic Weightings

The previous section outlined how to sample the gradient for a given state. We now need to ensure that these updates across states are weighted by the emphatic weighting. Intuitively, if we could estimate mη(s)m_{\eta}(s) and dμ(s)d_{\mu}(s) for all ss, then we could simply premultiply the update with mη(s)/dμ(s)m_{\eta}(s)/d_{\mu}(s). However, estimating the state distribution is nontrivial, and actually unnecessary. In this section, we first explain how to obtain a Monte Carlo estimate of this pre-multiplier, and then how to directly estimate it with function approximation.

5.2.1 An Unbiased Estimate of the Gradient with Emphatic Weightings

We can rely on the original work introducing the emphatic weightings to obtain a Monte Carlo estimate with the update

Mt\displaystyle M_{t} =(1η)it+ηFt\displaystyle=(1-{\eta})i_{t}+{\eta}F_{t} Ft=γtρt1Ft1+it\displaystyle\ F_{t}=\gamma_{t}\rho_{t-1}F_{t-1}+i_{t} (16)

where F0=0F_{0}=0 and iti_{t} is the interest observed in StS_{t}, with i(s)=𝔼[it|St=s]i(s)=\mathbb{E}[i_{t}|S_{t}=s]. Putting this together with the previous section, the gradient can be sampled by (1) sampling states according to dμd_{\mu}, (2) taking actions according to μ\mu, (3) obtaining an estimate δt\delta_{t} of the advantage qπ(s,a)v(s)q_{\pi}(s,a)-v(s), and then (4) weighting the update with MtM_{t}, to get

𝜽𝜽+αρtMtlogπ(At|St;𝜽)𝜽δt.{\bm{\theta}}\leftarrow{\bm{\theta}}+\alpha\rho_{t}M_{t}\frac{\partial\log\pi(A_{t}|S_{t};{\bm{\theta}})}{\partial{\bm{\theta}}}\delta_{t}.

Note that the all-actions gradient update with this emphatic weighting would be

𝜽𝜽+αMtb𝒜π(b|St;𝜽)logπ(b|St;𝜽)𝜽qπ(St,b),{\bm{\theta}}\leftarrow{\bm{\theta}}+\alpha M_{t}\sum_{b\in\mathcal{A}}\pi(b|S_{t};{\bm{\theta}})\frac{\partial\log\pi(b|S_{t};{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(S_{t},b),

but as before we will assume that we use one sampled action. We prove that our update is an unbiased estimate of the gradient for a fixed policy in Proposition 6. We assume access to an unbiased estimate δt\delta_{t} of the advantage for this result, though in practice δt\delta_{t} is likely to have some bias due to using approximate values.

Proposition 6

For a fixed policy π\pi, with the conditions on the MDP from (Yu, 2015) and assuming unbiased δt\delta_{t}, namely that 𝔼[δt|St=s,At=a]=qπ(s,a)v(s)\mathbb{E}[\delta_{t}|S_{t}=s,A_{t}=a]=q_{\pi}(s,a)-v(s),

𝔼Stdμ,Atμ[ρtMtδt𝜽logπ(At|St;𝜽)]=s𝒮mη(s)aπ(a|s;𝜽)𝜽qπ(s,a).\mathbb{E}_{S_{t}\sim d_{\mu},A_{t}\sim\mu}[\rho_{t}M_{t}\delta_{t}\nabla_{\bm{\theta}}\log\pi(A_{t}|S_{t};{\bm{\theta}})]=\sum_{s\in\mathcal{S}}m_{\eta}(s)\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a).

Proof  Emphatic traces MtM_{t} have been shown to provide an unbiased estimate of the true emphatic weighting for policy evaluation in Emphatic TD. We use the emphatic weighting differently, but can rely on the proof from (Sutton et al., 2016) to ensure that (a) dμ(s)limt𝔼μ[Mt|St=s]=mη(s)d_{\mu}(s)\lim_{t\to\infty}\mathbb{E}_{\mu}[M_{t}|S_{t}=s]=m_{\eta}(s). Note also that, given StS_{t}, the next action does not impact the expectation: (b) 𝔼μ[Mt|St=s]=𝔼μ[Mt|St=s,At]\mathbb{E}_{\mu}[M_{t}|S_{t}=s]=\mathbb{E}_{\mu}[M_{t}|S_{t}=s,A_{t}]. Using these equalities, we obtain

𝔼Stdμ,Atμ[ρtMtδt𝜽logπ(At|St;𝜽)]\displaystyle\mathbb{E}_{S_{t}\sim d_{\mu},A_{t}\sim\mu}[\rho_{t}M_{t}\delta_{t}\nabla_{\bm{\theta}}\log\pi(A_{t}|S_{t};{\bm{\theta}})]
=s𝒮dμ(s)𝔼μ[ρtMtδt𝜽logπ(At|St;𝜽)|St=s]\displaystyle=\sum_{s\in\mathcal{S}}d_{\mu}(s)\mathbb{E}_{\mu}[\rho_{t}M_{t}\delta_{t}\nabla_{\bm{\theta}}\log\pi(A_{t}|S_{t};{\bm{\theta}})|S_{t}=s]
=s𝒮dμ(s)𝔼μ[𝔼μ[ρtMtδt𝜽logπ(At|St;𝜽)|St=s,At]]law of total expectation\displaystyle=\sum_{s\in\mathcal{S}}d_{\mu}(s)\mathbb{E}_{\mu}\Big{[}\mathbb{E}_{\mu}[\rho_{t}M_{t}\delta_{t}\nabla_{\bm{\theta}}\log\pi(A_{t}|S_{t};{\bm{\theta}})|S_{t}=s,A_{t}]\Big{]}\hskip 14.22636pt\triangleright\text{law of total expectation}
=s𝒮dμ(s)𝔼μ[𝔼μ[Mt|St=s,At]𝔼μ[ρtδt𝜽logπ(At|St;𝜽)|St=s,At]]\displaystyle=\sum_{s\in\mathcal{S}}d_{\mu}(s)\mathbb{E}_{\mu}\Big{[}\mathbb{E}_{\mu}[M_{t}|S_{t}=s,A_{t}]\ \ \mathbb{E}_{\mu}[\rho_{t}\delta_{t}\nabla_{\bm{\theta}}\log\pi(A_{t}|S_{t};{\bm{\theta}})|S_{t}=s,A_{t}]\Big{]}
=s𝒮dμ(s)𝔼μ[Mt|St=s]𝔼μ[𝔼μ[ρtδt𝜽logπ(At|St;𝜽)|St=s,At]] using (b)\displaystyle=\sum_{s\in\mathcal{S}}d_{\mu}(s)\mathbb{E}_{\mu}[M_{t}|S_{t}=s]\mathbb{E}_{\mu}\Big{[}\mathbb{E}_{\mu}[\rho_{t}\delta_{t}\nabla_{\bm{\theta}}\log\pi(A_{t}|S_{t};{\bm{\theta}})|S_{t}=s,A_{t}]\Big{]}\hskip 22.76228pt\triangleright\text{ using (b)}
=s𝒮mη(s)𝔼μ[ρtδt𝜽logπ(At|St;𝜽)|St=s] using (a)\displaystyle=\sum_{s\in\mathcal{S}}m_{\eta}(s)\mathbb{E}_{\mu}[\rho_{t}\delta_{t}\nabla_{\bm{\theta}}\log\pi(A_{t}|S_{t};{\bm{\theta}})|S_{t}=s]\hskip 28.45274pt\triangleright\text{ using (a)}
=s𝒮mη(s)aπ(a|s;𝜽)𝜽qπ(s,a),\displaystyle=\sum_{s\in\mathcal{S}}m_{\eta}(s)\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a),

where the last line follows from the fact that

𝔼μ[ρtδt𝜽logπ(At|St;𝜽)|St=s]\displaystyle\mathbb{E}_{\mu}[\rho_{t}\delta_{t}\nabla_{\bm{\theta}}\log\pi(A_{t}|S_{t};{\bm{\theta}})|S_{t}=s]
=aμ(a|s)ρ(s,a)𝜽logπ(a|s;𝜽)π(a|s;𝜽)𝜽𝔼μ[δt|St=s,At=a]\displaystyle=\sum_{a}\mu(a|s)\rho(s,a)\nabla_{\bm{\theta}}\log\pi(a|s;{\bm{\theta}})\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\mathbb{E}_{\mu}[\delta_{t}|S_{t}=s,A_{t}=a]
=aμ(a|s)ρ(s,a)𝜽logπ(a|s;𝜽)π(a|s;𝜽)𝜽[qπ(s,a)v(s)] by assumption\displaystyle=\sum_{a}\mu(a|s)\rho(s,a)\nabla_{\bm{\theta}}\log\pi(a|s;{\bm{\theta}})\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}[q_{\pi}(s,a)-v(s)]\hskip 28.45274pt\triangleright\text{ by assumption}
=aπ(a|s;𝜽)𝜽qπ(s,a).\displaystyle=\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a).

 

5.2.2 A Biased Estimate of the Gradient with Emphatic Weightings

For a fixed policy, the emphatic trace provides an unbiased way to reweight the gradient. However, this is no longer true when the target policy is updated online in an actor-critic algorithm like ACE; the emphatic trace will contain importance sampling ratios for older versions of the policy. If the target policy changes substantially during the learning process—as one would hope—the older importance sampling ratios could bias the emphatic trace’s estimates. On the other hand, a constant discount rate γ<1\gamma<1 would give exponentially less weight to older importance sampling ratios in the emphatic trace, potentially mitigating the bias. We empirically investigate the impact of this type of bias in section 9.3.

As a Monte Carlo estimator of the emphatic weightings, the emphatic trace can also yield high-variance estimates. Furthermore, the product of importance sampling ratios in the emphatic trace can lead to even higher-variance estimates, which can slow learning (Precup et al., 2001; Liu et al., 2020a; Ghiassian et al., 2018). As such, several algorithms have been proposed that use parametrized functions to estimate the factors required for reweighting the updates (Hallak and Mannor, 2017; Gelada and Bellemare, 2019; Liu et al., 2018; Zhang et al., 2019). We can similarly approximate the emphatic weighting directly.

For our setting, this involves estimating 𝔼μ[Mt|St=s]=mη(s)/dμ(s)\mathbb{E}_{\mu}[M_{t}|S_{t}=s]=m_{\eta}(s)/d_{\mu}(s). Because MtM_{t} is a function of FtF_{t}, we directly approximate 𝔼μ[Ft|St=s]\mathbb{E}_{\mu}[F_{t}|S_{t}=s] by learning a parametrized function fϕ(s)f_{\bm{\phi}}(s) and use it in place of FtF_{t} to compute MtM_{t}. Notice that the resulting fϕ(s)𝔼μ[Ft|St=s]=m(s)/dμ(s)f_{\bm{\phi}}(s)\approx\mathbb{E}_{\mu}[F_{t}|S_{t}=s]={m(s)}/{d_{\mu}(s)}, and so (1η)i(s)+ηfϕ(s)𝔼μ[Mt|St=s]=mη(s)/dμ(s)(1-{\eta})i(s)+{\eta}f_{\bm{\phi}}(s)\approx\mathbb{E}_{\mu}[M_{t}|S_{t}=s]=m_{\eta}(s)/d_{\mu}(s). Because we sample sdμs\sim d_{\mu}, weighting the update with an estimate of mη(s)/dμ(s)m_{\eta}(s)/d_{\mu}(s) is effectively an importance sampling ratio that ensures the updates are weighted according to mη(s)m_{\eta}(s).

The direct approximation relies on the recursive equation for the emphatic weighting that allows for a temporal-difference update. Unlike the usual TD update, this update bootstraps off the estimate from the previous step rather than the next step, because the emphatic weightings accumulate interest back in time, leading into the state. Recall the recursive equation

m(s)=dμ(s)i(s)+𝒮,𝒜π(a|s;𝜽)p(s|s,a)γ(s,a,s)m(s).m(s^{\prime})=d_{\mu}(s^{\prime})i(s^{\prime})+\sum_{\mathcal{S},\mathcal{A}}\pi(a|s;{\bm{\theta}})p(s^{\prime}|s,a)\gamma(s,a,s^{\prime})m(s).

This equation is a Bellman equation, that specifies the emphatic values for state ss^{\prime} in terms of the immediate interest and the states leading into ss. We approximate fϕ(s)m(s)/dμ(s)f_{\bm{\phi}}(s)\approx m(s)/d_{\mu}(s), rather than m(s)m(s). The recursive formula for an f(s)f(s) that equals m(s)/dμ(s)m(s)/d_{\mu}(s) is

f(s)=i(s)+1dμ(s)𝒮,𝒜π(a|s;𝜽)p(s|s,a)γ(s,a,s)dμ(s)f(s),f(s^{\prime})=i(s^{\prime})+\tfrac{1}{d_{\mu}(s^{\prime})}\sum_{\mathcal{S},\mathcal{A}}\pi(a|s;{\bm{\theta}})p(s^{\prime}|s,a)\gamma(s,a,s^{\prime})d_{\mu}(s)f(s^{\prime}),

where it+ρt1γtfϕ(St1)i_{t}+\rho_{t-1}\gamma_{t}f_{\bm{\phi}}(S_{t-1}) is a sample of this target, as we show in the next proposition.

Proposition 7

Under the conditions stated by Sutton et al. (2016),

limt𝔼μ[it+ρt1γtfϕ(St1)|St=s]\displaystyle\lim_{t\rightarrow\infty}\mathbb{E}_{\mu}[i_{t}+\rho_{t-1}\gamma_{t}f_{\bm{\phi}}(S_{t-1})|S_{t}=s^{\prime}] =i(s)+1dμ(s)s,aπ(a|s;𝜽)p(s|s,a)γ(s,a,s)dμ(s)fϕ(s).\displaystyle=i(s^{\prime})+\tfrac{1}{d_{\mu}(s^{\prime})}\sum_{s,a}\pi(a|s;{\bm{\theta}})p(s^{\prime}|s,a)\gamma(s,a,s^{\prime})d_{\mu}(s)f_{\bm{\phi}}(s).

Proof  The proof follows the strategy showing the unbiasedness of FtF_{t} in Sutton et al. (2016).

limt𝔼μ[it+ρt1γtfϕ(St1)|St=s]\displaystyle\lim_{t\rightarrow\infty}\mathbb{E}_{\mu}[i_{t}+\rho_{t-1}\gamma_{t}f_{\bm{\phi}}(S_{t-1})|S_{t}=s^{\prime}]
=i(s)+limt𝔼μ[ρt1γtfϕ(St1)|St=s]\displaystyle\quad=i(s^{\prime})+\lim_{t\rightarrow\infty}\mathbb{E}_{\mu}[\rho_{t-1}\gamma_{t}f_{\bm{\phi}}(S_{t-1})|S_{t}=s^{\prime}]
=i(s)+s,alimtPr{St1=s,At1=a|St=s}π(a|s;𝜽)μ(a|s)γ(s,a,s)fϕ(s)\displaystyle\quad=i(s^{\prime})+\sum_{s,a}\lim_{t\rightarrow\infty}\Pr\{S_{t-1}=s,A_{t-1}=a|S_{t}=s^{\prime}\}\frac{\pi(a|s;{\bm{\theta}})}{\mu(a|s)}\gamma(s,a,s^{\prime})f_{\bm{\phi}}(s)
=i(s)+s,adμ(s)μ(a|s)p(s|s,a)dμ(s)π(a|s;𝜽)μ(a|s)γ(s,a,s)fϕ(s) Bayes rule\displaystyle\quad=i(s^{\prime})+\sum_{s,a}\frac{d_{\mu}(s)\mu(a|s)p(s^{\prime}|s,a)}{d_{\mu}(s^{\prime})}\frac{\pi(a|s;{\bm{\theta}})}{\mu(a|s)}\gamma(s,a,s^{\prime})f_{\bm{\phi}}(s)\qquad\triangleright\text{ Bayes rule}
=i(s)+1dμ(s)s,aπ(a|s;𝜽)p(s|s,a)γ(s,a,s)dμ(s)fϕ(s)\displaystyle\quad=i(s^{\prime})+\tfrac{1}{d_{\mu}(s^{\prime})}\sum_{s,a}\pi(a|s;{\bm{\theta}})p(s^{\prime}|s,a)\gamma(s,a,s^{\prime})d_{\mu}(s)f_{\bm{\phi}}(s)

 
The utility of this result is that it provides a way to sample the target for this Bellman equation. The following is a sample update whose target is equal to the above in expectation, when sampling sdμs\sim d_{\mu}:

ϕϕ+βt[it+ρt1γtfϕ(St1)fϕ(St)]ϕfϕ(St),{\bm{\phi}}\leftarrow{\bm{\phi}}+\beta_{t}\bigl{[}i_{t}+\rho_{t-1}\gamma_{t}f_{\bm{\phi}}(S_{t-1})-f_{\bm{\phi}}(S_{t})\bigr{]}\nabla_{\bm{\phi}}f_{\bm{\phi}}(S_{t}), (17)

where βt\beta_{t} is a step-size. The update aims to minimize the difference between the current estimate of the emphatic weightings and the target, using a standard semi-gradient TD update. We could alternatively use a gradient TD (Sutton et al., 2009; Sutton and Barto, 2018) update

ϕϕ+βt([it+ρt1γtfϕ(St1)fϕ(St)]ϕfϕ(St)γth(St1)ϕfϕ(St1)),{\bm{\phi}}\leftarrow{\bm{\phi}}+\beta_{t}\left(\bigl{[}i_{t}+\rho_{t-1}\gamma_{t}f_{\bm{\phi}}(S_{t-1})-f_{\bm{\phi}}(S_{t})\bigr{]}\nabla_{\bm{\phi}}f_{\bm{\phi}}(S_{t})-\gamma_{t}h(S_{t-1})\nabla_{\bm{\phi}}f_{\bm{\phi}}(S_{t-1})\right), (18)

where the auxiliary function h(s)h(s) provides an estimate of 𝔼μ[it+ρt1γtfϕ(St1)fϕ(St)|St=s]\mathbb{E}_{\mu}[i_{t}+\rho_{t-1}\gamma_{t}f_{\bm{\phi}}(S_{t-1})-f_{\bm{\phi}}(S_{t})|S_{t}=s] using a regression update.

Once we use an approximation to the emphatic weighting, we may introduce bias into the gradient estimate. As an extreme case, imagine the function is restricted to predict the same value for all states. Using these estimates means replacing MtM_{t} in the actor updates with a constant factor that can be subsumed in the step-size and therefore following the semi-gradient updates in Equation 14. However, the advantage is that these lower-variance estimates do not have to be computed online, unlike MtM_{t}. The function fϕf_{\bm{\phi}} can be trained by sampling transitions from a replay buffer (Lin, 1993; Mnih et al., 2015) and applying the update in Equation 17.

5.3 Summary

In this section, we showed how to operationalize the off-policy policy gradient theorem, by discussing how to sample the gradient. The gradient involves a standard actor-critic update from a state ss, but additionally weighted by the emphatic weighting mη(s)m_{\eta}(s). We leverage existing results for sampling the emphatic weighting in Section 5.2.1, and proved that using this unbiased emphatic weighting in our update results in an unbiased update (Proposition 6). These sampled emphatic weightings, however, can be high variance and become biased if the policy changes on each step. We develop a parameterized estimator in Section 5.2.2, by developing a recursive formula for fϕ(s)m(s)/dμ(s)f_{\bm{\phi}}(s)\approx m(s)/d_{\mu}(s) in Proposition 7. Though a similar form was used to show unbiasedness of a sampled weighting in the original work on emphasis (Sutton et al., 2016), they did not recognize nor leverage this recursive form to develop a TD algorithm to learn a parameterized estimator. We need this fϕ(s)f_{\bm{\phi}}(s), instead of directly estimating m(s)m(s), because states are sampled sdμs\sim d_{\mu}, so weighting by m(s)m(s) would be incorrect.

We summarize the final algorithm in Algorithm 1, in Section 7. This pseudocode is provided later, after incorporating entropy regularization and showing how to use the algorithm for episodic problems. The algorithm shows both how to use the emphatic trace and directly-estimated emphatic weightings.

6 Incorporating Entropy Regularization

In this section, we discuss how to extend the off-policy policy gradient theorem to incorporate entropy regularization. We focus on discrete states and actions in the main body, and include results for continuous states and actions in the appendix. Entropy regularization is commonly used with policy gradient methods, as it encodes a preference for more stochasticity in the policy. Recall that at each state, the policy induces a probability distribution over actions whose entropy is defined as

(π(.|s))=defa𝒜π(a|s)logπ(a|s).\mathcal{H}\bigl{(}\pi(.|s)\bigr{)}\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}-\sum_{a\in\mathcal{A}}\pi(a|s)\log\pi(a|s).

This value captures the stochasticity in the policy. A uniform random policy will maximize entropy, while the entropy of a nearly deterministic policy will be a large negative value.

Entropy regularization is believed to promote exploration, improve the loss surface, and promote faster convergence rates. For exploration, entropy regularization can help make the policy more stochastic and diversify the set of states and actions that the agent learns about (Williams and Peng, 1991; Haarnoja et al., 2018). To find a good policy, the agent needs accurate estimates of values of different actions in different parts of the state space. A greedy policy only takes actions that are deemed optimal at the current point which results in learning only about a limited number of trajectories. Entropy regularization has also shown to help policy optimization by modifying the landscape. The resulting objective is smoother, which allows the use of larger step-sizes and also reduces the chance of getting stuck in bad local maxima (Ahmed et al., 2019).

Finally, the use of entropy regularization facilitates convergence analysis to stationary points. Mei et al. (2020) showed that entropy regularization can ensure existence of stationary points and also improve the rate of convergence in tabular policy optimization. The idea is similar to convex optimization, where 2\ell_{2} regularization makes the loss function strongly convex and improves the convergence rate from sublinear to linear. In Appendix E, we extend the proof of existence of a stationary point for entropy regularized policy optimization to state aggregation. Therefore, in addition to extending our algorithm to allow for entropy regularization, we also use this extension to formally prove a counterexample for the semi-gradient updates.

Extending the results to entropy regularization is relatively straightforward, as it relies on the already developed machinery of soft action-values that include entropy in the rewards. Entropy regularization augments each reward with a sample of the entropy at the current state so that the return is modified to

G~t\displaystyle\tilde{G}_{t} =def(Rt+1τlogπ(At|St))+γt+1(Rt+2τlogπ(At+1|St+1))+\displaystyle\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}(R_{t+1}-\tau\log\pi(A_{t}|S_{t}))+\gamma_{t+1}(R_{t+2}-\tau\log\pi(A_{t+1}|S_{t+1}))+\ldots
=Rt+1τlogπ(At|St)+γt+1G~t+1,\displaystyle=R_{t+1}-\tau\log\pi(A_{t}|S_{t})+\gamma_{t+1}\tilde{G}_{t+1},

where τ\tau is a parameter that controls the amount of regularization. Entropy regularized state values and action values can be defined as (Geist et al., 2019)

v~π(s)\displaystyle\tilde{v}_{\pi}(s) =def𝔼π[G~t|St=s]sSS\displaystyle\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\mathbb{E}_{\pi}[\tilde{G}_{t}|S_{t}=s]\quad\forall s\in\SS
q~π(s,a)\displaystyle\tilde{q}_{\pi}(s,a) =def𝔼π[Rt+1+γt+1G~t+1|St=s,At=a]sSS and a𝒜\displaystyle\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\mathbb{E}_{\pi}[R_{t+1}+\gamma_{t+1}\tilde{G}_{t+1}|S_{t}=s,A_{t}=a]\quad\forall s\in\SS\text{ and }a\in{\cal A}
=sSSP(s|s,a)[r(s,a,s)+γ(s,a,s)v~π(s)]a𝒜,sSS.\displaystyle=\sum_{s^{\prime}\in\SS}\text{P}(s^{\prime}|s,a)[r(s,a,s^{\prime})+\gamma(s,a,s^{\prime})\tilde{v}_{\pi}(s^{\prime})]\quad\forall a\in{\cal A},\forall s\in\SS.

Here, v~π(s)=𝔼π[q~π(S,A)|S=s]+τ(π(|s))\tilde{v}_{\pi}(s)=\mathbb{E}_{\pi}[\tilde{q}_{\pi}(S,A)|S=s]+\tau\mathcal{H}(\pi(\cdot|s)) for entropy (π(|s))=𝔼π[logπ(A|s)]\mathcal{H}(\pi(\cdot|s))=\mathbb{E}_{\pi}[-\log\pi(A|s)]. The entropy-regularized objective function simply uses these entropy regularized values,

J~μ(𝜽)=defs𝒮dμ(s)i(s)v~π𝜽(s).\tilde{J}_{\mu}(\bm{\theta})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\sum_{s\in\mathcal{S}}d_{\mu}(s)i(s)\tilde{v}_{\pi_{\bm{\theta}}}(s). (19)

The off-policy policy gradient theorem can be extended to the entropy regularized objective. The result looks a little different, because the relationship between the soft values and soft action-values is a little different than in the unregularized setting. Notice that τlogπ(a|s)-\tau\log\pi(a|s) is not included in the first reward for q~π(s,a)\tilde{q}_{\pi}(s,a), because this first action is given—not random—and entropy is an expectation over all actions. Further, logπ(a|s)\log\pi(a|s) can be arbitrarily large, though the entropy itself remains nicely bounded. For this reason, we do not define the soft action-values to be 𝔼π[G~t|St=s,At=a]\mathbb{E}_{\pi}[\tilde{G}_{t}|S_{t}=s,A_{t}=a], as we typically would for the unregularized setting. Nonetheless, deriving the policy gradient theorem for these soft action-values uses exactly the same steps. Further, we do recover the unregularized formulation when τ=0\tau=0.

Theorem 8 (Off-policy Policy Gradient Theorem for Entropy Regularization)
J~μ(𝜽)𝜽\displaystyle\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}} =s𝒮m(s)[aπ(a|s;𝜽)𝜽q~π(s,a)+τ(π(|s;𝜽))𝜽]\displaystyle=\sum_{s\in\mathcal{S}}m(s)\left[\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\tilde{q}_{\pi}(s,a)+\tau\frac{\partial\mathcal{H}(\pi(\cdot|s;{\bm{\theta}}))}{\partial{\bm{\theta}}}\right]
=s𝒮m(s)aπ(a|s;𝜽)𝜽[q~π(s,a)τlogπ(a|s;𝜽))]\displaystyle=\sum_{s\in\mathcal{S}}m(s)\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\left[\tilde{q}_{\pi}(s,a)-\tau\log\pi(a|s;{\bm{\theta}}))\right]

The proof is in Appendix D, as is the extension to continuous states, continuous actions, and other regularizers.

This theorem shows that we can use a similar update as in the unregularized setting, simply by adding τlogπ(a|s;𝜽))-\tau\log\pi(a|s;{\bm{\theta}})). Given the true soft action-values q~π(s,a)\tilde{q}_{\pi}(s,a), an update with an unbiased sample of the gradient is

𝜽𝜽+αρtMtlogπ(At|St;𝜽)𝜽[q~π(St,At)τlogπ(At|St;𝜽))].{\bm{\theta}}\leftarrow{\bm{\theta}}+\alpha\rho_{t}M_{t}\frac{\partial\log\pi(A_{t}|S_{t};{\bm{\theta}})}{\partial{\bm{\theta}}}\left[\tilde{q}_{\pi}(S_{t},A_{t})-\tau\log\pi(A_{t}|S_{t};{\bm{\theta}}))\right].

In practice, we subtract a baseline and use an estimate δt\delta_{t} of the advantage q~π(s,a)τlogπ(a|s;𝜽)v~π(s)\tilde{q}_{\pi}(s,a)-\tau\log\pi(a|s;{\bm{\theta}})-\tilde{v}_{\pi}(s), by only directly estimating the soft values v(s)v~π(s)v(s)\approx\tilde{v}_{\pi}(s). The 1-step return sample is again the TD error, but with entropy regularization added to the rewards: δt=Rt+1τlogπ(At|St)+γt+1v(St+1)v(St)\delta_{t}=R_{t+1}-\tau\log\pi(A_{t}|S_{t})+\gamma_{t+1}v(S_{t+1})-v(S_{t}).

7 Formulating Episodic Problems as a Special Case

The goal in an episodic problem is to maximize the expected episodic return. Mismatch between the algorithm’s objective and this goal, even in the on-policy setting, can lead the agent to suboptimal policies (Thomas, 2014; Nota and Thomas, 2020). This section describes how we can use the interest function to maximize the right objective.

The objective function in Equation 4 that weights state values by the start state distribution is equal to expected episodic return. With limited function approximation resources this weighting matters. If the parametrized function is incapable of representing a policy that maximizes all state values, the agent has to settle for lower values in some states to maximize the values of more important ones. For example, if the weighting is dμd_{\mu}, the agent may incorrectly prioritize states that appear less often at the start of an episode and more frequently at other points in the behaviour policy’s trajectories, and fail to optimize the episode return.

The interest function in Equation 6 allows us to focus the parametrized function’s resources on states of interest. Recall that we assume that the agent observes interest iti_{t} in state StS_{t}, with i(s)=𝔼[itSt=s]i(s)=\mathbb{E}[i_{t}\mid S_{t}=s]. So far, we have not used the additional flexibility that the interest itself can be random, but for this unification we will use this property. If we could set i(St)=d0(St)/dμ(St)i(S_{t})={d_{0}(S_{t})}/{d_{\mu}(S_{t})} then the objective function would correspond to the episodic objective. However, neither dμd_{\mu} nor d0d_{0} is available to the agent, and we would like to avoid estimating those distributions.

Fortunately, we can show that if the signal is set to one at the beginning of an episode and set to zero thereafter, its expectation in the limit of time will be proportional to the correct ratio. Under transition-based discounting, the agent is informed that an episode has begun whenever it receives a discount factor of zero: namely, that the transition (St,At,St+1)(S_{t},A_{t},S_{t+1}) resulted in termination. Therefore, we can obtain this result by allowing for the interest to be a function of the whole transition.

Proposition 9

If the signal iti_{t} is set to

it=def{1,ifγt=0 (i.e. the beginning of an episode)0,otherwisei_{t}\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\begin{cases}1,&\text{if}\ \gamma_{t}=0\text{ (i.e. the beginning of an episode)}\\ 0,&\text{otherwise}\end{cases} (20)

then its expected value under the behaviour policy is

𝔼μ[itSt=s]d0(s)dμ(s).\displaystyle\mathbb{E}_{\mu}[i_{t}\mid S_{t}=s]\propto\frac{d_{0}(s)}{d_{\mu}(s)}.

Proof 

𝔼μ[itSt=s]\displaystyle\mathbb{E}_{\mu}[i_{t}\mid S_{t}=s] =Pr(γt=0St=s)\displaystyle=\Pr(\gamma_{t}=0\mid S_{t}=s)
=Pr(St=sγt=0)Pr(γt=0)Pr(St=s)\displaystyle=\frac{\Pr(S_{t}=s\mid\gamma_{t}=0)\Pr(\gamma_{t}=0)}{\Pr(S_{t}=s)}
=d0(s)Pr(γt=0)dμ(s)\displaystyle=\frac{d_{0}(s)\Pr(\gamma_{t}=0)}{d_{\mu}(s)}
d0(s))dμ(s),\displaystyle\propto\frac{d_{0}(s))}{d_{\mu}(s)},

where we used the fact that Pr(St=sγt=0)=d0(s)\Pr(S_{t}=s\mid\gamma_{t}=0)=d_{0}(s) and Pr(St=s)=dμ(s)\Pr(S_{t}=s)=d_{\mu}(s).  
The constant term is the same for all states, Pr(γt=0)\Pr(\gamma_{t}=0). It simply reflects the probability of termination under the behaviour policy. This constant term does not change the relative ordering between policies, since all weight is still on the start states. Therefore, the resulting objective is equivalent to the episodic objective.

We can relate the resulting update to the on-policy and off-policy updates used for the episodic setting. In the on-policy setting, if we use the interest function given by Equation 20 and a constant discount factor γ\gamma during the episode, then the update reduces to the unbiased episodic actor-critic algorithm of Sutton and Barto (2018), originally proposed by Thomas (2014).

Sutton et al. (1999a) proved the policy gradient theorem for the episodic setting

J0(𝜽)𝜽=s𝒮dπ¯(s)aπ(a|s;𝜽)𝜽qπ(s,a)for dπ¯(s)=t=0γtPr(s;t,π),\displaystyle\frac{\partial J_{0}({\bm{\theta}})}{\partial{\bm{\theta}}}=\sum_{s\in\mathcal{S}}\bar{d_{\pi}}(s)\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a)\hskip 28.45274pt\text{for }\hskip 8.5359pt\bar{d_{\pi}}(s)=\sum_{t=0}^{\infty}\gamma^{t}\Pr(s;t,\pi),

where Pr(s;t,π)\Pr(s;t,\pi) is the probability of going from a start state to state ss in tt steps under π\pi, the weighting in the gradient dπ¯\bar{d_{\pi}} is the discounted state visit distribution, and the agent should discount updates that occur later in the episode to account for this weighting. An algorithm that does not discount later updates—and thus samples according to dπd_{\pi}—results in a biased update.

To fix this problem, Thomas (2014) proposed the unbiased actor-critic algorithm with the updates rules

𝜽\displaystyle{\bm{\theta}} 𝜽+αIδ𝜽logπ(A|s;𝜽)\displaystyle\leftarrow{\bm{\theta}}+\alpha I\delta\nabla_{{\bm{\theta}}}\log\pi(A|s;{\bm{\theta}})
I\displaystyle I γI.\displaystyle\leftarrow\gamma I.

Since II is initialized to 11 and updated after the update to 𝜽{\bm{\theta}}, II will be 11 during the first weight update, and will decay by γ\gamma each timestep thereafter.

The on-policy ACE update rule with ρ=1\rho=1 and η=1{\eta}=1 is

Ft\displaystyle F_{t} γtFt1+itfor F0=0\displaystyle\leftarrow\gamma_{t}F_{t-1}+i_{t}\hskip 28.45274pt\text{for }F_{0}=0
𝜽\displaystyle{\bm{\theta}} 𝜽+αFtδ𝜽logπ(A|s;𝜽).\displaystyle\leftarrow{\bm{\theta}}+\alpha F_{t}\delta\nabla_{{\bm{\theta}}}\log\pi(A|s;{\bm{\theta}}).

Because it=1i_{t}=1 on the first time step and FF is initialized to 0, FF will be 11 on the first weight update and will decay by γ\gamma each time step thereafter. From this inspection it is clear that the ACE update reduces to the unbiased actor-critic update in the on-policy episodic setting. The final ACE algorithm with all the discussed techniques is in Algorithm 1.

Algorithm 1 Online Actor Critic with Emphatic weightings (ACE)
Initialize weights for actor 𝜽{\bm{\theta}} and critic 𝐰\mathbf{w}
For emphatic trace, initialize F0=0F_{0}=0; for directly-estimated emphatic weightings, initialize weights ϕ{\bm{\phi}} for approximate emphatic weightings fϕf_{\bm{\phi}}
Suggested (default) settings of parameters: η=0.1{\eta}=0.1, τ=0.01\tau=0.01
Obtain initial feature vector 𝐱0\mathbf{x}_{0} and set i01i_{0}\leftarrow 1
repeat
     Choose an action ata_{t} according to μ(|𝐱t)\mu(\cdot|\mathbf{x}_{t})
     Observe reward rt+1r_{t+1}, next state vector 𝐱t+1\mathbf{x}_{t+1} and γt+1\gamma_{t+1}
     For episodic setting: Set it+1=1i_{t+1}=1 if a new episode has begun; else it+1=0i_{t+1}=0
     For excursions setting: If no preferences between states, set it+1=1i_{t+1}=1
     r~t+1rt+1τlogπ(at|𝐱t;𝜽)\tilde{r}_{t+1}\leftarrow r_{t+1}-\tau\log\pi(a_{t}|\mathbf{x}_{t};{\bm{\theta}})
     ρtπ(at|𝐱t;𝜽)μ(at|𝐱t)\rho_{t}\leftarrow\frac{\pi(a_{t}|\mathbf{x}_{t};{\bm{\theta}})}{\mu(a_{t}|\mathbf{x}_{t})}
     Update (entropy-regularized) critic v𝐰v_{\mathbf{w}}
     if using emphatic trace then
         Mt(1η)it+ηFtM_{t}\leftarrow(1-{\eta})i_{t}+{\eta}F_{t}
         Ft+1ρtγt+1Ft+it+1F_{t+1}\leftarrow\rho_{t}\gamma_{t+1}F_{t}+i_{t+1}
     else
         Mt(1η)it+ηfϕ(𝐱t)M_{t}\leftarrow(1-{\eta})i_{t}+{\eta}f_{\bm{\phi}}(\mathbf{x}_{t})
         ϕϕ+βt[it+1+ρtγt+1fϕ(𝐱t)fϕ(𝐱t+1)]ϕfϕ(𝐱t+1){\bm{\phi}}\leftarrow{\bm{\phi}}+\beta_{t}\bigl{[}i_{t+1}+\rho_{t}\gamma_{t+1}f_{\bm{\phi}}(\mathbf{x}_{t})-f_{\bm{\phi}}(\mathbf{x}_{t+1})\bigr{]}\nabla_{\bm{\phi}}f_{\bm{\phi}}(\mathbf{x}_{t+1})      
     δtr~t+1+γt+1v𝐰(st+1)v𝐰(st)\delta_{t}\leftarrow\tilde{r}_{t+1}+\gamma_{t+1}v_{\mathbf{w}}(s_{t+1})-v_{\mathbf{w}}(s_{t})
     𝝍𝜽logπ(b|𝐱;𝜽)\bm{\psi}\leftarrow\nabla_{\bm{\theta}}\log\pi(b|\mathbf{x};{\bm{\theta}})
     𝜽𝜽+αtρtMtδt𝝍{\bm{\theta}}\leftarrow{\bm{\theta}}+\alpha_{t}\rho_{t}M_{t}\delta_{t}\bm{\psi}
until agent done interaction with environment

8 Stationary Points under the Semi-gradient and Gradient Updates

In this section, we provide more insight into the difference between the stationary points of the weighted excursions objective—obtained using the true gradient in Equation 9—versus those obtain with the semi-gradient update underlying OffPAC, in Equation 14. We first provide a counterexample showing that all the stationary points for the semi-gradient have poor performance. In fact, initializing at the optimal solution, and then updating with the semi-gradient, moves the solution away to one of these highly suboptimal stationary points. We then discuss that the semi-gradient update can actually be seen as a full gradient update, albeit with a different implicit state weighting. This connection illuminates why the semi-gradient does so poorly on the counterexample, but also potentially sheds light on why OffPAC has generally performed reasonably in practice.

8.1 A Counter-example for the Semi-Gradient Update

Recall that, in the derivation of the policy gradient theorem, the product rule breaks down the gradient of the value function into the two terms shown in Equation 11. The policy gradient theorem in OffPAC only considers the first term, i.e. the gradient of the policy given a fixed value function. The approximated gradient is

Jμ(𝜽)𝜽s𝒮𝐝μ(s)aπ(a|s;𝜽)𝜽qπ(s,a).\displaystyle\frac{\partial J_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}}\approx\sum_{s\in\mathcal{S}}\mathbf{d}_{\mu}(s)\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a).

The approximation above weights the states by the behaviour policy’s state distribution instead of emphatic weightings.

To see the difference between these weightings, consider the MDP in Figure 1(a). For the actor, s0s_{0} has a feature vector [1,0][1,0], and the feature vector for both s1s_{1} and s2s_{2} is [0,1][0,1]. This aliased representation forces the actor to take a similar action in s1s_{1} and s2s_{2}. The behaviour policy takes actions a0a_{0} and a1a_{1} with probabilities 0.250.25 and 0.750.75 in all non-terminal states, so that s0s_{0}, s1s_{1}, and s2s_{2} will have probabilities 0.50.5, 0.1250.125, and 0.3750.375 under 𝐝μ\mathbf{d}_{\mu}. The target policy is initialized to take a0a_{0} and a1a_{1} with probabilities 0.90.9 and 0.10.1 in all non-terminal states, which is close to optimal.

We trained two actors on this MDP, one with semi-gradient updates and one with gradient updates. The actors are initialized to the target policy above and the updates use exact values rather than critic estimates. States and actions are sampled from the behaviour policy. As shown in Figures 1(b) and 1(c), while both methods start close to the highest attainable value of the objective function (that of a deterministic policy that takes a0a_{0} everywhere), semi-gradient updates move the actor towards a suboptimal policy and reduce the objective function along the way. Gradient updates, however, increase the probability of taking a0a_{0} in all states and increase the value of the objective function.

Refer to caption
(a) Counterexample
Refer to caption
(b) Learning curves
Refer to caption
(c) Optimal action probability
Figure 1: (1(a)) A counterexample that demonstrates the suboptimal behaviour of semi-gradient updates. The semi-gradients converge for the tabular setting (Degris et al., 2012b), but not necessarily under function approximation—such as with the state aliasing in this MDP. The start state is denoted s0s_{0} and the terminal state is denoted tt. States s1s_{1} and s2s_{2} are aliased to the actor. The interest i(s)i(s) is set to one for all states. (1(b)) Learning curves comparing semi-gradient updates and gradient updates, averaged over 30 runs with negligible standard error bars. The actor has a softmax output on a linear transformation of features and is trained with a step-size of 0.1 (though results were similar across all the stepsizes tested). The dashed line shows the highest attainable objective function under the aliased representation. (1(c)) The probability of taking the optimal action (a0a_{0}) in the aliased states.

The problem with semi-gradient updates boils down to the weighting. In an expected semi-gradient update, each state tries to increase the probability of the action with the highest action-value. There will be a conflict between the aliased states s1s_{1} and s2s_{2} because their highest-valued actions differ. If the states are weighted by dμd_{\mu} in the expected update, s1s_{1} will appear insignificant to the actor, and the update will increase the probability of a1a_{1} in the aliased states. The ratio between qπ(s1,a0)q_{\pi}(s_{1},a_{0}) and qπ(s2,a1)q_{\pi}(s_{2},a_{1}) is not enough to counterbalance this weighting.

However, s1s_{1} has an importance that the semi-gradient update overlooks. Taking a suboptimal action at s1s_{1} will also reduce q(s0,a0)q(s_{0},a_{0}), and after many updates the actor will gradually prefer to take a1a_{1} in s0s_{0}. Eventually, the target policy will be to take a1a_{1} at all states, which has a lower value under the objective function than the initial target policy. This experiment highlights why the weight of a state should depend not only on its own share of dμd_{\mu}, but also on its predecessors. The behaviour policy’s state distribution is not the proper deciding factor in the competition between s1s_{1} and s2s_{2}, even when optimizing the excursions objective.

Proposition 10 formalizes the problem with semi-gradient updates by showing that, under any τ>0\tau>0, semi-gradient updates will not converge to a stationary point of the objective function in the counterexample. The requirement τ>0\tau>0 is only needed to ensure existence of a stationary point; this result holds for τ\tau arbitrarily close to zero. The proof is presented in Appendix E.

Proposition 10

For any τ>0\tau>0 where ττi0.2779\tau\neq\tau_{i}\approx 0.2779 in the three-state counterexample, semi-gradient updates do not converge to a stationary point of the objective function.

8.2 The Semi-Gradient as a True Gradient with a Different Weighting

In this section, we show that the semi-gradient can locally be seen as a gradient update, with a different state weighting in the objective. Locally for the current weights 𝜽t{\bm{\theta}}_{t} with corresponding policy π\pi, that state weighting is 𝐝=𝐝μ(𝐈𝐏π,γ)\mathbf{d}=\mathbf{d}_{\mu}(\mathbf{I}-\mathbf{P}_{\!\pi,\gamma}), with objective Jd(𝜽)=defs𝒮d(s)vπ𝜽(s)J_{d}({\bm{\theta}})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\sum_{s\in\mathcal{S}}d(s)v_{\pi_{\bm{\theta}}}(s). The resulting 𝐝\mathbf{d} may not be a distribution. In fact, it may not even be positive! If it is negative, the objective tells the agent to minimize the value for that state. This is precisely what occurs in the above counterexample.

First, to see why this is the implicit state weighting, notice that for the semi-gradient update to be a gradient update, the weighting dμd_{\mu} used in the update has to correspond to an emphatic weighting mdm_{d} for some state weighting dd in the objective. In other words, dμ(s)=md(s)d_{\mu}(s)=m_{d}(s) where 𝐦d=𝐝(𝐈𝐏π,γ)1\mathbf{m}_{d}=\mathbf{d}(\mathbf{I}-\mathbf{P}_{\!\pi,\gamma})^{-1}. This requirement implies that 𝐝=𝐦d(𝐈𝐏π,γ)\mathbf{d}=\mathbf{m}_{d}(\mathbf{I}-\mathbf{P}_{\!\pi,\gamma}). Therefore, by using the weighting dμd_{\mu} in the expected gradient, the semi-gradient update locally around 𝜽t{\bm{\theta}}_{t} can be seen as a gradient update on JdJ_{d}. This weighting dd actually changes as 𝜽t{\bm{\theta}}_{t} changes. In fact, we know that the semi-gradient update cannot be seen as the gradient of a fixed objective function, from our counterexample for OffPAC and the result for the on-policy setting showing that omitting the discount factor results in an update that is not a gradient (Nota and Thomas, 2020). However, even if this interpretation is only valid locally at each update, such a negative weighting can be problematic.

We can compute the implicit weighting dd, in our counter-example. Let b=μ(a0|s0)b=\mu(a_{0}|s_{0}) and p=π(a0|s0)p=\pi(a_{0}|s_{0}). Then we know that dμ(s0)=0.5d_{\mu}(s_{0})=0.5, dμ(s1)=0.5b=0.125d_{\mu}(s_{1})=0.5b=0.125 and dμ(s0)=0.5(1b)=0.375d_{\mu}(s_{0})=0.5(1-b)=0.375. Further, we know that m(s0)=dμ(s0)m(s_{0})=d_{\mu}(s_{0}) and

m(s1)\displaystyle m(s_{1}) =dμ(s1)+π(a0|s0)m(s0)=bdμ(s0)+pm(s0)=dμ(s0)(b+p)=0.5(b+p)\displaystyle=d_{\mu}(s_{1})+\pi(a_{0}|s_{0})m(s_{0})=bd_{\mu}(s_{0})+pm(s_{0})=d_{\mu}(s_{0})(b+p)=0.5(b+p)
m(s2)\displaystyle m(s_{2}) =dμ(s2)+(1p)m(s0)=dμ(s0)((1b)+(1p))=0.5(2bp).\displaystyle=d_{\mu}(s_{2})+(1-p)m(s_{0})=d_{\mu}(s_{0})((1-b)+(1-p))=0.5(2-b-p).

The implicit dd for the semi-gradient update, locally around this π\pi, has m~(s)=dμ(s)\tilde{m}(s)=d_{\mu}(s) where m~(s0)=d(s0)\tilde{m}(s_{0})=d(s_{0}) and so d(s0)=dμ(s0)=0.5d(s_{0})=d_{\mu}(s_{0})=0.5 and

m~(s1)\displaystyle\tilde{m}(s_{1}) =d(s1)+pm~(s0)=d(s1)+pdμ(s0)=d(s1)+0.5p\displaystyle=d(s_{1})+p\tilde{m}(s_{0})=d(s_{1})+pd_{\mu}(s_{0})=d(s_{1})+0.5p
m~(s2)\displaystyle\tilde{m}(s_{2}) =d(s2)+(1p)m~(s0)=d(s2)+(1p)dμ(s0)=d(s2)+0.5(1p).\displaystyle=d(s_{2})+(1-p)\tilde{m}(s_{0})=d(s_{2})+(1-p)d_{\mu}(s_{0})=d(s_{2})+0.5(1-p).

Using m~(s1)=dμ(s1)=0.5b\tilde{m}(s_{1})=d_{\mu}(s_{1})=0.5b and m~(s2)=dμ(s2)=0.5(1b)\tilde{m}(s_{2})=d_{\mu}(s_{2})=0.5(1-b), we get that

d(s1)\displaystyle d(s_{1}) =m~(s1)0.5p=0.5(bp)\displaystyle=\tilde{m}(s_{1})-0.5p=0.5(b-p)
d(s2)\displaystyle d(s_{2}) =m~(s2)0.5(1p)=0.5((1b)(1p))=0.5(pb).\displaystyle=\tilde{m}(s_{2})-0.5(1-p)=0.5((1-b)-(1-p))=0.5(p-b).

If b>pb>p, then d(s2)<0d(s_{2})<0; if b<pb<p, then d(s1)<0d(s_{1})<0; and if b=pb=p then both are zero. In our counterexample, we set b<pb<p, making the update move away from increasing the value in s1s_{1}, namely preferring to increase the value in s2s_{2} and decrease the value in s1s_{1}. The iterative updates maintain the condition b<pb<p, even as the policy changes—which changes pp—so the implicit weighting systematically causes convergence to a suboptimal policy.

We note that the implicit weighting is independent of the representation. However, we know that the semi-gradient update converges to a stationary point of the excursions objective for the tabular setting. This might seem odd, given that the implicit weighting is negative for a state in this counterexample. However, this is not contradictory. Recall in Section 2 we discussed that in the tabular setting, the condition for the stationary point is that the gradient must be zero at every state, independently. Summing up zeros, even when weighting those zeros with a negative weighting, still results in zero.

There has been other work that has noted that optimizing under a different state weighting can still be effective. Proposition 6 of Ghosh et al. (2020) shows that a lower bound can be optimized, using data generated under the behaviour policy. The action-values of the behaviour policy is used, and only the log likelihood terms of the target policy are differentiated. This lower bound, however, is only an approximation to the true objective. Our result differs, in that it highlights that equivalent solutions can be obtained, even under different state weightings.

This view suggests a direction to reduce variance in the ACE updates: consider appropriate implicit weightings dd, that allow for low variance emphatic updates. One potentially promising approach is to only consider emphatic weightings a small number of steps back-in-time. Another is to err on the side of smaller η\eta in the emphatic trace in ACE, knowing that the implicit weighting dd may remain reasonable for a broad range of η\eta.

9 Experiments: Studying Properties of ACE

In this section we investigate key properties of the ACE algorithm empirically.777Code is available at: https://github.com/gravesec/actor-critic-with-emphatic-weightings. Specifically, we study the effects of the trade-off parameter and the choice of estimator for the emphatic weightings on the learned policy, as these choices are central to the ACE algorithm. First, we revisit the simple counterexample from Section 8.1 to examine how the trade-off parameter can affect the learned policy. Next, we illustrate some issues with using the emphatic trace in the ACE algorithm by testing it on a modified version of the counterexample. We then move to two classic control environments to study the two estimators discussed in Section 5.2 and their effects on the learned policy. Finally, we test several variants of ACE on a challenging environment designed to illustrate the issues associated with both estimators. Please see Table 1 in Appendix F for a description of each of the algorithms compared.

9.1 The Impact of the Trade-Off Parameter

The parameter η{\eta} can be interpreted as trading off bias and variance. For η=0{\eta}=0, the bias can be significant, as shown in the previous section. A natural question to ask is how the bias changes as η{\eta} ranges from 0 to 1.

To answer this question, we repeated the experiment in Section 8.1, but this time with η{\eta} taking values in {0,0.25,0.5,0.75,1}\{0,0.25,0.5,0.75,1\} and the actor’s step-size taking values in {0.01,0.02,0.05,0.1,0.2,0.5,1}\{0.01,0.02,0.05,0.1,0.2,0.5,1\}, with the best-performing step-size (by total area under the learning curve, averaged over 30 runs) for each value of η{\eta} used in Figures 2(b) and 2(c). To highlight the rate of learning, the actor’s policy was initialized to take each action with equal probability. The results are displayed in Figure 2 with shaded regions depicting standard error.

Figure 2(a) shows the performance of ACE with each value of η{\eta} over the range of step-sizes tested. ACE performed well for a range of step-sizes, and even small values of η{\eta} significantly improved over η=0{\eta}=0 (OffPAC). However, the performance of ACE with η>0{\eta}>0 was more sensitive to the choice of step-size, with performance decreasing as the step-size grew large, while the performance of η=0{\eta}=0 was lower overall but remained steady with larger step-sizes.

Figure 2(b) plots the learning curves for ACE with each value of η{\eta}. For η=0{\eta}=0 (OffPAC), the algorithm decreased the objective function during learning to get to a suboptimal fixed point, while η>0{\eta}>0 always improved the objective function relative to the starting point. For a surprisingly small choice of η=0.5{\eta}=0.5, the actor converged to the optimal solution, and even η=0.25{\eta}=0.25 produced a much more reasonable solution than η=0{\eta}=0.

Figure 2(c) shows the probability of taking the optimal action in the aliased states. The optimal policy is to take a0a_{0} in the aliased states with probability 1. ACE with η=0{\eta}=0 (OffPAC) quickly converged to the suboptimal solution of choosing the best action for s2s_{2} instead of s1s_{1}. Even with η{\eta} just a bit higher than 0, convergence is to a more reasonable solution, choosing the optimal action the majority of the time.

Refer to caption
(a) Actor step-size sensitivity
Refer to caption
(b) Learning curves
Refer to caption
(c) Optimal action probability
Figure 2: Performance of ACE with different values of η{\eta} in the counterexample.

To determine whether the η{\eta} parameter has similar effects when ACE is used with a learned critic, we repeated the previous experiment but this time using value estimates from a critic trained with GTD(λ\lambda) (Maei, 2011). We checked all combinations of the following critic step-sizes: {105,104,103,102,101,100}\{10^{-5},10^{-4},10^{-3},10^{-2},10^{-1},10^{0}\}, the following critic trace decay rates: {0,0.5,1.0}\{0,0.5,1.0\}, and the following actor step-sizes: {1010,108,106,104,102}\{10^{-10},10^{-8},10^{-6},10^{-4},10^{-2}\}, and used the best-performing combination (by area under the learning curve) for each value of η{\eta} in Figures 3(b) and 3(c). We averaged the results over 10 runs, and plotted the standard error as shaded regions. Figure 3 shows that, as before, even relatively small values of η{\eta} can achieve close to the optimal solution. However, η=0{\eta}=0 (OffPAC) still finds a suboptimal policy. Overall, the outcomes are similar to the previous experiment, although noisier due to the use of a learned critic rather than the true value function.

Refer to caption
(a) Actor step-size sensitivity
Refer to caption
(b) Learning curves
Refer to caption
(c) Optimal action probability
Figure 3: Performance of ACE with a GTD(λ\lambda) critic and different values of η{\eta} in the counterexample.

So far we have only considered ACE with a discrete action policy parameterization. However, an appealing property of actor-critic algorithms is their ability to naturally handle continuous action spaces. To determine if the findings above generalize to continuous action policy parameterizations, we created an environment similar to Figure 1(a), but with one continuous unbounded action. Taking action with value aa at s0s_{0} will result in a transition to s1s_{1} with probability 1σ(a)1-\sigma(a) and a transition to s2s_{2} with probability σ(a)\sigma(a), where σ\sigma denotes the logistic sigmoid function. For all actions from s0s_{0}, the reward is zero. From s1s_{1} and s2s_{2}, the agent can only transition to the terminal state, with reward 2σ(a)2\sigma(-a) and σ(a)\sigma(a) respectively. The behaviour policy takes actions drawn from a Gaussian distribution with mean 1.01.0 and variance 1.01.0.

Because the environment has continuous actions, we can include both stochastic and deterministic policies, and so can include DPG in the comparison. DPG is built on the semi-gradient, like OffPAC (Silver et al., 2014). We include True-DPG with Emphatic weightings (True-DPGE), which uses the true emphatic weightings rather than estimated ones to avoid the issue of estimating the emphatic weightings for a deterministic target policy, and focus the investigation on whether DPG converges to a suboptimal solution in this setting. Estimation of the emphatic weightings for a deterministic target policy is left for future work. The stochastic actor in ACE has a linear output unit and a softplus output unit to represent the mean and the standard deviation of a Gaussian distribution. All actors are initialized with zero weights.

Figure 4 summarizes the results. The first observation is that DPG demonstrates suboptimal behaviour similar to OffPAC. As training goes on, DPG prefers to take positive actions in all states, because s2s_{2} is updated more often. This problem goes away in True-DPGE. The emphatic weightings emphasize updates in s1s_{1} and, thus, the actor gradually prefers negative actions and surpasses DPG in performance. Similarly, True-ACE learns to take negative actions but, being a stochastic policy, it cannot achieve True-DPGE’s performance on this domain. ACE with different η{\eta} values, however, cannot outperform DPG, and this result suggests that an alternative to importance sampling ratios is needed to effectively extend ACE to continuous actions.

Refer to caption
(a) Stepsize sensitivity
Refer to caption
(b) Learning curves
Refer to caption
(c) Mean action
Figure 4: Performance of ACE with different values of η{\eta}, True-ACE, DPG, and True-DPGE on the continuous action MDP. The results are averaged over 30 runs. For continuous actions, the methods have even more difficulty getting to the optimal solutions, given by True-DPGE and True-ACE, though the action selection graphs suggest that ACE for higher η{\eta} is staying nearer the optimal action selection than ACE(0) and DPG.

9.2 Challenges in Estimating the Emphatic Weightings

Up to this point we have been using the emphatic trace originally proposed by Sutton et al. (2016) to estimate the emphatic weightings. As discussed in Section 5, there can be multiple sources of inaccuracy from using this Monte Carlo estimate in an actor-critic framework.

However, it is unclear how the issues affecting the emphatic trace manifest in practice, and hence whether introducing a parametrized function to directly estimate the emphatic weightings is worth the added bias—especially if the representation is poor. To study the effects of this choice of estimator on the ACE algorithm, we conducted a series of experiments starting with a modified version of the counterexample in Figure 1(a), moving to two classic control environments, and ending with a challenging environment designed to illustrate the issues associated with both estimators.

Refer to caption
Figure 5: An 11-state MDP that makes estimating the emphatic weightings with the emphatic trace more difficult.

The first environment, shown in Figure 5, is an extended version of the counterexample with two long chains before the aliased states. Like the original counterexample, the behaviour policy takes a0a_{0} with probability 0.250.25 and a1a_{1} with probability 0.750.75 in all non-terminal states, and the interest i(s)i(s) is set to 11 for all states. Each state is represented by a unique feature vector except states s9s_{9} and s10s_{10}, which are aliased and appear the same to the actor. The addition of the new states makes trajectories considerably longer, which may exacerbate the issues with the emphatic trace.

We repeated the experiment from Figure 2 on the long counterexample. The following actor step-sizes were tested: {5105,104,2104,5104,103,2103,5103,102}\{5\cdot 10^{-5},10^{-4},2\cdot 10^{-4},5\cdot 10^{-4},10^{-3},2\cdot 10^{-3},5\cdot 10^{-3},10^{-2}\}, with the best-performing value (by total area under the learning curve, averaged over 10 runs) for each value of η{\eta} used in Figures 6(b) and 6(c). The actor was again initialized to take each action with equal probability, and the true state values were again used in the updates in order to isolate the effects of the emphatic trace.

We also trained an actor called True-ACE that uses the true emphatic weightings for the current target policy and behaviour policy, computed at each timestep. The performance of True-ACE is included here for the sake of comparison, as computing the exact emphatic weightings is not generally possible in an unknown environment.

The results in Figure 6 show that, even though performance improves as η{\eta} is increased, there is a significant gap between ACE with η=1{\eta}=1 and True-ACE. Unlike Figure 2, the methods have more difficulty reaching the optimal solution, although ACE with larger η{\eta} does still find a significantly better solution than η=0{\eta}=0. Additionally, values of η{\eta} greater than 0 result in high-variance estimates of the emphatic weightings, which lead to more variable performance, as shown by the larger shaded regions representing standard error. These results overall show the inaccuracies pointed out in Section 5 indeed disturb the updates in long trajectories.

Refer to caption
(a) Actor step-size sensitivity
Refer to caption
(b) Learning curves
Refer to caption
(c) Optimal action probability
Figure 6: Performance of ACE with different values of η{\eta} on the 11-state MDP.

9.3 Estimating the Emphatic Weightings in Classic Control Environments

The results of the previous experiments suggest that using the emphatic trace to estimate the emphatic weightings can significantly impact the performance of ACE. To better understand how the issues with the emphatic trace affect the learning process beyond our counterexample, we tested several variants of ACE on off-policy versions of two classic control environments: Puddle World (Degris et al., 2012b) and Mountain Car (Moore, 1990).

As an off-policy Monte Carlo estimator of the emphatic weightings, the emphatic trace can yield extremely high-variance estimates which can interfere with learning (Ghiassian et al., 2018). To see how the variance of the emphatic trace affects ACE, we tested a variant called ACE-direct that uses the one-step temporal-difference update from Section 5.2.2 to estimate the emphatic weightings. Temporal-difference methods often have lower variance than Monte Carlo methods—at the cost of introducing bias—which makes using a TD-style update to estimate the emphatic weightings an appealing alternative to the emphatic trace (Sutton and Barto, 2018).

As discussed in Section 5.2.2, the emphatic trace can also be biased when used in an actor-critic algorithm where the target policy is changing. To determine how detrimental this bias is to the performance of ACE, we tested a variant called ACE-ideal where all importance sampling ratios in the emphatic trace are re-computed at each time step using the current target policy, yielding an unbiased Monte Carlo estimate of the emphatic weighting for the current state. ACE-ideal is not a practical algorithm, as the computation and memory required grow with the number of time steps, but it allows us to isolate the effects of the bias introduced by using the emphatic trace with a changing target policy.

Other baselines for comparison included OffPAC, and ACE using the emphatic trace (referred to as ACE-trace). To determine the effects of different choices of critic, we included versions of each algorithm using an ETD critic (Sutton et al., 2016) and a TDRC critic (Ghiassian et al., 2020). We also included versions of each algorithm using the uniform interest function (i.e., i=1i=1 for all time steps), as well as the episodic interest function from Section 7—with the exception of OffPAC. OffPAC scales policy updates using only the interest for the current time step, which when combined with the episodic interest function leads to a single policy update on the first time step followed by no updates thereafter.

To get a clear picture of the performance of each algorithm, we conducted a grid search on the free parameters of each algorithm. For the step size parameter of the actor, critic, and the direct estimator of emphatic weightings, we tested values of the form 12i\frac{1}{2}^{i} where ii ranged from 0 to 15. For the trace decay rate of the critic, we tested values of the form 112j1-\frac{1}{2}^{j} where jj ranged from 0 to 6. The discount factor was .95. We sought to establish the performance of each variant for η=1{\eta}=1, as they all reduce to OffPAC when η=0{\eta}=0.

Each combination of parameters for each algorithm was run on 5 different trajectories generated by a fixed behaviour policy interacting with the environment for 100,000 time steps. The learned policies were saved every 1,000 time steps and evaluated 50 times using both the episodic and excursions objective functions from Section 3. For the episodic objective function, the policies were evaluated by creating 50 different instances of the environment and executing the target policy from the starting state until termination or 1000 time steps had elapsed. The excursions objective function was evaluated similarly, but with the environment’s starting state drawn from the behaviour policy’s steady state distribution, chosen by running the behaviour policy for 50,000 time steps and saving every thousandth state. The results were averaged, and the best-performing combinations (by area under the learning curve for the appropriate objective function) were re-run on enough different trajectories to reduce the standard error to an acceptable level (100 runs for Puddle World, 30 runs for Mountain Car). The initial parameter sweep was intended to find good parameter settings for each method at a reasonable computational cost, and not necessarily the absolute best performance.

9.3.1 Puddle World

The Puddle World environment is a 2-dimensional continuous gridworld containing a start location, goal location, and puddles through which it is costly to move. While it first appeared in Boyan and Moore (1995), we use the version from Degris et al. (2012b) that includes an additional puddle near the goal state which makes the task more difficult.888Please see Degris et al. (2012b) for a picture of the Puddle World environment. The behaviour policy took the North, East, South, and West actions with probabilities .45, .45, .05, and .05 respectively. The observations were tile coded with a fairly low-resolution tile coder (4 tilings of 2×22\times 2 tiles plus a bias unit) to generate feature vectors with a large degree of generalization.

Figure 7 presents the results for the Puddle World environment. The left-hand column shows learning curves, while the right-hand column contains sensitivity analyses for the actor’s step size. The first four plots (figures 7(a), 7(b), 7(c), and 7(d)) show results when using ETD as the critic, while the last four plots (figures 7(e), 7(f), 7(g), and 7(h)) show results for a TDRC critic. The first and third rows show results using the excursions objective function, and the second and fourth rows show results for the episodic objective function.

Refer to caption
(a) Excursions learning curves (ETD)
Refer to caption
(b) Excursions step-size sensitivity (ETD)
Refer to caption
(c) Episodic learning curves (ETD)
Refer to caption
(d) Episodic step-size sensitivity (ETD)
Refer to caption
(e) Excursions learning curves (TDRC)
Refer to caption
(f) Excursions step-size sensitivity (TDRC)
Refer to caption
(g) Episodic learning curves (TDRC)
Refer to caption
(h) Episodic step-size sensitivity (TDRC)
Figure 7: Results on Puddle World. Shaded regions are 95% confidence intervals.

Comparing the performance of each algorithm under the excursions and episodic objective function, we can see that the algorithms all performed better in the excursions setting than the episodic setting. This is due to the excursions setting being easier than the episodic setting for the classic control problems in this section. In the excursions setting, the agent starts from the behaviour policy’s state distribution, which in the problems studied is closer to the goal state than the start state. This is not necessarily true for all problems, however, and in general there is no strict ordering of objectives in terms of difficulty.

ACE variants that used the episodic interest function (dashed lines) generally performed much worse than their counterparts using the uniform interest function (solid lines), with the exception of ACE-direct. This is consistent with the discussion in Section 7 of Thomas (2014)—although they deal with the on-policy case—which states scaling policy updates by an accumulating product of discount factors hurts the sample efficiency of the algorithm. However, the off-policy case can exacerbate this issue as the accumulating product includes both discount factors and importance sampling ratios. The fact that the direct method of estimating emphatic weightings allowed ACE to perform well when using the episodic interest function is intriguing and merits further study.

The choice of critic affected the performance of the algorithms to differing degrees. Replacing an ETD critic with a TDRC critic improved the performance of each of the algorithms, with episodic performance showing the greatest improvement (compare figures 7(c) and 7(g)), perhaps due to the excursions objective being easier for this environment as previously mentioned. ACE-direct with uniform interest performed well regardless of the choice of critic, but improved substantially in the episodic setting when using a TDRC critic. Using a TDRC critic instead of an ETD critic also improved the sensitivity of the algorithms to the actor’s step size, allowing a slightly wider range of step sizes to perform well for most methods (compare Figure 7(b) to 7(f), and Figure 7(d) to 7(h)).

When comparing ACE-trace (orange) to ACE-ideal (blue)—where the emphatic trace’s importance sampling ratios were computed using the current policy, resulting in an unbiased Monte Carlo estimate of the emphatic weightings—we can see that correcting the bias introduced by using the emphatic trace with a changing policy did not improve performance significantly in all but one situation. The one exception was when performance was evaluated with the excursions objective function (figures 7(a) and 7(e)). When using an ETD critic (Figure 7(a)), ACE-ideal outperformed ACE-trace early in learning before deteriorating to a similar level of performance by the end of the experiment. Conversely, ACE-ideal outperformed ACE-trace throughout the experiment when using a TDRC critic (Figure 7(e)), although the difference was barely significant. In all other situations, the performance of ACE-ideal was not significantly different from the performance of ACE-trace.

Overall, ACE-direct performed better than or equal to ACE-trace, ACE-ideal, and OffPAC across all combinations of critic, interest function, and objective function. However, it diverged for some actor step sizes when used with an ETD critic (figures 7(b) and 7(d)). This could be due to the use of a semi-gradient update rule (analogous to off-policy TD(0) which is not guaranteed to converge when used with function approximation) for the direct method of estimating emphatic weightings. To resolve this issue, one could use the gradient-based update rule in equation 18 of Section 5.2.2. We chose to use the semi-gradient update to keep both the explanation and experiments simple, as there were already a large number of concepts and algorithmic parameters involved.

9.3.2 Off-policy Mountain Car

In the original Mountain Car environment, the agent attempts to drive an underpowered car out of a valley (Moore, 1990).999Please see Degris et al. (2012b) for a picture of the Mountain Car environment. In the off-policy version, the agent learns from experience generated by a fixed behaviour policy, in this case the uniform random policy. The observations were tile coded using 8 tilings of 4×44\times 4 tiles plus a bias unit. ACE-ideal was omitted from the Mountain Car experiments due to computational constraints; the uniform random behaviour policy rarely completed an episode, which resulted in extremely long trajectories that prevented ACE-ideal from running in a reasonable amount of time.

Figure 8 contains the results for the Off-policy Mountain Car environment. The left-hand column shows learning curves, while the right-hand column shows sensitivity analyses for the actor’s step size. The first four plots (figures 8(a), 8(b), 8(c), and 8(d)) show results when using ETD as the critic, while the last four plots (figures 8(e), 8(f), 8(g), and 8(h)) show results for a TDRC critic. The first and third rows show results using the excursions objective function, and the second and fourth rows show results for the episodic objective function.

Refer to caption
(a) Excursions learning curves (ETD)
Refer to caption
(b) Excursions step-size sensitivity (ETD)
Refer to caption
(c) Episodic learning curves (ETD)
Refer to caption
(d) Episodic step-size sensitivity (ETD)
Refer to caption
(e) Excursions learning curves (TDRC)
Refer to caption
(f) Excursions step-size sensitivity (TDRC)
Refer to caption
(g) Episodic learning curves (TDRC)
Refer to caption
(h) Episodic step-size sensitivity (TDRC)
Figure 8: Results on Off-policy Mountain Car. Shaded regions are 95% confidence intervals.

Across all choices of critic and objective function, algorithms trained using the episodic interest function (dashed lines) performed much worse than their counterparts trained using the uniform interest function (solid lines). This finding is consistent with the results from Puddle World, although much more pronounced. In this specific experiment, the extremely long trajectories generated by the uniform random behaviour policy caused the emphatic trace (and hence the policy updates) of the algorithms that used episodic interest to quickly shrink to the point of irrelevance, whereas in the Puddle World experiment the behaviour policy encounters the goal state more often, limiting trajectory length and shrinking the updates more slowly. Nevertheless, using the direct method to estimate the emphatic weightings allowed ACE to improve the performance of the policy in some cases, whereas using the emphatic trace prevented the policy from meaningfully improving.

Again, the choice of critic played a role in the performance of all the algorithms to varying degrees. Using a TDRC critic instead of an ETD critic improved the performance of each of the algorithms, with OffPAC seeing the greatest improvement, followed by ACE-trace. ACE-direct again performed well regardless of the choice of critic. Using a TDRC critic improved the sensitivity of each of the algorithms to the actor’s step size, allowing a wider range of step sizes to perform well (compare figures 8(b) and 8(f), and 8(d) and 8(h)).

Comparing ACE-direct (red) to ACE-trace (orange) reveals several advantages of using the direct method to estimate the emphatic weightings. The direct method allowed ACE to learn faster, with less variance, and find a better-performing target policy by the end of the experiment, regardless of the choice of critic or objective function used for evaluation. In addition, the direct method allowed a wider range of actor step sizes to perform well, with performance increasing and decreasing more smoothly and predictably than when using the emphatic trace. Finally, the choice of critic had little effect on the performance of ACE-direct, whereas ACE-trace performed better with a TDRC critic than with an ETD critic. Overall, ACE with the direct method performed better than or equal to OffPAC (green) and ACE-trace for all critics and objective functions, but was less sensitive to the actor step size and choice of critic than OffPAC.

9.4 The Virtual Office Environment

The classic control environments considered in the previous section are well-known environments for testing general off-policy algorithms, but were not designed to probe the weaknesses of the specific algorithms being studied. Hence, both semi-gradient updates and the direct method of estimating emphatic weightings perform well, and the tradeoffs made in the design of each algorithm are not obvious. In this section, we introduce an environment designed to highlight issues with both semi-gradient updates and the direct method of estimating emphatic weightings.

The environment is based on Dung et al. (2007)’s Virtual Office, a grid world consisting of a hallway and two cubicles that are identical except for the goal locations. To illustrate the problem with semi-gradient updates, two key properties of the counterexample from Section 8.1 must be incorporated: the optimal actions in the two identical rooms must be different, and the behaviour policy must visit the suboptimal room more often. To accomplish the first goal, we introduced hidden goal states in the north-east and south-east corners of the two rooms, and assigned rewards of 1 and 0 respectively to the goal states of the north-east room and rewards of 0 and .5 respectively to the goal states of the south-east room. To encourage the behaviour policy to visit the suboptimal room (i.e., the south-east room) more often, we fixed the starting state to the left-most, middle-most state and used a behaviour policy that favoured the South and East actions, taking the North, East, South, and West actions with probabilities .2, .4, .3, and .1 respectively.

While Dung et al. (2007)’s Virtual Office contains some partial observability in the form of the two identical rooms, to fully illustrate the tradeoff made by the Markov assumption used in the derivation of the direct method, we limited the agent’s observations to a 3×33\times 3 grid in front of the agent. The RGB colour values of each square in the agent’s view were used as observations, with the agent unable to see through walls or perceive the goal states.

The final environment was implemented using Minigrid (Chevalier-Boisvert et al., 2018), and is depicted in Figure 9. The agent (red triangle) must navigate to one of the terminal states (the north-east and south-east squares of the green rooms) to obtain the associated reward. The agent observes the RGB colour values of the 3×33\times 3 grid of squares in front of it (the lighter blue squares), and selects the North, East, South, or West action in response.

For this experiment we followed the methodology detailed in the previous section, with minor changes. First, we restricted the critic trace decay rate to be 0, as eligibility traces can alleviate some of the issues caused by partial observability (Loch and Singh, 1998). Second, we ran each combination of parameters for 30 runs of 200,000 time steps each. The learned policies were saved every 5,000 time steps and evaluated 50 times using both the episodic and excursions objective functions. The best-performing parameter settings were then re-run 100 times and averaged.

Refer to caption
Figure 9: The Virtual Office environment.

Figure 10 contains the results of our experiment on the Virtual Office environment. The left-hand column shows learning curves, while the right-hand column shows sensitivity analyses for the actor’s step size. The first four plots (figures 10(a), 10(b), 10(c), and 10(d)) show results when using ETD as the critic, while the last four plots (figures 10(e), 10(f), 10(g), and 10(h)) show results for a TDRC critic. The first and third rows show results for the excursions objective function, and the second and fourth rows show the episodic objective function.

Refer to caption
(a) Excursions learning curves (ETD)
Refer to caption
(b) Excursions step-size sensitivity (ETD)
Refer to caption
(c) Episodic learning curves (ETD)
Refer to caption
(d) Episodic step-size sensitivity (ETD)
Refer to caption
(e) Excursions learning curves (TDRC)
Refer to caption
(f) Excursions step-size sensitivity (TDRC)
Refer to caption
(g) Episodic learning curves (TDRC)
Refer to caption
(h) Episodic step-size sensitivity (TDRC)
Figure 10: Results on the Virtual Office. Shaded regions are 95% confidence intervals.

Interestingly, algorithms that used the episodic interest function (dashed lines) performed better than or equal to their counterparts that used the uniform interest function (solid lines) for all combinations of objective function and critic, unlike the previous experiments on Mountain Car and Puddle World. In addition, the algorithms that used the episodic interest function worked well for a larger range of step sizes, making it easier to find a value that performs well. This could be due to the partially observable nature of the environment; many of the agent’s observations are similar, which could cause the emphatic weightings with a uniform interest function to quickly become very large, necessitating a very small step size to compensate.

As in the previous experiments, the choice of critic played an important role in the performance of all the algorithms, although this time with mixed results. Using a TDRC critic instead of an ETD critic improved the performance of OffPAC and ACE-direct with uniform interest, but had very little impact on ACE-trace with uniform interest and ACE-direct with episodic interest, and actually reduced the performance of ACE-trace with episodic interest (compare figures 10(a) and 10(e), and 10(c) and 10(g)). Similarly, using a TDRC critic had mixed effects on the sensitivity of each of the algorithms to the actor’s step size. It allowed a wider range of step sizes to perform well for some algorithms, but also led to a narrower range of well-performing step sizes for some algorithms (compare figures 10(b) and 10(f), and 10(d) and 10(h)).

ACE-direct with the episodic interest function performed better than or equal to the other methods across most combinations of critic and objective function, with the possible exception of the episodic objective function. With a TDRC critic, OffPAC learned more quickly than the other methods before eventually being surpassed by both ACE-direct variants. However, OffPAC was able to close the gap near the end of the experiment, with the final learned policy of OffPAC performing comparably to the ACE-direct variants. With an ETD critic, ACE-trace with episodic interest performed surprisingly well, with the final policy possibly outperforming ACE-direct with episodic interest. However, the variance of the policy learned by ACE-trace was much larger than the other methods, rendering the difference in performance not statistically significant. Conversely, ACE-direct with episodic interest learned a low-variance policy on each of the settings tested, performing as well as or better than the other algorithms.

9.5 Summary

In this section we investigated several key properties of the ACE algorithm empirically. First we studied how the parameter η\eta trades off bias and variance in the counterexample from Section 8.1 and a continuous action version, finding that even small values of η\eta (i.e., closer to semi-gradient updates) greatly improved the quality of the final learned policy, both when using the true value function and when using a learned critic.

Next we tested the ability of the emphatic trace to accurately estimate the emphatic weightings on an extended version of the counterexample, showing that while higher values of η\eta improved performance, a significant gap remained between ACE with η=1\eta=1 and ACE with the true emphatic weightings.

Then we moved to two classic control domains and tested several variants of ACE to determine how the choice of estimator for the emphatic weightings (and associated bias and variance properties) and the choice of critic affects the learning process. We found that the emphatic trace’s extreme variance greatly reduced the performance of ACE—with its bias playing somewhat less of a role—and replacing the emphatic trace with the direct method from Section 5.2.2 resulted in ACE performing as well as or better than the other variants. The TDRC critic was the better critic overall, but ACE with the direct method of estimating emphatic weightings was insensitive to the choice of critic and performed well regardless of whether ETD or TDRC was used.

Finally, we compared the performance of several variants of ACE on a new environment designed to highlight the weaknesses of each of the variants, finding that the episodic variant of ACE with the direct method of estimating emphatic weightings performed as well as or better than the other variants in almost all the situations tested, suggesting it should be the default method of estimating the emphatic weightings.

10 Conclusion

In this paper we introduced a generalized objective for learning policies and proved the off-policy policy gradient theorem using emphatic weightings. Using this theorem, we derived an off-policy actor-critic algorithm that follows the gradient of the objective function, as opposed to previous methods like OffPAC and DPG that follow an approximate semi-gradient. We designed a simple MDP to highlight that stationary points for semi-gradients can be highly suboptimal, with semi-gradient updates moving away from the optimal solution. We also show, though, that the semi-gradient methods can produce reasonable solutions, either because of sufficiently rich policy parameterizations or because in some cases they can be seen as using the true gradient update with a different state weighting.

We leverage these insights to improve the practicality of the method empirically. We design Actor-Critic with Emphatic Weightings (ACE) to have a tuneable parameter η\eta that sweeps between the full gradient (at η=1\eta=1) and the semi-gradient (at η=0\eta=0), often obtaining the best performance with a relatively small but non-zero η\eta (typically η=0.1\eta=0.1). We additionally reduce variance by directly estimating emphatic weightings with function approximation rather than using the emphatic trace. Empirically, the direct method of estimating the emphatic weightings performed better than the emphatic trace across all objective functions and environments tested. In addition, ACE with the direct method was less sensitive to the actor step size and choice of critic than OffPAC. Overall, ACE with the direct method performed better than or equal to OffPAC in all situations, suggesting that incorporating (low-variance) state reweightings might be a reasonable direction for future policy gradient methods.


Acknowledgments and Disclosure of Funding

We gratefully acknowledge funding from the Natural Sciences and Engineering Research Council of Canada (NSERC), the Canada CIFAR AI Chair program, and the Alberta Machine Intelligence Institute (Amii). This research was enabled in part by support provided by SciNet and the Digital Research Alliance of Canada.

Appendix A Proof of Deterministic Off-Policy Policy Gradient Theorem

A.1 Assumptions

We make the following assumptions on the MDP:

Assumption 1

P(s|s,a),r(s,a,s),γ(s,a,s),π(s;𝜽)\text{P}(s^{\prime}|s,a),r(s,a,s^{\prime}),\gamma(s,a,s^{\prime}),\pi(s;\bm{\mathbf{\theta}}) and their derivatives are continuous in all variables s,a,s,𝛉s,a,s^{\prime},\bm{\mathbf{\theta}}.

Assumption 2

𝒮\mathcal{S} is a compact set in d\mathbb{R}^{d}, and 𝒜\mathcal{A} is a compact set in \mathbb{R}.

Assumption 3

The policy π\pi and discount γ\gamma are such that the inverse kernel of δ(s,s)𝐏π,γ(s,s)\delta(s,s^{\prime})-\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime}) exists, where 𝐏π,γ(s,s)=𝒜π(a|s)γ(s,a,s)P(s|s,a)𝑑a\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})=\int_{\mathcal{A}}\pi(a|s)\gamma(s,a,s^{\prime})\text{P}(s^{\prime}|s,a)da.

Under Assumption 1, vπ𝜽(s)v_{\pi_{\bm{\mathbf{\theta}}}}(s) and vπ𝜽(s)𝜽\frac{\partial v_{\pi_{\bm{\mathbf{\theta}}}}(s)}{\partial\bm{\mathbf{\theta}}} are continuous functions of 𝜽\bm{\mathbf{\theta}} and ss. Together, Assumptions 1 and 2 imply that ||vπ𝜽(s)𝜽||\left\lvert\left\lvert\frac{\partial v_{\pi_{\bm{\mathbf{\theta}}}}(s)}{\partial\bm{\mathbf{\theta}}}\right\rvert\right\rvert, ||π(s;𝜽)𝜽||\left\lvert\left\lvert\frac{\partial\pi(s;\bm{\mathbf{\theta}})}{\partial\bm{\mathbf{\theta}}}\right\rvert\right\rvert, and ||qπ𝜽(s,a)a|a=π(s;𝜽)||\left\lvert\left\lvert\left.\frac{\partial q_{\pi_{\bm{\mathbf{\theta}}}}(s,a)}{\partial a}\right\rvert_{a=\pi(s;\bm{\mathbf{\theta}})}\right\rvert\right\rvert are bounded functions of ss, which allows us to switch the order of integration and differentiation, and the order of multiple integrations.

A.2 Proof of Theorem 3

Proof  We start by deriving a recursive form for the gradient of the value function with respect to the policy parameters:

vπ(s)𝜽\displaystyle\frac{\partial v_{\pi}(s)}{\partial\bm{\mathbf{\theta}}} =𝜽qπ(s,π(s;𝜽))\displaystyle=\frac{\partial}{\partial\bm{\mathbf{\theta}}}q_{\pi}(s,\pi(s;\bm{\mathbf{\theta}})){}
=𝜽𝒮P(s|s,π(s;𝜽))(r(s,π(s;𝜽),s)+γ(s,π(s;𝜽),s)vπ(s))ds\displaystyle=\frac{\partial}{\partial\bm{\mathbf{\theta}}}\int_{\mathcal{S}}\text{P}(s^{\prime}|s,\pi(s;\bm{\mathbf{\theta}}))\Big{(}r(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})+\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})v_{\pi}(s^{\prime})\Big{)}\mathop{}\!\mathrm{d}s^{\prime}{}
=𝒮𝜽(P(s|s,π(s;𝜽))(r(s,π(s;𝜽),s)+γ(s,π(s;𝜽),s)vπ(s)))ds,\displaystyle=\int_{\mathcal{S}}\frac{\partial}{\partial\bm{\mathbf{\theta}}}\left(\text{P}(s^{\prime}|s,\pi(s;\bm{\mathbf{\theta}}))\Big{(}r(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})+\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})v_{\pi}(s^{\prime})\Big{)}\right)\mathop{}\!\mathrm{d}s^{\prime}, (21)

where in Equation 21 we used the Leibniz integral rule to switch the order of integration and differentiation. We proceed with the derivation using the product rule:

=\displaystyle={} 𝒮𝜽P(s|s,π(s;𝜽))(r(s,π(s;𝜽),s)+γ(s,π(s;𝜽),s)vπ(s))\displaystyle\int_{\mathcal{S}}\frac{\partial}{\partial\bm{\mathbf{\theta}}}\text{P}(s^{\prime}|s,\pi(s;\bm{\mathbf{\theta}}))\Big{(}r(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})+\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})v_{\pi}(s^{\prime})\Big{)}
+P(s|s,π(s;𝜽))𝜽(r(s,π(s;𝜽),s)+γ(s,π(s;𝜽),s)vπ(s))ds\displaystyle+\text{P}(s^{\prime}|s,\pi(s;\bm{\mathbf{\theta}}))\frac{\partial}{\partial\bm{\mathbf{\theta}}}\Big{(}r(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})+\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})v_{\pi}(s^{\prime})\Big{)}\mathop{}\!\mathrm{d}s^{\prime}
=\displaystyle={} 𝒮π(s;𝜽)𝜽P(s|s,a)a|a=π(s;𝜽)(r(s,π(s;𝜽),s)+γ(s,π(s;𝜽),s)vπ(s))+P(s|s,π(s;𝜽))\displaystyle\int_{\mathcal{S}}\frac{\partial\pi(s;\bm{\mathbf{\theta}})}{\partial\bm{\mathbf{\theta}}}\left.\frac{\partial\text{P}(s^{\prime}|s,a)}{\partial a}\right\rvert_{a=\pi(s;\bm{\mathbf{\theta}})}\Big{(}r(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})+\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})v_{\pi}(s^{\prime})\Big{)}+\text{P}(s^{\prime}|s,\pi(s;\bm{\mathbf{\theta}}))
(𝜽r(s,π(s;𝜽),s)+𝜽γ(s,π(s;𝜽),s)vπ(s)+γ(s,π(s;𝜽),s)𝜽vπ(s))ds\displaystyle\qquad\left(\frac{\partial}{\partial\bm{\mathbf{\theta}}}r(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})+\frac{\partial}{\partial\bm{\mathbf{\theta}}}\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})v_{\pi}(s^{\prime})+\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})\frac{\partial}{\partial\bm{\mathbf{\theta}}}v_{\pi}(s^{\prime})\right)\mathop{}\!\mathrm{d}s^{\prime}
=\displaystyle={} 𝒮π(s;𝜽)𝜽P(s|s,a)a|a=π(s;𝜽)(r(s,π(s;𝜽),s)+γ(s,π(s;𝜽),s)vπ(s))\displaystyle\int_{\mathcal{S}}\frac{\partial\pi(s;\bm{\mathbf{\theta}})}{\partial\bm{\mathbf{\theta}}}\left.\frac{\partial\text{P}(s^{\prime}|s,a)}{\partial a}\right\rvert_{a=\pi(s;\bm{\mathbf{\theta}})}\Big{(}r(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})+\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})v_{\pi}(s^{\prime})\Big{)}
+P(s|s,π(s;𝜽))(π(s;𝜽)𝜽r(s,a,s)a|a=π(s;𝜽)+π(s;𝜽)𝜽γ(s,a,s)a|a=π(s;𝜽)vπ(s))ds\displaystyle+\text{P}(s^{\prime}|s,\pi(s;\bm{\mathbf{\theta}}))\left(\frac{\partial\pi(s;\bm{\mathbf{\theta}})}{\partial\bm{\mathbf{\theta}}}\left.\frac{\partial r(s,a,s^{\prime})}{\partial a}\right\rvert_{a=\pi(s;\bm{\mathbf{\theta}})}+\frac{\partial\pi(s;\bm{\mathbf{\theta}})}{\partial\bm{\mathbf{\theta}}}\left.\frac{\partial\gamma(s,a,s^{\prime})}{\partial a}\right\rvert_{a=\pi(s;\bm{\mathbf{\theta}})}v_{\pi}(s^{\prime})\right)\mathop{}\!\mathrm{d}s^{\prime}
+𝒮P(s|s,π(s;𝜽))γ(s,π(s;𝜽),s)𝜽vπ(s)ds\displaystyle+\int_{\mathcal{S}}\text{P}(s^{\prime}|s,\pi(s;\bm{\mathbf{\theta}}))\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})\frac{\partial}{\partial\bm{\mathbf{\theta}}}v_{\pi}(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime}
=\displaystyle={} 𝒮π(s;𝜽)𝜽(P(s|s,a)a|a=π(s;𝜽)(r(s,π(s;𝜽),s)+γ(s,π(s;𝜽),s)vπ(s))\displaystyle\int_{\mathcal{S}}\frac{\partial\pi(s;\bm{\mathbf{\theta}})}{\partial\bm{\mathbf{\theta}}}\biggr{(}\left.\frac{\partial\text{P}(s^{\prime}|s,a)}{\partial a}\right\rvert_{a=\pi(s;\bm{\mathbf{\theta}})}\Big{(}r(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})+\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})v_{\pi}(s^{\prime})\Big{)}
+P(s|s,π(s;𝜽))a(r(s,a,s)+γ(s,a,s)vπ(s))|a=π(s;𝜽))ds\displaystyle+\text{P}(s^{\prime}|s,\pi(s;\bm{\mathbf{\theta}}))\left.\frac{\partial}{\partial a}\Big{(}r(s,a,s^{\prime})+\gamma(s,a,s^{\prime})v_{\pi}(s^{\prime})\Big{)}\right\rvert_{a=\pi(s;\bm{\mathbf{\theta}})}\biggr{)}\mathop{}\!\mathrm{d}s^{\prime}
+𝒮P(s|s,π(s;𝜽))γ(s,π(s;𝜽),s)𝜽vπ(s)ds\displaystyle+\int_{\mathcal{S}}\text{P}(s^{\prime}|s,\pi(s;\bm{\mathbf{\theta}}))\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})\frac{\partial}{\partial\bm{\mathbf{\theta}}}v_{\pi}(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime}
=\displaystyle={} π(s;𝜽)𝜽𝒮a(P(s|s,a)(r(s,a,s)+γ(s,a,s)vπ(s)))|a=π(s;𝜽)ds\displaystyle\frac{\partial\pi(s;\bm{\mathbf{\theta}})}{\partial\bm{\mathbf{\theta}}}\int_{\mathcal{S}}\left.\frac{\partial}{\partial a}\left(\text{P}(s^{\prime}|s,a)\Big{(}r(s,a,s^{\prime})+\gamma(s,a,s^{\prime})v_{\pi}(s^{\prime})\Big{)}\right)\right\rvert_{a=\pi(s;\bm{\mathbf{\theta}})}\mathop{}\!\mathrm{d}s^{\prime}
+𝒮P(s|s,π(s;𝜽))γ(s,π(s;𝜽),s)vπ(s)𝜽ds\displaystyle+\int_{\mathcal{S}}\text{P}(s^{\prime}|s,\pi(s;\bm{\mathbf{\theta}}))\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})\frac{\partial v_{\pi}(s^{\prime})}{\partial\bm{\mathbf{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}
=\displaystyle={} π(s;𝜽)𝜽qπ(s,a)a|a=π(s;𝜽)+𝒮P(s|s,π(s;𝜽))γ(s,π(s;𝜽),s)vπ(s)𝜽ds.\displaystyle\frac{\partial\pi(s;\bm{\mathbf{\theta}})}{\partial\bm{\mathbf{\theta}}}\left.\frac{\partial q_{\pi}(s,a)}{\partial a}\right\rvert_{a=\pi(s;\bm{\mathbf{\theta}})}+\int_{\mathcal{S}}\text{P}(s^{\prime}|s,\pi(s;\bm{\mathbf{\theta}}))\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime})\frac{\partial v_{\pi}(s^{\prime})}{\partial\bm{\mathbf{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}. (22)

For simplicity of notation, we will write Equation 22 as

vπ(s)𝜽=g(s)+𝒮𝐏π,γ(s,s)vπ(s)𝜽ds,\displaystyle\frac{\partial v_{\pi}(s)}{\partial\bm{\mathbf{\theta}}}=g(s)+\int_{\mathcal{S}}\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})\frac{\partial v_{\pi}(s^{\prime})}{\partial\bm{\mathbf{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}, (23)

since P(s|s,π(s;𝜽))γ(s,π(s;𝜽),s)\text{P}(s^{\prime}|s,\pi(s;\bm{\mathbf{\theta}}))\gamma(s,\pi(s;\bm{\mathbf{\theta}}),s^{\prime}) is a function of ss and ss^{\prime} for a fixed deterministic policy. Then we can write vπ(s)𝜽\frac{\partial v_{\pi}(s)}{\partial\bm{\mathbf{\theta}}} as an integral transform using the delta function:

vπ(s)𝜽=𝒮δ(s,s)vπ(s)𝜽ds.\frac{\partial v_{\pi}(s)}{\partial\bm{\mathbf{\theta}}}=\int_{\mathcal{S}}\delta(s,s^{\prime})\frac{\partial v_{\pi}(s^{\prime})}{\partial\bm{\mathbf{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}. (24)

Plugging Equation 24 into the left-hand side of Equation 23, we obtain:

𝒮δ(s,s)vπ(s)𝜽ds=g(s)+𝒮𝐏π,γ(s,s)vπ(s)𝜽ds\displaystyle\int_{\mathcal{S}}\delta(s,s^{\prime})\frac{\partial v_{\pi}(s^{\prime})}{\partial\bm{\mathbf{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}=g(s)+\int_{\mathcal{S}}\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})\frac{\partial v_{\pi}(s^{\prime})}{\partial\bm{\mathbf{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}
𝒮(δ(s,s)𝐏π,γ(s,s))vπ(s)𝜽ds=g(s)\displaystyle\implies\int_{\mathcal{S}}\Big{(}\delta(s,s^{\prime})-\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})\Big{)}\frac{\partial v_{\pi}(s^{\prime})}{\partial\bm{\mathbf{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}=g(s)
vπ(s)𝜽=𝒮k(s,s)g(s)ds,\displaystyle\implies\frac{\partial v_{\pi}(s)}{\partial\bm{\mathbf{\theta}}}=\int_{\mathcal{S}}k(s,s^{\prime})g(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime}, (25)

where k(s,s)k(s,s^{\prime}) is the inverse kernel of δ(s,s)𝐏π,γ(s,s)\delta(s,s^{\prime})-\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime}). Now, using the continuous version of the weighted excursions objective defined in Equation 6, we have:

Jμ(𝜽)𝜽=\displaystyle\frac{\partial J_{\mu}(\bm{\mathbf{\theta}})}{\partial\bm{\mathbf{\theta}}}={} 𝜽𝒮dμ(s)i(s)vπ(s)ds\displaystyle\frac{\partial}{\partial\bm{\mathbf{\theta}}}\int_{\mathcal{S}}d_{\mu}(s)i(s)v_{\pi}(s)\mathop{}\!\mathrm{d}s
=\displaystyle={} 𝒮vπ(s)𝜽dμ(s)i(s)ds\displaystyle\int_{\mathcal{S}}\frac{\partial v_{\pi}(s)}{\partial\bm{\mathbf{\theta}}}d_{\mu}(s)i(s)\mathop{}\!\mathrm{d}s
=\displaystyle={} 𝒮𝒮k(s,s)g(s)dsdμ(s)i(s)ds\displaystyle\int_{\mathcal{S}}\int_{\mathcal{S}}k(s,s^{\prime})g(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime}\,d_{\mu}(s)i(s)\mathop{}\!\mathrm{d}s
=\displaystyle={} 𝒮𝒮k(s,s)dμ(s)i(s)dsg(s)ds,\displaystyle\int_{\mathcal{S}}\int_{\mathcal{S}}k(s,s^{\prime})d_{\mu}(s)i(s)\mathop{}\!\mathrm{d}s\,g(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime}, (26)

where in Equation 26 we used Fubini’s theorem to switch the order of integration.

Now, we convert the recursive definition of emphatic weightings for deterministic policies over continuous state-action spaces into a non-recursive form. Recall the definition:

m(s)=dμ(s)i(s)+𝒮𝐏π,γ(s,s)m(s)ds.m(s^{\prime})=d_{\mu}(s^{\prime})i(s^{\prime})+\int_{\mathcal{S}}\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})m(s)\mathop{}\!\mathrm{d}s. (27)

Again, we can write m(s)m(s^{\prime}) as an integral transform using the delta function:

m(s)=𝒮δ(s,s)m(s)ds.\displaystyle m(s^{\prime})=\int_{\mathcal{S}}\delta(s,s^{\prime})m(s)\mathop{}\!\mathrm{d}s. (28)

Plugging Equation 28 into the left-hand side of Equation 27, we obtain:

𝒮δ(s,s)m(s)ds=dμ(s)i(s)+𝒮𝐏π,γ(s,s)m(s)ds\displaystyle\int_{\mathcal{S}}\delta(s,s^{\prime})m(s)\mathop{}\!\mathrm{d}s=d_{\mu}(s^{\prime})i(s^{\prime})+\int_{\mathcal{S}}\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})m(s)\mathop{}\!\mathrm{d}s
𝒮(δ(s,s)𝐏π,γ(s,s))m(s)ds=dμ(s)i(s)\displaystyle\implies\int_{\mathcal{S}}(\delta(s,s^{\prime})-\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime}))m(s)\mathop{}\!\mathrm{d}s=d_{\mu}(s^{\prime})i(s^{\prime})
m(s)=𝒮k(s,s)dμ(s)i(s)ds,\displaystyle\implies m(s^{\prime})=\int_{\mathcal{S}}k(s,s^{\prime})d_{\mu}(s)i(s)\mathop{}\!\mathrm{d}s, (29)

where k(s,s)k(s,s^{\prime}) is again the inverse kernel of δ(s,s)𝐏π,γ(s,s)\delta(s,s^{\prime})-\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime}). Plugging Equation 29 into Equation 26 yields

Jμ(𝜽)𝜽\displaystyle\frac{\partial J_{\mu}(\bm{\mathbf{\theta}})}{\partial\bm{\mathbf{\theta}}} =𝒮m(s)g(s)ds\displaystyle=\int_{\mathcal{S}}m(s^{\prime})g(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime}
=𝒮m(s)π(s;𝜽)𝜽qπ(s,a)a|a=π(s;𝜽)ds.\displaystyle=\int_{\mathcal{S}}m(s)\frac{\partial\pi(s;\bm{\mathbf{\theta}})}{\partial\bm{\mathbf{\theta}}}\left.\frac{\partial q_{\pi}(s,a)}{\partial a}\right\rvert_{a=\pi(s;\bm{\mathbf{\theta}})}\mathop{}\!\mathrm{d}s.

 

Appendix B Continuous State Off-Policy Policy Gradient Theorem

Given the Deterministic Off-Policy Policy Gradient Theorem in Appendix A, we can now provide a proof for the continuous-state version of the Off-Policy Policy Gradient Theorem provided in Theorem 2.

Proof  The gradient of the objective in the continuous-states case is

Jμ(𝜽)𝜽=𝒮i(s)vπ(s)ds𝜽=𝒮i(s)vπ(s)𝜽ds.\frac{\partial J_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}}=\frac{\partial\int_{\mathcal{S}}i(s)v_{\pi}(s)\mathop{}\!\mathrm{d}s}{\partial{\bm{\theta}}}=\int_{\mathcal{S}}i(s)\frac{\partial v_{\pi}(s)}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}s. (30)

Therefore, again to compute the gradient of JμJ_{\mu} we need to compute the gradient of the value function with respect to the policy parameters. A recursive form of the gradient of the value function can be derived, as we show below. Before starting, for simplicity of notation, we will again use

𝐠(s)=aπ(s,a;𝜽)𝜽qπ(s,a),\displaystyle\mathbf{g}(s)=\sum_{a}\frac{\partial\pi(s,a;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a),

where 𝐠:𝒮d\mathbf{g}:\mathcal{S}\rightarrow\mathbb{R}^{d}. Now let us compute the gradient of the value function:

vπ(s)𝜽\displaystyle\frac{\partial v_{\pi}(s)}{\partial{\bm{\theta}}} =𝜽aπ(s,a;𝜽)qπ(s,a)\displaystyle=\frac{\partial}{\partial{\bm{\theta}}}\sum_{a}\pi(s,a;{\bm{\theta}})q_{\pi}(s,a)
=aπ(s,a;𝜽)𝜽qπ(s,a)+aπ(s,a;𝜽)qπ(s,a)𝜽\displaystyle=\sum_{a}\frac{\partial\pi(s,a;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a)+\sum_{a}\pi(s,a;{\bm{\theta}})\frac{\partial q_{\pi}(s,a)}{\partial{\bm{\theta}}}
=𝐠(s)+aπ(s,a;𝜽)𝒮P(s|s,a)(r(s,a,s)+γ(s,a,s)vπ(s))ds𝜽\displaystyle=\mathbf{g}(s)+\sum_{a}\pi(s,a;{\bm{\theta}})\frac{\partial\int_{\mathcal{S}}\text{P}(s^{\prime}|s,a)(r(s,a,s^{\prime})+\gamma(s,a,s^{\prime})v_{\pi}(s^{\prime}))\mathop{}\!\mathrm{d}s^{\prime}}{\partial{\bm{\theta}}}
=𝐠(s)+aπ(s,a;𝜽)𝒮P(s|s,a)γ(s,a,s)vπ(s)𝜽ds\displaystyle=\mathbf{g}(s)+\sum_{a}\pi(s,a;{\bm{\theta}})\int_{\mathcal{S}}\text{P}(s^{\prime}|s,a)\gamma(s,a,s^{\prime})\frac{\partial v_{\pi}(s^{\prime})}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}
=𝐠(s)+𝒮aπ(s,a;𝜽)P(s|s,a)γ(s,a,s)vπ(s)𝜽ds.\displaystyle=\mathbf{g}(s)+\int_{\mathcal{S}}\sum_{a}\pi(s,a;{\bm{\theta}})\text{P}(s^{\prime}|s,a)\gamma(s,a,s^{\prime})\frac{\partial v_{\pi}(s^{\prime})}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}. (31)

For simplicity of notation we will write Equation 31 as

vπ(s)𝜽\displaystyle\frac{\partial v_{\pi}(s)}{\partial{\bm{\theta}}} =𝐠(s)+𝒮𝐏π,γ(s,s)vπ(s)𝜽ds,\displaystyle=\mathbf{g}(s)+\int_{\mathcal{S}}\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})\frac{\partial v_{\pi}(s^{\prime})}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}, (32)

where 𝐏π,γ(s,s)=aπ(s,a;𝜽)P(s|s,a)γ(s,a,s)\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})=\sum_{a}\pi(s,a;{\bm{\theta}})\text{P}(s^{\prime}|s,a)\gamma(s,a,s^{\prime}). Again, as in the deterministic off-policy policy gradient theorem, we can write vπ(s)𝜽\frac{\partial v_{\pi}(s)}{\partial\bm{\mathbf{\theta}}} as an integral transform using the delta function:

vπ(s)𝜽=𝒮δ(s,s)vπ(s)𝜽ds.\frac{\partial v_{\pi}(s)}{\partial\bm{\mathbf{\theta}}}=\int_{\mathcal{S}}\delta(s,s^{\prime})\frac{\partial v_{\pi}(s^{\prime})}{\partial\bm{\mathbf{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}. (33)

Plugging Equation 33 into the left-hand side of Equation 32, we obtain:

𝒮δ(s,s)vπ(s)𝜽ds=𝐠(s)+𝒮𝐏π,γ(s,s)vπ(s)𝜽ds\displaystyle\int_{\mathcal{S}}\delta(s,s^{\prime})\frac{\partial v_{\pi}(s^{\prime})}{\partial\bm{\mathbf{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}=\mathbf{g}(s)+\int_{\mathcal{S}}\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})\frac{\partial v_{\pi}(s^{\prime})}{\partial\bm{\mathbf{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}
𝒮(δ(s,s)𝐏π,γ(s,s))vπ(s)𝜽ds=𝐠(s)\displaystyle\implies\int_{\mathcal{S}}\Big{(}\delta(s,s^{\prime})-\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})\Big{)}\frac{\partial v_{\pi}(s^{\prime})}{\partial\bm{\mathbf{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}=\mathbf{g}(s)
vπ(s)𝜽=𝒮k(s,s)𝐠(s)ds.\displaystyle\implies\frac{\partial v_{\pi}(s)}{\partial\bm{\mathbf{\theta}}}=\int_{\mathcal{S}}k(s,s^{\prime})\mathbf{g}(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime}. (34)

Plugging this gradient from Equation 34 back into Equation 30, we obtain

Jμ(𝜽)𝜽\displaystyle\frac{\partial J_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}} =𝒮i(s)vπ(s)𝜽ds\displaystyle=\int_{\mathcal{S}}i(s)\frac{\partial v_{\pi}(s)}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}s
=𝒮i(s)𝒮k(s,s)𝐠(s)dsds\displaystyle=\int_{\mathcal{S}}i(s)\int_{\mathcal{S}}k(s,s^{\prime})\mathbf{g}(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime}\mathop{}\!\mathrm{d}s (35)
=𝒮𝒮k(s,s)i(s)ds 𝐠(s)ds\displaystyle=\int_{\mathcal{S}}\int_{\mathcal{S}}k(s,s^{\prime})i(s)\mathop{}\!\mathrm{d}s\text{ }\mathbf{g}(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime} (36)
=𝒮m(s)𝐠(s)ds\displaystyle=\int_{\mathcal{S}}m(s^{\prime})\mathbf{g}(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime}
=𝒮m(s)𝐠(s)ds,\displaystyle=\int_{\mathcal{S}}m(s)\mathbf{g}(s)\mathop{}\!\mathrm{d}s,

where in Equation 35 we used Fubini’s theorem to switch the order of integration, and in Equation 36 we use the non-recursive version of emphasis as shown in Equation 29.

 

Appendix C Standard Approaches for Sampling the Gradient from a State

In this section we overview the standard approach to sample the gradient from a given state, that is used in both ACE and OffPAC. This section reiterates known results, but is included here for completeness.

C.1 Sampling the Gradient for a Given State

For a given state, we can use the same strategies to sample the gradient as OffPAC, and many other actor-critic algorithms. We can use the log-likelihood approach, and incorporate baselines. For continuous actions, we can also use other strategies like the reparameterization trick. For concreteness, we explicitly explain how we sample this gradient here, using a value baseline and the log-likelihood approach, but emphasize that this can be replaced with other choices.

Under finite states, the simplest approach to compute the gradient for a given state is to use what is sometimes called the all-actions gradient

aπ(a|s;𝜽)𝜽qπ(s,a).\displaystyle\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a).

If there are many actions or if the action space is continuous, this gradient is not an appropriate choice. Instead, we can obtain a sample estimate of this gradient using the log-likelihood trick:

aπ(a|s;𝜽)𝜽qπ(s,a)=aπ(a|s;𝜽)logπ(a|s;𝜽)𝜽qπ(s,a),\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a)=\sum_{a}\pi(a|s;{\bm{\theta}})\frac{\partial\log\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a),

which follows from the fact that the derivative of logy\log y is 1y\frac{1}{y}. Above we called this log likelihood form g(s,𝜽)g(s,{\bm{\theta}}). Now we can use one action sampled according to π\pi to obtain an unbiased estimate of the all-actions gradient. In the off-policy setting, however, we do not use π\pi to select actions. So, additionally, we can incorporate importance sampling ρ(s,a;𝜽)=defπ(a|s;𝜽)μ(a|s)\rho(s,a;{\bm{\theta}})\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\frac{\pi(a|s;{\bm{\theta}})}{\mu(a|s)} to adjust for the fact that the actions are taken according to behaviour policy μ\mu:

aπ(a|s;𝜽)𝜽qπ(s,a)\displaystyle\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a) =aμ(a|s)1μ(a|s)π(a|s;𝜽)logπ(a|s;𝜽)𝜽qπ(s,a)\displaystyle=\sum_{a}\mu(a|s)\frac{1}{\mu(a|s)}\pi(a|s;{\bm{\theta}})\frac{\partial\log\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a)
=aμ(a|s)ρ(s,a;𝜽)logπ(a|s;𝜽)𝜽qπ(s,a).\displaystyle=\sum_{a}\mu(a|s)\rho(s,a;{\bm{\theta}})\frac{\partial\log\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a).

We can obtain an unbiased sample of the gradient by sampling an action Aμ(s,)A\sim\mu(s,\cdot) and using g(A)=defρ(s,A;𝜽)logπ(a|s;𝜽)𝜽qπ(s,A)g(A)\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\rho(s,A;{\bm{\theta}})\frac{\partial\log\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,A).

The estimated gradient has more variance when we only use a sampled action rather than all actions. The standard strategy to reduce this variance is to subtract a baseline. The baseline is a function of state, such as an estimate of the value function v(s)v(s), with modified update

ρ(s,a;𝜽)logπ(a|s;𝜽)𝜽[qπ(s,a)v(s)].\displaystyle\rho(s,a;{\bm{\theta}})\frac{\partial\log\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}[q_{\pi}(s,a)-v(s)].

It is straightforward to show that the expected value of this update is the same, i.e. that

aμ(a|s)ρ(s,a;𝜽)logπ(a|s;𝜽)𝜽[qπ(s,a)v(s)]=aμ(a|s)ρ(s,a;𝜽)logπ(a|s;𝜽)𝜽qπ(s,a).\displaystyle\sum_{a}\mu(a|s)\rho(s,a;{\bm{\theta}})\frac{\partial\log\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}[q_{\pi}(s,a)-v(s)]=\sum_{a}\mu(a|s)\rho(s,a;{\bm{\theta}})\frac{\partial\log\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}q_{\pi}(s,a).

Note that this is not the minimal variance baseline (Dick, 2015), but it nonetheless is one of the most commonly chosen in practice due to its efficacy and simplicity.

C.2 Estimating the Advantage

In practice, it is difficult to obtain qπ(s,a)q_{\pi}(s,a) to compute this gradient exactly. Instead, we obtain (a likely biased) sample of the advantage, δtqπ(St,At)v(st)\delta_{t}\approx q_{\pi}(S_{t},A_{t})-v(s_{t}). An unbiased but high variance estimate uses a sample of the return GtG_{t} from (st,at)(s_{t},a_{t}), and uses δt=Gtv(st)\delta_{t}=G_{t}-v(s_{t}). Then 𝔼[δt|St=s,At=a]=qπ(s,a)v(s)\mathbb{E}[\delta_{t}|S_{t}=s,A_{t}=a]=q_{\pi}(s,a)-v(s). On the other extreme, we can directly approximate the action-values, q^(s,a)\hat{q}(s,a). In-between, we can use nn-step returns, or λ\lambda-returns, to balance between using observed rewards and value estimates.

The simplest choice is to use a 1-step return: r(s,a)+γ(s,a,St+1)v(St+1)r(s,a)+\gamma(s,a,S_{t+1})v(S_{t+1}) is used as an approximation of qπ(s,a)q_{\pi}(s,a). The value function vv is typically called the critic, and plays both the role of estimating the return as well as that of a baseline. The advantage is set to δt=Rt+1+γt+1v(St+1)v(St)\delta_{t}=R_{t+1}+\gamma_{t+1}v(S_{t+1})-v(S_{t}), which is simply the TD error.

More generally, however, we can also use multi-step returns. For an nn-step return, with data sampled on-policy, with γt+1:t+n=defj=1nγt+j\gamma_{t+1:t+n}\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\prod_{j=1}^{n}\gamma_{t+j} as short-hand for the product of discounts, we would use

δt=Rt+1+γt+1Rt+2+γt+1:t+n1Rt+n+γt+1:t+nv(St+n)v(St).\delta_{t}=R_{t+1}+\gamma_{t+1}R_{t+2}+\ldots\gamma_{t+1:t+n-1}R_{t+n}+\gamma_{t+1:t+n}v(S_{t+n})-v(S_{t}).

With off-policy data, we would need to incorporate importance sampling ratios

δt=Rt+1+ρt+1γt+1Rt+2+ρt+1:t+n1γt+1:t+n1Rt+n+ρt+1:t+nγt+1:t+nv(St+n)v(St),\delta_{t}=R_{t+1}+\rho_{t+1}\gamma_{t+1}R_{t+2}+\ldots\rho_{t+1:t+n-1}\gamma_{t+1:t+n-1}R_{t+n}+\rho_{t+1:t+n}\gamma_{t+1:t+n}v(S_{t+n})-v(S_{t}),

where ρt+1:t+n=defj=1nρt+j\rho_{t+1:t+n}\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\prod_{j=1}^{n}\rho_{t+j}. For sufficiently large nn, we recover the sampled return and so obtain an unbiased—but likely high-variance—estimate of the advantage. An interim nn likely provides a reasonable choice between bias and variance, between n=0n=0 with a direct estimation of qπ(s,a)q_{\pi}(s,a), and large nn that gives an unbiased sample of the return.

We can also average across multiple nn, and obtain λ\lambda-returns (Sutton and Barto, 2018). For larger λ\lambda, more weight is put on the nn-step returns with larger nn, and for λ=1\lambda=1 we once again recover an unbiased sample of the return. Such an approach was used in OffPAC, which requires storing traces of gradients of policy parameters (see Algorithm 1 in Degris et al., 2012b). Note that this λ\lambda-return is used for the policy update. The update to the value estimate (critic) itself can use a different λ\lambda, as it stores a separate trace for the gradient of the value function. The update with eligibility traces is only sound under linear function approximation for the values and linear function approximation for action-preferences. In this work, therefore, we opt for the simplest choice: n=1n=1 with δt=Rt+1+γt+1v(St+1)v(St)\delta_{t}=R_{t+1}+\gamma_{t+1}v(S_{t+1})-v(S_{t}).

Finally, we need to estimate the values themselves. Any value approximation algorithm can be used to estimate vv, such as TD, gradient TD or even emphatic TD. Given we already compute the emphasis weighting, it is straightforward to use emphatic TD. We investigate both a gradient TD critic and an emphatic TD critic in the experiments.

Appendix D Proof of Theorem 8

D.1 Entropy-regularized OPPG theorem with discrete states and actions

Proof  The proof for this theorem is mostly similar to the one for Theorem 2. First, since the interest function does not depend on the parameters,

J~μ(𝜽)𝜽=s𝒮i(s)v~π(s)𝜽=s𝒮i(s)v~π(s)𝜽.\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}}=\frac{\partial\sum_{s\in\mathcal{S}}i(s)\tilde{v}_{\pi}(s)}{\partial{\bm{\theta}}}=\sum_{s\in\mathcal{S}}i(s)\frac{\partial\tilde{v}_{\pi}(s)}{\partial{\bm{\theta}}}.

Recall that v~π(s)=aπ(a|s;𝜽)q~π(s,a)+τ(π(|s;𝜽))\tilde{v}_{\pi}(s)=\sum_{a}\pi(a|s;{\bm{\theta}})\tilde{q}_{\pi}(s,a)+\tau\mathcal{H}(\pi(\cdot|s;{\bm{\theta}})). We define the following notation which helps with the recursive expression of the gradient:

𝐠~(s)=aπ(a|s;𝜽)𝜽q~π(s,a)+τ𝜽(π(|s;𝜽)),\displaystyle\tilde{\mathbf{g}}(s)=\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\tilde{q}_{\pi}(s,a)+\tau\frac{\partial}{\partial{\bm{\theta}}}\mathcal{H}(\pi(\cdot|s;{\bm{\theta}})),

where 𝐠~:𝒮d\tilde{\mathbf{g}}:\mathcal{S}\rightarrow\mathbb{R}^{d}. The gradient of the entropy regularized value function is then

v~π(s)𝜽\displaystyle\frac{\partial\tilde{v}_{\pi}(s)}{\partial{\bm{\theta}}} =𝜽[aπ(a|s;𝜽)q~π(s,a)+τ(π(|s;𝜽))]\displaystyle=\frac{\partial}{\partial{\bm{\theta}}}\left[\sum_{a}\pi(a|s;{\bm{\theta}})\tilde{q}_{\pi}(s,a)+\tau\mathcal{H}(\pi(\cdot|s;{\bm{\theta}}))\right]
=\displaystyle= aπ(a|s;𝜽)𝜽q~π(s,a)+aπ(a|s;𝜽)q~π(s,a)𝜽+τ𝜽(π(|s;𝜽))\displaystyle\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\tilde{q}_{\pi}(s,a)+\sum_{a}\pi(a|s;{\bm{\theta}})\frac{\partial\tilde{q}_{\pi}(s,a)}{\partial{\bm{\theta}}}+\tau\frac{\partial}{\partial{\bm{\theta}}}\mathcal{H}(\pi(\cdot|s;{\bm{\theta}})) (37)
=\displaystyle= 𝐠~(s)+aπ(a|s;𝜽)sp(s|s,a)(r(s,a,s)+γ(s,a,s)v~π(s))𝜽\displaystyle\tilde{\mathbf{g}}(s)+\sum_{a}\pi(a|s;{\bm{\theta}})\frac{\partial\sum_{s^{\prime}}p(s^{\prime}|s,a)(r(s,a,s^{\prime})+\gamma(s,a,s^{\prime})\tilde{v}_{\pi}(s^{\prime}))}{\partial{\bm{\theta}}}
=\displaystyle= 𝐠~(s)+aπ(a|s;𝜽)sp(s|s,a)γ(s,a,s)v~π(s)𝜽.\displaystyle\tilde{\mathbf{g}}(s)+\sum_{a}\pi(a|s;{\bm{\theta}})\sum_{s^{\prime}}p(s^{\prime}|s,a)\gamma(s,a,s^{\prime})\frac{\partial\tilde{v}_{\pi}(s^{\prime})}{\partial{\bm{\theta}}}.

So the gradient of the regularized value function can be written in vector form similar to the unregularized version. Let 𝐯~˙π|𝒮|×d{\dot{\tilde{\mathbf{v}}}}_{\pi}\in\mathbb{R}^{|\mathcal{S}|\times d} be the matrix of gradients of v~π\tilde{v}_{\pi} for each state ss; and 𝐆~|𝒮|×d\mathbf{\tilde{G}}\in\mathbb{R}^{|\mathcal{S}|\times d} the matrix where each row corresponds to state ss in the vector 𝐠~(s)\tilde{\mathbf{g}}(s). Then

𝐯~˙π\displaystyle{\dot{\tilde{\mathbf{v}}}}_{\pi} =𝐆~+𝐏π,γ𝐯~˙π\displaystyle=\mathbf{\tilde{G}}+\mathbf{P}_{\!\pi,\gamma}{\dot{\tilde{\mathbf{v}}}}_{\pi} (38)
𝐯~˙π\displaystyle\implies{\dot{\tilde{\mathbf{v}}}}_{\pi} =(𝐈𝐏π,γ)1𝐆~.\displaystyle=(\mathbf{I}-\mathbf{P}_{\!\pi,\gamma})^{-1}\mathbf{\tilde{G}}.

Therefore, we obtain

s𝒮i(s)v~π(s)𝜽\displaystyle\sum_{s\in\mathcal{S}}i(s)\frac{\partial\tilde{v}_{\pi}(s)}{\partial{\bm{\theta}}} =𝐢𝐯~˙π=𝐢(𝐈𝐏π,γ)1𝐆~=𝐦𝐆~\displaystyle=\mathbf{i}^{\top}{\dot{\tilde{\mathbf{v}}}}_{\pi}=\mathbf{i}^{\top}(\mathbf{I}-\mathbf{P}_{\!\pi,\gamma})^{-1}\mathbf{\tilde{G}}=\mathbf{m}^{\top}\mathbf{\tilde{G}}
=s𝒮m(s)[aπ(a|s;𝜽)𝜽q~π(s,a)+τ𝜽(π(|s;𝜽))].\displaystyle=\sum_{s\in\mathcal{S}}m(s)\left[\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\tilde{q}_{\pi}(s,a)+\tau\frac{\partial}{\partial{\bm{\theta}}}\mathcal{H}(\pi(\cdot|s;{\bm{\theta}}))\right].

We can further simplify this gradient by explicitly computing the gradient of the negative of the entropy:

-(π(|s;𝜽))𝜽\displaystyle\frac{\partial\text{-}\mathcal{H}(\pi(\cdot|s;{\bm{\theta}}))}{\partial{\bm{\theta}}} =aπ(a|s;𝜽)𝜽logπ(a|s;𝜽)+aπ(a|s;𝜽)logπ(a|s;𝜽)𝜽\displaystyle=\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\log\pi(a|s;{\bm{\theta}})+\sum_{a}\pi(a|s;{\bm{\theta}})\frac{\partial\log\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}
=aπ(a|s;𝜽)𝜽logπ(a|s;𝜽)+aπ(a|s;𝜽)𝜽\displaystyle=\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\log\pi(a|s;{\bm{\theta}})+\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}
=aπ(a|s;𝜽)𝜽logπ(a|s;𝜽),\displaystyle=\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\log\pi(a|s;{\bm{\theta}}),

where the last line follows from the fact that aπ(a|s;𝜽)𝜽=𝜽aπ(a|s;𝜽)=𝜽1=𝟎\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}=\frac{\partial}{\partial{\bm{\theta}}}\sum_{a}\pi(a|s;{\bm{\theta}})=\frac{\partial}{\partial{\bm{\theta}}}1=\mathbf{0}.  

This form looks a bit different than the (on-policy) policy gradient update given by (Geist et al., 2019, Theorem 6). We can actually get a similar form by rewriting the inner term above. For a specific ss, using chain rule on the entropy we obtain

aπ(a|s;𝜽)𝜽q~π(s,a)+τ𝜽(π(|s;𝜽))\displaystyle\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\tilde{q}_{\pi}(s,a)+\tau\frac{\partial}{\partial{\bm{\theta}}}\mathcal{H}(\pi(\cdot|s;{\bm{\theta}})) =aπ(a|s;𝜽)𝜽q~π(s,a)+τπ(a|s;𝜽)𝜽(π(|s;𝜽))π(a|s;𝜽)\displaystyle=\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\tilde{q}_{\pi}(s,a)+\tau\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\frac{\partial\mathcal{H}(\pi(\cdot|s;{\bm{\theta}}))}{\partial\pi(a|s;{\bm{\theta}})}
=aπ(a|s;𝜽)𝜽[q~π(s,a)+τ(π(|s;𝜽))π(a|s;𝜽)].\displaystyle=\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\left[\tilde{q}_{\pi}(s,a)+\tau\frac{\partial\mathcal{H}(\pi(\cdot|s;{\bm{\theta}}))}{\partial\pi(a|s;{\bm{\theta}})}\right].

D.2 Regularized OPPG theorem with continuous states and actions

Theorem 11
J~μ(𝜽)𝜽\displaystyle\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}} =SSm(s)[𝒜π(a|s;𝜽)𝜽q~π(s,a)da+τ𝜽Ω(π(|s;𝜽))]ds\displaystyle=\int_{\SS}m(s)\left[\int_{{\cal A}}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\tilde{q}_{\pi}(s,a)\mathop{}\!\mathrm{d}a+\tau\frac{\partial}{\partial{\bm{\theta}}}\Omega(\pi(\cdot|s;{\bm{\theta}}))\right]\mathop{}\!\mathrm{d}s

Proof  We extend the result from the previous section to the continuous state and action setting using a similar strategy to the one we used in Appendix B to extend Theorem 2 to the continuous state setting. In addition, we opt for a generic regularizer Ω(π(|s;𝜽)):Δ(𝒜)\Omega(\pi(\cdot|s;{\bm{\theta}})):\Delta({\cal A})\to\mathbb{R}, which slightly modifies the definition of the regularized value functions to be

v~π(s)\displaystyle\tilde{v}_{\pi}(s) =𝒜π(a|s;𝜽)q~π(s,a)da+τΩ(π(|s;𝜽))sSS\displaystyle=\int_{\cal A}\pi(a|s;{\bm{\theta}})\tilde{q}_{\pi}(s,a)\mathop{}\!\mathrm{d}a+\tau\Omega(\pi(\cdot|s;{\bm{\theta}}))\qquad\forall s\in\SS
q~π(s,a)\displaystyle\tilde{q}_{\pi}(s,a) =SSP(s|s,a)[r(s,a,s)+γ(s,a,s)v~π(s)]dssSS,a𝒜.\displaystyle=\int_{\SS}P(s^{\prime}|s,a)\big{[}r(s,a,s^{\prime})+\gamma(s,a,s^{\prime})\tilde{v}_{\pi}(s^{\prime})\big{]}\mathop{}\!\mathrm{d}s^{\prime}\qquad\forall s\in\SS,\forall a\in{\cal A}.

We use the following notation to simplify the derivation:

𝐠~(s)\displaystyle\tilde{\mathbf{g}}(s) =𝒜π(a|s;𝜽)𝜽q~π(s,a)da+τ𝜽Ω(π(|s;𝜽)).\displaystyle=\int_{{\cal A}}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\tilde{q}_{\pi}(s,a)\mathop{}\!\mathrm{d}a+\tau\frac{\partial}{\partial{\bm{\theta}}}\Omega(\pi(\cdot|s;{\bm{\theta}})).

Starting with the continuous objective function from equation (6) and using the Leibniz integral rule to differentiate under the integral (which applies because we assumed the state space is a compact set in Assumption 2 from Appendix A.1):

Jμ(𝜽)𝜽=𝜽SSi(s)v~π(s)ds=SSi(s)v~π(s)𝜽ds.\displaystyle\frac{\partial J_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}}=\frac{\partial}{\partial{\bm{\theta}}}\int_{\SS}i(s)\tilde{v}_{\pi}(s)\mathop{}\!\mathrm{d}s=\int_{\SS}i(s)\frac{\partial\tilde{v}_{\pi}(s)}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}s. (39)

Next, we find a recursive expression for the gradient of the regularized value function:

v~π(s)𝜽\displaystyle\frac{\partial\tilde{v}_{\pi}(s)}{\partial{\bm{\theta}}} =𝜽𝒜π(a|s;𝜽)q~π(s,a)da+τΩ(π(|s;𝜽))\displaystyle=\frac{\partial}{\partial{\bm{\theta}}}\int_{\cal A}\pi(a|s;{\bm{\theta}})\tilde{q}_{\pi}(s,a)\mathop{}\!\mathrm{d}a+\tau\Omega(\pi(\cdot|s;{\bm{\theta}}))
=𝒜π(a|s;𝜽)𝜽q~π(s,a)da+𝒜π(a|s;𝜽)q~π(s,a)𝜽da+τ𝜽Ω(π(|s;𝜽))\displaystyle=\int_{\cal A}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\tilde{q}_{\pi}(s,a)\mathop{}\!\mathrm{d}a+\int_{\cal A}\pi(a|s;{\bm{\theta}})\frac{\partial\tilde{q}_{\pi}(s,a)}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}a+\tau\frac{\partial}{\partial{\bm{\theta}}}\Omega(\pi(\cdot|s;{\bm{\theta}}))
=𝐠~(s)+𝒜π(a|s;𝜽)q~π(s,a)𝜽da\displaystyle=\tilde{\mathbf{g}}(s)+\int_{\cal A}\pi(a|s;{\bm{\theta}})\frac{\partial\tilde{q}_{\pi}(s,a)}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}a
=𝐠~(s)+𝒜π(a|s;𝜽)𝜽SSP(s|s,a)[r(s,a,s)+γ(s,a,s)v~π(s)]dsda\displaystyle=\tilde{\mathbf{g}}(s)+\int_{\cal A}\pi(a|s;{\bm{\theta}})\frac{\partial}{\partial{\bm{\theta}}}\int_{\SS}P(s^{\prime}|s,a)\big{[}r(s,a,s^{\prime})+\gamma(s,a,s^{\prime})\tilde{v}_{\pi}(s^{\prime})\big{]}\mathop{}\!\mathrm{d}s^{\prime}\mathop{}\!\mathrm{d}a
=𝐠~(s)+𝒜π(a|s;𝜽)SSP(s|s,a)γ(s,a,s)v~π(s)𝜽dsda\displaystyle=\tilde{\mathbf{g}}(s)+\int_{\cal A}\pi(a|s;{\bm{\theta}})\int_{\SS}P(s^{\prime}|s,a)\gamma(s,a,s^{\prime})\frac{\partial\tilde{v}_{\pi}(s^{\prime})}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}\mathop{}\!\mathrm{d}a
=𝐠~(s)+SS𝒜π(a|s;𝜽)P(s|s,a)γ(s,a,s)dav~π(s)𝜽ds,\displaystyle=\tilde{\mathbf{g}}(s)+\int_{\SS}\int_{\cal A}\pi(a|s;{\bm{\theta}})P(s^{\prime}|s,a)\gamma(s,a,s^{\prime})\mathop{}\!\mathrm{d}a\frac{\partial\tilde{v}_{\pi}(s^{\prime})}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}s^{\prime},

where in the last step we used Fubini’s Theorem to switch the order of integrals (which we can do because the absolute value of the integral is finite due to Assumptions 1 and 2). Again we will use the shorthand 𝐏π,γ(s,s)=𝒜π(a|s;𝜽)P(s|s,a)γ(s,a,s)da\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})=\int_{\cal A}\pi(a|s;{\bm{\theta}})P(s^{\prime}|s,a)\gamma(s,a,s^{\prime})\mathop{}\!\mathrm{d}a to simplify the derivation. Next we write v~π(s)𝜽\frac{\partial\tilde{v}_{\pi}(s)}{\partial{\bm{\theta}}} as the integral transform

v~π(s)𝜽\displaystyle\frac{\partial\tilde{v}_{\pi}(s)}{\partial{\bm{\theta}}} =SSδ(s,s)v~π(s)𝜽ds,\displaystyle=\int_{\SS}\delta(s,s^{\prime})\frac{\partial\tilde{v}_{\pi}(s^{\prime})}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}s^{\prime},

plug it into the previous equation, and solve to get

SSδ(s,s)v~π(s)𝜽ds\displaystyle\int_{\SS}\delta(s,s^{\prime})\frac{\partial\tilde{v}_{\pi}(s^{\prime})}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}s^{\prime} =𝐠~(s)+SS𝐏π,γ(s,s)v~π(s)𝜽ds\displaystyle=\tilde{\mathbf{g}}(s)+\int_{\SS}\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})\frac{\partial\tilde{v}_{\pi}(s^{\prime})}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}s^{\prime}
SS(δ(s,s)𝐏π,γ(s,s))v~π(s)𝜽ds\displaystyle\int_{\SS}\left(\delta(s,s^{\prime})-\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime})\right)\frac{\partial\tilde{v}_{\pi}(s^{\prime})}{\partial{\bm{\theta}}}\mathop{}\!\mathrm{d}s^{\prime} =𝐠~(s)\displaystyle=\tilde{\mathbf{g}}(s)
v~π(s)𝜽\displaystyle\implies\frac{\partial\tilde{v}_{\pi}(s)}{\partial{\bm{\theta}}} =SSk(s,s)𝐠~(s)ds,\displaystyle=\int_{\SS}k(s,s^{\prime})\tilde{\mathbf{g}}(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime},

where k(s,s)k(s,s^{\prime}) is the inverse kernel of δ(s,s)𝐏π,γ(s,s)\delta(s,s^{\prime})-\mathbf{P}_{\!\pi,\gamma}(s,s^{\prime}). Then plugging the above into Equation (39), interchanging integrals using Fubini’s theorem, and simplifying gives

Jμ(𝜽)𝜽\displaystyle\frac{\partial J_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}} =SSi(s)SSk(s,s)𝐠~(s)dsds\displaystyle=\int_{\SS}i(s)\int_{\SS}k(s,s^{\prime})\tilde{\mathbf{g}}(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime}\mathop{}\!\mathrm{d}s
=SSSSi(s)k(s,s)ds𝐠~(s)ds\displaystyle=\int_{\SS}\int_{\SS}i(s)k(s,s^{\prime})\mathop{}\!\mathrm{d}s\tilde{\mathbf{g}}(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime}
=SSm(s)𝐠~(s)ds\displaystyle=\int_{\SS}m(s^{\prime})\tilde{\mathbf{g}}(s^{\prime})\mathop{}\!\mathrm{d}s^{\prime}
=SSm(s)𝐠~(s)ds\displaystyle=\int_{\SS}m(s)\tilde{\mathbf{g}}(s)\mathop{}\!\mathrm{d}s
=SSm(s)[𝒜π(a|s;𝜽)𝜽q~π(s,a)da+τ𝜽Ω(π(|s;𝜽))]ds.\displaystyle=\int_{\SS}m(s)\left[\int_{{\cal A}}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\tilde{q}_{\pi}(s,a)\mathop{}\!\mathrm{d}a+\tau\frac{\partial}{\partial{\bm{\theta}}}\Omega(\pi(\cdot|s;{\bm{\theta}}))\right]\mathop{}\!\mathrm{d}s.

 

Appendix E Proof of Proposition 10

This section proves Proposition 10. First, a result by Mei et al. (2020) for tabular domains is extended to state aggregation and transition dependent discounting (which includes the three-state counterexample) to show that entropy-regularized policy gradient converges to a point where the gradient is zero. Then, it is shown that such point is not a stationary point for semi-gradient updates.

E.1 Entropy-regularized PG under State Aggregation

We assume the policy is a softmax policy and additionally specifically characterize the gradient under state aggregation. This specific characterization facilitates showing that the solution to the objective lies on the interior of the simplex, and so that a stationary point exists.

We define alias(s)\operatorname{alias}(s) as the set of state that share their representation with ss including ss itself. We additionally defined 𝒮rep𝒮\mathcal{S}_{\text{rep}}\subset\mathcal{S} the set of representative states, one for each bin in the aggregation. For example, in the three-state counterexample alias(s0)={s0}\operatorname{alias}(s_{0})=\{s_{0}\} and alias(s1)=alias(s2)={s1,s2}\operatorname{alias}(s_{1})=\operatorname{alias}(s_{2})=\{s_{1},s_{2}\}. We simply need to choose one state in each aliased set, giving 𝒮rep={s0,s1}\mathcal{S}_{\text{rep}}=\{s_{0},s_{1}\}.

For a parameter set 𝜽{\bm{\theta}}, the policy is a softmax transform defined as follows. For a state ss^{\prime} with representative state ss, i.e., salias(s)s^{\prime}\in\operatorname{alias}(s), we have

π(s,a;𝜽)=exp(𝜽(s,a))aexp(𝜽(s,a)).\pi(s^{\prime},a;{\bm{\theta}})=\frac{\exp\bigl{(}{\bm{\theta}}(s,a)\bigr{)}}{\sum_{a^{\prime}}\exp\bigl{(}{\bm{\theta}}(s,a^{\prime})\bigr{)}}. (40)

Using again the three-state counterexample, 𝜽{\bm{\theta}} has four components: 𝜽(s0,a0){\bm{\theta}}(s_{0},a_{0}) and 𝜽(s0,a1){\bm{\theta}}(s_{0},a_{1}) specify the policy for s0s_{0}, and 𝜽(s1,a0){\bm{\theta}}(s_{1},a_{0}) and 𝜽(s1,a1){\bm{\theta}}(s_{1},a_{1}) specify the policy for both s1s_{1} and s2s_{2} since these two states are aliased.

The softmax policy has a simple well-known gradient, which we can explicitly write for state-aggregation in the following lemma for easy reference in later proofs.

Lemma 12

Assume the policy uses a softmax distribution, as in Equation 40. For any state s𝒮reps\in\mathcal{S}_{\text{rep}}, for any salias(s)s^{\prime}\in\operatorname{alias}(s),

π(s,a;𝜽)𝜽(s,a)={π(s,a;𝜽)[1π(s,a;𝜽)] if a=aπ(s,a;𝜽)π(s,a;𝜽) if aa\frac{\partial\pi(s^{\prime},a^{\prime};{\bm{\theta}})}{\partial{\bm{\theta}}(s,a)}=\left\{\begin{array}[]{cc}\pi(s^{\prime},a;{\bm{\theta}})[1-\pi(s^{\prime},a;{\bm{\theta}})]&\mbox{ if $a^{\prime}=a$}\\ -\pi(s^{\prime},a;{\bm{\theta}})\pi(s^{\prime},a^{\prime};{\bm{\theta}})&\mbox{ if $a^{\prime}\neq a$}\end{array}\right. (41)

and, across all actions, this can be compactly written as

π(s,a;𝜽)𝜽(s,)=diag(π(s,.;𝜽))π(s,.;𝜽)π(s,.;𝜽).\frac{\partial\pi(s^{\prime},a^{\prime};{\bm{\theta}})}{\partial{\bm{\theta}}(s,\cdot)}=\operatorname{diag}(\pi(s,.;{\bm{\theta}}))-\pi(s,.;{\bm{\theta}})\pi(s,.;{\bm{\theta}})^{\top}.

Proof  For explicit step to obtain this derivation, notice first that

logπ(s,a;𝜽)=𝜽(s,a)logaexp(𝜽(s,a))\displaystyle\log\pi(s^{\prime},a;{\bm{\theta}})={\bm{\theta}}(s,a)-\log\sum_{a^{\prime}}\exp({\bm{\theta}}(s,a^{\prime}))

and so

π(s,a;𝜽)𝜽(s,a)\displaystyle\frac{\partial\pi(s^{\prime},a^{\prime};{\bm{\theta}})}{\partial{\bm{\theta}}(s,a)} =π(s,a;𝜽)logπ(s,a;𝜽)𝜽(s,a)\displaystyle=\pi(s^{\prime},a^{\prime};{\bm{\theta}})\frac{\partial\log\pi(s^{\prime},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s,a)}
=π(s,a;𝜽)[𝜽(s,a)loga′′exp(𝜽(s,a′′))]𝜽(s,a)\displaystyle=\pi(s^{\prime},a^{\prime};{\bm{\theta}})\frac{\partial[{\bm{\theta}}(s,a^{\prime})-\log\sum_{a^{\prime\prime}}\exp({\bm{\theta}}(s,a^{\prime\prime}))]}{\partial{\bm{\theta}}(s,a)}
=π(s,a;𝜽)[𝜽(s,a)𝜽(s,a)loga′′exp(𝜽(s,a′′))𝜽(s,a)].\displaystyle=\pi(s^{\prime},a^{\prime};{\bm{\theta}})\left[\frac{\partial{\bm{\theta}}(s,a^{\prime})}{\partial{\bm{\theta}}(s,a)}-\frac{\partial\log\sum_{a^{\prime\prime}}\exp({\bm{\theta}}(s,a^{\prime\prime}))}{\partial{\bm{\theta}}(s,a)}\right].

If a=aa^{\prime}=a, then

𝜽(s,a)𝜽(s,a)loga′′exp(𝜽(s,a′′))𝜽(s,a)\displaystyle\frac{\partial{\bm{\theta}}(s,a^{\prime})}{\partial{\bm{\theta}}(s,a)}-\frac{\partial\log\sum_{a^{\prime\prime}}\exp({\bm{\theta}}(s,a^{\prime\prime}))}{\partial{\bm{\theta}}(s,a)} =11a′′exp(𝜽(s,a′′)a′′exp(𝜽(s,a′′))𝜽(s,a)\displaystyle=1-\frac{1}{\sum_{a^{\prime\prime}}\exp({\bm{\theta}}(s,a^{\prime\prime})}\frac{\partial\sum_{a^{\prime\prime}}\exp({\bm{\theta}}(s,a^{\prime\prime}))}{\partial{\bm{\theta}}(s,a)}
=11a′′exp(𝜽(s,a′′))exp(𝜽(s,a)\displaystyle=1-\frac{1}{\sum_{a^{\prime\prime}}\exp({\bm{\theta}}(s,a^{\prime\prime}))}\exp({\bm{\theta}}(s,a)
=1π(a|s;𝜽)\displaystyle=1-\pi(a|s;{\bm{\theta}})
=1π(s,a;𝜽),\displaystyle=1-\pi(s^{\prime},a;{\bm{\theta}}),

where the last step follows from the fact that π(s,a;𝜽)=π(a|s;𝜽)\pi(s^{\prime},a;{\bm{\theta}})=\pi(a|s;{\bm{\theta}}) under state aggregation. If aaa^{\prime}\neq a, then

𝜽(s,a)𝜽(s,a)loga′′exp(𝜽(s,a′′))𝜽(s,a)\displaystyle\frac{\partial{\bm{\theta}}(s,a^{\prime})}{\partial{\bm{\theta}}(s,a)}-\frac{\partial\log\sum_{a^{\prime\prime}}\exp({\bm{\theta}}(s,a^{\prime\prime}))}{\partial{\bm{\theta}}(s,a^{\prime})} =0π(a|s;𝜽)=π(s,a;𝜽).\displaystyle=0-\pi(a|s;{\bm{\theta}})=-\pi(s^{\prime},a;{\bm{\theta}}).

 

The gradient ascent update using the entropy-regularized objective is

𝜽t+1=𝜽t+ηJ~μ(𝜽t)𝜽{\bm{\theta}}_{t+1}={\bm{\theta}}_{t}+\eta\frac{\partial\tilde{J}_{\mu}({\bm{\theta}}_{t})}{\partial{\bm{\theta}}} (42)

for stepsize η>0\eta>0. For an appropriately small stepsize η\eta, this will converge to a stationary point if one exists, and otherwise move to a point on the simplex, where 𝜽t(s,a){\bm{\theta}}_{t}(s,a)\rightarrow\infty for an action aa in state ss. The standard condition on η\eta is that η<1/L\eta<1/L for Lipschitz constant LL of J~μ\tilde{J}_{\mu}, which is a common assumption for policy gradient methods. In the next section, we show that this update converges to a stationary on the interior of the policy simplex.

We provide one more result, that is useful for this characterization. Lemma 13 breaks down the gradient into its components. Mei et al. (2020), in their Lemma 10, proved this result for entropy regularized policy gradients, assuming tabular policies in the on-policy setting. We extend it to state aggregation, with our off-policy gradient. Their presentation has an extra term τlogπ(a|s;𝜽)-\tau\log\pi(a|s;{\bm{\theta}}). This discrepancy is just a difference in notation. In our formulation, entropy regularized action values are defined to contain the entropy term so that the connection between the regularized and unregularized policy gradient is clearer.

Lemma 13

For each state s𝒮reps\in\mathcal{S}_{\text{rep}} and action aa

J~μ(𝜽)𝜽(s,a)\displaystyle\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}(s,a)} =salias(s)m(s)π(s,a;𝜽)[q~π(s,a)v~π(s)].\displaystyle=\sum_{s^{\prime}\in\operatorname{alias}(s)}m(s^{\prime})\pi(s^{\prime},a;{\bm{\theta}})\biggl{[}\tilde{q}_{\pi}(s^{\prime},a)-\tilde{v}_{\pi}(s^{\prime})\biggr{]}.

Proof  According to Theorem 8

J~μ(𝜽)𝜽=s𝒮m(s)aπ(a|s;𝜽)𝜽q~π(s,a).\displaystyle\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}}=\sum_{s\in\mathcal{S}}m(s)\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}}\tilde{q}_{\pi}(s,a).

Notice that only for states salias(s)s^{\prime}\in\operatorname{alias}(s), and any aa^{\prime}, we have π(s,a;𝜽)𝜽(s,a)0\frac{\partial\pi(s^{\prime},a^{\prime};{\bm{\theta}})}{\partial{\bm{\theta}}(s,a)}\neq 0. We can write partial derivatives w.r.t. the parameters of each state and action, using Equation 41:

J~μ(𝜽)𝜽(s,a)=\displaystyle\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}(s,a)}= salias(s)m(s)aπ(s,a;𝜽)𝜽(s,a)q~π(s,a)\displaystyle\sum_{s^{\prime}\in\operatorname{alias}(s)}m(s^{\prime})\sum_{a^{\prime}}\frac{\partial\pi(s^{\prime},a^{\prime};{\bm{\theta}})}{\partial{\bm{\theta}}(s,a)}\tilde{q}_{\pi}(s^{\prime},a^{\prime})
=\displaystyle= salias(s)m(s)[π(s,a;𝜽)(1π(s,a;𝜽))q~π(s,a)\displaystyle\sum_{s^{\prime}\in\operatorname{alias}(s)}m(s^{\prime})\Bigg{[}\pi(s^{\prime},a;{\bm{\theta}})(1-\pi(s^{\prime},a;{\bm{\theta}}))\tilde{q}_{\pi}(s^{\prime},a)
aaπ(s,a;𝜽)π(s,a;𝜽)q~π(s,a)]\displaystyle\qquad-\sum_{a^{\prime}\neq a}\pi(s^{\prime},a;{\bm{\theta}})\pi(s^{\prime},a^{\prime};{\bm{\theta}})\tilde{q}_{\pi}(s^{\prime},a^{\prime})\Bigg{]}
=\displaystyle= salias(s)m(s)π(s,a;𝜽)[q~π(s,a)aπ(s,a;𝜽)q~π(s,a)]\displaystyle\sum_{s^{\prime}\in\operatorname{alias}(s)}m(s^{\prime})\pi(s^{\prime},a;{\bm{\theta}})\left[\tilde{q}_{\pi}(s^{\prime},a)-\sum_{a^{\prime}}\pi(s^{\prime},a^{\prime};{\bm{\theta}})\tilde{q}_{\pi}(s^{\prime},a^{\prime})\right]
=\displaystyle= salias(s)m(s)π(s,a;𝜽)[q~π(s,a)v~π(s)].\displaystyle\sum_{s^{\prime}\in\operatorname{alias}(s)}m(s^{\prime})\pi(s^{\prime},a;{\bm{\theta}})\left[\tilde{q}_{\pi}(s^{\prime},a)-\tilde{v}_{\pi}(s^{\prime})\right].

 

E.2 Existence of stationary points for the entropy-regularized PG objective

We are now ready to prove that a stationary point exists for τ>0\tau>0. Mei et al. (2020) (Lemma 17) proved that, under tabular representation and softmax parameterization, entropy regularized policy gradient updates converge to a point where the gradient is zero. We extend their proof to state aggregation and transition-dependent discounting. We additionally avoid assuming that the rewards are non-negative.

Assumption 4 (Lipschitz continuity)

J~μ\tilde{J}_{\mu} is Lipschitz continuous with Lipschitz constant LL.

Assumption 5 (Bounded reward)

|r(s,a)|Rmax|r(s,a)|\leq R_{max}.

Assumption 6 (Bounded expected sum of future discounting)

There exists CC\in\mathbb{R} such that for any s𝒮s\in\mathcal{S}, 𝔼π[γt+1+γt+1γt+2+|St=s]C\mathbb{E}_{\pi}[\gamma_{t+1}+\gamma_{t+1}\gamma_{t+2}+\ldots|S_{t}=s]\leq C. This assumption is satisfied if π\pi is proper, or if γt+i<1\gamma_{t+i}<1 after a bounded number of steps ii\in\mathbb{N}.

We first prove that the state values are bounded and show that the action-values are lower bounded by the scaled log of the probabilities. We use this to show the main result in Proposition 15.

Lemma 14

Under Assumptions 5 and 6, for any given policy parameters 𝛉{\bm{\theta}},

v~π(s)\displaystyle\tilde{v}_{\pi}(s) CRmax+Cτlogna\displaystyle\leq CR_{max}+C\tau\log n_{a}
q~π(s,a)v~π(s)\displaystyle\tilde{q}_{\pi}(s,a)-\tilde{v}_{\pi}(s) τlogπ(a|s;𝜽)(2C+1)RmaxCτlogna.\displaystyle\geq-\tau\log\pi(a|s;{\bm{\theta}})-(2C+1)R_{max}-C\tau\log n_{a}.

Proof  First notice that for any distribution pp over actions

Entropy(p)=defa𝒜p(a)logp(a)a𝒜1nalog(1na)=logna,\displaystyle\text{Entropy}(p)\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}-\sum_{a\in\mathcal{A}}p(a)\log p(a)\leq-\sum_{a\in\mathcal{A}}\frac{1}{n_{a}}\log\left(\frac{1}{n_{a}}\right)=\log n_{a},

where nan_{a} is the number of actions. The inequality follows from the fact that the uniform distribution has the highest entropy. We use this inequality to bound 𝔼π[logπ(At|St;𝜽)|St=s]=aπ(a|s;𝜽)logπ(a|s;𝜽)\mathbb{E}_{\pi}[-\log\pi(A_{t}|S_{t};{\bm{\theta}})|S_{t}=s]=-\sum_{a}\pi(a|s;{\bm{\theta}})\log\pi(a|s;{\bm{\theta}}), which is the entropy of π(s,;𝜽)\pi(s,\cdot;{\bm{\theta}}). Using this bound on entropy, and the facts that the entropy is nonnegative and γt+i1\gamma_{t+i}\leq 1 for all ii\in\mathbb{N}, we obtain

v~π(s)\displaystyle\tilde{v}_{\pi}(s) =𝔼π[Gt|St=s]τ𝔼π[logπ(At|St;𝜽)+γt+1logπ(St+1,At+1;𝜽)+|St=s]\displaystyle=\mathbb{E}_{\pi}[G_{t}|S_{t}=s]-\tau\mathbb{E}_{\pi}[\log\pi(A_{t}|S_{t};{\bm{\theta}})+\gamma_{t+1}\log\pi(S_{t+1},A_{t+1};{\bm{\theta}})+\ldots|S_{t}=s]
\displaystyle\leq CRmax+τ𝔼π[logπ(At|St;𝜽)|St=s]+τ𝔼π[γt+1(logπ(St+1,At+1;𝜽))|St=s]+\displaystyle CR_{max}+\tau\mathbb{E}_{\pi}[-\log\pi(A_{t}|S_{t};{\bm{\theta}})|S_{t}=s]+\tau\mathbb{E}_{\pi}[\gamma_{t+1}(-\log\pi(S_{t+1},A_{t+1};{\bm{\theta}}))|S_{t}=s]+\ldots
\displaystyle\leq CRmax+τlogna+τlogna𝔼π[γt+1|St=s]+τlogna𝔼π[γt+2|St=s]+\displaystyle CR_{max}+\tau\log n_{a}+\tau\log n_{a}\mathbb{E}_{\pi}[\gamma_{t+1}|S_{t}=s]+\tau\log n_{a}\mathbb{E}_{\pi}[\gamma_{t+2}|S_{t}=s]+\ldots
\displaystyle\leq CRmax+Cτlogna.\displaystyle CR_{max}+C\tau\log n_{a}.

Finally, we can lower bound the advantage function, using v~π(s)𝔼π[Gt|St=s]CRmax\tilde{v}_{\pi}(s)\geq\mathbb{E}_{\pi}[G_{t}|S_{t}=s]\geq-CR_{max} and

q~π(s,a)\displaystyle\tilde{q}_{\pi}(s,a) =𝔼π[Rt+1+γt+1v~π(St+1)|St=s,At=a]τlogπ(a|s;𝜽)\displaystyle=\mathbb{E}_{\pi}[R_{t+1}+\gamma_{t+1}\tilde{v}_{\pi}(S_{t+1})|S_{t}=s,A_{t}=a]-\tau\log\pi(a|s;{\bm{\theta}})
RmaxCRmaxτlogπ(a|s;𝜽),\displaystyle\geq-R_{max}-CR_{max}-\tau\log\pi(a|s;{\bm{\theta}}),

giving

q~π(s,a)v~π(s)\displaystyle\tilde{q}_{\pi}(s,a)-\tilde{v}_{\pi}(s) RmaxCRmaxτlogπ(a|s;𝜽)v~π(s)\displaystyle\geq-R_{max}-CR_{max}-\tau\log\pi(a|s;{\bm{\theta}})-\tilde{v}_{\pi}(s)
RmaxCRmaxτlogπ(a|s;𝜽)(CRmax+Cτlogna)\displaystyle\geq-R_{max}-CR_{max}-\tau\log\pi(a|s;{\bm{\theta}})-(CR_{max}+C\tau\log n_{a})
=τlogπ(a|s;𝜽)(2C+1)RmaxCτlogna.\displaystyle=-\tau\log\pi(a|s;{\bm{\theta}})-(2C+1)R_{max}-C\tau\log n_{a}.

 

Proposition 15

Under Assumptions 4, 5 and 6, the entropy-regularized policy update in Equation 42 with τ>0\tau>0 and η<1/L\eta<1/L converges to a point with zero gradient.

Proof  Let {𝜽t}t=1\{{\bm{\theta}}_{t}\}_{t=1}^{\infty} be the trajectory of parameters under the gradient ascent update, where 𝜽t𝜽{\bm{\theta}}_{t}\rightarrow{\bm{\theta}}_{\infty}. This trajectory either converges to finite 𝜽{\bm{\theta}}_{\infty} that provides a policy on the interior of the policy simplex, or it pushes the weights for a subset of actions to infinity to converge to a point on the simplex, where certain actions have zero probability. The gradient is only zero for the solution on the interior, and so to prove the result we simply need to show that this process converges to the interior. We show this is true for every parameter 𝜽t(s,a){\bm{\theta}}_{t}(s,a) in the vector 𝜽t{\bm{\theta}}_{t}, where s𝒮reps\in\mathcal{S}_{\text{rep}} and a𝒜a\in\mathcal{A}.

Define 𝒜0(s)\mathcal{A}_{0}(s) and 𝒜+(s)\mathcal{A}_{+}(s) as the sets of actions with zero and nonzero probability under π(s,.;𝜽)\pi(s,.;{\bm{\theta}}_{\infty}). We use a proof by contradiction to show that 𝒜0(s)=\mathcal{A}_{0}(s)=\emptyset for any ss as long as τ>0\tau>0. Suppose there exist ss and a0a_{0} such that a0𝒜0(s)a_{0}\in\mathcal{A}_{0}(s). Note that π(s,a;𝜽)=π(s,a;𝜽)\pi(s^{\prime},a;{\bm{\theta}})=\pi(s,a;{\bm{\theta}}) for all salias(s)s^{\prime}\in\operatorname{alias}(s). We know that π(s,a0;𝜽t)0\pi(s,a_{0};{\bm{\theta}}_{t})\rightarrow 0 and logπ(s,a0;𝜽t)-\log\pi(s,a_{0};{\bm{\theta}}_{t})\rightarrow\infty as tt\rightarrow\infty. Therefore there exists t0>0t_{0}>0 such that for all salias(s)s^{\prime}\in\operatorname{alias}(s) and tt0t\geq t_{0}

logπ(s,a0;𝜽t)(2C+1)Rmax+Cτlognaτ.\displaystyle-\log\pi(s^{\prime},a_{0};{\bm{\theta}}_{t})\geq\frac{(2C+1)R_{max}+C\tau\log n_{a}}{\tau}.

According to Lemma 13, for all tt0t\geq t_{0}

J~μ(𝜽t)𝜽t(s,a0)=salias(s)m(s)π(s,a0;𝜽t)[q~π(s,a0)v~π(s)]And applying Lemma 14\displaystyle\frac{\partial\tilde{J}_{\mu}({\bm{\theta}}_{t})}{\partial{\bm{\theta}}_{t}(s,a_{0})}=\sum_{s^{\prime}\in\operatorname{alias}(s)}m(s^{\prime})\pi(s^{\prime},a_{0};{\bm{\theta}}_{t})\biggl{[}\tilde{q}_{\pi}(s^{\prime},a_{0})-\tilde{v}_{\pi}(s^{\prime})\biggr{]}\ \ \ \ \ \triangleright\text{And applying Lemma \ref{lem_vals_bounded}}
=salias(s)m(s)π(s,a0;𝜽t)[τlogπ(a|s;𝜽)(2C+1)RmaxCτlogna]\displaystyle=\sum_{s^{\prime}\in\operatorname{alias}(s)}m(s^{\prime})\pi(s^{\prime},a_{0};{\bm{\theta}}_{t})\biggl{[}-\tau\log\pi(a|s;{\bm{\theta}})-(2C+1)R_{max}-C\tau\log n_{a}\biggr{]}
salias(s)m(s)π(s,a0;𝜽t)[τ(2C+1)Rmax+Cτlognaτ((2C+1)Rmax+Cτlogna)]\displaystyle\geq\sum_{s^{\prime}\in\operatorname{alias}(s)}m(s^{\prime})\pi(s^{\prime},a_{0};{\bm{\theta}}_{t})\biggl{[}\tau\frac{(2C+1)R_{max}+C\tau\log n_{a}}{\tau}-((2C+1)R_{max}+C\tau\log n_{a})\biggr{]}
=0.\displaystyle=0. (43)

So 𝜽t(s,a0){\bm{\theta}}_{t}(s,a_{0}) is non-decreasing after t0t_{0}, because we only add this non-negative gradient to 𝜽t(s,a0){\bm{\theta}}_{t}(s,a_{0}). This means that 𝜽(s,a0){\bm{\theta}}_{\infty}(s,a_{0}) is lower bounded by a constant cc:

exp(𝜽(s,a0))>ec>0.\exp({\bm{\theta}}_{\infty}(s,a_{0}))>e^{c}>0.

Next, we bound |aexp(𝜽(s,a))||\sum_{a}\exp\bigl{(}{\bm{\theta}}_{\infty}(s,a)\bigr{)}|. Notice that the sum of gradient components over all actions at a state is zero:

aJ~μ(𝜽t)𝜽t(s,a)\displaystyle\sum_{a}\frac{\partial\tilde{J}_{\mu}({\bm{\theta}}_{t})}{\partial{\bm{\theta}}_{t}(s,a)} =salias(s)m(s)aπ(s,a;𝜽t)[q~π(s,a)v~π(s)]\displaystyle=\sum_{s^{\prime}\in\operatorname{alias}(s)}m(s^{\prime})\sum_{a}\pi(s^{\prime},a;{\bm{\theta}}_{t})[\tilde{q}_{\pi}(s^{\prime},a)-\tilde{v}_{\pi}(s^{\prime})]
=\displaystyle= salias(s)m(s)(v~π(s)v~π(s))=0.\displaystyle\sum_{s^{\prime}\in\operatorname{alias}(s)}m(s^{\prime})\bigl{(}\tilde{v}_{\pi}(s^{\prime})-\tilde{v}_{\pi}(s^{\prime})\bigr{)}=0. (44)

Because

aJ~μ(𝜽t)𝜽t(s,a)=a0𝒜0J~μ(𝜽t)𝜽t(s,a0)+a+𝒜+J~μ(𝜽t)𝜽t(s,a+)\sum_{a}\frac{\partial\tilde{J}_{\mu}({\bm{\theta}}_{t})}{\partial{\bm{\theta}}_{t}(s,a)}=\sum_{a_{0}\in\mathcal{A}_{0}}\frac{\partial\tilde{J}_{\mu}({\bm{\theta}}_{t})}{\partial{\bm{\theta}}_{t}(s,a_{0})}+\sum_{a_{+}\in\mathcal{A}_{+}}\frac{\partial\tilde{J}_{\mu}({\bm{\theta}}_{t})}{\partial{\bm{\theta}}_{t}(s,a_{+})}

we get that for all t>t0t>t_{0},

a+𝒜+J~μ(𝜽t)𝜽t(s,a+)\displaystyle\sum_{a_{+}\in\mathcal{A}_{+}}\frac{\partial\tilde{J}_{\mu}({\bm{\theta}}_{t})}{\partial{\bm{\theta}}_{t}(s,a_{+})} =aJ~μ(𝜽t)𝜽t(s,a)a0𝒜0J~μ(𝜽t)𝜽t(s,a0)\displaystyle=\sum_{a}\frac{\partial\tilde{J}_{\mu}({\bm{\theta}}_{t})}{\partial{\bm{\theta}}_{t}(s,a)}-\sum_{a_{0}\in\mathcal{A}_{0}}\frac{\partial\tilde{J}_{\mu}({\bm{\theta}}_{t})}{\partial{\bm{\theta}}_{t}(s,a_{0})}
=0a0𝒜0J~μ(𝜽t)𝜽t(s,a0)0,\displaystyle=0-\sum_{a_{0}\in\mathcal{A}_{0}}\frac{\partial\tilde{J}_{\mu}({\bm{\theta}}_{t})}{\partial{\bm{\theta}}_{t}(s,a_{0})}\leq 0,

where the last line follows from Equation E.2. This means that a+𝒜+(s)𝜽t(s,a+)\sum_{a_{+}\in\mathcal{A}_{+}(s)}{\bm{\theta}}_{t}(s,a_{+}) is non-increasing after t0t_{0}. For a+𝒜+(s)a_{+}\in\mathcal{A}_{+}(s), we know that there exists some constant that lower bounds 𝜽t(s,a+){\bm{\theta}}_{t}(s,a_{+}) for all tt. Otherwise, a+𝒜+(s)a_{+}\in\mathcal{A}_{+}(s) would not have non-zero probability under π(s,.;𝜽)\pi(s,.;{\bm{\theta}}_{\infty}). Therefore, non-increasing a+𝒜+(s)𝜽t(s,a+)\sum_{a_{+}\in\mathcal{A}_{+}(s)}{\bm{\theta}}_{t}(s,a_{+}) cannot be due to some 𝜽t(s,a+){\bm{\theta}}_{t}(s,a_{+}) approaching -\infty while others approach \infty. Instead, each 𝜽t(s,a+){\bm{\theta}}_{t}(s,a_{+}) must also be bounded above. This means there exists some b+>0b_{+}>0 such that exp(b+)<exp(𝜽t(s,a+))exp(b+)\exp(-b_{+})<\exp({\bm{\theta}}_{t}(s,a_{+}))\exp(b_{+}) for every a+𝒜+(s)a_{+}\in\mathcal{A}_{+}(s).

For all a0𝒜0a_{0}\in\mathcal{A}_{0}, we also know that there exists b0b_{0} such that exp(𝜽t(s,a0))<b0\exp({\bm{\theta}}_{t}(s,a_{0}))<b_{0} for all t>t0t>t_{0}. Otherwise, if exp(𝜽t(s,a0))\exp({\bm{\theta}}_{t}(s,a_{0})) approaches infinity, then a0a_{0} would not be in 𝒜0\mathcal{A}_{0}. Putting this together with the above, and knowing there is at least one a+𝒜+(s)a_{+}\in\mathcal{A}_{+}(s),

aexp(𝜽t(s,a))\displaystyle\sum_{a}\exp\bigl{(}{\bm{\theta}}_{t}(s,a)\bigr{)} naexp(b+)+b0\displaystyle\leq n_{a}\exp(b_{+})+b_{0}
aexp(𝜽t(s,a))\displaystyle\sum_{a}\exp\bigl{(}{\bm{\theta}}_{t}(s,a)\bigr{)} exp(b+)>0.\displaystyle\geq\exp(b_{+})>0.

Finally, this gives

π(s,a0;𝜽t)\displaystyle\pi(s,a_{0};{\bm{\theta}}_{t}) =exp(𝜽t(s,a0))aexp(𝜽t(s,a))\displaystyle=\frac{\exp\bigl{(}{\bm{\theta}}_{t}(s,a_{0})\bigr{)}}{\sum_{a}\exp\bigl{(}{\bm{\theta}}_{t}(s,a)\bigr{)}}
cnaexp(b+)+b0.\displaystyle\geq\frac{c}{n_{a}\exp(b_{+})+b_{0}}.

Taking the limit as tt\rightarrow\infty, this lower bound remains positive, implying that π(s,a0;𝜽)>0\pi(s,a_{0};{\bm{\theta}}_{\infty})>0, which is a contradiction.

Therefore, no such a0a_{0} can exist and so 𝒜0(s)=\mathcal{A}_{0}(s)=\emptyset for all s𝒮s\in\mathcal{S}. Since the policy converges to the interior of the probability simplex and the objective function is Lipschitz continuous, the gradient has to be zero at the point of convergence (Mei et al., 2020) (Lemma 17).  

E.3 Proof for Three-State Counterexample

In the previous section, we showed that a stationary point for the entropy-regularized PG objective exists. We now show that in the three-state counterexample that such a point is not a stationary point under semi-gradient updates.

Lemma 16

A point with zero gradient is not a stationary point under semi-gradient updates in the three-state counterexample.

Proof  Suppose 𝜽{\bm{\theta}} is a point with zero gradient, i.e.:

J~μ(𝜽)𝜽=𝟎.\displaystyle\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}}={\bf 0}.

Every component of the gradient vector is zero, thus

J~μ(𝜽)𝜽(s1,.)=salias(s1)m(s)aπ(a|s;𝜽)𝜽(s,.)q~π(s,a)=𝟎.\displaystyle\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}=\sum_{s\in\operatorname{alias}(s_{1})}m(s)\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}(s,.)}\tilde{q}_{\pi}(s,a)={\bf 0}.

Since taking any action from s1s_{1} or s2s_{2} results in termination, the following holds for s{s1,s2}s\in\{s_{1},s_{2}\} and any action aa:

q~π(s,a)=\displaystyle\tilde{q}_{\pi}(s,a)= sP(s|s,a)[r(s,a,s)τlogπ(a|s;𝜽)+γ(s,a,s)v~π(s)]\displaystyle\sum_{s^{\prime}}\text{P}(s^{\prime}|s,a)\bigl{[}r(s,a,s^{\prime})-\tau\log\pi(a|s;{\bm{\theta}})+\gamma(s,a,s^{\prime})\tilde{v}_{\pi}(s^{\prime})\bigr{]}
=\displaystyle= sP(s|s,a)[r(s,a,s)τlogπ(a|s;𝜽)+γ(s,a,s)v~π(s)]\displaystyle\sum_{s^{\prime}}\text{P}(s^{\prime}|s,a)\bigl{[}r(s,a,s^{\prime})-\tau\log\pi(a|s;{\bm{\theta}})+\gamma(s,a,s^{\prime})\tilde{v}_{\pi}(s^{\prime})\bigr{]}
=\displaystyle= r(s,a)τlogπ(a|s;𝜽) Recall that r(s,a)=defsP(s|s,a)r(s,a,s).\displaystyle r(s,a)-\tau\log\pi(a|s;{\bm{\theta}})\qquad\triangleright\text{ Recall that }r(s,a)\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\sum_{s^{\prime}}\text{P}(s^{\prime}|s,a)r(s,a,s^{\prime}).

States s1s_{1} and s2s_{2} have the same representation, meaning that π(s1,.;𝜽)𝜽(s1,.)=π(s2,.;𝜽)𝜽(s2,.)\frac{\partial\pi(s_{1},.;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}=\frac{\partial\pi(s_{2},.;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{2},.)} and π(s1,.;𝜽)=π(s2,.;𝜽)\pi(s_{1},.;{\bm{\theta}})=\pi(s_{2},.;{\bm{\theta}}). So

J~μ(𝜽)𝜽(s1,.)=\displaystyle\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}= salias(s1)m(s)aπ(s1,a;𝜽)𝜽(s1,.)[r(s,a)τlogπ(s1,a;𝜽)]\displaystyle\sum_{s\in\operatorname{alias}(s_{1})}m(s)\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}[r(s,a)-\tau\log\pi(s_{1},a;{\bm{\theta}})]
=\displaystyle= salias(s1)m(s)aπ(s1,a;𝜽)𝜽(s1,.)r(s,a)\displaystyle\sum_{s\in\operatorname{alias}(s_{1})}m(s)\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}r(s,a)
(salias(s1)m(s))aπ(s1,a;𝜽)𝜽(s1,.)τlogπ(s1,a;𝜽).\displaystyle-\biggl{(}\sum_{s\in\operatorname{alias}(s_{1})}m(s)\biggr{)}\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\tau\log\pi(s_{1},a;{\bm{\theta}}).

For simplicity of notation, define

M(s1)\displaystyle M(s_{1}) =defsalias(s1)m(s)\displaystyle\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\sum_{s\in\operatorname{alias}(s_{1})}m(s)
D(s1)\displaystyle D(s_{1}) =defsalias(s1)dμ(s),\displaystyle\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}\sum_{s\in\operatorname{alias}(s_{1})}d_{\mu}(s),

letting us write the above as

J~μ(𝜽)𝜽(s1,.)=salias(s1)m(s)aπ(s1,a;𝜽)𝜽(s1,.)r(s,a)M(s1)aπ(s1,a;𝜽)𝜽(s1,.)τlogπ(s1,a;𝜽).\displaystyle\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}=\sum_{s\in\operatorname{alias}(s_{1})}m(s)\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}r(s,a)-M(s_{1})\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\tau\log\pi(s_{1},a;{\bm{\theta}}).

Similarly, the semi-gradient update to 𝜽(s1,.){\bm{\theta}}(s_{1},.) becomes

salias(s1)dμ(s)aπ(s1,a;𝜽)𝜽(s1,.)r(s,a)D(s1)aπ(s1,a;𝜽)𝜽(s1,.)τlogπ(s1,a;𝜽).\displaystyle\sum_{s\in\operatorname{alias}(s_{1})}d_{\mu}(s)\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}r(s,a)-D(s_{1})\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\tau\log\pi(s_{1},a;{\bm{\theta}}).

We show that, even though 𝜽{\bm{\theta}} is a stationary point under the true gradient update, this semi-gradient under 𝜽{\bm{\theta}} is not zero. Notice first that

D(s1)aπ(s1,a;𝜽)𝜽(s1,.)τlogπ(s1,a;𝜽)\displaystyle D(s_{1})\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\tau\log\pi(s_{1},a;{\bm{\theta}})
=D(s1)aπ(s1,a;𝜽)𝜽(s1,.)τlogπ(s1,a;𝜽)+D(s1)M(s1)J~μ(𝜽)𝜽(s1,.)because J~μ(𝜽)𝜽(s1,.)=0\displaystyle=D(s_{1})\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\tau\log\pi(s_{1},a;{\bm{\theta}})+\frac{D(s_{1})}{M(s_{1})}\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\ \ \ \ \triangleright\text{because }\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}=0
=D(s1)aπ(s1,a;𝜽)𝜽(s1,.)τlogπ(s1,a;𝜽)\displaystyle=D(s_{1})\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\tau\log\pi(s_{1},a;{\bm{\theta}})
+D(s1)M(s1)[salias(s1)m(s)aπ(s1,a;𝜽)𝜽(s1,.)r(s,a)M(s1)aπ(s1,a;𝜽)𝜽(s1,.)τlogπ(s1,a;𝜽)]\displaystyle+\frac{D(s_{1})}{M(s_{1})}\left[\sum_{s\in\operatorname{alias}(s_{1})}m(s)\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}r(s,a)-M(s_{1})\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\tau\log\pi(s_{1},a;{\bm{\theta}})\right]
=D(s1)aπ(s1,a;𝜽)𝜽(s1,.)τlogπ(s1,a;𝜽)+D(s1)M(s1)[salias(s1)m(s)aπ(s1,a;𝜽)𝜽(s1,.)r(s,a)]\displaystyle=D(s_{1})\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\tau\log\pi(s_{1},a;{\bm{\theta}})+\frac{D(s_{1})}{M(s_{1})}\left[\sum_{s\in\operatorname{alias}(s_{1})}m(s)\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}r(s,a)\right]
D(s1)aπ(s1,a;𝜽)𝜽(s1,.)τlogπ(s1,a;𝜽)\displaystyle-D(s_{1})\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\tau\log\pi(s_{1},a;{\bm{\theta}})
=D(s1)M(s1)[salias(s1)m(s)aπ(s1,a;𝜽)𝜽(s1,.)r(s,a)].\displaystyle=\frac{D(s_{1})}{M(s_{1})}\left[\sum_{s\in\operatorname{alias}(s_{1})}m(s)\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}r(s,a)\right].

Therefore,

salias(s1)dμ(s)aπ(s1,a;𝜽)𝜽(s1,.)r(s,a)D(s1)aπ(s1,a;𝜽)𝜽(s1,.)τlogπ(s1,a;𝜽)\displaystyle\sum_{s\in\operatorname{alias}(s_{1})}d_{\mu}(s)\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}r(s,a)-D(s_{1})\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\tau\log\pi(s_{1},a;{\bm{\theta}}) (45)
=\displaystyle= salias(s1)dμ(s)aπ(s1,a;𝜽)𝜽(s1,.)r(s,a)D(s1)M(s1)[salias(s1)m(s)aπ(s1,a;𝜽)𝜽(s1,.)r(s,a)]\displaystyle\sum_{s\in\operatorname{alias}(s_{1})}d_{\mu}(s)\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}r(s,a)-\frac{D(s_{1})}{M(s_{1})}\left[\sum_{s\in\operatorname{alias}(s_{1})}m(s)\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}r(s,a)\right]
=\displaystyle= salias(s1)δ(s)aπ(s1,a;𝜽)𝜽(s1,.)r(s,a) where δ(s)=defdμ(s)salias(s1)dμ(s)salias(s1)m(s)m(s)\displaystyle\sum_{s\in\operatorname{alias}(s_{1})}\delta(s)\sum_{a}\frac{\partial\pi(s_{1},a;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}r(s,a)\ \ \ \ \ \triangleright\text{ where }\delta(s)\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}d_{\mu}(s)-\frac{\sum_{s^{\prime}\in\operatorname{alias}(s_{1})}d_{\mu}(s^{\prime})}{\sum_{s^{\prime}\in\operatorname{alias}(s_{1})}m(s^{\prime})}m(s)
=\displaystyle= δ(s1)π(s1,a0;𝜽)𝜽(s1,.)2+δ(s2)π(s1,a1;𝜽)𝜽(s1,.)1\displaystyle\delta(s_{1})\frac{\partial\pi(s_{1},a_{0};{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\cdot 2+\delta(s_{2})\frac{\partial\pi(s_{1},a_{1};{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\cdot 1
 because r(s1,a0)=2,r(s2,a1)=1,r(s1,a1)=r(s2,a0)=0\displaystyle\qquad\qquad\qquad\qquad\triangleright\text{ because }r(s_{1},a_{0})=2,r(s_{2},a_{1})=1,r(s_{1},a_{1})=r(s_{2},a_{0})=0
=\displaystyle= 2(δ(s1)δ(s2))π(s1,a0;𝜽)𝜽(s1,.). as π(s1,a1;𝜽)𝜽(s1,.)=π(s1,a0;𝜽)𝜽(s1,.) by Lemma 12\displaystyle 2\bigl{(}\delta(s_{1})-\delta(s_{2})\bigr{)}\frac{\partial\pi(s_{1},a_{0};{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}.\qquad\triangleright\text{ as }\frac{\partial\pi(s_{1},a_{1};{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}=-\frac{\partial\pi(s_{1},a_{0};{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)}\text{ by Lemma \ref{lemma_H}}

The second factor π(s1,a0;𝜽)𝜽(s1,.)\frac{\partial\pi(s_{1},a_{0};{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},.)} is nonzero due to softmax parametrization. It therefore suffices to show that the first term 2(δ(s1)δ(s2))2\bigl{(}\delta(s_{1})-\delta(s_{2})\bigr{)} is not zero.

In each episode in the counterexample, regardless of the policy, the agent spends one step in s0s_{0} and one step in either s1s_{1} or s2s_{2}. Therefore

dμ(s0)=0.5,dμ(s1)+dμ(s2)=0.5.\displaystyle d_{\mu}(s_{0})=0.5,\,d_{\mu}(s_{1})+d_{\mu}(s_{2})=0.5.

Unfolding the emphatic weightings gives

m(s0)\displaystyle m(s_{0}) =dμ(s0)=0.5\displaystyle=d_{\mu}(s_{0})=0.5
m(s1)\displaystyle m(s_{1}) =dμ(s1)+dμ(s0)π(s0,a0;𝜽)=dμ(s1)+0.5π(s0,a0;𝜽)\displaystyle=d_{\mu}(s_{1})+d_{\mu}(s_{0})\pi(s_{0},a_{0};{\bm{\theta}})=d_{\mu}(s_{1})+0.5\pi(s_{0},a_{0};{\bm{\theta}})
m(s2)\displaystyle m(s_{2}) =dμ(s2)+dμ(s0)π(s0,a1;𝜽)=dμ(s2)+0.5π(s0,a1;𝜽)\displaystyle=d_{\mu}(s_{2})+d_{\mu}(s_{0})\pi(s_{0},a_{1};{\bm{\theta}})=d_{\mu}(s_{2})+0.5\pi(s_{0},a_{1};{\bm{\theta}})
=0.5dμ(s1)+0.5(1π(s0,a0;𝜽))=1dμ(s1)0.5π(s0,a0;𝜽).\displaystyle=0.5-d_{\mu}(s_{1})+0.5(1-\pi(s_{0},a_{0};{\bm{\theta}}))=1-d_{\mu}(s_{1})-0.5\pi(s_{0},a_{0};{\bm{\theta}}).

Plugging these values into δ(s)\delta(s) results in

δ(s)\displaystyle\delta(s) =defdμ(s)salias(s1)dμ(s)salias(s1)m(s)m(s)\displaystyle\mathrel{\overset{\makebox[0.0pt]{\mbox{\tiny def}}}{=}}d_{\mu}(s)-\frac{\sum_{s^{\prime}\in\operatorname{alias}(s_{1})}d_{\mu}(s^{\prime})}{\sum_{s^{\prime}\in\operatorname{alias}(s_{1})}m(s^{\prime})}m(s)
=dμ(s)dμ(s1)+dμ(s2)m(s1)+m(s2)m(s)\displaystyle=d_{\mu}(s)-\frac{d_{\mu}(s_{1})+d_{\mu}(s_{2})}{m(s_{1})+m(s_{2})}m(s)
=dμ(s)0.51m(s)=dμ(s)0.5m(s).\displaystyle=d_{\mu}(s)-\frac{0.5}{1}m(s)=d_{\mu}(s)-0.5m(s).

Therefore,

δ(s1)\displaystyle\delta(s_{1}) =dμ(s1)0.5m(s1)=0.5dμ(s1)0.25π(s0,a0;𝜽)\displaystyle=d_{\mu}(s_{1})-0.5m(s_{1})=0.5d_{\mu}(s_{1})-0.25\pi(s_{0},a_{0};{\bm{\theta}})
δ(s2)\displaystyle\delta(s_{2}) =dμ(s2)0.5m(s2)=0.5dμ(s1)0.5+0.5dμ(s1)+0.25π(s0,a0;𝜽)\displaystyle=d_{\mu}(s_{2})-0.5m(s_{2})=0.5-d_{\mu}(s_{1})-0.5+0.5d_{\mu}(s_{1})+0.25\pi(s_{0},a_{0};{\bm{\theta}})
=0.5dμ(s1)+0.25π(s0,a0;𝜽),\displaystyle=-0.5d_{\mu}(s_{1})+0.25\pi(s_{0},a_{0};{\bm{\theta}}),

and we get that

2δ(s1)δ(s2)\displaystyle 2\delta(s_{1})-\delta(s_{2}) =dμ(s1)0.5π(s0,a0;𝜽)+0.5dμ(s1)0.25π(s0,a0;𝜽)\displaystyle=d_{\mu}(s_{1})-0.5\pi(s_{0},a_{0};{\bm{\theta}})+0.5d_{\mu}(s_{1})-0.25\pi(s_{0},a_{0};{\bm{\theta}})
=1.5dμ(s1)0.75π(s0,a0;𝜽)\displaystyle=1.5d_{\mu}(s_{1})-0.75\pi(s_{0},a_{0};{\bm{\theta}})
=0.75(μ(s0,a0)π(s0,a0;𝜽)). because dμ(s1)=0.5μ(s0,a0)\displaystyle=0.75\bigl{(}\mu(s_{0},a_{0})-\pi(s_{0},a_{0};{\bm{\theta}})\bigr{)}.\qquad\triangleright\text{ because }d_{\mu}(s_{1})=0.5\mu(s_{0},a_{0})

As long as π(s0,a0;𝜽)μ(s0,a0)\pi(s_{0},a_{0};{\bm{\theta}})\neq\mu(s_{0},a_{0}), the semi-gradient is not zero. For example, as in Section 8.1, we can choose μ(s0,a0)=0.25\mu(s_{0},a_{0})=0.25. Now we show that π(s0,a0;𝜽)0.25\pi(s_{0},a_{0};{\bm{\theta}})\neq 0.25 for a stationary point under the true gradient.

Let us first write the partial derivative w.r.t. the first parameter given that π(s0,a0;𝜽)=0.25\pi(s_{0},a_{0};{\bm{\theta}})=0.25:

J~μ(𝜽)𝜽(s0,a0)\displaystyle\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}(s_{0},a_{0})} =salias(s0)m(s)aπ(a|s;𝜽)𝜽(s0,a0)q~π(s,a)\displaystyle=\sum_{s\in\operatorname{alias}(s_{0})}m(s)\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{0},a_{0})}\tilde{q}_{\pi}(s,a)
=m(s0)[π(s0,a0;𝜽)𝜽(s0,a0)q~π(s0,a0)+π(s0,a1;𝜽)𝜽(s0,a0)q~π(s0,a1)]\displaystyle=m(s_{0})\biggl{[}\frac{\partial\pi(s_{0},a_{0};{\bm{\theta}})}{\partial{\bm{\theta}}(s_{0},a_{0})}\tilde{q}_{\pi}(s_{0},a_{0})+\frac{\partial\pi(s_{0},a_{1};{\bm{\theta}})}{\partial{\bm{\theta}}(s_{0},a_{0})}\tilde{q}_{\pi}(s_{0},a_{1})\biggr{]}
=0.5[π(s0,a0;𝜽)(1π(s0,a0;𝜽))q~π(s0,a0)\displaystyle=0.5\biggl{[}\pi(s_{0},a_{0};{\bm{\theta}})(1-\pi(s_{0},a_{0};{\bm{\theta}}))\tilde{q}_{\pi}(s_{0},a_{0})
π(s0,a1;𝜽)π(s0,a0;𝜽)q~π(s0,a1)]\displaystyle-\pi(s_{0},a_{1};{\bm{\theta}})\pi(s_{0},a_{0};{\bm{\theta}})\tilde{q}_{\pi}(s_{0},a_{1})\biggr{]}
=0.50.250.75[q~π(s0,a0)q~π(s0,a1)].\displaystyle=0.5*0.25*0.75\bigl{[}\tilde{q}_{\pi}(s_{0},a_{0})-\tilde{q}_{\pi}(s_{0},a_{1})\bigr{]}.

For this derivative to be zero, q~π(s0,a0)q~π(s0,a1)\tilde{q}_{\pi}(s_{0},a_{0})-\tilde{q}_{\pi}(s_{0},a_{1}) must be zero.

q~π(s0,a0)q~π(s0,a1)\displaystyle\tilde{q}_{\pi}(s_{0},a_{0})-\tilde{q}_{\pi}(s_{0},a_{1})
=τlogπ(s0,a0;𝜽)+π(s1,a0,𝜽)q~π(s1,a0)+π(s1,a1,𝜽)q~π(s1,a1)\displaystyle=-\tau\log\pi(s_{0},a_{0};{\bm{\theta}})+\pi(s_{1},a_{0},{\bm{\theta}})\tilde{q}_{\pi}(s_{1},a_{0})+\pi(s_{1},a_{1},{\bm{\theta}})\tilde{q}_{\pi}(s_{1},a_{1})
+τlogπ(s0,a1;𝜽)π(s2,a0,𝜽)q~π(s2,a0)π(s2,a1,𝜽)q~π(s2,a1)\displaystyle+\tau\log\pi(s_{0},a_{1};{\bm{\theta}})-\pi(s_{2},a_{0},{\bm{\theta}})\tilde{q}_{\pi}(s_{2},a_{0})-\pi(s_{2},a_{1},{\bm{\theta}})\tilde{q}_{\pi}(s_{2},a_{1})
=τlog0.25+π(s1,a0,𝜽)(2τlogπ(s1,a0,𝜽))+π(s1,a1,𝜽)(τlogπ(s1,a1,𝜽))\displaystyle=-\tau\log 0.25+\pi(s_{1},a_{0},{\bm{\theta}})(2-\tau\log\pi(s_{1},a_{0},{\bm{\theta}}))+\pi(s_{1},a_{1},{\bm{\theta}})(-\tau\log\pi(s_{1},a_{1},{\bm{\theta}}))
+τlog0.75π(s2,a0,𝜽)(τlogπ(s2,a0,𝜽))π(s2,a1,𝜽)(1τlogπ(s2,a1,𝜽))\displaystyle+\tau\log 0.75-\pi(s_{2},a_{0},{\bm{\theta}})(-\tau\log\pi(s_{2},a_{0},{\bm{\theta}}))-\pi(s_{2},a_{1},{\bm{\theta}})(1-\tau\log\pi(s_{2},a_{1},{\bm{\theta}}))

Recall that π(s2,,𝜽)=π(s1,,𝜽)\pi(s_{2},\cdot,{\bm{\theta}})=\pi(s_{1},\cdot,{\bm{\theta}}) and π(s1,a1,𝜽)=1π(s1,a0,𝜽)\pi(s_{1},a_{1},{\bm{\theta}})=1-\pi(s_{1},a_{0},{\bm{\theta}}). Then

q~π(s0,a0)q~π(s0,a1)\displaystyle\tilde{q}_{\pi}(s_{0},a_{0})-\tilde{q}_{\pi}(s_{0},a_{1})
=τlog0.25+π(s1,a0,𝜽)(2τlogπ(s1,a0,𝜽))+π(s1,a1,𝜽)(τlogπ(s1,a1,𝜽))\displaystyle=-\tau\log 0.25+\pi(s_{1},a_{0},{\bm{\theta}})(2-\tau\log\pi(s_{1},a_{0},{\bm{\theta}}))+\pi(s_{1},a_{1},{\bm{\theta}})(-\tau\log\pi(s_{1},a_{1},{\bm{\theta}}))
+τlog0.75π(s1,a0,𝜽)(τlogπ(s1,a0,𝜽))π(s1,a1,𝜽)(1τlogπ(s1,a1,𝜽))\displaystyle+\tau\log 0.75-\pi(s_{1},a_{0},{\bm{\theta}})(-\tau\log\pi(s_{1},a_{0},{\bm{\theta}}))-\pi(s_{1},a_{1},{\bm{\theta}})(1-\tau\log\pi(s_{1},a_{1},{\bm{\theta}}))
=τ(log0.75log0.25)+2π(s1,a0,𝜽)π(s1,a1,𝜽)\displaystyle=\tau(\log 0.75-\log 0.25)+2\pi(s_{1},a_{0},{\bm{\theta}})-\pi(s_{1},a_{1},{\bm{\theta}})
=τ(log3)+3π(s1,a0,𝜽)1,\displaystyle=\tau(\log 3)+3\pi(s_{1},a_{0},{\bm{\theta}})-1, (46)

because log0.75log0.25=log(0.75/0.25)=log3\log 0.75-\log 0.25=\log(0.75/0.25)=\log 3. For small values of τ\tau, π(s1,a0,𝜽)\pi(s_{1},a_{0},{\bm{\theta}}) should be close to 1/31/3 to make 𝜽{\bm{\theta}} a stationary point.

Now let us write the partial derivative w.r.t. 𝜽(s1,a0){\bm{\theta}}(s_{1},a_{0}), again assuming π(s0,a0;𝜽)=0.25\pi(s_{0},a_{0};{\bm{\theta}})=0.25 and by noting that s1s_{1} and s2s_{2} are aliased:

J~μ(𝜽)𝜽(s1,a0)\displaystyle\frac{\partial\tilde{J}_{\mu}({\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},a_{0})} =salias(s1)m(s)aπ(a|s;𝜽)𝜽(s1,a0)q~π(s,a)\displaystyle=\sum_{s\in\operatorname{alias}(s_{1})}m(s)\sum_{a}\frac{\partial\pi(a|s;{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},a_{0})}\tilde{q}_{\pi}(s,a)
=m(s1)[π(s1,a0;𝜽)𝜽(s1,a0)q~π(s1,a0)+π(s1,a1;𝜽)𝜽(s1,a0)q~π(s1,a1)]\displaystyle=m(s_{1})\biggl{[}\frac{\partial\pi(s_{1},a_{0};{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},a_{0})}\tilde{q}_{\pi}(s_{1},a_{0})+\frac{\partial\pi(s_{1},a_{1};{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},a_{0})}\tilde{q}_{\pi}(s_{1},a_{1})\biggr{]}
+m(s2)[π(s2,a0;𝜽)𝜽(s1,a0)q~π(s2,a0)+π(s2,a1;𝜽)𝜽(s1,a0)q~π(s2,a1)]\displaystyle+m(s_{2})\biggl{[}\frac{\partial\pi(s_{2},a_{0};{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},a_{0})}\tilde{q}_{\pi}(s_{2},a_{0})+\frac{\partial\pi(s_{2},a_{1};{\bm{\theta}})}{\partial{\bm{\theta}}(s_{1},a_{0})}\tilde{q}_{\pi}(s_{2},a_{1})\biggr{]}
=m(s1)[π(s1,a0;𝜽)(1π(s1,a0;𝜽))q~π(s1,a0)\displaystyle=m(s_{1})\biggl{[}\pi(s_{1},a_{0};{\bm{\theta}})(1-\pi(s_{1},a_{0};{\bm{\theta}}))\tilde{q}_{\pi}(s_{1},a_{0})
π(s1,a0;𝜽)(1π(s1,a0;𝜽))q~π(s1,a1)]\displaystyle-\pi(s_{1},a_{0};{\bm{\theta}})(1-\pi(s_{1},a_{0};{\bm{\theta}}))\tilde{q}_{\pi}(s_{1},a_{1})\biggr{]}
+m(s2)[π(s1,a0;𝜽)(1π(s1,a0;𝜽))q~π(s2,a0)\displaystyle+m(s_{2})\biggl{[}\pi(s_{1},a_{0};{\bm{\theta}})(1-\pi(s_{1},a_{0};{\bm{\theta}}))\tilde{q}_{\pi}(s_{2},a_{0})
π(s1,a0;𝜽)(1π(s1,a0;𝜽))q~π(s2,a1)]\displaystyle-\pi(s_{1},a_{0};{\bm{\theta}})(1-\pi(s_{1},a_{0};{\bm{\theta}}))\tilde{q}_{\pi}(s_{2},a_{1})\biggr{]}
=m(s1)[π(s1,a0;𝜽)(1π(s1,a0;𝜽))q~π(s1,a0)\displaystyle=m(s_{1})\biggl{[}\pi(s_{1},a_{0};{\bm{\theta}})(1-\pi(s_{1},a_{0};{\bm{\theta}}))\tilde{q}_{\pi}(s_{1},a_{0})
π(s1,a0;𝜽)(1π(s1,a0;𝜽))q~π(s1,a1)]\displaystyle-\pi(s_{1},a_{0};{\bm{\theta}})(1-\pi(s_{1},a_{0};{\bm{\theta}}))\tilde{q}_{\pi}(s_{1},a_{1})\biggr{]}
+m(s2)[π(s1,a0;𝜽)(1π(s1,a0;𝜽))q~π(s2,a0)\displaystyle+m(s_{2})\biggl{[}\pi(s_{1},a_{0};{\bm{\theta}})(1-\pi(s_{1},a_{0};{\bm{\theta}}))\tilde{q}_{\pi}(s_{2},a_{0})
π(s1,a0;𝜽)(1π(s1,a0;𝜽))q~π(s2,a1)]\displaystyle-\pi(s_{1},a_{0};{\bm{\theta}})(1-\pi(s_{1},a_{0};{\bm{\theta}}))\tilde{q}_{\pi}(s_{2},a_{1})\biggr{]}
=π(s1,a0;𝜽)(1π(s1,a0;𝜽))\displaystyle=\pi(s_{1},a_{0};{\bm{\theta}})(1-\pi(s_{1},a_{0};{\bm{\theta}}))
[m(s1)(q~π(s1,a0)q~π(s1,a1))+m(s2)(q~π(s2,a0)q~π(s2,a1))].\displaystyle\biggl{[}m(s_{1})\bigl{(}\tilde{q}_{\pi}(s_{1},a_{0})-\tilde{q}_{\pi}(s_{1},a_{1})\bigr{)}+m(s_{2})\bigl{(}\tilde{q}_{\pi}(s_{2},a_{0})-\tilde{q}_{\pi}(s_{2},a_{1})\bigr{)}\biggr{]}.

Making the partial derivative zero requires either π(s1,a0;𝜽)\pi(s_{1},a_{0};{\bm{\theta}}), 1π(s1,a0;𝜽)1-\pi(s_{1},a_{0};{\bm{\theta}}), or the contents of the brackets to be zero. The first two are incompatible with the requirement for making Equation 46 zero so we will continue with the third one. Note that

m(s1)\displaystyle m(s_{1}) =dμ(s1)+dμ(s0)π(s0,a0;𝜽)=0.125+0.50.25=0.25,\displaystyle=d_{\mu}(s_{1})+d_{\mu}(s_{0})\pi(s_{0},a_{0};{\bm{\theta}})=0.125+0.5*0.25=0.25,
m(s2)\displaystyle m(s_{2}) =dμ(s2)+dμ(s0)π(s0,a1;𝜽)=0.375+0.50.75=0.75,\displaystyle=d_{\mu}(s_{2})+d_{\mu}(s_{0})\pi(s_{0},a_{1};{\bm{\theta}})=0.375+0.5*0.75=0.75,

and the factor in the brackets becomes

m(s1)(q~π(s1,a0)q~π(s1,a1))+m(s2)(q~π(s2,a0)q~π(s2,a1))\displaystyle m(s_{1})\bigl{(}\tilde{q}_{\pi}(s_{1},a_{0})-\tilde{q}_{\pi}(s_{1},a_{1})\bigr{)}+m(s_{2})\bigl{(}\tilde{q}_{\pi}(s_{2},a_{0})-\tilde{q}_{\pi}(s_{2},a_{1})\bigr{)}
=0.25(2τlogπ(s1,a0,𝜽)+τlogπ(s1,a1,𝜽))\displaystyle=0.25\bigl{(}2-\tau\log\pi(s_{1},a_{0},{\bm{\theta}})+\tau\log\pi(s_{1},a_{1},{\bm{\theta}})\bigr{)}
+0.75(τlogπ(s2,a0,𝜽)1+τlogπ(s2,a1,𝜽))\displaystyle+0.75\bigl{(}-\tau\log\pi(s_{2},a_{0},{\bm{\theta}})-1+\tau\log\pi(s_{2},a_{1},{\bm{\theta}})\bigr{)}
=0.25+τ(logπ(s1,a0,𝜽)+logπ(s1,a1,𝜽))\displaystyle=-0.25+\tau\bigl{(}-\log\pi(s_{1},a_{0},{\bm{\theta}})+\log\pi(s_{1},a_{1},{\bm{\theta}})\bigr{)}
=0.25+τ(log(1π(s1,a0,𝜽))logπ(s1,a0,𝜽))\displaystyle=-0.25+\tau\bigl{(}\log(1-\pi(s_{1},a_{0},{\bm{\theta}}))-\log\pi(s_{1},a_{0},{\bm{\theta}})\bigr{)}
=0.25+τlog(1π(s1,a0,𝜽)π(s1,a0,𝜽)).\displaystyle=-0.25+\tau\log\bigl{(}\frac{1-\pi(s_{1},a_{0},{\bm{\theta}})}{\pi(s_{1},a_{0},{\bm{\theta}})}\bigr{)}.

Making this zero is also at odds with the requirement for Equation 46. To see why, let’s consider both conditions, where we use p=π(s1,a0,𝜽)p=\pi(s_{1},a_{0},{\bm{\theta}}) to simplify notation.

τlog3+3p1=0\displaystyle\tau\log 3+3p-1=0
0.25+τlog1pp=0\displaystyle-0.25+\tau\log\frac{1-p}{p}=0

The first equation requires p=(1τlog3)/3p=(1-\tau\log 3)/3. The second equation requires p=(exp(0.25/τ)+1)1p=(\exp(0.25/\tau)+1)^{-1}. These two equations intersect when τi0.2779\tau_{i}\approx 0.2779, but otherwise do not equal each other, meaning we cannot satisfy both of these criteria. Therefore, for any ττi\tau\neq\tau_{i}, a stationary point 𝜽{\bm{\theta}} under the true gradient cannot have π(s0,a0,𝜽)=μ(s0,a0)=0.25\pi(s_{0},a_{0},{\bm{\theta}})=\mu(s_{0},a_{0})=0.25 and thus cannot be a stationary point under semi-gradient.

 
This counterexample shows one environment where the sets of stationary points under the true gradient is disjoint from the set of stationary points under the semi-gradient.

Appendix F Descriptions of Algorithms in Section 9

ACE(η{\eta}) Actor-Critic with Emphatic weightings. η{\eta} interpolates between semi-gradient updates (η=0\eta=0) and gradient updates (η=1\eta=1).
OffPAC Off-Policy Actor-Critic (Degris et al., 2012b). Equivalent to ACE(0).
GTD(λ\lambda) Gradient Temporal-Difference learning (Sutton et al., 2009). λ\lambda is the decay rate of the eligibility trace vector.
ETD(λ\lambda) Emphatic Temporal-Difference learning (Sutton et al., 2016). λ\lambda is the decay rate of the eligibility trace vector.
TDRC(λ\lambda) Temporal-Difference learning with Regularized Corrections (Ghiassian et al., 2020; Patterson et al., 2022). λ\lambda is the decay rate of the eligibility trace vector.
DPG Deterministic Policy Gradient (Silver et al., 2014). Uses a deterministic policy parameterization and semi-gradient updates.
True-DPGE A version of DPG using true gradient updates (i.e., scaling the update by the true emphatic weightings).
True-ACE A version of ACE using the true emphatic weightings.
ACE-direct A version of ACE that uses the direct method from Section 5.2.2 to estimate the emphatic weightings.
ACE-trace A version of ACE that uses the emphatic trace from Sutton et al. (2016) to estimate the emphatic weightings.
ACE-ideal A version of ACE that re-computes the emphatic trace on each time step using the current policy to remove the influence of old versions of the policy.
Table 1: Descriptions of the algorithms used in the experiments.

References

  • Ahmed et al. (2019) Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, 2019.
  • Barto et al. (1983) Andrew G. Barto, Richard S. Sutton, and Charles W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, (5), 1983.
  • Bhatnagar et al. (2007) Shalabh Bhatnagar, Richard S. Sutton, Mohammad Ghavamzadeh, and Mark Lee. Incremental natural actor–critic algorithms. In Conference on Neural Information Processing Systems, 2007.
  • Bhatnagar et al. (2009) Shalabh Bhatnagar, Richard S. Sutton, Mohammad Ghavamzadeh, and Mark Lee. Natural actor–critic algorithms. Automatica, 45(11), 2009.
  • Boyan and Moore (1995) Justin Boyan and Andrew W. Moore. Generalization in reinforcement learning: Safely approximating the value function. Conference on Neural Information Processing Systems, pages 369–376, 1995.
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. OpenAI Gym. arXiv preprint arXiv:1606.01540, 2016.
  • Chevalier-Boisvert et al. (2018) Maxime Chevalier-Boisvert, Lucas Willems, and Suman Pal. Minimalistic gridworld environment for OpenAI Gym. https://github.com/maximecb/gym-minigrid, 2018.
  • Degris et al. (2012a) Thomas Degris, Patrick M. Pilarski, and Richard S. Sutton. Model–free reinforcement learning with continuous action in practice. In American Control Conference, 2012a.
  • Degris et al. (2012b) Thomas Degris, Martha White, and Richard S. Sutton. Off–policy actor–critic. In International Conference on Machine Learning, 2012b.
  • Dick (2015) Travis B. Dick. Policy gradient reinforcement learning without regret. Master’s thesis, University of Alberta, 2015.
  • Dung et al. (2007) Le Tien Dung, Takashi Komeda, and Motoki Takagi. Reinforcement learning in non-Markovian environments using automatic discovery of subgoals. In Society of Instrument and Control Engineers Annual Conference, pages 2601–2605. IEEE, 2007.
  • et al. (2020) Charles R. Harris et al. Array programming with NumPy. Nature, 585(7825):357–362, September 2020. doi: 10.1038/s41586-020-2649-2. URL https://doi.org/10.1038/s41586-020-2649-2.
  • Geist et al. (2019) Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized markov decision processes. In International Conference on Machine Learning, pages 2160–2169, 2019.
  • Gelada and Bellemare (2019) Carles Gelada and Marc G. Bellemare. Off-policy deep reinforcement learning by bootstrapping the covariate shift. In AAAI Conference on Artificial Intelligence, volume 33, 2019.
  • Ghiassian et al. (2018) Sina Ghiassian, Andrew Patterson, Martha White, Richard S. Sutton, and Adam White. Online off-policy prediction. arXiv:1811.02597, 2018.
  • Ghiassian et al. (2020) Sina Ghiassian, Andrew Patterson, Shivam Garg, Dhawal Gupta, Adam White, and Martha White. Gradient temporal-difference learning with regularized corrections. In International Conference on Machine Learning, pages 3524–3534. PMLR, 2020.
  • Ghosh et al. (2020) Dibya Ghosh, Marlos C. Machado, and Nicolas Le Roux. An operator view of policy gradient methods. In Conference on Neural Information Processing Systems, 2020.
  • Greensmith et al. (2004) Evan Greensmith, Peter L. Bartlett, and Jonathan Baxter. Variance reduction techniques for gradient estimates in reinforcement learning. The Journal of Machine Learning Research, 5, 2004.
  • Grondman et al. (2012) Ivo Grondman, Lucian Busoniu, Gabriel A.D. Lopes, and Robert Babuska. A survey of actor–critic reinforcement learning: Standard and natural policy gradients. IEEE Transactions on Systems, Man, and Cybernetics, Part C (Applications and Reviews), 42(6), 2012.
  • Gu et al. (2017a) Shixiang Gu, Tim Lillicrap, Richard E. Turner, Zoubin Ghahramani, Bernhard Schölkopf, and Sergey Levine. Interpolated policy gradient: Merging on–policy and off–policy gradient estimation for deep reinforcement learning. In Conference on Neural Information Processing Systems, 2017a.
  • Gu et al. (2017b) Shixiang Gu, Timothy Lillicrap, Zoubin Ghahramani, Richard E. Turner, and Sergey Levine. Q–prop: Sample–efficient policy gradient with an off–policy critic. In International Conference on Learning Representations, 2017b.
  • Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, 2018.
  • Hallak and Mannor (2017) Assaf Hallak and Shie Mannor. Consistent on-line off-policy evaluation. In International Conference on Machine Learning, volume 70, 2017.
  • Huang (2020) Bojun Huang. Steady state analysis of episodic reinforcement learning. In Conference on Neural Information Processing Systems, 2020.
  • Imani et al. (2018) Ehsan Imani, Eric Graves, and Martha White. An off–policy policy gradient theorem using emphatic weightings. In Conference on Neural Information Processing Systems, 2018.
  • Kallus and Uehara (2020) Nathan Kallus and Masatoshi Uehara. Statistically efficient off–policy policy gradients. In International Conference on Machine Learning, 2020.
  • Kirsch (2017) Andreas Kirsch. Mdp environments for the OpenAI Gym. Technical report, 2017. URL https://arxiv.org/abs/1709.09069.
  • Klissarov and Precup (2021) Martin Klissarov and Doina Precup. Flexible option learning. Conference on Neural Information Processing Systems, pages 4632–4646, 2021.
  • Konda and Tsitsiklis (2000) Vijay R. Konda and John N. Tsitsiklis. Actor–critic algorithms. In Conference on Neural Information Processing Systems, 2000.
  • Laroche and Tachet des Combes (2021) Romain Laroche and Remi Tachet des Combes. Dr jekyll & mr hyde: the strange case of off-policy policy updates. Conference on Neural Information Processing Systems, 2021.
  • Lillicrap et al. (2015) Timothy P. Lillicrap, Jonathan J. Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. In International Conference on Learning Representations, 2015.
  • Lin (1992) Long-Ji Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4), 1992.
  • Lin (1993) Long-Ji Lin. Reinforcement learning for robots using neural networks. Technical report, Carnegie-Mellon University, 1993.
  • Liu et al. (2018) Qiang Liu, Lihong Li, Ziyang Tang, and Dengyong Zhou. Breaking the curse of horizon: Infinite-horizon off-policy estimation. In Conference on Neural Information Processing Systems, 2018.
  • Liu et al. (2019) Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off-policy policy gradient with stationary distribution correction. In Conference on Uncertainty in Artificial Intelligence, 2019.
  • Liu et al. (2020a) Yao Liu, Pierre-Luc Bacon, and Emma Brunskill. Understanding the curse of horizon in off-policy evaluation via conditional importance sampling. In International Conference on Machine Learning, pages 6184–6193. PMLR, 2020a.
  • Liu et al. (2020b) Yao Liu, Adith Swaminathan, Alekh Agarwal, and Emma Brunskill. Off–policy policy gradient with stationary distribution correction. In Conference on Uncertainty in Artificial Intelligence, 2020b.
  • Loch and Singh (1998) John Loch and Satinder P. Singh. Using eligibility traces to find the best memoryless policy in partially observable Markov decision processes. In International Conference on Machine Learning, pages 323–331, 1998.
  • Maei (2011) Hamid Maei. Gradient Temporal–Difference Learning Algorithms. PhD thesis, University of Alberta, 2011.
  • Maei (2018) Hamid Reza Maei. Convergent actor–critic algorithms under off-policy training and function approximation. arXiv:1802.07842, 2018.
  • Mahmood et al. (2015) A. Rupam Mahmood, Huizhen Yu, Martha White, and Richard S. Sutton. Emphatic temporal-difference learning. arXiv:1507.01569, 2015.
  • Marbach and Tsitsiklis (2001) Peter Marbach and John N. Tsitsiklis. Simulation–based optimization of Markov reward processes. IEEE Transactions on Automatic Control, 46(2), 2001.
  • Mei et al. (2020) Jincheng Mei, Chenjun Xiao, Csaba Szepesvari, and Dale Schuurmans. On the global convergence rates of softmax policy gradient methods. In International Conference on Machine Learning, 2020.
  • Meuleau et al. (2000) Nicolas Meuleau, Leonid Peshkin, Leslie P. Kaelbling, and Kee-Eung Kim. Off–policy policy search. Technical report, MIT Artificial Intelligence Laboratory, 2000.
  • Mnih et al. (2015) Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, et al. Human–level control through deep reinforcement learning. Nature, 518(7540), 2015.
  • Moore (1990) Andrew William Moore. Efficient memory-based learning for robot control. PhD thesis, University of Cambridge, 1990.
  • Nota and Thomas (2020) Chris Nota and Philip S. Thomas. Is the policy gradient a gradient? In International Conference on Autonomous Agents and Multiagent Systems, 2020.
  • Patterson et al. (2022) Andrew Patterson, Adam White, and Martha White. A generalized projected bellman error for off-policy value estimation in reinforcement learning. The Journal of Machine Learning Research, 2022.
  • Peters et al. (2005) Jan Peters, Sethu Vijayakumar, and Stefan Schaal. Natural actor-critic. In European Conference on Machine Learning, 2005.
  • Precup (2000) Doina Precup. Temporal Abstraction In Reinforcement Learning. PhD thesis, University of Massachusetts Amherst, 2000.
  • Precup et al. (2000) Doina Precup, Richard S. Sutton, and Satinder P. Singh. Eligibility Traces for Off–Policy Policy Evaluation. In International Conference on Machine Learning, 2000.
  • Precup et al. (2001) Doina Precup, Richard S. Sutton, and Sanjoy Dasgupta. Off–policy temporal–difference learning with function approximation. International Conference on Machine Learning, 2001.
  • Schaul et al. (2016) Tom Schaul, John Quan, Ioannis Antonoglou, and David Silver. Prioritized experience replay. In International Conference on Learning Representations, 2016.
  • Schmidt and Roux (2013) Mark Schmidt and Nicolas Le Roux. Fast convergence of stochastic gradient descent under a strong growth condition. arXiv:1308.6370, 2013.
  • Silver et al. (2014) David Silver, Guy Lever, Nicolas Heess, Thomas Degris, Daan Wierstra, and Martin Riedmiller. Deterministic policy gradient algorithms. In International Conference on Machine Learning, 2014.
  • Sutton and Barto (2018) Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. 2nd edition, 2018.
  • Sutton et al. (1999a) Richard S. Sutton, David McAllester, Satinder Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Conference on Neural Information Processing Systems, 1999a.
  • Sutton et al. (1999b) Richard S. Sutton, Doina Precup, and Satinder Singh. Between mdps and semi-mdps: A framework for temporal abstraction in reinforcement learning. Artificial intelligence, 112(1-2), 1999b.
  • Sutton et al. (2009) Richard S. Sutton, Hamid Reza Maei, Doina Precup, Shalabh Bhatnagar, David Silver, Csaba Szepesvári, and Eric Wiewiora. Fast gradient-descent methods for temporal–difference learning with linear function approximation. In International Conference on Machine Learning, 2009.
  • Sutton et al. (2011) Richard S. Sutton, Joseph Modayil, Michael Delp, Thomas Degris, Patrick M. Pilarski, Adam White, and Doina Precup. Horde: A scalable real–time architecture for learning knowledge from unsupervised sensorimotor interaction. In International Conference on Autonomous Agents and MultiAgent Systems, 2011.
  • Sutton et al. (2016) Richard S. Sutton, A. Rupam Mahmood, and Martha White. An emphatic approach to the problem of off–policy temporal–difference learning. The Journal of Machine Learning Research, 17, 2016.
  • Thomas (2014) Philip Thomas. Bias in natural actor–critic algorithms. In International Conference on Machine Learning, 2014.
  • Tosatto et al. (2020) Samuele Tosatto, Joao Carvalho, Hany Abdulsamad, and Jan Peters. A nonparametric off–policy policy gradient. In International Conference on Artificial Intelligence and Statistics, 2020.
  • Wang et al. (2016) Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Rémi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor–critic with experience replay. In International Conference on Learning Representations, 2016.
  • Watkins and Dayan (1992) Christopher J.C.H. Watkins and Peter Dayan. Q–learning. Machine Learning, 8(3-4), 1992.
  • Weaver and Tao (2001) Lex Weaver and Nigel Tao. The optimal reward baseline for gradient–based reinforcement learning. In Conference on Uncertainty in Artificial Intelligence, 2001.
  • White (2015) Adam White. Developing A Predictive Approach To Knowledge. PhD thesis, University of Alberta, 2015.
  • White (2017) Martha White. Unifying task specification in reinforcement learning. In International Conference on Machine Learning, 2017.
  • Williams (1992) Ronald J. Williams. Simple statistical gradient–following algorithms for connectionist reinforcement learning. Machine Learning, 8, 1992.
  • Williams and Peng (1991) Ronald J. Williams and Jing Peng. Function optimization using connectionist reinforcement learning algorithms. Connection Science, 3(3), 1991.
  • Witten (1977) Ian H. Witten. An adaptive optimal controller for discrete–time Markov environments. Information and Control, 34(4), 1977.
  • Yu (2015) Huizhen Yu. On convergence of emphatic temporal–difference learning. In Conference on Learning Theory, 2015.
  • Zhang et al. (2019) Shangtong Zhang, Wendelin Boehmer, and Shimon Whiteson. Generalized off-policy actor-critic. In Conference on Neural Information Processing Systems, 2019.
  • Zhang et al. (2020) Shangtong Zhang, Bo Liu, Hengshuai Yao, and Shimon Whiteson. Provably convergent two–timescale off-policy actor–critic with function approximation. In International Conference on Machine Learning, 2020.