This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Cautious Actor-Critic

\NameLingwei Zhu \Email[email protected]
\NameToshinori Kitamurafootnotemark: \Email[email protected]
\NameTakamitsu Matsubara \Email[email protected]
\addrNara Institute of Science and Technology
Equal contribution
   JAPAN
Abstract

The oscillating performance of off-policy learning and persisting errors in the actor-critic (AC) setting call for algorithms that can conservatively learn to suit the stability-critical applications better. In this paper, we propose a novel off-policy AC algorithm cautious actor-critic (CAC). The name cautious comes from the doubly conservative nature that we exploit the classic policy interpolation from conservative policy iteration for the actor and the entropy-regularization of conservative value iteration for the critic. Our key observation is the entropy-regularized critic facilitates and simplifies the unwieldy interpolated actor update while still ensuring robust policy improvement. We compare CAC to state-of-the-art AC methods on a set of challenging continuous control problems and demonstrate that CAC achieves comparable performance while significantly stabilizes learning.

keywords:
Cautious Reinforcement Learning; Policy Oscillation; Monotonic Improvement; Entropy Regularized Markov Decision Process;

1 Introduction

Actor-critic (AC) methods of reinforcement learning (RL) have been gaining increasing interests recently due to their scalability to large-scale problems: they can learn with both on-policy or/and off-policy samples and handle continuous action spaces (Lillicrap et al., 2015; Schulman et al., 2015); both in model-free or model-based setting (Haarnoja et al., 2018; Hafner et al., 2020). Recently in the model-free setting there has seen a booming in off-policy AC methods (Wang et al., 2017; Haarnoja et al., 2018; Fakoor et al., 2020). However, while these methods are sample-efficient in exploiting off-policy samples for continuous control, it is those samples that often bring oscillating performance during learning as a side-effect due to distribution mismatch. The oscillating performance of off-policy learning and persisting errors in the AC setting (Fujimoto et al., 2018) call for algorithms that can conservatively learn to better suit the stability-critical applications.

The performance oscillation and degradation problems have been widely discussed in the approximate dynamic programming (ADP) literature (Wagner, 2011; Bertsekas, 2011) that has motivated efficient learning algorithms against various sources of error. The seminal work of (Kakade and Langford, 2002) propose a principled approach to tackle performance degradation by leveraging policy interpolation which is conservative in that it reduces greediness of the updated policy. However, though it enjoys strong theoretical guarantees, its drawbacks limit its use in the AC setting: (1) it is difficult to obtain a reliable reference policy in high-dimensional continuous state-action spaces; (2) the interpolation is often regarded inconvenient to use and it is unclear how to design the interpolation in continuous action spaces. In practice, two popular variants (Schulman et al., 2015, 2017) that sidestep the interpolation and directly approximate the updated policy are more often used in the AC setting. On the other hand, the recently booming entropy-regularized ADP literature (Azar et al., 2012; Fox et al., 2016; Kozuno et al., 2019; Vieillard et al., 2020a) also features conservative learning (Kozuno et al., 2019) as they average over past value functions (Vieillard et al., 2020a). Though these methods do not explicitly address the performance oscillation problem, they have been empirically verified to be error-tolerant and yield state-of-the-art performance on a wide range of tasks. Extending this conservative learning to AC has been studied (Nachum et al., 2018; Haarnoja et al., 2018). However, the resulted conservativeness exists only in the critic: In challenging tasks, the performance degradation and oscillation still occur.

This paper aims to tackle the performance oscillation problem of off-policy AC by proposing a novel algorithm: cautious actor critic (CAC), where the naming cautious comes from the doubly conservative nature as we combine a conservative actor leveraging the concept of conservative policy iteration (CPI) (Kakade and Langford, 2002) with a conservative critic exploiting the entropy-regularization of conservative value iteration (CVI) (Kozuno et al., 2019). The key observation is that the entropy-regularized critic can find error-tolerant reference policies and simplifies the unwieldy interpolated actor update while still ensures robust policy improvement. CAC leverages automatically adjusted interpolation to reflect the faith during learning: when performance oscillation is likely to happen, CAC behaves cautiously to rely more on validated previous policy rather than on the new policy. Our novel interpolation design is inspired by a very recent study from the ADP literature (Vieillard et al., 2020b) but improved for the continuous AC setting.

The rest of the paper is organized as follows: we provide a short survey on related work in Section 2, followed by the preliminary on RL and relevant AC methods in Section 3. Section 4 presents CAC. Specifically, we discuss our novel design of the interpolation scheme which is central to CAC. We evaluate CAC in Section 5 on a set of challenging continuous control problems. We show that CAC is capable of achieving performance comparable to the state-of-the-art AC methods while significantly stabilizing learning. Ablation study has been conducted to distinguish CAC from existing methods. Discussion and conclusion are given in Section 6. Due to page limits, we present derivations and implementation details in the Appendix of supplemental file.

2 Related Work

It has been noticed that various sources of error such as approximation error in AC algorithms (Fujimoto et al., 2018; Fu et al., 2019) are the cause of performance degradation and oscillation during learning. In this section, we briefly survey some related works (partially) tackling this problem and outline our contributions.

Robust AC. Algorithms learning from off-policy data are sample-efficient but are also at the risk of divergence. To solve the divergence problem, an approach is to incorporate the importance sampling (IS) ratio (Precup et al., 2001). However, the resultant algorithms typically have large variance as the IS ratio is the product of many potentially unbounded terms. Munos et al. (2016) proposed to clip the IS ratio and proved the resulting algorithm Retrace (λ)\!(\lambda) can attain the globally optimal policy. Retrace (λ\lambda) has motivated recent successful AC methods (Wang et al., 2017; Fakoor et al., 2020) that exploit both on- and off-policy samples for better stability while retaining sample-efficiency. The robustness comes from that any off-policy samples can be used without causing divergence and wild variance thanks to the clipping. However, one still has to trade off the learning speed with learning stability by the user-defined clipping threshold. If we favor more learning stability, the agent might fail to learn meaningful behaviors.

Entropy-regularized AC. The recently booming entropy-regularized ADP literature has established that by augmenting the reward with Shannon entropy, the optimal policy is multi-modal and hence robust against adversarial settings (Nachum et al., 2017; Haarnoja et al., 2017; Ahmed et al., 2019). Another popular candidate entropy is the relative entropy or Kullback-Leibler (KL) divergence that renders the optimal policy an average of all past value functions (Azar et al., 2012; Fox et al., 2016; Kozuno et al., 2019; Vieillard et al., 2020a), which is more conservative and robust under mild assumptions such as the sequence of errors is martingale difference under the natural filtration (Azar et al., 2012). These methods have been extended to the AC setting including state-of-the-art (Haarnoja et al., 2018) that exploits the Shannon entropy and (Nachum et al., 2018) that leverages the KL divergence. Those methods demonstrate strong empirical performance on a wide range of tasks. However, it should be noted that the conservativeness brought by the entropy-regularized reward augmentation exists only in the critic. Since the average is prone to outliers, performance degradation can still happen if the value function estimates at some iterations are poor.

Conservative Policy Iteration. Tackling the performance oscillation problem has been widely discussed in the ADP literature (Wagner, 2011; Bertsekas, 2011), of which the seminal CPI algorithm (Kakade and Langford, 2002) has inspired many conservative learning schemes with strong theoretical guarantees for per-update improvement (Pirotta et al., 2013; Abbasi-Yadkori et al., 2016). However, CPI has seen limited applications to the AC setting due to two main drawbacks: (1) it assumes a good reference policy that is typically difficult to obtain in high-dimensional continuous state-action spaces; (2) the interpolation coefficient that interpolates the reference policy and current policy depends on the horizon of learning, which is typically short in ADP scenarios. In the AC setting featuring long learning horizon, this coefficient becomes vanishingly small and hence significantly hinders learning. A very recent work extended CPI to learning with deep networks and has demonstrated good performance on Atari games (Vieillard et al., 2020b). However, it is limited to discrete action spaces while our method mainly focuses on continuous action spaces and can be easily adapted to discrete action setting. The above-mentioned drawbacks render CPI generally perceived as unwieldy (Schulman et al., 2015).

Trust-region Methods. Motivated by the above-mentioned drawbacks of CPI, two popular variants trust region policy optimization (TRPO) (Schulman et al., 2015) and its improved version proximal policy optimization (PPO) (Schulman et al., 2017) sidestep the interpolation and directly approximate the resultant conservative policy. TRPO and PPO are welcomed choices for learning from scratch when the reference policy is unavailable or unreliable, but they also ignore this knowledge when we have a good reference policy at our disposal. Further, TRPO and PPO require on-policy samples which are expensive since all samples can be used only once and then discarded.

Contribution. The main contributions of this paper are:

  • CAC, the first off-policy AC method applying the interpolation of CPI-based algorithms for stabilizing off-policy learning to the best of the authors’ knowledge.

  • A novel interpolation coefficient design suitable for high dimensional continuous state-action spaces. Previously there was only a design suitable for discrete action spaces (Vieillard et al., 2020b).

  • We evaluate CAC on a set of benchmark continuous control problems and demonstrate that CAC achieves comparable performance with state-of-the-art AC methods while significantly stabilizes learning.

3 Preliminary

3.1 Reinforcement Learning

RL problems are mostly formulated by Markov Decision Processes (MDPs) defined by the tuple (𝒮,𝒜,P,,γ)(\mathcal{S},\mathcal{A},P,\mathcal{R},\gamma), where 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the (possibly continuous) action space, P(ss,a){P}\left(s^{\prime}\mid s,a\right) is the transition probability from ss to ss^{\prime} under action aa taken; \mathcal{R} is the reward function with (s,a)[rmax,rmax]\mathcal{R}(s,a)\in[-r_{\text{max}},r_{\text{max}}] denoting the immediate reward associated with that transition. We also use rtr_{t} as a shorthand for (st,at)\mathcal{R}(s_{t},a_{t}) at tt-th step. γ(0,1)\gamma\!\in\!(0,1) is the discount factor. We define 𝐏\mathbf{P} as a left operator such that (𝐏V)(s,a)=sP(ss,a)V(s)(\mathbf{P}V)(s,a)\!=\!\sum_{s^{\prime}}\!{P}\left(s^{\prime}\mid s,a\right)\!V(s^{\prime}) for some VV. A policy π\pi maps states to a probability distribution over the action space. We define the stationary state distribution induced by π\pi as the unnormalized occupancy measure dπ(s)=t=0γtP(st=sπ)d^{\pi}(s)=\sum_{t=0}^{\infty}\gamma^{t}P\left(s_{t}=s\mid\pi\right). The goal of RL is to find an optimal policy π\pi^{*} that maximizes the long term discounted rewards Jπ=𝔼dπ[t=0γtrt]J^{\pi}=\mathbb{E}^{d^{\pi}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\right]. Equivalently, this optimal policy also maximizes the state value function for all states ss:

π=argmaxπVπ(s)=argmaxπ𝔼dπ[t=0γtrt|s0=s].\displaystyle\pi^{*}=\operatorname*{arg\,max}_{\pi}V^{\pi}(s)=\operatorname*{arg\,max}_{\pi}\mathbb{E}^{d^{\pi}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\big{|}s_{0}\!=\!s\right].

The state-action value counterpart QπQ^{\pi^{*}} is more frequently used in the control context:

Qπ(s,a)=maxπ𝔼dπ[t=0γtrt|s0=s,a0=a].\displaystyle Q^{\pi^{*}}(s,a)=\max_{\pi}\mathbb{E}^{d^{\pi}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\big{|}s_{0}\!=\!s,a_{0}\!=\!a\right].

We define the advantage function for a policy π\pi as Aπ(s,a)=Qπ(s,a)Vπ(s)A^{\pi}(s,a)=Q^{\pi}(s,a)-V^{\pi}(s), and the expected advantage function of policy π\pi^{\prime} over π\pi as:

Aππ(s)=𝔼π[Qπ(s,a)]Vπ(s).\displaystyle A^{\pi^{\prime}}_{\pi}(s)\!=\!\mathbb{E}^{\pi^{\prime}}\left[Q^{\pi}(s,a)\right]-V^{\pi}(s).

3.2 Actor Critic methods

In this section we briefly introduce recent actor-critic algorithms and discuss their pros and cons and shed light on our proposal in Section 4.

3.2.1 Trust Region Methods

TRPO exploits the policy improvement lemma of (Kakade and Langford, 2002) for ensuring approximately monotonic policy improvement. However, unlike in (Kakade and Langford, 2002) that at kk-th iteration the policy is updated as πk+1=ζπ+(1ζ)πk\pi_{k+1}=\zeta\pi^{\prime}+(1-\zeta)\pi_{k}, where π\pi^{\prime} is the greedy policy; TRPO constructs an algorithm that directly computes πk+1\pi_{k+1} without resorting to π\pi^{\prime}. Specifically, TRPO has the following update rule:

JπkTRPO(π):=argmaxπ𝔼π,dπk[Aπk],subject to CγΔπDKLmax(πk||π)δ,with Δπ=maxs,a|Aπ(s,a)|,\displaystyle\begin{split}&J^{\text{TRPO}}_{\pi_{k}}(\pi):=\,\operatorname*{arg\,max}_{\pi}\mathbb{E}^{\pi,d^{\pi_{k}}}\left[A^{\pi_{k}}\right],\\ &\text{subject to }\quad C_{\gamma}\,\Delta_{\pi}\,D^{\text{max}}_{KL}(\pi_{k}|\!|\pi)\leq\delta,\\ &\text{with }\,\,\Delta_{\pi}=\max_{s,a}|A^{\pi}(s,a)|,\end{split} (1)

where CγC_{\gamma} is a horizon-dependent constant, DKLD_{KL} is the KL divergence and δ\delta is the trust region parameter.

As computing JπkTRPO(π)J^{\text{TRPO}}_{\pi_{k}}(\pi) requires sampling according to the stationary distribution dπkd^{\pi_{k}}, it is inherently an on-policy algorithm, which is not sample-efficient as the samples can only be used only once and discarded.

3.2.2 Off-policy Maximum Entropy Actor-Critic

As state-of-the-art model-free off-policy AC algorithm, soft actor-critic (SAC) (Haarnoja et al., 2018) maximizes not only task reward but also the Shannon entropy of policy. The entropy term in the reward function renders the optimal policy multi-modal as opposed to deterministic policies of algorithms that solely maximize task reward, which is beneficial due to the multi-modality (Haarnoja et al., 2017) and has demonstrated superior sample-efficiency due to more effective exploration of the state-action spaces. Writing in the ADP manner, SAC has the following update rule (we drop the state-action pair for the QQ function for simplicity):

{πk+1argmaxπ𝔼π[Qπk(s,a)+κ(π(|st))]Qπk+1(s,a)+γ(𝐏Vπk)(s,a)where Vπ(s)=t0γt𝔼π[(st,at)+κ(π(|st))|s0=s].\displaystyle\begin{split}&\begin{cases}\pi_{k+1}\leftarrow\operatorname*{arg\,max}_{\pi}\mathbb{E}^{\pi}\left[Q^{\pi_{k}}_{\mathcal{H}}(s,a)+\kappa\mathcal{H}\left(\pi(\cdot|s_{t})\right)\right]\\ Q^{\pi_{k+1}}_{\mathcal{H}}\leftarrow\mathcal{R}(s,a)+\gamma\left(\mathbf{P}V^{\pi_{k}}_{\mathcal{H}}\right)\!(s,a)\end{cases}\\ &\text{where }V^{\pi}_{\mathcal{H}}(s)=\sum_{t\geq 0}\gamma^{t}\mathbb{E}^{\pi}\left[\mathcal{R}(s_{t},a_{t})+\kappa\mathcal{H}\left(\pi(\cdot|s_{t})\right)\Big{|}s_{0}=s\right].\end{split} (2)

(π):=aπ(a|s)logπ(a|s)\mathcal{H}(\pi):=-\sum_{a}\pi(a|s)\log\pi(a|s) denotes the Shannon entropy of policy π\pi, κ\kappa denotes the weighting coefficient and VπV^{\pi}_{\mathcal{H}} denotes the soft value function when regularized with the Shannon entropy. SAC performs one step look-ahead for updating the actor, where states are randomly sampled from a replay buffer, and then actions are generated by feeding the states into the policy network (Haarnoja et al., 2018). As such, SAC does not need an IS ratio, but it has been demonstrated that SAC often oscillates wildly in performance.

4 Cautious Actor Critic

In this section we present CAC, an off-policy actor-critic method capable of learning conservatively against performance oscillation and degradation.

4.1 CAC Algorithm

For the ease of understanding, we write CAC in the following approximate policy iteration style (Vieillard et al., 2020a). Specifically, the first step corresponds to the policy (actor) improvement and the last step corresponds to the interpolation:

CAC{πk+1argmaxπ𝔼π[Qπk(s,a)+πkπ(s)]Qπk+1(s,a)+γ(PVπk+1)(s,a)ζ(Δ~πkπk+1)1(𝔼πk+1,K[Aπk(s,a)])π~k+1ζπk+1+(1ζ)πkwith Vπk+1(s)=t0γt𝔼π[(st,at)+πkπk+1(st)|s0=s],πkπk+1(s)=𝔼πk+1[κlogπk+1(a|s)τlogπk+1(a|s)πk(a|s)],\displaystyle\begin{split}&\text{CAC}\,\,\begin{cases}\pi_{k+1}\leftarrow\operatorname*{arg\,max}_{\pi}\mathbb{E}^{\pi}\!\left[Q^{\pi_{k}}_{\mathcal{I}}(s,a)+\mathcal{I}_{\pi_{k}}^{\pi}(s)\right]\\ Q^{\pi_{k+1}}_{\mathcal{I}}\leftarrow\mathcal{R}(s,a)+\gamma\left(\textbf{P}V^{\pi_{k+1}}_{\mathcal{I}}\right)\!(s,a)\\ \zeta\leftarrow\,(\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}})^{-1}\!\left(\mathbb{E}^{\pi_{k+1},\mathcal{B}_{K}}\left[A^{\pi_{k}}(s,a)\right]\right)\\ \tilde{\pi}_{k+1}\leftarrow\zeta\pi_{k+1}+(1-\zeta)\pi_{k}\end{cases}\\ &\text{with }V^{\pi_{k+1}}_{\mathcal{I}}(s)=\sum_{t\geq 0}\gamma^{t}\mathbb{E}^{\pi}\left[\mathcal{R}(s_{t},a_{t})+\mathcal{I}_{\pi_{k}}^{\pi_{k+1}}(s_{t})\,\Big{|}\,s_{0}=s\right],\\ &\mathcal{I}_{\pi_{k}}^{\pi_{k+1}}(s)=\mathbb{E}^{\pi_{k+1}}\!\left[-\kappa\log\pi_{k+1}(a|s)-\tau\log\frac{\pi_{k+1}(a|s)}{\pi_{k}(a|s)}\right],\end{split} (3)

where ζ\zeta is the interpolation coefficient computed by ζ\zeta^{*} in Eq. (7), K\mathcal{B}_{K} denotes the on-policy replay buffer, with KK indicating the number of steps up to now. κ,τ\kappa,\tau denote the Shannon entropy and KL divergence regularization coefficient, respectively. Q,VQ_{\mathcal{I}},V_{\mathcal{I}} (and hence AA_{\mathcal{I}}) denote the entropy-regularized value functions. For uncluttered notations, in the rest of the paper we drop the subscript \mathcal{I}. Except the computation of ζ\zeta, all other steps are computed using samples from the off-policy replay buffer \mathcal{B}. For later convenience, we define α:=κκ+τ\alpha:=\frac{\kappa}{\kappa+\tau} and β:=1κ+τ\beta:=\frac{1}{\kappa+\tau}. Note our use of both on- and off-policy replay buffers renders CAC similar in spirit to (Gu et al., 2017; Wang et al., 2017; Fakoor et al., 2020).

We first compute the greedy policy πk+1\pi_{k+1}. Due to the Fenchel conjugacy (Geist et al., 2019), when πkπk+1\mathcal{I}_{\pi_{k}}^{\pi_{k+1}} is included in the argmax\operatorname*{arg\,max}, the maximizer policy can be analytically derived as πk+1(a|s)πkα(a|s)exp(βQπk(a|s))\pi_{k+1}(a|s)\propto\-\pi^{\alpha}_{k}(a|s)\exp\left(\beta Q_{\pi_{k}}(a|s)\right) (Kozuno et al., 2019). Then the entropy-regularized action value function Qπk+1Q^{\pi_{k+1}} is evaluated. In the third step, we use the on-policy replay buffer K\mathcal{B}_{K} for computing ζ\zeta as described in Eq. (7). Finally, the optimal policy in the sense of guaranteeing policy improvement is obtained by interpolating πk+1\pi_{k+1} with πk\pi_{k}. We present the following theorem of CAC convergence in the tabular case.

Theorem 4.1.

Repeated application of CAC Eq. (3) on any initial policy π\pi will make it converges to the entropy regularized optimal policy π(a|s)=exp(1κQ(s,a))a𝒜exp(1κQ(s,a))\pi^{*}(a|s)=\frac{\exp\left(\frac{1}{\kappa}Q^{*}(s,a)\right)}{\int_{a\in\mathcal{A}}\exp\left(\frac{1}{\kappa}Q^{*}(s,a)\right)}.

Proof 4.2.

See Section A.1 in the Appendix.

From Theorem 1 we see the optimal policy and corresponding optimum of the MDP is biased by the choice of κ\kappa. If we gradually decay the value of κ\kappa then we recover the optimum of the non-regularized MDP Vieillard et al. (2020a). In the following sections, we describe in detail the CAC actor and critic, as well as the derivation of actor gradient expression and practical interpolation coefficient ζ\zeta design.

4.1.1 Conservative Actor

As discussed in Section 3.2.1, at kk-th iteration TRPO directly constructs a new policy π\pi by maximizing JπkTRPO(π)J^{\text{TRPO}}_{\pi_{k}}(\pi). This is useful if the agent learns from scratch, but it discards the reference policy when available. On the other hand, we follow the exact form of (Kakade and Langford, 2002) by taking the information of reference policy into account, where we choose πk+1\pi_{k+1} to be the reference policy π\pi^{\prime}:

π~k+1=ζπk+1+(1ζ)πk.\displaystyle\tilde{\pi}_{k+1}=\zeta\pi_{k+1}+(1-\zeta)\pi_{k}. (4)

Our objective function Jπk,πk+1CAC(π)J_{\pi_{k},\pi_{k+1}}^{\text{CAC}}(\pi) explicitly features the knowledge of the reference policy (Pirotta et al., 2013). Specifically, the objective can be lower-bounded as:

Jπk,πk+1CAC(π):=𝔼πk+1,dπk+1[Aπk(s,a)]Cγ(vΔ~πkπk+1)1(𝔼πk+1,dπk[Aπk(s,a)])2,given ζ=2Cγ(vΔ~πkπk+1)1(𝔼πk+1,dπk[Aπk(s,a)]),Δ~πkπk+1=maxs,s|Aπkπk+1(s)Aπkπk+1(s)|,v=maxsDTV(πk+1(|s)||πk(|s)),\displaystyle\begin{split}J_{\pi_{k},\pi_{k+1}}^{\text{CAC}}(\pi)&:=\mathbb{E}^{\pi_{k+1},d^{\pi_{k+1}}}\left[A^{\pi_{k}}(s,a)\right]\\ &\geq C^{\prime}_{\gamma}\,\,(v\,\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}})^{-1}\left(\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right]\right)^{2},\\ \text{given }\zeta^{*}&=2\,C^{\prime}_{\gamma}(v\,\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}})^{-1}\left(\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right]\!\right),\\ \tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}}&=\max_{s,s^{\prime}}\left|A^{\pi_{k+1}}_{\pi_{k}}(s)-A^{\pi_{k+1}}_{\pi_{k}}(s^{\prime})\right|,\\ v&=\max_{s}D_{TV}\left(\pi_{k+1}(\cdot|s)|\!|\pi_{k}(\cdot|s)\right),\vspace{-0.2cm}\end{split} (5)

where DTVD_{TV} denotes the total variation. vv, Δ~πkπk+1\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}} and the expectation wrt πk+1,dπk\pi_{k+1},d^{\pi_{k}} require estimation. CγC^{\prime}_{\gamma} absorbs the horizon-dependent constants. Hence, when optimizing the lower bound of Jπk,πCAC(π)J_{\pi_{k},\pi^{\prime}}^{\text{CAC}}(\pi), we can achieve guaranteed improvement.

In the existing literature (Kakade and Langford, 2002; Pirotta et al., 2013; Abbasi-Yadkori et al., 2016), the difficulties of extending Eq. (5) to large-scale problems are: (1) preparing a reliable reference policy in high dimensional continuous state-action spaces is difficult; (2) it is hard to accurately estimate vv, the maximum total variation between two policies without enforcing a gradual change of policies, which is absent in these works. On the other hand, naively using v2v\leq 2 as suggested by (Pirotta et al., 2013) often yields vanishingly small ζ\zeta, which significantly hinders learning. (3) the horizon-dependent constant CγC_{\gamma}^{\prime} developed in the classic ADP literature is not suitable for learning with deep networks that feature long horizon of learning. As will be demonstrated in the following sections, we tackle the first and second problems by leveraging entropy-regularized critic, and the third problem via a novel design of ζ\zeta inspired by a very recent work for discrete action problems (Vieillard et al., 2020b).

4.1.2 Conservative Critic

In Eq. (5) we see in order to yield a meaningful interpolation coefficient ζ\zeta one is required to accurately estimate the maximum total derivation vv, which is intractable in high dimensional continuous action spaces. However, by introducing an entropy-regularized critic, we can leverage the following theorem to avoid estimating vv:

Theorem 4.3.

(Kozuno et al., 2019, Proposition 3) For any two consecutive entropy-regularized policies πk,πk+1\pi_{k},\pi_{k+1} generated by Eq. (3), the following bound for their maximum total deviation holds:

maxsDTV(πk+1(|s)||πk(|s))4BK+2CK,\displaystyle\max_{s}D_{TV}\left(\pi_{k+1}(\cdot|s)|\!|\pi_{k}(\cdot|s)\right)\leq\sqrt{4B_{K}+2C_{K}}, (6)
where Bk:=1γk1γϵβ,Ck:=rmaxβj=0k1αjγkj1,\displaystyle\text{where }B_{k}:=\frac{1-\gamma^{k}}{1-\gamma}\epsilon\beta,\quad C_{k}:=r_{\text{max}}\beta\sum_{j=0}^{k-1}\alpha^{j}\gamma^{k-j-1},

ϵ\epsilon is the uniform upper bound of errors.

Recall from Section 4.1 that α:=κκ+τ,β:=1κ+τ\alpha:=\frac{\kappa}{\kappa+\tau},\beta:=\frac{1}{\kappa+\tau}. Specifically, it has been proved in (Kozuno et al., 2019) that this bound is non-improvable, i.e. there exists an MDP such that the inequality becomes equality.

By leveraging an entropy-regularized critic, the objective in Eq. (5) becomes:

Jπk,πk+1CAC(π)CγCk(Δ~πkπk+1)1(𝔼πk+1,dπk[Aπk(s,a)])2,given ζ=2CγCk(Δ~πkπk+1)1(𝔼πk+1,dπk[Aπk(s,a)]),\displaystyle\begin{split}J_{\pi_{k},\pi_{k+1}}^{\text{CAC}}(\pi)&\geq C^{\prime}_{\gamma}C_{k}\,\,(\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}})^{-1}\left(\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right]\right)^{2},\\ \text{given }\zeta^{*}&=2\,C^{\prime}_{\gamma}C_{k}(\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}})^{-1}\left(\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right]\!\right),\end{split} (7)

where CγC^{\prime}_{\gamma} absorbs horizon-dependent constants and CkC_{k} is from Theorem 4.3. Estimating the optimal ζ\zeta^{*} now requires estimating the expectation wrt πk+1,dπk\pi_{k+1},d^{\pi_{k}} and Δ~πkπk+1\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}} which have been studied by (Vieillard et al., 2020b).

The KL divergence also manifests its importance for generating reasonable reference policies even for high dimensional or continuous action problems. Consider the following upper bound due to (Vieillard et al., 2020a) where reward is augmented by the KL divergence:

QQπk+121γ1kj=0kϵj+41γVmaxk,\displaystyle\left|\!\left|Q^{*}-Q^{\pi_{k+1}}\right|\!\right|_{\infty}\leq\frac{2}{1-\gamma}\left|\!\left|\frac{1}{k}\sum_{j=0}^{k}\epsilon_{j}\right|\!\right|_{\infty}\!\!+\frac{4}{1-\gamma}\frac{V_{max}}{k},

where ϵj\epsilon_{j} are errors and Vmax=rmax1γV_{max}\!=\!\frac{r_{\text{max}}}{1-\gamma}. By comparing it with the non-improvable approximate modified policy iteration (AMPI) bound where the reward is not augmented (Scherrer et al., 2015):

QQπk+1((1γ)j=1kϵj)+2γk+11γVmax,\displaystyle\left|\!\left|Q^{*}-Q^{\pi_{k+1}}\right|\!\right|_{\infty}\leq\left(\!(1-\gamma)\sum_{j=1}^{k}\left|\!\left|\epsilon_{j}\right|\!\right|_{\infty}\!\right)+\frac{2\gamma^{k+1}}{1-\gamma}V_{max},

we see that the error term for the KL regularization case is sup-over-sum. Under mild assumptions such as ϵj\epsilon_{j} are iid distributed under the natural filtration (Azar et al., 2012), the summation over errors asymptotically cancels out. On the other hand, the error term for AMPI depends on the summation of maximum of every iteration, which is typically large. Further, the dependence of error on the horizon is linear 11γ\frac{1}{1-\gamma} rather than quadratic, which is a significant improvement as typically γ1\gamma\approx 1.

4.1.3 Network Optimization Perspective

Given the above ADP-style characterization for both the actor and the critic, we now examine Eq. (3) from the optimization perspective. Suppose the critic is parametrized by a network with parameters θ\theta and the actor by a network with parameters ϕ\phi. CAC updates the network weights θ,ϕ\theta,\phi by solving the following minimization problems:

y\displaystyle y =r+γ(𝔼aπϕ[Qθ¯(s,a)]+πϕ¯πϕ(s)),\displaystyle=r+\gamma\left(\mathbb{E}^{a\sim\pi_{\phi}}\left[Q_{\bar{\theta}}(s^{\prime},a)\right]+\mathcal{I}_{\pi_{\bar{\phi}}}^{\pi_{\phi}}(s^{\prime})\right), (8a)
θ\displaystyle\theta argmin𝔼[(Qθ(s,a)y)2],\displaystyle\leftarrow\arg\min\mathbb{E}^{\mathcal{B}}\left[\left({Q}_{\theta}\left(s,a\right)-y\right)^{2}\right], (8b)
ϕ\displaystyle{\phi} argmin𝔼[DKL(πϕ(1ζ)πϕ¯+ζ𝒢πϕ¯Qθ)],\displaystyle\leftarrow\arg\min\mathbb{E}^{\mathcal{B}}\left[{D}_{{KL}}\left(\pi_{\phi}\|(1-\zeta)\pi_{\bar{\phi}}+\zeta\mathcal{G}_{\pi_{\bar{\phi}}}Q_{\theta}\right)\right], (8c)

where the update of ϕ\phi corresponds to solving an information projection problem. This is because policies πk+1,πk\pi_{k+1},\pi_{k} are Boltzmann softmax (Geist et al., 2019) but their summation is generally not Boltzmann, which might results in loss of desirable properties. By the following theorem, in the ideal case we can find a Boltzmann policy πϕ\pi_{\phi} that perfectly represents the interpolation by solving the information projection problem.

Theorem 4.4.

(Ziebart, 2010, Theorem 2.8) Let π(1),\pi^{(1)}, π(2),\pi^{(2)}, ,π(n)\dots,\pi^{(n)} be an arbitrary sequence of policies and ζ1,ζn\zeta_{1},\dots\zeta_{n} be a sequence of numbers such that ζi0,i\zeta_{i}\geq 0,\forall i, i=1nζi=1\sum_{i=1}^{n}\zeta_{i}=1. Then the policy π\pi^{\prime} defined by:

π(a|s):=i=1nζiP(𝒮=s,𝒜=a|π(i))i=1nζiP(𝒮=s|π(i))\displaystyle\pi^{\prime}(a|s):=\frac{\sum_{i=1}^{n}\zeta_{i}\,P(\mathcal{S}=s,\mathcal{A}=a|\pi^{(i)})}{\sum_{i=1}^{n}\zeta_{i}\,P(\mathcal{S}=s|\pi^{(i)})} (9)

has same expected number of state-action occurrences when the denominator is nonzero.

In implementation as the states are sampled from the replay buffer, there is an error term in this information projection step. Taking the above information projection into account, we elaborate upon gradient expression of the actor via the following proposition:

Proposition 4.5.

Let the actor network be parametrized by weights ϕ\phi and critic by θ\theta. Define 𝒢Qϕ¯,θ\mathcal{G}Q_{\bar{\phi},\theta} as the greedy policy with respect to the CAC critic. The subscript ϕ¯\bar{\phi} comes from the baseline policy introduced by KL divergence. Then the gradient of the actor update can be expressed as:

ϕ𝔼saπϕ[Dϕ¯ϕβ1+𝒳Qθ(s,a)],where Dϕ¯ϕ=logπϕ(as)α+𝒳1+𝒳logπϕ¯(a;s)𝒳=1ζζπϕ¯(as)𝒢Qϕ¯,θ(as).\displaystyle\begin{split}&\nabla_{\phi}\mathbb{E}^{\begin{subarray}{c}s\sim\mathcal{B}\\ a\sim\pi_{\phi}\end{subarray}}\left[D^{\phi}_{\bar{\phi}}-\frac{\beta}{1+\mathcal{X}}Q_{\theta}(s,a)\right],\\ &\text{where }D^{\phi}_{\bar{\phi}}=\log\pi_{\phi}\left(a\mid s\right)-\frac{\alpha+\mathcal{X}}{1+\mathcal{X}}\log\pi_{\bar{\phi}}(a;s)\\ &\mathcal{X}=\frac{1-\zeta}{\zeta}\cdot\frac{\pi_{\bar{\phi}}(a\mid s)}{\mathcal{G}Q_{\bar{\phi},\theta}(a\mid s)}.\end{split} (10)
Proof 4.6.

See Section A.2 in the Appendix.

This gradient expression is similar to SAC (Haarnoja et al., 2018) which is off-policy since states ss are sampled from the off-policy replay buffer \mathcal{B}. However, CAC has the term logπϕ¯(a;s)\log\pi_{\bar{\phi}}(a;s) from the KL regularization. The term 𝒳\mathcal{X} in both the QθQ_{\theta} and logπϕ¯(a;s)\log\pi_{\bar{\phi}}(a;s) involves ζ\zeta that encodes the information for guiding the gradient to cautiously learn.

4.2 Design of Interpolation Coefficient

One of the main difficulties to extending CPI to learning with deep networks is that ζ\zeta becomes vanishingly small due to the typically long horizon in the AC setting. To tackle this problem, (Vieillard et al., 2020b) propose to heuristically design ζ\zeta to be a non-trivial value, which features the consideration of moving averages.


Refer to caption

Figure 1: Training curves on the continuous control benchmark problems. The solid curves show the mean and shaded regions the standard deviation over the five independent trials. In all tasks CAC achieved comparable performance but significantly stabilized learning.

Recall that in Eq. (7), ζ\zeta is computed by a function of the form:

ζ=CγCk𝔼πk+1,dπk[Aπk(s,a)]Δ~πkπk+1,\displaystyle\zeta=C^{\prime}_{\gamma}C_{k}\,\,\frac{\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right]}{\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}}},

where CγC^{\prime}_{\gamma} absorbs horizon-dependent constants and CkC_{k} is defined in Eq. (6). We propose to remove CγC^{\prime}_{\gamma} since it tends to zero as the horizon increases. Recall also that Δ~πkπk+1\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}} is the maximum difference of the expected advantage function defined in Eq. (5). We propose the following novel ζ\zeta design:

ζCAC=clip(𝔸~|𝔸~MaxDiff|,0,  1),\zeta^{\mathrm{CAC}}=\mathop{\rm clip}{\left(\frac{\tilde{\mathbb{{A}}}}{|\tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}}|},0,\,\,1\right)}, (11)

where by following the moving average concept of (Vieillard et al., 2020b), we update 𝔸~\tilde{\mathbb{{A}}} and 𝔸~MaxDiff\tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}} as:

M=𝔼sKa𝒢Qπk[Aπk(s,a)]𝔸~{c,if M0(1νA)𝔸~+νAM,else𝔸~MaxDiff(1νAMaxDiff)𝔸~MaxDiff+νAMaxDiffM.\begin{array}[]{l}M=\mathbb{E}^{\begin{subarray}{c}s\sim\mathcal{B}_{K}\\ a\sim\mathcal{G}Q^{\pi_{k}}\end{subarray}}\left[A^{\pi_{k}}(s,a)\right]\\ \tilde{\mathbb{{A}}}\leftarrow\begin{cases}c,&\text{if }M\leq 0\\ \left(1-\nu_{A}\right)\tilde{\mathbb{{A}}}+\nu_{A}M,&\text{else}\end{cases}\\ \tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}}\leftarrow\left(1-\nu_{A^{\mathrm{MaxDiff}}}\right)\tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}}+\nu_{A^{\mathrm{MaxDiff}}}\,M.\end{array} (12)

Here, MM is the current estimate of 𝔼πk+1,dπk[Aπk(s,a)]\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right]. We propose to set an if-else judgement here, as M<0M<0 indicates the updated policy has worse performance than the current policy, we let 𝔸~\tilde{\mathbb{{A}}} be a negative value cc, hence enforcing ζ=0\zeta=0. This information is incorporated into 𝔸~\tilde{\mathbb{{A}}} by exponential moving average with the previous estimates. 𝔸~MaxDiff\tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}} attempts to approximate the maximum difference Δ~πkπk+1\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}}. K\mathcal{B}_{K} is a FIFO replay buffer storing KK on-policy samples, and νA,νAMaxDiff[0,1]\nu_{A},\nu_{A^{\mathrm{MaxDiff}}}\in[0,1] are the hyperparameters controlling the average. Following (Vieillard et al., 2020b), it is beneficial to have νAMaxDiffνA\nu_{A^{\mathrm{MaxDiff}}}\leq\nu_{A} for smooth learning.

Computing ζ\zeta in Eq. (11) using the moving average in Eq. (12) is in spirit similar to (Vieillard et al., 2020b). However, they focus on general stationary policies. As 𝔸~\tilde{\mathbb{{A}}} is an off-policy estimate of the on-policy term 𝔼πk+1,dπk[Aπk(s,a)]\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right], it might corrupt the improvement guarantee. On the other hand, we focus on entropy-regularized policies which allow one to bound the performance loss of leveraging off-policy estimate 𝔸~\tilde{\mathbb{{A}}} (Zhu and Matsubara, 2020, Theorem 3).

5 Experiments

Table 1: The performance oscillation values of all algorithms for all environments. The bold numbers indicate the smallest performance oscillation values. ×\times indicates the algorithm failed to learn meaningful behaviors. CAC recorded the smallest performance oscillation values for all the environments. PPO is the only on-policy algorithm in the comparison.

𝒪J\|\mathcal{O}J\|_{\infty} 𝒪J2\|\mathcal{O}J\|_{2}
PPO TD3 SAC
CAC
(ζ=1\zeta=1)
CAC
PPO TD3 SAC
CAC
(ζ=1\zeta=1)
CAC
Ant 1979 4979 7793 7160 1811 359 510 642 591 297
HalfCheetah ×\times 2337 3717 4200 1870 ×\times 331 425 397 286
Hopper 1598 3515 2598 2944 1944 318 609 454 394 279
Humanoid ×\times ×\times 4115 3092 2199 ×\times ×\times 645 436 313
Walker2d 1673 3729 4577 4310 1345 330 461 499 334 183

As CAC combines concepts from ADP literature such as KL regularization and conservative learning that have not seen applications in AC, it is interesting to examine the combination against existing AC methods in challenging tasks. We choose a set of high dimensional continuous control tasks from the OpenAI gym benchmark suite (Brockman et al., 2016).

For comparison, we compare CAC with twin delayed deep deterministic policy gradient (TD3) (Fujimoto et al., 2018) that comprehensively surveys the factors causing poor performance of actor-critic methods, to examine the cautious learning mechanism. As CAC is based on the CPI that has also inspired TRPO and PPO, we compare it with PPO which is improved over TRPO (Schulman et al., 2017). As PPO does not involve computing ζ\zeta, we include the curves when ζ=1\zeta=1. We also compare with SAC (Haarnoja et al., 2018) which has similar architecture.

To better illustrate and quantify the stability during learning, we follow (Zhu and Matsubara, 2020) to define the measure of performance oscillation:

k, such that Rk+1Rk<0𝒪J=maxk|Rk+1Rk|,𝒪J2=1Nk=1N(Rk+1Rk)2,\begin{array}[]{l}\forall k,\text{ such that }R_{k+1}-R_{k}<0\\ \|\mathcal{O}J\|_{\infty}=\max_{k}\left|R_{k+1}-R_{k}\right|,\\ \|\mathcal{O}J\|_{2}=\sqrt{\frac{1}{N}\sum_{k=1}^{N}\left(R_{k+1}-R_{k}\right)^{2}},\end{array} (13)

where NN is the steps of the learning and RkR_{k} refers to the cumulative reward reported at kk-th evaluation. Intuitively, 𝒪J\|\mathcal{O}J\|_{\infty} and 𝒪J2\|\mathcal{O}J\|_{2} measure the maximum and average degradation during learning, respectively.

5.1 Comparative Evaluation

We run all algorithms with the same set of hyperparameters listed in Section A.4 of the Appendix. All figures are plotted with statistics from 10 different random seeds, with each performing 10 evaluation rollouts every 5000 environment steps.

Figure 1 shows the learning curves of the algorithms. CAC achieved comparable performance with other AC algorithms while significantly stabilized learning curves. PPO’s learning speed was the slowest among all algorithms on all environments, due to the on-policy nature of PPO which is sample-inefficient. Other methods were able to leverage off-policy samples to quickly learn meaningful behaviors. However, the fast learning came at a cost: except the relatively simple HalfCheetah-v2, on all environments these off-policy algorithms oscillated wildly, especially on the challenging Humanoid-v2 where both PPO and TD3 failed to learn any meaningful behaviors, and the performance of SAC, CAC with ζ=1\zeta=1 degraded frequently. On the other hand, CAC traded off a little bit slower learning for stability, exhibiting smooth curves. Indeed, the convergence rate of CAC is O(e(1γ)j=1kζj)O\!\left(e^{-(1-\gamma)\sum_{j=1}^{k}\zeta_{j}\!}\right) (Vieillard et al., 2020b), which emphasizes stability more as ζ0\zeta\rightarrow 0.

The comparison on stability of the algorithms can be seen from the Table 1 that summarized the values of 𝒪J\|\mathcal{O}J\|_{\infty} and 𝒪J2\|\mathcal{O}J\|_{2} for all algorithms. It provided empirical support as CAC showed least oscillation during learning. This is in contrast to other off-policy algorithms oscillated wildly during learning. Since CAC with ζ=1\zeta=1 still showed huge oscillation, it can be concluded that the mixture coefficient introduced in CAC is effective in preventing significant policy degradation.

5.2 Ablation study on mixture coefficient

Refer to caption
Figure 2: Top: The learning curves of derived methods of CAC on Ant-v2. Bottom: The value of the mixture coefficient ζ\zeta.

In this section we conduct an ablation test to study the effectiveness of CAC as well as the proposed ζ\zeta design in Section 4.2. We compare the following setup:

  1. 1.

    Full. This is CAC with the proposed ζCAC\zeta^{\text{CAC}} in Eq. (12). The curve is same as in Figure 1.

  2. 2.

    No KL. We remove KL regularization from Eq. (3). This corresponds to SAC with the cautious actor.

  3. 3.

    DSPI. This corresponds to Deep Safe Policy Iteration (Zhu and Matsubara, 2020) that uses the ζ\zeta suggested by (Vieillard et al., 2020b).

  4. 4.

    No K\mathcal{B}_{K}. We replace the on-policy replay buffer K\mathcal{B}_{K} with off-policy \mathcal{B} as suggested by (Vieillard et al., 2020b).

  5. 5.

    No If-Else. This corresponds to removing the if term in Eq. (12) and learns with only the else condition.

As is obvious from Figure 2, while removing the if-else judgement accelerated learning, it ignored the warning from M<0M<0 that the updated policy was poorer. The consequent curve oscillated drastically as the result of aggressive ζ\zeta in the bottom image.

Removing KL regularization induced learning curve similar with Full in the beginning, but the performance degraded significantly since the middle stage and failed to recover. This is probably due to the policies were corrupted by error.

Using off-policy replay buffer \mathcal{B} demonstrated stable learning. This is expected as the agent was forced to learn cautiously by the CAC mechanism. On the other hand, the learning was slow as off-policy samples were not as informative as on-policy ones, as we observed small ζ\zeta values in the bottom figure.

It is most interesting to examine the DSPI case where ζ\zeta was set according to the suggestion of (Vieillard et al., 2020b). Though this scheme works well in Atari games, the resulting algorithm failed to learn any meaningful behaviors in the challenging control tasks with continuous action spaces. This is because they estimate with the entire off-policy replay buffer \mathcal{B} which tends to produce very large estimate of Δ~πkπk+1\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}}, leading to vanishingly small ζDSPI\zeta^{\text{DSPI}} and subsequent poor performance.

6 Conclusion

We have presented CAC, a novel actor-critic algorithm by introducing several concepts from the approximate dynamic programming literature. The cautiousness in CAC consists in the doubly conservativeness: the actor follows conservative policy iteration (Kakade and Langford, 2002) that ensures monotonic improvement and the critic exploits conservative value iteration (Kozuno et al., 2019; Vieillard et al., 2020a) that has been shown to yield state-of-the-art guarantees in ADP literature.

Our key observation was by introducing an entropy-regularized critic the unwieldy interpolated actor update can be simplified significantly while still ensuring robust policy improvement. CAC performed comparable to the state-of-the-art AC methods while significantly stabilized learning on the benchmark control problems with high dimensional continuous state-action spaces.

An interesting future direction is to incorporate other entropy for different purposes. For example, α\alpha-divergence could be used to achieve sparse optimal policies.

\acks

This work is partly supported by JSPS KAKENHI Grant Number 21J15633 and 21H03522.

References

  • Abbasi-Yadkori et al. (2016) Yasin Abbasi-Yadkori, Peter L. Bartlett, and Stephen J. Wright. A fast and reliable policy improvement algorithm. In International Conference on Artificial Intelligence and Statistics, pages 1338–1346, 2016.
  • Ahmed et al. (2019) Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, pages 151–160, 2019.
  • Azar et al. (2012) Mohammad Gheshlaghi Azar, Vicenç Gómez, and Hilbert J Kappen. Dynamic policy programming. The Journal of Machine Learning Research, 13(1):3207–3245, 2012.
  • Bertsekas (2011) Dimitri Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9:310–335, 2011.
  • Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
  • Fakoor et al. (2020) Rasool Fakoor, Pratik Chaudhari, and Alexander J Smola. P3o: Policy-on policy-off policy optimization. In Conference on Uncertainty in Artificial Intelligence, pages 1017–1027, 2020.
  • Fox et al. (2016) Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Conference on Uncertainty in Artificial Intelligence, pages 202–211, 2016.
  • Fu et al. (2019) Justin Fu, Aviral Kumar, Matthew Soh, and Sergey Levine. Diagnosing bottlenecks in deep q-learning algorithms. In International Conference on Machine Learning, pages 2021–2030, 2019.
  • Fujimoto et al. (2018) Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596, 2018.
  • Geist et al. (2019) Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized Markov decision processes. In International Conference on Machine Learning, pages 2160–2169, 2019.
  • Gu et al. (2017) Shixiang Shane Gu, Timothy Lillicrap, Richard E Turner, Zoubin Ghahramani, Bernhard Schölkopf, and Sergey Levine. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 3846–3855, 2017.
  • Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pages 1352–1361, 2017.
  • Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870, 2018.
  • Hafner et al. (2020) Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint, arXiv:1912.01603, 2020.
  • Kakade and Langford (2002) Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, pages 267–274, 2002.
  • Kozuno et al. (2019) Tadashi Kozuno, Eiji Uchibe, and Kenji Doya. Theoretical analysis of efficiency and robustness of softmax and gap-increasing operators in reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 2995–3003, 2019.
  • Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
  • Munos et al. (2016) Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054–1062, 2016.
  • Nachum et al. (2017) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems 30, pages 2775–2785. 2017.
  • Nachum et al. (2018) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Trust-pcl: An off-policy trust region method for continuous control. In International Conference on Learning Representations, pages 1–11, 2018.
  • Pirotta et al. (2013) Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe policy iteration. In International Conference on Machine Learning, pages 307–315, 2013.
  • Precup et al. (2001) Doina Precup, Richard S. Sutton, and Sanjoy Dasgupta. Off-policy temporal difference learning with function approximation. In International Conference on Machine Learning, page 417–424, 2001.
  • Scherrer and Geist (2014) Bruno Scherrer and Matthieu Geist. Local policy search in a convex space and conservative policy iteration as boosted policy search. In Machine Learning and Knowledge Discovery in Databases, pages 35–50, 2014.
  • Scherrer et al. (2015) Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, and Matthieu Geist. Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research, 16(1):1629–1676, 2015.
  • Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
  • Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Sutton and Barto (2018) Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018.
  • Vieillard et al. (2020a) Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, and Matthieu Geist. Leverage the average: an analysis of kl regularization in reinforcement learning. Advances in Neural Information Processing Systems, 33:1–12, 2020a.
  • Vieillard et al. (2020b) Nino Vieillard, Olivier Pietquin, and Matthieu Geist. Deep conservative policy iteration. In AAAI Conference on Artificial Intelligence, pages 6070–6077, 04 2020b.
  • Wagner (2011) Paul Wagner. A reinterpretation of the policy oscillation phenomenon in approximate policy iteration. In Advances in Neural Information Processing Systems 24, pages 2573–2581, 2011.
  • Wang et al. (2017) Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. In International Conference on Learning Representations, pages 1–13, 2017.
  • Zhu and Matsubara (2020) Lingwei Zhu and Takamitsu Matsubara. Ensuring monotonic policy improvement in entropy-regularized value-based reinforcement learning. arXiv preprint, arXiv:2008.10806, 2020.
  • Ziebart (2010) Brian D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University, 2010.

Appendix A Appendix

In this appendix we provide missing proofs and implementation details. Specifically, we present Theorem 1 for CAC convergenc in Section A.1, Proposition 4 for calculating CAC actor gradient expression in Section A.2 and implementation details in Section A.3.

A.1 Proof for the CAC convergence

We prove the convergence of CAC using policy iteration style argument. Similar proofs have also been used in (Haarnoja et al., 2018, Theorem 4). The following lemmas establish the convergence of the policy evaluation and policy improvement of CAC, respectively.

Lemma A.1 (CAC Policy Evaluation).

Given the current policy π\pi and a baseline policy π~\tilde{\pi}, the policy evaluation step of CAC is formulated as:

Qπk+1(s,a)+γ(𝐏𝔼π[Qπk(s,a)]+π~π(s)).\displaystyle Q_{\pi_{k+1}}\leftarrow\mathcal{R}(s,a)+\gamma\left(\mathbf{P}\mathbb{E}^{\pi}[Q_{\pi_{k}}(s,a)]+\mathcal{I}_{\tilde{\pi}}^{\pi}(s)\right). (14)

Consider an initial Q value Q0:𝒮×𝒜Q_{0}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R} with |𝒜|<|\mathcal{A}|<\infty. With the repeated application of Eq. (14), the sequence QkQ_{k} converges to the following entropy-regularized Q-value Qπ~πQ_{\tilde{\pi}}^{\pi} as kk\rightarrow\infty.

Qπ~π(s,a):=𝔼dπ[t=0γt(rt+π~π(st+1))|s0=s,a0=a].\displaystyle Q_{\tilde{\pi}}^{\pi}(s,a):=\mathbb{E}^{d^{\pi}}\left[\sum_{t=0}^{\infty}\gamma^{t}\big{(}r_{t}+\mathcal{I}_{\tilde{\pi}}^{\pi}\left(s_{t+1}\right)\big{)}\big{|}s_{0}=s,a_{0}=a\right]. (15)
Proof A.2.

Define the entropy augmented reward as π~π(s,a)(s,a)+π~π(s){\mathcal{R}^{\pi}_{\tilde{\pi}}}(s,a)\triangleq\mathcal{R}(s,a)+{\mathcal{I}_{\tilde{\pi}}^{\pi}}(s) and rewrite the update rule as:

Qπk+1π~π(s,a)+γ𝐏𝔼π[Qπk(s,a)].\displaystyle Q_{\pi_{k+1}}\leftarrow\mathcal{R}^{\pi}_{\tilde{\pi}}(s,a)+\gamma\mathbf{P}\mathbb{E}^{\pi}[Q_{\pi_{k}}(s,a)]. (16)

With the assumption |𝒜|<|\mathcal{A}|<\infty for bounded reward, we can apply the standard convergence results for policy evaluation (Sutton and Barto, 2018).

Lemma A.3 (CAC Policy Improvement).

Given the current policy π\pi, a baseline policy π~\tilde{\pi} and the updated policy πnew\pi_{\text{new}}. CAC has the following policy update:

πnew=(1ζ)π+ζπ^,\displaystyle\pi_{\text{new}}=(1-\zeta)\pi+\zeta\hat{\pi}, (17)
where π^(as)=π~α(as)exp(βQπ~π(s,a))Z(s),\displaystyle\text{where }\,\,\hat{\pi}(a\mid s)=\frac{\tilde{\pi}^{\alpha}(a\mid s)\exp\left(\beta Q_{\tilde{\pi}}^{\pi}(s,a)\right)}{Z(s)},

with ζ[0,1]\zeta\in[0,1]. Then, Qπ~πnew(s,a)Qπ~π(s,a)Q_{\tilde{\pi}}^{\pi_{\text{new}}}(s,a)\geq Q_{\tilde{\pi}}^{\pi}(s,a) for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} with |𝒜|<|\mathcal{A}|<\infty.

Proof A.4.

Consider a function f:ζf:\zeta\to\mathbb{R} with ζ[0,1]\zeta\in[0,1]:

f(ζ)=𝔼aπnew[Qπ~π(s,a)]+π~πnew(s).f(\zeta)=\mathbb{E}^{a\sim\pi_{\text{new}}}\left[Q_{\tilde{\pi}}^{\pi}(s,a)\right]+\mathcal{I}_{\tilde{\pi}}^{\pi_{\text{new}}}\left(s\right). (18)

From the definition of π^=argmax𝜋𝔼aπ[Qπ~π(s,a)]+π~π(s)\hat{\pi}\!=\!\underset{\pi}{\operatorname*{arg\,max}}\;\mathbb{E}^{a\sim\pi}\left[Q_{\tilde{\pi}}^{\pi}(s,a)\right]\!+\!\mathcal{I}_{\tilde{\pi}}^{\pi}\left(s\right), f(ζ)f(\zeta) takes the maximum value when ζ=1\zeta=1. The first and the second derivative of f(ζ)f(\zeta) w.r.t. ζ\zeta are:

f(ζ)=a\displaystyle f^{{}^{\prime}}(\zeta)=\sum_{a} (π^(a|s)π(a|s))(Qπ~π(s,a)+τlogπ~(a|s))\displaystyle\left(\hat{\pi}(a|s)-\pi(a|s)\right)\left(Q_{\tilde{\pi}}^{\pi}(s,a)+\tau\log\tilde{\pi}(a|s)\right) (19)
(σ+τ)log((1ζ)π(a|s)+ζπ^(a|s)),\displaystyle\left.-(\sigma+\tau)\log((1-\zeta)\pi(a|s)+\zeta\hat{\pi}(a|s)\right),
f′′(ζ)=(σ+τ)a(π^(a|s)π(a|s))2(1ζ)π(a|s)+ζπ^(a|s)0.\displaystyle f^{{}^{\prime\prime}}(\zeta)=-(\sigma+\tau)\sum_{a}\frac{\left(\hat{\pi}(a|s)-\pi(a|s)\right)^{2}}{(1-\zeta)\pi(a|s)+\zeta\hat{\pi}(a|s)}\leq 0. (20)

Thus, the function ff is concave in ζ[0,1]\zeta\in[0,1]. Since f(ζ)f(\zeta) takes the maximum value with ζ=1\zeta=1, f(ζ)f(\zeta) is monotonically increasing in ζ\zeta and f(0)f(ζ)f(0)\leq f(\zeta).

Therefore, the following inequality about the entropy-regularized V-value Vπ~π(s)V_{\tilde{\pi}}^{\pi}(s) holds:

Vπ~π(s)=\displaystyle V_{\tilde{\pi}}^{\pi}(s)= 𝔼aπ[Qπ~π(s,a)]+π~π(s)\displaystyle\mathbb{E}^{a\sim\pi}\left[Q_{\tilde{\pi}}^{\pi}(s,a)\right]+\mathcal{I}_{\tilde{\pi}}^{\pi}\left(s\right) (21)
\displaystyle\leq 𝔼aπnew[Qπ~π(s,a)]+π~πnew(s).\displaystyle\mathbb{E}^{a\sim\pi_{\text{new}}}\left[Q_{\tilde{\pi}}^{\pi}(s,a)\right]+\mathcal{I}_{\tilde{\pi}}^{\pi_{\text{new}}}\left(s\right).

By repeatedly applying Eq. (21), we obtain the following inequalities:

Qπ~π(s,a)\displaystyle Q_{\tilde{\pi}}^{\pi}(s,a) =(s,a)+γ𝐏[Qπ~π(s,a)+π~π(s)]\displaystyle=\mathcal{R}(s,a)+\gamma\mathbf{P}\left[Q_{\tilde{\pi}}^{\pi}(s,a)+\mathcal{I}_{\tilde{\pi}}^{\pi}(s)\right] (22)
(s,a)+γ𝐏[𝔼πnew[Qπ~π(s,a)]+π~πnew(s)]\displaystyle\leq\mathcal{R}(s,a)+\gamma\mathbf{P}\left[\mathbb{E}^{\pi_{\text{new}}}[Q_{\tilde{\pi}}^{\pi}(s,a)]+\mathcal{I}_{\tilde{\pi}}^{\pi_{\text{new}}}(s)\right]
\displaystyle\;\vdots
Qπ~πnew(s,a).\displaystyle\leq Q_{\tilde{\pi}}^{\pi_{\mathrm{new}}}(s,a).

Convergence to Qπ~πnewQ_{\tilde{\pi}}^{\pi_{\mathrm{new}}} follows from Lemma A.1.

Combining the policy evaluation and policy improvement, we are now ready to prove Theorem 1.

Theorem 1 Repeated application of CAC Eq. (3) on any initial policy π\pi will make it converges to the entropy regularized optimal policy π(a|s)=exp(1κQ(s,a))a𝒜exp(1κQ(s,a))\pi^{*}(a|s)=\frac{\exp\left(\frac{1}{\kappa}Q^{*}(s,a)\right)}{\int_{a\in\mathcal{A}}\exp\left(\frac{1}{\kappa}Q^{*}(s,a)\right)}.

Proof A.5.

According to Lemma A.1 and Lemma A.3, the entropy-regularized Q-value at kk-th update satisfies Qπk1πk(s,a)Qπk1πk1(s,a)Q_{\pi_{k-1}}^{\pi_{k}}(s,a)\geq Q_{\pi_{k-1}}^{\pi_{k-1}}(s,a). Given bounded reward, QπkπkQ_{\pi_{k}}^{\pi_{k}} is also bounded from above and the sequence converges to a unique π\pi^{*}. Note that when reaching the optimum the KL regularization term becomes 0. Hence, using the same iterative argument as in the proof of Lemma A.3, we get Qππ(s,a)>Qππ(s,a)Q_{\pi^{*}}^{\pi^{*}}(s,a)>Q_{\pi}^{\pi}(s,a) for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} and any π\pi. By Ziebart (2010); Haarnoja et al. (2018), the optimal policy is entropy-regularized and hence has the softmax form π(a|s)=exp(1κQ(s,a))a𝒜exp(1κQ(s,a))\pi^{*}(a|s)=\frac{\exp\left(\frac{1}{\kappa}Q^{*}(s,a)\right)}{\int_{a\in\mathcal{A}}\exp\left(\frac{1}{\kappa}Q^{*}(s,a)\right)} (recall from Eq. (3) that κ\kappa is the weight coefficient of entropy). The convergence of general interpolated policy to the optimal policy follows the argument of Scherrer and Geist (2014).

A.2 Proof for the CAC gradient

In this subsection we derive the gradient expression for CAC. For the ease of reading we rephrase the proposition here:

Proposition 4 Let the actor network be parametrized by weights ϕ\phi and critic by θ\theta. Define 𝒢Qϕ¯,θ\mathcal{G}Q_{\bar{\phi},\theta} as the greedy policy with respect to the CAC critic. The subscript ϕ¯\bar{\phi} comes from the baseline policy introduced by KL divergence. Then the gradient of the actor update can be expressed as:

ϕ𝔼saπϕ[Dϕ¯ϕβ1+𝒳Qθ(s,a)],where Dϕ¯ϕ=logπϕ(as)α+𝒳1+𝒳logπϕ¯(a;s)𝒳=1ζζπϕ¯(as)𝒢Qϕ¯,θ(as).\displaystyle\begin{split}&\nabla_{\phi}\mathbb{E}^{\begin{subarray}{c}s\sim\mathcal{B}\\ a\sim\pi_{\phi}\end{subarray}}\left[D^{\phi}_{\bar{\phi}}-\frac{\beta}{1+\mathcal{X}}Q_{\theta}(s,a)\right],\\ &\text{where }D^{\phi}_{\bar{\phi}}=\log\pi_{\phi}\left(a\mid s\right)-\frac{\alpha+\mathcal{X}}{1+\mathcal{X}}\log\pi_{\bar{\phi}}(a;s)\\ &\mathcal{X}=\frac{1-\zeta}{\zeta}\cdot\frac{\pi_{\bar{\phi}}(a\mid s)}{\mathcal{G}Q_{\bar{\phi},\theta}(a\mid s)}.\end{split} (23)
Proof A.6.

Using the reparameterization trick a=fϕ(ϵ;st)a\!=\!f_{\phi}(\mathbf{\epsilon};s_{t}) with ϵ\mathbf{\epsilon} a noise vector (Haarnoja et al., 2018), the gradient of Eq. (8c) can be expressed as:

^ϕ\displaystyle\hat{\nabla}_{\phi} Jπ(ϕ)=ϕlogπϕ(as)+alogπϕ(as)ϕfϕ(ϵ;st)\displaystyle J_{\pi}(\phi)\!=\!\nabla_{\phi}\log\pi_{\phi}\left(a\mid s\right)+\nabla_{a}\log\pi_{\phi}\left(a\mid s\right)\nabla_{\phi}f_{\phi}\left(\mathbf{\epsilon};s_{t}\right) (24)
alog((1ζ)πϕ¯(as)+ζ𝒢Qϕ¯,θ(as))ϕfϕ(ϵ;st).\displaystyle-\nabla_{a}\log\left(\!(1-\zeta)\pi_{\bar{\phi}}(a\mid s)\right.\!+\!\left.\zeta\mathcal{G}Q_{\bar{\phi},\theta}\left(a\mid s\right)\!\right)\!\nabla_{\phi}f_{\phi}\left(\mathbf{\epsilon};s_{t}\right).

We expand the term alog((1ζ)πϕ¯(as)+ζ𝒢Qϕ¯,θ(as))\nabla_{a}\!\log{\left(\!(1-\zeta)\pi_{\bar{\phi}}(a\mid s)+\zeta\mathcal{G}Q_{\bar{\phi},\theta}\left(a\mid s\right)\!\right)} by using that xilog(iexpxi)\nabla_{x_{i}}\!\log\left(\sum_{i}\exp x_{i}\right) =expxijexpxj=\frac{\exp x_{i}}{\sum_{j}\exp x_{j}}.

Let exp(C1(a))=(1ζ)πϕ¯(as)\exp{\left(C_{1}(a)\right)}=(1-\zeta)\pi_{\bar{\phi}}(a\mid s) and exp(C2(a))=ζ𝒢πϕ¯Qθ(as)\exp{\left(C_{2}(a)\right)}=\zeta\mathcal{G}_{\pi_{\bar{\phi}}}Q_{\theta}\left(a\mid s\right). We have the following transformation:

alog((1ζ)πϕ¯(as)+ζ𝒢Qϕ¯,θ(as))\displaystyle\nabla_{a}\log{\left((1-\zeta)\pi_{\bar{\phi}}(a\mid s)+\zeta\mathcal{G}Q_{\bar{\phi},\theta}\left(a\mid s\right)\right)} (25)
=\displaystyle= alog(exp(C1(a))+exp(C2(a)))\displaystyle\nabla_{a}\log{\left(\exp{\left(C_{1}(a)\right)}+\exp{\left(C_{2}(a)\right)}\right)}
=\displaystyle= (Daπϕ¯(as)+αaπϕ¯(as)+βaQθ(s,a))1+D,\displaystyle\frac{\left(D\frac{\partial}{\partial a}\pi_{\bar{\phi}}(a\mid s)+\alpha\frac{\partial}{\partial a}\pi_{\bar{\phi}}(a\mid s)+\beta\frac{\partial}{\partial a}Q_{\theta}(s,a)\right)}{1+D},
where D=exp(C1(a)C2(a)).\displaystyle\text{where }D=\exp{\left(C_{1}(a)-C_{2}(a)\right)}.

After replacing DD with 𝒳\mathcal{X} and inserting Eq. (25) into Eq. (24), we obtain Eq. (10).

A.3 Implementation details

This section presents implementation details of CAC with deep networks. Pseudo-code is provided in Algorithm 1.

On-policy replay buffer

To make the algorithm off-policy, we approximate the on-policy samples with on-policy replay buffer K\mathcal{B}_{K} which stores KK recent samples where KK is smaller than the size of the main replay buffer \mathcal{B}.

Advantage estimation

While it is possible to simply use the entropy-regularized advantage function A(s,a)=Q(s,a)V(s)A_{\mathcal{I}}(s,a)\!=\!Q_{\mathcal{I}}(s,a)\!-\!V_{\mathcal{I}}(s) for computing ζ\zeta, we are interested in studying the guidance of ζ\zeta when no entropy is involved since it might provide a more informative gradient improving direction. This corresponds to the case of (Kakade and Langford, 2002; Pirotta et al., 2013). To this end, we train another Q-network QωQ_{\omega} by solving:

ω\displaystyle\omega\leftarrow argmin𝔼[(Qω(s,a)y)2],\displaystyle\arg\min\mathbb{E}^{\mathcal{B}}\left[\left({Q}_{\omega}\left(s,a\right)-y\right)^{2}\right], (26)
where y\displaystyle\text{where }\,\,y =r+γ(𝔼aπϕ(|s)[Qω¯(s,a)]),\displaystyle=r+\gamma\left(\mathbb{E}^{a\sim\pi_{\phi}(\cdot|s^{\prime})}\left[Q_{\bar{\omega}}(s^{\prime},a)\right]\right),

where ω¯\bar{\omega} is the target network. Then we approximate the advantage as Aπ(s,a)=Qω(s,a)𝔼πϕ[Qω(s,a)]A^{\pi}(s,a)=Q_{\omega}(s,a)-\mathbb{E}^{\pi_{\phi}}[Q_{\omega}(s,a)]. While the advantage estimation is expected to be further improved with the recent generalized advantage estimation, we found that the above simple implementation is sufficient to stabilize the learning.

Target smoothing

For the target Q-networks Qθ¯Q_{\bar{\theta}} and Qω¯Q_{\bar{\omega}}, we update the parameters using the moving average (Haarnoja et al., 2018):

θ¯νθ¯θ+(1νθ¯)θ¯,\displaystyle\bar{\theta}\leftarrow\nu_{\bar{\theta}}\,\theta+(1-\nu_{\bar{\theta}})\bar{\theta}, (27)
ω¯νω¯ω+(1νω¯)ω¯,\displaystyle\bar{\omega}\leftarrow\nu_{\bar{\omega}}\,\omega+(1-\nu_{\bar{\omega}})\bar{\omega},

where νθ¯\nu_{\bar{\theta}} and νω¯\nu_{\bar{\omega}} are the target smoothing coefficients. In the mixing step, we use the previous policy πϕ¯\pi_{\bar{\phi}} rather than the current policy πϕ\pi_{\phi} to stabilize the training:

𝔼[DKL(πϕ(1ζ)πϕ¯+ζπ^)].\displaystyle\mathbb{E}^{\mathcal{B}}\left[{D}_{{KL}}\left(\pi_{\phi}\|(1-\zeta)\pi_{\bar{\phi}}+\zeta\hat{\pi}\right)\right].

Thus, the target policy πϕ¯\pi_{\bar{\phi}} corresponds to the monotonically improved policy in the CPI algorithm that is not updated when performance oscillation happens. To reflect this fact, we update the weight of the target policy network as:

ϕ¯ζνϕ¯θ+(1ζνϕ¯)ϕ¯,\bar{\phi}\leftarrow\zeta\nu_{\bar{\phi}}\,\theta+(1-\zeta\nu_{\bar{\phi}})\bar{\phi}, (28)

where νϕ¯\nu_{\bar{\phi}} is the target smoothing coefficient.

Normalization factor estimation

Since CAC algorithm requires the density of the reference policy π^(as)=πϕ¯α(as)exp(βQθ(s,a))(Z(s))1\hat{\pi}(a\mid s)={\pi_{\bar{\phi}}^{\alpha}(a\mid s)\exp\left(\beta Q_{\theta}(s,a)\right)}(Z(s))^{-1}, we need to estimate the normalization factor Z(s)Z(s).

A simple approach to estimate Z(s)Z(s) is by Monte-Carlo sampling with some distribution qq that is easier to sample from:

Z(s)=𝔼q[πϕ¯(as)αexp(βQθ(s,a))q(as)].{Z}(s)=\mathbb{E}^{q}\left[\frac{{\pi_{\bar{\phi}}}(a\mid s)^{\alpha}\exp\left(\beta Q_{\theta}(s,a)\right)}{q(a\mid s)}\right]. (29)

The closer q(s)q(\cdot\mid s) and the reference policy π^(s)\hat{\pi}(\cdot\mid s) are, the better the accuracy of the Z(s)Z(s) approximation.

Theorem 4.3 indicates that by choosing the current policy π\pi as the proposal distribution, we can control the closeness of the two distributions and the accuracy of the MC approximation via changing the entropy regularization weighting coefficients. We empirically study the effectiveness of entropy regularization against the closeness and the accuracy when ζ1\zeta\leq 1

We use the pendulum environment from Fu et al. (2019) where the dynamics are discretized so that we can compute the oracle values such as the KL divergence between the current and the reference policy. The hyperparameters used in the experiment is listed in Section A.4. Figure 3 shows how the learning behavior of CAC changes when the interpolation coefficient ζ\zeta and KL regularization weight τ\tau vary: When KL regularization is present, the approximation quality of Z(s)Z(s) is improved significantly.


Refer to caption

Figure 3: Left: The learning curves of CAC with different parameters on pendulum. Middle: The maximum KL divergence over all the states. Right: The maximum error of log-scaled Z(s)Z(s) approximation over all the states.
1:  Initialize parameter vectors θ\theta, ϕ\phi, θ¯\bar{\theta}, ϕ¯\bar{\phi}, ω\omega, ω¯\bar{\omega}
2:  Initialize variable 𝔸~\tilde{\mathbb{{A}}} and 𝔸~MaxDiff\tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}}
3:  for each iteration do
4:     Collect transitions by πθ\pi_{\theta} and add them to \mathcal{B} and K\mathcal{B}_{K}
5:     for each gradient step do
6:        Update θ\theta with one step of SGD using Eq. (8b)
7:        Update ω\omega with one step of SGD using Eq. (26)
8:        Update 𝔸~\tilde{\mathbb{{A}}} and 𝔸~MaxDiff\tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}} using Eq. (12)
9:        Update ϕ\phi with one step of SGD using Eq. (8c)
10:        Update ζ\zeta using Eq. (11)
11:        θ¯νθ¯θ+(1νθ¯)θ¯\bar{\theta}\leftarrow\nu_{\bar{\theta}}\theta+(1-\nu_{\bar{\theta}})\bar{\theta}
12:        ω¯νω¯ω+(1νθ¯)ω¯\bar{\omega}\leftarrow\nu_{\bar{\omega}}\omega+(1-\nu_{\bar{\theta}})\bar{\omega}
13:        ϕ¯ζνϕ¯ϕ+(1ζνϕ¯)ϕ¯\bar{\phi}\leftarrow\zeta\nu_{\bar{\phi}}\phi+(1-\zeta\nu_{\bar{\phi}})\bar{\phi}
14:     end for
15:  end for
Algorithm 1 Cautious Actor-Critic
Table 2: Hyperparameters of off-policy algorithms in mujoco tasks
Parameter Value
Shared
optimizer Adam
learning rate 10310^{-3}
discount factor (γ\gamma) 0.99
replay buffer size (\mathcal{B}) 10610^{6}
number of hidden layers 2
number of hidden units per layer 256
number of samples per minibatch 100
activations ReLU
TD3
Stddev for Gaussian noise 0.1
Stddev for target smoothing noise 0.2
policy delay 2
SAC
entropy coefficient (κ\kappa) 0.2
θ¯\bar{\theta} smoothing coefficient 0.995
CAC
entropy coefficient (κ\kappa) 0.2
KL coefficient (τ\tau) 0.1
θ¯\bar{\theta} smoothing coefficient (νθ¯\nu_{\bar{\theta}}) 0.995
ω¯\bar{\omega} smoothing coefficient (νω¯\nu_{\bar{\omega}}) 0.995
ϕ¯\bar{\phi} smoothing coefficient (νϕ¯\nu_{\bar{\phi}}) 0.9999
νA\nu_{A} 0.010.01
νAMaxDiff\nu_{A^{\mathrm{MaxDiff}}} 0.0010.001
size of K\mathcal{B}_{K} 10001000
if-else update c=Mc=M
Table 3: Hyperparameters of PPO in mujoco tasks
Parameter Value
PPO
optimizer Adam
value function learning rate 10310^{-3}
policy learning rate 3×1043\times 10^{-4}
discount factor (γ\gamma) 0.99
number of hidden layers 2
number of hidden units per layer 256
number of samples per minibatch 100
activations ReLU
Number of samples per update 80
Policy objective clipping coefficient 0.2
Table 4: Hyperparameters of CAC in pendulum task
Parameter Value
CAC
optimizer Adam
learning rate 10310^{-3}
discount factor (γ\gamma) 0.99
replay buffer size (\mathcal{B}) 10610^{6}
number of hidden layers 2
number of hidden units per layer 256
number of samples per minibatch 32
activations ReLU
entropy coefficient (κ\kappa) 0.2
θ¯\bar{\theta} smoothing coefficient 0.995
ϕ¯\bar{\phi} smoothing coefficient 0.995
size of K\mathcal{B}_{K} 10001000
if-else update c=Mc=M

A.4 Hyperparameters

This section lists the hyperparameters used in the comparative evaluation Section 5.1.