Cautious Actor-Critic

\NameLingwei Zhu \Email[email protected]
\NameToshinori Kitamura^†^†footnotemark: \Email[email protected]
\NameTakamitsu Matsubara \Email[email protected]
\addrNara Institute of Science and Technology Equal contribution JAPAN

Abstract

The oscillating performance of off-policy learning and persisting errors in the actor-critic (AC) setting call for algorithms that can conservatively learn to suit the stability-critical applications better. In this paper, we propose a novel off-policy AC algorithm cautious actor-critic (CAC). The name cautious comes from the doubly conservative nature that we exploit the classic policy interpolation from conservative policy iteration for the actor and the entropy-regularization of conservative value iteration for the critic. Our key observation is the entropy-regularized critic facilitates and simplifies the unwieldy interpolated actor update while still ensuring robust policy improvement. We compare CAC to state-of-the-art AC methods on a set of challenging continuous control problems and demonstrate that CAC achieves comparable performance while significantly stabilizes learning.

keywords:

Cautious Reinforcement Learning; Policy Oscillation; Monotonic Improvement; Entropy Regularized Markov Decision Process;

1 Introduction

Actor-critic (AC) methods of reinforcement learning (RL) have been gaining increasing interests recently due to their scalability to large-scale problems: they can learn with both on-policy or/and off-policy samples and handle continuous action spaces (Lillicrap et al., 2015; Schulman et al., 2015); both in model-free or model-based setting (Haarnoja et al., 2018; Hafner et al., 2020). Recently in the model-free setting there has seen a booming in off-policy AC methods (Wang et al., 2017; Haarnoja et al., 2018; Fakoor et al., 2020). However, while these methods are sample-efficient in exploiting off-policy samples for continuous control, it is those samples that often bring oscillating performance during learning as a side-effect due to distribution mismatch. The oscillating performance of off-policy learning and persisting errors in the AC setting (Fujimoto et al., 2018) call for algorithms that can conservatively learn to better suit the stability-critical applications.

The performance oscillation and degradation problems have been widely discussed in the approximate dynamic programming (ADP) literature (Wagner, 2011; Bertsekas, 2011) that has motivated efficient learning algorithms against various sources of error. The seminal work of (Kakade and Langford, 2002) propose a principled approach to tackle performance degradation by leveraging policy interpolation which is conservative in that it reduces greediness of the updated policy. However, though it enjoys strong theoretical guarantees, its drawbacks limit its use in the AC setting: (1) it is difficult to obtain a reliable reference policy in high-dimensional continuous state-action spaces; (2) the interpolation is often regarded inconvenient to use and it is unclear how to design the interpolation in continuous action spaces. In practice, two popular variants (Schulman et al., 2015, 2017) that sidestep the interpolation and directly approximate the updated policy are more often used in the AC setting. On the other hand, the recently booming entropy-regularized ADP literature (Azar et al., 2012; Fox et al., 2016; Kozuno et al., 2019; Vieillard et al., 2020a) also features conservative learning (Kozuno et al., 2019) as they average over past value functions (Vieillard et al., 2020a). Though these methods do not explicitly address the performance oscillation problem, they have been empirically verified to be error-tolerant and yield state-of-the-art performance on a wide range of tasks. Extending this conservative learning to AC has been studied (Nachum et al., 2018; Haarnoja et al., 2018). However, the resulted conservativeness exists only in the critic: In challenging tasks, the performance degradation and oscillation still occur.

This paper aims to tackle the performance oscillation problem of off-policy AC by proposing a novel algorithm: cautious actor critic (CAC), where the naming cautious comes from the doubly conservative nature as we combine a conservative actor leveraging the concept of conservative policy iteration (CPI) (Kakade and Langford, 2002) with a conservative critic exploiting the entropy-regularization of conservative value iteration (CVI) (Kozuno et al., 2019). The key observation is that the entropy-regularized critic can find error-tolerant reference policies and simplifies the unwieldy interpolated actor update while still ensures robust policy improvement. CAC leverages automatically adjusted interpolation to reflect the faith during learning: when performance oscillation is likely to happen, CAC behaves cautiously to rely more on validated previous policy rather than on the new policy. Our novel interpolation design is inspired by a very recent study from the ADP literature (Vieillard et al., 2020b) but improved for the continuous AC setting.

The rest of the paper is organized as follows: we provide a short survey on related work in Section 2, followed by the preliminary on RL and relevant AC methods in Section 3. Section 4 presents CAC. Specifically, we discuss our novel design of the interpolation scheme which is central to CAC. We evaluate CAC in Section 5 on a set of challenging continuous control problems. We show that CAC is capable of achieving performance comparable to the state-of-the-art AC methods while significantly stabilizing learning. Ablation study has been conducted to distinguish CAC from existing methods. Discussion and conclusion are given in Section 6. Due to page limits, we present derivations and implementation details in the Appendix of supplemental file.

2 Related Work

It has been noticed that various sources of error such as approximation error in AC algorithms (Fujimoto et al., 2018; Fu et al., 2019) are the cause of performance degradation and oscillation during learning. In this section, we briefly survey some related works (partially) tackling this problem and outline our contributions.

Robust AC. Algorithms learning from off-policy data are sample-efficient but are also at the risk of divergence. To solve the divergence problem, an approach is to incorporate the importance sampling (IS) ratio (Precup et al., 2001). However, the resultant algorithms typically have large variance as the IS ratio is the product of many potentially unbounded terms. Munos et al. (2016) proposed to clip the IS ratio and proved the resulting algorithm Retrace $\!(\lambda)$ can attain the globally optimal policy. Retrace ( $\lambda$ ) has motivated recent successful AC methods (Wang et al., 2017; Fakoor et al., 2020) that exploit both on- and off-policy samples for better stability while retaining sample-efficiency. The robustness comes from that any off-policy samples can be used without causing divergence and wild variance thanks to the clipping. However, one still has to trade off the learning speed with learning stability by the user-defined clipping threshold. If we favor more learning stability, the agent might fail to learn meaningful behaviors.

Entropy-regularized AC. The recently booming entropy-regularized ADP literature has established that by augmenting the reward with Shannon entropy, the optimal policy is multi-modal and hence robust against adversarial settings (Nachum et al., 2017; Haarnoja et al., 2017; Ahmed et al., 2019). Another popular candidate entropy is the relative entropy or Kullback-Leibler (KL) divergence that renders the optimal policy an average of all past value functions (Azar et al., 2012; Fox et al., 2016; Kozuno et al., 2019; Vieillard et al., 2020a), which is more conservative and robust under mild assumptions such as the sequence of errors is martingale difference under the natural filtration (Azar et al., 2012). These methods have been extended to the AC setting including state-of-the-art (Haarnoja et al., 2018) that exploits the Shannon entropy and (Nachum et al., 2018) that leverages the KL divergence. Those methods demonstrate strong empirical performance on a wide range of tasks. However, it should be noted that the conservativeness brought by the entropy-regularized reward augmentation exists only in the critic. Since the average is prone to outliers, performance degradation can still happen if the value function estimates at some iterations are poor.

Conservative Policy Iteration. Tackling the performance oscillation problem has been widely discussed in the ADP literature (Wagner, 2011; Bertsekas, 2011), of which the seminal CPI algorithm (Kakade and Langford, 2002) has inspired many conservative learning schemes with strong theoretical guarantees for per-update improvement (Pirotta et al., 2013; Abbasi-Yadkori et al., 2016). However, CPI has seen limited applications to the AC setting due to two main drawbacks: (1) it assumes a good reference policy that is typically difficult to obtain in high-dimensional continuous state-action spaces; (2) the interpolation coefficient that interpolates the reference policy and current policy depends on the horizon of learning, which is typically short in ADP scenarios. In the AC setting featuring long learning horizon, this coefficient becomes vanishingly small and hence significantly hinders learning. A very recent work extended CPI to learning with deep networks and has demonstrated good performance on Atari games (Vieillard et al., 2020b). However, it is limited to discrete action spaces while our method mainly focuses on continuous action spaces and can be easily adapted to discrete action setting. The above-mentioned drawbacks render CPI generally perceived as unwieldy (Schulman et al., 2015).

Trust-region Methods. Motivated by the above-mentioned drawbacks of CPI, two popular variants trust region policy optimization (TRPO) (Schulman et al., 2015) and its improved version proximal policy optimization (PPO) (Schulman et al., 2017) sidestep the interpolation and directly approximate the resultant conservative policy. TRPO and PPO are welcomed choices for learning from scratch when the reference policy is unavailable or unreliable, but they also ignore this knowledge when we have a good reference policy at our disposal. Further, TRPO and PPO require on-policy samples which are expensive since all samples can be used only once and then discarded.

Contribution. The main contributions of this paper are:

•

CAC, the first off-policy AC method applying the interpolation of CPI-based algorithms for stabilizing off-policy learning to the best of the authors’ knowledge.
•

A novel interpolation coefficient design suitable for high dimensional continuous state-action spaces. Previously there was only a design suitable for discrete action spaces (Vieillard et al., 2020b).
•

We evaluate CAC on a set of benchmark continuous control problems and demonstrate that CAC achieves comparable performance with state-of-the-art AC methods while significantly stabilizes learning.

3 Preliminary

3.1 Reinforcement Learning

RL problems are mostly formulated by Markov Decision Processes (MDPs) defined by the tuple $(\mathcal{S},\mathcal{A},P,\mathcal{R},\gamma)$ , where $\mathcal{S}$ is the state space, $\mathcal{A}$ is the (possibly continuous) action space, ${P}\left(s^{\prime}\mid s,a\right)$ is the transition probability from $s$ to $s^{\prime}$ under action $a$ taken; $\mathcal{R}$ is the reward function with $\mathcal{R}(s,a)\in[-r_{\text{max}},r_{\text{max}}]$ denoting the immediate reward associated with that transition. We also use $r_{t}$ as a shorthand for $\mathcal{R}(s_{t},a_{t})$ at $t$ -th step. $\gamma\!\in\!(0,1)$ is the discount factor. We define $\mathbf{P}$ as a left operator such that $(\mathbf{P}V)(s,a)\!=\!\sum_{s^{\prime}}\!{P}\left(s^{\prime}\mid s,a\right)\!V(s^{\prime})$ for some $V$ . A policy $\pi$ maps states to a probability distribution over the action space. We define the stationary state distribution induced by $\pi$ as the unnormalized occupancy measure $d^{\pi}(s)=\sum_{t=0}^{\infty}\gamma^{t}P\left(s_{t}=s\mid\pi\right)$ . The goal of RL is to find an optimal policy $\pi^{*}$ that maximizes the long term discounted rewards $J^{\pi}=\mathbb{E}^{d^{\pi}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\right]$ . Equivalently, this optimal policy also maximizes the state value function for all states $s$ :

\displaystyle\pi^{*}=\operatorname*{arg\,max}_{\pi}V^{\pi}(s)=\operatorname*{arg\,max}_{\pi}\mathbb{E}^{d^{\pi}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\big{|}s_{0}\!=\!s\right].

The state-action value counterpart $Q^{\pi^{*}}$ is more frequently used in the control context:

\displaystyle Q^{\pi^{*}}(s,a)=\max_{\pi}\mathbb{E}^{d^{\pi}}\left[\sum_{t=0}^{\infty}\gamma^{t}r_{t}\big{|}s_{0}\!=\!s,a_{0}\!=\!a\right].

We define the advantage function for a policy $\pi$ as $A^{\pi}(s,a)=Q^{\pi}(s,a)-V^{\pi}(s)$ , and the expected advantage function of policy $\pi^{\prime}$ over $\pi$ as:

\displaystyle A^{\pi^{\prime}}_{\pi}(s)\!=\!\mathbb{E}^{\pi^{\prime}}\left[Q^{\pi}(s,a)\right]-V^{\pi}(s).

3.2 Actor Critic methods

In this section we briefly introduce recent actor-critic algorithms and discuss their pros and cons and shed light on our proposal in Section 4.

3.2.1 Trust Region Methods

TRPO exploits the policy improvement lemma of (Kakade and Langford, 2002) for ensuring approximately monotonic policy improvement. However, unlike in (Kakade and Langford, 2002) that at $k$ -th iteration the policy is updated as $\pi_{k+1}=\zeta\pi^{\prime}+(1-\zeta)\pi_{k}$ , where $\pi^{\prime}$ is the greedy policy; TRPO constructs an algorithm that directly computes $\pi_{k+1}$ without resorting to $\pi^{\prime}$ . Specifically, TRPO has the following update rule:

\displaystyle\begin{split}&J^{\text{TRPO}}_{\pi_{k}}(\pi):=\,\operatorname*{arg\,max}_{\pi}\mathbb{E}^{\pi,d^{\pi_{k}}}\left[A^{\pi_{k}}\right],\\ &\text{subject to }\quad C_{\gamma}\,\Delta_{\pi}\,D^{\text{max}}_{KL}(\pi_{k}|\!|\pi)\leq\delta,\\ &\text{with }\,\,\Delta_{\pi}=\max_{s,a}|A^{\pi}(s,a)|,\end{split}

(1)

where $C_{\gamma}$ is a horizon-dependent constant, $D_{KL}$ is the KL divergence and $\delta$ is the trust region parameter.

As computing $J^{\text{TRPO}}_{\pi_{k}}(\pi)$ requires sampling according to the stationary distribution $d^{\pi_{k}}$ , it is inherently an on-policy algorithm, which is not sample-efficient as the samples can only be used only once and discarded.

3.2.2 Off-policy Maximum Entropy Actor-Critic

As state-of-the-art model-free off-policy AC algorithm, soft actor-critic (SAC) (Haarnoja et al., 2018) maximizes not only task reward but also the Shannon entropy of policy. The entropy term in the reward function renders the optimal policy multi-modal as opposed to deterministic policies of algorithms that solely maximize task reward, which is beneficial due to the multi-modality (Haarnoja et al., 2017) and has demonstrated superior sample-efficiency due to more effective exploration of the state-action spaces. Writing in the ADP manner, SAC has the following update rule (we drop the state-action pair for the $Q$ function for simplicity):

\displaystyle\begin{split}&\begin{cases}\pi_{k+1}\leftarrow\operatorname*{arg\,max}_{\pi}\mathbb{E}^{\pi}\left[Q^{\pi_{k}}_{\mathcal{H}}(s,a)+\kappa\mathcal{H}\left(\pi(\cdot|s_{t})\right)\right]\\ Q^{\pi_{k+1}}_{\mathcal{H}}\leftarrow\mathcal{R}(s,a)+\gamma\left(\mathbf{P}V^{\pi_{k}}_{\mathcal{H}}\right)\!(s,a)\end{cases}\\ &\text{where }V^{\pi}_{\mathcal{H}}(s)=\sum_{t\geq 0}\gamma^{t}\mathbb{E}^{\pi}\left[\mathcal{R}(s_{t},a_{t})+\kappa\mathcal{H}\left(\pi(\cdot|s_{t})\right)\Big{|}s_{0}=s\right].\end{split}

(2)

$\mathcal{H}(\pi):=-\sum_{a}\pi(a|s)\log\pi(a|s)$ denotes the Shannon entropy of policy $\pi$ , $\kappa$ denotes the weighting coefficient and $V^{\pi}_{\mathcal{H}}$ denotes the soft value function when regularized with the Shannon entropy. SAC performs one step look-ahead for updating the actor, where states are randomly sampled from a replay buffer, and then actions are generated by feeding the states into the policy network (Haarnoja et al., 2018). As such, SAC does not need an IS ratio, but it has been demonstrated that SAC often oscillates wildly in performance.

4 Cautious Actor Critic

In this section we present CAC, an off-policy actor-critic method capable of learning conservatively against performance oscillation and degradation.

4.1 CAC Algorithm

For the ease of understanding, we write CAC in the following approximate policy iteration style (Vieillard et al., 2020a). Specifically, the first step corresponds to the policy (actor) improvement and the last step corresponds to the interpolation:

\displaystyle\begin{split}&\text{CAC}\,\,\begin{cases}\pi_{k+1}\leftarrow\operatorname*{arg\,max}_{\pi}\mathbb{E}^{\pi}\!\left[Q^{\pi_{k}}_{\mathcal{I}}(s,a)+\mathcal{I}_{\pi_{k}}^{\pi}(s)\right]\\ Q^{\pi_{k+1}}_{\mathcal{I}}\leftarrow\mathcal{R}(s,a)+\gamma\left(\textbf{P}V^{\pi_{k+1}}_{\mathcal{I}}\right)\!(s,a)\\ \zeta\leftarrow\,(\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}})^{-1}\!\left(\mathbb{E}^{\pi_{k+1},\mathcal{B}_{K}}\left[A^{\pi_{k}}(s,a)\right]\right)\\ \tilde{\pi}_{k+1}\leftarrow\zeta\pi_{k+1}+(1-\zeta)\pi_{k}\end{cases}\\ &\text{with }V^{\pi_{k+1}}_{\mathcal{I}}(s)=\sum_{t\geq 0}\gamma^{t}\mathbb{E}^{\pi}\left[\mathcal{R}(s_{t},a_{t})+\mathcal{I}_{\pi_{k}}^{\pi_{k+1}}(s_{t})\,\Big{|}\,s_{0}=s\right],\\ &\mathcal{I}_{\pi_{k}}^{\pi_{k+1}}(s)=\mathbb{E}^{\pi_{k+1}}\!\left[-\kappa\log\pi_{k+1}(a|s)-\tau\log\frac{\pi_{k+1}(a|s)}{\pi_{k}(a|s)}\right],\end{split}

(3)

where $\zeta$ is the interpolation coefficient computed by $\zeta^{*}$ in Eq. (7), $\mathcal{B}_{K}$ denotes the on-policy replay buffer, with $K$ indicating the number of steps up to now. $\kappa,\tau$ denote the Shannon entropy and KL divergence regularization coefficient, respectively. $Q_{\mathcal{I}},V_{\mathcal{I}}$ (and hence $A_{\mathcal{I}}$ ) denote the entropy-regularized value functions. For uncluttered notations, in the rest of the paper we drop the subscript $\mathcal{I}$ . Except the computation of $\zeta$ , all other steps are computed using samples from the off-policy replay buffer $\mathcal{B}$ . For later convenience, we define $\alpha:=\frac{\kappa}{\kappa+\tau}$ and $\beta:=\frac{1}{\kappa+\tau}$ . Note our use of both on- and off-policy replay buffers renders CAC similar in spirit to (Gu et al., 2017; Wang et al., 2017; Fakoor et al., 2020).

We first compute the greedy policy $\pi_{k+1}$ . Due to the Fenchel conjugacy (Geist et al., 2019), when $\mathcal{I}_{\pi_{k}}^{\pi_{k+1}}$ is included in the $\operatorname*{arg\,max}$ , the maximizer policy can be analytically derived as $\pi_{k+1}(a|s)\propto\-\pi^{\alpha}_{k}(a|s)\exp\left(\beta Q_{\pi_{k}}(a|s)\right)$ (Kozuno et al., 2019). Then the entropy-regularized action value function $Q^{\pi_{k+1}}$ is evaluated. In the third step, we use the on-policy replay buffer $\mathcal{B}_{K}$ for computing $\zeta$ as described in Eq. (7). Finally, the optimal policy in the sense of guaranteeing policy improvement is obtained by interpolating $\pi_{k+1}$ with $\pi_{k}$ . We present the following theorem of CAC convergence in the tabular case.

Theorem 4.1.

Repeated application of CAC Eq. (3) on any initial policy $\pi$ will make it converges to the entropy regularized optimal policy $\pi^{*}(a|s)=\frac{\exp\left(\frac{1}{\kappa}Q^{*}(s,a)\right)}{\int_{a\in\mathcal{A}}\exp\left(\frac{1}{\kappa}Q^{*}(s,a)\right)}$ .

Proof 4.2.

See Section A.1 in the Appendix.

From Theorem 1 we see the optimal policy and corresponding optimum of the MDP is biased by the choice of $\kappa$ . If we gradually decay the value of $\kappa$ then we recover the optimum of the non-regularized MDP Vieillard et al. (2020a). In the following sections, we describe in detail the CAC actor and critic, as well as the derivation of actor gradient expression and practical interpolation coefficient $\zeta$ design.

4.1.1 Conservative Actor

As discussed in Section 3.2.1, at $k$ -th iteration TRPO directly constructs a new policy $\pi$ by maximizing $J^{\text{TRPO}}_{\pi_{k}}(\pi)$ . This is useful if the agent learns from scratch, but it discards the reference policy when available. On the other hand, we follow the exact form of (Kakade and Langford, 2002) by taking the information of reference policy into account, where we choose $\pi_{k+1}$ to be the reference policy $\pi^{\prime}$ :

\displaystyle\tilde{\pi}_{k+1}=\zeta\pi_{k+1}+(1-\zeta)\pi_{k}.

(4)

Our objective function $J_{\pi_{k},\pi_{k+1}}^{\text{CAC}}(\pi)$ explicitly features the knowledge of the reference policy (Pirotta et al., 2013). Specifically, the objective can be lower-bounded as:

\displaystyle\begin{split}J_{\pi_{k},\pi_{k+1}}^{\text{CAC}}(\pi)&:=\mathbb{E}^{\pi_{k+1},d^{\pi_{k+1}}}\left[A^{\pi_{k}}(s,a)\right]\\ &\geq C^{\prime}_{\gamma}\,\,(v\,\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}})^{-1}\left(\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right]\right)^{2},\\ \text{given }\zeta^{*}&=2\,C^{\prime}_{\gamma}(v\,\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}})^{-1}\left(\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right]\!\right),\\ \tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}}&=\max_{s,s^{\prime}}\left|A^{\pi_{k+1}}_{\pi_{k}}(s)-A^{\pi_{k+1}}_{\pi_{k}}(s^{\prime})\right|,\\ v&=\max_{s}D_{TV}\left(\pi_{k+1}(\cdot|s)|\!|\pi_{k}(\cdot|s)\right),\vspace{-0.2cm}\end{split}

(5)

where $D_{TV}$ denotes the total variation. $v$ , $\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}}$ and the expectation wrt $\pi_{k+1},d^{\pi_{k}}$ require estimation. $C^{\prime}_{\gamma}$ absorbs the horizon-dependent constants. Hence, when optimizing the lower bound of $J_{\pi_{k},\pi^{\prime}}^{\text{CAC}}(\pi)$ , we can achieve guaranteed improvement.

In the existing literature (Kakade and Langford, 2002; Pirotta et al., 2013; Abbasi-Yadkori et al., 2016), the difficulties of extending Eq. (5) to large-scale problems are: (1) preparing a reliable reference policy in high dimensional continuous state-action spaces is difficult; (2) it is hard to accurately estimate $v$ , the maximum total variation between two policies without enforcing a gradual change of policies, which is absent in these works. On the other hand, naively using $v\leq 2$ as suggested by (Pirotta et al., 2013) often yields vanishingly small $\zeta$ , which significantly hinders learning. (3) the horizon-dependent constant $C_{\gamma}^{\prime}$ developed in the classic ADP literature is not suitable for learning with deep networks that feature long horizon of learning. As will be demonstrated in the following sections, we tackle the first and second problems by leveraging entropy-regularized critic, and the third problem via a novel design of $\zeta$ inspired by a very recent work for discrete action problems (Vieillard et al., 2020b).

4.1.2 Conservative Critic

In Eq. (5) we see in order to yield a meaningful interpolation coefficient $\zeta$ one is required to accurately estimate the maximum total derivation $v$ , which is intractable in high dimensional continuous action spaces. However, by introducing an entropy-regularized critic, we can leverage the following theorem to avoid estimating $v$ :

Theorem 4.3.

(Kozuno et al., 2019, Proposition 3) For any two consecutive entropy-regularized policies $\pi_{k},\pi_{k+1}$ generated by Eq. (3), the following bound for their maximum total deviation holds:

		$\displaystyle\max_{s}D_{TV}\left(\pi_{k+1}(\cdot\|s)\|\!\|\pi_{k}(\cdot\|s)\right)\leq\sqrt{4B_{K}+2C_{K}},$		(6)
		$\displaystyle\text{where }B_{k}:=\frac{1-\gamma^{k}}{1-\gamma}\epsilon\beta,\quad C_{k}:=r_{\text{max}}\beta\sum_{j=0}^{k-1}\alpha^{j}\gamma^{k-j-1},$		(6)

$\epsilon$ is the uniform upper bound of errors.

Recall from Section 4.1 that $\alpha:=\frac{\kappa}{\kappa+\tau},\beta:=\frac{1}{\kappa+\tau}$ . Specifically, it has been proved in (Kozuno et al., 2019) that this bound is non-improvable, i.e. there exists an MDP such that the inequality becomes equality.

By leveraging an entropy-regularized critic, the objective in Eq. (5) becomes:

\displaystyle\begin{split}J_{\pi_{k},\pi_{k+1}}^{\text{CAC}}(\pi)&\geq C^{\prime}_{\gamma}C_{k}\,\,(\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}})^{-1}\left(\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right]\right)^{2},\\ \text{given }\zeta^{*}&=2\,C^{\prime}_{\gamma}C_{k}(\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}})^{-1}\left(\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right]\!\right),\end{split}

(7)

where $C^{\prime}_{\gamma}$ absorbs horizon-dependent constants and $C_{k}$ is from Theorem 4.3. Estimating the optimal $\zeta^{*}$ now requires estimating the expectation wrt $\pi_{k+1},d^{\pi_{k}}$ and $\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}}$ which have been studied by (Vieillard et al., 2020b).

The KL divergence also manifests its importance for generating reasonable reference policies even for high dimensional or continuous action problems. Consider the following upper bound due to (Vieillard et al., 2020a) where reward is augmented by the KL divergence:

\displaystyle\left|\!\left|Q^{*}-Q^{\pi_{k+1}}\right|\!\right|_{\infty}\leq\frac{2}{1-\gamma}\left|\!\left|\frac{1}{k}\sum_{j=0}^{k}\epsilon_{j}\right|\!\right|_{\infty}\!\!+\frac{4}{1-\gamma}\frac{V_{max}}{k},

where $\epsilon_{j}$ are errors and $V_{max}\!=\!\frac{r_{\text{max}}}{1-\gamma}$ . By comparing it with the non-improvable approximate modified policy iteration (AMPI) bound where the reward is not augmented (Scherrer et al., 2015):

\displaystyle\left|\!\left|Q^{*}-Q^{\pi_{k+1}}\right|\!\right|_{\infty}\leq\left(\!(1-\gamma)\sum_{j=1}^{k}\left|\!\left|\epsilon_{j}\right|\!\right|_{\infty}\!\right)+\frac{2\gamma^{k+1}}{1-\gamma}V_{max},

we see that the error term for the KL regularization case is sup-over-sum. Under mild assumptions such as $\epsilon_{j}$ are iid distributed under the natural filtration (Azar et al., 2012), the summation over errors asymptotically cancels out. On the other hand, the error term for AMPI depends on the summation of maximum of every iteration, which is typically large. Further, the dependence of error on the horizon is linear $\frac{1}{1-\gamma}$ rather than quadratic, which is a significant improvement as typically $\gamma\approx 1$ .

4.1.3 Network Optimization Perspective

Given the above ADP-style characterization for both the actor and the critic, we now examine Eq. (3) from the optimization perspective. Suppose the critic is parametrized by a network with parameters $\theta$ and the actor by a network with parameters $\phi$ . CAC updates the network weights $\theta,\phi$ by solving the following minimization problems:


$\displaystyle y$	$\displaystyle=r+\gamma\left(\mathbb{E}^{a\sim\pi_{\phi}}\left[Q_{\bar{\theta}}(s^{\prime},a)\right]+\mathcal{I}_{\pi_{\bar{\phi}}}^{\pi_{\phi}}(s^{\prime})\right),$	(8a)
$\displaystyle\theta$	$\displaystyle\leftarrow\arg\min\mathbb{E}^{\mathcal{B}}\left[\left({Q}_{\theta}\left(s,a\right)-y\right)^{2}\right],$	(8b)
$\displaystyle{\phi}$	$\displaystyle\leftarrow\arg\min\mathbb{E}^{\mathcal{B}}\left[{D}_{{KL}}\left(\pi_{\phi}\\|(1-\zeta)\pi_{\bar{\phi}}+\zeta\mathcal{G}_{\pi_{\bar{\phi}}}Q_{\theta}\right)\right],$	(8c)

where the update of $\phi$ corresponds to solving an information projection problem. This is because policies $\pi_{k+1},\pi_{k}$ are Boltzmann softmax (Geist et al., 2019) but their summation is generally not Boltzmann, which might results in loss of desirable properties. By the following theorem, in the ideal case we can find a Boltzmann policy $\pi_{\phi}$ that perfectly represents the interpolation by solving the information projection problem.

Theorem 4.4.

(Ziebart, 2010, Theorem 2.8) Let $\pi^{(1)},$ $\pi^{(2)},$ $\dots,\pi^{(n)}$ be an arbitrary sequence of policies and $\zeta_{1},\dots\zeta_{n}$ be a sequence of numbers such that $\zeta_{i}\geq 0,\forall i$ , $\sum_{i=1}^{n}\zeta_{i}=1$ . Then the policy $\pi^{\prime}$ defined by:

\displaystyle\pi^{\prime}(a|s):=\frac{\sum_{i=1}^{n}\zeta_{i}\,P(\mathcal{S}=s,\mathcal{A}=a|\pi^{(i)})}{\sum_{i=1}^{n}\zeta_{i}\,P(\mathcal{S}=s|\pi^{(i)})}

(9)

has same expected number of state-action occurrences when the denominator is nonzero.

In implementation as the states are sampled from the replay buffer, there is an error term in this information projection step. Taking the above information projection into account, we elaborate upon gradient expression of the actor via the following proposition:

Proposition 4.5.

Let the actor network be parametrized by weights $\phi$ and critic by $\theta$ . Define $\mathcal{G}Q_{\bar{\phi},\theta}$ as the greedy policy with respect to the CAC critic. The subscript $\bar{\phi}$ comes from the baseline policy introduced by KL divergence. Then the gradient of the actor update can be expressed as:

\displaystyle\begin{split}&\nabla_{\phi}\mathbb{E}^{\begin{subarray}{c}s\sim\mathcal{B}\\ a\sim\pi_{\phi}\end{subarray}}\left[D^{\phi}_{\bar{\phi}}-\frac{\beta}{1+\mathcal{X}}Q_{\theta}(s,a)\right],\\ &\text{where }D^{\phi}_{\bar{\phi}}=\log\pi_{\phi}\left(a\mid s\right)-\frac{\alpha+\mathcal{X}}{1+\mathcal{X}}\log\pi_{\bar{\phi}}(a;s)\\ &\mathcal{X}=\frac{1-\zeta}{\zeta}\cdot\frac{\pi_{\bar{\phi}}(a\mid s)}{\mathcal{G}Q_{\bar{\phi},\theta}(a\mid s)}.\end{split}

(10)

Proof 4.6.

See Section A.2 in the Appendix.

This gradient expression is similar to SAC (Haarnoja et al., 2018) which is off-policy since states $s$ are sampled from the off-policy replay buffer $\mathcal{B}$ . However, CAC has the term $\log\pi_{\bar{\phi}}(a;s)$ from the KL regularization. The term $\mathcal{X}$ in both the $Q_{\theta}$ and $\log\pi_{\bar{\phi}}(a;s)$ involves $\zeta$ that encodes the information for guiding the gradient to cautiously learn.

4.2 Design of Interpolation Coefficient

One of the main difficulties to extending CPI to learning with deep networks is that $\zeta$ becomes vanishingly small due to the typically long horizon in the AC setting. To tackle this problem, (Vieillard et al., 2020b) propose to heuristically design $\zeta$ to be a non-trivial value, which features the consideration of moving averages.

Refer to caption — Figure 1: Training curves on the continuous control benchmark problems. The solid curves show the mean and shaded regions the standard deviation over the five independent trials. In all tasks CAC achieved comparable performance but significantly stabilized learning.

Recall that in Eq. (7), $\zeta$ is computed by a function of the form:

\displaystyle\zeta=C^{\prime}_{\gamma}C_{k}\,\,\frac{\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right]}{\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}}},

where $C^{\prime}_{\gamma}$ absorbs horizon-dependent constants and $C_{k}$ is defined in Eq. (6). We propose to remove $C^{\prime}_{\gamma}$ since it tends to zero as the horizon increases. Recall also that $\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}}$ is the maximum difference of the expected advantage function defined in Eq. (5). We propose the following novel $\zeta$ design:

\zeta^{\mathrm{CAC}}=\mathop{\rm clip}{\left(\frac{\tilde{\mathbb{{A}}}}{|\tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}}|},0,\,\,1\right)},

(11)

where by following the moving average concept of (Vieillard et al., 2020b), we update $\tilde{\mathbb{{A}}}$ and $\tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}}$ as:

\begin{array}[]{l}M=\mathbb{E}^{\begin{subarray}{c}s\sim\mathcal{B}_{K}\\ a\sim\mathcal{G}Q^{\pi_{k}}\end{subarray}}\left[A^{\pi_{k}}(s,a)\right]\\ \tilde{\mathbb{{A}}}\leftarrow\begin{cases}c,&\text{if }M\leq 0\\ \left(1-\nu_{A}\right)\tilde{\mathbb{{A}}}+\nu_{A}M,&\text{else}\end{cases}\\ \tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}}\leftarrow\left(1-\nu_{A^{\mathrm{MaxDiff}}}\right)\tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}}+\nu_{A^{\mathrm{MaxDiff}}}\,M.\end{array}

(12)

Here, $M$ is the current estimate of $\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right]$ . We propose to set an if-else judgement here, as $M<0$ indicates the updated policy has worse performance than the current policy, we let $\tilde{\mathbb{{A}}}$ be a negative value $c$ , hence enforcing $\zeta=0$ . This information is incorporated into $\tilde{\mathbb{{A}}}$ by exponential moving average with the previous estimates. $\tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}}$ attempts to approximate the maximum difference $\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}}$ . $\mathcal{B}_{K}$ is a FIFO replay buffer storing $K$ on-policy samples, and $\nu_{A},\nu_{A^{\mathrm{MaxDiff}}}\in[0,1]$ are the hyperparameters controlling the average. Following (Vieillard et al., 2020b), it is beneficial to have $\nu_{A^{\mathrm{MaxDiff}}}\leq\nu_{A}$ for smooth learning.

Computing $\zeta$ in Eq. (11) using the moving average in Eq. (12) is in spirit similar to (Vieillard et al., 2020b). However, they focus on general stationary policies. As $\tilde{\mathbb{{A}}}$ is an off-policy estimate of the on-policy term $\mathbb{E}^{\pi_{k+1},d^{\pi_{k}}}\left[A^{\pi_{k}}(s,a)\right]$ , it might corrupt the improvement guarantee. On the other hand, we focus on entropy-regularized policies which allow one to bound the performance loss of leveraging off-policy estimate $\tilde{\mathbb{{A}}}$ (Zhu and Matsubara, 2020, Theorem 3).

5 Experiments

Table 1: The performance oscillation values of all algorithms for all environments. The bold numbers indicate the smallest performance oscillation values.

\times

indicates the algorithm failed to learn meaningful behaviors. CAC recorded the smallest performance oscillation values for all the environments. PPO is the only on-policy algorithm in the comparison.

\|\mathcal{O}J\|_{\infty}

\|\mathcal{O}J\|_{2}

PPO

TD3

SAC

CAC

(

\zeta=1

)

CAC

PPO

TD3

SAC

CAC

(

\zeta=1

)

CAC

Ant

1979

4979

7793

7160

1811

359

510

642

591

297

HalfCheetah

\times

2337

3717

4200

1870

\times

331

425

397

286

Hopper

1598

3515

2598

2944

1944

318

609

454

394

279

Humanoid

\times

\times

4115

3092

2199

\times

\times

645

436

313

Walker2d

1673

3729

4577

4310

1345

330

461

499

334

183

As CAC combines concepts from ADP literature such as KL regularization and conservative learning that have not seen applications in AC, it is interesting to examine the combination against existing AC methods in challenging tasks. We choose a set of high dimensional continuous control tasks from the OpenAI gym benchmark suite (Brockman et al., 2016).

For comparison, we compare CAC with twin delayed deep deterministic policy gradient (TD3) (Fujimoto et al., 2018) that comprehensively surveys the factors causing poor performance of actor-critic methods, to examine the cautious learning mechanism. As CAC is based on the CPI that has also inspired TRPO and PPO, we compare it with PPO which is improved over TRPO (Schulman et al., 2017). As PPO does not involve computing $\zeta$ , we include the curves when $\zeta=1$ . We also compare with SAC (Haarnoja et al., 2018) which has similar architecture.

To better illustrate and quantify the stability during learning, we follow (Zhu and Matsubara, 2020) to define the measure of performance oscillation:

\begin{array}[]{l}\forall k,\text{ such that }R_{k+1}-R_{k}<0\\ \|\mathcal{O}J\|_{\infty}=\max_{k}\left|R_{k+1}-R_{k}\right|,\\ \|\mathcal{O}J\|_{2}=\sqrt{\frac{1}{N}\sum_{k=1}^{N}\left(R_{k+1}-R_{k}\right)^{2}},\end{array}

(13)

where $N$ is the steps of the learning and $R_{k}$ refers to the cumulative reward reported at $k$ -th evaluation. Intuitively, $\|\mathcal{O}J\|_{\infty}$ and $\|\mathcal{O}J\|_{2}$ measure the maximum and average degradation during learning, respectively.

5.1 Comparative Evaluation

We run all algorithms with the same set of hyperparameters listed in Section A.4 of the Appendix. All figures are plotted with statistics from 10 different random seeds, with each performing 10 evaluation rollouts every 5000 environment steps.

Figure 1 shows the learning curves of the algorithms. CAC achieved comparable performance with other AC algorithms while significantly stabilized learning curves. PPO’s learning speed was the slowest among all algorithms on all environments, due to the on-policy nature of PPO which is sample-inefficient. Other methods were able to leverage off-policy samples to quickly learn meaningful behaviors. However, the fast learning came at a cost: except the relatively simple HalfCheetah-v2, on all environments these off-policy algorithms oscillated wildly, especially on the challenging Humanoid-v2 where both PPO and TD3 failed to learn any meaningful behaviors, and the performance of SAC, CAC with $\zeta=1$ degraded frequently. On the other hand, CAC traded off a little bit slower learning for stability, exhibiting smooth curves. Indeed, the convergence rate of CAC is $O\!\left(e^{-(1-\gamma)\sum_{j=1}^{k}\zeta_{j}\!}\right)$ (Vieillard et al., 2020b), which emphasizes stability more as $\zeta\rightarrow 0$ .

The comparison on stability of the algorithms can be seen from the Table 1 that summarized the values of $\|\mathcal{O}J\|_{\infty}$ and $\|\mathcal{O}J\|_{2}$ for all algorithms. It provided empirical support as CAC showed least oscillation during learning. This is in contrast to other off-policy algorithms oscillated wildly during learning. Since CAC with $\zeta=1$ still showed huge oscillation, it can be concluded that the mixture coefficient introduced in CAC is effective in preventing significant policy degradation.

5.2 Ablation study on mixture coefficient

In this section we conduct an ablation test to study the effectiveness of CAC as well as the proposed $\zeta$ design in Section 4.2. We compare the following setup:

1.

Full. This is CAC with the proposed $\zeta^{\text{CAC}}$ in Eq. (12). The curve is same as in Figure 1.
2.

No KL. We remove KL regularization from Eq. (3). This corresponds to SAC with the cautious actor.
3.

DSPI. This corresponds to Deep Safe Policy Iteration (Zhu and Matsubara, 2020) that uses the $\zeta$ suggested by (Vieillard et al., 2020b).
4.

No $\mathcal{B}_{K}$ . We replace the on-policy replay buffer $\mathcal{B}_{K}$ with off-policy $\mathcal{B}$ as suggested by (Vieillard et al., 2020b).
5.

No If-Else. This corresponds to removing the if term in Eq. (12) and learns with only the else condition.

As is obvious from Figure 2, while removing the if-else judgement accelerated learning, it ignored the warning from $M<0$ that the updated policy was poorer. The consequent curve oscillated drastically as the result of aggressive $\zeta$ in the bottom image.

Removing KL regularization induced learning curve similar with Full in the beginning, but the performance degraded significantly since the middle stage and failed to recover. This is probably due to the policies were corrupted by error.

Using off-policy replay buffer $\mathcal{B}$ demonstrated stable learning. This is expected as the agent was forced to learn cautiously by the CAC mechanism. On the other hand, the learning was slow as off-policy samples were not as informative as on-policy ones, as we observed small $\zeta$ values in the bottom figure.

It is most interesting to examine the DSPI case where $\zeta$ was set according to the suggestion of (Vieillard et al., 2020b). Though this scheme works well in Atari games, the resulting algorithm failed to learn any meaningful behaviors in the challenging control tasks with continuous action spaces. This is because they estimate with the entire off-policy replay buffer $\mathcal{B}$ which tends to produce very large estimate of $\tilde{\Delta}^{\pi_{k+1}}_{\pi_{k}}$ , leading to vanishingly small $\zeta^{\text{DSPI}}$ and subsequent poor performance.

6 Conclusion

We have presented CAC, a novel actor-critic algorithm by introducing several concepts from the approximate dynamic programming literature. The cautiousness in CAC consists in the doubly conservativeness: the actor follows conservative policy iteration (Kakade and Langford, 2002) that ensures monotonic improvement and the critic exploits conservative value iteration (Kozuno et al., 2019; Vieillard et al., 2020a) that has been shown to yield state-of-the-art guarantees in ADP literature.

Our key observation was by introducing an entropy-regularized critic the unwieldy interpolated actor update can be simplified significantly while still ensuring robust policy improvement. CAC performed comparable to the state-of-the-art AC methods while significantly stabilized learning on the benchmark control problems with high dimensional continuous state-action spaces.

An interesting future direction is to incorporate other entropy for different purposes. For example, $\alpha$ -divergence could be used to achieve sparse optimal policies.

\acks

This work is partly supported by JSPS KAKENHI Grant Number 21J15633 and 21H03522.

References

Abbasi-Yadkori et al. (2016) Yasin Abbasi-Yadkori, Peter L. Bartlett, and Stephen J. Wright. A fast and reliable policy improvement algorithm. In International Conference on Artificial Intelligence and Statistics, pages 1338–1346, 2016.
Ahmed et al. (2019) Zafarali Ahmed, Nicolas Le Roux, Mohammad Norouzi, and Dale Schuurmans. Understanding the impact of entropy on policy optimization. In International Conference on Machine Learning, pages 151–160, 2019.
Azar et al. (2012) Mohammad Gheshlaghi Azar, Vicenç Gómez, and Hilbert J Kappen. Dynamic policy programming. The Journal of Machine Learning Research, 13(1):3207–3245, 2012.
Bertsekas (2011) Dimitri Bertsekas. Approximate policy iteration: A survey and some new methods. Journal of Control Theory and Applications, 9:310–335, 2011.
Brockman et al. (2016) Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. Openai gym. arXiv preprint arXiv:1606.01540, 2016.
Fakoor et al. (2020) Rasool Fakoor, Pratik Chaudhari, and Alexander J Smola. P3o: Policy-on policy-off policy optimization. In Conference on Uncertainty in Artificial Intelligence, pages 1017–1027, 2020.
Fox et al. (2016) Roy Fox, Ari Pakman, and Naftali Tishby. Taming the noise in reinforcement learning via soft updates. In Conference on Uncertainty in Artificial Intelligence, pages 202–211, 2016.
Fu et al. (2019) Justin Fu, Aviral Kumar, Matthew Soh, and Sergey Levine. Diagnosing bottlenecks in deep q-learning algorithms. In International Conference on Machine Learning, pages 2021–2030, 2019.
Fujimoto et al. (2018) Scott Fujimoto, Herke van Hoof, and David Meger. Addressing function approximation error in actor-critic methods. In International Conference on Machine Learning, pages 1587–1596, 2018.
Geist et al. (2019) Matthieu Geist, Bruno Scherrer, and Olivier Pietquin. A theory of regularized Markov decision processes. In International Conference on Machine Learning, pages 2160–2169, 2019.
Gu et al. (2017) Shixiang Shane Gu, Timothy Lillicrap, Richard E Turner, Zoubin Ghahramani, Bernhard Schölkopf, and Sergey Levine. Interpolated policy gradient: Merging on-policy and off-policy gradient estimation for deep reinforcement learning. In Advances in Neural Information Processing Systems, pages 3846–3855, 2017.
Haarnoja et al. (2017) Tuomas Haarnoja, Haoran Tang, Pieter Abbeel, and Sergey Levine. Reinforcement learning with deep energy-based policies. In International Conference on Machine Learning, pages 1352–1361, 2017.
Haarnoja et al. (2018) Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In International Conference on Machine Learning, pages 1861–1870, 2018.
Hafner et al. (2020) Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination. arXiv preprint, arXiv:1912.01603, 2020.
Kakade and Langford (2002) Sham Kakade and John Langford. Approximately optimal approximate reinforcement learning. In International Conference on Machine Learning, pages 267–274, 2002.
Kozuno et al. (2019) Tadashi Kozuno, Eiji Uchibe, and Kenji Doya. Theoretical analysis of efficiency and robustness of softmax and gap-increasing operators in reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 2995–3003, 2019.
Lillicrap et al. (2015) Timothy P Lillicrap, Jonathan J Hunt, Alexander Pritzel, Nicolas Heess, Tom Erez, Yuval Tassa, David Silver, and Daan Wierstra. Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971, 2015.
Munos et al. (2016) Rémi Munos, Tom Stepleton, Anna Harutyunyan, and Marc Bellemare. Safe and efficient off-policy reinforcement learning. In Advances in Neural Information Processing Systems, pages 1054–1062, 2016.
Nachum et al. (2017) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Bridging the gap between value and policy based reinforcement learning. In Advances in Neural Information Processing Systems 30, pages 2775–2785. 2017.
Nachum et al. (2018) Ofir Nachum, Mohammad Norouzi, Kelvin Xu, and Dale Schuurmans. Trust-pcl: An off-policy trust region method for continuous control. In International Conference on Learning Representations, pages 1–11, 2018.
Pirotta et al. (2013) Matteo Pirotta, Marcello Restelli, Alessio Pecorino, and Daniele Calandriello. Safe policy iteration. In International Conference on Machine Learning, pages 307–315, 2013.
Precup et al. (2001) Doina Precup, Richard S. Sutton, and Sanjoy Dasgupta. Off-policy temporal difference learning with function approximation. In International Conference on Machine Learning, page 417–424, 2001.
Scherrer and Geist (2014) Bruno Scherrer and Matthieu Geist. Local policy search in a convex space and conservative policy iteration as boosted policy search. In Machine Learning and Knowledge Discovery in Databases, pages 35–50, 2014.
Scherrer et al. (2015) Bruno Scherrer, Mohammad Ghavamzadeh, Victor Gabillon, Boris Lesner, and Matthieu Geist. Approximate modified policy iteration and its application to the game of tetris. Journal of Machine Learning Research, 16(1):1629–1676, 2015.
Schulman et al. (2015) John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In International Conference on Machine Learning, pages 1889–1897, 2015.
Schulman et al. (2017) John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Sutton and Barto (2018) Richard S. Sutton and Andrew G. Barto. Reinforcement Learning: An Introduction. A Bradford Book, Cambridge, MA, USA, 2018.
Vieillard et al. (2020a) Nino Vieillard, Tadashi Kozuno, Bruno Scherrer, Olivier Pietquin, Rémi Munos, and Matthieu Geist. Leverage the average: an analysis of kl regularization in reinforcement learning. Advances in Neural Information Processing Systems, 33:1–12, 2020a.
Vieillard et al. (2020b) Nino Vieillard, Olivier Pietquin, and Matthieu Geist. Deep conservative policy iteration. In AAAI Conference on Artificial Intelligence, pages 6070–6077, 04 2020b.
Wagner (2011) Paul Wagner. A reinterpretation of the policy oscillation phenomenon in approximate policy iteration. In Advances in Neural Information Processing Systems 24, pages 2573–2581, 2011.
Wang et al. (2017) Ziyu Wang, Victor Bapst, Nicolas Heess, Volodymyr Mnih, Remi Munos, Koray Kavukcuoglu, and Nando de Freitas. Sample efficient actor-critic with experience replay. In International Conference on Learning Representations, pages 1–13, 2017.
Zhu and Matsubara (2020) Lingwei Zhu and Takamitsu Matsubara. Ensuring monotonic policy improvement in entropy-regularized value-based reinforcement learning. arXiv preprint, arXiv:2008.10806, 2020.
Ziebart (2010) Brian D. Ziebart. Modeling Purposeful Adaptive Behavior with the Principle of Maximum Causal Entropy. PhD thesis, Carnegie Mellon University, 2010.

Appendix A Appendix

In this appendix we provide missing proofs and implementation details. Specifically, we present Theorem 1 for CAC convergenc in Section A.1, Proposition 4 for calculating CAC actor gradient expression in Section A.2 and implementation details in Section A.3.

A.1 Proof for the CAC convergence

We prove the convergence of CAC using policy iteration style argument. Similar proofs have also been used in (Haarnoja et al., 2018, Theorem 4). The following lemmas establish the convergence of the policy evaluation and policy improvement of CAC, respectively.

Lemma A.1 (CAC Policy Evaluation).

Given the current policy $\pi$ and a baseline policy $\tilde{\pi}$ , the policy evaluation step of CAC is formulated as:

\displaystyle Q_{\pi_{k+1}}\leftarrow\mathcal{R}(s,a)+\gamma\left(\mathbf{P}\mathbb{E}^{\pi}[Q_{\pi_{k}}(s,a)]+\mathcal{I}_{\tilde{\pi}}^{\pi}(s)\right).

(14)

Consider an initial Q value $Q_{0}:\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ with $|\mathcal{A}|<\infty$ . With the repeated application of Eq. (14), the sequence $Q_{k}$ converges to the following entropy-regularized Q-value $Q_{\tilde{\pi}}^{\pi}$ as $k\rightarrow\infty$ .

\displaystyle Q_{\tilde{\pi}}^{\pi}(s,a):=\mathbb{E}^{d^{\pi}}\left[\sum_{t=0}^{\infty}\gamma^{t}\big{(}r_{t}+\mathcal{I}_{\tilde{\pi}}^{\pi}\left(s_{t+1}\right)\big{)}\big{|}s_{0}=s,a_{0}=a\right].

(15)

Proof A.2.

Define the entropy augmented reward as ${\mathcal{R}^{\pi}_{\tilde{\pi}}}(s,a)\triangleq\mathcal{R}(s,a)+{\mathcal{I}_{\tilde{\pi}}^{\pi}}(s)$ and rewrite the update rule as:

\displaystyle Q_{\pi_{k+1}}\leftarrow\mathcal{R}^{\pi}_{\tilde{\pi}}(s,a)+\gamma\mathbf{P}\mathbb{E}^{\pi}[Q_{\pi_{k}}(s,a)].

(16)

With the assumption $|\mathcal{A}|<\infty$ for bounded reward, we can apply the standard convergence results for policy evaluation (Sutton and Barto, 2018).

Lemma A.3 (CAC Policy Improvement).

Given the current policy $\pi$ , a baseline policy $\tilde{\pi}$ and the updated policy $\pi_{\text{new}}$ . CAC has the following policy update:

		$\displaystyle\pi_{\text{new}}=(1-\zeta)\pi+\zeta\hat{\pi},$		(17)
		$\displaystyle\text{where }\,\,\hat{\pi}(a\mid s)=\frac{\tilde{\pi}^{\alpha}(a\mid s)\exp\left(\beta Q_{\tilde{\pi}}^{\pi}(s,a)\right)}{Z(s)},$		(17)

with $\zeta\in[0,1]$ . Then, $Q_{\tilde{\pi}}^{\pi_{\text{new}}}(s,a)\geq Q_{\tilde{\pi}}^{\pi}(s,a)$ for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ with $|\mathcal{A}|<\infty$ .

Proof A.4.

Consider a function $f:\zeta\to\mathbb{R}$ with $\zeta\in[0,1]$ :

f(\zeta)=\mathbb{E}^{a\sim\pi_{\text{new}}}\left[Q_{\tilde{\pi}}^{\pi}(s,a)\right]+\mathcal{I}_{\tilde{\pi}}^{\pi_{\text{new}}}\left(s\right).

(18)

From the definition of $\hat{\pi}\!=\!\underset{\pi}{\operatorname*{arg\,max}}\;\mathbb{E}^{a\sim\pi}\left[Q_{\tilde{\pi}}^{\pi}(s,a)\right]\!+\!\mathcal{I}_{\tilde{\pi}}^{\pi}\left(s\right)$ , $f(\zeta)$ takes the maximum value when $\zeta=1$ . The first and the second derivative of $f(\zeta)$ w.r.t. $\zeta$ are:

	$\displaystyle f^{{}^{\prime}}(\zeta)=\sum_{a}$	$\displaystyle\left(\hat{\pi}(a\|s)-\pi(a\|s)\right)\left(Q_{\tilde{\pi}}^{\pi}(s,a)+\tau\log\tilde{\pi}(a\|s)\right)$		(19)
		$\displaystyle\left.-(\sigma+\tau)\log((1-\zeta)\pi(a\|s)+\zeta\hat{\pi}(a\|s)\right),$		(19)

\displaystyle f^{{}^{\prime\prime}}(\zeta)=-(\sigma+\tau)\sum_{a}\frac{\left(\hat{\pi}(a|s)-\pi(a|s)\right)^{2}}{(1-\zeta)\pi(a|s)+\zeta\hat{\pi}(a|s)}\leq 0.

(20)

Thus, the function $f$ is concave in $\zeta\in[0,1]$ . Since $f(\zeta)$ takes the maximum value with $\zeta=1$ , $f(\zeta)$ is monotonically increasing in $\zeta$ and $f(0)\leq f(\zeta)$ .

Therefore, the following inequality about the entropy-regularized V-value $V_{\tilde{\pi}}^{\pi}(s)$ holds:

	$\displaystyle V_{\tilde{\pi}}^{\pi}(s)=$	$\displaystyle\mathbb{E}^{a\sim\pi}\left[Q_{\tilde{\pi}}^{\pi}(s,a)\right]+\mathcal{I}_{\tilde{\pi}}^{\pi}\left(s\right)$		(21)
	$\displaystyle\leq$	$\displaystyle\mathbb{E}^{a\sim\pi_{\text{new}}}\left[Q_{\tilde{\pi}}^{\pi}(s,a)\right]+\mathcal{I}_{\tilde{\pi}}^{\pi_{\text{new}}}\left(s\right).$		(21)

By repeatedly applying Eq. (21), we obtain the following inequalities:

$\displaystyle Q_{\tilde{\pi}}^{\pi}(s,a)$	$\displaystyle=\mathcal{R}(s,a)+\gamma\mathbf{P}\left[Q_{\tilde{\pi}}^{\pi}(s,a)+\mathcal{I}_{\tilde{\pi}}^{\pi}(s)\right]$	(22)
	$\displaystyle\leq\mathcal{R}(s,a)+\gamma\mathbf{P}\left[\mathbb{E}^{\pi_{\text{new}}}[Q_{\tilde{\pi}}^{\pi}(s,a)]+\mathcal{I}_{\tilde{\pi}}^{\pi_{\text{new}}}(s)\right]$
	$\displaystyle\;\vdots$
	$\displaystyle\leq Q_{\tilde{\pi}}^{\pi_{\mathrm{new}}}(s,a).$

Convergence to $Q_{\tilde{\pi}}^{\pi_{\mathrm{new}}}$ follows from Lemma A.1.

Combining the policy evaluation and policy improvement, we are now ready to prove Theorem 1.

Theorem 1 Repeated application of CAC Eq. (3) on any initial policy $\pi$ will make it converges to the entropy regularized optimal policy $\pi^{*}(a|s)=\frac{\exp\left(\frac{1}{\kappa}Q^{*}(s,a)\right)}{\int_{a\in\mathcal{A}}\exp\left(\frac{1}{\kappa}Q^{*}(s,a)\right)}$ .

Proof A.5.

According to Lemma A.1 and Lemma A.3, the entropy-regularized Q-value at $k$ -th update satisfies $Q_{\pi_{k-1}}^{\pi_{k}}(s,a)\geq Q_{\pi_{k-1}}^{\pi_{k-1}}(s,a)$ . Given bounded reward, $Q_{\pi_{k}}^{\pi_{k}}$ is also bounded from above and the sequence converges to a unique $\pi^{*}$ . Note that when reaching the optimum the KL regularization term becomes $0$ . Hence, using the same iterative argument as in the proof of Lemma A.3, we get $Q_{\pi^{*}}^{\pi^{*}}(s,a)>Q_{\pi}^{\pi}(s,a)$ for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ and any $\pi$ . By Ziebart (2010); Haarnoja et al. (2018), the optimal policy is entropy-regularized and hence has the softmax form $\pi^{*}(a|s)=\frac{\exp\left(\frac{1}{\kappa}Q^{*}(s,a)\right)}{\int_{a\in\mathcal{A}}\exp\left(\frac{1}{\kappa}Q^{*}(s,a)\right)}$ (recall from Eq. (3) that $\kappa$ is the weight coefficient of entropy). The convergence of general interpolated policy to the optimal policy follows the argument of Scherrer and Geist (2014).

A.2 Proof for the CAC gradient

In this subsection we derive the gradient expression for CAC. For the ease of reading we rephrase the proposition here:

Proposition 4 Let the actor network be parametrized by weights $\phi$ and critic by $\theta$ . Define $\mathcal{G}Q_{\bar{\phi},\theta}$ as the greedy policy with respect to the CAC critic. The subscript $\bar{\phi}$ comes from the baseline policy introduced by KL divergence. Then the gradient of the actor update can be expressed as:

\displaystyle\begin{split}&\nabla_{\phi}\mathbb{E}^{\begin{subarray}{c}s\sim\mathcal{B}\\ a\sim\pi_{\phi}\end{subarray}}\left[D^{\phi}_{\bar{\phi}}-\frac{\beta}{1+\mathcal{X}}Q_{\theta}(s,a)\right],\\ &\text{where }D^{\phi}_{\bar{\phi}}=\log\pi_{\phi}\left(a\mid s\right)-\frac{\alpha+\mathcal{X}}{1+\mathcal{X}}\log\pi_{\bar{\phi}}(a;s)\\ &\mathcal{X}=\frac{1-\zeta}{\zeta}\cdot\frac{\pi_{\bar{\phi}}(a\mid s)}{\mathcal{G}Q_{\bar{\phi},\theta}(a\mid s)}.\end{split}

(23)

Proof A.6.

Using the reparameterization trick $a\!=\!f_{\phi}(\mathbf{\epsilon};s_{t})$ with $\mathbf{\epsilon}$ a noise vector (Haarnoja et al., 2018), the gradient of Eq. (8c) can be expressed as:

	$\displaystyle\hat{\nabla}_{\phi}$	$\displaystyle J_{\pi}(\phi)\!=\!\nabla_{\phi}\log\pi_{\phi}\left(a\mid s\right)+\nabla_{a}\log\pi_{\phi}\left(a\mid s\right)\nabla_{\phi}f_{\phi}\left(\mathbf{\epsilon};s_{t}\right)$		(24)
		$\displaystyle-\nabla_{a}\log\left(\!(1-\zeta)\pi_{\bar{\phi}}(a\mid s)\right.\!+\!\left.\zeta\mathcal{G}Q_{\bar{\phi},\theta}\left(a\mid s\right)\!\right)\!\nabla_{\phi}f_{\phi}\left(\mathbf{\epsilon};s_{t}\right).$		(24)

We expand the term $\nabla_{a}\!\log{\left(\!(1-\zeta)\pi_{\bar{\phi}}(a\mid s)+\zeta\mathcal{G}Q_{\bar{\phi},\theta}\left(a\mid s\right)\!\right)}$ by using that $\nabla_{x_{i}}\!\log\left(\sum_{i}\exp x_{i}\right)$ $=\frac{\exp x_{i}}{\sum_{j}\exp x_{j}}$ .

Let $\exp{\left(C_{1}(a)\right)}=(1-\zeta)\pi_{\bar{\phi}}(a\mid s)$ and $\exp{\left(C_{2}(a)\right)}=\zeta\mathcal{G}_{\pi_{\bar{\phi}}}Q_{\theta}\left(a\mid s\right)$ . We have the following transformation:

	$\displaystyle\nabla_{a}\log{\left((1-\zeta)\pi_{\bar{\phi}}(a\mid s)+\zeta\mathcal{G}Q_{\bar{\phi},\theta}\left(a\mid s\right)\right)}$	(25)
$\displaystyle=$	$\displaystyle\nabla_{a}\log{\left(\exp{\left(C_{1}(a)\right)}+\exp{\left(C_{2}(a)\right)}\right)}$
$\displaystyle=$	$\displaystyle\frac{\left(D\frac{\partial}{\partial a}\pi_{\bar{\phi}}(a\mid s)+\alpha\frac{\partial}{\partial a}\pi_{\bar{\phi}}(a\mid s)+\beta\frac{\partial}{\partial a}Q_{\theta}(s,a)\right)}{1+D},$
	$\displaystyle\text{where }D=\exp{\left(C_{1}(a)-C_{2}(a)\right)}.$

After replacing $D$ with $\mathcal{X}$ and inserting Eq. (25) into Eq. (24), we obtain Eq. (10).

A.3 Implementation details

This section presents implementation details of CAC with deep networks. Pseudo-code is provided in Algorithm 1.

On-policy replay buffer

To make the algorithm off-policy, we approximate the on-policy samples with on-policy replay buffer $\mathcal{B}_{K}$ which stores $K$ recent samples where $K$ is smaller than the size of the main replay buffer $\mathcal{B}$ .

Advantage estimation

While it is possible to simply use the entropy-regularized advantage function $A_{\mathcal{I}}(s,a)\!=\!Q_{\mathcal{I}}(s,a)\!-\!V_{\mathcal{I}}(s)$ for computing $\zeta$ , we are interested in studying the guidance of $\zeta$ when no entropy is involved since it might provide a more informative gradient improving direction. This corresponds to the case of (Kakade and Langford, 2002; Pirotta et al., 2013). To this end, we train another Q-network $Q_{\omega}$ by solving:

	$\displaystyle\omega\leftarrow$	$\displaystyle\arg\min\mathbb{E}^{\mathcal{B}}\left[\left({Q}_{\omega}\left(s,a\right)-y\right)^{2}\right],$		(26)
	$\displaystyle\text{where }\,\,y$	$\displaystyle=r+\gamma\left(\mathbb{E}^{a\sim\pi_{\phi}(\cdot\|s^{\prime})}\left[Q_{\bar{\omega}}(s^{\prime},a)\right]\right),$		(26)

where $\bar{\omega}$ is the target network. Then we approximate the advantage as $A^{\pi}(s,a)=Q_{\omega}(s,a)-\mathbb{E}^{\pi_{\phi}}[Q_{\omega}(s,a)]$ . While the advantage estimation is expected to be further improved with the recent generalized advantage estimation, we found that the above simple implementation is sufficient to stabilize the learning.

Target smoothing

For the target Q-networks $Q_{\bar{\theta}}$ and $Q_{\bar{\omega}}$ , we update the parameters using the moving average (Haarnoja et al., 2018):

	$\displaystyle\bar{\theta}\leftarrow\nu_{\bar{\theta}}\,\theta+(1-\nu_{\bar{\theta}})\bar{\theta},$		(27)
	$\displaystyle\bar{\omega}\leftarrow\nu_{\bar{\omega}}\,\omega+(1-\nu_{\bar{\omega}})\bar{\omega},$		(27)

where $\nu_{\bar{\theta}}$ and $\nu_{\bar{\omega}}$ are the target smoothing coefficients. In the mixing step, we use the previous policy $\pi_{\bar{\phi}}$ rather than the current policy $\pi_{\phi}$ to stabilize the training:

\displaystyle\mathbb{E}^{\mathcal{B}}\left[{D}_{{KL}}\left(\pi_{\phi}\|(1-\zeta)\pi_{\bar{\phi}}+\zeta\hat{\pi}\right)\right].

Thus, the target policy $\pi_{\bar{\phi}}$ corresponds to the monotonically improved policy in the CPI algorithm that is not updated when performance oscillation happens. To reflect this fact, we update the weight of the target policy network as:

\bar{\phi}\leftarrow\zeta\nu_{\bar{\phi}}\,\theta+(1-\zeta\nu_{\bar{\phi}})\bar{\phi},

(28)

where $\nu_{\bar{\phi}}$ is the target smoothing coefficient.

Normalization factor estimation

Since CAC algorithm requires the density of the reference policy $\hat{\pi}(a\mid s)={\pi_{\bar{\phi}}^{\alpha}(a\mid s)\exp\left(\beta Q_{\theta}(s,a)\right)}(Z(s))^{-1}$ , we need to estimate the normalization factor $Z(s)$ .

A simple approach to estimate $Z(s)$ is by Monte-Carlo sampling with some distribution $q$ that is easier to sample from:

{Z}(s)=\mathbb{E}^{q}\left[\frac{{\pi_{\bar{\phi}}}(a\mid s)^{\alpha}\exp\left(\beta Q_{\theta}(s,a)\right)}{q(a\mid s)}\right].

(29)

The closer $q(\cdot\mid s)$ and the reference policy $\hat{\pi}(\cdot\mid s)$ are, the better the accuracy of the $Z(s)$ approximation.

Theorem 4.3 indicates that by choosing the current policy $\pi$ as the proposal distribution, we can control the closeness of the two distributions and the accuracy of the MC approximation via changing the entropy regularization weighting coefficients. We empirically study the effectiveness of entropy regularization against the closeness and the accuracy when $\zeta\leq 1$

We use the pendulum environment from Fu et al. (2019) where the dynamics are discretized so that we can compute the oracle values such as the KL divergence between the current and the reference policy. The hyperparameters used in the experiment is listed in Section A.4. Figure 3 shows how the learning behavior of CAC changes when the interpolation coefficient $\zeta$ and KL regularization weight $\tau$ vary: When KL regularization is present, the approximation quality of $Z(s)$ is improved significantly.

1: Initialize parameter vectors

\theta

\phi

\bar{\theta}

\bar{\phi}

\omega

\bar{\omega}

2: Initialize variable

\tilde{\mathbb{{A}}}

and

\tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}}

3: for each iteration do

4: Collect transitions by

\pi_{\theta}

and add them to

\mathcal{B}

and

\mathcal{B}_{K}

5: for each gradient step do

6: Update

\theta

with one step of SGD using Eq. (8b)

7: Update

\omega

with one step of SGD using Eq. (26)

8: Update

\tilde{\mathbb{{A}}}

and

\tilde{\mathbb{{A}}}^{\mathrm{MaxDiff}}

using Eq. (12)

9: Update

\phi

with one step of SGD using Eq. (8c)

10: Update

\zeta

using Eq. (11)

11:

\bar{\theta}\leftarrow\nu_{\bar{\theta}}\theta+(1-\nu_{\bar{\theta}})\bar{\theta}

12:

\bar{\omega}\leftarrow\nu_{\bar{\omega}}\omega+(1-\nu_{\bar{\theta}})\bar{\omega}

13:

\bar{\phi}\leftarrow\zeta\nu_{\bar{\phi}}\phi+(1-\zeta\nu_{\bar{\phi}})\bar{\phi}

14: end for

15: end for

Algorithm 1 Cautious Actor-Critic

Table 2: Hyperparameters of off-policy algorithms in mujoco tasks

Parameter		Value
Shared
	optimizer	Adam
	learning rate	$10^{-3}$
	discount factor ( $\gamma$ )	0.99
	replay buffer size ( $\mathcal{B}$ )	$10^{6}$
	number of hidden layers	2
	number of hidden units per layer	256
	number of samples per minibatch	100
	activations	ReLU
TD3
	Stddev for Gaussian noise	0.1
	Stddev for target smoothing noise	0.2
	policy delay	2
SAC
	entropy coefficient ( $\kappa$ )	0.2
	$\bar{\theta}$ smoothing coefficient	0.995
CAC
	entropy coefficient ( $\kappa$ )	0.2
	KL coefficient ( $\tau$ )	0.1
	$\bar{\theta}$ smoothing coefficient ( $\nu_{\bar{\theta}}$ )	0.995
	$\bar{\omega}$ smoothing coefficient ( $\nu_{\bar{\omega}}$ )	0.995
	$\bar{\phi}$ smoothing coefficient ( $\nu_{\bar{\phi}}$ )	0.9999
	$\nu_{A}$	$0.01$
	$\nu_{A^{\mathrm{MaxDiff}}}$	$0.001$
	size of $\mathcal{B}_{K}$	$1000$
	if-else update	$c=M$

Table 3: Hyperparameters of PPO in mujoco tasks

Parameter		Value
PPO
	optimizer	Adam
	value function learning rate	$10^{-3}$
	policy learning rate	$3\times 10^{-4}$
	discount factor ( $\gamma$ )	0.99
	number of hidden layers	2
	number of hidden units per layer	256
	number of samples per minibatch	100
	activations	ReLU
	Number of samples per update	80
	Policy objective clipping coefficient	0.2

Table 4: Hyperparameters of CAC in pendulum task

Parameter		Value
CAC
	optimizer	Adam
	learning rate	$10^{-3}$
	discount factor ( $\gamma$ )	0.99
	replay buffer size ( $\mathcal{B}$ )	$10^{6}$
	number of hidden layers	2
	number of hidden units per layer	256
	number of samples per minibatch	32
	activations	ReLU
	entropy coefficient ( $\kappa$ )	0.2
	$\bar{\theta}$ smoothing coefficient	0.995
	$\bar{\phi}$ smoothing coefficient	0.995
	size of $\mathcal{B}_{K}$	$1000$
	if-else update	$c=M$

A.4 Hyperparameters

This section lists the hyperparameters used in the comparative evaluation Section 5.1.