This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Trust Region-Based Safe Distributional
Reinforcement Learning for Multiple Constraints

Dohyeong Kim1, Kyungjae Lee2, and Songhwai Oh1
1Dep. of Electrical and Computer Engineering and ASRI, Seoul National University
2Artificial Intelligence Graduate School, Chung-Ang University
[email protected], [email protected],
[email protected]
Abstract

In safety-critical robotic tasks, potential failures must be reduced, and multiple constraints must be met, such as avoiding collisions, limiting energy consumption, and maintaining balance. Thus, applying safe reinforcement learning (RL) in such robotic tasks requires to handle multiple constraints and use risk-averse constraints rather than risk-neutral constraints. To this end, we propose a trust region-based safe RL algorithm for multiple constraints called a safe distributional actor-critic (SDAC). Our main contributions are as follows: 1) introducing a gradient integration method to manage infeasibility issues in multi-constrained problems, ensuring theoretical convergence, and 2) developing a TD(λ\lambda) target distribution to estimate risk-averse constraints with low biases. We evaluate SDAC through extensive experiments involving multi- and single-constrained robotic tasks. While maintaining high scores, SDAC shows 1.93 times fewer steps to satisfy all constraints in multi-constrained tasks and 1.78 times fewer constraint violations in single-constrained tasks compared to safe RL baselines. Code is available at: https://github.com/rllab-snu/Safe-Distributional-Actor-Critic.

1 Introduction

Deep reinforcement learning (RL) enables reliable control of complex robots (Merel et al., 2020; Peng et al., 2021; Rudin et al., 2022). In order to successfully apply RL to real-world systems, it is essential to design a proper reward function which reflects safety guidelines, such as collision avoidance and limited energy consumption, as well as the goal of the given task. However, finding the reward function that considers all such factors involves a cumbersome and time-consuming process since RL algorithms must be repeatedly performed to verify the results of the designed reward function. Instead, safe RL, which handles safety guidelines as constraints, is an appropriate solution. A safe RL problem can be formulated using a constrained Markov decision process (Altman, 1999), where not only the reward but also cost functions are defined to provide the safety guideline signals. By defining constraints using expectation or risk measures of the sum of costs, safe RL aims to maximize returns while satisfying the constraints. Under the safe RL framework, the training process becomes straightforward since there is no need to search for a reward that reflects the safety guidelines.

While various safe RL algorithms have been proposed to deal with the safety guidelines, their applicability to general robotic applications remains limited due to the insufficiency in 1) handling multiple constraints and 2) minimizing failures, such as robot breakdowns after collisions. First, many safety-critical applications require multiple constraints, such as maintaining distance from obstacles, limiting operational space, and preventing falls. Lagrange-based safe RL methods (Yang et al., 2021; Zhang and Weng, 2022; Bai et al., 2022), which convert a safe RL problem into a dual problem and update the policy and Lagrange multipliers, are commonly used to solve these multi-constrained problems. However, the Lagrangian methods are difficult to guarantee satisfying constraints during training theoretically, and the training process can be unstable due to the multipliers (Stooke et al., 2020). To this end, trust region-based methods (Yang et al., 2020; Kim and Oh, 2022a), which can ensure to improve returns while satisfying the constraints under tabular settings (Achiam et al., 2017), have been proposed as an alternative to stabilize the training process. Still, trust region-based methods have a critical issue. Depending on the initial policy settings, there can be an infeasible starting case, meaning that no policy within the trust region satisfies constraints. To address this issue, we can sequentially select a violated constraint and update the policy to reduce the selected constraint (Xu et al., 2021). However, this can be inefficient as only one constraint is considered per update. It will be better to handle multiple constraints at once, but it is a remaining problem to find a policy gradient that reflects several constraints and guarantees to reach a feasible policy set.

Secondly, as RL settings are inherently stochastic, employing risk-neutral measures like expectation to define constraints can lead to frequent failures. Hence, it is crucial to define constraints using risk measures, such as conditional value at risk (CVaR), as they can reduce the potential for massive cost returns by emphasizing tail distributions (Yang et al., 2021; Kim and Oh, 2022a). In safe RL, critics are used to estimate the constraint values. Especially, to estimate constraints based on risk measures, it is required to use distributional critics (Dabney et al., 2018b), which can be trained using the distributional Bellman update (Bellemare et al., 2017). However, the Bellman update only considers the one-step temporal difference, which can induce a large bias. The estimation bias makes it difficult for critics to judge the policy, which can lead to the policy becoming overly conservative or risky, as shown in Section 5.3. In particular, when there are multiple constraints, the likelihood of deriving incorrect policy gradients due to estimation errors grows exponentially. Therefore, there is a need for a method that can train distributional critics with low biases.

In this paper, we propose a trust region-based safe RL algorithm called a safe distributional actor-critic (SDAC), designed to effectively manage multiple constraints and estimate risk-averse constraints with low biases. First, to handle the infeasible starting case by considering all constraints simultaneously, we propose a gradient integration method that projects unsafe policies into a feasible policy set by solving a quadratic program (QP) consisting of gradients of all constraints. It guarantees to obtain a feasible policy within a finite time under mild technical assumptions, and we experimentally show that it can restore the policy more stably than the existing method (Xu et al., 2021). Furthermore, by updating the policy using the trust region method with the integrated gradient, our approach makes the training process more stable than the Lagrangian method, as demonstrated in Section 5.2. Second, to train critics to estimate constraints with low biases, we propose a TD(λ\lambda) target distribution which can adjust the bias-variance trade-off. The target distribution is obtained by merging the quantile regression losses (Dabney et al., 2018b) of multi-step distributions and extracting a unified distribution from the loss. The unified distribution is then projected onto a quantile distribution set in a memory-efficient manner. We experimentally show that the target distribution can trade off the bias-variance of the constraint estimations (see Section 5.3).

We conduct extensive experiments with multi-constrained locomotion tasks and single-constrained Safety Gym tasks (Ray et al., 2019) to evaluate the proposed method. In the locomotion tasks, SDAC shows 1.93 times fewer steps to satisfy all constraints than the second-best baselines. In the Safety Gym tasks, the proposed method shows 1.78 times fewer constraint violations than the second-best methods while achieving high returns when using risk-averse constraints. As a result, it is shown that the proposed method can efficiently handle multiple constraints using the gradient integration method and effectively lower the constraint violations using the low-biased distributional critics.

2 Background

Constrained Markov Decision Processes. We formulate the safe RL problem using constrained Markov decision processes (CMDPs) (Altman, 1999). A CMDP is defined as (S(S, AA, PP, RR, C1,..,KC_{1,..,K}, ρ\rho, γ)\gamma), where SS is a state space, AA is an action space, P:S×A×S[0,1]P:S\times A\times S\mapsto[0,1] is a transition model, R:S×A×SR:S\times A\times S\mapsto\mathbb{R} is a reward function, Ck{1,,K}:S×A×S0C_{k\in\{1,...,K\}}:S\times A\times S\mapsto\mathbb{R}_{\geq 0} are cost functions, ρ:S[0,1]\rho:S\mapsto[0,1] is an initial state distribution, and γ(0,1)\gamma\in(0,1) is a discount factor. Given a policy π\pi from a stochastic policy set Π\Pi, the discounted state distribution is defined as dπ(s):=(1γ)t=0γtPr(st=s|π)d^{\pi}(s):=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\mathrm{Pr}(s_{t}=s|\pi), and the return is defined as ZRπ(s,a):=t=0γtR(st,at,st+1)Z_{R}^{\pi}(s,a):=\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t},s_{t+1}), where s0=s,a0=a,atπ(|st)s_{0}=s,\;a_{0}=a,\;a_{t}\sim\pi(\cdot|s_{t}), and st+1P(|st,at)s_{t+1}\sim P(\cdot|s_{t},a_{t}). Then, the state value and state action value functions are defined as: VRπ(s):=𝔼[ZRπ(s,a)|aπ(|s)],QRπ(s,a):=𝔼[ZRπ(s,a)]V_{R}^{\pi}(s):=\mathbb{E}\left[Z_{R}^{\pi}(s,a)|a\sim\pi(\cdot|s)\right],\;Q_{R}^{\pi}(s,a):=\mathbb{E}\left[Z_{R}^{\pi}(s,a)\right]. By substituting the costs for the reward, the cost value functions VCkπ(s)V_{C_{k}}^{\pi}(s) and QCkπ(s,a)Q_{C_{k}}^{\pi}(s,a) are defined. In the remainder of the paper, the cost parts will be omitted since they can be retrieved by replacing the reward with the costs. Then, the safe RL problem is defined as follows with a safety measure FF:

maximizeπJ(π)𝐬.𝐭.F(ZCkπ(s,a)|sρ,aπ(|s))dkk,\mathrm{maximize}_{\pi}\;J(\pi)\;\mathbf{s.t.}\;F(Z_{C_{k}}^{\pi}(s,a)|s\sim\rho,a\sim\pi(\cdot|s))\leq d_{k}\;\forall k, (1)

where J(π):=𝔼[ZRπ(s,a)|sρ,aπ(|s)]+β𝔼[t=0γtH(π(|st))|ρ,π,P]J(\pi):=\mathbb{E}\left[Z_{R}^{\pi}(s,a)|s\sim\rho,a\sim\pi(\cdot|s)\right]+\beta\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}H(\pi(\cdot|s_{t}))|\rho,\pi,P\right], β\beta is an entropy coefficient, HH is the Shannon entropy, and dkd_{k} is a threshold of the kkth constraint.

Trust-Region Method With a Mean-Std Constraint. Kim and Oh (2022a) have proposed a trust region-based safe RL method with a risk-averse constraint, called a mean-std constraint. The definition of the mean-std constraint function is as follows:

F(Z;α):=𝔼[Z]+(ϕ(Φ1(α))/α)Std[Z],F(Z;\alpha):=\mathbb{E}[Z]+(\phi(\Phi^{-1}(\alpha))/\alpha)\cdot\mathrm{Std}[Z], (2)

where α(0,1]\alpha\in(0,1] adjusts the risk level of constraints, Std[Z]\mathrm{Std}[Z] is the standard deviation of ZZ, and ϕ\phi and Φ\Phi are the probability density function and the cumulative distribution function (CDF) of the standard normal distribution, respectively. In particular, setting α=1\alpha=1 causes the standard deviation part to be zero, so the constraint becomes a risk-neutral constraint. Also, the mean-std constraint can effectively reduce the potential for massive cost returns, as shown in Yang et al. (2021); Kim and Oh (2022a, b). In order to calculate the mean-std constraint, it is essential to estimate the standard deviation of the cost return. To this end, Kim and Oh (2022a) define the square value functions:

SCkπ(s):=𝔼[ZCkπ(s,a)2|aπ(|s)],SCkπ(s,a):=𝔼[ZCkπ(s,a)2].S_{C_{k}}^{\pi}(s):=\mathbb{E}\left[Z_{C_{k}}^{\pi}(s,a)^{2}|a\sim\pi(\cdot|s)\right],\;S_{C_{k}}^{\pi}(s,a):=\mathbb{E}\left[Z_{C_{k}}^{\pi}(s,a)^{2}\right]. (3)

Since Std[Z]2=𝔼[Z2]𝔼[Z]2\mathrm{Std}[Z]^{2}=\mathbb{E}[Z^{2}]-\mathbb{E}[Z]^{2}, the kkth constraint can be written as follows:

Fk(π;α)=JCk(π)+(ϕ(Φ1(α))/α)JSk(π)JCk(π)2dk,\displaystyle F_{k}(\pi;\alpha)=J_{C_{k}}(\pi)+(\phi(\Phi^{-1}(\alpha))/\alpha)\cdot\sqrt{J_{S_{k}}(\pi)-J_{C_{k}}(\pi)^{2}}\leq d_{k}, (4)

where JCk(π):=𝔼[VCkπ(s)|sρ]J_{C_{k}}(\pi):=\mathbb{E}\left[V_{C_{k}}^{\pi}(s)|s\sim\rho\right], JSk(π):=𝔼[SCkπ(s)|sρ]J_{S_{k}}(\pi):=\mathbb{E}\left[S_{C_{k}}^{\pi}(s)|s\sim\rho\right]. In order to apply the trust region method (Schulman et al., 2015), it is necessary to derive surrogate functions for the objective and constraints. These surrogates can substitute for the objective and constraints within the trust region. Given a behavioral policy μ\mu and the current policy πold\pi_{\mathrm{old}}, we denote the surrogates as Jμ,πold(π)J^{\mu,\pi_{\mathrm{old}}}(\pi) and Fkμ,πold(π;α)F_{k}^{\mu,\pi_{\mathrm{old}}}(\pi;\alpha). For the definition and derivation of the surrogates, please refer to Appendix A.7 and (Kim and Oh, 2022a). Using the surrogates, a policy can be updated by solving the following subproblem:

maximizeπJμ,π(π)𝐬.𝐭.DKL(π||π)ϵ,Fkμ,π(π,α)dkk,\mathrm{maximize}_{\pi^{\prime}}\;J^{\mu,\pi}(\pi^{\prime})\;\mathbf{s.t.}\;D_{KL}(\pi||\pi^{\prime})\leq\epsilon,F_{k}^{\mu,\pi}(\pi^{\prime},\alpha)\leq d_{k}\;\forall k, (5)

where DKL(π||π):=𝔼sdμ[DKL(π(|s)||π(|s))]D_{\mathrm{KL}}(\pi||\pi^{\prime}):=\mathbb{E}_{s\sim d^{\mu}}\left[D_{\mathrm{KL}}(\pi(\cdot|s)||\pi^{\prime}(\cdot|s))\right], DKLD_{\mathrm{KL}} is the KL divergence, and ϵ\epsilon is a trust region size. This subproblem can be solved through approximation and a line search (see Appendix A.8). However, it is possible that there is no policy satisfying the constraints of (5). In order to tackle this issue, the policy must be projected onto a feasible policy set that complies with all constraints, yet there is a lack of such methods. In light of this issue, we introduce an efficient feasibility handling method for multi-constrained RL problems.

Distributional Quantile Critic. Dabney et al. (2018b) have proposed a method for approximating the random variable ZRπZ^{\pi}_{R} to follow a quantile distribution. Given a parametric model, θ:S×AM\theta:S\times A\mapsto\mathbb{R}^{M}, ZRπZ_{R}^{\pi} can be approximated as ZR,θZ_{R,\theta}, called a distributional quantile critic. The probability density function of ZR,θZ_{R,\theta} is defined as follows:

Pr(ZR,θ(s,a)=z):=m=1Mδθm(s,a)(z)/M,\mathrm{Pr}(Z_{R,\theta}(s,a)=z):=\sum\nolimits_{m=1}^{M}\delta_{\theta_{m}(s,a)}(z)/M, (6)

where MM is the number of atoms, θm(s,a)\theta_{m}(s,a) is the mmth atom, δ\delta is the Dirac function, and δa(z):=δ(za)\delta_{a}(z):=\delta(z-a). The percentile value of the mmth atom is denoted by τm\tau_{m} (τ0=0,τi=i/M\tau_{0}=0,\tau_{i}=i/M). In distributional RL, the returns are directly estimated to get value functions, and the target distribution can be calculated from the distributional Bellman operator (Bellemare et al., 2017): 𝒯πZR(s,a):=DR(s,a,s)+γZR(s,a)\mathcal{T}^{\pi}Z_{R}(s,a):\stackrel{{\scriptstyle D}}{{=}}R(s,a,s^{\prime})+\gamma Z_{R}(s^{\prime},a^{\prime}), where sP(|s,a)s^{\prime}\sim P(\cdot|s,a) and aπ(|s)a^{\prime}\sim\pi(\cdot|s^{\prime}). The above one-step distributional operator can be expanded to the nn-step one: 𝒯nπZR(s0,a0):=Dt=0n1γtR(st,at,st+1)+γnZR(sn,an)\mathcal{T}_{n}^{\pi}Z_{R}(s_{0},a_{0}):\stackrel{{\scriptstyle D}}{{=}}\sum_{t=0}^{n-1}\gamma^{t}R(s_{t},a_{t},s_{t+1})+\gamma^{n}Z_{R}(s_{n},a_{n}), where atπ(|st)a_{t}\sim\pi(\cdot|s_{t}) for t=1,,nt=1,...,n. Then, the critic can be trained to minimize the following quantile regression loss (Dabney et al., 2018b):

(θ)=m=1M𝔼(s,a)D[𝔼Z𝒯πZR,θ(s,a)[ρτ¯m(Zθm(s,a))]]=:QRτ¯m(θm),whereρτ(x)=x(τ𝟏x<0),\small\mathcal{L}(\theta)=\sum_{m=1}^{M}\underbrace{\mathbb{E}_{(s,a)\sim D}\left[\mathbb{E}_{Z\sim\mathcal{T}^{\pi}Z_{R,\theta}(s,a)}\left[\rho_{\bar{\tau}_{m}}(Z-\theta_{m}(s,a))\right]\right]}_{=:\mathcal{L}^{\bar{\tau}_{m}}_{\mathrm{QR}}(\theta_{m})},\;\mathrm{where}\;\rho_{\tau}(x)=x\cdot(\tau-\mathbf{1}_{\mathrm{x<0}}), (7)

DD is a replay buffer, τ¯m:=(τm1+τm)/2\bar{\tau}_{m}:=(\tau_{m-1}+\tau_{m})/2, and QRτ¯m(θm)\mathcal{L}^{\bar{\tau}_{m}}_{\mathrm{QR}}(\theta_{m}) denotes the quantile regression loss for a single atom. The distributional quantile critic can be plugged into existing actor-critic algorithms because only the critic modeling part is changed.

3 Proposed Method

The proposed method comprises two key components: 1) a feasibility handling method required for multi-constrained safe RL problems and 2) a target distribution designed to minimize estimation bias. This section sequentially presents these components, followed by a detailed explanation of the proposed method.

Refer to caption
Figure 1: Left: Gradient integration. Each constraint is truncated to be tangent to the trust region indicated by the ellipse, and the dashed straight lines show the truncated constraints. The solution of (8) is indicated in blue, pointing to the nearest point in the intersection of the truncated constraints. If the solution crosses the trust region, parameters are updated by the clipped direction, shown in red. Right: Optimization paths of the proposed and naive method in a toy example. The description is presented in Appendix A.2. The contour graph represents the objective of the toy example. The optimization paths exhibit distinct characteristics due to the difference that the naive method reflects only one constraint and the proposed method considers all constraints at once.

3.1 Feasibility Handling For Multiple Constraints

An optimal safe RL policy can be found by iteratively solving the subproblem (5), but the feasible set of (5) can be empty in the infeasible starting cases. To address the feasibility issue in safe RL with multiple constraints, one of the violated constraints can be selected, and the policy is updated to minimize the constraint until the feasible region is not empty (Xu et al., 2021), which is called a naive approach. However, it may not be easy to quickly reach the feasible condition if only one constraint at each update step is used to update the policy. Therefore, we propose a feasibility handling method which reflect all the constraints simultaneously, called a gradient integration method. The main idea is to get a gradient that reduces the value of violated constraints and keeps unviolated constraints. To find such a gradient, the following quadratic program (QP) can be formulated by linearly approximating the constraints:

g=argmin𝑔12gTHg𝐬.𝐭.gkTg+ck0k,ck:=min(2ϵgkTH1gk,Fk(πψ;α)dk+ζ),\displaystyle g^{*}=\underset{g}{\mathrm{argmin}}\;\frac{1}{2}g^{T}Hg\;\;\mathbf{s.t.}\;g_{k}^{T}g+c_{k}\leq 0\;\forall k,\;c_{k}:=\mathrm{min}(\sqrt{\smash[b]{2\epsilon g_{k}^{T}H^{-1}g_{k}}},F_{k}(\pi_{\psi};\alpha)-d_{k}+\zeta), (8)

where HH is the Hessian of KL divergence between the previous policy and the current policy with parameters ψ\psi, gkg_{k} is the gradient of the kkth constraint, ckc_{k} is a truncated threshold to make the kkth constraint tangent to the trust region, ϵ\epsilon is a trust region size, and ζ>0\zeta\in\mathbb{R}_{>0} is a slack coefficient. The reason why we truncate constraints is to make the gradient integration method invariant to the gradient scale. Otherwise, constraints with larger gradient scales might produce a dominant policy gradient. Finally, we update the policy parameters using the clipped gradient as follows:

ψ=ψ+min(1,2ϵ/(gTHg))g.\psi^{*}=\psi+\mathrm{min}(1,\sqrt{\smash[b]{2\epsilon/(g^{*T}Hg^{*})}})g^{*}. (9)

Figure 1 illustrates the proposed gradient integration process. In summary, the policy is updated by solving (5); if there is no solution to (5), it is updated using the gradient integration method. Then, the policy can reach the feasibility condition within finite time steps.

Theorem 3.1.

Assume that the constraints are differentiable and convex, gradients of the constraints are LL-Lipschitz continuous, eigenvalues of the Hessian are equal or greater than a positive value R>0R\in\mathbb{R}_{>0}, and {ψ|Fk(πψ;α)+ζ<dk,k}\{\psi|F_{k}(\pi_{\psi};\alpha)+\zeta<d_{k},\;\forall k\}\neq\emptyset. Then, there exists E>0E\in\mathbb{R}_{>0} such that if 0<ϵE0<\epsilon\leq E and a policy is updated by the proposed gradient integration method, all constraints are satisfied within finite time steps.

Note that the first two assumptions of Theorem 3.1 are commonly used in multi-task learning (Liu et al., 2021; Yu et al., 2020; Navon et al., 2022), and the assumption on eigenvalues is used in most trust region-based RL methods (Schulman et al., 2015; Kim and Oh, 2022a), so the assumptions in Theorem 3.1 can be considered reasonable. We provide the proof and show the existence of a solution (8) in Appendix A.1. The provided proof shows that the constant EE is proportional to ζ\zeta. This means that the trust region size should be set smaller as ζ\zeta decreases. Also, we further analyze the worst-case time to satisfy all constraints by comparing the gradient integration method and naive approach in Appendix A.3. In conclusion, if the policy update rule (5) is not feasible, a finite number of applications of the gradient integration method will make the policy feasible.

3.2 TD(λ\lambda) Target Distribution

The mean-std constraints can be estimated using the distributional quantile critics. Since the estimated constraints obtained from the critics are directly used to update policies in (5), estimating the constraints with low biases is crucial. In order to reduce the estimation bias of the critics, we propose a target distribution by capturing that the TD(λ\lambda) loss, which is obtained by a weighted sum of several losses, and the quantile regression loss with a single distribution are identical. A recursive method is then introduced so that the target distribution can be obtained practically. First, the nn-step targets for the current policy π\pi are estimated as follows, after collecting trajectories (st,at,st+1,)(s_{t},a_{t},s_{t+1},...) with a behavioral policy μ\mu:

Z^R(n)(st,at)\displaystyle\hat{Z}^{(n)}_{R}(s_{t},a_{t}) :=DRt+γRt+1++γn1Rt+n1+γnZR,θ(st+n,a^t+n),\displaystyle:\stackrel{{\scriptstyle D}}{{=}}R_{t}+\gamma R_{t+1}+\cdots+\gamma^{n-1}R_{t+n-1}+\gamma^{n}Z_{R,\theta}(s_{t+n},\hat{a}_{t+n}), (10)

where Rt=R(st,at,st+1)R_{t}=R(s_{t},a_{t},s_{t+1}), and a^t+nπ(|st+n)\hat{a}_{t+n}\sim\pi(\cdot|s_{t+n}). Note that the nn-step target controls the bias-variance tradeoff using nn. If nn is equal to 11, the nn-step target is equivalent to the temporal difference method that has low variance but high bias. On the contrary, if nn increases to infinity, it becomes a Monte Carlo estimation that has high variance but low bias. However, finding proper nn is another cumbersome task. To alleviate this issue, TD(λ\lambda) (Sutton, 1988) method considers the discounted sum of all nn-step targets. Similar to TD(λ\lambda), we define the TD(λ\lambda) loss for the distributional quantile critic as the discounted sum of all quantile regression losses with nn-step targets. Then, the TD(λ\lambda) loss for a single atom is approximated using importance sampling of the sampled nn-step targets in (10) as:

QRτ¯m(θm)=(1λ)i=0λi𝔼(st,at)D[𝔼Z𝒯i+1πZR,θ(st,at)[ρτ¯m(Zθm(st,at))]]\displaystyle\mathcal{L}^{\bar{\tau}_{m}}_{\mathrm{QR}}(\theta_{m})=(1-\lambda)\sum_{i=0}^{\infty}\lambda^{i}\mathbb{E}_{(s_{t},a_{t})\sim D}\left[\mathbb{E}_{Z\sim\mathcal{T}_{i+1}^{\pi}Z_{R,\theta}(s_{t},a_{t})}\left[\rho_{\bar{\tau}_{m}}(Z-\theta_{m}(s_{t},a_{t}))\right]\right] (11)
=(1λ)i=0λi𝔼(st,at)D[𝔼Z𝒯i+1μZR,θ(st,at)[j=t+1t+iπ(aj|sj)μ(aj|sj)ρτ¯m(Zθm(st,at))]]\displaystyle=(1-\lambda)\sum_{i=0}^{\infty}\lambda^{i}\mathbb{E}_{(s_{t},a_{t})\sim D}\left[\mathbb{E}_{Z\sim\mathcal{T}_{i+1}^{\mu}Z_{R,\theta}(s_{t},a_{t})}\left[\prod_{j=t+1}^{t+i}\frac{\pi(a_{j}|s_{j})}{\mu(a_{j}|s_{j})}\rho_{\bar{\tau}_{m}}(Z-\theta_{m}(s_{t},a_{t}))\right]\right]
(1λ)i=0λi𝔼(st,at)D[j=t+1t+iπ(aj|sj)μ(aj|sj)m=1M1Mρτ¯m(Z^R,m(i+1)(st,at)θm(st,at))],\displaystyle\approx(1-\lambda)\sum_{i=0}^{\infty}\lambda^{i}\mathbb{E}_{(s_{t},a_{t})\sim D}\left[\prod_{j=t+1}^{t+i}\frac{\pi(a_{j}|s_{j})}{\mu(a_{j}|s_{j})}\sum_{m=1}^{M}\frac{1}{M}\rho_{\bar{\tau}_{m}}(\hat{Z}^{(i+1)}_{R,m}(s_{t},a_{t})-\theta_{m}(s_{t},a_{t}))\right],

where λ\lambda is a trace-decay value, and Z^R,m(i)\hat{Z}^{(i)}_{R,m} is the mmth atom of Z^R(i)\hat{Z}^{(i)}_{R}. Since Z^R(i)(st,at)=DRt+γZ^R(i1)(st+1,at+1)\hat{Z}_{R}^{(i)}(s_{t},a_{t})\stackrel{{\scriptstyle D}}{{=}}R_{t}+\gamma\hat{Z}_{R}^{(i-1)}(s_{t+1},a_{t+1}) is satisfied, (11) is the same as the quantile regression loss with the following single distribution Z^ttot\hat{Z}_{t}^{\mathrm{tot}}, called a TD(λ\lambda) target distribution:

Pr(Z^ttot=z):=1𝒩t(1λ)i=0λij=t+1t+iπ(aj|sj)μ(aj|sj)m=1M1MδZ^R,m(i+1)(st,at)(z)\displaystyle\mathrm{Pr}(\hat{Z}_{t}^{\mathrm{tot}}=z):=\frac{1}{\mathcal{N}_{t}}(1-\lambda)\sum_{i=0}^{\infty}\lambda^{i}\prod_{j=t+1}^{t+i}\frac{\pi(a_{j}|s_{j})}{\mu(a_{j}|s_{j})}\sum_{m=1}^{M}\frac{1}{M}\delta_{\hat{Z}^{(i+1)}_{R,m}(s_{t},a_{t})}(z) (12)
=1λ𝒩t(m=1M1MδZ^t,m(1)(z)+λπ(at+1|st+1)μ(at+1|st+1)i=0λij=t+2t+1+iπ(aj|sj)μ(aj|sj)m=1M1MδZ^R,m(i+2)(st,at)(z))\displaystyle=\frac{1-\lambda}{\mathcal{N}_{t}}\bigg{(}\sum_{m=1}^{M}\frac{1}{M}\delta_{\hat{Z}^{(1)}_{t,m}}(z)+\lambda\frac{\pi(a_{t+1}|s_{t+1})}{\mu(a_{t+1}|s_{t+1})}\sum_{i=0}^{\infty}\lambda^{i}\prod_{j=t+2}^{t+1+i}\frac{\pi(a_{j}|s_{j})}{\mu(a_{j}|s_{j})}\sum_{m=1}^{M}\frac{1}{M}\delta_{\hat{Z}^{(i+2)}_{R,m}(s_{t},a_{t})}(z)\bigg{)}
=1λ𝒩tm=1M1MδZ^t,m(1)(z)(a)+λ𝒩t+1𝒩tπ(at+1|st+1)μ(at+1|st+1)Pr(Rt+γZ^t+1tot=z)(b),\displaystyle=\frac{1-\lambda}{\mathcal{N}_{t}}\underbrace{\sum_{m=1}^{M}\frac{1}{M}\delta_{\hat{Z}^{(1)}_{t,m}}(z)}_{\text{(a)}}+\lambda\frac{\mathcal{N}_{t+1}}{\mathcal{N}_{t}}\frac{\pi(a_{t+1}|s_{t+1})}{\mu(a_{t+1}|s_{t+1})}\underbrace{\mathrm{Pr}(R_{t}+\gamma\hat{Z}^{\mathrm{tot}}_{t+1}=z)}_{\text{(b)}},

where 𝒩t\mathcal{N}_{t} is a normalization factor. If the target for time step t+1t+1 is obtained, the target distribution for time step tt becomes the weighted sum of (a) the current one-step TD target and (b) the shifted previous target distribution, so it can be obtained recursively, as shown in (12). Since the definition requires infinite sums, the recursive way is more practical for computing the target. Nevertheless, to obtain the target in that recursive way, we need to store all quantile positions and weights for all time steps, which is not memory-efficient. Therefore, we propose to project the target distribution into a quantile distribution with a specific number of atoms, MM^{\prime} (we set M>MM^{\prime}>M to reduce information loss). The overall process to get the TD(λ)(\lambda) target distribution is illustrated in Figure 2, and the pseudocode is given in Appendix A.5. Furthermore, we can show that a distribution trained with the proposed target converges to the distribution of ZRπZ_{R}^{\pi}.

Theorem 3.2.

Let define a distributional operator 𝒯λμ,π\mathcal{T}_{\lambda}^{\mu,\pi}, whose probability density function is:

Pr(𝒯λμ,πZ(s,a)=z)\displaystyle\mathrm{Pr}(\mathcal{T}_{\lambda}^{\mu,\pi}Z(s,a)\!=\!z)\propto (13)
i=0𝔼μ[λij=1iπ(aj|sj)μ(aj|sj)𝔼aπ(|si+1)[Pr(t=0iγtRt+γi+1Z(si+1,a)=z)]|s0=s,a0=a].\displaystyle\;\;\sum_{i=0}^{\infty}\mathbb{E}_{\mu}\!\left[\lambda^{i}\prod_{j=1}^{i}\frac{\pi(a_{j}|s_{j})}{\mu(a_{j}|s_{j})}\underset{a^{\prime}\sim\pi(\cdot|s_{i+1})}{\mathbb{E}}\!\left[\mathrm{Pr}\left(\sum_{t=0}^{i}\gamma^{t}R_{t}\!+\!\gamma^{i+1}Z(s_{i+1},a^{\prime})\!=\!z\right)\right]\!\!\bigg{|}s_{0}=s,a_{0}=a\right].

Then, a sequence, Zk+1(s,a)=𝒯λμ,πZk(s,a)(s,a)Z_{k+1}(s,a)=\mathcal{T}_{\lambda}^{\mu,\pi}Z_{k}(s,a)\;\forall(s,a), converges to ZRπZ_{R}^{\pi}.

The TD(λ)(\lambda) target is a quantile distribution version of the distributional operator 𝒯λμ,π\mathcal{T}_{\lambda}^{\mu,\pi} in Theorem 3.2. Consequently, a distribution updated by minimizing the quantile regression loss with the TD(λ)(\lambda) target converges to the distribution of ZRπZ_{R}^{\pi} if the number of atoms is infinite, according to Theorem 3.2. The proof of Theorem 3.2 is provided in Appendix A.4. After calculating the target distribution for all time steps, the critic can be trained to reduce the quantile regression loss with the target distribution. To provide more insight, we experiment with a toy example in Appendix A.6, and the results show that the proposed target distribution can trade off bias and variance through the trace-decay λ\lambda.

Refer to caption
Figure 2: Procedure for constructing the target distribution. First, multiply the target at t+1t+1 step by γ\gamma and add RtR_{t}. Next, weight-combine the shifted previous target and one-step target at tt step and restore the CDF of the combined target. The CDF can be restored by sorting the positions of the atoms and then accumulating the weights at each atom position. Finally, the projected target can be obtained by finding the positions of the atoms corresponding to MM^{\prime} quantiles in the CDF. Using the projected target, the target at t1t-1 step can be found recursively.

3.3 Safe Distributional Actor-Critic

Finally, we describe the proposed method, safe distributional actor-critic (SDAC). After collecting trajectories, the policy is updated by solving (5), which can be solved through a line search (for more detail, see Appendix A.8). The cost value and the cost square value functions in (4) can be obtained using the distributional critics as follows:

QCπ(s,a)\displaystyle Q_{C}^{\pi}(s,a) =zPr(ZCπ(s,a)=z)𝑑z1Mm=1Mθm(s,a),\displaystyle=\int_{-\infty}^{\infty}z\mathrm{Pr}(Z^{\pi}_{C}(s,a)=z)dz\approx\frac{1}{M}\sum_{m=1}^{M}\theta_{m}(s,a), (14)
SCπ(s,a)\displaystyle S_{C}^{\pi}(s,a) =z2Pr(ZCπ(s,a)=z)𝑑z1Mm=1Mθm(s,a)2.\displaystyle=\int_{-\infty}^{\infty}z^{2}\mathrm{Pr}(Z^{\pi}_{C}(s,a)=z)dz\approx\frac{1}{M}\sum_{m=1}^{M}\theta_{m}(s,a)^{2}.

If a solution of (5) does not exist, the policy is projected into a feasible region through the proposed gradient integration method. The critics can also be updated by the regression loss (7) between the target distribution obtained from (12). The proposed method is summarized in Algorithm 1.

Algorithm 1 Safe Distributional Actor-Critic
  Input: Policy network πψ\pi_{\psi}, reward and cost critic networks ZR,θπZ_{R,\theta}^{\pi}, ZCk,θπZ_{C_{k},\theta}^{\pi}, and replay buffer 𝒟\mathcal{D}.
  Initialize network parameters ψ,θ\psi,\theta, and replay buffer 𝒟\mathcal{D}.
  for epochs=1\;=1 to EE do
     for t=1t=1 to TT do
        Sample atπψ(|st)a_{t}\sim\pi_{\psi}(\cdot|s_{t}) and get st+1,rt=R(st,at,st+1),s_{t+1},r_{t}=R(s_{t},a_{t},s_{t+1}), and ck,t=Ck(st,at,st+1)kc_{k,t}=C_{k}(s_{t},a_{t},s_{t+1})\;\forall k.
        Store (st,at,πψ(at|st),rt,c{1,,K},t,st+1)(s_{t},a_{t},\pi_{\psi}(a_{t}|s_{t}),r_{t},c_{\{1,...,K\},t},s_{t+1}) in 𝒟\mathcal{D}.
     end for
     Calculate the TD(λ\lambda) target distribution (Section 3.2) using 𝒟\mathcal{D} and update the critics to minimize the quantile loss defined in (7).
     Calculate the surrogates for the objective and constraints defined in (37) using 𝒟\mathcal{D}.
     Update the policy by solving (5), but if (5) has no solution, take a recovery step (Section 3.1).
  end for

4 Related Work

Safe Reinforcement Learning. Garcıa and Fernández (2015) and Gu et al. (2022) have researched and categorized safe RL methodologies from various perspectives. In this paper, we introduce safe RL methods depending on how to update policies to reflect safety constraints. First, trust region-based safe RL methods (Achiam et al., 2017; Yang et al., 2020; Kim and Oh, 2022a) find policy update directions by approximating the safe RL problem as a linear-quadratic constrained linear program and update policies through a line search. Yang et al. (2020) also employ projection to meet a constraint; however, their method is limited to a single constraint and does not show to satisfy the constraint for the infeasible starting case. Second, Lagrangian-based methods (Stooke et al., 2020; Yang et al., 2021; Liu et al., 2020) convert the safe RL problem to a dual problem and update the policy and dual variables simultaneously. Last, expectation-maximization (EM) based methods (Liu et al., 2022; Zhang et al., 2022) find non-parametric policy distributions by solving the safe RL problem in E-steps and fit parametric policies to the found non-parametric distributions in M-steps. Also, there are other ways to reflect safety other than policy updates. Qin et al. (2021); Lee et al. (2022) find optimal state or state-action distributions that satisfy constraints, and Bharadhwaj et al. (2021); Thananjeyan et al. (2021) reflect safety during exploration by executing only safe action candidates. In the experiments, only the safe RL methods of the policy update approach are compared with the proposed method.

Distributional TD(λ\lambda). TD(λ\lambda) (Precup et al., 2000) can be extended to the distributional critic to trade off bias-variance. Gruslys et al. (2018) have proposed a method to obtain target distributions by mixing nn-step distributions, but the method is applicable only in discrete action spaces. Nam et al. (2021) have proposed a method to obtain target distributions using sampling to apply to continuous action spaces, but this is only for on-policy settings. A method proposed by Tang et al. (2022) updates the critics using newly defined distributional TD errors rather than target distributions. This method is applicable for off-policy settings but has the disadvantage that memory usage increases linearly with the number of TD error steps. In contrast to these methods, the proposed method is memory-efficient and applicable for continuous action spaces under off-policy settings.

Gradient Integration. The proposed feasibility handling method utilizes a gradient integration method, which is widely used in multi-task learning (MTL). The gradient integration method finds a single gradient to improve all tasks by using gradients of all tasks. Yu et al. (2020) have proposed a projection-based gradient integration method, which is guaranteed to converge Pareto-stationary sets. A method proposed by Liu et al. (2021) can reflect user preference, and Navon et al. (2022) proposed a gradient-scale invariant method to prevent the training process from being biased by a few tasks. The proposed method can be viewed as a mixture of projection and scale-invariant methods as gradients are clipped and projected onto a trust region.

5 Experiments

We evaluate the safety performance of the proposed method in single- and multi-constrained robotic tasks. For single constraints, the agent performs four tasks provided by Safety Gym (Ray et al., 2019), and for multi-constraints, it performs bipedal and quadrupedal locomotion tasks.

5.1 Safety Gym

Tasks. We employ two robots, point and car, to perform goal and button tasks in the Safety Gym. The goal task is to control a robot toward a randomly spawned goal without passing through hazard regions. The button task is to click a randomly designated button using a robot, where not only hazard regions but also dynamic obstacles exist. Agents get a cost when touching undesignated buttons and obstacles or entering hazard regions. There is only one constraint for the Safety Gym tasks, and it is defined using (4) with the sum of costs. Constraint violations (CVs) are counted when the cost sum exceeds the threshold. For more details, see Appendix B.

Baselines. Safe RL methods based on various types of policy updates are used as baselines. For the trust region-based method, we use constrained policy optimization (CPO) (Achiam et al., 2017) and off-policy trust-region CVaR (OffTRC) (Kim and Oh, 2022a), which extend the CPO to an off-policy and mean-std constrained version. For the Lagrangian-based method, distributional worst-case soft actor-critic (WCSAC) (Yang et al., 2022) is used, and constrained variational policy optimization (CVPO) (Liu et al., 2022) based on the EM method is used. Specifically, WCSAC, OffTRC, and the proposed method, SDAC, use the risk-averse constraints, so we experiment with those for α=0.25\alpha=0.25 and 1.01.0 (when α=1.0\alpha=1.0, the constraint is identical to the risk-neutral constraint).

Refer to caption
(a) Results of the final reward sum, cost rate, and total number of CVs. The number after the algorithm name in the legend indicates α\alpha used for the risk-averse constraint.
Refer to caption
(b) Training curves of the point goal task according to the trace-decay λ\lambda. The solid line represents the average value, and the shaded area shows half of the std value.
Figure 3: Safety Gym task results. The cost rates show the cost sums divided by the episode length, and the dashed black lines indicate the threshold of the constraint. All methods are trained with five random seeds.

Results. The graph of the final reward sum, cost rate, and the total number of CVs are shown in Figure 3(a), and the training curves are provided in Appendix C.1. We can interpret the results as good if the reward sum is high and the cost rate and total CVs are low. SDAC with α=0.25\alpha=0.25, risk-averse constraint situations, satisfies the constraints in all tasks and shows an average of 1.78 times fewer total CVs than the second-best algorithm. Nevertheless, since the reward sums are also in the middle or upper ranks, its safety performance is of high quality. SDAC with α=1.0\alpha=1.0, risk-neutral constraint situations, shows that the cost rates are almost the same as the thresholds except for the car button. In the case of the car button, the constraint is not satisfied, but by setting α=0.25\alpha=0.25, SDAC can achieve the lowest total CVs and the highest reward sum compared to the other methods. As for the reward sum, SDAC is the highest in the point goal and car button, and WCSAC is the highest in the rest. However, WCSAC seems to lose the risk-averse properties seeing that the cost rates do not change significantly according to α\alpha. This is because WCSAC does not define constraints as risk measures of cost returns but as expectations of risk measures (Yang et al., 2022). OffTRC has lower safety performance than SDAC in most cases because, unlike SDAC, it does not use distributional critics. Finally, CVPO and CPO are on-policy methods, so they are less efficient than the other methods.

5.2 Locomotion Tasks

Tasks. The locomotion tasks are to train robots to follow xyxy-directional linear and zz-directional angular velocity commands. Mini-Cheetah from MIT (Katz et al., 2019) and Laikago from Unitree (Wang, 2018) are used for quadrupedal robots, and Cassie from Agility Robotics (Xie et al., 2018) is used for a bipedal robot. In order to perform the locomotion tasks, robots should keep balancing, standing, and stamping their feet so that they can move in any direction. Therefore, we define three constraints. The first is to keep the balance so that the body angle does not deviate from zero, and the second is to keep the height of CoM above a threshold. The third is to match the current foot contact state with a predefined foot contact timing. The reward is defined as the negative l2l^{2}-norm of the difference between the command and the current velocity. CVs are counted when the sum of at least one cost rate exceeds the threshold. For more details, see Appendix B.

Baselines. The baseline methods are identical to the Safety Gym tasks, and CVPO is excluded because it is technically challenging to scale to multiple constraint settings. We set α\alpha to 11 for the risk-averse constrained methods (OffTRC, WCSAC, and SDAC) to focus on measuring multi-constraint handling performance.

Refer to caption
(a) Training curves of the Cassie task.
Refer to caption
(b) Training curves of the Laikago task.
Refer to caption
(c) Training curves of the Mini-Cheetah task.
Refer to caption
(d) Training curves of the naive and proposed methods for the Cassie task.
Figure 4: Locomotion task results. The black dashed lines indicate the thresholds, and the dotted lines in (d) represent the thresholds + 0.0250.025. The shaded area represents half of the standard deviation, and all methods are trained with five random seeds.

Results. Figure 4.(a-c) presents the training curves. SDAC shows the highest reward sums and the lowest total CVs in all tasks. In particular, the number of steps required to satisfy all constraints is 1.93 times fewer than the second-best algorithm on average. Trust region methods (OffTRC, CPO) stably satisfy constraints, but they are not efficient since they handle constraints by the naive approach. WCSAC, a Lagrangian method, fails to keep the constraints and shows the lowest reward sums. This is because the Lagrange multipliers can hinder the training stability due to the concurrent update with policy (Stooke et al., 2020).

5.3 Ablation Study

We conduct ablation studies to show whether the proposed target distribution lowers the estimation bias and whether the proposed gradient integration quickly converges to the feasibility condition. In Figure 3(b), the number of CVs is reduced as λ\lambda increases, which means that the bias of constraint estimation decreases. However, the score also decreases due to large variance, showing that λ\lambda can adjust the bias-variance tradeoff. In Figure 4(d), the proposed gradient integration method is compared with the naive approach, which minimizes the constraints in order from the first to the third constraint, as described in Section 3.1. The proposed method reaches the feasibility condition faster than the naive approach and shows stable training curves because it reflects all constraints concurrently. Additionally, we analyze the distributional critics in Appendix C.2 and the hyperparameters, such as the trust region size, in Appendix C.3. Furthermore, we analyze the sensitivity of traditional RL algorithms to reward configuration in Appendix D, emphasizing the advantage of safe RL that does not require reward tuning.

6 Limitation

A limitation of the proposed method is that the computational complexity of the gradient integration is proportional to the square of the number of constraints, whose qualitative analysis is presented in Appendix E.1. Also, we conducted quantitative analyses in Appendix E.2 by measuring wall clock training time. In the mini-cheetah task which has three constraints, the training time of SDAC is the third fastest among the four safe RL algorithms. Gradient integration is not applied when the policy satisfies constraints, so it may not constitute a significant proportion of training time. However, its influence can be dominant as the number of constraints increases. In order to resolve this limitation, the calculation can be speeded up by stochastically selecting a subset of constraints (Liu et al., 2021), or by reducing the frequency of policy updates (Navon et al., 2022). The other limitation is that the mean-std defined in (2) is not a coherent risk measure. As a result, mean-std constraints can be served as reducing uncertainty rather than risk, although we experimentally showed that constraint violations are efficiently reduced. To resolve this, we can use the CVaR constraint, which can be estimated using an auxiliary variable, as done by Chow et al. (2017). However, this solution can destabilize the training process due to the auxiliary variable, as observed in experiments of Kim and Oh (2022b). Hence, a stabilization technique should be developed to employ the CVaR constraint.

7 Conclusion

We have presented the trust region-based safe distributional RL method, called SDAC. Through the locomotion tasks, it is verified that the proposed method efficiently satisfies multiple constraints using the gradient integration. Moreover, constraints can be stably satisfied in various tasks due to the low-biased distributional critics trained using the proposed target distributions. In addition, the proposed method is analyzed from multiple perspectives through various ablation studies. However, to compensate for the computational complexity, future work plans to devise efficient methods when dealing with large numbers of constraints.

Acknowledgments and Disclosure of Funding

This work was partly supported by Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2019-0-01190, [SW Star Lab] Robot Learning: Efficient, Safe, and Socially-Acceptable Machine Learning, 34%), Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Science and ICT (NRF-2022R1A2C2008239, General-Purpose Deep Reinforcement Learning Using Metaverse for Real World Applications, 33%), and Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2021-0-01341, AI Graduate School Program, CAU, 33%).

References

  • Achiam et al. [2017] J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In Proceedings of International Conference on Machine Learning, pages 22–31, 2017.
  • Agarwal et al. [2021] A. Agarwal, S. M. Kakade, J. D. Lee, and G. Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1):4431–4506, 2021.
  • Altman [1999] E. Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
  • Bai et al. [2022] Q. Bai, A. S. Bedi, M. Agarwal, A. Koppel, and V. Aggarwal. Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. Proceedings of the AAAI Conference on Artificial Intelligence, 36(4), 2022.
  • Bellemare et al. [2017] M. G. Bellemare, W. Dabney, and R. Munos. A distributional perspective on reinforcement learning. In Proceedings of International Conference on Machine Learning, pages 449–458, 2017.
  • Bellemare et al. [2023] M. G. Bellemare, W. Dabney, and M. Rowland. Distributional Reinforcement Learning. MIT Press, 2023. http://www.distributional-rl.org.
  • Bharadhwaj et al. [2021] H. Bharadhwaj, A. Kumar, N. Rhinehart, S. Levine, F. Shkurti, and A. Garg. Conservative safety critics for exploration. In Proceedings of International Conference on Learning Representations, 2021.
  • Biewald [2020] L. Biewald. Experiment tracking with weights and biases, 2020. URL https://www.wandb.com/. Software available from wandb.com.
  • Chow et al. [2017] Y. Chow, M. Ghavamzadeh, L. Janson, and M. Pavone. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(1):6070–6120, 2017.
  • Dabney et al. [2018a] W. Dabney, G. Ostrovski, D. Silver, and R. Munos. Implicit quantile networks for distributional reinforcement learning. In Proceedings of International conference on machine learning, pages 1096–1105, 2018a.
  • Dabney et al. [2018b] W. Dabney, M. Rowland, M. Bellemare, and R. Munos. Distributional reinforcement learning with quantile regression. Proceedings of the AAAI Conference on Artificial Intelligence, 32(1), 2018b.
  • Dennis and Schnabel [1996] J. E. Dennis and R. B. Schnabel. Numerical Methods for Unconstrained Optimization and Nonlinear Equations. Society for Industrial and Applied Mathematics, 1996.
  • Garcıa and Fernández [2015] J. Garcıa and F. Fernández. A comprehensive survey on safe reinforcement learning. Journal of Machine Learning Research, 16(1):1437–1480, 2015.
  • Gruslys et al. [2018] A. Gruslys, W. Dabney, M. G. Azar, B. Piot, M. Bellemare, and R. Munos. The reactor: A fast and sample-efficient actor-critic agent for reinforcement learning. In Proceedings of International Conference on Learning Representations, 2018.
  • Gu et al. [2022] S. Gu, L. Yang, Y. Du, G. Chen, F. Walter, J. Wang, Y. Yang, and A. Knoll. A review of safe reinforcement learning: Methods, theory and applications. arXiv preprint arXiv:2205.10330, 2022.
  • Haarnoja et al. [2018] T. Haarnoja, A. Zhou, P. Abbeel, and S. Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of International Conference on Machine Learning, pages 1861–1870, 2018.
  • Katz et al. [2019] B. Katz, J. D. Carlo, and S. Kim. Mini cheetah: A platform for pushing the limits of dynamic quadruped control. In Proceedings of International Conference on Robotics and Automation, pages 6295–6301, 2019.
  • Kim and Oh [2022a] D. Kim and S. Oh. Efficient off-policy safe reinforcement learning using trust region conditional value at risk. IEEE Robotics and Automation Letters, 7(3):7644–7651, 2022a.
  • Kim and Oh [2022b] D. Kim and S. Oh. TRC: Trust region conditional value at risk for safe reinforcement learning. IEEE Robotics and Automation Letters, 7(2):2621–2628, 2022b.
  • Kuznetsov et al. [2020] A. Kuznetsov, P. Shvechikov, A. Grishin, and D. Vetrov. Controlling overestimation bias with truncated mixture of continuous distributional quantile critics. In Proceedings International Conference on Machine Learning, pages 5556–5566, 2020.
  • Lee et al. [2020] J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter. Learning quadrupedal locomotion over challenging terrain. Science Robotics, 5(47):eabc5986, 2020.
  • Lee et al. [2022] J. Lee, C. Paduraru, D. J. Mankowitz, N. Heess, D. Precup, K.-E. Kim, and A. Guez. COptiDICE: Offline constrained reinforcement learning via stationary distribution correction estimation. In Proceedings of International Conference on Learning Representations, 2022.
  • Liu et al. [2021] B. Liu, X. Liu, X. Jin, P. Stone, and Q. Liu. Conflict-averse gradient descent for multi-task learning. In Advances in Neural Information Processing Systems, pages 18878–18890, 2021.
  • Liu et al. [2020] Y. Liu, J. Ding, and X. Liu. IPO: Interior-point policy optimization under constraints. Proceedings of the AAAI Conference on Artificial Intelligence, 34(04):4940–4947, 2020.
  • Liu et al. [2022] Z. Liu, Z. Cen, V. Isenbaev, W. Liu, S. Wu, B. Li, and D. Zhao. Constrained variational policy optimization for safe reinforcement learning. In Proceedings of International Conference on Machine Learning, pages 13644–13668, 2022.
  • Meng et al. [2022] W. Meng, Q. Zheng, Y. Shi, and G. Pan. An off-policy trust region policy optimization method with monotonic improvement guarantee for deep reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 33(5):2223–2235, 2022.
  • Merel et al. [2020] J. Merel, S. Tunyasuvunakool, A. Ahuja, Y. Tassa, L. Hasenclever, V. Pham, T. Erez, G. Wayne, and N. Heess. Catch & carry: Reusable neural controllers for vision-guided whole-body tasks. ACM Transactions on Graphics, 39(4), 2020.
  • Miki et al. [2022] T. Miki, J. Lee, J. Hwangbo, L. Wellhausen, V. Koltun, and M. Hutter. Learning robust perceptive locomotion for quadrupedal robots in the wild. Science Robotics, 7(62):eabk2822, 2022.
  • Nam et al. [2021] D. W. Nam, Y. Kim, and C. Y. Park. GMAC: A distributional perspective on actor-critic framework. In Proceedings of International Conference on Machine Learning, pages 7927–7936, 2021.
  • Navon et al. [2022] A. Navon, A. Shamsian, I. Achituve, H. Maron, K. Kawaguchi, G. Chechik, and E. Fetaya. Multi-task learning as a bargaining game. arXiv preprint arXiv:2202.01017, 2022.
  • Peng et al. [2021] X. B. Peng, Z. Ma, P. Abbeel, S. Levine, and A. Kanazawa. AMP: Adversarial motion priors for stylized physics-based character control. ACM Transactions on Graphics, 40(4), 2021.
  • Precup et al. [2000] D. Precup, R. S. Sutton, and S. P. Singh. Eligibility traces for off-policy policy evaluation. In Proceedings of International Conference on Machine Learning, pages 759–766, 2000.
  • Qin et al. [2021] Z. Qin, Y. Chen, and C. Fan. Density constrained reinforcement learning. In Proceedings of International Conference on Machine Learning, pages 8682–8692, 2021.
  • Ray et al. [2019] A. Ray, J. Achiam, and D. Amodei. Benchmarking Safe Exploration in Deep Reinforcement Learning. 2019.
  • Rowland et al. [2018] M. Rowland, M. Bellemare, W. Dabney, R. Munos, and Y. W. Teh. An analysis of categorical distributional reinforcement learning. In Proceedings of International Conference on Artificial Intelligence and Statistics, pages 29–37, 2018.
  • Rudin et al. [2022] N. Rudin, D. Hoeller, P. Reist, and M. Hutter. Learning to walk in minutes using massively parallel deep reinforcement learning. In Proceedings of Conference on Robot Learning, pages 91–100, 2022.
  • Schulman et al. [2015] J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In Proceedings of International Conference on Machine Learning, pages 1889–1897, 2015.
  • Stooke et al. [2020] A. Stooke, J. Achiam, and P. Abbeel. Responsive safety in reinforcement learning by PID lagrangian methods. In Proceedings of International Conference on Machine Learning, pages 9133–9143, 2020.
  • Sutton [1988] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine learning, 3(1):9–44, 1988.
  • Tang et al. [2022] Y. Tang, R. Munos, M. Rowland, B. Avila Pires, W. Dabney, and M. Bellemare. The nature of temporal difference errors in multi-step distributional reinforcement learning. In Advances in Neural Information Processing Systems, pages 30265–30276, 2022.
  • Thananjeyan et al. [2021] B. Thananjeyan, A. Balakrishna, S. Nair, M. Luo, K. Srinivasan, M. Hwang, J. E. Gonzalez, J. Ibarz, C. Finn, and K. Goldberg. Recovery RL: Safe reinforcement learning with learned recovery zones. IEEE Robotics and Automation Letters, 6(3):4915–4922, 2021.
  • Todorov et al. [2012] E. Todorov, T. Erez, and Y. Tassa. MuJoCo: A physics engine for model-based control. In Proceedings of International Conference on Intelligent Robots and Systems, pages 5026–5033, 2012.
  • Wang [2018] X. Wang. Unitree-Laikago Pro. http://www.unitree.cc/e/action/ShowInfo.php?classid=6&id=355, 2018.
  • Xie et al. [2018] Z. Xie, G. Berseth, P. Clary, J. Hurst, and M. van de Panne. Feedback control for cassie with deep reinforcement learning. In Proceedings of International Conference on Intelligent Robots and Systems, pages 1241–1246, 2018.
  • Xu et al. [2021] T. Xu, Y. Liang, and G. Lan. CRPO: A new approach for safe reinforcement learning with convergence guarantee. In Proceedings of International Conference on Machine Learning, pages 11480–11491, 2021.
  • Yang et al. [2021] Q. Yang, T. D. Simão, S. H. Tindemans, and M. T. J. Spaan. WCSAC: Worst-case soft actor critic for safety-constrained reinforcement learning. Proceedings of the AAAI Conference on Artificial Intelligence, 35(12):10639–10646, 2021.
  • Yang et al. [2022] Q. Yang, T. D. Simão, S. H. Tindemans, and M. T. J. Spaan. Safety-constrained reinforcement learning with a distributional safety critic. Machine Learning, 112:859–887, 2022.
  • Yang et al. [2020] T.-Y. Yang, J. Rosca, K. Narasimhan, and P. J. Ramadge. Projection-based constrained policy optimization. In Proceedings of International Conference on Learning Representations, 2020.
  • Yu et al. [2020] T. Yu, S. Kumar, A. Gupta, S. Levine, K. Hausman, and C. Finn. Gradient surgery for multi-task learning. In Advances in Neural Information Processing Systems, pages 5824–5836, 2020.
  • Zhang et al. [2022] H. Zhang, Y. Lin, S. Han, S. Wang, and K. Lv. Conservative distributional reinforcement learning with safety constraints. arXiv preprint arXiv:2201.07286, 2022.
  • Zhang and Weng [2022] J. Zhang and P. Weng. Safe distributional reinforcement learning. In J. Chen, J. Lang, C. Amato, and D. Zhao, editors, Proceedings of International Conference on Distributed Artificial Intelligence, pages 107–128, 2022.

Appendix A Algorithm Details

A.1 Proof of Theorem 3.1

We denote the policy parameter space as Ψd\Psi\subseteq\mathbb{R}^{d}, the parameter at the ttth iteration as ψtΨ\psi_{t}\in\Psi, the Hessian matrix as H(ψt)=ψ2DKL(πψt||πψ)|ψ=ψtH(\psi_{t})=\nabla_{\psi}^{2}D_{\mathrm{KL}}(\pi_{\psi_{t}}||\pi_{\psi})|_{\psi=\psi_{t}}, and the kkth constraint as Fk(ψt)=Fk(πψt;α)F_{k}(\psi_{t})=F_{k}(\pi_{\psi_{t}};\alpha). As we focus on the ttth iteration, the following notations are used for brevity: H=H(ψt)H=H(\psi_{t}) and gk=Fk(ψt)g_{k}=\nabla F_{k}(\psi_{t}). The proposed gradient integration at ttth iteration is defined as the following quadratic program (QP):

gt=argmin𝑔12gTHgs.t.gkTg+ck0fork,g_{t}=\underset{g}{\mathrm{argmin}}\;\frac{1}{2}g^{T}Hg\quad\mathrm{s.t.}\;g_{k}^{T}g+c_{k}\leq 0\;\mathrm{for}\;\forall k, (15)

where ck=min(2ϵgkTH1gk,Fk(πψ)dk+ζ)c_{k}=\mathrm{min}(\sqrt{2\epsilon g_{k}^{T}H^{-1}g_{k}},F_{k}(\pi_{\psi})-d_{k}+\zeta). In the remainder of this section, we introduce the assumptions and new definitions, discuss the existence of a solution (15), show the convergence to the feasibility condition for varying step size cases, and provide the proof of Theorem 3.1.

Assumption. 1) Each FkF_{k} is differentiable and convex, 2) Fk\nabla F_{k} is LL-Lipschitz continuous, 3) all eigenvalues of the Hessian matrix H(ψ)H(\psi) are equal or greater than R>0R\in\mathbb{R}_{>0} for ψΨ\forall\psi\in\Psi, and 4) {ψ|Fk(ψ)+ζ<dkfork}\{\psi|F_{k}(\psi)+\zeta<d_{k}\;\mathrm{for}\;\forall k\}\neq\emptyset.

Definition. Using the Cholesky decomposition, the Hessian matrix can be expressed as H=BBTH=B\cdot B^{T} where BB is a lower triangular matrix. By introducing new terms, g¯k:=B1gk\bar{g}_{k}:=B^{-1}g_{k} and bt:=BTgtb_{t}:=B^{T}g_{t}, the following is satisfied: gkTH1gk=g¯k22g_{k}^{T}H^{-1}g_{k}=||\bar{g}_{k}||_{2}^{2}. Additionally, we define the in-boundary and out-boundary sets as:

IBk:={ψ|Fk(ψ)dk+ζ2ϵFk(ψ)TH1(ψ)Fk(ψ)},\displaystyle\mathrm{IB}_{k}:=\left\{\psi|F_{k}(\psi)-d_{k}+\zeta\leq\sqrt{2\epsilon\nabla F_{k}(\psi)^{T}H^{-1}(\psi)\nabla F_{k}(\psi)}\right\},
OBk:={ψ|Fk(ψ)dk+ζ2ϵFk(ψ)TH1(ψ)Fk(ψ)}.\displaystyle\mathrm{OB}_{k}:=\left\{\psi|F_{k}(\psi)-d_{k}+\zeta\geq\sqrt{2\epsilon\nabla F_{k}(\psi)^{T}H^{-1}(\psi)\nabla F_{k}(\psi)}\right\}.

The minimum of g¯k||\bar{g}_{k}|| in OBk\mathrm{OB}_{k} is denoted as mkm_{k}, and the maximum of g¯k||\bar{g}_{k}|| in IBk\mathrm{IB}_{k} is denoted as MkM_{k}. Also, minkmk\mathrm{min}_{k}m_{k} and maxkMk\mathrm{max}_{k}M_{k} are denoted as mm and MM, respectively, and we can say that mm is positive.

Lemma A.1.

For all kk, the minimum value of mkm_{k} is positive.

Proof.

Assume that there exist k{1,,K}k\in\{1,...,K\} such that mkm_{k} is equal to zero at a policy parameter ψOBk\psi^{*}\in\mathrm{OB}_{k}, i.e., Fk(ψ)=0||\nabla F_{k}(\psi^{*})||=0. Since FkF_{k} is convex, ψ\psi^{*} is a minimum point of FkF_{k}, minψFk(ψ)=Fk(ψ)<dkζ\mathrm{min}_{\psi}F_{k}(\psi)=F_{k}(\psi^{*})<d_{k}-\zeta. However, Fk(ψ)dkζF_{k}(\psi^{*})\geq d_{k}-\zeta as ψOBk\psi^{*}\in\mathrm{OB}_{k}, so mkm_{k} is positive due to the contradiction. Hence, the minimum of mkm_{k} is also positive. ∎

Lemma A.2.

A solution of (15) always exists.

Proof.

There exists a policy parameter ψ^{ψ|Fk(ψ)+ζ<dkfork}\hat{\psi}\in\{\psi|F_{k}(\psi)+\zeta<d_{k}\;\mathrm{for}\;\forall k\} due to the assumptions. Let g=ψψtg=\psi-\psi_{t}. Then, the following inequality holds.

gkT(ψψt)+ck\displaystyle g_{k}^{T}(\psi-\psi_{t})+c_{k} gkT(ψψt)+Fk(ψt)+ζdkFk(ψ)+ζdk.\displaystyle\leq g_{k}^{T}(\psi-\psi_{t})+F_{k}(\psi_{t})+\zeta-d_{k}\leq F_{k}(\psi)+\zeta-d_{k}. (Fk is convex.)\displaystyle(\because F_{k}\text{ is convex.})
gkT(ψ^ψt)+ck\displaystyle\Rightarrow g_{k}^{T}(\hat{\psi}-\psi_{t})+c_{k} Fk(ψ^)+ζdk<0 for k.\displaystyle\leq F_{k}(\hat{\psi})+\zeta-d_{k}<0\text{ for }\forall k.

Since ψ^ψt\hat{\psi}-\psi_{t} satisfies all constraints of (15), the feasible set is non-empty and convex. Also, HH is positive definite, so the QP has a unique solution. ∎

Lemma A.2 shows the existence of solution of (15). Now, we show the convergence of the proposed gradient integration method in the case of varying step sizes.

Lemma A.3.

If 2ϵMζ\sqrt{2\epsilon}M\leq\zeta and a policy is updated by ψt+1=ψt+βtgt\psi_{t+1}=\psi_{t}+\beta_{t}g_{t}, where 0<βt<22ϵmRLbt20<\beta_{t}<\frac{2\sqrt{2\epsilon}mR}{L||b_{t}||^{2}} and βt1\beta_{t}\leq 1, the policy satisfies Fk(ψ)dkF_{k}(\psi)\leq d_{k}\; for k\forall k within a finite time.

Proof.

We can reformulate the step size as β=22ϵmRLbt2βt\beta=\frac{2\sqrt{2\epsilon}mR}{L||b_{t}||^{2}}\beta^{\prime}_{t}, where βtLbt222ϵmR\beta^{\prime}_{t}\leq\frac{L||b_{t}||^{2}}{2\sqrt{2\epsilon}mR} and 0<βt<10<\beta^{\prime}_{t}<1. Since the eigenvalues of HH is equal to or bigger than RR and HH is symmetric and positive definite, 1RIH1\frac{1}{R}I-H^{-1} is positive semi-definite. Hence, xTH1x1Rx2x^{T}H^{-1}x\leq\frac{1}{R}||x||^{2} is satisfied. Using this fact, the following inequality holds:

Fk(ψt+βtgt)Fk(ψt)\displaystyle F_{k}(\psi_{t}+\beta_{t}g_{t})-F_{k}(\psi_{t}) βtFk(ψt)Tgt+L2βtgt2\displaystyle\leq\beta_{t}\nabla F_{k}(\psi_{t})^{T}g_{t}+\frac{L}{2}||\beta_{t}g_{t}||^{2} (Fkis L-Lipschitz continuous.)\displaystyle(\because\nabla F_{k}\;\text{is $L$-Lipschitz continuous.})
=βtgkTgt+L2βt2gt2\displaystyle=\beta_{t}g_{k}^{T}g_{t}+\frac{L}{2}\beta_{t}^{2}||g_{t}||^{2}
=βtgkTgt+L2βt2btTH1bt\displaystyle=\beta_{t}g_{k}^{T}g_{t}+\frac{L}{2}\beta_{t}^{2}b_{t}^{T}H^{-1}b_{t} (gt=BTbt)\displaystyle(\because g_{t}=B^{-T}b_{t})
βtck+L2Rβt2bt2.\displaystyle\leq-\beta_{t}c_{k}+\frac{L}{2R}\beta_{t}^{2}||b_{t}||^{2}. (gkTgt+ck0)\displaystyle(\because g_{k}^{T}g_{t}+c_{k}\leq 0)

Now, we will show that ψ\psi enters IBk\mathrm{IB}_{k} in a finite time for ψOBk\forall\psi\in\mathrm{OB}_{k} and that the kkth constraint is satisfied for ψIBk\forall\psi\in\mathrm{IB}_{k}. Thus, we divide into two cases, 1) ψtOBk\psi_{t}\in\mathrm{OB}_{k} and 2) ψtIBk\psi_{t}\in\mathrm{IB}_{k}. For the first case, ck=2ϵg¯kc_{k}=\sqrt{2\epsilon}||\bar{g}_{k}||, so the following inequality holds:

Fk(ψt+βtgt)Fk(ψt)\displaystyle F_{k}(\psi_{t}+\beta_{t}g_{t})-F_{k}(\psi_{t}) βt(2ϵg¯k+L2Rβtbt2)\displaystyle\leq\beta_{t}\left(-\sqrt{2\epsilon}||\bar{g}_{k}||+\frac{L}{2R}\beta_{t}||b_{t}||^{2}\right) (16)
βt2ϵ(g¯k+mβt)\displaystyle\leq\beta_{t}\sqrt{2\epsilon}\left(-||\bar{g}_{k}||+m\beta^{\prime}_{t}\right)
βt2ϵm(βt1)<0.\displaystyle\leq\beta_{t}\sqrt{2\epsilon}m(\beta^{\prime}_{t}-1)<0.

The value of FkF_{k} decreases strictly with each update step according to (16). Hence, ψt\psi_{t} can reach IBk\mathrm{IB}_{k} by repeatedly updating the policy. We now check whether the constraint is satisfied for the second case. For the second case, the following inequality holds by applying ck=Fk(ψt)dk+ζc_{k}=F_{k}(\psi_{t})-d_{k}+\zeta:

Fk(ψt+βtgt)Fk(ψt)βtdkβtFk(ψt)βtζ+L2Rβt2bt2\displaystyle F_{k}(\psi_{t}+\beta_{t}g_{t})-F_{k}(\psi_{t})\leq\beta_{t}d_{k}-\beta_{t}F_{k}(\psi_{t})-\beta_{t}\zeta+\frac{L}{2R}\beta_{t}^{2}||b_{t}||^{2}
\displaystyle\Rightarrow Fk(ψt+βtgt)dk(1βt)(Fk(ψt)dk)+βt(ζ+2ϵmβt).\displaystyle F_{k}(\psi_{t}+\beta_{t}g_{t})-d_{k}\leq(1-\beta_{t})(F_{k}(\psi_{t})-d_{k})+\beta_{t}(-\zeta+\sqrt{2\epsilon}m\beta^{\prime}_{t}).

Since ψtIBk\psi_{t}\in\mathrm{IB}_{k},

Fk(ψt)dk2ϵg¯kζ2ϵMζ0.F_{k}(\psi_{t})-d_{k}\leq\sqrt{2\epsilon}||\bar{g}_{k}||-\zeta\leq\sqrt{2\epsilon}M-\zeta\leq 0.

Since mMm\leq M and βt<1\beta^{\prime}_{t}<1,

ζ+2ϵmβt<ζ+2ϵM0.-\zeta+\sqrt{2\epsilon}m\beta^{\prime}_{t}<-\zeta+\sqrt{2\epsilon}M\leq 0.

Hence, Fk(ψt+βtgt)dkF_{k}(\psi_{t}+\beta_{t}g_{t})\leq d_{k}, which means that the kkth constraint is satisfied if ψtIBk\psi_{t}\in\mathrm{IB}_{k}. As ψt\psi_{t} reaches IBk\mathrm{IB}_{k} for k\forall k within a finite time according to (16), the policy can satisfy all constraints within a finite time. ∎

Lemma A.3 shows the convergence to the feasibility condition in the case of varying step sizes. We introduce a lemma, which shows bt||b_{t}|| is bounded by ϵ\sqrt{\epsilon}, and finally show the proof of Theorem 3.1, which can be considered a special case of varying step sizes.

Lemma A.4.

There exists T>0T\in\mathbb{R}_{>0} such that btTϵ||b_{t}||\leq T\sqrt{\epsilon}.

Proof.

Let us define the following sets:

I\displaystyle I :={k|Fk(ψt)+ζdk<0},O:={k|Fk(ψt)+ζdk>0},\displaystyle:=\{k|F_{k}(\psi_{t})+\zeta-d_{k}<0\},\;O:=\{k|F_{k}(\psi_{t})+\zeta-d_{k}>0\}, (17)
U\displaystyle U :={k|Fk(ψt)+ζdk=0},C(ϵ):={g|gkTg+ck(ϵ)0k},\displaystyle:=\{k|F_{k}(\psi_{t})+\zeta-d_{k}=0\},\;C(\epsilon):=\{g|g_{k}^{T}g+c_{k}(\epsilon)\leq 0\;\forall k\},
IG\displaystyle I_{G} :={g|gkTg+Fk(ψt)+ζdk0kI},\displaystyle:=\{g|g_{k}^{T}g+F_{k}(\psi_{t})+\zeta-d_{k}\leq 0\;\forall k\in I\},

where ck(ϵ)=min(2ϵgkTH1gk,Fk(πψ)dk+ζ)c_{k}(\epsilon)=\mathrm{min}(\sqrt{2\epsilon g_{k}^{T}H^{-1}g_{k}},F_{k}(\pi_{\psi})-d_{k}+\zeta). Using these sets, the following vectors can be defined: g(ϵ):=argmingC(ϵ)12gTHgg(\epsilon):=\mathrm{argmin}_{g\in C(\epsilon)}\frac{1}{2}g^{T}Hg, b(ϵ):=BTg(ϵ).b(\epsilon):=B^{T}g(\epsilon). Now, we will show that b(ϵ)||b(\epsilon)|| is bounded above and b(ϵ)ϵ||b(\epsilon)||\propto\sqrt{\epsilon} for sufficiently small ϵ>0\epsilon>0.

First, the following is satisfied for a sufficiently large ϵ\epsilon:

C(ϵ)={g|gkTg+Fk(ψt)+ζdk0k}.C(\epsilon)=\{g|g_{k}^{T}g+F_{k}(\psi_{t})+\zeta-d_{k}\leq 0\;\forall k\}. (18)

Since ψ^ψtC(ϵ)\hat{\psi}-\psi_{t}\in C(\epsilon), where ψ^\hat{\psi} is defined in Lemma A.2, b(ϵ)12(ψ^ψt)TH(ψ^ψt)||b(\epsilon)||\leq\frac{1}{2}(\hat{\psi}-\psi_{t})^{T}H(\hat{\psi}-\psi_{t}) for ϵ\forall\epsilon. Therefore, b(ϵ)||b(\epsilon)|| is bounded above.

Second, let us define the following trust region size:

ϵ^:=12(minkOFk(ψt)+ζdkgk¯)2>0.\hat{\epsilon}:=\frac{1}{2}\left(\mathrm{min}_{k\in O}\frac{F_{k}(\psi_{t})+\zeta-d_{k}}{||\bar{g_{k}}||}\right)^{2}>0. (19)

if , ϵϵ^\epsilon\leq\hat{\epsilon}, the following is satisfied:

C(ϵ)=IG{g|gkTg+2ϵgk¯0kO,gkTg0kU}ϕ.C(\epsilon)=I_{G}\cap\{g|g_{k}^{T}g+\sqrt{2\epsilon}||\bar{g_{k}}||\leq 0\;\forall k\in O,\;g_{k}^{T}g\leq 0\;\forall k\in U\}\neq\phi. (20)

Thus, OG(ϵ):={g|gkTg+2ϵgk¯0kO,gkTg0kU}O_{G}(\epsilon):=\{g|g_{k}^{T}g+\sqrt{2\epsilon}||\bar{g_{k}}||\leq 0\;\forall k\in O,\;g_{k}^{T}g\leq 0\;\forall k\in U\} is not empty. If we define g^(ϵ):=argmingOG(ϵ)12gTHg\hat{g}(\epsilon):=\mathrm{argmin}_{g\in O_{G}(\epsilon)}\frac{1}{2}g^{T}Hg, the following is satisfied:

g^(ϵ)=ϵϵ^g^(ϵ^)for 0ϵϵ^.\hat{g}(\epsilon)=\sqrt{\frac{\epsilon}{\hat{\epsilon}}}\hat{g}(\hat{\epsilon})\;\text{for}\;0\leq\epsilon\leq\hat{\epsilon}. (21)

Then, if ϵϵ^(minkI(dkζFk(ψt))/gkTg^(ϵ^)))2\epsilon\leq\hat{\epsilon}\left(\mathrm{min}_{k\in I}(d_{k}-\zeta-F_{k}(\psi_{t}))/g_{k}^{T}\hat{g}(\hat{\epsilon}))\right)^{2}, g^(ϵ)=g(ϵ)\hat{g}(\epsilon)=g(\epsilon) since g^(ϵ)IG\hat{g}(\epsilon)\in I_{G}. Consequently, by defining a trust region size:

ϵ:=ϵ^min(1,(minkIdkζFk(ψt)gkTg^(ϵ^))2)>0,\epsilon^{*}:=\hat{\epsilon}\cdot\mathrm{min}\left(1,\left(\mathrm{min}_{k\in I}\frac{d_{k}-\zeta-F_{k}(\psi_{t})}{g_{k}^{T}\hat{g}(\hat{\epsilon})}\right)^{2}\right)>0, (22)

g(ϵ)=ϵ/ϵ^g^(ϵ^)g(\epsilon)=\sqrt{\epsilon/\hat{\epsilon}}\hat{g}(\hat{\epsilon}) for ϵϵ\epsilon\leq\epsilon^{*}. Therefore, b(ϵ)ϵ||b(\epsilon)||\propto\sqrt{\epsilon} if ϵϵ\epsilon\leq\epsilon^{*}.

Finally, since b(ϵ)||b(\epsilon)|| is bounded above and proportional to ϵ\sqrt{\epsilon} for sufficiently small ϵ\epsilon, there exist a constant TT such that b(ϵ)Tϵ||b(\epsilon)||\leq T\sqrt{\epsilon}. ∎

See 3.1

Proof.

The proposed step size is βt=min(1,2ϵ/bt)\beta_{t}=\mathrm{min}(1,\sqrt{2\epsilon}/||b_{t}||), and the sufficient conditions that guarantee the convergence according to Lemma A.3 are followings:

2ϵMζand 0<βt1andβt<22ϵmRLbt2.\sqrt{2\epsilon}M\leq\zeta\;\mathrm{and}\;0<\beta_{t}\leq 1\;\mathrm{and}\;\beta_{t}<\frac{2\sqrt{2\epsilon}mR}{L||b_{t}||^{2}}.

The second condition is self-evident. To satisfy the third condition, the proposed step size βt\beta_{t} should satisfy the followings:

2ϵbt<22ϵmRLbt2bt<2mRL.\displaystyle\frac{\sqrt{2\epsilon}}{||b_{t}||}<\frac{2\sqrt{2\epsilon}mR}{L||b_{t}||^{2}}\;\Leftrightarrow\;||b_{t}||<\frac{2mR}{L}.

If ϵ<4((mR)/(LT))2\epsilon<4((mR)/(LT))^{2}, the following inequality holds:

ϵ<2mRLTbtTϵ<2mRL.\displaystyle\sqrt{\epsilon}<\frac{2mR}{LT}\;\Rightarrow\;||b_{t}||\leq T\sqrt{\epsilon}<\frac{2mR}{L}. (Lemma A.4.)\displaystyle(\because\;\text{Lemma \ref{lemma:b_k}}.)

Hence, if ϵE=12min(ζ22M2,4(mRLT)2)\epsilon\leq E=\frac{1}{2}\mathrm{min}(\frac{\zeta^{2}}{2M^{2}},4(\frac{mR}{LT})^{2}), the sufficient conditions are satisfied. ∎

A.2 Toy Example for Gradient Integration Method

The problem of the toy example in Figure 1 is defined as:

minimizex1,x2(3x1+x2+2)2+4(x13x2+4)2𝐬.𝐭.x10,x12x20,\displaystyle\underset{x_{1},x_{2}}{\mathrm{minimize}}\sqrt{(\sqrt{3}x_{1}+x_{2}+2)^{2}+4(x_{1}-\sqrt{3}x_{2}+4)^{2}}\quad\mathbf{s.t.}\;x_{1}\geq 0,\;x_{1}-2x_{2}\leq 0, (23)

where there are two linear constraints. The initial points for the naive and gradient integration methods are x1=2.5x_{1}=-2.5 and x2=3.0x_{2}=-3.0, which do not satisfied the two constraints. We use the Hessian matrix for the trust region as identity matrix and the trust region size as 0.50.5 in both methods. The naive method minimizes the constraints in order from the first to the second constraint.

A.3 Analysis of Worst-Case Time to Satisfy All Constraints

To analyze the sample complexity, we consider a tabular MDP and use softmax policy parameterization as follows (for more details, see [Xu et al., 2021]):

πψ(a|s):=expψ(s,a)aexpψ(s,a)(s,a)S×A.\pi_{\psi}(a|s):=\frac{\exp{\psi(s,a)}}{\sum_{a^{\prime}}\exp{\psi(s,a^{\prime})}}\;\forall(s,a)\in S\times A. (24)

According to Agarwal et al. [2021], the natural policy gradient (NPG) update is as follows:

ψt+1=ψt+βAπψt,πψt+1(a|s)=πψt(a|s)exp(βAπψt(s,a))Zt(s),\psi_{t+1}=\psi_{t}+\beta A^{\pi_{\psi_{t}}},\;\pi_{\psi_{t+1}}(a|s)=\pi_{\psi_{t}}(a|s)\frac{\exp(\beta A^{\pi_{\psi_{t}}}(s,a))}{Z_{t}(s)}, (25)

where β\beta is a step size, Aπψt|S||A|A^{\pi_{\psi_{t}}}\in\mathbb{R}^{|S||A|} is the vector expression of the advantage function, and Zt(s)=aπψt(a|s)exp(βAπψt(s,a))/(1γ)Z_{t}(s)=\sum_{a}\pi_{\psi_{t}}(a|s)\exp{(\beta A^{\pi_{\psi_{t}}}(s,a))/(1-\gamma)}. Analyzing the sample complexity of trust region-based methods is challenging since their stepsize is not fixed, so we modify the gradient integration method to use the NPG as follows:

g=argming12gTHgs.t.gkTg+ck0k{k|Fk(πψt;α)>dk},\displaystyle g^{*}=\mathrm{argmin}_{g}\frac{1}{2}g^{T}Hg\;\mathrm{s.t.}\;g_{k}^{T}g+c_{k}\leq 0\;\forall k\in\{k|F_{k}(\pi_{\psi_{t}};\alpha)>d_{k}\}, (26)
ψt+1=ψt+βg/g2.\displaystyle\psi_{t+1}=\psi_{t}+\beta g^{*}/||g^{*}||_{2}.

In the remainder, we abbreviate πψt\pi_{\psi_{t}}, AπψtA^{\pi_{\psi_{t}}}, and Fk(πψt;α)F_{k}(\pi_{\psi_{t}};\alpha) as πt\pi_{t}, AtA^{t}, and Fk(πt)F_{k}(\pi_{t}), respectively. Since gg^{*} always exists due to Lemma A.2, we can write the policy using Lagrange multipliers λkt0\lambda_{k}^{t}\geq 0 as follows:

ψt+1=ψtβkλktACkt/Wt,Wt:=kλktACkt2,\displaystyle\psi_{t+1}=\psi_{t}-\beta\sum_{k}\lambda^{t}_{k}A_{C_{k}}^{t}/W_{t},\;W_{t}:=||\sum_{k}\lambda^{t}_{k}A_{C_{k}}^{t}||_{2}, (27)
πt+1(a|s)=πt(a|s)exp(βWtkλktACkt(s,a))/Zt(s),\displaystyle\pi_{t+1}(a|s)=\pi_{t}(a|s)\exp{\left(-\frac{\beta}{W_{t}}\sum_{k}\lambda^{t}_{k}A_{C_{k}}^{t}(s,a)\right)}/Z_{t}(s),

where Zt(s)Z_{t}(s) is a normalization factor, and λkt=0\lambda_{k}^{t}=0 for Fk(πt)dkF_{k}(\pi_{t})\leq d_{k}. The naive approach can also be written as above, except that λt\lambda^{t} is a one-hot vector, where ii-th value λit\lambda_{i}^{t} is one only for corresponding to the randomly selected constraint. Then, we can get the followings:

kλkt\displaystyle\sum_{k}\lambda^{t}_{k} (Fk(πt+1)Fk(πt))/Wt=11γ𝔼sdπt+1[aπt+1(a|s)kλktACkt(s,a)/Wt]\displaystyle(F_{k}(\pi_{t+1})-F_{k}(\pi_{t}))/W_{t}=\frac{1}{1-\gamma}\mathbb{E}_{s\sim d^{\pi_{t+1}}}\left[\sum_{a}\pi_{t+1}(a|s)\sum_{k}\lambda^{t}_{k}A_{C_{k}}^{t}(s,a)/W_{t}\right] (28)
=1β(1γ)𝔼sdπt+1[aπt+1(a|s)logπt+1(a|s)Zt(s)πt(a|s)]\displaystyle=-\frac{1}{\beta(1-\gamma)}\mathbb{E}_{s\sim d^{\pi_{t+1}}}\left[\sum_{a}\pi_{t+1}(a|s)\log\frac{\pi_{t+1}(a|s)Z_{t}(s)}{\pi_{t}(a|s)}\right]
=1β(1γ)𝔼sdπt+1[DKL(πt+1(|s)||πt(|s))+logZt(s)]\displaystyle=-\frac{1}{\beta(1-\gamma)}\mathbb{E}_{s\sim d^{\pi_{t+1}}}\left[D_{\mathrm{KL}}(\pi_{t+1}(\cdot|s)||\pi_{t}(\cdot|s))+\log Z_{t}(s)\right]
1β(1γ)𝔼sdπt+1[logZt(s)](DKL(π||π)0)\displaystyle\leq-\frac{1}{\beta(1-\gamma)}\mathbb{E}_{s\sim d^{\pi_{t+1}}}\left[\log{Z_{t}(s)}\right]\;\;\;\quad\qquad\qquad\qquad\qquad(\because D_{\mathrm{KL}}(\pi^{\prime}||\pi)\geq 0)
1β𝔼sρ[logZt(s)],(||dπ/ρ||1γ)\displaystyle\leq-\frac{1}{\beta}\mathbb{E}_{s\sim\rho}\left[\log{Z_{t}(s)}\right],\qquad\qquad\qquad\qquad\qquad\qquad\qquad(\because||d^{\pi}/\rho||_{\infty}\geq 1-\gamma)

We can also get the followings by using the Lemma 7 in Xu et al. [2021]:

kλkt(Fk(π)Fk(πt))/Wt\displaystyle\sum_{k}\lambda^{t}_{k}(F_{k}(\pi_{*})-F_{k}(\pi_{t}))/W_{t}\geq 1β(1γ)𝔼sd[DKL(π||πt)DKL(π||πt+1)]\displaystyle-\frac{1}{\beta(1-\gamma)}\mathbb{E}_{s\sim d^{*}}\left[D_{\mathrm{KL}}(\pi_{*}||\pi_{t})-D_{\mathrm{KL}}(\pi_{*}||\pi_{t+1})\right] (29)
kλkt2βCmax(1γ)2Wt,\displaystyle-\sum_{k}\lambda^{t}_{k}\frac{2\beta C_{\mathrm{max}}}{(1-\gamma)^{2}W_{t}},

where π\pi_{*} is an optimal policy, and CmaxC_{\mathrm{max}} is the maximum value of costs. If λkt>0\lambda^{t}_{k}>0, Fk(πt)Fk(π)>ζF_{k}(\pi_{t})-F_{k}(\pi_{*})>\zeta. Thus, kλkt(Fk(π)Fk(πt))ζkλkt\sum_{k}\lambda^{t}_{k}(F_{k}(\pi_{*})-F_{k}(\pi_{t}))\leq-\zeta\sum_{k}\lambda^{t}_{k}. If the policy does not satisfy the constraints until TT step, the following inequality holds by summing the above inequalities from t=0t=0 to TT:

β(1γ)(ζ2βCmax(1γ)2)t=0Tiλit/Wt𝔼sd[DKL(π||π0)].\beta(1-\gamma)(\zeta-\frac{2\beta C_{\mathrm{max}}}{(1-\gamma)^{2}})\sum_{t=0}^{T}\sum_{i}\lambda^{t}_{i}/W_{t}\leq\mathbb{E}_{s\sim d^{*}}\left[D_{\mathrm{KL}}(\pi_{*}||\pi_{0})\right]. (30)

Let denote 1Tt=0Tiλit/Wt\frac{1}{T}\sum_{t=0}^{T}\sum_{i}\lambda^{t}_{i}/W_{t} as 𝔼t[iλit/Wt]\mathbb{E}_{t}[\sum_{i}\lambda^{t}_{i}/W_{t}], and we can get Wt=kλktACkt2kλkt2|S||A|Cmax/(1γ)W_{t}=||\sum_{k}\lambda^{t}_{k}A_{C_{k}}^{t}||_{2}\leq\sum_{k}\lambda^{t}_{k}2|S||A|C_{\mathrm{max}}/(1-\gamma). Then, the maximum TT can be expressed as:

TDKLβ(1γ)(ζ2βCmax(1γ)2)𝔼t[iλit/Wt]2|S||A|CmaxDKLβ(1γ)2ζ2β2Cmax=:Tmax,T\leq\frac{D_{\mathrm{KL}}}{\beta(1-\gamma)(\zeta-\frac{2\beta C_{\mathrm{max}}}{(1-\gamma)^{2}})\mathbb{E}_{t}[\sum_{i}\lambda^{t}_{i}/W_{t}]}\leq\frac{2|S||A|C_{\mathrm{max}}D_{\mathrm{KL}}}{\beta(1-\gamma)^{2}\zeta-2\beta^{2}C_{\mathrm{max}}}=:T_{\mathrm{max}}, (31)

where we abbreviate 𝔼sd[DKL(π||π0)]\mathbb{E}_{s\sim d^{*}}\left[D_{\mathrm{KL}}(\pi_{*}||\pi_{0})\right] as DKLD_{\mathrm{KL}}. Finally, the policy can reach the feasible region within TmaxT_{\mathrm{max}} steps.

The worst-case time of the naive approach is the same as the above equation, except for the λ\lambda part. In the naive approach, λt\lambda^{t} is a one-hot vector, as mentioned earlier. In other words, only 𝔼t[iλit/Wt]=𝔼t[iλit/iλitACkt2]\mathbb{E}_{t}[\sum_{i}\lambda_{i}^{t}/W_{t}]=\mathbb{E}_{t}[\sum_{i}\lambda_{i}^{t}/||\sum_{i}\lambda_{i}^{t}A_{C_{k}}^{t}||_{2}] is different. Let us assume that the advantage vector follows a normal distribution. Then, the variance of iλitACkt\sum_{i}\lambda_{i}^{t}A_{C_{k}}^{t} is smaller for λt\lambda^{t} with distributed values than for one-hot values. Then, the reciprocal of the 2-norm becomes larger, resulting in a decrease in the worst-case time. From this perspective, the gradient integration method has a benefit over the naive approach as it reduces the variance of the advantage vector. Even though we cannot officially say that the worst-case time of the proposed method is smaller than the naive method because the advantage vector does not follow the normal distribution, we can deliver our insight on the benefit of gradient integration method.

A.4 Proof of Theorem 3.2

In this section, we show that a sequence, Zk+1=𝒯λμ,πZkZ_{k+1}=\mathcal{T}_{\lambda}^{\mu,\pi}Z_{k}, converges to the ZRπZ^{\pi}_{R}. First, we rewrite the operator 𝒯λμ,π\mathcal{T}_{\lambda}^{\mu,\pi} for random variables to an operator for distributions and show that the operator is contractive. Finally, we show that ZRπZ_{R}^{\pi} is the unique fixed point.

Before starting the proof, we introduce useful notions and distance metrics. As the return ZRπ(s,a)Z_{R}^{\pi}(s,a) is a random variable, we define the distribution of ZRπ(s,a)Z_{R}^{\pi}(s,a) as νRπ(s,a)\nu_{R}^{\pi}(s,a). Let η\eta be the distribution of a random variable XX. Then, we can express the distribution of affine transformation of random variable, aX+baX+b, using the pushforward operator, which is defined by Rowland et al. [2018], as (fa,b)#(η)(f_{a,b})_{\#}(\eta). To measure a distance between two distributions, Bellemare et al. [2023] has defined the distance lpl_{p} as follows:

lp(η1,η2):=(|Fη1(x)Fη2(x)|p𝑑x)1/p,l_{p}(\eta_{1},\eta_{2}):=\left(\int_{\mathbb{R}}\left|F_{\eta_{1}}(x)-F_{\eta_{2}}(x)\right|^{p}dx\right)^{1/p}, (32)

where Fη(x)F_{\eta}(x) is the cumulative distribution function. This distance is 1/p1/p-homogeneous, regular, and pp-convex (see Section 4 of Bellemare et al. [2023] for more details). For functions that map state-action pairs to distributions, a distance can be defined as [Bellemare et al., 2023]: l¯p(ν1,ν2):=sup(s,a)S×Alp(ν1(s,a),ν2(s,a))\bar{l}_{p}(\nu_{1},\nu_{2}):=\mathrm{sup}_{(s,a)\in S\times A}l_{p}(\nu_{1}(s,a),\nu_{2}(s,a)). Then, we can rewrite the operator 𝒯λμ,π\mathcal{T}_{\lambda}^{\mu,\pi} for random variables in (13) as an operator for distributions as below.

𝒯λμ,πν(s,a):=1λ𝒩i=0λi\displaystyle\mathcal{T}^{\mu,\pi}_{\lambda}\nu(s,a):=\frac{1-\lambda}{\mathcal{N}}\sum_{i=0}^{\infty}\lambda^{i} (33)
×𝔼μ[(j=1iη(sj,aj))𝔼aπ(|si+1)[(fγi+1,t=0iγtrt)#(ν(si+1,a))]|s0=s,a0=a],\displaystyle\times\mathbb{E}_{\mu}\left[\left(\prod_{j=1}^{i}\eta(s_{j},a_{j})\right)\mathbb{E}_{a^{\prime}\sim\pi(\cdot|s_{i+1})}\left[(f_{\gamma^{i+1},\sum_{t=0}^{i}\gamma^{t}r_{t}})_{\#}(\nu(s_{i+1},a^{\prime}))\right]\Big{|}s_{0}=s,a_{0}=a\right],

where η(s,a)=π(a|s)μ(a|s)\eta(s,a)=\frac{\pi(a|s)}{\mu(a|s)} and 𝒩\mathcal{N} is a normalization factor. Since the random variable Z(s,a)Z(s,a) and the distribution ν(s,a)\nu(s,a) is equivalent, the operators in (13) and (33) are also equivalent. Hence, we are going to show the proof of Theorem 3.2 using (33) instead of (13). We first show that the operator 𝒯λμ,π\mathcal{T}^{\mu,\pi}_{\lambda} has a contraction property.

Lemma A.5.

Under the distance l¯p\bar{l}_{p} and the assumption that the state, action, and reward spaces are finite, 𝒯λμ,π\mathcal{T}^{\mu,\pi}_{\lambda} is γ1/p\gamma^{1/p}-contractive.

Proof.

First, the operator can be rewritten using summation as follows.

𝒯λμ,πν(s,a)\displaystyle\mathcal{T}^{\mu,\pi}_{\lambda}\nu(s,a) =1λ𝒩i=0λiaA(s0,a0,r0,,si+1)Prμ(s0,a0,r0,,si+1=:τ)(j=1iη(sj,aj))\displaystyle=\frac{1-\lambda}{\mathcal{N}}\sum_{i=0}^{\infty}\lambda^{i}\sum_{a^{\prime}\in A}\sum_{(s_{0},a_{0},r_{0},...,s_{i+1})}\mathrm{Pr}_{\mu}(\underbrace{s_{0},a_{0},r_{0},...,s_{i+1}}_{=:\tau})\left(\prod_{j=1}^{i}\eta(s_{j},a_{j})\right) (34)
×π(a|si+1)(fγi+1,t=0iγtrt)#(ν(si+1,a))\displaystyle\quad\quad\quad\times\pi(a^{\prime}|s_{i+1})(f_{\gamma^{i+1},\sum_{t=0}^{i}\gamma^{t}r_{t}})_{\#}(\nu(s_{i+1},a^{\prime}))
=1λ𝒩i=0λiaAτPrμ(τ)(j=1iη(sj,aj))π(a|si+1)sS𝟏s=si+1\displaystyle=\frac{1-\lambda}{\mathcal{N}}\sum_{i=0}^{\infty}\lambda^{i}\sum_{a^{\prime}\in A}\sum_{\tau}\mathrm{Pr}_{\mu}(\tau)\left(\prod_{j=1}^{i}\eta(s_{j},a_{j})\right)\pi(a^{\prime}|s_{i+1})\sum_{s^{\prime}\in S}\mathbf{1}_{s^{\prime}=s_{i+1}}
×r0:i(k=0i𝟏rk=rk)(fγi+1,t=0iγtrt)#(ν(s,a))\displaystyle\quad\quad\quad\times\sum_{r^{\prime}_{0:i}}\left(\prod_{k=0}^{i}\mathbf{1}_{r^{\prime}_{k}=r_{k}}\right)(f_{\gamma^{i+1},\sum_{t=0}^{i}\gamma^{t}r^{\prime}_{t}})_{\#}(\nu(s^{\prime},a^{\prime}))
=1λ𝒩i=0λiaAsSr0:i(fγi+1,t=0iγtrt)#(ν(s,a))\displaystyle=\frac{1-\lambda}{\mathcal{N}}\sum_{i=0}^{\infty}\lambda^{i}\sum_{a^{\prime}\in A}\sum_{s^{\prime}\in S}\sum_{r^{\prime}_{0:i}}(f_{\gamma^{i+1},\sum_{t=0}^{i}\gamma^{t}r^{\prime}_{t}})_{\#}(\nu(s^{\prime},a^{\prime}))
×𝔼μ[(j=1iη(sj,aj))π(a|si+1)𝟏s=si+1(k=0i𝟏rk=rk)]=:ws,a,r0:i\displaystyle\quad\quad\quad\times\underbrace{\mathbb{E}_{\mu}\left[\left(\prod_{j=1}^{i}\eta(s_{j},a_{j})\right)\pi(a^{\prime}|s_{i+1})\mathbf{1}_{s^{\prime}=s_{i+1}}\left(\prod_{k=0}^{i}\mathbf{1}_{r^{\prime}_{k}=r_{k}}\right)\right]}_{=:w_{s^{\prime},a^{\prime},r^{\prime}_{0:i}}}
=1λ𝒩i=0sSaAr0:iλiws,a,r0:i(fγi+1,t=0iγtrt)#(ν(s,a)).\displaystyle=\frac{1-\lambda}{\mathcal{N}}\sum_{i=0}^{\infty}\sum_{s^{\prime}\in S}\sum_{a^{\prime}\in A}\sum_{r^{\prime}_{0:i}}\lambda^{i}w_{s^{\prime},a^{\prime},r^{\prime}_{0:i}}(f_{\gamma^{i+1},\sum_{t=0}^{i}\gamma^{t}r^{\prime}_{t}})_{\#}(\nu(s^{\prime},a^{\prime})).

Since the sum of weights of distributions should be one, we can find the normalization factor 𝒩=(1λ)i=0sSaAr0:iλiws,a,r0:i\mathcal{N}=(1-\lambda)\sum_{i=0}^{\infty}\sum_{s\in S}\sum_{a\in A}\sum_{r_{0:i}}\lambda^{i}w_{s,a,r_{0:i}}. Then, the following inequality can be derived using the homogeneity, regularity, and convexity of lpl_{p}:

lpp\displaystyle l^{p}_{p} (𝒯λμ,πν1(s,a),𝒯λμ,πν2(s,a))\displaystyle(\mathcal{T}^{\mu,\pi}_{\lambda}\nu_{1}(s,a),\mathcal{T}^{\mu,\pi}_{\lambda}\nu_{2}(s,a)) (35)
=lpp(1λ𝒩i=0sSaAr0:iλiws,a,r0:i(fγi+1,t=0iγtrt)#(ν1(s,a)),\displaystyle=l^{p}_{p}\left(\frac{1-\lambda}{\mathcal{N}}\sum_{i=0}^{\infty}\sum_{s\in S}\sum_{a\in A}\sum_{r_{0:i}}\lambda^{i}w_{s,a,r_{0:i}}(f_{\gamma^{i+1},\sum_{t=0}^{i}\gamma^{t}r_{t}})_{\#}(\nu_{1}(s,a)),\right.
1λ𝒩i=0sSaAr0:iλiws,a,r0:i(fγi+1,t=0iγtrt)#(ν2(s,a)))\displaystyle\qquad\qquad\qquad\left.\frac{1-\lambda}{\mathcal{N}}\sum_{i=0}^{\infty}\sum_{s\in S}\sum_{a\in A}\sum_{r_{0:i}}\lambda^{i}w_{s,a,r_{0:i}}(f_{\gamma^{i+1},\sum_{t=0}^{i}\gamma^{t}r_{t}})_{\#}(\nu_{2}(s,a))\right)
i=0sSaAr0:i(1λ)λiws,a,r0:i𝒩lpp((fγi+1,t=0iγtrt)#(ν1(s,a)),\displaystyle\leq\sum_{i=0}^{\infty}\sum_{s\in S}\sum_{a\in A}\sum_{r_{0:i}}\frac{(1-\lambda)\lambda^{i}w_{s,a,r_{0:i}}}{\mathcal{N}}l^{p}_{p}\left((f_{\gamma^{i+1},\sum_{t=0}^{i}\gamma^{t}r_{t}})_{\#}(\nu_{1}(s,a)),\right.
(fγi+1,t=0iγtrt)#(ν2(s,a)))\displaystyle\left.\qquad\qquad\qquad(f_{\gamma^{i+1},\sum_{t=0}^{i}\gamma^{t}r_{t}})_{\#}(\nu_{2}(s,a))\right)
i=0sSaAr0:i(1λ)λiws,a,r0:i𝒩lpp((fγi+1,0)#(ν1(s,a)),(fγi+1,0)#(ν2(s,a)))\displaystyle\leq\sum_{i=0}^{\infty}\sum_{s\in S}\sum_{a\in A}\sum_{r_{0:i}}\frac{(1-\lambda)\lambda^{i}w_{s,a,r_{0:i}}}{\mathcal{N}}l^{p}_{p}\left((f_{\gamma^{i+1},0})_{\#}(\nu_{1}(s,a)),(f_{\gamma^{i+1},0})_{\#}(\nu_{2}(s,a))\right)
=i=0sSaAr0:i(1λ)λiws,a,r0:i𝒩γi+1lpp(ν1(s,a),ν2(s,a))\displaystyle=\sum_{i=0}^{\infty}\sum_{s\in S}\sum_{a\in A}\sum_{r_{0:i}}\frac{(1-\lambda)\lambda^{i}w_{s,a,r_{0:i}}}{\mathcal{N}}\gamma^{i+1}l^{p}_{p}\left(\nu_{1}(s,a),\nu_{2}(s,a)\right)
i=0sSaAr0:i(1λ)λiws,a,r0:i𝒩γi+1(l¯p(ν1,ν2))p\displaystyle\leq\sum_{i=0}^{\infty}\sum_{s\in S}\sum_{a\in A}\sum_{r_{0:i}}\frac{(1-\lambda)\lambda^{i}w_{s,a,r_{0:i}}}{\mathcal{N}}\gamma^{i+1}\left(\bar{l}_{p}\left(\nu_{1},\nu_{2}\right)\right)^{p}
γ(l¯p(ν1,ν2))p.\displaystyle\leq\gamma\left(\bar{l}_{p}\left(\nu_{1},\nu_{2}\right)\right)^{p}.

Therefore, l¯p(𝒯λμ,πν1,𝒯λμ,πν2)γ1/pl¯p(ν1,ν2)\bar{l}_{p}\left(\mathcal{T}^{\mu,\pi}_{\lambda}\nu_{1},\mathcal{T}^{\mu,\pi}_{\lambda}\nu_{2}\right)\leq\gamma^{1/p}\bar{l}_{p}\left(\nu_{1},\nu_{2}\right). ∎

By the Banach’s fixed point theorem, the operator 𝒯λμ,π\mathcal{T}^{\mu,\pi}_{\lambda} has a unique fixed distribution. We now show that the fixed distribution is νRπ\nu^{\pi}_{R}.

Lemma A.6.

The fixed distribution of the operator 𝒯λμ,π\mathcal{T}^{\mu,\pi}_{\lambda} is νRπ\nu^{\pi}_{R}.

Proof.

From the definition of ZRπZ_{R}^{\pi}, the following equality holds [Rowland et al., 2018]: νRπ(s,a)=𝔼π[(fγ,r)#(νRπ(s,a))]\nu_{R}^{\pi}(s,a)=\mathbb{E}_{\pi}\left[(f_{\gamma,r})_{\#}(\nu_{R}^{\pi}(s^{\prime},a^{\prime}))\right]. Then, it can be shown that νRπ\nu_{R}^{\pi} is the fixed distribution by applying the operator 𝒯λμ,π\mathcal{T}^{\mu,\pi}_{\lambda} to νRπ\nu_{R}^{\pi}:

𝒯λμ,πνRπ(s,a)=1λ𝒩i=0λi\displaystyle\mathcal{T}^{\mu,\pi}_{\lambda}\nu_{R}^{\pi}(s,a)=\frac{1-\lambda}{\mathcal{N}}\sum_{i=0}^{\infty}\lambda^{i} (36)
×𝔼μ[(j=1iη(sj,aj))𝔼aπ(|si+1)[(fγi+1,t=0iγtrt)#(νRπ(si+1,a))]|s0=s,a0=a]\displaystyle\times\mathbb{E}_{\mu}\left[\left(\prod_{j=1}^{i}\eta(s_{j},a_{j})\right)\mathbb{E}_{a^{\prime}\sim\pi(\cdot|s_{i+1})}\left[(f_{\gamma^{i+1},\sum_{t=0}^{i}\gamma^{t}r_{t}})_{\#}(\nu_{R}^{\pi}(s_{i+1},a^{\prime}))\right]\Big{|}s_{0}=s,a_{0}=a\right]
=1λ𝒩i=0λi𝔼π[(fγi+1,t=0iγtrt)#(νRπ(si+1,ai+1))|s0=s,a0=a]\displaystyle=\frac{1-\lambda}{\mathcal{N}}\sum_{i=0}^{\infty}\lambda^{i}\mathbb{E}_{\pi}\left[(f_{\gamma^{i+1},\sum_{t=0}^{i}\gamma^{t}r_{t}})_{\#}(\nu_{R}^{\pi}(s_{i+1},a_{i+1}))\Big{|}s_{0}=s,a_{0}=a\right]
=1λ𝒩i=0λiνRπ(s,a)=νRπ(s,a).\displaystyle=\frac{1-\lambda}{\mathcal{N}}\sum_{i=0}^{\infty}\lambda^{i}\nu_{R}^{\pi}(s,a)=\nu_{R}^{\pi}(s,a).

See 3.2

Proof.

The operator 𝒯λμ,π\mathcal{T}^{\mu,\pi}_{\lambda} is γ1/p\gamma^{1/p}-contractive under the distance l¯p\bar{l}_{p} according to Lemma A.5. Also, the fixed distribution of the operator is νRπ\nu_{R}^{\pi}, which is equivalent to ZRπZ_{R}^{\pi}, according to Lemma A.6. By the Banach’s fixed point theorem, the sequence, Zk+1(s,a)=𝒯λμ,πZk(s,a)(s,a)Z_{k+1}(s,a)=\mathcal{T}^{\mu,\pi}_{\lambda}Z_{k}(s,a)\;\forall(s,a), converges to the fixed distribution of the operator, ZRπZ^{\pi}_{R}. ∎

A.5 Pseudocode of TD(λ\lambda) Target Distribution

We provide the pseudocode for calculating TD(λ\lambda) target distribution for the reward critic in Algorithm 2. The target distribution for the cost critics can also be obtained by simply replacing the reward part with the cost.

Algorithm 2 TD(λ\lambda) Target Distribution
  Input: Policy network πψ\pi_{\psi}, critic network ZθπZ_{\theta}^{\pi}, and trajectory {(st,at,μ(at|st),rt,dt,st+1)}t=1T\{(s_{t},a_{t},\mu(a_{t}|s_{t}),r_{t},d_{t},s_{t+1})\}_{t=1}^{T}.
  Sample an action aT+1πψ(sT+1)a^{\prime}_{T+1}\sim\pi_{\psi}(s_{T+1}) and get Z^Ttot=rT+(1dT)γZθπ(sT+1,aT+1)\hat{Z}_{T}^{\mathrm{tot}}=r_{T}+(1-d_{T})\gamma Z_{\theta}^{\pi}(s_{T+1},a^{\prime}_{T+1}).
  Initialize the total weight wtot=λw_{\mathrm{tot}}=\lambda.
  for t=Tt=T to 11 do
     Sample an action at+1πψ(st+1)a^{\prime}_{t+1}\sim\pi_{\psi}(s_{t+1}) and get Z^t(1)=rt+(1dt)γZθπ(st+1,at+1)\hat{Z}_{t}^{(1)}=r_{t}+(1-d_{t})\gamma Z_{\theta}^{\pi}(s_{t+1},a^{\prime}_{t+1}).
     Set the current weight w=1λw=1-\lambda.
     Combine the two targets, (Z^t(1),w)(\hat{Z}_{t}^{(1)},w) and (Z^t(tot),wtot)(\hat{Z}_{t}^{(\mathrm{tot})},w_{\mathrm{tot}}), and sort the combined target according to the positions of atoms.
     Build the CDF of the combined target by accumulating the weights at each atom.
     Project the combined target into a quantile distribution with MM^{\prime} atoms, which is Z^t(proj)\hat{Z}_{t}^{(\mathrm{proj})}, using the CDF (find the atom positions corresponding to each quantile).
     Update Z^t1(tot)=rt1+(1dt1)γZ^t(proj)\hat{Z}_{t-1}^{(\mathrm{tot})}=r_{t-1}+(1-d_{t-1})\gamma\hat{Z}_{t}^{(\mathrm{proj})} and wtot=λπψ(at|st)μ(at|st)(1dt1)(1λ+wtot)w_{\mathrm{tot}}=\lambda\frac{\pi_{\psi}(a_{t}|s_{t})}{\mu(a_{t}|s_{t})}(1-d_{t-1})(1-\lambda+w_{\mathrm{tot}}).
  end for
  Return {Z^t(proj)}t=1T\{\hat{Z}_{t}^{(\mathrm{proj})}\}_{t=1}^{T}.

A.6 Quantitative Analysis on TD(λ\lambda) Target Distribution

We experiment with a toy example to measure the bias and variance of the reward estimation according to λ\lambda. The toy example has two states, s1s_{1} and s2s_{2}; the state distribution is defined as an uniform; the reward function is defined as r(s1)𝒩(0.005,0.02)r(s_{1})\sim\mathcal{N}(-0.005,0.02) and r(s2)𝒩(0.005,0.03)r(s_{2})\sim\mathcal{N}(0.005,0.03). We train parameterized reward distributions by minimizing the quantile regression loss with the TD(λ\lambda) target distribution for λ=0,0.5,0.9\lambda=0,0.5,0.9, and 1.01.0. The experimental results are presented in the table below.

Table 1: Experimental results of the toy example.

5th iteration 10th iteration 15th iteration 20th iteration 25th iteration
λ=0.0\lambda=0.0 4.813 (0.173) 4.024 (0.253) 3.498 (0.085) 3.131 (0.103) 2.835 (0.070)
λ=0.5\lambda=0.5 4.621 (0.185) 3.688 (0.273) 2.925 (0.183) 2.379 (0.134) 2.057 (0.070)
λ=0.9\lambda=0.9 4.141 (0.461) 2.237 (0.402) 1.389 (0.132) 1.058 (0.031) 0.923 (0.019)
λ=1.0\lambda=1.0 2.886 (0.767) 1.733 (0.365) 1.509 (0.514) 1.142 (0.325) 1.109 (0.476)

The values in the table are the mean and standard deviation of the past five values of the Wasserstein distance between the true reward return and the estimated distribution. Looking at the fifth iteration, it is clear that the larger the λ\lambda value, the smaller the mean and the higher the standard deviation. At the 25th iteration, the run with λ=0.9\lambda=0.9 has the lowest mean and standard deviation, indicating that training has converged. On the other hand, the run with λ=1.0\lambda=1.0 has the biggest standard deviation, and the mean is greater than λ=0.9\lambda=0.9, indicating that the significant variance hinders training. In conclusion, we measured bias and variance quantitatively through the toy example, and the results are well aligned with our claim that λ\lambda can trade off bias and variance.

A.7 Surrogate Functions

In this section, we introduce the surrogate functions for the objective and constraints. First, Kim and Oh [2022a] define a doubly discounted state distribution: d2π(s):=(1γ2)t=0γ2tPr(st=s|π)d_{2}^{\pi}(s):=(1-\gamma^{2})\sum_{t=0}^{\infty}\gamma^{2t}\mathrm{Pr}(s_{t}=s|\pi). Then, the surrogates for the objective and constraints are defined as follows [Kim and Oh, 2022a]:

Jμ,π(π)\displaystyle J^{\mu,\pi}(\pi^{\prime}) :=𝔼s0ρ[Vπ(s0)]+11γ(β𝔼dπ[H(π(|s))]+𝔼dμ,π[QRπ(s,a)]),\displaystyle:=\underset{s_{0}\sim\rho}{\mathbb{E}}\left[V^{\pi}(s_{0})\right]+\frac{1}{1-\gamma}\Big{(}\beta\underset{d^{\pi}}{\mathbb{E}}\left[H(\pi^{\prime}(\cdot|s))\right]+\underset{d^{\mu},\pi^{\prime}}{\mathbb{E}}\left[Q_{R}^{\pi}(s,a)\right]\Big{)}, (37)
JCkμ,π(π)\displaystyle J_{C_{k}}^{\mu,\pi}(\pi^{\prime}) :=𝔼s0ρ[VCkπ(s0)]+11γ𝔼dμ,π[QCkπ(s,a)],\displaystyle:=\underset{s_{0}\sim\rho}{\mathbb{E}}\left[V_{C_{k}}^{\pi}(s_{0})\right]+\frac{1}{1-\gamma}\underset{d^{\mu},\pi^{\prime}}{\mathbb{E}}\left[Q_{C_{k}}^{\pi}(s,a)\right],
JSkμ,π(π)\displaystyle J^{\mu,\pi}_{S_{k}}(\pi^{\prime}) :=𝔼s0ρ[SCkπ(s0)]+11γ2𝔼d2μ,π[SCkπ(s,a)],\displaystyle:=\underset{s_{0}\sim\rho}{\mathbb{E}}\left[S_{C_{k}}^{\pi}(s_{0})\right]+\frac{1}{1-\gamma^{2}}\underset{d_{2}^{\mu},\pi^{\prime}}{\mathbb{E}}\left[S_{C_{k}}^{\pi}(s,a)\right],
Fkμ,π(π;α)\displaystyle F_{k}^{\mu,\pi}(\pi^{\prime};\alpha) :=JCkμ,π(π)+ϕ(Φ1(α))αJSkμ,π(π)(JCkμ,π(π))2,\displaystyle:=J^{\mu,\pi}_{C_{k}}(\pi^{\prime})+\frac{\phi(\Phi^{-1}(\alpha))}{\alpha}\sqrt{J^{\mu,\pi}_{S_{k}}(\pi^{\prime})-(J^{\mu,\pi}_{C_{k}}(\pi^{\prime}))^{2}},

where μ,π,π\mu,\pi,\pi^{\prime} are behavioral, current, and next policies, respectively. According to Theorem 1 in [Kim and Oh, 2022a], the constraint surrogates are bounded by DKL(π,π)D_{\mathrm{KL}}(\pi,\pi^{\prime}). We also show that the surrogate of the objective is bounded by DKL(π,π)D_{\mathrm{KL}}(\pi,\pi^{\prime}) in Appendix A.9. As a result, the gradients of the objective function and constraints become the same as the gradients of the surrogates, and the surrogates can substitute the objective and constraints within the trust region.

A.8 Policy Update Rule

To solve the constrained optimization problem (5), we find a policy update direction by linearly approximating the objective and safety constraints and quadratically approximating the trust region constraint, as done by Achiam et al. [2017]. After finding the direction, we update the policy using a line search method. Given the current policy parameter ψtΨ\psi_{t}\in\Psi, the approximated problem can be expressed as follows:

x=argmaxxΨgTxs.t.12xTHxϵ,bkTx+ck0k,x^{*}=\underset{x\in\Psi}{\mathrm{argmax}}\;g^{T}x\quad\mathrm{s.t.}\;\frac{1}{2}x^{T}Hx\leq\epsilon,\;b_{k}^{T}x+c_{k}\leq 0\;\forall k, (38)

where g=ψJμ,π(πψ)|ψ=ψtg=\nabla_{\psi}J^{\mu,\pi}(\pi_{\psi})|_{\psi=\psi_{t}}, H=ψ2DKL(πψt||πψ)|ψ=ψtH=\nabla_{\psi}^{2}D_{\mathrm{KL}}(\pi_{\psi_{t}}||\pi_{\psi})|_{\psi=\psi_{t}}, bk=ψFkμ,π(πψ;α)|ψ=ψtb_{k}=\nabla_{\psi}F_{k}^{\mu,\pi}(\pi_{\psi};\alpha)|_{\psi=\psi_{t}}, and ck=Fk(πψ;α)dkc_{k}=F_{k}(\pi_{\psi};\alpha)-d_{k}. Since (38) is convex, we can use an existing convex optimization solver. However, the search space, which is the policy parameter space Ψ\Psi, is excessively large, so we reduce the space by converting (38) to a dual problem as follows:

g(λ,ν)\displaystyle g(\lambda,\nu) =minxL(x,λ,ν)=minx{gTx+ν(12xTHxϵ)+λT(Bx+c)}\displaystyle=\mathrm{min}_{x}L(x,\lambda,\nu)=\mathrm{min}_{x}\{-g^{T}x+\nu(\frac{1}{2}x^{T}Hx-\epsilon)+\lambda^{T}(Bx+c)\} (39)
=12ν(gTH1g=:q2gTH1BT=:rTλ+λTBH1BT=:Sλ)+λTcνϵ\displaystyle=\frac{-1}{2\nu}\left(\underbrace{g^{T}H^{-1}g}_{=:q}-2\underbrace{g^{T}H^{-1}B^{T}}_{=:r^{T}}\lambda+\lambda^{T}\underbrace{BH^{-1}B^{T}}_{=:S}\lambda\right)+\lambda^{T}c-\nu\epsilon
=12ν(q2rTλ+λTSλ)+λTcνϵ,\displaystyle=\frac{-1}{2\nu}(q-2r^{T}\lambda+\lambda^{T}S\lambda)+\lambda^{T}c-\nu\epsilon,

where B=(b1,..,bK)B=(b_{1},..,b_{K}), c=(c1,,cK)Tc=(c_{1},...,c_{K})^{T}, and λK0\lambda\in\mathbb{R}^{K}\geq 0 and ν0\nu\in\mathbb{R}\geq 0 are Lagrange multipliers. Then, the optimal λ\lambda and ν\nu can be obtained by a convex optimization solver. After obtaining the optimal values, (λ,ν)=argmax(λ,ν)g(λ,ν)(\lambda^{*},\nu^{*})=\mathrm{argmax}_{(\lambda,\nu)}g(\lambda,\nu), the policy update direction xx^{*} are calculated by 1νH1(gBTλ)\frac{1}{\nu^{*}}H^{-1}(g-B^{T}\lambda^{*}). Then, the policy is updated by ψt+1=ψt+βx\psi_{t+1}=\psi_{t}+\beta x^{*}, where β\beta is a step size, which can be found through a backtracking method (please refer to Section 6.3.2 of Dennis and Schnabel [1996]).

Before using the above policy update rule, we should note that the existing trust-region method with the risk-averse constraint [Kim and Oh, 2022a] and the equations (1, 37, 5) are slightly different. There are two differences: 1) the objective is augmented with an entropy bonus, and 2) the surrogates are expressed with Q-functions instead of value functions. To use the entropy-regularized objective in the trust-region method, it is required to show that the objective is bounded by the KL divergence. We present the existence of bound in Appendix A.9. Next, there is no problem using the Q-functions because it is mathematically equivalent between the original surrogates [Kim and Oh, 2022a] and the new ones expressed with Q-functions (37). However, we experimentally show that using the Q-functions in off-policy settings has advantages in Appendix A.10.

A.9 Bound of Entropy-Augmented Objective

In this section, we show that the entropy-regularzied objective in (1) has a bound expressed by the KL divergence. Before showing the boundness, we present a new function and a lemma. A value difference function is defined as follows:

δπ(s):=𝔼[R(s,a,s)+γVπ(s)Vπ(s)|aπ(|s),sP(|s,a)]=𝔼aπ[Aπ(s,a)],\delta^{\pi^{\prime}}(s):=\mathbb{E}\left[R(s,a,s^{\prime})+\gamma V^{\pi}(s^{\prime})-V^{\pi}(s)\;|\;a\sim\pi^{\prime}(\cdot|s),s^{\prime}\sim P(\cdot|s,a)\right]=\underset{a\sim\pi^{\prime}}{\mathbb{E}}\left[A^{\pi}(s,a)\right],

where Aπ(s,a):=Qπ(s,a)Vπ(s,a)A^{\pi}(s,a):=Q^{\pi}(s,a)-V^{\pi}(s,a).

Lemma A.7.

The maximum of |δπ(s)δπ(s)||\delta^{\pi^{\prime}}(s)-\delta^{\pi}(s)| is equal or less than ϵR2DKLmax(π||π)\epsilon_{R}\sqrt{2D_{\mathrm{KL}}^{\mathrm{max}}(\pi||\pi^{\prime})}, where ϵR=maxs,a|Aπ(s,a)|\epsilon_{R}=\underset{s,a}{\mathrm{max}}\left|A^{\pi}(s,a)\right|.

Proof.

The value difference can be expressed in a vector form,

δπ(s)δπ(s)=a(π(a|s)π(a|s))Aπ(s,a)=π(|s)π(|s),Aπ(s,).\delta^{\pi^{\prime}}(s)-\delta^{\pi}(s)=\sum_{a}(\pi^{\prime}(a|s)-\pi(a|s))A^{\pi}(s,a)=\langle\pi^{\prime}(\cdot|s)-\pi(\cdot|s),A^{\pi}(s,\cdot)\rangle.

Using Hölder’s inequality, the following inequality holds:

|δπ(s)δπ(s)|\displaystyle|\delta^{\pi^{\prime}}(s)-\delta^{\pi}(s)| ||π(|s)π(|s)||1||Aπ(s,)||\displaystyle\leq||\pi^{\prime}(\cdot|s)-\pi(\cdot|s)||_{1}\cdot||A^{\pi}(s,\cdot)||_{\infty}
=2DTV(π(|s)||π(|s))maxaAπ(s,a).\displaystyle=2D_{\mathrm{TV}}(\pi^{\prime}(\cdot|s)||\pi(\cdot|s))\mathrm{max}_{a}A^{\pi}(s,a).
||δπδπ||=maxs|δπ(s)δπ(s)|2ϵRmaxsDTV(π(|s)||π(|s)).\Rightarrow||\delta^{\pi^{\prime}}-\delta^{\pi}||_{\infty}=\mathrm{max}_{s}|\delta^{\pi^{\prime}}(s)-\delta^{\pi}(s)|\leq 2\epsilon_{R}\mathrm{max}_{s}D_{\mathrm{TV}}(\pi(\cdot|s)||\pi^{\prime}(\cdot|s)).

Using Pinsker’s inequality, δπδπϵR2DKLmax(π||π)||\delta^{\pi^{\prime}}-\delta^{\pi}||_{\infty}\leq\epsilon_{R}\sqrt{2D_{\mathrm{KL}}^{\mathrm{max}}(\pi||\pi^{\prime})}. ∎

Theorem A.8.

Let us assume that maxsH(π(|s))<\mathrm{max}_{s}H(\pi(\cdot|s))<\infty for πΠ\forall\pi\in\Pi. The difference between the objective and surrogate functions is bounded by a term consisting of KL divergence as:

|J(π)Jμ,π(π)|2γ(1γ)2DKLmax(π||π)(βϵH+ϵR2DKLmax(μ||π)),\displaystyle\left|J(\pi^{\prime})-J^{\mu,\pi}(\pi^{\prime})\right|\leq\frac{\sqrt{2}\gamma}{(1-\gamma)^{2}}\sqrt{D_{\mathrm{KL}}^{\mathrm{max}}(\pi||\pi^{\prime})}\left(\beta\epsilon_{H}+\epsilon_{R}\sqrt{2D_{\mathrm{KL}}^{\mathrm{max}}(\mu||\pi^{\prime})}\right), (40)

where ϵH=max𝑠|H(π(|s))|\epsilon_{H}=\underset{s}{\mathrm{max}}\left|H(\pi^{\prime}(\cdot|s))\right|, DKLmax(π||π)=max𝑠DKL(π(|s)||π(|s))D_{\mathrm{KL}}^{\mathrm{max}}(\pi||\pi^{\prime})=\underset{s}{\mathrm{max}}\;D_{\mathrm{KL}}(\pi(\cdot|s)||\pi^{\prime}(\cdot|s)), and the equality holds when π=π\pi^{\prime}=\pi.

Proof.

The surrogate function can be expressed in vector form as follows:

Jμ,π(π)=ρ,Vπ+11γ(dμ,δπ+βdπ,Hπ),J^{\mu,\pi}(\pi^{\prime})=\langle\rho,V^{\pi}\rangle+\frac{1}{1-\gamma}\left(\langle d^{\mu},\delta^{\pi^{\prime}}\rangle+\beta\langle d^{\pi},H^{\pi^{\prime}}\rangle\right),

where Hπ(s)=H(π(|s))H^{\pi^{\prime}}(s)=H(\pi^{\prime}(\cdot|s)). The objective function of π\pi^{\prime} can also be expressed in a vector form using Lemma 1 from Achiam et al. [2017],

J(π)\displaystyle J(\pi^{\prime}) =11γ𝔼[R(s,a,s)+βHπ(s)|sdπ,aπ(|s),sP(|s,a)]\displaystyle=\frac{1}{1-\gamma}\mathbb{E}\left[R(s,a,s^{\prime})+\beta H^{\pi^{\prime}}(s)\;|\;s\sim d^{\pi^{\prime}},a\sim\pi^{\prime}(\cdot|s),s^{\prime}\sim P(\cdot|s,a)\right]
=11γ𝔼sdπ[δπ(s)+βHπ(s)]+𝔼sρ[Vπ(s)]\displaystyle=\frac{1}{1-\gamma}\underset{s\sim d^{\pi^{\prime}}}{\mathbb{E}}\left[\delta^{\pi^{\prime}}(s)+\beta H^{\pi^{\prime}}(s)\right]+\underset{s\sim\rho}{\mathbb{E}}\left[V^{\pi}(s)\right]
=ρ,Vπ+11γdπ,δπ+βHπ.\displaystyle=\langle\rho,V^{\pi}\rangle+\frac{1}{1-\gamma}\langle d^{\pi^{\prime}},\delta^{\pi^{\prime}}+\beta H^{\pi^{\prime}}\rangle.

By Lemma 3 from Achiam et al. [2017], dπdπ1γ1γ2DKLmax(π||π)||d^{\pi}-d^{\pi^{\prime}}||_{1}\leq\frac{\gamma}{1-\gamma}\sqrt{2D^{\mathrm{max}}_{\mathrm{KL}}(\pi||\pi^{\prime})}. Then, the following inequality is satisfied:

|(1\displaystyle|(1- γ)(Jμ,π(π)J(π))|\displaystyle\gamma)(J^{\mu,\pi}(\pi^{\prime})-J(\pi^{\prime}))|
=|dπdμ,δπ+βdπdπ,Hπ|\displaystyle=|\langle d^{\pi^{\prime}}-d^{\mu},\delta^{\pi^{\prime}}\rangle+\beta\langle d^{\pi}-d^{\pi^{\prime}},H^{\pi^{\prime}}\rangle|
|dπdμ,δπ|+β|dπdπ,Hπ|\displaystyle\leq|\langle d^{\pi^{\prime}}-d^{\mu},\delta^{\pi^{\prime}}\rangle|+\beta|\langle d^{\pi}-d^{\pi^{\prime}},H^{\pi^{\prime}}\rangle|
=|dπdμ,δπδπ|+β|dπdπ,Hπ|\displaystyle=|\langle d^{\pi^{\prime}}-d^{\mu},\delta^{\pi^{\prime}}-\delta^{\pi}\rangle|+\beta|\langle d^{\pi}-d^{\pi^{\prime}},H^{\pi^{\prime}}\rangle| (δπ=0)\displaystyle(\because\delta^{\pi}=0)
dπdμ1δπδπ+βdπdπ1Hπ\displaystyle\leq||d^{\pi^{\prime}}-d^{\mu}||_{1}||\delta^{\pi^{\prime}}-\delta^{\pi}||_{\infty}+\beta||d^{\pi}-d^{\pi^{\prime}}||_{1}||H^{\pi^{\prime}}||_{\infty} (Hölder’s inequality)\displaystyle(\because\text{\small H\"{o}lder's inequality})
2ϵRγ1γDKLmax(μ||π)DKLmax(π||π)+βγϵH1γ2DKLmax(π||π)\displaystyle\leq\frac{2\epsilon_{R}\gamma}{1-\gamma}\sqrt{D_{\mathrm{KL}}^{\mathrm{max}}(\mu||\pi^{\prime})D_{\mathrm{KL}}^{\mathrm{max}}(\pi||\pi^{\prime})}+\frac{\beta\gamma\epsilon_{H}}{1-\gamma}\sqrt{2D^{\mathrm{max}}_{\mathrm{KL}}(\pi||\pi^{\prime})} (Lemma A.7)\displaystyle(\because\text{Lemma \ref{lemma:delta difference}})
=γ1γDKLmax(π||π)(2βϵH+2ϵRDKLmax(μ||π)).\displaystyle=\frac{\gamma}{1-\gamma}\sqrt{D_{\mathrm{KL}}^{\mathrm{max}}(\pi||\pi^{\prime})}\left(\sqrt{2}\beta\epsilon_{H}+2\epsilon_{R}\sqrt{D_{\mathrm{KL}}^{\mathrm{max}}(\mu||\pi^{\prime})}\right).

If π=π\pi^{\prime}=\pi, the KL divergence term becomes zero, so equality holds. ∎

A.10 Comparison of Q-Function and Value Function-Based Surrogates

The original surrogate is defined as follows:

Jμ,π(π):=J(π)+11γ𝔼dμ,μ[π(a|s)μ(a|s)Aπ(s,a)],J^{\mu,\pi}(\pi^{\prime}):=J(\pi)+\frac{1}{1-\gamma}\underset{d^{\mu},\mu}{\mathbb{E}}\left[\frac{\pi^{\prime}(a|s)}{\mu(a|s)}A^{\pi}(s,a)\right], (41)

where Aπ(s,a):=Qπ(s,a)Vπ(s,a)A^{\pi}(s,a):=Q^{\pi}(s,a)-V^{\pi}(s,a), and the surrogate is the same as that of OffTRPO [Meng et al., 2022] and OffTRC [Kim and Oh, 2022a]. An entropy-regularized version can be derived as:

Jμ,π(π)=J(π)+11γ(β𝔼dπ[H(π(|s))]+𝔼dμ,μ[π(a|s)μ(a|s)Aπ(s,a)]).J^{\mu,\pi}(\pi^{\prime})=J(\pi)+\frac{1}{1-\gamma}\Big{(}\beta\underset{d^{\pi}}{\mathbb{E}}\left[H(\pi^{\prime}(\cdot|s))\right]+\underset{d^{\mu},\mu}{\mathbb{E}}\left[\frac{\pi^{\prime}(a|s)}{\mu(a|s)}A^{\pi}(s,a)\right]\Big{)}. (42)

Then, the surrogate expressed by Q-functions in (37), called SAC-style version, can be rewritten as:

Jμ,π(π)=J(π)+11γ(β𝔼dπ[H(π(|s))]+𝔼dμ,π[Qπ(s,a)]).J^{\mu,\pi}(\pi^{\prime})=J(\pi)+\frac{1}{1-\gamma}\Big{(}\beta\underset{d^{\pi}}{\mathbb{E}}\left[H(\pi^{\prime}(\cdot|s))\right]+\underset{d^{\mu},\pi^{\prime}}{\mathbb{E}}\left[Q^{\pi}(s,a)\right]\Big{)}. (43)

In this section, we evaluate the original, entropy-regularized, and SAC-style versions in the continuous control tasks of the MuJoCo simulators [Todorov et al., 2012]. We use neural networks with two hidden layers with (512, 512) nodes and ReLU for the activation function. The output of a value network is linear, but the input is different; the original and entropy-regularized versions use states, and the SAC-style version uses state-action pairs. The input of a policy network is the state, the output is mean μ\mu and std σ\sigma, and actions are squashed into tanh(μ+ϵσ)\mathrm{tanh}(\mu+\epsilon\sigma), ϵ𝒩(0,1)\epsilon\sim\mathcal{N}(0,1) as in SAC [Haarnoja et al., 2018]. The entropy coefficient β\beta in the entropy-regularized and SAC-style versions are adaptively adjusted to keep the entropy above a threshold (set as d-d given AdA\subseteq\mathbb{R}^{d}). The hyperparameters for all versions are summarized in Table 2.

Table 2: Hyperparameters for all versions.

Parameter Value
Discount factor γ\gamma 0.99
Trust region size ϵ\epsilon 0.001
Length of replay buffer 10510^{5}
Critic learning rate 0.00030.0003
Trace-decay λ\lambda 0.970.97
Initial entropy coefficient β\beta 1.0
β\beta learning rate 0.01
Refer to caption
(a) Ant-v3
Refer to caption
(b) HalfCheetah-v3
Refer to caption
(c) Hopper-v3
Refer to caption
(d) Humanoid-v3
Refer to caption
(e) Swimmer-v3
Refer to caption
(f) Walker2d-v3
Figure 5: MuJoCo training curves.

The training curves are presented in Figure 5. All methods are trained with five different random seeds. Although the entropy-regularized version (42) and SAC-style version (43) are mathematically equivalent, it can be observed that the performance of the SAC-style version is superior to the regularized version. It can be inferred that this is due to the variance of importance sampling. In the off-policy setting, the sampling probabilities of the behavioral and current policies can be significantly different, so the variance of the importance ratio is huge. The increased variance prevents estimating the objective accurately, so significant performance degradation can happen. As a result, using the Q-function-based surrogates has an advantage for efficient learning.

Appendix B Experimental Settings

Refer to caption
(a) Point goal.
Refer to caption
(b) Car button.
Refer to caption
(c) Cassie.
Refer to caption
(d) Laikago.
Refer to caption
(e) Mini-Cheetah.
Figure 6: (a) and (b) are Safety Gym tasks. (c), (d), and (e) are locomotion tasks.

Safety Gym. We use the goal and button tasks with the point and car robots in the Safety Gym environment [Ray et al., 2019], as shown in Figure 6(a) and 6(b). The environmental setting for the goal task is the same as in Kim and Oh [2022b]. Eight hazard regions and one goal are randomly spawned at the beginning of each episode, and a robot gets a reward and cost as follows:

R(s,a,s)\displaystyle R(s,a,s^{\prime}) =Δdgoal+𝟏dgoal0.3,\displaystyle=-\Delta d_{\mathrm{goal}}+\mathbf{1}_{d_{\mathrm{goal}}\leq 0.3}, (44)
C(s,a,s)\displaystyle C(s,a,s^{\prime}) =Sigmoid(10(0.2dhazard)),\displaystyle=\mathrm{Sigmoid}(10\cdot(0.2-d_{\mathrm{hazard}})),

where dgoald_{\mathrm{goal}} is the distance to the goal, and dhazardd_{\mathrm{hazard}} is the minimum distance to hazard regions. If dgoald_{\mathrm{goal}} is less than or equal to 0.30.3, a goal is respawned. The state consists of relative goal position, goal distance, linear and angular velocities, acceleration, and LiDAR values. The action space is two-dimensional, which consists of xyxy-directional forces for the point and wheel velocities for the car robot.

The environmental settings for the button task are the same as in Liu et al. [2022]. There are five hazard regions, four dynamic obstacles, and four buttons, and all components are fixed throughout the training. The initial position of a robot and an activated button are randomly placed at the beginning of each episode. The reward function is the same as in (44), but the cost is different since there is no dense signal for contacts. We define the cost function for the button task as an indicator function that outputs one if the robot makes contact with an obstacle or an inactive button or enters a hazardous region. We add LiDAR values of buttons and obstacles to the state of the goal task, and actions are the same as the goal task. The length of the episode is 1000 steps without early termination.

Locomotion Tasks. We use three different legged robots, Mini-Cheetah, Laikago, and Cassie, for the locomotion tasks, as shown in Figure 6(e), 6(d), and 6(c). The tasks aim to control robots to follow a velocity command on flat terrain. A velocity command is given by (vxcmd,vycmd,ωzcmd)(v_{x}^{\mathrm{cmd}},v_{y}^{\mathrm{cmd}},\omega_{z}^{\mathrm{cmd}}), where vxcmd𝒰(1.0,1.0)v_{x}^{\mathrm{cmd}}\sim\mathcal{U}(-1.0,1.0) for Cassie and 𝒰(1.0,2.0)\mathcal{U}(-1.0,2.0) otherwise, vycmd=0v_{y}^{\mathrm{cmd}}=0, and ωzcmd𝒰(0.5,0.5)\omega_{z}^{\mathrm{cmd}}\sim\mathcal{U}(-0.5,0.5). To lower the task complexity, we set the yy-directional linear velocity to zero but can scale to any non-zero value. As in other locomotion studies [Lee et al., 2020, Miki et al., 2022], central phases are introduced to produce periodic motion, which are defined as ϕi(t)=ϕi,0+ft\phi_{i}(t)=\phi_{i,0}+f\cdot t for i{1,,nlegs}\forall i\in\{1,...,n_{\mathrm{legs}}\}, where ff is a frequency coefficient and is set to 1010, and ϕi,0\phi_{i,0} is an initial phase. Actuators of robots are controlled by PD control towards target positions given by actions. The state consists of velocity command, orientation of the robot frame, linear and angular velocities of the robot, positions and speeds of the actuators, central phases, history of positions and speeds of the actuators (past two steps), and history of actions (past two steps). A foot contact timing ξ\xi can be defined as follows:

ξi(s)=1+2𝟏sin(ϕi)0i{1,,nlegs},\xi_{i}(s)=-1+2\cdot\mathbf{1}_{\sin(\phi_{i})\leq 0}\;\;\forall i\in\{1,...,n_{\mathrm{legs}}\}, (45)

where a value of -1 means that the iith foot is on the ground; otherwise, the foot is in the air. For the quadrupedal robots, Mini-Cheetah and Laikago, we use the initial phases as ϕ0={0,π,π,0}\phi_{0}=\{0,\pi,\pi,0\}, which generates trot gaits. For the bipedal robot, Cassie, the initial phases are defined as ϕ0={0,π}\phi_{0}=\{0,\pi\}, which generates walk gaits. Then, the reward and cost functions are defined as follows:

R(s,a,s)=0.1(vx,ybasevx,ycmd22+ωzbaseωzcmd22+103Rpower),\displaystyle\qquad\qquad R(s,a,s^{\prime})=-0.1\cdot(||v_{x,y}^{\mathrm{base}}-v_{x,y}^{\mathrm{cmd}}||_{2}^{2}+||\omega_{z}^{\mathrm{base}}-\omega_{z}^{\mathrm{cmd}}||_{2}^{2}+10^{-3}\cdot R_{\mathrm{power}}), (46)
C1(s,a,s)=𝟏anglea,C2(s,a,s)=𝟏heightb,C3(s,a,s)=i=1nlegs(1ξiξ^i)/(2nlegs),\displaystyle C_{1}(s,a,s^{\prime})=\mathbf{1}_{\mathrm{angle}\geq a},\;C_{2}(s,a,s^{\prime})=\mathbf{1}_{\mathrm{height}\leq b},\;C_{3}(s,a,s^{\prime})=\sum_{i=1}^{n_{\mathrm{legs}}}(1-\xi_{i}\cdot\hat{\xi}_{i})/(2\cdot n_{\mathrm{legs}}),

where the power consumption Rpower=i|τivi|R_{\mathrm{power}}=\sum_{i}|\tau_{i}v_{i}|, the sum of the torque times the actuator speed, is added to the reward as a regularization term, vx,ybasev_{x,y}^{\mathrm{base}} is the xyxy-directional linear velocity of the base frame of robots, ωzbase\omega_{z}^{\mathrm{base}} is the zz-directional angular velocity of the base frame, and ξ^{1,1}nlegs\hat{\xi}\in\{-1,1\}^{n_{\mathrm{legs}}} is the current feet contact vector. For balancing, the first cost indicates whether the angle between the zz-axis vector of the robot base and the world is greater than a threshold (a=15a=15^{\circ} for all robots). For standing, the second cost indicates the height of CoM is less than a threshold (b=0.3,0.35,0.7b=0.3,0.35,0.7 for Mini-Cheetah, Laikago, and Cassie, respectively), and the last cost is to check that the current feet contact vector ξ^\hat{\xi} matches the pre-defined timing ξ\xi. The length of the episode is 500 steps. There is no early termination, but if a robot falls to the ground, the state is frozen until the end of the episode.

Hyperparameter Settings. The structure of neural networks consists of two hidden layers with (512,512)(512,512) nodes and ReLU activation for all baselines and the proposed method. The input of value networks is state-action pairs, and the output is the positions of atoms. The input of policy networks is the state, the output is mean μ\mu and std σ\sigma, and actions are squashed into tanh(μ+ϵσ)\mathrm{tanh}(\mu+\epsilon\sigma), ϵ𝒩(0,1)\epsilon\sim\mathcal{N}(0,1). We use a fixed entropy coefficient β\beta. The trust region size ϵ\epsilon is set to 0.0010.001 for all trust region-based methods. The overall hyperparameters for the proposed method can be summarized in Table 3.

Table 3: Hyperparameter settings for the Safety Gym and locomotion tasks.

Parameter Safety Gym Locomotion
Discount factor γ\gamma 0.99 0.99
Trust region size ϵ\epsilon 0.001 0.001
Length of replay buffer 10510^{5} 10510^{5}
Critic learning rate 0.00030.0003 0.00030.0003
Trace-decay λ\lambda 0.970.97 0.970.97
Entropy coefficient β\beta 0.0 0.001
The number of critic atoms MM 25 25
The number of target atoms MM^{\prime} 50 50
Constraint risk level α\alpha 0.25, 0.5, and 1.0 1.0
threshold dkd_{k} 0.025/(1γ)0.025/(1-\gamma) [0.025,0.025,0.4]/(1γ)[0.025,0.025,0.4]/(1-\gamma)
Slack coefficient ζ\zeta - minkdk=0.025/(1γ)\mathrm{min}_{k}d_{k}=0.025/(1-\gamma)

Since the range of the cost is [0,1][0,1], the maximum discounted cost sum is 1/(1γ)1/(1-\gamma). Thus, the threshold is set by target cost rate times 1/(1γ)1/(1-\gamma). For the locomotion tasks, the third cost in (46) is designed for foot stamping, which is not essential to safety. Hence, we set the threshold to near the maximum (if a robot does not stamp, the cost rate becomes 0.5). In addition, baseline safe RL methods use multiple critic networks for the cost function, such as target [Yang et al., 2021] or square value networks [Kim and Oh, 2022a]. To match the number of network parameters, we use two critics as an ensemble, as in Kuznetsov et al. [2020].

Tips for Hyperparameter Tuning.

  • Discount factor γ\gamma, Critic learning rate: Since these are commonly used hyperparameters, we do not discuss these.

  • Trace-decay λ\lambda, Trust region size ϵ\epsilon: The ablation studies on these hyperparameters are presented in Appendix C.3. From the results, we recommend setting the trace-decay to 0.950.990.95\sim 0.99 as in other TD(λ\lambda)-based methods [Precup et al., 2000]. Also, the results show that the performance is not sensitive to the trust region size. However, if the trust region size is too large, the approximation error increases, so it is better to set it below 0.0030.003.

  • Entropy coefficient β\beta: This value is fixed in our experiments, but it can be adjusted automatically as done in SAC [Haarnoja et al., 2018].

  • The number of atoms M,MM,M^{\prime}: Although experiments on the number of atoms did not performed, performance is expected to increase as the number of atoms increases, as in other distributional RL methods [Dabney et al., 2018a].

  • Length of replay buffer: The effect of the length of the replay buffer can be confirmed through the experimental results from an off policy-based safe RL method [Kim and Oh, 2022a]. According to that, the length does not impact performance unless it is too short. We recommend setting it to 10 to 100 times the collected trajectory length.

  • Constraint risk level α\alpha, threshold dkd_{k}: If the cost sum follows a Gaussian distribution, the mean-std constraint is identical to the CVaR constraint. Then, the probability of the worst case can be controlled by adjusting α\alpha. For example, if we set α=0.125\alpha=0.125 and d=0.03/(1γ)d=0.03/(1-\gamma), the mean-std constraint enforces the probability that the average cost is less than 0.03 during an episode greater than 95%=Φ(ϕ(Φ1(α))/α)95\%=\Phi(\phi(\Phi^{-1}(\alpha))/\alpha). Through this meaning, proper α\alpha and dkd_{k} can be found.

  • Slack coefficient ζ\zeta: As mentioned at the end of Section 3.1, it is recommended to set this coefficient as large as possible. Since dkζd_{k}-\zeta should be positive, we recommend setting ζ\zeta to minkdk\min_{k}d_{k}.

In conclusion, most hyperparameters are not sensitive, so few need to be optimized. It seems that α\alpha and dkd_{k} need to be set based on the meaning described above. Additionally, if the approximation error of critics is significant, the trust region size should be set smaller.

Appendix C Experimental Results

C.1 Safety Gym

In this section, we present the training curves of the Safety Gym tasks separately according to the risk level of constraints for better readability. Figure 7 shows The training results of the risk-neutral constrained algorithms and risk-averse constrained algorithms with α=1.0\alpha=1.0. Figures 8 and 9 show the training results of the risk-averse constrained algorithms with α=0.25\alpha=0.25 and 0.50.5, respectively.

Refer to caption
Figure 7: Training curves of risk-neutral constrained algorithms for the Safety Gym tasks. The solid line and shaded area represent the average and std values, respectively. The black dashed lines in the second row indicate thresholds. All methods are trained with five random seeds.
Refer to caption
Figure 8: Training curves of risk-averse constrained algorithms with α=0.5\alpha=0.5 for the Safety Gym.
Refer to caption
Figure 9: Training curves of risk-averse constrained algorithms with α=0.25\alpha=0.25 for the Safety Gym.

C.2 Ablation Study on Components of SDAC

There are three main differences between SDAC and the existing trust region-based safe RL algorithm for mean-std constraints [Kim and Oh, 2022a], called OffTRC: 1) feasibility handling methods in multi-constraint settings, 2) the use of distributional critics, and 3) the use of Q-functions instead of advantage functions, as explained in Appendix A.8 and A.10. Since the ablation study for feasibility handling is conducted in Section 5.3, we perform ablation studies for the distributional critic and Q-function in this section. We call SDAC with only distributional critics as SDAC-Dist and SDAC with only Q-functions as SDAC-Q. If all components are absent, SDAC is identical to OffTRC [Kim and Oh, 2022a]. The variants are trained with the point goal task of the Safety Gym, and the training results are shown in Figure 10. SDAC-Q lowers the cost rate quickly but shows the lowest score. SDAC-Dist shows scores similar to SDAC, but the cost rate converges above the threshold 0.0250.025. In conclusion, SDAC can efficiently satisfy the safety constraints through the use of Q-functions and improve score performance through the distributional critics.

Refer to caption
Figure 10: Training curves of variants of SDAC for the point goal task.

C.3 Ablation Study on Hyperparameters

To check the effects of the hyperparameters, we conduct ablation studies on the trust region size ϵ\epsilon and entropy coefficient β\beta. The results on the entropy coefficient are presented in Figure 11(a), showing that the score significantly decreases when β\beta is 0.010.01. This indicates that policies with high entropy fail to improve score performance since they focus on satisfying the constraints. Thus, the entropy coefficient should be adjusted cautiously, or it can be better to set the coefficient to zero. The results on the trust region size are shown in Figure 11(b), which shows that the results do not change significantly regardless of the trust region size. However, the score convergence rate for ϵ=0.01\epsilon=0.01 is the slowest because the estimation error of the surrogate increases as the trust region size increases according to Theorem A.8.

Refer to caption
(a) Entropy coefficient β\beta.
Refer to caption
(b) Trust region size ϵ\epsilon.
Figure 11: Training curves of SDAC with different hyperparameters for the point goal task.

Appendix D Comparison with RL Algorithms

In this section, we compare the proposed safe RL algorithm with traditional RL algorithms in the locomotion tasks and show that safe RL has the advantage of not requiring reward tuning. We use the truncated quantile critic (TQC) [Kuznetsov et al., 2020], a state-of-the-art algorithm in existing RL benchmarks [Todorov et al., 2012], as traditional RL baselines. To apply the same experiment to traditional RL, it is necessary to design a reward reflecting safety. We construct the reward through a weighted sum as R¯=(Ri=13wiCi)/(1+i=13wi)\bar{R}=(R-\sum_{i=1}^{3}w_{i}C_{i})/(1+\sum_{i=1}^{3}w_{i}), where RR and C{1,2,3}C_{\{1,2,3\}} are used to train safe RL methods and are defined in Appendix B, and RR is called the true reward. The weights of the reward function w{1,2,3}w_{\{1,2,3\}} are searched by a Bayesian optimization tool111We use Sweeps from Weights & Biases Biewald [2020]. to maximize the true reward of TQC in the Mini-Cheetah task. Among the 63 weights searched through Bayesian optimization, the top five weights are listed in Table 4.

Table 4: Weights of the reward function for the Mini-Cheetah task.

Reward weights w1w_{1} w2w_{2} w3w_{3}
#1\#1 1.588 0.299 0.174
#2\#2 1.340 0.284 0.148
#3\#3 1.841 0.545 0.951
#4\#4 6.560 0.187 4.920
#5\#5 1.603 0.448 0.564

Figure 12 shows the training curves of the Mini-Cheetah task experiments where TQC is trained using the weight pairs listed in Table 4. The graph shows that it is difficult for TQC to lower the second cost below the threshold while all costs of SDAC are below the threshold. In particular, TQC with the fifth weight pairs shows the lowest second cost rate, but the true reward sum is the lowest. This shows that it is challenging to obtain good task performance while satisfying the constraints through reward tuning.

Refer to caption
Figure 12: Training curves of the Mini-Cheetah task. The black dashed lines show the thresholds used for the safe RL method. The solid line represents the average value, and the shaded area shows one-fifth of the std value. The number after TQC in the legend indicates which of the reward weights in Table 4 is used. All methods are trained with five different random seeds.

Appendix E Computational Cost Analysis

E.1 Complexity of Gradient Integration Method

In this section, we analyze the computational cost of the gradient integration method. The proposed gradient integration method has three subparts. First, it is required to calculate policy gradients of each cost surrogate, gkg_{k}, and H1gkH^{-1}g_{k} for k{1,2,,K}\forall k\in\{1,2,...,K\}, where HH is the Hessian matrix of the KL divergence. H1gkH^{-1}g_{k} can be computed using the conjugate gradient method, which requires only a constant number of back-propagation on the cost surrogate, so the computational cost can be expressed as KO(BackProp)K\cdot O(\mathrm{BackProp}).

Second, the quadratic problem in Section 3.1 is transformed to a dual problem, where the transformation process requires inner products between gkg_{k} and H1gmH^{-1}g_{m} for k,m{1,2,,K}\forall k,m\in\{1,2,...,K\}. The computational cost can be expressed as K2O(InnerProd)K^{2}\cdot O(\mathrm{InnerProd}).

Finally, the transformed quadratic problem is solved in the dual space K\in\mathbb{R}^{K} using a quadratic programming solver. Since KK is usually much smaller than the number of policy parameters, the computational cost almost negligible compared to the others. Then, the cost of the gradient integration is KO(BackProp)+K2O(InnerProd)+CK\cdot O(\mathrm{BackProp})+K^{2}\cdot O(\mathrm{InnerProd})+C. Since the back-propagation and the inner products is proportional to the number of policy parameters |ψ||\psi|, the computational cost can be simplified as O(K2|ψ|)O(K^{2}\cdot|\psi|).

E.2 Quantitative Analysis

Table 5: Training time of Safe RL algorithms (in hours). The training time of each algorithm is measured as the average time required for training with five random seeds. The total training steps are 51065\cdot 10^{6} and 31063\cdot 10^{6} for the point goal task and the Mini-Cheetah task, respectively.

Task SDAC (proposed) OffTRC WCSAC CPO CVPO
Point goal (Safety Gym) 7.96 4.86 19.07 2.61 47.43
Mini-Cheetah (Locomotion) 8.36 6.54 16.41 1.99 -

We analyze the computational cost of the proposed method quantitatively. To do this, we measure the training time of the proposed method, SDAC, and the safe RL baselines. We use a workstation whose CPU is the Intel Xeon e5-2650 v3, and GPU is the NVIDIA GeForce GTX TITAN X. The results are presented in Table 5. While CPO is the fastest algorithm, its performance, such as the sum of rewards, is relatively poor compared to other algorithms. The main reason why CPO shows the fastest computation time is that CPO is an on-policy algorithm, hence, it does not require an insertion to (and deletion from) a replay memory, and batch sampling. SDAC shows the third fastest computation time in all algorithms and the second best one among off-policy algorithms. Especially, SDAC is slightly slower than OffTRC, which is the fastest one among off-policy algorithms. This result shows the benefit of SDAC since SDAC outperforms OffTRC in terms of the returns and CV, but the training time is not significantly increased over OffTRC. WCSAC, which is based on SAC, has a slower training time because it updates networks more frequently than other algorithms. CVPO, an EM-based safe RL algorithm, has the slowest training time. In the E-step of CVPO, a non-parametric policy is optimized to solve a local subproblem, and the optimization process requires discretizing the action space and solving a non-linear convex optimization for all batch states. Because of this, CVPO takes the longest to train an RL agent.