This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Spectral-Risk Safe Reinforcement Learning
with Convergence Guarantees

Dohyeong Kim1  Taehyun Cho1  Seungyub Han1
Hojun Chung1Kyungjae Lee2∗Songhwai Oh1{}^{1\!}
1Dep. of Electrical and Computer Engineering, Seoul National University
2Artificial Intelligence Graduate School, Chung-Ang University
Corresponding authors: [email protected], [email protected].
Abstract

The field of risk-constrained reinforcement learning (RCRL) has been developed to effectively reduce the likelihood of worst-case scenarios by explicitly handling risk-measure-based constraints. However, the nonlinearity of risk measures makes it challenging to achieve convergence and optimality. To overcome the difficulties posed by the nonlinearity, we propose a spectral risk measure-constrained RL algorithm, spectral-risk-constrained policy optimization (SRCPO), a bilevel optimization approach that utilizes the duality of spectral risk measures. In the bilevel optimization structure, the outer problem involves optimizing dual variables derived from the risk measures, while the inner problem involves finding an optimal policy given these dual variables. The proposed method, to the best of our knowledge, is the first to guarantee convergence to an optimum in the tabular setting. Furthermore, the proposed method has been evaluated on continuous control tasks and showed the best performance among other RCRL algorithms satisfying the constraints.

1 Introduction

Safe reinforcement learning (Safe RL) has been extensively researched [2, 15, 25], particularly for its applications in safety-critical areas such as robotic manipulation [3, 17] and legged robot locomotion [28]. In safe RL, constraints are often defined by the expectation of the discounted sum of costs [2, 25], where cost functions are defined to capture safety signals. These expectation-based constraints can leverage existing RL techniques since the value functions for costs follow the Bellman equation. However, they inadequately reduce the likelihood of significant costs, which can result in worst-case outcomes, because expectations do not capture information on the tail distribution. In real-world applications, avoiding such worst-case scenarios is particularly crucial, as they can lead to irrecoverable damage.

In order to avoid worst-case scenarios, various safe RL methods [10, 26] attempt to define constraints using risk measures, such as conditional value at risk (CVaR) [20] and its generalized form, spectral risk measures [1]. Yang et al. [26] and Zhang and Weng [30] have employed a distributional critic to estimate the risk constraints and calculate policy gradients. Zhang et al. [32] and Chow et al. [10] have introduced auxiliary variables to estimate CVaR. These approaches have effectively prevented worst-case outcomes by implementing risk constraints. However, the nonlinearity of risk measures makes it difficult to develop methods that ensure optimality. To the best of our knowledge, even for well-known risk measures such as CVaR, existing methods only guarantee local convergence [10, 32]. This highlights the need for a method that ensures convergence to a global optimum.

In this paper, we introduce a spectral risk measure-constrained RL (RCRL) algorithm called spectral-risk-constrained policy optimization (SRCPO). Spectral risk measures, including CVaR, have a dual form expressed as the expectation over a random variable [7]. Leveraging this duality, we propose a bilevel optimization framework designed to alleviate the challenges caused by the nonlinearity of risk measures. This framework consists of two levels: the outer problem, which involves optimizing a dual variable that originates from the dual form of spectral risk measures, and the inner problem, which aims to find an optimal policy for a given dual variable. For the inner problem, we define novel risk value functions that exhibit linearity in performance difference and propose a policy gradient method that ensures convergence to an optimal policy in tabular settings. The outer problem is potentially non-convex, so gradient-based optimization techniques are not suitable for global convergence. Instead, we propose a method to model and update a distribution over the dual variable, which also ensures finding an optimal dual variable in tabular settings. As a result, to the best of our knowledge, we present the first algorithm that guarantees convergence to an optimum for the RCRL problem. Moreover, the proposed method has been evaluated on continuous control tasks with both single and multiple constraints, demonstrating the highest reward performance among other RCRL algorithms satisfying constraints in all tasks. Our contributions are summarized as follows:

  • We propose a novel bilevel optimization framework for RCRL, effectively addressing the complexities caused by the nonlinearity of risk measures.

  • The proposed method demonstrates convergence and optimality in tabular settings.

  • The proposed method has been evaluated on continuous control tasks with both single and multiple constraints, consistently outperforming existing RCRL algorithms.

2 Related Work

RCRL algorithms can be classified according to how constraints are handled and how risks are estimated. There are two main approaches to handling constraints: the Lagrangian [29, 26, 32] and the primal [15, 30] approaches. The Lagrangian approach deals with dual problems by jointly updating dual variables and policies, which can be easily integrated into the existing RL framework. In contrast, the primal approach directly tackles the original problem by constructing a subproblem according to the current policy and solving it. Subsequently, risk estimation can be categorized into three types. The first type [29, 32] uses an auxiliary variable to precisely estimate CVaR through its dual form. Another one [26, 30] approximates risks as the expectation of the state-wise risks calculated using a distributional critic [8]. The third one [15] approximates CVaR by assuming that cost returns follow Gaussians. Despite these diverse methods, they only guarantee convergence to local optima. In the field of risk-sensitive RL, Bäuerle and Glauner [7] have proposed an algorithm for spectral risk measures that ensures convergence to an optimum. They introduced a bilevel optimization approach, like our method, but did not provide a method for solving the outer problem. Furthermore, their approach to the inner problem is value-based, making it complicated to extend to RCRL, which typically requires a policy gradient approach.

3 Backgrounds

3.1 Constrained Markov Decision Processes

A constrained Markov decision process (CMDP) is defined as S,A,P,ρ,γ,R,{Ci}i=1N\langle S,A,P,\rho,\gamma,R,\{C_{i}\}_{i=1}^{N}\rangle, where SS is a state space, AA is an action space, PP is a transition model, ρ\rho is an initial state distribution, γ\gamma is a discount factor, R:S×A×S[Rmax,Rmax]R:S\times A\times S\mapsto[-R_{\mathrm{max}},R_{\mathrm{max}}] is a reward function, Ci:S×A×S[0,Cmax]C_{i}:S\times A\times S\mapsto[0,C_{\mathrm{max}}] is a cost function, and NN is the number of cost functions. A policy π:S𝒫(A)\pi:S\mapsto\mathcal{P}(A) is defined as a mapping from the state space to the action distribution. Then, the state distribution given π\pi is defined as dρπ(s):=(1γ)t=0γt(st=s)d_{\rho}^{\pi}(s):=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\mathbb{P}(s_{t}=s), where s0ρs_{0}\sim\rho, atπ(|st)a_{t}\sim\pi(\cdot|s_{t}), and st+1P(|st,at)s_{t+1}\sim P(\cdot|s_{t},a_{t}) for t\forall t. The state-action return of π\pi is defined as:

ZRπ(s,a):=t=0γtR(st,at,st+1)|s0=s,a0=a,atπ(|st),st+1P(|st,at)t.Z_{R}^{\pi}(s,a):=\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t},s_{t+1})\;\Big{|}\;s_{0}=s,\;a_{0}=a,\;a_{t}\sim\pi(\cdot|s_{t}),\;s_{t+1}\sim P(\cdot|s_{t},a_{t})\;\forall t.

The state return and the return are defined, respectively, as:

YRπ(s):=ZRπ(s,a)|aπ(|s),GRπ:=YRπ(s)|sρ.Y_{R}^{\pi}(s):=Z_{R}^{\pi}(s,a)\;\big{|}\;a\sim\pi(\cdot|s),\quad G_{R}^{\pi}:=Y_{R}^{\pi}(s)\;\big{|}\;s\sim\rho.

The returns for a cost, ZCiπZ_{C_{i}}^{\pi}, YCiπY_{C_{i}}^{\pi}, and GCiπG_{C_{i}}^{\pi}, are defined by replacing the reward with the iith cost. Then, a safe RL problem can be defined as maximizing the reward return while keeping the cost returns below predefined thresholds to ensure safety.

3.2 Spectral Risk Measures

In order to reduce the likelihood of encountering the worst-case scenario, it is required to incorporate risk measures into constraints. Conditional value at risk (CVaR) is a widely used risk measure in finance [20], and it has been further developed into a spectral risk measure in [1], which provides a more comprehensive framework. A spectral risk measure with spectrum σ\sigma is defined as:

σ(X):=01FX1(u)σ(u)𝑑u,\mathcal{R}_{\sigma}(X):=\int_{0}^{1}F_{X}^{-1}(u)\sigma(u)du,

where σ\sigma is an increasing function, σ0\sigma\geq 0, and 01σ(u)𝑑u=1\int_{0}^{1}\sigma(u)du=1. By appropriately defining the function σ\sigma, various risk measures can be established, and examples are as follows:

CVaRα:σ(u)=𝟏uα/(1α),Powα:σ(u)=uα/(1α)/(1α),\mathrm{CVaR}_{\alpha}:\sigma(u)=\mathbf{1}_{u\geq\alpha}/(1-\alpha),\quad\mathrm{Pow}_{\alpha}:\sigma(u)=u^{\alpha/(1-\alpha)}/(1-\alpha), (1)

where α[0,1)\alpha\in[0,1) represents a risk level, and the measures become risk neutral when α=0\alpha=0. Furthermore, it is known that the spectral risk measure has the following dual form [7]:

σ(X)=infg𝔼[g(X)]+01g(σ(u))𝑑u,\mathcal{R}_{\sigma}(X)=\inf_{g}\mathbb{E}[g(X)]+\int_{0}^{1}g^{*}(\sigma(u))du, (2)

where g:g:\mathbb{R}\mapsto\mathbb{R} is an increasing convex function, and g(y):=supxxyg(x)g^{*}(y):=\sup_{x}xy-g(x) is the convex conjugate function of gg. For example, CVaR\mathrm{CVaR} is expressed as CVaRα(X)=infβ𝔼[1α(Xβ)+]+β\mathrm{CVaR}_{\alpha}(X)=\inf_{\beta}\mathbb{E}[\frac{1}{\alpha}(X-\beta)_{+}]+\beta, where g(x)=(xβ)+/(1α)g(x)=(x-\beta)_{+}/(1-\alpha), and the integral part of the conjugate function becomes β\beta. We define the following sub-risk measure corresponding to the function gg as follows:

σg(X):=𝔼[g(X)]+01g(σ(u))𝑑u.\mathcal{R}_{\sigma}^{g}(X):=\mathbb{E}[g(X)]+\int_{0}^{1}g^{*}(\sigma(u))du. (3)

Then, σ(X)=infgσg(X)\mathcal{R}_{\sigma}(X)=\inf_{g}\mathcal{R}^{g}_{\sigma}(X). Note that the integral part is independent of XX and only serves as a constant value of risk. As a result, the sub-risk measure can be expressed using the expectation, so it provides computational advantages over the original risk measure in many operations. We will utilize this advantage to show optimality of the proposed method.

3.3 Augmented State Spaces

As mentioned by Zhang et al. [32], an risk-constrained RL problem is non-Markovian, so a history-conditioned policy is required to solve the problem. Also, Bastani et al. [6] showed that an optimal history-conditioned policy can be expressed as a Markov policy defined in a newly augmented state space, which incorporates the discounted sum of costs as part of the state. To follow these results, we define an augmented state as follows:

s¯t:=(st,{ei,t}i=1N,bt),whereei,0=0,ei,t+1=(Ci(st,at,st+1)+ei,t)/γ,bt=γt.\displaystyle\bar{s}_{t}:=(s_{t},\{e_{i,t}\}_{i=1}^{N},b_{t}),\;\text{where}\;e_{i,0}=0,\;e_{i,t+1}=(C_{i}(s_{t},a_{t},s_{t+1})+e_{i,t})/\gamma,\;b_{t}=\gamma^{t}. (4)

Then, the CMDP is modified as follows:

s¯t+1=(st+1,{(ci,t+ei,t)/γ}i=1N,γbt),wherest+1P(|st,at),ci,t=Ci(st,at,st+1),\bar{s}_{t+1}=(s_{t+1},\{(c_{i,t}+e_{i,t})/\gamma\}_{i=1}^{N},\gamma b_{t}),\;\text{where}\;s_{t+1}\sim P(\cdot|s_{t},a_{t}),\;c_{i,t}=C_{i}(s_{t},a_{t},s_{t+1}),

and the policy π(|s¯)\pi(\cdot|\bar{s}) is defined on the augmented state space.

4 Proposed Method

The risk measure-constrained RL (RCRL) problem is defined as follows:

maxπ𝔼[GRπ]𝐬.𝐭.σi(GCiπ)dii{1,2,,N},\max_{\pi}\mathbb{E}[G_{R}^{\pi}]\;\;\mathbf{s.t.}\;\mathcal{R}_{\sigma_{i}}(G_{C_{i}}^{\pi})\leq d_{i}\;\forall i\in\{1,2,...,N\}, (5)

where did_{i} is the threshold of the iith constraint. Due to the nonlinearity of the risk measure, achieving an optimal policy through policy gradient techniques can be challenging. To address this issue, we propose solving the RCRL problem by decomposing it into two separate problems. Using a feasibility function \mathcal{F},111(x):=0\mathcal{F}(x):=0 if x0x\leq 0 else \infty. the RCRL problem can be rewritten as follows:

maxπ𝔼[GRπ]i=1N(σi(GCiπ)\displaystyle\max_{\pi}\mathbb{E}[G_{R}^{\pi}]-\sum\nolimits_{i=1}^{N}\mathcal{F}(\mathcal{R}_{\sigma_{i}}(G_{C_{i}}^{\pi}) di)=maxπ𝔼[GRπ]i=1Ninfgi(σigi(GCiπ)di)\displaystyle-d_{i})=\max_{\pi}\mathbb{E}[G_{R}^{\pi}]-\sum\nolimits_{i=1}^{N}\inf_{g_{i}}\mathcal{F}(\mathcal{R}^{g_{i}}_{\sigma_{i}}(G_{C_{i}}^{\pi})-d_{i}) (6)
=sup{gi}i=1Nmaxπ𝔼[GRπ]i=1N(σigi(GCiπ)di)(a)(b),\displaystyle\quad=\underbrace{\sup_{\{g_{i}\}_{i=1}^{N}}\underbrace{\max_{\pi}\mathbb{E}[G_{R}^{\pi}]-\sum\nolimits_{i=1}^{N}\mathcal{F}(\mathcal{R}^{g_{i}}_{\sigma_{i}}(G_{C_{i}}^{\pi})-d_{i})}_{\textbf{(a)}}}_{\textbf{(b)}},

where (a) is the inner problem, (b) is the outer problem, and {gi}i=1N\{g_{i}\}_{i=1}^{N} is denoted as a dual variable. It is known that if the cost functions are bounded and there exists a policy that satisfies the constraints of (5), there exists an optimal dual variable [7], which transforms the supremum into the maximum in (6). Then, the RCRL problem can be solved by 1) finding optimal policies for various dual variables and 2) selecting the dual variable corresponding to the policy that achieves the maximum expected reward return while satisfying the constraints.

The inner problem is then a safe RL problem with sub-risk measure constraints. However, although the sub-risk measure is expressed using the expectation as in (3), the function gg causes nonlinearity, which makes difficult to develop a policy gradient method. To address this issue, we propose novel risk value functions that show linearity in the performance difference between two policies. Through these risk value functions, we propose a policy update rule with convergence guarantees for the inner problem in Section 5.2.

The search space of the outer problem is a function space, rendering it impractical. To address this challenge, we appropriately parameterize the function gig_{i} using a parameter βiBM1\beta_{i}\in B\subset\mathbb{R}^{M-1} in Section 6.1. Then, we can solve the outer problem through a brute-force approach by searching the parameter space. However, it is computationally prohibitive, requiring exponential time regarding to the number of constraints and the dimension of the parameter space. Consequently, it is necessary to develop an efficient method for solving the outer problem. Given the difficulty of directly identifying an optimal β\beta^{*}, we instead model a distribution ξ\xi to estimate the probability that a given β\beta is optimal. We then propose a method that ensures convergence to an optimal distribution ξ\xi^{*}, which deterministically samples an optimal β\beta^{*}, in Section 6.2.

By integrating the proposed methods to solve the inner and outer problems, we finally introduce an RCRL algorithm called spectral-risk-constrained policy optimization (SRCPO). The pseudo-code for SRCPO is presented in Algorithm 1. In the following sections, we will describe how to solve the inner problem and the outer problem in detail.

Algorithm 1 Pseudo-Code of Spectral-Risk-Constrained Policy Optimization (SRCPO)
  Input: Spectrum function σi\sigma_{i} for i{1,,N}i\in\{1,...,N\}.
  Parameterize gig_{i} of σi\sigma_{i} defined in (3) using βiBM1\beta_{i}\in B\subset\mathbb{R}^{M-1}.  // Parameterization, Section 6.1.
  for tt =1=1 to TT do
     for 𝜷={β1,,βN}BN\boldsymbol{\beta}=\{\beta_{1},...,\beta_{N}\}\in B^{N} do
        Update π𝜷,t\pi_{\boldsymbol{\beta},t} using the proposed policy gradient method.  // Inner problem, Section 5.2.
     end for
     Calculate the target distribution of ξ\xi using π𝜷,t\pi_{\boldsymbol{\beta},t}, and update ξt\xi_{t}.  // Outer problem, Section 6.2.
  end for
  Output: π𝜷,T\pi_{\boldsymbol{\beta},T}, where 𝜷ξT\boldsymbol{\beta}\sim\xi_{T}.

5 Inner Problem: Safe RL with Sub-Risk Measure

In this section, we propose a policy gradient method to solve the inner problem, defined as:

maxπ𝔼[GRπ]𝐬.𝐭.σigi(GCiπ)dii,\max_{\pi}\mathbb{E}[G_{R}^{\pi}]\;\;\mathbf{s.t.}\;\mathcal{R}^{g_{i}}_{\sigma_{i}}(G_{C_{i}}^{\pi})\leq d_{i}\;\forall i, (7)

where constraints consist of the sub-risk measures, defined in (3). In the remainder of this section, we first introduce risk value functions to calculate the policy gradient of the sub-risk measures, propose a policy update rule, and finally demonstrate its convergence to an optimum in tabular settings.

5.1 Risk Value Functions

In order to derive the policy gradient of the sub-risk measure, we first define risk value functions. To this end, using the fact that ei,t=j=0t1γjci,j/γte_{i,t}=\sum_{j=0}^{t-1}\gamma^{j}c_{i,j}/\gamma^{t} and bt=γtb_{t}=\gamma^{t}, where {ei,t}i=1N\{e_{i,t}\}_{i=1}^{N} and btb_{t} is defined in the augmented state s¯t\bar{s}_{t}, we expand the sub-risk measure of the iith cost as follows:

σigi(GCiπ)01gi(σi(u))𝑑u=gi(z)fGCiπ(z)𝑑z=𝔼s¯0ρ[gi(z)fYCiπ(s¯0)(z)𝑑z]\displaystyle\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi})-\int_{0}^{1}{g_{i}}^{*}(\sigma_{i}(u))du=\int_{-\infty}^{\infty}g_{i}(z)f_{G_{C_{i}}^{\pi}}(z)dz=\mathbb{E}_{\bar{s}_{0}\sim\rho}\left[\int_{-\infty}^{\infty}g_{i}(z)f_{Y_{C_{i}}^{\pi}(\bar{s}_{0})}(z)dz\right]
=𝔼ρ,π,P[gi(z)fj=0t1γjci,j+γtYCiπ(s¯t)(z)𝑑z]=𝔼ρ,π,P[gi(z)fbt(ei,t+YCiπ(s¯t))(z)𝑑z]t,\displaystyle=\underset{\rho,\pi,P}{\mathbb{E}}\left[\int_{-\infty}^{\infty}g_{i}(z)f_{\sum_{j=0}^{t-1}\gamma^{j}c_{i,j}+\gamma^{t}Y_{C_{i}}^{\pi}(\bar{s}_{t})}(z)dz\right]=\underset{\rho,\pi,P}{\mathbb{E}}\left[\int_{-\infty}^{\infty}g_{i}(z)f_{b_{t}(e_{i,t}+Y_{C_{i}}^{\pi}(\bar{s}_{t}))}(z)dz\right]\;\forall t,

where fXf_{X} is the probability density function (pdf) of a random variable XX. To capture the value of the sub-risk measure, we can define risk value functions as follows:

Vi,gπ(s¯):=g(z)fb(ei+YCiπ(s¯))(z)𝑑z,Qi,gπ(s¯,a):=g(z)fb(ei+ZCiπ(s¯,a))(z)𝑑z.V_{i,g}^{\pi}(\bar{s}):=\int_{-\infty}^{\infty}g(z)f_{b(e_{i}+Y_{C_{i}}^{\pi}(\bar{s}))}(z)dz,\;Q_{i,g}^{\pi}(\bar{s},a):=\int_{-\infty}^{\infty}g(z)f_{b(e_{i}+Z_{C_{i}}^{\pi}(\bar{s},a))}(z)dz.

Using the convexity of gg, we present the following lemma on the bounds of the risk value functions:

Lemma 5.1.

Assuming that a function gg is differentiable, Vi,gπ(s¯)V^{\pi}_{i,g}(\bar{s}) and Qi,gπ(s¯,a)Q^{\pi}_{i,g}(\bar{s},a) are bounded by [g(eib),g(eib+bCmax/(1γ))][g(e_{i}b),g(e_{i}b+bC_{\mathrm{max}}/(1-\gamma))], and |Qi,gπ(s¯,a)Vi,gπ(s¯)|bC|Q^{\pi}_{i,g}(\bar{s},a)-V^{\pi}_{i,g}(\bar{s})|\leq bC, where C=Cmax1γg(Cmax1γ)C=\frac{C_{\mathrm{max}}}{1-\gamma}g^{\prime}(\frac{C_{\mathrm{max}}}{1-\gamma}) and s¯=(s,{ei}i=1N,b)\bar{s}=(s,\{e_{i}\}_{i=1}^{N},b).

The proof is provided in Appendix A. Lemma 5.1 means that the value of Qi,gπ(s¯,a)Vi,gπ(s¯)Q^{\pi}_{i,g}(\bar{s},a)-V^{\pi}_{i,g}(\bar{s}) is scaled by bb. To eliminate this scale, we define a risk advantage function as follows:

Ai,gπ(s¯,a):=(Qi,gπ(s¯,a)Vi,gπ(s¯))/b.A_{i,g}^{\pi}(\bar{s},a):=(Q_{i,g}^{\pi}(\bar{s},a)-V_{i,g}^{\pi}(\bar{s}))/b.

Now, we introduce a theorem on the difference in the sub-risk measure between two policies.

Theorem 5.2.

Given two policies, π\pi and π\pi^{\prime}, the difference in the sub-risk measure is:

σigi(GCiπ)σigi(GCiπ)=𝔼dρπ,π[Ai,giπ(s¯,a)]/(1γ).\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi^{\prime}})-\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi})=\mathbb{E}_{d_{\rho}^{\pi^{\prime}},\pi^{\prime}}\left[A_{i,g_{i}}^{\pi}(\bar{s},a)\right]/(1-\gamma). (8)

The proof is provided in Appendix A. Since Theorem 5.2 also holds for the optimal policy, it is beneficial for showing optimality, as done in policy gradient algorithms of traditional RL [4]. Additionally, to calculate the advantage function, we use a distributional critic instead of directly estimating the risk value functions. This process is described in detail in Section 7.

5.2 Policy Update Rule

In this section, we derive the policy gradient of the sub-risk measure and introduce a policy update rule to solve the inner problem. Before that, we first parameterize the policy as πθ\pi_{\theta}, where θΘ\theta\in\Theta, and denote the objective function as 𝔼[GRπθ]=JR(π)=JR(θ)\mathbb{E}[G_{R}^{\pi_{\theta}}]=J_{R}(\pi)=J_{R}(\theta) and the iith constraint function as σigi(GCiπθ)=JCi(π)=JCi(θ)\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi_{\theta}})=J_{C_{i}}(\pi)=J_{C_{i}}(\theta). Then, using (8) and the fact that 𝔼aπ(|s¯)[Ai,gπ(s¯,a)]=0\mathbb{E}_{a\sim\pi(\cdot|\bar{s})}[A_{i,g}^{\pi}(\bar{s},a)]=0, the policy gradient of the iith constraint function is derived as follows:

θJCi(θ)=𝔼dρπθ,πθ[Ai,giπθ(s¯,a)θlog(πθ(a|s¯))]/(1γ).\nabla_{\theta}J_{C_{i}}(\theta)=\mathbb{E}_{d_{\rho}^{\pi_{\theta}},\pi_{\theta}}[A_{i,g_{i}}^{\pi_{\theta}}(\bar{s},a)\nabla_{\theta}\log(\pi_{\theta}(a|\bar{s}))]/(1-\gamma). (9)

Now, we propose a policy update rule as follows:

θt+1={θt+αtF(θt)θ(JR(θ)αti=1Nλt,iJCi(θ))|θ=θtif constraints are satisfied,θt+αtF(θt)θ(αtνtJR(θ)i=1Nλt,iJCi(θ))|θ=θtotherwise,\theta_{t+1}=\begin{cases}\theta_{t}+\alpha_{t}F^{\dagger}(\theta_{t})\nabla_{\theta}\left(J_{R}(\theta)-\alpha_{t}\sum_{i=1}^{N}\lambda_{t,i}J_{C_{i}}(\theta)\right)\big{|}_{\theta=\theta_{t}}&\text{if constraints are satisfied},\\ \theta_{t}+\alpha_{t}F^{\dagger}(\theta_{t})\nabla_{\theta}\left(\alpha_{t}\nu_{t}J_{R}(\theta)-\sum_{i=1}^{N}\lambda_{t,i}J_{C_{i}}(\theta)\right)\big{|}_{\theta=\theta_{t}}&\text{otherwise},\\ \end{cases}

where αt\alpha_{t} is a learning rate, F(θ)F(\theta) is Fisher information matrix, AA^{\dagger} is Moore-Penrose inverse matrix of AA, {λt,1,,λt,N,νt}[0,λmax]\{\lambda_{t,1},...,\lambda_{t,N},\nu_{t}\}\subset[0,\lambda_{\mathrm{max}}] are weights, and λt,i=0\lambda_{t,i}=0 for i{i|JCi(θ)di}i\in\{i|J_{C_{i}}(\theta)\leq d_{i}\} and iλt,i=1\sum_{i}\lambda_{t,i}=1 when the constraints are not satisfied. There are various possible strategies for determining λt,i\lambda_{t,i} and νt\nu_{t}, which makes the proposed method generalize many primal approach-based safe RL algorithms [2, 25, 15]. Several options for determining λt,i\lambda_{t,i} and νt\nu_{t}, as well as our strategy, are described in Appendix B.

5.3 Convergence Analysis

We analyze the convergence of the proposed policy update rule in a tabular setting where both the augmented state space and action space are finite. To demonstrate convergence in this setting, we use the softmax policy parameterization [4] as follows:

πθ(a|s¯):=expθ(s¯,a)/(aexpθ(s¯,a))(s¯,a)S¯×A.\pi_{\theta}(a|\bar{s}):=\exp{\theta(\bar{s},a)}\big{/}\left(\sum\nolimits_{a^{\prime}}\exp{\theta(\bar{s},a^{\prime})}\right)\;\forall(\bar{s},a)\in\bar{S}\times A.

Then, we can show that the policy converges to an optimal policy if updated by the proposed method.

Theorem 5.3.

Assume that there exists a feasible policy πf\pi_{f} such that JCi(πf)diηJ_{C_{i}}(\pi_{f})\leq d_{i}-\eta for i\forall i, where η>0\eta>0. Then, if a policy is updated by the proposed update rule with a learning rate αt\alpha_{t} that follows the Robbins-Monro condition [19], it will converge to an optimal policy of the inner problem.

The proof is provided in Appendix A.1, and the assumption of the existence of a feasible policy is commonly used in convergence analysis in safe RL [12, 5, 15]. Through this result, we can also demonstrate convergence of existing primal approach-based safe RL methods, such as CPO [2], PCPO [27], and P3O [31],222Note that these safe RL algorithms are designed to handle expectation-based constraints, so they are not suitable for RCRL. by identifying proper λt,i\lambda_{t,i} and νt\nu_{t} strategies for these methods.

6 Outer Problem: Dual Optimization for Spectral Risk Measure

Refer to caption
Figure 1: Discretization of spectrum.

In this section, we propose a method for solving the outer problem that can be performed concurrently with the policy update of the inner problem. To this end, since the domain of the outer problem is a function space, we first introduce a procedure to parameterize the functions, allowing the problem to be practically solved. Subsequently, we propose a method to find an optimal solution for the parameterized outer problem.

6.1 Parameterization

Searching the function space of gg is challenging. To address this issue, we discretize the spectrum function as illustrated in Figure 1 since the discretization allows gg to be expressed using a finite number of parameters. The discretized spectrum is then formulated as:

σ(u)σ~(u):=η1+i=1M1(ηi+1ηi)𝟏uαi,\displaystyle\sigma(u)\approx\tilde{\sigma}(u):=\eta_{1}+\sum\nolimits_{i=1}^{M-1}(\eta_{i+1}-\eta_{i})\mathbf{1}_{\mathrm{u\geq\alpha_{i}}}, (10)

where 0ηiηi+10\leq\eta_{i}\leq\eta_{i+1}, 0αiαi+110\leq\alpha_{i}\leq\alpha_{i+1}\leq 1, and MM is the number of discretizations. The proper values of ηi\eta_{i} and αi\alpha_{i} can be achieved by minimizing the norm of the function space, expressed as:

min{ηi}i=1M,{αi}i=1M101|σ(u)σ~(u)|𝑑u𝐬.𝐭.01σ~(u)𝑑u=1.\min_{\{\eta_{i}\}_{i=1}^{M},\{\alpha_{i}\}_{i=1}^{M-1}}\int_{0}^{1}|\sigma(u)-\tilde{\sigma}(u)|du\quad\mathbf{s.t.}\;\int_{0}^{1}\tilde{\sigma}(u)du=1. (11)

Now, we can show that the difference between the original risk measure and its discretized version is bounded by the inverse of the number of discretizations.

Lemma 6.1 (Approximation Error).

The difference between the original risk measure and its discretized version is:

|σ(X)σ~(X)|Cmaxσ(1)/((1γ)M).|\mathcal{R}_{\sigma}(X)-\mathcal{R}_{\tilde{\sigma}}(X)|\leq C_{\mathrm{max}}\sigma(1)/((1-\gamma)M).

The proof is provided in Appendix A. Lemma 6.1 means that as the number of discretizations increases, the risk measure converges to the original one. Note that the discretized version of the risk measure is also a spectral risk measure, indicating that the discretization process projects the original risk measure into a specific class of spectral risk measures. Now, we introduce a parameterization method as detailed in the following theorem:

Theorem 6.2.

Let us parameterize the function gg using a parameter βBM1\beta\in B\subset\mathbb{R}^{M-1} as:

gβ(x):=η1x+i=1M1(ηi+1ηi)(xβ[i])+,g_{\beta}(x):=\eta_{1}x+\sum\nolimits_{i=1}^{M-1}(\eta_{i+1}-\eta_{i})(x-\beta[i])_{+},

where β[i]β[i+1]\beta[i]\leq\beta[i+1] for i{1,..,M2}i\in\{1,..,M-2\}. Then, the following is satisfied:

σ~(X)=infβσ~β(X),whereσ~β(X):=𝔼[gβ(X)]+01gβ(σ~(u))𝑑u.\mathcal{R}_{\tilde{\sigma}}(X)=\inf\nolimits_{\beta}\mathcal{R}^{\beta}_{\tilde{\sigma}}(X),\;\text{where}\;\mathcal{R}^{\beta}_{\tilde{\sigma}}(X):=\mathbb{E}[g_{\beta}(X)]+\int_{0}^{1}g_{\beta}^{*}(\tilde{\sigma}(u))du.

According to Remark 2.7 in [7], the infimum function in (2) exists and can be expressed as an integral of the spectrum function if the reward and cost functions are bounded. By using this fact, Theorem 6.2 can be proved, and details are provided in Appendix A. Through this parameterization, the RCRL problem is now expressed as follows:

sup{βi}i=1Nmaxπ𝔼[GRπ]i=1N(σ~iβi(GCiπ)di),\sup_{\{\beta_{i}\}_{i=1}^{N}}\max_{\pi}\mathbb{E}[G_{R}^{\pi}]-\sum\nolimits_{i=1}^{N}\mathcal{F}(\mathcal{R}^{\beta_{i}}_{\tilde{\sigma}_{i}}(G_{C_{i}}^{\pi})-d_{i}), (12)

where βiBM1\beta_{i}\in B\subset\mathbb{R}^{M-1} is a parameter of the iith constraint. In the remainder of the paper, we denote the constraint function σ~iβi(GCiπ)\mathcal{R}^{\beta_{i}}_{\tilde{\sigma}_{i}}(G_{C_{i}}^{\pi}) as JCi(π;𝜷)J_{C_{i}}(\pi;\boldsymbol{\beta}), where 𝜷={βi}i=1N\boldsymbol{\beta}=\{\beta_{i}\}_{i=1}^{N}, and the policy for 𝜷\boldsymbol{\beta} as π𝜷\pi_{\boldsymbol{\beta}}.

6.2 Optimization

To obtain the supremum of 𝜷\boldsymbol{\beta} in (12), a brute-force search can be used, but it demands exponential time relative to the dimension of 𝜷\boldsymbol{\beta}. Alternatively, 𝜷\boldsymbol{\beta} can be directly optimized via gradient descent; however, due to the difficulty in confirming the convexity of the outer problem, optimal convergence cannot be assured. To resolve this issue, we instead propose to find a distribution on 𝜷\boldsymbol{\beta}, called a sampler ξ𝒫(B)N\xi\in\mathcal{P}(B)^{N}, that outputs the likelihood that a given 𝜷\boldsymbol{\beta} is optimal.

For detailed descriptions of the sampler ξ\xi, the probability of sampling 𝜷\boldsymbol{\beta} is expressed as ξ(𝜷)=i=1Nξ[i](βi)\xi(\boldsymbol{\beta})=\prod_{i=1}^{N}\xi[i](\beta_{i}), where ξ[i]\xi[i] is the iith element of ξ\xi, and βi\beta_{i} is 𝜷[i]\boldsymbol{\beta}[i]. The sampling process is denoted by 𝜷ξ\boldsymbol{\beta}\sim\xi, where each component is sampled according to βiξ[i]\beta_{i}\sim\xi[i]. Implementation of this sampling process is similar to the stick-breaking process [23] due to the condition βi[j]βi[j+1]\beta_{i}[j]\leq\beta_{i}[j+1] for j{1,,M2}j\in\{1,...,M-2\}. Initially, βi[1]\beta_{i}[1] is sampled within [0,Cmax/(1γ)][0,C_{\mathrm{max}}/(1-\gamma)], and subsequent values βi[j+1]\beta_{i}[j+1] are sampled within [βi[j],Cmax/(1γ)][\beta_{i}[j],C_{\mathrm{max}}/(1-\gamma)]. Then, our target is to find the following optimal distribution:

ξ(𝜷){0if𝜷{𝜷|JR(π𝜷)=JR(π𝜷)},=0otherwise,\xi^{*}(\boldsymbol{\beta})\begin{cases}\geq 0\quad\text{if}\;\boldsymbol{\beta}\in\{\boldsymbol{\beta}|J_{R}(\pi_{\boldsymbol{\beta}}^{*})=J_{R}(\pi_{\boldsymbol{\beta}^{*}}^{*})\},\\ =0\quad\text{otherwise,}\end{cases} (13)

where 𝜷\boldsymbol{\beta}^{*} is an optimal solution of the outer problem (12). Once we obtain the optimal distribution, 𝜷\boldsymbol{\beta}^{*} can be achieved by sampling from ξ\xi^{*}.

In order to obtain the optimal distribution, we propose a novel loss function and update rule. Before that, let us parameterize the sampler as ξϕ\xi_{\phi}, where ϕΦ\phi\in\Phi, and define the following function:

J(π;𝜷):=JR(π)Ki(JCi(π;𝜷)di)+.J(\pi;\boldsymbol{\beta}):=J_{R}(\pi)-K\sum\nolimits_{i}(J_{C_{i}}(\pi;\boldsymbol{\beta})-d_{i})_{+}.

It is known that for sufficiently large K>0K>0, the optimal policy of the inner problem, π𝜷\pi_{\boldsymbol{\beta}}^{*}, is also the solution of maxπJ(π;𝜷)\max_{\pi}J(\pi;\boldsymbol{\beta}), which means JR(π𝜷)=maxπJ(π;𝜷)J_{R}(\pi_{\boldsymbol{\beta}}^{*})=\max_{\pi}J(\pi;\boldsymbol{\beta}) [31]. Using this fact, we build a target distribution for the sampler as: ξ¯t(𝜷)exp(J(π𝜷,t;𝜷))\bar{\xi}_{t}(\boldsymbol{\beta})\propto\exp(J(\pi_{\boldsymbol{\beta},t};\boldsymbol{\beta})), where π𝜷,t\pi_{\boldsymbol{\beta},t} is the policy for 𝜷\boldsymbol{\beta} at time-step tt. Then, we define a loss function using the cross-entropy (HH) as follows:

t(ϕ):=H(ξϕ,ξ¯t)=𝔼𝜷ξϕ[logξ¯t(𝜷)]=𝔼𝜷ξϕ[J(π𝜷,t;𝜷)],\displaystyle\mathcal{L}_{t}(\phi):=H(\xi_{\phi},\bar{\xi}_{t})=-\mathbb{E}_{\boldsymbol{\beta}\sim\xi_{\phi}}[\log\bar{\xi}_{t}(\boldsymbol{\beta})]=-\mathbb{E}_{\boldsymbol{\beta}\sim\xi_{\phi}}[J(\pi_{\boldsymbol{\beta},t};\boldsymbol{\beta})],
ϕt(ϕ)=𝔼𝜷ξϕ[ϕlog(ξϕ(𝜷))J(π𝜷,t;𝜷)].\displaystyle\nabla_{\phi}\mathcal{L}_{t}(\phi)=-\mathbb{E}_{\boldsymbol{\beta}\sim\xi_{\phi}}[\nabla_{\phi}\log(\xi_{\phi}(\boldsymbol{\beta}))J(\pi_{\boldsymbol{\beta},t};\boldsymbol{\beta})].

Finally, we present an update rule along with its convergence property in the following theorem.

Theorem 6.3.

Let us assume that the space of 𝛃\boldsymbol{\beta} is finite and update the sampler according to the following equation:

ϕt+1=ϕtαF(ϕt)ϕt(ϕ)|ϕ=ϕt,\phi_{t+1}=\phi_{t}-\alpha F^{\dagger}(\phi_{t})\nabla_{\phi}\mathcal{L}_{t}(\phi)|_{\phi=\phi_{t}}, (14)

where α\alpha is a learning rate, and F(ϕ)F^{\dagger}(\phi) is the pseudo-inverse of Fisher information matrix of ξϕ\xi_{\phi}. Then, under a softmax parameterization, the sampler converges to an optimal distribution defined in (13).

The proof is provided in Appendix A. As the loss function consists of the current policy π𝜷,t\pi_{\boldsymbol{\beta},t}, the sampler can be updated simultaneously with the policy update.

7 Practical Implementation

In order to calculate the policy gradient defined in (9), it is required to estimate the risk value functions. Instead of modeling them directly, we use quantile distributional critics [11] ZCi,ψ(s¯,a;𝜷):S¯×A×BNLZ_{C_{i},\psi}(\bar{s},a;\boldsymbol{\beta}):\bar{S}\times A\times B^{N}\mapsto\mathbb{R}^{L}, which approximate the pdf of ZCiπ𝜷(s¯,a)Z_{C_{i}}^{\pi_{\boldsymbol{\beta}}}(\bar{s},a) using LL Dirac delta functions and are parameterized by ψΨ\psi\in\Psi. Then, the risk value function can be approximated as:

Qi,gπ𝜷(s¯,a)l=1Lg(bei+bZCi,ψ(s¯,a;𝜷)[l])/L,Q_{i,g}^{\pi_{\boldsymbol{\beta}}}(\bar{s},a)\approx\sum\nolimits_{l=1}^{L}g(be_{i}+bZ_{C_{i},\psi}(\bar{s},a;\boldsymbol{\beta})[l])/L,

and Vi,gπ𝜷(s¯)V_{i,g}^{\pi_{\boldsymbol{\beta}}}(\bar{s}) can be achieved from 𝔼aπ𝜷(|s¯)[Qi,gπ𝜷(s¯,a)]\mathbb{E}_{a\sim\pi_{\boldsymbol{\beta}}(\cdot|\bar{s})}[Q_{i,g}^{\pi_{\boldsymbol{\beta}}}(\bar{s},a)]. The distributional critics are trained to minimize the quantile regression loss [11] and details are referred to Appendix C. Additionally, to implement ξϕ\xi_{\phi}, we use a truncated normal distribution 𝒩(μ,σ2,a,b)\mathcal{N}(\mu,\sigma^{2},a,b) [9], where [a,b][a,b] is the sampling range. Then, the sampling process of ξϕ\xi_{\phi} can be implemented as follows:

βi[j]=k=1jΔβi[k],whereΔβi[j]𝒩(μi,ϕ[j],σi,ϕ2[j],0,Cmax/(1γ))iandj.\displaystyle\beta_{i}[j]=\sum\nolimits_{k=1}^{j}\Delta\beta_{i}[k],\;\text{where}\;\Delta\beta_{i}[j]\sim\mathcal{N}(\mu_{i,\phi}[j],\sigma^{2}_{i,\phi}[j],0,C_{\mathrm{max}}/(1-\gamma))\;\forall i\;\text{and}\;\forall j.

Additional details of the practical implementation are provided in Appendix C.

8 Experiments and Results

Tasks. The experiments are conducted on the Safety Gymnasium tasks [14] with a single constraint and the legged robot locomotion tasks [15] with multiple constraints. In the Safety Gymnasium, two robots—point and car—are used to perform two tasks: a goal task, which involves controlling the robot to reach a target location, and a button task, which involves controlling the robot to press a designated button. In these tasks, a cost is incurred when the robot collides with an obstacle. In the legged robot locomotion tasks, bipedal and quadrupedal robots are controlled to track a target velocity while satisfying three constraints related to body balance, body height, and foot contact timing. For more details, please refer to Appendix D.

Baselines. Many RCRL algorithms use CVaR to define risk constraints [32, 15, 29]. Accordingly, the proposed method is evaluated and compared with other algorithms under the CVaR constraints with α=0.75\alpha=0.75. Experiments on other risk measures are performed in Section 8.1. The baseline algorithms are categorized into three types based on their approach to estimating risk measures. First, CVaR-CPO [32] and CPPO [29] utilize auxiliary variables to estimate CVaR. Second, WCSAC-Dist [26] and SDPO [30] approximate the risk measure σ(GCπ)\mathcal{R}_{\sigma}(G_{C}^{\pi}) with an expected value 𝔼s𝒟[σ(YCπ(s))]\mathbb{E}_{s\sim\mathcal{D}}[\mathcal{R}_{\sigma}(Y_{C}^{\pi}(s))]. Finally, SDAC [15] approximates GCπG_{C}^{\pi} as a Gaussian distribution and uses the mean and standard deviation of GCπG_{C}^{\pi} to estimate CVaR. The hyperparameters and network structure of each algorithm are detailed in Appendix D. Note that both CVaR-CPO and the proposed method, SRCPO, use the augmented state space. However, for a fair comparison with other algorithms, the policy and critic networks of these methods are modified to operate on the original state space.

Refer to caption
Figure 2: Training curves of the legged robot locomotion tasks. The upper graph shows results for the quadrupedal robot, and the lower one is for the bipedal robot. The solid line in each graph represents the average of each metric, and the shaded area indicates the standard deviation scaled by 0.50.5. The results are obtained by training each algorithm with five random seeds.
Refer to caption
Figure 3: Training curves of the Safety Gymnasium tasks. The results for each task are displayed in columns, titled with the task name. The solid line represents the average of each metric, and the shaded area indicates the standard deviation scaled by 0.20.2. The results are obtained by training each algorithm with five random seeds.

Results. Figures 2 and 3 show the training curves for the locomotion tasks and the Safety Gymnasium tasks, respectively. The reward sum in the figures refers to the sum of rewards within an episode, while the cost rate is calculated as the sum of costs divided by the episode length. In all tasks, the proposed method achieves the highest rewards among methods whose cost rates are below the specified thresholds. These results are likely because only the proposed method guarantees optimality. Specifically in the locomotion tasks, an initial policy often struggles to stabilize the balance of the robot, resulting in high costs from the start. Given this challenge, it is more susceptible to falling into local optima compared to other tasks, which enables the proposed method to outperform other baseline methods. Note that WCSAC-Dist shows the highest rewards in the Safety Gymnasium tasks, but the cost rates exceed the specified thresholds. This issue seems to arise from the approach to estimating risk constraints. WCSAC-Dist estimates the constraints based on the expected risk for each state, but it is lower than the original risk measure, leading to constraint violations.

8.1 Study on Various Risk Measures

Refer to caption
Refer to caption
Refer to caption
Figure 4: (Left) A correlation graph between cost rate and reward sum for policies trained in the point goal task under various risk measure constraints. The results are achieved by training policies with five random seeds for each risk measure and risk level. The center and radius of each ellipse show the average and standard deviation of the results from the five seeds, respectively. (Middle) Distribution graphs of the cost rate under different risk measure constraints. Locations of several percentiles (from the 5050th to the 9999th) are marked on the plot. The risk level of each risk measure is selected to have a similar cost rate. After training a policy in the point goal task, cost distributions have been collected by rolling out the trained policy across 500 episodes. (Right) Distribution graphs of the cost rate with different risk levels, α\alpha, under the CVaR constraint.

In this section, we analyze the results when constraints are defined using various risk measures. To this end, we train policies in the point goal task under constraints based on the CVaRα\mathrm{CVaR}_{\alpha} and Powα\mathrm{Pow}_{\alpha} risk measures defined in (2), as well as the Wang risk measure [24]. Although the Wang risk measure is not a spectral but a distortion risk measure, our parameterization method introduced in Section 6.1 enables it to be approximated as a spectral risk measure, and the visualization of this process is provided in Appendix E. We conduct experiments with three risk levels for each risk measure and set the constraint threshold as 0.0250.025. Evaluation results are presented in Figure 4, and training curves are provided in Appendix E. Figure 4 (Right) shows intuitive results that increasing the risk level effectively reduces the likelihood of incurring high costs. Similarly, Figure 4 (Left) presents the trend across all risk measures, indicating that higher risk levels correspond to lower cost rates and decreased reward performance. Finally, Figure 4 (Middle) exhibits the differences in how each risk measure addresses worst-case scenarios. In the spectrum formulation defined in (1), CVaR\mathrm{CVaR} applies a uniform penalty to the tail of the cost distribution above a specified percentile, whereas measures like Wang\mathrm{Wang} and Pow\mathrm{Pow} impose a heavier penalty at higher percentiles. As a result, because CVaR\mathrm{CVaR} imposes a relatively milder penalty on the worst-case outcomes compared to other risk measures, it shows the largest intervals between the 5050th and 9999th percentiles.

9 Conclusions and Limitations

In this work, we introduced a spectral-risk-constrained RL algorithm that ensures convergence and optimality in a tabular setting. Specifically, in the inner problem, we proposed a generalized policy update rule that can facilitate the development of a new safe RL algorithm with convergence guarantees. For the outer problem, we introduced a notion called sampler, which enhances training efficiency by concurrently training with the inner problem. Through experiments in continuous control tasks, we empirically demonstrated the superior performance of the proposed method and its capability to handle various risk measures. However, convergence to an optimum is shown only in a tabular setting, so future research may focus on extending these results to linear MDPs or function approximation settings. Furthermore, since our approach can be applied to risk-sensitive RL, future work can also implement the proposed method in this area.

References

  • Acerbi [2002] Carlo Acerbi. Spectral measures of risk: A coherent representation of subjective risk aversion. Journal of Banking & Finance, 26(7):1505–1518, 2002.
  • Achiam et al. [2017] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of International Conference on Machine Learning, pages 22–31, 2017.
  • Adjei et al. [2024] Patrick Adjei, Norman Tasfi, Santiago Gomez-Rosero, and Miriam AM Capretz. Safe reinforcement learning for arm manipulation with constrained Markov decision process. Robotics, 13(4):63, 2024.
  • Agarwal et al. [2021] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76, 2021.
  • Bai et al. [2022] Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel, and Vaneet Aggarwal. Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3682–3689, 2022.
  • Bastani et al. [2022] Osbert Bastani, Jason Yecheng Ma, Estelle Shen, and Wanqiao Xu. Regret bounds for risk-sensitive reinforcement learning. In Advances in Neural Information Processing Systems, pages 36259–36269, 2022.
  • Bäuerle and Glauner [2021] Nicole Bäuerle and Alexander Glauner. Minimizing spectral risk measures applied to Markov decision processes. Mathematical Methods of Operations Research, 94(1):35–69, 2021.
  • Bellemare et al. [2017] Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of International Conference on Machine Learning, pages 449–458, 2017.
  • Burkardt [2014] John Burkardt. The truncated normal distribution. Department of Scientific Computing Website, Florida State University, 1(35):58, 2014.
  • Chow et al. [2017] Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(1):6070–6120, 2017.
  • Dabney et al. [2018] Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
  • Ding et al. [2020] Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained Markov decision processes. In Advances in Neural Information Processing Systems, pages 8378–8390, 2020.
  • Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of International Conference on Machine Learning, pages 1861–1870, 2018.
  • Ji et al. [2023] Jiaming Ji, Borong Zhang, Xuehai Pan, Jiayi Zhou, Juntao Dai, and Yaodong Yang. Safety-gymnasium. GitHub repository, 2023.
  • Kim et al. [2023] Dohyeong Kim, Kyungjae Lee, and Songhwai Oh. Trust region-based safe distributional reinforcement learning for multiple constraints. In Advances in Neural Information Processing Systems, 2023.
  • Kim et al. [2024] Dohyeong Kim, Mineui Hong, Jeongho Park, and Songhwai Oh. Scale-invariant gradient aggregation for constrained multi-objective reinforcement learning. arXiv preprint arXiv:2403.00282, 2024.
  • Liu et al. [2024] Puze Liu, Haitham Bou-Ammar, Jan Peters, and Davide Tateo. Safe reinforcement learning on the constraint manifold: Theory and applications. arXiv preprint arXiv:2404.09080, 2024.
  • Liu et al. [2020] Yongshuai Liu, Jiaxin Ding, and Xin Liu. IPO: Interior-point policy optimization under constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4940–4947, 2020.
  • Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
  • Rockafellar et al. [2000] R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of Risk, 2:21–42, 2000.
  • Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of International Conference on Machine Learning, pages 1889–1897. PMLR, 2015.
  • Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  • Sethuraman [1994] Jayaram Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, pages 639–650, 1994.
  • Wang [2000] Shaun S Wang. A class of distortion operators for pricing financial and insurance risks. Journal of Risk and Insurance, pages 15–36, 2000.
  • Xu et al. [2021] Tengyu Xu, Yingbin Liang, and Guanghui Lan. CRPO: A new approach for safe reinforcement learning with convergence guarantee. In Proceedings of International Conference on Machine Learning, pages 11480–11491, 2021.
  • Yang et al. [2023] Qisong Yang, Thiago D Simão, Simon H Tindemans, and Matthijs TJ Spaan. Safety-constrained reinforcement learning with a distributional safety critic. Machine Learning, 112(3):859–887, 2023.
  • Yang et al. [2020] Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, and Peter J. Ramadge. Projection-based constrained policy optimization. In Proceedings of International Conference on Learning Representations, 2020.
  • Yang et al. [2022] Tsung-Yen Yang, Tingnan Zhang, Linda Luu, Sehoon Ha, Jie Tan, and Wenhao Yu. Safe reinforcement learning for legged locomotion. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2454–2461, 2022.
  • Ying et al. [2022] Chengyang Ying, Xinning Zhou, Hang Su, Dong Yan, Ning Chen, and Jun Zhu. Towards safe reinforcement learning via constraining conditional value-at-risk. In Proceedings of International Joint Conference on Artificial Intelligence, 2022.
  • Zhang and Weng [2022] Jianyi Zhang and Paul Weng. Safe distributional reinforcement learning. In Proceedings of Distributed Artificial Intelligence, pages 107–128, 2022.
  • Zhang et al. [2022] Linrui Zhang, Li Shen, Long Yang, Shixiang Chen, Bo Yuan, Xueqian Wang, and Dacheng Tao. Penalized proximal policy optimization for safe reinforcement learning. arXiv preprint arXiv:2205.11814, 2022.
  • Zhang et al. [2024] Qiyuan Zhang, Shu Leng, Xiaoteng Ma, Qihan Liu, Xueqian Wang, Bin Liang, Yu Liu, and Jun Yang. CVaR-constrained policy optimization for safe reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 2024.

Appendix A Proofs

See 5.1

Proof.

Since 0YCiπ(s¯),ZCiπ(s¯,a)Cmax/(1γ)0\leq Y_{C_{i}}^{\pi}(\bar{s}),Z_{C_{i}}^{\pi}(\bar{s},a)\leq C_{\mathrm{max}}/(1-\gamma),

fbei+bYCiπ(s¯)(z)=fbei+bZCiπ(s¯,a)(z)=0,f_{be_{i}+bY_{C_{i}}^{\pi}(\bar{s})}(z)=f_{be_{i}+bZ_{C_{i}}^{\pi}(\bar{s},a)}(z)=0,

when z<beiz<be_{i} or z>bei+bCmax/(1γ)z>be_{i}+bC_{\mathrm{max}}/(1-\gamma). Using the fact that gg is an increasing and convex function,

g(bei)Vi,gπ(s¯)=beibei+bCmax1γg(z)fbei+bYCiπ(s¯)(z)𝑑zg(bei+bCmax/(1γ)).g(be_{i})\leq V_{i,g}^{\pi}(\bar{s})=\int_{be_{i}}^{be_{i}+b\frac{C_{\mathrm{max}}}{1-\gamma}}g(z)f_{be_{i}+bY_{C_{i}}^{\pi}(\bar{s})}(z)dz\leq g(be_{i}+bC_{\mathrm{max}}/(1-\gamma)).

Likewise, Qi,gπ(s¯,a)Q_{i,g}^{\pi}(\bar{s},a) also satisfies the above property. As a result,

Vi,gπ(s¯),Qi,gπ(s¯,a)[g(bei),g(bei+bCmax/(1γ))].V_{i,g}^{\pi}(\bar{s}),\;Q_{i,g}^{\pi}(\bar{s},a)\in[g(be_{i}),g(be_{i}+bC_{\mathrm{max}}/(1-\gamma))].

Then, the maximum of |Qi,gπ(s¯,a)Vi,gπ(s¯)||Q_{i,g}^{\pi}(\bar{s},a)-V_{i,g}^{\pi}(\bar{s})| is g(bei+bCmax/(1γ))g(bei)g(be_{i}+bC_{\mathrm{max}}/(1-\gamma))-g(be_{i}). Since b=γtb=\gamma^{t} and ei=k=0t1γkCi(sk,ak,sk+1)/γte_{i}=\sum_{k=0}^{t-1}\gamma^{k}C_{i}(s_{k},a_{k},s_{k+1})/\gamma^{t}, where tt is the current time step,

bei+bCmax/(1γ)Cmax/(1γ).be_{i}+bC_{\mathrm{max}}/(1-\gamma)\leq C_{\mathrm{max}}/(1-\gamma).

Consequently,

|Qi,gπ(s¯,a)Vi,gπ(s¯)|g(bei+bCmax/(1γ))g(bei)\displaystyle|Q_{i,g}^{\pi}(\bar{s},a)-V_{i,g}^{\pi}(\bar{s})|\leq g(be_{i}+bC_{\mathrm{max}}/(1-\gamma))-g(be_{i})
bCmax1γg(bei+bCmax1γ)bCmax1γg(Cmax1γ).\displaystyle\quad\leq b\frac{C_{\mathrm{max}}}{1-\gamma}g^{\prime}\left(be_{i}+b\frac{C_{\mathrm{max}}}{1-\gamma}\right)\leq b\frac{C_{\mathrm{max}}}{1-\gamma}g^{\prime}\left(\frac{C_{\mathrm{max}}}{1-\gamma}\right).

See 5.2

Proof.

Since the following equation is satisfied:

g(z)fbei+bZCiπ(s¯,a)(z)𝑑z\displaystyle\int_{-\infty}^{\infty}g(z)f_{be_{i}+bZ_{C_{i}}^{\pi}(\bar{s},a)}(z)dz =𝔼s¯P(|s¯,a)[g(z)fb(ei+Ci(s,a,s))+γbYCiπ(s¯)(z)𝑑z]\displaystyle=\mathbb{E}_{\bar{s}^{\prime}\sim P(\cdot|\bar{s},a)}\left[\int_{-\infty}^{\infty}g(z)f_{b(e_{i}+C_{i}(s,a,s^{\prime}))+\gamma bY_{C_{i}}^{\pi}(\bar{s}^{\prime})}(z)dz\right]
=𝔼s¯P(|s¯,a)[g(z)fbei+bYCiπ(s¯)(z)𝑑z],\displaystyle=\mathbb{E}_{\bar{s}^{\prime}\sim P(\cdot|\bar{s},a)}\left[\int_{-\infty}^{\infty}g(z)f_{b^{\prime}e_{i}^{\prime}+b^{\prime}Y_{C_{i}}^{\pi}(\bar{s}^{\prime})}(z)dz\right],

we can say Qi,gπ(s¯,a)=𝔼s¯P(|s¯,a)[Vi,gπ(s¯)]Q_{i,g}^{\pi}(\bar{s},a)=\mathbb{E}_{\bar{s}^{\prime}\sim P(\cdot|\bar{s},a)}[V_{i,g}^{\pi}(\bar{s}^{\prime})]. Using this property,

𝔼dρπ,π[Ai,giπ(s¯,a)]/(1γ)\displaystyle\mathbb{E}_{d_{\rho}^{\pi^{\prime}},\pi^{\prime}}\left[A_{i,g_{i}}^{\pi}(\bar{s},a)\right]/(1-\gamma) =𝔼π[t=0γtAi,giπ(s¯t,at)]\displaystyle=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}A_{i,g_{i}}^{\pi}(\bar{s}_{t},a_{t})\right] (15)
=𝔼π[t=0Qi,giπ(s¯t,at)Vi,giπ(s¯t)]\displaystyle=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}Q_{i,g_{i}}^{\pi}(\bar{s}_{t},a_{t})-V_{i,g_{i}}^{\pi}(\bar{s}_{t})\right]
=𝔼π[t=0(Vi,giπ(s¯t+1)Vi,giπ(s¯t))]\displaystyle=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\left(V_{i,g_{i}}^{\pi}(\bar{s}_{t+1})-V_{i,g_{i}}^{\pi}(\bar{s}_{t})\right)\right]
=limt𝔼π[Vi,giπ(s¯t)]𝔼s¯0ρ[Vi,giπ(s¯0)]\displaystyle=\lim_{t\to\infty}\mathbb{E}_{\pi^{\prime}}\left[V_{i,g_{i}}^{\pi}(\bar{s}_{t})\right]-\mathbb{E}_{\bar{s}_{0}\sim\rho}\left[V_{i,g_{i}}^{\pi}(\bar{s}_{0})\right]
=limt𝔼π[Vi,giπ(s¯t)]σigi(GCiπ)+01gi(σi(u))𝑑u.\displaystyle=\lim_{t\to\infty}\mathbb{E}_{\pi^{\prime}}\left[V_{i,g_{i}}^{\pi}(\bar{s}_{t})\right]-\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi})+\int_{0}^{1}{g_{i}}^{*}(\sigma_{i}(u))du.

Using the definition of Vi,giπ(s¯t)V_{i,g_{i}}^{\pi}(\bar{s}_{t}) and the dominated convergence theorem,

limt𝔼π[Vi,giπ(s¯t)]\displaystyle\lim_{t\to\infty}\mathbb{E}_{\pi^{\prime}}\left[V_{i,g_{i}}^{\pi}(\bar{s}_{t})\right] =limt𝔼π[gi(z)fbtei,t+btYCiπ(s¯t)(z)𝑑z]\displaystyle=\lim_{t\to\infty}\mathbb{E}_{\pi^{\prime}}\left[\int_{-\infty}^{\infty}g_{i}(z)f_{b_{t}e_{i,t}+b_{t}Y_{C_{i}}^{\pi}(\bar{s}_{t})}(z)dz\right] (16)
=limt𝔼π[gi(z)fk=0t1γkCi(sk,ak,sk+1)+γtYCiπ(s¯t)(z)𝑑z]\displaystyle=\lim_{t\to\infty}\mathbb{E}_{\pi^{\prime}}\left[\int_{-\infty}^{\infty}g_{i}(z)f_{\sum_{k=0}^{t-1}\gamma^{k}C_{i}(s_{k},a_{k},s_{k+1})+\gamma^{t}Y_{C_{i}}^{\pi}(\bar{s}_{t})}(z)dz\right]
=𝔼π[gi(z)fGCiπ(z)𝑑z]=σigi(GCiπ)01gi(σi(u))𝑑u.\displaystyle=\mathbb{E}_{\pi^{\prime}}\left[\int_{-\infty}^{\infty}g_{i}(z)f_{G_{C_{i}}^{\pi^{\prime}}}(z)dz\right]=\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi^{\prime}})-\int_{0}^{1}{g_{i}}^{*}(\sigma_{i}(u))du.

By combining (15) and (16),

𝔼dρπ,π[Ai,giπ(s¯,a)]/(1γ)\displaystyle\mathbb{E}_{d_{\rho}^{\pi^{\prime}},\pi^{\prime}}\left[A_{i,g_{i}}^{\pi}(\bar{s},a)\right]/(1-\gamma) =limt𝔼π[Vi,giπ(s¯t)]σigi(GCiπ)+01gi(σi(u))𝑑u\displaystyle=\lim_{t\to\infty}\mathbb{E}_{\pi^{\prime}}\left[V_{i,g_{i}}^{\pi}(\bar{s}_{t})\right]-\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi})+\int_{0}^{1}{g_{i}}^{*}(\sigma_{i}(u))du
=σigi(GCiπ)σigi(GCiπ).\displaystyle=\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi^{\prime}})-\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi}).

See 6.1

Proof.

The difference is:

|σ(X)σ~(X)|01|FX1(u)||σ(u)σ~(u)|𝑑uσσ~1FX1=σσ~1Cmax1γ.|\mathcal{R}_{\sigma}(X)-\mathcal{R}_{\tilde{\sigma}}(X)|\leq\int_{0}^{1}|F_{X}^{-1}(u)||\sigma(u)-\tilde{\sigma}(u)|du\leq||\sigma-\tilde{\sigma}||_{1}||F_{X}^{-1}||_{\infty}=||\sigma-\tilde{\sigma}||_{1}\frac{C_{\mathrm{max}}}{1-\gamma}.

The value of (11) is σσ~1||\sigma-\tilde{\sigma}||_{1} and is smaller than the value of the following problem:

minηi,αi01|σ(u)σ~(u)|𝑑u𝐬.𝐭.01σ~(u)𝑑u=1,σ(1)(i1)Mηiσ(1)iMi,\min_{\eta_{i},\alpha_{i}}\int_{0}^{1}|\sigma(u)-\tilde{\sigma}(u)|du\quad\mathbf{s.t.}\;\int_{0}^{1}\tilde{\sigma}(u)du=1,\;\frac{\sigma(1)(i-1)}{M}\leq\eta_{i}\leq\frac{\sigma(1)i}{M}\;\forall i, (17)

since the search space is smaller than the original one. As the value of (17) is smaller than maxu|σ(u)σ~(u)|σ(1)/M\max_{u}|\sigma(u)-\tilde{\sigma}(u)|\leq\sigma(1)/M, σσ~1σ(1)/M||\sigma-\tilde{\sigma}||_{1}\leq\sigma(1)/M. Consequently,

|σ(X)σ~(X)|σσ~1Cmax1γCmaxσ(1)(1γ)M.|\mathcal{R}_{\sigma}(X)-\mathcal{R}_{\tilde{\sigma}}(X)|\leq||\sigma-\tilde{\sigma}||_{1}\frac{C_{\mathrm{max}}}{1-\gamma}\leq\frac{C_{\mathrm{max}}\sigma(1)}{(1-\gamma)M}.

See 6.2

Proof.

Since gβ(x)g_{\beta}(x) is an increasing convex function, σ~(X)σ~gβ(X)=:σ~β(X)\mathcal{R}_{\tilde{\sigma}}(X)\leq\mathcal{R}_{\tilde{\sigma}}^{g_{\beta}}(X)=:\mathcal{R}_{\tilde{\sigma}}^{\beta}(X). In addition, according to Remark 2.7 in [1], there exists g~\tilde{g} such that g~=arginfgσ~g(X)\tilde{g}=\arg\inf_{g}\mathcal{R}^{g}_{\tilde{\sigma}}(X), and its derivative is expressed as g~=σ~(FX(x))\tilde{g}^{\prime}=\tilde{\sigma}(F_{X}(x)). Using this fact, we can formulate g~\tilde{g} by integrating its derivative as follows:

g~(x)=η1x+i=1M1(ηi+1ηi)(xFX1(αi))++C,\tilde{g}(x)=\eta_{1}x+\sum_{i=1}^{M-1}(\eta_{i+1}-\eta_{i})(x-F_{X}^{-1}(\alpha_{i}))_{+}+C,

where CC is a constant. Since for any constant CC, σ~g~(X)\mathcal{R}^{\tilde{g}}_{\tilde{\sigma}}(X) has the same value due to the integral part of g~\tilde{g}^{*} in (3), we set CC as zero. Also, we can express g~(x)\tilde{g}(x) as gβ~(x)g_{\tilde{\beta}}(x), where β~[i]{\tilde{\beta}}[i] is FX1(αi)F_{X}^{-1}(\alpha_{i}) due to the definition of gβg_{\beta}. As a result, the following is satisfied:

σ~(X)=σ~g~(X)\displaystyle\mathcal{R}_{\tilde{\sigma}}(X)=\mathcal{R}^{\tilde{g}}_{\tilde{\sigma}}(X) =σ~β~(X)σ~β(X).\displaystyle=\mathcal{R}^{\tilde{\beta}}_{\tilde{\sigma}}(X)\leq\mathcal{R}_{\tilde{\sigma}}^{\beta}(X).
σ~(X)\displaystyle\Rightarrow\mathcal{R}_{\tilde{\sigma}}(X) =infβσ~β(X).\displaystyle=\inf_{\beta}\mathcal{R}_{\tilde{\sigma}}^{\beta}(X).

See 6.3

Proof.

Since we assume that the space of 𝜷\boldsymbol{\beta} is finite, the sampler can be expressed using the softmax parameterization as follows:

ξϕ(𝜷):=exp(ϕ(𝜷))𝜷exp(ϕ(𝜷)).\xi_{\phi}(\boldsymbol{\beta}):=\frac{\exp(\phi(\boldsymbol{\beta}))}{\sum_{\boldsymbol{\beta}^{\prime}}\exp(\phi(\boldsymbol{\beta}^{\prime}))}.

If we update the parameter of the sampler as described in (14), the following is satisfied as in the natural policy gradient:

ϕt+1(𝜷)=ϕt(𝜷)+αJ(π𝜷,t;𝜷)=k=1tαJ(π𝜷,k;𝜷).\phi_{t+1}(\boldsymbol{\beta})=\phi_{t}(\boldsymbol{\beta})+\alpha J(\pi_{\boldsymbol{\beta},t};\boldsymbol{\beta})=\sum_{k=1}^{t}\alpha J(\pi_{\boldsymbol{\beta},k};\boldsymbol{\beta}).

Furthermore, due to Theorem 5.3, if a policy π𝜷,t\pi_{\boldsymbol{\beta},t} is updated by the proposed method, the policy converges to an optimal policy π𝜷,t\pi^{*}_{\boldsymbol{\beta},t}. Then, if 𝜷\boldsymbol{\beta} is not optimal,

limTt=1Tα(J(π𝜷,t;𝜷)J(π𝜷,t;𝜷))=,\lim_{T\to\infty}\sum_{t=1}^{T}\alpha\left(J(\pi_{\boldsymbol{\beta},t};\boldsymbol{\beta})-J(\pi_{\boldsymbol{\beta}^{*},t};\boldsymbol{\beta}^{*})\right)=-\infty,

since limtJ(π𝜷,t;𝜷)J(π𝜷,t;𝜷)=J(π𝜷;𝜷)J(π𝜷;𝜷)<0\lim_{t\to\infty}J(\pi_{\boldsymbol{\beta},t};\boldsymbol{\beta})-J(\pi_{\boldsymbol{\beta}^{*},t};\boldsymbol{\beta}^{*})=J(\pi_{\boldsymbol{\beta}}^{*};\boldsymbol{\beta})-J(\pi_{\boldsymbol{\beta}^{*}}^{*};\boldsymbol{\beta}^{*})<0. As a result,

limtξϕt(𝜷)=limtexp(ϕt(𝜷)ϕt(𝜷))/𝜷exp(ϕt(𝜷)ϕt(𝜷))\displaystyle\lim_{t\to\infty}\xi_{\phi_{t}}(\boldsymbol{\beta})=\lim_{t\to\infty}\exp(\phi_{t}(\boldsymbol{\beta})-\phi_{t}(\boldsymbol{\beta}^{*}))/\sum_{\boldsymbol{\beta}^{\prime}}\exp(\phi_{t}(\boldsymbol{\beta}^{\prime})-\phi_{t}(\boldsymbol{\beta}^{*}))
=limtexp(k=1t1α(J(π𝜷,k;𝜷)J(π𝜷,k;𝜷)))/𝜷exp(ϕt(𝜷)ϕt(𝜷))=0.\displaystyle=\lim_{t\to\infty}\exp\left(\sum_{k=1}^{t-1}\alpha\left(J(\pi_{\boldsymbol{\beta},k};\boldsymbol{\beta})-J(\pi_{\boldsymbol{\beta}^{*},k};\boldsymbol{\beta}^{*})\right)\right)\Big{/}\sum_{\boldsymbol{\beta}^{\prime}}\exp(\phi_{t}(\boldsymbol{\beta}^{\prime})-\phi_{t}(\boldsymbol{\beta}^{*}))=0.

Thus, ξϕt\xi_{\phi_{t}} converges to an optimal sampler. ∎

A.1 Proof of Theorem 5.3

Our derivation is based on the policy gradient works by Agarwal et al. [4] and Xu et al. [25]. Due to the softmax policy parameterization, the following is satisfied:

πt+1(a|s¯)\displaystyle\pi_{t+1}(a|\bar{s}) =πt(a|s¯)exp(θt+1(s¯,a)θt(s¯,a))Zt(s¯),\displaystyle=\pi_{t}(a|\bar{s})\frac{\exp(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))}{Z_{t}(\bar{s})}, (18)

where Zt(s¯):=aAπt(a|s¯)exp(θt+1(s¯,a)θt(s¯,a))Z_{t}(\bar{s}):=\sum_{a\in A}\pi_{t}(a|\bar{s})\exp(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a)). By the natural policy gradient,

θt+1={θt+αt(ARtαtiλt,iAi,git)/(1γ)ifJCi(θt)dii,θt+αt(αtνtARtiλt,iAi,git)/(1γ)otherwise.\theta_{t+1}=\begin{cases}\theta_{t}+\alpha_{t}(A_{R}^{t}-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t})/(1-\gamma)&\text{if}\;J_{C_{i}}(\theta_{t})\leq d_{i}\;\forall i,\\ \theta_{t}+\alpha_{t}(\alpha_{t}\nu_{t}A_{R}^{t}-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t})/(1-\gamma)&\text{otherwise}.\end{cases} (19)

Before proving Theorem 5.3, we first present several useful lemmas.

Lemma A.1.

logZt(s¯)\log Z_{t}(\bar{s}) is non-negative.

Proof.

Due to the advantage functions in (19), aAπt(a|s¯)(θt+1(s¯,a)θt(s¯,a))=0\sum_{a\in A}\pi_{t}(a|\bar{s})(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))=0. Using the fact that a logarithmic function is concave,

logZt(s¯)=logaAπt(a|s¯)exp(θt+1(s¯,a)θt(s¯,a))aAπt(a|s¯)(θt+1(s¯,a)θt(s¯,a))=0.\log Z_{t}(\bar{s})=\log\sum_{a\in A}\pi_{t}(a|\bar{s})\exp(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))\geq\sum_{a\in A}\pi_{t}(a|\bar{s})(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))=0.

As a result, logZt(s¯)0\log Z_{t}(\bar{s})\geq 0. ∎

Lemma A.2.

DTV(πt+1(|s¯),πt(|s¯))maxa|θt+1(s¯,a)θt(s¯,a)|||θt+1θt||D_{\mathrm{TV}}(\pi_{t+1}(\cdot|\bar{s}),\pi_{t}(\cdot|\bar{s}))\leq\max_{a}|\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a)|\leq||\theta_{t+1}-\theta_{t}||_{\infty}.

Proof.

Using the Pinsker’s inequality,

DTV(πt+1(|s¯),πt(|s¯))12DKL(πt+1(|s¯)||πt(|s¯))\displaystyle D_{\mathrm{TV}}(\pi_{t+1}(\cdot|\bar{s}),\pi_{t}(\cdot|\bar{s}))\leq\sqrt{\frac{1}{2}D_{\mathrm{KL}}(\pi_{t+1}(\cdot|\bar{s})||\pi_{t}(\cdot|\bar{s}))}
=12aπt+1(a|s¯)log(πt+1(a|s¯)/πt(a|s¯))\displaystyle=\sqrt{\frac{1}{2}\sum_{a}\pi_{t+1}(a|\bar{s})\log(\pi_{t+1}(a|\bar{s})/\pi_{t}(a|\bar{s})})
=12(aπt+1(a|s¯)(θt+1(s¯,a)θt(s¯,a))logZt(s¯))\displaystyle=\sqrt{\frac{1}{2}(\sum_{a}\pi_{t+1}(a|\bar{s})(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))-\log Z_{t}(\bar{s}))}
(i)12aπt+1(a|s¯)(θt+1(s¯,a)θt(s¯,a))\displaystyle\overset{\text{(i)}}{\leq}\sqrt{\frac{1}{2}\sum_{a}\pi_{t+1}(a|\bar{s})(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))}
=(ii)12a(πt+1(a|s¯)πt(a|s¯))(θt+1(s¯,a)θt(s¯,a))\displaystyle\overset{\text{(ii)}}{=}\sqrt{\frac{1}{2}\sum_{a}(\pi_{t+1}(a|\bar{s})-\pi_{t}(a|\bar{s}))(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))}
maxa|θt+1(s¯,a)θt(s¯,a)|DTV(πt+1(a|s¯),πt(a|s¯)),\displaystyle\leq\sqrt{\max_{a}|\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a)|D_{\mathrm{TV}}(\pi_{t+1}(a|\bar{s}),\pi_{t}(a|\bar{s}))},

where (i) follows Lemma A.1, and (ii) follows the fact that aAπt(a|s¯)(θt+1(s¯,a)θt(s¯,a))=0\sum_{a\in A}\pi_{t}(a|\bar{s})(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))=0. As a result,

DTV(πt+1(|s¯),πt(|s¯))maxa|θt+1(s¯,a)θt(s¯,a)|||θt+1θt||.D_{\mathrm{TV}}(\pi_{t+1}(\cdot|\bar{s}),\pi_{t}(\cdot|\bar{s}))\leq\max_{a}|\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a)|\leq||\theta_{t+1}-\theta_{t}||_{\infty}.

See 5.3

Proof.

We divide the analysis into two cases: 1) when the constraints are satisfied and 2) when the constraints are violated. First, let us consider the case where the constraints are satisfied. In this case, the policy is updated as follows:

θt+1\displaystyle\theta_{t+1} =θt+αt1γ(ARtαtiλt,iAi,git).\displaystyle=\theta_{t}+\frac{\alpha_{t}}{1-\gamma}\left(A_{R}^{t}-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}\right). (20)

Using Theorem 5.2,

JR(θt+1)JR(θt)+αtiλt,i(JCi(θt)JCi(θt+1))\displaystyle J_{R}(\theta_{t+1})-J_{R}(\theta_{t})+\alpha_{t}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\theta_{t})-J_{C_{i}}(\theta_{t+1})) (21)
=11γs¯dρπt+1(s¯)aπt+1(a|s¯)(ARt(s¯,a)αtiλt,iAi,git(s¯,a))\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a}\pi_{t+1}(a|\bar{s})(A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))
=11γs¯dρπt+1(s¯)a(πt+1(a|s¯)πt(a|s¯))(ARt(s¯,a)αtiλt,iAi,git(s¯,a))\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a}(\pi_{t+1}(a|\bar{s})-\pi_{t}(a|\bar{s}))(A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))
(i)21γs¯dρπt+1(s¯)maxa|ARt(s¯,a)αtiλt,iAi,git(s¯,a)|DTV(πt+1(|s¯),πt(|s¯))\displaystyle\overset{\text{(i)}}{\leq}\frac{2}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\max_{a}\big{|}A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a)\big{|}D_{\mathrm{TV}}(\pi_{t+1}(\cdot|\bar{s}),\pi_{t}(\cdot|\bar{s}))
(ii)2θt+1θt1γs¯dρπt+1(s¯)maxa|ARt(s¯,a)αtiλt,iAi,git(s¯,a)|\displaystyle\overset{\text{(ii)}}{\leq}\frac{2||\theta_{t+1}-\theta_{t}||_{\infty}}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\max_{a}\big{|}A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a)\big{|}
2θt+1θt1γARtαtiλt,iAi,git=2αt(1γ)2ARtαtiλt,iAi,git2\displaystyle\leq\frac{2||\theta_{t+1}-\theta_{t}||_{\infty}}{1-\gamma}\big{|}\big{|}A_{R}^{t}-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}\big{|}\big{|}_{\infty}=\frac{2\alpha_{t}}{(1-\gamma)^{2}}\big{|}\big{|}A_{R}^{t}-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}\big{|}\big{|}_{\infty}^{2}
2αt(Rmax+αtNλmaxCmax)2/(1γ)4.\displaystyle\leq 2\alpha_{t}(R_{\mathrm{max}}+\alpha_{t}N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}/(1-\gamma)^{4}.

where (i) follows Hölder’s inequality, and (ii) follows Lemma A.2. Using (18) and (20),

JR(θt+1)JR(θt)+αtiλt,i(JCi(θt)JCi(θt+1))\displaystyle J_{R}(\theta_{t+1})-J_{R}(\theta_{t})+\alpha_{t}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\theta_{t})-J_{C_{i}}(\theta_{t+1})) (22)
=11γs¯S¯dρπt+1(s¯)aAπt+1(a|s¯)(ARt(s¯,a)αtiλt,iAi,git(s¯,a))\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a\in A}\pi_{t+1}(a|\bar{s})(A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))
=1αts¯S¯dρπt+1(s¯)aAπt+1(a|s¯)log(πt+1(a|s¯)Zt(s¯)/πt(a|s¯))\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a\in A}\pi_{t+1}(a|\bar{s})\log(\pi_{t+1}(a|\bar{s})Z_{t}(\bar{s})/\pi_{t}(a|\bar{s}))
=1αts¯S¯dρπt+1(s¯)(DKL(πt+1(|s¯)||πt(|s¯))+logZt(s¯))\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})(D_{\mathrm{KL}}(\pi_{t+1}(\cdot|\bar{s})||\pi_{t}(\cdot|\bar{s}))+\log Z_{t}(\bar{s}))
1αts¯S¯dρπt+1(s¯)logZt(s¯)(i)1γαts¯S¯ρ(s¯)logZt(s¯),\displaystyle\geq\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})\log Z_{t}(\bar{s})\overset{\text{(i)}}{\geq}\frac{1-\gamma}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}\rho(\bar{s})\log Z_{t}(\bar{s}),

where (i) is from the fact that dρπ(s¯)=(1γ)tγt(s¯t=s¯)(1γ)ρ(s¯)d_{\rho}^{\pi}(\bar{s})=(1-\gamma)\sum_{t}\gamma^{t}\mathbb{P}(\bar{s}_{t}=\bar{s})\geq(1-\gamma)\rho(\bar{s}). By combining (21) and (22), the following is satisfied for any ρ\rho:

1γαts¯S¯ρ(s¯)logZt(s¯)2αt(1γ)4(Rmax+αtNλmaxCmax)2.\displaystyle\frac{1-\gamma}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}\rho(\bar{s})\log Z_{t}(\bar{s})\leq\frac{2\alpha_{t}}{(1-\gamma)^{4}}(R_{\mathrm{max}}+\alpha_{t}N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}. (23)

Then, the following is satisfied for any policy π^\hat{\pi}:

JR(π^)JR(πt)+αtiλt,i(JCi(πt)JCi(π^))\displaystyle J_{R}(\hat{\pi})-J_{R}(\pi_{t})+\alpha_{t}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi_{t})-J_{C_{i}}(\hat{\pi})) (24)
=11γs¯S¯dρπ^(s¯)aAπ^(a|s¯)(ARt(s¯,a)αtiλt,iAi,git(s¯,a))\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})\sum_{a\in A}\hat{\pi}(a|\bar{s})(A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))
=1αts¯S¯dρπ^(s¯)aAπ^(a|s¯)log(πt+1(a|s¯)Zt(s¯)/πt(a|s¯))\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})\sum_{a\in A}\hat{\pi}(a|\bar{s})\log(\pi_{t+1}(a|\bar{s})Z_{t}(\bar{s})/\pi_{t}(a|\bar{s}))
=1αts¯S¯dρπ^(s¯)(DKL(π^(|s¯)||πt(|s¯))DKL(π^(|s¯)||πt+1(|s¯))+logZt(s¯))\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})(D_{\mathrm{KL}}(\hat{\pi}(\cdot|\bar{s})||\pi_{t}(\cdot|\bar{s}))-D_{\mathrm{KL}}(\hat{\pi}(\cdot|\bar{s})||\pi_{t+1}(\cdot|\bar{s}))+\log Z_{t}(\bar{s}))
(i)1αts¯S¯dρπ^(s¯)(DKL(π^(|s¯)||πt(|s¯))DKL(π^(|s¯)||πt+1(|s¯)))\displaystyle\overset{\text{(i)}}{\leq}\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})(D_{\mathrm{KL}}(\hat{\pi}(\cdot|\bar{s})||\pi_{t}(\cdot|\bar{s}))-D_{\mathrm{KL}}(\hat{\pi}(\cdot|\bar{s})||\pi_{t+1}(\cdot|\bar{s})))
+2αt(1γ)5(Rmax+αtNλmaxCmax)2,\displaystyle+\frac{2\alpha_{t}}{(1-\gamma)^{5}}(R_{\mathrm{max}}+\alpha_{t}N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2},

where (i) follows (23) by substituting ρ\rho into dρπ^d_{\rho}^{\hat{\pi}}. Now, let us consider the second case where constraints are violated. In this case, the policy is updated as follows:

θt+1\displaystyle\theta_{t+1} =θt+αt1γ(αtνtARtiλt,iAi,git).\displaystyle=\theta_{t}+\frac{\alpha_{t}}{1-\gamma}(\alpha_{t}\nu_{t}A_{R}^{t}-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}). (25)

As (21) derived in the first case,

αtνt(JR(θt+1)JR(θt))+iλt,i(JCi(θt)JCi(θt+1))\displaystyle\alpha_{t}\nu_{t}(J_{R}(\theta_{t+1})-J_{R}(\theta_{t}))+\sum_{i}\lambda_{t,i}(J_{C_{i}}(\theta_{t})-J_{C_{i}}(\theta_{t+1})) (26)
=11γs¯dρπt+1(s¯)aπt+1(a|s¯)(αtνtARt(s¯,a)iλt,iAi,git(s¯,a))\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a}\pi_{t+1}(a|\bar{s})(\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))
=11γs¯dρπt+1(s¯)a(πt+1(a|s¯)πt(a|s¯))(αtνtARt(s¯,a)iλt,iAi,git(s¯,a))\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a}(\pi_{t+1}(a|\bar{s})-\pi_{t}(a|\bar{s}))(\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))
21γs¯dρπt+1(s¯)maxa|αtνtARt(s¯,a)iλt,iAi,git(s¯,a)|DTV(πt+1(|s¯),πt(|s¯))\displaystyle\leq\frac{2}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\max_{a}\big{|}\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a)\big{|}D_{\mathrm{TV}}(\pi_{t+1}(\cdot|\bar{s}),\pi_{t}(\cdot|\bar{s}))
2θt+1θt1γs¯dρπt+1(s¯)maxa|αtνtARt(s¯,a)iλt,iAi,git(s¯,a)|\displaystyle\leq\frac{2||\theta_{t+1}-\theta_{t}||_{\infty}}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\max_{a}\big{|}\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a)\big{|}
2αt(1γ)2αtνtARtiλt,iAi,git22αt(αtλmaxRmax+NλmaxCmax)2/(1γ)4.\displaystyle\leq\frac{2\alpha_{t}}{(1-\gamma)^{2}}\big{|}\big{|}\alpha_{t}\nu_{t}A_{R}^{t}-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}\big{|}\big{|}_{\infty}^{2}\leq 2\alpha_{t}(\alpha_{t}\lambda_{\mathrm{max}}R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}/(1-\gamma)^{4}.

Using (18) and (25),

αtνt(JR(θt+1)JR(θt))+iλt,i(JCi(θt)JCi(θt+1))\displaystyle\alpha_{t}\nu_{t}(J_{R}(\theta_{t+1})-J_{R}(\theta_{t}))+\sum_{i}\lambda_{t,i}(J_{C_{i}}(\theta_{t})-J_{C_{i}}(\theta_{t+1})) (27)
=11γs¯S¯dρπt+1(s¯)aAπt+1(a|s¯)(αtνtARt(s¯,a)iλt,iAi,git(s¯,a))\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a\in A}\pi_{t+1}(a|\bar{s})(\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))
=1αts¯S¯dρπt+1(s¯)aAπt+1(a|s¯)log(πt+1(a|s¯)Zt(s¯)/πt(a|s¯))\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a\in A}\pi_{t+1}(a|\bar{s})\log(\pi_{t+1}(a|\bar{s})Z_{t}(\bar{s})/\pi_{t}(a|\bar{s}))
=1αts¯S¯dρπt+1(s¯)(DKL(πt+1(|s¯)||πt(|s¯))+logZt(s¯))\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})(D_{\mathrm{KL}}(\pi_{t+1}(\cdot|\bar{s})||\pi_{t}(\cdot|\bar{s}))+\log Z_{t}(\bar{s}))
1αts¯S¯dρπt+1(s¯)logZt(s¯)1γαts¯S¯ρ(s¯)logZt(s¯).\displaystyle\geq\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})\log Z_{t}(\bar{s})\geq\frac{1-\gamma}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}\rho(\bar{s})\log Z_{t}(\bar{s}).

By combining (26) and (27), the following is satisfied for any ρ\rho:

1γαts¯S¯ρ(s¯)logZt(s¯)2αt(1γ)4(αtλmaxRmax+NλmaxCmax)2.\displaystyle\frac{1-\gamma}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}\rho(\bar{s})\log Z_{t}(\bar{s})\leq\frac{2\alpha_{t}}{(1-\gamma)^{4}}(\alpha_{t}\lambda_{\mathrm{max}}R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}. (28)

As in (24), the following is satisfied for any policy π^\hat{\pi}:

αtνt(JR(π^)JR(πt))+iλt,i(JCi(πt)JCi(π^))\displaystyle\alpha_{t}\nu_{t}(J_{R}(\hat{\pi})-J_{R}(\pi_{t}))+\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi_{t})-J_{C_{i}}(\hat{\pi})) (29)
=11γs¯S¯dρπ^(s¯)aAπ^(a|s¯)(αtνtARt(s¯,a)iλt,iAi,git(s¯,a))\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})\sum_{a\in A}\hat{\pi}(a|\bar{s})(\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))
=1αts¯S¯dρπ^(s¯)aAπ^(a|s¯)log(πt+1(a|s¯)Zt(s¯)/πt(a|s¯))\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})\sum_{a\in A}\hat{\pi}(a|\bar{s})\log(\pi_{t+1}(a|\bar{s})Z_{t}(\bar{s})/\pi_{t}(a|\bar{s}))
=1αts¯S¯dρπ^(s¯)(DKL(π^(|s¯)||πt(|s¯))DKL(π^(|s¯)||πt+1(|s¯))+logZt(s¯))\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})(D_{\mathrm{KL}}(\hat{\pi}(\cdot|\bar{s})||\pi_{t}(\cdot|\bar{s}))-D_{\mathrm{KL}}(\hat{\pi}(\cdot|\bar{s})||\pi_{t+1}(\cdot|\bar{s}))+\log Z_{t}(\bar{s}))
1αts¯S¯dρπ^(s¯)(DKL(π^(|s¯)||πt(|s¯))DKL(π^(|s¯)||πt+1(|s¯)))\displaystyle\leq\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})(D_{\mathrm{KL}}(\hat{\pi}(\cdot|\bar{s})||\pi_{t}(\cdot|\bar{s}))-D_{\mathrm{KL}}(\hat{\pi}(\cdot|\bar{s})||\pi_{t+1}(\cdot|\bar{s})))
+2αt(1γ)5(αtλmaxRmax+NλmaxCmax)2.\displaystyle+\frac{2\alpha_{t}}{(1-\gamma)^{5}}(\alpha_{t}\lambda_{\mathrm{max}}R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}.

Now, the main inequalities for both cases have been derived. Let us denote by 𝒩\mathcal{N} the set of time steps in which the constraints are not violated. This means that if t𝒩t\in\mathcal{N}, JCi(πt)diiJ_{C_{i}}(\pi_{t})\leq d_{i}\;\forall i. Then, using (24) and (29), we have:

t𝒩(αt(JR(π^)JR(πt))αt2iλt,i(JCi(π^)JCi(πt)))\displaystyle\sum_{t\in\mathcal{N}}\left(\alpha_{t}(J_{R}(\hat{\pi})-J_{R}(\pi_{t}))-\alpha_{t}^{2}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\hat{\pi})-J_{C_{i}}(\pi_{t}))\right) (30)
+t𝒩(αt2νt(JR(π^)JR(π))αtiλt,i(JCi(π^)JCi(πt)))\displaystyle\quad+\sum_{t\notin\mathcal{N}}\left(\alpha_{t}^{2}\nu_{t}(J_{R}(\hat{\pi})-J_{R}(\pi))-\alpha_{t}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\hat{\pi})-J_{C_{i}}(\pi_{t}))\right)
s¯S¯dρπ^(s¯)DKL(πf(|s¯)||π0(|s¯))+t𝒩2αt2(1γ)5(Rmax+αtNλmaxCmax)2\displaystyle\leq\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})D_{\mathrm{KL}}(\pi_{f}(\cdot|\bar{s})||\pi_{0}(\cdot|\bar{s}))+\sum_{t\in\mathcal{N}}\frac{2\alpha_{t}^{2}}{(1-\gamma)^{5}}(R_{\mathrm{max}}+\alpha_{t}N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}
+t𝒩2αt2(1γ)5(αtλmaxRmax+NλmaxCmax)2.\displaystyle\quad+\sum_{t\notin\mathcal{N}}\frac{2\alpha_{t}^{2}}{(1-\gamma)^{5}}(\alpha_{t}\lambda_{\mathrm{max}}R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}.

If t𝒩t\notin\mathcal{N} and λt,i>0\lambda_{t,i}>0, JCi(πt)>diJ_{C_{i}}(\pi_{t})>d_{i} and iλt,i=1\sum_{i}\lambda_{t,i}=1. Thus, iλt,i(JCi(πf)JCi(πt))η\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi_{f})-J_{C_{i}}(\pi_{t}))\leq-\eta. By substituting π^\hat{\pi} into πf\pi_{f} in (30),

t𝒩(αt(JR(πf)JR(πt))αt2iλt,i(JCi(πf)JCi(πt)))\displaystyle\sum_{t\in\mathcal{N}}\left(\alpha_{t}(J_{R}(\pi_{f})-J_{R}(\pi_{t}))-\alpha_{t}^{2}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi_{f})-J_{C_{i}}(\pi_{t}))\right)
+t𝒩(αtη+αt2νt(JR(πf)JR(π)))\displaystyle\quad+\sum_{t\notin\mathcal{N}}(\alpha_{t}\eta+\alpha_{t}^{2}\nu_{t}(J_{R}(\pi_{f})-J_{R}(\pi)))
s¯S¯dρπf(s¯)DKL(πf(|s¯)||π0(|s¯))\displaystyle\leq\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{f}}(\bar{s})D_{\mathrm{KL}}(\pi_{f}(\cdot|\bar{s})||\pi_{0}(\cdot|\bar{s}))
+tαt22max(α0,1)2(1γ)5(max(λmax,1)Rmax+NλmaxCmax)2=:K1.\displaystyle\quad+\sum_{t}\alpha_{t}^{2}\underbrace{\frac{2\max(\alpha_{0},1)^{2}}{(1-\gamma)^{5}}(\max(\lambda_{\mathrm{max}},1)R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}}_{=:K_{1}}.

For brevity, we denote DKL(πf||π0)=s¯S¯dρπf(s¯)DKL(πf(|s¯)||π0(|s¯))D_{\mathrm{KL}}(\pi_{f}||\pi_{0})=\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{f}}(\bar{s})D_{\mathrm{KL}}(\pi_{f}(\cdot|\bar{s})||\pi_{0}(\cdot|\bar{s})). Since t𝒩αt=tαtt𝒩αt\sum_{t\notin\mathcal{N}}\alpha_{t}=\sum_{t}\alpha_{t}-\sum_{t\in\mathcal{N}}\alpha_{t}, (30) can be modified as follows:

t𝒩αt(JR(πf)JR(πt)η)+ηtαt\displaystyle\sum_{t\in\mathcal{N}}\alpha_{t}(J_{R}(\pi_{f})-J_{R}(\pi_{t})-\eta)+\eta\sum_{t}\alpha_{t}
DKL(πf||π0)+K1tαt2+t𝒩(αt2iλt,i(JCi(πf)JCi(πt)))\displaystyle\leq D_{\mathrm{KL}}(\pi_{f}||\pi_{0})+K_{1}\sum_{t}\alpha_{t}^{2}+\sum_{t\in\mathcal{N}}\left(\alpha_{t}^{2}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi_{f})-J_{C_{i}}(\pi_{t}))\right)
t𝒩(αt2νt(JR(πf)JR(π)))\displaystyle\quad-\sum_{t\notin\mathcal{N}}\left(\alpha_{t}^{2}\nu_{t}(J_{R}(\pi_{f})-J_{R}(\pi))\right)
DKL(πf||π0)+tαt2(K1+2NλmaxCmax1γ+2λmaxRmax1γ)=:K2.\displaystyle\leq D_{\mathrm{KL}}(\pi_{f}||\pi_{0})+\sum_{t}\alpha_{t}^{2}\underbrace{\left(K_{1}+2N\lambda_{\mathrm{max}}\frac{C_{\mathrm{max}}}{1-\gamma}+2\lambda_{\mathrm{max}}\frac{R_{\mathrm{max}}}{1-\gamma}\right)}_{=:K_{2}}.

It can be simplified as follows:

t𝒩αt(JR(πf)JR(πt)η)+ηtαtDKL(πf||π0)+K2tαt2.\sum_{t\in\mathcal{N}}\alpha_{t}(J_{R}(\pi_{f})-J_{R}(\pi_{t})-\eta)+\eta\sum_{t}\alpha_{t}\leq D_{\mathrm{KL}}(\pi_{f}||\pi_{0})+K_{2}\sum_{t}\alpha_{t}^{2}.

If π0\pi_{0} is set with positive values in all actions, DKL(π^||π0)D_{\mathrm{KL}}(\hat{\pi}||\pi_{0}) is bounded for π^\forall\hat{\pi}. Additionally, due to the Robbins-Monro condition, RHS converges to some real number as TT goes to infinity. Since ηtαt\eta\sum_{t}\alpha_{t} in LHS goes to infinity, t𝒩αt(JR(πf)JR(πt)η)\sum_{t\in\mathcal{N}}\alpha_{t}(J_{R}(\pi_{f})-J_{R}(\pi_{t})-\eta) must go to minus infinity. It means that t𝒩αt=\sum_{t\in\mathcal{N}}\alpha_{t}=\infty since |JR(π)|Rmax/(1γ)|J_{R}(\pi)|\leq R_{\mathrm{max}}/(1-\gamma). By substituting π^\hat{\pi} into π\pi^{*} in (30),

t𝒩αt(JR(π)JR(πt))t𝒩αt(iλt,i(JCi(π)JCi(πt)))\displaystyle\sum_{t\in\mathcal{N}}\alpha_{t}(J_{R}(\pi^{*})-J_{R}(\pi_{t}))-\sum_{t\notin\mathcal{N}}\alpha_{t}\left(\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi^{*})-J_{C_{i}}(\pi_{t}))\right)
DKL(π||π0)+t𝒩2αt2(1γ)5(Rmax+αtNλmaxCmax)2\displaystyle\leq D_{\mathrm{KL}}(\pi^{*}||\pi_{0})+\sum_{t\in\mathcal{N}}\frac{2\alpha_{t}^{2}}{(1-\gamma)^{5}}(R_{\mathrm{max}}+\alpha_{t}N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}
+t𝒩2αt2(1γ)5(αtλmaxRmax+NλmaxCmax)2\displaystyle\quad+\sum_{t\notin\mathcal{N}}\frac{2\alpha_{t}^{2}}{(1-\gamma)^{5}}(\alpha_{t}\lambda_{\mathrm{max}}R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}
+t𝒩αt2iλt,i(JCi(π)JCi(πt))t𝒩αt2νt(JR(π)JR(π))\displaystyle\quad+\sum_{t\in\mathcal{N}}\alpha_{t}^{2}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi^{*})-J_{C_{i}}(\pi_{t}))-\sum_{t\notin\mathcal{N}}\alpha_{t}^{2}\nu_{t}(J_{R}(\pi^{*})-J_{R}(\pi))
DKL(π||π0)+K2tαt2.\displaystyle\leq D_{\mathrm{KL}}(\pi^{*}||\pi_{0})+K_{2}\sum_{t}\alpha_{t}^{2}.

Since iλt,i(JCi(π)JCi(πt))0\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi^{*})-J_{C_{i}}(\pi_{t}))\leq 0 for t𝒩t\notin\mathcal{N}, the above equation is rewritten as follows:

t𝒩αt(JR(π)JR(πt))DKL(π||π0)+K2tαt2.\sum_{t\in\mathcal{N}}\alpha_{t}(J_{R}(\pi^{*})-J_{R}(\pi_{t}))\leq D_{\mathrm{KL}}(\pi^{*}||\pi_{0})+K_{2}\sum_{t}\alpha_{t}^{2}.

Since t𝒩αt=\sum_{t\in\mathcal{N}}\alpha_{t}=\infty and the Robbins-Monro condition is satisfied, the following must be held:

JR(π)limtJR(π𝒩[t])=0,J_{R}(\pi^{*})-\lim_{t\to\infty}J_{R}(\pi_{\mathcal{N}[t]})=0,

where 𝒩[t]\mathcal{N}[t] is the ttth element of 𝒩\mathcal{N}. Consequently, the policy converges to an optimal policy. ∎

Appendix B Weight Selection Strategy

In this section, we first identify νt\nu_{t} and λt,i\lambda_{t,i} for existing primal approach-based safe RL algorithms and present our strategy. For the existing methods, we analyze several safe RL algorithms including constrained policy optimization (CPO, [2]), projection-based constrained policy optimization (PCPO, [27]), constraint-rectified policy optimization (CRPO, [25]), penalized proximal policy optimization (P3O, [31]), interior-point policy optimization (IPO, [18]), and safe distributional actor-critic (SDAC, [15]). Since the policy gradient is obtained differently depending on whether the constraints are satisfied (CS) or violated (CV), we analyze the existing methods for each of the two cases.

Case 1: Constraints are satisfied (CS). First, CPO [2] and SDAC [15] find a direction of the policy gradient by solving the following linear programming with linear and quadratic constraints (LQCLP):

maxgJR(θt)Tg𝐬.𝐭.JCi(θt)Tg+JCi(θt)dii,12gTF(θt)gϵ,\max_{g}\nabla J_{R}(\theta_{t})^{T}g\;\mathbf{s.t.}\nabla J_{C_{i}}(\theta_{t})^{T}g+J_{C_{i}}(\theta_{t})\leq d_{i}\;\forall i,\;\frac{1}{2}g^{T}F(\theta_{t})g\leq\epsilon,

where ϵ\epsilon is a trust region size. The LQCLP can be addressed by solving the following dual problem:

λ,ν=argmaxλ0,ν0\displaystyle\lambda^{*},\nu^{*}=\underset{\lambda\geq 0,\nu\geq 0}{\arg\max} JR(θt)Tg(λ,ν)+i=1Nλi(JCi(θt)Tg(λ,ν)+JCi(θt)di)\displaystyle-\nabla J_{R}(\theta_{t})^{T}g^{*}(\lambda,\nu)+\sum_{i=1}^{N}\lambda_{i}(\nabla J_{C_{i}}(\theta_{t})^{T}g^{*}(\lambda,\nu)+J_{C_{i}}(\theta_{t})-d_{i})
+ν(12g(λ,ν)TF(θt)g(λ,ν)ϵ),\displaystyle+\nu(\frac{1}{2}g^{*}(\lambda,\nu)^{T}F(\theta_{t})g^{*}(\lambda,\nu)-\epsilon),

where g(λ,ν)=1νF(θt)(JR(θt)iλiJCi(θt))g^{*}(\lambda,\nu)=\frac{1}{\nu}F^{\dagger}(\theta_{t})\left(\nabla J_{R}(\theta_{t})-\sum_{i}\lambda_{i}\nabla J_{C_{i}}(\theta_{t})\right). Then, the policy is updated in the following direction:

gt=F(θt)(JR(θt)αtimin(λi/αt,λmax)JCi(θt)),g_{t}=F^{\dagger}(\theta_{t})(\nabla J_{R}(\theta_{t})-\alpha_{t}\sum_{i}\min(\lambda_{i}^{*}/\alpha_{t},\lambda_{\mathrm{max}})\nabla J_{C_{i}}(\theta_{t})), (31)

which results in λt,i=min(λi/αt,λmax)\lambda_{t,i}=\min(\lambda_{i}^{*}/\alpha_{t},\lambda_{\mathrm{max}}). P3O [31], PCPO [27], and CRPO [25] update the policy only using the objective function in this case as follows:

gt=F(θt)JR(θt)λt,i=0.g_{t}=F^{\dagger}(\theta_{t})\nabla J_{R}(\theta_{t})\Rightarrow\lambda_{t,i}=0.

IPO [18] update a policy in the following direction:

gt=F(θt)(JR(θt)κiJCi(θt)/(diJCi(θt))),g_{t}=F^{\dagger}(\theta_{t})\left(\nabla J_{R}(\theta_{t})-\kappa\sum_{i}\nabla J_{C_{i}}(\theta_{t})/(d_{i}-J_{C_{i}}(\theta_{t}))\right),

where κ\kappa is a penalty coefficient. As a result, λt,i\lambda_{t,i} is computed as min(κ/((diJCi(θt))αt),λmax)\min(\kappa/((d_{i}-J_{C_{i}}(\theta_{t}))\alpha_{t}),\lambda_{\mathrm{max}}).

Case 2: Constraints are violated (CV). CPO, PCPO and IPO did not handle the constraint violation in multiple constraint settings, so we exclude them in this case. First, SDAC finds a recovery direction by solving the following quadratic programming (QP):

ming12gTF(θt)g𝐬.𝐭.JCi(θt)Tg+JCi(θt)difori{i|JCi(θt)>di},\min_{g}\frac{1}{2}g^{T}F(\theta_{t})g\;\mathbf{s.t.}\nabla J_{C_{i}}(\theta_{t})^{T}g+J_{C_{i}}(\theta_{t})\leq d_{i}\;\text{for}\;i\in\{i|J_{C_{i}}(\theta_{t})>d_{i}\},

where we will denote {i|JCi(θt)>di}\{i|J_{C_{i}}(\theta_{t})>d_{i}\} by ICVI_{\mathrm{CV}}. The QP can be addressed by solving the following dual problem:

λ=argmaxλ012g(λ)TF(θt)g(λ)+iICVλi(JCi(θt)Tg(λ)+JCi(θt)di),\lambda^{*}=\underset{\lambda\geq 0}{\arg\max}\frac{1}{2}g^{*}(\lambda)^{T}F(\theta_{t})g^{*}(\lambda)+\sum_{i\in I_{\mathrm{CV}}}\lambda_{i}(\nabla J_{C_{i}}(\theta_{t})^{T}g^{*}(\lambda)+J_{C_{i}}(\theta_{t})-d_{i}),

where g(λ)=F(θt)iICVλiJCi(θt)g^{*}(\lambda)=-F^{\dagger}(\theta_{t})\sum_{i\in I_{\mathrm{CV}}}\lambda_{i}\nabla J_{C_{i}}(\theta_{t}). Then, the policy is updated in the following direction:

gt=F(θt)(iICVλijICVλjJCi(θt))νt=0,λt,i=λijICVλjifiICVelse 0.g_{t}=F^{\dagger}(\theta_{t})\left(\sum_{i\in I_{\mathrm{CV}}}-\frac{\lambda_{i}^{*}}{\sum_{j\in I_{\mathrm{CV}}}\lambda_{j}^{*}}\nabla J_{C_{i}}(\theta_{t})\right)\Rightarrow\nu_{t}=0,\;\lambda_{t,i}=\frac{\lambda_{i}^{*}}{\sum_{j\in I_{\mathrm{CV}}}\lambda_{j}^{*}}\;\text{if}\;i\in I_{\mathrm{CV}}\;\text{else}\;0.

If there is only a single constraint, PCPO is compatible with SDAC. Next, CRPO first selects a violated constraint, whose index denoted as kk, and calculates the policy gradient to minimize the selected constraint as follows:

gt=F(θt)(JCk(θt))νt=0,λt,i=1ifi=kelse 0.g_{t}=F^{\dagger}(\theta_{t})\left(-\nabla J_{C_{k}}(\theta_{t})\right)\Rightarrow\nu_{t}=0,\;\lambda_{t,i}=1\;\text{if}\;i=k\;\text{else}\;0.

If there is only a single constraint, CPO is compatible with CRPO. Finally, P3O update the policy in the following direction:

gt=F(θt)(JR(θt)κiICVJCi(θt)),g_{t}=F^{\dagger}(\theta_{t})\left(\nabla J_{R}(\theta_{t})-\kappa\sum_{i\in I_{\mathrm{CV}}}\nabla J_{C_{i}}(\theta_{t})\right),

which results in νt=min(1/(αtκ|ICV|),λmax)\nu_{t}=\min(1/(\alpha_{t}\kappa|I_{\mathrm{CV}}|),\lambda_{\mathrm{max}}), and λt,i=1/|ICV|\lambda_{t,i}=1/|I_{\mathrm{CV}}| if iICVi\in I_{\mathrm{CV}} else 0.

Implementation of trust region method. To this point, we have obtained the direction of the policy gradient gtg_{t}, allowing the policy to be updated to θt+αtgt\theta_{t}+\alpha_{t}g_{t}. If we want to update the policy using a trust region method, as introduced in TRPO [21], we can adjust the learning rate αt\alpha_{t} as follows:

αt=ϵtClip(gtTF(θt)gt,gmin,gmax),\alpha_{t}=\frac{\epsilon_{t}}{\mathrm{Clip}(\sqrt{g_{t}^{T}F(\theta_{t})g_{t}},g_{\mathrm{min}},g_{\mathrm{max}})}, (32)

where ϵt\epsilon_{t} is a trust region size following the Robbins-Monro condition, gming_{\mathrm{min}} and gmaxg_{\mathrm{max}} are hyperparamters, and Clip(x,a,b)=min(max(x,a),b)\mathrm{Clip}(x,a,b)=\min(\max(x,a),b). By adjusting the learning rate, the policy can be updated mostly within the trust region, defined as {θt+g|gTF(θt)gϵt2}\{\theta_{t}+g|g^{T}F(\theta_{t})g\leq\epsilon_{t}^{2}\}, while the learning rate still satisfies the Robbins-Monro condition. Accordingly, we also need to modify λt,i\lambda_{t,i} for the CS case and νt\nu_{t} for the CV case, as they are expressed with the learning rate αt\alpha_{t}. We explain this modification process by using SDAC as an example. The direction of the policy gradient of SDAC for the CS case expressed as follows:

gt=F(θt)(JR(θt)ϵtimin(λi/ϵt,λmax)JCi(θt)),g_{t}=F^{\dagger}(\theta_{t})(\nabla J_{R}(\theta_{t})-\epsilon_{t}\sum_{i}\min(\lambda_{i}^{*}/\epsilon_{t},\lambda_{\mathrm{max}})\nabla J_{C_{i}}(\theta_{t})),

where αt\alpha_{t} in (31) is replaced with ϵt\epsilon_{t}. The learning rate αt\alpha_{t} is then calculated using gtg_{t} and (32). Now, we need to express gtg_{t} as F(θt)(JR(θt)αtiλt,iJCi(θt))F^{\dagger}(\theta_{t})(\nabla J_{R}(\theta_{t})-\alpha_{t}\sum_{i}\lambda_{t,i}\nabla J_{C_{i}}(\theta_{t})). To this end, we can use αt=ϵt/C\alpha_{t}=\epsilon_{t}/C, where C=Clip(gtTF(θt)gt,gmin,gmax)C=\mathrm{Clip}(\sqrt{g_{t}^{T}F(\theta_{t})g_{t}},g_{\mathrm{min}},g_{\mathrm{max}}). Consequently, λt,i\lambda_{t,i} is expressed as follows:

λt,i=Cmin(λi/ϵt,λmax),\lambda_{t,i}=C\min(\lambda_{i}^{*}/\epsilon_{t},\lambda_{\mathrm{max}}),

and the maximum of λt,i\lambda_{t,i} and νt\nu_{t} is adjusted to gmaxλmaxg_{\mathrm{max}}\lambda_{\mathrm{max}}.

Proposed strategy. Now, we introduce the proposed strategy for determining λt,i\lambda_{t,i} and νt\nu_{t}. We first consider the CV case. Kim et al. [15] empirically showed that processing multiple constraints simultaneously at each update step effectively projects the current policy onto a feasible set of policies. Therefore, we decide to use the same strategy as SDAC, which deals with constraints simultaneously, and it is written as follows:

νt=0,λt,i={λi/(jICVλj)ifiICV,0otherwise,\nu_{t}=0,\;\lambda_{t,i}=\begin{cases}\lambda_{i}^{*}/\left(\sum_{j\in I_{\mathrm{CV}}}\lambda_{j}^{*}\right)&\text{if}\;i\in I_{\mathrm{CV}},\\ 0&\text{otherwise,}\end{cases}

where λ=argmaxλ012g(λ)TF(θt)g(λ)+iICVλi(JCi(θt)Tg(λ)+JCi(θt)di)\lambda^{*}={\arg\max}_{\lambda\geq 0}\frac{1}{2}g^{*}(\lambda)^{T}F(\theta_{t})g^{*}(\lambda)+\sum_{i\in I_{\mathrm{CV}}}\lambda_{i}\left(\nabla J_{C_{i}}(\theta_{t})^{T}g^{*}(\lambda)+J_{C_{i}}(\theta_{t})-d_{i}\right) and g(λ)=F(θt)iICVλiJCi(θt)g^{*}(\lambda)=-F^{\dagger}(\theta_{t})\sum_{i\in I_{\mathrm{CV}}}\lambda_{i}\nabla J_{C_{i}}(\theta_{t}). Next, let us consider the CS case. We hypothesize that concurrently addressing both constraints and objectives when calculating the policy gradient can lead to more stable constraint satisfaction. Therefore, CPO, SDAC, and IPO can be suitable candidates. However, CPO and SDAC methods require solving the LQCLP whose dual problem is nonlinear, and IPO can introduce computational errors when JCi(θt)J_{C_{i}}(\theta_{t}) is close to did_{i}. To address these challenges, we propose to calculate the policy gradient direction by solving a QP instead of an LQCLP. By applying a concept of transforming objectives into constraints [16] to the safe RL problem, we can formulate the following QP:

ming12gTF(θt)g𝐬.𝐭.JR(θt)Tge,JCi(θt)Tg+JCi(θt)dii,\min_{g}\frac{1}{2}g^{T}F(\theta_{t})g\;\mathbf{s.t.}\;\nabla J_{R}(\theta_{t})^{T}g\geq e,\;\nabla J_{C_{i}}(\theta_{t})^{T}g+J_{C_{i}}(\theta_{t})\leq d_{i}\;\forall i,

where e:=ϵtJR(θt)TF(θt)JR(θt)e:=\epsilon_{t}\sqrt{\nabla J_{R}(\theta_{t})^{T}F^{\dagger}(\theta_{t})\nabla J_{R}(\theta_{t})}. According to [16], ee represents the maximum value of JR(θt)Tg\nabla J_{R}(\theta_{t})^{T}g when gg is within the trust region, {g|gTF(θt)gϵt2}\{g|g^{T}F(\theta_{t})g\leq\epsilon_{t}^{2}\}. Therefore, the transformed constraint, JR(θt)Tge\nabla J_{R}(\theta_{t})^{T}g\geq e, has the same effect of improving the objective function JR(θ)J_{R}(\theta) within the trust region. The QP is then addressed by solving the following dual problem:

λ,ν=argmaxλ0,ν0\displaystyle\lambda^{*},\nu^{*}=\underset{\lambda\geq 0,\nu\geq 0}{\arg\max} 12g(λ,ν)TF(θt)g(λ,ν)+ν(eJR(θt)Tg(λ,ν))\displaystyle\frac{1}{2}g^{*}(\lambda,\nu)^{T}F(\theta_{t})g^{*}(\lambda,\nu)+\nu(e-\nabla J_{R}(\theta_{t})^{T}g^{*}(\lambda,\nu))
+i=1Nλi(JCi(θt)Tg(λ,ν)+JCi(θt)di),\displaystyle+\sum_{i=1}^{N}\lambda_{i}(\nabla J_{C_{i}}(\theta_{t})^{T}g^{*}(\lambda,\nu)+J_{C_{i}}(\theta_{t})-d_{i}),

where g(λ,ν)=F(θt)(νJR(θt)iICVλiJCi(θt))g^{*}(\lambda,\nu)=F^{\dagger}(\theta_{t})(\nu\nabla J_{R}(\theta_{t})-\sum_{i\in I_{\mathrm{CV}}}\lambda_{i}\nabla J_{C_{i}}(\theta_{t})). Then, the policy is updated in the following direction:

gt=F(θt)(JR(θt)ϵtimin(λiνϵt,λmax)JCi(θt)),g_{t}=F^{\dagger}(\theta_{t})\left(\nabla J_{R}(\theta_{t})-\epsilon_{t}\sum_{i}\min\left(\frac{\lambda_{i}^{*}}{\nu^{*}\epsilon_{t}},\lambda_{\mathrm{max}}\right)\nabla J_{C_{i}}(\theta_{t})\right),

and the learning rate is calculated using (32) to ensure that the policy is updated within the trust region. Finally, λt,i\lambda_{t,i} is expressed as Cmin(λi/(νϵt),λmax)C\min(\lambda_{i}^{*}/(\nu^{*}\epsilon_{t}),\lambda_{\mathrm{max}}), where C=Clip(gtTF(θt)gt,gmin,gmax)C=\mathrm{Clip}(\sqrt{g_{t}^{T}F(\theta_{t})g_{t}},g_{\mathrm{min}},g_{\mathrm{max}}).

Appendix C Implementation Details

Algorithm 2 Spectral-Risk-Constrained Policy Optimization (SRCPO)
  Input: Policy parameter θ\theta, sampler parameter ϕ\phi, critic parameter ψ\psi, and replay buffer DD.
  for epochs =1=1 to PP do
     for episodes =1=1 to EE do
        if uϵu\leq\epsilon, where uU[0,1]u\sim U[0,1]then
           Sample 𝜷\boldsymbol{\beta} from a uniform distribution: βi[j]U(βi[j1],Cmax/(1γ))\beta_{i}[j]\sim U(\beta_{i}[j-1],C_{\mathrm{max}}/(1-\gamma)) i\forall i and j\forall j, where βi[0]=0\beta_{i}[0]=0.
        else
           Sample 𝜷ξϕ\boldsymbol{\beta}\sim\xi_{\phi}.
        end if
        Collect a trajectory τ={s¯t,at,rt,{ci,t}i=1N}t=0T\tau=\{\bar{s}_{t},a_{t},r_{t},\{c_{i,t}\}_{i=1}^{N}\}_{t=0}^{T} using πθ(a|s¯;𝜷)\pi_{\theta}(a|\bar{s};\boldsymbol{\beta}), and store (𝜷,τ)(\boldsymbol{\beta},\tau) in DD.
     end for
     Set Gθ={}G^{\theta}=\{\} and Gϕ={}G^{\phi}=\{\}.
     for updates =1=1 to UU do
        Sample (𝜷,τ)D(\boldsymbol{\beta},\tau)\sim D.
        Update the critics, ZR,ψ(s¯,a;𝜷)Z_{R,\psi}(\bar{s},a;\boldsymbol{\beta}) and ZCi,ψ(s¯,a;𝜷)Z_{C_{i},\psi}(\bar{s},a;\boldsymbol{\beta}), using the quantile regression loss.
        Calculate the policy gradient of πθ(a|s¯;𝜷)\pi_{\theta}(a|\bar{s};\boldsymbol{\beta}) using the update rule, described in Section 5.2, and store it to GθG^{\theta}.
        Calculate the gradient of the sampler ξϕ\xi_{\phi} using the update rule, described in (14), and store it to GϕG^{\phi}.
     end for
     Update the policy parameter by the average of GθG^{\theta}.
     Update the sampler parameter by the average of GϕG^{\phi}.
  end for
  Output: πθ(|;𝜷)\pi_{\theta}(\cdot|\cdot;\boldsymbol{\beta}), where 𝜷ξϕ\boldsymbol{\beta}\sim\xi_{\phi}.

In this section, we present a practical algorithm that utilizes function approximators, such as neural networks, to tackle more complex tasks. The comprehensive procedure is presented in Algorithm 2.

First, it is necessary to train individual policies for various 𝜷\boldsymbol{\beta}, so we model a policy network to be conditioned on 𝜷\boldsymbol{\beta}, expressed as πθ(a|s¯;𝜷)\pi_{\theta}(a|\bar{s};\boldsymbol{\beta}). This network structure enables the implementation of policies with various behaviors by conditioning the policy on different 𝜷\boldsymbol{\beta} values. In order to train this 𝜷\boldsymbol{\beta}-conditioned policy across a broad range of the 𝜷\boldsymbol{\beta} space, a ϵ\epsilon-greedy strategy can be used. At the beginning of each episode, 𝜷\boldsymbol{\beta} is sampled uniformly with a probability of ϵ\epsilon; otherwise, it is sampled using the sampler ξϕ\xi_{\phi}. In our implementation, we did not use the uniform sampling process by setting ϵ=0\epsilon=0 to reduce fine-tuning efforts, but it can still be used to enhance practical performance.

After collecting rollouts with the sampled 𝜷\boldsymbol{\beta}, the critic, policy, and sampler need to be updated. For the critic, a distributional target is calculated using the TD(λ\lambda) method [15], and we update the critic networks to minimize the quantile regression loss [11], defined as follows:

(ψ):=l=1L𝔼(s¯,a,𝜷)D[𝔼zZ^Ci(s¯,a;𝜷)[ρ2l12L(zZCi,ψ(s¯,a;𝜷)[l])]],\mathcal{L}(\psi):=\sum_{l=1}^{L}\underset{(\bar{s},a,\boldsymbol{\beta})\sim D}{\mathbb{E}}\left[\underset{z\sim\hat{Z}_{C_{i}}(\bar{s},a;\boldsymbol{\beta})}{\mathbb{E}}\bigg{[}\rho_{\frac{2l-1}{2L}}\Big{(}z-Z_{C_{i},\psi}(\bar{s},a;\boldsymbol{\beta})[l]\Big{)}\bigg{]}\right],

where ρl(u):=u(l𝟏u<0)\rho_{l}(u):=u\cdot(l-\mathbf{1}_{u<0}), and Z^Ci(s¯,a;𝜷)\hat{Z}_{C_{i}}(\bar{s},a;\boldsymbol{\beta}) is the target distribution of ZCi,ψ(s¯,a;𝜷)Z_{C_{i},\psi}(\bar{s},a;\boldsymbol{\beta}). The policy gradient for a specific 𝜷\boldsymbol{\beta} can be calculated according to the update rule described in Section 5.2. Consequently, the policy is updated using the policy gradient calculated with 𝜷\boldsymbol{\beta} and rollouts, which are extracted from the replay buffer. Finally, the gradient of the sampler for a given 𝜷\boldsymbol{\beta} can be calculated as in (14), so the sampler is updated with the average of gradients, which are calculated across multiple 𝜷\boldsymbol{\beta}.

Appendix D Experimental Details

In this section, we provide details on tasks, network structure, and hyperparameter settings.

Refer to caption
(a) Point goal task
Refer to caption
(b) Car button task
Refer to caption
(c) Quadrupedal robot
Refer to caption
(d) Bipedal robot
Figure 5: Rendered images of the Safety Gymnasium and the legged robot locomotion tasks.

D.1 Task Description

There are two environments, legged robot locomotion [15] and Safety Gymnasium [14], and whose rendered images are presented in Figure 5.

Legged Robot Locomotion. The legged robot locomotion tasks aim to control bipedal and quadrupedal robots to maintain a specified target velocity, which consists of linear velocity in the XX and YY directions and rotational velocity in the ZZ direction, without falling. Each robot takes actions as PD target joint positions at a frequency of 50 Hz, and a PD controller in each joint operates at 500 Hz. The dimension of the action space corresponds to the number of motors: ten for the bipedal robot and 12 for the quadrupedal robot. The state space includes the command, gravity vector, current joint positions and velocities, foot contact phase, and a history of the joint positions and velocities. The reward function is defined as the negative of the squared Euclidean distance between the current velocity and the target velocity. There are three cost functions to prevent robots from falling. The first cost is to balance the robot, which is formulated as 𝟏g[2]/g2cos(π/12)\mathbf{1}_{g[2]/||g||_{2}\geq\cos(\pi/12)}, where gg is the gravity vector. This function is activated when the base frame of the robot is tilted more than 1515 degrees. The second cost is to maintain the height of the robot above a predefined threshold, which is expressed as 𝟏hb\mathbf{1}_{h\leq b}, where hh is the height of the robot, and bb is set as 0.70.7 in the bipedal robot and 0.350.35 in the quadrupedal robot. The third cost is to match the foot contact state with the desired timing. It is defined as the average number of feet that do not touch the floor at the desired time. We set the thresholds for these cost functions as 0.025/(1γ),0.025/(1γ)0.025/(1-\gamma),0.025/(1-\gamma), and 0.4/(1γ)0.4/(1-\gamma), respectively, where (1γ)(1-\gamma) converts the threshold value from an average time horizon scale to an infinite time horizon scale.

Safety Gymnasium. We use the goal and button tasks in the Safety Gymnasium [14], and both tasks employ point and car robots. The goal task is to control the robot to navigate to a target goal position without passing hazard area. The button task is to control the robot to push a designated button among several buttons without passing hazard area or hitting undesignated buttons. The dimension of the action space is two for all tasks. The state space dimensions are 60, 72, 76, and 88 for the point goal, car goal, point button, and car button tasks, respectively. For more details on the action and state, please refer to [14]. The reward function is defined to reduce the distance between the robot and either the goal or the designated button, with a bonus awarded at the moment of task completion. When the robot traverses a hazard area or collides with an obstacle, it incurs a cost of one, and the threshold for this cost is set as 0.025/(1γ)0.025/(1-\gamma).

D.2 Network Structure

Across all algorithms, we use the same network architecture for the policy and critics. The policy network is structured to output the mean and standard deviation of a normal distribution, which is then squashed using the Tanh\mathrm{Tanh} function as done in SAC [13]. The critic network is based on the quantile distributional critic [11]. Details of these structures are presented in Table 1. For the sampler ξϕ\xi_{\phi}, we define a trainable variable ϕN×(M1)\phi\in\mathbb{R}^{N\times(M-1)} and represent the mean of the truncated normal distribution as μi,ϕ[j]=exp(ϕ[i,j])\mu_{i,\phi}[j]=\exp(\phi[i,j]). To reduce fine-tuning efforts, we fix the standard deviation of the truncated normal distribution to 0.050.05.

Table 1: Details of network structures.
Parameter Value
Policy network Hidden layer (512, 512)
Activation LeakyReLU
Last activation Linear
Reward critic network Hidden layer (512, 512)
Activation LeakyReLU
Last activation Linear
# of quantiles 25
# of ensembles 2
Cost critic network Hidden layer (512, 512)
Activation LeakyReLU
Last activation SoftPlus
# of quantiles 25
# of ensembles 2

D.3 Hyperparameter Settings

The hyperparameter settings for all algorithms applied to all tasks are summarized in Table 2.

Table 2: Description on hyperparameter settings.
SRCPO (Ours) CPPO [29] CVaR-CPO [32] SDAC [15] SDPO [30] WCSAC-D [26]
Discount factor 0.990.99 0.990.99 0.990.99 0.990.99 0.990.99 0.990.99
Policy learning rate - 31053\cdot 10^{-5} 31053\cdot 10^{-5} - 31053\cdot 10^{-5} 31043\cdot 10^{-4}
Critic learning rate 31043\cdot 10^{-4} 31043\cdot 10^{-4} 31043\cdot 10^{-4} 31043\cdot 10^{-4} 31043\cdot 10^{-4} 31043\cdot 10^{-4}
# of target quantiles 5050 5050 5050 5050 5050 -
Coefficient of TD(λ\lambda) 0.970.97 0.970.97 0.970.97 0.970.97 0.970.97 -
Soft update ratio - - - - - 0.9950.995
Batch size 1000010000 50005000 50005000 1000010000 50005000 256256
Steps per update 10001000 50005000 50005000 10001000 50005000 100100
# of policy update iterations 1010 2020 2020 - 2020 1010
# of critic update iterations 4040 4040 4040 4040 4040 1010
Trust region size 0.0010.001 0.0010.001 0.0010.001 0.0010.001 0.0010.001 -
Line search decay - - - 0.80.8 - -
PPO Clip ratio [22] - 0.20.2 0.20.2 - 0.20.2 -
Length of replay buffer 100000100000 50005000 50005000 100000100000 50005000 10000001000000
Learning rate for auxiliary variables - 0.050.05 0.010.01 - - -
Learning rate for Lagrange multipliers - 0.050.05 0.010.01 - - 0.0010.001
Log penalty coefficient - - - - 5.05.0 -
Entropy threshold - - - - - |A|-|A|
Entropy learning rate - - - - - 0.0010.001
Entropy coefficient 0.0010.001 - - 0.0010.001 -
Sampler learning rate 11031\cdot 10^{-3} - - - - -
KK for sampler target 1010 - - - - -

Appendix E Additional Experimental Results

E.1 Computational Resources

All experiments were conducted on a PC whose CPU and GPU are an Intel Xeon E5-2680 and NVIDIA TITAN Xp, respectively. The training time for each algorithm on the point goal task is reported in Table 3.

Table 3: Training time for the point goal task averaged over five runs.
SRCPO (Ours) CPPO CVaR-CPO SDAC WCSAC-D SDPO
Averaged training time 9h 51m 54s 6h 17m 16s 4h 43m 38s 8h 46m 8s 14h 18m 20s 6h 50m 39s

E.2 Experiments on Various Risk Measures

We have conducted studies on various risk measures in the main text. In this section, we first describe the definition of the Wang risk measure [24], show the visualization of the discretization process, and finally present the training curves of the experiments conducted in Section 8.1.

The Wang risk measure is defined as follows [24]:

Wangα(X):=01FX1(u)𝑑ζα(u),whereζα(u):=Φ(Φ1(u)α),\mathrm{Wang}_{\alpha}(X):=\int_{0}^{1}F_{X}^{-1}(u)d\zeta_{\alpha}(u),\quad\text{where}\;\zeta_{\alpha}(u):=\Phi(\Phi^{-1}(u)-\alpha),

Φ\Phi is the cumulative distribution function (CDF) of the standard normal distribution, and ζ\zeta is called a distortion function. It can be expressed in the form of the spectral risk measure as follows:

Wangα(X)=01FX1(u)σα(u)𝑑u,whereσα(u):=ϕ(Φ1(u)α)ϕ(Φ1(u)),\mathrm{Wang}_{\alpha}(X)=\int_{0}^{1}F_{X}^{-1}(u)\sigma_{\alpha}(u)du,\quad\text{where}\;\sigma_{\alpha}(u):=\frac{\phi(\Phi^{-1}(u)-\alpha)}{\phi(\Phi^{-1}(u))},

and ϕ\phi is the pdf of the standard normal distribution. Since σα(u)\sigma_{\alpha}(u) goes to infinity when uu is approaching one, σα(u)\sigma_{\alpha}(u) at u=1u=1 is not defined. Thus, it is not a spectral risk measure. Nevertheless, our parameterization method introduced in Section 6.1 can project Wangα\mathrm{Wang}_{\alpha} onto a discretized class of spectral risk measures.

The discretization process can be implemented by minimizing (11), and the results are shown in Figure 6. In this process, the number of discretizations MM is set to five. Additionally, visualizations of other risk measures are provided in Figure 7, and the optimized values of the parameters for the discretized spectrum, which are introduced in (10), are reported in the Table 4. Note that CVaR is already an element of the discretized class of spectral risk measures, as observed in Figure 7. Therefore, CVaR can be precisely expressed using a parameter with M=1M=1.

Finally, we conducted experiments on the point goal task to check how the results are changed according to the risk level and the type of risk measures. The training curves for these experiments are shown in Figure 8.

Table 4: Discretization results.
Wangα\mathrm{Wang}_{\alpha} Powα\mathrm{Pow}_{\alpha}
α=0.5\alpha=0.5 {ηi}i=1M\{\eta_{i}\}_{i=1}^{M} [0.515,0.790,1.091,1.493,2.191][0.515,0.790,1.091,1.493,2.191] α=0.5\alpha=0.5 {ηi}i=1M\{\eta_{i}\}_{i=1}^{M} [0.200,0.600,1.000,1.400,1.800][0.200,0.600,1.000,1.400,1.800]
{αi}i=1M1\{\alpha_{i}\}_{i=1}^{M-1} [0.263,0.541,0.770,0.926][0.263,0.541,0.770,0.926] {αi}i=1M1\{\alpha_{i}\}_{i=1}^{M-1} [0.200,0.400,0.600,0.800][0.200,0.400,0.600,0.800]
α=1.0\alpha=1.0 {ηi}i=1M\{\eta_{i}\}_{i=1}^{M} [0.294,0.734,1.417,2.640,5.517][0.294,0.734,1.417,2.640,5.517] α=0.75\alpha=0.75 {ηi}i=1M\{\eta_{i}\}_{i=1}^{M} [0.046,0.574,1.347,2.308,3.424][0.046,0.574,1.347,2.308,3.424]
{αi}i=1M1\{\alpha_{i}\}_{i=1}^{M-1} [0.409,0.701,0.878,0.968][0.409,0.701,0.878,0.968] {αi}i=1M1\{\alpha_{i}\}_{i=1}^{M-1} [0.417,0.615,0.765,0.890][0.417,0.615,0.765,0.890]
α=0.5\alpha=0.5 {ηi}i=1M\{\eta_{i}\}_{i=1}^{M} [0.180,0.834,2.253,5.678,16.419][0.180,0.834,2.253,5.678,16.419] α=0.9\alpha=0.9 {ηi}i=1M\{\eta_{i}\}_{i=1}^{M} [0.003,0.947,2.705,5.216,8.383][0.003,0.947,2.705,5.216,8.383]
{αi}i=1M1\{\alpha_{i}\}_{i=1}^{M-1} [0.579,0.834,0.945,0.989][0.579,0.834,0.945,0.989] {αi}i=1M1\{\alpha_{i}\}_{i=1}^{M-1} [0.701,0.821,0.898,0.955][0.701,0.821,0.898,0.955]
Refer to caption
Refer to caption
Refer to caption
Figure 6: Visualization of the spectrum function of the Wang risk measure and the discretized results.
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 7: Visualization of the spectrum function of the Pow and CVaR risk measures and the discretized results.
Refer to caption
Refer to caption
Refer to caption
Figure 8: Training curves of the point goal task with different risk levels and risk measures. Each column shows the results on the risk measure whose name appears in the plot title. The solid line in each graph represents the average of each metric, and the shaded area indicates the standard deviation scaled by 0.20.2. The results are obtained by training algorithms with five random seeds.