Spectral-Risk Safe Reinforcement Learning
with Convergence Guarantees

Dohyeong Kim¹ Taehyun Cho¹ Seungyub Han¹
Hojun Chung¹ Kyungjae Lee^2∗ Songhwai Oh

{}^{1\!}

¹Dep. of Electrical and Computer Engineering, Seoul National University
²Artificial Intelligence Graduate School, Chung-Ang University
Corresponding authors: [email protected], [email protected].

Abstract

The field of risk-constrained reinforcement learning (RCRL) has been developed to effectively reduce the likelihood of worst-case scenarios by explicitly handling risk-measure-based constraints. However, the nonlinearity of risk measures makes it challenging to achieve convergence and optimality. To overcome the difficulties posed by the nonlinearity, we propose a spectral risk measure-constrained RL algorithm, spectral-risk-constrained policy optimization (SRCPO), a bilevel optimization approach that utilizes the duality of spectral risk measures. In the bilevel optimization structure, the outer problem involves optimizing dual variables derived from the risk measures, while the inner problem involves finding an optimal policy given these dual variables. The proposed method, to the best of our knowledge, is the first to guarantee convergence to an optimum in the tabular setting. Furthermore, the proposed method has been evaluated on continuous control tasks and showed the best performance among other RCRL algorithms satisfying the constraints.

1 Introduction

Safe reinforcement learning (Safe RL) has been extensively researched [2, 15, 25], particularly for its applications in safety-critical areas such as robotic manipulation [3, 17] and legged robot locomotion [28]. In safe RL, constraints are often defined by the expectation of the discounted sum of costs [2, 25], where cost functions are defined to capture safety signals. These expectation-based constraints can leverage existing RL techniques since the value functions for costs follow the Bellman equation. However, they inadequately reduce the likelihood of significant costs, which can result in worst-case outcomes, because expectations do not capture information on the tail distribution. In real-world applications, avoiding such worst-case scenarios is particularly crucial, as they can lead to irrecoverable damage.

In order to avoid worst-case scenarios, various safe RL methods [10, 26] attempt to define constraints using risk measures, such as conditional value at risk (CVaR) [20] and its generalized form, spectral risk measures [1]. Yang et al. [26] and Zhang and Weng [30] have employed a distributional critic to estimate the risk constraints and calculate policy gradients. Zhang et al. [32] and Chow et al. [10] have introduced auxiliary variables to estimate CVaR. These approaches have effectively prevented worst-case outcomes by implementing risk constraints. However, the nonlinearity of risk measures makes it difficult to develop methods that ensure optimality. To the best of our knowledge, even for well-known risk measures such as CVaR, existing methods only guarantee local convergence [10, 32]. This highlights the need for a method that ensures convergence to a global optimum.

In this paper, we introduce a spectral risk measure-constrained RL (RCRL) algorithm called spectral-risk-constrained policy optimization (SRCPO). Spectral risk measures, including CVaR, have a dual form expressed as the expectation over a random variable [7]. Leveraging this duality, we propose a bilevel optimization framework designed to alleviate the challenges caused by the nonlinearity of risk measures. This framework consists of two levels: the outer problem, which involves optimizing a dual variable that originates from the dual form of spectral risk measures, and the inner problem, which aims to find an optimal policy for a given dual variable. For the inner problem, we define novel risk value functions that exhibit linearity in performance difference and propose a policy gradient method that ensures convergence to an optimal policy in tabular settings. The outer problem is potentially non-convex, so gradient-based optimization techniques are not suitable for global convergence. Instead, we propose a method to model and update a distribution over the dual variable, which also ensures finding an optimal dual variable in tabular settings. As a result, to the best of our knowledge, we present the first algorithm that guarantees convergence to an optimum for the RCRL problem. Moreover, the proposed method has been evaluated on continuous control tasks with both single and multiple constraints, demonstrating the highest reward performance among other RCRL algorithms satisfying constraints in all tasks. Our contributions are summarized as follows:

•

We propose a novel bilevel optimization framework for RCRL, effectively addressing the complexities caused by the nonlinearity of risk measures.
•

The proposed method demonstrates convergence and optimality in tabular settings.
•

The proposed method has been evaluated on continuous control tasks with both single and multiple constraints, consistently outperforming existing RCRL algorithms.

2 Related Work

RCRL algorithms can be classified according to how constraints are handled and how risks are estimated. There are two main approaches to handling constraints: the Lagrangian [29, 26, 32] and the primal [15, 30] approaches. The Lagrangian approach deals with dual problems by jointly updating dual variables and policies, which can be easily integrated into the existing RL framework. In contrast, the primal approach directly tackles the original problem by constructing a subproblem according to the current policy and solving it. Subsequently, risk estimation can be categorized into three types. The first type [29, 32] uses an auxiliary variable to precisely estimate CVaR through its dual form. Another one [26, 30] approximates risks as the expectation of the state-wise risks calculated using a distributional critic [8]. The third one [15] approximates CVaR by assuming that cost returns follow Gaussians. Despite these diverse methods, they only guarantee convergence to local optima. In the field of risk-sensitive RL, Bäuerle and Glauner [7] have proposed an algorithm for spectral risk measures that ensures convergence to an optimum. They introduced a bilevel optimization approach, like our method, but did not provide a method for solving the outer problem. Furthermore, their approach to the inner problem is value-based, making it complicated to extend to RCRL, which typically requires a policy gradient approach.

3 Backgrounds

3.1 Constrained Markov Decision Processes

A constrained Markov decision process (CMDP) is defined as $\langle S,A,P,\rho,\gamma,R,\{C_{i}\}_{i=1}^{N}\rangle$ , where $S$ is a state space, $A$ is an action space, $P$ is a transition model, $\rho$ is an initial state distribution, $\gamma$ is a discount factor, $R:S\times A\times S\mapsto[-R_{\mathrm{max}},R_{\mathrm{max}}]$ is a reward function, $C_{i}:S\times A\times S\mapsto[0,C_{\mathrm{max}}]$ is a cost function, and $N$ is the number of cost functions. A policy $\pi:S\mapsto\mathcal{P}(A)$ is defined as a mapping from the state space to the action distribution. Then, the state distribution given $\pi$ is defined as $d_{\rho}^{\pi}(s):=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}\mathbb{P}(s_{t}=s)$ , where $s_{0}\sim\rho$ , $a_{t}\sim\pi(\cdot|s_{t})$ , and $s_{t+1}\sim P(\cdot|s_{t},a_{t})$ for $\forall t$ . The state-action return of $\pi$ is defined as:

Z_{R}^{\pi}(s,a):=\sum_{t=0}^{\infty}\gamma^{t}R(s_{t},a_{t},s_{t+1})\;\Big{|}\;s_{0}=s,\;a_{0}=a,\;a_{t}\sim\pi(\cdot|s_{t}),\;s_{t+1}\sim P(\cdot|s_{t},a_{t})\;\forall t.

The state return and the return are defined, respectively, as:

Y_{R}^{\pi}(s):=Z_{R}^{\pi}(s,a)\;\big{|}\;a\sim\pi(\cdot|s),\quad G_{R}^{\pi}:=Y_{R}^{\pi}(s)\;\big{|}\;s\sim\rho.

The returns for a cost, $Z_{C_{i}}^{\pi}$ , $Y_{C_{i}}^{\pi}$ , and $G_{C_{i}}^{\pi}$ , are defined by replacing the reward with the $i$ th cost. Then, a safe RL problem can be defined as maximizing the reward return while keeping the cost returns below predefined thresholds to ensure safety.

3.2 Spectral Risk Measures

In order to reduce the likelihood of encountering the worst-case scenario, it is required to incorporate risk measures into constraints. Conditional value at risk (CVaR) is a widely used risk measure in finance [20], and it has been further developed into a spectral risk measure in [1], which provides a more comprehensive framework. A spectral risk measure with spectrum $\sigma$ is defined as:

\mathcal{R}_{\sigma}(X):=\int_{0}^{1}F_{X}^{-1}(u)\sigma(u)du,

where $\sigma$ is an increasing function, $\sigma\geq 0$ , and $\int_{0}^{1}\sigma(u)du=1$ . By appropriately defining the function $\sigma$ , various risk measures can be established, and examples are as follows:

\mathrm{CVaR}_{\alpha}:\sigma(u)=\mathbf{1}_{u\geq\alpha}/(1-\alpha),\quad\mathrm{Pow}_{\alpha}:\sigma(u)=u^{\alpha/(1-\alpha)}/(1-\alpha),

(1)

where $\alpha\in[0,1)$ represents a risk level, and the measures become risk neutral when $\alpha=0$ . Furthermore, it is known that the spectral risk measure has the following dual form [7]:

\mathcal{R}_{\sigma}(X)=\inf_{g}\mathbb{E}[g(X)]+\int_{0}^{1}g^{*}(\sigma(u))du,

(2)

where $g:\mathbb{R}\mapsto\mathbb{R}$ is an increasing convex function, and $g^{*}(y):=\sup_{x}xy-g(x)$ is the convex conjugate function of $g$ . For example, $\mathrm{CVaR}$ is expressed as $\mathrm{CVaR}_{\alpha}(X)=\inf_{\beta}\mathbb{E}[\frac{1}{\alpha}(X-\beta)_{+}]+\beta$ , where $g(x)=(x-\beta)_{+}/(1-\alpha)$ , and the integral part of the conjugate function becomes $\beta$ . We define the following sub-risk measure corresponding to the function $g$ as follows:

\mathcal{R}_{\sigma}^{g}(X):=\mathbb{E}[g(X)]+\int_{0}^{1}g^{*}(\sigma(u))du.

(3)

Then, $\mathcal{R}_{\sigma}(X)=\inf_{g}\mathcal{R}^{g}_{\sigma}(X)$ . Note that the integral part is independent of $X$ and only serves as a constant value of risk. As a result, the sub-risk measure can be expressed using the expectation, so it provides computational advantages over the original risk measure in many operations. We will utilize this advantage to show optimality of the proposed method.

3.3 Augmented State Spaces

As mentioned by Zhang et al. [32], an risk-constrained RL problem is non-Markovian, so a history-conditioned policy is required to solve the problem. Also, Bastani et al. [6] showed that an optimal history-conditioned policy can be expressed as a Markov policy defined in a newly augmented state space, which incorporates the discounted sum of costs as part of the state. To follow these results, we define an augmented state as follows:

\displaystyle\bar{s}_{t}:=(s_{t},\{e_{i,t}\}_{i=1}^{N},b_{t}),\;\text{where}\;e_{i,0}=0,\;e_{i,t+1}=(C_{i}(s_{t},a_{t},s_{t+1})+e_{i,t})/\gamma,\;b_{t}=\gamma^{t}.

(4)

Then, the CMDP is modified as follows:

\bar{s}_{t+1}=(s_{t+1},\{(c_{i,t}+e_{i,t})/\gamma\}_{i=1}^{N},\gamma b_{t}),\;\text{where}\;s_{t+1}\sim P(\cdot|s_{t},a_{t}),\;c_{i,t}=C_{i}(s_{t},a_{t},s_{t+1}),

and the policy $\pi(\cdot|\bar{s})$ is defined on the augmented state space.

4 Proposed Method

The risk measure-constrained RL (RCRL) problem is defined as follows:

\max_{\pi}\mathbb{E}[G_{R}^{\pi}]\;\;\mathbf{s.t.}\;\mathcal{R}_{\sigma_{i}}(G_{C_{i}}^{\pi})\leq d_{i}\;\forall i\in\{1,2,...,N\},

(5)

where $d_{i}$ is the threshold of the $i$ th constraint. Due to the nonlinearity of the risk measure, achieving an optimal policy through policy gradient techniques can be challenging. To address this issue, we propose solving the RCRL problem by decomposing it into two separate problems. Using a feasibility function $\mathcal{F}$ ,¹¹1 $\mathcal{F}(x):=0$ if $x\leq 0$ else $\infty$ . the RCRL problem can be rewritten as follows:

	$\displaystyle\max_{\pi}\mathbb{E}[G_{R}^{\pi}]-\sum\nolimits_{i=1}^{N}\mathcal{F}(\mathcal{R}_{\sigma_{i}}(G_{C_{i}}^{\pi})$	$\displaystyle-d_{i})=\max_{\pi}\mathbb{E}[G_{R}^{\pi}]-\sum\nolimits_{i=1}^{N}\inf_{g_{i}}\mathcal{F}(\mathcal{R}^{g_{i}}_{\sigma_{i}}(G_{C_{i}}^{\pi})-d_{i})$		(6)
		$\displaystyle\quad=\underbrace{\sup_{\{g_{i}\}_{i=1}^{N}}\underbrace{\max_{\pi}\mathbb{E}[G_{R}^{\pi}]-\sum\nolimits_{i=1}^{N}\mathcal{F}(\mathcal{R}^{g_{i}}_{\sigma_{i}}(G_{C_{i}}^{\pi})-d_{i})}_{\textbf{(a)}}}_{\textbf{(b)}},$		(6)

where (a) is the inner problem, (b) is the outer problem, and $\{g_{i}\}_{i=1}^{N}$ is denoted as a dual variable. It is known that if the cost functions are bounded and there exists a policy that satisfies the constraints of (5), there exists an optimal dual variable [7], which transforms the supremum into the maximum in (6). Then, the RCRL problem can be solved by 1) finding optimal policies for various dual variables and 2) selecting the dual variable corresponding to the policy that achieves the maximum expected reward return while satisfying the constraints.

The inner problem is then a safe RL problem with sub-risk measure constraints. However, although the sub-risk measure is expressed using the expectation as in (3), the function $g$ causes nonlinearity, which makes difficult to develop a policy gradient method. To address this issue, we propose novel risk value functions that show linearity in the performance difference between two policies. Through these risk value functions, we propose a policy update rule with convergence guarantees for the inner problem in Section 5.2.

The search space of the outer problem is a function space, rendering it impractical. To address this challenge, we appropriately parameterize the function $g_{i}$ using a parameter $\beta_{i}\in B\subset\mathbb{R}^{M-1}$ in Section 6.1. Then, we can solve the outer problem through a brute-force approach by searching the parameter space. However, it is computationally prohibitive, requiring exponential time regarding to the number of constraints and the dimension of the parameter space. Consequently, it is necessary to develop an efficient method for solving the outer problem. Given the difficulty of directly identifying an optimal $\beta^{*}$ , we instead model a distribution $\xi$ to estimate the probability that a given $\beta$ is optimal. We then propose a method that ensures convergence to an optimal distribution $\xi^{*}$ , which deterministically samples an optimal $\beta^{*}$ , in Section 6.2.

By integrating the proposed methods to solve the inner and outer problems, we finally introduce an RCRL algorithm called spectral-risk-constrained policy optimization (SRCPO). The pseudo-code for SRCPO is presented in Algorithm 1. In the following sections, we will describe how to solve the inner problem and the outer problem in detail.

Algorithm 1 Pseudo-Code of Spectral-Risk-Constrained Policy Optimization (SRCPO)

Input: Spectrum function

\sigma_{i}

for

i\in\{1,...,N\}

Parameterize

g_{i}

\sigma_{i}

defined in (3) using

\beta_{i}\in B\subset\mathbb{R}^{M-1}

. // Parameterization, Section 6.1.

for

t

=1

T

for

\boldsymbol{\beta}=\{\beta_{1},...,\beta_{N}\}\in B^{N}

Update

\pi_{\boldsymbol{\beta},t}

using the proposed policy gradient method. // Inner problem, Section 5.2.

end for

Calculate the target distribution of

\xi

using

\pi_{\boldsymbol{\beta},t}

, and update

\xi_{t}

. // Outer problem, Section 6.2.

end for

Output:

\pi_{\boldsymbol{\beta},T}

, where

\boldsymbol{\beta}\sim\xi_{T}

5 Inner Problem: Safe RL with Sub-Risk Measure

In this section, we propose a policy gradient method to solve the inner problem, defined as:

\max_{\pi}\mathbb{E}[G_{R}^{\pi}]\;\;\mathbf{s.t.}\;\mathcal{R}^{g_{i}}_{\sigma_{i}}(G_{C_{i}}^{\pi})\leq d_{i}\;\forall i,

(7)

where constraints consist of the sub-risk measures, defined in (3). In the remainder of this section, we first introduce risk value functions to calculate the policy gradient of the sub-risk measures, propose a policy update rule, and finally demonstrate its convergence to an optimum in tabular settings.

5.1 Risk Value Functions

In order to derive the policy gradient of the sub-risk measure, we first define risk value functions. To this end, using the fact that $e_{i,t}=\sum_{j=0}^{t-1}\gamma^{j}c_{i,j}/\gamma^{t}$ and $b_{t}=\gamma^{t}$ , where $\{e_{i,t}\}_{i=1}^{N}$ and $b_{t}$ is defined in the augmented state $\bar{s}_{t}$ , we expand the sub-risk measure of the $i$ th cost as follows:

		$\displaystyle\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi})-\int_{0}^{1}{g_{i}}^{*}(\sigma_{i}(u))du=\int_{-\infty}^{\infty}g_{i}(z)f_{G_{C_{i}}^{\pi}}(z)dz=\mathbb{E}_{\bar{s}_{0}\sim\rho}\left[\int_{-\infty}^{\infty}g_{i}(z)f_{Y_{C_{i}}^{\pi}(\bar{s}_{0})}(z)dz\right]$
		$\displaystyle=\underset{\rho,\pi,P}{\mathbb{E}}\left[\int_{-\infty}^{\infty}g_{i}(z)f_{\sum_{j=0}^{t-1}\gamma^{j}c_{i,j}+\gamma^{t}Y_{C_{i}}^{\pi}(\bar{s}_{t})}(z)dz\right]=\underset{\rho,\pi,P}{\mathbb{E}}\left[\int_{-\infty}^{\infty}g_{i}(z)f_{b_{t}(e_{i,t}+Y_{C_{i}}^{\pi}(\bar{s}_{t}))}(z)dz\right]\;\forall t,$

where $f_{X}$ is the probability density function (pdf) of a random variable $X$ . To capture the value of the sub-risk measure, we can define risk value functions as follows:

V_{i,g}^{\pi}(\bar{s}):=\int_{-\infty}^{\infty}g(z)f_{b(e_{i}+Y_{C_{i}}^{\pi}(\bar{s}))}(z)dz,\;Q_{i,g}^{\pi}(\bar{s},a):=\int_{-\infty}^{\infty}g(z)f_{b(e_{i}+Z_{C_{i}}^{\pi}(\bar{s},a))}(z)dz.

Using the convexity of $g$ , we present the following lemma on the bounds of the risk value functions:

Lemma 5.1.

Assuming that a function $g$ is differentiable, $V^{\pi}_{i,g}(\bar{s})$ and $Q^{\pi}_{i,g}(\bar{s},a)$ are bounded by $[g(e_{i}b),g(e_{i}b+bC_{\mathrm{max}}/(1-\gamma))]$ , and $|Q^{\pi}_{i,g}(\bar{s},a)-V^{\pi}_{i,g}(\bar{s})|\leq bC$ , where $C=\frac{C_{\mathrm{max}}}{1-\gamma}g^{\prime}(\frac{C_{\mathrm{max}}}{1-\gamma})$ and $\bar{s}=(s,\{e_{i}\}_{i=1}^{N},b)$ .

The proof is provided in Appendix A. Lemma 5.1 means that the value of $Q^{\pi}_{i,g}(\bar{s},a)-V^{\pi}_{i,g}(\bar{s})$ is scaled by $b$ . To eliminate this scale, we define a risk advantage function as follows:

A_{i,g}^{\pi}(\bar{s},a):=(Q_{i,g}^{\pi}(\bar{s},a)-V_{i,g}^{\pi}(\bar{s}))/b.

Now, we introduce a theorem on the difference in the sub-risk measure between two policies.

Theorem 5.2.

Given two policies, $\pi$ and $\pi^{\prime}$ , the difference in the sub-risk measure is:

\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi^{\prime}})-\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi})=\mathbb{E}_{d_{\rho}^{\pi^{\prime}},\pi^{\prime}}\left[A_{i,g_{i}}^{\pi}(\bar{s},a)\right]/(1-\gamma).

(8)

The proof is provided in Appendix A. Since Theorem 5.2 also holds for the optimal policy, it is beneficial for showing optimality, as done in policy gradient algorithms of traditional RL [4]. Additionally, to calculate the advantage function, we use a distributional critic instead of directly estimating the risk value functions. This process is described in detail in Section 7.

5.2 Policy Update Rule

In this section, we derive the policy gradient of the sub-risk measure and introduce a policy update rule to solve the inner problem. Before that, we first parameterize the policy as $\pi_{\theta}$ , where $\theta\in\Theta$ , and denote the objective function as $\mathbb{E}[G_{R}^{\pi_{\theta}}]=J_{R}(\pi)=J_{R}(\theta)$ and the $i$ th constraint function as $\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi_{\theta}})=J_{C_{i}}(\pi)=J_{C_{i}}(\theta)$ . Then, using (8) and the fact that $\mathbb{E}_{a\sim\pi(\cdot|\bar{s})}[A_{i,g}^{\pi}(\bar{s},a)]=0$ , the policy gradient of the $i$ th constraint function is derived as follows:

\nabla_{\theta}J_{C_{i}}(\theta)=\mathbb{E}_{d_{\rho}^{\pi_{\theta}},\pi_{\theta}}[A_{i,g_{i}}^{\pi_{\theta}}(\bar{s},a)\nabla_{\theta}\log(\pi_{\theta}(a|\bar{s}))]/(1-\gamma).

(9)

Now, we propose a policy update rule as follows:

\theta_{t+1}=\begin{cases}\theta_{t}+\alpha_{t}F^{\dagger}(\theta_{t})\nabla_{\theta}\left(J_{R}(\theta)-\alpha_{t}\sum_{i=1}^{N}\lambda_{t,i}J_{C_{i}}(\theta)\right)\big{|}_{\theta=\theta_{t}}&\text{if constraints are satisfied},\\ \theta_{t}+\alpha_{t}F^{\dagger}(\theta_{t})\nabla_{\theta}\left(\alpha_{t}\nu_{t}J_{R}(\theta)-\sum_{i=1}^{N}\lambda_{t,i}J_{C_{i}}(\theta)\right)\big{|}_{\theta=\theta_{t}}&\text{otherwise},\\ \end{cases}

where $\alpha_{t}$ is a learning rate, $F(\theta)$ is Fisher information matrix, $A^{\dagger}$ is Moore-Penrose inverse matrix of $A$ , $\{\lambda_{t,1},...,\lambda_{t,N},\nu_{t}\}\subset[0,\lambda_{\mathrm{max}}]$ are weights, and $\lambda_{t,i}=0$ for $i\in\{i|J_{C_{i}}(\theta)\leq d_{i}\}$ and $\sum_{i}\lambda_{t,i}=1$ when the constraints are not satisfied. There are various possible strategies for determining $\lambda_{t,i}$ and $\nu_{t}$ , which makes the proposed method generalize many primal approach-based safe RL algorithms [2, 25, 15]. Several options for determining $\lambda_{t,i}$ and $\nu_{t}$ , as well as our strategy, are described in Appendix B.

5.3 Convergence Analysis

We analyze the convergence of the proposed policy update rule in a tabular setting where both the augmented state space and action space are finite. To demonstrate convergence in this setting, we use the softmax policy parameterization [4] as follows:

\pi_{\theta}(a|\bar{s}):=\exp{\theta(\bar{s},a)}\big{/}\left(\sum\nolimits_{a^{\prime}}\exp{\theta(\bar{s},a^{\prime})}\right)\;\forall(\bar{s},a)\in\bar{S}\times A.

Then, we can show that the policy converges to an optimal policy if updated by the proposed method.

Theorem 5.3.

Assume that there exists a feasible policy $\pi_{f}$ such that $J_{C_{i}}(\pi_{f})\leq d_{i}-\eta$ for $\forall i$ , where $\eta>0$ . Then, if a policy is updated by the proposed update rule with a learning rate $\alpha_{t}$ that follows the Robbins-Monro condition [19], it will converge to an optimal policy of the inner problem.

The proof is provided in Appendix A.1, and the assumption of the existence of a feasible policy is commonly used in convergence analysis in safe RL [12, 5, 15]. Through this result, we can also demonstrate convergence of existing primal approach-based safe RL methods, such as CPO [2], PCPO [27], and P3O [31],²²2Note that these safe RL algorithms are designed to handle expectation-based constraints, so they are not suitable for RCRL. by identifying proper $\lambda_{t,i}$ and $\nu_{t}$ strategies for these methods.

6 Outer Problem: Dual Optimization for Spectral Risk Measure

Refer to caption — Figure 1: Discretization of spectrum.

In this section, we propose a method for solving the outer problem that can be performed concurrently with the policy update of the inner problem. To this end, since the domain of the outer problem is a function space, we first introduce a procedure to parameterize the functions, allowing the problem to be practically solved. Subsequently, we propose a method to find an optimal solution for the parameterized outer problem.

6.1 Parameterization

Searching the function space of $g$ is challenging. To address this issue, we discretize the spectrum function as illustrated in Figure 1 since the discretization allows $g$ to be expressed using a finite number of parameters. The discretized spectrum is then formulated as:

\displaystyle\sigma(u)\approx\tilde{\sigma}(u):=\eta_{1}+\sum\nolimits_{i=1}^{M-1}(\eta_{i+1}-\eta_{i})\mathbf{1}_{\mathrm{u\geq\alpha_{i}}},

(10)

where $0\leq\eta_{i}\leq\eta_{i+1}$ , $0\leq\alpha_{i}\leq\alpha_{i+1}\leq 1$ , and $M$ is the number of discretizations. The proper values of $\eta_{i}$ and $\alpha_{i}$ can be achieved by minimizing the norm of the function space, expressed as:

\min_{\{\eta_{i}\}_{i=1}^{M},\{\alpha_{i}\}_{i=1}^{M-1}}\int_{0}^{1}|\sigma(u)-\tilde{\sigma}(u)|du\quad\mathbf{s.t.}\;\int_{0}^{1}\tilde{\sigma}(u)du=1.

(11)

Now, we can show that the difference between the original risk measure and its discretized version is bounded by the inverse of the number of discretizations.

Lemma 6.1 (Approximation Error).

The difference between the original risk measure and its discretized version is:

|\mathcal{R}_{\sigma}(X)-\mathcal{R}_{\tilde{\sigma}}(X)|\leq C_{\mathrm{max}}\sigma(1)/((1-\gamma)M).

The proof is provided in Appendix A. Lemma 6.1 means that as the number of discretizations increases, the risk measure converges to the original one. Note that the discretized version of the risk measure is also a spectral risk measure, indicating that the discretization process projects the original risk measure into a specific class of spectral risk measures. Now, we introduce a parameterization method as detailed in the following theorem:

Theorem 6.2.

Let us parameterize the function $g$ using a parameter $\beta\in B\subset\mathbb{R}^{M-1}$ as:

g_{\beta}(x):=\eta_{1}x+\sum\nolimits_{i=1}^{M-1}(\eta_{i+1}-\eta_{i})(x-\beta[i])_{+},

where $\beta[i]\leq\beta[i+1]$ for $i\in\{1,..,M-2\}$ . Then, the following is satisfied:

\mathcal{R}_{\tilde{\sigma}}(X)=\inf\nolimits_{\beta}\mathcal{R}^{\beta}_{\tilde{\sigma}}(X),\;\text{where}\;\mathcal{R}^{\beta}_{\tilde{\sigma}}(X):=\mathbb{E}[g_{\beta}(X)]+\int_{0}^{1}g_{\beta}^{*}(\tilde{\sigma}(u))du.

According to Remark 2.7 in [7], the infimum function in (2) exists and can be expressed as an integral of the spectrum function if the reward and cost functions are bounded. By using this fact, Theorem 6.2 can be proved, and details are provided in Appendix A. Through this parameterization, the RCRL problem is now expressed as follows:

\sup_{\{\beta_{i}\}_{i=1}^{N}}\max_{\pi}\mathbb{E}[G_{R}^{\pi}]-\sum\nolimits_{i=1}^{N}\mathcal{F}(\mathcal{R}^{\beta_{i}}_{\tilde{\sigma}_{i}}(G_{C_{i}}^{\pi})-d_{i}),

(12)

where $\beta_{i}\in B\subset\mathbb{R}^{M-1}$ is a parameter of the $i$ th constraint. In the remainder of the paper, we denote the constraint function $\mathcal{R}^{\beta_{i}}_{\tilde{\sigma}_{i}}(G_{C_{i}}^{\pi})$ as $J_{C_{i}}(\pi;\boldsymbol{\beta})$ , where $\boldsymbol{\beta}=\{\beta_{i}\}_{i=1}^{N}$ , and the policy for $\boldsymbol{\beta}$ as $\pi_{\boldsymbol{\beta}}$ .

6.2 Optimization

To obtain the supremum of $\boldsymbol{\beta}$ in (12), a brute-force search can be used, but it demands exponential time relative to the dimension of $\boldsymbol{\beta}$ . Alternatively, $\boldsymbol{\beta}$ can be directly optimized via gradient descent; however, due to the difficulty in confirming the convexity of the outer problem, optimal convergence cannot be assured. To resolve this issue, we instead propose to find a distribution on $\boldsymbol{\beta}$ , called a sampler $\xi\in\mathcal{P}(B)^{N}$ , that outputs the likelihood that a given $\boldsymbol{\beta}$ is optimal.

For detailed descriptions of the sampler $\xi$ , the probability of sampling $\boldsymbol{\beta}$ is expressed as $\xi(\boldsymbol{\beta})=\prod_{i=1}^{N}\xi[i](\beta_{i})$ , where $\xi[i]$ is the $i$ th element of $\xi$ , and $\beta_{i}$ is $\boldsymbol{\beta}[i]$ . The sampling process is denoted by $\boldsymbol{\beta}\sim\xi$ , where each component is sampled according to $\beta_{i}\sim\xi[i]$ . Implementation of this sampling process is similar to the stick-breaking process [23] due to the condition $\beta_{i}[j]\leq\beta_{i}[j+1]$ for $j\in\{1,...,M-2\}$ . Initially, $\beta_{i}[1]$ is sampled within $[0,C_{\mathrm{max}}/(1-\gamma)]$ , and subsequent values $\beta_{i}[j+1]$ are sampled within $[\beta_{i}[j],C_{\mathrm{max}}/(1-\gamma)]$ . Then, our target is to find the following optimal distribution:

\xi^{*}(\boldsymbol{\beta})\begin{cases}\geq 0\quad\text{if}\;\boldsymbol{\beta}\in\{\boldsymbol{\beta}|J_{R}(\pi_{\boldsymbol{\beta}}^{*})=J_{R}(\pi_{\boldsymbol{\beta}^{*}}^{*})\},\\ =0\quad\text{otherwise,}\end{cases}

(13)

where $\boldsymbol{\beta}^{*}$ is an optimal solution of the outer problem (12). Once we obtain the optimal distribution, $\boldsymbol{\beta}^{*}$ can be achieved by sampling from $\xi^{*}$ .

In order to obtain the optimal distribution, we propose a novel loss function and update rule. Before that, let us parameterize the sampler as $\xi_{\phi}$ , where $\phi\in\Phi$ , and define the following function:

J(\pi;\boldsymbol{\beta}):=J_{R}(\pi)-K\sum\nolimits_{i}(J_{C_{i}}(\pi;\boldsymbol{\beta})-d_{i})_{+}.

It is known that for sufficiently large $K>0$ , the optimal policy of the inner problem, $\pi_{\boldsymbol{\beta}}^{*}$ , is also the solution of $\max_{\pi}J(\pi;\boldsymbol{\beta})$ , which means $J_{R}(\pi_{\boldsymbol{\beta}}^{*})=\max_{\pi}J(\pi;\boldsymbol{\beta})$ [31]. Using this fact, we build a target distribution for the sampler as: $\bar{\xi}_{t}(\boldsymbol{\beta})\propto\exp(J(\pi_{\boldsymbol{\beta},t};\boldsymbol{\beta}))$ , where $\pi_{\boldsymbol{\beta},t}$ is the policy for $\boldsymbol{\beta}$ at time-step $t$ . Then, we define a loss function using the cross-entropy ( $H$ ) as follows:

		$\displaystyle\mathcal{L}_{t}(\phi):=H(\xi_{\phi},\bar{\xi}_{t})=-\mathbb{E}_{\boldsymbol{\beta}\sim\xi_{\phi}}[\log\bar{\xi}_{t}(\boldsymbol{\beta})]=-\mathbb{E}_{\boldsymbol{\beta}\sim\xi_{\phi}}[J(\pi_{\boldsymbol{\beta},t};\boldsymbol{\beta})],$
		$\displaystyle\nabla_{\phi}\mathcal{L}_{t}(\phi)=-\mathbb{E}_{\boldsymbol{\beta}\sim\xi_{\phi}}[\nabla_{\phi}\log(\xi_{\phi}(\boldsymbol{\beta}))J(\pi_{\boldsymbol{\beta},t};\boldsymbol{\beta})].$

Finally, we present an update rule along with its convergence property in the following theorem.

Theorem 6.3.

Let us assume that the space of $\boldsymbol{\beta}$ is finite and update the sampler according to the following equation:

\phi_{t+1}=\phi_{t}-\alpha F^{\dagger}(\phi_{t})\nabla_{\phi}\mathcal{L}_{t}(\phi)|_{\phi=\phi_{t}},

(14)

where $\alpha$ is a learning rate, and $F^{\dagger}(\phi)$ is the pseudo-inverse of Fisher information matrix of $\xi_{\phi}$ . Then, under a softmax parameterization, the sampler converges to an optimal distribution defined in (13).

The proof is provided in Appendix A. As the loss function consists of the current policy $\pi_{\boldsymbol{\beta},t}$ , the sampler can be updated simultaneously with the policy update.

7 Practical Implementation

In order to calculate the policy gradient defined in (9), it is required to estimate the risk value functions. Instead of modeling them directly, we use quantile distributional critics [11] $Z_{C_{i},\psi}(\bar{s},a;\boldsymbol{\beta}):\bar{S}\times A\times B^{N}\mapsto\mathbb{R}^{L}$ , which approximate the pdf of $Z_{C_{i}}^{\pi_{\boldsymbol{\beta}}}(\bar{s},a)$ using $L$ Dirac delta functions and are parameterized by $\psi\in\Psi$ . Then, the risk value function can be approximated as:

Q_{i,g}^{\pi_{\boldsymbol{\beta}}}(\bar{s},a)\approx\sum\nolimits_{l=1}^{L}g(be_{i}+bZ_{C_{i},\psi}(\bar{s},a;\boldsymbol{\beta})[l])/L,

and $V_{i,g}^{\pi_{\boldsymbol{\beta}}}(\bar{s})$ can be achieved from $\mathbb{E}_{a\sim\pi_{\boldsymbol{\beta}}(\cdot|\bar{s})}[Q_{i,g}^{\pi_{\boldsymbol{\beta}}}(\bar{s},a)]$ . The distributional critics are trained to minimize the quantile regression loss [11] and details are referred to Appendix C. Additionally, to implement $\xi_{\phi}$ , we use a truncated normal distribution $\mathcal{N}(\mu,\sigma^{2},a,b)$ [9], where $[a,b]$ is the sampling range. Then, the sampling process of $\xi_{\phi}$ can be implemented as follows:

\displaystyle\beta_{i}[j]=\sum\nolimits_{k=1}^{j}\Delta\beta_{i}[k],\;\text{where}\;\Delta\beta_{i}[j]\sim\mathcal{N}(\mu_{i,\phi}[j],\sigma^{2}_{i,\phi}[j],0,C_{\mathrm{max}}/(1-\gamma))\;\forall i\;\text{and}\;\forall j.

Additional details of the practical implementation are provided in Appendix C.

8 Experiments and Results

Tasks. The experiments are conducted on the Safety Gymnasium tasks [14] with a single constraint and the legged robot locomotion tasks [15] with multiple constraints. In the Safety Gymnasium, two robots—point and car—are used to perform two tasks: a goal task, which involves controlling the robot to reach a target location, and a button task, which involves controlling the robot to press a designated button. In these tasks, a cost is incurred when the robot collides with an obstacle. In the legged robot locomotion tasks, bipedal and quadrupedal robots are controlled to track a target velocity while satisfying three constraints related to body balance, body height, and foot contact timing. For more details, please refer to Appendix D.

Baselines. Many RCRL algorithms use CVaR to define risk constraints [32, 15, 29]. Accordingly, the proposed method is evaluated and compared with other algorithms under the CVaR constraints with $\alpha=0.75$ . Experiments on other risk measures are performed in Section 8.1. The baseline algorithms are categorized into three types based on their approach to estimating risk measures. First, CVaR-CPO [32] and CPPO [29] utilize auxiliary variables to estimate CVaR. Second, WCSAC-Dist [26] and SDPO [30] approximate the risk measure $\mathcal{R}_{\sigma}(G_{C}^{\pi})$ with an expected value $\mathbb{E}_{s\sim\mathcal{D}}[\mathcal{R}_{\sigma}(Y_{C}^{\pi}(s))]$ . Finally, SDAC [15] approximates $G_{C}^{\pi}$ as a Gaussian distribution and uses the mean and standard deviation of $G_{C}^{\pi}$ to estimate CVaR. The hyperparameters and network structure of each algorithm are detailed in Appendix D. Note that both CVaR-CPO and the proposed method, SRCPO, use the augmented state space. However, for a fair comparison with other algorithms, the policy and critic networks of these methods are modified to operate on the original state space.

Results. Figures 2 and 3 show the training curves for the locomotion tasks and the Safety Gymnasium tasks, respectively. The reward sum in the figures refers to the sum of rewards within an episode, while the cost rate is calculated as the sum of costs divided by the episode length. In all tasks, the proposed method achieves the highest rewards among methods whose cost rates are below the specified thresholds. These results are likely because only the proposed method guarantees optimality. Specifically in the locomotion tasks, an initial policy often struggles to stabilize the balance of the robot, resulting in high costs from the start. Given this challenge, it is more susceptible to falling into local optima compared to other tasks, which enables the proposed method to outperform other baseline methods. Note that WCSAC-Dist shows the highest rewards in the Safety Gymnasium tasks, but the cost rates exceed the specified thresholds. This issue seems to arise from the approach to estimating risk constraints. WCSAC-Dist estimates the constraints based on the expected risk for each state, but it is lower than the original risk measure, leading to constraint violations.

8.1 Study on Various Risk Measures

In this section, we analyze the results when constraints are defined using various risk measures. To this end, we train policies in the point goal task under constraints based on the $\mathrm{CVaR}_{\alpha}$ and $\mathrm{Pow}_{\alpha}$ risk measures defined in (2), as well as the Wang risk measure [24]. Although the Wang risk measure is not a spectral but a distortion risk measure, our parameterization method introduced in Section 6.1 enables it to be approximated as a spectral risk measure, and the visualization of this process is provided in Appendix E. We conduct experiments with three risk levels for each risk measure and set the constraint threshold as $0.025$ . Evaluation results are presented in Figure 4, and training curves are provided in Appendix E. Figure 4 (Right) shows intuitive results that increasing the risk level effectively reduces the likelihood of incurring high costs. Similarly, Figure 4 (Left) presents the trend across all risk measures, indicating that higher risk levels correspond to lower cost rates and decreased reward performance. Finally, Figure 4 (Middle) exhibits the differences in how each risk measure addresses worst-case scenarios. In the spectrum formulation defined in (1), $\mathrm{CVaR}$ applies a uniform penalty to the tail of the cost distribution above a specified percentile, whereas measures like $\mathrm{Wang}$ and $\mathrm{Pow}$ impose a heavier penalty at higher percentiles. As a result, because $\mathrm{CVaR}$ imposes a relatively milder penalty on the worst-case outcomes compared to other risk measures, it shows the largest intervals between the $50$ th and $99$ th percentiles.

9 Conclusions and Limitations

In this work, we introduced a spectral-risk-constrained RL algorithm that ensures convergence and optimality in a tabular setting. Specifically, in the inner problem, we proposed a generalized policy update rule that can facilitate the development of a new safe RL algorithm with convergence guarantees. For the outer problem, we introduced a notion called sampler, which enhances training efficiency by concurrently training with the inner problem. Through experiments in continuous control tasks, we empirically demonstrated the superior performance of the proposed method and its capability to handle various risk measures. However, convergence to an optimum is shown only in a tabular setting, so future research may focus on extending these results to linear MDPs or function approximation settings. Furthermore, since our approach can be applied to risk-sensitive RL, future work can also implement the proposed method in this area.

References

Acerbi [2002] Carlo Acerbi. Spectral measures of risk: A coherent representation of subjective risk aversion. Journal of Banking & Finance, 26(7):1505–1518, 2002.
Achiam et al. [2017] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In Proceedings of International Conference on Machine Learning, pages 22–31, 2017.
Adjei et al. [2024] Patrick Adjei, Norman Tasfi, Santiago Gomez-Rosero, and Miriam AM Capretz. Safe reinforcement learning for arm manipulation with constrained Markov decision process. Robotics, 13(4):63, 2024.
Agarwal et al. [2021] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. Journal of Machine Learning Research, 22(98):1–76, 2021.
Bai et al. [2022] Qinbo Bai, Amrit Singh Bedi, Mridul Agarwal, Alec Koppel, and Vaneet Aggarwal. Achieving zero constraint violation for constrained reinforcement learning via primal-dual approach. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3682–3689, 2022.
Bastani et al. [2022] Osbert Bastani, Jason Yecheng Ma, Estelle Shen, and Wanqiao Xu. Regret bounds for risk-sensitive reinforcement learning. In Advances in Neural Information Processing Systems, pages 36259–36269, 2022.
Bäuerle and Glauner [2021] Nicole Bäuerle and Alexander Glauner. Minimizing spectral risk measures applied to Markov decision processes. Mathematical Methods of Operations Research, 94(1):35–69, 2021.
Bellemare et al. [2017] Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on reinforcement learning. In Proceedings of International Conference on Machine Learning, pages 449–458, 2017.
Burkardt [2014] John Burkardt. The truncated normal distribution. Department of Scientific Computing Website, Florida State University, 1(35):58, 2014.
Chow et al. [2017] Yinlam Chow, Mohammad Ghavamzadeh, Lucas Janson, and Marco Pavone. Risk-constrained reinforcement learning with percentile risk criteria. Journal of Machine Learning Research, 18(1):6070–6120, 2017.
Dabney et al. [2018] Will Dabney, Mark Rowland, Marc Bellemare, and Rémi Munos. Distributional reinforcement learning with quantile regression. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018.
Ding et al. [2020] Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained Markov decision processes. In Advances in Neural Information Processing Systems, pages 8378–8390, 2020.
Haarnoja et al. [2018] Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. In Proceedings of International Conference on Machine Learning, pages 1861–1870, 2018.
Ji et al. [2023] Jiaming Ji, Borong Zhang, Xuehai Pan, Jiayi Zhou, Juntao Dai, and Yaodong Yang. Safety-gymnasium. GitHub repository, 2023.
Kim et al. [2023] Dohyeong Kim, Kyungjae Lee, and Songhwai Oh. Trust region-based safe distributional reinforcement learning for multiple constraints. In Advances in Neural Information Processing Systems, 2023.
Kim et al. [2024] Dohyeong Kim, Mineui Hong, Jeongho Park, and Songhwai Oh. Scale-invariant gradient aggregation for constrained multi-objective reinforcement learning. arXiv preprint arXiv:2403.00282, 2024.
Liu et al. [2024] Puze Liu, Haitham Bou-Ammar, Jan Peters, and Davide Tateo. Safe reinforcement learning on the constraint manifold: Theory and applications. arXiv preprint arXiv:2404.09080, 2024.
Liu et al. [2020] Yongshuai Liu, Jiaxin Ding, and Xin Liu. IPO: Interior-point policy optimization under constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 4940–4947, 2020.
Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
Rockafellar et al. [2000] R Tyrrell Rockafellar, Stanislav Uryasev, et al. Optimization of conditional value-at-risk. Journal of Risk, 2:21–42, 2000.
Schulman et al. [2015] John Schulman, Sergey Levine, Pieter Abbeel, Michael Jordan, and Philipp Moritz. Trust region policy optimization. In Proceedings of International Conference on Machine Learning, pages 1889–1897. PMLR, 2015.
Schulman et al. [2017] John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
Sethuraman [1994] Jayaram Sethuraman. A constructive definition of Dirichlet priors. Statistica Sinica, pages 639–650, 1994.
Wang [2000] Shaun S Wang. A class of distortion operators for pricing financial and insurance risks. Journal of Risk and Insurance, pages 15–36, 2000.
Xu et al. [2021] Tengyu Xu, Yingbin Liang, and Guanghui Lan. CRPO: A new approach for safe reinforcement learning with convergence guarantee. In Proceedings of International Conference on Machine Learning, pages 11480–11491, 2021.
Yang et al. [2023] Qisong Yang, Thiago D Simão, Simon H Tindemans, and Matthijs TJ Spaan. Safety-constrained reinforcement learning with a distributional safety critic. Machine Learning, 112(3):859–887, 2023.
Yang et al. [2020] Tsung-Yen Yang, Justinian Rosca, Karthik Narasimhan, and Peter J. Ramadge. Projection-based constrained policy optimization. In Proceedings of International Conference on Learning Representations, 2020.
Yang et al. [2022] Tsung-Yen Yang, Tingnan Zhang, Linda Luu, Sehoon Ha, Jie Tan, and Wenhao Yu. Safe reinforcement learning for legged locomotion. In Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 2454–2461, 2022.
Ying et al. [2022] Chengyang Ying, Xinning Zhou, Hang Su, Dong Yan, Ning Chen, and Jun Zhu. Towards safe reinforcement learning via constraining conditional value-at-risk. In Proceedings of International Joint Conference on Artificial Intelligence, 2022.
Zhang and Weng [2022] Jianyi Zhang and Paul Weng. Safe distributional reinforcement learning. In Proceedings of Distributed Artificial Intelligence, pages 107–128, 2022.
Zhang et al. [2022] Linrui Zhang, Li Shen, Long Yang, Shixiang Chen, Bo Yuan, Xueqian Wang, and Dacheng Tao. Penalized proximal policy optimization for safe reinforcement learning. arXiv preprint arXiv:2205.11814, 2022.
Zhang et al. [2024] Qiyuan Zhang, Shu Leng, Xiaoteng Ma, Qihan Liu, Xueqian Wang, Bin Liang, Yu Liu, and Jun Yang. CVaR-constrained policy optimization for safe reinforcement learning. IEEE Transactions on Neural Networks and Learning Systems, 2024.

Appendix A Proofs

See 5.1

Proof.

Since $0\leq Y_{C_{i}}^{\pi}(\bar{s}),Z_{C_{i}}^{\pi}(\bar{s},a)\leq C_{\mathrm{max}}/(1-\gamma)$ ,

f_{be_{i}+bY_{C_{i}}^{\pi}(\bar{s})}(z)=f_{be_{i}+bZ_{C_{i}}^{\pi}(\bar{s},a)}(z)=0,

when $z<be_{i}$ or $z>be_{i}+bC_{\mathrm{max}}/(1-\gamma)$ . Using the fact that $g$ is an increasing and convex function,

g(be_{i})\leq V_{i,g}^{\pi}(\bar{s})=\int_{be_{i}}^{be_{i}+b\frac{C_{\mathrm{max}}}{1-\gamma}}g(z)f_{be_{i}+bY_{C_{i}}^{\pi}(\bar{s})}(z)dz\leq g(be_{i}+bC_{\mathrm{max}}/(1-\gamma)).

Likewise, $Q_{i,g}^{\pi}(\bar{s},a)$ also satisfies the above property. As a result,

V_{i,g}^{\pi}(\bar{s}),\;Q_{i,g}^{\pi}(\bar{s},a)\in[g(be_{i}),g(be_{i}+bC_{\mathrm{max}}/(1-\gamma))].

Then, the maximum of $|Q_{i,g}^{\pi}(\bar{s},a)-V_{i,g}^{\pi}(\bar{s})|$ is $g(be_{i}+bC_{\mathrm{max}}/(1-\gamma))-g(be_{i})$ . Since $b=\gamma^{t}$ and $e_{i}=\sum_{k=0}^{t-1}\gamma^{k}C_{i}(s_{k},a_{k},s_{k+1})/\gamma^{t}$ , where $t$ is the current time step,

be_{i}+bC_{\mathrm{max}}/(1-\gamma)\leq C_{\mathrm{max}}/(1-\gamma).

Consequently,

		$\displaystyle\|Q_{i,g}^{\pi}(\bar{s},a)-V_{i,g}^{\pi}(\bar{s})\|\leq g(be_{i}+bC_{\mathrm{max}}/(1-\gamma))-g(be_{i})$
		$\displaystyle\quad\leq b\frac{C_{\mathrm{max}}}{1-\gamma}g^{\prime}\left(be_{i}+b\frac{C_{\mathrm{max}}}{1-\gamma}\right)\leq b\frac{C_{\mathrm{max}}}{1-\gamma}g^{\prime}\left(\frac{C_{\mathrm{max}}}{1-\gamma}\right).$

∎

See 5.2

Proof.

Since the following equation is satisfied:

	$\displaystyle\int_{-\infty}^{\infty}g(z)f_{be_{i}+bZ_{C_{i}}^{\pi}(\bar{s},a)}(z)dz$	$\displaystyle=\mathbb{E}_{\bar{s}^{\prime}\sim P(\cdot\|\bar{s},a)}\left[\int_{-\infty}^{\infty}g(z)f_{b(e_{i}+C_{i}(s,a,s^{\prime}))+\gamma bY_{C_{i}}^{\pi}(\bar{s}^{\prime})}(z)dz\right]$
		$\displaystyle=\mathbb{E}_{\bar{s}^{\prime}\sim P(\cdot\|\bar{s},a)}\left[\int_{-\infty}^{\infty}g(z)f_{b^{\prime}e_{i}^{\prime}+b^{\prime}Y_{C_{i}}^{\pi}(\bar{s}^{\prime})}(z)dz\right],$

we can say $Q_{i,g}^{\pi}(\bar{s},a)=\mathbb{E}_{\bar{s}^{\prime}\sim P(\cdot|\bar{s},a)}[V_{i,g}^{\pi}(\bar{s}^{\prime})]$ . Using this property,

$\displaystyle\mathbb{E}_{d_{\rho}^{\pi^{\prime}},\pi^{\prime}}\left[A_{i,g_{i}}^{\pi}(\bar{s},a)\right]/(1-\gamma)$	$\displaystyle=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\gamma^{t}A_{i,g_{i}}^{\pi}(\bar{s}_{t},a_{t})\right]$	(15)
	$\displaystyle=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}Q_{i,g_{i}}^{\pi}(\bar{s}_{t},a_{t})-V_{i,g_{i}}^{\pi}(\bar{s}_{t})\right]$
	$\displaystyle=\mathbb{E}_{\pi^{\prime}}\left[\sum_{t=0}^{\infty}\left(V_{i,g_{i}}^{\pi}(\bar{s}_{t+1})-V_{i,g_{i}}^{\pi}(\bar{s}_{t})\right)\right]$
	$\displaystyle=\lim_{t\to\infty}\mathbb{E}_{\pi^{\prime}}\left[V_{i,g_{i}}^{\pi}(\bar{s}_{t})\right]-\mathbb{E}_{\bar{s}_{0}\sim\rho}\left[V_{i,g_{i}}^{\pi}(\bar{s}_{0})\right]$
	$\displaystyle=\lim_{t\to\infty}\mathbb{E}_{\pi^{\prime}}\left[V_{i,g_{i}}^{\pi}(\bar{s}_{t})\right]-\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi})+\int_{0}^{1}{g_{i}}^{*}(\sigma_{i}(u))du.$

Using the definition of $V_{i,g_{i}}^{\pi}(\bar{s}_{t})$ and the dominated convergence theorem,

$\displaystyle\lim_{t\to\infty}\mathbb{E}_{\pi^{\prime}}\left[V_{i,g_{i}}^{\pi}(\bar{s}_{t})\right]$	$\displaystyle=\lim_{t\to\infty}\mathbb{E}_{\pi^{\prime}}\left[\int_{-\infty}^{\infty}g_{i}(z)f_{b_{t}e_{i,t}+b_{t}Y_{C_{i}}^{\pi}(\bar{s}_{t})}(z)dz\right]$	(16)
	$\displaystyle=\lim_{t\to\infty}\mathbb{E}_{\pi^{\prime}}\left[\int_{-\infty}^{\infty}g_{i}(z)f_{\sum_{k=0}^{t-1}\gamma^{k}C_{i}(s_{k},a_{k},s_{k+1})+\gamma^{t}Y_{C_{i}}^{\pi}(\bar{s}_{t})}(z)dz\right]$
	$\displaystyle=\mathbb{E}_{\pi^{\prime}}\left[\int_{-\infty}^{\infty}g_{i}(z)f_{G_{C_{i}}^{\pi^{\prime}}}(z)dz\right]=\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi^{\prime}})-\int_{0}^{1}{g_{i}}^{*}(\sigma_{i}(u))du.$

By combining (15) and (16),

	$\displaystyle\mathbb{E}_{d_{\rho}^{\pi^{\prime}},\pi^{\prime}}\left[A_{i,g_{i}}^{\pi}(\bar{s},a)\right]/(1-\gamma)$	$\displaystyle=\lim_{t\to\infty}\mathbb{E}_{\pi^{\prime}}\left[V_{i,g_{i}}^{\pi}(\bar{s}_{t})\right]-\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi})+\int_{0}^{1}{g_{i}}^{*}(\sigma_{i}(u))du$
		$\displaystyle=\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi^{\prime}})-\mathcal{R}_{\sigma_{i}}^{g_{i}}(G_{C_{i}}^{\pi}).$

∎

See 6.1

Proof.

The difference is:

|\mathcal{R}_{\sigma}(X)-\mathcal{R}_{\tilde{\sigma}}(X)|\leq\int_{0}^{1}|F_{X}^{-1}(u)||\sigma(u)-\tilde{\sigma}(u)|du\leq||\sigma-\tilde{\sigma}||_{1}||F_{X}^{-1}||_{\infty}=||\sigma-\tilde{\sigma}||_{1}\frac{C_{\mathrm{max}}}{1-\gamma}.

The value of (11) is $||\sigma-\tilde{\sigma}||_{1}$ and is smaller than the value of the following problem:

\min_{\eta_{i},\alpha_{i}}\int_{0}^{1}|\sigma(u)-\tilde{\sigma}(u)|du\quad\mathbf{s.t.}\;\int_{0}^{1}\tilde{\sigma}(u)du=1,\;\frac{\sigma(1)(i-1)}{M}\leq\eta_{i}\leq\frac{\sigma(1)i}{M}\;\forall i,

(17)

since the search space is smaller than the original one. As the value of (17) is smaller than $\max_{u}|\sigma(u)-\tilde{\sigma}(u)|\leq\sigma(1)/M$ , $||\sigma-\tilde{\sigma}||_{1}\leq\sigma(1)/M$ . Consequently,

|\mathcal{R}_{\sigma}(X)-\mathcal{R}_{\tilde{\sigma}}(X)|\leq||\sigma-\tilde{\sigma}||_{1}\frac{C_{\mathrm{max}}}{1-\gamma}\leq\frac{C_{\mathrm{max}}\sigma(1)}{(1-\gamma)M}.

∎

See 6.2

Proof.

Since $g_{\beta}(x)$ is an increasing convex function, $\mathcal{R}_{\tilde{\sigma}}(X)\leq\mathcal{R}_{\tilde{\sigma}}^{g_{\beta}}(X)=:\mathcal{R}_{\tilde{\sigma}}^{\beta}(X)$ . In addition, according to Remark 2.7 in [1], there exists $\tilde{g}$ such that $\tilde{g}=\arg\inf_{g}\mathcal{R}^{g}_{\tilde{\sigma}}(X)$ , and its derivative is expressed as $\tilde{g}^{\prime}=\tilde{\sigma}(F_{X}(x))$ . Using this fact, we can formulate $\tilde{g}$ by integrating its derivative as follows:

\tilde{g}(x)=\eta_{1}x+\sum_{i=1}^{M-1}(\eta_{i+1}-\eta_{i})(x-F_{X}^{-1}(\alpha_{i}))_{+}+C,

where $C$ is a constant. Since for any constant $C$ , $\mathcal{R}^{\tilde{g}}_{\tilde{\sigma}}(X)$ has the same value due to the integral part of $\tilde{g}^{*}$ in (3), we set $C$ as zero. Also, we can express $\tilde{g}(x)$ as $g_{\tilde{\beta}}(x)$ , where ${\tilde{\beta}}[i]$ is $F_{X}^{-1}(\alpha_{i})$ due to the definition of $g_{\beta}$ . As a result, the following is satisfied:

	$\displaystyle\mathcal{R}_{\tilde{\sigma}}(X)=\mathcal{R}^{\tilde{g}}_{\tilde{\sigma}}(X)$	$\displaystyle=\mathcal{R}^{\tilde{\beta}}_{\tilde{\sigma}}(X)\leq\mathcal{R}_{\tilde{\sigma}}^{\beta}(X).$
	$\displaystyle\Rightarrow\mathcal{R}_{\tilde{\sigma}}(X)$	$\displaystyle=\inf_{\beta}\mathcal{R}_{\tilde{\sigma}}^{\beta}(X).$

∎

See 6.3

Proof.

Since we assume that the space of $\boldsymbol{\beta}$ is finite, the sampler can be expressed using the softmax parameterization as follows:

\xi_{\phi}(\boldsymbol{\beta}):=\frac{\exp(\phi(\boldsymbol{\beta}))}{\sum_{\boldsymbol{\beta}^{\prime}}\exp(\phi(\boldsymbol{\beta}^{\prime}))}.

If we update the parameter of the sampler as described in (14), the following is satisfied as in the natural policy gradient:

\phi_{t+1}(\boldsymbol{\beta})=\phi_{t}(\boldsymbol{\beta})+\alpha J(\pi_{\boldsymbol{\beta},t};\boldsymbol{\beta})=\sum_{k=1}^{t}\alpha J(\pi_{\boldsymbol{\beta},k};\boldsymbol{\beta}).

Furthermore, due to Theorem 5.3, if a policy $\pi_{\boldsymbol{\beta},t}$ is updated by the proposed method, the policy converges to an optimal policy $\pi^{*}_{\boldsymbol{\beta},t}$ . Then, if $\boldsymbol{\beta}$ is not optimal,

\lim_{T\to\infty}\sum_{t=1}^{T}\alpha\left(J(\pi_{\boldsymbol{\beta},t};\boldsymbol{\beta})-J(\pi_{\boldsymbol{\beta}^{*},t};\boldsymbol{\beta}^{*})\right)=-\infty,

since $\lim_{t\to\infty}J(\pi_{\boldsymbol{\beta},t};\boldsymbol{\beta})-J(\pi_{\boldsymbol{\beta}^{*},t};\boldsymbol{\beta}^{*})=J(\pi_{\boldsymbol{\beta}}^{*};\boldsymbol{\beta})-J(\pi_{\boldsymbol{\beta}^{*}}^{*};\boldsymbol{\beta}^{*})<0$ . As a result,

		$\displaystyle\lim_{t\to\infty}\xi_{\phi_{t}}(\boldsymbol{\beta})=\lim_{t\to\infty}\exp(\phi_{t}(\boldsymbol{\beta})-\phi_{t}(\boldsymbol{\beta}^{}))/\sum_{\boldsymbol{\beta}^{\prime}}\exp(\phi_{t}(\boldsymbol{\beta}^{\prime})-\phi_{t}(\boldsymbol{\beta}^{}))$
		$\displaystyle=\lim_{t\to\infty}\exp\left(\sum_{k=1}^{t-1}\alpha\left(J(\pi_{\boldsymbol{\beta},k};\boldsymbol{\beta})-J(\pi_{\boldsymbol{\beta}^{},k};\boldsymbol{\beta}^{})\right)\right)\Big{/}\sum_{\boldsymbol{\beta}^{\prime}}\exp(\phi_{t}(\boldsymbol{\beta}^{\prime})-\phi_{t}(\boldsymbol{\beta}^{*}))=0.$

Thus, $\xi_{\phi_{t}}$ converges to an optimal sampler. ∎

A.1 Proof of Theorem 5.3

Our derivation is based on the policy gradient works by Agarwal et al. [4] and Xu et al. [25]. Due to the softmax policy parameterization, the following is satisfied:

\displaystyle\pi_{t+1}(a|\bar{s})

\displaystyle=\pi_{t}(a|\bar{s})\frac{\exp(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))}{Z_{t}(\bar{s})},

(18)

where $Z_{t}(\bar{s}):=\sum_{a\in A}\pi_{t}(a|\bar{s})\exp(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))$ . By the natural policy gradient,

\theta_{t+1}=\begin{cases}\theta_{t}+\alpha_{t}(A_{R}^{t}-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t})/(1-\gamma)&\text{if}\;J_{C_{i}}(\theta_{t})\leq d_{i}\;\forall i,\\ \theta_{t}+\alpha_{t}(\alpha_{t}\nu_{t}A_{R}^{t}-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t})/(1-\gamma)&\text{otherwise}.\end{cases}

(19)

Before proving Theorem 5.3, we first present several useful lemmas.

Lemma A.1.

$\log Z_{t}(\bar{s})$ is non-negative.

Proof.

Due to the advantage functions in (19), $\sum_{a\in A}\pi_{t}(a|\bar{s})(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))=0$ . Using the fact that a logarithmic function is concave,

\log Z_{t}(\bar{s})=\log\sum_{a\in A}\pi_{t}(a|\bar{s})\exp(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))\geq\sum_{a\in A}\pi_{t}(a|\bar{s})(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))=0.

As a result, $\log Z_{t}(\bar{s})\geq 0$ . ∎

Lemma A.2.

$D_{\mathrm{TV}}(\pi_{t+1}(\cdot|\bar{s}),\pi_{t}(\cdot|\bar{s}))\leq\max_{a}|\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a)|\leq||\theta_{t+1}-\theta_{t}||_{\infty}$ .

Proof.

Using the Pinsker’s inequality,

		$\displaystyle D_{\mathrm{TV}}(\pi_{t+1}(\cdot\|\bar{s}),\pi_{t}(\cdot\|\bar{s}))\leq\sqrt{\frac{1}{2}D_{\mathrm{KL}}(\pi_{t+1}(\cdot\|\bar{s})\|\|\pi_{t}(\cdot\|\bar{s}))}$
		$\displaystyle=\sqrt{\frac{1}{2}\sum_{a}\pi_{t+1}(a\|\bar{s})\log(\pi_{t+1}(a\|\bar{s})/\pi_{t}(a\|\bar{s})})$
		$\displaystyle=\sqrt{\frac{1}{2}(\sum_{a}\pi_{t+1}(a\|\bar{s})(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))-\log Z_{t}(\bar{s}))}$
		$\displaystyle\overset{\text{(i)}}{\leq}\sqrt{\frac{1}{2}\sum_{a}\pi_{t+1}(a\|\bar{s})(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))}$
		$\displaystyle\overset{\text{(ii)}}{=}\sqrt{\frac{1}{2}\sum_{a}(\pi_{t+1}(a\|\bar{s})-\pi_{t}(a\|\bar{s}))(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))}$
		$\displaystyle\leq\sqrt{\max_{a}\|\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a)\|D_{\mathrm{TV}}(\pi_{t+1}(a\|\bar{s}),\pi_{t}(a\|\bar{s}))},$

where (i) follows Lemma A.1, and (ii) follows the fact that $\sum_{a\in A}\pi_{t}(a|\bar{s})(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))=0$ . As a result,

D_{\mathrm{TV}}(\pi_{t+1}(\cdot|\bar{s}),\pi_{t}(\cdot|\bar{s}))\leq\max_{a}|\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a)|\leq||\theta_{t+1}-\theta_{t}||_{\infty}.

∎

See 5.3

Proof.

We divide the analysis into two cases: 1) when the constraints are satisfied and 2) when the constraints are violated. First, let us consider the case where the constraints are satisfied. In this case, the policy is updated as follows:

\displaystyle\theta_{t+1}

\displaystyle=\theta_{t}+\frac{\alpha_{t}}{1-\gamma}\left(A_{R}^{t}-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}\right).

(20)

Using Theorem 5.2,

		$\displaystyle J_{R}(\theta_{t+1})-J_{R}(\theta_{t})+\alpha_{t}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\theta_{t})-J_{C_{i}}(\theta_{t+1}))$		(21)
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a}\pi_{t+1}(a\|\bar{s})(A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a}(\pi_{t+1}(a\|\bar{s})-\pi_{t}(a\|\bar{s}))(A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle\overset{\text{(i)}}{\leq}\frac{2}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\max_{a}\big{\|}A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a)\big{\|}D_{\mathrm{TV}}(\pi_{t+1}(\cdot\|\bar{s}),\pi_{t}(\cdot\|\bar{s}))$
		$\displaystyle\overset{\text{(ii)}}{\leq}\frac{2\|\|\theta_{t+1}-\theta_{t}\|\|_{\infty}}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\max_{a}\big{\|}A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a)\big{\|}$
		$\displaystyle\leq\frac{2\|\|\theta_{t+1}-\theta_{t}\|\|_{\infty}}{1-\gamma}\big{\|}\big{\|}A_{R}^{t}-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}\big{\|}\big{\|}_{\infty}=\frac{2\alpha_{t}}{(1-\gamma)^{2}}\big{\|}\big{\|}A_{R}^{t}-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}\big{\|}\big{\|}_{\infty}^{2}$
		$\displaystyle\leq 2\alpha_{t}(R_{\mathrm{max}}+\alpha_{t}N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}/(1-\gamma)^{4}.$

where (i) follows Hölder’s inequality, and (ii) follows Lemma A.2. Using (18) and (20),

		$\displaystyle J_{R}(\theta_{t+1})-J_{R}(\theta_{t})+\alpha_{t}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\theta_{t})-J_{C_{i}}(\theta_{t+1}))$		(22)
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a\in A}\pi_{t+1}(a\|\bar{s})(A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a\in A}\pi_{t+1}(a\|\bar{s})\log(\pi_{t+1}(a\|\bar{s})Z_{t}(\bar{s})/\pi_{t}(a\|\bar{s}))$
		$\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})(D_{\mathrm{KL}}(\pi_{t+1}(\cdot\|\bar{s})\|\|\pi_{t}(\cdot\|\bar{s}))+\log Z_{t}(\bar{s}))$
		$\displaystyle\geq\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})\log Z_{t}(\bar{s})\overset{\text{(i)}}{\geq}\frac{1-\gamma}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}\rho(\bar{s})\log Z_{t}(\bar{s}),$

where (i) is from the fact that $d_{\rho}^{\pi}(\bar{s})=(1-\gamma)\sum_{t}\gamma^{t}\mathbb{P}(\bar{s}_{t}=\bar{s})\geq(1-\gamma)\rho(\bar{s})$ . By combining (21) and (22), the following is satisfied for any $\rho$ :

\displaystyle\frac{1-\gamma}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}\rho(\bar{s})\log Z_{t}(\bar{s})\leq\frac{2\alpha_{t}}{(1-\gamma)^{4}}(R_{\mathrm{max}}+\alpha_{t}N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}.

(23)

Then, the following is satisfied for any policy $\hat{\pi}$ :

		$\displaystyle J_{R}(\hat{\pi})-J_{R}(\pi_{t})+\alpha_{t}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi_{t})-J_{C_{i}}(\hat{\pi}))$		(24)
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})\sum_{a\in A}\hat{\pi}(a\|\bar{s})(A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})\sum_{a\in A}\hat{\pi}(a\|\bar{s})\log(\pi_{t+1}(a\|\bar{s})Z_{t}(\bar{s})/\pi_{t}(a\|\bar{s}))$
		$\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})(D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t}(\cdot\|\bar{s}))-D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t+1}(\cdot\|\bar{s}))+\log Z_{t}(\bar{s}))$
		$\displaystyle\overset{\text{(i)}}{\leq}\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})(D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t}(\cdot\|\bar{s}))-D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t+1}(\cdot\|\bar{s})))$
		$\displaystyle+\frac{2\alpha_{t}}{(1-\gamma)^{5}}(R_{\mathrm{max}}+\alpha_{t}N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2},$

where (i) follows (23) by substituting $\rho$ into $d_{\rho}^{\hat{\pi}}$ . Now, let us consider the second case where constraints are violated. In this case, the policy is updated as follows:

\displaystyle\theta_{t+1}

\displaystyle=\theta_{t}+\frac{\alpha_{t}}{1-\gamma}(\alpha_{t}\nu_{t}A_{R}^{t}-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}).

(25)

As (21) derived in the first case,

		$\displaystyle\alpha_{t}\nu_{t}(J_{R}(\theta_{t+1})-J_{R}(\theta_{t}))+\sum_{i}\lambda_{t,i}(J_{C_{i}}(\theta_{t})-J_{C_{i}}(\theta_{t+1}))$		(26)
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a}\pi_{t+1}(a\|\bar{s})(\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a}(\pi_{t+1}(a\|\bar{s})-\pi_{t}(a\|\bar{s}))(\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle\leq\frac{2}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\max_{a}\big{\|}\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a)\big{\|}D_{\mathrm{TV}}(\pi_{t+1}(\cdot\|\bar{s}),\pi_{t}(\cdot\|\bar{s}))$
		$\displaystyle\leq\frac{2\|\|\theta_{t+1}-\theta_{t}\|\|_{\infty}}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\max_{a}\big{\|}\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a)\big{\|}$
		$\displaystyle\leq\frac{2\alpha_{t}}{(1-\gamma)^{2}}\big{\|}\big{\|}\alpha_{t}\nu_{t}A_{R}^{t}-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}\big{\|}\big{\|}_{\infty}^{2}\leq 2\alpha_{t}(\alpha_{t}\lambda_{\mathrm{max}}R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}/(1-\gamma)^{4}.$

Using (18) and (25),

		$\displaystyle\alpha_{t}\nu_{t}(J_{R}(\theta_{t+1})-J_{R}(\theta_{t}))+\sum_{i}\lambda_{t,i}(J_{C_{i}}(\theta_{t})-J_{C_{i}}(\theta_{t+1}))$		(27)
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a\in A}\pi_{t+1}(a\|\bar{s})(\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a\in A}\pi_{t+1}(a\|\bar{s})\log(\pi_{t+1}(a\|\bar{s})Z_{t}(\bar{s})/\pi_{t}(a\|\bar{s}))$
		$\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})(D_{\mathrm{KL}}(\pi_{t+1}(\cdot\|\bar{s})\|\|\pi_{t}(\cdot\|\bar{s}))+\log Z_{t}(\bar{s}))$
		$\displaystyle\geq\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{t+1}}(\bar{s})\log Z_{t}(\bar{s})\geq\frac{1-\gamma}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}\rho(\bar{s})\log Z_{t}(\bar{s}).$

By combining (26) and (27), the following is satisfied for any $\rho$ :

\displaystyle\frac{1-\gamma}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}\rho(\bar{s})\log Z_{t}(\bar{s})\leq\frac{2\alpha_{t}}{(1-\gamma)^{4}}(\alpha_{t}\lambda_{\mathrm{max}}R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}.

(28)

As in (24), the following is satisfied for any policy $\hat{\pi}$ :

		$\displaystyle\alpha_{t}\nu_{t}(J_{R}(\hat{\pi})-J_{R}(\pi_{t}))+\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi_{t})-J_{C_{i}}(\hat{\pi}))$		(29)
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})\sum_{a\in A}\hat{\pi}(a\|\bar{s})(\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})\sum_{a\in A}\hat{\pi}(a\|\bar{s})\log(\pi_{t+1}(a\|\bar{s})Z_{t}(\bar{s})/\pi_{t}(a\|\bar{s}))$
		$\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})(D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t}(\cdot\|\bar{s}))-D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t+1}(\cdot\|\bar{s}))+\log Z_{t}(\bar{s}))$
		$\displaystyle\leq\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})(D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t}(\cdot\|\bar{s}))-D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t+1}(\cdot\|\bar{s})))$
		$\displaystyle+\frac{2\alpha_{t}}{(1-\gamma)^{5}}(\alpha_{t}\lambda_{\mathrm{max}}R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}.$

Now, the main inequalities for both cases have been derived. Let us denote by $\mathcal{N}$ the set of time steps in which the constraints are not violated. This means that if $t\in\mathcal{N}$ , $J_{C_{i}}(\pi_{t})\leq d_{i}\;\forall i$ . Then, using (24) and (29), we have:

		$\displaystyle\sum_{t\in\mathcal{N}}\left(\alpha_{t}(J_{R}(\hat{\pi})-J_{R}(\pi_{t}))-\alpha_{t}^{2}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\hat{\pi})-J_{C_{i}}(\pi_{t}))\right)$		(30)
		$\displaystyle\quad+\sum_{t\notin\mathcal{N}}\left(\alpha_{t}^{2}\nu_{t}(J_{R}(\hat{\pi})-J_{R}(\pi))-\alpha_{t}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\hat{\pi})-J_{C_{i}}(\pi_{t}))\right)$
		$\displaystyle\leq\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})D_{\mathrm{KL}}(\pi_{f}(\cdot\|\bar{s})\|\|\pi_{0}(\cdot\|\bar{s}))+\sum_{t\in\mathcal{N}}\frac{2\alpha_{t}^{2}}{(1-\gamma)^{5}}(R_{\mathrm{max}}+\alpha_{t}N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}$
		$\displaystyle\quad+\sum_{t\notin\mathcal{N}}\frac{2\alpha_{t}^{2}}{(1-\gamma)^{5}}(\alpha_{t}\lambda_{\mathrm{max}}R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}.$

If $t\notin\mathcal{N}$ and $\lambda_{t,i}>0$ , $J_{C_{i}}(\pi_{t})>d_{i}$ and $\sum_{i}\lambda_{t,i}=1$ . Thus, $\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi_{f})-J_{C_{i}}(\pi_{t}))\leq-\eta$ . By substituting $\hat{\pi}$ into $\pi_{f}$ in (30),

		$\displaystyle\sum_{t\in\mathcal{N}}\left(\alpha_{t}(J_{R}(\pi_{f})-J_{R}(\pi_{t}))-\alpha_{t}^{2}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi_{f})-J_{C_{i}}(\pi_{t}))\right)$
		$\displaystyle\quad+\sum_{t\notin\mathcal{N}}(\alpha_{t}\eta+\alpha_{t}^{2}\nu_{t}(J_{R}(\pi_{f})-J_{R}(\pi)))$
		$\displaystyle\leq\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{f}}(\bar{s})D_{\mathrm{KL}}(\pi_{f}(\cdot\|\bar{s})\|\|\pi_{0}(\cdot\|\bar{s}))$
		$\displaystyle\quad+\sum_{t}\alpha_{t}^{2}\underbrace{\frac{2\max(\alpha_{0},1)^{2}}{(1-\gamma)^{5}}(\max(\lambda_{\mathrm{max}},1)R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}}_{=:K_{1}}.$

For brevity, we denote $D_{\mathrm{KL}}(\pi_{f}||\pi_{0})=\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\pi_{f}}(\bar{s})D_{\mathrm{KL}}(\pi_{f}(\cdot|\bar{s})||\pi_{0}(\cdot|\bar{s}))$ . Since $\sum_{t\notin\mathcal{N}}\alpha_{t}=\sum_{t}\alpha_{t}-\sum_{t\in\mathcal{N}}\alpha_{t}$ , (30) can be modified as follows:

		$\displaystyle\sum_{t\in\mathcal{N}}\alpha_{t}(J_{R}(\pi_{f})-J_{R}(\pi_{t})-\eta)+\eta\sum_{t}\alpha_{t}$
		$\displaystyle\leq D_{\mathrm{KL}}(\pi_{f}\|\|\pi_{0})+K_{1}\sum_{t}\alpha_{t}^{2}+\sum_{t\in\mathcal{N}}\left(\alpha_{t}^{2}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi_{f})-J_{C_{i}}(\pi_{t}))\right)$
		$\displaystyle\quad-\sum_{t\notin\mathcal{N}}\left(\alpha_{t}^{2}\nu_{t}(J_{R}(\pi_{f})-J_{R}(\pi))\right)$
		$\displaystyle\leq D_{\mathrm{KL}}(\pi_{f}\|\|\pi_{0})+\sum_{t}\alpha_{t}^{2}\underbrace{\left(K_{1}+2N\lambda_{\mathrm{max}}\frac{C_{\mathrm{max}}}{1-\gamma}+2\lambda_{\mathrm{max}}\frac{R_{\mathrm{max}}}{1-\gamma}\right)}_{=:K_{2}}.$

It can be simplified as follows:

\sum_{t\in\mathcal{N}}\alpha_{t}(J_{R}(\pi_{f})-J_{R}(\pi_{t})-\eta)+\eta\sum_{t}\alpha_{t}\leq D_{\mathrm{KL}}(\pi_{f}||\pi_{0})+K_{2}\sum_{t}\alpha_{t}^{2}.

If $\pi_{0}$ is set with positive values in all actions, $D_{\mathrm{KL}}(\hat{\pi}||\pi_{0})$ is bounded for $\forall\hat{\pi}$ . Additionally, due to the Robbins-Monro condition, RHS converges to some real number as $T$ goes to infinity. Since $\eta\sum_{t}\alpha_{t}$ in LHS goes to infinity, $\sum_{t\in\mathcal{N}}\alpha_{t}(J_{R}(\pi_{f})-J_{R}(\pi_{t})-\eta)$ must go to minus infinity. It means that $\sum_{t\in\mathcal{N}}\alpha_{t}=\infty$ since $|J_{R}(\pi)|\leq R_{\mathrm{max}}/(1-\gamma)$ . By substituting $\hat{\pi}$ into $\pi^{*}$ in (30),

		$\displaystyle\sum_{t\in\mathcal{N}}\alpha_{t}(J_{R}(\pi^{})-J_{R}(\pi_{t}))-\sum_{t\notin\mathcal{N}}\alpha_{t}\left(\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi^{})-J_{C_{i}}(\pi_{t}))\right)$
		$\displaystyle\leq D_{\mathrm{KL}}(\pi^{*}\|\|\pi_{0})+\sum_{t\in\mathcal{N}}\frac{2\alpha_{t}^{2}}{(1-\gamma)^{5}}(R_{\mathrm{max}}+\alpha_{t}N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}$
		$\displaystyle\quad+\sum_{t\notin\mathcal{N}}\frac{2\alpha_{t}^{2}}{(1-\gamma)^{5}}(\alpha_{t}\lambda_{\mathrm{max}}R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}$
		$\displaystyle\quad+\sum_{t\in\mathcal{N}}\alpha_{t}^{2}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi^{})-J_{C_{i}}(\pi_{t}))-\sum_{t\notin\mathcal{N}}\alpha_{t}^{2}\nu_{t}(J_{R}(\pi^{})-J_{R}(\pi))$
		$\displaystyle\leq D_{\mathrm{KL}}(\pi^{*}\|\|\pi_{0})+K_{2}\sum_{t}\alpha_{t}^{2}.$

Since $\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi^{*})-J_{C_{i}}(\pi_{t}))\leq 0$ for $t\notin\mathcal{N}$ , the above equation is rewritten as follows:

\sum_{t\in\mathcal{N}}\alpha_{t}(J_{R}(\pi^{*})-J_{R}(\pi_{t}))\leq D_{\mathrm{KL}}(\pi^{*}||\pi_{0})+K_{2}\sum_{t}\alpha_{t}^{2}.

Since $\sum_{t\in\mathcal{N}}\alpha_{t}=\infty$ and the Robbins-Monro condition is satisfied, the following must be held:

J_{R}(\pi^{*})-\lim_{t\to\infty}J_{R}(\pi_{\mathcal{N}[t]})=0,

where $\mathcal{N}[t]$ is the $t$ th element of $\mathcal{N}$ . Consequently, the policy converges to an optimal policy. ∎

Appendix B Weight Selection Strategy

In this section, we first identify $\nu_{t}$ and $\lambda_{t,i}$ for existing primal approach-based safe RL algorithms and present our strategy. For the existing methods, we analyze several safe RL algorithms including constrained policy optimization (CPO, [2]), projection-based constrained policy optimization (PCPO, [27]), constraint-rectified policy optimization (CRPO, [25]), penalized proximal policy optimization (P3O, [31]), interior-point policy optimization (IPO, [18]), and safe distributional actor-critic (SDAC, [15]). Since the policy gradient is obtained differently depending on whether the constraints are satisfied (CS) or violated (CV), we analyze the existing methods for each of the two cases.

Case 1: Constraints are satisfied (CS). First, CPO [2] and SDAC [15] find a direction of the policy gradient by solving the following linear programming with linear and quadratic constraints (LQCLP):

\max_{g}\nabla J_{R}(\theta_{t})^{T}g\;\mathbf{s.t.}\nabla J_{C_{i}}(\theta_{t})^{T}g+J_{C_{i}}(\theta_{t})\leq d_{i}\;\forall i,\;\frac{1}{2}g^{T}F(\theta_{t})g\leq\epsilon,

where $\epsilon$ is a trust region size. The LQCLP can be addressed by solving the following dual problem:

	$\displaystyle\lambda^{},\nu^{}=\underset{\lambda\geq 0,\nu\geq 0}{\arg\max}$	$\displaystyle-\nabla J_{R}(\theta_{t})^{T}g^{}(\lambda,\nu)+\sum_{i=1}^{N}\lambda_{i}(\nabla J_{C_{i}}(\theta_{t})^{T}g^{}(\lambda,\nu)+J_{C_{i}}(\theta_{t})-d_{i})$
		$\displaystyle+\nu(\frac{1}{2}g^{}(\lambda,\nu)^{T}F(\theta_{t})g^{}(\lambda,\nu)-\epsilon),$

where $g^{*}(\lambda,\nu)=\frac{1}{\nu}F^{\dagger}(\theta_{t})\left(\nabla J_{R}(\theta_{t})-\sum_{i}\lambda_{i}\nabla J_{C_{i}}(\theta_{t})\right)$ . Then, the policy is updated in the following direction:

g_{t}=F^{\dagger}(\theta_{t})(\nabla J_{R}(\theta_{t})-\alpha_{t}\sum_{i}\min(\lambda_{i}^{*}/\alpha_{t},\lambda_{\mathrm{max}})\nabla J_{C_{i}}(\theta_{t})),

(31)

which results in $\lambda_{t,i}=\min(\lambda_{i}^{*}/\alpha_{t},\lambda_{\mathrm{max}})$ . P3O [31], PCPO [27], and CRPO [25] update the policy only using the objective function in this case as follows:

g_{t}=F^{\dagger}(\theta_{t})\nabla J_{R}(\theta_{t})\Rightarrow\lambda_{t,i}=0.

IPO [18] update a policy in the following direction:

g_{t}=F^{\dagger}(\theta_{t})\left(\nabla J_{R}(\theta_{t})-\kappa\sum_{i}\nabla J_{C_{i}}(\theta_{t})/(d_{i}-J_{C_{i}}(\theta_{t}))\right),

where $\kappa$ is a penalty coefficient. As a result, $\lambda_{t,i}$ is computed as $\min(\kappa/((d_{i}-J_{C_{i}}(\theta_{t}))\alpha_{t}),\lambda_{\mathrm{max}})$ .

Case 2: Constraints are violated (CV). CPO, PCPO and IPO did not handle the constraint violation in multiple constraint settings, so we exclude them in this case. First, SDAC finds a recovery direction by solving the following quadratic programming (QP):

\min_{g}\frac{1}{2}g^{T}F(\theta_{t})g\;\mathbf{s.t.}\nabla J_{C_{i}}(\theta_{t})^{T}g+J_{C_{i}}(\theta_{t})\leq d_{i}\;\text{for}\;i\in\{i|J_{C_{i}}(\theta_{t})>d_{i}\},

where we will denote $\{i|J_{C_{i}}(\theta_{t})>d_{i}\}$ by $I_{\mathrm{CV}}$ . The QP can be addressed by solving the following dual problem:

\lambda^{*}=\underset{\lambda\geq 0}{\arg\max}\frac{1}{2}g^{*}(\lambda)^{T}F(\theta_{t})g^{*}(\lambda)+\sum_{i\in I_{\mathrm{CV}}}\lambda_{i}(\nabla J_{C_{i}}(\theta_{t})^{T}g^{*}(\lambda)+J_{C_{i}}(\theta_{t})-d_{i}),

where $g^{*}(\lambda)=-F^{\dagger}(\theta_{t})\sum_{i\in I_{\mathrm{CV}}}\lambda_{i}\nabla J_{C_{i}}(\theta_{t})$ . Then, the policy is updated in the following direction:

g_{t}=F^{\dagger}(\theta_{t})\left(\sum_{i\in I_{\mathrm{CV}}}-\frac{\lambda_{i}^{*}}{\sum_{j\in I_{\mathrm{CV}}}\lambda_{j}^{*}}\nabla J_{C_{i}}(\theta_{t})\right)\Rightarrow\nu_{t}=0,\;\lambda_{t,i}=\frac{\lambda_{i}^{*}}{\sum_{j\in I_{\mathrm{CV}}}\lambda_{j}^{*}}\;\text{if}\;i\in I_{\mathrm{CV}}\;\text{else}\;0.

If there is only a single constraint, PCPO is compatible with SDAC. Next, CRPO first selects a violated constraint, whose index denoted as $k$ , and calculates the policy gradient to minimize the selected constraint as follows:

g_{t}=F^{\dagger}(\theta_{t})\left(-\nabla J_{C_{k}}(\theta_{t})\right)\Rightarrow\nu_{t}=0,\;\lambda_{t,i}=1\;\text{if}\;i=k\;\text{else}\;0.

If there is only a single constraint, CPO is compatible with CRPO. Finally, P3O update the policy in the following direction:

g_{t}=F^{\dagger}(\theta_{t})\left(\nabla J_{R}(\theta_{t})-\kappa\sum_{i\in I_{\mathrm{CV}}}\nabla J_{C_{i}}(\theta_{t})\right),

which results in $\nu_{t}=\min(1/(\alpha_{t}\kappa|I_{\mathrm{CV}}|),\lambda_{\mathrm{max}})$ , and $\lambda_{t,i}=1/|I_{\mathrm{CV}}|$ if $i\in I_{\mathrm{CV}}$ else $0$ .

Implementation of trust region method. To this point, we have obtained the direction of the policy gradient $g_{t}$ , allowing the policy to be updated to $\theta_{t}+\alpha_{t}g_{t}$ . If we want to update the policy using a trust region method, as introduced in TRPO [21], we can adjust the learning rate $\alpha_{t}$ as follows:

\alpha_{t}=\frac{\epsilon_{t}}{\mathrm{Clip}(\sqrt{g_{t}^{T}F(\theta_{t})g_{t}},g_{\mathrm{min}},g_{\mathrm{max}})},

(32)

where $\epsilon_{t}$ is a trust region size following the Robbins-Monro condition, $g_{\mathrm{min}}$ and $g_{\mathrm{max}}$ are hyperparamters, and $\mathrm{Clip}(x,a,b)=\min(\max(x,a),b)$ . By adjusting the learning rate, the policy can be updated mostly within the trust region, defined as $\{\theta_{t}+g|g^{T}F(\theta_{t})g\leq\epsilon_{t}^{2}\}$ , while the learning rate still satisfies the Robbins-Monro condition. Accordingly, we also need to modify $\lambda_{t,i}$ for the CS case and $\nu_{t}$ for the CV case, as they are expressed with the learning rate $\alpha_{t}$ . We explain this modification process by using SDAC as an example. The direction of the policy gradient of SDAC for the CS case expressed as follows:

g_{t}=F^{\dagger}(\theta_{t})(\nabla J_{R}(\theta_{t})-\epsilon_{t}\sum_{i}\min(\lambda_{i}^{*}/\epsilon_{t},\lambda_{\mathrm{max}})\nabla J_{C_{i}}(\theta_{t})),

where $\alpha_{t}$ in (31) is replaced with $\epsilon_{t}$ . The learning rate $\alpha_{t}$ is then calculated using $g_{t}$ and (32). Now, we need to express $g_{t}$ as $F^{\dagger}(\theta_{t})(\nabla J_{R}(\theta_{t})-\alpha_{t}\sum_{i}\lambda_{t,i}\nabla J_{C_{i}}(\theta_{t}))$ . To this end, we can use $\alpha_{t}=\epsilon_{t}/C$ , where $C=\mathrm{Clip}(\sqrt{g_{t}^{T}F(\theta_{t})g_{t}},g_{\mathrm{min}},g_{\mathrm{max}})$ . Consequently, $\lambda_{t,i}$ is expressed as follows:

\lambda_{t,i}=C\min(\lambda_{i}^{*}/\epsilon_{t},\lambda_{\mathrm{max}}),

and the maximum of $\lambda_{t,i}$ and $\nu_{t}$ is adjusted to $g_{\mathrm{max}}\lambda_{\mathrm{max}}$ .

Proposed strategy. Now, we introduce the proposed strategy for determining $\lambda_{t,i}$ and $\nu_{t}$ . We first consider the CV case. Kim et al. [15] empirically showed that processing multiple constraints simultaneously at each update step effectively projects the current policy onto a feasible set of policies. Therefore, we decide to use the same strategy as SDAC, which deals with constraints simultaneously, and it is written as follows:

\nu_{t}=0,\;\lambda_{t,i}=\begin{cases}\lambda_{i}^{*}/\left(\sum_{j\in I_{\mathrm{CV}}}\lambda_{j}^{*}\right)&\text{if}\;i\in I_{\mathrm{CV}},\\ 0&\text{otherwise,}\end{cases}

where $\lambda^{*}={\arg\max}_{\lambda\geq 0}\frac{1}{2}g^{*}(\lambda)^{T}F(\theta_{t})g^{*}(\lambda)+\sum_{i\in I_{\mathrm{CV}}}\lambda_{i}\left(\nabla J_{C_{i}}(\theta_{t})^{T}g^{*}(\lambda)+J_{C_{i}}(\theta_{t})-d_{i}\right)$ and $g^{*}(\lambda)=-F^{\dagger}(\theta_{t})\sum_{i\in I_{\mathrm{CV}}}\lambda_{i}\nabla J_{C_{i}}(\theta_{t})$ . Next, let us consider the CS case. We hypothesize that concurrently addressing both constraints and objectives when calculating the policy gradient can lead to more stable constraint satisfaction. Therefore, CPO, SDAC, and IPO can be suitable candidates. However, CPO and SDAC methods require solving the LQCLP whose dual problem is nonlinear, and IPO can introduce computational errors when $J_{C_{i}}(\theta_{t})$ is close to $d_{i}$ . To address these challenges, we propose to calculate the policy gradient direction by solving a QP instead of an LQCLP. By applying a concept of transforming objectives into constraints [16] to the safe RL problem, we can formulate the following QP:

\min_{g}\frac{1}{2}g^{T}F(\theta_{t})g\;\mathbf{s.t.}\;\nabla J_{R}(\theta_{t})^{T}g\geq e,\;\nabla J_{C_{i}}(\theta_{t})^{T}g+J_{C_{i}}(\theta_{t})\leq d_{i}\;\forall i,

where $e:=\epsilon_{t}\sqrt{\nabla J_{R}(\theta_{t})^{T}F^{\dagger}(\theta_{t})\nabla J_{R}(\theta_{t})}$ . According to [16], $e$ represents the maximum value of $\nabla J_{R}(\theta_{t})^{T}g$ when $g$ is within the trust region, $\{g|g^{T}F(\theta_{t})g\leq\epsilon_{t}^{2}\}$ . Therefore, the transformed constraint, $\nabla J_{R}(\theta_{t})^{T}g\geq e$ , has the same effect of improving the objective function $J_{R}(\theta)$ within the trust region. The QP is then addressed by solving the following dual problem:

	$\displaystyle\lambda^{},\nu^{}=\underset{\lambda\geq 0,\nu\geq 0}{\arg\max}$	$\displaystyle\frac{1}{2}g^{}(\lambda,\nu)^{T}F(\theta_{t})g^{}(\lambda,\nu)+\nu(e-\nabla J_{R}(\theta_{t})^{T}g^{*}(\lambda,\nu))$
		$\displaystyle+\sum_{i=1}^{N}\lambda_{i}(\nabla J_{C_{i}}(\theta_{t})^{T}g^{*}(\lambda,\nu)+J_{C_{i}}(\theta_{t})-d_{i}),$

where $g^{*}(\lambda,\nu)=F^{\dagger}(\theta_{t})(\nu\nabla J_{R}(\theta_{t})-\sum_{i\in I_{\mathrm{CV}}}\lambda_{i}\nabla J_{C_{i}}(\theta_{t}))$ . Then, the policy is updated in the following direction:

g_{t}=F^{\dagger}(\theta_{t})\left(\nabla J_{R}(\theta_{t})-\epsilon_{t}\sum_{i}\min\left(\frac{\lambda_{i}^{*}}{\nu^{*}\epsilon_{t}},\lambda_{\mathrm{max}}\right)\nabla J_{C_{i}}(\theta_{t})\right),

and the learning rate is calculated using (32) to ensure that the policy is updated within the trust region. Finally, $\lambda_{t,i}$ is expressed as $C\min(\lambda_{i}^{*}/(\nu^{*}\epsilon_{t}),\lambda_{\mathrm{max}})$ , where $C=\mathrm{Clip}(\sqrt{g_{t}^{T}F(\theta_{t})g_{t}},g_{\mathrm{min}},g_{\mathrm{max}})$ .

Appendix C Implementation Details

Algorithm 2 Spectral-Risk-Constrained Policy Optimization (SRCPO)

Input: Policy parameter

\theta

, sampler parameter

\phi

, critic parameter

\psi

, and replay buffer

D

for epochs

=1

P

for episodes

=1

E

u\leq\epsilon

, where

u\sim U[0,1]

, then

Sample

\boldsymbol{\beta}

from a uniform distribution:

\beta_{i}[j]\sim U(\beta_{i}[j-1],C_{\mathrm{max}}/(1-\gamma))

\forall i

and

\forall j

, where

\beta_{i}[0]=0

else

Sample

\boldsymbol{\beta}\sim\xi_{\phi}

end if

Collect a trajectory

\tau=\{\bar{s}_{t},a_{t},r_{t},\{c_{i,t}\}_{i=1}^{N}\}_{t=0}^{T}

using

\pi_{\theta}(a|\bar{s};\boldsymbol{\beta})

, and store

(\boldsymbol{\beta},\tau)

D

end for

Set

G^{\theta}=\{\}

and

G^{\phi}=\{\}

for updates

=1

U

Sample

(\boldsymbol{\beta},\tau)\sim D

Update the critics,

Z_{R,\psi}(\bar{s},a;\boldsymbol{\beta})

and

Z_{C_{i},\psi}(\bar{s},a;\boldsymbol{\beta})

, using the quantile regression loss.

Calculate the policy gradient of

\pi_{\theta}(a|\bar{s};\boldsymbol{\beta})

using the update rule, described in Section 5.2, and store it to

G^{\theta}

Calculate the gradient of the sampler

\xi_{\phi}

using the update rule, described in (14), and store it to

G^{\phi}

end for

Update the policy parameter by the average of

G^{\theta}

Update the sampler parameter by the average of

G^{\phi}

end for

Output:

\pi_{\theta}(\cdot|\cdot;\boldsymbol{\beta})

, where

\boldsymbol{\beta}\sim\xi_{\phi}

In this section, we present a practical algorithm that utilizes function approximators, such as neural networks, to tackle more complex tasks. The comprehensive procedure is presented in Algorithm 2.

First, it is necessary to train individual policies for various $\boldsymbol{\beta}$ , so we model a policy network to be conditioned on $\boldsymbol{\beta}$ , expressed as $\pi_{\theta}(a|\bar{s};\boldsymbol{\beta})$ . This network structure enables the implementation of policies with various behaviors by conditioning the policy on different $\boldsymbol{\beta}$ values. In order to train this $\boldsymbol{\beta}$ -conditioned policy across a broad range of the $\boldsymbol{\beta}$ space, a $\epsilon$ -greedy strategy can be used. At the beginning of each episode, $\boldsymbol{\beta}$ is sampled uniformly with a probability of $\epsilon$ ; otherwise, it is sampled using the sampler $\xi_{\phi}$ . In our implementation, we did not use the uniform sampling process by setting $\epsilon=0$ to reduce fine-tuning efforts, but it can still be used to enhance practical performance.

After collecting rollouts with the sampled $\boldsymbol{\beta}$ , the critic, policy, and sampler need to be updated. For the critic, a distributional target is calculated using the TD( $\lambda$ ) method [15], and we update the critic networks to minimize the quantile regression loss [11], defined as follows:

\mathcal{L}(\psi):=\sum_{l=1}^{L}\underset{(\bar{s},a,\boldsymbol{\beta})\sim D}{\mathbb{E}}\left[\underset{z\sim\hat{Z}_{C_{i}}(\bar{s},a;\boldsymbol{\beta})}{\mathbb{E}}\bigg{[}\rho_{\frac{2l-1}{2L}}\Big{(}z-Z_{C_{i},\psi}(\bar{s},a;\boldsymbol{\beta})[l]\Big{)}\bigg{]}\right],

where $\rho_{l}(u):=u\cdot(l-\mathbf{1}_{u<0})$ , and $\hat{Z}_{C_{i}}(\bar{s},a;\boldsymbol{\beta})$ is the target distribution of $Z_{C_{i},\psi}(\bar{s},a;\boldsymbol{\beta})$ . The policy gradient for a specific $\boldsymbol{\beta}$ can be calculated according to the update rule described in Section 5.2. Consequently, the policy is updated using the policy gradient calculated with $\boldsymbol{\beta}$ and rollouts, which are extracted from the replay buffer. Finally, the gradient of the sampler for a given $\boldsymbol{\beta}$ can be calculated as in (14), so the sampler is updated with the average of gradients, which are calculated across multiple $\boldsymbol{\beta}$ .

Appendix D Experimental Details

In this section, we provide details on tasks, network structure, and hyperparameter settings.

D.1 Task Description

There are two environments, legged robot locomotion [15] and Safety Gymnasium [14], and whose rendered images are presented in Figure 5.

Legged Robot Locomotion. The legged robot locomotion tasks aim to control bipedal and quadrupedal robots to maintain a specified target velocity, which consists of linear velocity in the $X$ and $Y$ directions and rotational velocity in the $Z$ direction, without falling. Each robot takes actions as PD target joint positions at a frequency of 50 Hz, and a PD controller in each joint operates at 500 Hz. The dimension of the action space corresponds to the number of motors: ten for the bipedal robot and 12 for the quadrupedal robot. The state space includes the command, gravity vector, current joint positions and velocities, foot contact phase, and a history of the joint positions and velocities. The reward function is defined as the negative of the squared Euclidean distance between the current velocity and the target velocity. There are three cost functions to prevent robots from falling. The first cost is to balance the robot, which is formulated as $\mathbf{1}_{g[2]/||g||_{2}\geq\cos(\pi/12)}$ , where $g$ is the gravity vector. This function is activated when the base frame of the robot is tilted more than $15$ degrees. The second cost is to maintain the height of the robot above a predefined threshold, which is expressed as $\mathbf{1}_{h\leq b}$ , where $h$ is the height of the robot, and $b$ is set as $0.7$ in the bipedal robot and $0.35$ in the quadrupedal robot. The third cost is to match the foot contact state with the desired timing. It is defined as the average number of feet that do not touch the floor at the desired time. We set the thresholds for these cost functions as $0.025/(1-\gamma),0.025/(1-\gamma)$ , and $0.4/(1-\gamma)$ , respectively, where $(1-\gamma)$ converts the threshold value from an average time horizon scale to an infinite time horizon scale.

Safety Gymnasium. We use the goal and button tasks in the Safety Gymnasium [14], and both tasks employ point and car robots. The goal task is to control the robot to navigate to a target goal position without passing hazard area. The button task is to control the robot to push a designated button among several buttons without passing hazard area or hitting undesignated buttons. The dimension of the action space is two for all tasks. The state space dimensions are 60, 72, 76, and 88 for the point goal, car goal, point button, and car button tasks, respectively. For more details on the action and state, please refer to [14]. The reward function is defined to reduce the distance between the robot and either the goal or the designated button, with a bonus awarded at the moment of task completion. When the robot traverses a hazard area or collides with an obstacle, it incurs a cost of one, and the threshold for this cost is set as $0.025/(1-\gamma)$ .

D.2 Network Structure

Across all algorithms, we use the same network architecture for the policy and critics. The policy network is structured to output the mean and standard deviation of a normal distribution, which is then squashed using the $\mathrm{Tanh}$ function as done in SAC [13]. The critic network is based on the quantile distributional critic [11]. Details of these structures are presented in Table 1. For the sampler $\xi_{\phi}$ , we define a trainable variable $\phi\in\mathbb{R}^{N\times(M-1)}$ and represent the mean of the truncated normal distribution as $\mu_{i,\phi}[j]=\exp(\phi[i,j])$ . To reduce fine-tuning efforts, we fix the standard deviation of the truncated normal distribution to $0.05$ .

Table 1: Details of network structures.

	Parameter	Value
Policy network	Hidden layer	(512, 512)
	Activation	LeakyReLU
	Last activation	Linear
Reward critic network	Hidden layer	(512, 512)
	Activation	LeakyReLU
	Last activation	Linear
	# of quantiles	25
	# of ensembles	2
Cost critic network	Hidden layer	(512, 512)
	Activation	LeakyReLU
	Last activation	SoftPlus
	# of quantiles	25
	# of ensembles	2

D.3 Hyperparameter Settings

The hyperparameter settings for all algorithms applied to all tasks are summarized in Table 2.

Table 2: Description on hyperparameter settings.

	SRCPO (Ours)	CPPO [29]	CVaR-CPO [32]	SDAC [15]	SDPO [30]	WCSAC-D [26]
Discount factor	$0.99$	$0.99$	$0.99$	$0.99$	$0.99$	$0.99$
Policy learning rate	-	$3\cdot 10^{-5}$	$3\cdot 10^{-5}$	-	$3\cdot 10^{-5}$	$3\cdot 10^{-4}$
Critic learning rate	$3\cdot 10^{-4}$	$3\cdot 10^{-4}$	$3\cdot 10^{-4}$	$3\cdot 10^{-4}$	$3\cdot 10^{-4}$	$3\cdot 10^{-4}$
# of target quantiles	$50$	$50$	$50$	$50$	$50$	-
Coefficient of TD( $\lambda$ )	$0.97$	$0.97$	$0.97$	$0.97$	$0.97$	-
Soft update ratio	-	-	-	-	-	$0.995$
Batch size	$10000$	$5000$	$5000$	$10000$	$5000$	$256$
Steps per update	$1000$	$5000$	$5000$	$1000$	$5000$	$100$
# of policy update iterations	$10$	$20$	$20$	-	$20$	$10$
# of critic update iterations	$40$	$40$	$40$	$40$	$40$	$10$
Trust region size	$0.001$	$0.001$	$0.001$	$0.001$	$0.001$	-
Line search decay	-	-	-	$0.8$	-	-
PPO Clip ratio [22]	-	$0.2$	$0.2$	-	$0.2$	-
Length of replay buffer	$100000$	$5000$	$5000$	$100000$	$5000$	$1000000$
Learning rate for auxiliary variables	-	$0.05$	$0.01$	-	-	-
Learning rate for Lagrange multipliers	-	$0.05$	$0.01$	-	-	$0.001$
Log penalty coefficient	-	-	-	-	$5.0$	-
Entropy threshold	-	-	-	-	-	$-\|A\|$
Entropy learning rate	-	-	-	-	-	$0.001$
Entropy coefficient		$0.001$	-	-	$0.001$	-
Sampler learning rate	$1\cdot 10^{-3}$	-	-	-	-	-
$K$ for sampler target	$10$	-	-	-	-	-

Appendix E Additional Experimental Results

E.1 Computational Resources

All experiments were conducted on a PC whose CPU and GPU are an Intel Xeon E5-2680 and NVIDIA TITAN Xp, respectively. The training time for each algorithm on the point goal task is reported in Table 3.

Table 3: Training time for the point goal task averaged over five runs.

	SRCPO (Ours)	CPPO	CVaR-CPO	SDAC	WCSAC-D	SDPO
Averaged training time	9h 51m 54s	6h 17m 16s	4h 43m 38s	8h 46m 8s	14h 18m 20s	6h 50m 39s

E.2 Experiments on Various Risk Measures

We have conducted studies on various risk measures in the main text. In this section, we first describe the definition of the Wang risk measure [24], show the visualization of the discretization process, and finally present the training curves of the experiments conducted in Section 8.1.

The Wang risk measure is defined as follows [24]:

\mathrm{Wang}_{\alpha}(X):=\int_{0}^{1}F_{X}^{-1}(u)d\zeta_{\alpha}(u),\quad\text{where}\;\zeta_{\alpha}(u):=\Phi(\Phi^{-1}(u)-\alpha),

$\Phi$ is the cumulative distribution function (CDF) of the standard normal distribution, and $\zeta$ is called a distortion function. It can be expressed in the form of the spectral risk measure as follows:

\mathrm{Wang}_{\alpha}(X)=\int_{0}^{1}F_{X}^{-1}(u)\sigma_{\alpha}(u)du,\quad\text{where}\;\sigma_{\alpha}(u):=\frac{\phi(\Phi^{-1}(u)-\alpha)}{\phi(\Phi^{-1}(u))},

and $\phi$ is the pdf of the standard normal distribution. Since $\sigma_{\alpha}(u)$ goes to infinity when $u$ is approaching one, $\sigma_{\alpha}(u)$ at $u=1$ is not defined. Thus, it is not a spectral risk measure. Nevertheless, our parameterization method introduced in Section 6.1 can project $\mathrm{Wang}_{\alpha}$ onto a discretized class of spectral risk measures.

The discretization process can be implemented by minimizing (11), and the results are shown in Figure 6. In this process, the number of discretizations $M$ is set to five. Additionally, visualizations of other risk measures are provided in Figure 7, and the optimized values of the parameters for the discretized spectrum, which are introduced in (10), are reported in the Table 4. Note that CVaR is already an element of the discretized class of spectral risk measures, as observed in Figure 7. Therefore, CVaR can be precisely expressed using a parameter with $M=1$ .

Finally, we conducted experiments on the point goal task to check how the results are changed according to the risk level and the type of risk measures. The training curves for these experiments are shown in Figure 8.

Table 4: Discretization results.

		$\mathrm{Wang}_{\alpha}$			$\mathrm{Pow}_{\alpha}$
$\alpha=0.5$	$\{\eta_{i}\}_{i=1}^{M}$	$[0.515,0.790,1.091,1.493,2.191]$	$\alpha=0.5$	$\{\eta_{i}\}_{i=1}^{M}$	$[0.200,0.600,1.000,1.400,1.800]$
	$\{\alpha_{i}\}_{i=1}^{M-1}$	$[0.263,0.541,0.770,0.926]$		$\{\alpha_{i}\}_{i=1}^{M-1}$	$[0.200,0.400,0.600,0.800]$
$\alpha=1.0$	$\{\eta_{i}\}_{i=1}^{M}$	$[0.294,0.734,1.417,2.640,5.517]$	$\alpha=0.75$	$\{\eta_{i}\}_{i=1}^{M}$	$[0.046,0.574,1.347,2.308,3.424]$
	$\{\alpha_{i}\}_{i=1}^{M-1}$	$[0.409,0.701,0.878,0.968]$		$\{\alpha_{i}\}_{i=1}^{M-1}$	$[0.417,0.615,0.765,0.890]$
$\alpha=0.5$	$\{\eta_{i}\}_{i=1}^{M}$	$[0.180,0.834,2.253,5.678,16.419]$	$\alpha=0.9$	$\{\eta_{i}\}_{i=1}^{M}$	$[0.003,0.947,2.705,5.216,8.383]$
	$\{\alpha_{i}\}_{i=1}^{M-1}$	$[0.579,0.834,0.945,0.989]$		$\{\alpha_{i}\}_{i=1}^{M-1}$	$[0.701,0.821,0.898,0.955]$

		$\displaystyle D_{\mathrm{TV}}(\pi_{t+1}(\cdot\|\bar{s}),\pi_{t}(\cdot\|\bar{s}))\leq\sqrt{\frac{1}{2}D_{\mathrm{KL}}(\pi_{t+1}(\cdot\|\bar{s})\|\|\pi_{t}(\cdot\|\bar{s}))}$
		$\displaystyle=\sqrt{\frac{1}{2}\sum_{a}\pi_{t+1}(a\|\bar{s})\log(\pi_{t+1}(a\|\bar{s})/\pi_{t}(a\|\bar{s})})$
		$\displaystyle=\sqrt{\frac{1}{2}(\sum_{a}\pi_{t+1}(a\|\bar{s})(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))-\log Z_{t}(\bar{s}))}$
		$\displaystyle\overset{\text{(i)}}{\leq}\sqrt{\frac{1}{2}\sum_{a}\pi_{t+1}(a\|\bar{s})(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))}$
		$\displaystyle\overset{\text{(ii)}}{=}\sqrt{\frac{1}{2}\sum_{a}(\pi_{t+1}(a\|\bar{s})-\pi_{t}(a\|\bar{s}))(\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a))}$
		$\displaystyle\leq\sqrt{\max_{a}\|\theta_{t+1}(\bar{s},a)-\theta_{t}(\bar{s},a)\|D_{\mathrm{TV}}(\pi_{t+1}(a\|\bar{s}),\pi_{t}(a\|\bar{s}))},$

		$\displaystyle J_{R}(\theta_{t+1})-J_{R}(\theta_{t})+\alpha_{t}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\theta_{t})-J_{C_{i}}(\theta_{t+1}))$		(21)
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a}\pi_{t+1}(a\|\bar{s})(A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a}(\pi_{t+1}(a\|\bar{s})-\pi_{t}(a\|\bar{s}))(A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle\overset{\text{(i)}}{\leq}\frac{2}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\max_{a}\big{\|}A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a)\big{\|}D_{\mathrm{TV}}(\pi_{t+1}(\cdot\|\bar{s}),\pi_{t}(\cdot\|\bar{s}))$
		$\displaystyle\overset{\text{(ii)}}{\leq}\frac{2\|\|\theta_{t+1}-\theta_{t}\|\|_{\infty}}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\max_{a}\big{\|}A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a)\big{\|}$
		$\displaystyle\leq\frac{2\|\|\theta_{t+1}-\theta_{t}\|\|_{\infty}}{1-\gamma}\big{\|}\big{\|}A_{R}^{t}-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}\big{\|}\big{\|}_{\infty}=\frac{2\alpha_{t}}{(1-\gamma)^{2}}\big{\|}\big{\|}A_{R}^{t}-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}\big{\|}\big{\|}_{\infty}^{2}$
		$\displaystyle\leq 2\alpha_{t}(R_{\mathrm{max}}+\alpha_{t}N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}/(1-\gamma)^{4}.$

		$\displaystyle J_{R}(\hat{\pi})-J_{R}(\pi_{t})+\alpha_{t}\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi_{t})-J_{C_{i}}(\hat{\pi}))$		(24)
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})\sum_{a\in A}\hat{\pi}(a\|\bar{s})(A_{R}^{t}(\bar{s},a)-\alpha_{t}\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})\sum_{a\in A}\hat{\pi}(a\|\bar{s})\log(\pi_{t+1}(a\|\bar{s})Z_{t}(\bar{s})/\pi_{t}(a\|\bar{s}))$
		$\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})(D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t}(\cdot\|\bar{s}))-D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t+1}(\cdot\|\bar{s}))+\log Z_{t}(\bar{s}))$
		$\displaystyle\overset{\text{(i)}}{\leq}\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})(D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t}(\cdot\|\bar{s}))-D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t+1}(\cdot\|\bar{s})))$
		$\displaystyle+\frac{2\alpha_{t}}{(1-\gamma)^{5}}(R_{\mathrm{max}}+\alpha_{t}N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2},$

		$\displaystyle\alpha_{t}\nu_{t}(J_{R}(\theta_{t+1})-J_{R}(\theta_{t}))+\sum_{i}\lambda_{t,i}(J_{C_{i}}(\theta_{t})-J_{C_{i}}(\theta_{t+1}))$		(26)
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a}\pi_{t+1}(a\|\bar{s})(\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\sum_{a}(\pi_{t+1}(a\|\bar{s})-\pi_{t}(a\|\bar{s}))(\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle\leq\frac{2}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\max_{a}\big{\|}\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a)\big{\|}D_{\mathrm{TV}}(\pi_{t+1}(\cdot\|\bar{s}),\pi_{t}(\cdot\|\bar{s}))$
		$\displaystyle\leq\frac{2\|\|\theta_{t+1}-\theta_{t}\|\|_{\infty}}{1-\gamma}\sum_{\bar{s}}d_{\rho}^{\pi_{t+1}}(\bar{s})\max_{a}\big{\|}\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a)\big{\|}$
		$\displaystyle\leq\frac{2\alpha_{t}}{(1-\gamma)^{2}}\big{\|}\big{\|}\alpha_{t}\nu_{t}A_{R}^{t}-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}\big{\|}\big{\|}_{\infty}^{2}\leq 2\alpha_{t}(\alpha_{t}\lambda_{\mathrm{max}}R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}/(1-\gamma)^{4}.$

		$\displaystyle\alpha_{t}\nu_{t}(J_{R}(\hat{\pi})-J_{R}(\pi_{t}))+\sum_{i}\lambda_{t,i}(J_{C_{i}}(\pi_{t})-J_{C_{i}}(\hat{\pi}))$		(29)
		$\displaystyle=\frac{1}{1-\gamma}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})\sum_{a\in A}\hat{\pi}(a\|\bar{s})(\alpha_{t}\nu_{t}A_{R}^{t}(\bar{s},a)-\sum_{i}\lambda_{t,i}A_{i,g_{i}}^{t}(\bar{s},a))$
		$\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})\sum_{a\in A}\hat{\pi}(a\|\bar{s})\log(\pi_{t+1}(a\|\bar{s})Z_{t}(\bar{s})/\pi_{t}(a\|\bar{s}))$
		$\displaystyle=\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})(D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t}(\cdot\|\bar{s}))-D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t+1}(\cdot\|\bar{s}))+\log Z_{t}(\bar{s}))$
		$\displaystyle\leq\frac{1}{\alpha_{t}}\sum_{\bar{s}\in\bar{S}}d_{\rho}^{\hat{\pi}}(\bar{s})(D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t}(\cdot\|\bar{s}))-D_{\mathrm{KL}}(\hat{\pi}(\cdot\|\bar{s})\|\|\pi_{t+1}(\cdot\|\bar{s})))$
		$\displaystyle+\frac{2\alpha_{t}}{(1-\gamma)^{5}}(\alpha_{t}\lambda_{\mathrm{max}}R_{\mathrm{max}}+N\lambda_{\mathrm{max}}C_{\mathrm{max}})^{2}.$

Spectral-Risk Safe Reinforcement Learning with Convergence Guarantees

Abstract

1 Introduction

2 Related Work

3 Backgrounds

3.1 Constrained Markov Decision Processes

3.2 Spectral Risk Measures

3.3 Augmented State Spaces

4 Proposed Method

5 Inner Problem: Safe RL with Sub-Risk Measure

5.1 Risk Value Functions

Lemma 5.1.

Theorem 5.2.

5.2 Policy Update Rule

5.3 Convergence Analysis

Theorem 5.3.

6 Outer Problem: Dual Optimization for Spectral Risk Measure

6.1 Parameterization

Lemma 6.1 (Approximation Error).

Theorem 6.2.

6.2 Optimization

Theorem 6.3.

7 Practical Implementation

8 Experiments and Results

8.1 Study on Various Risk Measures

9 Conclusions and Limitations

References

Appendix A Proofs

Proof.

Proof.

Proof.

Proof.

Proof.

A.1 Proof of Theorem 5.3

Lemma A.1.

Proof.

Lemma A.2.

Proof.

Proof.

Appendix B Weight Selection Strategy

Appendix C Implementation Details

Appendix D Experimental Details

D.1 Task Description

D.2 Network Structure

D.3 Hyperparameter Settings

Appendix E Additional Experimental Results

E.1 Computational Resources

E.2 Experiments on Various Risk Measures

Spectral-Risk Safe Reinforcement Learning
with Convergence Guarantees