Provably Safe Reinforcement Learning with Step-wise Violation Constraints

Nuoya Xiong
IIIS, Tsinghua University
[email protected]
Yihan Du
IIIS, Tsinghua University
[email protected]
Longbo Huang
IIIS, Tsinghua University
[email protected]

Abstract

We investigate a novel safe reinforcement learning problem with step-wise violation constraints. Our problem differs from existing works in that we focus on stricter step-wise violation constraints and do not assume the existence of safe actions, making our formulation more suitable for safety-critical applications that need to ensure safety in all decision steps but may not always possess safe actions, e.g., robot control and autonomous driving. We propose an efficient algorithm SUCBVI, which guarantees $\widetilde{\mathcal{O}}(\sqrt{ST})$ or gap-dependent $\widetilde{\mathcal{O}}(S/\mathcal{C}_{\mathrm{gap}}+S^{2}AH^{2})$ step-wise violation and $\widetilde{\mathcal{O}}(\sqrt{H^{3}SAT})$ regret. Lower bounds are provided to validate the optimality in both violation and regret performance with respect to the number of states $S$ and the total number of steps $T$ . Moreover, we further study an innovative safe reward-free exploration problem with step-wise violation constraints. For this problem, we design algorithm SRF-UCRL to find a near-optimal safe policy, which achieves nearly state-of-the-art sample complexity $\widetilde{\mathcal{O}}((\frac{S^{2}AH^{2}}{\varepsilon}+\frac{H^{4}SA}{\varepsilon^{2}})(\log(\frac{1}{\delta})+S))$ , and guarantees $\widetilde{\mathcal{O}}(\sqrt{ST})$ violation during exploration. Experimental results demonstrate the superiority of our algorithms in safety performance and corroborate our theoretical results.

1 Introduction

In recent years, reinforcement learning (RL) (Sutton & Barto, 2018) has become a powerful framework for decision making and learning in unknown environments. Despite the ground-breaking success of RL in games (Lanctot et al., 2019), recommendation systems (Afsar et al., 2022) and complex tasks in simulation environments (Zhao et al., 2020), most existing RL algorithms focus on optimizing the cumulative reward and do not take into consideration the risk aspect, e.g., the agent runs into catastrophic situations during control. The lack of strong safety guarantees hinders the application of RL to broader safety-critical scenarios such as autonomous driving, robotics and healthcare. For example, for robotic control in complex environments, it is crucial to prevent the robot from getting into dangerous situations, e.g., hitting walls or falling into water pools, at all time.

To handle the safety requirement, a common approach is to formulate safety as a long-term expected violation constraint in each episode. This approach focuses on seeking a policy whose cumulative expected violation in each episode is below a certain threshold. However, for applications where an agent needs to avoid disastrous situations throughout the decision process, e.g., a robot needs to avoid hitting obstacles at each step, merely reducing the long-term expected violation is not sufficient to guarantee safety.

Motivated by this fact, we investigate safe reinforcement learning with a more fine-grained constraint, called step-wise violation constraint, which aggregates all nonnegative violations at each step (no offset between positive and negative violations permitted). We name this problem Safe-RL-SW. Our step-wise violation constraint differs from prior expected violation constraint (Wachi & Sui, 2020; Efroni et al., 2020; Kalagarla et al., 2021) in two aspects: (i) Minimizing the step-wise violation enables the agent to learn an optimal policy that avoids unsafe regions deterministically, while reducing the expected violation only guarantees to find a policy with low expected violation, instead of a per-step zero-violation policy. (ii) Reducing the aggregated nonnegative violation allows us to have a risk control for each step, while a small cumulative expected violation can still result in a large cost at some individual step and cause danger, if other steps with smaller costs offset the huge cost.

Our problem faces two unique challenges. First, the step-wise violation requires us to guarantee a small violation at each step, which demands very different algorithm design and analysis from that for the expected violation (Wachi & Sui, 2020; Efroni et al., 2020; Kalagarla et al., 2021). Second, in safety-critical scenarios, the agent needs to identify not only unsafe states but also potentially unsafe states, which are states that may appear to be safe but will ultimately lead to unsafe regions with a non-zero probability. For example, a self-driving car needs to learn to slow down or change directions early, foreseeing the potential danger in advance in order to ensure safe driving (Thomas et al., 2021). Existing safe RL works focus mainly on the expected violation (Wachi & Sui, 2020; Liu et al., 2021b; Wei et al., 2022), or assume no potentially unsafe states by imposing the prior knowledge of a safe action for each state (Amani et al., 2021). Hence, techniques in previous works cannot be applied to handle step-wise violations.

To systematically handle these two challenges, we formulate safety as an unknown cost function for each state without assuming safe actions, and consider minimizing the step-wise violation instead of the expected violation. We propose a general algorithmic framework called Safe UCBVI (SUCBVI). Specifically, in each episode, we first estimate the transition kernel and cost function in an optimistic manner, tending to regard a state as safe at the beginning of learning. After that, we introduce novel dynamic programming to identify potentially unsafe states and determine safe actions, based on our estimated transition and costs. Finally, we employ the identified safe actions to conduct value iteration. This mechanism can adaptively update dangerous regions, and help the agent plan for the future, which keeps her away from all states that may lead to unsafe states. As our estimation becomes more accurate over time, the safety violation becomes smaller and eventually converges to zero. Note that without the assumption of safe actions, the agent knows nothing about the environment at the beginning. Thus, it is impossible for her to achieve an absolute zero violation. Nevertheless, we show that SUCBVI can achieve sub-linear $\widetilde{\mathcal{O}}(\sqrt{ST})$ cumulative violation or an $\widetilde{\mathcal{O}}(S/\mathcal{C}_{\mathrm{gap}}+S^{2}AH^{2})$ gap-dependent violation that is independent of $T$ . This violation implies that as the RL game proceeds, the agent eventually learns how to avoid unsafe states. We also provide a matching lower bound to demonstrate the optimality of SUCBVI in both violation and regret.

Furthermore, we apply our step-wise safe RL framework to the reward-free exploration (Jin et al., 2020) setting. In this novel safe reward-free exploration, the agent needs to guarantee small step-wise violations during exploration, and also output a near-optimal safe policy. Our algorithm achieves $\varepsilon$ cumulative step-wise violation during exploration, and also identifies a $\varepsilon$ -optimal and safe policy. See the definition of $\varepsilon$ -optimal and $\varepsilon$ -safe policy in Section 6.1. Another interesting application of our framework is safe zero-sum Markov games, which we discuss in Appendix C.

The main contributions of our paper are as follows.

•

We formulate the safe RL with step-wise violation constraint problem (Safe-RL-SW), which models safety as a cost function for states and aims to minimize the cumulative step-wise violation. Our formulation is particularly useful for safety-critical applications where avoiding disastrous situations at each decision step is desirable, e.g., autonomous driving and robotics.
•

We provide a general algorithmic framework SUCBVI, which is equipped with an innovative dynamic programming to identify potentially unsafe states and distinguish safe actions. We establish $\widetilde{\mathcal{O}}(\sqrt{H^{3}SAT})$ regret and $\widetilde{\mathcal{O}}(\sqrt{ST})$ or $\widetilde{\mathcal{O}}(S/\mathcal{C}_{\mathrm{gap}}+S^{2}AH^{2})$ gap-dependent violation guarantees, which exhibits the capability of SUCBVI in attaining high rewards while maintaining small violation.
•

We further establish $\Omega(\sqrt{HST})$ regret and $\Omega(\sqrt{ST})$ violation lower bounds for Safe-RL-SW. The lower bounds demonstrate the optimality of algorithm SUCBVI in both regret minimization and safety guarantee, with respect to factors $S$ and $T$ .
•

We consider step-wise safety constraints in the reward-free exploration setting, which calls the Safe-RFE-SW problem. In this setting, we design an efficient algorithm SRF-UCRL, which ensures $\varepsilon$ step-wise violations during exploration and plans a $\varepsilon$ -optimal and $\varepsilon$ -safe policy for any reward functions with probability at least $1-\delta$ . We obtain an $\widetilde{\mathcal{O}}((\frac{S^{2}AH^{2}}{\varepsilon}+\frac{H^{4}SA}{\varepsilon^{2}})(\log(\frac{1}{\delta})+S))$ sample complexity and an $\widetilde{\mathcal{O}}(\sqrt{ST})$ violation guarantee for SRF-UCRL, which shows the efficiency of SRF-UCRL in sampling and danger avoidance even without reward signals. To the best of our knowledge, this work is the first to study the step-wise violation constraint in the RFE setting.

2 Related Work

Safe RL.

Safety is an important topic in RL, which has been extensively studied. The constrained Markov decision process (CMDP)-based approaches handle safety via cost functions, and aim to minimize the expected episode-wise violation, e.g., (Yu et al., 2019; Wachi & Sui, 2020; Qiu et al., 2020; Efroni et al., 2020; Turchetta et al., 2020; Ding et al., 2020; Singh et al., 2020; Kalagarla et al., 2021; Simão et al., 2021; Ding et al., 2021), or achieve zero episode-wise violation, e.g., (Liu et al., 2021b; Bura et al., 2021; Wei et al., 2022; Sootla et al., 2022). Apart from CMDP-based approaches, there are also other works that tackle safe RL by control-based approaches (Berkenkamp et al., 2017; Chow et al., 2018; Dalal et al., 2018; Wang et al., 2022), policy optimization (Uchibe & Doya, 2007; Achiam et al., 2017; Tessler et al., 2018; Liu et al., 2020; Stooke et al., 2020) and safety shields (Alshiekh et al., 2018). In recent years, there are also some works studying step-wise violation with additional assumptions such as safe action (Amani et al., 2021) or known state-action subgraph (Shi et al., 2023). More detailed comparisons are provided in A.

Reward-free Exploration with Safety.

Motivated by sparse reward signals in realistic applications, reward-free exploration (RFE) has received extensive attention (Jin et al., 2020; Kaufmann et al., 2021; Ménard et al., 2021). Recently, Miryoosefi & Jin (2022); Huang et al. (2022) also study RFE with safety constraints. However, Miryoosefi & Jin (2022) only considers the safety constraint after the exploration phase. Huang et al. (2022) considers the episode-wise constraint during the exploration phase, requiring an assumption of a known baseline policy. The detailed comparison is provided in Appendix A.

3 The Safe MDP Model

Episodic MDP. In this paper, we consider the finite-horizon episodic Markov Decision Process (MDP), represented by a tuple $(\mathcal{S},\mathcal{A},H,\mathbb{P},r)$ . Here $\mathcal{S}$ is the state space, $\mathcal{A}$ is the action space, and $H$ is the length of each episode. $\mathbb{P}=\{\mathbb{P}_{h}:\mathcal{S}\times\mathcal{A}\mapsto\triangle_{\mathcal{S}}\}_{h\in[H]}$ is the transition kernel, and $\mathbb{P}_{h}(s^{\prime}|s,a)$ gives the transition probability from $(s,a)$ to $s^{\prime}$ at step $h$ . $r=\{r_{h}:\mathcal{S}\times\mathcal{A}\mapsto[0,1]\}_{h\in[H]}$ is the reward function, and $r_{h}(s,a)$ gives the reward of taking action $a$ in state $s$ at step $h$ . A policy $\pi=\{\pi_{h}:\mathcal{S}\mapsto\mathcal{A}\}_{h\in[H]}$ consists of $H$ mappings from the state space to action space. In each episode $k$ , the agent first chooses a policy $\pi^{k}$ . At each step $h\in[H]$ , the agent observes a state $s_{h}^{k}$ , takes an action $a_{h}^{k}$ , and then goes to a next state $s_{h+1}^{k}$ with probability $\mathbb{P}_{h}(s_{h+1}^{k}\mid s_{h}^{k},a_{h}^{k})$ . The algorithm executes $T=HK$ steps. Moreover, the state value function $V^{\pi}_{h}(s,a)$ and state-action value function $Q_{h}^{\pi}(s,a)$ for a policy $\pi$ can be defined as

	$\displaystyle V^{\pi}_{h}(s)$	$\displaystyle:=\mathbb{E}_{\pi}\Bigg{[}\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(s_{h^{\prime}},\pi_{h^{\prime}}(s_{h^{\prime}}))\Bigg{\|}s_{h}=s\Bigg{]},$
	$\displaystyle Q^{\pi}_{h}(s,a)$	$\displaystyle:=\mathbb{E}_{\pi}\Bigg{[}\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(s_{h^{\prime}},\pi_{h^{\prime}}(s_{h^{\prime}}))\Bigg{\|}s_{h}=s,a_{h}=a\Bigg{]}.$

Safety Constraint. To model unsafe regions in the environment, similar to Wachi & Sui (2020); Yu et al. (2022), we define a safety cost function $c:\mathcal{S}\mapsto[0,1]$ . Let $\tau\in[0,1]$ denote the safety threshold. A state is called safe if $c(s)\leq\tau$ , and called unsafe if $c(s)>\tau$ . Similar to Efroni et al. (2020); Amani et al. (2021), when the agent arrives in a state $s$ , she will receive a cost signal $z(s)=c(s)+\zeta$ , where $\zeta$ is an independent, zero-mean and 1-sub-Gaussian noise. Denote $(x)_{+}=\max\{x,0\}$ . The violation in state $s$ is defined as $(c(s)-\tau)_{+}$ , and the cumulative step-wise violation till episode $K$ is

C(K)=\sum_{k=1}^{K}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+}.

(1)

Eq. (1) represents the accumulated step-wise violation during training. When the agent arrives in state $s_{h}^{k}$ at step $h$ in episode $k$ , she will suffer violation $(c(s_{h}^{k})-\tau)_{+}$ . This violation setting is significantly different from the previous CMDP setting (Qiu et al., 2020; Ding et al., 2020; Efroni et al., 2020; Ding et al., 2021; Wachi & Sui, 2020; Liu et al., 2021a; Kalagarla et al., 2021). They study the episode-wise expected violation $C^{\prime}(K)=\sum_{k=1}^{K}(\mathbb{E}[\sum_{h=1}^{H}c(s_{h}^{k},a_{h}^{k})]-\mu)$ . There are two main differences between the step-wise violation and episode-wise expected violation: (i) First, the expected violation constraint allows the agent to get into unsafe states occasionally. Instead, the step-wise violation constraint enforces the agent to stay in safe regions at all time. (ii) Second, in the episode-wise constraint, the violation $c(s_{h},a_{h})-\tau$ at each step is allowed to be positive or negative, and they can cancel out in one episode. Instead, we consider a nonnegative function $(c(s)-\tau)_{+}$ for each step in our step-wise violation, which imposes a stricter constraint.

Define $\mathcal{U}:=\{s\in\mathcal{S}\mid c(s)>\tau\}$ as the set of all unsafe states. Let $\iota=\{s_{1},a_{1},\cdots,s_{H},a_{H}\}$ denote a trajectory. Since a feasible policy needs to satisfy the constraint at each step, we define the set of feasible policies as $\Pi=\{\pi\mid\Pr\{\ \exists\ h\in[H],s_{h}\in\mathcal{U}\mid\iota\sim\pi\}=0\}.$ The feasible policy set $\Pi$ consists of all policies under which one never reaches any unsafe state in an episode.

Learning Objective. In this paper, we consider the regret minimization objective. Specifically, define $\pi^{*}=\operatorname*{argmax}_{\pi\in\Pi}V_{1}^{\pi}$ , $V^{*}=V^{\pi^{*}}$ and $Q^{*}=Q^{\pi^{*}}$ . The regret till $K$ episodes is then defined as

R(K)=\sum_{t=1}^{K}(V^{*}_{1}(s_{1})-V^{\pi^{k}}_{1}(s_{1})),

where $\pi^{k}$ is the policy taken in episode $k$ . Our objective is to minimize $R(K)$ to achieve a good performance, and minimize the violation $C(K)$ to guarantee the safety at the same time.

4 Safe RL with Step-wise Violation Constraints

4.1 Assumptions and Problem Features

Before introducing our algorithms, we first state the important assumptions and problem features for Safe-RL-SW.

Suppose $\pi$ is a feasible policy. Then, if we arrive at $s_{H-1}$ at step $H-1$ , $\pi$ needs to select an action that guarantees $s_{H}\notin\mathcal{U}$ . Define the transition set $\Delta_{h}(s,a)=\{s^{\prime}\mid\mathbb{P}_{h}(s^{\prime}\!\mid\!s,a)>0\}$ for any $\forall(s,a,h)\!\in\!\mathcal{S}\!\times\!\mathcal{A}\times[H]$ , which represents the set of possible next states after taking action $a$ in state $s$ at step $h$ . Then, at the former step $H-1$ , the states $s$ is potentially unsafe if it satisfies that $\Delta_{H-1}(s,a)\cap\mathcal{U}\neq\emptyset$ for all $a\in\mathcal{A}$ . (i.e., no matter taking what action, there is a positive probability of transitioning to an unsafe next state). Therefore, we can recursively define the set of potentially unsafe states at step $h$ as

\footnotesize\mathcal{U}_{h}=\mathcal{U}_{h+1}\cup\{s\mid\ \forall\ a\in\mathcal{A},\Delta_{h}(s,a)\cap\mathcal{U}_{h+1}\neq\emptyset\},

(2)

where $\mathcal{U}_{H}=\mathcal{U}.$ Intuitively, if we are in a state $s_{h}\in\mathcal{U}_{h}$ at step $h$ , no action can be taken to completely avoid reaching potentially unsafe states $s_{h+1}\in\mathcal{U}_{h+1}$ at step $h+1$ . Thus, in order to completely prevent from getting into unsafe states $\mathcal{U}$ throughout all steps, one needs to avoid potentially unsafe states in $\mathcal{U}_{h}$ at step $h$ . From the above argument, we have that $s_{1}\notin\mathcal{U}_{1}$ is equivalent to the existence of feasible policies. The detailed proof is provided in Appendix E. Thus, we make the following necessary assumption.

Assumption 4.1 (Existence of feasible policies).

The initial state $s_{1}$ satisfies $s_{1}\notin\mathcal{U}_{1}$ .

For any $s\in\mathcal{S}$ and $h\in[H-1]$ , we define the set of safe actions for state $s$ at step $h$ as

\displaystyle A_{h}^{safe}(s)=\{a\in\mathcal{A}\mid\Delta_{h}(s,a)\cap\mathcal{U}_{h+1}=\emptyset\},

(3)

and let $A_{H}^{safe}(s)=\mathcal{A}$ . $A_{h}^{safe}(s)$ stands for the set of all actions at step $h$ which will not lead to potentially unsafe states in $\mathcal{U}_{h+1}$ . Here, $\{\mathcal{U}_{h}\}_{h\in[H]}$ and $\{A_{h}^{safe}(s)\}_{h\in[H]}$ are defined by dynamic programming: If we know all possible next state sets $\{\Delta_{h}(s,a)\}_{h\in[H]}$ and unsafe state set $\mathcal{U}=\mathcal{U}_{H}$ , we can calculate all potentially unsafe state sets $\{\mathcal{U}_{h}\}_{h\in[H]}$ and safe action sets $\{A_{h}^{safe}(s)\}_{h\in[H]}$ , and choose feasible policies to completely avoid unsafe states.

4.2 Algorithm SUCBVI

Now we present our main algorithm Safe UCBVI (SUCBVI), which is based on previous classic RL algorithm UCBVI (Azar et al., 2017), and equipped with a novel dynamic programming to identify potentially unsafe states and safe actions. The pseudo-code is shown in Algorithm 1. First, we provide some intuitions about how SUCBVI works. At each episode, we first estimate all the unsafe states based on the historical data. Then, we perform a dynamic programming procedure introduced in Section 4.1 and calculate the safe action set $A_{h}^{safe}(s)$ for each state $s$ . Then, we perform value iteration in the estimated safe action set. As the estimation becomes more accurate, SUCBVI will eventually avoid potentially unsafe states and achieve both sublinear regrets and violations.

Now we begin to introduce our algorithm. In the beginning, we initialize $\Delta_{h}(s,a)=\emptyset$ for all $(h,s,a)\in[H]\times\mathcal{S}\times\mathcal{A}$ . It implies that the agent considers all actions to be safe at first, because no action will lead to unsafe states from the agent’s perspective. In each episode, we first estimate the empirical cost $\hat{c}(s)$ based on historical cost feedback $z(s)$ , and regard state $s$ as safe if $\bar{c}(s)=\hat{c}(s)-\beta>\tau$ for some bonus term $\beta$ , which aims to guarantee $\bar{c}(s)\leq c(s)$ (Line 4). Then, we calculate the estimated unsafe state set $\mathcal{U}_{H}^{k}$ , which is a subset of the true unsafe state set $\mathcal{U}=\mathcal{U}_{H}$ with a high probability by optimism. With $\mathcal{U}_{H}^{k}$ and $\Delta_{h}^{k}(s,a)$ , we can estimate potentially unsafe state sets $\mathcal{U}_{h}^{k}$ for all $h\in[H]$ by Eq. (2) recursively.

Then, we perform value iteration to compute the optimistically estimated optimal policy. Specifically, for any hypothesized safe state $s\notin\mathcal{U}_{h}^{k}$ , we update the greedy policy on the estimated safe actions, i.e., $\pi_{h}^{k}(s)=\max_{a\in A_{h}^{k,safe}(s)}Q_{h}^{k}(s,a)$ (Line 13). On the other hand, for any hypothesized unsafe state $s_{h}\in\mathcal{U}_{h}$ , since there is no action that can completely avoid unsafe states, we ignore safety costs and simply update the policy by $\pi_{h}^{k}(s)=\max_{a\in\mathcal{A}}Q(s,a)$ (Line 15). After that, we calculate the estimated optimal policy $\pi^{k}$ for episode $k$ , and the agent follows $\pi^{k}$ and collects a trajectory. Then, we update $\Delta_{h}^{k+1}(s,a)$ by incorporating the observed state $s_{h+1}^{k}$ into the set $\Delta_{h}^{k}(s_{h}^{k},a_{h}^{k})$ . Under this updating rule, it holds that $\Delta_{h}^{k}(s,a)\subseteq\Delta_{h}(s,a)$ for all $(s,a)\in\mathcal{S}\times\mathcal{A}$ .

1: Initialize:

\Delta_{h}^{1}(s,a)=\emptyset

N_{h}^{1}(s,a)=N_{h}^{1}(s,a,s^{\prime})=0

for all

s\in\mathcal{S}

a\in\mathcal{A}

s^{\prime}\in\mathcal{S}

and

h\in[H]

2: for

k=1,2,\cdots,K

3: Update the optimistic estimates of cost.

4: Update the empirical cost

\hat{c}(s)

and calculate

\bar{c}(s)=\hat{c}(s)-\beta(N^{k}(s),\delta)

for all

s\in\mathcal{S}

\triangleright

5: Define

\mathcal{U}_{H}^{k}=\{s\mid\bar{c}(s)>\tau\}

and

\mathcal{U}_{h}^{k}=\mathcal{U}_{h+1}^{k}\cup\{s\mid\forall a\in A,\Delta_{h}^{k}(s,a)\cap\mathcal{U}_{h+1}^{k}\neq\emptyset\}

for all

h\in[H]

recursively. Calculate

\{A_{h}^{k,safe}(s)\}_{h\in[H]}

by Eq. (3) with

\{\mathcal{U}_{h}^{k}\}_{h\in[H]}

\triangleright

Perform value iteration with previous estimates.

6: for

h=H,H-1,\cdots,1

7: for

s\in\mathcal{S}

8: for

a\in\mathcal{A}

9: Compute

\hat{\mathbb{P}}^{k}_{h}(s^{\prime}\mid s,a)=\frac{N_{h}^{k}(s,a,s^{\prime})}{N_{h}^{k}(s,a)}

10:

Q_{h}^{k}(s,a)=\min\{H,r_{h}(s,a)+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}^{k}(s^{\prime}\mid s,a)V_{h+1}^{k}(s^{\prime})+\alpha(N_{h}^{k}(s,a))\}.

11: end for

12: if

s\notin\mathcal{U}_{h}^{k}

then

13:

V_{h}^{k}(s)=\max_{a\in A_{h}^{k,safe}(s)}Q_{h}^{k}(s,a),\pi_{h}^{k}(s)

\arg\max_{a\in A_{h}^{k,safe}(s)}Q_{h}^{k}(s,a)

14: else

15:

V_{h}^{k}(s)=\max_{a\in\mathcal{A}}Q_{h}^{k}(s,a),\pi_{h}^{k}(s)

\arg\max_{a\in\mathcal{A}}Q_{h}^{k}(s,a)

16: end if

17: end for

18: end for

19: for

h=1,2,\cdots,H

20: Take action

a_{h}^{k}=\pi_{h}^{k}(s_{h}^{k})

and observe state

s_{h+1}^{k}.

\triangleright

Update the estimates of

\Delta(s,a)

21:

\Delta_{h}^{k+1}(s_{h}^{k},a_{h}^{k})=\Delta_{h}^{k}(s_{h}^{k},a_{h}^{k})\cup\{s_{h+1}^{k}\}

. Increase

N_{h}^{k+1}(s_{h}^{k},a_{h}^{k}),N_{h}^{k+1}(s_{h}^{k},a_{h}^{k},s_{h+1}^{k})

1

22: end for

23: end for

Algorithm 1 SUCBVI

The performance of Algorithm 1 is summarized below in Theorem 4.2.

Theorem 4.2.

Let $\alpha(n,\delta)=7H\sqrt{\frac{\ln(5SAHK/\delta)}{n}}$ and $\beta(n,\delta)=\sqrt{\frac{2}{n}\log(SK/\delta)}$ . With probability at least $1-\delta$ , the regret and step-wise violation of Algorithm 1 are bounded by

\displaystyle R(K)=\widetilde{\mathcal{O}}(\sqrt{H^{2}SAT}),\qquad C(K)=\widetilde{\mathcal{O}}(\sqrt{ST}+S^{2}AH^{2}).

Moreover, if $\mathcal{C}_{\mathrm{gap}}\triangleq\min_{s\in\mathcal{U}}(c(s)-\tau)_{+}>0$ , we have $C(K)=\widetilde{\mathcal{O}}(S/\mathcal{C}_{\mathrm{gap}}+S^{2}AH^{2})$ .

Theorem 4.2 shows that SUCBVI achieves both sublinear regret and violation. Moreover, when all the unsafe states have a large cost compared to the safety threshold $\tau$ , i.e., $\mathcal{C}_{\mathrm{gap}}=\min_{s\in\mathcal{U}}(c(s)-\tau)_{+}>0$ is a constant, we can distinguish the unsafe states easily and get a constant violation. In particular, when $\tau=1$ , Safe-RL-SW degenerates to the unconstrained MDP and Theorem 4.2 maintains the same regret $\widetilde{\mathcal{O}}(\sqrt{H^{2}SAT})$ as UCBVI-CH (Azar et al., 2017), while CMDP algorithms (Efroni et al., 2020; Liu et al., 2021a) suffer a larger $\widetilde{\mathcal{O}}(\sqrt{H^{3}S^{3}AT})$ regret.

We provide the analysis idea of Theorem 4.2 here. First, by the updating rule of $\Delta_{h}^{k}(s,a)$ , we show that $\mathcal{U}_{h}^{k}$ and $A_{h}^{k,safe}(s)$ have the following crucial properties: $\mathcal{U}_{h}^{k}\subseteq\mathcal{U}_{h},\ \ A_{h}^{safe}(s)\subseteq A_{h}^{k,safe}(s).$ Based on this property, we can prove that $Q^{*}_{h}(s,a)\leq Q_{h}^{k}(s,a)$ and $V^{*}_{h}(s)\leq V_{h}^{k}(s)$ for all $(s,a)$ , and then apply the regret decomposition techniques to derive the regret bound.

Recall that $\pi^{k}$ is a feasible policy with respect to the estimated unsafe state set $\mathcal{U}_{H}^{k}$ and transition set $\{\Delta_{h}^{k}(s,a)\}_{h\in[H]}$ . If the agent takes policy $\pi^{k}$ and the transition follows $\Delta_{h}^{k}(s,a)$ , i.e., $s_{h+1}^{k}\in\Delta_{h}^{k}(s,a)$ , the agent never arrive any estimated unsafe state $s\in\mathcal{U}_{H}^{k}$ in this episode. Hence the agent will suffer at most $\widetilde{\mathcal{O}}(\sqrt{ST})$ step-wise violation or $\widetilde{\mathcal{O}}(S/\mathcal{C}_{\mathrm{gap}}+S^{2}AH^{2})$ gap-dependent bounded violation. Yet, the situation $s_{h+1}\in\Delta_{h}^{k}(s,a)$ does not always hold for all $h\in[H]$ . In the case when $s_{h+1}^{k}\notin\Delta_{h}^{k}(s,a)$ , we add the newly observed state $s_{h+1}^{k}$ into $\Delta_{h}^{k+1}(s_{h}^{k},a_{h}^{k})$ (Line 21). We can show that this case appears at most $S^{2}AH$ times, and thus incurs $O(S^{2}AH^{2})$ additional violation. Combining the above two cases, the total violation can be upper bounded.

5 Lower Bounds for Safe-RL-SW

In this section, we provide a matching lower bound for Safe-RL-SW in Section 4. The lower bound shows that if an algorithm always achieves a sublinear regret in Safe-RL-SW, it must incur an $\Omega(\sqrt{ST})$ violation. This result matches our upper bound in Theorem 4.2, showing that SUCBVI achieves the optimal violation performance.

Theorem 5.1.

If an algorithm has an expected regret $\mathbb{E}_{\pi}[R(K)]\leq\frac{HK}{24}$ for all MDP instances, there exists an MDP instance in which the algorithm suffers expected violation $\mathbb{E}_{\pi}[C(K)]=\Omega(\sqrt{ST})$ .

Now we validate the optimality in terms of regret. Note that if we do not consider safety constraints, the lower bound for classic RL (Osband & Van Roy, 2016) can be applied to our setting. Thus, we also have a $\Omega(\sqrt{T})$ regret lower bound. To understand the essential hardness brought by safety constraints, we further investigate whether safety constraints will lead to $\Omega(\sqrt{T})$ regret, given that we can achieve a $o(\sqrt{T})$ regret on some good instances without the safety constraints.

Theorem 5.2.

For any $\alpha\in(0,1)$ , there exists a parameter $n$ and $n$ MDPs $M_{1},\dots,M_{n}$ satisfying that:

1. If we do not consider any constraint, there is an algorithm which achieves $\widetilde{\mathcal{O}}(T^{(1-\alpha)/2})$ regret compared to the unconstrained optimal policy on all MDPs.

2. If we consider the safety constraint, any algorithm with $O(T^{1-\alpha})$ expected violation will achieve $\Omega(\sqrt{HST})$ regret compared to the constrained optimal policy on one of MDPs.

Intuitively, Theorem 5.2 shows that if one achieves sublinear violation, she must suffer at least $\Omega(\sqrt{T})$ regret even if she can achieve $o(\sqrt{T})$ regret without considering constraints. This theorem demonstrates the hardness particularly brought by the step-wise constraint, and corroborates the optimality of our results. Combining with Theorem 5.1, the two lower bounds show an essential trade-off between the violation and performance.

6 Safe Reward-Free Exploration with Step-wise Violation Constraints

6.1 Formulation of Safe-RFE-SW

In this section, we consider Safe RL in the reward-free exploration (RFE) setting (Jin et al., 2020; Kaufmann et al., 2021; Ménard et al., 2021) called Safe-RFE-SW, to show the generality of our proposed framework. In the RFE setting, the agent does not have access to reward signals and only receives random safety cost feedback $z(s)=c(s)+\zeta$ . To impose safety requirements, Safe-RFE-SW requests the agent to keep small safety violations during exploration, and outputs a near-optimal safe policy after receiving the reward function.

Definition 6.1 ( $(\varepsilon,\delta)$ -optimal safe algorithm for Safe-RFE-SW).

An algorithm is $(\varepsilon,\delta)$ -optimal safe for Safe-RFE-SW if it outputs the triple $(\hat{\mathbb{P}},\hat{\Delta}(s,a),\hat{\mathcal{U}}_{H})$ such that for any reward function $r$ , with probability at least $1-\delta$ ,

\small V_{1}^{*}(s_{1};r)-V_{1}^{\hat{\pi}^{*}}(s_{1};r)\leq\varepsilon,\qquad\mathbb{E}_{\pi}\left[\sum_{h=1}^{H}(c(s_{h})-\tau)_{+}\right]\leq\varepsilon,

(4)

where $\hat{\pi}^{*}$ is the optimal feasible policy with respect to $(\hat{\mathbb{P}},\hat{\Delta}(s,a),\hat{\mathcal{U}}_{H},r)$ , and $V(s;r)$ is the value function under reward function $r$ . We say that a policy is $\varepsilon$ -optimal if it satisfies the left inequality in Eq. (4) and $\varepsilon$ -safe if it satisfies the right inequality in Eq. (4). We measure the performance by the number of episodes used before the algorithm terminates, i.e., sample complexity. Moreover, the cumulative step-wise violation till episode $K$ is defined as $C(K)=\sum_{k=1}^{K}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+}$ . In Safe-RFE-SW, our goal is to design an $(\varepsilon,\delta)$ -optimal safe algorithm, and minimize both sample complexity and violation.

6.2 Algorithm SRF-UCRL

The Safe-RFE-SW problem requires us to consider extra safety constraints for both the exploration phase and final output policy, which needs new algorithm design and techniques compared to previous RFE algorithms. Also, the techniques for Safe-RL-SW in Section 4.2 are not sufficient for guaranteeing the safety of output policy, because SUCBVI only guarantees a step-wise violation during exploration.

We design an efficient algorithm Safe RF-UCRL (SRF-UCRL), which builds upon previous RFE algorithm RF-UCRL (Kaufmann et al., 2021). SRF-UCRL distinguishes potentially unsafe states and safe actions by backward iteration, and establishes a new uncertainty function to guarantee the safety of output policy. Algorithm 2 illustrates the procedure of SRF-UCRL. Specifically, in each episode $k$ , we first execute a policy $\pi_{h}^{k}$ computed from previous episodes, and then update the estimated next state set $\{\Delta^{k}_{h}(s,a)\}_{h\in[H]}$ and unsafe state set $\mathcal{U}_{H}^{k}$ by optimistic estimation. Then, we use Eq. (2) to calculate the unsafe state set $\mathcal{U}_{h}$ for all steps $h\in[H]$ . After that, we update the uncertainty function $\overline{W}^{k}$ defined in Eq. (5) below and compute the policy $\pi^{k+1}$ that maximizes the uncertainty to encourage more exploration in the next episode.

Now we provide the definition of the uncertainty function, which measures the estimation error between the empirical MDP and true MDP. For any safe state-action pair $s\notin\mathcal{U}_{h},a\in A_{h}^{k,safe}(s)$ , we define

\displaystyle\overline{W}_{h}^{k}(s,a)=\min\left\{H,M(N_{h}^{k}(s,a),\delta)+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}^{k}(s^{\prime}\mid s,a)\max_{b\in A^{k,safe}_{h+1}(s^{\prime})}\overline{W}_{h+1}^{k}(s^{\prime},b)\right\}.

(5)

where $M(N_{h}^{k}(s,a),\delta)=2H\sqrt{\frac{2\gamma(N_{h}^{k}(s,a),\delta)}{N_{h}^{k}(s,a)}}+\frac{SH\gamma(N_{h}^{k}(s,a),\delta)}{N_{h}^{k}(s,a)},$ and $\gamma(n,\delta)$ is a logarithmic term that is formally defined in Theorem 6.2. For other state-action pairs $(s,a)$ , we relax the restriction of $b\in\mathcal{A}_{h+1}^{k,safe}(s^{\prime})$ to $b\in\mathcal{A}$ in the $\max$ function. Our algorithm stops when $\overline{W}^{K}_{1}(s_{1},\pi^{K}(s_{1}))$ shrinks to within $\varepsilon/2$ . Compared to previous RFE works (Kaufmann et al. (2021); Ménard et al. (2021)), our uncertainty function has two distinctions. First, for any safe state-action pair $(s,a)\in(\mathcal{S}\setminus\mathcal{U}_{h},A_{h}^{k,safe}(s))$ , Eq. (5) considers only safe actions $A_{h+1}^{k,safe}(s)$ , which guarantees that the agent focuses on safe policies. Second, Eq. (5) incorporates another term $(SH\gamma(N_{h}^{k}(s,a),\delta)/N_{h}^{k}(s,a))$ to control the expected violation for feasible policies. Now we present our result for Safe-RFE-SW.

Theorem 6.2.

Let $\gamma(n,\delta)=2(\log(2SAH/\delta)+(S-1)\log(e(1+n/(S-1)))$ , Algorithm 2 is a $(\varepsilon,\delta)$ -PAC algorithm with sample complexity at most¹¹1Here $\widetilde{\mathcal{O}}(\cdot)$ ignores all $\log S,\log A,\log H,\log(1/\varepsilon)$ and $\log(\log(1/\delta))$ terms.

\small K=\widetilde{\mathcal{O}}\left(\left(\frac{S^{2}AH^{2}}{\varepsilon}+\frac{H^{4}SA}{\varepsilon^{2}}\right)\left(\log\left(\frac{1}{\delta}\right)+S\right)\right),

The step-wise violation of Algorithm 2 during exploration is $C(K)=\widetilde{\mathcal{O}}(S^{2}AH^{2}+\sqrt{ST}).$

Compared to previous work (Kaufmann et al., 2021) with $\widetilde{\mathcal{O}}((H^{4}SA/\varepsilon^{2})(\log(1/\delta)+S))$ sample complexity, our result has an additional term $\widetilde{\mathcal{O}}((S^{2}AH^{2}/\varepsilon)(\log(1/\delta)+S))$ . This term is due to the additional safety requirement for the final output policy, which was not considered in previous RFE algorithms (Kaufmann et al., 2021; Ménard et al., 2021). When $\varepsilon$ and $\delta$ are sufficiently small, the leading term is $\widetilde{\mathcal{O}}((H^{4}SA/\varepsilon^{2})\log(1/\delta))$ , which implies that our algorithm satisfies the safety constraint without suffering additional regret.²²2Ménard et al. (2021) improve the result by a factor $H$ and replace $\log(1/\delta)+S$ by $\log(1/\delta)$ via the Bernstein-type inequality. While they do not consider the safety constraints, we believe that similar improvement can also be applied to our framework, without significant changes in our analysis. However, since our paper mainly focuses on tackling the safety constraint for reward-free exploration, we use the Hoeffding-type inequality to keep the succinctness of our statements.

1: Initialize:

k=1

\overline{W}_{h}^{0}(s,a)=H

\pi_{h}^{1}(s)=a_{1}

for all

s\in\mathcal{S}

a\in\mathcal{A}

and

h\in[H]

2: while

\overline{W}_{h}^{k-1}(s_{1},\pi_{1}^{k}(s_{1}))\leq\varepsilon/2

3: for

h=1,\cdots,H

4: Observe state

s_{h}^{k}

. Take action

a_{h}^{k}=\pi_{h}^{k}(s_{h}^{k})

and observe

s_{h+1}^{k}.

\triangleright

Update the optimistic estimate of cost.

5: Calculate empirical cost

\hat{c}(s)

and

\bar{c}(s)=\hat{c}(s)-\beta(N^{k}(s),\delta)

for all

s\in\mathcal{S}

\triangleright

Update the estimates of

\Delta_{h}(s,a)

and

\mathcal{U}_{h}

6: Update

\Delta_{h}^{k}(s_{h}^{k},a_{h}^{k})=\Delta_{h}^{k-1}(s_{h}^{k},a_{h}^{k})\cup\{s_{h+1}^{k}\}

and

\{\mathcal{U}_{h}^{k}\}_{h\in[H]}

by Eq. (2).

7: Calculate

A_{h}^{k,safe}(s)

for all

s\in\mathcal{S}

by Eq. (3).

8: end for

9: Update

N_{h}^{k}(s,a,s^{\prime}),N_{h}^{k}(s,a)

and

\hat{\mathbb{P}}_{h}^{k}

for all

s\in\mathcal{S},a\in\mathcal{A},s^{\prime}\in\mathcal{S}

and

h\in[H]

10: Compute

\overline{W}^{k}

according to Eq. (5).

\triangleright

Calculate the greedy policy.

11: for

h\in[H],s\in\mathcal{S}

12: if

A_{h}^{k,safe}(s)\neq\emptyset

then

13:

\pi_{h}^{k+1}(s)=\arg\max_{a\in A_{h}^{k,safe}(s)}\overline{W}_{h}^{k}(s,a).

14: else

15:

\pi_{h}^{k+1}(s)=\arg\max_{a\in A}\overline{W}_{h}^{k}(s,a).

16: end if

17: end for

18: Set

k=k+1

19: end while

20: return the tuple

(\hat{\mathbb{P}}_{h}^{k-1},\Delta_{h}^{k-1}(s,a),\mathcal{U}_{H}^{k-1})

Algorithm 2 SRF-UCRL

6.3 Analysis for Algorithm SRF-UCRL

For the analysis of step-wise violation (Eq. (1)), similar to algorithm SUCBVI, algorithm SRF-UCRL estimates the next state set $\Delta_{h}(s,a)$ and potentially unsafe state set $\mathcal{U}_{h}$ , which guarantees a $\widetilde{\mathcal{O}}(\sqrt{ST})$ step-wise violation. Now we give a proof sketch for the $\varepsilon$ -safe property of output policy (Eq. (4)).

First, if $\pi$ is a feasible policy for $(\hat{\mathbb{P}}^{k},\Delta^{k}(s,a),\{\mathcal{U}_{h}^{k}\}_{h\in[H]})$ and $s_{h+1}\in\Delta_{h}^{K}(s_{h},a_{h})$ for all $h\in[H]$ , the agent who follows policy $\pi$ will only visit the estimated safe states. Since each estimated safe state only suffers $O(1/\sqrt{N_{t}(s_{h},\pi_{h}(s_{h}))})$ violation, the violation led by this situation is bounded by $\overline{W}_{1}^{k}(s,\pi_{1}(s_{1}))$ . Next, we bound the probability that $s_{h+1}\notin\Delta^{K}_{h}(s_{h},a_{h})$ for some step $h\in[H]$ . For any state-action pair $(s,a)$ , if there is a probability $\mathbb{P}(s^{\prime}\mid s,a)\geq\widetilde{\mathcal{O}}(\log(S/\delta)/N_{h}^{K}(s,a))$ that the agent transitions to next state $s^{\prime}$ from $(s,a)$ at step $h$ , the state $s^{\prime}$ is put into $\Delta_{h}^{K}(s,a)$ with probability at least $1-\delta/S$ . Then, we can expect that all such states are put into our estimated next state set $\Delta_{h}^{K}(s,a)$ . Thus, the probability that $s_{h+1}\notin\Delta_{h}^{K}(s_{h},a_{h})$ is no larger than $\widetilde{\mathcal{O}}(S\log(S/\delta)/N_{h}^{K}(s,a))$ by a union bound over all possible next states $s^{\prime}$ . Based on this argument, $\overline{W}_{h}^{k}(s_{h},a_{h})$ is an upper bound for the total probability that $s_{h+1}\notin\Delta^{K}_{h}(s_{h},a_{h})$ for some step $h$ . This will lead to additional $\overline{W}_{1}^{K}(s_{1},\pi_{1}(s_{1}))$ expected violation. Hence the expected violation of output policy is upper bounded by $2\overline{W}_{1}^{K}(s_{1},\pi_{1}^{K}(s_{1}))\leq\varepsilon$ . The complete proof is provided in Appendix B.

7 Experiments

In this section, we provide experiments for Safe-RL-SW and Safe-RFE-SW to validate our theoretical results. For Safe-RL-SW, we compare our algorithm SUCBVI with a classical RL algorithm UCBVI (Azar et al., 2017) and three state-of-the-art CMDP algorithms OptCMDP-bonus (Efroni et al., 2020) , Optpess (Liu et al., 2021a) and Triple-Q (Wei et al., 2022). For Safe-RFE-SW, we report the average reward in each episode and cumulative step-wise violation. For Safe-RFE-SW, we compare our algorithm SRF-UCRL with a state-of-the-art RFE algorithm RF-UCRL (Kaufmann et al., 2021). We do not plot the regret because unconstrained MDP or CMDP algorithms do not guarantee step-wise violation. Applying them to the step-wise constrained setting can lead to negative or large regret and large violations, making the results meaningless. Detailed experiment setup is in Appendix F.

Refer to caption — Figure 1: Experimental results for Safe-RL-SW and Safe-RFE-SW. The left two figures show the average rewards and step-wise violations of algorithms SUCBVI, UCBVI (Azar et al., 2017), OptCMDP-bonus (Efroni et al., 2020), Triple-Q (Wei et al., 2022) and Optpess (Liu et al., 2021a). The right two figures show the reward and expected violation of the policies outputted by algorithms SRF-UCRL and RF-UCRL (Kaufmann et al., 2021).

As shown in Figure 1, the rewards of SUCBVI and SRF-UCRL converge to the optimal rewards under safety constraints (denoted by "safe optimal reward"). In contrast, the rewards of UCBVI and UCRL converge to the optimal rewards without safety constraints (denoted by "unconstrained optimal reward"), and those of CMDP algorithms converge to a policy with low expected violation (denoted by "constrained optimal reward"). For Safe-RFE-SW, the expected violation of output policy of SRF-UCRL converges to zero while that of RF-UCRL does not. This corroborates the ability of SRF-UCRL in finding policies that simultaneously achieve safety and high rewards.

8 Conclusion

In this paper, we investigate a novel safe reinforcement learning problem with step-wise safety constraints. We first provide an algorithmic framework SUCBVI to achieve both $\widetilde{\mathcal{O}}(\sqrt{H^{3}SAT})$ regret and $\widetilde{\mathcal{O}}(\sqrt{ST})$ step-wise or $\widetilde{\mathcal{O}}(S/\mathcal{C}_{\mathrm{gap}}+S^{2}AH^{2})$ gap-dependent bounded violation that is independent of $T$ . Then, we provide two lower bounds to validate the optimality of SUCBVI in both violation and regret in terms of $S$ and $T$ . Further, we extend our framework to the safe RFE with a step-wise violation and provide an algorithm SRF-UCRL that identifies a near-optimal safe policy given any reward function $r$ and guarantees $\widetilde{\mathcal{O}}(\sqrt{ST})$ violation during exploration.

References

Achiam et al. (2017) Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International conference on machine learning, pp. 22–31. PMLR, 2017.
Afsar et al. (2022) M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. ACM Computing Surveys, 55(7):1–38, 2022.
Alshiekh et al. (2018) Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
Amani et al. (2021) Sanae Amani, Christos Thrampoulidis, and Lin Yang. Safe reinforcement learning with linear function approximation. In International Conference on Machine Learning, pp. 243–253. PMLR, 2021.
Auer et al. (2008) Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008.
Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. PMLR, 2017.
Berkenkamp et al. (2017) Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. Advances in neural information processing systems, 30, 2017.
Bura et al. (2021) Archana Bura, Aria HasanzadeZonuzy, Dileep Kalathil, Srinivas Shakkottai, and Jean-Francois Chamberland. Safe exploration for constrained reinforcement learning with provable guarantees. arXiv preprint arXiv:2112.00885, 2021.
Chow et al. (2018) Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. Advances in neural information processing systems, 31, 2018.
Dalal et al. (2018) Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.
Ding et al. (2020) Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33:8378–8390, 2020.
Ding et al. (2021) Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pp. 3304–3312. PMLR, 2021.
Efroni et al. (2020) Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189, 2020.
Hoeffding (1994) Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding, pp. 409–426. Springer, 1994.
Huang et al. (2022) Ruiquan Huang, Jing Yang, and Yingbin Liang. Safe exploration incurs nearly no additional sample complexity for reward-free rl. arXiv preprint arXiv:2206.14057, 2022.
Jin et al. (2020) Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pp. 4870–4879. PMLR, 2020.
Kalagarla et al. (2021) Krishna C Kalagarla, Rahul Jain, and Pierluigi Nuzzo. A sample-efficient algorithm for episodic finite-horizon mdp with constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 8030–8037, 2021.
Kaufmann et al. (2021) Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In Algorithmic Learning Theory, pp. 865–891. PMLR, 2021.
Lanctot et al. (2019) Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, et al. Openspiel: A framework for reinforcement learning in games. arXiv preprint arXiv:1908.09453, 2019.
Le et al. (2019) Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712. PMLR, 2019.
Liu et al. (2021a) Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021a.
Liu et al. (2021b) Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021b.
Liu et al. (2020) Yongshuai Liu, Jiaxin Ding, and Xin Liu. Ipo: Interior-point policy optimization under constraints. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp. 4940–4947, 2020.
Ménard et al. (2021) Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, and Michal Valko. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp. 7599–7608. PMLR, 2021.
Miryoosefi & Jin (2022) Sobhan Miryoosefi and Chi Jin. A simple reward-free approach to constrained reinforcement learning. In International Conference on Machine Learning, pp. 15666–15698. PMLR, 2022.
Osband & Van Roy (2016) Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
Qiu et al. (2020) Shuang Qiu, Xiaohan Wei, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. Upper confidence primal-dual reinforcement learning for cmdp with adversarial loss. Advances in Neural Information Processing Systems, 33:15277–15287, 2020.
Shi et al. (2023) Ming Shi, Yingbin Liang, and Ness Shroff. A near-optimal algorithm for safe reinforcement learning under instantaneous hard constraints. arXiv preprint arXiv:2302.04375, 2023.
Simão et al. (2021) Thiago D Simão, Nils Jansen, and Matthijs TJ Spaan. Alwayssafe: Reinforcement learning without safety constraint violations during training. 2021.
Simchowitz & Jamieson (2019) Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32, 2019.
Singh et al. (2020) Rahul Singh, Abhishek Gupta, and Ness B Shroff. Learning in markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.
Sootla et al. (2022) Aivar Sootla, Alexander Cowen-Rivers, Jun Wang, and Haitham Bou Ammar. Enhancing safe exploration using safety state augmentation. Advances in Neural Information Processing Systems, 35:34464–34477, 2022.
Stooke et al. (2020) Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning, pp. 9133–9143. PMLR, 2020.
Sutton & Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
Tessler et al. (2018) Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018.
Thomas et al. (2021) Garrett Thomas, Yuping Luo, and Tengyu Ma. Safe reinforcement learning by imagining the near future. Advances in Neural Information Processing Systems, 34:13859–13869, 2021.
Turchetta et al. (2020) Matteo Turchetta, Andrey Kolobov, Shital Shah, Andreas Krause, and Alekh Agarwal. Safe reinforcement learning via curriculum induction. Advances in Neural Information Processing Systems, 33:12151–12162, 2020.
Uchibe & Doya (2007) Eiji Uchibe and Kenji Doya. Constrained reinforcement learning from intrinsic and extrinsic rewards. In 2007 IEEE 6th International Conference on Development and Learning, pp. 163–168. IEEE, 2007.
Wachi & Sui (2020) Akifumi Wachi and Yanan Sui. Safe reinforcement learning in constrained markov decision processes. In International Conference on Machine Learning, pp. 9797–9806. PMLR, 2020.
Wang et al. (2022) Yixuan Wang, Simon Sinong Zhan, Ruochen Jiao, Zhilu Wang, Wanxin Jin, Zhuoran Yang, Zhaoran Wang, Chao Huang, and Qi Zhu. Enforcing hard constraints with soft barriers: Safe reinforcement learning in unknown stochastic environments. arXiv preprint arXiv:2209.15090, 2022.
Wei et al. (2022) Honghao Wei, Xin Liu, and Lei Ying. Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation. In International Conference on Artificial Intelligence and Statistics, pp. 3274–3307. PMLR, 2022.
Yu et al. (2022) Dongjie Yu, Haitong Ma, Shengbo Li, and Jianyu Chen. Reachability constrained reinforcement learning. In International Conference on Machine Learning, pp. 25636–25655. PMLR, 2022.
Yu et al. (2019) Ming Yu, Zhuoran Yang, Mladen Kolar, and Zhaoran Wang. Convergent policy optimization for safe reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
Yu et al. (2021) Tiancheng Yu, Yi Tian, Jingzhao Zhang, and Suvrit Sra. Provably efficient algorithms for multi-objective competitive rl. In International Conference on Machine Learning, pp. 12167–12176. PMLR, 2021.
Zhao et al. (2020) Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 737–744. IEEE, 2020.

Appendix

\startcontents\printcontents

Appendix A Related Works

Safe RL except for CMDP-approaches

Apart from the CMDP approaches, (Yu et al. (2019); Qiu et al. (2020); Efroni et al. (2020); Ding et al. (2021); Kalagarla et al. (2021); Simão et al. (2021); Liu et al. (2021b); Bura et al. (2021); Wei et al. (2022)), there are several other ways to achieve safety in reinforcement learning. The control-based approaches (Berkenkamp et al. (2017); Chow et al. (2018); Liu et al. (2020); Wang et al. (2022)) adopt barrier functions to keep the agent away from unsafe regions and impose hard constraints. The policy-optimization style approaches are also commonly studied to solve constrained RL problems. (Uchibe & Doya (2007); Achiam et al. (2017); Tessler et al. (2018); Liu et al. (2020); Stooke et al. (2020))

There are some other papers investigating the safety of RL problems. Alshiekh et al. (2018) represents the safe state by the reactive system, and uses shielding to calculate and restrict the agent within a safe trajectory completely. The main difference between their work and our work is that we need to dynamically update the estimated safe state, while they require to know the mechanism and state to calculate the shield. Dalal et al. (2018) considers restricting the safe action by projecting the action into the closest safe actions. They achieve this goal by solving a convex optimization problem on the continuous action set. However, in their paper, they do not consider the situation where a state can have no safe actions. To be more specific, they do not consider the situation when the convex optimization problem has no solutions. Le et al. (2019) considers the decision-making problem with a pre-collected dataset. Turchetta et al. (2020) and Sootla et al. (2022) both consider cumulative cost constraints rather than step-wise constraints. The former uses a teacher for intervention to keep the agent away from the unsafe region, while the latter encourages safe exploration by augmenting a safety state to measure safety during training.

RFE with Safety

Motivated by sparse reward signals in realistic applications, reward-free exploration (RFE) has been proposed in Jin et al. (2020), and further developed in Kaufmann et al. (2021); Ménard et al. (2021). In RFE, the agent explores the environment without reward signals. After enough exploration, the reward function is given, and the agent needs to plan a near-optimal policy based on his knowledge collected in exploration. Safety is also important in RFE: We need to not only guarantee that our outputted policy is safe, but also ensure small violations during exploration.

In recent years, Huang et al. (2022) studies RFE with safety constraints. Our work differs from this work in the following aspects: (i) They allow different reward functions during exploration and after exploration, and the agent can directly know the true costs during training. In our work, the cost function is the same during and after exploration, but the agent can only observe the noisy costs of a state when she arrives at that state. (ii) They require prior knowledge of safe baseline policies, while we do not need such an assumption. (iii) They consider the zero-expected violation during exploration, while we focus on keeping small step-wise violations.

Appendix B Proofs of Main Results

B.1 Proof of Theorem 4.2

We start the proof with the confidence bound of optimistic estimation:

Lemma B.1.

With probability at least $1-\delta$ , the empirical optimistic estimation of cost function $\overline{c}^{k}(s)=\hat{c}^{k}(s)-\sqrt{\frac{2}{N^{k}(s)}\log(SK/\delta)}$ are always less than the true cost $c(s)$ for any episode $k\in[K]$ . In other words, let the event be

\displaystyle\mathcal{E}_{1}=\left\{\forall\ k\in[K]\ \mbox{and}\ s\in\mathcal{S},\overline{c}^{k}(s)\leq c(s)\right\},

(6)

then $\Pr(\mathcal{E}_{1}^{c})\leq\delta.$

Proof.

Since we observe $c(s)+\varepsilon$ with 1-subgaussian noise $\varepsilon$ , by Hoeffding’s inequality Hoeffding (1994), for a fixed episode $k$ and state $s$ , the statement

\displaystyle|\hat{c}^{k}(s)-c(s)|\leq\sqrt{\frac{2}{N^{k}(s)}\log\left(\frac{SK}{\delta}\right)}

holds with probability at least $1-\frac{\delta}{SK}$ . Taking a union bound on all $s\in\mathcal{S}$ and $k\in[K]$ , we have

c(s)\geq\hat{c}^{k}(s)-\sqrt{\frac{2}{N^{k}(s)}\log\left(\frac{SK}{\delta}\right)}=\overline{c}^{k}(s),

and the Lemma B.1 has been proved. ∎

Lemma B.2.

Let the event $\mathcal{E}_{2}$ and $\mathcal{E}_{3}$ are

	$\displaystyle\mathcal{E}_{2}$	$\displaystyle=\left\{\forall\ s,a,s^{\prime},k,h,\|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a)\|\right.$
		$\displaystyle\left.\hskip 45.00008pt\leq\sqrt{\frac{2\mathbb{P}_{h}(s^{\prime}\mid s,a)}{N_{h}^{k}(s,a)}\log\left(\frac{2S^{2}AHK}{\delta}\right)}+\frac{2}{3N_{h}^{k}(s,a)}\log\left(\frac{2S^{2}AHK}{\delta}\right)\right\}$
	$\displaystyle\mathcal{E}_{3}$	$\displaystyle=\left\{\sum_{k=1}^{K}\sum_{h=1}^{H}\left(P_{h}(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h}^{k},a_{h}^{k})-(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h+1}^{k})\right)\leq\sqrt{\frac{H^{3}K}{2}\log(1/\delta)}\right\}$

Then $\Pr(\mathcal{E}_{2}^{c})\leq\delta$ , $\Pr(\mathcal{E}_{3}^{c})\leq\delta$ .

Proof.

This lemma is very standard in analysis of UCBVI algorithm. The $\Pr(\mathcal{E}_{2}^{c})$ can be derived by Bernstein’s inequality, and $\Pr(\mathcal{E}_{3}^{c})\leq\delta$ can be derived by Azuma-hoeffding inequality. ∎

The following lemma shows the optimistic property we discussed in the paper.

Lemma B.3.

Under the event $\mathcal{E}_{1}$ , $\mathcal{U}_{h}^{k}\subseteq\mathcal{U}_{h}$ and $A_{h}^{safe}(s)\subseteq A_{h}^{k,safe}(s)$ for all $s\in\mathcal{U}_{h}^{k}$ and $h\in[H]$ .

Proof.

Under the event $\mathcal{E}_{1}$ , for all $s\in\mathcal{U}_{H}^{k}$ , $c(s)\geq\overline{c}(s)>\tau$ . So $s\in\mathcal{U}_{H}^{*}$ . Now from our definition of $\Delta_{h}^{K}(s,a)$ , we can easily have $\Delta^{k}_{h}(s,a)\subseteq\Delta_{h}(s,a)$ for all step $h\in[H]$ because we add the possible next state one by one into estimated set $\Delta_{h}^{K}(s,a)$ .

Now we prove the lemma by induction. When $h=H$ , $\mathcal{U}_{H}^{k}\subseteq\mathcal{U}_{H}^{*}$ . Assume that $\mathcal{U}_{h+1}^{k}\subseteq\mathcal{U}_{h+1}^{*}$ , then

	$\displaystyle A_{h}^{safe}(s)$	$\displaystyle=\{a\in\mathcal{A}\mid\Delta_{h}(s,a)\cap\mathcal{U}_{h+1}^{*}=\emptyset\}$
		$\displaystyle\subseteq\{a\in\mathcal{A}\mid\Delta_{h}^{K}(s,a)\cap\mathcal{U}_{h+1}^{*}=\emptyset\}$
		$\displaystyle\subseteq\{a\in\mathcal{A}\mid\Delta_{h}^{K}(s,a)\cap\mathcal{U}_{h+1}^{k}=\emptyset\}$
		$\displaystyle=A_{h}^{k,safe}(s),$

and

	$\displaystyle\mathcal{U}_{h}^{k}$	$\displaystyle=\mathcal{U}_{h+1}^{k}\cup\{s\mid\exists\ a\in\mathcal{A},\Delta_{h}^{K}(s,a)\cap\mathcal{U}_{h+1}^{k}=\emptyset\}$
		$\displaystyle\subseteq\mathcal{U}_{h+1}^{}\cup\{s\mid\exists\ a\in\mathcal{A},\Delta_{h}(s,a)\cap\mathcal{U}_{h+1}^{}=\emptyset\}$
		$\displaystyle\subseteq\mathcal{U}_{h}.$

Hence we complete the proof by induction. ∎

Similar to UCBVI algorithm, we prove that $Q_{h}^{k}(s,a)$ and $V_{h}^{k}(s)$ are always greater than $Q_{h}^{*}(s)$ and $V_{h}^{*}(s)$ .

Lemma B.4.

With probability at least $1-\delta$ , choosing $b(n,\delta)=7H\sqrt{\frac{\ln(5SAT/\delta)}{n}}$ , we will have

\displaystyle Q_{h}^{k}(s,a)\geq Q_{h}^{*}(s,a),\ \ V_{h}^{k}(s)\geq V_{h}^{*}(s),

for any state $s\in\mathcal{S}$ , action $a\in\mathcal{A}$ , episode $k$ and the step $h\in[H]$ .

Proof.

We proof this lemma by induction. First, consider $h=H+1$ , then $Q_{H+1}^{*}(s,a)=Q_{H+1}^{k}(s,a)=0$ and $V_{H+1}^{*}(s)=V_{H+1}^{k}(s)=0$ for all $k\in[K]$ . Then our statements hold for $h=H+1$ . Now fixed an episode $k$ and assume this statement holds for $h+1$ . If $Q_{h}^{k}(s,a)=H$ , our proof is completed. Then assume $Q_{h}^{k}(s,a)\leq H$ and

	$\displaystyle Q_{h}^{k}(s,a)-Q_{h}^{*}(s,a)$	$\displaystyle=(\hat{\mathbb{P}}_{h}^{k})(V_{h+1}-V^{})(s,a)+(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V^{})(s,a)+b(N_{h}^{k}(s,a))$
		$\displaystyle\leq(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V^{*})(s,a)+b(N_{h}^{k}(s,a))$
		$\displaystyle\leq 0$

where the first inequality is induction hypothesis, and the second inequality follows naturally by Chernoff-Hoeffding’s inequality. Then for $s\in\mathcal{U}_{h}^{k}$

\displaystyle V_{h}^{k}(s)=\max_{a\in\mathcal{A}}Q_{h}^{k}(s,a)\geq\max_{a\in\mathcal{A}}Q^{*}(s,a)\geq V_{h}^{*}(s).

For $s\notin\mathcal{U}_{h}^{k},$ since we know $\mathcal{U}_{h}\subseteq\mathcal{U}_{h}^{k}$ , $s\notin\mathcal{U}_{h}$ . Then

$\displaystyle V_{h}^{k}(s)=\max_{a\in A_{h}^{k,safe}(s)}Q_{h}^{k}(s,a)$	$\displaystyle\geq\max_{a\in A_{h}^{k,safe}(s)}Q_{h}^{*}(s,a)$
	$\displaystyle\geq\max_{a\in A_{h}^{safe}(s)}Q_{h}^{*}(s,a)$	(7)
	$\displaystyle=V_{h}^{*}(s).$

where Eq. (7) is because $A_{h}^{safe}(s)\subseteq A_{h}^{k,safe}(s)$ by our optimistic exploration mechanism. ∎

Regret

Now we can start to prove our main theorem.

Proof.

Having the previous analysis, our proof is similar to UCBVI algorithm. Note that the regret can be bounded by UCB:

\displaystyle R(K)=\sum_{k=1}^{K}(V_{1}^{*}(s_{1})-V_{1}^{\pi^{k}}(s_{1}))\leq\sum_{k=1}^{K}(V_{1}^{k}(s_{1})-V_{1}^{\pi^{k}}(s_{1})).

Because the action taking by $\pi^{k}$ is equal to action taking at episode $k$ , we will have

\displaystyle V_{h}^{k}(s_{h}^{k})-V_{h}^{\pi^{k}}(s_{h}^{k})=(Q_{h}^{k}-Q_{h}^{\pi^{k}})(s_{h}^{k},a_{h}^{k}),

where $a_{h}^{k}=\pi_{h}^{k}(s_{h}^{k})$ . Then

$\displaystyle V_{h}^{k}(s_{h}^{k})-V_{h}^{\pi^{k}}(s_{h}^{k})$	$\displaystyle=(Q_{h}^{k}-Q_{h}^{\pi^{k}})(s_{h}^{k},a_{h}^{k})$
	$\displaystyle=(\hat{\mathbb{P}}_{h}^{k}(V_{h+1}^{k})-\mathbb{P}_{h}(V_{h+1}^{\pi^{k}}))(s_{h}^{k},a_{h}^{k})+b(N_{h}^{k}(s_{h}^{k},a_{h}^{k}),\delta)$
	$\displaystyle=(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V_{h+1}^{})(s_{h}^{k},a_{h}^{k})+(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V_{h+1}^{k}-V_{h+1}^{})(s_{h}^{k},a_{h}^{k})$
	$\displaystyle\ \ \ \ \ \ \ +P_{h}(V_{h+1}^{k}-V_{h+1}^{\pi^{k}}))(s_{h}^{k},a_{h}^{k})+b(N_{h}^{k}(s_{h}^{k},a_{h}^{k}),\delta).$	(8)

First, by Chernoff-Hoeffding’s bound, $(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V_{h+1}^{*})(s_{h}^{k},a_{h}^{k})\leq b(N_{h}^{k}(s_{h}^{k},a_{h}^{k}),\delta).$ Second, by the standard analysis of UCBVI (Theorem 1 in UCBVI), under the event $\mathcal{E}_{2}$ , we will have

	$\displaystyle\ \ \ \ \ (\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V_{h+1}^{k}-V_{h+1}^{*})(s_{h}^{k},a_{h}^{k})$
	$\displaystyle\leq\sum_{s^{\prime}\in\mathcal{S}}\left(\sqrt{\frac{2\mathbb{P}_{h}(s^{\prime}\mid s_{h}^{k},a_{h}^{k})\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k}))}}+\frac{2\iota}{3N_{h}^{k}(s_{h}^{k},a_{h}^{k})}\right)(V_{h+1}^{k}-V_{h+1}^{*})(s^{\prime})$
	$\displaystyle\leq\sum_{s^{\prime}\in\mathcal{S}}\left(\frac{\mathbb{P}_{h}(s^{\prime}\mid s_{h}^{k},a_{h}^{k})}{H}+\frac{H\iota}{2N_{h}^{k}(s_{h}^{k},a_{h}^{k})}+\frac{2\iota}{3N_{h}^{k}(s_{h}^{k},a_{h}^{k})}\right)(V_{h+1}^{k}-V_{h+1}^{*})(s^{\prime})$
	$\displaystyle\leq\frac{1}{H}\mathbb{P}_{h}(V_{h+1}^{k}-V_{h+1}^{*})(s_{h}^{k},a_{h}^{k})+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}$
	$\displaystyle\leq\frac{1}{H}\mathbb{P}_{h}(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h}^{k},a_{h}^{k})+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})},$

where the first inequality holds under the event $\mathcal{E}_{2}$ , and the second inequality is derived by $\sqrt{ab}\leq\frac{a+b}{2}$ , and the last inequality is because $V_{h+1}^{\pi^{k}}\leq V_{h+1}^{*}$ . Thus combining with Eq. (8), we can get

\displaystyle V_{h}^{k}(s_{h}^{k})-V_{h}^{\pi^{k}}(s_{h}^{k})\leq\left(1+\frac{1}{H}\right)\mathbb{P}_{h}(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h}^{k},a_{h}^{k})+2b(N_{h}^{k}(s_{h}^{k},a_{h}^{k}),\delta)+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}.

Denote

	$\displaystyle\alpha_{h}^{k}$	$\displaystyle=\mathbb{P}_{h}(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h}^{k},a_{h}^{k})-(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h}^{K},a_{h}^{k}).$
	$\displaystyle\beta_{h}^{k}$	$\displaystyle=\frac{H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}.$
	$\displaystyle\gamma_{h}^{k}$	$\displaystyle=b(N_{h}^{k}(s_{h}^{k},a_{h}^{k}),\delta).$

we have

	$\displaystyle V_{h}^{k}(s_{h}^{k})-V_{h}^{\pi^{k}}(s_{h}^{k})$	$\displaystyle\leq\left(1+\frac{1}{H}\right)(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h+1}^{k})+\left(1+\frac{1}{H}\right)\alpha_{h}^{k}+2\beta_{h}^{k}+2\gamma_{h}^{k}$
		$\displaystyle\leq\left(1+\frac{1}{H}\right)(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h+1}^{k})+2\alpha_{h}^{k}+2\beta_{h}^{k}+2\gamma_{h}^{k},$

then by recursion,

	$\displaystyle V_{1}^{k}(s_{1})-V_{1}^{\pi^{k}}(s_{1})$	$\displaystyle\leq\sum_{h=1}^{H}\left(1+\frac{1}{H}\right)^{H}(2\alpha_{h}^{k}+2\beta_{h}^{k}+2\gamma_{h}^{k})$
		$\displaystyle\leq e\cdot\sum_{h=1}^{H}(2\alpha_{h}^{k}+2\beta_{h}^{k}+2\gamma_{h}^{k}).$

Under the event $\mathcal{E}_{3}$ with Bernstein’s inequality, $\sum_{k=1}^{K}\sum_{h=1}^{H}\alpha_{h}^{k}\leq\sqrt{\frac{H^{3}K}{2}\log(1/\delta)}$ , and

\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\beta_{h}^{k}=\sum_{k=1}^{K}\sum_{h=1}^{H}\frac{H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}\leq H^{3}S^{2}A\iota\log(HK).

(9)

Also, for $\gamma_{h}^{k}$ , we can get

	$\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\gamma_{h}^{k}$	$\displaystyle=\sum_{k=1}^{K}\sum_{h=1}^{H}7H\sqrt{\frac{\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}}$
		$\displaystyle\leq 7H\sqrt{\iota}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\sum_{i=1}^{N^{k}(s,a)}\sqrt{1/i}$
		$\displaystyle\leq 14H\sqrt{\iota}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\sqrt{N^{k}(s,a)}$
		$\displaystyle\leq 14H\sqrt{\iota SAT}.$

Then the regret will be bounded by $O(\sqrt{H^{2}SAT})$ .

Violation

Now we turn to consider the violation during the process.

First, if the agent follows the $\pi^{k}$ at episode $k$ , and $s_{h+1}^{k}\in\Delta_{h}^{K}(s_{h}^{k},a_{h}^{k})$ for all step $h\in[H-1]$ , from the definition of $\{\mathcal{U}_{h}\}_{1\leq h\leq H+1}$ and $\Delta_{h}^{K}(s_{h}^{k},a_{h}^{k})$ , all states $s_{1}^{k},s_{2}^{k},\cdots,s_{H}^{k}\notin\mathcal{U}_{H}^{k}$ are safe. Since $\mathcal{U}_{H}^{k}=\{s\mid\overline{c}(s)>\tau\}$ , for each state $s\in\{s_{1}^{k},s_{2}^{k},\cdots,s_{H}^{k}\}$ , we can have

c(s)-\tau\leq\overline{c}^{k}(s)+2\sqrt{\frac{2}{N^{k}(s)}\log(SK/\delta)}-\tau\leq 2\sqrt{\frac{2}{N^{k}(s)}\log(SK/\delta)},

where the first inequality from the event $\mathcal{E}_{1}$ and the second inequality is because $s\notin\mathcal{U}_{H}^{k}$ . Then

\displaystyle\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+}\leq\sum_{h=1}^{H}2\sqrt{\frac{2}{N^{k}(s_{h}^{k})}\log(SK/\delta)}.

Then we consider the situation that $s_{h+1}^{k}\notin\Delta_{h}^{K}(s_{h}^{k},a_{h}^{k})$ for some step $h\in[H]$ at episode $k\in[K]$ . In this particular case, we can only bound the violation on this episode by $H$ because the future state are not in control. Fortunately, in this situation we will add at least one new state $s_{h+1}^{k}$ to $\Delta_{h}^{K}(s_{h}^{k},a_{h}^{k})$ , then $|\Delta_{h}^{k+1}(s_{h}^{k},a_{h}^{k})|\geq|\Delta_{h}^{k}(s_{h}^{k},a_{h}^{k})|+1$ , and the summation

\sum_{(h,s,a)\in[H]\times\mathcal{S}\times\mathcal{A}}|\Delta_{h}^{k+1}(s,a)|\geq\sum_{(h,s,a)\in[H]\times\mathcal{S}\times\mathcal{A}}|\Delta_{h}^{k}(s,a)|+1.

Note that for any episode $k$ , the summation has a trivial upper bound

\sum_{(h,s,a)\in[H]\times\mathcal{S}\times\mathcal{A}}|\Delta_{h}^{k}(s,a)|\leq S^{2}AH,

this situation will also appears for at most $S^{2}AH$ episode and leads at most $S^{2}AH^{2}$ extra violation.

Hence the total violation can be upper bounded by

$\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+}$	$\displaystyle\leq S^{2}AH^{2}+\sum_{k=1}^{K}\sum_{h=1}^{H}\left(2\sqrt{\frac{2}{N^{k}(s_{h}^{k})}\log(SK/\delta)}\wedge 1\right)$
	$\displaystyle\leq S^{2}AH^{2}+2\sqrt{2\log(SK/\delta)}\sum_{k=1}^{K}\sum_{h=1}^{H}\left(\sqrt{\frac{1}{N^{k}(s_{h}^{k})\vee 1}}\right)$
	$\displaystyle\leq S^{2}AH^{2}+2\sqrt{2\log(SK/\delta)}\sum_{s\in\mathcal{S}}\sum_{i=1}^{K}\left(\frac{N^{i+1}(s)-N^{i}(s)}{\sqrt{N^{i}(s)\vee 1}}\right)$	(10)
	$\displaystyle\leq S^{2}AH^{2}+2\sqrt{8\log(SK/\delta)}\sum_{s\in\mathcal{S}}\sqrt{N^{K}(s)}$
	$\displaystyle\leq S^{2}AH^{2}+2\sqrt{8\log(SK/\delta)}\sqrt{ST}$
	$\displaystyle=\widetilde{\mathcal{O}}(S^{2}AH^{2}+\sqrt{ST})$

with probability at least $1-3\delta$ . The inequality Eq. (10) is derived by Lemma 19 in Auer et al. (2008). By replacing $\delta$ to $\delta/3$ we can complete our proof.

Moreover, for the unsafe state set $\mathcal{U}$ , recall that $d=\min_{s\in\mathcal{U}}(c(s)-\tau)>0$ . For the state $s_{h}^{k}$ , if $s_{h}^{k}\notin\mathcal{U}_{H}^{k}$ , we have

\displaystyle c(s_{h}^{k})-\tau\leq\overline{c}^{k}(s_{h}^{k})+2\sqrt{\frac{2}{N^{k}(s_{h}^{k})}\log(SK/\delta)}-\tau\leq 2\sqrt{\frac{2}{N^{k}(s_{h}^{k})}\log(SK/\delta)}.

Thus we can get

\displaystyle N^{k}(s_{h}^{k})\leq\frac{8}{(c(s)-\tau)^{2}}\log(SK/\delta).

Now denote $\mathcal{K}$ be the set of episodes $k$ that $s_{h}^{k+1}\in\Delta_{h}^{k}(s_{h}^{k},a_{h}^{k})$ . Now for each $s\notin\mathcal{U}$ , denote $k_{s}=\max\{k:\exists\ h\in[H],s=s_{h}^{k}\}$ , then our total violation can be bounded by

	$\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+}$	$\displaystyle\leq S^{2}AH^{2}+\sum_{k\in\mathcal{K}}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+}$
		$\displaystyle\leq S^{2}AH^{2}+\sum_{s\notin\mathcal{U}}N^{k_{s}}(s)(c(s)-\tau)_{+}$
		$\displaystyle\leq S^{2}AH^{2}+\sum_{s\notin\mathcal{U}}\frac{8}{(c(s)-\tau)_{+}^{2}}(c(s_{h}^{k})-\tau)_{+}\log(SK/\delta)$
		$\displaystyle\leq S^{2}AH^{2}+\frac{8S}{d}\log(SK/\delta).$

∎

B.2 Proof of Theorem 6.2

First, apply the same argument in the Appendix B.1, the violation during the exploration phase can be bounded by

	$\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+}$	$\displaystyle\leq S^{2}AH^{2}+\sum_{k=1}^{K}\sum_{h=1}^{H}\sqrt{\frac{2}{N^{k}(s_{h}^{k})}\log(SK/\delta)}$
		$\displaystyle\leq S^{2}AH^{2}+\sqrt{2\log(SK/\delta)}\sum_{s\in\mathcal{S}}\sum_{i=1}^{N^{K}(s)}\sqrt{\frac{1}{i}}$
		$\displaystyle\leq S^{2}AH^{2}+\sqrt{8\log(SK/\delta)}\sum_{s\in\mathcal{S}}\sqrt{N^{K}(s)}$
		$\displaystyle=\widetilde{\mathcal{O}}(S^{2}AH^{2}+\sqrt{ST}).$

Now we prove the Algorithm 2 will return $\varepsilon$ -safe and $\varepsilon$ -optimal policy. Lemma B.5 shows that $\overline{W}_{h}^{k}(s_{1},\pi_{1}^{k+1}(s_{1}))$ bounds the estimation error for any reward function $r$ .

Lemma B.5.

For any feasible policy $\pi\in\Pi^{k}$ , step $h$ , episode $k$ and reward function $r$ , defining the estimation error as $\hat{e}^{k,\pi}_{h}(s,a;r)=|\hat{Q}_{h}^{k,\pi}(s,a;r)-Q_{h}^{k,\pi}(s,a;r)|$ . Then, with probability at least $1-\delta$ , we have

\hat{e}^{k,\pi}_{h}(s,a;r)\leq\overline{W}_{h}^{k}(s,a).

First, we prove the Lemma B.5. We provide an lemma to show a high probability event for concentration.

Lemma B.6 (Kaufmann et al. (2021)).

Define the event as

\displaystyle\mathcal{E}_{4}=\left\{\forall k,h,s,a,\mbox{KL}(\hat{\mathbb{P}}_{h}^{k}(\cdot\mid(s,a)),\mathbb{P}_{h}^{k}(\cdot\mid(s,a))\leq\frac{\gamma(N_{h}^{k}(s,a),\delta)}{N_{h}^{k}(s,a)}\right\},

where $\beta(n,\delta)=2\log(SAHK/\delta)+(S-1)\log(e(1+n/(S-1)))$ . Then $\Pr(\mathcal{E}_{4}^{c})\leq\delta$ .

By Bellman equations,

	$\displaystyle\hat{Q}_{h}^{\pi}(s,a;r)-Q_{h}^{\pi}(s,a;r)=\sum_{s^{\prime}}(\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a))Q_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)$
	$\displaystyle\ \ \ +\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)(\hat{Q}_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)-Q_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)).$

By the Pinkser’s inequality and event $\mathcal{E}_{4}$ ,

	$\displaystyle\hat{e}_{h}^{\pi}(s,a;r)\leq$	$\displaystyle\sum_{s^{\prime}}\|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a)\|Q_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)$
		$\displaystyle\ \ \ +\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\|\hat{Q}_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)-Q_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)\|$
		$\displaystyle\leq H\\|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a)\\|_{1}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\hat{e}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime});r)$
		$\displaystyle\leq H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\hat{e}_{h+1}^{\pi}(s,\pi_{h+1}(s^{\prime});r).$

By induction, if $\hat{e}_{h+1}^{\pi}(s,a;r)\leq\overline{W}_{h+1}(s,a)$ , for $s\notin\mathcal{U}_{h}^{k},a\in A_{h}^{k,safe}(s)$ , we can have

$\displaystyle\hat{e}_{h}^{\pi}(s,a;r)$	$\displaystyle\leq\min\left\{H,H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\hat{e}_{h+1}^{\pi}(s,\pi_{h+1}(s^{\prime});r)\right\}$
	$\displaystyle\leq\min\left\{H,H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\max_{b\in A_{h+1}^{k,safe}(s^{\prime})}\hat{e}_{h+1}^{\pi}(s^{\prime},b;r)\right\}$	(11)
	$\displaystyle\leq\min\left\{H,H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\max_{b\in A_{h+1}^{k,safe}(s^{\prime})}\overline{W}_{h+1}(s^{\prime},b)\right\}$
	$\displaystyle\leq\overline{W}_{h}(s,a).$

The second inequality is because $\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)=0$ for all $s^{\prime}\in\mathcal{U}_{h+1}$ , and then for $s^{\prime}\notin\mathcal{U}_{h+1},\pi_{h+1}(s^{\prime})\in A_{h+1}^{k,safe}(s^{\prime})$ . The third inequality holds by induction hypothesis, and the last inequality holds by the definition of $\overline{W}_{h}(s,a)$ .

Also, for other state-action pair $(s,a)$ , we can get

	$\displaystyle\hat{e}_{h}^{\pi}(s,a;r)$	$\displaystyle\leq\min\left\{H,H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\hat{e}_{h+1}^{\pi}(s,\pi_{h+1}(s^{\prime});r)\right\}$
		$\displaystyle\leq\min\left\{H,H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\max_{b\in A}\hat{e}_{h+1}^{\pi}(s^{\prime},b;r)\right\}$
		$\displaystyle\leq\min\left\{H,H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\max_{b\in A}\overline{W}_{h+1}(s^{\prime},b)\right\}$
		$\displaystyle\leq\overline{W}_{h}(s,a).$

Hence the Lemma B.5 is true. Now we assume the algorithm terminates at episode $K+1$ , for any feasible policy $\pi\in\Pi^{K}$ , if we define the value function in model $\hat{M}$ as $\hat{V}$ , we will get

	$\displaystyle\|V_{1}^{\pi}(s_{1})-\hat{V}_{1}^{\pi}(s_{1})\|$	$\displaystyle=\|Q_{1}^{\pi}(s_{1},\pi_{1}(s_{1}))-\hat{Q}_{1}^{\pi}(s_{1},\pi_{1}(s_{1}))\|$
		$\displaystyle\leq\overline{W}_{h}^{K}(s_{1},\pi_{1}(s_{1}))$
		$\displaystyle\leq\overline{W}_{h}^{K}(s_{1},\pi^{K+1}_{1}(s_{1}))$
		$\displaystyle\leq\varepsilon/2.$

where the second inequality is because $\pi^{K+1}$ is the greedy policy with respect to $\overline{W}_{h}^{K}$ and $\pi$ is in the feasible policy set $\Pi^{K}$ at episode $K$ . Then for any reward function $r$ , denote the $\hat{\pi^{*}}\in\Pi^{K}$ to be the optimal policy in estimated MDP $\hat{M}$ . Recall that $\pi^{*}\in\Pi^{K}\subseteq\Pi^{*}$ , we will have

	$\displaystyle V_{1}^{\pi^{}}(s_{1})-V_{1}^{\hat{\pi^{}}}(s_{1})$	$\displaystyle\leq\hat{V}_{1}^{\pi^{}}(s_{1})-V_{1}^{\hat{\pi^{}}}(s_{1})+\varepsilon/2$
		$\displaystyle\leq\hat{V}_{1}^{\hat{\pi^{}}}(s_{1})-V_{1}^{\hat{\pi^{}}}+\varepsilon/2$
		$\displaystyle\leq\varepsilon.$

Now we consider the expected violation. We will show that for all feasible policy $\pi\in\Pi^{K}$ , the expected violation will be upper bounded by $2\overline{W}_{1}(s_{1},\pi_{1}(s_{1})).$

We first consider the event for feasible policy $\pi$

\displaystyle{\mathcal{P}}_{\pi}=\{\forall\ h\in[H-1],s_{h+1}\in\Delta_{h}^{K}(s_{h},a_{h})\mid s_{h},a_{h}\sim\pi\}.

Assume that the event holds, then because $\pi\in\Pi^{K}$ is the feasible policy with respect to $\Delta_{h}^{K}(s,a)$ , we have $s_{h}\notin\mathcal{U}_{h}$ for all $1\leq h\leq H+1$ . Then

\displaystyle(c(s_{h})-\tau)_{+}\leq\sqrt{\frac{2}{N^{K}(s)}\log(SK/\delta)}.

and under the event ${\mathcal{P}}_{\pi}$

\displaystyle\mathbb{E}_{\pi}\left[\sum_{h=1}^{H}(c(s_{h})-\tau)_{+}\right]\leq\mathbb{E}_{\pi}\left[\sum_{h=1}^{H}\sqrt{\frac{2}{N^{K}(s_{h})}\log\left(\frac{SK}{\delta}\right)}\right].

(12)

Now define

	$\displaystyle\ \ \ \ \ \overline{F}_{h}^{\pi}(s,a)$
	$\displaystyle=\min\left\{H,\sqrt{\frac{2}{N(s_{h})}\log\left(\frac{SK}{\delta}\right)}+H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)F_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}.$

Since $\sqrt{\frac{2}{N(s_{h})}\log\left(\frac{SK}{\delta}\right)}\leq H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}$ , we have $\overline{F}_{h}^{\pi}(s,a)\leq\overline{W}_{h}(s,a)$ by recursion. The proof is similar to B.5.

Also denote

F_{h}^{\pi}(s,a)=\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H}\sqrt{\frac{2}{N^{K}(s_{h})}\log\left(\frac{SK}{\delta}\right)}\ \Bigg{|}\ s_{h}=s,a_{h}=a\right].

Next we prove $\overline{F}_{h}^{\pi}(s,a)\geq F_{h}^{\pi}(s,a).$ To show this fact, we use the induction. When $h=H+1$ , $F_{H+1}^{\pi}(s,a)=\overline{F}_{H+1}^{\pi}(s,a)=0$ . Now assume that $\overline{F}_{h+1}^{\pi}(s,a)\geq F_{h+1}^{\pi}(s,a)$ , then

	$\displaystyle\ \ \ \ \ \overline{F}_{h}^{\pi}(s,a)$
	$\displaystyle=\min\left\{H,\sqrt{\frac{2}{N(s)}\log\left(\frac{SK}{\delta}\right)}+H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\overline{F}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}$
	$\displaystyle\geq\min\left\{H,\sqrt{\frac{2}{N(s)}\log\left(\frac{SK}{\delta}\right)}+H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)F_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}$
	$\displaystyle\geq\min\left\{H,\sqrt{\frac{2}{N(s_{h})}\log\left(\frac{SK}{\delta}\right)}+H\\|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a)\\|_{1}\right.$
	$\displaystyle\hskip 200.0003pt\left.+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)F_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}$
	$\displaystyle\geq\min\left\{H,\sqrt{\frac{2}{N(s_{h})}\log\left(\frac{SK}{\delta}\right)}+\sum_{s^{\prime}}\mathbb{P}_{h}(s^{\prime}\mid s,a)F_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}$
	$\displaystyle=F_{h}^{\pi}(s,a).$

Thus by induction, we know $F_{1}^{\pi}(s_{1},\pi_{1}(s_{1}))\leq\overline{F}_{1}^{\pi}(s_{1},\pi_{1}(s_{1}))\leq\overline{W}_{1}^{K}(s,\pi_{1}(s_{1}))$ because $\sqrt{\frac{2}{N(s)}\log\left(\frac{SK}{\delta}\right)}\leq H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}$ .

From now, we have proved that under the condition ${\mathcal{P}}_{\pi}$ , the expected violation can be upper bounded by $\overline{W}_{1}(s,\pi_{1}(s_{1}))$ . Next we state that $\Pr({\mathcal{P}}_{\pi}^{c})$ is small. Recall that ${\mathcal{P}}_{\pi}=\{\forall\ h\in[H-1],s_{h+1}\in\Delta_{h}^{K}(s_{h},a_{h})\mid\pi\}$ , we define

\displaystyle G_{h}^{\pi}(s,a)=\Pr\{\exists\ h^{\prime}\in[h,H-1],s_{h^{\prime}+1}\notin\Delta_{h}^{K}(s_{h^{\prime}},a_{h^{\prime}})\mid s_{h}=s,a_{h}=a\}

and

	$\displaystyle\ \ \ \ \ \overline{G}_{h}^{\pi}(s,a)$
	$\displaystyle=\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\overline{G}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}.$

Then by recursion, we can easily show that $\overline{G}_{h}^{\pi}(s,a)\leq\overline{W}_{h}^{K}(s,a)$ by $\log(S^{2}HA/\delta)\leq\gamma(N_{h}(s,a),\delta)$ .

Lemma B.7.

For all $h\in[H]$ and feasible policy $\pi$ , we have $\overline{G}_{h}^{\pi}(s,a)\geq HG_{h}^{\pi}(s,a)$ .

Proof.

Define $\widetilde{G}_{h}^{\pi}(s,a)$ as

\displaystyle\widetilde{G}_{h}^{\pi}(s,a)=\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+\sum_{s^{\prime}}\mathbb{P}_{h}(s^{\prime}\mid s,a)\widetilde{G}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}.

We will show $\overline{G}_{h}^{\pi}(s,a)\geq\widetilde{G}_{h}^{\pi}(s,a)\geq HG_{h}^{\pi}(s,a).$ We prove these two inequalities by induction. First, for $h=H+1$ , these inequalities holds. Now assume that they hold for $h+1$ , then

	$\displaystyle\ \ \ \ \ \overline{G}_{h}^{\pi}(s,a)$
	$\displaystyle=\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\overline{G}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}$
	$\displaystyle\geq\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\widetilde{G}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}$
	$\displaystyle\geq\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+H\\|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a)\\|_{1}\right.$
	$\displaystyle\hskip 200.0003pt\left.+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\widetilde{G}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}$
	$\displaystyle\geq\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+\sum_{s^{\prime}}\mathbb{P}_{h}(s^{\prime}\mid s,a)\widetilde{G}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}$
	$\displaystyle=\widetilde{G}_{h}^{\pi}(s,a).$

Now to prove that $\widetilde{G}_{h}^{\pi}(s,a)\geq HG_{h}^{\pi}(s,a)$ , we need the following lemma.

Lemma B.8.

Fixed a step $h\in[H]$ and state-action pair $(s,a)\in\mathcal{S}\times\mathcal{A}$ , then with high probability at least $1-\delta$ , for any $s^{\prime}\in\Delta_{h}(s,a)$ with $P(s^{\prime}\mid s,a)\geq\frac{\log(S/\delta)}{N_{h}(s,a)}$ , $s^{\prime}\in\Delta^{K}_{h}(s,a)$ .

Proof.

Since we sample $s^{\prime}$ from $P_{h}(s^{\prime}\mid s,a)$ for $N_{h}(s,a)$ times, if $s^{\prime}\notin\Delta^{K}_{h}(s,a)$ , then $s^{\prime}$ are not sampled. The probability that $s^{\prime}$ are not sampled will be $(1-p)^{N_{h}(s,a)}$ , where $p=\mathbb{P}_{h}(s^{\prime}\mid s,a)$ . When $p\geq\frac{\log(S/\delta)}{N_{h}(s,a)}$ , the probability will be at most

(1-p)^{N_{h}(s,a)}\leq\left(1-\frac{\log(S/\delta)}{N_{h}(s,a)}\right)^{N_{h}(s,a)}\leq e^{-\log(S/\delta)}=\frac{\delta}{S}.

Taking the union bound for all possible next state $s^{\prime}$ , the lemma has been proved. ∎

By the Lemma B.8, define the event as

\displaystyle\mathcal{E}_{5}=\left\{\forall\ h,s,a\mbox{and}\ s^{\prime}\in\Delta_{h}(s,a)\ \mbox{with}\ P(s^{\prime}\mid s,a)\geq\frac{\log(S^{2}HA/\delta)}{N_{h}(s,a)}:s^{\prime}\in\Delta_{h}^{K}(s,a)\right\}.

Then $\Pr(\mathcal{E}_{5}^{c})\leq\delta$ . Under the event $\mathcal{E}_{5}$ , at step $h\in[H]$ and state-action pair $(s_{h},a_{h})$ , we have

\displaystyle\mathbb{P}(s_{h+1}\notin\Delta_{h}^{K}(s_{h},a_{h}))\leq\frac{S\log(S^{2}HA/\delta)}{N_{h}(s_{h},a_{h})}.

(13)

This inequality is because $s_{h+1}\notin\Delta_{h}^{K}(s_{h},a_{h})$ only happens when $P(s^{\prime}\mid s,a)\leq\frac{\log(S^{2}HA/\delta)}{N_{h}(s_{h},a_{h})}$ under the event $\mathcal{E}_{5}$ . Now we first decompose the $G_{h}^{\pi}(s,a)$ as

	$\displaystyle H\cdot G_{h}^{\pi}(s,a)$	$\displaystyle=\Pr\{\exists\ h^{\prime}\in[h,H-1],s_{h^{\prime}+1}\notin\Delta_{h}^{K}(s_{h^{\prime}},a_{h^{\prime}})\mid s_{h}=s,a_{h}=a,\pi\}$
		$\displaystyle\leq H\sum_{h^{\prime}=h}^{H-1}\Pr\{s_{h^{\prime}+1}\notin\hat{Delta}_{h^{\prime}}^{K}(s_{h^{\prime}},a_{h^{\prime}})\mid s_{h}=s,a_{h}=a,\pi\}$
		$\displaystyle\leq\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H-1}\frac{SH\log(S^{2}HA/\delta)}{N_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})}\right].$

The first inequality is to decompose the probability into the summation over step $h$ , and the second inequality is derived by Eq. (13). Now denote $g_{h}^{\pi}(s,a)=\min\left\{H,\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H-1}\frac{SH\log(S^{2}HA/\delta)}{N_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})}\right]\right\}$ , then we can easily have $g_{h}^{\pi}(s,a)\leq\widetilde{G}_{h}^{\pi}(s,a)$ by recursion. In fact, if

	$\displaystyle\ \ \ \ \ g_{h}^{\pi}(s,a)$
	$\displaystyle=\min\left\{H,\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H-1}\frac{SH\log(S^{2}HA/\delta)}{N_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})}\ \Bigg{\|}\ (s_{h},a_{h})=(s,a)\right]\right\}$
	$\displaystyle=\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h+1}^{H-1}\frac{SH\log(S^{2}HA/\delta)}{N_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})}\ \Bigg{\|}\ (s_{h},a_{h})=(s,a)\right]\right\}$
	$\displaystyle=\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}\right.$
	$\displaystyle\hskip 50.00008pt\left.+\min\left\{H,\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h+1}^{H-1}\frac{SH\log(S^{2}HA/\delta)}{N_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})}\ \Bigg{\|}\ (s_{h},a_{h})=(s,a)\right]\right\}\right\}$
	$\displaystyle=\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+\sum_{s^{\prime}\in\mathcal{S}}\mathbb{P}_{h}(s^{\prime}\mid s,a)g_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}$

has the same recursive equation as $\widetilde{G}_{h}^{\pi}(s,a)$ . Since $g_{H}^{\pi}(s,a)=0\leq\widetilde{G}_{H}^{\pi}(s,a)$ , by induction we can get $g_{h}^{\pi}(s,a)\leq\widetilde{G}_{h}^{\pi}(s,a)$ , and then

\displaystyle H\cdot G_{h}^{\pi}(s,a)\leq g_{h}^{\pi}(s,a)\leq\widetilde{G}_{h}^{\pi}(s,a)

holds directly. The Lemma B.7 has been proved.

Now we return to our main theorem. Consider the policy $\pi\in\Pi^{K}$ is one of feasible policy at episode $K$ . For each possible $\mathscr{T}_{H}=\{s_{1},a_{1},\cdots,s_{H},a_{H}\}$ , denote $\Pr_{\pi}(\mathscr{T}_{H})$ be the probability for generating $\mathscr{T}_{H}$ following policy $\pi$ and true transition kernel. If the event $s_{h+1}\notin\Delta_{h}^{K}(s_{h},a_{h})$ happens, we define the trajectory $\mathscr{T}_{H}$ are “unsafe". Denote the set of all “unsafe" trajectories are $\mathcal{U}$ , we have

	$\displaystyle\mathbb{E}_{\pi}\left[\sum_{h=1}^{H}(c(s_{h})-\tau)_{+}\right]$
	$\displaystyle\quad=\sum_{\mathscr{T}_{H}}\Pr_{\pi}(\mathscr{T}_{H})\left[\sum_{h=1}^{H}(c(s_{h})-\tau)_{+}\right]$
	$\displaystyle\quad\leq\sum_{\mathscr{T}_{H}\in\mathcal{U}}\Pr_{\pi}(\mathscr{T}_{H})\left[\sum_{h=1}^{H}(c(s_{h})-\tau)_{+}\right]+\sum_{\mathscr{T}_{H}\notin\mathcal{U}}\Pr_{\pi}(\mathscr{T}_{H})\left[\sum_{h=1}^{H}(c(s_{h})-\tau)_{+}\right]$
	$\displaystyle\quad\leq H\sum_{\mathscr{T}_{H}\in\mathcal{U}}\Pr_{\pi}(\mathscr{T}_{H})+\sum_{\mathscr{T}_{H}\notin\mathcal{U}}\Pr_{\pi}(\mathscr{T}_{H})\left[\sum_{h=1}^{H}\sqrt{\frac{2}{N^{K}(s_{h})}\log(SK/\delta)}\right]$		(14)
	$\displaystyle\quad\leq H\Pr_{\pi}\{{\mathcal{P}}_{\pi}\}+\sum_{\mathscr{T}_{H}}\Pr_{\pi}(\mathscr{T}_{H})\left[\sum_{h=1}^{H}\sqrt{\frac{2}{N^{K}(s_{h})}\log(SK/\delta)}\right]$
	$\displaystyle\quad\leq H\cdot G_{1}^{\pi}(s,a)+F_{1}^{\pi}(s,a)$		(15)
	$\displaystyle\quad\leq 2\overline{W}_{1}^{K}(s_{1},\pi_{1}(s_{1}))$
	$\displaystyle\quad\leq 2\overline{W}_{1}^{K}(s_{1},\pi^{K+1}_{1}(s_{1}))$
	$\displaystyle\quad\leq\varepsilon.$

The Eq. (14) is because $\mathscr{T}_{H}\in\mathcal{U}$ implies that the event ${\mathcal{P}}_{\pi}$ happens, and we can apply Eq. (12). The Eq. (15) is derived by the definition of $F_{1}^{\pi}(s,a)$ and $G_{1}^{\pi}(s,a)$ . Until now, we have proved that if the algorithm terminates, all the requirements are satisfied. Now we turn to bound the sample complexity. We first assume that $K\geq SA$

Define the average $q_{h}^{k}=\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)\overline{W}_{h}^{k}(s,a)$ . Since the $\pi^{k+1}$ are greedy policy with respect to $\overline{W}_{h}^{k}(s,a)$ , we have

	$\displaystyle\overline{W}_{h}^{k}(s,a)$
	$\displaystyle\quad=\min\left\{H,M(N_{h}^{k}(s,a),\delta)+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}^{k}(s^{\prime}\mid s,a)\overline{W}_{h+1}^{k}(s^{\prime},\pi^{t+1}_{h+1}(s^{\prime}))\right\}$
	$\displaystyle\quad\leq\min\left\{H,M(N_{h}^{k}(s,a),\delta)+\sum_{s^{\prime}}\mathbb{P}_{h}^{k}(s^{\prime}\mid s,a)\overline{W}_{h+1}^{k}(s^{\prime},\pi^{t+1}_{h+1}(s^{\prime}))\right.$
	$\displaystyle\quad\hskip 200.0003pt\left.+H\\|\hat{\mathbb{P}}_{h}^{k}(s^{\prime}\mid s,a)-\mathbb{P}_{h}^{k}(s^{\prime}\mid s,a)\\|_{1}\right\}$
	$\displaystyle\quad\leq M^{\prime}(N_{h}^{k}(s,a),\delta)+\sum_{s^{\prime}}\mathbb{P}_{h}^{k}(s^{\prime}\mid s,a)\overline{W}_{h+1}^{k}(s^{\prime},\pi^{t+1}_{h+1}(s^{\prime}))$

where $M^{\prime}(n,\delta)=3H\left(\sqrt{\frac{2\beta(n,\delta)}{n}}\wedge 1\right)+SH\left(\frac{\beta(n,\delta)}{n}\wedge 1\right)$ . Now

	$\displaystyle q_{h}^{k}$	$\displaystyle=\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)\overline{W}_{h}^{k}(s,a)$
		$\displaystyle\leq\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)M^{\prime}(N_{h}^{k}(s,a),\delta)+\sum_{(s^{\prime},s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)\mathbb{P}_{h}^{k}(s^{\prime}\mid s,a)\overline{W}_{h+1}^{k}(s^{\prime},\pi_{h+1}^{k+1}(s^{\prime}))$
		$\displaystyle=\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)M^{\prime}(N_{h}^{k}(s,a),\delta)+q_{h+1}^{k}.$

Thus

\displaystyle\varepsilon/2<q_{1}^{k}\leq\sum_{h=1}^{H}\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)M^{\prime}(N_{h}^{k}(s,a),\delta),\forall\ k\in[K-1],

where the first equality is because $q_{1}^{k}=\overline{W}_{h}^{k}(s_{1},\pi^{k+1}(s_{1}))>\varepsilon/2$ . Sum over $k\in[K-1]$ , we can get

\displaystyle K\varepsilon/2\leq\sum_{k=0}^{K-1}\sum_{h=1}^{H}\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)M^{\prime}(N_{h}^{k}(s,a),\delta).

Define the pseudo-count as $\overline{N}_{h}^{k}(s,a)=\sum_{i=1}^{k}\mathbb{P}_{h}^{\pi^{i}}(s,a)$ and the event about the pseudo-count like Kaufmann et al. (2021):

\displaystyle\mathcal{E}_{cnt}=\left\{\forall\ k,h,s,a:N_{h}^{k}(s,a)\geq\frac{1}{2}\overline{N}_{h}^{k}(s,a)-\log(SAH/\delta)\right\}.

Then $\Pr\{\mathcal{E}_{cnt}\}\leq\delta$ . The proof is provided in Kaufmann et al. (2021). Also, the Lemma 7 in Kaufmann et al. (2021) show that we can change $N_{h}^{k}(s,a)$ to $\overline{N}_{h}^{k}(s,a)$ up to a constant.

Lemma B.9 (Kaufmann et al. (2021)).

Under the event $\mathcal{E}_{cnt}$ , for any $s\in\mathcal{S},a\in\mathcal{A}$ and $k$ , if $\beta(x,\delta)/x$ is non-increasing for $x\geq 1$ and $\beta(x,\delta)$ is increasing, we have

\displaystyle\frac{\gamma(N_{h}^{k}(s,a),\delta)}{N_{h}^{k}(s,a)}\wedge 1\leq\frac{4\gamma(\overline{N}_{h}^{k}(s,a),\delta)}{\overline{N}_{h}^{k}(s,a)\vee 1}.

The proof of Lemma B.9 can be found in Kaufmann et al. (2021). Then we can have

$\displaystyle K\varepsilon/2$	$\displaystyle\leq\sum_{k=0}^{K-1}\sum_{h=1}^{H}\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)M^{\prime}(N_{h}^{k}(s,a),\delta)$	(16)
	$\displaystyle\leq\underbrace{3H\sum_{h=1}^{H}\sum_{k=0}^{K-1}\sum_{(s,a)}(\overline{N}_{h}^{k+1}(s,a)-\overline{N}_{h}^{k}(s,a))\sqrt{\frac{2\gamma(N_{h}^{k}(s,a),\delta)}{N_{h}^{k}(s,a)}\wedge 1}}_{\text{(a)}}$
	$\displaystyle\ \ \ \ \ \ \ \ \ \ +\underbrace{SH\sum_{h=1}^{H}\sum_{k=0}^{K-1}\sum_{(s,a)}(\overline{N}_{h}^{k+1}(s,a)-\overline{N}_{h}^{k}(s,a))\left(\frac{\gamma(N_{h}^{k}(s,a),\delta)}{N_{h}^{k}(s,a)}\wedge 1\right)}_{\text{(b)}}.$	(17)

For part $(a)$ , we upper bound it by

(a)	$\displaystyle\leq 3H\sum_{h=1}^{H}\sum_{k=0}^{K-1}\sum_{(s,a)}(\overline{N}_{h}^{k+1}(s,a)-\overline{N}_{h}^{k}(s,a))\sqrt{\frac{8\gamma(\overline{N}_{h}^{k}(s,a),\delta)}{\overline{N}_{h}^{k}(s,a)\vee 1}}$
	$\displaystyle\leq 3\sqrt{2}H\sqrt{\gamma(K,\delta)}\sum_{h=1}^{H}\sum_{k=0}^{K-1}\sum_{(s,a)}\frac{(\overline{N}_{h}^{k+1}(s,a)-\overline{N}_{h}^{k}(s,a))}{\sqrt{\overline{N}_{h}^{k}(s,a)\vee 1}}$	(18)
	$\displaystyle\leq 3(2+\sqrt{2})H\sqrt{\gamma(K,\delta)}\sum_{h=1}^{H}\sum_{(s,a)}\sqrt{\overline{N}_{h}^{K}(s,a)\vee 1}$
	$\displaystyle\leq 3(2+\sqrt{2})H\sqrt{\gamma(K,\delta)}\sum_{h=1}^{H}\sum_{(s,a)}(1+\sqrt{\overline{N}_{h}^{K}(s,a)})$
	$\displaystyle\leq 3(2+\sqrt{2})H^{2}\sqrt{\gamma(K,\delta)}(SA+\sqrt{SAK})$	(19)
	$\displaystyle\leq 6(2+\sqrt{2})H^{2}\sqrt{\gamma(K,\delta)}\sqrt{SAK},$

where Eq. (18) is because Lemma 19 in Auer et al. (2008), Eq. (19) is derived by $\sum_{(s,a)}\overline{N}_{h}^{K}(s,a)=K$ and Cauchy’s inequality, and the last inequality is because we assume $K\geq SA$ .

Also for part $(b)$ ,

(b)	$\displaystyle\leq SH\sum_{h=1}^{H}\sum_{k=0}^{K-1}\sum_{(s,a)}(\overline{N}_{h}^{k+1}(s,a)-\overline{N}_{h}^{k}(s,a))\frac{4\gamma(\overline{N}_{h}^{k}(s,a),\delta)}{\overline{N}_{h}^{k}(s,a)\vee 1}$
	$\displaystyle\leq SH\gamma(K,\delta)\sum_{h=1}^{H}\sum_{(s,a)}\sum_{k=0}^{K-1}\frac{(\overline{N}_{h}^{k+1}(s,a)-\overline{N}_{h}^{k}(s,a))}{\overline{N}_{h}^{k}(s,a)\vee 1}$
	$\displaystyle\leq SH\gamma(K,\delta)\sum_{h=1}^{H}\sum_{(s,a)}4\log(\overline{N}_{h}^{K}(s,a)+1)$	(20)
	$\displaystyle\leq 4S^{2}AH^{2}\gamma(K,\delta)\log(K+1),$

where the Eq. (20) is because the Lemma 9 in Ménard et al. (2021). Now from the upper bound of $(a)$ and $(b)$ , we can get

\displaystyle K\varepsilon/2\leq 6(2+\sqrt{2})H^{2}\sqrt{\gamma(K,\delta)}\sqrt{SAK}+4S^{2}AH^{2}\gamma(K,\delta)\log(K+1).

Recall that $\gamma(n,\delta)=2\log(SAHK/\delta)+(S-1)\log(e(1+n/(S-1)))\geq 1$ , by some technical transformation, performing Lemma D.1 we can get

\displaystyle K=\widetilde{\mathcal{O}}\left(\left(\frac{H^{4}SA}{\varepsilon^{2}}+\frac{S^{2}AH^{2}}{\varepsilon}\right)\left(\log\left(\frac{1}{\delta}\right)+S\right)\right),

under the event $\mathcal{E}_{cnt},\mathcal{E}_{1},\mathcal{E}_{4},\mathcal{E}_{5}$ . Hence the theorem holds with probability at least $1-4\delta$ , and we can prove the theorem by replacing $\delta$ to $\delta/4$ .

∎

B.3 Proof of Theorem 5.1

Proof.

We construct n MDP models with the same transition model but different safety cost. For each model, the states consist of a tree with degree $A$ and layer $H/3$ . For each state inside the tree (not a leaf), the action $a_{i}$ will lead to the $i-$ th branch of the next state. For leaf nodes, they are absorbing states, and we call them $s_{1},s_{2},\cdots,s_{n}$ . Then $n=A^{H/3}=\frac{S(A-1)+1}{A}\geq\frac{S}{2}$ .

Now we first define the instance $\tau_{1}$ as follows: For each state inside the tree, the safety cost and reward are all 0. For leaf node, the cost function for state $s_{i}$ are $c(s_{i})=1/2+\Delta\mathbb{I}\{i\neq 1\}$ . The reward function are $r(s_{i})=\mathbb{I}\{i\neq 1\}$ , which implies that only state $s_{1}$ has the reward 0 and other states $s_{2},\cdots,s_{n}$ have reward 1. Now we define the instance $\tau_{j}(2\leq j\leq n)$ . For state inside the tree, the safety cost and reward are equal to the instance $\tau_{1}$ . For leaf node, the cost function for state $s_{i}$ are $c(s_{i})=1/2+\Delta\mathbb{I}\{i\neq 1,j\}$ , $r(s_{i})$ unchanged. Then for $\tau_{j}$ , the state $s_{1}$ and $s_{j}$ (and initial state $s_{0}$ ) are safe. Then the safe optimal policy is to choose $a_{j}$ at the first step. Thus the safe optimal reward is $H$ . The MDP instances are shown in Figure 2 and Figure 3.

Denote $H^{\prime}=\frac{2}{3}H$ are the rest of length when we arrive at the leaf states $s_{1},\cdots,s_{n}$ . Let $R_{K}(\pi,\tau)$ represents the regret till step episode $K$ for instance $\tau$ and policy $\pi$ , and $C_{K}(\pi,\tau)$ represents the violation. For $\tau_{1}$ , we can have $C_{K}(\pi,\tau_{1})\geq\mathbb{P}_{\tau_{1}}(t_{1}(K)\leq K/2)\cdot\frac{KH^{\prime}\Delta}{2},$ where $t_{i}(K)$ means the times to choose the path to $s_{i}$ . Also, $R_{K}(\pi,\tau_{i})\geq\mathbb{P}_{\tau_{i}}(t_{1}(K)>K/2)\frac{KH^{\prime}}{2}$ .

Applying Bertagnolle-Huber inequality, we have

\displaystyle\mathbb{P}_{\tau_{1}}(t_{1}(K)\leq K/2)+\mathbb{P}_{\tau_{i}}(t_{1}(K)>K/2)\geq\frac{1}{2}\exp\{-D(\mathbb{P}_{\tau_{1}},\mathbb{P}_{\tau_{i}})\}.

We choose $i$ such that $i=\arg\min_{i}\mathbb{E}_{\tau_{1}}[t_{i}(K)]$ , then by $\sum_{i\neq 1}\mathbb{E}_{\tau_{1}}[t_{i}(K)]\leq K$ ,

\mathbb{E}_{\tau_{1}}[t_{i}(K)]\leq\frac{K}{n-1}.

By the definition of $\mathbb{P}_{\tau_{1}}$ and $\mathbb{P}_{\tau_{i}}$ ,

	$\displaystyle D(\mathbb{P}_{\tau_{1}},\mathbb{P}_{\tau_{i}})$	$\displaystyle=H^{\prime}\mathbb{E}_{\tau_{1}}[t_{i}(K)]D(\mathcal{N}(1/2+\Delta,1),\mathcal{N}(1/2,1))$
		$\displaystyle\leq H^{\prime}\mathbb{E}_{\tau_{1}}[t_{i}(K)]\cdot 4\Delta^{2}$
		$\displaystyle\leq\frac{4KH^{\prime}\Delta^{2}}{n-1}.$

Thus choose $\Delta=\sqrt{n-1/4KH^{\prime}}\leq\frac{1}{2}$ , (need $n\leq T$ )

\displaystyle\mathbb{P}_{\tau_{1}}(t_{1}(K)\leq K/2)+\mathbb{P}_{\tau_{i}}(t_{1}(K)>K/2)\geq\frac{1}{2}\exp\{-D(\mathbb{P}_{\tau_{1}},\mathbb{P}_{\tau_{i}})\}\geq\frac{1}{2e}.

If $\mathbb{E}_{\pi}(R_{K}(\pi,\tau))\leq\frac{T}{24}=\frac{KH}{24}\leq\frac{KH^{\prime}}{16}$ for all $\tau$ , then

\displaystyle\mathbb{P}_{\tau_{i}}(t_{1}(K)>K/2)\leq\frac{2R_{K}(\pi,\tau_{i})}{KH^{\prime}}\leq\frac{1}{8}.

Thus

\mathbb{P}_{\tau_{1}}(t_{1}(K)\leq K/2)\geq\frac{1}{2e}-\frac{1}{8}:=C

and

\displaystyle C_{K}(\pi,\tau_{1})\geq\frac{CKH^{\prime}\Delta}{2}=\frac{C\sqrt{(n-1)KH^{\prime}}}{4}=\Omega(\sqrt{SHK})=\Omega(\sqrt{SHK}),

where $n\geq\frac{S}{2}$ . ∎

B.4 Proof of Theorem 5.2

Proof.

Construct the MDP $\tau_{1},\tau_{2},\cdots,\tau_{n}$ as follows: For each instance, the states consist of a tree with degree $A-1$ and layer $H/3$ . For each state inside the tree (not a leaf), the action $a_{i},1\leq i\leq A-1$ will lead to the $i$ -th branch of the next state. For the leaf nodes, denote them as $s_{1},s_{2},\cdots,s_{n}$ , where $n=(A-1)^{H/3}=\frac{(S-3)(A-2)+1}{A-1}=\Omega(S)$ and these states will arrive at the final absorbing state $s_{A}$ and $s_{B}$ with reward $r(s_{A})=0$ and $r(s_{B})=1$ respectively. Also, the action $a_{A}$ will always lead the agent to the unique unsafe state $s_{U}$ with $r(s_{U})=0.5+\Delta^{\prime}$ and $c(s_{U})=1$ for some parameter $\Delta^{\prime}$ in the states except two absorbing states. For all states except the unsafe state, the safety cost is 0.

Now for the instance $\tau_{1}$ , the transition probability between the leaf states and absorbing states are

\displaystyle\left\{\begin{array}[]{c}p(s_{A}\mid s_{1},a)=0.5-\Delta\\ p(s_{B}\mid s_{1},a)=0.5+\Delta\\ p(s_{A}\mid s_{i},a)=p(s_{B}\mid s_{1},a)=0.5\ \ \ \ i\geq 2\end{array}\right.

For instance $\tau_{j}$ , the transition probability are

\displaystyle\left\{\begin{array}[]{c}p(s_{A}\mid s_{1},a)=0.5-\Delta\\ p(s_{B}\mid s_{1},a)=0.5+\Delta\\ p(s_{A}\mid s_{j},a)=0.5-2\Delta\\ p(s_{B}\mid s_{j},a)=0.5+2\Delta\\ p(s_{A}\mid s_{i},a)=p(s_{B}\mid s_{1},a)=0.5\ \ \ \ i\neq 1,j\end{array}\right.

Then if we do not consider the constraint, choosing $a_{A}$ and getting into the unsafe state $s_{U}$ is always the optimal action, then define $\mbox{gap}_{h,s,a}=V_{h}^{*}(s)-Q_{h}^{*}(s,a)$ , then the minimal gap Simchowitz & Jamieson (2019) is

\mbox{gap}_{\min}=\min_{h,s,a}\{\mbox{gap}_{h,s,a}>0:\mbox{gap}_{h,s,a}\}\geq H^{\prime}((0.5+\Delta^{\prime})-(0.5+2\Delta))\geq H^{\prime}(\Delta^{\prime}-2\Delta).

Thus the regret will be at most $O(\frac{S^{2}A\cdot poly(H)\log T}{\Delta^{\prime}-\Delta})$ by classical gap-dependent upper bound Simchowitz & Jamieson (2019). As we will show later, we will choose $\Delta=\Theta(T^{-1/2})$ and $\Delta^{\prime}=\Theta(T^{\frac{\alpha-1}{2}})$ , hence the regret will be

\displaystyle O\left(\frac{S^{2}A\cdot poly(H)\log T}{\Delta^{\prime}-\Delta}\right)=\widetilde{\mathcal{O}}\left(T^{\frac{1-\alpha}{2}}\right).

Now if we consider the safe threshold $\tau=0$ , the violation will be the number of time to arrive the unsafe state. Now consider any algorithm, if it achieves $O(T^{1-\alpha})$ violation, they will arrive at unsafe state at most $O(T^{1-\alpha})$ times. Since denote $T_{i}(K),1\leq i\leq n+1$ as the number of times when agent arrive at state $s_{i}$ till episode $K$ . Then $\mathbb{E}_{\pi}[T_{n+1}(K)]\leq CT^{1-\alpha}$ . Note that each episode we will arrive at least one of $s_{i},1\leq i\leq n+1$ once, we know $\sum_{i=1}^{n+1}T_{i}(K)\geq K$ , and $\sum_{i=1}^{n}T_{i}(K)\geq K-C\cdot T^{1-\alpha}\geq K/2$ when $K\geq(2C)^{1/\alpha}H^{\frac{1-\alpha}{\alpha}}$ .

Now similar to proof of Theorem 5.1, let $C_{K}(\pi,\tau)$ represents the violation with algorithm $\pi=\{\pi_{1},\cdots,\pi_{K}\}$ . Denote $H^{\prime}=\frac{2}{3}H$ . For $\tau_{1}$ , we have

	$\displaystyle R_{K}(\pi,\tau_{1})$	$\displaystyle\geq\mathbb{P}_{\tau_{1}}(T_{1}(K)\leq K/4)\frac{KH^{\prime}\Delta}{4}-(\Delta^{\prime}-\Delta)\mathbb{E}_{\pi}[T_{n+1}(K)]$
		$\displaystyle\geq\mathbb{P}_{\tau_{1}}(T_{1}(K)\leq K/4)\frac{KH^{\prime}\Delta}{4}-(\Delta^{\prime}-\Delta)CT^{1-\alpha}$

	$\displaystyle R_{K}(\pi,\tau_{i})$	$\displaystyle\geq\mathbb{P}_{\tau_{i}}(T_{1}(K)>K/4)\frac{KH^{\prime}\Delta}{4}-(\Delta^{\prime}-2\Delta)\mathbb{E}_{\pi}[T_{n+1}(K)]$
		$\displaystyle\geq\mathbb{P}_{\tau_{i}}(T_{1}(K)>K/4)\frac{KH^{\prime}\Delta}{4}-(\Delta^{\prime}-2\Delta)CT^{1-\alpha}$

Also, we have

\displaystyle\mathbb{P}_{\tau_{1}}(T_{1}(K)\leq K/4)+\mathbb{P}_{\tau_{i}}(T_{1}(K)>K/4)\geq\frac{1}{2}\exp\{-D(\mathbb{P}_{\tau_{1}},\mathbb{P}_{\tau_{i}})\}

Now WLOG we have $\mathbb{E}_{\tau_{1}}[T_{i}(K)]\leq\frac{K}{2(n-1)}$

\displaystyle D(\mathbb{P}_{\tau_{1}},\mathbb{P}_{\tau_{i}})\leq\mathbb{E}_{\tau_{1}}[T_{i}(K)]2\Delta^{2}\leq\frac{K}{2(n-1)}\cdot 2\Delta^{2}=\frac{K\Delta^{2}}{n-1}

Choose $\Delta=\sqrt{\frac{n-1}{K}}$ , then

\displaystyle\mathbb{P}_{\tau_{1}}(T_{1}(K)\leq K/4)+\mathbb{P}_{\tau_{i}}(T_{1}(K)>K/4)\geq\frac{1}{2e}.

Then at least one of $\tau_{1}$ and $\tau_{i}$ (assume $\tau_{1}$ ) have

\displaystyle R_{K}(\pi,\tau_{1})\geq\frac{H^{\prime}\sqrt{K(n-1)}}{8e}-C(\Delta^{\prime}-2\Delta)T^{1-\alpha}

Choose $\Delta^{\prime}=T^{\frac{\alpha-1}{2}}$ . Then assume $T\geq 2^{\frac{2}{1-\alpha}}$ so that $c\leq\frac{1}{2}.$

\displaystyle R_{K}(\pi,\tau_{1})\geq\frac{H^{\prime}\sqrt{K(n-1)}}{8e}-CT^{(1-\alpha)/2}=\Omega(H\sqrt{SK})=\Omega(\sqrt{HST}).

∎

Appendix C Safe Markov Games

C.1 Definitions

In this section, we provide an extension for our general framework of Safe-RL-SW. We solve the safe zero-sum two-player Markov games Yu et al. (2021).

In zero-sum two-player Markov games, there is an agent and another adversary who wants to make the agent into the unsafe region and also minimize the reward. The goal of agent is to maximize the reward and avoid the unsafe region with the existence of adversary. To be more specific, define the action space for the agent and adversary are $\mathcal{A}$ and $\mathcal{B}$ respectively. In each episode the agent starts at an initial state $s_{1}$ . At each step $h\in[H]$ , the agent and adversary observe the state $s_{h}$ and taking action $a_{h}\in\mathcal{A}$ and $b_{h}\in\mathcal{B}$ from the strategy $\mu_{h}(\cdot\mid s)$ and $\upsilon_{h}(\cdot\mid s)$ simultaneously. Then the agent receive a reward $r_{h}(s,a,b)$ and the environment transitions to the next state $s_{h+1}\sim P(s_{h+1}\mid s,a,b)$ by the transition kernel $P$ .

One motivation of this setting is that, when a robot is in a dangerous environment, the robot should learn a completely safe policy even with the disturbance from the environment. Like the single-player MDP, we define the value function and Q-value function:

	$\displaystyle V_{h}^{\mu,\upsilon}(s)$	$\displaystyle=\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}},b_{h^{\prime}})\Bigg{\|}s_{h}=s,\mu,\upsilon\right].$
	$\displaystyle Q_{h}^{\mu,\upsilon}(s,a,b)$	$\displaystyle=\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}},b_{h^{\prime}})\Bigg{\|}s_{h}=s,a_{h}=a,b_{h}=b,\mu,\upsilon\right].$

The cost function is a bounded function defined on the states $c(s)\in[0,1],s\in\mathcal{S}$ , and the unsafe states are defined $\mathcal{U}_{H}=\{s\mid c(s)>\tau\}$ for some fixed threshold $\tau$ . Similar to the Section 4, we define the possible next states for state $s$ , step $h$ , action pair $(a,b)$ as

\displaystyle\Delta_{h}(s,a,b)=\{s^{\prime}\mid\mathbb{P}_{h}(s^{\prime}\mid s,a,b)>0\}.

Hence we can define the unsafe states for step $h$ recursively as:

\displaystyle\mathcal{U}_{h}=\mathcal{U}_{h+1}\cup\{s\mid\ \forall\in\mathcal{A},\exists\ b\in\mathcal{B},\Delta_{h}(s,a,b)\cap\mathcal{U}_{h+1}\neq\emptyset\}.

Then to keep safety, the agent cannot arrive at the state $s\in\mathcal{U}_{h}$ at the step $1,2,\cdots,h$ . The safe action for the agent $A_{h}^{safe}(s)$ should avoid the unsafe region no matter what action the adversary takes.

\displaystyle A_{h}^{safe}(s)=\{a\in\mathcal{A}\mid\ \forall\ b\in\mathcal{B},\Delta_{h}(s,a,b)\cap\mathcal{U}_{h+1}=\emptyset\}.

Hence a feasible policy should keep the agent in safe state $s\notin\mathcal{U}_{h}$ at step $h$ , and always chooses $a\in A_{h}^{safe}(s)$ to avoid the unsafe region regardless of the adversary. Analogous to the safe RL in Section 4, assume the the optimal Bellman equation can be written as

	$\displaystyle Q^{*}_{h}(s,a,b)$	$\displaystyle=r(s,a)+\mathbb{P}_{h}(s^{\prime}\mid s,a,b)V^{*}_{h+1}(s^{\prime}).$
	$\displaystyle V^{*}_{h}(s)$	$\displaystyle=\max_{a\in A_{h}^{safe}(s)}\min_{b\in\mathcal{B}}Q^{*}_{h}(s,a,b).$

If we denote $r(s,a,b)=-\infty$ for unsafe states $s\in\mathcal{U}_{H}$ , fixed an strategy $\upsilon$ for the adversary, the best response of the agent can be defined as $\arg\max_{\mu}V_{1}^{\mu,\upsilon}(s_{1})$ . Similarly, a best response of the adversary given agent’s strategy $\mu$ can be defined as $\arg\min_{\upsilon}V_{1}^{\mu,\upsilon}(s_{1})$ . By the minimax equality, for any step $h\in[H]$ and $s\in\mathcal{S}$ ,

\displaystyle\max_{\mu}\min_{\upsilon}V_{h}^{\mu,\upsilon}(s)=\min_{\upsilon}\max_{\mu}V_{h}^{\mu,\upsilon}(s).

A policy pair $(\mu^{*},\upsilon^{*})$ satisfy that $V_{h}^{\mu^{*},\upsilon^{*}}(s)=\max_{\mu}\min_{\upsilon}V_{h}^{\mu,\upsilon}(s)$ for all step $h\in[H]$ is called a Nash Equilibrium, and the value $V_{h}^{\mu^{*},\upsilon^{*}}(s)$ is called the minimax value. It is known that minimax value is unique for Markov games, and the agent’s goal is make the reward closer to the minimax value.

Assume the agent and adversary executes the policy $(\mu_{1},\upsilon_{1}),(\mu_{2},\upsilon_{2}),\cdots,(\mu_{K},\upsilon_{K})$ for episode $1,2,\cdots,K$ , the regret is defined as

\displaystyle R(K)=\sum_{k=1}^{K}V_{1}^{\mu^{*},\upsilon^{*}}(s_{1})-V_{1}^{\mu_{k},\upsilon_{k}}(s_{1}).

Also, the actual state-wise violation is defined as

\displaystyle C(K)=\sum_{k=1}^{K}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+}.

C.2 Algorithms and Results

The main algorithm is presented in Algorithm 3. The key idea is similar to the Algorithm 1. However, in each episode we need to calculate an optimal policy $\pi^{k}$ by a sub-procedure “Planning". This sub-procedure calculates the Nash Equilibrium for the estimated MDP. Note that there is always a mixed Nash equilibrium strategy $(\mu,\upsilon)$ , and the ZERO-SUM-NASH function in “Planning" procedure calculates the Nash equilibrium for a Markov games.

In the sub-procedure “Planning", we calculate the $(Q_{h}^{k})^{\prime}(s,a,\cdot)=-\infty$ if there exists at least one adversary action $b$ such that $\Delta(s,a,b)\cap\mathcal{U}_{h+1}\neq\emptyset$ . In fact, the agent cannot take these actions at this time, hence by setting $(Q_{h}^{k})^{\prime}(s,a,\cdot)$ for these actions $a$ , the support of Nash Equilibrium policy $\mu_{h}^{k}$ will not contains $a$ . Otherwise, the minimax value is $-\infty$ , and it means that $a$ is now in $\mathcal{U}_{h}^{k}$ , then at this time the agent can chooes the action arbitrarily

1: Input:

\Delta_{h}^{1}(s,a,b)=\emptyset,N_{h}^{1}(s,a,b),N_{h}^{1}(s,a,b,s^{\prime})=0.

2: for

k=1,2,\cdots,K

3: Receive the initial state

s_{1}

4: Update the estimation

\hat{c}(s)

based on history and get optimistic estimation

\bar{c}(s)=\hat{c}(s)-\beta(N^{k}(s),\delta)

for each state

s

, where

\beta

is the bouns function.

5: Define

\mathcal{U}_{H}^{k}=\{s\mid\bar{c}(s)>\tau\}

6: Based on

\Delta_{h}^{K}(s,a)

, calculate

\mathcal{U}_{h}^{k}=\mathcal{U}_{h+1}^{k}\cup\{s\mid\forall a\in A,\Delta_{k}(s,a,b)\cap\mathcal{U}_{h+1}^{k}\neq\emptyset\}

\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a,b)=\frac{N_{h}^{k}(s,a,b,s^{\prime})}{N_{h}^{k}(s,a,b)}

\pi_{h}^{k}=

Planning

(k,r,\{\mathcal{U}^{k}_{h}\}_{h\in[H]},\Delta_{h}^{K}(s,a,b),N^{k}(s,a,b),\hat{\mathbb{P}})

9: for

h=1,2,\cdots,H

10: Take action

a_{h}^{k}

such that

(a_{h}^{k},b)\sim\pi_{h}^{k}(\cdot,\cdot\mid s)

and arrive state

s_{h+1}^{k}\sim P(s\mid s_{h}^{k},a_{h}^{k},b_{h}^{k})

, where

b_{h}^{k}

is the action of the adversary.

11:

\Delta_{h}^{k+1}(s_{h}^{k},a_{h}^{k},b_{h}^{k})=\Delta_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k})\cup\{s_{h+1}^{k}\}

12: end for

13: Update

N_{h}^{k+1}(s,a,b)

N_{h}^{k+1}(s,a,b,s^{\prime}),\Delta_{h}^{k+1}(s,a,b)

for all

(s,a,b)

14: end for

Algorithm 3 Safe Markov Games

1: Input

k,r,\{\mathcal{U}_{h}\},\Delta(s,a,b),N,\hat{\mathbb{P}}

2: for

h=H,H-1,\cdots,1

3: for

(s,a,b)\in\mathcal{S}\times\mathcal{A}\times\mathcal{B}

\beta(N_{h}^{k}(s,a,b),\delta)\leftarrow 7H\sqrt{\log(5SABT/\delta)/N_{h}^{k}(s,a,b)}

Q_{h}^{k}(s,a,b)=\min\{(r_{h}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}^{k}(s^{\prime}\mid s,a,b)V_{h+1}(s^{\prime}))+\beta(N_{h}^{k}(s,a,b),\delta),H\}

6: end for

7: for

s\in S

8: for

a\in A

9: if

\exists b,\Delta_{h}(s,a,b)\cap\mathcal{U}_{h+1}\neq\emptyset

then

10:

(Q_{h}^{k})^{\prime}(s,a,\cdot)=-\infty

11: else

12:

(Q_{h}^{k})^{\prime}(s,a,\cdot)=Q_{h}^{k}(s,a,\cdot)

a\in A_{h}^{k}(s)

13: end if

14: end for

15:

\mu_{h}^{k},\upsilon_{h}^{k}=\mbox{ZERO-SUM-NASH}((Q_{h}^{k})^{\prime}(s,\cdot,\cdot)).

16:

V_{h}^{k}(s)=\mathbb{E}_{a\sim\mu_{h}^{k}(\cdot\mid s),b\sim\upsilon_{h}^{k}(\cdot\mid s)}(a,b)Q_{h}(s,a,b)

17: end for

18: end for

Algorithm 4 Planning

(k,r,\{\mathcal{U}_{h}\}_{h\in[H]},\Delta(s,a,b),N,\hat{\mathbb{P}})

Our main result are presented in the following theorem, which shows that our algorithm achieves both $O(\sqrt{T})$ regret and $O(\sqrt{ST})$ state-wise violation.

Theorem C.1.

With probability at least $1-\delta$ , the Algorithm 3 have regret and state-wise violation

	$\displaystyle R(T)$	$\displaystyle=O(H^{3}\sqrt{SABT})$
	$\displaystyle C(T)$	$\displaystyle=O(\sqrt{ST}+S^{2}ABH^{2}),$

where $A=|\mathcal{A}|$ , $B=|\mathcal{B}|$ and $S=|\mathcal{S}|.$

C.3 Proofs

Proof.

We first prove that, with probability at least $1-\delta$ , $Q_{h}^{k}(s,a,b)\geq Q_{h}^{*}(s,a,b)$ and $V_{h}^{k}(s)\geq V_{h}^{*}(s)$ for any episode $k\in[K].$ We prove this fact by induction. Denote $\mathbb{D}_{\mu\times\upsilon}Q(s)=\mathbb{E}_{a\sim\mu(\cdot\mid s),b\sim\upsilon(\cdot\mid s)}Q(s,a,b)$ . The statement holds for $h=H+1$ . Suppose the bounds hold for $Q-$ values in the step $h+1$ , now we consider step $h$ .

	$\displaystyle V_{h+1}^{k}(s)$	$\displaystyle=\mathbb{D}_{\pi_{h}^{k}}Q_{h+1}^{k}(s)$
		$\displaystyle=\max_{supp(\mu)\subseteq A_{h}^{k,safe}(s)}\mathbb{D}_{\mu\times v_{h}^{k}}Q_{h+1}^{k}(s)$
		$\displaystyle\geq\max_{\mu\in supp(\mu)\subseteq A_{h}^{k}(s)}\mathbb{D}_{\mu\times v_{h}^{k}}Q_{h+1}^{*}(s)$
		$\displaystyle\geq\max_{\mu\in supp(\mu)\subseteq A_{h}^{}(s)}\mathbb{D}_{\mu\times v_{h}^{k}}Q_{h+1}^{}(s)$
		$\displaystyle\geq\min_{v}\max_{\mu\in supp(\mu)\subseteq A_{h}^{}(s)}\mathbb{D}_{\mu\times v}Q_{h+1}^{}(s)$
		$\displaystyle=V_{h+1}^{*}(s),$

where $A_{h}^{k,safe}(s)$ is all safe action for max-player at episode $k$ and step $h$ it estimated, $A_{h}^{*}(s)$ is the true safe action, and $supp(\mu)=\{a\in\mathcal{A}\mid\mu(a)>0\}$ From our algorithm, we know $A_{h}(s)\subseteq A_{h}^{k}(s)$ for all time if the confidence event always holds.

Then

	$\displaystyle Q_{h}^{k}(s,a,b)-Q_{h}^{*}(s,a,b)$	$\displaystyle=(\hat{\mathbb{P}}_{h}^{k}V_{h+1}^{k}-\mathbb{P}_{h}V_{h+1}^{*}+\beta_{h}^{k})(s,a,b)$
		$\displaystyle\geq(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V_{h+1}^{*})(s,a,b)+\beta(N_{h}^{k}(s,a,b),\delta)$
		$\displaystyle\geq 0.$

The last inequality is because of the Chernoff-Hoeffding’s inequality. Thus define event

\mathcal{E}_{6}=\left\{\ \forall k,h,s,a,b,\ Q_{h}^{k}(s,a,b)\geq Q_{h}^{*}(s,a,b),V_{h}^{k}(s)\geq V_{h}^{*}(s)\right\}.

Then $\Pr\{\mathcal{E}_{6}^{c}\}\leq\delta.$

Now we start to bound the regret.

	$\displaystyle R(K)$	$\displaystyle=\sum_{k=1}^{K}V_{1}^{\mu^{},\upsilon^{}}(s_{1})-V_{1}^{\mu_{k},\upsilon_{k}}(s_{1})$
		$\displaystyle\leq\sum_{k=1}^{K}V_{1}^{k}(s_{1})-V_{1}^{\mu_{k},\upsilon_{k}}(s_{1}).$

Now we first assume $\upsilon_{k}$ is the best response of $\mu_{k}$ . Otherwise our regret can be larger. We calculate the regret by

	$\displaystyle\ \ \ \ \ V_{h}^{k}(s_{h}^{k})-V_{h}^{\mu_{k},\upsilon_{k}}(s_{h}^{k})$
	$\displaystyle=\mathbb{D}_{\pi^{k}}(Q_{h}^{k})(s_{h}^{k})-\mathbb{D}_{\mu_{k}\times\upsilon_{k}}Q_{h}^{\mu_{k}\times\upsilon_{k}}(s_{h}^{k})$
	$\displaystyle\leq\mathbb{D}_{\mu_{k}\times\upsilon_{k}}(Q_{h}^{k})(s_{h}^{k})-\mathbb{D}_{\mu_{k}\times\upsilon_{k}}Q_{h}^{\mu_{k}\times\upsilon_{k}}(s_{h}^{k})$
	$\displaystyle=\mathbb{D}_{\mu_{k}\times\upsilon_{k}}(Q_{h}^{k}-Q_{h}^{\mu_{k}\times\upsilon_{k}})(s_{h}^{k})$
	$\displaystyle=(Q_{h}^{k}-Q_{h}^{\mu_{k}\times\upsilon_{k}})(s_{h}^{k},a_{h}^{k},b_{h}^{k})+\alpha_{h}^{k}$
	$\displaystyle=(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})V_{h+1}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k})+\mathbb{P}_{h}\left(V_{h+1}^{k}-V_{h+1}^{\mu_{k}\times\upsilon_{k}}\right)(s_{h}^{k},a_{h}^{k},b_{h}^{k})+\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta)+\alpha_{h}^{k}$
	$\displaystyle\leq\left(V_{h+1}^{k}-V_{h+1}^{\mu_{k}\times\upsilon_{k}}\right)(s_{h+1}^{k})+\alpha_{h}^{k}+\gamma_{h}^{k}+2\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta)$
	$\displaystyle\hskip 200.0003pt+(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})(V_{h+1}^{k}-V_{h+1}^{*})(s_{h}^{k},a_{h}^{k},b_{h}^{k}),$

where

\alpha_{h}^{k}=\mathbb{P}_{h}\left(V_{h+1}^{k}-V_{h+1}^{\mu_{k}\times\upsilon_{k}}\right)(s_{h}^{k},a_{h}^{k},b_{h}^{k})-\left(V_{h+1}^{k}-V_{h+1}^{\mu_{k}\times\upsilon_{k}}\right)(s_{h+1}^{k}),

\gamma_{h}^{k}=\left(\mathbb{D}_{\mu_{k}\times\upsilon_{k}}(Q_{h}^{k}-Q_{h}^{\mu_{k}\times\upsilon_{k}})(s_{h}^{k})-(Q_{h}^{k}-Q_{h}^{\mu_{k}\times\upsilon_{k}})(s_{h}^{k})\right).

The last inequality is because $(\hat{\mathbb{P}_{h}}-\mathbb{P}_{h})V_{h+1}^{*}(s_{h}^{k},a_{h}^{k},b_{h}^{k})\leq\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta)$ by Chernoff-Hoeffding’s inequality.

Now by the similar analysis in the previous section, with probability at least $1-\delta$ , the event

	$\displaystyle\mathcal{E}_{7}$	$\displaystyle=\left\{\ \forall s,a,b,s^{\prime},k,h,\|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a,b)-\mathbb{P}_{h}(s^{\prime}\mid s,a,b)\|\right.$
		$\displaystyle\hskip 50.00008pt\left.\leq\sqrt{\frac{2\mathbb{P}_{h}(s^{\prime}\mid s,a,b)}{N_{h}^{k}(s,a)}\log\left(\frac{2S^{2}ABHK}{\delta}\right)}+\frac{2}{3N_{h}^{k}(s,a)}\log\left(\frac{2S^{2}ABHK}{\delta}\right)\right\}.$

holds with probability at least $1-\delta$ by Bernstein’s inequality. Then for $\iota=\log\left(\frac{2S^{2}ABHK}{\delta}\right)$

	$\displaystyle(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})(V_{h+1}^{k}-V_{h+1}^{*})(s_{h}^{k},a_{h}^{k},b_{h}^{k})$
	$\displaystyle\quad\leq\sum_{s^{\prime}}\left(\sqrt{\frac{2\mathbb{P}_{h}(s^{\prime}\mid s,a,b)}{N_{h}^{k}(s,a)}\iota}+\frac{2}{3N_{h}^{k}(s,a)}\iota\right)(V_{h+1}^{k}-V_{h+1}^{*})(s^{\prime})$
	$\displaystyle\quad\leq\sum_{s^{\prime}}\left(\frac{\mathbb{P}_{h}(s^{\prime}\mid s_{h}^{k},a_{h}^{k})}{H}+\frac{H\iota}{2N_{h}^{k}(s_{h}^{k},a_{h}^{k})}+\frac{2\iota}{3N_{h}^{k}(s_{h}^{k},a_{h}^{k})}\right)(V_{h+1}^{k}-V_{h+1}^{*})(s^{\prime})$
	$\displaystyle\quad\leq\frac{1}{H}\mathbb{P}_{h}(V_{h+1}^{k}-V_{h+1}^{*})(s_{h}^{k},a_{h}^{k})+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}$
	$\displaystyle\quad\leq\frac{1}{H}\mathbb{P}_{h}(V_{h+1}^{k}-V_{h+1}^{\mu_{k}\times\upsilon_{k}})(s_{h}^{k},a_{h}^{k})+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}.$

The last inequality is because we assume $\upsilon_{k}$ is the best response of $\mu_{k}$ and then $V_{h+1}^{*}(s_{h}^{k},a_{h}^{k})\geq V_{h+1}^{\mu_{k}\times\upsilon_{k}}(s_{h}^{k},a_{h}^{k}).$ Now we can get

	$\displaystyle V_{h}^{k}(s_{h}^{k})-V_{h}^{\mu_{k}\times\upsilon_{k}}$
	$\displaystyle\quad\leq\left(1+\frac{1}{H}\right)(V_{h+1}^{k}(s_{h}^{k})-V_{h+1}^{\mu_{k}\times\upsilon_{k}})+2\alpha_{h}^{k}+2\gamma_{h}^{k}+2\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta)+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}$

and then

	$\displaystyle V_{1}^{k}(s_{1})-V_{1}^{\mu_{k}\times\upsilon_{k}}$	$\displaystyle\leq\sum_{h=1}^{H}\left(1+\frac{1}{H}\right)^{H}(2\alpha_{h}^{k}+2\gamma_{h}^{k}+2\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta))+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}$
		$\displaystyle\leq\sum_{h=1}^{H}e\cdot(2\alpha_{h}^{k}+2\gamma_{h}^{k}+2\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta))+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}.$

By Azuma-hoeffding’s inequality, with probability at least $1-2\delta$ , $\sum_{k=1}^{K}\sum_{h=1}^{H}\alpha_{h}^{k}\leq O(\sqrt{H^{3}K})$ , $\sum_{k=1}^{K}\sum_{h=1}^{H}\gamma_{h}^{k}=O(\sqrt{H^{3}K})$ and

\sum_{k=1}^{K}\sum_{h=1}^{H}\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}=O(\log K).

Note that

	$\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta)$	$\displaystyle=\sum_{k=1}^{K}\sum_{h=1}^{H}7H\sqrt{\frac{\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k})}}$
		$\displaystyle\leq O(H\iota\sum_{(h,s,a,b)\in[H]\times\mathcal{S}\times\mathcal{A}\times\mathcal{B}}\sqrt{N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k})}$
		$\displaystyle\leq\widetilde{\mathcal{O}}(\sqrt{H^{3}SABT}).$

For the violation, the argument is similar to the Theorem 4.2. During the learning process, the $s_{h+1}^{k}\notin\Delta_{h}^{K}(s_{h}^{k},a_{h}^{k},b_{h}^{k})$ can appear at most $S^{2}ABH$ times because each time the summation

\sum_{(h,s,a,b)\in[H]\times\mathcal{S}\times\mathcal{A}\times\mathcal{B}}|\Delta_{h}^{K}(s,a,b)|

will increase at least 1, and it has a upper bound $S^{2}ABH$ . Thus it will lead to at most $S^{2}ABH^{2}$ regret. For other situations, the total violation can be upper bounded by $\widetilde{\mathcal{O}}(\sqrt{ST})$ . Thus the final violation bound is $\tilde{O}(S^{2}ABH^{2}+\sqrt{ST}).$

∎

Appendix D Technical Lemmas

D.1 Lemma D.1

Lemma D.1.

\displaystyle K\varepsilon/2\leq a\sqrt{K}\sqrt{\gamma(K,\delta)}+b\gamma(K,\delta)\log(K+1),

where $\gamma(K,\delta)=2\log(SAHK/\delta)+(S-1)\log(e(1+K/(S-1)))$ , then

\displaystyle K=\widetilde{\mathcal{O}}\left(\left(\frac{a^{2}}{\varepsilon^{2}}+\frac{b}{\varepsilon}\right)\log\left(\frac{1}{\delta}\right)+\frac{Sb}{\varepsilon}\right),

where $\widetilde{\mathcal{O}}(\cdot)$ ignores all the term $\log S,\log A,\log H,\log 1/\varepsilon$ and $\log\log(1/\delta).$

Proof.

If $K\leq 4$ , the lemma is trivial. Now assume $K\geq 4$ , then $e(1+K)\leq 4K$ and

	$\displaystyle\gamma(K,\delta)$	$\displaystyle\leq 2\log(SAHK/\delta)+S\log(e(1+K))$
		$\displaystyle\leq 2\log(SAHK/\delta)+S\log(4K)$
		$\displaystyle\leq 2S\log(4K)+\log(SAH/\delta).$

Then

	$\displaystyle K\varepsilon/2$	$\displaystyle\leq a\sqrt{K}\sqrt{\gamma(K,\delta)}+b\gamma(K,\delta)\log(K+1)$
		$\displaystyle\leq 2\max\{a\sqrt{K}\sqrt{\gamma(K,\delta)},b\gamma(K,\delta)\log(K+1)\}.$

Case 1:

If $K\varepsilon/2\leq 2a\sqrt{K}\sqrt{\gamma(K,\delta)}$ , we can get

	$\displaystyle K$	$\displaystyle\leq\frac{16a^{2}}{\varepsilon^{2}}(2S\log(4K)+\log(SAH/\delta))$
		$\displaystyle\leq 2\max\left\{\frac{32Sa^{2}}{\varepsilon^{2}}\log(4K),\frac{16a^{2}}{\varepsilon^{2}}\log(SAH/\delta)\right\}.$

Subcase 1:

If $K\leq\frac{64Sa^{2}}{\varepsilon^{2}}\log(4K)$ by Lemma D.2, we can get $K=\widetilde{\mathcal{O}}(\frac{Sa^{2}}{\varepsilon^{2}})$ .

Subcase 2:

If $K\leq\frac{16a^{2}}{\varepsilon^{2}}\log(SAH/\delta)$ , we complete the proof.

Case 2:

If $K\varepsilon/2\leq 2b\gamma(K,\delta)\log(K+1)$ , we can get

	$\displaystyle K$	$\displaystyle\leq\frac{4b}{\varepsilon}\log(4K)(2S\log(4K)+\log(SAH/\delta))$
		$\displaystyle\leq 2\max\left\{\frac{4b}{\varepsilon}\log(4K)2S\log(4K),\frac{4b}{\varepsilon}\log(4K)\log(SAH/\delta)\right\}.$

Subcase 3:

If $K\leq\frac{8bS}{\varepsilon}\log^{2}(4K)$ By Lemma D.2, we can get $K=\widetilde{\mathcal{O}}(\frac{8bS}{\varepsilon})$ .

Subcase 4:

If $K\leq\frac{8b}{\varepsilon}\log(4K)\log(SAH/\delta)$ , then by Lemma D.2, we can get $K=\widetilde{\mathcal{O}}(\frac{b}{\varepsilon}\log(1/\delta))$

From the four subcases, we complete the proof of Lemma D.1.

∎

D.2 Lemma D.2

Lemma D.2.

If $K\leq Q\log^{2}(4K)$ with $Q\geq 4$ , then we have $K=\widetilde{\mathcal{O}}(Q)$ .

Proof.

Define $f(x)=\frac{x}{\log^{2}(4x)}$ , then $f^{\prime}(x)=\frac{\log^{2}(4x)-\log(4x)}{\log^{4}(4x)}>0$ when $4x\geq 3$ , and $f(x)$ is increasing over $x\geq 1$ . Then we only need to prove there exists a constant $c_{2}>0$ such that

\displaystyle f(c_{2}Q\log^{2}Q)\geq Q.

In fact, if this inequality holds, any $K$ such that $K\leq Q\log^{2}(4K)$ should have $f(K)\leq Q\leq f(c_{2}Q\log^{2}Q)$ , and then $K\leq c_{2}Q\log^{2}Q=\widetilde{\mathcal{O}}(Q)$ . To prove this inequality, observe that

	$\displaystyle Q\log^{2}(4c_{2}Q\log^{2}Q)$	$\displaystyle=Q(\log 4c_{2}Q+\log\log^{2}Q)^{2}$
		$\displaystyle\leq 2Q\log^{2}(4c_{2}Q)+2Q(\log\log^{2}Q)^{2}$
		$\displaystyle\leq 4Q\log^{2}(4c_{2})+4Q\log^{2}(Q)+2Q\log^{2}Q$
		$\displaystyle\leq(4\log^{2}(4c_{2})+6)Q\log^{2}Q.$

where the third inequality is because $\log^{2}Q\leq Q$ for $Q\geq 4$ . So if we choose $c_{2}$ to satisfy $c_{2}\geq 4\log^{2}(4c_{2})+6$ , we have $Q\log^{2}(4c_{2}Q\log^{2}Q)\leq c_{2}Q\log^{2}Q$ , which implies $f(c_{2}Q\log^{2}Q)\geq Q$ . ∎

Appendix E Detailed Analysis of Assumption 4.1

In this section, we provide a more detailed analysis of the equivalence between $s_{1}\notin\mathcal{U}_{1}$ and the existence of feasible policy. Indeed, if the agent gets into some state $s\in\mathcal{U}_{h}$ at step $h$ , then any action can lead her to an unsafe next state $s^{\prime}\in\mathcal{U}_{h+1}$ . Recursively, any action sequence $a_{h},a_{h+1},\cdots,a_{H-1}$ can lead the agent to unsafe states $s_{h+1}\in\mathcal{U}_{h+1},\cdots,s_{H}\in\mathcal{U}_{H}$ . As a result, the agent cannot avoid getting into an unsafe state. Therefore, all feasible policies $\pi$ must satisfy that, at any step $h$ , the probability for the agent in the state $s\in\mathcal{U}_{h}$ should be zero under policy $\pi$ .

\displaystyle\mathbb{P}_{h}^{\pi}(s)=0,\ \ \forall\ s\in\mathcal{U}_{h}.

(21)

Recall that

\displaystyle A_{h}^{safe}(s)=\{a\in\mathcal{A}\mid\Delta_{h}(s,a)\cap\mathcal{U}_{h+1}=\emptyset\}.

From these definitions, a feasible policy $\pi$ should satisfy that $\pi_{h}(s)\in A_{h}^{safe}(s)$ for any safe state $s\notin\mathcal{U}_{h}$ . Moreover, for any safe state $s_{h}\notin\mathcal{U}_{h}$ , there exists at least one safe action $a_{h}\in A_{h}^{safe}(s_{h})$ , and all possible next states $s_{h+1}\in\Delta(s_{h},a_{h})$ satisfy $s_{h+1}\notin\mathcal{U}_{h+1}$ . Recursively, there exists at least one feasible action sequence $a_{h}\in A_{h}^{safe}(s_{h}),\cdots,a_{H}\in A_{H}^{safe}(s_{H})$ which satisfies that $s_{h^{\prime}}\notin\mathcal{U}_{h^{\prime}}$ for $h+1\leq h^{\prime}\leq H$ . Hence the assumption for the existence of feasible policy can be simplified to $s_{1}\notin\mathcal{U}_{1}$ .

Appendix F Experiment Setup

We perform each algorithm for $20000$ episodes in a grid environment (Chow et al. (2018); Wei et al. (2022)) with $25$ states and $4$ actions (four directions), and report the average reward and violations across runs. In the experiment, we note that the algorithm Optpess-bouns (Efroni et al., 2020) needs to solve a linear programming with size $O(SAH)$ for each episode, which is extremely slow. Thus we only compare all algorithms in a relatively small environment. In the Safe-RL-SW experiments, we choose $\delta=0.005$ and binary $0$ - $1$ cost function. We set the safety threshold as 0.5 for both the CMDP algorithms and the SUCBVI algorithm.

For the Safe-RFE-SW experiments, we compare our algorithm SRF-UCRL with a state-of-the-art RFE algorithm RF-UCRL (Kaufmann et al., 2021) in an MDP with 11 states and 5 actions. We choose $\delta=0.005$ and safety threshold of SRF-UCRL as 0.5. We run 500 episodes 100 times, and then report the average reward in each episode and cumulative step-wise violation. Also at each episode, we calculate the expected violation for the outputted policy and plot the curves. All experiments are conducted with AMD Ryzen 7 5800H with Radeon Graphics and a 16GB unified memory.

	$\displaystyle\hat{e}_{h}^{\pi}(s,a;r)\leq$	$\displaystyle\sum_{s^{\prime}}\|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a)\|Q_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)$
		$\displaystyle\ \ \ +\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\|\hat{Q}_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)-Q_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)\|$
		$\displaystyle\leq H\\|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a)\\|_{1}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\hat{e}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime});r)$
		$\displaystyle\leq H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\hat{e}_{h+1}^{\pi}(s,\pi_{h+1}(s^{\prime});r).$

	$\displaystyle V_{1}^{\pi^{}}(s_{1})-V_{1}^{\hat{\pi^{}}}(s_{1})$	$\displaystyle\leq\hat{V}_{1}^{\pi^{}}(s_{1})-V_{1}^{\hat{\pi^{}}}(s_{1})+\varepsilon/2$
		$\displaystyle\leq\hat{V}_{1}^{\hat{\pi^{}}}(s_{1})-V_{1}^{\hat{\pi^{}}}+\varepsilon/2$
		$\displaystyle\leq\varepsilon.$

Provably Safe Reinforcement Learning with Step-wise Violation Constraints

Abstract

1 Introduction

2 Related Work

Safe RL.

Reward-free Exploration with Safety.

3 The Safe MDP Model

4 Safe RL with Step-wise Violation Constraints

4.1 Assumptions and Problem Features

Assumption 4.1 (Existence of feasible policies).

4.2 Algorithm SUCBVI

Theorem 4.2.

5 Lower Bounds for Safe-RL-SW

Theorem 5.1.

Theorem 5.2.

6 Safe Reward-Free Exploration with Step-wise Violation Constraints

6.1 Formulation of Safe-RFE-SW

Definition 6.1 ((ε,δ)(\varepsilon,\delta)-optimal safe algorithm for Safe-RFE-SW).

6.2 Algorithm SRF-UCRL

Theorem 6.2.

6.3 Analysis for Algorithm SRF-UCRL

7 Experiments

8 Conclusion

References

Appendix A Related Works

Safe RL except for CMDP-approaches

RFE with Safety

Appendix B Proofs of Main Results

B.1 Proof of Theorem 4.2

Lemma B.1.

Proof.

Lemma B.2.

Proof.

Lemma B.3.

Proof.

Lemma B.4.

Proof.

Regret

Proof.

Violation

B.2 Proof of Theorem 6.2

Lemma B.5.

Lemma B.6 (Kaufmann et al. (2021)).

Lemma B.7.

Proof.

Lemma B.8.

Proof.

Lemma B.9 (Kaufmann et al. (2021)).

B.3 Proof of Theorem 5.1

Proof.

B.4 Proof of Theorem 5.2

Proof.

Appendix C Safe Markov Games

C.1 Definitions

C.2 Algorithms and Results

Theorem C.1.

C.3 Proofs

Proof.

Appendix D Technical Lemmas

D.1 Lemma D.1

Lemma D.1.

Proof.

Case 1:

Subcase 1:

Subcase 2:

Case 2:

Subcase 3:

Subcase 4:

D.2 Lemma D.2

Lemma D.2.

Proof.

Appendix E Detailed Analysis of Assumption 4.1

Appendix F Experiment Setup

Definition 6.1 ( $(\varepsilon,\delta)$ -optimal safe algorithm for Safe-RFE-SW).