This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Provably Safe Reinforcement Learning with Step-wise Violation Constraints

Nuoya Xiong
IIIS, Tsinghua University
[email protected]
   Yihan Du
IIIS, Tsinghua University
[email protected]
   Longbo Huang
IIIS, Tsinghua University
[email protected]
Abstract

We investigate a novel safe reinforcement learning problem with step-wise violation constraints. Our problem differs from existing works in that we focus on stricter step-wise violation constraints and do not assume the existence of safe actions, making our formulation more suitable for safety-critical applications that need to ensure safety in all decision steps but may not always possess safe actions, e.g., robot control and autonomous driving. We propose an efficient algorithm SUCBVI, which guarantees 𝒪~(ST)\widetilde{\mathcal{O}}(\sqrt{ST}) or gap-dependent 𝒪~(S/𝒞gap+S2AH2)\widetilde{\mathcal{O}}(S/\mathcal{C}_{\mathrm{gap}}+S^{2}AH^{2}) step-wise violation and 𝒪~(H3SAT)\widetilde{\mathcal{O}}(\sqrt{H^{3}SAT}) regret. Lower bounds are provided to validate the optimality in both violation and regret performance with respect to the number of states SS and the total number of steps TT. Moreover, we further study an innovative safe reward-free exploration problem with step-wise violation constraints. For this problem, we design algorithm SRF-UCRL to find a near-optimal safe policy, which achieves nearly state-of-the-art sample complexity 𝒪~((S2AH2ε+H4SAε2)(log(1δ)+S))\widetilde{\mathcal{O}}((\frac{S^{2}AH^{2}}{\varepsilon}+\frac{H^{4}SA}{\varepsilon^{2}})(\log(\frac{1}{\delta})+S)), and guarantees 𝒪~(ST)\widetilde{\mathcal{O}}(\sqrt{ST}) violation during exploration. Experimental results demonstrate the superiority of our algorithms in safety performance and corroborate our theoretical results.

1 Introduction

In recent years, reinforcement learning (RL) (Sutton & Barto, 2018) has become a powerful framework for decision making and learning in unknown environments. Despite the ground-breaking success of RL in games (Lanctot et al., 2019), recommendation systems (Afsar et al., 2022) and complex tasks in simulation environments (Zhao et al., 2020), most existing RL algorithms focus on optimizing the cumulative reward and do not take into consideration the risk aspect, e.g., the agent runs into catastrophic situations during control. The lack of strong safety guarantees hinders the application of RL to broader safety-critical scenarios such as autonomous driving, robotics and healthcare. For example, for robotic control in complex environments, it is crucial to prevent the robot from getting into dangerous situations, e.g., hitting walls or falling into water pools, at all time.

To handle the safety requirement, a common approach is to formulate safety as a long-term expected violation constraint in each episode. This approach focuses on seeking a policy whose cumulative expected violation in each episode is below a certain threshold. However, for applications where an agent needs to avoid disastrous situations throughout the decision process, e.g., a robot needs to avoid hitting obstacles at each step, merely reducing the long-term expected violation is not sufficient to guarantee safety.

Motivated by this fact, we investigate safe reinforcement learning with a more fine-grained constraint, called step-wise violation constraint, which aggregates all nonnegative violations at each step (no offset between positive and negative violations permitted). We name this problem Safe-RL-SW. Our step-wise violation constraint differs from prior expected violation constraint (Wachi & Sui, 2020; Efroni et al., 2020; Kalagarla et al., 2021) in two aspects: (i) Minimizing the step-wise violation enables the agent to learn an optimal policy that avoids unsafe regions deterministically, while reducing the expected violation only guarantees to find a policy with low expected violation, instead of a per-step zero-violation policy. (ii) Reducing the aggregated nonnegative violation allows us to have a risk control for each step, while a small cumulative expected violation can still result in a large cost at some individual step and cause danger, if other steps with smaller costs offset the huge cost.

Our problem faces two unique challenges. First, the step-wise violation requires us to guarantee a small violation at each step, which demands very different algorithm design and analysis from that for the expected violation (Wachi & Sui, 2020; Efroni et al., 2020; Kalagarla et al., 2021). Second, in safety-critical scenarios, the agent needs to identify not only unsafe states but also potentially unsafe states, which are states that may appear to be safe but will ultimately lead to unsafe regions with a non-zero probability. For example, a self-driving car needs to learn to slow down or change directions early, foreseeing the potential danger in advance in order to ensure safe driving (Thomas et al., 2021). Existing safe RL works focus mainly on the expected violation (Wachi & Sui, 2020; Liu et al., 2021b; Wei et al., 2022), or assume no potentially unsafe states by imposing the prior knowledge of a safe action for each state (Amani et al., 2021). Hence, techniques in previous works cannot be applied to handle step-wise violations.

To systematically handle these two challenges, we formulate safety as an unknown cost function for each state without assuming safe actions, and consider minimizing the step-wise violation instead of the expected violation. We propose a general algorithmic framework called Safe UCBVI (SUCBVI). Specifically, in each episode, we first estimate the transition kernel and cost function in an optimistic manner, tending to regard a state as safe at the beginning of learning. After that, we introduce novel dynamic programming to identify potentially unsafe states and determine safe actions, based on our estimated transition and costs. Finally, we employ the identified safe actions to conduct value iteration. This mechanism can adaptively update dangerous regions, and help the agent plan for the future, which keeps her away from all states that may lead to unsafe states. As our estimation becomes more accurate over time, the safety violation becomes smaller and eventually converges to zero. Note that without the assumption of safe actions, the agent knows nothing about the environment at the beginning. Thus, it is impossible for her to achieve an absolute zero violation. Nevertheless, we show that SUCBVI can achieve sub-linear 𝒪~(ST)\widetilde{\mathcal{O}}(\sqrt{ST}) cumulative violation or an 𝒪~(S/𝒞gap+S2AH2)\widetilde{\mathcal{O}}(S/\mathcal{C}_{\mathrm{gap}}+S^{2}AH^{2}) gap-dependent violation that is independent of TT. This violation implies that as the RL game proceeds, the agent eventually learns how to avoid unsafe states. We also provide a matching lower bound to demonstrate the optimality of SUCBVI in both violation and regret.

Furthermore, we apply our step-wise safe RL framework to the reward-free exploration (Jin et al., 2020) setting. In this novel safe reward-free exploration, the agent needs to guarantee small step-wise violations during exploration, and also output a near-optimal safe policy. Our algorithm achieves ε\varepsilon cumulative step-wise violation during exploration, and also identifies a ε\varepsilon-optimal and safe policy. See the definition of ε\varepsilon-optimal and ε\varepsilon-safe policy in Section 6.1. Another interesting application of our framework is safe zero-sum Markov games, which we discuss in Appendix C.

The main contributions of our paper are as follows.

  • We formulate the safe RL with step-wise violation constraint problem (Safe-RL-SW), which models safety as a cost function for states and aims to minimize the cumulative step-wise violation. Our formulation is particularly useful for safety-critical applications where avoiding disastrous situations at each decision step is desirable, e.g., autonomous driving and robotics.

  • We provide a general algorithmic framework SUCBVI, which is equipped with an innovative dynamic programming to identify potentially unsafe states and distinguish safe actions. We establish 𝒪~(H3SAT)\widetilde{\mathcal{O}}(\sqrt{H^{3}SAT}) regret and 𝒪~(ST)\widetilde{\mathcal{O}}(\sqrt{ST}) or 𝒪~(S/𝒞gap+S2AH2)\widetilde{\mathcal{O}}(S/\mathcal{C}_{\mathrm{gap}}+S^{2}AH^{2}) gap-dependent violation guarantees, which exhibits the capability of SUCBVI in attaining high rewards while maintaining small violation.

  • We further establish Ω(HST)\Omega(\sqrt{HST}) regret and Ω(ST)\Omega(\sqrt{ST}) violation lower bounds for Safe-RL-SW. The lower bounds demonstrate the optimality of algorithm SUCBVI in both regret minimization and safety guarantee, with respect to factors SS and TT.

  • We consider step-wise safety constraints in the reward-free exploration setting, which calls the Safe-RFE-SW problem. In this setting, we design an efficient algorithm SRF-UCRL, which ensures ε\varepsilon step-wise violations during exploration and plans a ε\varepsilon-optimal and ε\varepsilon-safe policy for any reward functions with probability at least 1δ1-\delta. We obtain an 𝒪~((S2AH2ε+H4SAε2)(log(1δ)+S))\widetilde{\mathcal{O}}((\frac{S^{2}AH^{2}}{\varepsilon}+\frac{H^{4}SA}{\varepsilon^{2}})(\log(\frac{1}{\delta})+S)) sample complexity and an 𝒪~(ST)\widetilde{\mathcal{O}}(\sqrt{ST}) violation guarantee for SRF-UCRL, which shows the efficiency of SRF-UCRL in sampling and danger avoidance even without reward signals. To the best of our knowledge, this work is the first to study the step-wise violation constraint in the RFE setting.

2 Related Work

Safe RL.

Safety is an important topic in RL, which has been extensively studied. The constrained Markov decision process (CMDP)-based approaches handle safety via cost functions, and aim to minimize the expected episode-wise violation, e.g., (Yu et al., 2019; Wachi & Sui, 2020; Qiu et al., 2020; Efroni et al., 2020; Turchetta et al., 2020; Ding et al., 2020; Singh et al., 2020; Kalagarla et al., 2021; Simão et al., 2021; Ding et al., 2021), or achieve zero episode-wise violation, e.g., (Liu et al., 2021b; Bura et al., 2021; Wei et al., 2022; Sootla et al., 2022). Apart from CMDP-based approaches, there are also other works that tackle safe RL by control-based approaches (Berkenkamp et al., 2017; Chow et al., 2018; Dalal et al., 2018; Wang et al., 2022), policy optimization (Uchibe & Doya, 2007; Achiam et al., 2017; Tessler et al., 2018; Liu et al., 2020; Stooke et al., 2020) and safety shields (Alshiekh et al., 2018). In recent years, there are also some works studying step-wise violation with additional assumptions such as safe action (Amani et al., 2021) or known state-action subgraph (Shi et al., 2023). More detailed comparisons are provided in A.

Reward-free Exploration with Safety.

Motivated by sparse reward signals in realistic applications, reward-free exploration (RFE) has received extensive attention (Jin et al., 2020; Kaufmann et al., 2021; Ménard et al., 2021). Recently, Miryoosefi & Jin (2022); Huang et al. (2022) also study RFE with safety constraints. However, Miryoosefi & Jin (2022) only considers the safety constraint after the exploration phase. Huang et al. (2022) considers the episode-wise constraint during the exploration phase, requiring an assumption of a known baseline policy. The detailed comparison is provided in Appendix A.

3 The Safe MDP Model

Episodic MDP. In this paper, we consider the finite-horizon episodic Markov Decision Process (MDP), represented by a tuple (𝒮,𝒜,H,,r)(\mathcal{S},\mathcal{A},H,\mathbb{P},r). Here 𝒮\mathcal{S} is the state space, 𝒜\mathcal{A} is the action space, and HH is the length of each episode. ={h:𝒮×𝒜𝒮}h[H]\mathbb{P}=\{\mathbb{P}_{h}:\mathcal{S}\times\mathcal{A}\mapsto\triangle_{\mathcal{S}}\}_{h\in[H]} is the transition kernel, and h(s|s,a)\mathbb{P}_{h}(s^{\prime}|s,a) gives the transition probability from (s,a)(s,a) to ss^{\prime} at step hh. r={rh:𝒮×𝒜[0,1]}h[H]r=\{r_{h}:\mathcal{S}\times\mathcal{A}\mapsto[0,1]\}_{h\in[H]} is the reward function, and rh(s,a)r_{h}(s,a) gives the reward of taking action aa in state ss at step hh. A policy π={πh:𝒮𝒜}h[H]\pi=\{\pi_{h}:\mathcal{S}\mapsto\mathcal{A}\}_{h\in[H]} consists of HH mappings from the state space to action space. In each episode kk, the agent first chooses a policy πk\pi^{k}. At each step h[H]h\in[H], the agent observes a state shks_{h}^{k}, takes an action ahka_{h}^{k}, and then goes to a next state sh+1ks_{h+1}^{k} with probability h(sh+1kshk,ahk)\mathbb{P}_{h}(s_{h+1}^{k}\mid s_{h}^{k},a_{h}^{k}). The algorithm executes T=HKT=HK steps. Moreover, the state value function Vhπ(s,a)V^{\pi}_{h}(s,a) and state-action value function Qhπ(s,a)Q_{h}^{\pi}(s,a) for a policy π\pi can be defined as

Vhπ(s)\displaystyle V^{\pi}_{h}(s) :=𝔼π[h=hHrh(sh,πh(sh))|sh=s],\displaystyle:=\mathbb{E}_{\pi}\Bigg{[}\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(s_{h^{\prime}},\pi_{h^{\prime}}(s_{h^{\prime}}))\Bigg{|}s_{h}=s\Bigg{]},
Qhπ(s,a)\displaystyle Q^{\pi}_{h}(s,a) :=𝔼π[h=hHrh(sh,πh(sh))|sh=s,ah=a].\displaystyle:=\mathbb{E}_{\pi}\Bigg{[}\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(s_{h^{\prime}},\pi_{h^{\prime}}(s_{h^{\prime}}))\Bigg{|}s_{h}=s,a_{h}=a\Bigg{]}.

Safety Constraint. To model unsafe regions in the environment, similar to Wachi & Sui (2020); Yu et al. (2022), we define a safety cost function c:𝒮[0,1]c:\mathcal{S}\mapsto[0,1]. Let τ[0,1]\tau\in[0,1] denote the safety threshold. A state is called safe if c(s)τc(s)\leq\tau, and called unsafe if c(s)>τc(s)>\tau. Similar to Efroni et al. (2020); Amani et al. (2021), when the agent arrives in a state ss, she will receive a cost signal z(s)=c(s)+ζz(s)=c(s)+\zeta, where ζ\zeta is an independent, zero-mean and 1-sub-Gaussian noise. Denote (x)+=max{x,0}(x)_{+}=\max\{x,0\}. The violation in state ss is defined as (c(s)τ)+(c(s)-\tau)_{+}, and the cumulative step-wise violation till episode KK is

C(K)=k=1Kh=1H(c(shk)τ)+.C(K)=\sum_{k=1}^{K}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+}. (1)

Eq. (1) represents the accumulated step-wise violation during training. When the agent arrives in state shks_{h}^{k} at step hh in episode kk, she will suffer violation (c(shk)τ)+(c(s_{h}^{k})-\tau)_{+}. This violation setting is significantly different from the previous CMDP setting (Qiu et al., 2020; Ding et al., 2020; Efroni et al., 2020; Ding et al., 2021; Wachi & Sui, 2020; Liu et al., 2021a; Kalagarla et al., 2021). They study the episode-wise expected violation C(K)=k=1K(𝔼[h=1Hc(shk,ahk)]μ)C^{\prime}(K)=\sum_{k=1}^{K}(\mathbb{E}[\sum_{h=1}^{H}c(s_{h}^{k},a_{h}^{k})]-\mu). There are two main differences between the step-wise violation and episode-wise expected violation: (i) First, the expected violation constraint allows the agent to get into unsafe states occasionally. Instead, the step-wise violation constraint enforces the agent to stay in safe regions at all time. (ii) Second, in the episode-wise constraint, the violation c(sh,ah)τc(s_{h},a_{h})-\tau at each step is allowed to be positive or negative, and they can cancel out in one episode. Instead, we consider a nonnegative function (c(s)τ)+(c(s)-\tau)_{+} for each step in our step-wise violation, which imposes a stricter constraint.

Define 𝒰:={s𝒮c(s)>τ}\mathcal{U}:=\{s\in\mathcal{S}\mid c(s)>\tau\} as the set of all unsafe states. Let ι={s1,a1,,sH,aH}\iota=\{s_{1},a_{1},\cdots,s_{H},a_{H}\} denote a trajectory. Since a feasible policy needs to satisfy the constraint at each step, we define the set of feasible policies as Π={πPr{h[H],sh𝒰ιπ}=0}.\Pi=\{\pi\mid\Pr\{\ \exists\ h\in[H],s_{h}\in\mathcal{U}\mid\iota\sim\pi\}=0\}. The feasible policy set Π\Pi consists of all policies under which one never reaches any unsafe state in an episode.

Learning Objective. In this paper, we consider the regret minimization objective. Specifically, define π=argmaxπΠV1π\pi^{*}=\operatorname*{argmax}_{\pi\in\Pi}V_{1}^{\pi}, V=VπV^{*}=V^{\pi^{*}} and Q=QπQ^{*}=Q^{\pi^{*}}. The regret till KK episodes is then defined as

R(K)=t=1K(V1(s1)V1πk(s1)),R(K)=\sum_{t=1}^{K}(V^{*}_{1}(s_{1})-V^{\pi^{k}}_{1}(s_{1})),

where πk\pi^{k} is the policy taken in episode kk. Our objective is to minimize R(K)R(K) to achieve a good performance, and minimize the violation C(K)C(K) to guarantee the safety at the same time.

4 Safe RL with Step-wise Violation Constraints

4.1 Assumptions and Problem Features

Before introducing our algorithms, we first state the important assumptions and problem features for Safe-RL-SW.

Suppose π\pi is a feasible policy. Then, if we arrive at sH1s_{H-1} at step H1H-1, π\pi needs to select an action that guarantees sH𝒰s_{H}\notin\mathcal{U}. Define the transition set Δh(s,a)={sh(ss,a)>0}\Delta_{h}(s,a)=\{s^{\prime}\mid\mathbb{P}_{h}(s^{\prime}\!\mid\!s,a)>0\} for any (s,a,h)𝒮×𝒜×[H]\forall(s,a,h)\!\in\!\mathcal{S}\!\times\!\mathcal{A}\times[H], which represents the set of possible next states after taking action aa in state ss at step hh. Then, at the former step H1H-1, the states ss is potentially unsafe if it satisfies that ΔH1(s,a)𝒰\Delta_{H-1}(s,a)\cap\mathcal{U}\neq\emptyset for all a𝒜a\in\mathcal{A}. (i.e., no matter taking what action, there is a positive probability of transitioning to an unsafe next state). Therefore, we can recursively define the set of potentially unsafe states at step hh as

𝒰h=𝒰h+1{sa𝒜,Δh(s,a)𝒰h+1},\footnotesize\mathcal{U}_{h}=\mathcal{U}_{h+1}\cup\{s\mid\ \forall\ a\in\mathcal{A},\Delta_{h}(s,a)\cap\mathcal{U}_{h+1}\neq\emptyset\}, (2)

where 𝒰H=𝒰.\mathcal{U}_{H}=\mathcal{U}. Intuitively, if we are in a state sh𝒰hs_{h}\in\mathcal{U}_{h} at step hh, no action can be taken to completely avoid reaching potentially unsafe states sh+1𝒰h+1s_{h+1}\in\mathcal{U}_{h+1} at step h+1h+1. Thus, in order to completely prevent from getting into unsafe states 𝒰\mathcal{U} throughout all steps, one needs to avoid potentially unsafe states in 𝒰h\mathcal{U}_{h} at step hh. From the above argument, we have that s1𝒰1s_{1}\notin\mathcal{U}_{1} is equivalent to the existence of feasible policies. The detailed proof is provided in Appendix E. Thus, we make the following necessary assumption.

Assumption 4.1 (Existence of feasible policies).

The initial state s1s_{1} satisfies s1𝒰1s_{1}\notin\mathcal{U}_{1}.

For any s𝒮s\in\mathcal{S} and h[H1]h\in[H-1], we define the set of safe actions for state ss at step hh as

Ahsafe(s)={a𝒜Δh(s,a)𝒰h+1=},\displaystyle A_{h}^{safe}(s)=\{a\in\mathcal{A}\mid\Delta_{h}(s,a)\cap\mathcal{U}_{h+1}=\emptyset\}, (3)

and let AHsafe(s)=𝒜A_{H}^{safe}(s)=\mathcal{A}. Ahsafe(s)A_{h}^{safe}(s) stands for the set of all actions at step hh which will not lead to potentially unsafe states in 𝒰h+1\mathcal{U}_{h+1}. Here, {𝒰h}h[H]\{\mathcal{U}_{h}\}_{h\in[H]} and {Ahsafe(s)}h[H]\{A_{h}^{safe}(s)\}_{h\in[H]} are defined by dynamic programming: If we know all possible next state sets {Δh(s,a)}h[H]\{\Delta_{h}(s,a)\}_{h\in[H]} and unsafe state set 𝒰=𝒰H\mathcal{U}=\mathcal{U}_{H}, we can calculate all potentially unsafe state sets {𝒰h}h[H]\{\mathcal{U}_{h}\}_{h\in[H]} and safe action sets {Ahsafe(s)}h[H]\{A_{h}^{safe}(s)\}_{h\in[H]}, and choose feasible policies to completely avoid unsafe states.

4.2 Algorithm SUCBVI

Now we present our main algorithm Safe UCBVI (SUCBVI), which is based on previous classic RL algorithm UCBVI (Azar et al., 2017), and equipped with a novel dynamic programming to identify potentially unsafe states and safe actions. The pseudo-code is shown in Algorithm 1. First, we provide some intuitions about how SUCBVI works. At each episode, we first estimate all the unsafe states based on the historical data. Then, we perform a dynamic programming procedure introduced in Section 4.1 and calculate the safe action set Ahsafe(s)A_{h}^{safe}(s) for each state ss. Then, we perform value iteration in the estimated safe action set. As the estimation becomes more accurate, SUCBVI will eventually avoid potentially unsafe states and achieve both sublinear regrets and violations.

Now we begin to introduce our algorithm. In the beginning, we initialize Δh(s,a)=\Delta_{h}(s,a)=\emptyset for all (h,s,a)[H]×𝒮×𝒜(h,s,a)\in[H]\times\mathcal{S}\times\mathcal{A}. It implies that the agent considers all actions to be safe at first, because no action will lead to unsafe states from the agent’s perspective. In each episode, we first estimate the empirical cost c^(s)\hat{c}(s) based on historical cost feedback z(s)z(s), and regard state ss as safe if c¯(s)=c^(s)β>τ\bar{c}(s)=\hat{c}(s)-\beta>\tau for some bonus term β\beta, which aims to guarantee c¯(s)c(s)\bar{c}(s)\leq c(s) (Line 4). Then, we calculate the estimated unsafe state set 𝒰Hk\mathcal{U}_{H}^{k}, which is a subset of the true unsafe state set 𝒰=𝒰H\mathcal{U}=\mathcal{U}_{H} with a high probability by optimism. With 𝒰Hk\mathcal{U}_{H}^{k} and Δhk(s,a)\Delta_{h}^{k}(s,a), we can estimate potentially unsafe state sets 𝒰hk\mathcal{U}_{h}^{k} for all h[H]h\in[H] by Eq. (2) recursively.

Then, we perform value iteration to compute the optimistically estimated optimal policy. Specifically, for any hypothesized safe state s𝒰hks\notin\mathcal{U}_{h}^{k}, we update the greedy policy on the estimated safe actions, i.e., πhk(s)=maxaAhk,safe(s)Qhk(s,a)\pi_{h}^{k}(s)=\max_{a\in A_{h}^{k,safe}(s)}Q_{h}^{k}(s,a) (Line 13). On the other hand, for any hypothesized unsafe state sh𝒰hs_{h}\in\mathcal{U}_{h}, since there is no action that can completely avoid unsafe states, we ignore safety costs and simply update the policy by πhk(s)=maxa𝒜Q(s,a)\pi_{h}^{k}(s)=\max_{a\in\mathcal{A}}Q(s,a) (Line 15). After that, we calculate the estimated optimal policy πk\pi^{k} for episode kk, and the agent follows πk\pi^{k} and collects a trajectory. Then, we update Δhk+1(s,a)\Delta_{h}^{k+1}(s,a) by incorporating the observed state sh+1ks_{h+1}^{k} into the set Δhk(shk,ahk)\Delta_{h}^{k}(s_{h}^{k},a_{h}^{k}). Under this updating rule, it holds that Δhk(s,a)Δh(s,a)\Delta_{h}^{k}(s,a)\subseteq\Delta_{h}(s,a) for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}.

1:  Initialize: Δh1(s,a)=\Delta_{h}^{1}(s,a)=\emptyset, Nh1(s,a)=Nh1(s,a,s)=0N_{h}^{1}(s,a)=N_{h}^{1}(s,a,s^{\prime})=0 for all s𝒮s\in\mathcal{S}, a𝒜a\in\mathcal{A}, s𝒮s^{\prime}\in\mathcal{S} and h[H]h\in[H].
2:  for k=1,2,,Kk=1,2,\cdots,K do
3:     Update the optimistic estimates of cost.
4:     Update the empirical cost c^(s)\hat{c}(s) and calculate c¯(s)=c^(s)β(Nk(s),δ)\bar{c}(s)=\hat{c}(s)-\beta(N^{k}(s),\delta) for all s𝒮s\in\mathcal{S}. \triangleright
5:     Define 𝒰Hk={sc¯(s)>τ}\mathcal{U}_{H}^{k}=\{s\mid\bar{c}(s)>\tau\} and 𝒰hk=𝒰h+1k{saA,Δhk(s,a)𝒰h+1k}\mathcal{U}_{h}^{k}=\mathcal{U}_{h+1}^{k}\cup\{s\mid\forall a\in A,\Delta_{h}^{k}(s,a)\cap\mathcal{U}_{h+1}^{k}\neq\emptyset\} for all h[H]h\in[H] recursively. Calculate {Ahk,safe(s)}h[H]\{A_{h}^{k,safe}(s)\}_{h\in[H]} by Eq. (3) with {𝒰hk}h[H]\{\mathcal{U}_{h}^{k}\}_{h\in[H]}. \triangleright Perform value iteration with previous estimates.
6:     for h=H,H1,,1h=H,H-1,\cdots,1 do
7:        for s𝒮s\in\mathcal{S} do
8:           for a𝒜a\in\mathcal{A} do
9:              Compute ^hk(ss,a)=Nhk(s,a,s)Nhk(s,a)\hat{\mathbb{P}}^{k}_{h}(s^{\prime}\mid s,a)=\frac{N_{h}^{k}(s,a,s^{\prime})}{N_{h}^{k}(s,a)}.
10:              Qhk(s,a)=min{H,rh(s,a)+s^hk(ss,a)Vh+1k(s)+α(Nhk(s,a))}.Q_{h}^{k}(s,a)=\min\{H,r_{h}(s,a)+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}^{k}(s^{\prime}\mid s,a)V_{h+1}^{k}(s^{\prime})+\alpha(N_{h}^{k}(s,a))\}.
11:           end for
12:           if s𝒰hks\notin\mathcal{U}_{h}^{k} then
13:              Vhk(s)=maxaAhk,safe(s)Qhk(s,a),πhk(s)V_{h}^{k}(s)=\max_{a\in A_{h}^{k,safe}(s)}Q_{h}^{k}(s,a),\pi_{h}^{k}(s) = argmaxaAhk,safe(s)Qhk(s,a)\arg\max_{a\in A_{h}^{k,safe}(s)}Q_{h}^{k}(s,a) .
14:           else
15:              Vhk(s)=maxa𝒜Qhk(s,a),πhk(s)V_{h}^{k}(s)=\max_{a\in\mathcal{A}}Q_{h}^{k}(s,a),\pi_{h}^{k}(s) = argmaxa𝒜Qhk(s,a)\arg\max_{a\in\mathcal{A}}Q_{h}^{k}(s,a).
16:           end if
17:        end for
18:     end for
19:     for h=1,2,,Hh=1,2,\cdots,H do
20:        Take action ahk=πhk(shk)a_{h}^{k}=\pi_{h}^{k}(s_{h}^{k}) and observe state sh+1k.s_{h+1}^{k}.\triangleright Update the estimates of Δ(s,a)\Delta(s,a).
21:        Δhk+1(shk,ahk)=Δhk(shk,ahk){sh+1k}\Delta_{h}^{k+1}(s_{h}^{k},a_{h}^{k})=\Delta_{h}^{k}(s_{h}^{k},a_{h}^{k})\cup\{s_{h+1}^{k}\}. Increase Nhk+1(shk,ahk),Nhk+1(shk,ahk,sh+1k)N_{h}^{k+1}(s_{h}^{k},a_{h}^{k}),N_{h}^{k+1}(s_{h}^{k},a_{h}^{k},s_{h+1}^{k}) by 11.
22:     end for
23:  end for
Algorithm 1 SUCBVI

The performance of Algorithm 1 is summarized below in Theorem 4.2.

Theorem 4.2.

Let α(n,δ)=7Hln(5SAHK/δ)n\alpha(n,\delta)=7H\sqrt{\frac{\ln(5SAHK/\delta)}{n}} and β(n,δ)=2nlog(SK/δ)\beta(n,\delta)=\sqrt{\frac{2}{n}\log(SK/\delta)}. With probability at least 1δ1-\delta, the regret and step-wise violation of Algorithm 1 are bounded by

R(K)=𝒪~(H2SAT),C(K)=𝒪~(ST+S2AH2).\displaystyle R(K)=\widetilde{\mathcal{O}}(\sqrt{H^{2}SAT}),\qquad C(K)=\widetilde{\mathcal{O}}(\sqrt{ST}+S^{2}AH^{2}).

Moreover, if 𝒞gapmins𝒰(c(s)τ)+>0\mathcal{C}_{\mathrm{gap}}\triangleq\min_{s\in\mathcal{U}}(c(s)-\tau)_{+}>0, we have C(K)=𝒪~(S/𝒞gap+S2AH2)C(K)=\widetilde{\mathcal{O}}(S/\mathcal{C}_{\mathrm{gap}}+S^{2}AH^{2}).

Theorem 4.2 shows that SUCBVI achieves both sublinear regret and violation. Moreover, when all the unsafe states have a large cost compared to the safety threshold τ\tau, i.e., 𝒞gap=mins𝒰(c(s)τ)+>0\mathcal{C}_{\mathrm{gap}}=\min_{s\in\mathcal{U}}(c(s)-\tau)_{+}>0 is a constant, we can distinguish the unsafe states easily and get a constant violation. In particular, when τ=1\tau=1, Safe-RL-SW degenerates to the unconstrained MDP and Theorem 4.2 maintains the same regret 𝒪~(H2SAT)\widetilde{\mathcal{O}}(\sqrt{H^{2}SAT}) as UCBVI-CH (Azar et al., 2017), while CMDP algorithms (Efroni et al., 2020; Liu et al., 2021a) suffer a larger 𝒪~(H3S3AT)\widetilde{\mathcal{O}}(\sqrt{H^{3}S^{3}AT}) regret.

We provide the analysis idea of Theorem 4.2 here. First, by the updating rule of Δhk(s,a)\Delta_{h}^{k}(s,a), we show that 𝒰hk\mathcal{U}_{h}^{k} and Ahk,safe(s)A_{h}^{k,safe}(s) have the following crucial properties: 𝒰hk𝒰h,Ahsafe(s)Ahk,safe(s).\mathcal{U}_{h}^{k}\subseteq\mathcal{U}_{h},\ \ A_{h}^{safe}(s)\subseteq A_{h}^{k,safe}(s). Based on this property, we can prove that Qh(s,a)Qhk(s,a)Q^{*}_{h}(s,a)\leq Q_{h}^{k}(s,a) and Vh(s)Vhk(s)V^{*}_{h}(s)\leq V_{h}^{k}(s) for all (s,a)(s,a), and then apply the regret decomposition techniques to derive the regret bound.

Recall that πk\pi^{k} is a feasible policy with respect to the estimated unsafe state set 𝒰Hk\mathcal{U}_{H}^{k} and transition set {Δhk(s,a)}h[H]\{\Delta_{h}^{k}(s,a)\}_{h\in[H]}. If the agent takes policy πk\pi^{k} and the transition follows Δhk(s,a)\Delta_{h}^{k}(s,a), i.e., sh+1kΔhk(s,a)s_{h+1}^{k}\in\Delta_{h}^{k}(s,a), the agent never arrive any estimated unsafe state s𝒰Hks\in\mathcal{U}_{H}^{k} in this episode. Hence the agent will suffer at most 𝒪~(ST)\widetilde{\mathcal{O}}(\sqrt{ST}) step-wise violation or 𝒪~(S/𝒞gap+S2AH2)\widetilde{\mathcal{O}}(S/\mathcal{C}_{\mathrm{gap}}+S^{2}AH^{2}) gap-dependent bounded violation. Yet, the situation sh+1Δhk(s,a)s_{h+1}\in\Delta_{h}^{k}(s,a) does not always hold for all h[H]h\in[H]. In the case when sh+1kΔhk(s,a)s_{h+1}^{k}\notin\Delta_{h}^{k}(s,a), we add the newly observed state sh+1ks_{h+1}^{k} into Δhk+1(shk,ahk)\Delta_{h}^{k+1}(s_{h}^{k},a_{h}^{k}) (Line 21). We can show that this case appears at most S2AHS^{2}AH times, and thus incurs O(S2AH2)O(S^{2}AH^{2}) additional violation. Combining the above two cases, the total violation can be upper bounded.

5 Lower Bounds for Safe-RL-SW

In this section, we provide a matching lower bound for Safe-RL-SW in Section 4. The lower bound shows that if an algorithm always achieves a sublinear regret in Safe-RL-SW, it must incur an Ω(ST)\Omega(\sqrt{ST}) violation. This result matches our upper bound in Theorem 4.2, showing that SUCBVI achieves the optimal violation performance.

Theorem 5.1.

If an algorithm has an expected regret 𝔼π[R(K)]HK24\mathbb{E}_{\pi}[R(K)]\leq\frac{HK}{24} for all MDP instances, there exists an MDP instance in which the algorithm suffers expected violation 𝔼π[C(K)]=Ω(ST)\mathbb{E}_{\pi}[C(K)]=\Omega(\sqrt{ST}).

Now we validate the optimality in terms of regret. Note that if we do not consider safety constraints, the lower bound for classic RL (Osband & Van Roy, 2016) can be applied to our setting. Thus, we also have a Ω(T)\Omega(\sqrt{T}) regret lower bound. To understand the essential hardness brought by safety constraints, we further investigate whether safety constraints will lead to Ω(T)\Omega(\sqrt{T}) regret, given that we can achieve a o(T)o(\sqrt{T}) regret on some good instances without the safety constraints.

Theorem 5.2.

For any α(0,1)\alpha\in(0,1), there exists a parameter nn and nn MDPs M1,,MnM_{1},\dots,M_{n} satisfying that:

1. If we do not consider any constraint, there is an algorithm which achieves 𝒪~(T(1α)/2)\widetilde{\mathcal{O}}(T^{(1-\alpha)/2}) regret compared to the unconstrained optimal policy on all MDPs.

2. If we consider the safety constraint, any algorithm with O(T1α)O(T^{1-\alpha}) expected violation will achieve Ω(HST)\Omega(\sqrt{HST}) regret compared to the constrained optimal policy on one of MDPs.

Intuitively, Theorem 5.2 shows that if one achieves sublinear violation, she must suffer at least Ω(T)\Omega(\sqrt{T}) regret even if she can achieve o(T)o(\sqrt{T}) regret without considering constraints. This theorem demonstrates the hardness particularly brought by the step-wise constraint, and corroborates the optimality of our results. Combining with Theorem 5.1, the two lower bounds show an essential trade-off between the violation and performance.

6 Safe Reward-Free Exploration with Step-wise Violation Constraints

6.1 Formulation of Safe-RFE-SW

In this section, we consider Safe RL in the reward-free exploration (RFE) setting (Jin et al., 2020; Kaufmann et al., 2021; Ménard et al., 2021) called Safe-RFE-SW, to show the generality of our proposed framework. In the RFE setting, the agent does not have access to reward signals and only receives random safety cost feedback z(s)=c(s)+ζz(s)=c(s)+\zeta. To impose safety requirements, Safe-RFE-SW requests the agent to keep small safety violations during exploration, and outputs a near-optimal safe policy after receiving the reward function.

Definition 6.1 ((ε,δ)(\varepsilon,\delta)-optimal safe algorithm for Safe-RFE-SW).

An algorithm is (ε,δ)(\varepsilon,\delta)-optimal safe for Safe-RFE-SW if it outputs the triple (^,Δ^(s,a),𝒰^H)(\hat{\mathbb{P}},\hat{\Delta}(s,a),\hat{\mathcal{U}}_{H}) such that for any reward function rr, with probability at least 1δ1-\delta,

V1(s1;r)V1π^(s1;r)ε,𝔼π[h=1H(c(sh)τ)+]ε,\small V_{1}^{*}(s_{1};r)-V_{1}^{\hat{\pi}^{*}}(s_{1};r)\leq\varepsilon,\qquad\mathbb{E}_{\pi}\left[\sum_{h=1}^{H}(c(s_{h})-\tau)_{+}\right]\leq\varepsilon, (4)

where π^\hat{\pi}^{*} is the optimal feasible policy with respect to (^,Δ^(s,a),𝒰^H,r)(\hat{\mathbb{P}},\hat{\Delta}(s,a),\hat{\mathcal{U}}_{H},r), and V(s;r)V(s;r) is the value function under reward function rr. We say that a policy is ε\varepsilon-optimal if it satisfies the left inequality in Eq. (4) and ε\varepsilon-safe if it satisfies the right inequality in Eq. (4). We measure the performance by the number of episodes used before the algorithm terminates, i.e., sample complexity. Moreover, the cumulative step-wise violation till episode KK is defined as C(K)=k=1Kh=1H(c(shk)τ)+C(K)=\sum_{k=1}^{K}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+}. In Safe-RFE-SW, our goal is to design an (ε,δ)(\varepsilon,\delta)-optimal safe algorithm, and minimize both sample complexity and violation.

6.2 Algorithm SRF-UCRL

The Safe-RFE-SW problem requires us to consider extra safety constraints for both the exploration phase and final output policy, which needs new algorithm design and techniques compared to previous RFE algorithms. Also, the techniques for Safe-RL-SW in Section 4.2 are not sufficient for guaranteeing the safety of output policy, because SUCBVI only guarantees a step-wise violation during exploration.

We design an efficient algorithm Safe RF-UCRL (SRF-UCRL), which builds upon previous RFE algorithm RF-UCRL (Kaufmann et al., 2021). SRF-UCRL distinguishes potentially unsafe states and safe actions by backward iteration, and establishes a new uncertainty function to guarantee the safety of output policy. Algorithm 2 illustrates the procedure of SRF-UCRL. Specifically, in each episode kk, we first execute a policy πhk\pi_{h}^{k} computed from previous episodes, and then update the estimated next state set {Δhk(s,a)}h[H]\{\Delta^{k}_{h}(s,a)\}_{h\in[H]} and unsafe state set 𝒰Hk\mathcal{U}_{H}^{k} by optimistic estimation. Then, we use Eq. (2) to calculate the unsafe state set 𝒰h\mathcal{U}_{h} for all steps h[H]h\in[H]. After that, we update the uncertainty function W¯k\overline{W}^{k} defined in Eq. (5) below and compute the policy πk+1\pi^{k+1} that maximizes the uncertainty to encourage more exploration in the next episode.

Now we provide the definition of the uncertainty function, which measures the estimation error between the empirical MDP and true MDP. For any safe state-action pair s𝒰h,aAhk,safe(s)s\notin\mathcal{U}_{h},a\in A_{h}^{k,safe}(s), we define

W¯hk(s,a)=min{H,M(Nhk(s,a),δ)+s^hk(ss,a)maxbAh+1k,safe(s)W¯h+1k(s,b)}.\displaystyle\overline{W}_{h}^{k}(s,a)=\min\left\{H,M(N_{h}^{k}(s,a),\delta)+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}^{k}(s^{\prime}\mid s,a)\max_{b\in A^{k,safe}_{h+1}(s^{\prime})}\overline{W}_{h+1}^{k}(s^{\prime},b)\right\}. (5)

where M(Nhk(s,a),δ)=2H2γ(Nhk(s,a),δ)Nhk(s,a)+SHγ(Nhk(s,a),δ)Nhk(s,a),M(N_{h}^{k}(s,a),\delta)=2H\sqrt{\frac{2\gamma(N_{h}^{k}(s,a),\delta)}{N_{h}^{k}(s,a)}}+\frac{SH\gamma(N_{h}^{k}(s,a),\delta)}{N_{h}^{k}(s,a)}, and γ(n,δ)\gamma(n,\delta) is a logarithmic term that is formally defined in Theorem 6.2. For other state-action pairs (s,a)(s,a), we relax the restriction of b𝒜h+1k,safe(s)b\in\mathcal{A}_{h+1}^{k,safe}(s^{\prime}) to b𝒜b\in\mathcal{A} in the max\max function. Our algorithm stops when W¯1K(s1,πK(s1))\overline{W}^{K}_{1}(s_{1},\pi^{K}(s_{1})) shrinks to within ε/2\varepsilon/2. Compared to previous RFE works (Kaufmann et al. (2021); Ménard et al. (2021)), our uncertainty function has two distinctions. First, for any safe state-action pair (s,a)(𝒮𝒰h,Ahk,safe(s))(s,a)\in(\mathcal{S}\setminus\mathcal{U}_{h},A_{h}^{k,safe}(s)), Eq. (5) considers only safe actions Ah+1k,safe(s)A_{h+1}^{k,safe}(s), which guarantees that the agent focuses on safe policies. Second, Eq. (5) incorporates another term (SHγ(Nhk(s,a),δ)/Nhk(s,a))(SH\gamma(N_{h}^{k}(s,a),\delta)/N_{h}^{k}(s,a)) to control the expected violation for feasible policies. Now we present our result for Safe-RFE-SW.

Theorem 6.2.

Let γ(n,δ)=2(log(2SAH/δ)+(S1)log(e(1+n/(S1)))\gamma(n,\delta)=2(\log(2SAH/\delta)+(S-1)\log(e(1+n/(S-1))), Algorithm 2 is a (ε,δ)(\varepsilon,\delta)-PAC algorithm with sample complexity at most111Here 𝒪~()\widetilde{\mathcal{O}}(\cdot) ignores all logS,logA,logH,log(1/ε)\log S,\log A,\log H,\log(1/\varepsilon) and log(log(1/δ))\log(\log(1/\delta)) terms.

K=𝒪~((S2AH2ε+H4SAε2)(log(1δ)+S)),\small K=\widetilde{\mathcal{O}}\left(\left(\frac{S^{2}AH^{2}}{\varepsilon}+\frac{H^{4}SA}{\varepsilon^{2}}\right)\left(\log\left(\frac{1}{\delta}\right)+S\right)\right),

The step-wise violation of Algorithm 2 during exploration is C(K)=𝒪~(S2AH2+ST).C(K)=\widetilde{\mathcal{O}}(S^{2}AH^{2}+\sqrt{ST}).

Compared to previous work (Kaufmann et al., 2021) with 𝒪~((H4SA/ε2)(log(1/δ)+S))\widetilde{\mathcal{O}}((H^{4}SA/\varepsilon^{2})(\log(1/\delta)+S)) sample complexity, our result has an additional term 𝒪~((S2AH2/ε)(log(1/δ)+S))\widetilde{\mathcal{O}}((S^{2}AH^{2}/\varepsilon)(\log(1/\delta)+S)). This term is due to the additional safety requirement for the final output policy, which was not considered in previous RFE algorithms  (Kaufmann et al., 2021; Ménard et al., 2021). When ε\varepsilon and δ\delta are sufficiently small, the leading term is 𝒪~((H4SA/ε2)log(1/δ))\widetilde{\mathcal{O}}((H^{4}SA/\varepsilon^{2})\log(1/\delta)), which implies that our algorithm satisfies the safety constraint without suffering additional regret.222Ménard et al. (2021) improve the result by a factor HH and replace log(1/δ)+S\log(1/\delta)+S by log(1/δ)\log(1/\delta) via the Bernstein-type inequality. While they do not consider the safety constraints, we believe that similar improvement can also be applied to our framework, without significant changes in our analysis. However, since our paper mainly focuses on tackling the safety constraint for reward-free exploration, we use the Hoeffding-type inequality to keep the succinctness of our statements.

1:  Initialize: k=1k=1, W¯h0(s,a)=H\overline{W}_{h}^{0}(s,a)=H, πh1(s)=a1\pi_{h}^{1}(s)=a_{1} for all s𝒮s\in\mathcal{S}, a𝒜a\in\mathcal{A} and h[H]h\in[H].
2:  while W¯hk1(s1,π1k(s1))ε/2\overline{W}_{h}^{k-1}(s_{1},\pi_{1}^{k}(s_{1}))\leq\varepsilon/2 do
3:     for h=1,,Hh=1,\cdots,H do
4:        Observe state shks_{h}^{k}. Take action ahk=πhk(shk)a_{h}^{k}=\pi_{h}^{k}(s_{h}^{k}) and observe sh+1k.s_{h+1}^{k}. \triangleright Update the optimistic estimate of cost.
5:        Calculate empirical cost c^(s)\hat{c}(s) and c¯(s)=c^(s)β(Nk(s),δ)\bar{c}(s)=\hat{c}(s)-\beta(N^{k}(s),\delta) for all s𝒮s\in\mathcal{S}. \triangleright Update the estimates of Δh(s,a)\Delta_{h}(s,a) and 𝒰h\mathcal{U}_{h}.
6:        Update Δhk(shk,ahk)=Δhk1(shk,ahk){sh+1k}\Delta_{h}^{k}(s_{h}^{k},a_{h}^{k})=\Delta_{h}^{k-1}(s_{h}^{k},a_{h}^{k})\cup\{s_{h+1}^{k}\} and {𝒰hk}h[H]\{\mathcal{U}_{h}^{k}\}_{h\in[H]} by Eq. (2).
7:        Calculate Ahk,safe(s)A_{h}^{k,safe}(s) for all s𝒮s\in\mathcal{S} by Eq. (3).
8:     end for
9:     Update Nhk(s,a,s),Nhk(s,a)N_{h}^{k}(s,a,s^{\prime}),N_{h}^{k}(s,a) and ^hk\hat{\mathbb{P}}_{h}^{k} for all s𝒮,a𝒜,s𝒮s\in\mathcal{S},a\in\mathcal{A},s^{\prime}\in\mathcal{S} and h[H]h\in[H].
10:     Compute W¯k\overline{W}^{k} according to Eq. (5). \triangleright Calculate the greedy policy.
11:     for h[H],s𝒮h\in[H],s\in\mathcal{S} do
12:        if Ahk,safe(s)A_{h}^{k,safe}(s)\neq\emptyset then
13:           πhk+1(s)=argmaxaAhk,safe(s)W¯hk(s,a).\pi_{h}^{k+1}(s)=\arg\max_{a\in A_{h}^{k,safe}(s)}\overline{W}_{h}^{k}(s,a).
14:        else
15:           πhk+1(s)=argmaxaAW¯hk(s,a).\pi_{h}^{k+1}(s)=\arg\max_{a\in A}\overline{W}_{h}^{k}(s,a).
16:        end if
17:     end for
18:     Set k=k+1k=k+1.
19:  end while
20:  return the tuple (^hk1,Δhk1(s,a),𝒰Hk1)(\hat{\mathbb{P}}_{h}^{k-1},\Delta_{h}^{k-1}(s,a),\mathcal{U}_{H}^{k-1}).
Algorithm 2 SRF-UCRL

6.3 Analysis for Algorithm SRF-UCRL

For the analysis of step-wise violation (Eq. (1)), similar to algorithm SUCBVI, algorithm SRF-UCRL estimates the next state set Δh(s,a)\Delta_{h}(s,a) and potentially unsafe state set 𝒰h\mathcal{U}_{h}, which guarantees a 𝒪~(ST)\widetilde{\mathcal{O}}(\sqrt{ST}) step-wise violation. Now we give a proof sketch for the ε\varepsilon-safe property of output policy (Eq. (4)).

First, if π\pi is a feasible policy for (^k,Δk(s,a),{𝒰hk}h[H])(\hat{\mathbb{P}}^{k},\Delta^{k}(s,a),\{\mathcal{U}_{h}^{k}\}_{h\in[H]}) and sh+1ΔhK(sh,ah)s_{h+1}\in\Delta_{h}^{K}(s_{h},a_{h}) for all h[H]h\in[H], the agent who follows policy π\pi will only visit the estimated safe states. Since each estimated safe state only suffers O(1/Nt(sh,πh(sh)))O(1/\sqrt{N_{t}(s_{h},\pi_{h}(s_{h}))}) violation, the violation led by this situation is bounded by W¯1k(s,π1(s1))\overline{W}_{1}^{k}(s,\pi_{1}(s_{1})). Next, we bound the probability that sh+1ΔhK(sh,ah)s_{h+1}\notin\Delta^{K}_{h}(s_{h},a_{h}) for some step h[H]h\in[H]. For any state-action pair (s,a)(s,a), if there is a probability (ss,a)𝒪~(log(S/δ)/NhK(s,a))\mathbb{P}(s^{\prime}\mid s,a)\geq\widetilde{\mathcal{O}}(\log(S/\delta)/N_{h}^{K}(s,a)) that the agent transitions to next state ss^{\prime} from (s,a)(s,a) at step hh, the state ss^{\prime} is put into ΔhK(s,a)\Delta_{h}^{K}(s,a) with probability at least 1δ/S1-\delta/S. Then, we can expect that all such states are put into our estimated next state set ΔhK(s,a)\Delta_{h}^{K}(s,a). Thus, the probability that sh+1ΔhK(sh,ah)s_{h+1}\notin\Delta_{h}^{K}(s_{h},a_{h}) is no larger than 𝒪~(Slog(S/δ)/NhK(s,a))\widetilde{\mathcal{O}}(S\log(S/\delta)/N_{h}^{K}(s,a)) by a union bound over all possible next states ss^{\prime}. Based on this argument, W¯hk(sh,ah)\overline{W}_{h}^{k}(s_{h},a_{h}) is an upper bound for the total probability that sh+1ΔhK(sh,ah)s_{h+1}\notin\Delta^{K}_{h}(s_{h},a_{h}) for some step hh. This will lead to additional W¯1K(s1,π1(s1))\overline{W}_{1}^{K}(s_{1},\pi_{1}(s_{1})) expected violation. Hence the expected violation of output policy is upper bounded by 2W¯1K(s1,π1K(s1))ε2\overline{W}_{1}^{K}(s_{1},\pi_{1}^{K}(s_{1}))\leq\varepsilon. The complete proof is provided in Appendix B.

7 Experiments

In this section, we provide experiments for Safe-RL-SW and Safe-RFE-SW to validate our theoretical results. For Safe-RL-SW, we compare our algorithm SUCBVI with a classical RL algorithm UCBVI (Azar et al., 2017) and three state-of-the-art CMDP algorithms OptCMDP-bonus (Efroni et al., 2020) , Optpess (Liu et al., 2021a) and Triple-Q (Wei et al., 2022). For Safe-RFE-SW, we report the average reward in each episode and cumulative step-wise violation. For Safe-RFE-SW, we compare our algorithm SRF-UCRL with a state-of-the-art RFE algorithm RF-UCRL (Kaufmann et al., 2021). We do not plot the regret because unconstrained MDP or CMDP algorithms do not guarantee step-wise violation. Applying them to the step-wise constrained setting can lead to negative or large regret and large violations, making the results meaningless. Detailed experiment setup is in Appendix F.

Refer to caption
Refer to caption
Refer to caption
Refer to caption
Figure 1: Experimental results for Safe-RL-SW and Safe-RFE-SW. The left two figures show the average rewards and step-wise violations of algorithms SUCBVI, UCBVI (Azar et al., 2017), OptCMDP-bonus (Efroni et al., 2020), Triple-Q (Wei et al., 2022) and Optpess (Liu et al., 2021a). The right two figures show the reward and expected violation of the policies outputted by algorithms SRF-UCRL and RF-UCRL (Kaufmann et al., 2021).

As shown in Figure 1, the rewards of SUCBVI and SRF-UCRL converge to the optimal rewards under safety constraints (denoted by "safe optimal reward"). In contrast, the rewards of UCBVI and UCRL converge to the optimal rewards without safety constraints (denoted by "unconstrained optimal reward"), and those of CMDP algorithms converge to a policy with low expected violation (denoted by "constrained optimal reward"). For Safe-RFE-SW, the expected violation of output policy of SRF-UCRL converges to zero while that of RF-UCRL does not. This corroborates the ability of SRF-UCRL in finding policies that simultaneously achieve safety and high rewards.

8 Conclusion

In this paper, we investigate a novel safe reinforcement learning problem with step-wise safety constraints. We first provide an algorithmic framework SUCBVI to achieve both 𝒪~(H3SAT)\widetilde{\mathcal{O}}(\sqrt{H^{3}SAT}) regret and 𝒪~(ST)\widetilde{\mathcal{O}}(\sqrt{ST}) step-wise or 𝒪~(S/𝒞gap+S2AH2)\widetilde{\mathcal{O}}(S/\mathcal{C}_{\mathrm{gap}}+S^{2}AH^{2}) gap-dependent bounded violation that is independent of TT. Then, we provide two lower bounds to validate the optimality of SUCBVI in both violation and regret in terms of SS and TT. Further, we extend our framework to the safe RFE with a step-wise violation and provide an algorithm SRF-UCRL that identifies a near-optimal safe policy given any reward function rr and guarantees 𝒪~(ST)\widetilde{\mathcal{O}}(\sqrt{ST}) violation during exploration.

References

  • Achiam et al. (2017) Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International conference on machine learning, pp.  22–31. PMLR, 2017.
  • Afsar et al. (2022) M Mehdi Afsar, Trafford Crump, and Behrouz Far. Reinforcement learning based recommender systems: A survey. ACM Computing Surveys, 55(7):1–38, 2022.
  • Alshiekh et al. (2018) Mohammed Alshiekh, Roderick Bloem, Rüdiger Ehlers, Bettina Könighofer, Scott Niekum, and Ufuk Topcu. Safe reinforcement learning via shielding. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  • Amani et al. (2021) Sanae Amani, Christos Thrampoulidis, and Lin Yang. Safe reinforcement learning with linear function approximation. In International Conference on Machine Learning, pp. 243–253. PMLR, 2021.
  • Auer et al. (2008) Peter Auer, Thomas Jaksch, and Ronald Ortner. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21, 2008.
  • Azar et al. (2017) Mohammad Gheshlaghi Azar, Ian Osband, and Rémi Munos. Minimax regret bounds for reinforcement learning. In International Conference on Machine Learning, pp. 263–272. PMLR, 2017.
  • Berkenkamp et al. (2017) Felix Berkenkamp, Matteo Turchetta, Angela Schoellig, and Andreas Krause. Safe model-based reinforcement learning with stability guarantees. Advances in neural information processing systems, 30, 2017.
  • Bura et al. (2021) Archana Bura, Aria HasanzadeZonuzy, Dileep Kalathil, Srinivas Shakkottai, and Jean-Francois Chamberland. Safe exploration for constrained reinforcement learning with provable guarantees. arXiv preprint arXiv:2112.00885, 2021.
  • Chow et al. (2018) Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. Advances in neural information processing systems, 31, 2018.
  • Dalal et al. (2018) Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. Safe exploration in continuous action spaces. arXiv preprint arXiv:1801.08757, 2018.
  • Ding et al. (2020) Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33:8378–8390, 2020.
  • Ding et al. (2021) Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pp.  3304–3312. PMLR, 2021.
  • Efroni et al. (2020) Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189, 2020.
  • Hoeffding (1994) Wassily Hoeffding. Probability inequalities for sums of bounded random variables. In The collected works of Wassily Hoeffding, pp.  409–426. Springer, 1994.
  • Huang et al. (2022) Ruiquan Huang, Jing Yang, and Yingbin Liang. Safe exploration incurs nearly no additional sample complexity for reward-free rl. arXiv preprint arXiv:2206.14057, 2022.
  • Jin et al. (2020) Chi Jin, Akshay Krishnamurthy, Max Simchowitz, and Tiancheng Yu. Reward-free exploration for reinforcement learning. In International Conference on Machine Learning, pp. 4870–4879. PMLR, 2020.
  • Kalagarla et al. (2021) Krishna C Kalagarla, Rahul Jain, and Pierluigi Nuzzo. A sample-efficient algorithm for episodic finite-horizon mdp with constraints. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  8030–8037, 2021.
  • Kaufmann et al. (2021) Emilie Kaufmann, Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Edouard Leurent, and Michal Valko. Adaptive reward-free exploration. In Algorithmic Learning Theory, pp.  865–891. PMLR, 2021.
  • Lanctot et al. (2019) Marc Lanctot, Edward Lockhart, Jean-Baptiste Lespiau, Vinicius Zambaldi, Satyaki Upadhyay, Julien Pérolat, Sriram Srinivasan, Finbarr Timbers, Karl Tuyls, Shayegan Omidshafiei, et al. Openspiel: A framework for reinforcement learning in games. arXiv preprint arXiv:1908.09453, 2019.
  • Le et al. (2019) Hoang Le, Cameron Voloshin, and Yisong Yue. Batch policy learning under constraints. In International Conference on Machine Learning, pp. 3703–3712. PMLR, 2019.
  • Liu et al. (2021a) Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021a.
  • Liu et al. (2021b) Tao Liu, Ruida Zhou, Dileep Kalathil, Panganamala Kumar, and Chao Tian. Learning policies with zero or bounded constraint violation for constrained mdps. Advances in Neural Information Processing Systems, 34:17183–17193, 2021b.
  • Liu et al. (2020) Yongshuai Liu, Jiaxin Ding, and Xin Liu. Ipo: Interior-point policy optimization under constraints. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  4940–4947, 2020.
  • Ménard et al. (2021) Pierre Ménard, Omar Darwiche Domingues, Anders Jonsson, Emilie Kaufmann, Edouard Leurent, and Michal Valko. Fast active learning for pure exploration in reinforcement learning. In International Conference on Machine Learning, pp. 7599–7608. PMLR, 2021.
  • Miryoosefi & Jin (2022) Sobhan Miryoosefi and Chi Jin. A simple reward-free approach to constrained reinforcement learning. In International Conference on Machine Learning, pp. 15666–15698. PMLR, 2022.
  • Osband & Van Roy (2016) Ian Osband and Benjamin Van Roy. On lower bounds for regret in reinforcement learning. arXiv preprint arXiv:1608.02732, 2016.
  • Qiu et al. (2020) Shuang Qiu, Xiaohan Wei, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. Upper confidence primal-dual reinforcement learning for cmdp with adversarial loss. Advances in Neural Information Processing Systems, 33:15277–15287, 2020.
  • Shi et al. (2023) Ming Shi, Yingbin Liang, and Ness Shroff. A near-optimal algorithm for safe reinforcement learning under instantaneous hard constraints. arXiv preprint arXiv:2302.04375, 2023.
  • Simão et al. (2021) Thiago D Simão, Nils Jansen, and Matthijs TJ Spaan. Alwayssafe: Reinforcement learning without safety constraint violations during training. 2021.
  • Simchowitz & Jamieson (2019) Max Simchowitz and Kevin G Jamieson. Non-asymptotic gap-dependent regret bounds for tabular mdps. Advances in Neural Information Processing Systems, 32, 2019.
  • Singh et al. (2020) Rahul Singh, Abhishek Gupta, and Ness B Shroff. Learning in markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.
  • Sootla et al. (2022) Aivar Sootla, Alexander Cowen-Rivers, Jun Wang, and Haitham Bou Ammar. Enhancing safe exploration using safety state augmentation. Advances in Neural Information Processing Systems, 35:34464–34477, 2022.
  • Stooke et al. (2020) Adam Stooke, Joshua Achiam, and Pieter Abbeel. Responsive safety in reinforcement learning by pid lagrangian methods. In International Conference on Machine Learning, pp. 9133–9143. PMLR, 2020.
  • Sutton & Barto (2018) Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • Tessler et al. (2018) Chen Tessler, Daniel J Mankowitz, and Shie Mannor. Reward constrained policy optimization. arXiv preprint arXiv:1805.11074, 2018.
  • Thomas et al. (2021) Garrett Thomas, Yuping Luo, and Tengyu Ma. Safe reinforcement learning by imagining the near future. Advances in Neural Information Processing Systems, 34:13859–13869, 2021.
  • Turchetta et al. (2020) Matteo Turchetta, Andrey Kolobov, Shital Shah, Andreas Krause, and Alekh Agarwal. Safe reinforcement learning via curriculum induction. Advances in Neural Information Processing Systems, 33:12151–12162, 2020.
  • Uchibe & Doya (2007) Eiji Uchibe and Kenji Doya. Constrained reinforcement learning from intrinsic and extrinsic rewards. In 2007 IEEE 6th International Conference on Development and Learning, pp.  163–168. IEEE, 2007.
  • Wachi & Sui (2020) Akifumi Wachi and Yanan Sui. Safe reinforcement learning in constrained markov decision processes. In International Conference on Machine Learning, pp. 9797–9806. PMLR, 2020.
  • Wang et al. (2022) Yixuan Wang, Simon Sinong Zhan, Ruochen Jiao, Zhilu Wang, Wanxin Jin, Zhuoran Yang, Zhaoran Wang, Chao Huang, and Qi Zhu. Enforcing hard constraints with soft barriers: Safe reinforcement learning in unknown stochastic environments. arXiv preprint arXiv:2209.15090, 2022.
  • Wei et al. (2022) Honghao Wei, Xin Liu, and Lei Ying. Triple-q: A model-free algorithm for constrained reinforcement learning with sublinear regret and zero constraint violation. In International Conference on Artificial Intelligence and Statistics, pp.  3274–3307. PMLR, 2022.
  • Yu et al. (2022) Dongjie Yu, Haitong Ma, Shengbo Li, and Jianyu Chen. Reachability constrained reinforcement learning. In International Conference on Machine Learning, pp. 25636–25655. PMLR, 2022.
  • Yu et al. (2019) Ming Yu, Zhuoran Yang, Mladen Kolar, and Zhaoran Wang. Convergent policy optimization for safe reinforcement learning. Advances in Neural Information Processing Systems, 32, 2019.
  • Yu et al. (2021) Tiancheng Yu, Yi Tian, Jingzhao Zhang, and Suvrit Sra. Provably efficient algorithms for multi-objective competitive rl. In International Conference on Machine Learning, pp. 12167–12176. PMLR, 2021.
  • Zhao et al. (2020) Wenshuai Zhao, Jorge Peña Queralta, and Tomi Westerlund. Sim-to-real transfer in deep reinforcement learning for robotics: a survey. In 2020 IEEE Symposium Series on Computational Intelligence (SSCI), pp.  737–744. IEEE, 2020.

Appendix

\startcontents\printcontents

1

Appendix A Related Works

Safe RL except for CMDP-approaches

Apart from the CMDP approaches, (Yu et al. (2019); Qiu et al. (2020); Efroni et al. (2020); Ding et al. (2021); Kalagarla et al. (2021); Simão et al. (2021); Liu et al. (2021b); Bura et al. (2021); Wei et al. (2022)), there are several other ways to achieve safety in reinforcement learning. The control-based approaches (Berkenkamp et al. (2017); Chow et al. (2018); Liu et al. (2020); Wang et al. (2022)) adopt barrier functions to keep the agent away from unsafe regions and impose hard constraints. The policy-optimization style approaches are also commonly studied to solve constrained RL problems. (Uchibe & Doya (2007); Achiam et al. (2017); Tessler et al. (2018); Liu et al. (2020); Stooke et al. (2020))

There are some other papers investigating the safety of RL problems. Alshiekh et al. (2018) represents the safe state by the reactive system, and uses shielding to calculate and restrict the agent within a safe trajectory completely. The main difference between their work and our work is that we need to dynamically update the estimated safe state, while they require to know the mechanism and state to calculate the shield. Dalal et al. (2018) considers restricting the safe action by projecting the action into the closest safe actions. They achieve this goal by solving a convex optimization problem on the continuous action set. However, in their paper, they do not consider the situation where a state can have no safe actions. To be more specific, they do not consider the situation when the convex optimization problem has no solutions. Le et al. (2019) considers the decision-making problem with a pre-collected dataset. Turchetta et al. (2020) and Sootla et al. (2022) both consider cumulative cost constraints rather than step-wise constraints. The former uses a teacher for intervention to keep the agent away from the unsafe region, while the latter encourages safe exploration by augmenting a safety state to measure safety during training.

RFE with Safety

Motivated by sparse reward signals in realistic applications, reward-free exploration (RFE) has been proposed in Jin et al. (2020), and further developed in Kaufmann et al. (2021); Ménard et al. (2021). In RFE, the agent explores the environment without reward signals. After enough exploration, the reward function is given, and the agent needs to plan a near-optimal policy based on his knowledge collected in exploration. Safety is also important in RFE: We need to not only guarantee that our outputted policy is safe, but also ensure small violations during exploration.

In recent years, Huang et al. (2022) studies RFE with safety constraints. Our work differs from this work in the following aspects: (i) They allow different reward functions during exploration and after exploration, and the agent can directly know the true costs during training. In our work, the cost function is the same during and after exploration, but the agent can only observe the noisy costs of a state when she arrives at that state. (ii) They require prior knowledge of safe baseline policies, while we do not need such an assumption. (iii) They consider the zero-expected violation during exploration, while we focus on keeping small step-wise violations.

Appendix B Proofs of Main Results

B.1 Proof of Theorem 4.2

We start the proof with the confidence bound of optimistic estimation:

Lemma B.1.

With probability at least 1δ1-\delta, the empirical optimistic estimation of cost function c¯k(s)=c^k(s)2Nk(s)log(SK/δ)\overline{c}^{k}(s)=\hat{c}^{k}(s)-\sqrt{\frac{2}{N^{k}(s)}\log(SK/\delta)} are always less than the true cost c(s)c(s) for any episode k[K]k\in[K]. In other words, let the event be

1={k[K]ands𝒮,c¯k(s)c(s)},\displaystyle\mathcal{E}_{1}=\left\{\forall\ k\in[K]\ \mbox{and}\ s\in\mathcal{S},\overline{c}^{k}(s)\leq c(s)\right\}, (6)

then Pr(1c)δ.\Pr(\mathcal{E}_{1}^{c})\leq\delta.

Proof.

Since we observe c(s)+εc(s)+\varepsilon with 1-subgaussian noise ε\varepsilon, by Hoeffding’s inequality Hoeffding (1994), for a fixed episode kk and state ss, the statement

|c^k(s)c(s)|2Nk(s)log(SKδ)\displaystyle|\hat{c}^{k}(s)-c(s)|\leq\sqrt{\frac{2}{N^{k}(s)}\log\left(\frac{SK}{\delta}\right)}

holds with probability at least 1δSK1-\frac{\delta}{SK}. Taking a union bound on all s𝒮s\in\mathcal{S} and k[K]k\in[K], we have

c(s)c^k(s)2Nk(s)log(SKδ)=c¯k(s),c(s)\geq\hat{c}^{k}(s)-\sqrt{\frac{2}{N^{k}(s)}\log\left(\frac{SK}{\delta}\right)}=\overline{c}^{k}(s),

and the Lemma B.1 has been proved. ∎

Lemma B.2.

Let the event 2\mathcal{E}_{2} and 3\mathcal{E}_{3} are

2\displaystyle\mathcal{E}_{2} ={s,a,s,k,h,|^h(ss,a)h(ss,a)|\displaystyle=\left\{\forall\ s,a,s^{\prime},k,h,|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a)|\right.
2h(ss,a)Nhk(s,a)log(2S2AHKδ)+23Nhk(s,a)log(2S2AHKδ)}\displaystyle\left.\hskip 45.00008pt\leq\sqrt{\frac{2\mathbb{P}_{h}(s^{\prime}\mid s,a)}{N_{h}^{k}(s,a)}\log\left(\frac{2S^{2}AHK}{\delta}\right)}+\frac{2}{3N_{h}^{k}(s,a)}\log\left(\frac{2S^{2}AHK}{\delta}\right)\right\}
3\displaystyle\mathcal{E}_{3} ={k=1Kh=1H(Ph(Vh+1kVh+1πk)(shk,ahk)(Vh+1kVh+1πk)(sh+1k))H3K2log(1/δ)}\displaystyle=\left\{\sum_{k=1}^{K}\sum_{h=1}^{H}\left(P_{h}(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h}^{k},a_{h}^{k})-(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h+1}^{k})\right)\leq\sqrt{\frac{H^{3}K}{2}\log(1/\delta)}\right\}

Then Pr(2c)δ\Pr(\mathcal{E}_{2}^{c})\leq\delta, Pr(3c)δ\Pr(\mathcal{E}_{3}^{c})\leq\delta.

Proof.

This lemma is very standard in analysis of UCBVI algorithm. The Pr(2c)\Pr(\mathcal{E}_{2}^{c}) can be derived by Bernstein’s inequality, and Pr(3c)δ\Pr(\mathcal{E}_{3}^{c})\leq\delta can be derived by Azuma-hoeffding inequality. ∎

The following lemma shows the optimistic property we discussed in the paper.

Lemma B.3.

Under the event 1\mathcal{E}_{1}, 𝒰hk𝒰h\mathcal{U}_{h}^{k}\subseteq\mathcal{U}_{h} and Ahsafe(s)Ahk,safe(s)A_{h}^{safe}(s)\subseteq A_{h}^{k,safe}(s) for all s𝒰hks\in\mathcal{U}_{h}^{k} and h[H]h\in[H].

Proof.

Under the event 1\mathcal{E}_{1}, for all s𝒰Hks\in\mathcal{U}_{H}^{k}, c(s)c¯(s)>τc(s)\geq\overline{c}(s)>\tau . So s𝒰Hs\in\mathcal{U}_{H}^{*}. Now from our definition of ΔhK(s,a)\Delta_{h}^{K}(s,a), we can easily have Δhk(s,a)Δh(s,a)\Delta^{k}_{h}(s,a)\subseteq\Delta_{h}(s,a) for all step h[H]h\in[H] because we add the possible next state one by one into estimated set ΔhK(s,a)\Delta_{h}^{K}(s,a).

Now we prove the lemma by induction. When h=Hh=H, 𝒰Hk𝒰H\mathcal{U}_{H}^{k}\subseteq\mathcal{U}_{H}^{*}. Assume that 𝒰h+1k𝒰h+1\mathcal{U}_{h+1}^{k}\subseteq\mathcal{U}_{h+1}^{*}, then

Ahsafe(s)\displaystyle A_{h}^{safe}(s) ={a𝒜Δh(s,a)𝒰h+1=}\displaystyle=\{a\in\mathcal{A}\mid\Delta_{h}(s,a)\cap\mathcal{U}_{h+1}^{*}=\emptyset\}
{a𝒜ΔhK(s,a)𝒰h+1=}\displaystyle\subseteq\{a\in\mathcal{A}\mid\Delta_{h}^{K}(s,a)\cap\mathcal{U}_{h+1}^{*}=\emptyset\}
{a𝒜ΔhK(s,a)𝒰h+1k=}\displaystyle\subseteq\{a\in\mathcal{A}\mid\Delta_{h}^{K}(s,a)\cap\mathcal{U}_{h+1}^{k}=\emptyset\}
=Ahk,safe(s),\displaystyle=A_{h}^{k,safe}(s),

and

𝒰hk\displaystyle\mathcal{U}_{h}^{k} =𝒰h+1k{sa𝒜,ΔhK(s,a)𝒰h+1k=}\displaystyle=\mathcal{U}_{h+1}^{k}\cup\{s\mid\exists\ a\in\mathcal{A},\Delta_{h}^{K}(s,a)\cap\mathcal{U}_{h+1}^{k}=\emptyset\}
𝒰h+1{sa𝒜,Δh(s,a)𝒰h+1=}\displaystyle\subseteq\mathcal{U}_{h+1}^{*}\cup\{s\mid\exists\ a\in\mathcal{A},\Delta_{h}(s,a)\cap\mathcal{U}_{h+1}^{*}=\emptyset\}
𝒰h.\displaystyle\subseteq\mathcal{U}_{h}.

Hence we complete the proof by induction. ∎

Similar to UCBVI algorithm, we prove that Qhk(s,a)Q_{h}^{k}(s,a) and Vhk(s)V_{h}^{k}(s) are always greater than Qh(s)Q_{h}^{*}(s) and Vh(s)V_{h}^{*}(s).

Lemma B.4.

With probability at least 1δ1-\delta, choosing b(n,δ)=7Hln(5SAT/δ)nb(n,\delta)=7H\sqrt{\frac{\ln(5SAT/\delta)}{n}}, we will have

Qhk(s,a)Qh(s,a),Vhk(s)Vh(s),\displaystyle Q_{h}^{k}(s,a)\geq Q_{h}^{*}(s,a),\ \ V_{h}^{k}(s)\geq V_{h}^{*}(s),

for any state s𝒮s\in\mathcal{S}, action a𝒜a\in\mathcal{A}, episode kk and the step h[H]h\in[H].

Proof.

We proof this lemma by induction. First, consider h=H+1h=H+1, then QH+1(s,a)=QH+1k(s,a)=0Q_{H+1}^{*}(s,a)=Q_{H+1}^{k}(s,a)=0 and VH+1(s)=VH+1k(s)=0V_{H+1}^{*}(s)=V_{H+1}^{k}(s)=0 for all k[K]k\in[K]. Then our statements hold for h=H+1h=H+1. Now fixed an episode kk and assume this statement holds for h+1h+1. If Qhk(s,a)=HQ_{h}^{k}(s,a)=H, our proof is completed. Then assume Qhk(s,a)HQ_{h}^{k}(s,a)\leq H and

Qhk(s,a)Qh(s,a)\displaystyle Q_{h}^{k}(s,a)-Q_{h}^{*}(s,a) =(^hk)(Vh+1V)(s,a)+(^hkh)(V)(s,a)+b(Nhk(s,a))\displaystyle=(\hat{\mathbb{P}}_{h}^{k})(V_{h+1}-V^{*})(s,a)+(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V^{*})(s,a)+b(N_{h}^{k}(s,a))
(^hkh)(V)(s,a)+b(Nhk(s,a))\displaystyle\leq(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V^{*})(s,a)+b(N_{h}^{k}(s,a))
0\displaystyle\leq 0

where the first inequality is induction hypothesis, and the second inequality follows naturally by Chernoff-Hoeffding’s inequality. Then for s𝒰hks\in\mathcal{U}_{h}^{k}

Vhk(s)=maxa𝒜Qhk(s,a)maxa𝒜Q(s,a)Vh(s).\displaystyle V_{h}^{k}(s)=\max_{a\in\mathcal{A}}Q_{h}^{k}(s,a)\geq\max_{a\in\mathcal{A}}Q^{*}(s,a)\geq V_{h}^{*}(s).

For s𝒰hk,s\notin\mathcal{U}_{h}^{k}, since we know 𝒰h𝒰hk\mathcal{U}_{h}\subseteq\mathcal{U}_{h}^{k}, s𝒰hs\notin\mathcal{U}_{h}. Then

Vhk(s)=maxaAhk,safe(s)Qhk(s,a)\displaystyle V_{h}^{k}(s)=\max_{a\in A_{h}^{k,safe}(s)}Q_{h}^{k}(s,a) maxaAhk,safe(s)Qh(s,a)\displaystyle\geq\max_{a\in A_{h}^{k,safe}(s)}Q_{h}^{*}(s,a)
maxaAhsafe(s)Qh(s,a)\displaystyle\geq\max_{a\in A_{h}^{safe}(s)}Q_{h}^{*}(s,a) (7)
=Vh(s).\displaystyle=V_{h}^{*}(s).

where Eq. (7) is because Ahsafe(s)Ahk,safe(s)A_{h}^{safe}(s)\subseteq A_{h}^{k,safe}(s) by our optimistic exploration mechanism. ∎

Regret

Now we can start to prove our main theorem.

Proof.

Having the previous analysis, our proof is similar to UCBVI algorithm. Note that the regret can be bounded by UCB:

R(K)=k=1K(V1(s1)V1πk(s1))k=1K(V1k(s1)V1πk(s1)).\displaystyle R(K)=\sum_{k=1}^{K}(V_{1}^{*}(s_{1})-V_{1}^{\pi^{k}}(s_{1}))\leq\sum_{k=1}^{K}(V_{1}^{k}(s_{1})-V_{1}^{\pi^{k}}(s_{1})).

Because the action taking by πk\pi^{k} is equal to action taking at episode kk, we will have

Vhk(shk)Vhπk(shk)=(QhkQhπk)(shk,ahk),\displaystyle V_{h}^{k}(s_{h}^{k})-V_{h}^{\pi^{k}}(s_{h}^{k})=(Q_{h}^{k}-Q_{h}^{\pi^{k}})(s_{h}^{k},a_{h}^{k}),

where ahk=πhk(shk)a_{h}^{k}=\pi_{h}^{k}(s_{h}^{k}). Then

Vhk(shk)Vhπk(shk)\displaystyle V_{h}^{k}(s_{h}^{k})-V_{h}^{\pi^{k}}(s_{h}^{k}) =(QhkQhπk)(shk,ahk)\displaystyle=(Q_{h}^{k}-Q_{h}^{\pi^{k}})(s_{h}^{k},a_{h}^{k})
=(^hk(Vh+1k)h(Vh+1πk))(shk,ahk)+b(Nhk(shk,ahk),δ)\displaystyle=(\hat{\mathbb{P}}_{h}^{k}(V_{h+1}^{k})-\mathbb{P}_{h}(V_{h+1}^{\pi^{k}}))(s_{h}^{k},a_{h}^{k})+b(N_{h}^{k}(s_{h}^{k},a_{h}^{k}),\delta)
=(^hkh)(Vh+1)(shk,ahk)+(^hkh)(Vh+1kVh+1)(shk,ahk)\displaystyle=(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V_{h+1}^{*})(s_{h}^{k},a_{h}^{k})+(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V_{h+1}^{k}-V_{h+1}^{*})(s_{h}^{k},a_{h}^{k})
+Ph(Vh+1kVh+1πk))(shk,ahk)+b(Nhk(shk,ahk),δ).\displaystyle\ \ \ \ \ \ \ +P_{h}(V_{h+1}^{k}-V_{h+1}^{\pi^{k}}))(s_{h}^{k},a_{h}^{k})+b(N_{h}^{k}(s_{h}^{k},a_{h}^{k}),\delta). (8)

First, by Chernoff-Hoeffding’s bound, (^hkh)(Vh+1)(shk,ahk)b(Nhk(shk,ahk),δ).(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V_{h+1}^{*})(s_{h}^{k},a_{h}^{k})\leq b(N_{h}^{k}(s_{h}^{k},a_{h}^{k}),\delta). Second, by the standard analysis of UCBVI (Theorem 1 in UCBVI), under the event 2\mathcal{E}_{2}, we will have

(^hkh)(Vh+1kVh+1)(shk,ahk)\displaystyle\ \ \ \ \ (\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V_{h+1}^{k}-V_{h+1}^{*})(s_{h}^{k},a_{h}^{k})
s𝒮(2h(sshk,ahk)ιNhk(shk,ahk))+2ι3Nhk(shk,ahk))(Vh+1kVh+1)(s)\displaystyle\leq\sum_{s^{\prime}\in\mathcal{S}}\left(\sqrt{\frac{2\mathbb{P}_{h}(s^{\prime}\mid s_{h}^{k},a_{h}^{k})\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k}))}}+\frac{2\iota}{3N_{h}^{k}(s_{h}^{k},a_{h}^{k})}\right)(V_{h+1}^{k}-V_{h+1}^{*})(s^{\prime})
s𝒮(h(sshk,ahk)H+Hι2Nhk(shk,ahk)+2ι3Nhk(shk,ahk))(Vh+1kVh+1)(s)\displaystyle\leq\sum_{s^{\prime}\in\mathcal{S}}\left(\frac{\mathbb{P}_{h}(s^{\prime}\mid s_{h}^{k},a_{h}^{k})}{H}+\frac{H\iota}{2N_{h}^{k}(s_{h}^{k},a_{h}^{k})}+\frac{2\iota}{3N_{h}^{k}(s_{h}^{k},a_{h}^{k})}\right)(V_{h+1}^{k}-V_{h+1}^{*})(s^{\prime})
1Hh(Vh+1kVh+1)(shk,ahk)+2H2SιNhk(shk,ahk)\displaystyle\leq\frac{1}{H}\mathbb{P}_{h}(V_{h+1}^{k}-V_{h+1}^{*})(s_{h}^{k},a_{h}^{k})+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}
1Hh(Vh+1kVh+1πk)(shk,ahk)+2H2SιNhk(shk,ahk),\displaystyle\leq\frac{1}{H}\mathbb{P}_{h}(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h}^{k},a_{h}^{k})+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})},

where the first inequality holds under the event 2\mathcal{E}_{2}, and the second inequality is derived by aba+b2\sqrt{ab}\leq\frac{a+b}{2}, and the last inequality is because Vh+1πkVh+1V_{h+1}^{\pi^{k}}\leq V_{h+1}^{*}. Thus combining with Eq. (8), we can get

Vhk(shk)Vhπk(shk)(1+1H)h(Vh+1kVh+1πk)(shk,ahk)+2b(Nhk(shk,ahk),δ)+2H2SιNhk(shk,ahk).\displaystyle V_{h}^{k}(s_{h}^{k})-V_{h}^{\pi^{k}}(s_{h}^{k})\leq\left(1+\frac{1}{H}\right)\mathbb{P}_{h}(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h}^{k},a_{h}^{k})+2b(N_{h}^{k}(s_{h}^{k},a_{h}^{k}),\delta)+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}.

Denote

αhk\displaystyle\alpha_{h}^{k} =h(Vh+1kVh+1πk)(shk,ahk)(Vh+1kVh+1πk)(shK,ahk).\displaystyle=\mathbb{P}_{h}(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h}^{k},a_{h}^{k})-(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h}^{K},a_{h}^{k}).
βhk\displaystyle\beta_{h}^{k} =H2SιNhk(shk,ahk).\displaystyle=\frac{H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}.
γhk\displaystyle\gamma_{h}^{k} =b(Nhk(shk,ahk),δ).\displaystyle=b(N_{h}^{k}(s_{h}^{k},a_{h}^{k}),\delta).

we have

Vhk(shk)Vhπk(shk)\displaystyle V_{h}^{k}(s_{h}^{k})-V_{h}^{\pi^{k}}(s_{h}^{k}) (1+1H)(Vh+1kVh+1πk)(sh+1k)+(1+1H)αhk+2βhk+2γhk\displaystyle\leq\left(1+\frac{1}{H}\right)(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h+1}^{k})+\left(1+\frac{1}{H}\right)\alpha_{h}^{k}+2\beta_{h}^{k}+2\gamma_{h}^{k}
(1+1H)(Vh+1kVh+1πk)(sh+1k)+2αhk+2βhk+2γhk,\displaystyle\leq\left(1+\frac{1}{H}\right)(V_{h+1}^{k}-V_{h+1}^{\pi^{k}})(s_{h+1}^{k})+2\alpha_{h}^{k}+2\beta_{h}^{k}+2\gamma_{h}^{k},

then by recursion,

V1k(s1)V1πk(s1)\displaystyle V_{1}^{k}(s_{1})-V_{1}^{\pi^{k}}(s_{1}) h=1H(1+1H)H(2αhk+2βhk+2γhk)\displaystyle\leq\sum_{h=1}^{H}\left(1+\frac{1}{H}\right)^{H}(2\alpha_{h}^{k}+2\beta_{h}^{k}+2\gamma_{h}^{k})
eh=1H(2αhk+2βhk+2γhk).\displaystyle\leq e\cdot\sum_{h=1}^{H}(2\alpha_{h}^{k}+2\beta_{h}^{k}+2\gamma_{h}^{k}).

Under the event 3\mathcal{E}_{3} with Bernstein’s inequality, k=1Kh=1HαhkH3K2log(1/δ)\sum_{k=1}^{K}\sum_{h=1}^{H}\alpha_{h}^{k}\leq\sqrt{\frac{H^{3}K}{2}\log(1/\delta)}, and

k=1Kh=1Hβhk=k=1Kh=1HH2SιNhk(shk,ahk)H3S2Aιlog(HK).\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\beta_{h}^{k}=\sum_{k=1}^{K}\sum_{h=1}^{H}\frac{H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}\leq H^{3}S^{2}A\iota\log(HK). (9)

Also, for γhk\gamma_{h}^{k}, we can get

k=1Kh=1Hγhk\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\gamma_{h}^{k} =k=1Kh=1H7HιNhk(shk,ahk)\displaystyle=\sum_{k=1}^{K}\sum_{h=1}^{H}7H\sqrt{\frac{\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}}
7Hι(s,a)𝒮×𝒜i=1Nk(s,a)1/i\displaystyle\leq 7H\sqrt{\iota}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\sum_{i=1}^{N^{k}(s,a)}\sqrt{1/i}
14Hι(s,a)𝒮×𝒜Nk(s,a)\displaystyle\leq 14H\sqrt{\iota}\sum_{(s,a)\in\mathcal{S}\times\mathcal{A}}\sqrt{N^{k}(s,a)}
14HιSAT.\displaystyle\leq 14H\sqrt{\iota SAT}.

Then the regret will be bounded by O(H2SAT)O(\sqrt{H^{2}SAT}).

Violation

Now we turn to consider the violation during the process.

First, if the agent follows the πk\pi^{k} at episode kk, and sh+1kΔhK(shk,ahk)s_{h+1}^{k}\in\Delta_{h}^{K}(s_{h}^{k},a_{h}^{k}) for all step h[H1]h\in[H-1], from the definition of {𝒰h}1hH+1\{\mathcal{U}_{h}\}_{1\leq h\leq H+1} and ΔhK(shk,ahk)\Delta_{h}^{K}(s_{h}^{k},a_{h}^{k}), all states s1k,s2k,,sHk𝒰Hks_{1}^{k},s_{2}^{k},\cdots,s_{H}^{k}\notin\mathcal{U}_{H}^{k} are safe. Since 𝒰Hk={sc¯(s)>τ}\mathcal{U}_{H}^{k}=\{s\mid\overline{c}(s)>\tau\}, for each state s{s1k,s2k,,sHk}s\in\{s_{1}^{k},s_{2}^{k},\cdots,s_{H}^{k}\}, we can have

c(s)τc¯k(s)+22Nk(s)log(SK/δ)τ22Nk(s)log(SK/δ),c(s)-\tau\leq\overline{c}^{k}(s)+2\sqrt{\frac{2}{N^{k}(s)}\log(SK/\delta)}-\tau\leq 2\sqrt{\frac{2}{N^{k}(s)}\log(SK/\delta)},

where the first inequality from the event 1\mathcal{E}_{1} and the second inequality is because s𝒰Hks\notin\mathcal{U}_{H}^{k}. Then

h=1H(c(shk)τ)+h=1H22Nk(shk)log(SK/δ).\displaystyle\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+}\leq\sum_{h=1}^{H}2\sqrt{\frac{2}{N^{k}(s_{h}^{k})}\log(SK/\delta)}.

Then we consider the situation that sh+1kΔhK(shk,ahk)s_{h+1}^{k}\notin\Delta_{h}^{K}(s_{h}^{k},a_{h}^{k}) for some step h[H]h\in[H] at episode k[K]k\in[K]. In this particular case, we can only bound the violation on this episode by HH because the future state are not in control. Fortunately, in this situation we will add at least one new state sh+1ks_{h+1}^{k} to ΔhK(shk,ahk)\Delta_{h}^{K}(s_{h}^{k},a_{h}^{k}), then |Δhk+1(shk,ahk)||Δhk(shk,ahk)|+1|\Delta_{h}^{k+1}(s_{h}^{k},a_{h}^{k})|\geq|\Delta_{h}^{k}(s_{h}^{k},a_{h}^{k})|+1, and the summation

(h,s,a)[H]×𝒮×𝒜|Δhk+1(s,a)|(h,s,a)[H]×𝒮×𝒜|Δhk(s,a)|+1.\sum_{(h,s,a)\in[H]\times\mathcal{S}\times\mathcal{A}}|\Delta_{h}^{k+1}(s,a)|\geq\sum_{(h,s,a)\in[H]\times\mathcal{S}\times\mathcal{A}}|\Delta_{h}^{k}(s,a)|+1.

Note that for any episode kk, the summation has a trivial upper bound

(h,s,a)[H]×𝒮×𝒜|Δhk(s,a)|S2AH,\sum_{(h,s,a)\in[H]\times\mathcal{S}\times\mathcal{A}}|\Delta_{h}^{k}(s,a)|\leq S^{2}AH,

this situation will also appears for at most S2AHS^{2}AH episode and leads at most S2AH2S^{2}AH^{2} extra violation.

Hence the total violation can be upper bounded by

k=1Kh=1H(c(shk)τ)+\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+} S2AH2+k=1Kh=1H(22Nk(shk)log(SK/δ)1)\displaystyle\leq S^{2}AH^{2}+\sum_{k=1}^{K}\sum_{h=1}^{H}\left(2\sqrt{\frac{2}{N^{k}(s_{h}^{k})}\log(SK/\delta)}\wedge 1\right)
S2AH2+22log(SK/δ)k=1Kh=1H(1Nk(shk)1)\displaystyle\leq S^{2}AH^{2}+2\sqrt{2\log(SK/\delta)}\sum_{k=1}^{K}\sum_{h=1}^{H}\left(\sqrt{\frac{1}{N^{k}(s_{h}^{k})\vee 1}}\right)
S2AH2+22log(SK/δ)s𝒮i=1K(Ni+1(s)Ni(s)Ni(s)1)\displaystyle\leq S^{2}AH^{2}+2\sqrt{2\log(SK/\delta)}\sum_{s\in\mathcal{S}}\sum_{i=1}^{K}\left(\frac{N^{i+1}(s)-N^{i}(s)}{\sqrt{N^{i}(s)\vee 1}}\right) (10)
S2AH2+28log(SK/δ)s𝒮NK(s)\displaystyle\leq S^{2}AH^{2}+2\sqrt{8\log(SK/\delta)}\sum_{s\in\mathcal{S}}\sqrt{N^{K}(s)}
S2AH2+28log(SK/δ)ST\displaystyle\leq S^{2}AH^{2}+2\sqrt{8\log(SK/\delta)}\sqrt{ST}
=𝒪~(S2AH2+ST)\displaystyle=\widetilde{\mathcal{O}}(S^{2}AH^{2}+\sqrt{ST})

with probability at least 13δ1-3\delta. The inequality Eq. (10) is derived by Lemma 19 in Auer et al. (2008). By replacing δ\delta to δ/3\delta/3 we can complete our proof.

Moreover, for the unsafe state set 𝒰\mathcal{U}, recall that d=mins𝒰(c(s)τ)>0d=\min_{s\in\mathcal{U}}(c(s)-\tau)>0. For the state shks_{h}^{k}, if shk𝒰Hks_{h}^{k}\notin\mathcal{U}_{H}^{k}, we have

c(shk)τc¯k(shk)+22Nk(shk)log(SK/δ)τ22Nk(shk)log(SK/δ).\displaystyle c(s_{h}^{k})-\tau\leq\overline{c}^{k}(s_{h}^{k})+2\sqrt{\frac{2}{N^{k}(s_{h}^{k})}\log(SK/\delta)}-\tau\leq 2\sqrt{\frac{2}{N^{k}(s_{h}^{k})}\log(SK/\delta)}.

Thus we can get

Nk(shk)8(c(s)τ)2log(SK/δ).\displaystyle N^{k}(s_{h}^{k})\leq\frac{8}{(c(s)-\tau)^{2}}\log(SK/\delta).

Now denote 𝒦\mathcal{K} be the set of episodes kk that shk+1Δhk(shk,ahk)s_{h}^{k+1}\in\Delta_{h}^{k}(s_{h}^{k},a_{h}^{k}). Now for each s𝒰s\notin\mathcal{U}, denote ks=max{k:h[H],s=shk}k_{s}=\max\{k:\exists\ h\in[H],s=s_{h}^{k}\}, then our total violation can be bounded by

k=1Kh=1H(c(shk)τ)+\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+} S2AH2+k𝒦h=1H(c(shk)τ)+\displaystyle\leq S^{2}AH^{2}+\sum_{k\in\mathcal{K}}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+}
S2AH2+s𝒰Nks(s)(c(s)τ)+\displaystyle\leq S^{2}AH^{2}+\sum_{s\notin\mathcal{U}}N^{k_{s}}(s)(c(s)-\tau)_{+}
S2AH2+s𝒰8(c(s)τ)+2(c(shk)τ)+log(SK/δ)\displaystyle\leq S^{2}AH^{2}+\sum_{s\notin\mathcal{U}}\frac{8}{(c(s)-\tau)_{+}^{2}}(c(s_{h}^{k})-\tau)_{+}\log(SK/\delta)
S2AH2+8Sdlog(SK/δ).\displaystyle\leq S^{2}AH^{2}+\frac{8S}{d}\log(SK/\delta).

B.2 Proof of Theorem 6.2

First, apply the same argument in the Appendix B.1, the violation during the exploration phase can be bounded by

k=1Kh=1H(c(shk)τ)+\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+} S2AH2+k=1Kh=1H2Nk(shk)log(SK/δ)\displaystyle\leq S^{2}AH^{2}+\sum_{k=1}^{K}\sum_{h=1}^{H}\sqrt{\frac{2}{N^{k}(s_{h}^{k})}\log(SK/\delta)}
S2AH2+2log(SK/δ)s𝒮i=1NK(s)1i\displaystyle\leq S^{2}AH^{2}+\sqrt{2\log(SK/\delta)}\sum_{s\in\mathcal{S}}\sum_{i=1}^{N^{K}(s)}\sqrt{\frac{1}{i}}
S2AH2+8log(SK/δ)s𝒮NK(s)\displaystyle\leq S^{2}AH^{2}+\sqrt{8\log(SK/\delta)}\sum_{s\in\mathcal{S}}\sqrt{N^{K}(s)}
=𝒪~(S2AH2+ST).\displaystyle=\widetilde{\mathcal{O}}(S^{2}AH^{2}+\sqrt{ST}).

Now we prove the Algorithm 2 will return ε\varepsilon-safe and ε\varepsilon-optimal policy. Lemma B.5 shows that W¯hk(s1,π1k+1(s1))\overline{W}_{h}^{k}(s_{1},\pi_{1}^{k+1}(s_{1})) bounds the estimation error for any reward function rr.

Lemma B.5.

For any feasible policy πΠk\pi\in\Pi^{k}, step hh, episode kk and reward function rr, defining the estimation error as e^hk,π(s,a;r)=|Q^hk,π(s,a;r)Qhk,π(s,a;r)|\hat{e}^{k,\pi}_{h}(s,a;r)=|\hat{Q}_{h}^{k,\pi}(s,a;r)-Q_{h}^{k,\pi}(s,a;r)|. Then, with probability at least 1δ1-\delta, we have

e^hk,π(s,a;r)W¯hk(s,a).\hat{e}^{k,\pi}_{h}(s,a;r)\leq\overline{W}_{h}^{k}(s,a).

First, we prove the Lemma B.5. We provide an lemma to show a high probability event for concentration.

Lemma B.6 (Kaufmann et al. (2021)).

Define the event as

4={k,h,s,a,KL(^hk((s,a)),hk((s,a))γ(Nhk(s,a),δ)Nhk(s,a)},\displaystyle\mathcal{E}_{4}=\left\{\forall k,h,s,a,\mbox{KL}(\hat{\mathbb{P}}_{h}^{k}(\cdot\mid(s,a)),\mathbb{P}_{h}^{k}(\cdot\mid(s,a))\leq\frac{\gamma(N_{h}^{k}(s,a),\delta)}{N_{h}^{k}(s,a)}\right\},

where β(n,δ)=2log(SAHK/δ)+(S1)log(e(1+n/(S1)))\beta(n,\delta)=2\log(SAHK/\delta)+(S-1)\log(e(1+n/(S-1))). Then Pr(4c)δ\Pr(\mathcal{E}_{4}^{c})\leq\delta.

By Bellman equations,

Q^hπ(s,a;r)Qhπ(s,a;r)=s(^h(ss,a)h(ss,a))Qh+1π(s,π(s);r)\displaystyle\hat{Q}_{h}^{\pi}(s,a;r)-Q_{h}^{\pi}(s,a;r)=\sum_{s^{\prime}}(\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a))Q_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)
+s^h(ss,a)(Q^h+1π(s,π(s);r)Qh+1π(s,π(s);r)).\displaystyle\ \ \ +\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)(\hat{Q}_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)-Q_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)).

By the Pinkser’s inequality and event 4\mathcal{E}_{4},

e^hπ(s,a;r)\displaystyle\hat{e}_{h}^{\pi}(s,a;r)\leq s|^h(ss,a)h(ss,a)|Qh+1π(s,π(s);r)\displaystyle\sum_{s^{\prime}}|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a)|Q_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)
+s^h(ss,a)|Q^h+1π(s,π(s);r)Qh+1π(s,π(s);r)|\displaystyle\ \ \ +\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)|\hat{Q}_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)-Q_{h+1}^{\pi}(s^{\prime},\pi(s^{\prime});r)|
H^h(ss,a)h(ss,a)1+s^h(ss,a)e^h+1π(s,πh+1(s);r)\displaystyle\leq H\|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a)\|_{1}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\hat{e}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime});r)
Hγ(Nh(s,a),δ)Nh(s,a)+s^h(ss,a)e^h+1π(s,πh+1(s);r).\displaystyle\leq H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\hat{e}_{h+1}^{\pi}(s,\pi_{h+1}(s^{\prime});r).

By induction, if e^h+1π(s,a;r)W¯h+1(s,a)\hat{e}_{h+1}^{\pi}(s,a;r)\leq\overline{W}_{h+1}(s,a), for s𝒰hk,aAhk,safe(s)s\notin\mathcal{U}_{h}^{k},a\in A_{h}^{k,safe}(s), we can have

e^hπ(s,a;r)\displaystyle\hat{e}_{h}^{\pi}(s,a;r) min{H,Hγ(Nh(s,a),δ)Nh(s,a)+s^h(ss,a)e^h+1π(s,πh+1(s);r)}\displaystyle\leq\min\left\{H,H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\hat{e}_{h+1}^{\pi}(s,\pi_{h+1}(s^{\prime});r)\right\}
min{H,Hγ(Nh(s,a),δ)Nh(s,a)+s^h(ss,a)maxbAh+1k,safe(s)e^h+1π(s,b;r)}\displaystyle\leq\min\left\{H,H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\max_{b\in A_{h+1}^{k,safe}(s^{\prime})}\hat{e}_{h+1}^{\pi}(s^{\prime},b;r)\right\} (11)
min{H,Hγ(Nh(s,a),δ)Nh(s,a)+s^h(ss,a)maxbAh+1k,safe(s)W¯h+1(s,b)}\displaystyle\leq\min\left\{H,H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\max_{b\in A_{h+1}^{k,safe}(s^{\prime})}\overline{W}_{h+1}(s^{\prime},b)\right\}
W¯h(s,a).\displaystyle\leq\overline{W}_{h}(s,a).

The second inequality is because ^h(ss,a)=0\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)=0 for all s𝒰h+1s^{\prime}\in\mathcal{U}_{h+1}, and then for s𝒰h+1,πh+1(s)Ah+1k,safe(s)s^{\prime}\notin\mathcal{U}_{h+1},\pi_{h+1}(s^{\prime})\in A_{h+1}^{k,safe}(s^{\prime}). The third inequality holds by induction hypothesis, and the last inequality holds by the definition of W¯h(s,a)\overline{W}_{h}(s,a).

Also, for other state-action pair (s,a)(s,a), we can get

e^hπ(s,a;r)\displaystyle\hat{e}_{h}^{\pi}(s,a;r) min{H,Hγ(Nh(s,a),δ)Nh(s,a)+s^h(ss,a)e^h+1π(s,πh+1(s);r)}\displaystyle\leq\min\left\{H,H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\hat{e}_{h+1}^{\pi}(s,\pi_{h+1}(s^{\prime});r)\right\}
min{H,Hγ(Nh(s,a),δ)Nh(s,a)+s^h(ss,a)maxbAe^h+1π(s,b;r)}\displaystyle\leq\min\left\{H,H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\max_{b\in A}\hat{e}_{h+1}^{\pi}(s^{\prime},b;r)\right\}
min{H,Hγ(Nh(s,a),δ)Nh(s,a)+s^h(ss,a)maxbAW¯h+1(s,b)}\displaystyle\leq\min\left\{H,H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\max_{b\in A}\overline{W}_{h+1}(s^{\prime},b)\right\}
W¯h(s,a).\displaystyle\leq\overline{W}_{h}(s,a).

Hence the Lemma B.5 is true. Now we assume the algorithm terminates at episode K+1K+1, for any feasible policy πΠK\pi\in\Pi^{K}, if we define the value function in model M^\hat{M} as V^\hat{V}, we will get

|V1π(s1)V^1π(s1)|\displaystyle|V_{1}^{\pi}(s_{1})-\hat{V}_{1}^{\pi}(s_{1})| =|Q1π(s1,π1(s1))Q^1π(s1,π1(s1))|\displaystyle=|Q_{1}^{\pi}(s_{1},\pi_{1}(s_{1}))-\hat{Q}_{1}^{\pi}(s_{1},\pi_{1}(s_{1}))|
W¯hK(s1,π1(s1))\displaystyle\leq\overline{W}_{h}^{K}(s_{1},\pi_{1}(s_{1}))
W¯hK(s1,π1K+1(s1))\displaystyle\leq\overline{W}_{h}^{K}(s_{1},\pi^{K+1}_{1}(s_{1}))
ε/2.\displaystyle\leq\varepsilon/2.

where the second inequality is because πK+1\pi^{K+1} is the greedy policy with respect to W¯hK\overline{W}_{h}^{K} and π\pi is in the feasible policy set ΠK\Pi^{K} at episode KK. Then for any reward function rr, denote the π^ΠK\hat{\pi^{*}}\in\Pi^{K} to be the optimal policy in estimated MDP M^\hat{M}. Recall that πΠKΠ\pi^{*}\in\Pi^{K}\subseteq\Pi^{*}, we will have

V1π(s1)V1π^(s1)\displaystyle V_{1}^{\pi^{*}}(s_{1})-V_{1}^{\hat{\pi^{*}}}(s_{1}) V^1π(s1)V1π^(s1)+ε/2\displaystyle\leq\hat{V}_{1}^{\pi^{*}}(s_{1})-V_{1}^{\hat{\pi^{*}}}(s_{1})+\varepsilon/2
V^1π^(s1)V1π^+ε/2\displaystyle\leq\hat{V}_{1}^{\hat{\pi^{*}}}(s_{1})-V_{1}^{\hat{\pi^{*}}}+\varepsilon/2
ε.\displaystyle\leq\varepsilon.

Now we consider the expected violation. We will show that for all feasible policy πΠK\pi\in\Pi^{K}, the expected violation will be upper bounded by 2W¯1(s1,π1(s1)).2\overline{W}_{1}(s_{1},\pi_{1}(s_{1})).

We first consider the event for feasible policy π\pi

𝒫π={h[H1],sh+1ΔhK(sh,ah)sh,ahπ}.\displaystyle{\mathcal{P}}_{\pi}=\{\forall\ h\in[H-1],s_{h+1}\in\Delta_{h}^{K}(s_{h},a_{h})\mid s_{h},a_{h}\sim\pi\}.

Assume that the event holds, then because πΠK\pi\in\Pi^{K} is the feasible policy with respect to ΔhK(s,a)\Delta_{h}^{K}(s,a), we have sh𝒰hs_{h}\notin\mathcal{U}_{h} for all 1hH+11\leq h\leq H+1. Then

(c(sh)τ)+2NK(s)log(SK/δ).\displaystyle(c(s_{h})-\tau)_{+}\leq\sqrt{\frac{2}{N^{K}(s)}\log(SK/\delta)}.

and under the event 𝒫π{\mathcal{P}}_{\pi}

𝔼π[h=1H(c(sh)τ)+]𝔼π[h=1H2NK(sh)log(SKδ)].\displaystyle\mathbb{E}_{\pi}\left[\sum_{h=1}^{H}(c(s_{h})-\tau)_{+}\right]\leq\mathbb{E}_{\pi}\left[\sum_{h=1}^{H}\sqrt{\frac{2}{N^{K}(s_{h})}\log\left(\frac{SK}{\delta}\right)}\right]. (12)

Now define

F¯hπ(s,a)\displaystyle\ \ \ \ \ \overline{F}_{h}^{\pi}(s,a)
=min{H,2N(sh)log(SKδ)+Hγ(Nh(s,a),δ)Nh(s,a)+s^h(ss,a)Fh+1π(s,πh+1(s))}.\displaystyle=\min\left\{H,\sqrt{\frac{2}{N(s_{h})}\log\left(\frac{SK}{\delta}\right)}+H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)F_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}.

Since 2N(sh)log(SKδ)Hγ(Nh(s,a),δ)Nh(s,a)\sqrt{\frac{2}{N(s_{h})}\log\left(\frac{SK}{\delta}\right)}\leq H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}, we have F¯hπ(s,a)W¯h(s,a)\overline{F}_{h}^{\pi}(s,a)\leq\overline{W}_{h}(s,a) by recursion. The proof is similar to B.5.

Also denote

Fhπ(s,a)=𝔼π[h=hH2NK(sh)log(SKδ)|sh=s,ah=a].F_{h}^{\pi}(s,a)=\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H}\sqrt{\frac{2}{N^{K}(s_{h})}\log\left(\frac{SK}{\delta}\right)}\ \Bigg{|}\ s_{h}=s,a_{h}=a\right].

Next we prove F¯hπ(s,a)Fhπ(s,a).\overline{F}_{h}^{\pi}(s,a)\geq F_{h}^{\pi}(s,a). To show this fact, we use the induction. When h=H+1h=H+1, FH+1π(s,a)=F¯H+1π(s,a)=0F_{H+1}^{\pi}(s,a)=\overline{F}_{H+1}^{\pi}(s,a)=0. Now assume that F¯h+1π(s,a)Fh+1π(s,a)\overline{F}_{h+1}^{\pi}(s,a)\geq F_{h+1}^{\pi}(s,a), then

F¯hπ(s,a)\displaystyle\ \ \ \ \ \overline{F}_{h}^{\pi}(s,a)
=min{H,2N(s)log(SKδ)+Hγ(Nh(s,a),δ)Nh(s,a)+s^h(ss,a)F¯h+1π(s,πh+1(s))}\displaystyle=\min\left\{H,\sqrt{\frac{2}{N(s)}\log\left(\frac{SK}{\delta}\right)}+H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\overline{F}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}
min{H,2N(s)log(SKδ)+Hγ(Nh(s,a),δ)Nh(s,a)+s^h(ss,a)Fh+1π(s,πh+1(s))}\displaystyle\geq\min\left\{H,\sqrt{\frac{2}{N(s)}\log\left(\frac{SK}{\delta}\right)}+H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)F_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}
min{H,2N(sh)log(SKδ)+H^h(ss,a)h(ss,a)1\displaystyle\geq\min\left\{H,\sqrt{\frac{2}{N(s_{h})}\log\left(\frac{SK}{\delta}\right)}+H\|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a)\|_{1}\right.
+s^h(ss,a)Fh+1π(s,πh+1(s))}\displaystyle\hskip 200.0003pt\left.+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)F_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}
min{H,2N(sh)log(SKδ)+sh(ss,a)Fh+1π(s,πh+1(s))}\displaystyle\geq\min\left\{H,\sqrt{\frac{2}{N(s_{h})}\log\left(\frac{SK}{\delta}\right)}+\sum_{s^{\prime}}\mathbb{P}_{h}(s^{\prime}\mid s,a)F_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}
=Fhπ(s,a).\displaystyle=F_{h}^{\pi}(s,a).

Thus by induction, we know F1π(s1,π1(s1))F¯1π(s1,π1(s1))W¯1K(s,π1(s1))F_{1}^{\pi}(s_{1},\pi_{1}(s_{1}))\leq\overline{F}_{1}^{\pi}(s_{1},\pi_{1}(s_{1}))\leq\overline{W}_{1}^{K}(s,\pi_{1}(s_{1})) because 2N(s)log(SKδ)Hγ(Nh(s,a),δ)Nh(s,a)\sqrt{\frac{2}{N(s)}\log\left(\frac{SK}{\delta}\right)}\leq H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}.

From now, we have proved that under the condition 𝒫π{\mathcal{P}}_{\pi}, the expected violation can be upper bounded by W¯1(s,π1(s1))\overline{W}_{1}(s,\pi_{1}(s_{1})). Next we state that Pr(𝒫πc)\Pr({\mathcal{P}}_{\pi}^{c}) is small. Recall that 𝒫π={h[H1],sh+1ΔhK(sh,ah)π}{\mathcal{P}}_{\pi}=\{\forall\ h\in[H-1],s_{h+1}\in\Delta_{h}^{K}(s_{h},a_{h})\mid\pi\}, we define

Ghπ(s,a)=Pr{h[h,H1],sh+1ΔhK(sh,ah)sh=s,ah=a}\displaystyle G_{h}^{\pi}(s,a)=\Pr\{\exists\ h^{\prime}\in[h,H-1],s_{h^{\prime}+1}\notin\Delta_{h}^{K}(s_{h^{\prime}},a_{h^{\prime}})\mid s_{h}=s,a_{h}=a\}

and

G¯hπ(s,a)\displaystyle\ \ \ \ \ \overline{G}_{h}^{\pi}(s,a)
=min{H,SHlog(S2HA/δ)Nh(s,a)+Hγ(Nh(s,a),δ)Nh(s,a)+s^h(ss,a)G¯h+1π(s,πh+1(s))}.\displaystyle=\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\overline{G}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}.

Then by recursion, we can easily show that G¯hπ(s,a)W¯hK(s,a)\overline{G}_{h}^{\pi}(s,a)\leq\overline{W}_{h}^{K}(s,a) by log(S2HA/δ)γ(Nh(s,a),δ)\log(S^{2}HA/\delta)\leq\gamma(N_{h}(s,a),\delta).

Lemma B.7.

For all h[H]h\in[H] and feasible policy π\pi, we have G¯hπ(s,a)HGhπ(s,a)\overline{G}_{h}^{\pi}(s,a)\geq HG_{h}^{\pi}(s,a).

Proof.

Define G~hπ(s,a)\widetilde{G}_{h}^{\pi}(s,a) as

G~hπ(s,a)=min{H,SHlog(S2HA/δ)Nh(s,a)+sh(ss,a)G~h+1π(s,πh+1(s))}.\displaystyle\widetilde{G}_{h}^{\pi}(s,a)=\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+\sum_{s^{\prime}}\mathbb{P}_{h}(s^{\prime}\mid s,a)\widetilde{G}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}.

We will show G¯hπ(s,a)G~hπ(s,a)HGhπ(s,a).\overline{G}_{h}^{\pi}(s,a)\geq\widetilde{G}_{h}^{\pi}(s,a)\geq HG_{h}^{\pi}(s,a). We prove these two inequalities by induction. First, for h=H+1h=H+1, these inequalities holds. Now assume that they hold for h+1h+1, then

G¯hπ(s,a)\displaystyle\ \ \ \ \ \overline{G}_{h}^{\pi}(s,a)
=min{H,SHlog(S2HA/δ)Nh(s,a)+Hγ(Nh(s,a),δ)Nh(s,a)+s^h(ss,a)G¯h+1π(s,πh+1(s))}\displaystyle=\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\overline{G}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}
min{H,SHlog(S2HA/δ)Nh(s,a)+Hγ(Nh(s,a),δ)Nh(s,a)+s^h(ss,a)G~h+1π(s,πh+1(s))}\displaystyle\geq\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+H\sqrt{\frac{\gamma(N_{h}(s,a),\delta)}{N_{h}(s,a)}}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\widetilde{G}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}
min{H,SHlog(S2HA/δ)Nh(s,a)+H^h(ss,a)h(ss,a)1\displaystyle\geq\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+H\|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)-\mathbb{P}_{h}(s^{\prime}\mid s,a)\|_{1}\right.
+s^h(ss,a)G~h+1π(s,πh+1(s))}\displaystyle\hskip 200.0003pt\left.+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a)\widetilde{G}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}
min{H,SHlog(S2HA/δ)Nh(s,a)+sh(ss,a)G~h+1π(s,πh+1(s))}\displaystyle\geq\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+\sum_{s^{\prime}}\mathbb{P}_{h}(s^{\prime}\mid s,a)\widetilde{G}_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}
=G~hπ(s,a).\displaystyle=\widetilde{G}_{h}^{\pi}(s,a).

Now to prove that G~hπ(s,a)HGhπ(s,a)\widetilde{G}_{h}^{\pi}(s,a)\geq HG_{h}^{\pi}(s,a), we need the following lemma.

Lemma B.8.

Fixed a step h[H]h\in[H] and state-action pair (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}, then with high probability at least 1δ1-\delta, for any sΔh(s,a)s^{\prime}\in\Delta_{h}(s,a) with P(ss,a)log(S/δ)Nh(s,a)P(s^{\prime}\mid s,a)\geq\frac{\log(S/\delta)}{N_{h}(s,a)}, sΔhK(s,a)s^{\prime}\in\Delta^{K}_{h}(s,a).

Proof.

Since we sample ss^{\prime} from Ph(ss,a)P_{h}(s^{\prime}\mid s,a) for Nh(s,a)N_{h}(s,a) times, if sΔhK(s,a)s^{\prime}\notin\Delta^{K}_{h}(s,a), then ss^{\prime} are not sampled. The probability that ss^{\prime} are not sampled will be (1p)Nh(s,a)(1-p)^{N_{h}(s,a)}, where p=h(ss,a)p=\mathbb{P}_{h}(s^{\prime}\mid s,a). When plog(S/δ)Nh(s,a)p\geq\frac{\log(S/\delta)}{N_{h}(s,a)}, the probability will be at most

(1p)Nh(s,a)(1log(S/δ)Nh(s,a))Nh(s,a)elog(S/δ)=δS.(1-p)^{N_{h}(s,a)}\leq\left(1-\frac{\log(S/\delta)}{N_{h}(s,a)}\right)^{N_{h}(s,a)}\leq e^{-\log(S/\delta)}=\frac{\delta}{S}.

Taking the union bound for all possible next state ss^{\prime}, the lemma has been proved. ∎

By the Lemma B.8, define the event as

5={h,s,aandsΔh(s,a)withP(ss,a)log(S2HA/δ)Nh(s,a):sΔhK(s,a)}.\displaystyle\mathcal{E}_{5}=\left\{\forall\ h,s,a\mbox{and}\ s^{\prime}\in\Delta_{h}(s,a)\ \mbox{with}\ P(s^{\prime}\mid s,a)\geq\frac{\log(S^{2}HA/\delta)}{N_{h}(s,a)}:s^{\prime}\in\Delta_{h}^{K}(s,a)\right\}.

Then Pr(5c)δ\Pr(\mathcal{E}_{5}^{c})\leq\delta. Under the event 5\mathcal{E}_{5}, at step h[H]h\in[H] and state-action pair (sh,ah)(s_{h},a_{h}), we have

(sh+1ΔhK(sh,ah))Slog(S2HA/δ)Nh(sh,ah).\displaystyle\mathbb{P}(s_{h+1}\notin\Delta_{h}^{K}(s_{h},a_{h}))\leq\frac{S\log(S^{2}HA/\delta)}{N_{h}(s_{h},a_{h})}. (13)

This inequality is because sh+1ΔhK(sh,ah)s_{h+1}\notin\Delta_{h}^{K}(s_{h},a_{h}) only happens when P(ss,a)log(S2HA/δ)Nh(sh,ah)P(s^{\prime}\mid s,a)\leq\frac{\log(S^{2}HA/\delta)}{N_{h}(s_{h},a_{h})} under the event 5\mathcal{E}_{5}. Now we first decompose the Ghπ(s,a)G_{h}^{\pi}(s,a) as

HGhπ(s,a)\displaystyle H\cdot G_{h}^{\pi}(s,a) =Pr{h[h,H1],sh+1ΔhK(sh,ah)sh=s,ah=a,π}\displaystyle=\Pr\{\exists\ h^{\prime}\in[h,H-1],s_{h^{\prime}+1}\notin\Delta_{h}^{K}(s_{h^{\prime}},a_{h^{\prime}})\mid s_{h}=s,a_{h}=a,\pi\}
Hh=hH1Pr{sh+1Delta^hK(sh,ah)sh=s,ah=a,π}\displaystyle\leq H\sum_{h^{\prime}=h}^{H-1}\Pr\{s_{h^{\prime}+1}\notin\hat{Delta}_{h^{\prime}}^{K}(s_{h^{\prime}},a_{h^{\prime}})\mid s_{h}=s,a_{h}=a,\pi\}
𝔼π[h=hH1SHlog(S2HA/δ)Nh(sh,ah)].\displaystyle\leq\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H-1}\frac{SH\log(S^{2}HA/\delta)}{N_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})}\right].

The first inequality is to decompose the probability into the summation over step hh, and the second inequality is derived by Eq. (13). Now denote ghπ(s,a)=min{H,𝔼π[h=hH1SHlog(S2HA/δ)Nh(sh,ah)]}g_{h}^{\pi}(s,a)=\min\left\{H,\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H-1}\frac{SH\log(S^{2}HA/\delta)}{N_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})}\right]\right\}, then we can easily have ghπ(s,a)G~hπ(s,a)g_{h}^{\pi}(s,a)\leq\widetilde{G}_{h}^{\pi}(s,a) by recursion. In fact, if

ghπ(s,a)\displaystyle\ \ \ \ \ g_{h}^{\pi}(s,a)
=min{H,𝔼π[h=hH1SHlog(S2HA/δ)Nh(sh,ah)|(sh,ah)=(s,a)]}\displaystyle=\min\left\{H,\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H-1}\frac{SH\log(S^{2}HA/\delta)}{N_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})}\ \Bigg{|}\ (s_{h},a_{h})=(s,a)\right]\right\}
=min{H,SHlog(S2HA/δ)Nh(s,a)+𝔼π[h=h+1H1SHlog(S2HA/δ)Nh(sh,ah)|(sh,ah)=(s,a)]}\displaystyle=\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h+1}^{H-1}\frac{SH\log(S^{2}HA/\delta)}{N_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})}\ \Bigg{|}\ (s_{h},a_{h})=(s,a)\right]\right\}
=min{H,SHlog(S2HA/δ)Nh(s,a)\displaystyle=\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}\right.
+min{H,𝔼π[h=h+1H1SHlog(S2HA/δ)Nh(sh,ah)|(sh,ah)=(s,a)]}}\displaystyle\hskip 50.00008pt\left.+\min\left\{H,\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h+1}^{H-1}\frac{SH\log(S^{2}HA/\delta)}{N_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}})}\ \Bigg{|}\ (s_{h},a_{h})=(s,a)\right]\right\}\right\}
=min{H,SHlog(S2HA/δ)Nh(s,a)+s𝒮h(ss,a)gh+1π(s,πh+1(s))}\displaystyle=\min\left\{H,\frac{SH\log(S^{2}HA/\delta)}{N_{h}(s,a)}+\sum_{s^{\prime}\in\mathcal{S}}\mathbb{P}_{h}(s^{\prime}\mid s,a)g_{h+1}^{\pi}(s^{\prime},\pi_{h+1}(s^{\prime}))\right\}

has the same recursive equation as G~hπ(s,a)\widetilde{G}_{h}^{\pi}(s,a). Since gHπ(s,a)=0G~Hπ(s,a)g_{H}^{\pi}(s,a)=0\leq\widetilde{G}_{H}^{\pi}(s,a), by induction we can get ghπ(s,a)G~hπ(s,a)g_{h}^{\pi}(s,a)\leq\widetilde{G}_{h}^{\pi}(s,a), and then

HGhπ(s,a)ghπ(s,a)G~hπ(s,a)\displaystyle H\cdot G_{h}^{\pi}(s,a)\leq g_{h}^{\pi}(s,a)\leq\widetilde{G}_{h}^{\pi}(s,a)

holds directly. The Lemma B.7 has been proved.

Now we return to our main theorem. Consider the policy πΠK\pi\in\Pi^{K} is one of feasible policy at episode KK. For each possible 𝒯H={s1,a1,,sH,aH}\mathscr{T}_{H}=\{s_{1},a_{1},\cdots,s_{H},a_{H}\}, denote Prπ(𝒯H)\Pr_{\pi}(\mathscr{T}_{H}) be the probability for generating 𝒯H\mathscr{T}_{H} following policy π\pi and true transition kernel. If the event sh+1ΔhK(sh,ah)s_{h+1}\notin\Delta_{h}^{K}(s_{h},a_{h}) happens, we define the trajectory 𝒯H\mathscr{T}_{H} are “unsafe". Denote the set of all “unsafe" trajectories are 𝒰\mathcal{U}, we have

𝔼π[h=1H(c(sh)τ)+]\displaystyle\mathbb{E}_{\pi}\left[\sum_{h=1}^{H}(c(s_{h})-\tau)_{+}\right]
=𝒯HPrπ(𝒯H)[h=1H(c(sh)τ)+]\displaystyle\quad=\sum_{\mathscr{T}_{H}}\Pr_{\pi}(\mathscr{T}_{H})\left[\sum_{h=1}^{H}(c(s_{h})-\tau)_{+}\right]
𝒯H𝒰Prπ(𝒯H)[h=1H(c(sh)τ)+]+𝒯H𝒰Prπ(𝒯H)[h=1H(c(sh)τ)+]\displaystyle\quad\leq\sum_{\mathscr{T}_{H}\in\mathcal{U}}\Pr_{\pi}(\mathscr{T}_{H})\left[\sum_{h=1}^{H}(c(s_{h})-\tau)_{+}\right]+\sum_{\mathscr{T}_{H}\notin\mathcal{U}}\Pr_{\pi}(\mathscr{T}_{H})\left[\sum_{h=1}^{H}(c(s_{h})-\tau)_{+}\right]
H𝒯H𝒰Prπ(𝒯H)+𝒯H𝒰Prπ(𝒯H)[h=1H2NK(sh)log(SK/δ)]\displaystyle\quad\leq H\sum_{\mathscr{T}_{H}\in\mathcal{U}}\Pr_{\pi}(\mathscr{T}_{H})+\sum_{\mathscr{T}_{H}\notin\mathcal{U}}\Pr_{\pi}(\mathscr{T}_{H})\left[\sum_{h=1}^{H}\sqrt{\frac{2}{N^{K}(s_{h})}\log(SK/\delta)}\right] (14)
HPrπ{𝒫π}+𝒯HPrπ(𝒯H)[h=1H2NK(sh)log(SK/δ)]\displaystyle\quad\leq H\Pr_{\pi}\{{\mathcal{P}}_{\pi}\}+\sum_{\mathscr{T}_{H}}\Pr_{\pi}(\mathscr{T}_{H})\left[\sum_{h=1}^{H}\sqrt{\frac{2}{N^{K}(s_{h})}\log(SK/\delta)}\right]
HG1π(s,a)+F1π(s,a)\displaystyle\quad\leq H\cdot G_{1}^{\pi}(s,a)+F_{1}^{\pi}(s,a) (15)
2W¯1K(s1,π1(s1))\displaystyle\quad\leq 2\overline{W}_{1}^{K}(s_{1},\pi_{1}(s_{1}))
2W¯1K(s1,π1K+1(s1))\displaystyle\quad\leq 2\overline{W}_{1}^{K}(s_{1},\pi^{K+1}_{1}(s_{1}))
ε.\displaystyle\quad\leq\varepsilon.

The Eq. (14) is because 𝒯H𝒰\mathscr{T}_{H}\in\mathcal{U} implies that the event 𝒫π{\mathcal{P}}_{\pi} happens, and we can apply Eq. (12). The Eq. (15) is derived by the definition of F1π(s,a)F_{1}^{\pi}(s,a) and G1π(s,a)G_{1}^{\pi}(s,a). Until now, we have proved that if the algorithm terminates, all the requirements are satisfied. Now we turn to bound the sample complexity. We first assume that KSAK\geq SA

Define the average qhk=(s,a)hπk+1(s,a)W¯hk(s,a)q_{h}^{k}=\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)\overline{W}_{h}^{k}(s,a). Since the πk+1\pi^{k+1} are greedy policy with respect to W¯hk(s,a)\overline{W}_{h}^{k}(s,a), we have

W¯hk(s,a)\displaystyle\overline{W}_{h}^{k}(s,a)
=min{H,M(Nhk(s,a),δ)+s^hk(ss,a)W¯h+1k(s,πh+1t+1(s))}\displaystyle\quad=\min\left\{H,M(N_{h}^{k}(s,a),\delta)+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}^{k}(s^{\prime}\mid s,a)\overline{W}_{h+1}^{k}(s^{\prime},\pi^{t+1}_{h+1}(s^{\prime}))\right\}
min{H,M(Nhk(s,a),δ)+shk(ss,a)W¯h+1k(s,πh+1t+1(s))\displaystyle\quad\leq\min\left\{H,M(N_{h}^{k}(s,a),\delta)+\sum_{s^{\prime}}\mathbb{P}_{h}^{k}(s^{\prime}\mid s,a)\overline{W}_{h+1}^{k}(s^{\prime},\pi^{t+1}_{h+1}(s^{\prime}))\right.
+H^hk(ss,a)hk(ss,a)1}\displaystyle\quad\hskip 200.0003pt\left.+H\|\hat{\mathbb{P}}_{h}^{k}(s^{\prime}\mid s,a)-\mathbb{P}_{h}^{k}(s^{\prime}\mid s,a)\|_{1}\right\}
M(Nhk(s,a),δ)+shk(ss,a)W¯h+1k(s,πh+1t+1(s))\displaystyle\quad\leq M^{\prime}(N_{h}^{k}(s,a),\delta)+\sum_{s^{\prime}}\mathbb{P}_{h}^{k}(s^{\prime}\mid s,a)\overline{W}_{h+1}^{k}(s^{\prime},\pi^{t+1}_{h+1}(s^{\prime}))

where M(n,δ)=3H(2β(n,δ)n1)+SH(β(n,δ)n1)M^{\prime}(n,\delta)=3H\left(\sqrt{\frac{2\beta(n,\delta)}{n}}\wedge 1\right)+SH\left(\frac{\beta(n,\delta)}{n}\wedge 1\right). Now

qhk\displaystyle q_{h}^{k} =(s,a)hπk+1(s,a)W¯hk(s,a)\displaystyle=\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)\overline{W}_{h}^{k}(s,a)
(s,a)hπk+1(s,a)M(Nhk(s,a),δ)+(s,s,a)hπk+1(s,a)hk(ss,a)W¯h+1k(s,πh+1k+1(s))\displaystyle\leq\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)M^{\prime}(N_{h}^{k}(s,a),\delta)+\sum_{(s^{\prime},s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)\mathbb{P}_{h}^{k}(s^{\prime}\mid s,a)\overline{W}_{h+1}^{k}(s^{\prime},\pi_{h+1}^{k+1}(s^{\prime}))
=(s,a)hπk+1(s,a)M(Nhk(s,a),δ)+qh+1k.\displaystyle=\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)M^{\prime}(N_{h}^{k}(s,a),\delta)+q_{h+1}^{k}.

Thus

ε/2<q1kh=1H(s,a)hπk+1(s,a)M(Nhk(s,a),δ),k[K1],\displaystyle\varepsilon/2<q_{1}^{k}\leq\sum_{h=1}^{H}\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)M^{\prime}(N_{h}^{k}(s,a),\delta),\forall\ k\in[K-1],

where the first equality is because q1k=W¯hk(s1,πk+1(s1))>ε/2q_{1}^{k}=\overline{W}_{h}^{k}(s_{1},\pi^{k+1}(s_{1}))>\varepsilon/2. Sum over k[K1]k\in[K-1], we can get

Kε/2k=0K1h=1H(s,a)hπk+1(s,a)M(Nhk(s,a),δ).\displaystyle K\varepsilon/2\leq\sum_{k=0}^{K-1}\sum_{h=1}^{H}\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)M^{\prime}(N_{h}^{k}(s,a),\delta).

Define the pseudo-count as N¯hk(s,a)=i=1khπi(s,a)\overline{N}_{h}^{k}(s,a)=\sum_{i=1}^{k}\mathbb{P}_{h}^{\pi^{i}}(s,a) and the event about the pseudo-count like Kaufmann et al. (2021):

cnt={k,h,s,a:Nhk(s,a)12N¯hk(s,a)log(SAH/δ)}.\displaystyle\mathcal{E}_{cnt}=\left\{\forall\ k,h,s,a:N_{h}^{k}(s,a)\geq\frac{1}{2}\overline{N}_{h}^{k}(s,a)-\log(SAH/\delta)\right\}.

Then Pr{cnt}δ\Pr\{\mathcal{E}_{cnt}\}\leq\delta. The proof is provided in Kaufmann et al. (2021). Also, the Lemma 7 in Kaufmann et al. (2021) show that we can change Nhk(s,a)N_{h}^{k}(s,a) to N¯hk(s,a)\overline{N}_{h}^{k}(s,a) up to a constant.

Lemma B.9 (Kaufmann et al. (2021)).

Under the event cnt\mathcal{E}_{cnt}, for any s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A} and kk, if β(x,δ)/x\beta(x,\delta)/x is non-increasing for x1x\geq 1 and β(x,δ)\beta(x,\delta) is increasing, we have

γ(Nhk(s,a),δ)Nhk(s,a)14γ(N¯hk(s,a),δ)N¯hk(s,a)1.\displaystyle\frac{\gamma(N_{h}^{k}(s,a),\delta)}{N_{h}^{k}(s,a)}\wedge 1\leq\frac{4\gamma(\overline{N}_{h}^{k}(s,a),\delta)}{\overline{N}_{h}^{k}(s,a)\vee 1}.

The proof of Lemma B.9 can be found in Kaufmann et al. (2021). Then we can have

Kε/2\displaystyle K\varepsilon/2 k=0K1h=1H(s,a)hπk+1(s,a)M(Nhk(s,a),δ)\displaystyle\leq\sum_{k=0}^{K-1}\sum_{h=1}^{H}\sum_{(s,a)}\mathbb{P}_{h}^{\pi^{k+1}}(s,a)M^{\prime}(N_{h}^{k}(s,a),\delta) (16)
3Hh=1Hk=0K1(s,a)(N¯hk+1(s,a)N¯hk(s,a))2γ(Nhk(s,a),δ)Nhk(s,a)1(a)\displaystyle\leq\underbrace{3H\sum_{h=1}^{H}\sum_{k=0}^{K-1}\sum_{(s,a)}(\overline{N}_{h}^{k+1}(s,a)-\overline{N}_{h}^{k}(s,a))\sqrt{\frac{2\gamma(N_{h}^{k}(s,a),\delta)}{N_{h}^{k}(s,a)}\wedge 1}}_{\text{(a)}}
+SHh=1Hk=0K1(s,a)(N¯hk+1(s,a)N¯hk(s,a))(γ(Nhk(s,a),δ)Nhk(s,a)1)(b).\displaystyle\ \ \ \ \ \ \ \ \ \ +\underbrace{SH\sum_{h=1}^{H}\sum_{k=0}^{K-1}\sum_{(s,a)}(\overline{N}_{h}^{k+1}(s,a)-\overline{N}_{h}^{k}(s,a))\left(\frac{\gamma(N_{h}^{k}(s,a),\delta)}{N_{h}^{k}(s,a)}\wedge 1\right)}_{\text{(b)}}. (17)

For part (a)(a), we upper bound it by

(a) 3Hh=1Hk=0K1(s,a)(N¯hk+1(s,a)N¯hk(s,a))8γ(N¯hk(s,a),δ)N¯hk(s,a)1\displaystyle\leq 3H\sum_{h=1}^{H}\sum_{k=0}^{K-1}\sum_{(s,a)}(\overline{N}_{h}^{k+1}(s,a)-\overline{N}_{h}^{k}(s,a))\sqrt{\frac{8\gamma(\overline{N}_{h}^{k}(s,a),\delta)}{\overline{N}_{h}^{k}(s,a)\vee 1}}
32Hγ(K,δ)h=1Hk=0K1(s,a)(N¯hk+1(s,a)N¯hk(s,a))N¯hk(s,a)1\displaystyle\leq 3\sqrt{2}H\sqrt{\gamma(K,\delta)}\sum_{h=1}^{H}\sum_{k=0}^{K-1}\sum_{(s,a)}\frac{(\overline{N}_{h}^{k+1}(s,a)-\overline{N}_{h}^{k}(s,a))}{\sqrt{\overline{N}_{h}^{k}(s,a)\vee 1}} (18)
3(2+2)Hγ(K,δ)h=1H(s,a)N¯hK(s,a)1\displaystyle\leq 3(2+\sqrt{2})H\sqrt{\gamma(K,\delta)}\sum_{h=1}^{H}\sum_{(s,a)}\sqrt{\overline{N}_{h}^{K}(s,a)\vee 1}
3(2+2)Hγ(K,δ)h=1H(s,a)(1+N¯hK(s,a))\displaystyle\leq 3(2+\sqrt{2})H\sqrt{\gamma(K,\delta)}\sum_{h=1}^{H}\sum_{(s,a)}(1+\sqrt{\overline{N}_{h}^{K}(s,a)})
3(2+2)H2γ(K,δ)(SA+SAK)\displaystyle\leq 3(2+\sqrt{2})H^{2}\sqrt{\gamma(K,\delta)}(SA+\sqrt{SAK}) (19)
6(2+2)H2γ(K,δ)SAK,\displaystyle\leq 6(2+\sqrt{2})H^{2}\sqrt{\gamma(K,\delta)}\sqrt{SAK},

where Eq. (18) is because Lemma 19 in Auer et al. (2008), Eq. (19) is derived by (s,a)N¯hK(s,a)=K\sum_{(s,a)}\overline{N}_{h}^{K}(s,a)=K and Cauchy’s inequality, and the last inequality is because we assume KSAK\geq SA.

Also for part (b)(b),

(b) SHh=1Hk=0K1(s,a)(N¯hk+1(s,a)N¯hk(s,a))4γ(N¯hk(s,a),δ)N¯hk(s,a)1\displaystyle\leq SH\sum_{h=1}^{H}\sum_{k=0}^{K-1}\sum_{(s,a)}(\overline{N}_{h}^{k+1}(s,a)-\overline{N}_{h}^{k}(s,a))\frac{4\gamma(\overline{N}_{h}^{k}(s,a),\delta)}{\overline{N}_{h}^{k}(s,a)\vee 1}
SHγ(K,δ)h=1H(s,a)k=0K1(N¯hk+1(s,a)N¯hk(s,a))N¯hk(s,a)1\displaystyle\leq SH\gamma(K,\delta)\sum_{h=1}^{H}\sum_{(s,a)}\sum_{k=0}^{K-1}\frac{(\overline{N}_{h}^{k+1}(s,a)-\overline{N}_{h}^{k}(s,a))}{\overline{N}_{h}^{k}(s,a)\vee 1}
SHγ(K,δ)h=1H(s,a)4log(N¯hK(s,a)+1)\displaystyle\leq SH\gamma(K,\delta)\sum_{h=1}^{H}\sum_{(s,a)}4\log(\overline{N}_{h}^{K}(s,a)+1) (20)
4S2AH2γ(K,δ)log(K+1),\displaystyle\leq 4S^{2}AH^{2}\gamma(K,\delta)\log(K+1),

where the Eq. (20) is because the Lemma 9 in Ménard et al. (2021). Now from the upper bound of (a)(a) and (b)(b), we can get

Kε/26(2+2)H2γ(K,δ)SAK+4S2AH2γ(K,δ)log(K+1).\displaystyle K\varepsilon/2\leq 6(2+\sqrt{2})H^{2}\sqrt{\gamma(K,\delta)}\sqrt{SAK}+4S^{2}AH^{2}\gamma(K,\delta)\log(K+1).

Recall that γ(n,δ)=2log(SAHK/δ)+(S1)log(e(1+n/(S1)))1\gamma(n,\delta)=2\log(SAHK/\delta)+(S-1)\log(e(1+n/(S-1)))\geq 1, by some technical transformation, performing Lemma D.1 we can get

K=𝒪~((H4SAε2+S2AH2ε)(log(1δ)+S)),\displaystyle K=\widetilde{\mathcal{O}}\left(\left(\frac{H^{4}SA}{\varepsilon^{2}}+\frac{S^{2}AH^{2}}{\varepsilon}\right)\left(\log\left(\frac{1}{\delta}\right)+S\right)\right),

under the event cnt,1,4,5\mathcal{E}_{cnt},\mathcal{E}_{1},\mathcal{E}_{4},\mathcal{E}_{5}. Hence the theorem holds with probability at least 14δ1-4\delta, and we can prove the theorem by replacing δ\delta to δ/4\delta/4.

B.3 Proof of Theorem 5.1

Proof.

We construct n MDP models with the same transition model but different safety cost. For each model, the states consist of a tree with degree AA and layer H/3H/3. For each state inside the tree (not a leaf), the action aia_{i} will lead to the ii-th branch of the next state. For leaf nodes, they are absorbing states, and we call them s1,s2,,sns_{1},s_{2},\cdots,s_{n}. Then n=AH/3=S(A1)+1AS2n=A^{H/3}=\frac{S(A-1)+1}{A}\geq\frac{S}{2}.

Refer to caption
Figure 2: MDP instance τ1\tau_{1}
Refer to caption
Figure 3: MDP instance τi(i=2)\tau_{i}(i=2)

Now we first define the instance τ1\tau_{1} as follows: For each state inside the tree, the safety cost and reward are all 0. For leaf node, the cost function for state sis_{i} are c(si)=1/2+Δ𝕀{i1}c(s_{i})=1/2+\Delta\mathbb{I}\{i\neq 1\}. The reward function are r(si)=𝕀{i1}r(s_{i})=\mathbb{I}\{i\neq 1\}, which implies that only state s1s_{1} has the reward 0 and other states s2,,sns_{2},\cdots,s_{n} have reward 1. Now we define the instance τj(2jn)\tau_{j}(2\leq j\leq n). For state inside the tree, the safety cost and reward are equal to the instance τ1\tau_{1}. For leaf node, the cost function for state sis_{i} are c(si)=1/2+Δ𝕀{i1,j}c(s_{i})=1/2+\Delta\mathbb{I}\{i\neq 1,j\}, r(si)r(s_{i}) unchanged. Then for τj\tau_{j}, the state s1s_{1} and sjs_{j} (and initial state s0s_{0}) are safe. Then the safe optimal policy is to choose aja_{j} at the first step. Thus the safe optimal reward is HH. The MDP instances are shown in Figure 2 and Figure 3.

Denote H=23HH^{\prime}=\frac{2}{3}H are the rest of length when we arrive at the leaf states s1,,sns_{1},\cdots,s_{n} . Let RK(π,τ)R_{K}(\pi,\tau) represents the regret till step episode KK for instance τ\tau and policy π\pi, and CK(π,τ)C_{K}(\pi,\tau) represents the violation. For τ1\tau_{1}, we can have CK(π,τ1)τ1(t1(K)K/2)KHΔ2,C_{K}(\pi,\tau_{1})\geq\mathbb{P}_{\tau_{1}}(t_{1}(K)\leq K/2)\cdot\frac{KH^{\prime}\Delta}{2}, where ti(K)t_{i}(K) means the times to choose the path to sis_{i}. Also, RK(π,τi)τi(t1(K)>K/2)KH2R_{K}(\pi,\tau_{i})\geq\mathbb{P}_{\tau_{i}}(t_{1}(K)>K/2)\frac{KH^{\prime}}{2}.

Applying Bertagnolle-Huber inequality, we have

τ1(t1(K)K/2)+τi(t1(K)>K/2)12exp{D(τ1,τi)}.\displaystyle\mathbb{P}_{\tau_{1}}(t_{1}(K)\leq K/2)+\mathbb{P}_{\tau_{i}}(t_{1}(K)>K/2)\geq\frac{1}{2}\exp\{-D(\mathbb{P}_{\tau_{1}},\mathbb{P}_{\tau_{i}})\}.

We choose ii such that i=argmini𝔼τ1[ti(K)]i=\arg\min_{i}\mathbb{E}_{\tau_{1}}[t_{i}(K)], then by i1𝔼τ1[ti(K)]K\sum_{i\neq 1}\mathbb{E}_{\tau_{1}}[t_{i}(K)]\leq K,

𝔼τ1[ti(K)]Kn1.\mathbb{E}_{\tau_{1}}[t_{i}(K)]\leq\frac{K}{n-1}.

By the definition of τ1\mathbb{P}_{\tau_{1}}and τi\mathbb{P}_{\tau_{i}},

D(τ1,τi)\displaystyle D(\mathbb{P}_{\tau_{1}},\mathbb{P}_{\tau_{i}}) =H𝔼τ1[ti(K)]D(𝒩(1/2+Δ,1),𝒩(1/2,1))\displaystyle=H^{\prime}\mathbb{E}_{\tau_{1}}[t_{i}(K)]D(\mathcal{N}(1/2+\Delta,1),\mathcal{N}(1/2,1))
H𝔼τ1[ti(K)]4Δ2\displaystyle\leq H^{\prime}\mathbb{E}_{\tau_{1}}[t_{i}(K)]\cdot 4\Delta^{2}
4KHΔ2n1.\displaystyle\leq\frac{4KH^{\prime}\Delta^{2}}{n-1}.

Thus choose Δ=n1/4KH12\Delta=\sqrt{n-1/4KH^{\prime}}\leq\frac{1}{2}, (need nTn\leq T)

τ1(t1(K)K/2)+τi(t1(K)>K/2)12exp{D(τ1,τi)}12e.\displaystyle\mathbb{P}_{\tau_{1}}(t_{1}(K)\leq K/2)+\mathbb{P}_{\tau_{i}}(t_{1}(K)>K/2)\geq\frac{1}{2}\exp\{-D(\mathbb{P}_{\tau_{1}},\mathbb{P}_{\tau_{i}})\}\geq\frac{1}{2e}.

If 𝔼π(RK(π,τ))T24=KH24KH16\mathbb{E}_{\pi}(R_{K}(\pi,\tau))\leq\frac{T}{24}=\frac{KH}{24}\leq\frac{KH^{\prime}}{16} for all τ\tau, then

τi(t1(K)>K/2)2RK(π,τi)KH18.\displaystyle\mathbb{P}_{\tau_{i}}(t_{1}(K)>K/2)\leq\frac{2R_{K}(\pi,\tau_{i})}{KH^{\prime}}\leq\frac{1}{8}.

Thus

τ1(t1(K)K/2)12e18:=C\mathbb{P}_{\tau_{1}}(t_{1}(K)\leq K/2)\geq\frac{1}{2e}-\frac{1}{8}:=C

and

CK(π,τ1)CKHΔ2=C(n1)KH4=Ω(SHK)=Ω(SHK),\displaystyle C_{K}(\pi,\tau_{1})\geq\frac{CKH^{\prime}\Delta}{2}=\frac{C\sqrt{(n-1)KH^{\prime}}}{4}=\Omega(\sqrt{SHK})=\Omega(\sqrt{SHK}),

where nS2n\geq\frac{S}{2}. ∎

B.4 Proof of Theorem 5.2

Proof.

Construct the MDP τ1,τ2,,τn\tau_{1},\tau_{2},\cdots,\tau_{n} as follows: For each instance, the states consist of a tree with degree A1A-1 and layer H/3H/3. For each state inside the tree (not a leaf), the action ai,1iA1a_{i},1\leq i\leq A-1 will lead to the ii-th branch of the next state. For the leaf nodes, denote them as s1,s2,,sns_{1},s_{2},\cdots,s_{n}, where n=(A1)H/3=(S3)(A2)+1A1=Ω(S)n=(A-1)^{H/3}=\frac{(S-3)(A-2)+1}{A-1}=\Omega(S) and these states will arrive at the final absorbing state sAs_{A} and sBs_{B} with reward r(sA)=0r(s_{A})=0 and r(sB)=1r(s_{B})=1 respectively. Also, the action aAa_{A} will always lead the agent to the unique unsafe state sUs_{U} with r(sU)=0.5+Δr(s_{U})=0.5+\Delta^{\prime} and c(sU)=1c(s_{U})=1 for some parameter Δ\Delta^{\prime} in the states except two absorbing states. For all states except the unsafe state, the safety cost is 0.

Now for the instance τ1\tau_{1}, the transition probability between the leaf states and absorbing states are

{p(sAs1,a)=0.5Δp(sBs1,a)=0.5+Δp(sAsi,a)=p(sBs1,a)=0.5i2\displaystyle\left\{\begin{array}[]{c}p(s_{A}\mid s_{1},a)=0.5-\Delta\\ p(s_{B}\mid s_{1},a)=0.5+\Delta\\ p(s_{A}\mid s_{i},a)=p(s_{B}\mid s_{1},a)=0.5\ \ \ \ i\geq 2\end{array}\right.

For instance τj\tau_{j}, the transition probability are

{p(sAs1,a)=0.5Δp(sBs1,a)=0.5+Δp(sAsj,a)=0.52Δp(sBsj,a)=0.5+2Δp(sAsi,a)=p(sBs1,a)=0.5i1,j\displaystyle\left\{\begin{array}[]{c}p(s_{A}\mid s_{1},a)=0.5-\Delta\\ p(s_{B}\mid s_{1},a)=0.5+\Delta\\ p(s_{A}\mid s_{j},a)=0.5-2\Delta\\ p(s_{B}\mid s_{j},a)=0.5+2\Delta\\ p(s_{A}\mid s_{i},a)=p(s_{B}\mid s_{1},a)=0.5\ \ \ \ i\neq 1,j\end{array}\right.
Refer to caption
Figure 4: MDP instance τ1\tau_{1}
Refer to caption
Figure 5: MDP instance τi\tau_{i}

Then if we do not consider the constraint, choosing aAa_{A} and getting into the unsafe state sUs_{U} is always the optimal action, then define gaph,s,a=Vh(s)Qh(s,a)\mbox{gap}_{h,s,a}=V_{h}^{*}(s)-Q_{h}^{*}(s,a), then the minimal gap Simchowitz & Jamieson (2019) is

gapmin=minh,s,a{gaph,s,a>0:gaph,s,a}H((0.5+Δ)(0.5+2Δ))H(Δ2Δ).\mbox{gap}_{\min}=\min_{h,s,a}\{\mbox{gap}_{h,s,a}>0:\mbox{gap}_{h,s,a}\}\geq H^{\prime}((0.5+\Delta^{\prime})-(0.5+2\Delta))\geq H^{\prime}(\Delta^{\prime}-2\Delta).

Thus the regret will be at most O(S2Apoly(H)logTΔΔ)O(\frac{S^{2}A\cdot poly(H)\log T}{\Delta^{\prime}-\Delta}) by classical gap-dependent upper bound Simchowitz & Jamieson (2019). As we will show later, we will choose Δ=Θ(T1/2)\Delta=\Theta(T^{-1/2}) and Δ=Θ(Tα12)\Delta^{\prime}=\Theta(T^{\frac{\alpha-1}{2}}), hence the regret will be

O(S2Apoly(H)logTΔΔ)=𝒪~(T1α2).\displaystyle O\left(\frac{S^{2}A\cdot poly(H)\log T}{\Delta^{\prime}-\Delta}\right)=\widetilde{\mathcal{O}}\left(T^{\frac{1-\alpha}{2}}\right).

Now if we consider the safe threshold τ=0\tau=0, the violation will be the number of time to arrive the unsafe state. Now consider any algorithm, if it achieves O(T1α)O(T^{1-\alpha}) violation, they will arrive at unsafe state at most O(T1α)O(T^{1-\alpha}) times. Since denote Ti(K),1in+1T_{i}(K),1\leq i\leq n+1 as the number of times when agent arrive at state sis_{i} till episode KK. Then 𝔼π[Tn+1(K)]CT1α\mathbb{E}_{\pi}[T_{n+1}(K)]\leq CT^{1-\alpha}. Note that each episode we will arrive at least one of si,1in+1s_{i},1\leq i\leq n+1 once, we know i=1n+1Ti(K)K\sum_{i=1}^{n+1}T_{i}(K)\geq K, and i=1nTi(K)KCT1αK/2\sum_{i=1}^{n}T_{i}(K)\geq K-C\cdot T^{1-\alpha}\geq K/2 when K(2C)1/αH1ααK\geq(2C)^{1/\alpha}H^{\frac{1-\alpha}{\alpha}}.

Now similar to proof of Theorem 5.1, let CK(π,τ)C_{K}(\pi,\tau) represents the violation with algorithm π={π1,,πK}\pi=\{\pi_{1},\cdots,\pi_{K}\}. Denote H=23HH^{\prime}=\frac{2}{3}H. For τ1\tau_{1}, we have

RK(π,τ1)\displaystyle R_{K}(\pi,\tau_{1}) τ1(T1(K)K/4)KHΔ4(ΔΔ)𝔼π[Tn+1(K)]\displaystyle\geq\mathbb{P}_{\tau_{1}}(T_{1}(K)\leq K/4)\frac{KH^{\prime}\Delta}{4}-(\Delta^{\prime}-\Delta)\mathbb{E}_{\pi}[T_{n+1}(K)]
τ1(T1(K)K/4)KHΔ4(ΔΔ)CT1α\displaystyle\geq\mathbb{P}_{\tau_{1}}(T_{1}(K)\leq K/4)\frac{KH^{\prime}\Delta}{4}-(\Delta^{\prime}-\Delta)CT^{1-\alpha}
RK(π,τi)\displaystyle R_{K}(\pi,\tau_{i}) τi(T1(K)>K/4)KHΔ4(Δ2Δ)𝔼π[Tn+1(K)]\displaystyle\geq\mathbb{P}_{\tau_{i}}(T_{1}(K)>K/4)\frac{KH^{\prime}\Delta}{4}-(\Delta^{\prime}-2\Delta)\mathbb{E}_{\pi}[T_{n+1}(K)]
τi(T1(K)>K/4)KHΔ4(Δ2Δ)CT1α\displaystyle\geq\mathbb{P}_{\tau_{i}}(T_{1}(K)>K/4)\frac{KH^{\prime}\Delta}{4}-(\Delta^{\prime}-2\Delta)CT^{1-\alpha}

Also, we have

τ1(T1(K)K/4)+τi(T1(K)>K/4)12exp{D(τ1,τi)}\displaystyle\mathbb{P}_{\tau_{1}}(T_{1}(K)\leq K/4)+\mathbb{P}_{\tau_{i}}(T_{1}(K)>K/4)\geq\frac{1}{2}\exp\{-D(\mathbb{P}_{\tau_{1}},\mathbb{P}_{\tau_{i}})\}

Now WLOG we have 𝔼τ1[Ti(K)]K2(n1)\mathbb{E}_{\tau_{1}}[T_{i}(K)]\leq\frac{K}{2(n-1)}

D(τ1,τi)𝔼τ1[Ti(K)]2Δ2K2(n1)2Δ2=KΔ2n1\displaystyle D(\mathbb{P}_{\tau_{1}},\mathbb{P}_{\tau_{i}})\leq\mathbb{E}_{\tau_{1}}[T_{i}(K)]2\Delta^{2}\leq\frac{K}{2(n-1)}\cdot 2\Delta^{2}=\frac{K\Delta^{2}}{n-1}

Choose Δ=n1K\Delta=\sqrt{\frac{n-1}{K}}, then

τ1(T1(K)K/4)+τi(T1(K)>K/4)12e.\displaystyle\mathbb{P}_{\tau_{1}}(T_{1}(K)\leq K/4)+\mathbb{P}_{\tau_{i}}(T_{1}(K)>K/4)\geq\frac{1}{2e}.

Then at least one of τ1\tau_{1} and τi\tau_{i} (assume τ1\tau_{1}) have

RK(π,τ1)HK(n1)8eC(Δ2Δ)T1α\displaystyle R_{K}(\pi,\tau_{1})\geq\frac{H^{\prime}\sqrt{K(n-1)}}{8e}-C(\Delta^{\prime}-2\Delta)T^{1-\alpha}

Choose Δ=Tα12\Delta^{\prime}=T^{\frac{\alpha-1}{2}}. Then assume T221αT\geq 2^{\frac{2}{1-\alpha}} so that c12.c\leq\frac{1}{2}.

RK(π,τ1)HK(n1)8eCT(1α)/2=Ω(HSK)=Ω(HST).\displaystyle R_{K}(\pi,\tau_{1})\geq\frac{H^{\prime}\sqrt{K(n-1)}}{8e}-CT^{(1-\alpha)/2}=\Omega(H\sqrt{SK})=\Omega(\sqrt{HST}).

Appendix C Safe Markov Games

C.1 Definitions

In this section, we provide an extension for our general framework of Safe-RL-SW. We solve the safe zero-sum two-player Markov games Yu et al. (2021).

In zero-sum two-player Markov games, there is an agent and another adversary who wants to make the agent into the unsafe region and also minimize the reward. The goal of agent is to maximize the reward and avoid the unsafe region with the existence of adversary. To be more specific, define the action space for the agent and adversary are 𝒜\mathcal{A} and \mathcal{B} respectively. In each episode the agent starts at an initial state s1s_{1}. At each step h[H]h\in[H], the agent and adversary observe the state shs_{h} and taking action ah𝒜a_{h}\in\mathcal{A} and bhb_{h}\in\mathcal{B} from the strategy μh(s)\mu_{h}(\cdot\mid s) and υh(s)\upsilon_{h}(\cdot\mid s) simultaneously. Then the agent receive a reward rh(s,a,b)r_{h}(s,a,b) and the environment transitions to the next state sh+1P(sh+1s,a,b)s_{h+1}\sim P(s_{h+1}\mid s,a,b) by the transition kernel PP.

One motivation of this setting is that, when a robot is in a dangerous environment, the robot should learn a completely safe policy even with the disturbance from the environment. Like the single-player MDP, we define the value function and Q-value function:

Vhμ,υ(s)\displaystyle V_{h}^{\mu,\upsilon}(s) =𝔼π[h=hHrh(sh,ah,bh)|sh=s,μ,υ].\displaystyle=\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}},b_{h^{\prime}})\Bigg{|}s_{h}=s,\mu,\upsilon\right].
Qhμ,υ(s,a,b)\displaystyle Q_{h}^{\mu,\upsilon}(s,a,b) =𝔼π[h=hHrh(sh,ah,bh)|sh=s,ah=a,bh=b,μ,υ].\displaystyle=\mathbb{E}_{\pi}\left[\sum_{h^{\prime}=h}^{H}r_{h^{\prime}}(s_{h^{\prime}},a_{h^{\prime}},b_{h^{\prime}})\Bigg{|}s_{h}=s,a_{h}=a,b_{h}=b,\mu,\upsilon\right].

The cost function is a bounded function defined on the states c(s)[0,1],s𝒮c(s)\in[0,1],s\in\mathcal{S}, and the unsafe states are defined 𝒰H={sc(s)>τ}\mathcal{U}_{H}=\{s\mid c(s)>\tau\} for some fixed threshold τ\tau. Similar to the Section 4, we define the possible next states for state ss, step hh, action pair (a,b)(a,b) as

Δh(s,a,b)={sh(ss,a,b)>0}.\displaystyle\Delta_{h}(s,a,b)=\{s^{\prime}\mid\mathbb{P}_{h}(s^{\prime}\mid s,a,b)>0\}.

Hence we can define the unsafe states for step hh recursively as:

𝒰h=𝒰h+1{s𝒜,b,Δh(s,a,b)𝒰h+1}.\displaystyle\mathcal{U}_{h}=\mathcal{U}_{h+1}\cup\{s\mid\ \forall\in\mathcal{A},\exists\ b\in\mathcal{B},\Delta_{h}(s,a,b)\cap\mathcal{U}_{h+1}\neq\emptyset\}.

Then to keep safety, the agent cannot arrive at the state s𝒰hs\in\mathcal{U}_{h} at the step 1,2,,h1,2,\cdots,h. The safe action for the agent Ahsafe(s)A_{h}^{safe}(s) should avoid the unsafe region no matter what action the adversary takes.

Ahsafe(s)={a𝒜b,Δh(s,a,b)𝒰h+1=}.\displaystyle A_{h}^{safe}(s)=\{a\in\mathcal{A}\mid\ \forall\ b\in\mathcal{B},\Delta_{h}(s,a,b)\cap\mathcal{U}_{h+1}=\emptyset\}.

Hence a feasible policy should keep the agent in safe state s𝒰hs\notin\mathcal{U}_{h} at step hh, and always chooses aAhsafe(s)a\in A_{h}^{safe}(s) to avoid the unsafe region regardless of the adversary. Analogous to the safe RL in Section 4, assume the the optimal Bellman equation can be written as

Qh(s,a,b)\displaystyle Q^{*}_{h}(s,a,b) =r(s,a)+h(ss,a,b)Vh+1(s).\displaystyle=r(s,a)+\mathbb{P}_{h}(s^{\prime}\mid s,a,b)V^{*}_{h+1}(s^{\prime}).
Vh(s)\displaystyle V^{*}_{h}(s) =maxaAhsafe(s)minbQh(s,a,b).\displaystyle=\max_{a\in A_{h}^{safe}(s)}\min_{b\in\mathcal{B}}Q^{*}_{h}(s,a,b).

If we denote r(s,a,b)=r(s,a,b)=-\infty for unsafe states s𝒰Hs\in\mathcal{U}_{H}, fixed an strategy υ\upsilon for the adversary, the best response of the agent can be defined as argmaxμV1μ,υ(s1)\arg\max_{\mu}V_{1}^{\mu,\upsilon}(s_{1}). Similarly, a best response of the adversary given agent’s strategy μ\mu can be defined as argminυV1μ,υ(s1)\arg\min_{\upsilon}V_{1}^{\mu,\upsilon}(s_{1}). By the minimax equality, for any step h[H]h\in[H] and s𝒮s\in\mathcal{S},

maxμminυVhμ,υ(s)=minυmaxμVhμ,υ(s).\displaystyle\max_{\mu}\min_{\upsilon}V_{h}^{\mu,\upsilon}(s)=\min_{\upsilon}\max_{\mu}V_{h}^{\mu,\upsilon}(s).

A policy pair (μ,υ)(\mu^{*},\upsilon^{*}) satisfy that Vhμ,υ(s)=maxμminυVhμ,υ(s)V_{h}^{\mu^{*},\upsilon^{*}}(s)=\max_{\mu}\min_{\upsilon}V_{h}^{\mu,\upsilon}(s) for all step h[H]h\in[H] is called a Nash Equilibrium, and the value Vhμ,υ(s)V_{h}^{\mu^{*},\upsilon^{*}}(s) is called the minimax value. It is known that minimax value is unique for Markov games, and the agent’s goal is make the reward closer to the minimax value.

Assume the agent and adversary executes the policy (μ1,υ1),(μ2,υ2),,(μK,υK)(\mu_{1},\upsilon_{1}),(\mu_{2},\upsilon_{2}),\cdots,(\mu_{K},\upsilon_{K}) for episode 1,2,,K1,2,\cdots,K, the regret is defined as

R(K)=k=1KV1μ,υ(s1)V1μk,υk(s1).\displaystyle R(K)=\sum_{k=1}^{K}V_{1}^{\mu^{*},\upsilon^{*}}(s_{1})-V_{1}^{\mu_{k},\upsilon_{k}}(s_{1}).

Also, the actual state-wise violation is defined as

C(K)=k=1Kh=1H(c(shk)τ)+.\displaystyle C(K)=\sum_{k=1}^{K}\sum_{h=1}^{H}(c(s_{h}^{k})-\tau)_{+}.

C.2 Algorithms and Results

The main algorithm is presented in Algorithm 3. The key idea is similar to the Algorithm 1. However, in each episode we need to calculate an optimal policy πk\pi^{k} by a sub-procedure “Planning". This sub-procedure calculates the Nash Equilibrium for the estimated MDP. Note that there is always a mixed Nash equilibrium strategy (μ,υ)(\mu,\upsilon), and the ZERO-SUM-NASH function in “Planning" procedure calculates the Nash equilibrium for a Markov games.

In the sub-procedure “Planning", we calculate the (Qhk)(s,a,)=(Q_{h}^{k})^{\prime}(s,a,\cdot)=-\infty if there exists at least one adversary action bb such that Δ(s,a,b)𝒰h+1\Delta(s,a,b)\cap\mathcal{U}_{h+1}\neq\emptyset. In fact, the agent cannot take these actions at this time, hence by setting (Qhk)(s,a,)(Q_{h}^{k})^{\prime}(s,a,\cdot) for these actions aa, the support of Nash Equilibrium policy μhk\mu_{h}^{k} will not contains aa. Otherwise, the minimax value is -\infty, and it means that aa is now in 𝒰hk\mathcal{U}_{h}^{k}, then at this time the agent can chooes the action arbitrarily

1:  Input: Δh1(s,a,b)=,Nh1(s,a,b),Nh1(s,a,b,s)=0.\Delta_{h}^{1}(s,a,b)=\emptyset,N_{h}^{1}(s,a,b),N_{h}^{1}(s,a,b,s^{\prime})=0.
2:  for k=1,2,,Kk=1,2,\cdots,K do
3:     Receive the initial state s1s_{1}.
4:     Update the estimation c^(s)\hat{c}(s) based on history and get optimistic estimation c¯(s)=c^(s)β(Nk(s),δ)\bar{c}(s)=\hat{c}(s)-\beta(N^{k}(s),\delta) for each state ss, where β\beta is the bouns function.
5:     Define 𝒰Hk={sc¯(s)>τ}\mathcal{U}_{H}^{k}=\{s\mid\bar{c}(s)>\tau\}.
6:     Based on ΔhK(s,a)\Delta_{h}^{K}(s,a), calculate 𝒰hk=𝒰h+1k{saA,Δk(s,a,b)𝒰h+1k}\mathcal{U}_{h}^{k}=\mathcal{U}_{h+1}^{k}\cup\{s\mid\forall a\in A,\Delta_{k}(s,a,b)\cap\mathcal{U}_{h+1}^{k}\neq\emptyset\}.
7:     ^h(ss,a,b)=Nhk(s,a,b,s)Nhk(s,a,b)\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a,b)=\frac{N_{h}^{k}(s,a,b,s^{\prime})}{N_{h}^{k}(s,a,b)}.
8:     πhk=\pi_{h}^{k}=Planning(k,r,{𝒰hk}h[H],ΔhK(s,a,b),Nk(s,a,b),^)(k,r,\{\mathcal{U}^{k}_{h}\}_{h\in[H]},\Delta_{h}^{K}(s,a,b),N^{k}(s,a,b),\hat{\mathbb{P}}).
9:     for h=1,2,,Hh=1,2,\cdots,H do
10:        Take action ahka_{h}^{k} such that (ahk,b)πhk(,s)(a_{h}^{k},b)\sim\pi_{h}^{k}(\cdot,\cdot\mid s) and arrive state sh+1kP(sshk,ahk,bhk)s_{h+1}^{k}\sim P(s\mid s_{h}^{k},a_{h}^{k},b_{h}^{k}), where bhkb_{h}^{k} is the action of the adversary.
11:        Δhk+1(shk,ahk,bhk)=Δhk(shk,ahk,bhk){sh+1k}\Delta_{h}^{k+1}(s_{h}^{k},a_{h}^{k},b_{h}^{k})=\Delta_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k})\cup\{s_{h+1}^{k}\}.
12:     end for
13:     Update Nhk+1(s,a,b)N_{h}^{k+1}(s,a,b), Nhk+1(s,a,b,s),Δhk+1(s,a,b)N_{h}^{k+1}(s,a,b,s^{\prime}),\Delta_{h}^{k+1}(s,a,b) for all (s,a,b)(s,a,b).
14:  end for
Algorithm 3 Safe Markov Games
1:  Input k,r,{𝒰h},Δ(s,a,b),N,^k,r,\{\mathcal{U}_{h}\},\Delta(s,a,b),N,\hat{\mathbb{P}}.
2:  for h=H,H1,,1h=H,H-1,\cdots,1 do
3:     for (s,a,b)𝒮×𝒜×(s,a,b)\in\mathcal{S}\times\mathcal{A}\times\mathcal{B} do
4:        β(Nhk(s,a,b),δ)7Hlog(5SABT/δ)/Nhk(s,a,b)\beta(N_{h}^{k}(s,a,b),\delta)\leftarrow 7H\sqrt{\log(5SABT/\delta)/N_{h}^{k}(s,a,b)}.
5:        Qhk(s,a,b)=min{(rh+s^hk(ss,a,b)Vh+1(s))+β(Nhk(s,a,b),δ),H}Q_{h}^{k}(s,a,b)=\min\{(r_{h}+\sum_{s^{\prime}}\hat{\mathbb{P}}_{h}^{k}(s^{\prime}\mid s,a,b)V_{h+1}(s^{\prime}))+\beta(N_{h}^{k}(s,a,b),\delta),H\}.
6:     end for
7:     for sSs\in S do
8:        for aAa\in A do
9:           if b,Δh(s,a,b)𝒰h+1\exists b,\Delta_{h}(s,a,b)\cap\mathcal{U}_{h+1}\neq\emptyset then
10:              (Qhk)(s,a,)=(Q_{h}^{k})^{\prime}(s,a,\cdot)=-\infty.
11:           else
12:              (Qhk)(s,a,)=Qhk(s,a,)(Q_{h}^{k})^{\prime}(s,a,\cdot)=Q_{h}^{k}(s,a,\cdot), aAhk(s)a\in A_{h}^{k}(s).
13:           end if
14:        end for
15:        μhk,υhk=ZERO-SUM-NASH((Qhk)(s,,)).\mu_{h}^{k},\upsilon_{h}^{k}=\mbox{ZERO-SUM-NASH}((Q_{h}^{k})^{\prime}(s,\cdot,\cdot)).
16:        Vhk(s)=𝔼aμhk(s),bυhk(s)(a,b)Qh(s,a,b)V_{h}^{k}(s)=\mathbb{E}_{a\sim\mu_{h}^{k}(\cdot\mid s),b\sim\upsilon_{h}^{k}(\cdot\mid s)}(a,b)Q_{h}(s,a,b).
17:     end for
18:  end for
Algorithm 4 Planning(k,r,{𝒰h}h[H],Δ(s,a,b),N,^)(k,r,\{\mathcal{U}_{h}\}_{h\in[H]},\Delta(s,a,b),N,\hat{\mathbb{P}})

Our main result are presented in the following theorem, which shows that our algorithm achieves both O(T)O(\sqrt{T}) regret and O(ST)O(\sqrt{ST}) state-wise violation.

Theorem C.1.

With probability at least 1δ1-\delta, the Algorithm 3 have regret and state-wise violation

R(T)\displaystyle R(T) =O(H3SABT)\displaystyle=O(H^{3}\sqrt{SABT})
C(T)\displaystyle C(T) =O(ST+S2ABH2),\displaystyle=O(\sqrt{ST}+S^{2}ABH^{2}),

where A=|𝒜|A=|\mathcal{A}|, B=||B=|\mathcal{B}| and S=|𝒮|.S=|\mathcal{S}|.

C.3 Proofs

Proof.

We first prove that, with probability at least 1δ1-\delta, Qhk(s,a,b)Qh(s,a,b)Q_{h}^{k}(s,a,b)\geq Q_{h}^{*}(s,a,b) and Vhk(s)Vh(s)V_{h}^{k}(s)\geq V_{h}^{*}(s) for any episode k[K].k\in[K]. We prove this fact by induction. Denote 𝔻μ×υQ(s)=𝔼aμ(s),bυ(s)Q(s,a,b)\mathbb{D}_{\mu\times\upsilon}Q(s)=\mathbb{E}_{a\sim\mu(\cdot\mid s),b\sim\upsilon(\cdot\mid s)}Q(s,a,b). The statement holds for h=H+1h=H+1. Suppose the bounds hold for QQ-values in the step h+1h+1, now we consider step hh.

Vh+1k(s)\displaystyle V_{h+1}^{k}(s) =𝔻πhkQh+1k(s)\displaystyle=\mathbb{D}_{\pi_{h}^{k}}Q_{h+1}^{k}(s)
=maxsupp(μ)Ahk,safe(s)𝔻μ×vhkQh+1k(s)\displaystyle=\max_{supp(\mu)\subseteq A_{h}^{k,safe}(s)}\mathbb{D}_{\mu\times v_{h}^{k}}Q_{h+1}^{k}(s)
maxμsupp(μ)Ahk(s)𝔻μ×vhkQh+1(s)\displaystyle\geq\max_{\mu\in supp(\mu)\subseteq A_{h}^{k}(s)}\mathbb{D}_{\mu\times v_{h}^{k}}Q_{h+1}^{*}(s)
maxμsupp(μ)Ah(s)𝔻μ×vhkQh+1(s)\displaystyle\geq\max_{\mu\in supp(\mu)\subseteq A_{h}^{*}(s)}\mathbb{D}_{\mu\times v_{h}^{k}}Q_{h+1}^{*}(s)
minvmaxμsupp(μ)Ah(s)𝔻μ×vQh+1(s)\displaystyle\geq\min_{v}\max_{\mu\in supp(\mu)\subseteq A_{h}^{*}(s)}\mathbb{D}_{\mu\times v}Q_{h+1}^{*}(s)
=Vh+1(s),\displaystyle=V_{h+1}^{*}(s),

where Ahk,safe(s)A_{h}^{k,safe}(s) is all safe action for max-player at episode kk and step hh it estimated, Ah(s)A_{h}^{*}(s) is the true safe action, and supp(μ)={a𝒜μ(a)>0}supp(\mu)=\{a\in\mathcal{A}\mid\mu(a)>0\} From our algorithm, we know Ah(s)Ahk(s)A_{h}(s)\subseteq A_{h}^{k}(s) for all time if the confidence event always holds.

Then

Qhk(s,a,b)Qh(s,a,b)\displaystyle Q_{h}^{k}(s,a,b)-Q_{h}^{*}(s,a,b) =(^hkVh+1khVh+1+βhk)(s,a,b)\displaystyle=(\hat{\mathbb{P}}_{h}^{k}V_{h+1}^{k}-\mathbb{P}_{h}V_{h+1}^{*}+\beta_{h}^{k})(s,a,b)
(^hkh)(Vh+1)(s,a,b)+β(Nhk(s,a,b),δ)\displaystyle\geq(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})(V_{h+1}^{*})(s,a,b)+\beta(N_{h}^{k}(s,a,b),\delta)
0.\displaystyle\geq 0.

The last inequality is because of the Chernoff-Hoeffding’s inequality. Thus define event

6={k,h,s,a,b,Qhk(s,a,b)Qh(s,a,b),Vhk(s)Vh(s)}.\mathcal{E}_{6}=\left\{\ \forall k,h,s,a,b,\ Q_{h}^{k}(s,a,b)\geq Q_{h}^{*}(s,a,b),V_{h}^{k}(s)\geq V_{h}^{*}(s)\right\}.

Then Pr{6c}δ.\Pr\{\mathcal{E}_{6}^{c}\}\leq\delta.

Now we start to bound the regret.

R(K)\displaystyle R(K) =k=1KV1μ,υ(s1)V1μk,υk(s1)\displaystyle=\sum_{k=1}^{K}V_{1}^{\mu^{*},\upsilon^{*}}(s_{1})-V_{1}^{\mu_{k},\upsilon_{k}}(s_{1})
k=1KV1k(s1)V1μk,υk(s1).\displaystyle\leq\sum_{k=1}^{K}V_{1}^{k}(s_{1})-V_{1}^{\mu_{k},\upsilon_{k}}(s_{1}).

Now we first assume υk\upsilon_{k} is the best response of μk\mu_{k}. Otherwise our regret can be larger. We calculate the regret by

Vhk(shk)Vhμk,υk(shk)\displaystyle\ \ \ \ \ V_{h}^{k}(s_{h}^{k})-V_{h}^{\mu_{k},\upsilon_{k}}(s_{h}^{k})
=𝔻πk(Qhk)(shk)𝔻μk×υkQhμk×υk(shk)\displaystyle=\mathbb{D}_{\pi^{k}}(Q_{h}^{k})(s_{h}^{k})-\mathbb{D}_{\mu_{k}\times\upsilon_{k}}Q_{h}^{\mu_{k}\times\upsilon_{k}}(s_{h}^{k})
𝔻μk×υk(Qhk)(shk)𝔻μk×υkQhμk×υk(shk)\displaystyle\leq\mathbb{D}_{\mu_{k}\times\upsilon_{k}}(Q_{h}^{k})(s_{h}^{k})-\mathbb{D}_{\mu_{k}\times\upsilon_{k}}Q_{h}^{\mu_{k}\times\upsilon_{k}}(s_{h}^{k})
=𝔻μk×υk(QhkQhμk×υk)(shk)\displaystyle=\mathbb{D}_{\mu_{k}\times\upsilon_{k}}(Q_{h}^{k}-Q_{h}^{\mu_{k}\times\upsilon_{k}})(s_{h}^{k})
=(QhkQhμk×υk)(shk,ahk,bhk)+αhk\displaystyle=(Q_{h}^{k}-Q_{h}^{\mu_{k}\times\upsilon_{k}})(s_{h}^{k},a_{h}^{k},b_{h}^{k})+\alpha_{h}^{k}
=(^hh)Vh+1k(shk,ahk,bhk)+h(Vh+1kVh+1μk×υk)(shk,ahk,bhk)+β(Nhk(shk,ahk,bhk),δ)+αhk\displaystyle=(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})V_{h+1}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k})+\mathbb{P}_{h}\left(V_{h+1}^{k}-V_{h+1}^{\mu_{k}\times\upsilon_{k}}\right)(s_{h}^{k},a_{h}^{k},b_{h}^{k})+\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta)+\alpha_{h}^{k}
(Vh+1kVh+1μk×υk)(sh+1k)+αhk+γhk+2β(Nhk(shk,ahk,bhk),δ)\displaystyle\leq\left(V_{h+1}^{k}-V_{h+1}^{\mu_{k}\times\upsilon_{k}}\right)(s_{h+1}^{k})+\alpha_{h}^{k}+\gamma_{h}^{k}+2\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta)
+(^hh)(Vh+1kVh+1)(shk,ahk,bhk),\displaystyle\hskip 200.0003pt+(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})(V_{h+1}^{k}-V_{h+1}^{*})(s_{h}^{k},a_{h}^{k},b_{h}^{k}),

where

αhk=h(Vh+1kVh+1μk×υk)(shk,ahk,bhk)(Vh+1kVh+1μk×υk)(sh+1k),\alpha_{h}^{k}=\mathbb{P}_{h}\left(V_{h+1}^{k}-V_{h+1}^{\mu_{k}\times\upsilon_{k}}\right)(s_{h}^{k},a_{h}^{k},b_{h}^{k})-\left(V_{h+1}^{k}-V_{h+1}^{\mu_{k}\times\upsilon_{k}}\right)(s_{h+1}^{k}),
γhk=(𝔻μk×υk(QhkQhμk×υk)(shk)(QhkQhμk×υk)(shk)).\gamma_{h}^{k}=\left(\mathbb{D}_{\mu_{k}\times\upsilon_{k}}(Q_{h}^{k}-Q_{h}^{\mu_{k}\times\upsilon_{k}})(s_{h}^{k})-(Q_{h}^{k}-Q_{h}^{\mu_{k}\times\upsilon_{k}})(s_{h}^{k})\right).

The last inequality is because (h^h)Vh+1(shk,ahk,bhk)β(Nhk(shk,ahk,bhk),δ)(\hat{\mathbb{P}_{h}}-\mathbb{P}_{h})V_{h+1}^{*}(s_{h}^{k},a_{h}^{k},b_{h}^{k})\leq\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta) by Chernoff-Hoeffding’s inequality.

Now by the similar analysis in the previous section, with probability at least 1δ1-\delta, the event

7\displaystyle\mathcal{E}_{7} ={s,a,b,s,k,h,|^h(ss,a,b)h(ss,a,b)|\displaystyle=\left\{\ \forall s,a,b,s^{\prime},k,h,|\hat{\mathbb{P}}_{h}(s^{\prime}\mid s,a,b)-\mathbb{P}_{h}(s^{\prime}\mid s,a,b)|\right.
2h(ss,a,b)Nhk(s,a)log(2S2ABHKδ)+23Nhk(s,a)log(2S2ABHKδ)}.\displaystyle\hskip 50.00008pt\left.\leq\sqrt{\frac{2\mathbb{P}_{h}(s^{\prime}\mid s,a,b)}{N_{h}^{k}(s,a)}\log\left(\frac{2S^{2}ABHK}{\delta}\right)}+\frac{2}{3N_{h}^{k}(s,a)}\log\left(\frac{2S^{2}ABHK}{\delta}\right)\right\}.

holds with probability at least 1δ1-\delta by Bernstein’s inequality. Then for ι=log(2S2ABHKδ)\iota=\log\left(\frac{2S^{2}ABHK}{\delta}\right)

(^hh)(Vh+1kVh+1)(shk,ahk,bhk)\displaystyle(\hat{\mathbb{P}}_{h}-\mathbb{P}_{h})(V_{h+1}^{k}-V_{h+1}^{*})(s_{h}^{k},a_{h}^{k},b_{h}^{k})
s(2h(ss,a,b)Nhk(s,a)ι+23Nhk(s,a)ι)(Vh+1kVh+1)(s)\displaystyle\quad\leq\sum_{s^{\prime}}\left(\sqrt{\frac{2\mathbb{P}_{h}(s^{\prime}\mid s,a,b)}{N_{h}^{k}(s,a)}\iota}+\frac{2}{3N_{h}^{k}(s,a)}\iota\right)(V_{h+1}^{k}-V_{h+1}^{*})(s^{\prime})
s(h(sshk,ahk)H+Hι2Nhk(shk,ahk)+2ι3Nhk(shk,ahk))(Vh+1kVh+1)(s)\displaystyle\quad\leq\sum_{s^{\prime}}\left(\frac{\mathbb{P}_{h}(s^{\prime}\mid s_{h}^{k},a_{h}^{k})}{H}+\frac{H\iota}{2N_{h}^{k}(s_{h}^{k},a_{h}^{k})}+\frac{2\iota}{3N_{h}^{k}(s_{h}^{k},a_{h}^{k})}\right)(V_{h+1}^{k}-V_{h+1}^{*})(s^{\prime})
1Hh(Vh+1kVh+1)(shk,ahk)+2H2SιNhk(shk,ahk)\displaystyle\quad\leq\frac{1}{H}\mathbb{P}_{h}(V_{h+1}^{k}-V_{h+1}^{*})(s_{h}^{k},a_{h}^{k})+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}
1Hh(Vh+1kVh+1μk×υk)(shk,ahk)+2H2SιNhk(shk,ahk).\displaystyle\quad\leq\frac{1}{H}\mathbb{P}_{h}(V_{h+1}^{k}-V_{h+1}^{\mu_{k}\times\upsilon_{k}})(s_{h}^{k},a_{h}^{k})+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}.

The last inequality is because we assume υk\upsilon_{k} is the best response of μk\mu_{k} and then Vh+1(shk,ahk)Vh+1μk×υk(shk,ahk).V_{h+1}^{*}(s_{h}^{k},a_{h}^{k})\geq V_{h+1}^{\mu_{k}\times\upsilon_{k}}(s_{h}^{k},a_{h}^{k}). Now we can get

Vhk(shk)Vhμk×υk\displaystyle V_{h}^{k}(s_{h}^{k})-V_{h}^{\mu_{k}\times\upsilon_{k}}
(1+1H)(Vh+1k(shk)Vh+1μk×υk)+2αhk+2γhk+2β(Nhk(shk,ahk,bhk),δ)+2H2SιNhk(shk,ahk)\displaystyle\quad\leq\left(1+\frac{1}{H}\right)(V_{h+1}^{k}(s_{h}^{k})-V_{h+1}^{\mu_{k}\times\upsilon_{k}})+2\alpha_{h}^{k}+2\gamma_{h}^{k}+2\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta)+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}

and then

V1k(s1)V1μk×υk\displaystyle V_{1}^{k}(s_{1})-V_{1}^{\mu_{k}\times\upsilon_{k}} h=1H(1+1H)H(2αhk+2γhk+2β(Nhk(shk,ahk,bhk),δ))+2H2SιNhk(shk,ahk)\displaystyle\leq\sum_{h=1}^{H}\left(1+\frac{1}{H}\right)^{H}(2\alpha_{h}^{k}+2\gamma_{h}^{k}+2\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta))+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}
h=1He(2αhk+2γhk+2β(Nhk(shk,ahk,bhk),δ))+2H2SιNhk(shk,ahk).\displaystyle\leq\sum_{h=1}^{H}e\cdot(2\alpha_{h}^{k}+2\gamma_{h}^{k}+2\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta))+\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}.

By Azuma-hoeffding’s inequality, with probability at least 12δ1-2\delta, k=1Kh=1HαhkO(H3K)\sum_{k=1}^{K}\sum_{h=1}^{H}\alpha_{h}^{k}\leq O(\sqrt{H^{3}K}), k=1Kh=1Hγhk=O(H3K)\sum_{k=1}^{K}\sum_{h=1}^{H}\gamma_{h}^{k}=O(\sqrt{H^{3}K}) and

k=1Kh=1H2H2SιNhk(shk,ahk)=O(logK).\sum_{k=1}^{K}\sum_{h=1}^{H}\frac{2H^{2}S\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k})}=O(\log K).

Note that

k=1Kh=1Hβ(Nhk(shk,ahk,bhk),δ)\displaystyle\sum_{k=1}^{K}\sum_{h=1}^{H}\beta(N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k}),\delta) =k=1Kh=1H7HιNhk(shk,ahk,bhk)\displaystyle=\sum_{k=1}^{K}\sum_{h=1}^{H}7H\sqrt{\frac{\iota}{N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k})}}
O(Hι(h,s,a,b)[H]×𝒮×𝒜×Nhk(shk,ahk,bhk)\displaystyle\leq O(H\iota\sum_{(h,s,a,b)\in[H]\times\mathcal{S}\times\mathcal{A}\times\mathcal{B}}\sqrt{N_{h}^{k}(s_{h}^{k},a_{h}^{k},b_{h}^{k})}
𝒪~(H3SABT).\displaystyle\leq\widetilde{\mathcal{O}}(\sqrt{H^{3}SABT}).

For the violation, the argument is similar to the Theorem 4.2. During the learning process, the sh+1kΔhK(shk,ahk,bhk)s_{h+1}^{k}\notin\Delta_{h}^{K}(s_{h}^{k},a_{h}^{k},b_{h}^{k}) can appear at most S2ABHS^{2}ABH times because each time the summation

(h,s,a,b)[H]×𝒮×𝒜×|ΔhK(s,a,b)|\sum_{(h,s,a,b)\in[H]\times\mathcal{S}\times\mathcal{A}\times\mathcal{B}}|\Delta_{h}^{K}(s,a,b)|

will increase at least 1, and it has a upper bound S2ABHS^{2}ABH. Thus it will lead to at most S2ABH2S^{2}ABH^{2} regret. For other situations, the total violation can be upper bounded by 𝒪~(ST)\widetilde{\mathcal{O}}(\sqrt{ST}). Thus the final violation bound is O~(S2ABH2+ST).\tilde{O}(S^{2}ABH^{2}+\sqrt{ST}).

Appendix D Technical Lemmas

D.1 Lemma D.1

Lemma D.1.

If

Kε/2aKγ(K,δ)+bγ(K,δ)log(K+1),\displaystyle K\varepsilon/2\leq a\sqrt{K}\sqrt{\gamma(K,\delta)}+b\gamma(K,\delta)\log(K+1),

where γ(K,δ)=2log(SAHK/δ)+(S1)log(e(1+K/(S1)))\gamma(K,\delta)=2\log(SAHK/\delta)+(S-1)\log(e(1+K/(S-1))), then

K=𝒪~((a2ε2+bε)log(1δ)+Sbε),\displaystyle K=\widetilde{\mathcal{O}}\left(\left(\frac{a^{2}}{\varepsilon^{2}}+\frac{b}{\varepsilon}\right)\log\left(\frac{1}{\delta}\right)+\frac{Sb}{\varepsilon}\right),

where 𝒪~()\widetilde{\mathcal{O}}(\cdot) ignores all the term logS,logA,logH,log1/ε\log S,\log A,\log H,\log 1/\varepsilon and loglog(1/δ).\log\log(1/\delta).

Proof.

If K4K\leq 4, the lemma is trivial. Now assume K4K\geq 4, then e(1+K)4Ke(1+K)\leq 4K and

γ(K,δ)\displaystyle\gamma(K,\delta) 2log(SAHK/δ)+Slog(e(1+K))\displaystyle\leq 2\log(SAHK/\delta)+S\log(e(1+K))
2log(SAHK/δ)+Slog(4K)\displaystyle\leq 2\log(SAHK/\delta)+S\log(4K)
2Slog(4K)+log(SAH/δ).\displaystyle\leq 2S\log(4K)+\log(SAH/\delta).

Then

Kε/2\displaystyle K\varepsilon/2 aKγ(K,δ)+bγ(K,δ)log(K+1)\displaystyle\leq a\sqrt{K}\sqrt{\gamma(K,\delta)}+b\gamma(K,\delta)\log(K+1)
2max{aKγ(K,δ),bγ(K,δ)log(K+1)}.\displaystyle\leq 2\max\{a\sqrt{K}\sqrt{\gamma(K,\delta)},b\gamma(K,\delta)\log(K+1)\}.

Case 1:

If Kε/22aKγ(K,δ)K\varepsilon/2\leq 2a\sqrt{K}\sqrt{\gamma(K,\delta)}, we can get

K\displaystyle K 16a2ε2(2Slog(4K)+log(SAH/δ))\displaystyle\leq\frac{16a^{2}}{\varepsilon^{2}}(2S\log(4K)+\log(SAH/\delta))
2max{32Sa2ε2log(4K),16a2ε2log(SAH/δ)}.\displaystyle\leq 2\max\left\{\frac{32Sa^{2}}{\varepsilon^{2}}\log(4K),\frac{16a^{2}}{\varepsilon^{2}}\log(SAH/\delta)\right\}.

Subcase 1:

If K64Sa2ε2log(4K)K\leq\frac{64Sa^{2}}{\varepsilon^{2}}\log(4K) by Lemma D.2, we can get K=𝒪~(Sa2ε2)K=\widetilde{\mathcal{O}}(\frac{Sa^{2}}{\varepsilon^{2}}).

Subcase 2:

If K16a2ε2log(SAH/δ)K\leq\frac{16a^{2}}{\varepsilon^{2}}\log(SAH/\delta), we complete the proof.

Case 2:

If Kε/22bγ(K,δ)log(K+1)K\varepsilon/2\leq 2b\gamma(K,\delta)\log(K+1), we can get

K\displaystyle K 4bεlog(4K)(2Slog(4K)+log(SAH/δ))\displaystyle\leq\frac{4b}{\varepsilon}\log(4K)(2S\log(4K)+\log(SAH/\delta))
2max{4bεlog(4K)2Slog(4K),4bεlog(4K)log(SAH/δ)}.\displaystyle\leq 2\max\left\{\frac{4b}{\varepsilon}\log(4K)2S\log(4K),\frac{4b}{\varepsilon}\log(4K)\log(SAH/\delta)\right\}.

Subcase 3:

If K8bSεlog2(4K)K\leq\frac{8bS}{\varepsilon}\log^{2}(4K) By Lemma D.2, we can get K=𝒪~(8bSε)K=\widetilde{\mathcal{O}}(\frac{8bS}{\varepsilon}).

Subcase 4:

If K8bεlog(4K)log(SAH/δ)K\leq\frac{8b}{\varepsilon}\log(4K)\log(SAH/\delta), then by Lemma D.2, we can get K=𝒪~(bεlog(1/δ))K=\widetilde{\mathcal{O}}(\frac{b}{\varepsilon}\log(1/\delta))

From the four subcases, we complete the proof of Lemma D.1.

D.2 Lemma D.2

Lemma D.2.

If KQlog2(4K)K\leq Q\log^{2}(4K) with Q4Q\geq 4, then we have K=𝒪~(Q)K=\widetilde{\mathcal{O}}(Q).

Proof.

Define f(x)=xlog2(4x)f(x)=\frac{x}{\log^{2}(4x)}, then f(x)=log2(4x)log(4x)log4(4x)>0f^{\prime}(x)=\frac{\log^{2}(4x)-\log(4x)}{\log^{4}(4x)}>0 when 4x34x\geq 3, and f(x)f(x) is increasing over x1x\geq 1. Then we only need to prove there exists a constant c2>0c_{2}>0 such that

f(c2Qlog2Q)Q.\displaystyle f(c_{2}Q\log^{2}Q)\geq Q.

In fact, if this inequality holds, any KK such that KQlog2(4K)K\leq Q\log^{2}(4K) should have f(K)Qf(c2Qlog2Q)f(K)\leq Q\leq f(c_{2}Q\log^{2}Q), and then Kc2Qlog2Q=𝒪~(Q)K\leq c_{2}Q\log^{2}Q=\widetilde{\mathcal{O}}(Q). To prove this inequality, observe that

Qlog2(4c2Qlog2Q)\displaystyle Q\log^{2}(4c_{2}Q\log^{2}Q) =Q(log4c2Q+loglog2Q)2\displaystyle=Q(\log 4c_{2}Q+\log\log^{2}Q)^{2}
2Qlog2(4c2Q)+2Q(loglog2Q)2\displaystyle\leq 2Q\log^{2}(4c_{2}Q)+2Q(\log\log^{2}Q)^{2}
4Qlog2(4c2)+4Qlog2(Q)+2Qlog2Q\displaystyle\leq 4Q\log^{2}(4c_{2})+4Q\log^{2}(Q)+2Q\log^{2}Q
(4log2(4c2)+6)Qlog2Q.\displaystyle\leq(4\log^{2}(4c_{2})+6)Q\log^{2}Q.

where the third inequality is because log2QQ\log^{2}Q\leq Q for Q4Q\geq 4. So if we choose c2c_{2} to satisfy c24log2(4c2)+6c_{2}\geq 4\log^{2}(4c_{2})+6, we have Qlog2(4c2Qlog2Q)c2Qlog2QQ\log^{2}(4c_{2}Q\log^{2}Q)\leq c_{2}Q\log^{2}Q, which implies f(c2Qlog2Q)Qf(c_{2}Q\log^{2}Q)\geq Q. ∎

Appendix E Detailed Analysis of Assumption 4.1

In this section, we provide a more detailed analysis of the equivalence between s1𝒰1s_{1}\notin\mathcal{U}_{1} and the existence of feasible policy. Indeed, if the agent gets into some state s𝒰hs\in\mathcal{U}_{h} at step hh, then any action can lead her to an unsafe next state s𝒰h+1s^{\prime}\in\mathcal{U}_{h+1}. Recursively, any action sequence ah,ah+1,,aH1a_{h},a_{h+1},\cdots,a_{H-1} can lead the agent to unsafe states sh+1𝒰h+1,,sH𝒰Hs_{h+1}\in\mathcal{U}_{h+1},\cdots,s_{H}\in\mathcal{U}_{H}. As a result, the agent cannot avoid getting into an unsafe state. Therefore, all feasible policies π\pi must satisfy that, at any step hh, the probability for the agent in the state s𝒰hs\in\mathcal{U}_{h} should be zero under policy π\pi.

hπ(s)=0,s𝒰h.\displaystyle\mathbb{P}_{h}^{\pi}(s)=0,\ \ \forall\ s\in\mathcal{U}_{h}. (21)

Recall that

Ahsafe(s)={a𝒜Δh(s,a)𝒰h+1=}.\displaystyle A_{h}^{safe}(s)=\{a\in\mathcal{A}\mid\Delta_{h}(s,a)\cap\mathcal{U}_{h+1}=\emptyset\}.

From these definitions, a feasible policy π\pi should satisfy that πh(s)Ahsafe(s)\pi_{h}(s)\in A_{h}^{safe}(s) for any safe state s𝒰hs\notin\mathcal{U}_{h}. Moreover, for any safe state sh𝒰hs_{h}\notin\mathcal{U}_{h}, there exists at least one safe action ahAhsafe(sh)a_{h}\in A_{h}^{safe}(s_{h}), and all possible next states sh+1Δ(sh,ah)s_{h+1}\in\Delta(s_{h},a_{h}) satisfy sh+1𝒰h+1s_{h+1}\notin\mathcal{U}_{h+1}. Recursively, there exists at least one feasible action sequence ahAhsafe(sh),,aHAHsafe(sH)a_{h}\in A_{h}^{safe}(s_{h}),\cdots,a_{H}\in A_{H}^{safe}(s_{H}) which satisfies that sh𝒰hs_{h^{\prime}}\notin\mathcal{U}_{h^{\prime}} for h+1hHh+1\leq h^{\prime}\leq H. Hence the assumption for the existence of feasible policy can be simplified to s1𝒰1s_{1}\notin\mathcal{U}_{1}.

Appendix F Experiment Setup

We perform each algorithm for 2000020000 episodes in a grid environment (Chow et al. (2018); Wei et al. (2022)) with 2525 states and 44 actions (four directions), and report the average reward and violations across runs. In the experiment, we note that the algorithm Optpess-bouns (Efroni et al., 2020) needs to solve a linear programming with size O(SAH)O(SAH) for each episode, which is extremely slow. Thus we only compare all algorithms in a relatively small environment. In the Safe-RL-SW experiments, we choose δ=0.005\delta=0.005 and binary 0-11 cost function. We set the safety threshold as 0.5 for both the CMDP algorithms and the SUCBVI algorithm.

For the Safe-RFE-SW experiments, we compare our algorithm SRF-UCRL with a state-of-the-art RFE algorithm RF-UCRL (Kaufmann et al., 2021) in an MDP with 11 states and 5 actions. We choose δ=0.005\delta=0.005 and safety threshold of SRF-UCRL as 0.5. We run 500 episodes 100 times, and then report the average reward in each episode and cumulative step-wise violation. Also at each episode, we calculate the expected violation for the outputted policy and plot the curves. All experiments are conducted with AMD Ryzen 7 5800H with Radeon Graphics and a 16GB unified memory.