This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Regret Guarantees for Model-Based Reinforcement Learning with Long-Term Average Constraints

Mridul Agarwal School of Electrical and Computer Engineering.
Purdue University
West Lafayette, Indiana, USA
Qinbo Bai School of Electrical and Computer Engineering.
Purdue University
West Lafayette, Indiana, USA
Vaneet Aggarwal School of Industrial Engineering.
Purdue University
West Lafayette, Indiana, USA
School of Electrical and Computer Engineering.
Purdue University
West Lafayette, Indiana, USA
Abstract

We consider the problem of constrained Markov Decision Process (CMDP) where an agent interacts with an ergodic Markov Decision Process. At every interaction, the agent obtains a reward and incurs KK costs. The agent aims to maximize the long-term average reward while simultaneously keeping the KK long-term average costs lower than a certain threshold. In this paper, we propose CMDP-PSRL, a posterior sampling based algorithm using which the agent can learn optimal policies to interact with the CMDP. We show that with the assumption of slackness, characterized by κ\kappa, the optimization problem is feasible for the sampled MDPs. Further, for MDP with SS states, AA actions, and mixing time TMT_{M}, we prove that following CMDP-PSRL algorithm, the agent can bound the regret of not accumulating rewards from an optimal policy by O~(TMSAT)\tilde{O}(T_{M}S\sqrt{AT}). Further, we show that the violations for any of the KK constraints is also bounded by O~(TMSAT)\tilde{O}(T_{M}S\sqrt{AT}). To the best of our knowledge, this is the first work that obtains a O~(T)\tilde{O}(\sqrt{T}) regret bounds for ergodic MDPs with long-term average constraints using a posterior sampling method.

1 Introduction

Consider a wireless sensor network where the devices aim to update a server with sensor values. At time tt, the device can choose to send a packet to obtain a reward of 11 unit or to queue the packet and obtain 0 reward. However, communicating a packet results in ptp_{t} power consumption. At time tt, if the wireless channel condition, sts_{t}, is weak and the device chooses to send a packet, the resulting instantaneous power consumption, ptp_{t}, is high. The goal is to send as many packets as possible while keep the average power consumption, t=1Tpt/T\sum_{t=1}^{T}p_{t}/T, within some limit, say CC. This environment has state (st,qt)(s_{t},q_{t}) as the channel condition and queue length at time tt. To limit the power consumption, the agent may choose to send packets when the channel condition is good or when the queue length grows beyond a certain threshold. The agent aims to learn the policies in an online manner which requires efficiently balancing exploration of state-space and exploitation of the estimated system dynamics [Singh et al., 2020].

Similar to the example above, many applications require to keep some costs low while simultaneously maximizing the rewards [Altman, 1999]. Owing to the importance of this problem, in this paper, we consider the problem of constrained Markov Decision Processes (CMDP). We aim to develop a reinforcement learning algorithm following which an agent can bound the constraint violations and the regret in obtaining the optimal reward to o(T)o(T).

The problem setup, where the system dynamics are known, is extensively studied [Altman, 1999]. For a constrained setup, the optimal policy is possibly stochastic [Altman, 1999, Puterman, 2014]. In the domain where the agent learns the system dynamics and aims to learn good policies online, there has been work where to show asymptotic convergence to optimal policies [Gattami et al., 2021], or even provide regret guarantees when the MDP is episodic [Zheng and Ratliff, 2020, Ding et al., 2021]. Recently, [Singh et al., 2020] considered the problem of online optimization of infinite-horizon communicating Markov Decision Processes with long-term average constraints. They provide an optimism based algorithm where confidence bounds on each transition probabilities p(s|s,a)p(s^{\prime}|s,a) is constructed. Using this, they obtain a regret bound of O~(SAT+TMT2/3)\tilde{O}\left(\sqrt{SAT}+T_{M}T^{2/3}\right)111O~()\tilde{O}(\cdot) hides the logarithmic terms. Additionally, finding the optimistic policy is a computationally intensive task as the number of optimization variables become S2×AS^{2}\times A for MDP with SS states and AA actions.

In this paper, we consider the reinforcement learning an infinite-horizon ergodic MDP [Tarbouriech and Lazaric, 2019, Gattami et al., 2021] with long-term average constraints. We use 1\ell_{1} deviation bounds [Jaksch et al., 2010] and use a Bellman error analysis to bound the reward regret of the MDP as O~(TMSAT)\tilde{O}(T_{M}S\sqrt{AT}). Additionally, we also bound the constraint violations as O~(TMSAT)\tilde{O}(T_{M}S\sqrt{AT}). We propose a posterior sampling based algorithm where we sample the transition dynamics using a Dirichlet distribution [Osband et al., 2013], which achieves this regret bound.

Unlike optimistic algorithms, the sampled MDP may not be infeasible for the constrained optimization. Hence, we consider slackness characterized by Slater’s parameter Ding et al. [2020], which allows us to prove that the optimization problem is feasible even with the sampled MDPs. Posterior sampling also helps to reduces the optimization variables, to find only the optimal policy for the sampled MDP, to only S×AS\times A variables. Finally, we provide numerical examples where the algorithm converges to the calculated optimal policies. To the best of our knowledge, this is the first work to obtain O(T)O(\sqrt{T}) regret guarantees for the infinite horizon long-term average constraint setup with posterior sampling.

2 Related Work

Stochastic Optimization using Markov Decision Processes has very rich roots [Howard, 1960]. There have been work in understanding convergence of the algorithm to find optimal policies for known MDPs [Bertsekas and Tsitsiklis, 1996, Altman, 1999]. Also, when the MDP is not known, there are algorithms with asymptotic guarantees for learning the optimal policies [Watkins and Dayan, 1992] which maximize an objective without any constraints. Recent algorithms even achieve finite time near-optimal regret bounds with respect to the number of interactions with the environment [Jaksch et al., 2010, Osband et al., 2013, Agrawal and Jia, 2017, Jin et al., 2018]. Jaksch et al. [2010] uses the optimism principle for minimizing regret for weakly communicating infinite horizon MDPs with bounded diameter. Osband et al. [2013] extended the analysis of Jaksch et al. [2010] to posterior sampling for episodic MDPs and bounded the Bayesian regret and further improved the regret bounds Osband and Van Roy [2017]. Agrawal and Jia [2017] uses a posterior sampling based approach and obtains a frequentist regret for the infinite horizon MDPs with bounded diameter.

In many reinforcement learning settings, the agent not only wants to maximize the rewards but also satisfy certain cost constraints [Altman, 1999]. Early works in this area were pioneered by [Altman and Schwartz, 1991]. They provided an algorithm which combined forced explorations and following policies optimized on empirical estimates to obtain an asymptotic convergence. [Borkar, 2005] studied the constrained RL problem using actor-critic and a two time-scale framework [Borkar, 1997] to obtain asymptotic performance guarantees. Very recently, [Gattami et al., 2021] analyzed the asymptotic performance for Lagrangian based algorithms for infinite-horizon long-term average constraints.

Inspired by the finite-time performance analysis of reinforcement learning algorithm for unconstrained problems, there has been a significant thrust in understanding the finite-time performances of constrained MDP algorithms. [Zheng and Ratliff, 2020] considered an episodic CMDP and use an optimism based algorithm to bound the constraint violation as O~(T1.5)\tilde{O}(\sqrt{T^{1.5}}) with high probability. [Kalagarla et al., 2020] also considered the episodic setup to obtain PAC-style bound for an optimism based algorithm. [Ding et al., 2021] considered the setup of HH-episode length episodic CMDPs with dd-dimensional linear function approximation to bound the constraint violations as O~(dH5T)\tilde{O}(d\sqrt{H^{5}T}) by mixing the optimal policy with an exploration policy. [Efroni et al., 2020] proposes a linear-programming and primal-dual policy optimization algorithm to bound the regret as O(SH3T)O(S\sqrt{H^{3}T}). [Qiu et al., 2020] proposed an algorithm which obtains a regret bound of O~(SAH2T)\tilde{O}(S\sqrt{AH^{2}T}) for the problem of adversarial stochastic shortest path. Compared to these works, we focus on setting with infinite horizon long-term average constraints.

After developing a better understanding of the policy gradient algorithms [Agarwal et al., 2020], there has been theoretical work in the area of model-free policy gradient algorithms for constrained MDP and safe reinforcement learning as well. [Xu et al., 2020] consider an infinite horizon discounted setup with constraints and obtain global convergence using policy gradient algorithms. [Ding et al., 2020] also considers an infinite horizon discounted setup. They use a natural policy gradient to update the primal variable and sub-gradient descent to update the dual variable.

Recently [Singh et al., 2020] considered the setup of infinite-horizon CMDPs with long-term average constraints and obtain a regret bound of O~(T2/3)\tilde{O}(T^{2/3}) using an optimism based algorithm and forced explorations. We consider a similar setting with ergodic CMDP and propose a posterior sampling based algorithm to bound the regret as O~(poly(DSA)T)\tilde{O}(poly(DSA)\sqrt{T}) using explorations assisted by the ergodicity of the MDP.

3 Problem Formulation

We consider an infinite horizon discounted Markov decision process (MDP) \mathcal{M}, defined by the tuple (𝒮,𝒜,P,r,c1,,ck)\left(\mathcal{S},\mathcal{A},P,r,c_{1},\cdots,c^{k}\right). 𝒮\mathcal{S} denotes a finite set of state space with |𝒮|=S|\mathcal{S}|=S, and 𝒜\mathcal{A} denotes a finite set of actions with |𝒜|=A|\mathcal{A}|=A. P:𝒮×𝒜Δ(𝒮)P:\mathcal{S}\times\mathcal{A}\to\Delta(\mathcal{S}) denotes the probability P(s|s,a)P(s^{\prime}|s,a) of transitioning to state ss^{\prime} from state ss after taking action aa. r:𝒮×𝒜[0,1]r:\mathcal{S}\times\mathcal{A}\to[0,1] denotes the average reward in state ss after taking action aa. ck:𝒮×𝒜[0,1]c^{k}:\mathcal{S}\times\mathcal{A}\to[0,1] denotes average cost incurred by the agent for constraint k[K]={1,2,,K}k\in[K]=\{1,2,\cdots,K\} after taking action aa in state ss. We use a stochastic policy π:𝒮Δ(𝒜)\pi:\mathcal{S}\to\Delta(\mathcal{A}), such that given state ss, π(a|s)\pi(a|s) is the probability of selecting action aa.

Note that the a policy π\pi induces a Markov chain over the state space of the MDP. Pertaining to the Markov chains generated by the policies for \mathcal{M}, we now define the mixing time of MDP.

Definition 1 (Mixing Time).

Consider the Markov Chain induced by the policy π\pi on the MDP \mathcal{M}. Let TssπT_{s\to s^{\prime}}^{\pi} be a random variable that denotes the first time step when this Markov Chain enters state ss^{\prime} starting from state ss. Then, the mixing time of the MDP \mathcal{M} is defined as:

TM=maxssmaxπ𝔼[Tssπ]\displaystyle T_{M}=\max_{s^{\prime}\neq s}\max_{\pi}\mathbb{E}\left[T_{s\to s^{\prime}}^{\pi}\right] (1)

Similar to Singh et al. [2020], let Pπt(s)P_{\pi}^{t}(s) be the tt step state distribution starting from state ss following policy π\pi and PπP_{\pi} be the steady-state state distribution generated by policy π\pi.

Our first assumption on the MDP allows any policy to reach any state ss^{\prime} starting from any state ss, and to converge to a steady state. We formalize the result in the following assumption:

Assumption 1.

The MDP \mathcal{M} is ergodic, or Pπt(s)PπTVCρt\|P^{t}_{\pi}(s)-P_{\pi}\|_{TV}\leq C\rho^{t} with PπP_{\pi} being the long-term steady state distribution induced by policy π\pi, and C>0C>0 and ρ<1\rho<1 are problem specific constants. And, the mixing time of the MDP \mathcal{M} is finite or TM<T_{M}<\infty.

After discussing the transition dynamics of the system, we move to the rewards and costs of the MDP \mathcal{M}.

Assumption 2.

The reward function r(s,a)r(s,a) and the costs ck(s,a),k[K]c^{k}(s,a),k\in[K] are known to the agent.

We note that in most of the problems, rewards are engineered. Hence, Assumption 2 is justified in many setups. However, the system dynamics are stochastic and typically not known.

Following a policy π\pi, the expected long-term average cost are given by ζπP,k\zeta_{\pi}^{P,k}. Also, we denote the average long-term reward using ζπP,k\zeta_{\pi}^{P,k}. Formally, we have:

ζπP,k\displaystyle\zeta_{\pi}^{P,k} =𝔼s0,a0,s1,a1,[limτ1τt=0τck(st,at)]\displaystyle=\mathbb{E}_{s_{0},a_{0},s_{1},a_{1},\cdots}\left[\lim_{\tau\to\infty}\frac{1}{\tau}\sum_{t=0}^{\tau}c^{k}\left(s_{t},a_{t}\right)\right] (2)
λπP,r\displaystyle\lambda_{\pi}^{P,r} =𝔼s0,a0,s1,a1,[limτ1τt=0τr(st,at)]\displaystyle=\mathbb{E}_{s_{0},a_{0},s_{1},a_{1},\cdots}\left[\lim_{\tau\to\infty}\frac{1}{\tau}\sum_{t=0}^{\tau}r\left(s_{t},a_{t}\right)\right] (3)
s0\displaystyle s_{0} ρ0(s0),atπ(at|st),st+1P(st+1|st,at)\displaystyle\sim\rho_{0}(s_{0}),\ a_{t}\sim\pi(a_{t}|s_{t}),\ s_{t+1}\sim P(s_{t+1}|s_{t},a_{t})

For brevity, in the rest of the paper, 𝔼st,at,st+1;t0[]\mathbb{E}_{s_{t},a_{t},s_{t+1};t\geq 0}[\cdot] will be denoted as 𝔼ρ,π,P[]\mathbb{E}_{\rho,\pi,P}[\cdot], where s0ρ0(s0),atπ(st|at),st+1P(st+1|st,at)s_{0}\sim\rho_{0}(s_{0}),\ a_{t}\sim\pi(s_{t}|a_{t}),\ s_{t+1}\sim P(s_{t+1}|s_{t},a_{t}). Both, ζπP,k\zeta_{\pi}^{P,k} and λπP,r\lambda_{\pi}^{P,r} satisfy the following form of Bellman equation:

λπP,r+hπP,r(s)\displaystyle\lambda_{\pi}^{P,r}+h_{\pi}^{P,r}(s) =aπ(a|s)r(s,a)\displaystyle=\sum_{a}\pi(a|s)r(s,a)
+saπ(a|s)P(s|s,a)hπP,r(s)\displaystyle\leavevmode\nobreak\ +\sum_{s^{\prime}}\sum_{a}\pi(a|s)P(s^{\prime}|s,a)h_{\pi}^{P,r}(s) (4)
ζπP,k+hπP,r(s)\displaystyle\zeta_{\pi}^{P,k}+h_{\pi}^{P,r}(s) =aπ(a|s)ck(s,a)\displaystyle=\sum_{a}\pi(a|s)c^{k}(s,a) (5)
+saπ(a|s)P(s|s,a)hπP,k(s)\displaystyle\leavevmode\nobreak\ +\sum_{s^{\prime}}\sum_{a}\pi(a|s)P(s^{\prime}|s,a)h_{\pi}^{P,k}(s) (6)

where hπP,r(s)h_{\pi}^{P,r}(s) is the bias for reward and hπP,kh_{\pi}^{P,k} is the bias for cost k[K]k\in[K].

The objective is find a policy π\pi^{*} which is the solution of the following optimization problem.

maxπ\displaystyle\max_{\pi} λπP,r s.t.\displaystyle\ \lambda_{\pi}^{P,r}\text{ \ \ s.t.} (7)
ζπP,k\displaystyle\zeta_{\pi}^{P,k} Ckk[K]\displaystyle\leq C_{k}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \forall\leavevmode\nobreak\ k\in[K] (8)

where Ckk[K]C_{k}\leavevmode\nobreak\ \forall\leavevmode\nobreak\ k\in[K] are the bounds on the average costs which the agent needs to satisfy.

After formulating the optimization problem, we now state our next assumption characterizing the slackness.

Assumption 3.

There exists a policy π\pi, and one constant κ2STM14Alog(AT)/T+CSTM/((1ρ)T)\kappa\geq 2ST_{M}\sqrt{14A\log(AT)/\sqrt{T}}+CST_{M}/((1-\rho)\sqrt{T}) such that

ζπP,kCkκ\displaystyle\zeta_{\pi}^{P,k}\leq C_{k}-\kappa (9)

The slackness assumption is mild because, in various applications some a priori knowledge about a strictly feasible policy is available. Hence, this assumption is again a standard assumption in the constrained RL literature Efroni et al. [2020], Ding et al. [2021, 2020]. κ\kappa is referred as Slater’s constant. Ding et al. [2021] assumes that the Slater’s constant κ\kappa is known.

Any online algorithm starting with no prior knowledge will require to obtain estimates of transition probabilities PP and obtain reward rr and costs ck,k[K]c^{k},\forall\ k\in[K] for each state action pair. Initially, when algorithm does not have good estimates of the model, it accumulates a regret as well as violates constraints as it does not know the optimal policy. We define reward regret R(T)R(T) as the difference between the cumulative reward obtained rtr_{t} vs the expected rewards from running the optimal policy π\pi^{*} for TT steps, or

R(T)\displaystyle R(T) =TλπP,rt=1Tr(st,at)\displaystyle=T\lambda_{\pi^{*}}^{P,r}-\sum_{t=1}^{T}r(s_{t},a_{t}) (10)

Additionally, we define constraint regret Rk(T)R_{k}(T) for each constraint k[K]k\in[K] as the gap between the cumulative cost incurred ctk,k[K]c_{t}^{k},k\in[K] and constraint bounds, or

Rk(T)\displaystyle R^{k}(T) =(t=1Tck(st,at)TCk)+,\displaystyle=\left(\sum_{t=1}^{T}c^{k}(s_{t},a_{t})-TC_{k}\right)_{+}, (11)

where (x)+=max(0,x)(x)_{+}=\max(0,x).

In the following section, we present a model-based algorithm to obtain this policy π\pi^{*}, and reward regret and the constraint regret accumulated by the algorithm.

4 The CMDP-PSRL Algorithm

For infinite horizon optimization problems (or τ\tau\to\infty), we can use steady state distribution of the state to obtain expected long-term rewards or costs [Puterman, 2014]. We use

ζπP,k\displaystyle\zeta_{\pi}^{P,k} =s𝒮a𝒜ck(s,a)dπP(s,a),k[K]\displaystyle=\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}c_{k}(s,a)d_{\pi}^{P}(s,a),\ \ \forall\ k\in[K] (12)
λπP,r\displaystyle\lambda_{\pi}^{P,r} =s𝒮a𝒜r(s,a)dπP(s,a)\displaystyle=\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}r(s,a)d_{\pi}^{P}(s,a) (13)

where dπP(s,a)d_{\pi}^{P}(s,a) is the steady state joint distribution of the state and actions under policy π\pi.

Based on the above formulation, we can solve the joint optimization problem of following form

maxd(s,a)s𝒮a𝒜r(s,a)d(s,a)\displaystyle\max_{d(s,a)}\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}r(s,a)d(s,a) (14)

with the following set of constraints,

a𝒜d(s,a)\displaystyle\sum_{a\in\mathcal{A}}d(s^{\prime},a) =s𝒮,a𝒜P(s|s,a)d(s,a)\displaystyle=\sum_{s\in\mathcal{S},a\in\mathcal{A}}P(s^{\prime}|s,a)d(s,a) (15)
s𝒮,a𝒜d(s,a)=1,\displaystyle\sum_{s\in\mathcal{S},a\in\mathcal{A}}d(s,a)=1, d(s,a)0\displaystyle\ \ d(s,a)\geq 0 (16)
s𝒮a𝒜ck(s,a)d(s,a)\displaystyle\sum_{s\in\mathcal{S}}\sum_{a\in\mathcal{A}}c_{k}(s,a)d(s,a) Ckk[K]\displaystyle\leq C_{k}\ \ \forall\ k\in[K] (17)

for all s𝒮,s𝒮,s^{\prime}\in\mathcal{S},\leavevmode\nobreak\ \forall\leavevmode\nobreak\ s\in\mathcal{S}, and a𝒜\forall\leavevmode\nobreak\ a\in\mathcal{A}. Equation (15) denotes the constraint on the transition structure for the underlying Markov Process. Equation (16) ensures that the solution is a valid probability distribution. Finally, Equation (17) are the constraints for the constrained MDP setup which the policy must satisfy.

Note that arguments in Equation (14) are linear, and the constraints in Equation (15) and Equation (16) are linear, this is a linear programming problem. Since convex optimization problems can be solved in polynomial time [Potra and Wright, 2000], we can use standard approaches to solve Equation (14). After solving the optimization problem, we obtain the optimal policy from the obtained steady state distribution d(s,a)d^{*}(s,a) as,

π(a|s)=(s,a)(s)=d(s,a)b𝒜d(s,b)s𝒮\displaystyle\pi^{*}(a|s)=\frac{\mathbb{P}(s,a)}{\mathbb{P}(s)}=\frac{d^{*}(s,a)}{\sum_{b\in\mathcal{A}}d^{*}(s,b)}\ \ \ \forall\ s\in\mathcal{S} (18)

Since we assumed that the CMDP is ergodic, the Markov Chain induced from policy π\pi is ergodic. Hence, every state is reachable following the policy π\pi^{*}, we have (s)>0\mathbb{P}(s)>0 and Equation (18) is defined for all states s𝒮s\in\mathcal{S}.

Further, since we assumed that the induced Markov Chain is irreducible for all stationary policies, we assume Dirichlet distribution as prior for the state transition probability P(s|s,a)P(s^{\prime}|s,a). Dirichlet distribution is also used as a standard prior in literature [Agrawal and Jia, 2017, Osband et al., 2013]. Further, there exists a steady state distribution when the transition probability is sampled from a Dirichlet distribution [Agarwal et al., 2022, Proposition 1].

The complete constrained posterior sampling based algorithm, which we name CMDP-PSRL, is described in Algorithm 1. The algorithm proceeds in epochs, and a new epoch is started whenever the visitation count in epoch ee, νe(s,a)\nu_{e}(s,a), is at least the total visitations before episode ee, Ne(s,a)N_{e}(s,a), for any state action pair (Line 8). In Line 9, we sample transition probabilities P~\tilde{P} using the updated posterior and in Line 10, we update the policy using the optimization problem specified in Equation (14)-(17) for P=P~eP=\tilde{P}_{e}. Further, if the sampled MDP does not satisfy the cost constraint in Equation (17), we ignore that constraint 222We will show in the analysis that cumulative constraint violations are still bounded. for that epoch.

Algorithm 1 CMDP-PSRL
1:Input: 𝒮,𝒜,r,c1,,cK\mathcal{S},\mathcal{A},r,c_{1},\cdots,c_{K}
2:Initialize N(s,a,s)=1(s,a,s)𝒮×𝒜×𝒮,πe(a|s)=1|𝒜|(a,s)𝒜×𝒮,e=0,νe(s,a)=Ne(s,a)=0(s,a)𝒮×𝒜N(s,a,s^{\prime})=1\ \forall(s,a,s^{\prime})\in\mathcal{S}\times\mathcal{A}\times\mathcal{S},\ \pi_{e}(a|s)=\frac{1}{|\mathcal{A}|}\ \forall\ (a,s)\in\mathcal{A}\times\mathcal{S},\leavevmode\nobreak\ e=0,\leavevmode\nobreak\ \nu_{e}(s,a)=N_{e}(s,a)=0\leavevmode\nobreak\ \forall(s,a)\in\mathcal{S}\times\mathcal{A}
3:for time index t=1,2,t=1,2,\cdots do
4:    Observe state ss
5:    Play action aπ(|s)a\sim\pi(\cdot|s)
6:    Observe rewards {rk}\{r^{k}\} and next state ss^{\prime}
7:    νe(s,a)+=1,N(s,a,s)+=1\nu_{e}(s,a)+=1,\leavevmode\nobreak\ N(s,a,s^{\prime})+=1
8:    if νe(s,a)max(1,Ne(s,a))\nu_{e}(s,a)\geq\max(1,N_{e}(s,a)) for any s,as,a then
9:         P~e(s|a,s)Dir(N(s,a,s))(s,a,s)\tilde{P}_{e}(s^{\prime}|a,s)\sim Dir(N(s,a,s^{\prime}))\ \forall\ (s,a,s^{\prime})
10:         Solve steady state distribution d(s,a)d(s,a) as the solution of the optimization problem in Equations (14-17) for P~e\tilde{P}_{e}.
11:         Obtain optimal policy for next epoch, e+1e+1, πe+1\pi_{e+1} as
πe+1(a|s)=d(s,a)a𝒜d(s,a)\pi_{e+1}(a|s)=\frac{d(s,a)}{\sum_{a\in\mathcal{A}}d(s,a)}
12:         e=e+1e=e+1
13:         te=tt_{e}=t
14:         νe(s,a)=0,Ne(s,a)=eeνe(s,a)(s,a)\nu_{e}(s,a)=0,N_{e}(s,a)=\sum_{e^{\prime}}^{e}\nu_{e^{\prime}}(s,a)\leavevmode\nobreak\ \forall(s,a)
15:    end if
16:end for

5 Analysis

We first obtain the feasibility of the optimization problem Equation (14)-(17) for the sampled MDP. We note that we assumed slackness in the true MDP with transition probabilities PP. Hence, if the the sampled MDP is close to the true MDP, the deviation in the cost will be less and there will be a policy which satisfies the constraint in Equation (17). We formalize this intuition in the following result.

Lemma 1.

Following Algorithm 1, if te+1teTt_{e+1}-t_{e}\geq\sqrt{T} and P~e(|s,a)P(|s,a)114Slog(2At)Ne(s,a)s,a\|\tilde{P}_{e}(\cdot|s,a)-P(\cdot|s,a)\|_{1}\leq\sqrt{\frac{14S\log(2At)}{N_{e}(s,a)}}\forall\leavevmode\nobreak\ s,a there exists a policy π\pi which satisfies,

ζπP~e,kCkk[K],\displaystyle\zeta_{\pi}^{\tilde{P}_{e},k}\leq C_{k}\leavevmode\nobreak\ \forall\leavevmode\nobreak\ k\in[K], (19)

and the optimization problem in Equation (14)-(17) is feasible, where tet_{e} is the start time of epoch ee.

Proof Outline.

We consider the policy π\pi which satisfies the Slater’s condition in Equation (9). We then consider the Bellman error of taking one step in MDP with transition probabilities P~e\tilde{P}_{e} and then following policy π\pi on the MDP with transition probabilities PP. Now, using [Agarwal et al., 2022, Lemma 1] relating the average costs following policy π\pi with PP and P~e\tilde{P}_{e} (ζπP,k\zeta_{\pi}^{P,k}, and ζπP~e,k\zeta_{\pi}^{\tilde{P}_{e},k} for all k[K]k\in[K] respectively) with the Bellman error gives the required result. The complete proof is provided in the supplementary material. ∎

After obtaining a feasible policy πe\pi_{e} maximizing rewards for the sampled MDP, we now quantify is regret. We note that when optimizing for long-term average rewards and long-term average constraints, we want to simultaneously minimize the reward regret and the constraint regrets. Further, if we know the optimal policy π\pi^{*} before hand, the deviations resulting from the stochasticity of the process can still result in some constraint violations. Also, since we sample a MDP, the policy which is feasible for the MDP may violate constraints on the true MDP. We want to bound this gap between KK costs for the two MDPs as well.

We aim to quantify the regret from (R.1) deviation of long-term average rewards of the optimal policy because of incorrect knowledge of the MDP (λπP,rλπeP~e,r\lambda_{\pi^{*}}^{P,r}-\lambda_{\pi_{e}}^{\tilde{P}_{e},r}), (R.2) deviation of the long-term average rewards generated by the optimal policy for the sampled MDP on the sampled MDP and the long-term average rewards generated by the optimal policy for the sampled MDP on the true MDP (λπeP~e,rλπeP,r\lambda_{\pi_{e}}^{\tilde{P}_{e},r}-\lambda_{\pi_{e}}^{P,r}), and (R.3) deviation of the expected rewards from following the optimal policy of the sampled MDP (λπeP,rr(st,at)\lambda_{\pi_{e}}^{P,r}-r(s_{t},a_{t})).

Similarly, the constraint violations for each k[K]k\in[K] are incurred from (C.1) deviation of long-term average rewards of the optimal policy because of incorrect knowledge of the MDP (CkζπeP~e,kC_{k}-\zeta_{\pi_{e}}^{\tilde{P}_{e},k}), (C.2) deviation of the long-term average costs generated by the optimal policy for the sampled MDP on the sampled MDP and the long-term average costs generated by the optimal policy for the sampled MDP on the true MDP (ζπeP~e,rλπeP,r\zeta_{\pi_{e}}^{\tilde{P}_{e},r}-\lambda_{\pi_{e}}^{P,r}), and (C.3) deviation of the expected costs from following the optimal policy of the sampled MDP (ζπeP,rck(st,at)\zeta_{\pi_{e}}^{P,r}-c^{k}(s_{t},a_{t})).

We now prove the regret bounds for Algorithm 1. We first give the high level ideas used in obtaining the bounds on regret. We divide the regret into regret incurred in each epoch ee. Then, we use the posterior sampling lemma [Osband et al., 2013, Lemma 1] to obtain the equivalence between the long-term average rewards of the true MDP \mathcal{M} and the long-term average rewards for the optimal value of the sampled MDP ^\widehat{\mathcal{M}}. This step allows us to deal with the regret from (R.1). Then we use the Bellman error formulation to relate average rewards for the policy πe\pi_{e} on PP and P~e\tilde{P}_{e} [Agarwal et al., 2022]. Combining this with Azuma’s concentration inequality for Martingales allows us to bound the regret from (R.2) and (R.3).

Bounding constraint violations requires similar considerations for (C.2) and (C.3). Further, (C.1) becomes zero if Equation (17) is feasible for the sampled MDP. However, if Equation (17) is not feasible, the cost may be as high as 11 (ck(s,a)1k[K]c^{k}(s,a)\leq 1\leavevmode\nobreak\ \forall\leavevmode\nobreak\ k\in[K]). We bound the violations by bounding the time-steps for which the optimal policy for unconstrained optimization runs.

To obtain bounds on the regret, we first note that the total number of epochs, EE, for which the Algorithm 1 runs is bounded by O(1+2SA+SAlog(T)O(1+2SA+SA\log(T) from [Jaksch et al., 2010, Proposition 1].

We formally state the regret bounds and constraint violation bounds in Theorem 1 which we prove rigorously in the supplementary material.

Theorem 1.

The expected reward regret 𝔼[R(T)]\mathbb{E}\left[R(T)\right], and the expected constraint regret 𝔼[Rk(T)]k[K]\mathbb{E}\left[R_{k}(T)\right]\leavevmode\nobreak\ \forall\ k\in[K] of Algorithm 1 are bounded as

𝔼[R(T)]\displaystyle\mathbb{E}\left[R(T)\right] O(TMSATlog(AT)+CS2AlogT1ρ)\displaystyle\leq O\left(T_{M}S\sqrt{AT\log(AT)}+\frac{CS^{2}A\log T}{1-\rho}\right)
𝔼[Rk(T)]\displaystyle\mathbb{E}\left[R^{k}(T)\right] O(TMSATlog(AT)+CS2AlogT1ρ)\displaystyle\leq O\left(T_{M}S\sqrt{AT\log(AT)}+\frac{CS^{2}A\log T}{1-\rho}\right)
Proof Outline.

We break the cumulative regret into the regret incurred in each epoch ee. This gives us:

𝔼[RT]\displaystyle\mathbb{E}\left[R_{T}\right] =𝔼[e=1Et=tete+11(λπP,rr(st,at))]\displaystyle=\mathbb{E}\left[\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi^{*}}^{P,r}-r(s_{t},a_{t})\right)\right] (20)
=e=1E𝔼[t=tete+11(λπP,rr(st,at))]\displaystyle=\sum_{e=1}^{E}\mathbb{E}\left[\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi^{*}}^{P,r}-r(s_{t},a_{t})\right)\right] (21)
=e=1E𝔼[t=tete+11(λπeP~e,rr(st,at))]\displaystyle=\sum_{e=1}^{E}\mathbb{E}\left[\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{\tilde{P}_{e},r}-r(s_{t},a_{t})\right)\right] (22)
=e=1E𝔼[t=tete+11(λπeP~e,rλπeP,r+λπeP,rr(st,at))]\displaystyle=\sum_{e=1}^{E}\mathbb{E}\left[\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{\tilde{P}_{e},r}-\lambda_{\pi_{e}}^{P,r}+\lambda_{\pi_{e}}^{P,r}-r(s_{t},a_{t})\right)\right]
=e=1E𝔼[t=tete+11(λπeP~e,rλπeP,r)]\displaystyle=\sum_{e=1}^{E}\mathbb{E}\left[\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{\tilde{P}_{e},r}-\lambda_{\pi_{e}}^{P,r}\right)\right]
+𝔼[e=1Et=tete+11(λπeP,rr(st,at))]\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ +\mathbb{E}\left[\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{P,r}-r(s_{t},a_{t})\right)\right] (23)

The Equation (22) follows from [Osband et al., 2013, Lemma 1] for regret each each epoch of Equation (21). Proceeding from Equation (23) requires additional consideration. Typical proof techniques to bound regret requires a bounded bias-span (maxs,s(hπP~e,r(s)hπP~e,r(s))\max_{s,s^{\prime}}(h_{\pi}^{\tilde{P}_{e},r}(s)-h_{\pi}^{\tilde{P}_{e},r}(s^{\prime}))) which may be large for the sampled MDP. For this, we consider an MDP for the transition probability PerP_{e}^{r} satisfies

λπePer,r\displaystyle\lambda_{\pi_{e}}^{P_{e}^{r},r} maxP𝒫teλπeP,r,where\displaystyle\geq\max_{P^{\prime}\in\mathcal{P}_{t_{e}}}\lambda_{\pi_{e}}^{P^{\prime},r},\text{where} (24)
𝒫te\displaystyle\mathcal{P}_{t_{e}} ={P:P(|s,a)P¯te(|s,a)1\displaystyle=\Big{\{}P^{\prime}:\|P^{\prime}(\cdot|s,a)-\bar{P}_{t_{e}}(\cdot|s,a)\|_{1}
14Slog(AT)Ne(s,a)}s,a\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leq\sqrt{\frac{14S\log(AT)}{N_{e}(s,a)}}\Big{\}}\leavevmode\nobreak\ \forall\leavevmode\nobreak\ s,a

where P¯te(|s,a)\bar{P}_{t_{e}}(\cdot|s,a) is the estimated transition probability given s,as,a at time tet_{e}. We now have,

R(T)\displaystyle R(T) e=1E𝔼[t=tete+11(λπePer,rλπeP,r)]\displaystyle\leq\sum_{e=1}^{E}\mathbb{E}\left[\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{P_{e}^{r},r}-\lambda_{\pi_{e}}^{P,r}\right)\right]
+e=1E𝔼[t=tete+11(λπeP,rr(st,at))]\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ +\sum_{e=1}^{E}\mathbb{E}\left[\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{P,r}-r(s_{t},a_{t})\right)\right] (25)

The first term of Equation (25) is bounded by bounding the expected Bellman error. The second term is converted to a Martingale sequence by conditioning it on the state stes_{t_{e}} and is bounded using the ergodicity of the MDP \mathcal{M} and Azuma’s concentration inequality. The complete proof on bounding the regret is provided in the supplementary material.

Regarding the constraint violations, for each k[K]k\in[K], we want to bound,

𝔼[Rk(T)]\displaystyle\mathbb{E}\left[R^{k}(T)\right] =𝔼[(t=1Tck(st,at)TCk)+]\displaystyle=\mathbb{E}\left[\left(\sum_{t=1}^{T}c_{k}(s_{t},a_{t})-TC_{k}\right)_{+}\right] (26)

We divide the constraint violation regret into regret over epochs as well. Now, for each epoch, we know that the constraint is satisfied by the policy for the sampled MDP. This allows us to obtain:

𝔼[Rk(T)]\displaystyle\mathbb{E}\left[R^{k}(T)\right] =𝔼[(et=tete+11(ck(st,at)Ck))+]\displaystyle=\mathbb{E}\left[\left(\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\left(c_{k}(s_{t},a_{t})-C_{k}\right)\right)_{+}\right] (27)
=𝔼[(et=tete+11((ck(st,at)ζπeP,k)\displaystyle=\mathbb{E}\Big{[}\Big{(}\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\Big{(}\left(c_{k}(s_{t},a_{t})-\zeta_{\pi_{e}}^{P,k}\right)
+(ζπeP,kζπeP~e,k)+(ζπeP~e,kCk)))+]\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\left(\zeta_{\pi_{e}}^{P,k}-\zeta_{\pi_{e}}^{\tilde{P}_{e},k}\right)+\left(\zeta_{\pi_{e}}^{\tilde{P}_{e},k}-C_{k}\right)\Big{)}\Big{)}_{+}\Big{]} (28)
=𝔼[(et=tete+11ck(st,at)ζπeP,k)+\displaystyle=\mathbb{E}\Big{[}\left(\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}c_{k}(s_{t},a_{t})-\zeta_{\pi_{e}}^{P,k}\right)_{+}
+(et=tete+11ζπeP,kζπeP~e,k)+\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\left(\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\zeta_{\pi_{e}}^{P,k}-\zeta_{\pi_{e}}^{\tilde{P}_{e},k}\right)_{+}
+(et=tete+11ζπeP~e,kCk)+]\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\left(\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\zeta_{\pi_{e}}^{\tilde{P}_{e},k}-C_{k}\right)_{+}\Big{]} (29)
=𝔼[|et=tete+11(ck(st,at)ζπeP,k)|\displaystyle=\mathbb{E}\Big{[}\Big{|}\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\left(c_{k}(s_{t},a_{t})-\zeta_{\pi_{e}}^{P,k}\right)\Big{|}
+|et=tete+11ζπeP,kζπeP~e,k|\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\left|\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\zeta_{\pi_{e}}^{P,k}-\zeta_{\pi_{e}}^{\tilde{P}_{e},k}\right|
+(et=tete+11ζπeP~e,kCk)+]\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\left(\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\zeta_{\pi_{e}}^{\tilde{P}_{e},k}-C_{k}\right)_{+}\Big{]} (30)

The first term in Equation (28) denotes the difference between the incurred costs and the expected costs from following policy πe\pi_{e}. The second term denotes the difference between the expected costs from policy πe\pi_{e} on the true MDP and on the sampled MDP. The third terms denotes the violations of the policy πe\pi_{e} which would be zero if the policy πe\pi_{e} satisfies constraint Eqution (17) for the sampled MDP. Equation (29) is obtained from the fact max(0,x+y)max(0,x)+max(0,y)\max(0,x+y)\leq\max(0,x)+\max(0,y) and Equation (28) is obtained from the fact max(0,x)|x|\max(0,x)\leq|x|.

The first and second term in Equation (28) are bounded similar to Equation (23), and we focus our attention to the third term. If the optimization problem in Equation (14)-(17) is feasible, the term (ζπeP~e,kCk)0(\zeta_{\pi_{e}}^{\tilde{P}_{e},k}-C_{k})\leq 0 and if the optimization equation is infeasible, the term is upper bounded by 11 as Ck0C_{k}\geq 0 and ζπeP~e1\zeta_{\pi_{e}}^{\tilde{P}_{e}}\leq 1. Hence, we get:

(et=tete+11(ζπeP~e,kCk))+\displaystyle\left(\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\zeta_{\pi_{e}}^{\tilde{P}_{e},k}-C_{k}\right)\right)_{+}
e(t=tete+11ζπeP~e,kCk)+\displaystyle\leq\sum_{e}\left(\sum_{t=t_{e}}^{t_{e+1}-1}\zeta_{\pi_{e}}^{\tilde{P}_{e},k}-C_{k}\right)_{+} (31)
=e(t=tete+11ζπeP~e,kCk)+𝟏{te+1te>T}\displaystyle=\sum_{e}\left(\sum_{t=t_{e}}^{t_{e+1}-1}\zeta_{\pi_{e}}^{\tilde{P}_{e},k}-C_{k}\right)_{+}\bm{1}\left\{t_{e+1}-t_{e}>\sqrt{T}\right\}
+e(t=tete+11ζπeP~e,kCk)+𝟏{te+1teT}\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\sum_{e}\left(\sum_{t=t_{e}}^{t_{e+1}-1}\zeta_{\pi_{e}}^{\tilde{P}_{e},k}-C_{k}\right)_{+}\bm{1}\left\{t_{e+1}-t_{e}\leq\sqrt{T}\right\} (32)
et=tete+11𝟏{te+1teT}\displaystyle\leq\sum_{e}\sum_{t=t_{e}}^{t_{e+1}-1}\bm{1}\left\{t_{e+1}-t_{e}\leq\sqrt{T}\right\} (33)
eT=ET\displaystyle\leq\sum_{e}\sqrt{T}=E\sqrt{T} (34)
(1+2SA+SAlog2(T/SA))T\displaystyle\leq(1+2SA+SA\log_{2}(T/SA))\sqrt{T} (35)

where Equation (31) follows from the fact that total violations are less than the cumulative violations are considered per epoch. Equation (33) follows from Lemma 1 as (ζπeP~e,kCk)0\left(\zeta_{\pi_{e}}^{\tilde{P}_{e},k}-C_{k}\right)\leq 0 when te>Tt_{e}>\sqrt{T} and Equation (35) comes from [Jaksch et al., 2010, Proposition 1]. ∎

We note that the fundamental setup of unconstrained optimization (K=0K=0), the bound is loose compared to that of UCRL2 algorithm Jaksch et al. [2010]. This is because we use a stochastic policy instead of a deterministic policy. Recall that the optimal policy for CMDP setup is possibly stochastic Altman [1999].

6 Evaluation of the Proposed Algorithm

Refer to caption
(a) Reward growth w.r.t. time
Refer to caption
(b) Regret w.r.t. time
Figure 1: Reward and regret performance of the proposed CMDP-PSRL algorithm on a flow and service control problem for a single queue. The algorithms is compared against the optimistic algorithm from Singh et al. Singh et al. [2020] compared to which our algorithm extremely well.
Refer to caption
(a) Service constraints w.r.t. time
Refer to caption
(b) Flow constraints w.r.t. time
Figure 2: Constraint violation performance of the proposed CMDP-PSRL algorithm on a flow and service control problem for a single queue. The average constraint violations become zero as the algorithm proceeds, however, it never crosses zero to increase the reward further.

To validate the performance of the proposed CDMP-PSRL algorithm and the understanding of our analysis, we run the simulation on the flow and service control in a single-serve queue, which is introduced in [Altman and Schwartz, 1991]. A discrete-time single-server queue with a buffer of finite size LL is considered in this case. The number of the customer waiting in the queue is considered as the state in this problem and thus |S|=L+1|S|=L+1. Two kinds of the actions, service and flow, are considered in the problem and control the number of customers together. The action space for service is a finite subset AA in [amin,amax][a_{min},a_{max}], where 0<aminamax<10<a_{min}\leq a_{max}<1. Given a specific service action aa, the service a customer is successfully finished with the probability bb. If the service is successful, the length of the queue will reduce by 1. Similarly, the space for flow is also a finite subsection BB in [bmin,bmax][b_{min},b_{max}]. In contrast to the service action, flow action will increase the queue by 11 with probability bb if the specific flow action bb is given. Also, we assume that there is no customer arriving when the queue is full. The overall action space is the Cartesian product of the AA and BB. According to the service and flow probability, the transition probability can be computed and is given in the Table 1.

Table 1: Transition probability of the queue system
Current State P(xt+1=xt1)P(x_{t+1}=x_{t}-1) P(xt+1=xt)P(x_{t+1}=x_{t}) P(xt+1=xt+1)P(x_{t+1}=x_{t}+1)
1xtL11\leq x_{t}\leq L-1 a(1b)a(1-b) ab+(1a)(1b)ab+(1-a)(1-b) (1a)b(1-a)b
xt=Lx_{t}=L aa 1a1-a 0
xt=0x_{t}=0 0 1b(1a)1-b(1-a) b(1a)b(1-a)

For the reward function as r(s,a,b)r(s,a,b) and the constraints for service and flow as c1(s,a,b)c^{1}(s,a,b) and c2(s,a,b)c^{2}(s,a,b), respectively, and stationary policies for service and flow as πa\pi_{a} and πb\pi_{b}, respectively, the problem can be defined as

maxπa,πblimT1Tt=1Tr(st,πa(st),πb(st))s.t.limT1Tt=1Tc1(st,πa(st),πb(st))0limT1Tt=1Tc2(st,πa(st),πb(st))0\begin{split}\max_{\pi_{a},\pi_{b}}&\quad\lim\limits_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}r(s_{t},\pi_{a}(s_{t}),\pi_{b}(s_{t}))\\ s.t.&\quad\lim\limits_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}c^{1}(s_{t},\pi_{a}(s_{t}),\pi_{b}(s_{t}))\geq 0\\ &\quad\lim\limits_{T\rightarrow\infty}\frac{1}{T}\sum_{t=1}^{T}c^{2}(s_{t},\pi_{a}(s_{t}),\pi_{b}(s_{t}))\geq 0\end{split} (36)

According to the discussion in [Altman and Schwartz, 1991], we define the reward function as r(s,a,b)=5sr(s,a,b)=5-s, which is an decreasing function only dependent on the state. It is reasonable to give higher reward when the number of customer waiting in the queue is small. For the constraint function, we define c1(s,a,b)=10a+6c^{1}(s,a,b)=-10a+6 and c2=8(1b)2+2c^{2}=-8*(1-b)^{2}+2, which are dependent only on service and flow action, respectively. Higher constraint value is given if the probability for the service and flow are low and high, respectively.

In the simulation, the length of the buffer is set as L=5L=5. The service action space is set as [0.2,0.4,0.6,0.8][0.2,0.4,0.6,0.8] and the flow action space is set as [0.4,0.5,0.6,0.7][0.4,0.5,0.6,0.7]. We use the length of horizon T=50000T=50000 and run 5050 independent simulations of the proposed CMDP-PSRL algorithm. We also plot the standard deviation around the mean value in the shadow to show the random error. In order to compare this result to the optimal, we assume that the full information of the transition dynamics is known and then use Linear Programming to solve the problem. The optimal cumulative reward from LP is shown to be 4.474.47. The reward performance of the CMDP-PSRL algorithm is shown in the Figure 1 where we observe that the reward converges towards the optimal value. We also plot the constraint violations in Figure 2. The service and flow constraints converge to 0 as expected. We note that the reward of the proposed CMDP-PSRL algorithm becomes closer the optimal reward as the algorithm proceeds, and to further increase the reward, it does not violates the constraint.

We also compared our algorithm against the optimistic algorithm of Singh et al. [2020]. We note that their algorithm performs significantly worse compared to our algorithm. We account this poor performance on two accounts. An optimistic algorithm does not find a policy for transition probabilities close to PP for significantly large time. The other issue is because they consider confidence interval for each P(s|s,a)P(s^{\prime}|s,a). This also shows in their analysis and hence they obtain a O(T2/3)O(T^{2/3}) regret bound. Further, the optimization problem takes a significantly more time to solve for optimistic setup. However, the variance of their optimistic algorithm is significantly lower compared to the variance of our CMDP-PSRL algorithm.

7 Conclusion

This paper, considers the setup of reinforcement learning in ergodic infinite-horizon constrained Markov Decision Processes with KK long-term average constraint. We propose a posterior sampling based algorithm, CMDP-PSRL, which proceeds in epochs. At every epoch, we sample a new CMDP and generate a solution for the constraint optimization problem. A major advantage of the posterior sampling based algorithm over an optimistic approach is, that it reduces the complexity of solving for the optimal solution of the constraint problem. We also study the proposed CMDP-PSRL algorithm from regret perspective. We bound the regret of the reward collected by the CMDP-PSRL algorithm as O~(TMSAT+CS2A/(1ρ))\tilde{O}(T_{M}S\sqrt{AT}+CS^{2}A/(1-\rho)). Further, we bound the gap between the long-term average costs of the sampled MDP and the true MDP to bound the KK constraint violations as O~(TMSAT+CS2A/(1ρ))\tilde{O}(T_{M}S\sqrt{AT}+CS^{2}A/(1-\rho)). Finally, we evaluate the proposed CMDP-PSRL algorithm on a flow control problem for single queue and show that the proposed algorithm performs empirically well. This paper is the first work which obtains a O~(T)\tilde{O}(\sqrt{T}) regret bounds for ergodic MDPs with long-term average constraints using a posterior sampling algorithm. A model-free algorithm that obtains similar regret bounds for infinite horizon long-term average constraints remains an open problem.

References

  • Agarwal et al. [2020] Alekh Agarwal, Sham M Kakade, Jason D Lee, and Gaurav Mahajan. Optimality and approximation with policy gradient methods in markov decision processes. In Jacob Abernethy and Shivani Agarwal, editors, Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, pages 64–66. PMLR, 09–12 Jul 2020. URL http://proceedings.mlr.press/v125/agarwal20a.html.
  • Agarwal et al. [2022] Mridul Agarwal, Vaneet Aggarwal, and Tian Lan. Multi-objective reinforcement learning with non-linear scalarization. In Proceedings of the 21st International Conference on Autonomous Agents and Multiagent Systems, pages 9–17, 2022.
  • Agrawal and Jia [2017] Shipra Agrawal and Randy Jia. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. In Advances in Neural Information Processing Systems, pages 1184–1194, 2017.
  • Altman and Schwartz [1991] E. Altman and A. Schwartz. Adaptive control of constrained markov chains. IEEE Transactions on Automatic Control, 36(4):454–462, 1991. 10.1109/9.75103.
  • Altman [1999] Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
  • Bertsekas and Tsitsiklis [1996] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-dynamic programming. Athena Scientific, Belmont, MA, 1996.
  • Borkar [1997] Vivek S. Borkar. Stochastic approximation with two time scales. Systems & Control Letters, 29(5):291–294, 1997. ISSN 0167-6911. https://doi.org/10.1016/S0167-6911(97)90015-3. URL https://www.sciencedirect.com/science/article/pii/S0167691197900153.
  • Borkar [2005] V.S. Borkar. An actor-critic algorithm for constrained markov decision processes. Systems & Control Letters, 54(3):207–213, 2005. ISSN 0167-6911. https://doi.org/10.1016/j.sysconle.2004.08.007. URL https://www.sciencedirect.com/science/article/pii/S0167691104001276.
  • Ding et al. [2020] Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33, 2020.
  • Ding et al. [2021] Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In International Conference on Artificial Intelligence and Statistics, pages 3304–3312. PMLR, 2021.
  • Efroni et al. [2020] Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189, 2020.
  • Gattami et al. [2021] Ather Gattami, Qinbo Bai, and Vaneet Aggarwal. Reinforcement learning for constrained markov decision processes. In International Conference on Artificial Intelligence and Statistics, pages 2656–2664. PMLR, 2021.
  • Howard [1960] R. A. Howard. Dynamic Programming and Markov Processes. MIT Press, Cambridge, MA, 1960.
  • Jaksch et al. [2010] Thomas Jaksch, Ronald Ortner, and Peter Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
  • Jin et al. [2018] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc., 2018. URL https://proceedings.neurips.cc/paper/2018/file/d3b1fb02964aa64e257f9f26a31f72cf-Paper.pdf.
  • Kalagarla et al. [2020] Krishna C Kalagarla, Rahul Jain, and Pierluigi Nuzzo. A sample-efficient algorithm for episodic finite-horizon mdp with constraints. arXiv preprint arXiv:2009.11348, 2020.
  • Osband and Van Roy [2017] Ian Osband and Benjamin Van Roy. Why is posterior sampling better than optimism for reinforcement learning? In International conference on machine learning, pages 2701–2710. PMLR, 2017.
  • Osband et al. [2013] Ian Osband, Daniel Russo, and Benjamin Van Roy. (more) efficient reinforcement learning via posterior sampling. In Advances in Neural Information Processing Systems, pages 3003–3011, 2013.
  • Potra and Wright [2000] Florian A Potra and Stephen J Wright. Interior-point methods. Journal of Computational and Applied Mathematics, 124(1-2):281–302, 2000.
  • Puterman [2014] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  • Qiu et al. [2020] Shuang Qiu, Xiaohan Wei, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. Upper confidence primal-dual reinforcement learning for cmdp with adversarial loss. In H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 15277–15287. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/ae95296e27d7f695f891cd26b4f37078-Paper.pdf.
  • Singh et al. [2020] Rahul Singh, Abhishek Gupta, and Ness B Shroff. Learning in markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.
  • Tarbouriech and Lazaric [2019] Jean Tarbouriech and Alessandro Lazaric. Active exploration in markov decision processes. In The 22nd International Conference on Artificial Intelligence and Statistics, pages 974–982. PMLR, 2019.
  • Watkins and Dayan [1992] Christopher J. C. H. Watkins and Peter Dayan. Q-learning. Machine Learning, 8(3):279–292, May 1992. ISSN 1573-0565. 10.1007/BF00992698. URL https://doi.org/10.1007/BF00992698.
  • Weissman et al. [2003] Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. Inequalities for the l1 deviation of the empirical distribution. 2003.
  • Xu et al. [2020] Tengyu Xu, Yingbin Liang, and Guanghui Lan. A primal approach to constrained policy optimization: Global optimality and finite-time analysis. arXiv preprint arXiv:2011.05869, 2020.
  • Zheng and Ratliff [2020] Liyuan Zheng and Lillian Ratliff. Constrained upper confidence reinforcement learning. In Learning for Dynamics and Control, pages 620–629. PMLR, 2020.

Appendix A Proof for Regret Bounds

We now complete the proof of Theorem 1 here.

A.1 Variable Definitions

We first define some important variables required for the proof.

We define value Vγ,πP,r,Vγ,πP,kV_{\gamma,\pi}^{P,r},V_{\gamma,\pi}^{P,k} function for rewards rr and cost ckc^{k} as:

Vγ,πP,r(s)\displaystyle V_{\gamma,\pi}^{P,r}(s) =𝔼[t=0γtr(st,at)|s0=s]\displaystyle=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|s_{0}=s\right] (37)
Vγ,πP,k(s)\displaystyle V_{\gamma,\pi}^{P,k}(s) =𝔼[t=0γtck(st,at)|s0=s]\displaystyle=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}c^{k}(s_{t},a_{t})|s_{0}=s\right] (38)

We also define Q-value Qγ,πP,r,Qγ,πP,kQ_{\gamma,\pi}^{P,r},Q_{\gamma,\pi}^{P,k} function for rewards rr and cost ckc^{k} as:

Qγ,πP,r(s,a)\displaystyle Q_{\gamma,\pi}^{P,r}(s,a) =𝔼[t=0γtr(st,at)|s0=s,a0=a]\displaystyle=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}r(s_{t},a_{t})|s_{0}=s,a_{0}=a\right] (39)
Qγ,πP,k(s,a)\displaystyle Q_{\gamma,\pi}^{P,k}(s,a) =𝔼[t=0γtck(st,at)|s0=s,a0=a]\displaystyle=\mathbb{E}\left[\sum_{t=0}^{\infty}\gamma^{t}c^{k}(s_{t},a_{t})|s_{0}=s,a_{0}=a\right] (40)

Based on this, we define Bellman error BπP,r,BπP,kB_{\pi}^{P^{\prime},r},B_{\pi}^{P^{\prime},k} function for rewards rr and cost ckc^{k} as:

BπP,r\displaystyle B_{\pi}^{P^{\prime},r} =limγ1(r(s,a)+sP(s|s,a)Vγ,πP,r(s,a)Qγ,πP,r(s,a))\displaystyle=\lim_{\gamma\to 1}\left(r(s,a)+\sum_{s^{\prime}}P^{\prime}(s^{\prime}|s,a)V_{\gamma,\pi}^{P,r}(s,a)-Q_{\gamma,\pi}^{P,r}(s,a)\right) (41)
BπP,k\displaystyle B_{\pi}^{P^{\prime},k} =limγ1(ck(s,a)+sP(s|s,a)Vγ,πP,k(s,a)Qγ,πP,k(s,a))\displaystyle=\lim_{\gamma\to 1}\left(c^{k}(s,a)+\sum_{s^{\prime}}P^{\prime}(s^{\prime}|s,a)V_{\gamma,\pi}^{P,k}(s,a)-Q_{\gamma,\pi}^{P,k}(s,a)\right) (42)

A.2 Auxiliary Lemmas

We now state and prove various lemmas required to complete the proof of Theorem 1.

The first lemma obtains concentration bounds for the sampled MDP. We have:

Lemma 2.

The probability that the event

t={P¯t(|s,a)P(|s,a)114Slog(2AT)max{1,nt(s,a)}(s,a)𝒮×𝒜}\displaystyle\mathcal{E}_{t}=\left\{\|\bar{P}_{t}(\cdot|s,a)-P(\cdot|s,a)\|_{1}\leq\sqrt{\frac{14S\log(2AT)}{\max\{1,n_{t}(s,a)\}}}\forall(s,a)\in\mathcal{S}\times\mathcal{A}\right\} (43)

fails to occur for any tTt\leq T is bounded by 1T5\frac{1}{T^{5}}.

Proof Outline.

From the result of Weissman et al. [2003], the 1\ell_{1} distance of a probability distribution over SS events with nn samples is bounded as:

(P(|s,a)P¯t(|s,a)1ϵ)\displaystyle\mathbb{P}\left(\|P(\cdot|s,a)-\bar{P}_{t}(\cdot|s,a)\|_{1}\geq\epsilon\right) (2S2)exp(n(s,a)ϵ22)\displaystyle\leq(2^{S}-2)\exp{\left(-\frac{n(s,a)\epsilon^{2}}{2}\right)}
(2S)exp(n(s,a)ϵ22)\displaystyle\leq(2^{S})\exp{\left(-\frac{n(s,a)\epsilon^{2}}{2}\right)} (44)

Thus, for ϵ=2n(s,a)log(2S20SAT7)14Sn(s,a)log(2AT)14Sn(s,a)log(2AT)\epsilon=\sqrt{\frac{2}{n(s,a)}\log(2^{S}20SAT^{7})}\leq\sqrt{\frac{14S}{n(s,a)}\log(2AT)}\leq\sqrt{\frac{14S}{n(s,a)}\log(2AT)}, we have

(P(|s,a)P¯t(|s,a)114Sn(s,a)log(2AT))\displaystyle\mathbb{P}\left(\|P(\cdot|s,a)-\bar{P}_{t}(\cdot|s,a)\|_{1}\geq\sqrt{\frac{14S}{n(s,a)}\log(2AT)}\right) (2S)exp(n(s,a)22n(s,a)log(2S20SAT7))\displaystyle\leq(2^{S})\exp{\left(-\frac{n(s,a)}{2}\frac{2}{n(s,a)}\log(2^{S}20SAT^{7})\right)} (45)
=2S12S20SAT7\displaystyle=2^{S}\frac{1}{2^{S}20SAT^{7}} (46)
=120AST7\displaystyle=\frac{1}{20AST^{7}} (47)

We sum over the all the possible values of n(s,a)n(s,a) till tt time-step to bound the probability that the event t\mathcal{E}_{t} does not occur as:

n(s,a)=1t120SAT7120SAT6\displaystyle\sum_{n(s,a)=1}^{t}\frac{1}{20SAT^{7}}\leq\frac{1}{20SAT^{6}} (48)

Finally, summing over all the s,as,a, we get

(P(|s,a)P¯t(|s,a)114Sn(s,a)log(2AT)s,a)120t6\displaystyle\mathbb{P}\left(\|P(\cdot|s,a)-\bar{P}_{t}(\cdot|s,a)\|_{1}\geq\sqrt{\frac{14S}{n(s,a)}\log(2AT)}\leavevmode\nobreak\ \forall s,a\right)\leq\frac{1}{20t^{6}} (49)

Further, using union bounds and summing over all the tTt\leq T, we get

(P(|s,a)P¯t(|s,a)114Sn(s,a)log(2AT)s,atT)\displaystyle\mathbb{P}\left(\|P(\cdot|s,a)-\bar{P}_{t}(\cdot|s,a)\|_{1}\geq\sqrt{\frac{14S}{n(s,a)}\log(2AT)}\leavevmode\nobreak\ \forall s,a\leavevmode\nobreak\ \forall\leavevmode\nobreak\ t\leq T\right) t=1T120T6\displaystyle\leq\sum_{t=1}^{T}\frac{1}{20T^{6}} (50)
1T5\displaystyle\leq\frac{1}{T^{5}} (51)

The next lemma relates the difference between average per step reward λπP,r\lambda_{\pi}^{P,r} (or cost λπP,k\lambda_{\pi}^{P,k}) for following policy π\pi on true MDP with transition probabilities and average per step reward λπP~,r\lambda_{\pi}^{\tilde{P},r} for following policy π\pi on MDP with transition probabilities P~\tilde{P} with the Bellman error BπP~,r(s,a)B_{\pi}^{\tilde{P},r}(s,a) as:

Lemma 3.

The difference of long-term average rewards for running the policy π\pi on the MDP, λπP~,r\lambda_{\pi}^{\tilde{P},r}, and the average long-term average rewards for running the policy π\pi on the true MDP, λπP~,r\lambda_{\pi}^{\tilde{P},r}, is the long-term average Bellman error as

λπP~,rλπP,r=s,adπ(s,a)BπP~,r(s,a)=𝔼π,P[BπP~,r(s,a)].\displaystyle\lambda_{\pi}^{\tilde{P},r}-\lambda_{\pi}^{P,r}=\sum_{s,a}d_{\pi}(s,a)B_{\pi}^{\tilde{P},r}(s,a)=\mathbb{E}_{\pi,P}\left[B_{\pi}^{\tilde{P},r}(s,a)\right]. (52)
Proof.

Note that for all s𝒮s\in\mathcal{S}, we have:

Vγ,πP~,r(s)\displaystyle V_{\gamma,\pi}^{\tilde{P},r}(s) =𝔼aπ[Qγ,πP~,r(s,a)]\displaystyle=\mathbb{E}_{a\sim\pi}\left[Q_{\gamma,\pi}^{\tilde{P},r}(s,a)\right] (53)
=𝔼aπ[Bγ,πP~,r(s,a)+r(s,a)+γs𝒮P(s|s,a)VγπP~,r(s)]\displaystyle=\mathbb{E}_{a\sim\pi}\left[B_{\gamma,\pi}^{\tilde{P},r}(s,a)+r(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)V_{\gamma\pi}^{\tilde{P},r}(s^{\prime})\right] (54)

where Equation (54) follows from the definition of the Bellman error for state action pair s,as,a.

Similarly, for the true MDP, we have,

Vγ,πP,r(s)\displaystyle V_{\gamma,\pi}^{P,r}(s) =𝔼aπ[Qγ,πP,r(s,a)]\displaystyle=\mathbb{E}_{a\sim\pi}\left[Q_{\gamma,\pi}^{P,r}(s,a)\right] (55)
=𝔼aπ[r(s,a)+γs𝒮P(s|s,a)Vγ,πP,r(s)]\displaystyle=\mathbb{E}_{a\sim\pi}\left[r(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)V_{\gamma,\pi}^{P,r}(s^{\prime})\right] (56)

Subtracting Equation (56) from Equation (54), we get:

Vγ,πP~,r(s)Vγ,πP,r(s)\displaystyle V_{\gamma,\pi}^{\tilde{P},r}(s)-V_{\gamma,\pi}^{P,r}(s) =𝔼aπ[Bγ,πP~,r(s,a)+γs𝒮P(s|s,a)(Vγ,πP~,rVγ,πP,r)(s)]\displaystyle=\mathbb{E}_{a\sim\pi}\left[B_{\gamma,\pi}^{\tilde{P},r}(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)\left(V_{\gamma,\pi}^{\tilde{P},r}-V_{\gamma,\pi}^{P,r}\right)(s^{\prime})\right] (57)
=𝔼aπ[Bγ,πP~,r(s,a)]+γs𝒮Pπ(Vγ,πP~,rVγ,πP,r)(s)\displaystyle=\mathbb{E}_{a\sim\pi}\left[B_{\gamma,\pi}^{\tilde{P},r}(s,a)\right]+\gamma\sum_{s^{\prime}\in\mathcal{S}}P_{\pi}\left(V_{\gamma,\pi}^{\tilde{P},r}-V_{\gamma,\pi}^{P,r}\right)(s^{\prime}) (58)

Using the vector format for the value functions, we have,

V¯γ,πP~,rV¯γ,πP,r\displaystyle\bar{V}_{\gamma,\pi}^{\tilde{P},r}-\bar{V}_{\gamma,\pi}^{P,r} =(IγPπ)1Bγ,πP,r\displaystyle=\left(I-\gamma P_{\pi}\right)^{-1}{B}_{\gamma,\pi}^{P,r} (59)

Now, converting the value function to average per-step reward we have,

λπP~,r𝟏SλπP,r𝟏S\displaystyle\lambda_{\pi}^{\tilde{P},r}\bm{1}_{S}-\lambda_{\pi}^{P,r}\bm{1}_{S} =limγ1(1γ)(V¯γ,πP~,rV¯γ,πP,r)\displaystyle=\lim_{\gamma\to 1}(1-\gamma)\left(\bar{V}_{\gamma,\pi}^{\tilde{P},r}-\bar{V}_{\gamma,\pi}^{P,r}\right) (60)
=limγ1(1γ)(IγPπ)1Bγ,πP~,r\displaystyle=\lim_{\gamma\to 1}(1-\gamma)\left(I-\gamma P_{\pi}\right)^{-1}{B}_{\gamma,\pi}^{\tilde{P},r} (61)
=(s,adπP(s,a)BπP~,r(s,a))𝟏S\displaystyle=\left(\sum_{s,a}d_{\pi}^{P}(s,a)B_{\pi}^{\tilde{P},r}(s,a)\right)\bm{1}_{S} (62)

where the last equation follows from the definition of occupancy measures by Puterman [2014], and the existence of the limit limγ1Bγ,πP~,r\lim_{\gamma\to 1}B_{\gamma,\pi}^{\tilde{P},r} in Equation (72). ∎

After relating the gap between the long-term average rewards of policy πe\pi_{e} on the two MDPs, we now want to bound the sum of Bellman error over an epoch. For this, we first bound the Bellman error for a particular state action pair s,as,a in the form of following lemma. We have,

Lemma 4.

For an MDP with rewards r(s,a)r(s,a) and transition probability P~(s|s,a)\tilde{P}(s^{\prime}|s,a) such that P~(|s,a)P(|s,a)1ϵs,a\|\tilde{P}(\cdot|s,a)-P(\cdot|s,a)\|_{1}\leq\epsilon_{s,a}, the Bellman error BπeP~,r(s,a)B_{\pi_{e}}^{\tilde{P},r}(s,a) for state-action pair s,as,a is upper bounded as

BπP~,r(s,a)P~(|s,a)P(|s,a)1hπP~,r()\displaystyle B_{\pi}^{\tilde{P},r}(s,a)\leq\big{\|}\tilde{P}(\cdot|s,a)-P(\cdot|s,a)\big{\|}_{1}\|h_{\pi}^{\tilde{P},r}(\cdot)\|_{\infty} (63)

where hπP~,r()\|h_{\pi}^{\tilde{P},r}(\cdot)\|_{\infty} is the bias-span of the MDP with transition probability P~\tilde{P}.

Proof.

Starting with the definition of Bellman error in Equation (41), we get

BπP~,r(s,a)\displaystyle B_{\pi}^{\tilde{P},r}(s,a) =limγ1Bγ,πP~,r(s,a)\displaystyle=\lim_{\gamma\to 1}B_{\gamma,\pi}^{\tilde{P},r}(s,a) (64)
=limγ1(Qγ,πP~,r(s,a)(r(s,a)+γs𝒮P(s|s,a)Vγ,πP~,r))\displaystyle=\lim_{\gamma\to 1}\left(Q_{\gamma,\pi}^{\tilde{P},r}(s,a)-\left(r(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)V_{\gamma,\pi}^{\tilde{P},r}\right)\right) (65)
=limγ1((r(s,a)+γs𝒮P~(s|s,a)Vγ,πP~,r(s))(r(s,a)+γs𝒮P(s|s,a)Vγ,πP~,r(s)))\displaystyle=\lim_{\gamma\to 1}\left(\left(r(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}\tilde{P}(s^{\prime}|s,a)V_{\gamma,\pi}^{\tilde{P},r}(s^{\prime})\right)-\left(r(s,a)+\gamma\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)V_{\gamma,\pi}^{\tilde{P},r}(s^{\prime})\right)\right) (66)
=limγ1γs𝒮(P~(s|s,a)P(s|s,a))Vγ,πP~,r(s)\displaystyle=\lim_{\gamma\to 1}\gamma\sum_{s^{\prime}\in\mathcal{S}}\left(\tilde{P}(s^{\prime}|s,a)-P(s^{\prime}|s,a)\right)V_{\gamma,\pi}^{\tilde{P},r}(s^{\prime}) (67)
=limγ1γ(s𝒮(P~(s|s,a)P(s|s,a))Vγ,πP~,r(s)+Vγ,πP~,r(s)Vγ,πP~,r(s))\displaystyle=\lim_{\gamma\to 1}\gamma\left(\sum_{s^{\prime}\in\mathcal{S}}\left(\tilde{P}(s^{\prime}|s,a)-P(s^{\prime}|s,a)\right)V_{\gamma,\pi}^{\tilde{P},r}(s^{\prime})+V_{\gamma,\pi}^{\tilde{P},r}(s)-V_{\gamma,\pi}^{\tilde{P},r}(s)\right) (68)
=limγ1γ(s𝒮(P~(s|s,a)P(s|s,a))Vγ,πP~,r(s)s𝒮P~(s|s,a)Vγ,πP~,r(s)\displaystyle=\lim_{\gamma\to 1}\gamma\Big{(}\sum_{s^{\prime}\in\mathcal{S}}\left(\tilde{P}(s^{\prime}|s,a)-P(s^{\prime}|s,a)\right)V_{\gamma,\pi}^{\tilde{P},r}(s^{\prime})-\sum_{s^{\prime}\in\mathcal{S}}\tilde{P}(s^{\prime}|s,a)V_{\gamma,\pi}^{\tilde{P},r}(s)
+s𝒮P(s|s,a)Vγ,πP~,r(s))\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ +\sum_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)V_{\gamma,\pi}^{\tilde{P},r}(s)\Big{)} (69)
=limγ1γ(s𝒮(P~(s|s,a)P(s|s,a))(Vγ,πP~,r(s)Vγ,πP~,r(s)))\displaystyle=\lim_{\gamma\to 1}\gamma\left(\sum_{s^{\prime}\in\mathcal{S}}\left(\tilde{P}(s^{\prime}|s,a)-P(s^{\prime}|s,a)\right)\left(V_{\gamma,\pi}^{\tilde{P},r}(s^{\prime})-V_{\gamma,\pi}^{\tilde{P},r}(s)\right)\right) (70)
=(s𝒮(P~(s|s,a)P(s|s,a))limγ1γ(Vγ,πP~,r(s)Vγ,πP~,r(s)))\displaystyle=\left(\sum_{s^{\prime}\in\mathcal{S}}\left(\tilde{P}(s^{\prime}|s,a)-P(s^{\prime}|s,a)\right)\lim_{\gamma\to 1}\gamma\left(V_{\gamma,\pi}^{\tilde{P},r}(s^{\prime})-V_{\gamma,\pi}^{\tilde{P},r}(s)\right)\right) (71)
=(s𝒮(P~(s|s,a)P(s|s,a))hπP~,r(s))\displaystyle=\left(\sum_{s^{\prime}\in\mathcal{S}}\left(\tilde{P}(s^{\prime}|s,a)-P(s^{\prime}|s,a)\right)h_{\pi}^{\tilde{P},r}(s^{\prime})\right) (72)
P~(|s,a)P(|s,a)1hπP~,r()\displaystyle\leq\Big{\|}\tilde{P}(\cdot|s,a)-P(\cdot|s,a)\Big{\|}_{1}\|h_{\pi}^{\tilde{P},r}(\cdot)\|_{\infty} (73)
ϵs,aT~M\displaystyle\leq\epsilon_{s,a}\tilde{T}_{M} (74)

where Equation (67) comes from the assumption that the rewards are known to the agent. Equation (71) follows from the fact that the difference between value function at two states is bounded. Equation (72) comes from the definition of bias term Puterman [2014] where hh is the bias of the policy π\pi when run on the sampled MDP. Equation (73) follows from Hölder’s inequality. In Equation (74), the 1\ell_{1} norm of probability vector difference is bounded from the definition.

Additionally, note that the 1\ell_{1} norm in Equation (73) is bounded by 22. Thus the Bellman error is loose upper bounded by 2hπP~,r()2\|h_{\pi}^{\tilde{P},r}(\cdot)\|_{\infty} for all state-action pairs. ∎

Lemma 5 (Bounded Span of optimal MDP in confidence interval).

For a MDP with rewards r(s,a)r(s,a) and transition probabilities Per=argmaxPe𝒫teλπePe,rP_{e}^{r}=\arg\max_{P_{e}\in\mathcal{P}_{t_{e}}}\lambda_{\pi_{e}}^{P_{e},r}, for policy πe\pi_{e}, the difference of bias of any two states ss, and ss^{\prime}, is upper bounded by the mixing time of the true MDP TMT_{M} as:

hπePer,r(s)hπePer,r(s)TMs,s𝒮\displaystyle h_{\pi_{e}}^{P_{e}^{r},r}(s)-h_{\pi_{e}}^{P_{e}^{r},r}(s^{\prime})\leq T_{M}\leavevmode\nobreak\ \forall\leavevmode\nobreak\ s,s^{\prime}\in\mathcal{S} (75)
Proof.

Note that λπePer,rλπeP,r\lambda_{\pi_{e}}^{P_{e}^{r},r}\geq\lambda_{\pi_{e}}^{P^{\prime},r} for all P𝒫teP^{\prime}\in\mathcal{P}_{t_{e}}. Now, consider the following Bellman equation:

hπePer,r(s)\displaystyle h_{\pi_{e}}^{P_{e}^{r},r}(s) =rπe(s,a)λπePer,r+<Pπe,er(|s),hπePer,r>\displaystyle=r_{\pi_{e}}(s,a)-\lambda_{\pi_{e}}^{P_{e}^{r},r}+<P_{\pi_{e},e}^{r}(\cdot|s),h_{\pi_{e}}^{P_{e}^{r},r}>
=ThπePer,r(s)\displaystyle=Th_{\pi_{e}}^{P_{e}^{r},r}(s) (76)

where rπe(s)=aπe(a|s)r(s,a)r_{\pi_{e}}(s)=\sum_{a}\pi_{e}(a|s)r(s,a) and Pπe,er(s|s)=aπ(a|s)Per(s|s,a)P_{\pi_{e},e}^{r}(s^{\prime}|s)=\sum_{a}\pi(a|s)P_{e}^{r}(s^{\prime}|s,a).

Consider two states s,s𝒮s,s^{\prime}\in\mathcal{S}. Also, let τ\tau be a random variable defined as:

τ=min{t1:st=s,s1=s}\displaystyle\tau=\min\{t\geq 1:s_{t}=s^{\prime},s_{1}=s\} (77)

We also define another operator,

T¯h(s)={mins,ar(s,a)λπePer,r+<Pπe(|s),h>,sshπePer,r(s),s=s\displaystyle\bar{T}h(s)=\begin{cases}\min_{s,a}r(s,a)-\lambda_{\pi_{e}}^{P_{e}^{r},r}+<P_{\pi_{e}}(\cdot|s),h>,&s\neq s^{\prime}\\ h_{\pi_{e}}^{P_{e}^{r},r}(s^{\prime}),&s=s^{\prime}\end{cases} (78)

where Pπe(|s)=aπe(a|s)P(s|s,a)P_{\pi_{e}}(\cdot|s)=\sum_{a}\pi_{e}(a|s)P(s^{\prime}|s,a).

Now, note that

h(s)\displaystyle h(s) =Th(s)\displaystyle=Th(s) (79)
=maxP𝒫te(rπer(s)λπePer,r+<Pπe,h>)\displaystyle=\max_{P^{\prime}\in\mathcal{P}_{t_{e}}}\left(r_{\pi_{e}}^{r}(s)-\lambda_{\pi_{e}}^{P_{e}^{r},r}+<P_{\pi_{e}}^{\prime},h>\right) (80)
rπer(s)λπePer,r+<Pπe,h>\displaystyle\geq r_{\pi_{e}}^{r}(s)-\lambda_{\pi_{e}}^{P_{e}^{r},r}+<P_{\pi_{e}},h> (81)
mins,ar(s,a)λπePer,r+<Pπe,h>\displaystyle\geq\min_{s,a}r(s,a)-\lambda_{\pi_{e}}^{P_{e}^{r},r}+<P_{\pi_{e}},h> (82)
=T¯h(s)\displaystyle=\bar{T}h(s) (83)

Further, for any two vectors u,vu,v, where all the elements of uu are not smaller than ww we have T¯uT¯w\bar{T}u\geq\bar{T}w. Hence, we have T¯nhπP,r(s)hπP,r(s)\bar{T}^{n}h_{\pi}^{P,r}(s)\leq h_{\pi}^{P,r}(s) for all ss. Unrolling the recurrence, we have

hπPer,r(s)T¯nhπPer,r(s)=𝔼[(λπPer,rmins,ar(s,a))(nτ)+hπPer,r(snτ)]\displaystyle h_{\pi}^{P_{e}^{r},r}(s)\geq\bar{T}^{n}h_{\pi}^{P_{e}^{r},r}(s)=\mathbb{E}\left[-(\lambda_{\pi}^{P_{e}^{r},r}-\min_{s,a}r(s,a))(n\wedge\tau)+h_{\pi}^{P_{e}^{r},r}(s_{n\wedge\tau})\right] (84)

For limn\lim n\to\infty, we have hπPer,r(s)hπPer,r(s)TMh_{\pi}^{P_{e}^{r},r}(s)\geq h_{\pi}^{P_{e}^{r},r}(s^{\prime})-T_{M}, completing the proof. ∎

A.3 Proof of results from main text

After stating the necessary lemmas, we can now prove Lemma 1 and Theorem 1.

Proof of Theorem 1.

We continue our proof from Equation (25). We had:

R(T)\displaystyle R(T) e=1E𝔼[t=tete+11(λπePer,rλπeP,r)]+e=1E𝔼[t=tete+11(λπeP,rr(st,at))]\displaystyle\leq\sum_{e=1}^{E}\mathbb{E}\left[\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{P_{e}^{r},r}-\lambda_{\pi_{e}}^{P,r}\right)\right]+\sum_{e=1}^{E}\mathbb{E}\left[\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{P,r}-r(s_{t},a_{t})\right)\right] (85)
=R1(T)+R2(T)\displaystyle=R_{1}(T)+R_{2}(T) (86)

where R1(T)R_{1}(T) and R2(T)R_{2}(T) are:

R1(T)\displaystyle R_{1}(T) =e=1E𝔼[t=tete+11(λπePer,rλπeP,r)]\displaystyle=\sum_{e=1}^{E}\mathbb{E}\left[\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{P_{e}^{r},r}-\lambda_{\pi_{e}}^{P,r}\right)\right] (87)
R2(T)\displaystyle R_{2}(T) =e=1E𝔼[t=tete+11(λπeP,rr(st,at))]\displaystyle=\sum_{e=1}^{E}\mathbb{E}\left[\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{P,r}-r(s_{t},a_{t})\right)\right] (88)

We first consider R2(T)R_{2}(T) term. We start by defining filtration t={s0,a0,,st,at}\mathcal{H}_{t}=\{s_{0},a_{0},\cdots,s_{t},a_{t}\} as the set of of observed states and played actions. Further, we have λπeP,r\lambda_{\pi_{e}}^{P,r} as

λπeP,r=𝔼(s,a)πe,P[r(s,a)]\displaystyle\lambda_{\pi_{e}}^{P,r}=\mathbb{E}_{(s,a)\sim\pi_{e},P}[r(s,a)] (89)

We have,

𝔼(s,a)πe,P[r(s,a)]\displaystyle\mathbb{E}_{(s,a)\sim\pi_{e},P}[r(s,a)] =𝔼(s,a)πe,P[r(s,a)]±𝔼(st,at)πe,P[r(st,at)|te1]\displaystyle=\mathbb{E}_{(s,a)\sim\pi_{e},P}[r(s,a)]\pm\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[r(s_{t},a_{t})|\mathcal{H}_{t_{e}-1}] (90)
=𝔼(st,at)πe,P[r(st,at)|te1]+(𝔼(s,a)πe,P[r(s,a)]𝔼(st,at)πe,P[r(st,at)|te1])\displaystyle=\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[r(s_{t},a_{t})|\mathcal{H}_{t_{e}-1}]+\left(\mathbb{E}_{(s,a)\sim\pi_{e},P}[r(s,a)]-\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[r(s_{t},a_{t})|\mathcal{H}_{t_{e}-1}]\right) (91)
𝔼(st,at)πe,P[t(st,at)|te1]+2(πe(a|s)dπe(s)πe(a|s)Pπ,ste1tte+1(s)TV)\displaystyle\leq\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[t(s_{t},a_{t})|\mathcal{H}_{t_{e}-1}]+2\left(\|\pi_{e}(a|s)d_{\pi_{e}}(s)-\pi_{e}(a|s)P_{\pi,s_{t_{e}-1}}^{t-t_{e}+1}(s)\|_{TV}\right) (92)
𝔼(st,at)πe,P[r(st,at)|te1]+2CSρtte\displaystyle\leq\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[r(s_{t},a_{t})|\mathcal{H}_{t_{e}-1}]+2CS\rho^{t-t_{e}} (93)

Hence, we have,

t=tete+11(λπeP,rr(st,at))\displaystyle\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{P,r}-r(s_{t},a_{t})\right) =t=tete+11(𝔼(s,a)πe,P[r(s,a)]r(st,at))\displaystyle=\sum_{t=t_{e}}^{t_{e+1}-1}\left(\mathbb{E}_{(s,a)\sim\pi_{e},P}[r(s,a)]-r(s_{t},a_{t})\right) (94)
t=tete+11(𝔼(st,at)πe,P[r(st,at)|te1]+2CSρtter(st,at))\displaystyle\leq\sum_{t=t_{e}}^{t_{e+1}-1}\left(\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[r(s_{t},a_{t})|\mathcal{H}_{t_{e}-1}]+2CS\rho^{t-t_{e}}-r(s_{t},a_{t})\right) (95)
t=tete+11(𝔼(st,at)πe,P[r(st,at)|te1]r(st,at))+t=te2CSρtte\displaystyle\leq\sum_{t=t_{e}}^{t_{e+1}-1}\left(\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[r(s_{t},a_{t})|\mathcal{H}_{t_{e}-1}]-r(s_{t},a_{t})\right)+\sum_{t=t_{e}}^{\infty}2CS\rho^{t-t_{e}} (96)
t=tete+11(𝔼(st,at)πe,P[r(st,at)|te1]r(st,at))+2CS1ρ\displaystyle\leq\sum_{t=t_{e}}^{t_{e+1}-1}\left(\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[r(s_{t},a_{t})|\mathcal{H}_{t_{e}-1}]-r(s_{t},a_{t})\right)+\frac{2CS}{1-\rho} (97)

Using Azuma-Hoeffding’s inequality, we get,

t=tete+11(𝔼(st,at)πe,P[r(st,at)|te1]r(st,at))\displaystyle\sum_{t=t_{e}}^{t_{e+1}-1}\left(\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[r(s_{t},a_{t})|\mathcal{H}_{t_{e}-1}]-r(s_{t},a_{t})\right) 2(te+1te)log(2T)\displaystyle\leq 2\sqrt{(t_{e+1}-t_{e})\log(2T)} (98)

with probability at least 11/T1-1/T. Summing over all the epochs and using Cauchy-Schwarz inequality, we get:

t=tete+11(λπeP,rr(st,at))\displaystyle\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{P,r}-r(s_{t},a_{t})\right) =e=1E(t=tete+11(𝔼(st,at)πe,P[r(st,at)|te1]r(st,at))+2CS1ρ)\displaystyle=\sum_{e=1}^{E}\left(\sum_{t=t_{e}}^{t_{e+1}-1}\left(\mathbb{E}_{(s_{t},a_{t})\sim\pi_{e},P}[r(s_{t},a_{t})|\mathcal{H}_{t_{e}-1}]-r(s_{t},a_{t})\right)+\frac{2CS}{1-\rho}\right) (99)
e=1E2(te+1te)log(2T)+2CSE1ρ\displaystyle\leq\sum_{e=1}^{E}2\sqrt{(t_{e+1}-t_{e})\log(2T)}+\frac{2CSE}{1-\rho} (100)
2Ee=1E(te+1te)log(2T)+2CSE1ρ\displaystyle\leq 2\sqrt{E\sum_{e=1}^{E}(t_{e+1}-t_{e})\log(2T)}+\frac{2CSE}{1-\rho} (101)
=2ETlog(2T)+2CSE1ρ\displaystyle=2\sqrt{ET\log(2T)}+\frac{2CSE}{1-\rho} (102)

with probability at least 1E/T1-E/T. Further, the maximum value of the sum is bounded by TT and that event occurs with probability less than 1/T1/T which gives,

𝔼[R2(T)]\displaystyle\mathbb{E}\left[R_{2}(T)\right] =e=1E𝔼[t=tete+11(λπeP,rr(st,at))]\displaystyle=\sum_{e=1}^{E}\mathbb{E}\left[\sum_{t=t_{e}}^{t_{e+1}-1}\left(\lambda_{\pi_{e}}^{P,r}-r(s_{t},a_{t})\right)\right] (103)
4Tlog(2T)+2CSE1ρ+ETT\displaystyle\leq 4\sqrt{T\log(2T)}+\frac{2CSE}{1-\rho}+\frac{E}{T}T (104)
=E+4ETlog(2T)+2CSE1ρ\displaystyle=E+4\sqrt{ET\log(2T)}+\frac{2CSE}{1-\rho} (105)

We can now focus on the R1(T)R_{1}(T) term. We have:

R1(T)\displaystyle R_{1}(T) =e=1T𝔼[t=tete+11(λπePer,rλπeP,r)]\displaystyle=\sum_{e=1}^{T}\mathbb{E}\left[\sum_{t=t_{e}}^{t_{e+1}-1}(\lambda_{\pi_{e}}^{P_{e}^{r},r}-\lambda_{\pi_{e}}^{P,r})\right] (106)
=e=1T𝔼[t=tete+11𝔼s,aπ,P[BπePer,r(s,a)]]\displaystyle=\sum_{e=1}^{T}\mathbb{E}\left[\sum_{t=t_{e}}^{t_{e+1}-1}\mathbb{E}_{s,a\sim\pi,P}\left[B_{\pi_{e}}^{P_{e}^{r},r}(s,a)\right]\right] (107)

Similar to Equations (90)-(93), we have:

e=1Et=tete+11𝔼s,aπ,P[BπePer,r(s,a)]\displaystyle\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\mathbb{E}_{s,a\sim\pi,P}\left[B_{\pi_{e}}^{P_{e}^{r},r}(s,a)\right] e=1Et=tete+11𝔼s,aπ,P[BπePer,r(s,a)|te1]+e=1Et=tete+112CTMSρtte\displaystyle\leq\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\mathbb{E}_{s,a\sim\pi,P}\left[B_{\pi_{e}}^{P_{e}^{r},r}(s,a)|\mathcal{H}_{t_{e}-1}\right]+\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}2CT_{M}S\rho^{t-t_{e}} (108)

Again, using Azuma-Hoeffding’s inequality, with probability at least 11/T1-1/T we have:

t=tete+11𝔼s,aπ,P[BπePer,r(s,a)|te1]t=tete+11BπePer,e(st,at)+2TM(te+1te)log(2T)\displaystyle\sum_{t=t_{e}}^{t_{e+1}-1}\mathbb{E}_{s,a\sim\pi,P}\left[B_{\pi_{e}}^{P_{e}^{r},r}(s,a)|\mathcal{H}_{t_{e}-1}\right]\leq\sum_{t=t_{e}}^{t_{e+1}-1}B_{\pi_{e}}^{P_{e}^{r},e}(s_{t},a_{t})+2T_{M}\sqrt{(t_{e+1}-t_{e})\log(2T)} (109)

Summing over all the epochs, we get, with probability at least 1E/T1-E/T:

e=1Et=tete+11𝔼s,aπ,P[BπePer,r(s,a)|te1]\displaystyle\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\mathbb{E}_{s,a\sim\pi,P}\left[B_{\pi_{e}}^{P_{e}^{r},r}(s,a)|\mathcal{H}_{t_{e}-1}\right] e=1Et=tete+11BπePer,e(st,at)+e=1Et=tete+112TM(te+1te)log(2T)\displaystyle\leq\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}B_{\pi_{e}}^{P_{e}^{r},e}(s_{t},a_{t})+\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}2T_{M}\sqrt{(t_{e+1}-t_{e})\log(2T)} (110)
e=1Et=tete+11BπePer,e(st,at)+2TMEe=1E(te+1te)log(2T)\displaystyle\leq\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}B_{\pi_{e}}^{P_{e}^{r},e}(s_{t},a_{t})+2T_{M}\sqrt{E\sum_{e=1}^{E}(t_{e+1}-t_{e})\log(2T)} (111)
=e=1Et=tete+11BπePer,e(st,at)+2TMETlog(2T)\displaystyle=\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}B_{\pi_{e}}^{P_{e}^{r},e}(s_{t},a_{t})+2T_{M}\sqrt{ET\log(2T)} (112)
e=1Et=tete+11P~(|s,a)P(|s,a)1hπP~,r()+2TMETlog(2T)\displaystyle\leq\sum_{e=1}^{E}\sum_{t=t_{e}}^{t_{e+1}-1}\Big{\|}\tilde{P}(\cdot|s,a)-P(\cdot|s,a)\Big{\|}_{1}\|h_{\pi}^{\tilde{P},r}(\cdot)\|_{\infty}+2T_{M}\sqrt{ET\log(2T)} (113)
e=1Es,aνe(s,a)214Slog(2AT)Ne(s,a)hπP~,r()+2TMETlog(2T)\displaystyle\leq\sum_{e=1}^{E}\sum_{s,a}\nu_{e}(s,a)2\sqrt{\frac{14S\log(2AT)}{N_{e}(s,a)}}\|h_{\pi}^{\tilde{P},r}(\cdot)\|_{\infty}+2T_{M}\sqrt{ET\log(2T)} (114)
2TM14Slog(2AT)s,ae=1Eνe(s,a)Ne(s,a)+2TMETlog(2T)\displaystyle\leq 2T_{M}\sqrt{14S\log(2AT)}\sum_{s,a}\sum_{e=1}^{E}\frac{\nu_{e}(s,a)}{\sqrt{N_{e}(s,a)}}+2T_{M}\sqrt{ET\log(2T)} (115)
2(2+1)TM14Slog(2AT)s,aN(s,a)+2TMETlog(2T)\displaystyle\leq 2(\sqrt{2}+1)T_{M}\sqrt{14S\log(2AT)}\sum_{s,a}\sqrt{N(s,a)}+2T_{M}\sqrt{ET\log(2T)} (116)
2(2+1)TM14Slog(2AT)SAT+2TMETlog(2T)\displaystyle\leq 2(\sqrt{2}+1)T_{M}\sqrt{14S\log(2AT)}\sqrt{SAT}+2T_{M}\sqrt{ET\log(2T)} (117)

where Equation (113) follows from Lemma 4. Equation (114) follows from Lemma 2 with probability 11/T51-1/T^{5}. Equation (115) comes from Lemma 5. Equation (116) follows from [Jaksch et al., 2010, Lemma 19] and Equation (117) follows from Cauchy-Schwarz inequality.

Together with Equation (108), we get:

R1(T)2(2+1)TMSATlog(AT)+2TMETlog(2T)+2TMSE1ρ+E+T\displaystyle R_{1}(T)\leq 2(\sqrt{2}+1)T_{M}S\sqrt{AT\log(AT)}+2T_{M}\sqrt{ET\log(2T)}+\frac{2T_{M}SE}{1-\rho}+E+\sqrt{T} (118)

Combining R1(T)R_{1}(T) and R2(T)R_{2}(T) we get the required bound on regret. The bound on constraint violations follows similarly. ∎

Proof of Lemma 1.

We begin with considering the policy π\pi in Assumption 3. We now prove the result for one k[K]k\in[K] and the result follows for all k[K]k\in[K]. We consider an MDP with transition dynamics PekP_{e}^{k} which maximizes ζπP,k\zeta_{\pi}^{P^{\prime},k} for all P(|s,a)P(|s,a)114Slog(2At)Ne(s,a)\|P^{\prime}(\cdot|s,a)-P(\cdot|s,a)\|_{1}\leq\sqrt{\frac{14S\log(2At)}{N_{e}(s,a)}} for all s,as,a. Consider the difference between the average cost kk incurred from following policy π\pi on the MDP with true transition probabilities PP and the average cost kk incurred from following policy π\pi on the MDP with transition probabilities PekP_{e}^{k} and using Lemma 3. We have:

ζπP~e,kζπP,k\displaystyle\zeta_{\pi}^{\tilde{P}_{e},k}-\zeta_{\pi}^{P,k} ζπPek,kζπP,k\displaystyle\leq\zeta_{\pi}^{P_{e}^{k},k}-\zeta_{\pi}^{P,k} (119)
=s,adπP(s,a)BπPek,k(s,a)\displaystyle=\sum_{s,a}d_{\pi}^{P}(s,a)B_{\pi}^{P_{e}^{k},k}(s,a) (120)
=𝔼[BπPek,k(s,a)]\displaystyle=\mathbb{E}\left[B_{\pi}^{P_{e}^{k},k}(s,a)\right] (121)

where the of Bellman error BπPek,k(s,a)B_{\pi}^{P_{e}^{k},k}(s,a) is of the following form,

BπP~e,k(s,a)=limγ1(Qγ,πP~,k(s,a)ck(s,a)γs𝒮P(s|s,a)Vγ,πP~,k(s,a)),\displaystyle B_{\pi}^{\tilde{P}_{e},k}(s,a)=\lim_{\gamma\to 1}\left(Q_{\gamma,\pi}^{\tilde{P},k}(s,a)-c^{k}(s,a)-\gamma\sum\nolimits_{s^{\prime}\in\mathcal{S}}P(s^{\prime}|s,a)V_{\gamma,\pi}^{\tilde{P},k}(s,a)\right),

and the value function, Vγ,πP~,k(s)V_{\gamma,\pi}^{\tilde{P},k}(s) and QQ-value, Qγ,πP~,k(s,a)Q_{\gamma,\pi}^{\tilde{P},k}(s,a), function become:

Vγ,πP~,k(s)=t=1γt1𝔼atπ,st+1P[ck(st,at)|s1=s]\displaystyle V_{\gamma,\pi}^{\tilde{P},k}(s)=\sum_{t=1}^{\infty}\gamma^{t-1}\mathbb{E}_{a_{t}\sim\pi,s_{t+1}\sim P}\left[c^{k}(s_{t},a_{t})|s_{1}=s\right]
Qγ,πP~,k(s,a)=t=1γt1𝔼atπ,st+1P[ck(st,at)|s1=s,a1=a].\displaystyle Q_{\gamma,\pi}^{\tilde{P},k}(s,a)=\sum_{t=1}^{\infty}\gamma^{t-1}\mathbb{E}_{a_{t}\sim\pi,s_{t+1}\sim P}\left[c^{k}(s_{t},a_{t})|s_{1}=s,a_{1}=a\right].

We bound the expectation using Azuma-Hoeffdings inequality as follows:

𝔼[BπPek,k(s,a)]\displaystyle\mathbb{E}\left[B_{\pi}^{P_{e}^{k},k}(s,a)\right] =𝔼[BπPek,k(st,at)|te1]+ChπPe,k()ρtte\displaystyle=\mathbb{E}\left[B_{\pi}^{P_{e}^{k},k}(s_{t},a_{t})|\mathcal{H}_{t_{e}-1}\right]+C\|h_{\pi}^{P_{e},k}(\cdot)\|_{\infty}\rho^{t-t_{e}} (122)
=1te+1tet=tete+11(𝔼[BπPek,k(st,at)|te1]+ChπPe,k()ρtte)\displaystyle=\frac{1}{t_{e+1}-t_{e}}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\mathbb{E}\left[B_{\pi}^{P_{e}^{k},k}(s_{t},a_{t})|\mathcal{H}_{t_{e}-1}\right]+C\|h_{\pi}^{P_{e},k}(\cdot)\|_{\infty}\rho^{t-t_{e}}\right) (123)
1te+1tet=tete+11(𝔼[BπPek,k(st,at)|te1])+CShπPe,k()(1ρ)(te+1te)\displaystyle\leq\frac{1}{t_{e+1}-t_{e}}\sum_{t=t_{e}}^{t_{e+1}-1}\left(\mathbb{E}\left[B_{\pi}^{P_{e}^{k},k}(s_{t},a_{t})|\mathcal{H}_{t_{e}-1}\right]\right)+\frac{CS\|h_{\pi}^{P_{e},k}(\cdot)\|_{\infty}}{(1-\rho)(t_{e+1}-t_{e})} (124)
1te+1te(TM14SlogATs,aνe(s,a)Ne(s,a)+4TM7(te+1te)log(te+1te))\displaystyle\leq\frac{1}{t_{e+1}-t_{e}}\left(T_{M}\sqrt{14S\log AT}\sum_{s,a}\frac{\nu_{e}(s,a)}{\sqrt{N_{e}(s,a)}}+4T_{M}\sqrt{7(t_{e+1}-t_{e})\log(t_{e+1}-t_{e})}\right)
+CSTM(1ρ)(te+1te)\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ +\frac{CST_{M}}{(1-\rho)(t_{e+1}-t_{e})} (125)
1te+1te(TM14SlogATs,aνe(s,a)+4TM7(te+1te)log(te+1te))\displaystyle\leq\frac{1}{t_{e+1}-t_{e}}\left(T_{M}\sqrt{14S\log AT}\sum_{s,a}\sqrt{\nu_{e}(s,a)}+4T_{M}\sqrt{7(t_{e+1}-t_{e})\log(t_{e+1}-t_{e})}\right)
+CSTM(1ρ)(te+1te)\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ +\frac{CST_{M}}{(1-\rho)(t_{e+1}-t_{e})} (126)
1te+1te(TMS14AlogATs,aνe(s,a)+4TM7(te+1te)log(te+1te))\displaystyle\leq\frac{1}{t_{e+1}-t_{e}}\left(T_{M}S\sqrt{14A\log AT}\sqrt{\sum_{s,a}\nu_{e}(s,a)}+4T_{M}\sqrt{7(t_{e+1}-t_{e})\log(t_{e+1}-t_{e})}\right)
+CSTM(1ρ)(te+1te)\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ +\frac{CST_{M}}{(1-\rho)(t_{e+1}-t_{e})} (127)
1te+1te(TMS14AlogAT(te+1te)+4TM7(te+1te)log(te+1te))\displaystyle\leq\frac{1}{t_{e+1}-t_{e}}\left(T_{M}S\sqrt{14A\log AT}\sqrt{(t_{e+1}-t_{e})}+4T_{M}\sqrt{7(t_{e+1}-t_{e})\log(t_{e+1}-t_{e})}\right)
+CSTM(1ρ)(te+1te)\displaystyle\leavevmode\nobreak\ \leavevmode\nobreak\ +\frac{CST_{M}}{(1-\rho)(t_{e+1}-t_{e})} (128)
(TMS14AlogAT(te+1te)+4TM7log(te+1te)(te+1te))+CSTM(1ρ)(te+1te)\displaystyle\leq\left(T_{M}S\sqrt{\frac{14A\log AT}{(t_{e+1}-t_{e})}}+4T_{M}\sqrt{\frac{7\log(t_{e+1}-t_{e})}{(t_{e+1}-t_{e})}}\right)+\frac{CST_{M}}{(1-\rho)(t_{e+1}-t_{e})} (129)

where Equation (123) is obtained by summing both sides from t=tet=t_{e} to t=te+1t=t_{e+1}. Equation (124) is obtained by summing over the geometric series with ratio ρ\rho. Equation (125) comes from analysis used in the proof of Theorem 1. Equation (126) comes from the fact that Ne(s,a)νe(s,a)N_{e}(s,a)\geq\nu_{e}(s,a) for all s,as,a, and then replacing the lower bound of Ne(s,a)N_{e}(s,a). Equation (127) follows from the Cauchy Schwarz inequality. Equation (128) follows from the fact that the epoch length te+1tet_{e+1}-t_{e} is same as the number of visitations to all state action pairs in an epoch.

Combining Equation (129) with Equation (121), we obtain the required result as follows:

ζπP~e,k\displaystyle\zeta_{\pi}^{\tilde{P}_{e},k} ζπPek,kζπP,k+ζπP,k\displaystyle\leq\zeta_{\pi}^{P_{e}^{k},k}-\zeta_{\pi}^{P,k}+\zeta_{\pi}^{P,k} (130)
(TMS14AlogAT(te+1te)+4TM7log(te+1te)(te+1te))+CSTM(1ρ)(te+1te)+ζπP,k\displaystyle\leq\left(T_{M}S\sqrt{\frac{14A\log AT}{(t_{e+1}-t_{e})}}+4T_{M}\sqrt{\frac{7\log(t_{e+1}-t_{e})}{(t_{e+1}-t_{e})}}\right)+\frac{CST_{M}}{(1-\rho)(t_{e+1}-t_{e})}+\zeta_{\pi}^{P,k} (131)
(TMS14AlogATT+4TM7log(T)T)+CSTM(1ρ)T+ζπP,k\displaystyle\leq\left(T_{M}S\sqrt{\frac{14A\log AT}{\sqrt{T}}}+4T_{M}\sqrt{\frac{7\log(\sqrt{T})}{\sqrt{T}}}\right)+\frac{CST_{M}}{(1-\rho)\sqrt{T}}+\zeta_{\pi}^{P,k} (132)
(TMS14AlogATT+4TM7log(T)T)+CSTM(1ρ)T+Ckκ\displaystyle\leq\left(T_{M}S\sqrt{\frac{14A\log AT}{\sqrt{T}}}+4T_{M}\sqrt{\frac{7\log(\sqrt{T})}{\sqrt{T}}}\right)+\frac{CST_{M}}{(1-\rho)\sqrt{T}}+C_{k}-\kappa (133)
Ck\displaystyle\leq C_{k} (134)

where Equation (132) comes from the fact that we consider epoch length te+1teTt_{e+1}-t_{e}\geq\sqrt{T} and Equation (133) comes from Assumption 3 and Equation (134) comes from the value of Slater’s constant κ\kappa in Assumption 3. Replicating the analysis for all k[K]k\in[K], for the policy π\pi, ζπP~e,k\zeta_{\pi}^{\tilde{P}_{e},k} satisfy the constraint for all k[K]k\in[K] and hence, the optimization problem in Equation (14)-(17) is feasible. ∎