A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes

Honghao Wei, Xin Liu, and Lei Ying
University of Michigan, Ann Arbor

Abstract

This paper presents the first model-free, simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it includes three key components: a Q-function (also called action-value function) for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three “Q” values. The algorithm updates the reward and utility Q-values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves $\tilde{\cal O}\left(\frac{1}{\delta}H^{4}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}}\right)$ regret¹¹1Notation: $f(n)=\tilde{\mathcal{O}}(g(n))$ denotes $f(n)={\mathcal{O}}(g(n){\log}^{k}n)$ with $k>0.$ The same applies to $\tilde{\Omega}.$ $\mathbb{R}^{+}$ denotes non-negative real numbers. $[H]$ denotes the set $\{1,2,\cdots,H\}.$ , where $K$ is the total number of episodes, $H$ is the number of steps in each episode, $S$ is the number of states, $A$ is the number of actions, and $\delta$ is Slater’s constant. Furthermore, Triple-Q guarantees zero constraint violation, both on expectation and with a high probability, when $K$ is sufficiently large. Finally, the computational complexity of Triple-Q is similar to SARSA for unconstrained MDPs, and is computationally efficient.

1 Introduction

Reinforcement learning (RL), with its success in gaming and robotics, has been widely viewed as one of the most important technologies for next-generation, AI-driven complex systems such as autonomous driving, digital healthcare, and smart cities. However, despite the significant advances (such as deep RL) over the last few decades, a major obstacle in applying RL in practice is the lack of “safety” guarantees. Here “safety” refers to a broad range of operational constraints. The objective of a traditional RL problem is to maximize the expected cumulative reward, but in practice, many applications need to be operated under a variety of constraints, such as collision avoidance in robotics and autonomous driving [18, 13, 12], legal and business restrictions in financial engineering [1], and resource and budget constraints in healthcare systems [31]. These applications with operational constraints can often be modeled as Constrained Markov Decision Processes (CMDPs), in which the agent’s goal is to learn a policy that maximizes the expected cumulative reward subject to the constraints.

Earlier studies on CMDPs assume the model is known. A comprehensive study of these early results can be found in [3]. RL for unknown CMDPs has been a topic of great interest recently because of its importance in Artificial Intelligence (AI) and Machine Learning (ML). The most noticeable advances recently are model-based RL for CMDPs, where the transition kernels are learned and used to solve the linear programming (LP) problem for the CMDP [23, 6, 15, 11], or the LP problem in the primal component of a primal-dual algorithm [21, 11]. If the transition kernel is linear, then it can be learned in a sample efficient manner even for infinite state and action spaces, and then be used in the policy evaluation and improvement in a primal-dual algorithm [9]. [9] also proposes a model-based algorithm (Algorithm 3) for the tabular setting (without assuming a linear transition kernel).

Table 1: The Exploration-Exploitation Tradeoff in Episodic CMDPs.

	Algorithm	Regret	Constraint Violation
Model-based	OPDOP [9]	$\tilde{\mathcal{O}}(H^{3}\sqrt{S^{2}AK})$	$\tilde{\mathcal{O}}(H^{3}\sqrt{S^{2}AK})$
	OptDual-CMDP [11]	$\tilde{\mathcal{O}}(H^{2}\sqrt{S^{3}AK})$	$\tilde{\mathcal{O}}(H^{2}\sqrt{S^{3}AK})$
	OptPrimalDual-CMDP [11]	$\tilde{\mathcal{O}}(H^{2}\sqrt{S^{3}AK})$	$\tilde{\mathcal{O}}(H^{2}\sqrt{S^{3}AK})$
Model-free	Triple-Q	$\tilde{\cal O}\left(\frac{1}{\delta}H^{4}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}}\right)$	0

The performance of a model-based RL algorithm depends on how accurately a model can be estimated. For some complex environments, building accurate models is challenging computationally and data-wise [25]. For such environments, model-free RL algorithms often are more desirable. However, there has been little development on model-free RL algorithms for CMDPs with provable optimality or regret guarantees, with the exceptions [10, 29, 7], all of which require simulators. In particular, the sample-based NPG-PD algorithm in [10] requires a simulator which can simulate the MDP from any initial state $x,$ and the algorithms in [29, 7] both require a simulator for policy evaluation. It has been argued in [4, 5, 14] that with a perfect simulator, exploration is not needed and sample efficiency can be easily achieved because the agent can query any (state, action) pair as it wishes. Unfortunately, for complex environments, building a perfect simulator often is as difficult as deriving the model for the CMDP. For those environments, sample efficiency and the exploration-exploitation tradeoff are critical and become one of the most important considerations of RL algorithm design.

1.1 Main Contributions

In this paper, we consider the online learning problem of an episodic CMDP with a model-free approach without a simulator. We develop the first model-free RL algorithm for CMDPs with sublinear regret and zero constraint violation (for large $K$ ). The algorithm is named Triple-Q because it has three key components: (i) a Q-function (also called action-value function) for the expected cumulative reward, denoted by $Q_{h}(x,a)$ where $h$ is the step index and $(x,a)$ denotes a state-action pair, (ii) a Q-function for the expected cumulative utility for the constraint, denoted by $C_{h}(x,a),$ and (iii) a virtual-Queue, denoted by $Z,$ which (over)estimates the cumulative constraint violation so far. At step $h$ in the current episode, when observing state $x,$ the agent selects action $a^{*}$ based on a pseudo-Q-value that is a combination of the three “Q” values:

\displaystyle a^{*}\in\underbrace{\arg\max_{a}Q_{h}(x,a)+\frac{Z}{\eta}C_{h}(x,a)}_{{\text{pseudo-Q-value of state $(x,a)$ at step $h$}}},

where $\eta$ is a constant. Triple-Q uses UCB-exploration when learning the Q-values, where the UCB bonus and the learning rate at each update both depend on the visit count to the corresponding (state, action) pair as in [14]). Different from the optimistic Q-learning for unconstrained MDPs (e.g. [14, 26, 28]), the learning rates in Triple-Q need to be periodically reset at the beginning of each frame, where a frame consists of $K^{\alpha}$ consecutive episodes. The value of the virtual-Queue (the dual variable) is updated once in every frame. So Triple-Q can be viewed as a two-time-scale algorithm where virtual-Queue is updated at a slow time-scale, and Triple-Q learns the pseudo-Q-value for fixed $Z$ at a fast time scale within each frame. Furthermore, it is critical to update the two Q-functions ( $Q_{h}(x,a)$ and $C_{h}(x,a)$ ) following a rule similar to SARSA [22] instead of Q-learning [27], in other words, using the Q-functions of the action that is taken instead of using the $\max$ function.

We prove Triple-Q achieves $\tilde{\cal O}\left(\frac{1}{\delta}H^{4}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}}\right)$ reward regret and guarantees zero constraint violation when the total number of episodes $K\geq\left(\frac{16\sqrt{SAH^{6}\iota^{3}}}{\delta}\right)^{5},$ where $\iota$ is logarithmic in $K.$ Therefore, in terms of constraint violation, our bound is sharp for large $K.$ To the best of our knowledge, this is the first model-free, simulator-free RL algorithm with sublinear regret and zero constraint violation. For model-based approaches, it has been shown that a model-based algorithm achieves both $\tilde{\cal O}(\sqrt{H^{4}SAK})$ regret and constraint violation (see, e.g. [11]). It remains open what is the fundamental lower bound on the regret under model-free algorithms for CMDPs and whether the regret bound under Triple-Q is order-wise sharp or can be further improved. Table 1 summarizes the key results on the exploration-exploitation tradeoff of CMDPs in the literature.

As many other model-free RL algorithms, a major advantage of Triple-Q is its low computational complexity. The computational complexity of Triple-Q is similar to SARSA for unconstrained MDPs, so it retains both its effectiveness and efficiency while solving a much harder problem. While we consider a tabular setting in this paper, Triple-Q can easily incorporate function approximations (linear function approximations or neural networks) by replacing the $Q(x,a)$ and $C(x,a)$ with their function approximation versions, making the algorithm a very appealing approach for solving complex CMDPs in practice. We will demonstrate this by applying Deep Triple-Q, Triple-Q with neural networks, to the Dynamic Gym benchmark [30] in Section 6.

2 Problem Formulation

We consider an episodic CMDP, denoted by $(\mathcal{S},\mathcal{A},H,\mathbb{P},r,g),$ where $\mathcal{S}$ is the state space with $|\mathcal{S}|=S,$ $\mathcal{A}$ is the action space with $|\mathcal{A}|=A,$ $H$ is the number of steps in each episode, and $\mathbb{P}=\{\mathbb{P}_{h}\}_{h=1}^{H}$ is a collection of transition kernels (transition probability matrices). At the beginning of each episode, an initial state $x_{1}$ is sampled from distribution $\mu_{0}.$ Then at step $h,$ the agent takes action $a_{h}$ after observing state $x_{h}$ . Then the agent receives a reward $r_{h}(x_{h},a_{h})$ and incurs a utility $g_{h}(x_{h},a_{h}).$ The environment then moves to a new state $x_{h+1}$ sampled from distribution $\mathbb{P}_{h}(\cdot|x_{h},a_{h}).$ Similar to [14], we assume that $r_{h}(x,a)(g_{h}(x,a)):\mathcal{S}\times\mathcal{A}\rightarrow[0,1],$ are deterministic for convenience.

Given a policy $\pi,$ which is a collection of $H$ functions $\{\pi_{h}:\mathcal{S}\rightarrow\mathcal{A}\}_{h=1}^{H},$ the reward value function $V_{h}^{\pi}$ at step $h$ is the expected cumulative rewards from step $h$ to the end of the episode under policy $\pi:$

V_{h}^{\pi}(x)=\mathbb{E}\left[\left.\sum_{i=h}^{H}r_{i}(x_{i},\pi_{i}(x_{i}))\right|x_{h}=x\right].

The (reward) $Q$ -function $Q_{h}^{\pi}(x,a)$ at step $h$ is the expected cumulative rewards when agent starts from a state-action pair $(x,a)$ at step $h$ and then follows policy $\pi:$

Q_{h}^{\pi}(x,a)=r_{h}(x,a)+\mathbb{E}\left[\left.\sum_{i=h+1}^{H}r_{i}(x_{i},\pi_{i}(x_{i}))\right|\begin{array}[]{cc}x_{h}=x\\ a_{h}=a\end{array}\right].

Similarly, we use $W_{h}^{\pi}(x):\mathcal{S}\rightarrow\mathbb{R}^{+}$ and $C_{h}^{\pi}(x,a):\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{+}$ to denote the utility value function and utility $Q$ -function at step $h$ :

	$\displaystyle W_{h}^{\pi}(x)$	$\displaystyle=\mathbb{E}\left[\left.\sum_{i=h}^{H}g_{i}(x_{i},\pi_{i}(x_{i}))\right\|x_{h}=x\right],$
	$\displaystyle C_{h}^{\pi}(x,a)$	$\displaystyle=g_{h}(x,a)+\mathbb{E}\left[\left.\sum_{i=h+1}^{H}g_{i}(x_{i},\pi_{i}(x_{i}))\right\|\begin{array}[]{cc}x_{h}=x\\ a_{h}=a\end{array}\right].$

For simplicity, we adopt the following notation (some used in [14, 9]):

	$\displaystyle\mathbb{P}_{h}V_{h+1}^{\pi}(x,a)=\mathbb{E}_{x^{\prime}\sim\mathbb{P}_{h}(\cdot\|x,a)}V^{\pi}_{h+1}(x^{\prime}),$	$\displaystyle\quad Q_{h}^{\pi}(x,\pi_{h}(x))=\sum_{a}Q_{h}^{\pi}(x,a)\mathbb{P}(\pi_{h}(x)=a)$
	$\displaystyle\mathbb{P}_{h}W_{h+1}^{\pi}(x,a)=\mathbb{E}_{x^{\prime}\sim\mathbb{P}_{h}(\cdot\|x,a)}W^{\pi}_{h+1}(x^{\prime}),$	$\displaystyle\quad C_{h}^{\pi}(x,\pi_{h}(x))=\sum_{a}C_{h}^{\pi}(x,a)\mathbb{P}(\pi_{h}(x)=a).$

From the definitions above, we have

\displaystyle V_{h}^{\pi}(x)=Q_{h}^{\pi}(x,\pi_{h}(x)),\quad

\displaystyle Q_{h}^{\pi}(x,a)=r_{h}(x,a)+\mathbb{P}_{h}V_{h+1}^{\pi}(x,a),

and

\displaystyle W_{h}^{\pi}(x)=C_{h}^{\pi}(x,\pi_{h}(x)),\quad

\displaystyle C_{h}^{\pi}(x,a)=g_{h}(x,a)+\mathbb{P}_{h}W_{h+1}^{\pi}(x,a).

Given the model defined above, the objective of the agent is to find a policy that maximizes the expected cumulative reward subject to a constraint on the expected utility:

\underset{\pi\in\Pi}{\text{maximize}}\ \mathbb{E}\left[V_{1}^{\pi}(x_{1})\right]\ \text{subject to:}\mathbb{E}\left[W_{1}^{\pi}(x_{1})\right]\geq\rho,

(1)

where we assume $\rho\in[0,H]$ to avoid triviality and the expectation is taken with respect to the initial distribution $x_{1}\sim\mu_{0}.$

Remark 1.

The results in the paper can be directly applied to a constraint in the form of

\mathbb{E}\left[W_{1}^{\pi}(x_{1})\right]\leq\rho.

(2)

Without loss of generality, assume $\rho\leq H.$ We define $\tilde{g}_{h}(x,a)=1-g_{h}(x,a)\in[0,1]$ and $\tilde{\rho}=H-\rho\geq 0,$ the the constraint in (2) can be written as

\mathbb{E}\left[\tilde{W}_{1}^{\pi}(x_{1})\right]\geq\tilde{\rho},

(3)

where

\displaystyle\mathbb{E}\left[\tilde{W}_{1}^{\pi}(x_{1})\right]=\mathbb{E}\left[\sum_{i=1}^{H}\tilde{g}_{i}(x_{i},\pi_{i}(x_{i}))\right]=H-\mathbb{E}\left[W_{1}^{\pi}(x_{1})\right].

Let $\pi^{*}$ denote the optimal solution to the CMDP problem defined in (1). We evaluate our model-free RL algorithm using regret and constraint violation defined below:

	$\displaystyle\text{Regert}(K)$	$\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\left(V_{1}^{*}(x_{k,1})-V_{1}^{\pi_{k}}(x_{k,1})\right)\right],$		(4)
	$\displaystyle\text{Violation}(K)$	$\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\left(\rho-W_{1}^{\pi_{k}}(x_{k,1})\right)\right],$		(5)

where $V_{1}^{*}(x)=V_{1}^{\pi^{*}}(x),$ $\pi_{k}$ is the policy used in episode $k$ and the expectation is taken with respect to the distribution of the initial state $x_{k,1}\sim\mu_{0}.$

In this paper, we assume the following standard Slater’s condition hold.

Assumption 1.

(Slater’s Condition). Given initial distribution $\mu_{0},$ there exist $\delta>0$ and policy $\pi$ such that

\mathbb{E}\left[W_{1}^{\pi}(x_{1})\right]-\rho\geq\delta.

In this paper, Slater’s condition simply means there exists a feasible policy that can satisfy the constraint with a slackness $\delta.$ This has been commonly used in the literature [9, 10, 11, 19]. We call $\delta$ Slater’s constant. While the regret and constraint violation bounds depend on $\delta,$ our algorithm does not need to know $\delta$ under the assumption that $K$ is large (the exact condition can be found in Theorem 1). This is a noticeable difference from some of works in CMDPs in which the agent needs to know the value of this constant (e.g. [9]) or alternatively a feasible policy (e.g. [2]) .

3 Triple-Q

In this section, we introduce Triple-Q for CMDPs. The design of our algorithm is based on the primal-dual approach in optimization. While RL algorithms based on the primal-dual approach have been developed for CMDPs [9, 10, 21, 11], a model-free RL algorithm with sublinear regrets and zero constraint violation is new.

The design of Triple-Q is based on the primal-dual approach in optimization. Given Lagrange multiplier $\lambda,$ we consider the Lagrangian of problem (1) from a given initial state $x_{1}:$

		$\displaystyle\max_{\pi}V_{1}^{\pi}(x_{1})+\lambda\left(W_{1}^{\pi}(x_{1})-\rho\right)$		(6)
	$\displaystyle=$	$\displaystyle\max_{\pi}\mathbb{E}\left[\sum_{h=1}^{H}r_{h}(x_{h},\pi_{h}(x_{h}))+\lambda g_{h}(x_{h},\pi_{h}(x_{h}))\right]-\lambda\rho,$

which is an unconstrained MDP with reward $r_{h}(x_{h},\pi_{h}(x_{h}))+\lambda g_{h}(x_{h},\pi_{h}(x_{h}))$ at step $h.$ Assuming we solve the unconstrained MDP and obtain the optimal policy, denoted by $\pi^{*}_{\lambda},$ we can then update the dual variable (the Lagrange multiplier) using a gradient method:

\lambda\leftarrow\left(\lambda+\rho-\mathbb{E}\left[W_{1}^{\pi_{\lambda}^{*}}(x_{1})\right]\right)^{+}.

While primal-dual is a standard approach, analyzing the finite-time performance such as regret or sample complexity is particularly challenging. For example, over a finite learning horizon, we will not be able to exactly solve the unconstrained MDP for given $\lambda.$ Therefore, we need to carefully design how often the Lagrange multiplier should be updated. If we update it too often, then the algorithm may not have sufficient time to solve the unconstrained MDP, which leads to divergence; and on the other hand, if we update it too slowly, then the solution will converge slowly to the optimal solution and will lead to large regret and constraint violation. Another challenge is that when $\lambda$ is given, the primal-dual algorithm solves a problem with an objective different from the original objective and does not consider any constraint violation. Therefore, even when the asymptotic convergence may be established, establishing the finite-time regret is still difficult because we need to evaluate the difference between the policy used at each step and the optimal policy.

Next we will show that a low-complexity primal-dual algorithm can converge and have sublinear regret and zero constraint violation when carefully designed. In particular, Triple-Q includes the following key ideas:

•

A sub-gradient algorithm for estimating the Lagrange multiplier, which is updated at the beginning of each frame as follows:

$\displaystyle Z\leftarrow\left(Z+\rho+\epsilon-\frac{\bar{C}}{K^{\alpha}}\right)^{+},$ (7)

where $(x)^{+}=\max\{x,0\}$ and $\bar{C}$ is the summation of all $C_{1}(x_{1},a_{1})$ s of the episodes in the previous frame. We call $Z$ a virtual queue because it is terminology that has been widely used in stochastic networks (see e.g. [17, 24]). If we view $\rho+\epsilon$ as the number of jobs that arrive at a queue within each frame and $\bar{C}$ as the number of jobs that leave the queue within each frame, then $Z$ is the number of jobs that are waiting at the queue. Note that we added extra utility $\epsilon$ to $\rho.$ By choosing $\epsilon=\frac{8\sqrt{SAH^{6}\iota^{3}}}{K^{0.2}}$ , the virtual queue pessimistically estimates constraint violation so Triple-Q achieves zero constraint violation when the number of episodes is large.
•

A carefully chosen parameter $\eta=K^{0.2}$ so that when $\frac{Z}{\eta}$ is used as the estimated Lagrange multiplier, it balances the trade-off between maximizing the cumulative reward and satisfying the constraint.
•

Carefully chosen learning rate $\alpha_{t}$ and Upper Confidence Bound (UCB) bonus $b_{t}$ to guarantee that the estimated Q-value does not significantly deviate from the actual Q-value. We remark that the learning rate and UCB bonus proposed for unconstrained MDPs [14] do not work here. Our learning rate is chosen to be $\frac{K^{0.2}+1}{K^{0.2}+t},$ where $t$ is the number of visits to a given (state, action) pair in a particular step. This decays much slower than the classic learning rate $\frac{1}{t}$ or $\frac{H+1}{H+t}$ used in [14]. The learning rate is further reset from frame to frame, so Triple-Q can continue to learn the pseudo-Q-values that vary from frame to frame due to the change of the virtual-Queue (the Lagrange multiplier).

We now formally introduce Triple-Q. A detailed description is presented in Algorithm 1. The algorithm only needs to know the values of $H,$ $A,$ $S$ and $K,$ and no other problem-specific values are needed. Furthermore, Triple-Q includes updates of two Q-functions per step: one for $Q_{h}$ and one for $C_{h};$ and one simple virtual queue update per frame. So its computational complexity is similar to SARSA.

1 Choose

\chi=K^{0.2},

\eta=K^{0.2},

\iota=128\log\left(\sqrt{2SAH}K\right),\alpha=0.6

and

\epsilon=\frac{8\sqrt{SAH^{6}\iota^{3}}}{K^{0.2}}

;

2 Initialize

Q_{h}(x,a)=C_{h}(x,a)\leftarrow H

and

Z=\bar{C}=N_{h}(x,a)=V_{H+1}(x)=W_{H+1}(x)\leftarrow 0

for all

(x,a,h)\in{\cal S}\times{\cal A}\times[H]

;

3 for episode $k=1,\dots,K$ do

4 Sample the initial state for episode

k:

x_{1}\sim\mu_{0}

;

5 for step $h=1,\dots,H+1$ do

if $h\leq H$ ;

// take a greedy action based on the pseudo-Q-function

6 then

7 Take action

a_{h}\leftarrow\arg\max_{a}\left(Q_{h}(x_{h},a)+\frac{Z}{\eta}C_{h}(x_{h},a)\right)

;

8 Observe

r_{h}(x_{h},a_{h}),g_{h}(x_{h},a_{h}),

and

x_{h+1}

;

N_{h}(x_{h},a_{h})\leftarrow N_{h}(x_{h},a_{h})+1,V_{h}(x_{h})\leftarrow Q_{h}(x_{h},a_{h}),W_{h}(x_{h})\leftarrow C_{h}(x_{h},a_{h})

;

if $h\geq 2$ ;

// update the Q-values for

(x_{h-1},a_{h-1})

after observing

(s_{h},a_{h})

10 then

11 Set

t=N_{h-1}(x_{h-1},a_{h-1}),b_{t}=\frac{1}{4}\sqrt{\frac{H^{2}\iota\left(\chi+1\right)}{\chi+t}},\alpha_{t}=\frac{\chi+1}{\chi+t}

;

12 Update the reward Q-value:

{Q}_{h-1}(x_{h-1},a_{h-1})\leftarrow(1-\alpha_{t})Q_{h-1}(x_{h-1},a_{h-1})+\alpha_{t}\left(r_{h-1}(x_{h-1},a_{h-1})+V_{h}(x_{h})+b_{t}\right)

;

13 Update the utility Q-value:

{C}_{h-1}(x_{h-1},a_{h-1})\leftarrow(1-\alpha_{t})C_{h-1}(x_{h-1},a_{h-1})+\alpha_{t}\left(g_{h-1}(x_{h-1},a_{h-1})+W_{h}(x_{h})+b_{t}\right)

;

15 if $h=1$ then

\bar{C}\leftarrow\bar{C}+C_{1}(x_{1},a_{1})

;

// Add

C_{1}(x_{1},a_{1})

\bar{C}

if $k\mod(K^{\alpha})=0$ ;

// Reset the visit counts, add extra bonuses, and update the virtual-queue at the beginning of each frame

17 then

N_{h}(x,a)\leftarrow 0,Q_{h}(x,a)\leftarrow Q_{h}(x,a)+\frac{2H^{3}\sqrt{\iota}}{\eta},\forall(x,a,h)

;

19 if $Q_{h}(x,a)\geq H$ or $C_{h}(x,a)\geq H$ then

Q_{h}(x,a)\leftarrow H

and

C_{h}(x,a)\leftarrow H;

Z\leftarrow\left(Z+\rho+\epsilon-\frac{\bar{C}}{K^{\alpha}}\right)^{+},

and

\bar{C}\leftarrow 0

;

// update the virtual-queue length

Algorithm 1 Triple-Q

The next theorem summarizes the regret and constraint violation bounds guaranteed under Triple-Q.

Theorem 1.

Assume $K\geq\left(\frac{16\sqrt{SAH^{6}\iota^{3}}}{\delta}\right)^{5},$ where $\iota=128\log(\sqrt{2SAH}K).$ Triple-Q achieves the following regret and constraint violation bounds:

	$\displaystyle\text{\em Regret}(K)$	$\displaystyle\leq\frac{13}{\delta}H^{4}{\sqrt{SA\iota^{3}}}{K^{0.8}}+\frac{4H^{4}\iota}{K^{1.2}}$
	$\displaystyle\text{\em Violation}(K)$	$\displaystyle\leq\frac{54H^{4}{\iota}K^{0.6}}{\delta}\log{\frac{16H^{2}\sqrt{\iota}}{\delta}}+\frac{4\sqrt{H^{2}\iota}}{\delta}K^{0.8}-5\sqrt{SAH^{6}\iota^{3}}K^{0.8}.$

If we further have $K\geq e^{\frac{1}{\delta}},$ then $\text{\em Violation}(K)\leq 0$ and

\displaystyle\Pr\left(\sum^{K}_{k=1}\rho-W_{1}^{\pi_{k}}(x_{k,1})\leq 0\right)=1-\tilde{\mathcal{O}}\left(e^{-K^{0.2}+}+\frac{1}{K^{2}}\right),

in other words, Triple-Q guarantees zero constraint violation both on expectation and with a high probability.

4 Proof of the Main Theorem

We now present the complete proof of the main theorem.

4.1 Notation

In the proof, we explicitly include the episode index in our notation. In particular,

•

$x_{k,h}$ and $a_{k,h}$ are the state and the action taken at step $h$ of episode $k.$
•

$Q_{k,h},$ $C_{k,h},$ $Z_{k},$ and $\bar{C}_{k}$ are the reward Q-function, the utility Q-function, the virtual-Queue, and the value of $\bar{C}$ at the beginning of episode $k.$
•

$N_{k,h},$ $V_{k,h}$ and $W_{k,h}$ are the visit count, reward value-function, and utility value-function after they are updated at step $h$ of episode $k$ (i.e. after line 9 of Triple-Q).

We also use shorthand notation

\{f-g\}(x)=f(x)-g(x),

when $f(\cdot)$ and $g(\cdot)$ take the same argument value. Similarly

\{(f-g)q\}(x)=(f(x)-g(x))q(x).

In this shorthand notation, we put functions inside $\{\ \},$ and the common argument(s) outside.

A summary of notations used throughout this paper can be found in Table 3 in the appendix.

4.2 Regret

To bound the regret, we consider the following offline optimization problem as our regret baseline [3, 20]:

$\displaystyle\underset{q_{h}}{\max}$	$\displaystyle\sum_{h,x,a}q_{h}(x,a)r_{h}(x,a)$	(8)
s.t.:	$\displaystyle\sum_{h,x,a}q_{h}(x,a)g_{h}(x,a)\geq\rho$	(9)
	$\displaystyle\sum_{a}q_{h}(x,a)=\sum_{x^{\prime},a^{\prime}}\mathbb{P}_{h-1}(x\|x^{\prime},a^{\prime})q_{h-1}(x^{\prime},a^{\prime})$	(10)
	$\displaystyle\sum_{x,a}q_{h}(x,a)=1,\forall h\in[H]$	(11)
	$\displaystyle\sum_{a}q_{1}(x,a)=\mu_{0}(x)$	(12)
	$\displaystyle q_{h}(x,a)\geq 0,\forall x\in\mathcal{S},\forall a\in\mathcal{A},\forall h\in[H].$	(13)

Recall that $\mathbb{P}_{h-1}(x|x^{\prime},a^{\prime})$ is the probability of transitioning to state $x$ upon taking action $a^{\prime}$ in state $x^{\prime}$ at step $h-1.$ This optimization problem is linear programming (LP), where $q_{h}(x,a)$ is the probability of (state, action) pair $(x,a)$ occurs in step $h,$ $\sum_{a}q_{h}(x,a)$ is the probability the environment is in state $x$ in step $h,$ and

\frac{q_{h}(x,a)}{\sum_{a^{\prime}}q_{h}(x,a^{\prime})}

is the probability of taking action $a$ in state $x$ at step $h,$ which defines the policy. We can see that (9) is the utility constraint, (10) is the global-balance equation for the MDP, (11) is the normalization condition so that $q_{h}$ is a valid probability distribution, and (12) states that the initial state is sampled from $\mu_{0}.$ Therefore, the optimal solution to this LP solves the CMDP (if the model is known), so we use the optimal solution to this LP as our baseline.

To analyze the performance of Triple-Q, we need to consider a tightened version of the LP, which is defined below:

$\displaystyle\underset{q_{h}}{\max}$	$\displaystyle\sum_{h,x,a}q_{h}(x,a)r_{h}(x,a)$	(14)
s.t.:	$\displaystyle\sum_{h,x,a}q_{h}(x,a)g_{h}(x,a)\geq\rho+\epsilon$
	$\displaystyle\eqref{lp:gb}-\eqref{lp:p},$

where $\epsilon>0$ is called a tightness constant. When $\epsilon\leq\delta,$ this problem has a feasible solution due to Slater’s condition. We use superscript ^∗ to denote the optimal value/policy related to the original CMDP (1) or the solution to the corresponding LP (8) and superscript ^ϵ,∗ to denote the optimal value/policy related to the $\epsilon$ -tightened version of CMDP (defined in (14)).

Following the definition of the regret in (4), we have

\displaystyle\hbox{Regret}(K)=\mathbb{E}\left[\sum_{k=1}^{K}V^{*}_{1}(x_{k,1})-V_{1}^{\pi_{k}}(x_{k,1})\right]=\mathbb{E}\left[\sum_{k=1}^{K}\left(\sum_{a}\left\{Q^{*}_{1}q_{1}^{*}\right\}(x_{k,1},a)\right)-Q_{1}^{\pi_{k}}(x_{k,1},a_{k,1})\right].

Now by adding and subtracting the corresponding terms, we obtain

	$\displaystyle\hbox{Regret}(K)$
$\displaystyle=$	$\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left(\sum_{a}\left\{{Q}_{1}^{}{q}^{}_{1}-{Q}_{1}^{\epsilon,}{q}^{\epsilon,}_{1}\right\}(x_{k,1},a)\right)\right]+$	(15)
	$\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left(\sum_{a}\left\{{Q}_{1}^{\epsilon,}{q}^{\epsilon,}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]+$	(16)
	$\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left\{Q_{k,1}-Q_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right].$	(17)

Next, we establish the regret bound by analyzing the three terms above. We first present a brief outline.

4.2.1 Outline of the Regret Analysis

•

Step 1: First, by comparing the LP associated with the original CMDP (8) and the tightened LP (14), Lemma 1 will show

\mathbb{E}\left[\sum_{a}\left\{Q_{1}^{*}q_{1}^{*}-{Q}_{1}^{\epsilon,*}{q}_{1}^{\epsilon,*}\right\}(x_{k,1},a)\right]\leq\frac{H\epsilon}{\delta},

which implies that under our choices of $\epsilon,$ $\delta,$ and $\iota,$

\eqref{step:epsilon-dif}\leq\frac{KH\epsilon}{\delta}=\tilde{\cal O}\left(\frac{1}{\delta}H^{4}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}}\right).

•

Step 2: Note that $Q_{k,h}$ is an estimate of $Q^{\pi_{k}}_{h},$ and the estimation error (17) is controlled by the learning rates and the UCB bonuses. In Lemma 2, we will show that the cumulative estimation error over one frame is upper bounded by

H^{2}SA+\frac{H^{3}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{4}SA\iota K^{\alpha}(\chi+1)}.

Therefore, under our choices of $\alpha,$ $\chi,$ and $\iota,$ the cumulative estimation error over $K$ episodes satisfies

\displaystyle\eqref{step:biase}\leq H^{2}SAK^{1-\alpha}+\frac{H^{3}\sqrt{\iota}K}{\chi}+\sqrt{H^{4}SA\iota K^{2-\alpha}(\chi+1)}=\tilde{\cal O}\left(H^{3}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}}\right).

The proof of Lemma 2 is based on a recursive formula that relates the estimation error at step $h$ to the estimation error at step $h+1,$ similar to the one used in [14], but with different learning rates and UCB bonuses.

•

Step 3: Bounding (16) is the most challenging part of the proof. For unconstrained MDPs, the optimistic Q-learning in [14] guarantees that $Q_{k,h}(x,a)$ is an overestimate of $Q^{*}_{h}(x,a)$ (so also an overestimate of $Q^{\epsilon,*}_{h}(x,a)$ ) for all $(x,a,h,k)$ simultaneously with a high probability. However, this result does not hold under Triple-Q because Triple-Q takes greedy actions with respect to the pseudo-Q-function instead of the reward Q-function. To overcome this challenge, we first add and subtract additional terms to obtain

	$\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left(\sum_{a}\left\{{Q}_{1}^{\epsilon,}{q}^{\epsilon,}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}\left[\sum_{k}\sum_{a}\left(\left\{{Q}_{1}^{\epsilon,}{q}^{\epsilon,}_{1}+\frac{Z_{k}}{\eta}C_{1}^{\epsilon,}{q}^{\epsilon,}_{1}\right\}(x_{k,1},a)-\left\{Q_{k,1}{q}^{\epsilon,}_{1}+\frac{Z_{k}}{\eta}C_{k,1}{q}^{\epsilon,}_{1}\right\}(x_{k,1},a)\right)\right]$	(18)
	$\displaystyle+\mathbb{E}\left[\sum_{k}\left(\sum_{a}\left\{Q_{k,1}{q}^{\epsilon,}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]+\mathbb{E}\left[\sum_{k}\frac{Z_{k}}{\eta}\sum_{a}\left\{\left(C_{k,1}-{C}^{\epsilon,}_{1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right].$	(19)

We can see (18) is the difference of two pseudo-Q-functions. Using a three-dimensional induction (on step, episode, and frame), we will prove in Lemma 3 that $\left\{Q_{k,h}+\frac{Z_{k}}{\eta}C_{k,h}\right\}(x,a)$ is an overestimate of $\left\{{Q}_{h}^{\epsilon,*}+\frac{Z_{k}}{\eta}C_{h}^{\epsilon,*}\right\}(x,a)$ (i.e. $\eqref{F-new}\leq 0$ ) for all $(x,a,h,k)$ simultaneously with a high probability. Since $Z_{k}$ changes from frame to frame, Triple-Q adds the extra bonus in line 21 so that the induction can be carried out over frames.

Finally, to bound (19), we use the Lyapunov-drift method and consider Lyapunov function $L_{T}=\frac{1}{2}Z_{T}^{2},$ where $T$ is the frame index and $Z_{T}$ is the value of the virtual queue at the beginning of the $T$ th frame. We will show in Lemma 4 that the Lyapunov-drift satisfies

\displaystyle\mathbb{E}[L_{T+1}-L_{T}]\leq\hbox{a negative drift}+H^{4}\iota+\epsilon^{2}-\frac{\eta}{K^{\alpha}}\sum_{k=TK^{\alpha}+1}^{(T+1)K^{\alpha}}\Phi_{k},

(20)

where

\displaystyle\Phi_{k}=\mathbb{E}\left[\left(\sum_{a}\left\{Q_{k,1}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]+\mathbb{E}\left[\frac{Z_{k}}{\eta}\sum_{a}\left\{\left(C_{k,1}-{C}^{\epsilon,*}_{1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right],

and we note that $\eqref{eq:(i)expanded-new}=\sum_{k}\Phi_{k}.$ Inequality (20) will be established by showing that Triple-Q takes actions to almost greedily reduce virtual-Queue $Z$ when $Z$ is large, which results in the negative drift in (20). From (20), we observe that

\displaystyle\mathbb{E}[L_{T+1}-L_{T}]\leq H^{4}\iota+\epsilon^{2}-\frac{\eta}{K^{\alpha}}\sum_{k=TK^{\alpha}+1}^{(T+1)K^{\alpha}}\Phi_{k}.

(21)

So we can bound (19) by applying the telescoping sum over the $K^{1-\alpha}$ frames on the inequality above:

\eqref{eq:(i)expanded-new}=\sum_{k}\Phi_{k}\leq\frac{K^{\alpha}\mathbb{E}\left[L_{1}-L_{K^{1-\alpha}+1}\right]}{\eta}+\frac{K(H^{4}\iota+\epsilon^{2})}{\eta}\leq\frac{K(H^{4}\iota+\epsilon^{2})}{\eta},

where the last inequality holds because $L_{1}=0$ and $L_{T}\geq 0$ for all $T.$ Combining the bounds on (18) and (19), we conclude that under our choices of $\iota,$ $\epsilon$ and $\eta,$

\eqref{step(i)}=\tilde{\cal O}(H^{4}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}}).

Combining the results in the three steps above, we obtain the regret bound in Theorem 1.

4.2.2 Detailed Proof

We next present the detailed proof. The first lemma bounds the difference between the original CMDP and its $\epsilon$ -tightened version. The result is intuitive because the $\epsilon$ -tightened version is a perturbation of the original problem and $\epsilon\leq\delta.$

Lemma 1.

Given $\epsilon\leq\delta$ , we have

\mathbb{E}\left[\sum_{a}\left\{Q_{1}^{*}q_{1}^{*}-{Q}_{1}^{\epsilon,*}{q}_{1}^{\epsilon,*}\right\}(x_{k,1},a)\right]\leq\frac{H\epsilon}{\delta}.

$\square$

Proof.

Given ${q}_{h}^{*}(x,a)$ is the optimal solution, we have

\sum_{h,x,a}{q}_{h}^{*}(x,a)g_{h}(x,a)\geq\rho.

Under Assumption 1, we know that there exists a feasible solution $\{q^{\xi_{1}}_{h}(x,a)\}_{h=1}^{H}$ such that

\sum_{h,x,a}q_{h}^{\xi_{1}}(x,a)g_{h}(x,a)\geq\rho+\delta.

We construct $q_{h}^{\xi_{2}}(x,a)=(1-\frac{\epsilon}{\delta})q^{*}_{h}(x,a)+\frac{\epsilon}{\delta}q^{\xi_{1}}_{h}(x,a),$ which satisfies that

	$\displaystyle\sum_{h,x,a}q_{h}^{\xi_{2}}(x,a)g_{h}(x,a)$	$\displaystyle=\sum_{h,x,a}\left((1-\frac{\epsilon}{\delta})q^{*}_{h}(x,a)+\frac{\epsilon}{\delta}q^{\xi_{1}}_{h}(x,a)\right)g_{h}(x,a)\geq\rho+\epsilon,$
	$\displaystyle\sum_{h,x,a}q_{h}^{\xi_{2}}(x,a)$	$\displaystyle=\sum_{x^{\prime},a^{\prime}}p_{h-1}(x\|x^{\prime},a^{\prime})q_{h-1}^{\xi_{2}}(x^{\prime},a^{\prime}),$
	$\displaystyle\sum_{h,x,a}q_{h}^{\xi_{2}}(x,a)$	$\displaystyle=1.$

Also we have $q_{h}^{\xi_{2}}(x,a)\geq 0$ for all $(h,x,a).$ Thus $\{q^{\xi_{2}}_{h}(x,a)\}_{h=1}^{H}$ is a feasible solution to the $\epsilon$ -tightened optimization problem (14). Then given $\{{q}^{\epsilon,*}_{h}(x,a)\}_{h=1}^{H}$ is the optimal solution to the $\epsilon$ -tightened optimization problem, we have

		$\displaystyle\sum_{h,x,a}\left({q}_{h}^{}(x,a)-{q}_{h}^{\epsilon,}(x,a)\right)r_{h}(x,a)$
	$\displaystyle\leq$	$\displaystyle\sum_{h,x,a}\left(q_{h}^{*}(x,a)-q_{h}^{\xi_{2}}(x,a)\right)r_{h}(x,a)$
	$\displaystyle\leq$	$\displaystyle\sum_{h,x,a}\left(q_{h}^{}(x,a)-\left(1-\frac{\epsilon}{\delta}\right)q_{h}^{}(x,a)-\frac{\epsilon}{\delta}q_{h}^{\xi_{1}}(x,a)\right)r_{h}(x,a)$
	$\displaystyle\leq$	$\displaystyle\sum_{h,x,a}\left(q_{h}^{}(x,a)-\left(1-\frac{\epsilon}{\delta}\right)q_{h}^{}(x,a)\right)r_{h}(x,a)$
	$\displaystyle\leq$	$\displaystyle\frac{\epsilon}{\delta}\sum_{h,x,a}q_{h}^{*}(x,a)r_{h}(x,a)$
	$\displaystyle\leq$	$\displaystyle\frac{H\epsilon}{\delta},$

where the last inequality holds because $0\leq r_{h}(x,a)\leq 1$ under our assumption. Therefore the result follows because

	$\displaystyle\sum_{a}Q_{1}^{}(x_{k,1},a)q_{1}^{}(x_{k,1},a)=$	$\displaystyle\sum_{h,x,a}{q}_{h}^{*}(x,a)r_{h}(x,a)$
	$\displaystyle\sum_{a}{Q}_{1}^{\epsilon,}(x_{k,1},a){q}_{1}^{\epsilon,}(x_{k,1},a)=$	$\displaystyle\sum_{h,x,a}{q}_{h}^{\epsilon,*}(x,a)r_{h}(x,a).$

∎

The next lemma bounds the difference between the estimated Q-functions and actual Q-functions in a frame. The bound on (17) is an immediate result of this lemma.

Lemma 2.

Under Triple-Q, we have for any $T\in[K^{1-\alpha}],$

	$\displaystyle\mathbb{E}\left[\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\left\{{Q}_{k,1}-Q_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right]\leq H^{2}SA+\frac{H^{3}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)},$
	$\displaystyle\mathbb{E}\left[\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\left\{{C}_{k,1}-C_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right]\leq H^{2}SA+\frac{H^{3}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)}.$

Proof.

We will prove the result on the reward Q-function. The proof for the utility Q-function is almost identical. We first establish a recursive equation between a Q-function with the value-functions in the earlier episodes in the same frame. Recall that under Triple-Q, ${Q}_{k+1,h}(x,a)$ , where $k$ is an episode in frame $T,$ is updated as follows:

\displaystyle{Q}_{k+1,h}(x,a)

\displaystyle=\begin{cases}(1-\alpha_{t}){Q}_{k,h}(x,a)+\alpha_{t}\left(r_{h}(x,a)+{V}_{k,h+1}(x_{k,h+1})+b_{t}\right)&\text{if $(x,a)=(x_{k,h},a_{k,h})$}\\ {Q}_{k,h}(x,a)&\text{otherwise}\end{cases},

where $t=N_{k,h}(x,a).$ Define $k_{t}$ to be the index of the episode in which the agent visits $(x,a)$ in step $h$ for the $t$ th time in the current frame.

The update equation above can be written as:

\displaystyle{Q}_{k,h}(x,a)=

\displaystyle(1-\alpha_{t}){Q}_{k_{t},h}(x,a)+\alpha_{t}\left(r_{h}(x,a)+{V}_{k_{t},h+1}(x_{k_{t},h+1})+b_{t}\right).

Repeatedly using the equation above, we obtain

$\displaystyle{Q}_{k,h}(x,a)=$	$\displaystyle(1-\alpha_{t})(1-\alpha_{t-1}){Q}_{k_{t-1},h}(x,a)+(1-\alpha_{t})\alpha_{t-1}\left(r_{h}(x,a)+{V}_{k_{t-1},h+1}(x_{k_{t-1},h+1})+b_{t-1}\right)$
	$\displaystyle+\alpha_{t}\left(r_{h}(x,a)+{V}_{k_{t},h+1}(x_{k_{t},h+1})+b_{t}\right)$
$\displaystyle=$	$\displaystyle\cdots$
$\displaystyle=$	$\displaystyle\alpha_{t}^{0}Q_{(T-1)K^{\alpha}+1,h}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left(r_{h}(x,a)+{V}_{k_{i},h+1}(x_{k_{i},h+1})+b_{i}\right)$
$\displaystyle\leq$	$\displaystyle\alpha_{t}^{0}H+\sum_{i=1}^{t}\alpha_{t}^{i}\left(r_{h}(x,a)+{V}_{k_{i},h+1}(x_{k_{i},h+1})+b_{i}\right),$	(22)

where $\alpha_{t}^{0}=\prod_{j=1}^{t}(1-\alpha_{j})$ and $\alpha_{t}^{i}=\alpha_{i}\prod_{j=i+1}^{t}(1-\alpha_{j}).$ From the inequality above, we further obtain

\displaystyle\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}{Q}_{k,h}(x,a)\leq\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\alpha_{t}^{0}H+\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\sum_{i=1}^{N_{k,h}(x,a)}\alpha_{N_{k,h}}^{i}\left(r_{h}(x,a)+{V}_{k_{i},h+1}(x_{k_{i},h+1})+b_{i}\right).

(23)

The notation becomes rather cumbersome because for each $(x_{k,h},a_{k,h}),$ we need to consider a corresponding sequence of episode indices in which the agent sees $(x_{k,h},a_{k,h}).$ Next we will analyze a given sample path (i.e. a specific realization of the episodes in a frame), so we simplify our notation in this proof and use the following notation:

	$\displaystyle N_{k,h}=$	$\displaystyle N_{k,h}(x_{k,h},a_{k,h})$
	$\displaystyle k^{(k,h)}_{i}=$	$\displaystyle k_{i}(x_{k,h},a_{k,h}),$

where $k^{(k,h)}_{i}$ is the index of the episode in which the agent visits state-action pair $(x_{k,h},a_{k,h})$ for the $i$ th time. Since in a given sample path, $(k,h)$ can uniquely determine $(x_{k,h},a_{k,h}),$ this notation introduces no ambiguity. Furthermore, we will replace $\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}$ with $\sum_{k}$ because we only consider episodes in frame $T$ in this proof.

We note that

\sum_{k}\sum_{i=1}^{N_{k,h}}\alpha_{N_{k,h}}^{i}V_{k_{i}^{(k,h)},h+1}\left(x_{k_{i}^{(k,h)},h+1}\right)\leq\sum_{k}V_{k,h+1}(x_{k,h+1})\sum_{t=N_{k,h}}^{\infty}\alpha_{t}^{N_{k,h}}\leq\left(1+\frac{1}{\chi}\right)\sum_{k}V_{k,h+1}(x_{k,h+1}),

(24)

where the first inequality holds because because $V_{k,h+1}(x_{k,h+1})$ appears in the summation on the left-hand side each time when in episode $k^{\prime}>k$ in the same frame, the environment visits $(x_{k,h},a_{k,h})$ again, i.e. $(x_{k^{\prime},h},a_{k^{\prime},h})=(x_{k,h},a_{k,h}),$ and the second inequality holds due to the property of the learning rate proved in Lemma 7-(d). By substituting (24) into (23) and noting that $\sum_{i=1}^{N_{k,h}(x,a)}\alpha_{N_{k,h}}^{i}=1$ according to Lemma Lemma 7-(b), we obtain

		$\displaystyle\sum_{k}Q_{k,h}(x_{k,h},a_{k,h})$
	$\displaystyle\leq$	$\displaystyle\sum_{k}\alpha_{t}^{0}H+\sum_{k}\left(r_{h}(x_{k,h},a_{k,h})+V_{k,h+1}(x_{k,h+1})\right)+\frac{1}{\chi}\sum_{k}V_{k,h+1}(x_{k,h+1})+\sum_{k}\sum_{i=1}^{N_{k,h}}\alpha_{N_{k,h}}^{i}b_{i}$
	$\displaystyle\leq$	$\displaystyle\sum_{k}\left(r_{h}(x_{k,h},a_{k,h})+V_{k,h+1}(x_{k,h+1})\right)+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\frac{1}{2}\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)},$

where the last inequality holds because (i) we have

\displaystyle\sum_{k}\alpha_{N_{k,h}}^{0}H=\sum_{k}H\mathbb{I}_{\{N_{k,h}=0\}}\leq HSA,

(ii) $V_{k,h+1}(x_{k,h+1})\leq H^{2}\sqrt{\iota}$ by using Lemma 8, and (iii) we know that

		$\displaystyle\sum_{k}\sum_{i=1}^{N_{k,h}}\alpha_{N_{k,h}}^{i}b_{i}=\frac{1}{4}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\sum_{i=1}^{N_{k,h}}\alpha_{N_{k,h}}^{i}\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+i}}\leq\frac{1}{2}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+N_{k,h}}}$
	$\displaystyle=$	$\displaystyle\frac{1}{2}\sum_{x,a}\sum_{n=1}^{N_{TK^{\alpha},h}(x,a)}\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+n}}\leq\frac{1}{2}\sum_{x,a}\sum_{n=1}^{N_{TK^{\alpha},h}(x,a)}\sqrt{\frac{H^{2}\iota(\chi+1)}{n}}\overset{(1)}{\leq}\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)},$

where the last inequality above holds because the left hand side of $(1)$ is the summation of $K^{\alpha}$ terms and $\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+n}}$ is a decreasing function of $n.$

Therefore, it is maximized when $N_{TK^{\alpha},h}=K^{\alpha}/SA$ for all $x,a,$ i.e. by picking the largest $K^{\alpha}$ terms. Thus we can obtain

		$\displaystyle\sum_{k}Q_{k,h}(x_{k,h},a_{k,h})-\sum_{k}Q_{h}^{\pi_{k}}(x_{k,h},a_{k,h})$
	$\displaystyle\leq$	$\displaystyle\sum_{k}\left(V_{k,h+1}(x_{k,h+1})-\mathbb{P}_{h}V_{h+1}^{\pi_{k}}(x_{k,h},a_{k,h})\right)+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)}$
	$\displaystyle\leq$	$\displaystyle\sum_{k}\left(V_{k,h+1}(x_{k,h+1})-\mathbb{P}_{h}V_{h+1}^{\pi_{k}}(x_{k,h},a_{k,h})+V_{h+1}^{\pi_{k}}(x_{k,h+1})-V_{h+1}^{\pi_{k}}(x_{k,h+1})\right)$
		$\displaystyle+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)}$
	$\displaystyle=$	$\displaystyle\sum_{k}\left(V_{k,h+1}(x_{k,h+1}))-V_{h+1}^{\pi_{k}}(x_{k,h+1})-\mathbb{P}_{h}V_{h+1}^{\pi_{k}}(x_{k,h},a_{k,h})+\hat{\mathbb{P}}_{h}^{k}V^{\pi_{k}}_{h+1}(x_{k,h},a_{k,h})\right)$
		$\displaystyle+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)}$
	$\displaystyle=$	$\displaystyle\sum_{k}\left(Q_{k,h+1}(x_{k,h+1},a_{k,h+1})-Q_{h+1}^{\pi_{k}}(x_{k,h+1},a_{k,h+1})-\mathbb{P}_{h}V_{h+1}^{\pi_{k}}(x_{k,h},a_{k,h})+\hat{\mathbb{P}}_{h}^{k}V^{\pi_{k}}_{h+1}(x_{k,h},a_{k,h}\right)$
		$\displaystyle+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)}.$

Taking the expectation on both sides yields

		$\displaystyle\mathbb{E}\left[\sum_{k}Q_{k,h}(x_{k,h},a_{k,h})-\sum_{k}Q_{h}^{\pi_{k}}(x_{k,h},a_{k,h})\right]$
	$\displaystyle\leq$	$\displaystyle\mathbb{E}\left[\sum_{k}\left(Q_{k,h+1}(x_{k,h+1},a_{k,h+1}))-Q_{h+1}^{\pi_{k}}(x_{k,h+1},a_{k,h+1})\right)\right]+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)}.$

Then by using the inequality repeatably, we obtain for any $h\in[H],$

\displaystyle\mathbb{E}\left[\sum_{k}Q_{k,h}(x_{k,h},a_{k,h})-\sum_{k}Q_{h}^{\pi_{k}}(x_{k,h},a_{k,h})\right]\leq

\displaystyle H^{2}SA+\frac{H^{3}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{4}SA\iota K^{\alpha}(\chi+1)},

so the lemma holds.

∎

From the lemma above, we can immediately conclude:

	$\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left\{Q_{k,1}-Q_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right]\leq H^{2}SAK^{1-\alpha}+\frac{H^{3}\sqrt{\iota}K}{\chi}+\sqrt{H^{4}SA\iota K^{2-\alpha}(\chi+1)}$
	$\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left\{C_{k,1}-C_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right]\leq H^{2}SAK^{1-\alpha}+\frac{H^{3}\sqrt{\iota}K}{\chi}+\sqrt{H^{4}SA\iota K^{2-\alpha}(\chi+1)}.$

We now focus on (16), and further expand it as follows:

	$\displaystyle\eqref{step(i)}$
$\displaystyle=$	$\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left(\sum_{a}\left\{{Q}_{1}^{\epsilon,}{q}^{\epsilon,}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}\left[\sum_{k}\sum_{a}\left\{\left({F}^{\epsilon,}_{k,1}-F_{k,1}\right){q}^{\epsilon,}_{1}\right\}(x_{k,1},a)\right]$	(25)
	$\displaystyle+\mathbb{E}\left[\sum_{k}\left(\sum_{a}\left\{Q_{k,1}{q}^{\epsilon,}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]+\mathbb{E}\left[\sum_{k}\frac{Z_{k}}{\eta}\sum_{a}\left\{\left(C_{k,1}-{C}^{\epsilon,}_{1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right],$	(26)

where

	$\displaystyle F_{k,h}(x,a)$	$\displaystyle=Q_{k,h}(x,a)+\frac{Z_{k}}{\eta}C_{k,h}(x,a)$
	$\displaystyle{F}_{h}^{\epsilon,*}(x,a)$	$\displaystyle={Q}_{h}^{\epsilon,}(x,a)+\frac{Z_{k}}{\eta}C_{h}^{\epsilon,}(x,a).$

We first show (25) can be bounded using the following lemma. This result holds because the choices of the UCB bonuses and the additional bonuses added at the beginning of each frame guarantee that $F_{k,h}(x,a)$ is an over-estimate of ${F}_{h}^{\epsilon,*}(x,a)$ for all $k,$ $h$ and $(x,a)$ with a high probability.

Lemma 3.

With probability at least $1-\frac{1}{K^{3}},$ the following inequality holds simultaneously for all $(x,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K]:$

\left\{F_{k,h}-F_{h}^{\pi}\right\}(x,a)\geq 0,

(27)

which further implies that

\mathbb{E}\left[\sum_{k=1}^{K}\sum_{a}\left\{\left({F}^{\epsilon,*}_{k,1}-F_{k,1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right]\leq\frac{4H^{4}\iota}{\eta K}.

(28)

Proof.

Consider frame $T$ and episodes in frame $T.$ Define $Z=Z_{(T-1)K^{\alpha}+1}$ because the value of the virtual queue does not change during each frame. We further define/recall the following notations:

	$\displaystyle F_{k,h}(x,a)=Q_{k,h}(x,a)+\frac{Z}{\eta}C_{k,h}(x,a),\quad$	$\displaystyle U_{k,h}(x)=V_{k,h}(x)+\frac{Z}{\eta}W_{k,h}(x),$
	$\displaystyle F_{h}^{\pi}(x,a)=Q_{h}^{\pi}(x,a)+\frac{Z}{\eta}C_{h}^{\pi}(x,a),\quad$	$\displaystyle U_{h}^{\pi}(x)=V_{h}^{\pi}(x)+\frac{Z}{\eta}W_{h}^{\pi}(x).$

According to Lemma 9 in the appendix, we have

	$\displaystyle\{F_{k,h}-F_{h}^{\pi}\}(x,a)$
$\displaystyle=$	$\displaystyle\alpha_{t}^{0}\left\{F_{(T-1)K^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)$
	$\displaystyle+\sum_{i=1}^{t}\alpha_{t}^{i}\left(\left\{U_{k_{i},h+1}-U_{h+1}^{\pi}\right\}(x_{k_{i},h+1})+\{(\hat{\mathbb{P}}_{h}^{k_{i}}-\mathbb{P}_{h})U_{h+1}^{\pi}\}(x,a)+\left(1+\frac{Z}{\eta}\right)b_{i}\right)$
$\displaystyle\geq$	${}_{(a)}\alpha_{t}^{0}\left\{F_{(T-1)K^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left\{U_{k_{i},h+1}-U_{h+1}^{\pi}\right\}(x_{k_{i},h+1})$
$\displaystyle=$	${}_{(b)}\alpha_{t}^{0}\left\{F_{(T-1)K^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left(\max_{a}F_{k_{i},h+1}(x_{k_{i},h+1},a)-F_{h+1}^{\pi}(x_{k_{i},h+1},\pi(x_{k_{i},h+1}))\right)$
$\displaystyle\geq$	$\displaystyle\alpha_{t}^{0}\left\{F_{(T-1)K^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left\{F_{k_{i},h+1}-F_{h+1}^{\pi}\right\}(x_{k_{i},h+1},\pi(x_{k_{i},h+1})),$	(29)

where inequality $(a)$ holds because of the concentration result in Lemma 10 in the appendix and

\sum_{i=1}^{t}\alpha_{t}^{i}(1+\frac{Z}{\eta})b_{i}=\frac{1}{4}\sum_{i=1}^{t}\alpha_{t}^{i}(1+\frac{Z}{\eta})\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+i}}\geq\frac{\eta+Z}{4\eta}\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+t}}

by using Lemma 7-(c), and equality $(b)$ holds because Triple-Q selects the action that maximizes $F_{k_{i},h+1}(x_{k_{i},h+1},a)$ so $U_{k_{i},h+1}(x_{k_{i},h+1})=\max_{a}F_{k_{i},h+1}(x_{k_{i},h+1},a)$ .

The inequality above suggests that we can prove $\{F_{k,h}-F_{h}^{\pi}\}(x,a)$ for any $(x,a)$ if (i)

\left\{F_{(T-1)K^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)\geq 0,

i.e. the result holds at the beginning of the frame and (ii)

\left\{F_{k^{\prime},h+1}-F_{h+1}^{\pi}\right\}(x,a)\geq 0\quad\hbox{ for any }\quad k^{\prime}<k

and $(x,a),$ i.e. the result holds for step $h+1$ in all the previous episodes in the same frame.

We now prove the lemma using induction. We first consider $T=1$ and $h=H$ i.e. the last step in the first frame. In this case, inequality (29) becomes

\displaystyle\{F_{k,H}-F_{H}^{\pi}\}(x,a)\geq\alpha_{t}^{0}\left\{H+\frac{Z_{1}}{\eta}H-F_{h}^{\pi}\right\}(x,a)\geq 0.

(30)

Based on induction, we can first conclude that

\displaystyle\{F_{k,h}-F_{h}^{\pi}\}(x,a)\geq 0

for all $h$ and $k\leq K^{\alpha}+1,$ where $\{F_{K^{\alpha}+1,h}\}_{h=1,\cdots,H}$ are the values before line 20, i.e. before adding the extra bonuses and thresholding Q-values at the end of a frame. Now suppose that (27) holds for any episode $k$ in frame $T,$ any step $h,$ and any $(x,a).$ Now consider

\displaystyle\left\{F_{TK^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)=Q_{TK^{\alpha}+1,h}(x,a)+\frac{Z_{TK^{\alpha}+1}}{\eta}C_{TK^{\alpha}+1,h}(x,a)-Q_{h}^{\pi}(x,a)-\frac{Z_{TK^{\alpha}+1}}{\eta}C_{h}^{\pi}(x,a).

(31)

Note that if $Q^{+}_{TK^{\alpha}+1,h}(x,a)=C^{+}_{TK^{\alpha}+1,h}(x,a)=H,$ then $\eqref{eq:new-induction}\geq 0.$ Otherwise, from line 21-23, we have $Q^{+}_{TK^{\alpha}+1,h}(x,a)=Q^{-}_{TK^{\alpha}+1,h}(x,a)+\frac{2H^{3}\sqrt{\iota}}{\eta}<H$ and $C^{+}_{TK^{\alpha}+1,h}(x,a)=C^{-}_{TK^{\alpha}+1,h}(x,a)<H.$ Here, we use superscript $-$ and $+$ to indicate the Q-values before and after 21-24 of Triple-Q. Therefore, at the beginning of frame $T+1,$ we have

$\displaystyle\left\{F_{TK^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)=$	$\displaystyle Q^{-}_{TK^{\alpha}+1,h}(x,a)+\frac{Z}{\eta}C^{-}_{TK^{\alpha}+1,h}(x,a)-Q_{h}^{\pi}(x,a)-\frac{Z}{\eta}C_{h}^{\pi}(x,a)$
$\displaystyle+$	$\displaystyle\frac{2H^{3}\sqrt{\iota}}{\eta}+\frac{Z_{TK^{\alpha}+1}-Z}{\eta}C^{-}_{TK^{\alpha}+1,h}(x,a)-\frac{Z_{TK^{\alpha}+1}-Z}{\eta}C_{h}^{\pi}(x,a)$
$\displaystyle\geq_{(a)}$	$\displaystyle\frac{2H^{3}\sqrt{\iota}}{\eta}-2\frac{\|Z_{TK^{\alpha}+1}-Z\|}{\eta}H$
$\displaystyle\geq_{(b)}$	$\displaystyle 0,$	(32)

where inequality $(a)$ holds due to the induction assumption and the fact $C^{-}_{TK^{\alpha}+1,h}(x,a)<H,$ and $(b)$ holds because according to Lemma 8,

|Z_{TK^{\alpha}+1}-Z_{TK^{\alpha}}|\leq\max\left\{\rho+\epsilon,\frac{\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}C_{k,1}(x_{k,1},a_{k,1})}{K^{\alpha}}\right\}\leq H^{2}\sqrt{\iota}.

Therefore, by substituting inequality (32) into inequality (29), we obtain for any $TK^{\alpha}+1\leq k\leq(T+1)K^{\alpha}+1,$

\displaystyle\{F_{k,h}-F_{h}^{\pi}\}(x,a)\geq\sum_{i=1}^{t}\alpha_{t}^{i}\left\{F_{k_{i},h+1}-F_{h+1}^{\pi}\right\}\left(x_{k_{i},h+1},\pi(x_{k_{i},h+1})\right).

(33)

Considering $h=H,$ the inequality becomes

\displaystyle\{F_{k,H}-F_{H}^{\pi}\}(x,a)\geq 0.

(34)

By applying induction on $h$ , we conclude that

\displaystyle\{F_{k,h}-F_{h}^{\pi}\}(x,a)\geq 0.

(35)

holds for any $TK^{\alpha}+1\leq k\leq(T+1)K^{\alpha}+1,$ $h,$ and $(x,a),$ which completes the proof of (27).

Let $\cal E$ denote the event that (27) holds for all $k,$ $h$ and $(x,a).$ Then based on Lemma 8, we conclude that

	$\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\sum_{a}\left\{\left({F}^{\epsilon,}_{k,1}-F_{k,1}\right){q}^{\epsilon,}_{1}\right\}(x_{k,1},a)\right]$
$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left.\sum_{k=1}^{K}\sum_{a}\left\{\left({F}^{\epsilon,}_{k,1}-F_{k,1}\right){q}^{\epsilon,}_{1}\right\}(x_{k,1},a)\right\|{\cal E}\right]\Pr({\cal E})$
	$\displaystyle+\mathbb{E}\left[\left.\sum_{k=1}^{K}\sum_{a}\left\{\left({F}^{\epsilon,}_{k,1}-F_{k,1}\right){q}^{\epsilon,}_{1}\right\}(x_{k,1},a)\right\|{\cal E}^{c}\right]\Pr({\cal E}^{c})$
$\displaystyle\leq$	$\displaystyle 2K\left(1+\frac{K^{1-\alpha}H^{2}\sqrt{\iota}}{\eta}\right)H^{2}\sqrt{\iota}\frac{1}{K^{3}}\leq\frac{4H^{4}\iota}{\eta K}.$	(36)

∎

Next we bound (26) using the Lyapunov drift analysis on virtual queue $Z$ . Since the virtual queue is updated every frame, we abuse the notation and define $Z_{T}$ to be the virtual queue used in frame $T.$ In particular, $Z_{T}=Z_{(T-1)K^{\alpha}+1}.$ We further define

\bar{C}_{T}=\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}C_{k,1}(x_{k,1},a_{k,1}).

Therefore, under Triple-Q, we have

\displaystyle Z_{T+1}=\left(Z_{T}+\rho+\epsilon-\frac{\bar{C}_{T}}{K^{\alpha}}\right)^{+}

(37)

Define the Lyapunov function to be

L_{T}=\frac{1}{2}Z_{T}^{2}.

The next lemma bounds the expected Lyapunov drift conditioned on $Z_{T}.$

Lemma 4.

Assume $\epsilon\leq\delta.$ The expected Lyapunov drift satisfies

	$\displaystyle\mathbb{E}\left[L_{T+1}-L_{T}\|Z_{T}=z\right]$
$\displaystyle\leq$	$\displaystyle\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\left(-\eta\mathbb{E}\left[\left.\sum_{a}\left\{Q_{k,1}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]\right.$
	$\displaystyle\left.+z\mathbb{E}\left[\left.\sum_{a}\left\{\left({C}^{\epsilon,}_{1}-C_{k,1}\right){q}^{\epsilon,}_{1}\right\}(x_{k,1},a)\right\|Z_{T}=z\right]\right)+H^{4}\iota+\epsilon^{2}.$	(38)

Proof.

Based on the definition of $L_{T},$ the Lyapunov drift is

	$\displaystyle L_{T+1}-L_{T}\leq$	$\displaystyle Z_{T}\left(\rho+\epsilon-\frac{\bar{C}_{T}}{K^{\alpha}}\right)+\frac{\left(\frac{\bar{C}_{T}}{K^{\alpha}}+\epsilon-\rho\right)^{2}}{2}$
	$\displaystyle\leq$	$\displaystyle Z_{T}\left(\rho+\epsilon-\frac{\bar{C}_{T}}{K^{\alpha}}\right)+H^{4}\iota+\epsilon^{2}$
	$\displaystyle\leq$	$\displaystyle\frac{Z_{T}}{K^{\alpha}}\sum_{k=TK^{\alpha}+1}^{(T+1)K^{\alpha}}\left(\rho+\epsilon-C_{k,1}(x_{k,1},a_{k,1})\right)+H^{4}\iota+\epsilon^{2}$

where the first inequality is a result of the upper bound on $|C_{k,1}(x_{k,1},a_{k,1})|$ in Lemma 8.

Let $\{q^{\epsilon}_{h}\}_{h=1}^{H}$ be a feasible solution to the tightened LP (14). Then the expected Lyapunov drift conditioned on $Z_{T}=z$ is

	$\displaystyle\mathbb{E}\left[L_{T+1}-L_{T}\|Z_{T}=z\right]$
$\displaystyle\leq$	$\displaystyle\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\left(\mathbb{E}\left[\left.z\left(\rho+\epsilon-C_{k,1}(x_{k,1},a_{k,1})\right)-\eta Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]+\eta\mathbb{E}\left[\left.Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]\right)$
	$\displaystyle+H^{4}\iota+\epsilon^{2}.$	(39)

Now we focus on the term inside the summation and obtain that

		$\displaystyle\left(\mathbb{E}\left[\left.z\left(\rho+\epsilon-C_{k,1}(x_{k,1},a_{k,1})\right)-\eta Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]+\eta\mathbb{E}\left[\left.Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]\right)$
	$\displaystyle\leq$	${}_{(a)}z(\rho+\epsilon)-\mathbb{E}\left[\left.\eta\left(\sum_{a}\left\{\frac{z}{\eta}C_{k,1}q^{\epsilon}_{1}+Q_{k,1}q^{\epsilon}_{1}\right\}(x_{k,1},a)\right)\right\|Z_{T}=z\right]+\eta\mathbb{E}\left[\left.Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left.z\left(\rho+\epsilon-\sum_{a}C_{k,1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)\right)\right\|Z_{T}=z\right]$
		$\displaystyle-\mathbb{E}\left[\left.\eta\sum_{a}Q_{k,1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)-\eta Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[z\left(\left.\rho+\epsilon-\sum_{a}C^{\epsilon}_{1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)\right)\right\|Z_{T}=z\right]$
		$\displaystyle-\mathbb{E}\left[\left.\eta\sum_{a}Q_{k,1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)-\eta Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]+\mathbb{E}\left[\left.z\sum_{a}\left\{(C^{\epsilon}_{1}-C_{k,1})q^{\epsilon}_{1}\right\}(x_{k,1},a)\right\|Z_{T}=z\right]$
	$\displaystyle\leq$	$\displaystyle-\eta\mathbb{E}\left[\left.\sum_{a}Q_{k,1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]+\mathbb{E}\left[\left.z\sum_{a}\left\{(C^{\epsilon}_{1}-C_{k,1})q^{\epsilon}_{1}\right\}(x_{k,1},a)\right\|Z_{T}=z\right],$

where inequality $(a)$ holds because $a_{k,h}$ is chosen to maximize $Q_{k,h}(x_{k,h},a)+\frac{Z_{T}}{\eta}C_{k,h}(x_{k,h},a)$ under Triple-Q, and the last equality holds due to that $\{q^{\epsilon}_{h}(x,a)\}_{h=1}^{H}$ is a feasible solution to the optimization problem (14), so

\displaystyle\left(\rho+\epsilon-\sum_{a}{C}_{1}^{\epsilon}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)\right)

\displaystyle=\left(\rho+\epsilon-\sum_{h,x,a}g_{h}(x,a){q}^{\epsilon}_{h}(x,a)\right)\leq 0.

Therefore, we can conclude the lemma by substituting ${q}^{\epsilon}_{h}(x,a)$ with the optimal solution ${q}^{\epsilon,*}_{h}(x,a)$ . ∎

After taking expectation with respect to $Z,$ dividing $\eta$ on both sides, and then applying the telescoping sum, we obtain

		$\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left(\sum_{a}\left\{Q_{k,1}{q}^{\epsilon,}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]+\mathbb{E}\left[\sum_{k=1}^{K}\frac{Z_{k}}{\eta}\sum_{a}\left\{\left(C_{k,1}-{C}^{\epsilon,}_{1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right]$
	$\displaystyle\leq$	$\displaystyle\frac{K^{\alpha}\mathbb{E}\left[L_{1}-L_{K^{1-\alpha}+1}\right]}{\eta}+\frac{K\left(H^{4}\iota+\epsilon^{2}\right)}{\eta}\leq\frac{K\left(H^{4}\iota+\epsilon^{2}\right)}{\eta},$		(40)

where the last inequality holds because that $L_{1}=0$ and $L_{T+1}$ is non-negative.

Now combining Lemma 3 and inequality (40), we conclude that

\displaystyle\eqref{step(i)}\leq\frac{K\left(H^{4}\iota+\epsilon^{2}\right)}{\eta}+\frac{4H^{4}\iota}{\eta K}.

Further combining inequality above with Lemma 1 and Lemma 2,

\displaystyle\text{Regret}(K)\leq\frac{KH\epsilon}{\delta}+H^{2}SAK^{1-\alpha}+\frac{H^{3}\sqrt{\iota}K}{\chi}+\sqrt{H^{4}SA\iota K^{2-\alpha}(\chi+1)}+\frac{K\left(H^{4}\iota+\epsilon^{2}\right)}{\eta}+\frac{4H^{4}\iota}{\eta K}.

(41)

By choosing $\alpha=0.6,$ i.e each frame has $K^{0.6}$ episodes, $\chi=K^{0.2},$ $\eta=K^{0.2},$ and $\epsilon=\frac{8\sqrt{SAH^{6}\iota^{3}}}{K^{0.2}},$ we conclude that when $K\geq\left(\frac{8\sqrt{SAH^{6}\iota^{3}}}{\delta}\right)^{5},$ which guarantees that $\epsilon<\delta/2,$ we have

\displaystyle\text{Regret}(K)\leq\frac{13}{\delta}H^{4}{\sqrt{SA\iota^{3}}}{K^{0.8}}+\frac{4H^{4}\iota}{K^{1.2}}=\tilde{\cal O}\left(\frac{1}{\delta}H^{4}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{0.8}\right).

(42)

4.3 Constraint Violation

4.3.1 Outline of the Constraint Violation Analysis

Again, we use $Z_{T}$ to denote the value of virtual-Queue in frame $T.$ According to the virtual-Queue update defined in Triple-Q, we have

\displaystyle Z_{T+1}=\left(Z_{T}+\rho+\epsilon-\frac{\bar{C}_{T}}{K^{\alpha}}\right)^{+}\geq Z_{T}+\rho+\epsilon-\frac{\bar{C}_{T}}{K^{\alpha}},

which implies that

\displaystyle\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\left(-C_{1}^{\pi_{k}}(x_{k,1},a_{k,1})+\rho\right)\leq K^{\alpha}\left(Z_{T+1}-Z_{T}\right)+\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\left(\left\{C_{k,1}-C_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})-\epsilon\right).

Summing the inequality above over all frames and taking expectation on both sides, we obtain the following upper bound on the constraint violation:

\displaystyle\mathbb{E}\left[\sum^{K}_{k=1}\rho-C_{1}^{\pi_{k}}(x_{k,1},a_{k,1})\right]\leq-K\epsilon+K^{\alpha}\mathbb{E}\left[Z_{K^{1-\alpha}+1}\right]+\mathbb{E}\left[\sum_{k=1}^{K}\left\{C_{k,1}-C_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right],

(43)

where we used the fact $Z_{1}=0.$

In Lemma 2, we already established an upper bound on the estimation error of $C_{k,1}:$

\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left\{C_{k,1}-C_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right]\leq H^{2}SAK^{1-\alpha}+\frac{H^{3}\sqrt{\iota}K}{\chi}+\sqrt{H^{4}SA\iota K^{2-\alpha}(\chi+1)}.

(44)

Next, we study the moment generating function of $Z_{T},$ i.e. $\mathbb{E}\left[e^{rZ_{T}}\right]$ for some $r>0.$ Based on a Lyapunov drift analysis of this moment generating function and Jensen’s inequality, we will establish the following upper bound on $Z_{T}$ that holds for any $1\leq T\leq K^{1-\alpha}+1$

\displaystyle\mathbb{E}[Z_{T}]\leq

\displaystyle\frac{54H^{4}\iota}{\delta}\log\left(\frac{16H^{2}\sqrt{\iota}}{\delta}\right)+\frac{16H^{2}\iota}{K^{2}\delta}+\frac{4\eta\sqrt{H^{2}\iota}}{\delta}.

(45)

Under our choices of $\epsilon,$ $\alpha,$ $\chi,$ $\eta$ and $\iota,$ it can be easily verified that $K\epsilon$ dominates the upper bounds in (44) and (45), which leads to the conclusion that the constraint violation because zero when $K$ is sufficiently large in Theorem 1.

4.3.2 Detailed Proof

To complete the proof, we need to establish the following upper bound on $\mathbb{E}[Z_{T+1}]$ based on a bound on the moment generating function.

Lemma 5.

Assuming $\epsilon\leq\frac{\delta}{2},$ we have for any $1\leq T\leq K^{1-\alpha}$

\displaystyle\mathbb{E}[Z_{T}]\leq

\displaystyle\frac{54H^{4}\iota}{\delta}\log\left(\frac{16H^{2}\sqrt{\iota}}{\delta}\right)+\frac{16H^{2}\iota}{K^{2}\delta}+\frac{4\eta\sqrt{H^{2}\iota}}{\delta}.

(46)

The proof will also use the following lemma from [16].

Lemma 6.

Let $S_{t}$ be the state of a Markov chain, $L_{t}$ be a Lyapunov function with $L_{0}=l_{0},$ and its drift $\Delta_{t}=L_{t+1}-L_{t}.$ Given the constant $\gamma$ and $v$ with $0<\gamma\leq v,$ suppose that the expected drift $\mathbb{E}[\Delta_{t}|S_{t}=s]$ satisfies the following conditions:

(1)

There exists constant $\gamma>0$ and $\theta_{t}>0$ such that $\mathbb{E}[\Delta_{t}|S_{t}=s]\leq-\gamma$ when $L_{t}\geq\theta_{t}.$
(2)

$|L_{t+1}-L_{t}|\leq v$ holds with probability one.

Then we have

\mathbb{E}[e^{rL_{t}}]\leq e^{rl_{0}}+\frac{2e^{r(v+\theta_{t})}}{r\gamma},

where $r=\frac{\gamma}{v^{2}+v\gamma/3}.$ $\square$

Proof of Lemma 5.

We apply Lemma 6 to a new Lyapunov function:

\bar{L}_{T}=Z_{T}.

To verify condition (1) in Lemma 6, consider $\bar{L}_{T}=Z_{T}\geq\theta_{T}=\frac{4(\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2})}{\delta}$ and $2\epsilon\leq\delta.$ The conditional expected drift of $\bar{L}_{T}$ is

		$\displaystyle\mathbb{E}\left[Z_{T+1}-Z_{T}\|Z_{T}=z\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left.\sqrt{Z_{T+1}^{2}}-\sqrt{z^{2}}\right\|Z_{T}=z\right]$
	$\displaystyle\leq$	$\displaystyle\frac{1}{2z}\mathbb{E}\left[\left.Z_{T+1}^{2}-z^{2}\right\|Z_{T}=z\right]$
	$\displaystyle\leq_{(a)}$	$\displaystyle-\frac{\delta}{2}+\frac{\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2}}{z}$
	$\displaystyle\leq$	$\displaystyle-\frac{\delta}{2}+\frac{\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2}}{\theta_{T}}$
	$\displaystyle=$	$\displaystyle-\frac{\delta}{4},$

where inequality ( $a$ ) is obtained according to Lemma 11; and the last inequality holds given $z\geq\theta_{T}.$

To verify condition (2) in Lemma 6, we have

Z_{T+1}-Z_{T}\leq|Z_{T+1}-Z_{T}|\leq\left|\rho+\epsilon-\bar{C}_{T}\right|\leq(H^{2}+\sqrt{H^{4}\iota})+\epsilon\leq 2\sqrt{H^{4}\iota},

where the last inequality holds because $2\epsilon\leq\delta\leq 1.$

Now choose $\gamma=\frac{\delta}{4}$ and $v=2\sqrt{H^{4}\iota}.$ From Lemma 6, we obtain

\displaystyle\mathbb{E}\left[e^{rZ_{T}}\right]\leq e^{rZ_{1}}+\frac{2e^{r(v+\theta_{T})}}{r\gamma},\quad\hbox{where}\quad r=\frac{\gamma}{v^{2}+v\gamma/3}.

(47)

By Jensen’s inequality, we have

e^{r\mathbb{E}\left[Z_{T}\right]}\leq\mathbb{E}\left[e^{rZ_{T}}\right],

which implies that

		$\displaystyle\mathbb{E}[Z_{T}]\leq\frac{1}{r}\log{\left(1+\frac{2e^{r(v+\theta_{T})}}{r\gamma}\right)}$
	$\displaystyle=$	$\displaystyle\frac{1}{r}\log{\left(1+\frac{6v^{2}+2v\gamma}{3\gamma^{2}}e^{r(v+\theta_{T})}\right)}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{r}\log{\left(1+\frac{8v^{2}}{3\gamma^{2}}e^{r(v+\theta_{T})}\right)}$
	$\displaystyle\leq$	$\displaystyle\frac{1}{r}\log{\left(\frac{11v^{2}}{3\gamma^{2}}e^{r(v+\theta_{T})}\right)}$
	$\displaystyle\leq$	$\displaystyle\frac{4v^{2}}{3\gamma}\log{\left(\frac{11v^{2}}{3\gamma^{2}}e^{r(v+\theta_{T})}\right)}$
	$\displaystyle\leq$	$\displaystyle\frac{3v^{2}}{\gamma}\log\left(\frac{2v}{\gamma}\right)+v+\theta_{T}$
	$\displaystyle\leq$	$\displaystyle\frac{3v^{2}}{\gamma}\log\left(\frac{2v}{\gamma}\right)+v+\frac{4(\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2})}{\delta}$
	$\displaystyle=$	$\displaystyle\frac{48H^{4}\iota}{\delta}\log\left(\frac{16H^{2}\sqrt{\iota}}{\delta}\right)+2\sqrt{H^{4}\iota}+\frac{4(\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2})}{\delta}$
	$\displaystyle\leq$	$\displaystyle\frac{54H^{4}\iota}{\delta}\log\left(\frac{16H^{2}\sqrt{\iota}}{\delta}\right)+\frac{16H^{2}\iota}{K^{2}\delta}+\frac{4\eta\sqrt{H^{2}\iota}}{\delta}=\tilde{\cal O}\left(\frac{\eta H}{\delta}\right),$

which completes the proof of Lemma 5. ∎

Substituting the results from Lemmas 2 and 5 into (43), under assumption $K\geq\left(\frac{16\sqrt{SAH^{6}\iota^{3}}}{\delta}\right)^{5},$ which guarantees $\epsilon\leq\frac{\delta}{2}.$ Then by using the facts that $\epsilon=\frac{8\sqrt{SAH^{6}\iota^{3}}}{K^{0.2}},$ we can easily verify that

\displaystyle\hbox{Violation}(K)\leq\frac{54H^{4}{\iota}K^{0.6}}{\delta}\log{\frac{16H^{2}\sqrt{\iota}}{\delta}}+\frac{4\sqrt{H^{2}\iota}}{\delta}K^{0.8}-5\sqrt{SAH^{6}\iota^{3}}K^{0.8}.

If further we have $K\geq e^{\frac{1}{\delta}},$ we can obtain

\hbox{Violation}(K)\leq\frac{54H^{4}{\iota}K^{0.6}}{\delta}\log{\frac{16H^{2}\sqrt{\iota}}{\delta}}-\sqrt{SAH^{6}\iota^{3}}K^{0.8}=0.

Now to prove the high probability bound, recall that from inequality (37), we have

\displaystyle\sum^{K}_{k=1}\rho-C_{1}^{\pi_{k}}(x_{k,1},a_{k,1})\leq-K\epsilon+K^{\alpha}Z_{K^{1-\alpha}+1}+\sum_{k=1}^{K}\left\{C_{k,1}-C_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1}).

(48)

According to inequality (47), we have

\mathbb{E}\left[e^{rZ_{T}}\right]\leq e^{rZ_{1}}+\frac{2e^{r(v+\theta_{T})}}{r\gamma}\leq\frac{11v^{2}}{3\gamma^{2}}e^{r(v+\theta_{T})},

which implies that

	$\displaystyle\Pr\left(Z_{T}\geq\frac{1}{r}\log\left(\frac{11v^{2}}{3\gamma^{2}}\right)+2(v+\theta_{T})\right)$
$\displaystyle=$	$\displaystyle\Pr(e^{rZ_{T}}\geq e^{\log\left(\frac{11v^{2}}{3\gamma^{2}}\right)+2r(v+\theta_{T})})$
$\displaystyle\leq$	$\displaystyle\frac{\mathbb{E}[e^{rZ_{T}}]}{\frac{11v^{2}}{3\gamma^{2}}e^{2r(v+\theta_{T})}}$
$\displaystyle\leq$	$\displaystyle\frac{1}{e^{r(v+\theta_{T})}}=\tilde{\mathcal{O}}\left(e^{-\eta}\right),$	(49)

where the first inequality is from the Markov inequality.

In the proof of Lemma 2, we have shown

	$\displaystyle\left\|\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}C_{k,h}(x_{k,h},a_{k,h})-C_{h}^{\pi_{k}}(x_{k,h},a_{k,h})\right\|$
$\displaystyle\leq$	$\displaystyle\left\|\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}C_{k,h+1}(x_{k,h+1},a_{k,h+1})-C_{h+1}^{\pi_{k}}(x_{k,h+1},a_{k,h+1})\right\|+\left\|\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})V_{h+1}^{\pi_{k}}(x_{k,h},a_{k,h})\right\|$
	$\displaystyle+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)}$	(50)

Following a similar proof as the proof of Lemma 10, we can prove that

\left|\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})V_{h+1}^{\pi_{k}}(x_{k,h},a_{k,h})\right|\leq\frac{1}{4}\sqrt{H^{2}\iota K^{\alpha}}

holds with probability at least $1-\frac{1}{K^{3}}$ . By iteratively using inequality (50) over $h$ and by summing it over all frames, we conclude that with probability at at least $1-\frac{1}{K^{2}},$

	$\displaystyle\left\|\sum_{k=1}^{K}\{C_{k,1}-C_{1}^{\pi_{k}}\}(x_{k,1},a_{k,1})\right\|\leq$	$\displaystyle K^{1-\alpha}H^{2}SA+\frac{H^{3}\sqrt{\iota}K}{\chi}+\sqrt{H^{4}SA\iota K^{2-\alpha}(\chi+1)}+\frac{1}{4}\sqrt{H^{4}\iota K^{2-\alpha}}$
	$\displaystyle\leq$	$\displaystyle 4\sqrt{H^{4}SA\iota}K^{0.8},$		(51)

where the last inequality holds because $\alpha=0.6$ and $\chi=K^{0.2}.$

Now, by combining inequalities (49) and (51), and using the union bound, we can show that when $K\geq\max\left\{\left(\frac{8\sqrt{SAH^{6}\iota^{3}}}{\delta}\right)^{5},e^{\frac{1}{\delta}}\right\},$ with probability at least $1-\tilde{\mathcal{O}}\left(e^{-K^{0.2}}+\frac{1}{K^{2}}\right)$

	$\displaystyle\sum^{K}_{k=1}\rho-C_{1}^{\pi_{k}}(x_{k,1},a_{k,1})$
	$\displaystyle\leq-K\epsilon+K^{\alpha}\left(\frac{1}{r}\log\left(\frac{11v^{2}}{3\gamma^{2}}\right)+2(v+\theta_{T})\right)+4\sqrt{H^{4}SA\iota}K^{0.8}$
	$\displaystyle\leq-\sqrt{SAH^{6}\iota^{3}}K^{0.8}\leq 0,$		(52)

which completes the proof of our main result.

5 Convergence and $\epsilon$ -Optimal Policy

The Triple-Q algorithm is an online learning algorithm and is not a stationary policy. In theory, we can obtain a near-optimal, stationary policy following the idea proposed in [14]. Assume the agent stores all the policies used during learning. Note that each policy is defined by the two Q tables and the value of the virtual queue. At the end of learning horizon $K,$ the agent constructs a stochastic policy $\bar{\pi}$ such that at the beginning of each episode, the agent uniformly and randomly selects a policy from the $K$ policies , i.e. $\bar{\pi}=\pi_{k}$ with probability $\frac{1}{K}.$

We note that given any initial state $x,$

\frac{1}{K}\sum_{k=1}^{K}V_{1}^{\pi_{k}}(x)=V^{\bar{\pi}}_{1}(x).

\frac{1}{K}\sum_{k=1}^{K}W_{1}^{\pi_{k}}(x)=W^{\bar{\pi}}_{1}(x).

Therefore, under policy $\bar{\pi},$ we have

		$\displaystyle\mathbb{E}\left[V_{1}^{*}(x_{k,1})-V_{1}^{\bar{\pi}}(x_{k,1})\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\frac{1}{K}\sum_{k=1}^{K}\left(V_{1}^{*}(x_{k,1})-V_{1}^{\bar{\pi}}(x_{k,1}\right)\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\frac{1}{K}\sum_{k=1}^{K}\left(V_{1}^{*}(x_{k,1})-V_{1}^{\pi_{k}}(x_{k,1}\right)\right]$
	$\displaystyle=$	$\displaystyle\tilde{O}\left(\frac{H^{4}\sqrt{SA}}{\delta K^{0.2}}\right),$

and

		$\displaystyle\mathbb{E}\left[\rho-W_{1}^{\bar{\pi}}(x_{k,1})\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\frac{1}{K}\sum_{k=1}^{K}\left(\rho-W_{1}^{\bar{\pi}}(x_{k,1}\right)\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\frac{1}{K}\sum_{k=1}^{K}\left(\rho-W_{1}^{\pi_{k}}(x_{k,1})\right)\right]$
	$\displaystyle\leq$	$\displaystyle 0.$

Therefore, given any $\epsilon,$ $\bar{\pi}$ is an $\epsilon$ -optimal policy when $K$ is sufficiently large.

While $\bar{\pi}$ is a near-optimal policy, in practice, it may not be possible to store all policies during learning due to memory constraint. A heuristic approach to obtain a near optimal, stationary policy is to fix the two Q functions (reward and utility) after learning horizon $K$ and continue to adapt the virtual queue with frame size $\sqrt{K}.$ This is a stochastic policy where the randomness comes from the virtual queue. When the virtual queue reaches its steady state, we obtain a stationary policy. The resulted policy has great performance in our experiment (see Section 6).

6 Evaluation

We evaluated Triple-Q using a grid-world environment [8]. We further implemented Triple-Q with neural network approximations, called Deep Triple-Q, for an environment with continuous state and action spaces, called the Dynamic Gym benchmark [30]. In both cases, Triple-Q and Deep Triple-Q quickly learn a safe policy with a high reward.

6.1 A Tabular Case

We first evaluated our algorithm using a grid-world environment studied in [8]. The environment is shown in Figure 3-(a). The objective of the agent is to travel to the destination as quickly as possible while avoiding obstacles for safety. Hitting an obstacle incurs a cost of $1$ . The reward for the destination is $100$ , and for other locations are the Euclidean distance between them and the destination subtracted from the longest distance. The cost constraint is set to be $6$ (we transferred utility to cost as we discussed in the paper), which means the agent is only allowed to hit the obstacles as most six times. To account for the statistical significance, the result of each experiment was averaged over $5$ trials.

The result is shown in Figure 2, from which we can observe that Triple-Q can quickly learn a well performed policy (with about $20,000$ episodes) while satisfying the safety constraint. Triple-Q-stop is a stationary policy obtained by stopping learning (i.e. fixing the Q tables) at $40,000$ training steps (note the virtual-Queue continues to be updated so the policy is a stochastic policy). We can see that Triple-Q-stop has similar performance as Triple-Q, and show that Triple-Q yields a near-optimal, stationary policy after the learning stops.

Refer to caption — Figure 1: Grid World and DynamicEnv³³3Image Sorce: The environment is generated using safety-gym: https://github.com/openai/safety-gym. with Safety Constraints

6.2 Triple-Q with Neural Networks

We further evaluated our algorithm on the Dynamic Gym benchmark (DynamicEnv) [30] as shown in Figure. 3-(b). In this environment, a point agent (one actuator for turning and another for moving) navigates on a 2D map to reach the goal position while trying to avoid reaching hazardous areas. The initial state of the agent, the goal position and hazards are randomly generated in each episode. At each step, the agents get a cost of $1$ if it stays in the hazardous area; and otherwise, there is no cost. The constraint is that the expected cost should not exceed 15. In this environment, both the states and action spaces are continuous, we implemented the key ideas of Triple-Q with neural network approximations and the actor-critic method. In particular, two Q functions are trained simultaneously, the virtual queue is updated slowly every few episodes, and the policy network is trained by optimizing the combined three “Q”s (Triple-Q). The details can be found in Table 2. We call this algorithm Deep Triple-Q. The simulation results in Figure 3 show that Deep Triple-Q learns a safe-policy with a high reward much faster than WCSAC [30]. In particular, it took around $0.45$ million training steps under Deep Triple-Q, but it took $4.5$ million training steps under WCSAC.

Table 2: Hyperparameters

Parameter	Value
optimizer	Adam
learning rate	$3\times 1^{-3}$
discount	0.99
replay buffer size	$10^{6}$
number of hidden layers (all networks)	2
batch Size	256
nonlinearity	ReLU
number of hidden units per layer (Critic)	256
number of hidden units per layer (Actor)	256
virtual queue update frequency	3 episode

7 Conclusions

This paper considered CMDPs and proposed a model-free RL algorithm without a simulator, named Triple-Q. From a theoretical perspective, Triple-Q achieves sublinear regret and zero constraint violation. We believe it is the first model-free RL algorithm for CMDPs with provable sublinear regret, without a simulator. From an algorithmic perspective, Triple-Q has similar computational complexity with SARSA, and can easily incorporate recent deep Q-learning algorithms to obtain a deep Triple-Q algorithm, which makes our method particularly appealing for complex and challenging CMDPs in practice.

While we only considered a single constraint in the paper, it is straightforward to extend the algorithm and the analysis to multiple constraints. Assuming there are $J$ constraints in total, Triple-Q can maintain a virtual queue and a utility Q-function for each constraint, and then selects an action at each step by solving the following problem:

\max_{a}\left(Q_{h}(x_{h},a)+\frac{1}{\eta}\sum_{j=1}^{J}{Z^{(j)}}C^{(j)}_{h}(x_{h},a)\right).

References

[1] Naoki Abe, Prem Melville, Cezar Pendus, Chandan K Reddy, David L Jensen, Vince P Thomas, James J Bennett, Gary F Anderson, Brent R Cooley, Melissa Kowalczyk, et al. Optimizing debt collections using constrained reinforcement learning. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 75–84, 2010.
[2] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31. PMLR, 2017.
[3] Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
[4] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. On the sample complexity of reinforcement learning with a generative model. In Int. Conf. Machine Learning (ICML), Madison, WI, USA, 2012.
[5] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Mach. Learn., 91(3):325–349, June 2013.
[6] Kianté Brantley, Miroslav Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. arXiv preprint arXiv:2006.05051, 2020.
[7] Yi Chen, Jing Dong, and Zhaoran Wang. A primal-dual approach to constrained Markov decision processes. arXiv preprint arXiv:2101.10895, 2021.
[8] Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. arXiv preprint arXiv:1805.07708, 2018.
[9] Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, 2021.
[10] Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33, 2020.
[11] Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020.
[12] Jaime F Fisac, Anayo K Akametalu, Melanie N Zeilinger, Shahab Kaynama, Jeremy Gillula, and Claire J Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.
[13] Javier Garcia and Fernando Fernández. Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, 45:515–564, 2012.
[14] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances Neural Information Processing Systems (NeurIPS), volume 31, pages 4863–4873, 2018.
[15] Krishna C. Kalagarla, Rahul Jain, and Pierluigi Nuzzo. A sample-efficient algorithm for episodic finite-horizon MDP with constraints. arXiv preprint arXiv:2009.11348, 2020.
[16] M. J. Neely. Energy-aware wireless scheduling with near-optimal backlog and convergence time tradeoffs. IEEE/ACM Transactions on Networking, 24(4):2223–2236, 2016.
[17] Michael J. Neely. Stochastic network optimization with application to communication and queueing systems. Synthesis Lectures on Communication Networks, 3(1):1–211, 2010.
[18] Masahiro Ono, Marco Pavone, Yoshiaki Kuwata, and J Balaram. Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots, 39(4):555–571, 2015.
[19] Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, 2019.
[20] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
[21] Shuang Qiu, Xiaohan Wei, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. Upper confidence primal-dual reinforcement learning for CMDP with adversarial loss. In Advances in Neural Information Processing Systems, 2020.
[22] Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.
[23] Rahul Singh, Abhishek Gupta, and Ness B Shroff. Learning in markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.
[24] R. Srikant and Lei Ying. Communication Networks: An Optimization, Control and Stochastic Networks Perspective. Cambridge University Press, 2014.
[25] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
[26] Yuanhao Wang, Kefan Dong, Xiaoyu Chen, and Liwei Wang. Q-learning with UCB exploration is sample efficient for infinite-horizon MDP. In International Conference on Learning Representations, 2020.
[27] Christopher John Cornish Hellaby Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, King’s College, Cambridge United Kingdom, May 1989.
[28] Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International Conference on Machine Learning, pages 10170–10180. PMLR, 2020.
[29] Tengyu Xu, Yingbin Liang, and Guanghui Lan. A primal approach to constrained policy optimization: Global optimality and finite-time analysis. arXiv preprint arXiv:2011.05869, 2020.
[30] Qisong Yang, Thiago D Simão, Simon H Tindemans, and Matthijs TJ Spaan. Wcsac: Worst-case soft actor critic for safety-constrained reinforcement learning. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence. AAAI Press, online, 2021.
[31] Chao Yu, Jiming Liu, and Shamim Nemati. Reinforcement learning in healthcare: A survey. arXiv preprint arXiv:1908.08796, 2020.

In the appendix, we summarize notations used throughout the paper in Table 3, and present a few lemmas used to prove the main theorem.

Appendix A Notation Table

Table 3: Notation Table

Notation	Definition
$K$	The total number of episodes
$S$	The number of states
$A$	The number of actions
$H$	The length of each episode
$[H]$	Set $\{1,2,\dots,H\}$
$Q_{k,h}(x,a)$	The estimated reward Q-function at step $h$ in episode $k$
$Q_{h}^{\pi}(x,a)$	The reward Q-function at step $h$ in episode $k$ under policy $\pi$
$V_{k,h}(x)$	The estimated reward value-function at step $h$ in episode $k$ .
$V_{h}^{\pi}(x)$	The value-function at step $h$ in episode $k$ under policy $\pi$
$C_{k,h}(x,a)$	The estimated utility Q-function at step $h$ in episode $k$
$C_{h}^{\pi}(x,a)$	The utility Q-function at step $h$ in episode $k$ under policy $\pi$
$W_{k,h}(x)$	The estimated utility value-function at step $h$ in episode $k$
$W_{h}^{\pi}(x)$	The utility value-function at step $h$ in episode $k$ under policy $\pi$
$F_{k,h}(x,a)$	$F_{k,h}(x,a)=Q_{k,h}(x,a)+\frac{Z_{k}}{\eta}C_{k,h}(x,a)$
$U_{k,h}(x)$	$U_{k,h}(x)=V_{k,h}(x)+\frac{Z_{k}}{\eta}W_{k,h}(x)$
$r_{h}(x,a)$	The reward of (state, action) pair $(x,a)$ at step $h.$
$g_{h}(x,a)$	The utility of (state, action) pair $(x,a)$ at step $h.$
$N_{k,h}(x,a)$	The number of visits to $(x,a)$ when at step $h$ in episode $k$ (not including $k$ )
$Z_{k}$	The dual estimation (virtual queue) in episode $k.$
$q_{h}^{*}$	The optimal solution to the LP of the CMDP (8).
${q}_{h}^{\epsilon,*}$	The optimal solution to the tightened LP (14).
$\delta$	Slater’s constant.
$b_{t}$	the UCB bonus for given $t$
$\mathbb{I}(\cdot)$	The indicator function

Appendix B Useful Lemmas

The first lemma establishes some key properties of the learning rates used in Triple-Q. The proof closely follows the proof of Lemma 4.1 in [14].

Lemma 7.

Recall that the learning rate used in Triple-Q is $\alpha_{t}=\frac{\chi+1}{\chi+t},$ and

\alpha_{t}^{0}=\prod_{j=1}^{t}(1-\alpha_{j})\quad\hbox{and}\quad\alpha_{t}^{i}=\alpha_{i}\prod_{j=i+1}^{t}(1-\alpha_{j}).

(53)

The following properties hold for $\alpha_{t}^{i}:$

(a)

$\alpha_{t}^{0}=0$ for $t\geq 1,\alpha_{t}^{0}=1$ for $t=0.$
(b)

$\sum_{i=1}^{t}\alpha_{t}^{i}=1$ for $t\geq 1,$ $\sum_{i=1}^{t}\alpha_{t}^{i}=0$ for $t=0.$
(c)

$\frac{1}{\sqrt{\chi+t}}\leq\sum_{i=1}^{t}\frac{\alpha_{t}^{i}}{\sqrt{\chi+i}}\leq\frac{2}{\sqrt{\chi+t}}.$
(d)

$\sum_{t=i}^{\infty}\alpha_{t}^{i}=1+\frac{1}{\chi}$ for every $i\geq 1.$
(e)

$\sum_{i=1}^{t}(\alpha_{t}^{i})^{2}\leq\frac{\chi+1}{\chi+t}$ for every $t\geq 1.$

$\square$

Proof.

The proof of (a) and (b) are straightforward by using the definition of $\alpha_{t}^{i}$ . The proof of (d) is the same as that in [14].

(c): We next prove (c) by induction.

For $t=1,$ we have $\sum_{i=1}^{t}\frac{\alpha_{t}^{i}}{\sqrt{\chi+i}}=\frac{\alpha_{1}^{1}}{\sqrt{\chi+1}}=\frac{1}{\sqrt{\chi+1}},$ so (c) holds for $t=1$ .

Now suppose that (c) holds for $t-1$ for $t\geq 2,$ i.e.

\frac{1}{\sqrt{\chi+t-1}}\leq\sum_{i=1}^{t-1}\frac{\alpha_{t}^{i}}{\sqrt{\chi+i-1}}\leq\frac{2}{\sqrt{\chi+t-1}}.

From the relationship $\alpha_{t}^{i}=(1-\alpha_{t})\alpha_{t-1}^{i}$ for $i=1,2,\dots,t-1,$ we have

\sum_{i=1}^{t}\frac{\alpha_{t}^{i}}{\chi+i}=\frac{\alpha_{t}}{\sqrt{\chi+t}}+(1-\alpha_{t})\sum_{i=1}^{t-1}\frac{\alpha_{t-1}^{i}}{\sqrt{\chi+i}}.

Now we apply the induction assumption. To prove the lower bound in (c), we have

\frac{\alpha_{t}}{\sqrt{\chi+t}}+(1-\alpha_{t})\sum_{i=1}^{t-1}\frac{\alpha_{t-1}^{i}}{\sqrt{\chi+i}}\geq\frac{\alpha_{t}}{\sqrt{\chi+t}}+\frac{1-\alpha_{t}}{\sqrt{\chi+t-1}}\geq\frac{\alpha_{t}}{\sqrt{\chi+t}}+\frac{1-\alpha_{t}}{\sqrt{\chi+t}}\geq\frac{1}{\sqrt{\chi+t}}.

To prove the upper bound in (c), we have

$\displaystyle\frac{\alpha_{t}}{\sqrt{\chi+t}}+(1-\alpha_{t})\sum_{i=1}^{t-1}\frac{\alpha_{t-1}^{i}}{\sqrt{\chi+i}}\leq$	$\displaystyle\frac{\alpha_{t}}{\sqrt{\chi+t}}+\frac{2(1-\alpha_{t})}{\sqrt{\chi+t-1}}=\frac{\chi+1}{(\chi+t)\sqrt{\chi+t}}+\frac{2(t-1)}{(\chi+t)\sqrt{\chi+t-1}},$
$\displaystyle=$	$\displaystyle\frac{1-\chi-2t}{(\chi+t)\sqrt{\chi+t}}+\frac{2(t-1)}{(\chi+t)\sqrt{\chi+t-1}}+\frac{2}{\sqrt{\chi+t}}$
$\displaystyle\leq$	$\displaystyle\frac{-\chi-1}{(\chi+t)\sqrt{\chi+t-1}}+\frac{2}{\sqrt{\chi+t}}\leq\frac{2}{\sqrt{\chi+t}}.$	(54)

(e) According to its definition, we have

	$\displaystyle\alpha_{t}^{i}=$	$\displaystyle\frac{\chi+1}{i+\chi}\cdot\left(\frac{i}{i+1+\chi}\frac{i+1}{i+2+\chi}\cdots\frac{t-1}{t+\chi}\right)$
	$\displaystyle=$	$\displaystyle\frac{\chi+1}{t+\chi}\cdot\left(\frac{i}{i+\chi}\frac{i+1}{i+1+\chi}\cdots\frac{t-1}{t-1+\chi}\right)\leq\frac{\chi+1}{\chi+t}.$		(55)

Therefore, we have

\sum_{i=1}^{t}(\alpha_{t}^{i})^{2}\leq[\max_{i\in[t]}\alpha_{t}^{i}]\cdot\sum_{i=1}^{t}\alpha_{t}^{i}\leq\frac{\chi+1}{\chi+t},

because $\sum_{i=1}^{t}\alpha_{t}^{i}=1.$ ∎

The next lemma establishes upper bounds on $Q_{k,h}$ and $C_{k,h}$ under Triple-Q.

Lemma 8.

For any $(x,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K],$ we have the following bounds on $Q_{k,h}(x,a)$ and $C_{k,h}(x,a):$

	$\displaystyle 0\leq Q_{k,h}(x,a)\leq H^{2}\sqrt{\iota}$
	$\displaystyle 0\leq C_{k,h}(x,a)\leq H^{2}\sqrt{\iota}.$

Proof.

We first consider the last step of an episode, i.e. $h=H.$ Recall that $V_{k,H+1}(x)=0$ for any $k$ and $x$ by its definition and $Q_{0,H}=H\leq H\sqrt{\iota}.$ Suppose $Q_{k^{\prime},H}(x,a)\leq H\sqrt{\iota}$ for any $k^{\prime}\leq k-1$ and any $(x,a).$ Then,

{Q}_{k,H}(x,a)=(1-\alpha_{t})Q_{k_{t},H}(x,a)+\alpha_{t}\left(r_{H}(x,a)+b_{t}\right)\leq\max\left\{H\sqrt{\iota},1+\frac{H\sqrt{\iota}}{4}\right\}\leq H\sqrt{\iota},

where $t=N_{k,H}(x,a)$ is the number of visits to state-action pair $(x,a)$ when in step $H$ by episode $k$ (but not include episode $k$ ) and $k_{t}$ is the index of the episode of the most recent visit. Therefore, the upper bound holds for $h=H.$

Note that $Q_{0,h}=H\leq H(H-h+1)\sqrt{\iota}.$ Now suppose the upper bound holds for $h+1,$ and also holds for $k^{\prime}\leq k-1.$ Consider step $h$ in episode $k:$

\displaystyle{Q}_{k,h}(x,a)=

\displaystyle(1-\alpha_{t})Q_{k_{t},h}(x,a)+\alpha_{t}\left(r_{h}(x,a)+V_{k_{t},{h}+1}(x_{k_{t},{h}+1})+b_{t}\right),

where $t=N_{k,{h}}(x,a)$ is the number of visits to state-action pair $(x,a)$ when in step ${h}$ by episode $k$ (but not include episode $k$ ) and $k_{t}$ is the index of the episode of the most recent visit. We also note that $V_{k,h+1}(x)\leq\max_{a}Q_{k,h+1}(x,a)\leq H(H-h)\sqrt{\iota}.$ Therefore, we obtain

\displaystyle{Q}_{k,h}(x,a)\leq\max\left\{H(H-h+1)\sqrt{\iota},1+H(H-h)\sqrt{\iota}+\frac{H\sqrt{\iota}}{4}\right\}\leq H(H-h+1)\sqrt{\iota}.

Therefore, we can conclude that $Q_{k,h}(x,a)\leq H^{2}\sqrt{\iota}$ for any $k,$ $h$ and $(x,a).$ The proof for $C_{k,h}(x,a)$ is identical. ∎

Next, we present the following lemma from [14], which establishes a recursive relationship between $Q_{k,h}$ and $Q^{\pi}_{h}$ for any $\pi.$ We include the proof so the paper is self-contained.

Lemma 9.

Consider any $(x,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K],$ and any policy $\pi.$ Let t= $N_{k,h}(x,a)$ be the number of visits to $(x,a)$ when at step $h$ in frame $T$ before episode $k,$ and $k_{1},\dots,k_{t}$ be the indices of the episodes in which these visits occurred. We have the following two equations:

$\displaystyle(Q_{k,h}-Q_{h}^{\pi})(x,a)=$	$\displaystyle\alpha_{t}^{0}\left\{Q_{(T-1)K^{\alpha}+1,h}-Q_{h}^{\pi}\right\}(x,a)$
	$\displaystyle+\sum_{i=1}^{t}\alpha_{t}^{i}\left(\left\{V_{k_{i},h+1}-V_{h+1}^{\pi}\right\}(x_{k_{i},h+1})+\left\{\hat{\mathbb{P}}_{h}^{k_{i}}V_{h+1}^{\pi}-\mathbb{P}_{h}V_{h+1}^{\pi}\right\}(x,a)+b_{i}\right),$	(56)
$\displaystyle(C_{k,h}-C_{h}^{\pi})(x,a)=$	$\displaystyle\alpha_{t}^{0}\left\{C_{(T-1)K^{\alpha}+1,h}-C_{h}^{\pi}\right\}(x,a)$
	$\displaystyle+\sum_{i=1}^{t}\alpha_{t}^{i}\left(\left\{W_{k_{i},h+1}-W_{h+1}^{\pi}\right\}(x_{k_{i},h+1})+\left\{\hat{\mathbb{P}}_{h}^{k_{i}}W_{h+1}^{\pi}-\mathbb{P}_{h}W_{h+1}^{\pi}\right\}(x,a)+b_{i}\right),$	(57)

where $\hat{\mathbb{P}}_{h}^{k}V_{h+1}(x,a):=V_{h+1}(x_{k,h+1})$ is the empirical counterpart of $\mathbb{P}_{h}V_{h+1}^{\pi}(x,a)=\mathbb{E}_{x^{\prime}\sim\mathbb{P}_{h}(\cdot|x,a)}V_{h+1}^{\pi}(x^{\prime}).$ This definition can also be applied to $W_{h}^{\pi}$ as well.

Proof.

We will prove (56). The proof for (57) is identical. Recall that under Triple-Q, $Q_{k+1,h}(x,a)$ is updated as follows:

\displaystyle Q_{k+1,h}(x,a)

\displaystyle=\begin{cases}(1-\alpha_{t})Q_{k,h}(x,a)+\alpha_{t}\left(r_{h}(x,a)+V_{k,h+1}(x_{h+1,k})+b_{t}\right)&\text{if $(x,a)=(x_{k,h},a_{k,h})$}\\ Q_{k,h}(x,a)&\text{otherwise}\end{cases}.

From the update equation above, we have in episode $k,$

\displaystyle Q_{k,h}(x,a)=

\displaystyle(1-\alpha_{t})Q_{k_{t},h}(x,a)+\alpha_{t}\left(r_{h}(x,a)+V_{k_{t},h+1}(x_{k_{t},h+1})+b_{t}\right).

Repeatedly using the equation above, we obtain

$\displaystyle Q_{k,h}(x,a)=$	$\displaystyle(1-\alpha_{t})(1-\alpha_{t-1})Q_{k_{t-1},h}(x,a)+(1-\alpha_{t})\alpha_{t-1}\left(r_{h}(x,a)+V_{k_{t-1},h+1}(x_{k_{t-1},h+1})+b_{t-1}\right)$
	$\displaystyle+\alpha_{t}\left(r_{h}(x,a)+V_{k_{t},h+1}(x_{k_{t},h+1})+b_{t}\right)$
$\displaystyle=$	$\displaystyle\cdots$
$\displaystyle=$	$\displaystyle\alpha_{t}^{0}Q_{(T-1)K^{\alpha}+1,h}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left(r_{h}(x,a)+V_{k_{i},h+1}(x_{k_{i},h+1})+b_{i}\right),$	(58)

where the last equality holds due to the definition of $\alpha_{t}^{i}$ in (53) and the fact that all $Q_{1,h}(x,a)$ s are initialized to be $H.$ Now applying the Bellman equation $Q_{h}^{\pi}(x,a)=\left\{r_{h}+\mathbb{P}_{h}V_{h+1}^{\pi}\right\}(x,a)$ and the fact that $\sum_{i=1}^{t}\alpha_{t}^{i}=1,$ we can further obtain

$\displaystyle Q_{h}^{\pi}(x,a)$	$\displaystyle=\alpha_{t}^{0}Q_{h}^{\pi}(x,a)+(1-\alpha_{t}^{0})Q_{h}^{\pi}(x,a)$
	$\displaystyle=\alpha_{t}^{0}Q_{h}^{\pi}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left(r(x,a)+\mathbb{P}_{h}V_{h+1}^{\pi}(x,a)+V_{h+1}^{\pi}(x_{k_{i},h+1})-V_{h+1}^{\pi}(x_{k_{i},h+1})\right)$
	$\displaystyle=\alpha_{t}^{0}Q_{h}^{\pi}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left(r_{h}(x,a)+\mathbb{P}_{h}V_{h+1}^{\pi}(x,a)+V_{h+1}^{\pi}(x_{k_{i},h+1})-\hat{\mathbb{P}}_{h}^{k_{i}}V_{h+1}^{\pi}(x,a)\right)$
	$\displaystyle=\alpha_{t}^{0}Q_{h}^{\pi}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left(r_{h}(x,a)+V_{h+1}^{\pi}(x_{k_{i},h+1})+\left\{\mathbb{P}_{h}V_{h+1}^{\pi}-\hat{\mathbb{P}}_{h}^{k_{i}}V_{h+1}^{\pi}\right\}(x,a)\right).$	(59)

Then subtracting (59) from (58) yields

	$\displaystyle(Q_{k,h}-Q_{h}^{\pi})(x,a)=$	$\displaystyle\alpha_{t}^{0}\left\{Q_{(T-1)K^{\alpha}+1,h}-Q_{h}^{\pi}\right\}(x,a)$
		$\displaystyle+\sum_{i=1}^{t}\alpha_{t}^{i}\left(\left\{V_{k_{i},h+1}-V_{h+1}^{\pi}\right\}(x_{k_{i},h+1})+\left\{\hat{\mathbb{P}}_{h}^{k_{i}}V_{h+1}^{\pi}-\mathbb{P}_{h}V_{h+1}^{\pi}\right\}(x,a)+b_{i}\right).$

∎

Lemma 10.

Consider any frame $T.$ Let t= $N_{k,h}(x,a)$ be the number of visits to $(x,a)$ at step $h$ before episode $k$ in the current frame and let $k_{1},\dots,k_{t}<k$ be the indices of these episodes. Under any policy $\pi,$ with probability at least $1-\frac{1}{K^{3}},$ the following inequalities hold simultaneously for all $(x,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K]$

	$\displaystyle\left\|\sum_{i=1}^{t}\alpha_{t}^{i}\left\{(\hat{\mathbb{P}}_{h}^{k_{i}}-\mathbb{P}_{h})V_{h+1}^{\pi}\right\}(x,a)\right\|\leq$	$\displaystyle\frac{1}{4}\sqrt{\frac{H^{2}\iota(\chi+1)}{(\chi+t)}},$
	$\displaystyle\left\|\sum_{i=1}^{t}\alpha_{t}^{i}\left\{(\hat{\mathbb{P}}_{h}^{k_{i}}-\mathbb{P}_{h})W_{h+1}^{\pi}\right\}(x,a)\right\|\leq$	$\displaystyle\frac{1}{4}\sqrt{\frac{H^{2}\iota(\chi+1)}{(\chi+t)}}.$

Proof.

Without loss of generality, we consider $T=1.$ Fix any $(x,a,h)\in\mathcal{S}\times\mathcal{A}\times[H].$ For any $n\in[K^{\alpha}],$ define

X(n)=\sum_{i=1}^{n}\alpha_{\tau}^{i}\cdot\mathbb{I}_{\{k_{i}\leq K\}}\left\{(\hat{\mathbb{P}}_{h}^{k_{i}}-\mathbb{P}_{h})V_{h+1}^{\pi}\right\}(x,a).

Let $\mathcal{F}_{i}$ be the $\sigma-$ algebra generated by all the random variables until step $h$ in episode $k_{i}.$ Then

\mathbb{E}[X(n+1)|\mathcal{F}_{n}]=X(n)+\mathbb{E}\left[\alpha_{\tau}^{n+1}\mathbb{I}_{\{k_{n+1}\leq K\}}\left\{(\hat{\mathbb{P}}_{h}^{k_{n+1}}-\mathbb{P}_{h})V_{h+1}^{\pi}\right\}(x,a)|\mathcal{F}_{n}\right]=X(n),

which shows that $X(n)$ is a martingale. We also have for $1\leq i\leq n,$

\displaystyle|X(i)-X(i-1)|\leq\alpha_{\tau}^{i}\left|\left\{(\hat{\mathbb{P}}_{h}^{k_{n+1}}-\mathbb{P}_{h})V_{h+1}^{\pi}\right\}(x,a)\right|\leq\alpha_{\tau}^{i}H

Then let $\sigma=\sqrt{8\log\left(\sqrt{2SAH}K\right)\sum_{i=1}^{\tau}(\alpha_{\tau}^{i}H)^{2}}.$ By applying the Azuma-Hoeffding inequality, we have with probability at least $1-2\exp\left(-\frac{\sigma^{2}}{2\sum_{i=1}^{\tau}(\alpha^{i}_{\tau}H)^{2}}\right)=1-\frac{1}{SAHK^{4}},$

|X(\tau)|\leq\sqrt{8\log\left(\sqrt{2SAH}K\right)\sum_{i=1}^{\tau}(\alpha_{\tau}^{i}H)^{2}}\leq\sqrt{\frac{\iota}{16}H^{2}\sum_{i=1}^{\tau}(\alpha_{\tau}^{i})^{2}}\leq\frac{1}{4}\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+\tau}},

where the last inequality holds due to $\sum_{i=1}^{\tau}(\alpha_{\tau}^{i})^{2}\leq\frac{\chi+1}{\chi+\tau}$ from Lemma 7.(e). Because this inequality holds for any $\tau\in[K]$ , it also holds for $\tau=t=N_{k,h}(x,a)\leq K,$ Applying the union bound, we obtain that with probability at least $1-\frac{1}{K^{3}}$ the following inequality holds simultaneously for all $(x,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K],$ :

\left|\sum_{i=1}^{t}\alpha_{t}^{i}\left\{(\hat{\mathbb{P}}_{h}^{k_{i}}-\mathbb{P}_{h})V_{h+1}^{\pi}\right\}(x,a)\right|\leq\frac{1}{4}\sqrt{\frac{H^{2}\iota(\chi+1)}{(\chi+t)}}.

Following a similar analysis we also have that with probability at least $1-\frac{1}{K^{3}}$ the following inequality holds simultaneously for all $(x,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K],$ :

\left|\sum_{i=1}^{t}\alpha_{t}^{i}\left\{(\hat{\mathbb{P}}_{h}^{k_{i}}-\mathbb{P}_{h})W_{h+1}^{\pi}\right\}(x,a)\right|\leq\frac{1}{4}\sqrt{\frac{H^{2}\iota(\chi+1)}{(\chi+t)}}.

∎

Lemma 11.

Given $\delta\geq 2\epsilon,$ under Triple-Q, the conditional expected drift is

\displaystyle\mathbb{E}\left[L_{T+1}-L_{T}|Z_{T}=z\right]\leq-\frac{\delta}{2}Z_{T}+\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2}

(60)

Proof.

Recall that $L_{T}=\frac{1}{2}Z_{T}^{2},$ and the virtual queue is updated by using

Z_{T+1}=\left(Z_{T}+\rho+\epsilon-\frac{\bar{C}_{T}}{K^{\alpha}}\right)^{+}.

From inequality (39), we have

		$\displaystyle\mathbb{E}\left[L_{T+1}-L_{T}\|Z_{T}=z\right]$
	$\displaystyle\leq$	$\displaystyle\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\mathbb{E}\left[Z_{T}\left(\rho+\epsilon-C_{k,1}(x_{k,1},a_{k,1})\right)-\eta Q_{k,1}(x_{k,1},a_{k,1})+\eta Q_{k,1}(x_{k,1},a_{k,1})\|Z_{T}=z\right]+H^{4}\iota+\epsilon^{2}$
	$\displaystyle\leq_{(a)}$	$\displaystyle\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\mathbb{E}\left[Z_{T}\left(\rho+\epsilon-\sum_{a}\left\{C_{k,1}q_{1}^{\pi}\right\}(x_{k,1},a)\right)-\eta\sum_{a}\{Q_{k,1}q_{1}^{\pi}\}(x_{k,1},a)+\eta Q_{k,1}(x_{k,1},a_{k,1})\|Z_{T}=z\right]$
		$\displaystyle+\epsilon^{2}+H^{4}\iota$
	$\displaystyle\leq$	$\displaystyle\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\mathbb{E}\left[Z_{T}\left(\rho+\epsilon-\sum_{a}\left\{C_{1}^{\pi}q_{1}^{\pi}\right\}(x_{k,1},a)\right)-\eta\sum_{a}\{Q_{k,1}q_{1}^{\pi}\}(x_{k,1},a)+\eta Q_{k,1}(x_{k,1},a_{k,1})\|Z_{T}=z\right]$
		$\displaystyle+\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\mathbb{E}\left[Z_{T}\sum_{a}\left\{C_{1}^{\pi}q_{1}^{\pi}\right\}(x_{k,1},a)-Z_{T}\sum_{a}\left\{C_{k,1}q_{1}^{\pi}\right\}(x_{k,1},a)\|Z_{T}=z\right]$
		$\displaystyle+\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\mathbb{E}\left[\eta\sum_{a}\left\{Q_{1}^{\pi}q_{1}^{\pi}\right\}(x_{k,1},a)-\eta\sum_{a}\left\{Q_{1}^{\pi}q_{1}^{\pi}\right\}(x_{k,1},a)\|Z_{T}=z\right]+H^{4}\iota+\epsilon^{2}$
	$\displaystyle\leq_{(b)}$	$\displaystyle-\frac{\delta}{2}z+\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\mathbb{E}\left[\eta\sum_{a}\left\{(F_{1}^{\pi}-F_{k,1})q_{1}^{\pi}\right\}(x_{k,1},a)\ +\eta Q_{k,1}(x_{k,1},a_{k,1})\|Z_{T}=z\right]+H^{4}\iota+\epsilon^{2}$
	$\displaystyle\leq_{(c)}$	$\displaystyle-\frac{\delta}{2}z+\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2}.$

Inequality $(a)$ holds because of our algorithm. Inequality $(b)$ holds because $\sum_{a}\left\{Q_{1}^{\pi}q_{1}^{\pi}\right\}(x_{k,1},a)$ is non-negative, and under Slater’s condition, we can find policy $\pi$ such that

\epsilon+\rho-\mathbb{E}\left[\sum_{a}{C}^{\pi}_{1}(x_{k,1},a){q}^{\pi}_{1}(x_{k,1},a)\right]=\rho+\epsilon-\mathbb{E}\left[\sum_{h,x,a}{q}^{\pi}_{h}(x,a)g_{h}(x,a)\right]\leq-\delta+\epsilon\leq-\frac{\delta}{2}.

Finally, inequality $(c)$ is obtained similar to (36), and the fact that $Q_{k,1}(x_{k,1},a_{k,1})$ is bounded by using Lemma 8 ∎

	$\displaystyle\mathbb{E}\left[L_{T+1}-L_{T}\|Z_{T}=z\right]$
$\displaystyle\leq$	$\displaystyle\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\left(-\eta\mathbb{E}\left[\left.\sum_{a}\left\{Q_{k,1}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]\right.$
	$\displaystyle\left.+z\mathbb{E}\left[\left.\sum_{a}\left\{\left({C}^{\epsilon,}_{1}-C_{k,1}\right){q}^{\epsilon,}_{1}\right\}(x_{k,1},a)\right\|Z_{T}=z\right]\right)+H^{4}\iota+\epsilon^{2}.$	(38)

		$\displaystyle\left(\mathbb{E}\left[\left.z\left(\rho+\epsilon-C_{k,1}(x_{k,1},a_{k,1})\right)-\eta Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]+\eta\mathbb{E}\left[\left.Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]\right)$
	$\displaystyle\leq$	${}_{(a)}z(\rho+\epsilon)-\mathbb{E}\left[\left.\eta\left(\sum_{a}\left\{\frac{z}{\eta}C_{k,1}q^{\epsilon}_{1}+Q_{k,1}q^{\epsilon}_{1}\right\}(x_{k,1},a)\right)\right\|Z_{T}=z\right]+\eta\mathbb{E}\left[\left.Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[\left.z\left(\rho+\epsilon-\sum_{a}C_{k,1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)\right)\right\|Z_{T}=z\right]$
		$\displaystyle-\mathbb{E}\left[\left.\eta\sum_{a}Q_{k,1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)-\eta Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]$
	$\displaystyle=$	$\displaystyle\mathbb{E}\left[z\left(\left.\rho+\epsilon-\sum_{a}C^{\epsilon}_{1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)\right)\right\|Z_{T}=z\right]$
		$\displaystyle-\mathbb{E}\left[\left.\eta\sum_{a}Q_{k,1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)-\eta Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]+\mathbb{E}\left[\left.z\sum_{a}\left\{(C^{\epsilon}_{1}-C_{k,1})q^{\epsilon}_{1}\right\}(x_{k,1},a)\right\|Z_{T}=z\right]$
	$\displaystyle\leq$	$\displaystyle-\eta\mathbb{E}\left[\left.\sum_{a}Q_{k,1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right\|Z_{T}=z\right]+\mathbb{E}\left[\left.z\sum_{a}\left\{(C^{\epsilon}_{1}-C_{k,1})q^{\epsilon}_{1}\right\}(x_{k,1},a)\right\|Z_{T}=z\right],$

A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes

Abstract

1 Introduction

1.1 Main Contributions

2 Problem Formulation

Remark 1.

Assumption 1.

3 Triple-Q

Theorem 1.

4 Proof of the Main Theorem

4.1 Notation

4.2 Regret

4.2.1 Outline of the Regret Analysis

4.2.2 Detailed Proof

Lemma 1.

Proof.

Lemma 2.

Proof.

Lemma 3.

Proof.

Lemma 4.

Proof.

4.3 Constraint Violation

4.3.1 Outline of the Constraint Violation Analysis

4.3.2 Detailed Proof

Lemma 5.

Lemma 6.

Proof of Lemma 5.

5 Convergence and ϵ\epsilon-Optimal Policy

6 Evaluation

6.1 A Tabular Case

6.2 Triple-Q with Neural Networks

7 Conclusions

References

Appendix A Notation Table

Appendix B Useful Lemmas

Lemma 7.

Proof.

Lemma 8.

Proof.

Lemma 9.

Proof.

Lemma 10.

Proof.

Lemma 11.

Proof.

5 Convergence and $\epsilon$ -Optimal Policy