This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes

Honghao Wei, Xin Liu, and Lei Ying
University of Michigan, Ann Arbor
Abstract

This paper presents the first model-free, simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it includes three key components: a Q-function (also called action-value function) for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three “Q” values. The algorithm updates the reward and utility Q-values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves 𝒪~(1δH4S12A12K45)\tilde{\cal O}\left(\frac{1}{\delta}H^{4}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}}\right) regret111Notation: f(n)=𝒪~(g(n))f(n)=\tilde{\mathcal{O}}(g(n)) denotes f(n)=𝒪(g(n)logkn)f(n)={\mathcal{O}}(g(n){\log}^{k}n) with k>0.k>0. The same applies to Ω~.\tilde{\Omega}. +\mathbb{R}^{+} denotes non-negative real numbers. [H][H] denotes the set {1,2,,H}.\{1,2,\cdots,H\}. , where KK is the total number of episodes, HH is the number of steps in each episode, SS is the number of states, AA is the number of actions, and δ\delta is Slater’s constant. Furthermore, Triple-Q guarantees zero constraint violation, both on expectation and with a high probability, when KK is sufficiently large. Finally, the computational complexity of Triple-Q is similar to SARSA for unconstrained MDPs, and is computationally efficient.

1 Introduction

Reinforcement learning (RL), with its success in gaming and robotics, has been widely viewed as one of the most important technologies for next-generation, AI-driven complex systems such as autonomous driving, digital healthcare, and smart cities. However, despite the significant advances (such as deep RL) over the last few decades, a major obstacle in applying RL in practice is the lack of “safety” guarantees. Here “safety” refers to a broad range of operational constraints. The objective of a traditional RL problem is to maximize the expected cumulative reward, but in practice, many applications need to be operated under a variety of constraints, such as collision avoidance in robotics and autonomous driving [18, 13, 12], legal and business restrictions in financial engineering [1], and resource and budget constraints in healthcare systems [31]. These applications with operational constraints can often be modeled as Constrained Markov Decision Processes (CMDPs), in which the agent’s goal is to learn a policy that maximizes the expected cumulative reward subject to the constraints.

Earlier studies on CMDPs assume the model is known. A comprehensive study of these early results can be found in [3]. RL for unknown CMDPs has been a topic of great interest recently because of its importance in Artificial Intelligence (AI) and Machine Learning (ML). The most noticeable advances recently are model-based RL for CMDPs, where the transition kernels are learned and used to solve the linear programming (LP) problem for the CMDP [23, 6, 15, 11], or the LP problem in the primal component of a primal-dual algorithm [21, 11]. If the transition kernel is linear, then it can be learned in a sample efficient manner even for infinite state and action spaces, and then be used in the policy evaluation and improvement in a primal-dual algorithm [9]. [9] also proposes a model-based algorithm (Algorithm 3) for the tabular setting (without assuming a linear transition kernel).

Table 1: The Exploration-Exploitation Tradeoff in Episodic CMDPs.
Algorithm Regret Constraint Violation
Model-based OPDOP [9] 𝒪~(H3S2AK)\tilde{\mathcal{O}}(H^{3}\sqrt{S^{2}AK}) 𝒪~(H3S2AK)\tilde{\mathcal{O}}(H^{3}\sqrt{S^{2}AK})
OptDual-CMDP [11] 𝒪~(H2S3AK)\tilde{\mathcal{O}}(H^{2}\sqrt{S^{3}AK}) 𝒪~(H2S3AK)\tilde{\mathcal{O}}(H^{2}\sqrt{S^{3}AK})
OptPrimalDual-CMDP [11] 𝒪~(H2S3AK)\tilde{\mathcal{O}}(H^{2}\sqrt{S^{3}AK}) 𝒪~(H2S3AK)\tilde{\mathcal{O}}(H^{2}\sqrt{S^{3}AK})
Model-free Triple-Q 𝒪~(1δH4S12A12K45)\tilde{\cal O}\left(\frac{1}{\delta}H^{4}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}}\right) 0

The performance of a model-based RL algorithm depends on how accurately a model can be estimated. For some complex environments, building accurate models is challenging computationally and data-wise [25]. For such environments, model-free RL algorithms often are more desirable. However, there has been little development on model-free RL algorithms for CMDPs with provable optimality or regret guarantees, with the exceptions [10, 29, 7], all of which require simulators. In particular, the sample-based NPG-PD algorithm in [10] requires a simulator which can simulate the MDP from any initial state x,x, and the algorithms in [29, 7] both require a simulator for policy evaluation. It has been argued in [4, 5, 14] that with a perfect simulator, exploration is not needed and sample efficiency can be easily achieved because the agent can query any (state, action) pair as it wishes. Unfortunately, for complex environments, building a perfect simulator often is as difficult as deriving the model for the CMDP. For those environments, sample efficiency and the exploration-exploitation tradeoff are critical and become one of the most important considerations of RL algorithm design.

1.1 Main Contributions

In this paper, we consider the online learning problem of an episodic CMDP with a model-free approach without a simulator. We develop the first model-free RL algorithm for CMDPs with sublinear regret and zero constraint violation (for large KK). The algorithm is named Triple-Q because it has three key components: (i) a Q-function (also called action-value function) for the expected cumulative reward, denoted by Qh(x,a)Q_{h}(x,a) where hh is the step index and (x,a)(x,a) denotes a state-action pair, (ii) a Q-function for the expected cumulative utility for the constraint, denoted by Ch(x,a),C_{h}(x,a), and (iii) a virtual-Queue, denoted by Z,Z, which (over)estimates the cumulative constraint violation so far. At step hh in the current episode, when observing state x,x, the agent selects action aa^{*} based on a pseudo-Q-value that is a combination of the three “Q” values:

aargmaxaQh(x,a)+ZηCh(x,a)pseudo-Q-value of state (x,a) at step h,\displaystyle a^{*}\in\underbrace{\arg\max_{a}Q_{h}(x,a)+\frac{Z}{\eta}C_{h}(x,a)}_{{\text{pseudo-Q-value of state $(x,a)$ at step $h$}}},

where η\eta is a constant. Triple-Q uses UCB-exploration when learning the Q-values, where the UCB bonus and the learning rate at each update both depend on the visit count to the corresponding (state, action) pair as in [14]). Different from the optimistic Q-learning for unconstrained MDPs (e.g. [14, 26, 28]), the learning rates in Triple-Q need to be periodically reset at the beginning of each frame, where a frame consists of KαK^{\alpha} consecutive episodes. The value of the virtual-Queue (the dual variable) is updated once in every frame. So Triple-Q can be viewed as a two-time-scale algorithm where virtual-Queue is updated at a slow time-scale, and Triple-Q learns the pseudo-Q-value for fixed ZZ at a fast time scale within each frame. Furthermore, it is critical to update the two Q-functions (Qh(x,a)Q_{h}(x,a) and Ch(x,a)C_{h}(x,a)) following a rule similar to SARSA [22] instead of Q-learning [27], in other words, using the Q-functions of the action that is taken instead of using the max\max function.

We prove Triple-Q achieves 𝒪~(1δH4S12A12K45)\tilde{\cal O}\left(\frac{1}{\delta}H^{4}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}}\right) reward regret and guarantees zero constraint violation when the total number of episodes K(16SAH6ι3δ)5,K\geq\left(\frac{16\sqrt{SAH^{6}\iota^{3}}}{\delta}\right)^{5}, where ι\iota is logarithmic in K.K. Therefore, in terms of constraint violation, our bound is sharp for large K.K. To the best of our knowledge, this is the first model-free, simulator-free RL algorithm with sublinear regret and zero constraint violation. For model-based approaches, it has been shown that a model-based algorithm achieves both 𝒪~(H4SAK)\tilde{\cal O}(\sqrt{H^{4}SAK}) regret and constraint violation (see, e.g. [11]). It remains open what is the fundamental lower bound on the regret under model-free algorithms for CMDPs and whether the regret bound under Triple-Q is order-wise sharp or can be further improved. Table 1 summarizes the key results on the exploration-exploitation tradeoff of CMDPs in the literature.

As many other model-free RL algorithms, a major advantage of Triple-Q is its low computational complexity. The computational complexity of Triple-Q is similar to SARSA for unconstrained MDPs, so it retains both its effectiveness and efficiency while solving a much harder problem. While we consider a tabular setting in this paper, Triple-Q can easily incorporate function approximations (linear function approximations or neural networks) by replacing the Q(x,a)Q(x,a) and C(x,a)C(x,a) with their function approximation versions, making the algorithm a very appealing approach for solving complex CMDPs in practice. We will demonstrate this by applying Deep Triple-Q, Triple-Q with neural networks, to the Dynamic Gym benchmark [30] in Section 6.

2 Problem Formulation

We consider an episodic CMDP, denoted by (𝒮,𝒜,H,,r,g),(\mathcal{S},\mathcal{A},H,\mathbb{P},r,g), where 𝒮\mathcal{S} is the state space with |𝒮|=S,|\mathcal{S}|=S, 𝒜\mathcal{A} is the action space with |𝒜|=A,|\mathcal{A}|=A, HH is the number of steps in each episode, and ={h}h=1H\mathbb{P}=\{\mathbb{P}_{h}\}_{h=1}^{H} is a collection of transition kernels (transition probability matrices). At the beginning of each episode, an initial state x1x_{1} is sampled from distribution μ0.\mu_{0}. Then at step h,h, the agent takes action aha_{h} after observing state xhx_{h}. Then the agent receives a reward rh(xh,ah)r_{h}(x_{h},a_{h}) and incurs a utility gh(xh,ah).g_{h}(x_{h},a_{h}). The environment then moves to a new state xh+1x_{h+1} sampled from distribution h(|xh,ah).\mathbb{P}_{h}(\cdot|x_{h},a_{h}). Similar to [14], we assume that rh(x,a)(gh(x,a)):𝒮×𝒜[0,1],r_{h}(x,a)(g_{h}(x,a)):\mathcal{S}\times\mathcal{A}\rightarrow[0,1], are deterministic for convenience.

Given a policy π,\pi, which is a collection of HH functions {πh:𝒮𝒜}h=1H,\{\pi_{h}:\mathcal{S}\rightarrow\mathcal{A}\}_{h=1}^{H}, the reward value function VhπV_{h}^{\pi} at step hh is the expected cumulative rewards from step hh to the end of the episode under policy π:\pi:

Vhπ(x)=𝔼[i=hHri(xi,πi(xi))|xh=x].V_{h}^{\pi}(x)=\mathbb{E}\left[\left.\sum_{i=h}^{H}r_{i}(x_{i},\pi_{i}(x_{i}))\right|x_{h}=x\right].

The (reward) QQ-function Qhπ(x,a)Q_{h}^{\pi}(x,a) at step hh is the expected cumulative rewards when agent starts from a state-action pair (x,a)(x,a) at step hh and then follows policy π:\pi:

Qhπ(x,a)=rh(x,a)+𝔼[i=h+1Hri(xi,πi(xi))|xh=xah=a].Q_{h}^{\pi}(x,a)=r_{h}(x,a)+\mathbb{E}\left[\left.\sum_{i=h+1}^{H}r_{i}(x_{i},\pi_{i}(x_{i}))\right|\begin{array}[]{cc}x_{h}=x\\ a_{h}=a\end{array}\right].

Similarly, we use Whπ(x):𝒮+W_{h}^{\pi}(x):\mathcal{S}\rightarrow\mathbb{R}^{+} and Chπ(x,a):𝒮×𝒜+C_{h}^{\pi}(x,a):\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}^{+} to denote the utility value function and utility QQ-function at step hh:

Whπ(x)\displaystyle W_{h}^{\pi}(x) =𝔼[i=hHgi(xi,πi(xi))|xh=x],\displaystyle=\mathbb{E}\left[\left.\sum_{i=h}^{H}g_{i}(x_{i},\pi_{i}(x_{i}))\right|x_{h}=x\right],
Chπ(x,a)\displaystyle C_{h}^{\pi}(x,a) =gh(x,a)+𝔼[i=h+1Hgi(xi,πi(xi))|xh=xah=a].\displaystyle=g_{h}(x,a)+\mathbb{E}\left[\left.\sum_{i=h+1}^{H}g_{i}(x_{i},\pi_{i}(x_{i}))\right|\begin{array}[]{cc}x_{h}=x\\ a_{h}=a\end{array}\right].

For simplicity, we adopt the following notation (some used in [14, 9]):

hVh+1π(x,a)=𝔼xh(|x,a)Vh+1π(x),\displaystyle\mathbb{P}_{h}V_{h+1}^{\pi}(x,a)=\mathbb{E}_{x^{\prime}\sim\mathbb{P}_{h}(\cdot|x,a)}V^{\pi}_{h+1}(x^{\prime}), Qhπ(x,πh(x))=aQhπ(x,a)(πh(x)=a)\displaystyle\quad Q_{h}^{\pi}(x,\pi_{h}(x))=\sum_{a}Q_{h}^{\pi}(x,a)\mathbb{P}(\pi_{h}(x)=a)
hWh+1π(x,a)=𝔼xh(|x,a)Wh+1π(x),\displaystyle\mathbb{P}_{h}W_{h+1}^{\pi}(x,a)=\mathbb{E}_{x^{\prime}\sim\mathbb{P}_{h}(\cdot|x,a)}W^{\pi}_{h+1}(x^{\prime}), Chπ(x,πh(x))=aChπ(x,a)(πh(x)=a).\displaystyle\quad C_{h}^{\pi}(x,\pi_{h}(x))=\sum_{a}C_{h}^{\pi}(x,a)\mathbb{P}(\pi_{h}(x)=a).

From the definitions above, we have

Vhπ(x)=Qhπ(x,πh(x)),\displaystyle V_{h}^{\pi}(x)=Q_{h}^{\pi}(x,\pi_{h}(x)),\quad Qhπ(x,a)=rh(x,a)+hVh+1π(x,a),\displaystyle Q_{h}^{\pi}(x,a)=r_{h}(x,a)+\mathbb{P}_{h}V_{h+1}^{\pi}(x,a),

and

Whπ(x)=Chπ(x,πh(x)),\displaystyle W_{h}^{\pi}(x)=C_{h}^{\pi}(x,\pi_{h}(x)),\quad Chπ(x,a)=gh(x,a)+hWh+1π(x,a).\displaystyle C_{h}^{\pi}(x,a)=g_{h}(x,a)+\mathbb{P}_{h}W_{h+1}^{\pi}(x,a).

Given the model defined above, the objective of the agent is to find a policy that maximizes the expected cumulative reward subject to a constraint on the expected utility:

maximizeπΠ𝔼[V1π(x1)]subject to:𝔼[W1π(x1)]ρ,\underset{\pi\in\Pi}{\text{maximize}}\ \mathbb{E}\left[V_{1}^{\pi}(x_{1})\right]\ \text{subject to:}\mathbb{E}\left[W_{1}^{\pi}(x_{1})\right]\geq\rho, (1)

where we assume ρ[0,H]\rho\in[0,H] to avoid triviality and the expectation is taken with respect to the initial distribution x1μ0.x_{1}\sim\mu_{0}.

Remark 1.

The results in the paper can be directly applied to a constraint in the form of

𝔼[W1π(x1)]ρ.\mathbb{E}\left[W_{1}^{\pi}(x_{1})\right]\leq\rho. (2)

Without loss of generality, assume ρH.\rho\leq H. We define g~h(x,a)=1gh(x,a)[0,1]\tilde{g}_{h}(x,a)=1-g_{h}(x,a)\in[0,1] and ρ~=Hρ0,\tilde{\rho}=H-\rho\geq 0, the the constraint in (2) can be written as

𝔼[W~1π(x1)]ρ~,\mathbb{E}\left[\tilde{W}_{1}^{\pi}(x_{1})\right]\geq\tilde{\rho}, (3)

where

𝔼[W~1π(x1)]=𝔼[i=1Hg~i(xi,πi(xi))]=H𝔼[W1π(x1)].\displaystyle\mathbb{E}\left[\tilde{W}_{1}^{\pi}(x_{1})\right]=\mathbb{E}\left[\sum_{i=1}^{H}\tilde{g}_{i}(x_{i},\pi_{i}(x_{i}))\right]=H-\mathbb{E}\left[W_{1}^{\pi}(x_{1})\right].

Let π\pi^{*} denote the optimal solution to the CMDP problem defined in (1). We evaluate our model-free RL algorithm using regret and constraint violation defined below:

Regert(K)\displaystyle\text{Regert}(K) =𝔼[k=1K(V1(xk,1)V1πk(xk,1))],\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\left(V_{1}^{*}(x_{k,1})-V_{1}^{\pi_{k}}(x_{k,1})\right)\right], (4)
Violation(K)\displaystyle\text{Violation}(K) =𝔼[k=1K(ρW1πk(xk,1))],\displaystyle=\mathbb{E}\left[\sum_{k=1}^{K}\left(\rho-W_{1}^{\pi_{k}}(x_{k,1})\right)\right], (5)

where V1(x)=V1π(x),V_{1}^{*}(x)=V_{1}^{\pi^{*}}(x), πk\pi_{k} is the policy used in episode kk and the expectation is taken with respect to the distribution of the initial state xk,1μ0.x_{k,1}\sim\mu_{0}.

In this paper, we assume the following standard Slater’s condition hold.

Assumption 1.

(Slater’s Condition). Given initial distribution μ0,\mu_{0}, there exist δ>0\delta>0 and policy π\pi such that

𝔼[W1π(x1)]ρδ.\mathbb{E}\left[W_{1}^{\pi}(x_{1})\right]-\rho\geq\delta.

In this paper, Slater’s condition simply means there exists a feasible policy that can satisfy the constraint with a slackness δ.\delta. This has been commonly used in the literature [9, 10, 11, 19]. We call δ\delta Slater’s constant. While the regret and constraint violation bounds depend on δ,\delta, our algorithm does not need to know δ\delta under the assumption that KK is large (the exact condition can be found in Theorem 1). This is a noticeable difference from some of works in CMDPs in which the agent needs to know the value of this constant (e.g. [9]) or alternatively a feasible policy (e.g. [2]) .

3 Triple-Q

In this section, we introduce Triple-Q for CMDPs. The design of our algorithm is based on the primal-dual approach in optimization. While RL algorithms based on the primal-dual approach have been developed for CMDPs [9, 10, 21, 11], a model-free RL algorithm with sublinear regrets and zero constraint violation is new.

The design of Triple-Q is based on the primal-dual approach in optimization. Given Lagrange multiplier λ,\lambda, we consider the Lagrangian of problem (1) from a given initial state x1:x_{1}:

maxπV1π(x1)+λ(W1π(x1)ρ)\displaystyle\max_{\pi}V_{1}^{\pi}(x_{1})+\lambda\left(W_{1}^{\pi}(x_{1})-\rho\right) (6)
=\displaystyle= maxπ𝔼[h=1Hrh(xh,πh(xh))+λgh(xh,πh(xh))]λρ,\displaystyle\max_{\pi}\mathbb{E}\left[\sum_{h=1}^{H}r_{h}(x_{h},\pi_{h}(x_{h}))+\lambda g_{h}(x_{h},\pi_{h}(x_{h}))\right]-\lambda\rho,

which is an unconstrained MDP with reward rh(xh,πh(xh))+λgh(xh,πh(xh))r_{h}(x_{h},\pi_{h}(x_{h}))+\lambda g_{h}(x_{h},\pi_{h}(x_{h})) at step h.h. Assuming we solve the unconstrained MDP and obtain the optimal policy, denoted by πλ,\pi^{*}_{\lambda}, we can then update the dual variable (the Lagrange multiplier) using a gradient method:

λ(λ+ρ𝔼[W1πλ(x1)])+.\lambda\leftarrow\left(\lambda+\rho-\mathbb{E}\left[W_{1}^{\pi_{\lambda}^{*}}(x_{1})\right]\right)^{+}.

While primal-dual is a standard approach, analyzing the finite-time performance such as regret or sample complexity is particularly challenging. For example, over a finite learning horizon, we will not be able to exactly solve the unconstrained MDP for given λ.\lambda. Therefore, we need to carefully design how often the Lagrange multiplier should be updated. If we update it too often, then the algorithm may not have sufficient time to solve the unconstrained MDP, which leads to divergence; and on the other hand, if we update it too slowly, then the solution will converge slowly to the optimal solution and will lead to large regret and constraint violation. Another challenge is that when λ\lambda is given, the primal-dual algorithm solves a problem with an objective different from the original objective and does not consider any constraint violation. Therefore, even when the asymptotic convergence may be established, establishing the finite-time regret is still difficult because we need to evaluate the difference between the policy used at each step and the optimal policy.

Next we will show that a low-complexity primal-dual algorithm can converge and have sublinear regret and zero constraint violation when carefully designed. In particular, Triple-Q includes the following key ideas:

  • A sub-gradient algorithm for estimating the Lagrange multiplier, which is updated at the beginning of each frame as follows:

    Z(Z+ρ+ϵC¯Kα)+,\displaystyle Z\leftarrow\left(Z+\rho+\epsilon-\frac{\bar{C}}{K^{\alpha}}\right)^{+}, (7)

    where (x)+=max{x,0}(x)^{+}=\max\{x,0\} and C¯\bar{C} is the summation of all C1(x1,a1)C_{1}(x_{1},a_{1})s of the episodes in the previous frame. We call ZZ a virtual queue because it is terminology that has been widely used in stochastic networks (see e.g. [17, 24]). If we view ρ+ϵ\rho+\epsilon as the number of jobs that arrive at a queue within each frame and C¯\bar{C} as the number of jobs that leave the queue within each frame, then ZZ is the number of jobs that are waiting at the queue. Note that we added extra utility ϵ\epsilon to ρ.\rho. By choosing ϵ=8SAH6ι3K0.2\epsilon=\frac{8\sqrt{SAH^{6}\iota^{3}}}{K^{0.2}}, the virtual queue pessimistically estimates constraint violation so Triple-Q achieves zero constraint violation when the number of episodes is large.

  • A carefully chosen parameter η=K0.2\eta=K^{0.2} so that when Zη\frac{Z}{\eta} is used as the estimated Lagrange multiplier, it balances the trade-off between maximizing the cumulative reward and satisfying the constraint.

  • Carefully chosen learning rate αt\alpha_{t} and Upper Confidence Bound (UCB) bonus btb_{t} to guarantee that the estimated Q-value does not significantly deviate from the actual Q-value. We remark that the learning rate and UCB bonus proposed for unconstrained MDPs [14] do not work here. Our learning rate is chosen to be K0.2+1K0.2+t,\frac{K^{0.2}+1}{K^{0.2}+t}, where tt is the number of visits to a given (state, action) pair in a particular step. This decays much slower than the classic learning rate 1t\frac{1}{t} or H+1H+t\frac{H+1}{H+t} used in [14]. The learning rate is further reset from frame to frame, so Triple-Q can continue to learn the pseudo-Q-values that vary from frame to frame due to the change of the virtual-Queue (the Lagrange multiplier).

We now formally introduce Triple-Q. A detailed description is presented in Algorithm 1. The algorithm only needs to know the values of H,H, A,A, SS and K,K, and no other problem-specific values are needed. Furthermore, Triple-Q includes updates of two Q-functions per step: one for QhQ_{h} and one for Ch;C_{h}; and one simple virtual queue update per frame. So its computational complexity is similar to SARSA.

1 Choose χ=K0.2,\chi=K^{0.2}, η=K0.2,\eta=K^{0.2}, ι=128log(2SAHK),α=0.6\iota=128\log\left(\sqrt{2SAH}K\right),\alpha=0.6 and ϵ=8SAH6ι3K0.2\epsilon=\frac{8\sqrt{SAH^{6}\iota^{3}}}{K^{0.2}} ;
2 Initialize Qh(x,a)=Ch(x,a)HQ_{h}(x,a)=C_{h}(x,a)\leftarrow H and Z=C¯=Nh(x,a)=VH+1(x)=WH+1(x)0Z=\bar{C}=N_{h}(x,a)=V_{H+1}(x)=W_{H+1}(x)\leftarrow 0 for all (x,a,h)𝒮×𝒜×[H](x,a,h)\in{\cal S}\times{\cal A}\times[H];
3 for episode k=1,,Kk=1,\dots,K  do
4       Sample the initial state for episode k:k: x1μ0x_{1}\sim\mu_{0};
5       for step h=1,,H+1h=1,\dots,H+1 do
             if hHh\leq H ;
              // take a greedy action based on the pseudo-Q-function
6             then
7                   Take action ahargmaxa(Qh(xh,a)+ZηCh(xh,a))a_{h}\leftarrow\arg\max_{a}\left(Q_{h}(x_{h},a)+\frac{Z}{\eta}C_{h}(x_{h},a)\right);
8                   Observe rh(xh,ah),gh(xh,ah),r_{h}(x_{h},a_{h}),g_{h}(x_{h},a_{h}), and xh+1x_{h+1} ;
9                   Nh(xh,ah)Nh(xh,ah)+1,Vh(xh)Qh(xh,ah),Wh(xh)Ch(xh,ah)N_{h}(x_{h},a_{h})\leftarrow N_{h}(x_{h},a_{h})+1,V_{h}(x_{h})\leftarrow Q_{h}(x_{h},a_{h}),W_{h}(x_{h})\leftarrow C_{h}(x_{h},a_{h});
            if h2h\geq 2 ;
              // update the Q-values for (xh1,ah1)(x_{h-1},a_{h-1}) after observing (sh,ah)(s_{h},a_{h})
10             then
11                   Set t=Nh1(xh1,ah1),bt=14H2ι(χ+1)χ+t,αt=χ+1χ+tt=N_{h-1}(x_{h-1},a_{h-1}),b_{t}=\frac{1}{4}\sqrt{\frac{H^{2}\iota\left(\chi+1\right)}{\chi+t}},\alpha_{t}=\frac{\chi+1}{\chi+t} ;
12                   Update the reward Q-value: Qh1(xh1,ah1)(1αt)Qh1(xh1,ah1)+αt(rh1(xh1,ah1)+Vh(xh)+bt){Q}_{h-1}(x_{h-1},a_{h-1})\leftarrow(1-\alpha_{t})Q_{h-1}(x_{h-1},a_{h-1})+\alpha_{t}\left(r_{h-1}(x_{h-1},a_{h-1})+V_{h}(x_{h})+b_{t}\right);
13                   Update the utility Q-value: Ch1(xh1,ah1)(1αt)Ch1(xh1,ah1)+αt(gh1(xh1,ah1)+Wh(xh)+bt){C}_{h-1}(x_{h-1},a_{h-1})\leftarrow(1-\alpha_{t})C_{h-1}(x_{h-1},a_{h-1})+\alpha_{t}\left(g_{h-1}(x_{h-1},a_{h-1})+W_{h}(x_{h})+b_{t}\right);
14                  
15            if h=1h=1 then
                   C¯C¯+C1(x1,a1)\bar{C}\leftarrow\bar{C}+C_{1}(x_{1},a_{1}) ;
                    // Add C1(x1,a1)C_{1}(x_{1},a_{1}) to C¯\bar{C}
16                  
      if kmod(Kα)=0k\mod(K^{\alpha})=0 ;
        // Reset the visit counts, add extra bonuses, and update the virtual-queue at the beginning of each frame
17       then
18             Nh(x,a)0,Qh(x,a)Qh(x,a)+2H3ιη,(x,a,h)N_{h}(x,a)\leftarrow 0,Q_{h}(x,a)\leftarrow Q_{h}(x,a)+\frac{2H^{3}\sqrt{\iota}}{\eta},\forall(x,a,h);
19             if Qh(x,a)HQ_{h}(x,a)\geq H or Ch(x,a)HC_{h}(x,a)\geq H then
20                   Qh(x,a)HQ_{h}(x,a)\leftarrow H and Ch(x,a)H;C_{h}(x,a)\leftarrow H;
            Z(Z+ρ+ϵC¯Kα)+,Z\leftarrow\left(Z+\rho+\epsilon-\frac{\bar{C}}{K^{\alpha}}\right)^{+}, and C¯0\bar{C}\leftarrow 0 ;
              // update the virtual-queue length
21            
Algorithm 1 Triple-Q

The next theorem summarizes the regret and constraint violation bounds guaranteed under Triple-Q.

Theorem 1.

Assume K(16SAH6ι3δ)5,K\geq\left(\frac{16\sqrt{SAH^{6}\iota^{3}}}{\delta}\right)^{5}, where ι=128log(2SAHK).\iota=128\log(\sqrt{2SAH}K). Triple-Q achieves the following regret and constraint violation bounds:

Regret(K)\displaystyle\text{\em Regret}(K) 13δH4SAι3K0.8+4H4ιK1.2\displaystyle\leq\frac{13}{\delta}H^{4}{\sqrt{SA\iota^{3}}}{K^{0.8}}+\frac{4H^{4}\iota}{K^{1.2}}
Violation(K)\displaystyle\text{\em Violation}(K) 54H4ιK0.6δlog16H2ιδ+4H2ιδK0.85SAH6ι3K0.8.\displaystyle\leq\frac{54H^{4}{\iota}K^{0.6}}{\delta}\log{\frac{16H^{2}\sqrt{\iota}}{\delta}}+\frac{4\sqrt{H^{2}\iota}}{\delta}K^{0.8}-5\sqrt{SAH^{6}\iota^{3}}K^{0.8}.

If we further have Ke1δ,K\geq e^{\frac{1}{\delta}}, then Violation(K)0\text{\em Violation}(K)\leq 0 and

Pr(k=1KρW1πk(xk,1)0)=1𝒪~(eK0.2++1K2),\displaystyle\Pr\left(\sum^{K}_{k=1}\rho-W_{1}^{\pi_{k}}(x_{k,1})\leq 0\right)=1-\tilde{\mathcal{O}}\left(e^{-K^{0.2}+}+\frac{1}{K^{2}}\right),

in other words, Triple-Q guarantees zero constraint violation both on expectation and with a high probability.

4 Proof of the Main Theorem

We now present the complete proof of the main theorem.

4.1 Notation

In the proof, we explicitly include the episode index in our notation. In particular,

  • xk,hx_{k,h} and ak,ha_{k,h} are the state and the action taken at step hh of episode k.k.

  • Qk,h,Q_{k,h}, Ck,h,C_{k,h}, Zk,Z_{k}, and C¯k\bar{C}_{k} are the reward Q-function, the utility Q-function, the virtual-Queue, and the value of C¯\bar{C} at the beginning of episode k.k.

  • Nk,h,N_{k,h}, Vk,hV_{k,h} and Wk,hW_{k,h} are the visit count, reward value-function, and utility value-function after they are updated at step hh of episode kk (i.e. after line 9 of Triple-Q).

We also use shorthand notation

{fg}(x)=f(x)g(x),\{f-g\}(x)=f(x)-g(x),

when f()f(\cdot) and g()g(\cdot) take the same argument value. Similarly

{(fg)q}(x)=(f(x)g(x))q(x).\{(f-g)q\}(x)=(f(x)-g(x))q(x).

In this shorthand notation, we put functions inside {},\{\ \}, and the common argument(s) outside.

A summary of notations used throughout this paper can be found in Table 3 in the appendix.

4.2 Regret

To bound the regret, we consider the following offline optimization problem as our regret baseline [3, 20]:

maxqh\displaystyle\underset{q_{h}}{\max} h,x,aqh(x,a)rh(x,a)\displaystyle\sum_{h,x,a}q_{h}(x,a)r_{h}(x,a) (8)
s.t.: h,x,aqh(x,a)gh(x,a)ρ\displaystyle\sum_{h,x,a}q_{h}(x,a)g_{h}(x,a)\geq\rho (9)
aqh(x,a)=x,ah1(x|x,a)qh1(x,a)\displaystyle\sum_{a}q_{h}(x,a)=\sum_{x^{\prime},a^{\prime}}\mathbb{P}_{h-1}(x|x^{\prime},a^{\prime})q_{h-1}(x^{\prime},a^{\prime}) (10)
x,aqh(x,a)=1,h[H]\displaystyle\sum_{x,a}q_{h}(x,a)=1,\forall h\in[H] (11)
aq1(x,a)=μ0(x)\displaystyle\sum_{a}q_{1}(x,a)=\mu_{0}(x) (12)
qh(x,a)0,x𝒮,a𝒜,h[H].\displaystyle q_{h}(x,a)\geq 0,\forall x\in\mathcal{S},\forall a\in\mathcal{A},\forall h\in[H]. (13)

Recall that h1(x|x,a)\mathbb{P}_{h-1}(x|x^{\prime},a^{\prime}) is the probability of transitioning to state xx upon taking action aa^{\prime} in state xx^{\prime} at step h1.h-1. This optimization problem is linear programming (LP), where qh(x,a)q_{h}(x,a) is the probability of (state, action) pair (x,a)(x,a) occurs in step h,h, aqh(x,a)\sum_{a}q_{h}(x,a) is the probability the environment is in state xx in step h,h, and

qh(x,a)aqh(x,a)\frac{q_{h}(x,a)}{\sum_{a^{\prime}}q_{h}(x,a^{\prime})}

is the probability of taking action aa in state xx at step h,h, which defines the policy. We can see that (9) is the utility constraint, (10) is the global-balance equation for the MDP, (11) is the normalization condition so that qhq_{h} is a valid probability distribution, and (12) states that the initial state is sampled from μ0.\mu_{0}. Therefore, the optimal solution to this LP solves the CMDP (if the model is known), so we use the optimal solution to this LP as our baseline.

To analyze the performance of Triple-Q, we need to consider a tightened version of the LP, which is defined below:

maxqh\displaystyle\underset{q_{h}}{\max} h,x,aqh(x,a)rh(x,a)\displaystyle\sum_{h,x,a}q_{h}(x,a)r_{h}(x,a) (14)
s.t.: h,x,aqh(x,a)gh(x,a)ρ+ϵ\displaystyle\sum_{h,x,a}q_{h}(x,a)g_{h}(x,a)\geq\rho+\epsilon
(10)(13),\displaystyle\eqref{lp:gb}-\eqref{lp:p},

where ϵ>0\epsilon>0 is called a tightness constant. When ϵδ,\epsilon\leq\delta, this problem has a feasible solution due to Slater’s condition. We use superscript to denote the optimal value/policy related to the original CMDP (1) or the solution to the corresponding LP (8) and superscript ϵ,∗ to denote the optimal value/policy related to the ϵ\epsilon-tightened version of CMDP (defined in (14)).

Following the definition of the regret in (4), we have

Regret(K)=𝔼[k=1KV1(xk,1)V1πk(xk,1)]=𝔼[k=1K(a{Q1q1}(xk,1,a))Q1πk(xk,1,ak,1)].\displaystyle\hbox{Regret}(K)=\mathbb{E}\left[\sum_{k=1}^{K}V^{*}_{1}(x_{k,1})-V_{1}^{\pi_{k}}(x_{k,1})\right]=\mathbb{E}\left[\sum_{k=1}^{K}\left(\sum_{a}\left\{Q^{*}_{1}q_{1}^{*}\right\}(x_{k,1},a)\right)-Q_{1}^{\pi_{k}}(x_{k,1},a_{k,1})\right].

Now by adding and subtracting the corresponding terms, we obtain

Regret(K)\displaystyle\hbox{Regret}(K)
=\displaystyle= 𝔼[k=1K(a{Q1q1Q1ϵ,q1ϵ,}(xk,1,a))]+\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left(\sum_{a}\left\{{Q}_{1}^{*}{q}^{*}_{1}-{Q}_{1}^{\epsilon,*}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right)\right]+ (15)
𝔼[k=1K(a{Q1ϵ,q1ϵ,}(xk,1,a)Qk,1(xk,1,ak,1))]+\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left(\sum_{a}\left\{{Q}_{1}^{\epsilon,*}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]+ (16)
𝔼[k=1K{Qk,1Q1πk}(xk,1,ak,1)].\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left\{Q_{k,1}-Q_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right]. (17)

Next, we establish the regret bound by analyzing the three terms above. We first present a brief outline.

4.2.1 Outline of the Regret Analysis

  • Step 1: First, by comparing the LP associated with the original CMDP (8) and the tightened LP (14), Lemma 1 will show

    𝔼[a{Q1q1Q1ϵ,q1ϵ,}(xk,1,a)]Hϵδ,\mathbb{E}\left[\sum_{a}\left\{Q_{1}^{*}q_{1}^{*}-{Q}_{1}^{\epsilon,*}{q}_{1}^{\epsilon,*}\right\}(x_{k,1},a)\right]\leq\frac{H\epsilon}{\delta},

    which implies that under our choices of ϵ,\epsilon, δ,\delta, and ι,\iota,

    (15)KHϵδ=𝒪~(1δH4S12A12K45).\eqref{step:epsilon-dif}\leq\frac{KH\epsilon}{\delta}=\tilde{\cal O}\left(\frac{1}{\delta}H^{4}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}}\right).
  • Step 2: Note that Qk,hQ_{k,h} is an estimate of Qhπk,Q^{\pi_{k}}_{h}, and the estimation error (17) is controlled by the learning rates and the UCB bonuses. In Lemma 2, we will show that the cumulative estimation error over one frame is upper bounded by

    H2SA+H3ιKαχ+H4SAιKα(χ+1).H^{2}SA+\frac{H^{3}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{4}SA\iota K^{\alpha}(\chi+1)}.

    Therefore, under our choices of α,\alpha, χ,\chi, and ι,\iota, the cumulative estimation error over KK episodes satisfies

    (17)H2SAK1α+H3ιKχ+H4SAιK2α(χ+1)=𝒪~(H3S12A12K45).\displaystyle\eqref{step:biase}\leq H^{2}SAK^{1-\alpha}+\frac{H^{3}\sqrt{\iota}K}{\chi}+\sqrt{H^{4}SA\iota K^{2-\alpha}(\chi+1)}=\tilde{\cal O}\left(H^{3}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}}\right).

    The proof of Lemma 2 is based on a recursive formula that relates the estimation error at step hh to the estimation error at step h+1,h+1, similar to the one used in [14], but with different learning rates and UCB bonuses.

  • Step 3: Bounding (16) is the most challenging part of the proof. For unconstrained MDPs, the optimistic Q-learning in [14] guarantees that Qk,h(x,a)Q_{k,h}(x,a) is an overestimate of Qh(x,a)Q^{*}_{h}(x,a) (so also an overestimate of Qhϵ,(x,a)Q^{\epsilon,*}_{h}(x,a)) for all (x,a,h,k)(x,a,h,k) simultaneously with a high probability. However, this result does not hold under Triple-Q because Triple-Q takes greedy actions with respect to the pseudo-Q-function instead of the reward Q-function. To overcome this challenge, we first add and subtract additional terms to obtain

    𝔼[k=1K(a{Q1ϵ,q1ϵ,}(xk,1,a)Qk,1(xk,1,ak,1))]\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left(\sum_{a}\left\{{Q}_{1}^{\epsilon,*}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]
    =\displaystyle= 𝔼[ka({Q1ϵ,q1ϵ,+ZkηC1ϵ,q1ϵ,}(xk,1,a){Qk,1q1ϵ,+ZkηCk,1q1ϵ,}(xk,1,a))]\displaystyle\mathbb{E}\left[\sum_{k}\sum_{a}\left(\left\{{Q}_{1}^{\epsilon,*}{q}^{\epsilon,*}_{1}+\frac{Z_{k}}{\eta}C_{1}^{\epsilon,*}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)-\left\{Q_{k,1}{q}^{\epsilon,*}_{1}+\frac{Z_{k}}{\eta}C_{k,1}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right)\right] (18)
    +𝔼[k(a{Qk,1q1ϵ,}(xk,1,a)Qk,1(xk,1,ak,1))]+𝔼[kZkηa{(Ck,1C1ϵ,)q1ϵ,}(xk,1,a)].\displaystyle+\mathbb{E}\left[\sum_{k}\left(\sum_{a}\left\{Q_{k,1}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]+\mathbb{E}\left[\sum_{k}\frac{Z_{k}}{\eta}\sum_{a}\left\{\left(C_{k,1}-{C}^{\epsilon,*}_{1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right]. (19)

    We can see (18) is the difference of two pseudo-Q-functions. Using a three-dimensional induction (on step, episode, and frame), we will prove in Lemma 3 that {Qk,h+ZkηCk,h}(x,a)\left\{Q_{k,h}+\frac{Z_{k}}{\eta}C_{k,h}\right\}(x,a) is an overestimate of {Qhϵ,+ZkηChϵ,}(x,a)\left\{{Q}_{h}^{\epsilon,*}+\frac{Z_{k}}{\eta}C_{h}^{\epsilon,*}\right\}(x,a) (i.e. (18)0\eqref{F-new}\leq 0) for all (x,a,h,k)(x,a,h,k) simultaneously with a high probability. Since ZkZ_{k} changes from frame to frame, Triple-Q adds the extra bonus in line 21 so that the induction can be carried out over frames.

    Finally, to bound (19), we use the Lyapunov-drift method and consider Lyapunov function LT=12ZT2,L_{T}=\frac{1}{2}Z_{T}^{2}, where TT is the frame index and ZTZ_{T} is the value of the virtual queue at the beginning of the TTth frame. We will show in Lemma 4 that the Lyapunov-drift satisfies

    𝔼[LT+1LT]a negative drift+H4ι+ϵ2ηKαk=TKα+1(T+1)KαΦk,\displaystyle\mathbb{E}[L_{T+1}-L_{T}]\leq\hbox{a negative drift}+H^{4}\iota+\epsilon^{2}-\frac{\eta}{K^{\alpha}}\sum_{k=TK^{\alpha}+1}^{(T+1)K^{\alpha}}\Phi_{k}, (20)

    where

    Φk=𝔼[(a{Qk,1q1ϵ,}(xk,1,a)Qk,1(xk,1,ak,1))]+𝔼[Zkηa{(Ck,1C1ϵ,)q1ϵ,}(xk,1,a)],\displaystyle\Phi_{k}=\mathbb{E}\left[\left(\sum_{a}\left\{Q_{k,1}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]+\mathbb{E}\left[\frac{Z_{k}}{\eta}\sum_{a}\left\{\left(C_{k,1}-{C}^{\epsilon,*}_{1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right],

    and we note that (19)=kΦk.\eqref{eq:(i)expanded-new}=\sum_{k}\Phi_{k}. Inequality (20) will be established by showing that Triple-Q takes actions to almost greedily reduce virtual-Queue ZZ when ZZ is large, which results in the negative drift in (20). From (20), we observe that

    𝔼[LT+1LT]H4ι+ϵ2ηKαk=TKα+1(T+1)KαΦk.\displaystyle\mathbb{E}[L_{T+1}-L_{T}]\leq H^{4}\iota+\epsilon^{2}-\frac{\eta}{K^{\alpha}}\sum_{k=TK^{\alpha}+1}^{(T+1)K^{\alpha}}\Phi_{k}. (21)

    So we can bound (19) by applying the telescoping sum over the K1αK^{1-\alpha} frames on the inequality above:

    (19)=kΦkKα𝔼[L1LK1α+1]η+K(H4ι+ϵ2)ηK(H4ι+ϵ2)η,\eqref{eq:(i)expanded-new}=\sum_{k}\Phi_{k}\leq\frac{K^{\alpha}\mathbb{E}\left[L_{1}-L_{K^{1-\alpha}+1}\right]}{\eta}+\frac{K(H^{4}\iota+\epsilon^{2})}{\eta}\leq\frac{K(H^{4}\iota+\epsilon^{2})}{\eta},

    where the last inequality holds because L1=0L_{1}=0 and LT0L_{T}\geq 0 for all T.T. Combining the bounds on (18) and (19), we conclude that under our choices of ι,\iota, ϵ\epsilon and η,\eta,

    (16)=𝒪~(H4S12A12K45).\eqref{step(i)}=\tilde{\cal O}(H^{4}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{\frac{4}{5}}).

Combining the results in the three steps above, we obtain the regret bound in Theorem 1.

4.2.2 Detailed Proof

We next present the detailed proof. The first lemma bounds the difference between the original CMDP and its ϵ\epsilon-tightened version. The result is intuitive because the ϵ\epsilon-tightened version is a perturbation of the original problem and ϵδ.\epsilon\leq\delta.

Lemma 1.

Given ϵδ\epsilon\leq\delta, we have

𝔼[a{Q1q1Q1ϵ,q1ϵ,}(xk,1,a)]Hϵδ.\mathbb{E}\left[\sum_{a}\left\{Q_{1}^{*}q_{1}^{*}-{Q}_{1}^{\epsilon,*}{q}_{1}^{\epsilon,*}\right\}(x_{k,1},a)\right]\leq\frac{H\epsilon}{\delta}.

\square

Proof.

Given qh(x,a){q}_{h}^{*}(x,a) is the optimal solution, we have

h,x,aqh(x,a)gh(x,a)ρ.\sum_{h,x,a}{q}_{h}^{*}(x,a)g_{h}(x,a)\geq\rho.

Under Assumption 1, we know that there exists a feasible solution {qhξ1(x,a)}h=1H\{q^{\xi_{1}}_{h}(x,a)\}_{h=1}^{H} such that

h,x,aqhξ1(x,a)gh(x,a)ρ+δ.\sum_{h,x,a}q_{h}^{\xi_{1}}(x,a)g_{h}(x,a)\geq\rho+\delta.

We construct qhξ2(x,a)=(1ϵδ)qh(x,a)+ϵδqhξ1(x,a),q_{h}^{\xi_{2}}(x,a)=(1-\frac{\epsilon}{\delta})q^{*}_{h}(x,a)+\frac{\epsilon}{\delta}q^{\xi_{1}}_{h}(x,a), which satisfies that

h,x,aqhξ2(x,a)gh(x,a)\displaystyle\sum_{h,x,a}q_{h}^{\xi_{2}}(x,a)g_{h}(x,a) =h,x,a((1ϵδ)qh(x,a)+ϵδqhξ1(x,a))gh(x,a)ρ+ϵ,\displaystyle=\sum_{h,x,a}\left((1-\frac{\epsilon}{\delta})q^{*}_{h}(x,a)+\frac{\epsilon}{\delta}q^{\xi_{1}}_{h}(x,a)\right)g_{h}(x,a)\geq\rho+\epsilon,
h,x,aqhξ2(x,a)\displaystyle\sum_{h,x,a}q_{h}^{\xi_{2}}(x,a) =x,aph1(x|x,a)qh1ξ2(x,a),\displaystyle=\sum_{x^{\prime},a^{\prime}}p_{h-1}(x|x^{\prime},a^{\prime})q_{h-1}^{\xi_{2}}(x^{\prime},a^{\prime}),
h,x,aqhξ2(x,a)\displaystyle\sum_{h,x,a}q_{h}^{\xi_{2}}(x,a) =1.\displaystyle=1.

Also we have qhξ2(x,a)0q_{h}^{\xi_{2}}(x,a)\geq 0 for all (h,x,a).(h,x,a). Thus {qhξ2(x,a)}h=1H\{q^{\xi_{2}}_{h}(x,a)\}_{h=1}^{H} is a feasible solution to the ϵ\epsilon-tightened optimization problem (14). Then given {qhϵ,(x,a)}h=1H\{{q}^{\epsilon,*}_{h}(x,a)\}_{h=1}^{H} is the optimal solution to the ϵ\epsilon-tightened optimization problem, we have

h,x,a(qh(x,a)qhϵ,(x,a))rh(x,a)\displaystyle\sum_{h,x,a}\left({q}_{h}^{*}(x,a)-{q}_{h}^{\epsilon,*}(x,a)\right)r_{h}(x,a)
\displaystyle\leq h,x,a(qh(x,a)qhξ2(x,a))rh(x,a)\displaystyle\sum_{h,x,a}\left(q_{h}^{*}(x,a)-q_{h}^{\xi_{2}}(x,a)\right)r_{h}(x,a)
\displaystyle\leq h,x,a(qh(x,a)(1ϵδ)qh(x,a)ϵδqhξ1(x,a))rh(x,a)\displaystyle\sum_{h,x,a}\left(q_{h}^{*}(x,a)-\left(1-\frac{\epsilon}{\delta}\right)q_{h}^{*}(x,a)-\frac{\epsilon}{\delta}q_{h}^{\xi_{1}}(x,a)\right)r_{h}(x,a)
\displaystyle\leq h,x,a(qh(x,a)(1ϵδ)qh(x,a))rh(x,a)\displaystyle\sum_{h,x,a}\left(q_{h}^{*}(x,a)-\left(1-\frac{\epsilon}{\delta}\right)q_{h}^{*}(x,a)\right)r_{h}(x,a)
\displaystyle\leq ϵδh,x,aqh(x,a)rh(x,a)\displaystyle\frac{\epsilon}{\delta}\sum_{h,x,a}q_{h}^{*}(x,a)r_{h}(x,a)
\displaystyle\leq Hϵδ,\displaystyle\frac{H\epsilon}{\delta},

where the last inequality holds because 0rh(x,a)10\leq r_{h}(x,a)\leq 1 under our assumption. Therefore the result follows because

aQ1(xk,1,a)q1(xk,1,a)=\displaystyle\sum_{a}Q_{1}^{*}(x_{k,1},a)q_{1}^{*}(x_{k,1},a)= h,x,aqh(x,a)rh(x,a)\displaystyle\sum_{h,x,a}{q}_{h}^{*}(x,a)r_{h}(x,a)
aQ1ϵ,(xk,1,a)q1ϵ,(xk,1,a)=\displaystyle\sum_{a}{Q}_{1}^{\epsilon,*}(x_{k,1},a){q}_{1}^{\epsilon,*}(x_{k,1},a)= h,x,aqhϵ,(x,a)rh(x,a).\displaystyle\sum_{h,x,a}{q}_{h}^{\epsilon,*}(x,a)r_{h}(x,a).

The next lemma bounds the difference between the estimated Q-functions and actual Q-functions in a frame. The bound on (17) is an immediate result of this lemma.

Lemma 2.

Under Triple-Q, we have for any T[K1α],T\in[K^{1-\alpha}],

𝔼[k=(T1)Kα+1TKα{Qk,1Q1πk}(xk,1,ak,1)]H2SA+H3ιKαχ+H2SAιKα(χ+1),\displaystyle\mathbb{E}\left[\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\left\{{Q}_{k,1}-Q_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right]\leq H^{2}SA+\frac{H^{3}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)},
𝔼[k=(T1)Kα+1TKα{Ck,1C1πk}(xk,1,ak,1)]H2SA+H3ιKαχ+H2SAιKα(χ+1).\displaystyle\mathbb{E}\left[\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\left\{{C}_{k,1}-C_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right]\leq H^{2}SA+\frac{H^{3}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)}.
Proof.

We will prove the result on the reward Q-function. The proof for the utility Q-function is almost identical. We first establish a recursive equation between a Q-function with the value-functions in the earlier episodes in the same frame. Recall that under Triple-Q, Qk+1,h(x,a){Q}_{k+1,h}(x,a), where kk is an episode in frame T,T, is updated as follows:

Qk+1,h(x,a)\displaystyle{Q}_{k+1,h}(x,a) ={(1αt)Qk,h(x,a)+αt(rh(x,a)+Vk,h+1(xk,h+1)+bt)if (x,a)=(xk,h,ak,h)Qk,h(x,a)otherwise,\displaystyle=\begin{cases}(1-\alpha_{t}){Q}_{k,h}(x,a)+\alpha_{t}\left(r_{h}(x,a)+{V}_{k,h+1}(x_{k,h+1})+b_{t}\right)&\text{if $(x,a)=(x_{k,h},a_{k,h})$}\\ {Q}_{k,h}(x,a)&\text{otherwise}\end{cases},

where t=Nk,h(x,a).t=N_{k,h}(x,a). Define ktk_{t} to be the index of the episode in which the agent visits (x,a)(x,a) in step hh for the ttth time in the current frame.

The update equation above can be written as:

Qk,h(x,a)=\displaystyle{Q}_{k,h}(x,a)= (1αt)Qkt,h(x,a)+αt(rh(x,a)+Vkt,h+1(xkt,h+1)+bt).\displaystyle(1-\alpha_{t}){Q}_{k_{t},h}(x,a)+\alpha_{t}\left(r_{h}(x,a)+{V}_{k_{t},h+1}(x_{k_{t},h+1})+b_{t}\right).

Repeatedly using the equation above, we obtain

Qk,h(x,a)=\displaystyle{Q}_{k,h}(x,a)= (1αt)(1αt1)Qkt1,h(x,a)+(1αt)αt1(rh(x,a)+Vkt1,h+1(xkt1,h+1)+bt1)\displaystyle(1-\alpha_{t})(1-\alpha_{t-1}){Q}_{k_{t-1},h}(x,a)+(1-\alpha_{t})\alpha_{t-1}\left(r_{h}(x,a)+{V}_{k_{t-1},h+1}(x_{k_{t-1},h+1})+b_{t-1}\right)
+αt(rh(x,a)+Vkt,h+1(xkt,h+1)+bt)\displaystyle+\alpha_{t}\left(r_{h}(x,a)+{V}_{k_{t},h+1}(x_{k_{t},h+1})+b_{t}\right)
=\displaystyle= \displaystyle\cdots
=\displaystyle= αt0Q(T1)Kα+1,h(x,a)+i=1tαti(rh(x,a)+Vki,h+1(xki,h+1)+bi)\displaystyle\alpha_{t}^{0}Q_{(T-1)K^{\alpha}+1,h}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left(r_{h}(x,a)+{V}_{k_{i},h+1}(x_{k_{i},h+1})+b_{i}\right)
\displaystyle\leq αt0H+i=1tαti(rh(x,a)+Vki,h+1(xki,h+1)+bi),\displaystyle\alpha_{t}^{0}H+\sum_{i=1}^{t}\alpha_{t}^{i}\left(r_{h}(x,a)+{V}_{k_{i},h+1}(x_{k_{i},h+1})+b_{i}\right), (22)

where αt0=j=1t(1αj)\alpha_{t}^{0}=\prod_{j=1}^{t}(1-\alpha_{j}) and αti=αij=i+1t(1αj).\alpha_{t}^{i}=\alpha_{i}\prod_{j=i+1}^{t}(1-\alpha_{j}). From the inequality above, we further obtain

k=(T1)Kα+1TKαQk,h(x,a)k=(T1)Kα+1TKααt0H+k=(T1)Kα+1TKαi=1Nk,h(x,a)αNk,hi(rh(x,a)+Vki,h+1(xki,h+1)+bi).\displaystyle\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}{Q}_{k,h}(x,a)\leq\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\alpha_{t}^{0}H+\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\sum_{i=1}^{N_{k,h}(x,a)}\alpha_{N_{k,h}}^{i}\left(r_{h}(x,a)+{V}_{k_{i},h+1}(x_{k_{i},h+1})+b_{i}\right). (23)

The notation becomes rather cumbersome because for each (xk,h,ak,h),(x_{k,h},a_{k,h}), we need to consider a corresponding sequence of episode indices in which the agent sees (xk,h,ak,h).(x_{k,h},a_{k,h}). Next we will analyze a given sample path (i.e. a specific realization of the episodes in a frame), so we simplify our notation in this proof and use the following notation:

Nk,h=\displaystyle N_{k,h}= Nk,h(xk,h,ak,h)\displaystyle N_{k,h}(x_{k,h},a_{k,h})
ki(k,h)=\displaystyle k^{(k,h)}_{i}= ki(xk,h,ak,h),\displaystyle k_{i}(x_{k,h},a_{k,h}),

where ki(k,h)k^{(k,h)}_{i} is the index of the episode in which the agent visits state-action pair (xk,h,ak,h)(x_{k,h},a_{k,h}) for the iith time. Since in a given sample path, (k,h)(k,h) can uniquely determine (xk,h,ak,h),(x_{k,h},a_{k,h}), this notation introduces no ambiguity. Furthermore, we will replace k=(T1)Kα+1TKα\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}} with k\sum_{k} because we only consider episodes in frame TT in this proof.

We note that

ki=1Nk,hαNk,hiVki(k,h),h+1(xki(k,h),h+1)kVk,h+1(xk,h+1)t=Nk,hαtNk,h(1+1χ)kVk,h+1(xk,h+1),\sum_{k}\sum_{i=1}^{N_{k,h}}\alpha_{N_{k,h}}^{i}V_{k_{i}^{(k,h)},h+1}\left(x_{k_{i}^{(k,h)},h+1}\right)\leq\sum_{k}V_{k,h+1}(x_{k,h+1})\sum_{t=N_{k,h}}^{\infty}\alpha_{t}^{N_{k,h}}\leq\left(1+\frac{1}{\chi}\right)\sum_{k}V_{k,h+1}(x_{k,h+1}), (24)

where the first inequality holds because because Vk,h+1(xk,h+1)V_{k,h+1}(x_{k,h+1}) appears in the summation on the left-hand side each time when in episode k>kk^{\prime}>k in the same frame, the environment visits (xk,h,ak,h)(x_{k,h},a_{k,h}) again, i.e. (xk,h,ak,h)=(xk,h,ak,h),(x_{k^{\prime},h},a_{k^{\prime},h})=(x_{k,h},a_{k,h}), and the second inequality holds due to the property of the learning rate proved in Lemma 7-(d). By substituting (24) into (23) and noting that i=1Nk,h(x,a)αNk,hi=1\sum_{i=1}^{N_{k,h}(x,a)}\alpha_{N_{k,h}}^{i}=1 according to Lemma Lemma 7-(b), we obtain

kQk,h(xk,h,ak,h)\displaystyle\sum_{k}Q_{k,h}(x_{k,h},a_{k,h})
\displaystyle\leq kαt0H+k(rh(xk,h,ak,h)+Vk,h+1(xk,h+1))+1χkVk,h+1(xk,h+1)+ki=1Nk,hαNk,hibi\displaystyle\sum_{k}\alpha_{t}^{0}H+\sum_{k}\left(r_{h}(x_{k,h},a_{k,h})+V_{k,h+1}(x_{k,h+1})\right)+\frac{1}{\chi}\sum_{k}V_{k,h+1}(x_{k,h+1})+\sum_{k}\sum_{i=1}^{N_{k,h}}\alpha_{N_{k,h}}^{i}b_{i}
\displaystyle\leq k(rh(xk,h,ak,h)+Vk,h+1(xk,h+1))+HSA+H2ιKαχ+12H2SAιKα(χ+1),\displaystyle\sum_{k}\left(r_{h}(x_{k,h},a_{k,h})+V_{k,h+1}(x_{k,h+1})\right)+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\frac{1}{2}\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)},

where the last inequality holds because (i) we have

kαNk,h0H=kH𝕀{Nk,h=0}HSA,\displaystyle\sum_{k}\alpha_{N_{k,h}}^{0}H=\sum_{k}H\mathbb{I}_{\{N_{k,h}=0\}}\leq HSA,

(ii) Vk,h+1(xk,h+1)H2ιV_{k,h+1}(x_{k,h+1})\leq H^{2}\sqrt{\iota} by using Lemma 8, and (iii) we know that

ki=1Nk,hαNk,hibi=14k=(T1)Kα+1TKαi=1Nk,hαNk,hiH2ι(χ+1)χ+i12k=(T1)Kα+1TKαH2ι(χ+1)χ+Nk,h\displaystyle\sum_{k}\sum_{i=1}^{N_{k,h}}\alpha_{N_{k,h}}^{i}b_{i}=\frac{1}{4}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\sum_{i=1}^{N_{k,h}}\alpha_{N_{k,h}}^{i}\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+i}}\leq\frac{1}{2}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+N_{k,h}}}
=\displaystyle= 12x,an=1NTKα,h(x,a)H2ι(χ+1)χ+n12x,an=1NTKα,h(x,a)H2ι(χ+1)n(1)H2SAιKα(χ+1),\displaystyle\frac{1}{2}\sum_{x,a}\sum_{n=1}^{N_{TK^{\alpha},h}(x,a)}\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+n}}\leq\frac{1}{2}\sum_{x,a}\sum_{n=1}^{N_{TK^{\alpha},h}(x,a)}\sqrt{\frac{H^{2}\iota(\chi+1)}{n}}\overset{(1)}{\leq}\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)},

where the last inequality above holds because the left hand side of (1)(1) is the summation of KαK^{\alpha} terms and H2ι(χ+1)χ+n\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+n}} is a decreasing function of n.n.

Therefore, it is maximized when NTKα,h=Kα/SAN_{TK^{\alpha},h}=K^{\alpha}/SA for all x,a,x,a, i.e. by picking the largest KαK^{\alpha} terms. Thus we can obtain

kQk,h(xk,h,ak,h)kQhπk(xk,h,ak,h)\displaystyle\sum_{k}Q_{k,h}(x_{k,h},a_{k,h})-\sum_{k}Q_{h}^{\pi_{k}}(x_{k,h},a_{k,h})
\displaystyle\leq k(Vk,h+1(xk,h+1)hVh+1πk(xk,h,ak,h))+HSA+H2ιKαχ+H2SAιKα(χ+1)\displaystyle\sum_{k}\left(V_{k,h+1}(x_{k,h+1})-\mathbb{P}_{h}V_{h+1}^{\pi_{k}}(x_{k,h},a_{k,h})\right)+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)}
\displaystyle\leq k(Vk,h+1(xk,h+1)hVh+1πk(xk,h,ak,h)+Vh+1πk(xk,h+1)Vh+1πk(xk,h+1))\displaystyle\sum_{k}\left(V_{k,h+1}(x_{k,h+1})-\mathbb{P}_{h}V_{h+1}^{\pi_{k}}(x_{k,h},a_{k,h})+V_{h+1}^{\pi_{k}}(x_{k,h+1})-V_{h+1}^{\pi_{k}}(x_{k,h+1})\right)
+HSA+H2ιKαχ+H2SAιKα(χ+1)\displaystyle+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)}
=\displaystyle= k(Vk,h+1(xk,h+1))Vh+1πk(xk,h+1)hVh+1πk(xk,h,ak,h)+^hkVh+1πk(xk,h,ak,h))\displaystyle\sum_{k}\left(V_{k,h+1}(x_{k,h+1}))-V_{h+1}^{\pi_{k}}(x_{k,h+1})-\mathbb{P}_{h}V_{h+1}^{\pi_{k}}(x_{k,h},a_{k,h})+\hat{\mathbb{P}}_{h}^{k}V^{\pi_{k}}_{h+1}(x_{k,h},a_{k,h})\right)
+HSA+H2ιKαχ+H2SAιKα(χ+1)\displaystyle+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)}
=\displaystyle= k(Qk,h+1(xk,h+1,ak,h+1)Qh+1πk(xk,h+1,ak,h+1)hVh+1πk(xk,h,ak,h)+^hkVh+1πk(xk,h,ak,h)\displaystyle\sum_{k}\left(Q_{k,h+1}(x_{k,h+1},a_{k,h+1})-Q_{h+1}^{\pi_{k}}(x_{k,h+1},a_{k,h+1})-\mathbb{P}_{h}V_{h+1}^{\pi_{k}}(x_{k,h},a_{k,h})+\hat{\mathbb{P}}_{h}^{k}V^{\pi_{k}}_{h+1}(x_{k,h},a_{k,h}\right)
+HSA+H2ιKαχ+H2SAιKα(χ+1).\displaystyle+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)}.

Taking the expectation on both sides yields

𝔼[kQk,h(xk,h,ak,h)kQhπk(xk,h,ak,h)]\displaystyle\mathbb{E}\left[\sum_{k}Q_{k,h}(x_{k,h},a_{k,h})-\sum_{k}Q_{h}^{\pi_{k}}(x_{k,h},a_{k,h})\right]
\displaystyle\leq 𝔼[k(Qk,h+1(xk,h+1,ak,h+1))Qh+1πk(xk,h+1,ak,h+1))]+HSA+H2ιKαχ+H2SAιKα(χ+1).\displaystyle\mathbb{E}\left[\sum_{k}\left(Q_{k,h+1}(x_{k,h+1},a_{k,h+1}))-Q_{h+1}^{\pi_{k}}(x_{k,h+1},a_{k,h+1})\right)\right]+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)}.

Then by using the inequality repeatably, we obtain for any h[H],h\in[H],

𝔼[kQk,h(xk,h,ak,h)kQhπk(xk,h,ak,h)]\displaystyle\mathbb{E}\left[\sum_{k}Q_{k,h}(x_{k,h},a_{k,h})-\sum_{k}Q_{h}^{\pi_{k}}(x_{k,h},a_{k,h})\right]\leq H2SA+H3ιKαχ+H4SAιKα(χ+1),\displaystyle H^{2}SA+\frac{H^{3}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{4}SA\iota K^{\alpha}(\chi+1)},

so the lemma holds.

From the lemma above, we can immediately conclude:

𝔼[k=1K{Qk,1Q1πk}(xk,1,ak,1)]H2SAK1α+H3ιKχ+H4SAιK2α(χ+1)\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left\{Q_{k,1}-Q_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right]\leq H^{2}SAK^{1-\alpha}+\frac{H^{3}\sqrt{\iota}K}{\chi}+\sqrt{H^{4}SA\iota K^{2-\alpha}(\chi+1)}
𝔼[k=1K{Ck,1C1πk}(xk,1,ak,1)]H2SAK1α+H3ιKχ+H4SAιK2α(χ+1).\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left\{C_{k,1}-C_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right]\leq H^{2}SAK^{1-\alpha}+\frac{H^{3}\sqrt{\iota}K}{\chi}+\sqrt{H^{4}SA\iota K^{2-\alpha}(\chi+1)}.

We now focus on (16), and further expand it as follows:

(16)\displaystyle\eqref{step(i)}
=\displaystyle= 𝔼[k=1K(a{Q1ϵ,q1ϵ,}(xk,1,a)Qk,1(xk,1,ak,1))]\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left(\sum_{a}\left\{{Q}_{1}^{\epsilon,*}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]
=\displaystyle= 𝔼[ka{(Fk,1ϵ,Fk,1)q1ϵ,}(xk,1,a)]\displaystyle\mathbb{E}\left[\sum_{k}\sum_{a}\left\{\left({F}^{\epsilon,*}_{k,1}-F_{k,1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right] (25)
+𝔼[k(a{Qk,1q1ϵ,}(xk,1,a)Qk,1(xk,1,ak,1))]+𝔼[kZkηa{(Ck,1C1ϵ,)q1ϵ,}(xk,1,a)],\displaystyle+\mathbb{E}\left[\sum_{k}\left(\sum_{a}\left\{Q_{k,1}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]+\mathbb{E}\left[\sum_{k}\frac{Z_{k}}{\eta}\sum_{a}\left\{\left(C_{k,1}-{C}^{\epsilon,*}_{1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right], (26)

where

Fk,h(x,a)\displaystyle F_{k,h}(x,a) =Qk,h(x,a)+ZkηCk,h(x,a)\displaystyle=Q_{k,h}(x,a)+\frac{Z_{k}}{\eta}C_{k,h}(x,a)
Fhϵ,(x,a)\displaystyle{F}_{h}^{\epsilon,*}(x,a) =Qhϵ,(x,a)+ZkηChϵ,(x,a).\displaystyle={Q}_{h}^{\epsilon,*}(x,a)+\frac{Z_{k}}{\eta}C_{h}^{\epsilon,*}(x,a).

We first show (25) can be bounded using the following lemma. This result holds because the choices of the UCB bonuses and the additional bonuses added at the beginning of each frame guarantee that Fk,h(x,a)F_{k,h}(x,a) is an over-estimate of Fhϵ,(x,a){F}_{h}^{\epsilon,*}(x,a) for all k,k, hh and (x,a)(x,a) with a high probability.

Lemma 3.

With probability at least 11K3,1-\frac{1}{K^{3}}, the following inequality holds simultaneously for all (x,a,h,k)𝒮×𝒜×[H]×[K]:(x,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K]:

{Fk,hFhπ}(x,a)0,\left\{F_{k,h}-F_{h}^{\pi}\right\}(x,a)\geq 0, (27)

which further implies that

𝔼[k=1Ka{(Fk,1ϵ,Fk,1)q1ϵ,}(xk,1,a)]4H4ιηK.\mathbb{E}\left[\sum_{k=1}^{K}\sum_{a}\left\{\left({F}^{\epsilon,*}_{k,1}-F_{k,1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right]\leq\frac{4H^{4}\iota}{\eta K}. (28)
Proof.

Consider frame TT and episodes in frame T.T. Define Z=Z(T1)Kα+1Z=Z_{(T-1)K^{\alpha}+1} because the value of the virtual queue does not change during each frame. We further define/recall the following notations:

Fk,h(x,a)=Qk,h(x,a)+ZηCk,h(x,a),\displaystyle F_{k,h}(x,a)=Q_{k,h}(x,a)+\frac{Z}{\eta}C_{k,h}(x,a),\quad Uk,h(x)=Vk,h(x)+ZηWk,h(x),\displaystyle U_{k,h}(x)=V_{k,h}(x)+\frac{Z}{\eta}W_{k,h}(x),
Fhπ(x,a)=Qhπ(x,a)+ZηChπ(x,a),\displaystyle F_{h}^{\pi}(x,a)=Q_{h}^{\pi}(x,a)+\frac{Z}{\eta}C_{h}^{\pi}(x,a),\quad Uhπ(x)=Vhπ(x)+ZηWhπ(x).\displaystyle U_{h}^{\pi}(x)=V_{h}^{\pi}(x)+\frac{Z}{\eta}W_{h}^{\pi}(x).

According to Lemma 9 in the appendix, we have

{Fk,hFhπ}(x,a)\displaystyle\{F_{k,h}-F_{h}^{\pi}\}(x,a)
=\displaystyle= αt0{F(T1)Kα+1,hFhπ}(x,a)\displaystyle\alpha_{t}^{0}\left\{F_{(T-1)K^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)
+i=1tαti({Uki,h+1Uh+1π}(xki,h+1)+{(^hkih)Uh+1π}(x,a)+(1+Zη)bi)\displaystyle+\sum_{i=1}^{t}\alpha_{t}^{i}\left(\left\{U_{k_{i},h+1}-U_{h+1}^{\pi}\right\}(x_{k_{i},h+1})+\{(\hat{\mathbb{P}}_{h}^{k_{i}}-\mathbb{P}_{h})U_{h+1}^{\pi}\}(x,a)+\left(1+\frac{Z}{\eta}\right)b_{i}\right)
\displaystyle\geq αt0(a){F(T1)Kα+1,hFhπ}(x,a)+i=1tαti{Uki,h+1Uh+1π}(xki,h+1){}_{(a)}\alpha_{t}^{0}\left\{F_{(T-1)K^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left\{U_{k_{i},h+1}-U_{h+1}^{\pi}\right\}(x_{k_{i},h+1})
=\displaystyle= αt0(b){F(T1)Kα+1,hFhπ}(x,a)+i=1tαti(maxaFki,h+1(xki,h+1,a)Fh+1π(xki,h+1,π(xki,h+1))){}_{(b)}\alpha_{t}^{0}\left\{F_{(T-1)K^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left(\max_{a}F_{k_{i},h+1}(x_{k_{i},h+1},a)-F_{h+1}^{\pi}(x_{k_{i},h+1},\pi(x_{k_{i},h+1}))\right)
\displaystyle\geq αt0{F(T1)Kα+1,hFhπ}(x,a)+i=1tαti{Fki,h+1Fh+1π}(xki,h+1,π(xki,h+1)),\displaystyle\alpha_{t}^{0}\left\{F_{(T-1)K^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left\{F_{k_{i},h+1}-F_{h+1}^{\pi}\right\}(x_{k_{i},h+1},\pi(x_{k_{i},h+1})), (29)

where inequality (a)(a) holds because of the concentration result in Lemma 10 in the appendix and

i=1tαti(1+Zη)bi=14i=1tαti(1+Zη)H2ι(χ+1)χ+iη+Z4ηH2ι(χ+1)χ+t\sum_{i=1}^{t}\alpha_{t}^{i}(1+\frac{Z}{\eta})b_{i}=\frac{1}{4}\sum_{i=1}^{t}\alpha_{t}^{i}(1+\frac{Z}{\eta})\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+i}}\geq\frac{\eta+Z}{4\eta}\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+t}}

by using Lemma 7-(c), and equality (b)(b) holds because Triple-Q selects the action that maximizes Fki,h+1(xki,h+1,a)F_{k_{i},h+1}(x_{k_{i},h+1},a) so Uki,h+1(xki,h+1)=maxaFki,h+1(xki,h+1,a)U_{k_{i},h+1}(x_{k_{i},h+1})=\max_{a}F_{k_{i},h+1}(x_{k_{i},h+1},a).

The inequality above suggests that we can prove {Fk,hFhπ}(x,a)\{F_{k,h}-F_{h}^{\pi}\}(x,a) for any (x,a)(x,a) if (i)

{F(T1)Kα+1,hFhπ}(x,a)0,\left\{F_{(T-1)K^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)\geq 0,

i.e. the result holds at the beginning of the frame and (ii)

{Fk,h+1Fh+1π}(x,a)0 for any k<k\left\{F_{k^{\prime},h+1}-F_{h+1}^{\pi}\right\}(x,a)\geq 0\quad\hbox{ for any }\quad k^{\prime}<k

and (x,a),(x,a), i.e. the result holds for step h+1h+1 in all the previous episodes in the same frame.

We now prove the lemma using induction. We first consider T=1T=1 and h=Hh=H i.e. the last step in the first frame. In this case, inequality (29) becomes

{Fk,HFHπ}(x,a)αt0{H+Z1ηHFhπ}(x,a)0.\displaystyle\{F_{k,H}-F_{H}^{\pi}\}(x,a)\geq\alpha_{t}^{0}\left\{H+\frac{Z_{1}}{\eta}H-F_{h}^{\pi}\right\}(x,a)\geq 0. (30)

Based on induction, we can first conclude that

{Fk,hFhπ}(x,a)0\displaystyle\{F_{k,h}-F_{h}^{\pi}\}(x,a)\geq 0

for all hh and kKα+1,k\leq K^{\alpha}+1, where {FKα+1,h}h=1,,H\{F_{K^{\alpha}+1,h}\}_{h=1,\cdots,H} are the values before line 20, i.e. before adding the extra bonuses and thresholding Q-values at the end of a frame. Now suppose that (27) holds for any episode kk in frame T,T, any step h,h, and any (x,a).(x,a). Now consider

{FTKα+1,hFhπ}(x,a)=QTKα+1,h(x,a)+ZTKα+1ηCTKα+1,h(x,a)Qhπ(x,a)ZTKα+1ηChπ(x,a).\displaystyle\left\{F_{TK^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)=Q_{TK^{\alpha}+1,h}(x,a)+\frac{Z_{TK^{\alpha}+1}}{\eta}C_{TK^{\alpha}+1,h}(x,a)-Q_{h}^{\pi}(x,a)-\frac{Z_{TK^{\alpha}+1}}{\eta}C_{h}^{\pi}(x,a). (31)

Note that if QTKα+1,h+(x,a)=CTKα+1,h+(x,a)=H,Q^{+}_{TK^{\alpha}+1,h}(x,a)=C^{+}_{TK^{\alpha}+1,h}(x,a)=H, then (31)0.\eqref{eq:new-induction}\geq 0. Otherwise, from line 21-23, we have QTKα+1,h+(x,a)=QTKα+1,h(x,a)+2H3ιη<HQ^{+}_{TK^{\alpha}+1,h}(x,a)=Q^{-}_{TK^{\alpha}+1,h}(x,a)+\frac{2H^{3}\sqrt{\iota}}{\eta}<H and CTKα+1,h+(x,a)=CTKα+1,h(x,a)<H.C^{+}_{TK^{\alpha}+1,h}(x,a)=C^{-}_{TK^{\alpha}+1,h}(x,a)<H. Here, we use superscript - and ++ to indicate the Q-values before and after 21-24 of Triple-Q. Therefore, at the beginning of frame T+1,T+1, we have

{FTKα+1,hFhπ}(x,a)=\displaystyle\left\{F_{TK^{\alpha}+1,h}-F_{h}^{\pi}\right\}(x,a)= QTKα+1,h(x,a)+ZηCTKα+1,h(x,a)Qhπ(x,a)ZηChπ(x,a)\displaystyle Q^{-}_{TK^{\alpha}+1,h}(x,a)+\frac{Z}{\eta}C^{-}_{TK^{\alpha}+1,h}(x,a)-Q_{h}^{\pi}(x,a)-\frac{Z}{\eta}C_{h}^{\pi}(x,a)
+\displaystyle+ 2H3ιη+ZTKα+1ZηCTKα+1,h(x,a)ZTKα+1ZηChπ(x,a)\displaystyle\frac{2H^{3}\sqrt{\iota}}{\eta}+\frac{Z_{TK^{\alpha}+1}-Z}{\eta}C^{-}_{TK^{\alpha}+1,h}(x,a)-\frac{Z_{TK^{\alpha}+1}-Z}{\eta}C_{h}^{\pi}(x,a)
(a)\displaystyle\geq_{(a)} 2H3ιη2|ZTKα+1Z|ηH\displaystyle\frac{2H^{3}\sqrt{\iota}}{\eta}-2\frac{|Z_{TK^{\alpha}+1}-Z|}{\eta}H
(b)\displaystyle\geq_{(b)} 0,\displaystyle 0, (32)

where inequality (a)(a) holds due to the induction assumption and the fact CTKα+1,h(x,a)<H,C^{-}_{TK^{\alpha}+1,h}(x,a)<H, and (b)(b) holds because according to Lemma 8,

|ZTKα+1ZTKα|max{ρ+ϵ,k=(T1)Kα+1TKαCk,1(xk,1,ak,1)Kα}H2ι.|Z_{TK^{\alpha}+1}-Z_{TK^{\alpha}}|\leq\max\left\{\rho+\epsilon,\frac{\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}C_{k,1}(x_{k,1},a_{k,1})}{K^{\alpha}}\right\}\leq H^{2}\sqrt{\iota}.

Therefore, by substituting inequality (32) into inequality (29), we obtain for any TKα+1k(T+1)Kα+1,TK^{\alpha}+1\leq k\leq(T+1)K^{\alpha}+1,

{Fk,hFhπ}(x,a)i=1tαti{Fki,h+1Fh+1π}(xki,h+1,π(xki,h+1)).\displaystyle\{F_{k,h}-F_{h}^{\pi}\}(x,a)\geq\sum_{i=1}^{t}\alpha_{t}^{i}\left\{F_{k_{i},h+1}-F_{h+1}^{\pi}\right\}\left(x_{k_{i},h+1},\pi(x_{k_{i},h+1})\right). (33)

Considering h=H,h=H, the inequality becomes

{Fk,HFHπ}(x,a)0.\displaystyle\{F_{k,H}-F_{H}^{\pi}\}(x,a)\geq 0. (34)

By applying induction on hh, we conclude that

{Fk,hFhπ}(x,a)0.\displaystyle\{F_{k,h}-F_{h}^{\pi}\}(x,a)\geq 0. (35)

holds for any TKα+1k(T+1)Kα+1,TK^{\alpha}+1\leq k\leq(T+1)K^{\alpha}+1, h,h, and (x,a),(x,a), which completes the proof of (27).

Let \cal E denote the event that (27) holds for all k,k, hh and (x,a).(x,a). Then based on Lemma 8, we conclude that

𝔼[k=1Ka{(Fk,1ϵ,Fk,1)q1ϵ,}(xk,1,a)]\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\sum_{a}\left\{\left({F}^{\epsilon,*}_{k,1}-F_{k,1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right]
=\displaystyle= 𝔼[k=1Ka{(Fk,1ϵ,Fk,1)q1ϵ,}(xk,1,a)|]Pr()\displaystyle\mathbb{E}\left[\left.\sum_{k=1}^{K}\sum_{a}\left\{\left({F}^{\epsilon,*}_{k,1}-F_{k,1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right|{\cal E}\right]\Pr({\cal E})
+𝔼[k=1Ka{(Fk,1ϵ,Fk,1)q1ϵ,}(xk,1,a)|c]Pr(c)\displaystyle+\mathbb{E}\left[\left.\sum_{k=1}^{K}\sum_{a}\left\{\left({F}^{\epsilon,*}_{k,1}-F_{k,1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right|{\cal E}^{c}\right]\Pr({\cal E}^{c})
\displaystyle\leq 2K(1+K1αH2ιη)H2ι1K34H4ιηK.\displaystyle 2K\left(1+\frac{K^{1-\alpha}H^{2}\sqrt{\iota}}{\eta}\right)H^{2}\sqrt{\iota}\frac{1}{K^{3}}\leq\frac{4H^{4}\iota}{\eta K}. (36)

Next we bound (26) using the Lyapunov drift analysis on virtual queue ZZ. Since the virtual queue is updated every frame, we abuse the notation and define ZTZ_{T} to be the virtual queue used in frame T.T. In particular, ZT=Z(T1)Kα+1.Z_{T}=Z_{(T-1)K^{\alpha}+1}. We further define

C¯T=k=(T1)Kα+1TKαCk,1(xk,1,ak,1).\bar{C}_{T}=\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}C_{k,1}(x_{k,1},a_{k,1}).

Therefore, under Triple-Q, we have

ZT+1=(ZT+ρ+ϵC¯TKα)+\displaystyle Z_{T+1}=\left(Z_{T}+\rho+\epsilon-\frac{\bar{C}_{T}}{K^{\alpha}}\right)^{+} (37)

Define the Lyapunov function to be

LT=12ZT2.L_{T}=\frac{1}{2}Z_{T}^{2}.

The next lemma bounds the expected Lyapunov drift conditioned on ZT.Z_{T}.

Lemma 4.

Assume ϵδ.\epsilon\leq\delta. The expected Lyapunov drift satisfies

𝔼[LT+1LT|ZT=z]\displaystyle\mathbb{E}\left[L_{T+1}-L_{T}|Z_{T}=z\right]
\displaystyle\leq 1Kαk=(T1)Kα+1TKα(η𝔼[a{Qk,1q1ϵ,}(xk,1,a)Qk,1(xk,1,ak,1)|ZT=z]\displaystyle\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\left(-\eta\mathbb{E}\left[\left.\sum_{a}\left\{Q_{k,1}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right|Z_{T}=z\right]\right.
+z𝔼[a{(C1ϵ,Ck,1)q1ϵ,}(xk,1,a)|ZT=z])+H4ι+ϵ2.\displaystyle\left.+z\mathbb{E}\left[\left.\sum_{a}\left\{\left({C}^{\epsilon,*}_{1}-C_{k,1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right|Z_{T}=z\right]\right)+H^{4}\iota+\epsilon^{2}. (38)
Proof.

Based on the definition of LT,L_{T}, the Lyapunov drift is

LT+1LT\displaystyle L_{T+1}-L_{T}\leq ZT(ρ+ϵC¯TKα)+(C¯TKα+ϵρ)22\displaystyle Z_{T}\left(\rho+\epsilon-\frac{\bar{C}_{T}}{K^{\alpha}}\right)+\frac{\left(\frac{\bar{C}_{T}}{K^{\alpha}}+\epsilon-\rho\right)^{2}}{2}
\displaystyle\leq ZT(ρ+ϵC¯TKα)+H4ι+ϵ2\displaystyle Z_{T}\left(\rho+\epsilon-\frac{\bar{C}_{T}}{K^{\alpha}}\right)+H^{4}\iota+\epsilon^{2}
\displaystyle\leq ZTKαk=TKα+1(T+1)Kα(ρ+ϵCk,1(xk,1,ak,1))+H4ι+ϵ2\displaystyle\frac{Z_{T}}{K^{\alpha}}\sum_{k=TK^{\alpha}+1}^{(T+1)K^{\alpha}}\left(\rho+\epsilon-C_{k,1}(x_{k,1},a_{k,1})\right)+H^{4}\iota+\epsilon^{2}

where the first inequality is a result of the upper bound on |Ck,1(xk,1,ak,1)||C_{k,1}(x_{k,1},a_{k,1})| in Lemma 8.

Let {qhϵ}h=1H\{q^{\epsilon}_{h}\}_{h=1}^{H} be a feasible solution to the tightened LP (14). Then the expected Lyapunov drift conditioned on ZT=zZ_{T}=z is

𝔼[LT+1LT|ZT=z]\displaystyle\mathbb{E}\left[L_{T+1}-L_{T}|Z_{T}=z\right]
\displaystyle\leq 1Kαk=(T1)Kα+1TKα(𝔼[z(ρ+ϵCk,1(xk,1,ak,1))ηQk,1(xk,1,ak,1)|ZT=z]+η𝔼[Qk,1(xk,1,ak,1)|ZT=z])\displaystyle\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\left(\mathbb{E}\left[\left.z\left(\rho+\epsilon-C_{k,1}(x_{k,1},a_{k,1})\right)-\eta Q_{k,1}(x_{k,1},a_{k,1})\right|Z_{T}=z\right]+\eta\mathbb{E}\left[\left.Q_{k,1}(x_{k,1},a_{k,1})\right|Z_{T}=z\right]\right)
+H4ι+ϵ2.\displaystyle+H^{4}\iota+\epsilon^{2}. (39)

Now we focus on the term inside the summation and obtain that

(𝔼[z(ρ+ϵCk,1(xk,1,ak,1))ηQk,1(xk,1,ak,1)|ZT=z]+η𝔼[Qk,1(xk,1,ak,1)|ZT=z])\displaystyle\left(\mathbb{E}\left[\left.z\left(\rho+\epsilon-C_{k,1}(x_{k,1},a_{k,1})\right)-\eta Q_{k,1}(x_{k,1},a_{k,1})\right|Z_{T}=z\right]+\eta\mathbb{E}\left[\left.Q_{k,1}(x_{k,1},a_{k,1})\right|Z_{T}=z\right]\right)
\displaystyle\leq z(a)(ρ+ϵ)𝔼[η(a{zηCk,1q1ϵ+Qk,1q1ϵ}(xk,1,a))|ZT=z]+η𝔼[Qk,1(xk,1,ak,1)|ZT=z]{}_{(a)}z(\rho+\epsilon)-\mathbb{E}\left[\left.\eta\left(\sum_{a}\left\{\frac{z}{\eta}C_{k,1}q^{\epsilon}_{1}+Q_{k,1}q^{\epsilon}_{1}\right\}(x_{k,1},a)\right)\right|Z_{T}=z\right]+\eta\mathbb{E}\left[\left.Q_{k,1}(x_{k,1},a_{k,1})\right|Z_{T}=z\right]
=\displaystyle= 𝔼[z(ρ+ϵaCk,1(xk,1,a)q1ϵ(xk,1,a))|ZT=z]\displaystyle\mathbb{E}\left[\left.z\left(\rho+\epsilon-\sum_{a}C_{k,1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)\right)\right|Z_{T}=z\right]
𝔼[ηaQk,1(xk,1,a)q1ϵ(xk,1,a)ηQk,1(xk,1,ak,1)|ZT=z]\displaystyle-\mathbb{E}\left[\left.\eta\sum_{a}Q_{k,1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)-\eta Q_{k,1}(x_{k,1},a_{k,1})\right|Z_{T}=z\right]
=\displaystyle= 𝔼[z(ρ+ϵaC1ϵ(xk,1,a)q1ϵ(xk,1,a))|ZT=z]\displaystyle\mathbb{E}\left[z\left(\left.\rho+\epsilon-\sum_{a}C^{\epsilon}_{1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)\right)\right|Z_{T}=z\right]
𝔼[ηaQk,1(xk,1,a)q1ϵ(xk,1,a)ηQk,1(xk,1,ak,1)|ZT=z]+𝔼[za{(C1ϵCk,1)q1ϵ}(xk,1,a)|ZT=z]\displaystyle-\mathbb{E}\left[\left.\eta\sum_{a}Q_{k,1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)-\eta Q_{k,1}(x_{k,1},a_{k,1})\right|Z_{T}=z\right]+\mathbb{E}\left[\left.z\sum_{a}\left\{(C^{\epsilon}_{1}-C_{k,1})q^{\epsilon}_{1}\right\}(x_{k,1},a)\right|Z_{T}=z\right]
\displaystyle\leq η𝔼[aQk,1(xk,1,a)q1ϵ(xk,1,a)Qk,1(xk,1,ak,1)|ZT=z]+𝔼[za{(C1ϵCk,1)q1ϵ}(xk,1,a)|ZT=z],\displaystyle-\eta\mathbb{E}\left[\left.\sum_{a}Q_{k,1}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right|Z_{T}=z\right]+\mathbb{E}\left[\left.z\sum_{a}\left\{(C^{\epsilon}_{1}-C_{k,1})q^{\epsilon}_{1}\right\}(x_{k,1},a)\right|Z_{T}=z\right],

where inequality (a)(a) holds because ak,ha_{k,h} is chosen to maximize Qk,h(xk,h,a)+ZTηCk,h(xk,h,a)Q_{k,h}(x_{k,h},a)+\frac{Z_{T}}{\eta}C_{k,h}(x_{k,h},a) under Triple-Q, and the last equality holds due to that {qhϵ(x,a)}h=1H\{q^{\epsilon}_{h}(x,a)\}_{h=1}^{H} is a feasible solution to the optimization problem (14), so

(ρ+ϵaC1ϵ(xk,1,a)q1ϵ(xk,1,a))\displaystyle\left(\rho+\epsilon-\sum_{a}{C}_{1}^{\epsilon}(x_{k,1},a)q^{\epsilon}_{1}(x_{k,1},a)\right) =(ρ+ϵh,x,agh(x,a)qhϵ(x,a))0.\displaystyle=\left(\rho+\epsilon-\sum_{h,x,a}g_{h}(x,a){q}^{\epsilon}_{h}(x,a)\right)\leq 0.

Therefore, we can conclude the lemma by substituting qhϵ(x,a){q}^{\epsilon}_{h}(x,a) with the optimal solution qhϵ,(x,a){q}^{\epsilon,*}_{h}(x,a). ∎

After taking expectation with respect to Z,Z, dividing η\eta on both sides, and then applying the telescoping sum, we obtain

𝔼[k=1K(a{Qk,1q1ϵ,}(xk,1,a)Qk,1(xk,1,ak,1))]+𝔼[k=1KZkηa{(Ck,1C1ϵ,)q1ϵ,}(xk,1,a)]\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left(\sum_{a}\left\{Q_{k,1}{q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)-Q_{k,1}(x_{k,1},a_{k,1})\right)\right]+\mathbb{E}\left[\sum_{k=1}^{K}\frac{Z_{k}}{\eta}\sum_{a}\left\{\left(C_{k,1}-{C}^{\epsilon,*}_{1}\right){q}^{\epsilon,*}_{1}\right\}(x_{k,1},a)\right]
\displaystyle\leq Kα𝔼[L1LK1α+1]η+K(H4ι+ϵ2)ηK(H4ι+ϵ2)η,\displaystyle\frac{K^{\alpha}\mathbb{E}\left[L_{1}-L_{K^{1-\alpha}+1}\right]}{\eta}+\frac{K\left(H^{4}\iota+\epsilon^{2}\right)}{\eta}\leq\frac{K\left(H^{4}\iota+\epsilon^{2}\right)}{\eta}, (40)

where the last inequality holds because that L1=0L_{1}=0 and LT+1L_{T+1} is non-negative.

Now combining Lemma 3 and inequality (40), we conclude that

(16)K(H4ι+ϵ2)η+4H4ιηK.\displaystyle\eqref{step(i)}\leq\frac{K\left(H^{4}\iota+\epsilon^{2}\right)}{\eta}+\frac{4H^{4}\iota}{\eta K}.

Further combining inequality above with Lemma 1 and Lemma 2,

Regret(K)KHϵδ+H2SAK1α+H3ιKχ+H4SAιK2α(χ+1)+K(H4ι+ϵ2)η+4H4ιηK.\displaystyle\text{Regret}(K)\leq\frac{KH\epsilon}{\delta}+H^{2}SAK^{1-\alpha}+\frac{H^{3}\sqrt{\iota}K}{\chi}+\sqrt{H^{4}SA\iota K^{2-\alpha}(\chi+1)}+\frac{K\left(H^{4}\iota+\epsilon^{2}\right)}{\eta}+\frac{4H^{4}\iota}{\eta K}. (41)

By choosing α=0.6,\alpha=0.6, i.e each frame has K0.6K^{0.6} episodes, χ=K0.2,\chi=K^{0.2}, η=K0.2,\eta=K^{0.2}, and ϵ=8SAH6ι3K0.2,\epsilon=\frac{8\sqrt{SAH^{6}\iota^{3}}}{K^{0.2}}, we conclude that when K(8SAH6ι3δ)5,K\geq\left(\frac{8\sqrt{SAH^{6}\iota^{3}}}{\delta}\right)^{5}, which guarantees that ϵ<δ/2,\epsilon<\delta/2, we have

Regret(K)13δH4SAι3K0.8+4H4ιK1.2=𝒪~(1δH4S12A12K0.8).\displaystyle\text{Regret}(K)\leq\frac{13}{\delta}H^{4}{\sqrt{SA\iota^{3}}}{K^{0.8}}+\frac{4H^{4}\iota}{K^{1.2}}=\tilde{\cal O}\left(\frac{1}{\delta}H^{4}S^{\frac{1}{2}}A^{\frac{1}{2}}K^{0.8}\right). (42)

4.3 Constraint Violation

4.3.1 Outline of the Constraint Violation Analysis

Again, we use ZTZ_{T} to denote the value of virtual-Queue in frame T.T. According to the virtual-Queue update defined in Triple-Q, we have

ZT+1=(ZT+ρ+ϵC¯TKα)+ZT+ρ+ϵC¯TKα,\displaystyle Z_{T+1}=\left(Z_{T}+\rho+\epsilon-\frac{\bar{C}_{T}}{K^{\alpha}}\right)^{+}\geq Z_{T}+\rho+\epsilon-\frac{\bar{C}_{T}}{K^{\alpha}},

which implies that

k=(T1)Kα+1TKα(C1πk(xk,1,ak,1)+ρ)Kα(ZT+1ZT)+k=(T1)Kα+1TKα({Ck,1C1πk}(xk,1,ak,1)ϵ).\displaystyle\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\left(-C_{1}^{\pi_{k}}(x_{k,1},a_{k,1})+\rho\right)\leq K^{\alpha}\left(Z_{T+1}-Z_{T}\right)+\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\left(\left\{C_{k,1}-C_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})-\epsilon\right).

Summing the inequality above over all frames and taking expectation on both sides, we obtain the following upper bound on the constraint violation:

𝔼[k=1KρC1πk(xk,1,ak,1)]Kϵ+Kα𝔼[ZK1α+1]+𝔼[k=1K{Ck,1C1πk}(xk,1,ak,1)],\displaystyle\mathbb{E}\left[\sum^{K}_{k=1}\rho-C_{1}^{\pi_{k}}(x_{k,1},a_{k,1})\right]\leq-K\epsilon+K^{\alpha}\mathbb{E}\left[Z_{K^{1-\alpha}+1}\right]+\mathbb{E}\left[\sum_{k=1}^{K}\left\{C_{k,1}-C_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right], (43)

where we used the fact Z1=0.Z_{1}=0.

In Lemma 2, we already established an upper bound on the estimation error of Ck,1:C_{k,1}:

𝔼[k=1K{Ck,1C1πk}(xk,1,ak,1)]H2SAK1α+H3ιKχ+H4SAιK2α(χ+1).\displaystyle\mathbb{E}\left[\sum_{k=1}^{K}\left\{C_{k,1}-C_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1})\right]\leq H^{2}SAK^{1-\alpha}+\frac{H^{3}\sqrt{\iota}K}{\chi}+\sqrt{H^{4}SA\iota K^{2-\alpha}(\chi+1)}. (44)

Next, we study the moment generating function of ZT,Z_{T}, i.e. 𝔼[erZT]\mathbb{E}\left[e^{rZ_{T}}\right] for some r>0.r>0. Based on a Lyapunov drift analysis of this moment generating function and Jensen’s inequality, we will establish the following upper bound on ZTZ_{T} that holds for any 1TK1α+11\leq T\leq K^{1-\alpha}+1

𝔼[ZT]\displaystyle\mathbb{E}[Z_{T}]\leq 54H4ιδlog(16H2ιδ)+16H2ιK2δ+4ηH2ιδ.\displaystyle\frac{54H^{4}\iota}{\delta}\log\left(\frac{16H^{2}\sqrt{\iota}}{\delta}\right)+\frac{16H^{2}\iota}{K^{2}\delta}+\frac{4\eta\sqrt{H^{2}\iota}}{\delta}. (45)

Under our choices of ϵ,\epsilon, α,\alpha, χ,\chi, η\eta and ι,\iota, it can be easily verified that KϵK\epsilon dominates the upper bounds in (44) and (45), which leads to the conclusion that the constraint violation because zero when KK is sufficiently large in Theorem 1.

4.3.2 Detailed Proof

To complete the proof, we need to establish the following upper bound on 𝔼[ZT+1]\mathbb{E}[Z_{T+1}] based on a bound on the moment generating function.

Lemma 5.

Assuming ϵδ2,\epsilon\leq\frac{\delta}{2}, we have for any 1TK1α1\leq T\leq K^{1-\alpha}

𝔼[ZT]\displaystyle\mathbb{E}[Z_{T}]\leq 54H4ιδlog(16H2ιδ)+16H2ιK2δ+4ηH2ιδ.\displaystyle\frac{54H^{4}\iota}{\delta}\log\left(\frac{16H^{2}\sqrt{\iota}}{\delta}\right)+\frac{16H^{2}\iota}{K^{2}\delta}+\frac{4\eta\sqrt{H^{2}\iota}}{\delta}. (46)

The proof will also use the following lemma from [16].

Lemma 6.

Let StS_{t} be the state of a Markov chain, LtL_{t} be a Lyapunov function with L0=l0,L_{0}=l_{0}, and its drift Δt=Lt+1Lt.\Delta_{t}=L_{t+1}-L_{t}. Given the constant γ\gamma and vv with 0<γv,0<\gamma\leq v, suppose that the expected drift 𝔼[Δt|St=s]\mathbb{E}[\Delta_{t}|S_{t}=s] satisfies the following conditions:

  • (1)

    There exists constant γ>0\gamma>0 and θt>0\theta_{t}>0 such that 𝔼[Δt|St=s]γ\mathbb{E}[\Delta_{t}|S_{t}=s]\leq-\gamma when Ltθt.L_{t}\geq\theta_{t}.

  • (2)

    |Lt+1Lt|v|L_{t+1}-L_{t}|\leq v holds with probability one.

Then we have

𝔼[erLt]erl0+2er(v+θt)rγ,\mathbb{E}[e^{rL_{t}}]\leq e^{rl_{0}}+\frac{2e^{r(v+\theta_{t})}}{r\gamma},

where r=γv2+vγ/3.r=\frac{\gamma}{v^{2}+v\gamma/3}. \square

Proof of Lemma 5.

We apply Lemma 6 to a new Lyapunov function:

L¯T=ZT.\bar{L}_{T}=Z_{T}.

To verify condition (1) in Lemma 6, consider L¯T=ZTθT=4(4H2ιK2+ηH2ι+H4ι+ϵ2)δ\bar{L}_{T}=Z_{T}\geq\theta_{T}=\frac{4(\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2})}{\delta} and 2ϵδ.2\epsilon\leq\delta. The conditional expected drift of L¯T\bar{L}_{T} is

𝔼[ZT+1ZT|ZT=z]\displaystyle\mathbb{E}\left[Z_{T+1}-Z_{T}|Z_{T}=z\right]
=\displaystyle= 𝔼[ZT+12z2|ZT=z]\displaystyle\mathbb{E}\left[\left.\sqrt{Z_{T+1}^{2}}-\sqrt{z^{2}}\right|Z_{T}=z\right]
\displaystyle\leq 12z𝔼[ZT+12z2|ZT=z]\displaystyle\frac{1}{2z}\mathbb{E}\left[\left.Z_{T+1}^{2}-z^{2}\right|Z_{T}=z\right]
(a)\displaystyle\leq_{(a)} δ2+4H2ιK2+ηH2ι+H4ι+ϵ2z\displaystyle-\frac{\delta}{2}+\frac{\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2}}{z}
\displaystyle\leq δ2+4H2ιK2+ηH2ι+H4ι+ϵ2θT\displaystyle-\frac{\delta}{2}+\frac{\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2}}{\theta_{T}}
=\displaystyle= δ4,\displaystyle-\frac{\delta}{4},

where inequality (aa) is obtained according to Lemma 11; and the last inequality holds given zθT.z\geq\theta_{T}.

To verify condition (2) in Lemma 6, we have

ZT+1ZT|ZT+1ZT||ρ+ϵC¯T|(H2+H4ι)+ϵ2H4ι,Z_{T+1}-Z_{T}\leq|Z_{T+1}-Z_{T}|\leq\left|\rho+\epsilon-\bar{C}_{T}\right|\leq(H^{2}+\sqrt{H^{4}\iota})+\epsilon\leq 2\sqrt{H^{4}\iota},

where the last inequality holds because 2ϵδ1.2\epsilon\leq\delta\leq 1.

Now choose γ=δ4\gamma=\frac{\delta}{4} and v=2H4ι.v=2\sqrt{H^{4}\iota}. From Lemma 6, we obtain

𝔼[erZT]erZ1+2er(v+θT)rγ,wherer=γv2+vγ/3.\displaystyle\mathbb{E}\left[e^{rZ_{T}}\right]\leq e^{rZ_{1}}+\frac{2e^{r(v+\theta_{T})}}{r\gamma},\quad\hbox{where}\quad r=\frac{\gamma}{v^{2}+v\gamma/3}. (47)

By Jensen’s inequality, we have

er𝔼[ZT]𝔼[erZT],e^{r\mathbb{E}\left[Z_{T}\right]}\leq\mathbb{E}\left[e^{rZ_{T}}\right],

which implies that

𝔼[ZT]1rlog(1+2er(v+θT)rγ)\displaystyle\mathbb{E}[Z_{T}]\leq\frac{1}{r}\log{\left(1+\frac{2e^{r(v+\theta_{T})}}{r\gamma}\right)}
=\displaystyle= 1rlog(1+6v2+2vγ3γ2er(v+θT))\displaystyle\frac{1}{r}\log{\left(1+\frac{6v^{2}+2v\gamma}{3\gamma^{2}}e^{r(v+\theta_{T})}\right)}
\displaystyle\leq 1rlog(1+8v23γ2er(v+θT))\displaystyle\frac{1}{r}\log{\left(1+\frac{8v^{2}}{3\gamma^{2}}e^{r(v+\theta_{T})}\right)}
\displaystyle\leq 1rlog(11v23γ2er(v+θT))\displaystyle\frac{1}{r}\log{\left(\frac{11v^{2}}{3\gamma^{2}}e^{r(v+\theta_{T})}\right)}
\displaystyle\leq 4v23γlog(11v23γ2er(v+θT))\displaystyle\frac{4v^{2}}{3\gamma}\log{\left(\frac{11v^{2}}{3\gamma^{2}}e^{r(v+\theta_{T})}\right)}
\displaystyle\leq 3v2γlog(2vγ)+v+θT\displaystyle\frac{3v^{2}}{\gamma}\log\left(\frac{2v}{\gamma}\right)+v+\theta_{T}
\displaystyle\leq 3v2γlog(2vγ)+v+4(4H2ιK2+ηH2ι+H4ι+ϵ2)δ\displaystyle\frac{3v^{2}}{\gamma}\log\left(\frac{2v}{\gamma}\right)+v+\frac{4(\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2})}{\delta}
=\displaystyle= 48H4ιδlog(16H2ιδ)+2H4ι+4(4H2ιK2+ηH2ι+H4ι+ϵ2)δ\displaystyle\frac{48H^{4}\iota}{\delta}\log\left(\frac{16H^{2}\sqrt{\iota}}{\delta}\right)+2\sqrt{H^{4}\iota}+\frac{4(\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2})}{\delta}
\displaystyle\leq 54H4ιδlog(16H2ιδ)+16H2ιK2δ+4ηH2ιδ=𝒪~(ηHδ),\displaystyle\frac{54H^{4}\iota}{\delta}\log\left(\frac{16H^{2}\sqrt{\iota}}{\delta}\right)+\frac{16H^{2}\iota}{K^{2}\delta}+\frac{4\eta\sqrt{H^{2}\iota}}{\delta}=\tilde{\cal O}\left(\frac{\eta H}{\delta}\right),

which completes the proof of Lemma 5. ∎

Substituting the results from Lemmas 2 and 5 into (43), under assumption K(16SAH6ι3δ)5,K\geq\left(\frac{16\sqrt{SAH^{6}\iota^{3}}}{\delta}\right)^{5}, which guarantees ϵδ2.\epsilon\leq\frac{\delta}{2}. Then by using the facts that ϵ=8SAH6ι3K0.2,\epsilon=\frac{8\sqrt{SAH^{6}\iota^{3}}}{K^{0.2}}, we can easily verify that

Violation(K)54H4ιK0.6δlog16H2ιδ+4H2ιδK0.85SAH6ι3K0.8.\displaystyle\hbox{Violation}(K)\leq\frac{54H^{4}{\iota}K^{0.6}}{\delta}\log{\frac{16H^{2}\sqrt{\iota}}{\delta}}+\frac{4\sqrt{H^{2}\iota}}{\delta}K^{0.8}-5\sqrt{SAH^{6}\iota^{3}}K^{0.8}.

If further we have Ke1δ,K\geq e^{\frac{1}{\delta}}, we can obtain

Violation(K)54H4ιK0.6δlog16H2ιδSAH6ι3K0.8=0.\hbox{Violation}(K)\leq\frac{54H^{4}{\iota}K^{0.6}}{\delta}\log{\frac{16H^{2}\sqrt{\iota}}{\delta}}-\sqrt{SAH^{6}\iota^{3}}K^{0.8}=0.

Now to prove the high probability bound, recall that from inequality (37), we have

k=1KρC1πk(xk,1,ak,1)Kϵ+KαZK1α+1+k=1K{Ck,1C1πk}(xk,1,ak,1).\displaystyle\sum^{K}_{k=1}\rho-C_{1}^{\pi_{k}}(x_{k,1},a_{k,1})\leq-K\epsilon+K^{\alpha}Z_{K^{1-\alpha}+1}+\sum_{k=1}^{K}\left\{C_{k,1}-C_{1}^{\pi_{k}}\right\}(x_{k,1},a_{k,1}). (48)

According to inequality (47), we have

𝔼[erZT]erZ1+2er(v+θT)rγ11v23γ2er(v+θT),\mathbb{E}\left[e^{rZ_{T}}\right]\leq e^{rZ_{1}}+\frac{2e^{r(v+\theta_{T})}}{r\gamma}\leq\frac{11v^{2}}{3\gamma^{2}}e^{r(v+\theta_{T})},

which implies that

Pr(ZT1rlog(11v23γ2)+2(v+θT))\displaystyle\Pr\left(Z_{T}\geq\frac{1}{r}\log\left(\frac{11v^{2}}{3\gamma^{2}}\right)+2(v+\theta_{T})\right)
=\displaystyle= Pr(erZTelog(11v23γ2)+2r(v+θT))\displaystyle\Pr(e^{rZ_{T}}\geq e^{\log\left(\frac{11v^{2}}{3\gamma^{2}}\right)+2r(v+\theta_{T})})
\displaystyle\leq 𝔼[erZT]11v23γ2e2r(v+θT)\displaystyle\frac{\mathbb{E}[e^{rZ_{T}}]}{\frac{11v^{2}}{3\gamma^{2}}e^{2r(v+\theta_{T})}}
\displaystyle\leq 1er(v+θT)=𝒪~(eη),\displaystyle\frac{1}{e^{r(v+\theta_{T})}}=\tilde{\mathcal{O}}\left(e^{-\eta}\right), (49)

where the first inequality is from the Markov inequality.

In the proof of Lemma 2, we have shown

|k=(T1)Kα+1TKαCk,h(xk,h,ak,h)Chπk(xk,h,ak,h)|\displaystyle\left|\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}C_{k,h}(x_{k,h},a_{k,h})-C_{h}^{\pi_{k}}(x_{k,h},a_{k,h})\right|
\displaystyle\leq |k=(T1)Kα+1TKαCk,h+1(xk,h+1,ak,h+1)Ch+1πk(xk,h+1,ak,h+1)|+|k=(T1)Kα+1TKα(^hkh)Vh+1πk(xk,h,ak,h)|\displaystyle\left|\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}C_{k,h+1}(x_{k,h+1},a_{k,h+1})-C_{h+1}^{\pi_{k}}(x_{k,h+1},a_{k,h+1})\right|+\left|\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})V_{h+1}^{\pi_{k}}(x_{k,h},a_{k,h})\right|
+HSA+H2ιKαχ+H2SAιKα(χ+1)\displaystyle+HSA+\frac{H^{2}\sqrt{\iota}K^{\alpha}}{\chi}+\sqrt{H^{2}SA\iota K^{\alpha}(\chi+1)} (50)

Following a similar proof as the proof of Lemma 10, we can prove that

|k=(T1)Kα+1TKα(^hkh)Vh+1πk(xk,h,ak,h)|14H2ιKα\left|\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}(\hat{\mathbb{P}}_{h}^{k}-\mathbb{P}_{h})V_{h+1}^{\pi_{k}}(x_{k,h},a_{k,h})\right|\leq\frac{1}{4}\sqrt{H^{2}\iota K^{\alpha}}

holds with probability at least 11K31-\frac{1}{K^{3}}. By iteratively using inequality (50) over hh and by summing it over all frames, we conclude that with probability at at least 11K2,1-\frac{1}{K^{2}},

|k=1K{Ck,1C1πk}(xk,1,ak,1)|\displaystyle\left|\sum_{k=1}^{K}\{C_{k,1}-C_{1}^{\pi_{k}}\}(x_{k,1},a_{k,1})\right|\leq K1αH2SA+H3ιKχ+H4SAιK2α(χ+1)+14H4ιK2α\displaystyle K^{1-\alpha}H^{2}SA+\frac{H^{3}\sqrt{\iota}K}{\chi}+\sqrt{H^{4}SA\iota K^{2-\alpha}(\chi+1)}+\frac{1}{4}\sqrt{H^{4}\iota K^{2-\alpha}}
\displaystyle\leq 4H4SAιK0.8,\displaystyle 4\sqrt{H^{4}SA\iota}K^{0.8}, (51)

where the last inequality holds because α=0.6\alpha=0.6 and χ=K0.2.\chi=K^{0.2}.

Now, by combining inequalities (49) and (51), and using the union bound, we can show that when Kmax{(8SAH6ι3δ)5,e1δ},K\geq\max\left\{\left(\frac{8\sqrt{SAH^{6}\iota^{3}}}{\delta}\right)^{5},e^{\frac{1}{\delta}}\right\}, with probability at least 1𝒪~(eK0.2+1K2)1-\tilde{\mathcal{O}}\left(e^{-K^{0.2}}+\frac{1}{K^{2}}\right)

k=1KρC1πk(xk,1,ak,1)\displaystyle\sum^{K}_{k=1}\rho-C_{1}^{\pi_{k}}(x_{k,1},a_{k,1})
Kϵ+Kα(1rlog(11v23γ2)+2(v+θT))+4H4SAιK0.8\displaystyle\leq-K\epsilon+K^{\alpha}\left(\frac{1}{r}\log\left(\frac{11v^{2}}{3\gamma^{2}}\right)+2(v+\theta_{T})\right)+4\sqrt{H^{4}SA\iota}K^{0.8}
SAH6ι3K0.80,\displaystyle\leq-\sqrt{SAH^{6}\iota^{3}}K^{0.8}\leq 0, (52)

which completes the proof of our main result.

5 Convergence and ϵ\epsilon-Optimal Policy

The Triple-Q algorithm is an online learning algorithm and is not a stationary policy. In theory, we can obtain a near-optimal, stationary policy following the idea proposed in [14]. Assume the agent stores all the policies used during learning. Note that each policy is defined by the two Q tables and the value of the virtual queue. At the end of learning horizon K,K, the agent constructs a stochastic policy π¯\bar{\pi} such that at the beginning of each episode, the agent uniformly and randomly selects a policy from the KK policies , i.e. π¯=πk\bar{\pi}=\pi_{k} with probability 1K.\frac{1}{K}.

We note that given any initial state x,x,

1Kk=1KV1πk(x)=V1π¯(x).\frac{1}{K}\sum_{k=1}^{K}V_{1}^{\pi_{k}}(x)=V^{\bar{\pi}}_{1}(x).
1Kk=1KW1πk(x)=W1π¯(x).\frac{1}{K}\sum_{k=1}^{K}W_{1}^{\pi_{k}}(x)=W^{\bar{\pi}}_{1}(x).

Therefore, under policy π¯,\bar{\pi}, we have

𝔼[V1(xk,1)V1π¯(xk,1)]\displaystyle\mathbb{E}\left[V_{1}^{*}(x_{k,1})-V_{1}^{\bar{\pi}}(x_{k,1})\right]
=\displaystyle= 𝔼[1Kk=1K(V1(xk,1)V1π¯(xk,1)]\displaystyle\mathbb{E}\left[\frac{1}{K}\sum_{k=1}^{K}\left(V_{1}^{*}(x_{k,1})-V_{1}^{\bar{\pi}}(x_{k,1}\right)\right]
=\displaystyle= 𝔼[1Kk=1K(V1(xk,1)V1πk(xk,1)]\displaystyle\mathbb{E}\left[\frac{1}{K}\sum_{k=1}^{K}\left(V_{1}^{*}(x_{k,1})-V_{1}^{\pi_{k}}(x_{k,1}\right)\right]
=\displaystyle= O~(H4SAδK0.2),\displaystyle\tilde{O}\left(\frac{H^{4}\sqrt{SA}}{\delta K^{0.2}}\right),

and

𝔼[ρW1π¯(xk,1)]\displaystyle\mathbb{E}\left[\rho-W_{1}^{\bar{\pi}}(x_{k,1})\right]
=\displaystyle= 𝔼[1Kk=1K(ρW1π¯(xk,1)]\displaystyle\mathbb{E}\left[\frac{1}{K}\sum_{k=1}^{K}\left(\rho-W_{1}^{\bar{\pi}}(x_{k,1}\right)\right]
=\displaystyle= 𝔼[1Kk=1K(ρW1πk(xk,1))]\displaystyle\mathbb{E}\left[\frac{1}{K}\sum_{k=1}^{K}\left(\rho-W_{1}^{\pi_{k}}(x_{k,1})\right)\right]
\displaystyle\leq 0.\displaystyle 0.

Therefore, given any ϵ,\epsilon, π¯\bar{\pi} is an ϵ\epsilon-optimal policy when KK is sufficiently large.

While π¯\bar{\pi} is a near-optimal policy, in practice, it may not be possible to store all policies during learning due to memory constraint. A heuristic approach to obtain a near optimal, stationary policy is to fix the two Q functions (reward and utility) after learning horizon KK and continue to adapt the virtual queue with frame size K.\sqrt{K}. This is a stochastic policy where the randomness comes from the virtual queue. When the virtual queue reaches its steady state, we obtain a stationary policy. The resulted policy has great performance in our experiment (see Section 6).

6 Evaluation

We evaluated Triple-Q using a grid-world environment [8]. We further implemented Triple-Q with neural network approximations, called Deep Triple-Q, for an environment with continuous state and action spaces, called the Dynamic Gym benchmark [30]. In both cases, Triple-Q and Deep Triple-Q quickly learn a safe policy with a high reward.

6.1 A Tabular Case

We first evaluated our algorithm using a grid-world environment studied in [8]. The environment is shown in Figure 3-(a). The objective of the agent is to travel to the destination as quickly as possible while avoiding obstacles for safety. Hitting an obstacle incurs a cost of 11. The reward for the destination is 100100, and for other locations are the Euclidean distance between them and the destination subtracted from the longest distance. The cost constraint is set to be 66 (we transferred utility to cost as we discussed in the paper), which means the agent is only allowed to hit the obstacles as most six times. To account for the statistical significance, the result of each experiment was averaged over 55 trials.

The result is shown in Figure 2, from which we can observe that Triple-Q can quickly learn a well performed policy (with about 20,00020,000 episodes) while satisfying the safety constraint. Triple-Q-stop is a stationary policy obtained by stopping learning (i.e. fixing the Q tables) at 40,00040,000 training steps (note the virtual-Queue continues to be updated so the policy is a stochastic policy). We can see that Triple-Q-stop has similar performance as Triple-Q, and show that Triple-Q yields a near-optimal, stationary policy after the learning stops.

Refer to caption
Figure 1: Grid World and DynamicEnv333Image Sorce: The environment is generated using safety-gym: https://github.com/openai/safety-gym. with Safety Constraints
Refer to caption
Figure 2: The average reward and cost under Triple-Q during training. The shaded region represents the standard deviations

6.2 Triple-Q with Neural Networks

We further evaluated our algorithm on the Dynamic Gym benchmark (DynamicEnv) [30] as shown in Figure. 3-(b). In this environment, a point agent (one actuator for turning and another for moving) navigates on a 2D map to reach the goal position while trying to avoid reaching hazardous areas. The initial state of the agent, the goal position and hazards are randomly generated in each episode. At each step, the agents get a cost of 11 if it stays in the hazardous area; and otherwise, there is no cost. The constraint is that the expected cost should not exceed 15. In this environment, both the states and action spaces are continuous, we implemented the key ideas of Triple-Q with neural network approximations and the actor-critic method. In particular, two Q functions are trained simultaneously, the virtual queue is updated slowly every few episodes, and the policy network is trained by optimizing the combined three “Q”s (Triple-Q). The details can be found in Table 2. We call this algorithm Deep Triple-Q. The simulation results in Figure 3 show that Deep Triple-Q learns a safe-policy with a high reward much faster than WCSAC [30]. In particular, it took around 0.450.45 million training steps under Deep Triple-Q, but it took 4.54.5 million training steps under WCSAC.

Table 2: Hyperparameters
Parameter Value
   optimizer Adam
   learning rate 3×133\times 1^{-3}
   discount 0.99
   replay buffer size 10610^{6}
   number of hidden layers (all networks) 2
   batch Size 256
   nonlinearity ReLU
   number of hidden units per layer (Critic) 256
   number of hidden units per layer (Actor) 256
   virtual queue update frequency 3 episode
Refer to caption
Figure 3: The rewards and costs of Deep Triple-Q versus WCSAC during Training

7 Conclusions

This paper considered CMDPs and proposed a model-free RL algorithm without a simulator, named Triple-Q. From a theoretical perspective, Triple-Q achieves sublinear regret and zero constraint violation. We believe it is the first model-free RL algorithm for CMDPs with provable sublinear regret, without a simulator. From an algorithmic perspective, Triple-Q has similar computational complexity with SARSA, and can easily incorporate recent deep Q-learning algorithms to obtain a deep Triple-Q algorithm, which makes our method particularly appealing for complex and challenging CMDPs in practice.

While we only considered a single constraint in the paper, it is straightforward to extend the algorithm and the analysis to multiple constraints. Assuming there are JJ constraints in total, Triple-Q can maintain a virtual queue and a utility Q-function for each constraint, and then selects an action at each step by solving the following problem:

maxa(Qh(xh,a)+1ηj=1JZ(j)Ch(j)(xh,a)).\max_{a}\left(Q_{h}(x_{h},a)+\frac{1}{\eta}\sum_{j=1}^{J}{Z^{(j)}}C^{(j)}_{h}(x_{h},a)\right).

References

  • [1] Naoki Abe, Prem Melville, Cezar Pendus, Chandan K Reddy, David L Jensen, Vince P Thomas, James J Bennett, Gary F Anderson, Brent R Cooley, Melissa Kowalczyk, et al. Optimizing debt collections using constrained reinforcement learning. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 75–84, 2010.
  • [2] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31. PMLR, 2017.
  • [3] Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
  • [4] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. On the sample complexity of reinforcement learning with a generative model. In Int. Conf. Machine Learning (ICML), Madison, WI, USA, 2012.
  • [5] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Mach. Learn., 91(3):325–349, June 2013.
  • [6] Kianté Brantley, Miroslav Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. arXiv preprint arXiv:2006.05051, 2020.
  • [7] Yi Chen, Jing Dong, and Zhaoran Wang. A primal-dual approach to constrained Markov decision processes. arXiv preprint arXiv:2101.10895, 2021.
  • [8] Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. arXiv preprint arXiv:1805.07708, 2018.
  • [9] Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, 2021.
  • [10] Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33, 2020.
  • [11] Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020.
  • [12] Jaime F Fisac, Anayo K Akametalu, Melanie N Zeilinger, Shahab Kaynama, Jeremy Gillula, and Claire J Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.
  • [13] Javier Garcia and Fernando Fernández. Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, 45:515–564, 2012.
  • [14] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances Neural Information Processing Systems (NeurIPS), volume 31, pages 4863–4873, 2018.
  • [15] Krishna C. Kalagarla, Rahul Jain, and Pierluigi Nuzzo. A sample-efficient algorithm for episodic finite-horizon MDP with constraints. arXiv preprint arXiv:2009.11348, 2020.
  • [16] M. J. Neely. Energy-aware wireless scheduling with near-optimal backlog and convergence time tradeoffs. IEEE/ACM Transactions on Networking, 24(4):2223–2236, 2016.
  • [17] Michael J. Neely. Stochastic network optimization with application to communication and queueing systems. Synthesis Lectures on Communication Networks, 3(1):1–211, 2010.
  • [18] Masahiro Ono, Marco Pavone, Yoshiaki Kuwata, and J Balaram. Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots, 39(4):555–571, 2015.
  • [19] Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, 2019.
  • [20] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
  • [21] Shuang Qiu, Xiaohan Wei, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. Upper confidence primal-dual reinforcement learning for CMDP with adversarial loss. In Advances in Neural Information Processing Systems, 2020.
  • [22] Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.
  • [23] Rahul Singh, Abhishek Gupta, and Ness B Shroff. Learning in markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.
  • [24] R. Srikant and Lei Ying. Communication Networks: An Optimization, Control and Stochastic Networks Perspective. Cambridge University Press, 2014.
  • [25] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
  • [26] Yuanhao Wang, Kefan Dong, Xiaoyu Chen, and Liwei Wang. Q-learning with UCB exploration is sample efficient for infinite-horizon MDP. In International Conference on Learning Representations, 2020.
  • [27] Christopher John Cornish Hellaby Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, King’s College, Cambridge United Kingdom, May 1989.
  • [28] Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International Conference on Machine Learning, pages 10170–10180. PMLR, 2020.
  • [29] Tengyu Xu, Yingbin Liang, and Guanghui Lan. A primal approach to constrained policy optimization: Global optimality and finite-time analysis. arXiv preprint arXiv:2011.05869, 2020.
  • [30] Qisong Yang, Thiago D Simão, Simon H Tindemans, and Matthijs TJ Spaan. Wcsac: Worst-case soft actor critic for safety-constrained reinforcement learning. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence. AAAI Press, online, 2021.
  • [31] Chao Yu, Jiming Liu, and Shamim Nemati. Reinforcement learning in healthcare: A survey. arXiv preprint arXiv:1908.08796, 2020.

In the appendix, we summarize notations used throughout the paper in Table 3, and present a few lemmas used to prove the main theorem.

Appendix A Notation Table

Table 3: Notation Table
Notation Definition
KK The total number of episodes
SS The number of states
AA The number of actions
HH The length of each episode
[H][H] Set {1,2,,H}\{1,2,\dots,H\}
Qk,h(x,a)Q_{k,h}(x,a) The estimated reward Q-function at step hh in episode kk
Qhπ(x,a)Q_{h}^{\pi}(x,a) The reward Q-function at step hh in episode kk under policy π\pi
Vk,h(x)V_{k,h}(x) The estimated reward value-function at step hh in episode kk.
Vhπ(x)V_{h}^{\pi}(x) The value-function at step hh in episode kk under policy π\pi
Ck,h(x,a)C_{k,h}(x,a) The estimated utility Q-function at step hh in episode kk
Chπ(x,a)C_{h}^{\pi}(x,a) The utility Q-function at step hh in episode kk under policy π\pi
Wk,h(x)W_{k,h}(x) The estimated utility value-function at step hh in episode kk
Whπ(x)W_{h}^{\pi}(x) The utility value-function at step hh in episode kk under policy π\pi
Fk,h(x,a)F_{k,h}(x,a) Fk,h(x,a)=Qk,h(x,a)+ZkηCk,h(x,a)F_{k,h}(x,a)=Q_{k,h}(x,a)+\frac{Z_{k}}{\eta}C_{k,h}(x,a)
Uk,h(x)U_{k,h}(x) Uk,h(x)=Vk,h(x)+ZkηWk,h(x)U_{k,h}(x)=V_{k,h}(x)+\frac{Z_{k}}{\eta}W_{k,h}(x)
rh(x,a)r_{h}(x,a) The reward of (state, action) pair (x,a)(x,a) at step h.h.
gh(x,a)g_{h}(x,a) The utility of (state, action) pair (x,a)(x,a) at step h.h.
Nk,h(x,a)N_{k,h}(x,a) The number of visits to (x,a)(x,a) when at step hh in episode kk (not including kk)
ZkZ_{k} The dual estimation (virtual queue) in episode k.k.
qhq_{h}^{*} The optimal solution to the LP of the CMDP (8).
qhϵ,{q}_{h}^{\epsilon,*} The optimal solution to the tightened LP (14).
δ\delta Slater’s constant.
btb_{t} the UCB bonus for given tt
𝕀()\mathbb{I}(\cdot) The indicator function

Appendix B Useful Lemmas

The first lemma establishes some key properties of the learning rates used in Triple-Q. The proof closely follows the proof of Lemma 4.1 in [14].

Lemma 7.

Recall that the learning rate used in Triple-Q is αt=χ+1χ+t,\alpha_{t}=\frac{\chi+1}{\chi+t}, and

αt0=j=1t(1αj)andαti=αij=i+1t(1αj).\alpha_{t}^{0}=\prod_{j=1}^{t}(1-\alpha_{j})\quad\hbox{and}\quad\alpha_{t}^{i}=\alpha_{i}\prod_{j=i+1}^{t}(1-\alpha_{j}). (53)

The following properties hold for αti:\alpha_{t}^{i}:

  1. (a)

    αt0=0\alpha_{t}^{0}=0 for t1,αt0=1t\geq 1,\alpha_{t}^{0}=1 for t=0.t=0.

  2. (b)

    i=1tαti=1\sum_{i=1}^{t}\alpha_{t}^{i}=1 for t1,t\geq 1, i=1tαti=0\sum_{i=1}^{t}\alpha_{t}^{i}=0 for t=0.t=0.

  3. (c)

    1χ+ti=1tαtiχ+i2χ+t.\frac{1}{\sqrt{\chi+t}}\leq\sum_{i=1}^{t}\frac{\alpha_{t}^{i}}{\sqrt{\chi+i}}\leq\frac{2}{\sqrt{\chi+t}}.

  4. (d)

    t=iαti=1+1χ\sum_{t=i}^{\infty}\alpha_{t}^{i}=1+\frac{1}{\chi} for every i1.i\geq 1.

  5. (e)

    i=1t(αti)2χ+1χ+t\sum_{i=1}^{t}(\alpha_{t}^{i})^{2}\leq\frac{\chi+1}{\chi+t} for every t1.t\geq 1.

\square

Proof.

The proof of (a) and (b) are straightforward by using the definition of αti\alpha_{t}^{i}. The proof of (d) is the same as that in [14].

(c): We next prove (c) by induction.

For t=1,t=1, we have i=1tαtiχ+i=α11χ+1=1χ+1,\sum_{i=1}^{t}\frac{\alpha_{t}^{i}}{\sqrt{\chi+i}}=\frac{\alpha_{1}^{1}}{\sqrt{\chi+1}}=\frac{1}{\sqrt{\chi+1}}, so (c) holds for t=1t=1.

Now suppose that (c) holds for t1t-1 for t2,t\geq 2, i.e.

1χ+t1i=1t1αtiχ+i12χ+t1.\frac{1}{\sqrt{\chi+t-1}}\leq\sum_{i=1}^{t-1}\frac{\alpha_{t}^{i}}{\sqrt{\chi+i-1}}\leq\frac{2}{\sqrt{\chi+t-1}}.

From the relationship αti=(1αt)αt1i\alpha_{t}^{i}=(1-\alpha_{t})\alpha_{t-1}^{i} for i=1,2,,t1,i=1,2,\dots,t-1, we have

i=1tαtiχ+i=αtχ+t+(1αt)i=1t1αt1iχ+i.\sum_{i=1}^{t}\frac{\alpha_{t}^{i}}{\chi+i}=\frac{\alpha_{t}}{\sqrt{\chi+t}}+(1-\alpha_{t})\sum_{i=1}^{t-1}\frac{\alpha_{t-1}^{i}}{\sqrt{\chi+i}}.

Now we apply the induction assumption. To prove the lower bound in (c), we have

αtχ+t+(1αt)i=1t1αt1iχ+iαtχ+t+1αtχ+t1αtχ+t+1αtχ+t1χ+t.\frac{\alpha_{t}}{\sqrt{\chi+t}}+(1-\alpha_{t})\sum_{i=1}^{t-1}\frac{\alpha_{t-1}^{i}}{\sqrt{\chi+i}}\geq\frac{\alpha_{t}}{\sqrt{\chi+t}}+\frac{1-\alpha_{t}}{\sqrt{\chi+t-1}}\geq\frac{\alpha_{t}}{\sqrt{\chi+t}}+\frac{1-\alpha_{t}}{\sqrt{\chi+t}}\geq\frac{1}{\sqrt{\chi+t}}.

To prove the upper bound in (c), we have

αtχ+t+(1αt)i=1t1αt1iχ+i\displaystyle\frac{\alpha_{t}}{\sqrt{\chi+t}}+(1-\alpha_{t})\sum_{i=1}^{t-1}\frac{\alpha_{t-1}^{i}}{\sqrt{\chi+i}}\leq αtχ+t+2(1αt)χ+t1=χ+1(χ+t)χ+t+2(t1)(χ+t)χ+t1,\displaystyle\frac{\alpha_{t}}{\sqrt{\chi+t}}+\frac{2(1-\alpha_{t})}{\sqrt{\chi+t-1}}=\frac{\chi+1}{(\chi+t)\sqrt{\chi+t}}+\frac{2(t-1)}{(\chi+t)\sqrt{\chi+t-1}},
=\displaystyle= 1χ2t(χ+t)χ+t+2(t1)(χ+t)χ+t1+2χ+t\displaystyle\frac{1-\chi-2t}{(\chi+t)\sqrt{\chi+t}}+\frac{2(t-1)}{(\chi+t)\sqrt{\chi+t-1}}+\frac{2}{\sqrt{\chi+t}}
\displaystyle\leq χ1(χ+t)χ+t1+2χ+t2χ+t.\displaystyle\frac{-\chi-1}{(\chi+t)\sqrt{\chi+t-1}}+\frac{2}{\sqrt{\chi+t}}\leq\frac{2}{\sqrt{\chi+t}}. (54)

(e) According to its definition, we have

αti=\displaystyle\alpha_{t}^{i}= χ+1i+χ(ii+1+χi+1i+2+χt1t+χ)\displaystyle\frac{\chi+1}{i+\chi}\cdot\left(\frac{i}{i+1+\chi}\frac{i+1}{i+2+\chi}\cdots\frac{t-1}{t+\chi}\right)
=\displaystyle= χ+1t+χ(ii+χi+1i+1+χt1t1+χ)χ+1χ+t.\displaystyle\frac{\chi+1}{t+\chi}\cdot\left(\frac{i}{i+\chi}\frac{i+1}{i+1+\chi}\cdots\frac{t-1}{t-1+\chi}\right)\leq\frac{\chi+1}{\chi+t}. (55)

Therefore, we have

i=1t(αti)2[maxi[t]αti]i=1tαtiχ+1χ+t,\sum_{i=1}^{t}(\alpha_{t}^{i})^{2}\leq[\max_{i\in[t]}\alpha_{t}^{i}]\cdot\sum_{i=1}^{t}\alpha_{t}^{i}\leq\frac{\chi+1}{\chi+t},

because i=1tαti=1.\sum_{i=1}^{t}\alpha_{t}^{i}=1.

The next lemma establishes upper bounds on Qk,hQ_{k,h} and Ck,hC_{k,h} under Triple-Q.

Lemma 8.

For any (x,a,h,k)𝒮×𝒜×[H]×[K],(x,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K], we have the following bounds on Qk,h(x,a)Q_{k,h}(x,a) and Ck,h(x,a):C_{k,h}(x,a):

0Qk,h(x,a)H2ι\displaystyle 0\leq Q_{k,h}(x,a)\leq H^{2}\sqrt{\iota}
0Ck,h(x,a)H2ι.\displaystyle 0\leq C_{k,h}(x,a)\leq H^{2}\sqrt{\iota}.
Proof.

We first consider the last step of an episode, i.e. h=H.h=H. Recall that Vk,H+1(x)=0V_{k,H+1}(x)=0 for any kk and xx by its definition and Q0,H=HHι.Q_{0,H}=H\leq H\sqrt{\iota}. Suppose Qk,H(x,a)HιQ_{k^{\prime},H}(x,a)\leq H\sqrt{\iota} for any kk1k^{\prime}\leq k-1 and any (x,a).(x,a). Then,

Qk,H(x,a)=(1αt)Qkt,H(x,a)+αt(rH(x,a)+bt)max{Hι,1+Hι4}Hι,{Q}_{k,H}(x,a)=(1-\alpha_{t})Q_{k_{t},H}(x,a)+\alpha_{t}\left(r_{H}(x,a)+b_{t}\right)\leq\max\left\{H\sqrt{\iota},1+\frac{H\sqrt{\iota}}{4}\right\}\leq H\sqrt{\iota},

where t=Nk,H(x,a)t=N_{k,H}(x,a) is the number of visits to state-action pair (x,a)(x,a) when in step HH by episode kk (but not include episode kk) and ktk_{t} is the index of the episode of the most recent visit. Therefore, the upper bound holds for h=H.h=H.

Note that Q0,h=HH(Hh+1)ι.Q_{0,h}=H\leq H(H-h+1)\sqrt{\iota}. Now suppose the upper bound holds for h+1,h+1, and also holds for kk1.k^{\prime}\leq k-1. Consider step hh in episode k:k:

Qk,h(x,a)=\displaystyle{Q}_{k,h}(x,a)= (1αt)Qkt,h(x,a)+αt(rh(x,a)+Vkt,h+1(xkt,h+1)+bt),\displaystyle(1-\alpha_{t})Q_{k_{t},h}(x,a)+\alpha_{t}\left(r_{h}(x,a)+V_{k_{t},{h}+1}(x_{k_{t},{h}+1})+b_{t}\right),

where t=Nk,h(x,a)t=N_{k,{h}}(x,a) is the number of visits to state-action pair (x,a)(x,a) when in step h{h} by episode kk (but not include episode kk) and ktk_{t} is the index of the episode of the most recent visit. We also note that Vk,h+1(x)maxaQk,h+1(x,a)H(Hh)ι.V_{k,h+1}(x)\leq\max_{a}Q_{k,h+1}(x,a)\leq H(H-h)\sqrt{\iota}. Therefore, we obtain

Qk,h(x,a)max{H(Hh+1)ι,1+H(Hh)ι+Hι4}H(Hh+1)ι.\displaystyle{Q}_{k,h}(x,a)\leq\max\left\{H(H-h+1)\sqrt{\iota},1+H(H-h)\sqrt{\iota}+\frac{H\sqrt{\iota}}{4}\right\}\leq H(H-h+1)\sqrt{\iota}.

Therefore, we can conclude that Qk,h(x,a)H2ιQ_{k,h}(x,a)\leq H^{2}\sqrt{\iota} for any k,k, hh and (x,a).(x,a). The proof for Ck,h(x,a)C_{k,h}(x,a) is identical. ∎

Next, we present the following lemma from [14], which establishes a recursive relationship between Qk,hQ_{k,h} and QhπQ^{\pi}_{h} for any π.\pi. We include the proof so the paper is self-contained.

Lemma 9.

Consider any (x,a,h,k)𝒮×𝒜×[H]×[K],(x,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K], and any policy π.\pi. Let t=Nk,h(x,a)N_{k,h}(x,a) be the number of visits to (x,a)(x,a) when at step hh in frame TT before episode k,k, and k1,,ktk_{1},\dots,k_{t} be the indices of the episodes in which these visits occurred. We have the following two equations:

(Qk,hQhπ)(x,a)=\displaystyle(Q_{k,h}-Q_{h}^{\pi})(x,a)= αt0{Q(T1)Kα+1,hQhπ}(x,a)\displaystyle\alpha_{t}^{0}\left\{Q_{(T-1)K^{\alpha}+1,h}-Q_{h}^{\pi}\right\}(x,a)
+i=1tαti({Vki,h+1Vh+1π}(xki,h+1)+{^hkiVh+1πhVh+1π}(x,a)+bi),\displaystyle+\sum_{i=1}^{t}\alpha_{t}^{i}\left(\left\{V_{k_{i},h+1}-V_{h+1}^{\pi}\right\}(x_{k_{i},h+1})+\left\{\hat{\mathbb{P}}_{h}^{k_{i}}V_{h+1}^{\pi}-\mathbb{P}_{h}V_{h+1}^{\pi}\right\}(x,a)+b_{i}\right), (56)
(Ck,hChπ)(x,a)=\displaystyle(C_{k,h}-C_{h}^{\pi})(x,a)= αt0{C(T1)Kα+1,hChπ}(x,a)\displaystyle\alpha_{t}^{0}\left\{C_{(T-1)K^{\alpha}+1,h}-C_{h}^{\pi}\right\}(x,a)
+i=1tαti({Wki,h+1Wh+1π}(xki,h+1)+{^hkiWh+1πhWh+1π}(x,a)+bi),\displaystyle+\sum_{i=1}^{t}\alpha_{t}^{i}\left(\left\{W_{k_{i},h+1}-W_{h+1}^{\pi}\right\}(x_{k_{i},h+1})+\left\{\hat{\mathbb{P}}_{h}^{k_{i}}W_{h+1}^{\pi}-\mathbb{P}_{h}W_{h+1}^{\pi}\right\}(x,a)+b_{i}\right), (57)

where ^hkVh+1(x,a):=Vh+1(xk,h+1)\hat{\mathbb{P}}_{h}^{k}V_{h+1}(x,a):=V_{h+1}(x_{k,h+1}) is the empirical counterpart of hVh+1π(x,a)=𝔼xh(|x,a)Vh+1π(x).\mathbb{P}_{h}V_{h+1}^{\pi}(x,a)=\mathbb{E}_{x^{\prime}\sim\mathbb{P}_{h}(\cdot|x,a)}V_{h+1}^{\pi}(x^{\prime}). This definition can also be applied to WhπW_{h}^{\pi} as well.

Proof.

We will prove (56). The proof for (57) is identical. Recall that under Triple-Q, Qk+1,h(x,a)Q_{k+1,h}(x,a) is updated as follows:

Qk+1,h(x,a)\displaystyle Q_{k+1,h}(x,a) ={(1αt)Qk,h(x,a)+αt(rh(x,a)+Vk,h+1(xh+1,k)+bt)if (x,a)=(xk,h,ak,h)Qk,h(x,a)otherwise.\displaystyle=\begin{cases}(1-\alpha_{t})Q_{k,h}(x,a)+\alpha_{t}\left(r_{h}(x,a)+V_{k,h+1}(x_{h+1,k})+b_{t}\right)&\text{if $(x,a)=(x_{k,h},a_{k,h})$}\\ Q_{k,h}(x,a)&\text{otherwise}\end{cases}.

From the update equation above, we have in episode k,k,

Qk,h(x,a)=\displaystyle Q_{k,h}(x,a)= (1αt)Qkt,h(x,a)+αt(rh(x,a)+Vkt,h+1(xkt,h+1)+bt).\displaystyle(1-\alpha_{t})Q_{k_{t},h}(x,a)+\alpha_{t}\left(r_{h}(x,a)+V_{k_{t},h+1}(x_{k_{t},h+1})+b_{t}\right).

Repeatedly using the equation above, we obtain

Qk,h(x,a)=\displaystyle Q_{k,h}(x,a)= (1αt)(1αt1)Qkt1,h(x,a)+(1αt)αt1(rh(x,a)+Vkt1,h+1(xkt1,h+1)+bt1)\displaystyle(1-\alpha_{t})(1-\alpha_{t-1})Q_{k_{t-1},h}(x,a)+(1-\alpha_{t})\alpha_{t-1}\left(r_{h}(x,a)+V_{k_{t-1},h+1}(x_{k_{t-1},h+1})+b_{t-1}\right)
+αt(rh(x,a)+Vkt,h+1(xkt,h+1)+bt)\displaystyle+\alpha_{t}\left(r_{h}(x,a)+V_{k_{t},h+1}(x_{k_{t},h+1})+b_{t}\right)
=\displaystyle= \displaystyle\cdots
=\displaystyle= αt0Q(T1)Kα+1,h(x,a)+i=1tαti(rh(x,a)+Vki,h+1(xki,h+1)+bi),\displaystyle\alpha_{t}^{0}Q_{(T-1)K^{\alpha}+1,h}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left(r_{h}(x,a)+V_{k_{i},h+1}(x_{k_{i},h+1})+b_{i}\right), (58)

where the last equality holds due to the definition of αti\alpha_{t}^{i} in (53) and the fact that all Q1,h(x,a)Q_{1,h}(x,a)s are initialized to be H.H. Now applying the Bellman equation Qhπ(x,a)={rh+hVh+1π}(x,a)Q_{h}^{\pi}(x,a)=\left\{r_{h}+\mathbb{P}_{h}V_{h+1}^{\pi}\right\}(x,a) and the fact that i=1tαti=1,\sum_{i=1}^{t}\alpha_{t}^{i}=1, we can further obtain

Qhπ(x,a)\displaystyle Q_{h}^{\pi}(x,a) =αt0Qhπ(x,a)+(1αt0)Qhπ(x,a)\displaystyle=\alpha_{t}^{0}Q_{h}^{\pi}(x,a)+(1-\alpha_{t}^{0})Q_{h}^{\pi}(x,a)
=αt0Qhπ(x,a)+i=1tαti(r(x,a)+hVh+1π(x,a)+Vh+1π(xki,h+1)Vh+1π(xki,h+1))\displaystyle=\alpha_{t}^{0}Q_{h}^{\pi}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left(r(x,a)+\mathbb{P}_{h}V_{h+1}^{\pi}(x,a)+V_{h+1}^{\pi}(x_{k_{i},h+1})-V_{h+1}^{\pi}(x_{k_{i},h+1})\right)
=αt0Qhπ(x,a)+i=1tαti(rh(x,a)+hVh+1π(x,a)+Vh+1π(xki,h+1)^hkiVh+1π(x,a))\displaystyle=\alpha_{t}^{0}Q_{h}^{\pi}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left(r_{h}(x,a)+\mathbb{P}_{h}V_{h+1}^{\pi}(x,a)+V_{h+1}^{\pi}(x_{k_{i},h+1})-\hat{\mathbb{P}}_{h}^{k_{i}}V_{h+1}^{\pi}(x,a)\right)
=αt0Qhπ(x,a)+i=1tαti(rh(x,a)+Vh+1π(xki,h+1)+{hVh+1π^hkiVh+1π}(x,a)).\displaystyle=\alpha_{t}^{0}Q_{h}^{\pi}(x,a)+\sum_{i=1}^{t}\alpha_{t}^{i}\left(r_{h}(x,a)+V_{h+1}^{\pi}(x_{k_{i},h+1})+\left\{\mathbb{P}_{h}V_{h+1}^{\pi}-\hat{\mathbb{P}}_{h}^{k_{i}}V_{h+1}^{\pi}\right\}(x,a)\right). (59)

Then subtracting (59) from (58) yields

(Qk,hQhπ)(x,a)=\displaystyle(Q_{k,h}-Q_{h}^{\pi})(x,a)= αt0{Q(T1)Kα+1,hQhπ}(x,a)\displaystyle\alpha_{t}^{0}\left\{Q_{(T-1)K^{\alpha}+1,h}-Q_{h}^{\pi}\right\}(x,a)
+i=1tαti({Vki,h+1Vh+1π}(xki,h+1)+{^hkiVh+1πhVh+1π}(x,a)+bi).\displaystyle+\sum_{i=1}^{t}\alpha_{t}^{i}\left(\left\{V_{k_{i},h+1}-V_{h+1}^{\pi}\right\}(x_{k_{i},h+1})+\left\{\hat{\mathbb{P}}_{h}^{k_{i}}V_{h+1}^{\pi}-\mathbb{P}_{h}V_{h+1}^{\pi}\right\}(x,a)+b_{i}\right).

Lemma 10.

Consider any frame T.T. Let t=Nk,h(x,a)N_{k,h}(x,a) be the number of visits to (x,a)(x,a) at step hh before episode kk in the current frame and let k1,,kt<kk_{1},\dots,k_{t}<k be the indices of these episodes. Under any policy π,\pi, with probability at least 11K3,1-\frac{1}{K^{3}}, the following inequalities hold simultaneously for all (x,a,h,k)𝒮×𝒜×[H]×[K](x,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K]

|i=1tαti{(^hkih)Vh+1π}(x,a)|\displaystyle\left|\sum_{i=1}^{t}\alpha_{t}^{i}\left\{(\hat{\mathbb{P}}_{h}^{k_{i}}-\mathbb{P}_{h})V_{h+1}^{\pi}\right\}(x,a)\right|\leq 14H2ι(χ+1)(χ+t),\displaystyle\frac{1}{4}\sqrt{\frac{H^{2}\iota(\chi+1)}{(\chi+t)}},
|i=1tαti{(^hkih)Wh+1π}(x,a)|\displaystyle\left|\sum_{i=1}^{t}\alpha_{t}^{i}\left\{(\hat{\mathbb{P}}_{h}^{k_{i}}-\mathbb{P}_{h})W_{h+1}^{\pi}\right\}(x,a)\right|\leq 14H2ι(χ+1)(χ+t).\displaystyle\frac{1}{4}\sqrt{\frac{H^{2}\iota(\chi+1)}{(\chi+t)}}.
Proof.

Without loss of generality, we consider T=1.T=1. Fix any (x,a,h)𝒮×𝒜×[H].(x,a,h)\in\mathcal{S}\times\mathcal{A}\times[H]. For any n[Kα],n\in[K^{\alpha}], define

X(n)=i=1nατi𝕀{kiK}{(^hkih)Vh+1π}(x,a).X(n)=\sum_{i=1}^{n}\alpha_{\tau}^{i}\cdot\mathbb{I}_{\{k_{i}\leq K\}}\left\{(\hat{\mathbb{P}}_{h}^{k_{i}}-\mathbb{P}_{h})V_{h+1}^{\pi}\right\}(x,a).

Let i\mathcal{F}_{i} be the σ\sigma-algebra generated by all the random variables until step hh in episode ki.k_{i}. Then

𝔼[X(n+1)|n]=X(n)+𝔼[ατn+1𝕀{kn+1K}{(^hkn+1h)Vh+1π}(x,a)|n]=X(n),\mathbb{E}[X(n+1)|\mathcal{F}_{n}]=X(n)+\mathbb{E}\left[\alpha_{\tau}^{n+1}\mathbb{I}_{\{k_{n+1}\leq K\}}\left\{(\hat{\mathbb{P}}_{h}^{k_{n+1}}-\mathbb{P}_{h})V_{h+1}^{\pi}\right\}(x,a)|\mathcal{F}_{n}\right]=X(n),

which shows that X(n)X(n) is a martingale. We also have for 1in,1\leq i\leq n,

|X(i)X(i1)|ατi|{(^hkn+1h)Vh+1π}(x,a)|ατiH\displaystyle|X(i)-X(i-1)|\leq\alpha_{\tau}^{i}\left|\left\{(\hat{\mathbb{P}}_{h}^{k_{n+1}}-\mathbb{P}_{h})V_{h+1}^{\pi}\right\}(x,a)\right|\leq\alpha_{\tau}^{i}H

Then let σ=8log(2SAHK)i=1τ(ατiH)2.\sigma=\sqrt{8\log\left(\sqrt{2SAH}K\right)\sum_{i=1}^{\tau}(\alpha_{\tau}^{i}H)^{2}}. By applying the Azuma-Hoeffding inequality, we have with probability at least 12exp(σ22i=1τ(ατiH)2)=11SAHK4,1-2\exp\left(-\frac{\sigma^{2}}{2\sum_{i=1}^{\tau}(\alpha^{i}_{\tau}H)^{2}}\right)=1-\frac{1}{SAHK^{4}},

|X(τ)|8log(2SAHK)i=1τ(ατiH)2ι16H2i=1τ(ατi)214H2ι(χ+1)χ+τ,|X(\tau)|\leq\sqrt{8\log\left(\sqrt{2SAH}K\right)\sum_{i=1}^{\tau}(\alpha_{\tau}^{i}H)^{2}}\leq\sqrt{\frac{\iota}{16}H^{2}\sum_{i=1}^{\tau}(\alpha_{\tau}^{i})^{2}}\leq\frac{1}{4}\sqrt{\frac{H^{2}\iota(\chi+1)}{\chi+\tau}},

where the last inequality holds due to i=1τ(ατi)2χ+1χ+τ\sum_{i=1}^{\tau}(\alpha_{\tau}^{i})^{2}\leq\frac{\chi+1}{\chi+\tau} from Lemma 7.(e). Because this inequality holds for any τ[K]\tau\in[K], it also holds for τ=t=Nk,h(x,a)K,\tau=t=N_{k,h}(x,a)\leq K, Applying the union bound, we obtain that with probability at least 11K31-\frac{1}{K^{3}} the following inequality holds simultaneously for all (x,a,h,k)𝒮×𝒜×[H]×[K],(x,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K],:

|i=1tαti{(^hkih)Vh+1π}(x,a)|14H2ι(χ+1)(χ+t).\left|\sum_{i=1}^{t}\alpha_{t}^{i}\left\{(\hat{\mathbb{P}}_{h}^{k_{i}}-\mathbb{P}_{h})V_{h+1}^{\pi}\right\}(x,a)\right|\leq\frac{1}{4}\sqrt{\frac{H^{2}\iota(\chi+1)}{(\chi+t)}}.

Following a similar analysis we also have that with probability at least 11K31-\frac{1}{K^{3}} the following inequality holds simultaneously for all (x,a,h,k)𝒮×𝒜×[H]×[K],(x,a,h,k)\in\mathcal{S}\times\mathcal{A}\times[H]\times[K],:

|i=1tαti{(^hkih)Wh+1π}(x,a)|14H2ι(χ+1)(χ+t).\left|\sum_{i=1}^{t}\alpha_{t}^{i}\left\{(\hat{\mathbb{P}}_{h}^{k_{i}}-\mathbb{P}_{h})W_{h+1}^{\pi}\right\}(x,a)\right|\leq\frac{1}{4}\sqrt{\frac{H^{2}\iota(\chi+1)}{(\chi+t)}}.

Lemma 11.

Given δ2ϵ,\delta\geq 2\epsilon, under Triple-Q, the conditional expected drift is

𝔼[LT+1LT|ZT=z]δ2ZT+4H2ιK2+ηH2ι+H4ι+ϵ2\displaystyle\mathbb{E}\left[L_{T+1}-L_{T}|Z_{T}=z\right]\leq-\frac{\delta}{2}Z_{T}+\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2} (60)
Proof.

Recall that LT=12ZT2,L_{T}=\frac{1}{2}Z_{T}^{2}, and the virtual queue is updated by using

ZT+1=(ZT+ρ+ϵC¯TKα)+.Z_{T+1}=\left(Z_{T}+\rho+\epsilon-\frac{\bar{C}_{T}}{K^{\alpha}}\right)^{+}.

From inequality (39), we have

𝔼[LT+1LT|ZT=z]\displaystyle\mathbb{E}\left[L_{T+1}-L_{T}|Z_{T}=z\right]
\displaystyle\leq 1Kαk=(T1)Kα+1TKα𝔼[ZT(ρ+ϵCk,1(xk,1,ak,1))ηQk,1(xk,1,ak,1)+ηQk,1(xk,1,ak,1)|ZT=z]+H4ι+ϵ2\displaystyle\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\mathbb{E}\left[Z_{T}\left(\rho+\epsilon-C_{k,1}(x_{k,1},a_{k,1})\right)-\eta Q_{k,1}(x_{k,1},a_{k,1})+\eta Q_{k,1}(x_{k,1},a_{k,1})|Z_{T}=z\right]+H^{4}\iota+\epsilon^{2}
(a)\displaystyle\leq_{(a)} 1Kαk=(T1)Kα+1TKα𝔼[ZT(ρ+ϵa{Ck,1q1π}(xk,1,a))ηa{Qk,1q1π}(xk,1,a)+ηQk,1(xk,1,ak,1)|ZT=z]\displaystyle\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\mathbb{E}\left[Z_{T}\left(\rho+\epsilon-\sum_{a}\left\{C_{k,1}q_{1}^{\pi}\right\}(x_{k,1},a)\right)-\eta\sum_{a}\{Q_{k,1}q_{1}^{\pi}\}(x_{k,1},a)+\eta Q_{k,1}(x_{k,1},a_{k,1})|Z_{T}=z\right]
+ϵ2+H4ι\displaystyle+\epsilon^{2}+H^{4}\iota
\displaystyle\leq 1Kαk=(T1)Kα+1TKα𝔼[ZT(ρ+ϵa{C1πq1π}(xk,1,a))ηa{Qk,1q1π}(xk,1,a)+ηQk,1(xk,1,ak,1)|ZT=z]\displaystyle\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\mathbb{E}\left[Z_{T}\left(\rho+\epsilon-\sum_{a}\left\{C_{1}^{\pi}q_{1}^{\pi}\right\}(x_{k,1},a)\right)-\eta\sum_{a}\{Q_{k,1}q_{1}^{\pi}\}(x_{k,1},a)+\eta Q_{k,1}(x_{k,1},a_{k,1})|Z_{T}=z\right]
+1Kαk=(T1)Kα+1TKα𝔼[ZTa{C1πq1π}(xk,1,a)ZTa{Ck,1q1π}(xk,1,a)|ZT=z]\displaystyle+\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\mathbb{E}\left[Z_{T}\sum_{a}\left\{C_{1}^{\pi}q_{1}^{\pi}\right\}(x_{k,1},a)-Z_{T}\sum_{a}\left\{C_{k,1}q_{1}^{\pi}\right\}(x_{k,1},a)|Z_{T}=z\right]
+1Kαk=(T1)Kα+1TKα𝔼[ηa{Q1πq1π}(xk,1,a)ηa{Q1πq1π}(xk,1,a)|ZT=z]+H4ι+ϵ2\displaystyle+\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\mathbb{E}\left[\eta\sum_{a}\left\{Q_{1}^{\pi}q_{1}^{\pi}\right\}(x_{k,1},a)-\eta\sum_{a}\left\{Q_{1}^{\pi}q_{1}^{\pi}\right\}(x_{k,1},a)|Z_{T}=z\right]+H^{4}\iota+\epsilon^{2}
(b)\displaystyle\leq_{(b)} δ2z+1Kαk=(T1)Kα+1TKα𝔼[ηa{(F1πFk,1)q1π}(xk,1,a)+ηQk,1(xk,1,ak,1)|ZT=z]+H4ι+ϵ2\displaystyle-\frac{\delta}{2}z+\frac{1}{K^{\alpha}}\sum_{k=(T-1)K^{\alpha}+1}^{TK^{\alpha}}\mathbb{E}\left[\eta\sum_{a}\left\{(F_{1}^{\pi}-F_{k,1})q_{1}^{\pi}\right\}(x_{k,1},a)\ +\eta Q_{k,1}(x_{k,1},a_{k,1})|Z_{T}=z\right]+H^{4}\iota+\epsilon^{2}
(c)\displaystyle\leq_{(c)} δ2z+4H2ιK2+ηH2ι+H4ι+ϵ2.\displaystyle-\frac{\delta}{2}z+\frac{4H^{2}\iota}{K^{2}}+\eta\sqrt{H^{2}\iota}+H^{4}\iota+\epsilon^{2}.

Inequality (a)(a) holds because of our algorithm. Inequality (b)(b) holds because a{Q1πq1π}(xk,1,a)\sum_{a}\left\{Q_{1}^{\pi}q_{1}^{\pi}\right\}(x_{k,1},a) is non-negative, and under Slater’s condition, we can find policy π\pi such that

ϵ+ρ𝔼[aC1π(xk,1,a)q1π(xk,1,a)]=ρ+ϵ𝔼[h,x,aqhπ(x,a)gh(x,a)]δ+ϵδ2.\epsilon+\rho-\mathbb{E}\left[\sum_{a}{C}^{\pi}_{1}(x_{k,1},a){q}^{\pi}_{1}(x_{k,1},a)\right]=\rho+\epsilon-\mathbb{E}\left[\sum_{h,x,a}{q}^{\pi}_{h}(x,a)g_{h}(x,a)\right]\leq-\delta+\epsilon\leq-\frac{\delta}{2}.

Finally, inequality (c)(c) is obtained similar to (36), and the fact that Qk,1(xk,1,ak,1)Q_{k,1}(x_{k,1},a_{k,1}) is bounded by using Lemma 8