A Provably-Efficient Model-Free Algorithm for Constrained Markov Decision Processes
Abstract
This paper presents the first model-free, simulator-free reinforcement learning algorithm for Constrained Markov Decision Processes (CMDPs) with sublinear regret and zero constraint violation. The algorithm is named Triple-Q because it includes three key components: a Q-function (also called action-value function) for the cumulative reward, a Q-function for the cumulative utility for the constraint, and a virtual-Queue that (over)-estimates the cumulative constraint violation. Under Triple-Q, at each step, an action is chosen based on the pseudo-Q-value that is a combination of the three “Q” values. The algorithm updates the reward and utility Q-values with learning rates that depend on the visit counts to the corresponding (state, action) pairs and are periodically reset. In the episodic CMDP setting, Triple-Q achieves regret111Notation: denotes with The same applies to denotes non-negative real numbers. denotes the set , where is the total number of episodes, is the number of steps in each episode, is the number of states, is the number of actions, and is Slater’s constant. Furthermore, Triple-Q guarantees zero constraint violation, both on expectation and with a high probability, when is sufficiently large. Finally, the computational complexity of Triple-Q is similar to SARSA for unconstrained MDPs, and is computationally efficient.
1 Introduction
Reinforcement learning (RL), with its success in gaming and robotics, has been widely viewed as one of the most important technologies for next-generation, AI-driven complex systems such as autonomous driving, digital healthcare, and smart cities. However, despite the significant advances (such as deep RL) over the last few decades, a major obstacle in applying RL in practice is the lack of “safety” guarantees. Here “safety” refers to a broad range of operational constraints. The objective of a traditional RL problem is to maximize the expected cumulative reward, but in practice, many applications need to be operated under a variety of constraints, such as collision avoidance in robotics and autonomous driving [18, 13, 12], legal and business restrictions in financial engineering [1], and resource and budget constraints in healthcare systems [31]. These applications with operational constraints can often be modeled as Constrained Markov Decision Processes (CMDPs), in which the agent’s goal is to learn a policy that maximizes the expected cumulative reward subject to the constraints.
Earlier studies on CMDPs assume the model is known. A comprehensive study of these early results can be found in [3]. RL for unknown CMDPs has been a topic of great interest recently because of its importance in Artificial Intelligence (AI) and Machine Learning (ML). The most noticeable advances recently are model-based RL for CMDPs, where the transition kernels are learned and used to solve the linear programming (LP) problem for the CMDP [23, 6, 15, 11], or the LP problem in the primal component of a primal-dual algorithm [21, 11]. If the transition kernel is linear, then it can be learned in a sample efficient manner even for infinite state and action spaces, and then be used in the policy evaluation and improvement in a primal-dual algorithm [9]. [9] also proposes a model-based algorithm (Algorithm 3) for the tabular setting (without assuming a linear transition kernel).
Algorithm | Regret | Constraint Violation | |
---|---|---|---|
Model-based | OPDOP [9] | ||
OptDual-CMDP [11] | |||
OptPrimalDual-CMDP [11] | |||
Model-free | Triple-Q | 0 |
The performance of a model-based RL algorithm depends on how accurately a model can be estimated. For some complex environments, building accurate models is challenging computationally and data-wise [25]. For such environments, model-free RL algorithms often are more desirable. However, there has been little development on model-free RL algorithms for CMDPs with provable optimality or regret guarantees, with the exceptions [10, 29, 7], all of which require simulators. In particular, the sample-based NPG-PD algorithm in [10] requires a simulator which can simulate the MDP from any initial state and the algorithms in [29, 7] both require a simulator for policy evaluation. It has been argued in [4, 5, 14] that with a perfect simulator, exploration is not needed and sample efficiency can be easily achieved because the agent can query any (state, action) pair as it wishes. Unfortunately, for complex environments, building a perfect simulator often is as difficult as deriving the model for the CMDP. For those environments, sample efficiency and the exploration-exploitation tradeoff are critical and become one of the most important considerations of RL algorithm design.
1.1 Main Contributions
In this paper, we consider the online learning problem of an episodic CMDP with a model-free approach without a simulator. We develop the first model-free RL algorithm for CMDPs with sublinear regret and zero constraint violation (for large ). The algorithm is named Triple-Q because it has three key components: (i) a Q-function (also called action-value function) for the expected cumulative reward, denoted by where is the step index and denotes a state-action pair, (ii) a Q-function for the expected cumulative utility for the constraint, denoted by and (iii) a virtual-Queue, denoted by which (over)estimates the cumulative constraint violation so far. At step in the current episode, when observing state the agent selects action based on a pseudo-Q-value that is a combination of the three “Q” values:
where is a constant. Triple-Q uses UCB-exploration when learning the Q-values, where the UCB bonus and the learning rate at each update both depend on the visit count to the corresponding (state, action) pair as in [14]). Different from the optimistic Q-learning for unconstrained MDPs (e.g. [14, 26, 28]), the learning rates in Triple-Q need to be periodically reset at the beginning of each frame, where a frame consists of consecutive episodes. The value of the virtual-Queue (the dual variable) is updated once in every frame. So Triple-Q can be viewed as a two-time-scale algorithm where virtual-Queue is updated at a slow time-scale, and Triple-Q learns the pseudo-Q-value for fixed at a fast time scale within each frame. Furthermore, it is critical to update the two Q-functions ( and ) following a rule similar to SARSA [22] instead of Q-learning [27], in other words, using the Q-functions of the action that is taken instead of using the function.
We prove Triple-Q achieves reward regret and guarantees zero constraint violation when the total number of episodes where is logarithmic in Therefore, in terms of constraint violation, our bound is sharp for large To the best of our knowledge, this is the first model-free, simulator-free RL algorithm with sublinear regret and zero constraint violation. For model-based approaches, it has been shown that a model-based algorithm achieves both regret and constraint violation (see, e.g. [11]). It remains open what is the fundamental lower bound on the regret under model-free algorithms for CMDPs and whether the regret bound under Triple-Q is order-wise sharp or can be further improved. Table 1 summarizes the key results on the exploration-exploitation tradeoff of CMDPs in the literature.
As many other model-free RL algorithms, a major advantage of Triple-Q is its low computational complexity. The computational complexity of Triple-Q is similar to SARSA for unconstrained MDPs, so it retains both its effectiveness and efficiency while solving a much harder problem. While we consider a tabular setting in this paper, Triple-Q can easily incorporate function approximations (linear function approximations or neural networks) by replacing the and with their function approximation versions, making the algorithm a very appealing approach for solving complex CMDPs in practice. We will demonstrate this by applying Deep Triple-Q, Triple-Q with neural networks, to the Dynamic Gym benchmark [30] in Section 6.
2 Problem Formulation
We consider an episodic CMDP, denoted by where is the state space with is the action space with is the number of steps in each episode, and is a collection of transition kernels (transition probability matrices). At the beginning of each episode, an initial state is sampled from distribution Then at step the agent takes action after observing state . Then the agent receives a reward and incurs a utility The environment then moves to a new state sampled from distribution Similar to [14], we assume that are deterministic for convenience.
Given a policy which is a collection of functions the reward value function at step is the expected cumulative rewards from step to the end of the episode under policy
The (reward) -function at step is the expected cumulative rewards when agent starts from a state-action pair at step and then follows policy
Similarly, we use and to denote the utility value function and utility -function at step :
For simplicity, we adopt the following notation (some used in [14, 9]):
From the definitions above, we have
and
Given the model defined above, the objective of the agent is to find a policy that maximizes the expected cumulative reward subject to a constraint on the expected utility:
(1) |
where we assume to avoid triviality and the expectation is taken with respect to the initial distribution
Remark 1.
The results in the paper can be directly applied to a constraint in the form of
(2) |
Without loss of generality, assume We define and the the constraint in (2) can be written as
(3) |
where
Let denote the optimal solution to the CMDP problem defined in (1). We evaluate our model-free RL algorithm using regret and constraint violation defined below:
(4) | ||||
(5) |
where is the policy used in episode and the expectation is taken with respect to the distribution of the initial state
In this paper, we assume the following standard Slater’s condition hold.
Assumption 1.
(Slater’s Condition). Given initial distribution there exist and policy such that
In this paper, Slater’s condition simply means there exists a feasible policy that can satisfy the constraint with a slackness This has been commonly used in the literature [9, 10, 11, 19]. We call Slater’s constant. While the regret and constraint violation bounds depend on our algorithm does not need to know under the assumption that is large (the exact condition can be found in Theorem 1). This is a noticeable difference from some of works in CMDPs in which the agent needs to know the value of this constant (e.g. [9]) or alternatively a feasible policy (e.g. [2]) .
3 Triple-Q
In this section, we introduce Triple-Q for CMDPs. The design of our algorithm is based on the primal-dual approach in optimization. While RL algorithms based on the primal-dual approach have been developed for CMDPs [9, 10, 21, 11], a model-free RL algorithm with sublinear regrets and zero constraint violation is new.
The design of Triple-Q is based on the primal-dual approach in optimization. Given Lagrange multiplier we consider the Lagrangian of problem (1) from a given initial state
(6) | ||||
which is an unconstrained MDP with reward at step Assuming we solve the unconstrained MDP and obtain the optimal policy, denoted by we can then update the dual variable (the Lagrange multiplier) using a gradient method:
While primal-dual is a standard approach, analyzing the finite-time performance such as regret or sample complexity is particularly challenging. For example, over a finite learning horizon, we will not be able to exactly solve the unconstrained MDP for given Therefore, we need to carefully design how often the Lagrange multiplier should be updated. If we update it too often, then the algorithm may not have sufficient time to solve the unconstrained MDP, which leads to divergence; and on the other hand, if we update it too slowly, then the solution will converge slowly to the optimal solution and will lead to large regret and constraint violation. Another challenge is that when is given, the primal-dual algorithm solves a problem with an objective different from the original objective and does not consider any constraint violation. Therefore, even when the asymptotic convergence may be established, establishing the finite-time regret is still difficult because we need to evaluate the difference between the policy used at each step and the optimal policy.
Next we will show that a low-complexity primal-dual algorithm can converge and have sublinear regret and zero constraint violation when carefully designed. In particular, Triple-Q includes the following key ideas:
-
•
A sub-gradient algorithm for estimating the Lagrange multiplier, which is updated at the beginning of each frame as follows:
(7) where and is the summation of all s of the episodes in the previous frame. We call a virtual queue because it is terminology that has been widely used in stochastic networks (see e.g. [17, 24]). If we view as the number of jobs that arrive at a queue within each frame and as the number of jobs that leave the queue within each frame, then is the number of jobs that are waiting at the queue. Note that we added extra utility to By choosing , the virtual queue pessimistically estimates constraint violation so Triple-Q achieves zero constraint violation when the number of episodes is large.
-
•
A carefully chosen parameter so that when is used as the estimated Lagrange multiplier, it balances the trade-off between maximizing the cumulative reward and satisfying the constraint.
-
•
Carefully chosen learning rate and Upper Confidence Bound (UCB) bonus to guarantee that the estimated Q-value does not significantly deviate from the actual Q-value. We remark that the learning rate and UCB bonus proposed for unconstrained MDPs [14] do not work here. Our learning rate is chosen to be where is the number of visits to a given (state, action) pair in a particular step. This decays much slower than the classic learning rate or used in [14]. The learning rate is further reset from frame to frame, so Triple-Q can continue to learn the pseudo-Q-values that vary from frame to frame due to the change of the virtual-Queue (the Lagrange multiplier).
We now formally introduce Triple-Q. A detailed description is presented in Algorithm 1. The algorithm only needs to know the values of and and no other problem-specific values are needed. Furthermore, Triple-Q includes updates of two Q-functions per step: one for and one for and one simple virtual queue update per frame. So its computational complexity is similar to SARSA.
The next theorem summarizes the regret and constraint violation bounds guaranteed under Triple-Q.
Theorem 1.
Assume where Triple-Q achieves the following regret and constraint violation bounds:
If we further have then and
in other words, Triple-Q guarantees zero constraint violation both on expectation and with a high probability.
4 Proof of the Main Theorem
We now present the complete proof of the main theorem.
4.1 Notation
In the proof, we explicitly include the episode index in our notation. In particular,
-
•
and are the state and the action taken at step of episode
-
•
and are the reward Q-function, the utility Q-function, the virtual-Queue, and the value of at the beginning of episode
-
•
and are the visit count, reward value-function, and utility value-function after they are updated at step of episode (i.e. after line 9 of Triple-Q).
We also use shorthand notation
when and take the same argument value. Similarly
In this shorthand notation, we put functions inside and the common argument(s) outside.
A summary of notations used throughout this paper can be found in Table 3 in the appendix.
4.2 Regret
To bound the regret, we consider the following offline optimization problem as our regret baseline [3, 20]:
(8) | ||||
s.t.: | (9) | |||
(10) | ||||
(11) | ||||
(12) | ||||
(13) |
Recall that is the probability of transitioning to state upon taking action in state at step This optimization problem is linear programming (LP), where is the probability of (state, action) pair occurs in step is the probability the environment is in state in step and
is the probability of taking action in state at step which defines the policy. We can see that (9) is the utility constraint, (10) is the global-balance equation for the MDP, (11) is the normalization condition so that is a valid probability distribution, and (12) states that the initial state is sampled from Therefore, the optimal solution to this LP solves the CMDP (if the model is known), so we use the optimal solution to this LP as our baseline.
To analyze the performance of Triple-Q, we need to consider a tightened version of the LP, which is defined below:
(14) | ||||
s.t.: | ||||
where is called a tightness constant. When this problem has a feasible solution due to Slater’s condition. We use superscript ∗ to denote the optimal value/policy related to the original CMDP (1) or the solution to the corresponding LP (8) and superscript ϵ,∗ to denote the optimal value/policy related to the -tightened version of CMDP (defined in (14)).
Following the definition of the regret in (4), we have
Now by adding and subtracting the corresponding terms, we obtain
(15) | ||||
(16) | ||||
(17) |
Next, we establish the regret bound by analyzing the three terms above. We first present a brief outline.
4.2.1 Outline of the Regret Analysis
- •
-
•
Step 2: Note that is an estimate of and the estimation error (17) is controlled by the learning rates and the UCB bonuses. In Lemma 2, we will show that the cumulative estimation error over one frame is upper bounded by
Therefore, under our choices of and the cumulative estimation error over episodes satisfies
The proof of Lemma 2 is based on a recursive formula that relates the estimation error at step to the estimation error at step similar to the one used in [14], but with different learning rates and UCB bonuses.
-
•
Step 3: Bounding (16) is the most challenging part of the proof. For unconstrained MDPs, the optimistic Q-learning in [14] guarantees that is an overestimate of (so also an overestimate of ) for all simultaneously with a high probability. However, this result does not hold under Triple-Q because Triple-Q takes greedy actions with respect to the pseudo-Q-function instead of the reward Q-function. To overcome this challenge, we first add and subtract additional terms to obtain
(18) (19) We can see (18) is the difference of two pseudo-Q-functions. Using a three-dimensional induction (on step, episode, and frame), we will prove in Lemma 3 that is an overestimate of (i.e. ) for all simultaneously with a high probability. Since changes from frame to frame, Triple-Q adds the extra bonus in line 21 so that the induction can be carried out over frames.
Finally, to bound (19), we use the Lyapunov-drift method and consider Lyapunov function where is the frame index and is the value of the virtual queue at the beginning of the th frame. We will show in Lemma 4 that the Lyapunov-drift satisfies
(20) where
and we note that Inequality (20) will be established by showing that Triple-Q takes actions to almost greedily reduce virtual-Queue when is large, which results in the negative drift in (20). From (20), we observe that
(21)
Combining the results in the three steps above, we obtain the regret bound in Theorem 1.
4.2.2 Detailed Proof
We next present the detailed proof. The first lemma bounds the difference between the original CMDP and its -tightened version. The result is intuitive because the -tightened version is a perturbation of the original problem and
Lemma 1.
Given , we have
Proof.
Given is the optimal solution, we have
Under Assumption 1, we know that there exists a feasible solution such that
We construct which satisfies that
Also we have for all Thus is a feasible solution to the -tightened optimization problem (14). Then given is the optimal solution to the -tightened optimization problem, we have
where the last inequality holds because under our assumption. Therefore the result follows because
∎
The next lemma bounds the difference between the estimated Q-functions and actual Q-functions in a frame. The bound on (17) is an immediate result of this lemma.
Lemma 2.
Under Triple-Q, we have for any
Proof.
We will prove the result on the reward Q-function. The proof for the utility Q-function is almost identical. We first establish a recursive equation between a Q-function with the value-functions in the earlier episodes in the same frame. Recall that under Triple-Q, , where is an episode in frame is updated as follows:
where Define to be the index of the episode in which the agent visits in step for the th time in the current frame.
The update equation above can be written as:
Repeatedly using the equation above, we obtain
(22) |
where and From the inequality above, we further obtain
(23) |
The notation becomes rather cumbersome because for each we need to consider a corresponding sequence of episode indices in which the agent sees Next we will analyze a given sample path (i.e. a specific realization of the episodes in a frame), so we simplify our notation in this proof and use the following notation:
where is the index of the episode in which the agent visits state-action pair for the th time. Since in a given sample path, can uniquely determine this notation introduces no ambiguity. Furthermore, we will replace with because we only consider episodes in frame in this proof.
We note that
(24) |
where the first inequality holds because because appears in the summation on the left-hand side each time when in episode in the same frame, the environment visits again, i.e. and the second inequality holds due to the property of the learning rate proved in Lemma 7-(d). By substituting (24) into (23) and noting that according to Lemma Lemma 7-(b), we obtain
where the last inequality holds because (i) we have
(ii) by using Lemma 8, and (iii) we know that
where the last inequality above holds because the left hand side of is the summation of terms and is a decreasing function of
Therefore, it is maximized when for all i.e. by picking the largest terms. Thus we can obtain
Taking the expectation on both sides yields
Then by using the inequality repeatably, we obtain for any
so the lemma holds.
∎
From the lemma above, we can immediately conclude:
We now focus on (16), and further expand it as follows:
(25) | ||||
(26) |
where
We first show (25) can be bounded using the following lemma. This result holds because the choices of the UCB bonuses and the additional bonuses added at the beginning of each frame guarantee that is an over-estimate of for all and with a high probability.
Lemma 3.
With probability at least the following inequality holds simultaneously for all
(27) |
which further implies that
(28) |
Proof.
Consider frame and episodes in frame Define because the value of the virtual queue does not change during each frame. We further define/recall the following notations:
According to Lemma 9 in the appendix, we have
(29) |
where inequality holds because of the concentration result in Lemma 10 in the appendix and
by using Lemma 7-(c), and equality holds because Triple-Q selects the action that maximizes so .
The inequality above suggests that we can prove for any if (i)
i.e. the result holds at the beginning of the frame and (ii)
and i.e. the result holds for step in all the previous episodes in the same frame.
We now prove the lemma using induction. We first consider and i.e. the last step in the first frame. In this case, inequality (29) becomes
(30) |
Based on induction, we can first conclude that
for all and where are the values before line 20, i.e. before adding the extra bonuses and thresholding Q-values at the end of a frame. Now suppose that (27) holds for any episode in frame any step and any Now consider
(31) |
Note that if then Otherwise, from line 21-23, we have and Here, we use superscript and to indicate the Q-values before and after 21-24 of Triple-Q. Therefore, at the beginning of frame we have
(32) |
where inequality holds due to the induction assumption and the fact and holds because according to Lemma 8,
Next we bound (26) using the Lyapunov drift analysis on virtual queue . Since the virtual queue is updated every frame, we abuse the notation and define to be the virtual queue used in frame In particular, We further define
Therefore, under Triple-Q, we have
(37) |
Define the Lyapunov function to be
The next lemma bounds the expected Lyapunov drift conditioned on
Lemma 4.
Assume The expected Lyapunov drift satisfies
(38) |
Proof.
Based on the definition of the Lyapunov drift is
where the first inequality is a result of the upper bound on in Lemma 8.
Let be a feasible solution to the tightened LP (14). Then the expected Lyapunov drift conditioned on is
(39) |
Now we focus on the term inside the summation and obtain that
where inequality holds because is chosen to maximize under Triple-Q, and the last equality holds due to that is a feasible solution to the optimization problem (14), so
Therefore, we can conclude the lemma by substituting with the optimal solution . ∎
After taking expectation with respect to dividing on both sides, and then applying the telescoping sum, we obtain
(40) |
where the last inequality holds because that and is non-negative.
Now combining Lemma 3 and inequality (40), we conclude that
Further combining inequality above with Lemma 1 and Lemma 2,
(41) |
By choosing i.e each frame has episodes, and we conclude that when which guarantees that we have
(42) |
4.3 Constraint Violation
4.3.1 Outline of the Constraint Violation Analysis
Again, we use to denote the value of virtual-Queue in frame According to the virtual-Queue update defined in Triple-Q, we have
which implies that
Summing the inequality above over all frames and taking expectation on both sides, we obtain the following upper bound on the constraint violation:
(43) |
where we used the fact
In Lemma 2, we already established an upper bound on the estimation error of
(44) |
Next, we study the moment generating function of i.e. for some Based on a Lyapunov drift analysis of this moment generating function and Jensen’s inequality, we will establish the following upper bound on that holds for any
(45) |
Under our choices of and it can be easily verified that dominates the upper bounds in (44) and (45), which leads to the conclusion that the constraint violation because zero when is sufficiently large in Theorem 1.
4.3.2 Detailed Proof
To complete the proof, we need to establish the following upper bound on based on a bound on the moment generating function.
Lemma 5.
Assuming we have for any
(46) |
The proof will also use the following lemma from [16].
Lemma 6.
Let be the state of a Markov chain, be a Lyapunov function with and its drift Given the constant and with suppose that the expected drift satisfies the following conditions:
-
(1)
There exists constant and such that when
-
(2)
holds with probability one.
Then we have
where
Proof of Lemma 5.
We apply Lemma 6 to a new Lyapunov function:
Substituting the results from Lemmas 2 and 5 into (43), under assumption which guarantees Then by using the facts that we can easily verify that
If further we have we can obtain
Now to prove the high probability bound, recall that from inequality (37), we have
(48) |
According to inequality (47), we have
which implies that
(49) |
where the first inequality is from the Markov inequality.
In the proof of Lemma 2, we have shown
(50) |
Following a similar proof as the proof of Lemma 10, we can prove that
holds with probability at least . By iteratively using inequality (50) over and by summing it over all frames, we conclude that with probability at at least
(51) |
where the last inequality holds because and
5 Convergence and -Optimal Policy
The Triple-Q algorithm is an online learning algorithm and is not a stationary policy. In theory, we can obtain a near-optimal, stationary policy following the idea proposed in [14]. Assume the agent stores all the policies used during learning. Note that each policy is defined by the two Q tables and the value of the virtual queue. At the end of learning horizon the agent constructs a stochastic policy such that at the beginning of each episode, the agent uniformly and randomly selects a policy from the policies , i.e. with probability
We note that given any initial state
Therefore, under policy we have
and
Therefore, given any is an -optimal policy when is sufficiently large.
While is a near-optimal policy, in practice, it may not be possible to store all policies during learning due to memory constraint. A heuristic approach to obtain a near optimal, stationary policy is to fix the two Q functions (reward and utility) after learning horizon and continue to adapt the virtual queue with frame size This is a stochastic policy where the randomness comes from the virtual queue. When the virtual queue reaches its steady state, we obtain a stationary policy. The resulted policy has great performance in our experiment (see Section 6).
6 Evaluation
We evaluated Triple-Q using a grid-world environment [8]. We further implemented Triple-Q with neural network approximations, called Deep Triple-Q, for an environment with continuous state and action spaces, called the Dynamic Gym benchmark [30]. In both cases, Triple-Q and Deep Triple-Q quickly learn a safe policy with a high reward.
6.1 A Tabular Case
We first evaluated our algorithm using a grid-world environment studied in [8]. The environment is shown in Figure 3-(a). The objective of the agent is to travel to the destination as quickly as possible while avoiding obstacles for safety. Hitting an obstacle incurs a cost of . The reward for the destination is , and for other locations are the Euclidean distance between them and the destination subtracted from the longest distance. The cost constraint is set to be (we transferred utility to cost as we discussed in the paper), which means the agent is only allowed to hit the obstacles as most six times. To account for the statistical significance, the result of each experiment was averaged over trials.
The result is shown in Figure 2, from which we can observe that Triple-Q can quickly learn a well performed policy (with about episodes) while satisfying the safety constraint. Triple-Q-stop is a stationary policy obtained by stopping learning (i.e. fixing the Q tables) at training steps (note the virtual-Queue continues to be updated so the policy is a stochastic policy). We can see that Triple-Q-stop has similar performance as Triple-Q, and show that Triple-Q yields a near-optimal, stationary policy after the learning stops.


6.2 Triple-Q with Neural Networks
We further evaluated our algorithm on the Dynamic Gym benchmark (DynamicEnv) [30] as shown in Figure. 3-(b). In this environment, a point agent (one actuator for turning and another for moving) navigates on a 2D map to reach the goal position while trying to avoid reaching hazardous areas. The initial state of the agent, the goal position and hazards are randomly generated in each episode. At each step, the agents get a cost of if it stays in the hazardous area; and otherwise, there is no cost. The constraint is that the expected cost should not exceed 15. In this environment, both the states and action spaces are continuous, we implemented the key ideas of Triple-Q with neural network approximations and the actor-critic method. In particular, two Q functions are trained simultaneously, the virtual queue is updated slowly every few episodes, and the policy network is trained by optimizing the combined three “Q”s (Triple-Q). The details can be found in Table 2. We call this algorithm Deep Triple-Q. The simulation results in Figure 3 show that Deep Triple-Q learns a safe-policy with a high reward much faster than WCSAC [30]. In particular, it took around million training steps under Deep Triple-Q, but it took million training steps under WCSAC.
Parameter | Value |
optimizer | Adam |
learning rate | |
discount | 0.99 |
replay buffer size | |
number of hidden layers (all networks) | 2 |
batch Size | 256 |
nonlinearity | ReLU |
number of hidden units per layer (Critic) | 256 |
number of hidden units per layer (Actor) | 256 |
virtual queue update frequency | 3 episode |

7 Conclusions
This paper considered CMDPs and proposed a model-free RL algorithm without a simulator, named Triple-Q. From a theoretical perspective, Triple-Q achieves sublinear regret and zero constraint violation. We believe it is the first model-free RL algorithm for CMDPs with provable sublinear regret, without a simulator. From an algorithmic perspective, Triple-Q has similar computational complexity with SARSA, and can easily incorporate recent deep Q-learning algorithms to obtain a deep Triple-Q algorithm, which makes our method particularly appealing for complex and challenging CMDPs in practice.
While we only considered a single constraint in the paper, it is straightforward to extend the algorithm and the analysis to multiple constraints. Assuming there are constraints in total, Triple-Q can maintain a virtual queue and a utility Q-function for each constraint, and then selects an action at each step by solving the following problem:
References
- [1] Naoki Abe, Prem Melville, Cezar Pendus, Chandan K Reddy, David L Jensen, Vince P Thomas, James J Bennett, Gary F Anderson, Brent R Cooley, Melissa Kowalczyk, et al. Optimizing debt collections using constrained reinforcement learning. In Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 75–84, 2010.
- [2] Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31. PMLR, 2017.
- [3] Eitan Altman. Constrained Markov decision processes, volume 7. CRC Press, 1999.
- [4] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. On the sample complexity of reinforcement learning with a generative model. In Int. Conf. Machine Learning (ICML), Madison, WI, USA, 2012.
- [5] Mohammad Gheshlaghi Azar, Rémi Munos, and Hilbert J. Kappen. Minimax pac bounds on the sample complexity of reinforcement learning with a generative model. Mach. Learn., 91(3):325–349, June 2013.
- [6] Kianté Brantley, Miroslav Dudik, Thodoris Lykouris, Sobhan Miryoosefi, Max Simchowitz, Aleksandrs Slivkins, and Wen Sun. Constrained episodic reinforcement learning in concave-convex and knapsack settings. arXiv preprint arXiv:2006.05051, 2020.
- [7] Yi Chen, Jing Dong, and Zhaoran Wang. A primal-dual approach to constrained Markov decision processes. arXiv preprint arXiv:2101.10895, 2021.
- [8] Yinlam Chow, Ofir Nachum, Edgar Duenez-Guzman, and Mohammad Ghavamzadeh. A lyapunov-based approach to safe reinforcement learning. arXiv preprint arXiv:1805.07708, 2018.
- [9] Dongsheng Ding, Xiaohan Wei, Zhuoran Yang, Zhaoran Wang, and Mihailo Jovanovic. Provably efficient safe exploration via primal-dual policy optimization. In Proceedings of The 24th International Conference on Artificial Intelligence and Statistics, 2021.
- [10] Dongsheng Ding, Kaiqing Zhang, Tamer Basar, and Mihailo Jovanovic. Natural policy gradient primal-dual method for constrained markov decision processes. Advances in Neural Information Processing Systems, 33, 2020.
- [11] Yonathan Efroni, Shie Mannor, and Matteo Pirotta. Exploration-exploitation in constrained MDPs. arXiv preprint arXiv:2003.02189, 2020.
- [12] Jaime F Fisac, Anayo K Akametalu, Melanie N Zeilinger, Shahab Kaynama, Jeremy Gillula, and Claire J Tomlin. A general safety framework for learning-based control in uncertain robotic systems. IEEE Transactions on Automatic Control, 64(7):2737–2752, 2018.
- [13] Javier Garcia and Fernando Fernández. Safe exploration of state and action spaces in reinforcement learning. Journal of Artificial Intelligence Research, 45:515–564, 2012.
- [14] Chi Jin, Zeyuan Allen-Zhu, Sebastien Bubeck, and Michael I Jordan. Is q-learning provably efficient? In Advances Neural Information Processing Systems (NeurIPS), volume 31, pages 4863–4873, 2018.
- [15] Krishna C. Kalagarla, Rahul Jain, and Pierluigi Nuzzo. A sample-efficient algorithm for episodic finite-horizon MDP with constraints. arXiv preprint arXiv:2009.11348, 2020.
- [16] M. J. Neely. Energy-aware wireless scheduling with near-optimal backlog and convergence time tradeoffs. IEEE/ACM Transactions on Networking, 24(4):2223–2236, 2016.
- [17] Michael J. Neely. Stochastic network optimization with application to communication and queueing systems. Synthesis Lectures on Communication Networks, 3(1):1–211, 2010.
- [18] Masahiro Ono, Marco Pavone, Yoshiaki Kuwata, and J Balaram. Chance-constrained dynamic programming with application to risk-aware robotic space exploration. Autonomous Robots, 39(4):555–571, 2015.
- [19] Santiago Paternain, Luiz Chamon, Miguel Calvo-Fullana, and Alejandro Ribeiro. Constrained reinforcement learning has zero duality gap. In Advances in Neural Information Processing Systems, 2019.
- [20] Martin L Puterman. Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons, 2014.
- [21] Shuang Qiu, Xiaohan Wei, Zhuoran Yang, Jieping Ye, and Zhaoran Wang. Upper confidence primal-dual reinforcement learning for CMDP with adversarial loss. In Advances in Neural Information Processing Systems, 2020.
- [22] Gavin A Rummery and Mahesan Niranjan. On-line Q-learning using connectionist systems, volume 37. University of Cambridge, Department of Engineering Cambridge, UK, 1994.
- [23] Rahul Singh, Abhishek Gupta, and Ness B Shroff. Learning in markov decision processes under constraints. arXiv preprint arXiv:2002.12435, 2020.
- [24] R. Srikant and Lei Ying. Communication Networks: An Optimization, Control and Stochastic Networks Perspective. Cambridge University Press, 2014.
- [25] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
- [26] Yuanhao Wang, Kefan Dong, Xiaoyu Chen, and Liwei Wang. Q-learning with UCB exploration is sample efficient for infinite-horizon MDP. In International Conference on Learning Representations, 2020.
- [27] Christopher John Cornish Hellaby Watkins. Learning from Delayed Rewards. PhD thesis, King’s College, King’s College, Cambridge United Kingdom, May 1989.
- [28] Chen-Yu Wei, Mehdi Jafarnia Jahromi, Haipeng Luo, Hiteshi Sharma, and Rahul Jain. Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International Conference on Machine Learning, pages 10170–10180. PMLR, 2020.
- [29] Tengyu Xu, Yingbin Liang, and Guanghui Lan. A primal approach to constrained policy optimization: Global optimality and finite-time analysis. arXiv preprint arXiv:2011.05869, 2020.
- [30] Qisong Yang, Thiago D Simão, Simon H Tindemans, and Matthijs TJ Spaan. Wcsac: Worst-case soft actor critic for safety-constrained reinforcement learning. In Proceedings of the Thirty-Fifth AAAI Conference on Artificial Intelligence. AAAI Press, online, 2021.
- [31] Chao Yu, Jiming Liu, and Shamim Nemati. Reinforcement learning in healthcare: A survey. arXiv preprint arXiv:1908.08796, 2020.
In the appendix, we summarize notations used throughout the paper in Table 3, and present a few lemmas used to prove the main theorem.
Appendix A Notation Table
Notation | Definition |
---|---|
The total number of episodes | |
The number of states | |
The number of actions | |
The length of each episode | |
Set | |
The estimated reward Q-function at step in episode | |
The reward Q-function at step in episode under policy | |
The estimated reward value-function at step in episode . | |
The value-function at step in episode under policy | |
The estimated utility Q-function at step in episode | |
The utility Q-function at step in episode under policy | |
The estimated utility value-function at step in episode | |
The utility value-function at step in episode under policy | |
The reward of (state, action) pair at step | |
The utility of (state, action) pair at step | |
The number of visits to when at step in episode (not including ) | |
The dual estimation (virtual queue) in episode | |
The optimal solution to the LP of the CMDP (8). | |
The optimal solution to the tightened LP (14). | |
Slater’s constant. | |
the UCB bonus for given | |
The indicator function |
Appendix B Useful Lemmas
The first lemma establishes some key properties of the learning rates used in Triple-Q. The proof closely follows the proof of Lemma 4.1 in [14].
Lemma 7.
Recall that the learning rate used in Triple-Q is and
(53) |
The following properties hold for
-
(a)
for for
-
(b)
for for
-
(c)
-
(d)
for every
-
(e)
for every
Proof.
The proof of (a) and (b) are straightforward by using the definition of . The proof of (d) is the same as that in [14].
For we have so (c) holds for .
The next lemma establishes upper bounds on and under Triple-Q.
Lemma 8.
For any we have the following bounds on and
Proof.
We first consider the last step of an episode, i.e. Recall that for any and by its definition and Suppose for any and any Then,
where is the number of visits to state-action pair when in step by episode (but not include episode ) and is the index of the episode of the most recent visit. Therefore, the upper bound holds for
Note that Now suppose the upper bound holds for and also holds for Consider step in episode
where is the number of visits to state-action pair when in step by episode (but not include episode ) and is the index of the episode of the most recent visit. We also note that Therefore, we obtain
Therefore, we can conclude that for any and The proof for is identical. ∎
Next, we present the following lemma from [14], which establishes a recursive relationship between and for any We include the proof so the paper is self-contained.
Lemma 9.
Consider any and any policy Let t= be the number of visits to when at step in frame before episode and be the indices of the episodes in which these visits occurred. We have the following two equations:
(56) | ||||
(57) |
where is the empirical counterpart of This definition can also be applied to as well.
Proof.
We will prove (56). The proof for (57) is identical. Recall that under Triple-Q, is updated as follows:
From the update equation above, we have in episode
Lemma 10.
Consider any frame Let t= be the number of visits to at step before episode in the current frame and let be the indices of these episodes. Under any policy with probability at least the following inequalities hold simultaneously for all
Proof.
Without loss of generality, we consider Fix any For any define
Let be the algebra generated by all the random variables until step in episode Then
which shows that is a martingale. We also have for
Then let By applying the Azuma-Hoeffding inequality, we have with probability at least
where the last inequality holds due to from Lemma 7.(e). Because this inequality holds for any , it also holds for Applying the union bound, we obtain that with probability at least the following inequality holds simultaneously for all :
Following a similar analysis we also have that with probability at least the following inequality holds simultaneously for all :
∎
Lemma 11.
Given under Triple-Q, the conditional expected drift is
(60) |
Proof.
Recall that and the virtual queue is updated by using
From inequality (39), we have
Inequality holds because of our algorithm. Inequality holds because is non-negative, and under Slater’s condition, we can find policy such that
Finally, inequality is obtained similar to (36), and the fact that is bounded by using Lemma 8 ∎