Upper Confidence Primal-Dual Reinforcement Learning for CMDP with Adversarial Loss
Abstract
We consider online learning for episodic stochastically constrained Markov decision processes (CMDPs), which plays a central role in ensuring the safety of reinforcement learning. Here the loss function can vary arbitrarily across the episodes, and both the loss received and the budget consumption are revealed at the end of each episode. Previous works solve this problem under the restrictive assumption that the transition model of the Markov decision processes (MDPs) is known a priori and establish regret bounds that depend polynomially on the cardinalities of the state space and the action space . In this work, we propose a new upper confidence primal-dual algorithm, which only requires the trajectories sampled from the transition model. In particular, we prove that the proposed algorithm achieves upper bounds of both the regret and the constraint violation, where is the length of each episode. Our analysis incorporates a new high-probability drift analysis of Lagrange multiplier processes into the celebrated regret analysis of upper confidence reinforcement learning, which demonstrates the power of “optimism in the face of uncertainty” in constrained online learning.
1 Introduction
Constrained Markov decision processes play an important role in control and planning. It aims at maximizing a reward or minimizing a penalty metric over the set of all available policies subject to constraints on other relevant metrics. The constraints aim at enforcing the fairness or safety of the policies so that over time the behaviors of the chosen policy are under control. For example, in an edge cloud serving network (Urgaonkar et al., 2015; Wang et al., 2015), one would like to minimize the average cost of serving the moving targets subject to a constraint on the average serving delay. In an autonomous vehicle control problem (Le et al., 2019), one might be interested in minimizing the driving time subject to certain fuel efficiency or driving safety constraints.
Classical treatment of CMDPs dates back to Fox (1966); Altman (1999) reformulating the problem into a linear program (LP) via stationary state-action occupancy measures. However, to formulate such an LP, one requires the full knowledge of the transition model, reward, and constraint functions, and also assumes them to be fixed. Leveraging the episodic structure of a class of MDPs, Neely (2012) develops online renewal optimization which potentially allows the loss and constraint functions to be stochastically varying and unknown, while still relying on the transition model to solve the subproblem within the episode.
More recently, policy-search type algorithms have received much attention, attaining state-of-art performance in various tasks without knowledge of the transition model, e.g. Williams (1992); Baxter and Bartlett (2000); Konda and Tsitsiklis (2000); Kakade (2002); Schulman et al. (2015); Lillicrap et al. (2015); Schulman et al. (2017); Sutton and Barto (2018); Fazel et al. (2018); Abbasi-Yadkori et al. (2019a, b); Bhandari and Russo (2019); Cai et al. (2019); Wang et al. (2019); Liu et al. (2019); Agarwal et al. (2019). While most of the algorithms focus on unconstrained reinforcement learning problems, there are efforts to develop policy-based methods in CMDPs where constraints are known with limited theoretical guarantees. The work Chow et al. (2017) develops a primal-dual type algorithm which is shown to converge to some constraint satisfying policy. Achiam et al. (2017) develops a trust-region type algorithm which requires solving an optimization problem with both trust-region and safety constraints during each update. Generalizing ideas from the fitted-Q iteration, Le et al. (2019) develops a batch offline primal-dual type algorithm which guarantees only the time average primal-dual gap converges.
The goal of this paper is to solve constrained episodic MDPs with more generality where not only transition models are unknown, but also the loss and constraint functions can change online. In particular, the losses can be arbitrarily time-varying and adversarial. Let be the number of episodes111In the non-episodic (infinite-horizon) setting, denotes the total number of steps, which is a little different from the aforementioned for the episodic setting.. When assuming the transition model is known, Even-Dar et al. (2009) achieves regret with being the mixing time of the MDPs, and the work Yu et al. (2009) achieves regret. These two papers consider a continuous setting that is a little different to the episodic setting that we consider in this paper. Zimin and Neu (2013) further studies the episodic MDP and achieves regret. For the constrained case with known transitions, the work Wei et al. (2018) achieves an regret and constraint violations, and Zheng and Ratliff (2020) attains for the non-episodic setting.
After we finished the first version of this work, there are several concurrent works appearing which also focus on CMDPs with unknown transitions and losses. The work Efroni et al. (2020a) studies episodic tabular MDPs with unknown but fixed loss and constraint functions, where the feedbacks are stochastic bandits. Leveraging upper confidence bound (UCB) on the reward, constraints, and transitions, they obtain an regret and constraint violation via linear program as well as primal-dual optimization. In another work, Ding et al. (2020) studies the constrained episodic MDPs with a linear structure and adversarial losses via a primal-dual-type policy optimization algorithm, achieving regret and constraint violation. While the scenario in Ding et al. (2020) is more general than ours, their dependence on the length of episode is worse when applied to the tabular case. Both of these two works rely on Slater condition which is also more restrictive than that of this work.
On the other hand, for unconstrained online MDPs, the idea of UCB has shown to be effective and helps to achieve tight regret bounds without knowing the transition model, e.g. Jaksch et al. (2010); Azar et al. (2017); Rosenberg and Mansour (2019a, b); Jin et al. (2019). The main idea there is to sequentially refine a confidence set of the transition model and choose a model in the interval which performs the best in optimizing the current value.
The main contribution of this paper is to show that UCB is also effective when incorporating with primal-dual type approaches to achieve regret and constraint violation simultaneously in online CMDPs when the transition model is unknown, the loss is adversarial, and the constraints are stochastic. This almost matches the lower bound for the regret Jaksch et al. (2010) up to an factor. Under the hood is a new Lagrange multiplier condition together with a new drift analysis on the Lagrange multipliers leading to low constraint violation. Our setup is challenging compared to classical constrained optimization in particular due to (1) the unknown loss and constraint functions from the online setup; (2) the time-varying decision sets resulting from moving confidence set estimation of UCB. The decision sets can potentially be much larger than or even inconsistent with the true decision set knowing the model, resulting in a potentially large constraint violation. The main idea is to utilize a Lagrange multiplier condition as well as a confidence set of the model to construct a probabilistic bound on an online dual multiplier. We then explicitly take into account the laziness nature of the confidence set estimation in our algorithm to argue that the bound on the dual multiplier gives the bound on constraint violation.
2 Related Work
In this paper, we are more interested in a class of online MDP problems where the loss functions are arbitrarily changing, or adversarial. With a known transition model, adversarial losses, and full-information feedbacks (as opposed to bandit feedbacks), Even-Dar et al. (2009) achieves regret with being the mixing time of the MDP, and the work Yu et al. (2009) achieves regret. These two papers consider a non-episodic setting that is different to the episodic setting that we consider in this paper. The work Zimin and Neu (2013) further studies the episodic MDP and achieves an regret.
A more challenging setting is that the transition model is unknown. Under such setting, there are several works studying the online episodic MDP problems with adversarial losses and full-information feedbacks. Neu et al. (2012) obtains regret by proposing a Follow the Perturbed Optimistic Policy (FPOP) algorithm. The recent work Rosenberg and Mansour (2019a) improves the regret to by proposing an online upper confidence mirror descent algorithm. This regret bound nearly matches the lower bound (Jaksch et al., 2010) up to and some logarithm factors. Our work is along this line of research, and further considers the setup that there exist stochastic constraints observed at each episode during the learning process.
Besides, a number of papers also investigate online episodic MDPs with bandit feedbacks. Assuming the transition model is known and the losses are adversarial, Neu et al. (2010) achieves regret, where is the probability that all states are reachable under all policies. Under the same setting, Neu et al. (2010) achieves regret without the dependence on , and Zimin and Neu (2013) obtains regret. Furthermore, with assuming the transition model is not known and the losses are adversarial, Rosenberg and Mansour (2019b) obtains regret and also where all states are reachable with probability under any policy. Jin et al. (2019) further achieves regret under the same setting of the unknown transition model and adversarial losses. We remark that our algorithm can be extended to the setting of constrained episodic MDP where both the loss and constraint functions are time-varying and we only receive bandit feedbacks. We leave such an extension as our future work.
On the other hand, instead of adversarial losses, extensive works have studied the setting where the feedbacks of the losses are stochastic and have fixed expectations, e.g. Jaksch et al. (2010); Azar et al. (2017); Ouyang et al. (2017); Jin et al. (2018); Fruit et al. (2018); Wei et al. (2019a); Zhang and Ji (2019); Dong et al. (2019). With assuming that the transition model is known, Zheng and Ratliff (2020) studies online CMDPs under the non-episodic setting and attains an regret which is suboptimal in terms of . The concurrent work Efroni et al. (2020a) studies episodic MDPs with unknown transitions and stochastic bandit feedbacks of the losses and the constraints, and obtains an regret and constraint violation.
In addition to the aforementioned papers, there is also a line of policy-search type works, focusing on solving online MDP problems via directly optimizing policies without knowing the transition model, e.g., Williams (1992); Baxter and Bartlett (2000); Konda and Tsitsiklis (2000); Kakade (2002); Schulman et al. (2015); Lillicrap et al. (2015); Schulman et al. (2017); Sutton and Barto (2018); Fazel et al. (2018); Abbasi-Yadkori et al. (2019a, b); Bhandari and Russo (2019); Cai et al. (2019); Wang et al. (2019); Liu et al. (2019); Agarwal et al. (2019); Efroni et al. (2020b). Efforts have also been made in several works (Chow et al., 2017; Achiam et al., 2017; Le et al., 2019) to investigate CMDP problems via policy-based methods, but with known transition models. In another concurrent work, assuming the transition model is unknown, Ding et al. (2020) studies the episodic CMDPs with linear structures and proposes a primal-dual type policy optimization algorithm.
3 Problem Formulation
Consider an episodic loop-free MDP with a finite state space and a finite action space at each state over a finite horizon of episodes. Each episode starts with a fixed initial state and ends with a terminal state . The transition probability is , where gives the probability of transition from to under an action . This underlying transition model is assumed to be unknown. The state space is loop-free, i.e., it is divided into layers, i.e., with a singleton initial layer and terminal layer . Furthermore, and transitions are only allowed between consecutive layers, which is only if , , and , . Note that such an assumption enforces that each path from the initial state to the terminal state takes a fixed length . This is not an excessively restrictive assumption as any loop-free MDP with bounded varying path lengths can be transformed into one with a fixed path length (see György et al. (2007) for details).
The loss function for each episode is , where denotes the loss received at episode for any and , . We assume can be arbitrarily varying with potentially no fixed probability distribution. There are stochastic constraint (or budget consumption) functions: , where denotes the price to pay at episode for any . Each stochastic function at episode is sampled according to a random variable , namely . Then, we define where the expectation is taken over the randomness of . For abbreviation, we denote . In addition, the functions and , are mutually independent and independent of the Markov transitions. Both the loss functions and the budget consumption functions are revealed at the end of each episode.
Remark 3.1.
It might be tempting to consider the more general scenario that both losses and constraints are arbitrarily time-varying. For such a setting, however, there exist counterexamples (Mannor et al., 2009) in the arguably simpler constrained online learning scenario that no algorithm can achieve sublinear regret and constraint violation simultaneously. Therefore, we seek to put extra assumptions on the problem so that obtaining sublinear regret and constraint violation is feasible, one of which is to assert constraints to be stochastic instead of arbitrarily varying.
For any episode , a policy is the conditional probability of choosing an action at any given state . Let denotes a random tuple generated according to the transition model and the policy . The corresponding expected loss is , while the budget costs are , where the expectations are taken w.r.t. the randomness of the tuples .
In this paper, we adopt the occupancy measure for our analysis. In general, the occupancy measure is a joint distribution of the tuple under some certain policy and transition model. Particularly, under the true transition , we define the set (Altman, 1999) where
-
(a)
.
-
(b)
.
-
(c)
.
We can further recover a policy from via .
We define , , to be the occupancy measure at episode w.r.t. the true transition , resulting from a policy at episode . Given the definition of occupancy measure, we can rewrite the expected loss and the budget cost as where and with . We aim to solve the following constrained optimization, and let be one solution which is further viewed as a reference point to define the regret:
(1) |
where is the overall loss in episodes and constraints are enforced on the budget cost based on the expected budget consumption functions . To measure the regret and the constraint violation respectively for solving (1) in an online setting, we define the following two metrics:
(2) |
where the notation denotes the entry-wise application of for any vector . For abbreviation, let , and .
The goal is to attain a sublinear regret bound and constraint violation on this problem w.r.t. any fixed stationary policy , which does not change over episodes. In another word, we compare to the best policy in hindsight whose corresponding occupancy measure solves problem (1). We make the following assumption on the existence of a solution to (1).
Assumption 3.2.
There exists at least one fixed policy such that the corresponding probability is feasible, i.e., .
Then, we assume the following boundedness on function values for simplicity of notations without loss of generality.
Assumption 3.3.
We assume the following quantities are bounded. For any , (1) , (2) , (3) .
When the transition model is known and Slater’s condition holds (i.e., existence of a policy which satisfies all stochastic inequality constraints with a constant -slackness), this stochastically constrained online linear program can be solved via similar methods as Wei et al. (2018); Yu et al. (2017) with a regret bound that depends polynomially on the cardinalities of state and action spaces, which is highly suboptimal especially when the state or action space is large. The main challenge we will address in this paper is to solve this problem without knowing the model , or losses and constraints before making decisions, while tightening the dependency on both state and action spaces in the resulting performance bound.
4 Proposed Algorithm
In this section, we introduce our proposed algorithm, namely, the upper confidence primal-dual (UCPD) algorithm, as presented in Algorithm 1. It adopts a primal-dual mirror descent type algorithm solving constrained problems but with an important difference: We maintain a confidence set via past sample trajectories, which contains the true MDP model with high probability, and choose the policy to minimize the proximal Lagrangian using the most optimistic model from the confidence set. Such an idea, known as optimism in the face of uncertainty, is reminiscent of the upper confidence bound (UCB) algorithm (Auer et al., 2002) for stochastic multi-armed bandit (MAB) and first proposed by Jaksch et al. (2010) to obtain a near-optimal regret for reinforcement learning problems.
In the algorithm, we introduce epochs, which are back-to-back time intervals that span several episodes. We use to index the epochs and use to denote a mapping from episode index to epoch index, indicating which epoch the -th episode lives in. Next, let and be two global counters which indicate the number of times the tuples and appear before the -th epoch. Let , be two local counters which indicate the number of times the tuples and appear in the -th epoch. We start a new epoch whenever there exists such that . Otherwise, set . Such an update rule follows from Jaksch et al. (2010). Then, we define the empirical transition model at any epoch as
As shown in Remark 6.8, introducing the notion of ‘epoch’ is necessary to achieve an constraint violation.
The next lemma shows that with high probability, the true transition model is contained in a confidence interval around the empirical one no matter what sequence of policies taken, which is adapted from Lemma 1 of Neu et al. (2012).
Lemma 4.1 (Lemma 1 of Neu et al. (2012)).
For any , we have that with probability at least , for all epoch and any state and action pair ,
with the error being
(3) |
where is a mapping from state to the layer that the state belongs to.
4.1 Computing Optimistic Policies
Next, we show how to compute the policy at each episode. Formally, we introduce a new occupancy measure at episode , namely . It should be emphasized that this is different from the defined in the previous section as is chosen by the decision maker at episode to construct the policy. In particular, does not have to satisfy the local balance equation (c). Once getting (which will be detailed below), we construct the policy as follows
(4) |
Next, we demonstrate the proposed method computing . First, we introduce an online dual multiplier for each constraint in (1), which is when and is updated as follows for ,
(5) |
At each episode, we compute the new occupancy measure solving an optimistic regularized linear program (ORLP) with tuning parameters . Specifically, we update for all by solving
(6) |
For , we let , . The above updating rule introduces extra notations , , and that will be elaborated below. Specifically, we denote by the unnormalized Kullback-Leibler(KL) divergence for two different occupancy measures and , which is defined as
(7) |
In addition, for and , we compute via
where . This equation introduces a probability mixing, pushing the update away from the boundary and encouraging explorations.
Furthermore, since for any epoch , we can compute the empirical transition model with the confidence interval size as defined in (3), we let every satisfy that
(8) |
such that we can define the feasible set for the optimization problem (6) as follows
(9) |
By this definition, we know that at the epoch . On the other hand, according to Lemma 4.1, we have that with probability at least , for all epoch , holds. By Rosenberg and Mansour (2019a), the problem (6) is essentially a linear programming with a special structure that can be solved efficiently (see details in Section A of the supplementary material).
5 Main Results
Before presenting our results, we first make assumption on the existence of Lagrange multipliers. We define a partial average function starting from any time slot as . Then, we consider the following static optimization problem (recalling )
(10) |
Denote the solution to this program as . Define the Lagrangian dual function of (10) as , where is a dual variable. We are ready to state our assumption:
Assumption 5.1.
For any time slot and any time period , the set of primal optimal solution to (10) is non-empty. Furthermore, the set of Lagrange multipliers, which is , is non-empty and bounded. Any vector in is called a Lagrange multiplier associated with (10). Furthermore, let be a constant such that for any and , the dual optimal set defined above satisfies .
As is discussed in Section B of the supplementary material, Assumption 5.1 proposes a weaker condition than the Slater condition commonly adopted in previous constrained online learning works. The following lemma further shows the relation between Assumption 5.1 and the dual function.
Lemma 5.2.
Suppose Assumption 5.1 holds, then for any and , there exists constants such that for any satisfying 222We let as Euclidean distance between a point and the set . , we have
Based on the above assumptions and lemmas, we present results of the regret and constraint violation.
6 Theoretical Analysis
In this section, we provide lemmas and proof sketches for the regret bound and constraint violation bound in Theorem 5.3. The detailed proofs for Section 6.1 and Section 6.2 are in Section C and Section D of the supplementary material respectively.
6.1 Proof of Regret Bound
Lemma 6.1.
The updating rules in Algorithm 1 ensure that with probability at least ,
Lemma 6.2.
The updating rules in Algorithm 1 ensure that with probability at least ,
Here we let . Next, we present Lemma 6.3, which is one of the key lemmas in our proof. Then, this lemma indicates that is bounded by with high probability when setting the parameters as in Theorem 5.3. Thus, introducing stochastic constraints retains the regret. Moreover, this lemma will lead to the constraint violation in the level of . Lemma 6.3 is proved by using Assumption 5.1 and Lemma 5.2.
Lemma 6.3.
Letting and satisfy , the updating rules in Algorithm 1 ensure that with probability at least , the following inequality holds for all ,
where we define and .
The upper bound of in the above lemma actually holds for any and it is a convex function w.r.t. , which thus indicates that there exists a tight upper bound of if is chosen by finding the minimizer of this upper bound. In this paper, we directly set , which suffices to give an upper bound.
Remark 6.4.
We discuss the upper bound of the term in the following way: (1) if , then this term is bounded by ; (2) if , then the term is bounded by . Thus, combining the two cases, we have
This discussion shows that the term in the result of Lemma 6.3 will not introduce extra dependency on except a term.
With the bound of in Lemma 6.3, we further obtain the following lemma.
Lemma 6.5.
Proof of the Regret Bound in Theorem 5.3.
Recall that is the probability vector chosen by the decision maker, and is the true occupancy measure at time while is the solution to the problem (1). The main idea is to decompose the regret as follows:
(11) |
where we use Assumption 3.3 such that . Thus, it suffices to bound the Term (I) and Term (II).
We first show the bound for Term (I). According to Lemma 6.1, by the fact that and , we have that with probability at least , the following holds
(12) |
For Term (II), setting , , , and , by Lemma 6.2, we obtain
where we use the inequality that due to . Thus, we further need to bound the last term of the above inequality. By Lemma 6.5 and Remark 6.4, with probability at least for all , we have
by the facts that , , , and the assumption , as well as the computation of as
Therefore, with probability at least , the following holds
(13) |
Combining (12) and (13) with (11), and letting , by union bound, we eventually obtain that with probability at least , the regret bound is
where the notation absorbs the logarithm factors. Further let (such that is guaranteed). This completes the proof. ∎
6.2 Proof of Constraint Violation Bound
Lemma 6.6.
The updating rules in Algorithm 1 ensure
Lemma 6.7.
The updating rules in Algorithm 1 ensure
Remark 6.8.
The proof of Lemma 6.7 uses the fact that the confidence set of the transition model changes only times due to the doubling of epoch length in Algorithm 1. Within each epoch where the confidence set is unchanged, we further show is sufficiently small. Thus, the epoch length doubling eventually ensures that the constraint violation is in the level of .
Proof of the Constraint Violation Bound in Theorem 5.3.
We decompose as
(14) |
where the second inequality is due to Assumption 3.3 that . Thus, it suffices to bound Terms (III) and (IV).
For Term (III), we already have its bound as (12). Then, we focus on proving the upper bound of Term (IV). Set , , , and as in the proof of the regret bound. By Lemma 6.6, we know that to bound Term (IV) requires bounding the terms and . By Lemma 6.3, combining it with Remark 6.4 and as shown in the proof of the regret bound, letting , with probability , for all , the following inequality holds
(15) |
where we use . This gives the bound of that .
Furthermore, by Lemma 6.7, we know that the key to bounding is also the drift bound for . Therefore, by (15) and the settings of the parameters , we have
(16) |
by the facts that , , and the condition . Thus combining (15) and (16) with Lemma 6.6, and letting , then with probability at least , we have
Combining results for Term (III) and Term (IV) with (14), by union bound, with probability at least , the constraint violation is bounded as This finishes the proof. ∎
7 Conclusion
In this paper, we propose a new upper confidence primal-dual algorithm to solve online constrained episodic MDPs with arbitrarily varying losses and stochastically changing constraints. In particular, our algorithm does not require transition models of the MDPs and delivers an upper bounds of both the regret and the constraint violation.
References
- Abbasi-Yadkori et al. (2019a) Abbasi-Yadkori, Y., Bartlett, P., Bhatia, K., Lazic, N., Szepesvari, C. and Weisz, G. (2019a). Politex: Regret bounds for policy iteration using expert prediction. In International Conference on Machine Learning.
- Abbasi-Yadkori et al. (2019b) Abbasi-Yadkori, Y., Lazic, N., Szepesvari, C. and Weisz, G. (2019b). Exploration-enhanced politex. arXiv preprint arXiv:1908.10479.
- Achiam et al. (2017) Achiam, J., Held, D., Tamar, A. and Abbeel, P. (2017). Constrained policy optimization. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org.
- Agarwal et al. (2019) Agarwal, A., Kakade, S. M., Lee, J. D. and Mahajan, G. (2019). Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261.
- Altman (1999) Altman, E. (1999). Constrained Markov decision processes, vol. 7. CRC Press.
- Auer et al. (2002) Auer, P., Cesa-Bianchi, N. and Fischer, P. (2002). Finite-time analysis of the multiarmed bandit problem. Machine learning, 47 235–256.
- Azar et al. (2017) Azar, M. G., Osband, I. and Munos, R. (2017). Minimax regret bounds for reinforcement learning. In Proceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org.
- Baxter and Bartlett (2000) Baxter, J. and Bartlett, P. L. (2000). Direct gradient-based reinforcement learning. In 2000 IEEE International Symposium on Circuits and Systems. Emerging Technologies for the 21st Century. Proceedings (IEEE Cat No. 00CH36353), vol. 3. IEEE.
- Bhandari and Russo (2019) Bhandari, J. and Russo, D. (2019). Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
- Cai et al. (2019) Cai, Q., Yang, Z., Jin, C. and Wang, Z. (2019). Provably efficient exploration in policy optimization. arXiv preprint arXiv:1912.05830.
- Chow et al. (2017) Chow, Y., Ghavamzadeh, M., Janson, L. and Pavone, M. (2017). Risk-constrained reinforcement learning with percentile risk criteria. The Journal of Machine Learning Research, 18 6070–6120.
- Ding et al. (2020) Ding, D., Wei, X., Yang, Z., Wang, Z. and Jovanović, M. (2020). Provably efficient safe exploration via primal-dual policy optimization. Manuscript.
- Dong et al. (2019) Dong, K., Wang, Y., Chen, X. and Wang, L. (2019). Q-learning with ucb exploration is sample efficient for infinite-horizon mdp. arXiv preprint arXiv:1901.09311.
- Efroni et al. (2020a) Efroni, Y., Mannor, S. and Pirotta, M. (2020a). Exploration-exploitation in constrained mdps. arXiv preprint arXiv:2003.02189.
- Efroni et al. (2020b) Efroni, Y., Shani, L., Rosenberg, A. and Mannor, S. (2020b). Optimistic policy optimization with bandit feedback. arXiv preprint arXiv:2002.08243.
- Even-Dar et al. (2009) Even-Dar, E., Kakade, S. M. and Mansour, Y. (2009). Online markov decision processes. Mathematics of Operations Research, 34 726–736.
- Fazel et al. (2018) Fazel, M., Ge, R., Kakade, S. M. and Mesbahi, M. (2018). Global convergence of policy gradient methods for the linear quadratic regulator. arXiv preprint arXiv:1801.05039.
- Fox (1966) Fox, B. (1966). Markov renewal programming by linear fractional programming. SIAM Journal on Applied Mathematics, 14 1418–1432.
- Fruit et al. (2018) Fruit, R., Pirotta, M., Lazaric, A. and Ortner, R. (2018). Efficient bias-span-constrained exploration-exploitation in reinforcement learning. arXiv preprint arXiv:1802.04020.
- György et al. (2007) György, A., Linder, T., Lugosi, G. and Ottucsák, G. (2007). The on-line shortest path problem under partial monitoring. Journal of Machine Learning Research, 8 2369–2403.
- Jaksch et al. (2010) Jaksch, T., Ortner, R. and Auer, P. (2010). Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11 1563–1600.
- Jin et al. (2018) Jin, C., Allen-Zhu, Z., Bubeck, S. and Jordan, M. I. (2018). Is q-learning provably efficient? In Advances in Neural Information Processing Systems.
- Jin et al. (2019) Jin, C., Jin, T., Luo, H., Sra, S. and Yu, T. (2019). Learning adversarial mdps with bandit feedback and unknown transition. arXiv preprint arXiv:1912.01192.
- Kakade (2002) Kakade, S. M. (2002). A natural policy gradient. In Advances in neural information processing systems.
- Konda and Tsitsiklis (2000) Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in neural information processing systems.
- Le et al. (2019) Le, H. M., Voloshin, C. and Yue, Y. (2019). Batch policy learning under constraints. arXiv preprint arXiv:1903.08738.
- Lillicrap et al. (2015) Lillicrap, T. P., Hunt, J. J., Pritzel, A., Heess, N., Erez, T., Tassa, Y., Silver, D. and Wierstra, D. (2015). Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971.
- Liu et al. (2019) Liu, B., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural proximal/trust region policy optimization attains globally optimal policy. arXiv preprint arXiv:1906.10306.
- Mannor et al. (2009) Mannor, S., Tsitsiklis, J. N. and Yu, J. Y. (2009). Online learning with sample path constraints. Journal of Machine Learning Research, 10 569–590.
- Nedić and Ozdaglar (2009) Nedić, A. and Ozdaglar, A. (2009). Approximate primal solutions and rate analysis for dual subgradient methods. SIAM Journal on Optimization, 19 1757–1780.
- Neely (2012) Neely, M. J. (2012). Dynamic optimization and learning for renewal systems. IEEE Transactions on Automatic Control, 58 32–46.
- Neu et al. (2010) Neu, G., György, A. and Szepesvári, C. (2010). The online loop-free stochastic shortest-path problem. In COLT, vol. 2010. Citeseer.
- Neu et al. (2012) Neu, G., Gyorgy, A. and Szepesvári, C. (2012). The adversarial stochastic shortest path problem with unknown transition probabilities. In Artificial Intelligence and Statistics.
- Ouyang et al. (2017) Ouyang, Y., Gagrani, M., Nayyar, A. and Jain, R. (2017). Learning unknown markov decision processes: A thompson sampling approach. In Advances in Neural Information Processing Systems.
- Rosenberg and Mansour (2019a) Rosenberg, A. and Mansour, Y. (2019a). Online convex optimization in adversarial markov decision processes. arXiv preprint arXiv:1905.07773.
- Rosenberg and Mansour (2019b) Rosenberg, A. and Mansour, Y. (2019b). Online stochastic shortest path with bandit feedback and unknown transition function. In Advances in Neural Information Processing Systems.
- Schulman et al. (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M. and Moritz, P. (2015). Trust region policy optimization. In International conference on machine learning.
- Schulman et al. (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A. and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Sutton and Barto (2018) Sutton, R. S. and Barto, A. G. (2018). Reinforcement learning: An introduction. MIT press.
- Urgaonkar et al. (2015) Urgaonkar, R., Wang, S., He, T., Zafer, M., Chan, K. and Leung, K. K. (2015). Dynamic service migration and workload scheduling in edge-clouds. Performance Evaluation, 91 205–228.
- Wang et al. (2019) Wang, L., Cai, Q., Yang, Z. and Wang, Z. (2019). Neural policy gradient methods: Global optimality and rates of convergence. arXiv preprint arXiv:1909.01150.
- Wang et al. (2015) Wang, S., Urgaonkar, R., Zafer, M., He, T., Chan, K. and Leung, K. K. (2015). Dynamic service migration in mobile edge-clouds. In 2015 IFIP Networking Conference (IFIP Networking). IEEE.
- Wei et al. (2019a) Wei, C.-Y., Jafarnia-Jahromi, M., Luo, H., Sharma, H. and Jain, R. (2019a). Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. arXiv preprint arXiv:1910.07072.
- Wei et al. (2018) Wei, X., Yu, H. and Neely, M. J. (2018). Online learning in weakly coupled markov decision processes: A convergence time study. Proceedings of the ACM on Measurement and Analysis of Computing Systems, 2 12.
- Wei et al. (2019b) Wei, X., Yu, H. and Neely, M. J. (2019b). Online primal-dual mirror descent under stochastic constraints. arXiv preprint arXiv:1908.00305.
- Williams (1992) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8 229–256.
- Yu et al. (2017) Yu, H., Neely, M. and Wei, X. (2017). Online convex optimization with stochastic constraints. In Advances in Neural Information Processing Systems.
- Yu et al. (2009) Yu, J. Y., Mannor, S. and Shimkin, N. (2009). Markov decision processes with arbitrary reward processes. Mathematics of Operations Research, 34 737–757.
- Zhang and Ji (2019) Zhang, Z. and Ji, X. (2019). Regret minimization for reinforcement learning by evaluating the optimal bias function. In Advances in Neural Information Processing Systems.
- Zheng and Ratliff (2020) Zheng, L. and Ratliff, L. J. (2020). Constrained upper confidence reinforcement learning. arXiv preprint arXiv:2001.09377.
- Zimin and Neu (2013) Zimin, A. and Neu, G. (2013). Online learning in episodic markovian decision processes by relative entropy policy search. In Advances in neural information processing systems.
Supplementary Material
Appendix A Efficient Solver for Subproblem (6)
In this section, we provide the details on how to efficiently solve the subproblem (6). We can further rewrite (6) into the following equivalent form
where we let . According to Rosenberg and Mansour (2019a), solving the above problem is decomposed to the following two steps
(17) | ||||
(18) |
Note that the first step, i.e., (17), is an unconstrained problem, which has a closed-form solution
(19) |
The second step, i.e., (18), can be viewed as a projection of onto the feasible set . With the definition of the feasible set as in (9), further by Theorem 4.2 of Rosenberg and Mansour (2019a) and Lemma 7 of Jin et al. (2019), and plugging in computed as (19), we have
(20) |
where is a mapping from state to its associated layer index, and and are defined as follows
where and with . Specifically, the variables in (20) are obtained by solving a convex optimization with only non-negativity constraints, which is
(21) |
Therefore, by solving (21), we can eventually compute by (20). Since (21) is associated with a convex optimization with only non-negativity constraints, it can be solved much efficiently.
Appendix B Structure of Optimization Problem Sequence
We have the following simple sufficient condition which is a direct corollary of Lemma 1 in Nedić and Ozdaglar (2009):
Lemma B.1.
In fact, it can be shown that some certain constraint qualification condition more general than Slater condition can imply the boundedness of Lagrange multipliers (see, for example, Lemma 18 of Wei et al. (2019b)). Thus, Assumption 5.1 is weaker than Slater condition commonly adopted in previous constrained online learning works. The motivation for such a Lagrange multiplier condition is that it is a sufficient condition of a key structural property on the dual function , namely, the error bound condition. Formally, we have the following definition:
Definition B.2 (Error Bound Condition (EBC)).
Let be a concave function over , where is closed and convex. Suppose is non-empty. The function satisfies the EBC if there exists constants such that for any satisfying333We let as the Euclidean distance between a point and the set . ,
Note that in Definition B.2, is a closed convex set, which follows from the fact that is a concave function and thus all superlevel sets are closed and convex. The following lemma, whose proof can be found in Lemma 5 of Wei et al. (2019b), shows the relation between the Lagrange multiplier condition and the dual function:
Lemma B.3.
Fix . Suppose Assumption 5.1 holds, then for any and , the dual function satisfies the EBC with and .
This lemma is equivalent to Lemma 5.2 in the main text.
Appendix C Proofs of the Lemmas in Section 6.1
C.1 Proof of Lemma 6.1
We first provide Lemmas C.1 and C.2 below. Then, we give the proof of Lemma 6.1 based on these lemmas.
Lemma C.1 (Lemma 19 in Jaksch et al. (2010)).
For any sequence of numbers with with , the following inequality holds
Lemma C.2.
Let and be the state stationary distributions associated with and respectively, and and be the corresponding transition distributions. Denote as the policy at episode . There are and . On the other hand, there are also , and . Then, we have the following inequality
where we let .
Proof of Lemma C.2.
By the definitions of , , , , and shown in Lemma C.2, we have
Thus, with the above equalities, and by triangle inequality for , we can bound the term in the following way
(22) |
Then we need to bound the last two terms of (22) respectively. For the first term of RHS in (22), we have
(23) |
since denotes the joint distribution probability of .
Next, we bound the last term of RHS in (22), which is
since . Furthermore, we can bound the last term above as
where the first equality is due to , the second equality is due to , and the third equality is by the relations and , . Further bounding the last term of the above equation gives
which eventually implies that the last term on RHS of (22) can be bounded as
(24) |
Therefore, plugging the bounds (LABEL:eq:prob_decomp_1) and (24) in (22), we have
Recursively applying the above inequality, we obtain
which completes the proof. ∎
Now, we are in position to give the proof of Lemma 6.1.
Proof of Lemma 6.1.
We already know and , , . By Lemma C.2, one can show that
where we denote as the indicator random variable that equals with probability and otherwise. Denote for abbreviation. We can see that . Summing both sides of the above inequality over time slots, we obtain
(25) |
Next, we bound the first term on RHS of (25). Let be the system history up to -th episode. Then, by the definition of , we have
since is only associated with system randomness history up to episodes. Thus, the term is a martingale difference sequence with respect to . Furthermore, by and , there will be
Thus, by Azuma’s inequality, we obtain that with probability at least ,
According to union bound, we further have that with probability at least , the above inequality holds for all . This implies that with probability at least , the following inequality holds
(26) |
On the other hand, we adopt the same argument as the first part of the proof of Lemma 5 in Neu et al. (2012) to show the upper bound of in (25). Recall that denotes the epoch that the -th episode belongs to. By the definition of the state-action pair counter and , we have
According to Lemma C.1, we have
(27) |
Since we can rewrite
then by Lemma 4.1and , the following holds with probability at least ,
where the first inequality is due to Lemma 4.1, the second inequality is by the definition of the counters and , and the last inequality is by (27). Thus, further bounding the last term of the above inequality yields
where the first inequality is due to Jensen’s inequality, the second inequality is due to , and the last inequality is by bounding the term . The above results imply that with probability at least ,
(28) |
By union bound, combining (25), (26) and (28), we obtain with probability at least ,
This completes the proof. ∎
C.2 Proof of Lemma 6.2
Lemma C.3 (Lemma 14 in Wei et al. (2019b)).
Let and denote the probability simplex and the set of the probability simplex excluding the boundary respectively. Assuming , and letting , then the following inequality holds
where , is a convex function, and is the unnormalized KL divergence for two distributions.
Lemma C.3 is an extension of Lemma 14 in Wei et al. (2019b), whose proof follows the one in Wei et al. (2019b). Specifically, the unnormalized KL divergence is a special case of the Bregman divergence studied in Wei et al. (2019b).
Lemma C.4.
For any and satisfying and , we let denote the vector formed by the elements for all . We also let similarly denote a vector formed by . Then, we have
where is defined as in (7).
Proof of Lemma C.4.
We prove the lemma by the following inequality
where the inequality is due to the Pinsker’s inequality since and are two probability distributions such that and . This completes the proof. ∎
Lemma C.5.
For any and satisfying and , letting with , then we have
Proof of Lemma C.5.
We start our proof as follows
where the last equality is by substituting . Thus, by bounding the last term above, we further have
where the first inequality is by Jensen’s inequality and the second inequality is due to since , and the last inequality is due to Hölder’s inequality that and .
Moreover, we have
where the first inequality is due to , the second inequality is due to the monotonicity of logarithm function, and the third inequality is by as well as . This completes the proof. ∎
Now we are ready to provide the proof of Lemma 6.2.
Proof of Lemma 6.2.
First of all, by Lemma 4.1, we know that
with probability at least , for all epochs and any state and action pair . Thus, we have that for any epoch ,
holds with probability at least .
This can be easily proved in the following way: If any , then for all , and ,
Then, we obtain with probability at least ,
where the inequality is by Lemma 4.1. Therefore, we know that , which proves the above claim.
We define the event as follows:
(29) |
by which we have
Thus, for any that is a solution to problem (1), we have . If event happens, then . Now we have that the updating rule of follows as shown in (6), and also holds with probability at least . According to Lemma C.3, letting , , and , we have that with probability at least , the following holds for all episodes
(30) |
which means once given the event happens, the inequality (LABEL:eq:update_breg_tria) will hold.
On the other hand, according to the updating rule of in (5), which is , we know that
which further leads to
Taking summation on both sides of the above inequality from to , we have
(31) |
where we let and , and the last inequality is due to
by Assumption 3.3 and the facts that and . Thus, summing up (LABEL:eq:update_breg_tria) and (LABEL:eq:Q_difference), and then subtracting from both sides, we have
We further need to show the lower bound of the term on LHS of the above inequality. Specifically, we have
where the first inequality uses Hölder’s inequality and Lemma C.4 that with , the second inequality is due to , the second inequality is due to , and the third inequality is by finding the minimal value of a quadratic function . Thus, we can show that with probability , the following inequality holds for all ,
(32) | |||
Note that according to Lemma C.5, we have
Therefore, plugging the above inequality into (32) and rearranging the terms, we further get
Thus, by taking summation on both sides of the above inequality from to and assuming , we would obtain that with probability at least , the following inequality holds
(33) |
It is not difficult to compute that according to the initialization of by the uniform distribution. Then, by rearranging the terms and shifting the index, we rewrite (33) as
This completes the proof. ∎
C.3 Proof of Lemma 6.3
Lemma C.6 (Lemma 5 of Yu et al. (2017)).
Let be a discrete time stochastic process adapted to a filtration with and . Suppose there exists an integer , real constants , , and such that
hold for all . Then for any constant , with probability at least , we have
Now, we are in position to give the proof of Lemma 6.3.
Proof of Lemma 6.3.
The proof of this Lemma is based on applying the lemma C.6 to our problem. Thus, this proof mainly focuses on showing that the variable satisfies the condition of Lemma C.6. By the updating rule of , i.e., , we have
where the first inequality is due to triangle inequality, and the second inequality is by the fact that if . Then, by Assumption 3.3, we further have
which therefore implies
(34) |
Thus, with the above inequality, we have
(35) |
such that
(36) |
where represents the system randomness up to the -th episode and depends on according to its updating rule. Note that (35) in fact indicates that the random process is bounded by the value .
Next, we need to show that there exist and such that if . Recall the definition of the event in (29). Therefore, we have that with probability at least , the event happens, such that for all and any , the following holds
which adopts similar proof techniques to (32). Then, the above inequality further leads to the following inequality by rearranging the terms
Taking summation from to on both sides of the above inequality, the following inequality holds with probability for any and satisfying
(37) |
Particularly, in (LABEL:eq:Q_diff_high_prob_bound), the term due to the non-negativity of unnormalized KL divergence. By Lemma C.5, we can bound
For the term , by Lemma C.5, we can bound it as
Moreover, we can decompose the term in (LABEL:eq:Q_diff_high_prob_bound) as
where the last inequality is due to
as well as
by and if for the first inequality, and Assumption 3.3 for the last inequality.
Therefore, taking conditional expectation on both sides of (LABEL:eq:Q_diff_high_prob_bound) and combining the above upper bounds for certain terms in (LABEL:eq:Q_diff_high_prob_bound), we can obtain for any ,
(38) |
Thus, it remains to bound the term so as to give an upper bound of the right-hand side of (LABEL:eq:exp_Q_diff_high_prob_bound). Given the event happens such that , and since is any vector in the set , we can give an upper bound of (LABEL:eq:exp_Q_diff_high_prob_bound) by bounding a term , which is by
where the inequality is due to given happens and the last equality is obtained according to the definition of the dual function in Section 5. We can bound in the following way.
According to Assumption 5.1, we assume that one dual solution is . We let be the maximum of all and be the minimum of all . Thus, when , we have
where the first inequality is due to the weak error bound in Lemma 5.2 and weak duality with being one primal solution, the second inequality is by triangle inequality, and the third inequality is by Assumption 3.3 and Assumption 5.1. On the other hand, when , we have
where the first inequality is by the definition of and Cauchy-Schwarz inequality, and the second inequality is due to weak duality and Assumption 3.3 such that
Now we can combine the two cases as follows
(39) |
The bound in (39) is derived by the following discussion:
-
(1)
When , we have
-
(2)
When , we have
since .
Therefore, plugging (39) into (LABEL:eq:exp_Q_diff_high_prob_bound), we can obtain that given the event happens, the following holds
(40) |
where we define
We can see that if , then according to (LABEL:eq:bound_q_func_diff), there is
Due to and by Jensen’s inequality, we have
(41) |
Then we can compute the expectation according to the law of total expectation. With (35) and (41), we can obtain that
where we let .
Summarizing the above results, we know that if , then
where we let
Directly by Lemma C.6, for a certain , the following inequality holds with probability at least ,
(42) |
Further employing union bound for probabilities, we have that with probability at least , for any , the above inequality (42) holds. Note that (42) only holds when . For , when (42) holds for , combining (42) and (34), we have
(43) |
Thus, with probability at least , for any k satisfying , the inequality (43) holds. We can understand the upper bound of the term in the following way: (1) if , then this term is bounded by ; (2) if , then the term is bounded by . Thus, we have
This discussion shows that the term in (42) will not introduce extra dependency on except a term. This completes our proof. ∎
C.4 Proof of Lemma 6.5
Lemma C.7 (Lemma 9 of Yu et al. (2017)).
Let be a supermartingale adapted to a filtration with and , i.e., , . Suppose there exists a constant such that , where is process with adpated to for all . Then, for all , we have
We are in position to give the proof of Lemma 6.5.
Proof of Lemma 6.5.
Now we compute the upper bound of the term . Note that is supermartigale which is verified by
where and are independent variables with and . On the other hand, we can know the random process has bounded drifts as
where the first inequality is by Cauchy-Schwarz inequality, and the last inequality is by Assumption 3.3. This also implies that for an arbitrary , we have since implies according to the above inequality. Thus, by Lemma C.7, we have
(44) |
where we can see that bounding is the key to obtaining the bound of .
Next, we will show the upper bound of the term . According to the proof of Lemma 6.3, if , setting
we have that with probability at least , for a certain , the following inequality holds
Thus, combining (34) and the above inequality at , with probability at least , for a certain satisfying , the above inequality also holds. The above inequality is equivalent to
Setting and in (44), then the following probability hold with probability at least with
which completes the proof. ∎
Appendix D Proofs of the Lemmas in Section 6.2
D.1 Proof of Lemma 6.6
Proof of Lemma 6.6.
We start our proof with the updating rule of as follows
Rearranging the terms in the above inequality further leads to
Thus, taking summation on both sides of the above inequality from to leads to
where the second inequality is due to Hölder’s inequality. Note that the right-hand side of the above inequality is no less than since . Thus, we have
where is an entry-wise application of the operation for any vector.
Defining and , we obtain
where the third inequality is due to Assumption 3.3. This completes the proof. ∎
D.2 Proof of Lemma 6.7
Lemma D.1 (Proposition 18 of Jaksch et al. (2010)).
The number of epochs up to episode with is upper bounded by
where is a mapping from a certain episode to the epoch where it lives.
We are ready to give the proof of Lemma 6.7.
Proof of Lemma 6.7.
We need to discuss the upper bound of the term in the following two cases:
-
(1)
, i.e., episodes and are in the same epoch;
-
(2)
, i.e., episodes and are in two different epochs.
For the first case where , according to Lemma C.3, letting , , and , with and , we have
Rearranging the terms and dropping the last term (due to ) yield
where the second inequality is by Hölder’s inequality and triangle inequality, the third inequality is by Cauchy–Schwarz inequality and Assumption 3.3, and the last inequality is due to Assumption 3.3 and the first inequality in Lemma C.5 with setting and . Note that by Lemma C.4, there is
Thus, combining the previous two inequalities, we obtain
which further leads to
Since there is
where , combining it with the last inequality, we further have
where the last inequality is due to . Rearranging the terms in the above inequality gives for with ,
Shifting the index in the above inequality, we further have for with ,
(45) |
For the second case where with , it is difficult to know whether the two solutions and are in the same feasible set since . Thus, the above derivation does not hold. Then, we give a bound for the term as follows
(46) |
However, we can observe that only happens when is a starting episode for a new epoch. The number of starting episodes for new epochs in episodes is bounded by , namely the total number of epochs in episodes. According to Lemma D.1, the total number of epochs is bounded as which only grows in the order of .
Thus, we can decompose the term in the following way
where the inequality is due to (46) and the fact that . By (45), we can further bound the last term in the above inequality as
where we relax the summation on the right-hand side to . Thus, we eventually obtain
where we use the result in Lemma D.1 to bound the number of epoch, i.e., . This completes the proof. ∎