Regret Analysis of Policy Gradient Algorithm for Infinite Horizon Average Reward Markov Decision Processes
Abstract
In this paper, we consider an infinite horizon average reward Markov Decision Process (MDP). Distinguishing itself from existing works within this context, our approach harnesses the power of the general policy gradient-based algorithm, liberating it from the constraints of assuming a linear MDP structure. We propose a policy gradient-based algorithm and show its global convergence property. We then prove that the proposed algorithm has regret. Remarkably, this paper marks a pioneering effort by presenting the first exploration into regret-bound computation for the general parameterized policy gradient algorithm in the context of average reward scenarios.
Introduction
Reinforcement Learning (RL) describes a class of problems where a learner repeatedly interacts with an unknown environment with the intention of maximizing the cumulative sum of rewards. This model has found its application in a wide array of areas, ranging from networking to transportation to epidemic control (Geng et al. 2020; Al-Abbasi, Ghosh, and Aggarwal 2019; Ling, Mondal, and Ukkusuri 2023). RL problems are typically analysed via three distinct setups episodic, infinite horizon discounted reward, and infinite horizon average reward. Among these, the infinite horizon average reward setup holds particular significance in real-world applications (including those mentioned above) due to its alignment with many practical scenarios and its ability to capture essential long-term behaviors. However, scalable algorithms in this setup have not been widely studied. This paper provides an algorithm in the infinite horizon average reward setup with general parametrized policies which yields sub-linear regret guarantees. We would like to mention that this result is the first of its kind in the average reward setting.
There are two major approaches to solving an RL problem. The first one, known as the model-based approach, involves constructing an estimate of the transition probabilities of the underlying Markov Decision Process (MDP). This estimate is subsequently leveraged to derive policies (Auer, Jaksch, and Ortner 2008; Agrawal and Jia 2017; Ouyang et al. 2017; Fruit et al. 2018). It is worth noting that model-based techniques encounter a significant challenge – these algorithms demand a substantial memory to house the model parameters. Consequently, their practical application is hindered when dealing with large state spaces. An alternative strategy is referred to as model-free algorithms. These methods either directly estimate the policy function or maintain an estimate of the function, which are subsequently employed for policy generation (Mnih et al. 2015; Schulman et al. 2015; Mnih et al. 2016). The advantage of these algorithms lies in their adaptability to handle large state spaces.
In the average reward MDP, which is the setting considered in our paper, one of the key performance indicators of an algorithm is the expected regret. It has been theoretically demonstrated in (Auer, Jaksch, and Ortner 2008) that the expected regret of any algorithm for a broad class of MDPs is lower bounded by where denotes the length of the time horizon. Many model-based algorithms, such as, (Auer, Jaksch, and Ortner 2008; Agrawal and Jia 2017) achieve this bound. Unfortunately, the above algorithms are designed to be applicable solely in the tabular setup. Recently, (Wei et al. 2021) proposed a model-based algorithm for the linear MDP setup that is shown to achieve the optimal regret bound. On the other hand, (Wei et al. 2020) proposed a model-free -estimation-based algorithm that achieves the optimal regret in the tabular setup.
One way to extend algorithms beyond the tabular setting is via policy parameterization. Here, the policies are indexed by parameters (via, for example, neural networks), and the learning process is manifested by updating these parameters using some update rule (such as gradient descent). Such algorithms are referred to as policy gradient (PG) algorithms. Interestingly, the analysis of PG algorithms is typically restricted within the discounted reward setup. For example, (Agarwal et al. 2021) characterized the sample complexity of PG and Natural PG (NPG) with softmax and tabular parameterization. Sample complexity results for general parameterization are given by (Liu et al. 2020; Ding et al. 2020). However, the sub-linear regret analysis of a PG-based algorithm with general parameterization in the average reward setup, to the best of our knowledge, has not been studied in the literature. This paper aims to bridge this gap by addressing this crucial problem.
Challenges and Contribution
We propose a PG-based algorithm with general parameterization in the average reward setup and establish a sublinear regret of the proposed algorithm. In particular, within the class of ergodic MDPs, we first show that our PG-based algorithm achieves an average optimality error of . Utilizing this convergence result, we establish that the algorithm achieves a regret bound of .
Despite the availability of sample complexity analysis of PG algorithms in the discounted reward setup, obtaining a sublinear regret bound for their average reward counterpart is quite difficult. This is because in the average reward case, the value function estimators, which are crucial to estimate the gradient, can become unbounded, unlike their discounted reward counterparts. Indeed, the sample complexity results in the discounted MDPs are often associated with a factor (where denotes the discount factor which is for the average reward case), indicating that a naive adaptation of these estimators will not work for the average reward setup. Also, discounted setups typically assume access to a simulator to generate unbiased value estimates. On the contrary, our paper deals with a single sample trajectory and does not assume the availability of a simulator. To obtain a good estimator of the gradient, we design an epoch-based algorithm where the length of each epoch is . The algorithm estimates the value functions within a given epoch by sampling rewards of sub-trajectories of length that are at least distance apart. The separation between these sub-trajectories ensures that their reward samples are sufficiently independent. The key challenge of this paper is to bound a second-order term which is related to the variance of the estimated gradient and the true gradient. We show that by judiciously controlling the growth rate of and with , it is possible to obtain a gradient estimator that has an asymptotically decreasing variance.
Related Works
As discussed in the introduction, the reinforcement learning problem has been widely studied recently for infinite horizon discounted reward cases or the episodic setting. For example, (Jin et al. 2018) proposed the model-free UCB-Q learning and showed a regret in the episodic setting. In the discounted reward setting, (Ding et al. 2020) achieved sample complexity for the softmax parametrization using the Natural Policy Gradient algorithm whereas (Mondal and Aggarwal 2023; Fatkhullin et al. 2023) exhibited the same complexity for the general parameterization. However, the regret analysis or the global convergence of the average reward infinite horizon case is much less investigated.
Algorithm | Regret | Ergodic | Model-free | Setting |
---|---|---|---|---|
UCRL2 (Auer, Jaksch, and Ortner 2008) | No | No | Tabular | |
PSRL (Agrawal and Jia 2017) | No | No | Tabular | |
OPTIMISTIC Q-LEARNING (Wei et al. 2020) | No | Yes | Tabular | |
MDP-OOMD (Wei et al. 2020) | Yes | Yes | Tabular | |
FOPO (Wei et al. 2021)111FOPO is computationally inefficient. | No | No | Linear MDP | |
OLSVI.FH (Wei et al. 2021) | No | No | Linear MDP | |
MDP-EXP2 (Wei et al. 2021) | Yes | No | Linear MDP | |
This paper | Yes | Yes | General parametrization | |
Lower bound (Auer, Jaksch, and Ortner 2008) | N/A | N/A | N/A |
For infinite horizon average reward MDPs, (Auer, Jaksch, and Ortner 2008) proposed a model-based Upper confidence Reinforcement learning (UCRL2) algorithm and established that it obeys a regret bound. (Agrawal and Jia 2017) proposed posterior sampling-based approaches for average reward MDPs. (Wei et al. 2020) proposed the optimistic-Q learning algorithm which connects the discounted reward and average reward setting together to show regret in weakly communicating average reward case and another online mirror descent algorithm which achieves regret in the ergodic setting. For the linear MDP setting, (Wei et al. 2021) proposed three algorithms, including the MDP-EXP2 algorithm which achieves regret under the ergodicity assumption. These works have been summarized in Table 1. We note that the assumption of weakly communicating MDP is the minimum assumption needed to have sub-linear regret results. However, it is much more challenging to work with this assumption in the general parametrized setting because of the following reasons. Firstly, there is no guarantee that the state distribution will converge to the steady distribution exponentially fast which is the required property to show that the value functions are bounded by the mixing time. Secondly, it is unclear how to obtain an asymptotically unbiased estimate of the policy gradient. Thus, we assume ergodic MDP in this work following other works in the literature (Pesquerel and Maillard 2022; Gong and Wang 2020). MDPs with constraints have also been recently studied for model-based (Agarwal, Bai, and Aggarwal 2022b, a), model-free tabular (Wei, Liu, and Ying 2022; Chen, Jain, and Luo 2022), and linear MDP setup (Ghosh, Zhou, and Shroff 2022).
However, all of the above algorithms are designed for the tabular setting or the linear MDP assumption, and none of them uses a PG algorithm with the general parametrization setting. In this paper, we propose a PG algorithm for ergodic MDPs with general parametrization and analyze its regret. Our algorithm can, therefore, be applied beyond the tabular or linear MDP setting.
Formulation
In this paper, we consider an infinite horizon reinforcement learning problem with an average reward criterion, which is modeled by the Markov Decision Process (MDP) written as a tuple where is the state space, is the action space of size , is the reward function, is the state transition function where denotes the probability simplex with dimension , and is the initial distribution of states. A policy decides the distribution of the action to be taken given the current state. For a given policy, we define the long-term average reward as follows.
(1) |
where the expectation is taken over all state-action trajectories that are generated by following the action execution process, and the state transition rule, , . To simplify notations, we shall drop the dependence on whenever there is no confusion. We consider a parametrized class of policies, whose each element is indexed by a -dimensional parameter, where . Our goal is to solve the following optimization problem.
(2) |
A policy induces a transition function as , . If is such that for every policy , the induced function, is irreducible, and aperiodic, then is called ergodic.
Assumption 1.
The MDP is ergodic.
Ergodicity is commonly applied in the analysis of MDPs (Pesquerel and Maillard 2022; Gong and Wang 2020). It is well known that if is ergodic, then , there exists a unique stationary distribution, defined as,
(3) |
Note that under the assumption of ergodicity, is independent of the initial distribution, , and satisfies . In this case, we can write the average reward as follows.
(4) |
Hence, the average reward is also independent of the initial distribution, . Furthermore, , there exist a function such that the following Bellman equation is satisfied .
(5) |
where the state value function, is defined as,
(6) |
Note that if is satisfied by , then it is also satisfied by for any arbitrary constant, . To define these functions uniquely, we assume that . In this case, can be written as follows .
(7) |
where denotes expectation over all trajectories induced by the policy . Similarly, , can be uniquely written as,
(8) |
Additionally, we define the advantage function such that ,
(9) |
Ergodicity also implies the existence of a finite mixing time. In particular, if is ergodic, then the mixing time is defined as follows.
Definition 1.
The mixing time of an MDP with respect to a policy parameter is defined as,
(10) |
We also define as the the overall mixing time. In this paper, is finite due to ergodicity.
Mixing time is a measure of how fast the MDP reaches close to its stationary distribution if the same policy is kept on being executed repeatedly. We also define the hitting time as follows.
Definition 2.
The hitting time of an MDP with respect to a policy parameter, is defined as,
(11) |
We also define as the the overall hitting time. In this paper, is finite due to ergodicity.
Let, . For a given MDP and a time horizon , the regret of an algorithm is defined as follows.
(12) |
where the action, , is chosen by following the algorithm, based on the trajectory up to time, , and the state, is obtained by following the state transition function, . Wherever there is no confusion, we shall simplify the notation of regret to . The goal of maximizing can be accomplished by designing an algorithm that minimizes the regret.
Algorithm
In this section, we discuss a policy-gradient-based algorithm in the average reward RL settings. For simplicity, we assume that the set of all policy parameters is . The standard policy gradient algorithm iterates the policy parameter as follows starting with an initial guess .
(13) |
where is the parameter learning rate. The following result is well-known in the literature (Sutton et al. 1999).
Lemma 1.
The gradient of the long-term average reward can be expressed as follows.
(14) |
Typically we have access neither to , the state transition function to compute the required expectation nor to the functions , . In the absence of this knowledge, computation of gradient, therefore, becomes a difficult job. In the subsequent discussion, we shall demonstrate how the gradient can be estimated using sampled trajectories. Our policy gradient-based algorithm is described in Algorithm 1.
The algorithm proceeds in multiple epochs with the length of each epoch being . Observe that the algorithm is assumed to be aware of . This assumption can be easily relaxed invoking the well-known doubling trick (Lattimore and Szepesvári 2020). We also assume that the values of , and are known to the algorithm. Similar presumptions have been used in the previous literature (Wei et al. 2020). In the th epoch, the algorithm generates a trajectory of length , denoted as , by following the policy . We utilise the policy parameter and the trajectory in Algorithm 2 to compute the estimates , and for a given state-action pair . The algorithm searches the trajectory to locate disjoint sub-trajectories of length that start with the given state and are at least distance apart. Let be the number of such sub-trajectories and the sum of rewards in the th such sub-trajectory be . Then is computed as,
(15) |
The sub-trajectories are kept at least distance apart to ensure that the samples are fairly independent. The estimate , on the other hand, is given as,
(16) |
where is the starting time of the th chosen sub-trajectory. Finally, the advantage value is estimated as,
(17) |
This allows us to compute an estimate of the policy gradient as follows.
(18) |
where is the starting time of the th epoch. The policy parameters are updated via (20). In the following lemma, we show that is a good estimator of .
Lemma 2.
The following inequalities hold , and sufficiently large .
(19) |
Lemma 2 establishes that the error of our proposed estimator can be bounded above as . As we shall see later, this result can be used to bound the estimation error of the gradient. It is worthwhile to point out that and defined in , respectively, may not themselves be good estimators of their target quantities although their difference is one. We would also like to mention that our Algorithm 2 is inspired by Algorithm 2 of (Wei et al. 2020). The main difference is that we take the episode length to be while in (Wei et al. 2020), it was chosen to be . This extra factor makes the estimation error a decreasing function of .
Global Convergence Analysis
In this section, we show that our proposed Algorithm 1 converges globally. This essentially means that the parameters are such that the sequence , in certain sense, approaches the optimal average reward, . Such convergence will be later useful in bounding the regret of our algorithm. Before delving into the analysis, we would like to first point out a few assumptions that are needed to establish the results.
Assumption 2.
The log-likelihood function is -Lipschitz and -smooth. Formally,
(21) | ||||
Remark 1.
One can immediately see that by combining Assumption 2 with Lemma 2 and using the definition of the gradient estimator as given in , we arrive at the following important result.
Lemma 3.
The following relation holds .
(22) |
Lemma 3 claims that the error in estimating the gradient can be bounded above as . This result will be used in proving the global convergence of our algorithm.
Assumption 3.
Define the transferred function approximation error
(23) |
where is the optimal policy and is given as
(24) |
We assume that the error satisfies for any where is a positive constant.
Remark 2.
The transferred function approximation error, defined by (23) and (24), quantifies the expressivity of the policy class in consideration. It has been shown that the softmax parameterization (Agarwal et al. 2021) or linear MDP structure (Jin et al. 2020) admits . When parameterized by the restricted policy class that does not contain all the policies, turns out to be strictly positive. However, for a rich neural network parameterization, the is small (Wang et al. 2019). A similar assumption has been adopted in (Liu et al. 2020) and (Agarwal et al. 2021).
Remark 3.
It is to be mentioned that defined in (24) can be alternatively written as,
where symbolizes the Moore-Penrose pseudoinverse operation and is the Fisher information matrix as defined below.
(25) |
Assumption 4.
There exists a constant such that is positive semidefinite where denotes an identity matrix.
Assumption 4 is also commonly used in the policy gradient analysis (Liu et al. 2020). This is satisfied by the Gaussian policy with a linearly parameterized mean.
In the discounted reward setup, one key result is the performance difference lemma. In the averaged reward setting, this is derived as stated below.
Lemma 4.
The difference in the performance for any policies and is bounded as follows
(26) |
Using Lemma 4, we present a general framework for convergence analysis of the policy gradient algorithm in the averaged reward case as dictated below. This is inspired by the convergence analysis of (Liu et al. 2020) for the discounted reward MDPs.
Lemma 5.
Lemma 5 bounds the optimality error of any gradient ascent algorithm as a function of intermediate gradient norms. Note the presence of in the upper bound. Clearly, for a severely restricted policy class where is significant, the optimality bound becomes poor. Consider the expectation of the second term in (28). Note that,
(29) |
where uses Assumption 4. The expectation of the third term in (28) can be bounded as follows.
(30) |
In both (29), (30), the terms related to are bounded by Lemma 3. To bound the term , we use the following lemma.
Lemma 6.
Let be -smooth and . Then the following inequality holds.
Using Lemma 3, we obtain the following inequality.
(31) |
Similarly, using , we deduce the following.
(33) |
Theorem 1.
Theorem 1 dictates that the sequence generated by Algorithm 1 converges to with a convergence rate of . Alternatively, one can say that in order to achieve an optimality error of , it is sufficient to choose . It matches the state-of-the-art sample complexity bound of the policy gradient algorithm with general parameterization in the discounted reward setup (Liu et al. 2020).
Regret Analysis
In this section, we demonstrate how the convergence analysis in the previous section can be used to bind the expected regret of our proposed algorithm. Note that the regret can be decomposed as follows.
(35) |
where . Note that the expectation of the first term in can be bounded using Theorem 34. The expectation of the second term can be expressed as follows,
(36) |
where follows from Bellman equation and utilises the following facts: and . The term, in (36) can be bounded using Lemma 7 (stated below). Moreover, the term can be upper bounded as as clarified in the appendix.
Lemma 7.
Lemma 7 can be interpreted as the stability results of our algorithm. It essentially states that the policy parameters are updated such that the average difference between consecutive average reward and value functions decreases with the horizon, . Using the above result, we now prove our regret guarantee.
Theorem 2.
Conclusion
In this paper, we proposed an algorithm based on the vanilla policy gradient for reinforcement learning in an infinite horizon average reward setting. Unlike the recent works on this setting which require the MDP to be tabular or have a linear structure, we assume the framework of general parametrization of the policy. We show that the proposed algorithm converges to the neighborhood of the global optimum with rate , which matches the result of vanilla policy gradient with general parametrization in discounted reward setting. We use this convergence result to further show that our algorithm achieves a regret of .
We note that this paper unveils numerous promising directions for future research. These avenues encompass exploring the possibility of relaxing the assumption of ergodic MDPs to weakly communicating MDPs, refining regret bounds for enhanced performance, deriving more robust lower bounds for the general parametrization, and extending the problem domain to incorporate constraints.
Acknowledgements
This work was supported in part by the U.S. National Science Foundation under Grant CCF-2149588 and Cisco Inc.
References
- Agarwal et al. (2020) Agarwal, A.; Kakade, S. M.; Lee, J. D.; and Mahajan, G. 2020. Optimality and Approximation with Policy Gradient Methods in Markov Decision Processes. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, 64–66. PMLR.
- Agarwal et al. (2021) Agarwal, A.; Kakade, S. M.; Lee, J. D.; and Mahajan, G. 2021. On the theory of policy gradient methods: Optimality, approximation, and distribution shift. The Journal of Machine Learning Research, 22(1): 4431–4506.
- Agarwal, Bai, and Aggarwal (2022a) Agarwal, M.; Bai, Q.; and Aggarwal, V. 2022a. Concave Utility Reinforcement Learning with Zero-Constraint Violations. Transactions on Machine Learning Research.
- Agarwal, Bai, and Aggarwal (2022b) Agarwal, M.; Bai, Q.; and Aggarwal, V. 2022b. Regret guarantees for model-based reinforcement learning with long-term average constraints. In Uncertainty in Artificial Intelligence, 22–31. PMLR.
- Agrawal and Jia (2017) Agrawal, S.; and Jia, R. 2017. Optimistic posterior sampling for reinforcement learning: worst-case regret bounds. Advances in Neural Information Processing Systems, 30.
- Al-Abbasi, Ghosh, and Aggarwal (2019) Al-Abbasi, A. O.; Ghosh, A.; and Aggarwal, V. 2019. Deeppool: Distributed model-free algorithm for ride-sharing using deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 20(12): 4714–4727.
- Auer, Jaksch, and Ortner (2008) Auer, P.; Jaksch, T.; and Ortner, R. 2008. Near-optimal regret bounds for reinforcement learning. Advances in neural information processing systems, 21.
- Chen, Jain, and Luo (2022) Chen, L.; Jain, R.; and Luo, H. 2022. Learning infinite-horizon average-reward Markov decision process with constraints. In International Conference on Machine Learning, 3246–3270. PMLR.
- Ding et al. (2020) Ding, D.; Zhang, K.; Basar, T.; and Jovanovic, M. 2020. Natural Policy Gradient Primal-Dual Method for Constrained Markov Decision Processes. In Advances in Neural Information Processing Systems, volume 33, 8378–8390. Curran Associates, Inc.
- Dorfman and Levy (2022) Dorfman, R.; and Levy, K. Y. 2022. Adapting to mixing time in stochastic optimization with Markovian data. In International Conference on Machine Learning, 5429–5446. PMLR.
- Fatkhullin et al. (2023) Fatkhullin, I.; Barakat, A.; Kireeva, A.; and He, N. 2023. Stochastic policy gradient methods: Improved sample complexity for fisher-non-degenerate policies. arXiv:2302.01734.
- Fruit et al. (2018) Fruit, R.; Pirotta, M.; Lazaric, A.; and Ortner, R. 2018. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, 1578–1586. PMLR.
- Geng et al. (2020) Geng, N.; Lan, T.; Aggarwal, V.; Yang, Y.; and Xu, M. 2020. A multi-agent reinforcement learning perspective on distributed traffic engineering. In 2020 IEEE 28th International Conference on Network Protocols (ICNP), 1–11. IEEE.
- Ghosh, Zhou, and Shroff (2022) Ghosh, A.; Zhou, X.; and Shroff, N. 2022. Achieving Sub-linear Regret in Infinite Horizon Average Reward Constrained MDP with Linear Function Approximation. In The Eleventh International Conference on Learning Representations.
- Gong and Wang (2020) Gong, H.; and Wang, M. 2020. A Duality Approach for Regret Minimization in Average-Award Ergodic Markov Decision Processes. In Learning for Dynamics and Control, 862–883. PMLR.
- Jin et al. (2018) Jin, C.; Allen-Zhu, Z.; Bubeck, S.; and Jordan, M. I. 2018. Is Q-Learning Provably Efficient? In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
- Jin et al. (2020) Jin, C.; Yang, Z.; Wang, Z.; and Jordan, M. I. 2020. Provably efficient reinforcement learning with linear function approximation. In Proceedings of Thirty Third Conference on Learning Theory, volume 125 of Proceedings of Machine Learning Research, 2137–2143. PMLR.
- Lattimore and Szepesvári (2020) Lattimore, T.; and Szepesvári, C. 2020. Bandit algorithms. Cambridge University Press.
- Ling, Mondal, and Ukkusuri (2023) Ling, L.; Mondal, W. U.; and Ukkusuri, S. V. 2023. Cooperating Graph Neural Networks with Deep Reinforcement Learning for Vaccine Prioritization. arXiv preprint arXiv:2305.05163.
- Liu et al. (2020) Liu, Y.; Zhang, K.; Basar, T.; and Yin, W. 2020. An improved analysis of (variance-reduced) policy gradient and natural policy gradient methods. Advances in Neural Information Processing Systems, 33: 7624–7636.
- Mnih et al. (2016) Mnih, V.; Badia, A. P.; Mirza, M.; Graves, A.; Lillicrap, T.; Harley, T.; Silver, D.; and Kavukcuoglu, K. 2016. Asynchronous methods for deep reinforcement learning. In International conference on machine learning, 1928–1937. PMLR.
- Mnih et al. (2015) Mnih, V.; Kavukcuoglu, K.; Silver, D.; Rusu, A. A.; Veness, J.; Bellemare, M. G.; Graves, A.; Riedmiller, M.; Fidjeland, A. K.; Ostrovski, G.; et al. 2015. Human-level control through deep reinforcement learning. nature, 518(7540): 529–533.
- Mondal and Aggarwal (2023) Mondal, W. U.; and Aggarwal, V. 2023. Improved Sample Complexity Analysis of Natural Policy Gradient Algorithm with General Parameterization for Infinite Horizon Discounted Reward Markov Decision Processes. arXiv preprint arXiv:2310.11677.
- Ouyang et al. (2017) Ouyang, Y.; Gagrani, M.; Nayyar, A.; and Jain, R. 2017. Learning unknown markov decision processes: A thompson sampling approach. Advances in neural information processing systems, 30.
- Pesquerel and Maillard (2022) Pesquerel, F.; and Maillard, O.-A. 2022. IMED-RL: Regret optimal learning of ergodic Markov decision processes. In NeurIPS 2022-Thirty-sixth Conference on Neural Information Processing Systems.
- Schulman et al. (2015) Schulman, J.; Levine, S.; Abbeel, P.; Jordan, M.; and Moritz, P. 2015. Trust region policy optimization. In International conference on machine learning, 1889–1897. PMLR.
- Suttle et al. (2023) Suttle, W. A.; Bedi, A.; Patel, B.; Sadler, B. M.; Koppel, A.; and Manocha, D. 2023. Beyond Exponentially Fast Mixing in Average-Reward Reinforcement Learning via Multi-Level Monte Carlo Actor-Critic. In International Conference on Machine Learning, 33240–33267. PMLR.
- Sutton et al. (1999) Sutton, R. S.; McAllester, D.; Singh, S.; and Mansour, Y. 1999. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12.
- Wang et al. (2019) Wang, L.; Cai, Q.; Yang, Z.; and Wang, Z. 2019. Neural Policy Gradient Methods: Global Optimality and Rates of Convergence. arXiv:1909.01150.
- Wei et al. (2021) Wei, C.-Y.; Jahromi, M. J.; Luo, H.; and Jain, R. 2021. Learning infinite-horizon average-reward mdps with linear function approximation. In International Conference on Artificial Intelligence and Statistics, 3007–3015. PMLR.
- Wei et al. (2020) Wei, C.-Y.; Jahromi, M. J.; Luo, H.; Sharma, H.; and Jain, R. 2020. Model-free reinforcement learning in infinite-horizon average-reward markov decision processes. In International conference on machine learning, 10170–10180. PMLR.
- Wei, Liu, and Ying (2022) Wei, H.; Liu, X.; and Ying, L. 2022. A provably-efficient model-free algorithm for infinite-horizon average-reward constrained Markov decision processes. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, 3868–3876.
- Zhang et al. (2021) Zhang, J.; Ni, C.; Yu, Z.; Szepesvari, C.; and Wang, M. 2021. On the Convergence and Sample Efficiency of Variance-Reduced Policy Gradient Method. Advances in Neural Information Processing Systems, 34: 2228–2240.
Proofs for the Section of Global Convergence Analysis
Proof of Lemma 1
Proof.
Using (6), we arrive at the following.
(38) | ||||
where the step (a) is a consequence of and the Bellman equation. Multiplying both sides by , taking a sum over , and rearranging the terms, we obtain the following.
(39) |
where uses the fact that is a stationary distribution. Note that,
(40) | ||||
We can, therefore, replace the function in the policy gradient with the advantage function , . Thus,
(41) |
∎
Proof of Lemma 2
Proof.
The proof runs along the line of (Wei et al. 2020, Lemma 6). Consider the th epoch and assume that is denoted as for notational convenience. Let, be the number of disjoint sub-trajectories of length that start with the state and are at least distance apart (found by Algorithm 2). Let, be the sum of observed rewards of the th sub-trajectory and be its starting time. The advantage function estimate is,
(42) |
Note the following,
(43) |
where follows from the definition of as given in , is an application of the definition of given in , and follows from the Bellman equation. Define the following quantity.
(44) |
Using Lemma 9, we get which implies, . Observe that,
(45) |
where . Using the bound on , one can show that, , which implies,
(46) |
Note that (46) cannot be directly used to bound the bias of . This is because the random variable is correlated with the reward variables . To decorrelate these two random variables, imagine an MDP where the state distribution rejuvenates to the stationary distribution, after exactly time steps since the completion of a sub-trajectory. In other words, if a sub-trajectory starts at , and ends at , then the system ‘rests’ for additional steps before rejuvenating with the state distribution, at . Clearly, the wait time between the rejuvenation after the th sub-trajectory and the start of the th sub-trajectory is, , . Let be the time between the start time of the th epoch and the start time of the first sub-trajectory. Note that,
only depends on the initial state, and the induced transition function, ,
, where , depends on the stationary distribution, , and the induced transition function, ,
only depends on as other segments of the epoch have fixed length, .
Clearly, in this imaginary MDP, the sequence, , and hence, is independent of . Let, denote the expectation operation and denote the probability of events in this imaginary system. Define the following.
(47) |
where is defined in . Note that we have suppressed the dependence on , , and while defining to remove clutter. Using , one can write . Moreover,
(48) |
where uses the bound derived in , and the fact that are zero mean independent random variables conditioned on . Note that almost surely, via Lemma 8, and as shown in . Combining, we get, (see the definition of in (47)). Invoking this bound into , we get the following result.
(49) |
Note that, one can use Lemma 10 to bound the following violation probability.
(50) |
where follows from the fact that for sufficiently large . Finally, note that, if , where is defined as,
(51) |
then there exists at least one that is larger than which can happen with the following maximum probability according to Lemma 10.
(52) |
The above probability bound can be used to obtain the following result,
(53) |
Injecting and into , we finally obtain the following.
(54) |
Eq. shows that our desired inequality is satisfied in the imaginary system. We now need a mechanism to translate this result to our actual MDP. Notice that, we can write where , and . We have,
(55) |
The last inequality uses the non-negativity of . Observe that, for a fixed sequence, , we have,
(56) | ||||
(57) |
Thus, the difference between and arises because , . Note that the ratio of these two terms can be bounded as follows,
(58) |
where is a consequence of Lemma 9. We have,
(59) |
where uses the fact that . Combining and , we get,
(60) |
where follows from . This concludes the lemma. ∎
Proof of Lemma 3
Proof.
Recall from Eq. (18) that,
(61) |
Define the following quantity,
(62) |
where is the starting time of the th epoch. Note that the true gradient is given by,
(63) |
Moreover, using Assumption 2 and Lemma 9, one can prove that , which implies . Applying Lemma 12, we, therefore, arrive at,
(64) |
We would like to point out that Lemma 12 was also used in (Suttle et al. 2023) to analyze their actor-critic algorithm. Finally, the difference, can be bounded as follows.
(65) |
where follows from Assumption 2 and Jensen’s inequality whereas follows from Lemma 2. Combining, and , we conclude the result. ∎
Proof of Lemma 4
Proof.
Proof of Lemma 5
Proof.
We start with the definition of KL divergence.
(67) | ||||
where the step (a) holds by Assumption 2 and step (b) holds by Lemma 4. Step (c) uses the convexity of the function . Finally, step (d) comes from the Assumption 3. Rearranging items, we have
(68) |
Summing from to , using the non-negativity of KL divergence and dividing the resulting expression by , we get the desired result. ∎
Proof of Lemma 6
Proof.
By the -smooth property of the objective function, we have
(69) |
where step (a) holds from the fact that and step (b) holds due to the Cauchy-Schwarz inequality. Rearranging the terms yields the following.
(70) |
Choosing and summing over results in the following.
(71) |
Using due to bounded reward, we conclude the result. ∎
Proofs for the Section of Regret Analysis
Proof of Lemma 7
Proof.
Using Taylor’s expansion, we can write the following , .
(72) |
where is some convex combination of and and follows from Assumption 2. This concludes the first statement. Applying (72) and Lemma 11, we obtain,
(73) |
Inequality uses Lemma 8 and the update rule . Step holds by the Cauchy inequality and Jensen inequality whereas can be derived using and substituting . This establishes the second statement. Next, recall from that for any policy , . Note that, for any policy parameter , and any state , the following holds.
(74) |
Define the following quantity.
(75) |
Lemma 9 states that for sufficiently large , we have for any policy and state . Combining this result with the fact that the reward function is bounded in , we obtain,
(76) |
where follows from (73) and substituting . For the first term, note that,
(77) |
Inequality holds since every row of sums to and . Moreover, invoking (72), and the parameter update rule , we get,
Plugging the above result into and using a recursive argument, we get,
Proof of Theorem 2
Some Auxiliary Lemmas for the Proofs
Lemma 8.
(Wei et al. 2020, Lemma 14) For any ergodic MDP with mixing time , the following holds and any policy .
Lemma 9.
(Wei et al. 2020, Corollary 13.2) Let be defined as written below for an arbitrary policy .
(82) |
If , we have the following inequality : .
Lemma 10.
Lemma 11.
(Wei et al. 2020, Lemma 15) For two difference policy and , the difference of the objective function is
(84) |
Lemma 12.
(Dorfman and Levy 2022, Lemma A.6) Let be a policy parameter. Fix a trajectory generated by following the policy starting from some initial state . Let, be the gradient that we wish to estimate over , and is a function such that . Assume that , , . Define . If , then the following holds as long as ,
(85) |