Improved Regret Bound and Experience Replay in Regularized Policy Iteration
Abstract
In this work, we study algorithms for learning in infinite-horizon undiscounted Markov decision processes (MDPs) with function approximation. We first show that the regret analysis of the Politex algorithm (a version of regularized policy iteration) can be sharpened from to under nearly identical assumptions, and instantiate the bound with linear function approximation. Our result provides the first high-probability regret bound for a computationally efficient algorithm in this setting. The exact implementation of Politex with neural network function approximation is inefficient in terms of memory and computation. Since our analysis suggests that we need to approximate the average of the action-value functions of past policies well, we propose a simple efficient implementation where we train a single Q-function on a replay buffer with past data. We show that this often leads to superior performance over other implementation choices, especially in terms of wall-clock time. Our work also provides a novel theoretical justification for using experience replay within policy iteration algorithms.
1 Introduction
Model-free reinforcement learning (RL) algorithms combined with powerful function approximation have achieved impressive performance in a variety of application domains over the last decade. Unfortunately, the theoretical understanding of such methods is still quite limited. In this work, we study single-trajectory learning in infinite-horizon undiscounted Markov decision processes (MDPs), also known as average-reward MDPs, which capture tasks such as routing and the control of physical systems.
One line of works with performance guarantees for the average-reward setting follows the “online MDP” approach proposed by Even-Dar et al. (2009), where the agent selects policies by running an online learning algorithm in each state, typically mirror descent. The resulting algorithm is a version of approximate policy iteration (API), which alternates between (1) estimating the state-action value function (or Q-function) of the current policy and (2) setting the next policy to be optimal w.r.t. the sum of all previous Q-functions plus a regularizer. Note that, by contrast, standard API sets the next policy only based on the most recent Q-function. The policy update can also be written as maximizing the most recent Q-function minus KL-divergence to the previous policy, which is somewhat similar to recently popular versions of API (Schulman et al., 2015, 2017; Achiam et al., 2017; Abdolmaleki et al., 2018; Song et al., 2019a).
The original work of Even-Dar et al. (2009) studied this scheme with known dynamics, tabular representation, and adversarial reward functions. More recent works (Abbasi-Yadkori et al., 2019; Hao et al., 2020; Wei et al., 2020a) have adapted the approach to the case of unknown dynamics, stochastic rewards, and value function approximation. With linear value functions, the Politex algorithm of Abbasi-Yadkori et al. (2019) achieves high-probability regret in ergodic MDPs, and the results only scale in the number of features rather than states. Wei et al. (2020a) later show an bound on expected regret for a similar algorithm named MDP-EXP2. In this work, we revisit the analysis of Politex and show that it can be sharpened to under nearly identical assumptions, resulting in the first high-probability regret bound for a computationally efficient algorithm in this setting.
In addition to improved analysis, our work also addresses practical implementation of Politex with neural networks. The policies produced by Politex in each iteration require access to the sum of all previous Q-function estimates. With neural network function approximation, exact implementation requires us to keep all past networks in memory and evaluate them at each step, which is inefficient in terms of memory and computation. Some practical implementation choices include subsampling Q-functions and/or optimizing a KL-divergence regularized objective w.r.t. a parametric policy at each iteration. We propose an alternative approach, where we approximate the average of all past Q-functions by training a single network on a replay buffer with past data. We demonstrate that this choice often outperforms other approximate implementations, especially in terms of run-time. When available memory is constrained, we propose to subsample transitions using the notion of coresets (Bachem et al., 2017).
Our work also provides a novel perspective on the benefits of experience replay. Experience replay is a standard tool for stabilizing learning in modern deep RL, and typically used in off-policy methods like Deep Q-Networks (Mnih et al., 2013), as well as “value gradient” methods such as DDPG (Lillicrap et al., 2016) and SVG (Heess et al., 2015). A different line of on-policy methods typically does not rely on experience replay; instead, learning is stabilized by constraining consecutive policies to be close in terms of KL divergence (Schulman et al., 2015; Song et al., 2019a; Degrave et al., 2019). We observe that both experience replay and KL-divergence regularization can be viewed as approximate implementations of Politex. Thus, we provide a theoretical justification for using experience replay in API, as an approximate implementation of online learning in each state. Note that this online-learning view differs from the commonly used justifications for experience replay, namely that it “breaks temporal correlations” (Schaul et al., 2016; Mnih et al., 2013). Our analysis also suggests a new objective for subsampling or priority-sampling transitions in the replay buffer, which differs priority-sampling objectives of previous work (Schaul et al., 2016).
In summary, our main contributions are (1) an improved analysis of Politex, showing an regret bound under the same assumptions as the original work, and (2) an efficient implementation that also offers a new perspective on the benefits of experience replay.
2 Setting and Notation
We consider learning in infinite-horizon undiscounted ergodic MDPs , where is the state space, is a finite action space, is an unknown reward function, and is the unknown probability transition function. A policy is a mapping from a state to a distribution over actions. Let denote the state-action sequence obtained by following policy . The expected average reward of policy is defined as
(2.1) |
Let denote the stationary state distribution of a policy , satisfying . We will sometimes write as a vector, and use to denote the stationary state-action distribution. In ergodic MDPs, and are well-defined and independent of the initial state. The optimal policy is a policy that maximizes the expected average reward. We will denote by and the expected average reward and stationary state distribution of , respectively.
The value function of a policy is defined as:
(2.2) |
The state-action value function is defined as
(2.3) |
Notice that
(2.4) |
Equations (2.3) and (2.4) are known as the Bellman equation. If we do not use the definition of in Eq. (2.2) and instead solve for and using the Bellman equation, we can see that the solutions are unique up to an additive constant. Therefore, in the following, we may use and to denote the same function up to a constant. We will use the shorthand notation ; note that .
The agent interacts with the environment as follows: at each round , the agent observes a state , chooses an action , and receives a reward . The environment then transitions to the next state with probability . Recall that is the optimal policy and is its expected average reward. The regret of an algorithm with respect to this fixed policy is defined as
(2.5) |
The learning goal is to find an algorithm that minimizes the long-term regret .
Our analysis will require the following assumption on the mixing rate of policies.
Assumption 2.1 (Uniform mixing).
Let be the state-action transition matrix of a policy . Let be the corresponding ergodicity coefficient (Seneta, 1979), defined as
We assume that there exists a scalar such that for any policy , .
3 Related Work
Regret bounds for average-reward MDPs. Most no-regret algorithms for infinite-horizon undiscounted MDPs are only applicable to tabular representations and model-based (Bartlett, 2009; Jaksch et al., 2010; Ouyang et al., 2017; Fruit et al., 2018; Jian et al., 2019; Talebi and Maillard, 2018). For weakly-communicating MDPs with diameter , these algorithms nearly achieve the minimax lower bound (Jaksch et al., 2010) with high probability. Wei et al. (2020b) provide model-free algorithms with regret bounds in the tabular setting. In the model-free setting with function approximation, the Politex algorithm (Abbasi-Yadkori et al., 2019) achieve regret in uniformly ergodic MDPs, where is the size of the compressed state-action space. Hao et al. (2020) improve these results to . More recently, Wei et al. (2020a) present three algorithms for average-reward MDPs with linear function approximation. Among these, FOPO achieves regret but is computationally inefficient, and OLSVI.FH is efficient but obtains regret. The MDP-EXP2 algorithm is computationally efficient, and under similar assumptions as in Abbasi-Yadkori et al. (2019) it obtains a bound on expected regret (a weaker guarantee than the high-probability bounds in other works). Our analysis shows a high-probability regret bound under the same assumptions.
KL-regularized approximate policy iteration. Our work is also related to approximate policy iteration algorithms which constrain each policy to be close to the previous policy in the sense of KL divergence. This approach was popularized by TRPO (Schulman et al., 2015), where it was motivated as an approximate implementation of conservative policy iteration (Kakade and Langford, 2002). Some of the related subsequent works include PPO (Schulman et al., 2017), MPO (Abdolmaleki et al., 2018), V-MPO (Song et al., 2019a), and CPO (Achiam et al., 2017). While these algorithms place a constraint on consecutive policies and are mostly heuristic, another line of research shows that using KL divergence as a regularizer has a theoretical justification in terms of either regret guarantees (Abbasi-Yadkori et al., 2019; Hao et al., 2020; Wei et al., 2020b, a) or error propagation (Vieillard et al., 2020a, b).
Experience replay. Experience replay (Lin, 1992) is one of the central tools for achieving good performance in deep reinforcement learning. While it is mostly used in off-policy methods such as deep Q-learning (Mnih et al., 2013, 2015), it has also shown benefits in value-gradient methods (Heess et al., 2015; Lillicrap et al., 2016), and has been used in some variants of KL-regularized policy iteration (Abdolmaleki et al., 2018; Tomar et al., 2020). Its success has been attributed to removing some temporal correlations from data fed to standard gradient-based optimization algorithms. Schaul et al. (2016) have shown that non-uniform replay sampling based on the Bellman error can improve performance. Unlike these works, we motivate experience replay from the perspective of online learning in MDPs (Even-Dar et al., 2009) with the goal of approximating the average of past value functions well.
Continual learning. Continual learning (CL) is the paradigm of learning a classifier or regressor that performs well on a set of tasks, where each task corresponds to a different data distribution. The tasks are observed sequentially, and the learning goal is to avoid forgetting past tasks without storing all the data in memory. This is quite similar to our goal of approximating the average of sequentially-observed Q-functions, where the data for approximating each Q-function has different distribution. In general, approaches to CL can be categorized as regularization-based (Kirkpatrick et al., 2017; Zenke et al., 2017; Farajtabar et al., 2020; Yin et al., 2020), expansion-based (Rusu et al., 2016), and replay-based (Lopez-Paz and Ranzato, 2017; Chaudhry et al., 2018; Borsos et al., 2020), with the approaches based on experience replay typically having superior performance over other methods.
4 Algorithm
Our algorithm is similar to the Politex schema (Abbasi-Yadkori et al., 2019). In each phase , Politex obtains an estimate of the action-value function of the current policy , and then sets the next policy using the mirror descent update rule:
(4.1) |
where is the entropy function. When the functions are tabular or linear, the above update can be implemented efficiently by simply summing all the table entries or weight vectors. However, with neural network function approximation, we need to keep all networks in memory and evaluate them at each step, which quickly becomes inefficient in terms of storage and computation. Some of the efficient implementations proposed in literature include keeping a subset of the action-value functions (Abbasi-Yadkori et al., 2019), and using a parametric policy and optimizing the KL-regularized objective w.r.t. the parameters over the available data (Tomar et al., 2020).
Our proposed method, presented in Algorithm 1, attempts to directly approximate the average of all previous action-value functions
To do so, we only use a single network, and continually train it to approximate . At each iteration , we obtain a dataset of tuples , where is the empirical return from the state-action pair . We initialize to and update it by minimizing the squared error over the union of and the replay buffer . We then update the replay buffer with all or a subset of data in .
In the sequel, in Section 5, we first show that by focusing on estimating the average Q-function, we can improve the regret bound of Politex under nearly identical assumption; in Section 6, we instantiate this bound for linear value functions, where we can estimate the average Q-function simply by weight averaging; in Section 7, we focus on practical implementations, in particular, we discuss the limitations of weight averaging, and provide details on how to leverage replay data when using non-linear function approximation; in Section 8 we present our experimental results; and in Section 9 we make final remarks.
5 Regret Analysis of Politex
In this section, we revisit the regret analysis of Politex (Abbasi-Yadkori et al., 2019) in ergodic average-reward MDPs, and show that it can be improved from to under similar assumptions. Our analysis relies in part on a simple modification of the regret decomposition. Namely, instead of including the estimation error of each value-function in the regret, we consider the error in the running average . When this error scales as in a particular weighted norm, the regret of Politex is . As we show in the following section, this bound can be instantiated for linear value functions under the same assumptions as Abbasi-Yadkori et al. (2019).
Assumption 5.1 (Boundedness).
Let . We assume that there exists a constant such that for all and for all ,
For the purpose of our algorithm, functions are unique up to a constant. Thus we can equivalently assume that for all .
Define . Let and be the average of the state-value functions and its estimate. We will require the estimation error of the running average to scale as in the following assumption.
Assumption 5.2 (Estimation error).
Let be the stationary state distribution of the optimal policy . With probability at least , for a problem-dependent constant , the errors in and are bounded as
Define as in Abbasi-Yadkori et al. (2019):
We bound the regret of Politex in the following theorem. Here, recall that is the length of each phase, and and are defined in Assumption 2.1 and Eq. (4.1), respectively.
Theorem 5.3 (Regret of Politex).
Proof.
We start by decomposing the cumulative regret, following similar steps as Abbasi-Yadkori et al. (2019):
(5.1) |
The second term captures the sum of differences between observed rewards and their long term averages. In previous work, this term was shown to scale as . We show that the analysis can in fact be tightened to using improved concentration bounds and the slow-changing nature of the policies. See Lemma B.1 in Appendix B for precise details.
The first term, which is also called pseudo-regret in literature, measures the difference between the expected reward of the reference policy and the policies produced by the algorithm. Applying the performance difference lemma (Cao, 1999), we can write each pseudo-regret term as
Now, notice that Politex is running exact mirror descent for loss functions . We bridge the pseudo-regret by the terms:
(5.2) | ||||
(5.3) | ||||
(5.4) |
can be bounded using the regret of mirror descent as in previous work (Abbasi-Yadkori et al., 2019). Setting , and using a union bound over all states, with probability at least , is bounded as
6 Linear Value Functions
In this section, we show that the estimation error condition in Assumption 5.2 (and thus regret) can be achieved under similar assumptions as in Abbasi-Yadkori et al. (2019) and Wei et al. (2020a), which we state next.
Assumption 6.1 (Linear value functions).
The action-value function of any policy is linear: , where is a known feature function such that .
Assumption 6.2 (Feature excitation).
There exists a constant such that for any policy ,
We now describe a simple procedure for estimating the average action-value functions , where , such that the conditions of Assumption 5.2 are satisfied. Essentially, we estimate each using least-squares Monte Carlo and then average the weights. We will use the shorthand notation . Let and be subsets of time indices in phase (defined later). We estimate as follows:
(6.1) |
where are the empirical -step returns ( is specified later), computed as
(6.2) |
We then estimate as . Note that for this special case of linear value functions, we do not need to use the replay buffer in Algorithm 1. For analysis purposes, we divide each phase of length into blocks of size and let () be the starting indices of odd (even) blocks in phase . Due to the gaps between indices and fast mixing, this makes the data almost independent (we make this precise in Appendix C) and the error easier to analyze. In practice, one may simply want to use all data.
For a distribution , let denote the distribution-weighted norm such that . Using Jensen’s inequality, we have that . Thus, it suffices to bound the -function error in the distribution-weighted norm, . Furthermore, given bounded features,
so it suffices to bound the error in the weights. We bound this error in the following Lemma, proven in Appendix C.
Lemma 6.3 (Estimation error for linear functions).
Furthermore, in Appendix D, we show that the error in the average state-value function satisfies the following:
We have demonstrated that Assumption 5.2 can be satisfied with linear value functions. For Assumption 5.1, it suffices for the weight estimates to be bounded. This will be true, since we assume that the true weights are bounded and we can bound the error in the weight space. Thus Politex has an regret in this setting (though note that we incur an extra factor coming from the error bound).
7 Practical Implementation
As mentioned earlier, the key idea in our policy update is to obtain an estimate of the average of all the Q-functions in previous phases. We have seen that when we use linear functions to approximate the Q-functions, we can simply average the weights in order to get an estimate of the average Q-function. However, in practice, we often need to use non-linear functions, especially neural networks, to approximate complex Q-functions. In this section, we discuss how to efficiently implement our algorithm with non-linear function approximation.
7.1 Weight Averaging
The simplest idea may be averaging the weights of neural networks. However, a crucial difference from the linear setting is that averaging the weights of the neural networks is not equivalent to averaging the functions that they represent: the function that a neural network represent is invariant to the permutation of the hidden units, and thus two networks with very different weights can represent similar functions. Therefore, this implementation may only succeed when all the Q-function approximations are around the same local region in the weight space. Thus, when we learn the new Q-function in phase , we should initialize it with , run SGD with new data, and then average with .
7.2 Experience Replay
Another natural idea is to leverage the data from replay buffer to obtain an estimate of the average Q-function. We elaborate the details below. We use simple -step Monte Carlo estimate for the value of each state-action pair. For any , we can estimate the state-action value of by -step cumulative reward111This is a practical implementation of Eq. (6.2), i.e., we do not split the data in each phase into blocks.
In the following, we denote by the maximum number of data that we can collect from every phase. At the end of each phase, we store all or a subset of the tuples in our replay buffer . We extract feature for the state-action pair and let be a class of functions that we use to approximate the Q-functions. For phase , we propose the following method to estimate : , where and
(7.1) |
Suppose that we store all the data from the previous phases in the replay buffer, then in order to estimate the average of the Q-functions of the first phases, i.e., , we propose to use the heuristic that minimizes the average of the squared loss functions defined in Eq. (7.1), i.e., .
Subsampling and coreset. In practice, due to the high memory cost, it may be hard to store all the data from the previous phases in the replay buffer. We found that a simple strategy to resolve this issue is to begin with storing all the data from every phase, and add a limit on size of the replay buffer. When the buffer size exceeds the limit, we eliminate a subset of the data uniformly at random.
Another approach is to sample a subset of size from the data collected in each phase. Denote this subset by for phase . Thus in the -th phase, we have data from the current phase as well as data from the replay buffer . First, suppose that the data points are sampled uniformly at random. We can then minimize the following objective:
(7.2) |
where is an unbiased estimate of . Further, uniform sampling is not the only way to construct an unbiased estimate. In fact, for any discrete distribution, with PMF , over the data in , we can sample data points according to and construct
(7.3) |
in order to obtain an unbiased estimate of . As shown by Bachem et al. (2017), by choosing , we can minimize the variance of for any fixed . In the following, we call the subset of data sampled according to this distribution a coreset of the data. In the experiments in Section 8, we show that thanks to the variance reduction effect, sampling a coreset often produces better performance than sampling a subset uniformly at random, especially when the rewards are sparse.
Comparison to Abbasi-Yadkori et al. (2019). Our algorithm can be considered as an efficient implementation of Politex (Abbasi-Yadkori et al., 2019) via experience replay. In the original Politex algorithm, the Q-functions are estimated using Eq. (7.1) for each phase, and all the functions are stored in memory. When an agent interacts with the environment, it needs to evaluate all the Q-functions in order to obtain the probability of each action. This implies that the time complexity of computing action probabilities increases with the number of phases, and as a result the algorithm is hard to scale to a large number of phases. Although we can choose to evaluate a random subset of the Q-functions to estimate the action probabilities in Politex, our implementation via experience replay can still be faster since we only need to evaluate a single function, trained with replay data, to take actions.
8 Experiments
In this section, we evaluate our implementations empirically. We make comparisons with several baselines in two control environments.
Environments. We use two control environments with the simulators described in Tassa et al. (2018). Both environments are episodic with episode length . The environments we evaluate are:
-
•
Cart-pole (Barto et al., 1983): The goal of this environment is to balance an unactuated pole by applying forces to a cart at its base. We discretize the continuous force to values: . The reward at each time step is a real number in . We end episodes early when the pole falls (we use rewards less than an indicator for falling), and assume zero reward for the remaining steps when reporting results.
-
•
Ball-in-cup: Here, an actuated planar receptacle can translate in the vertical plane in order to swing and catch a ball attached to its bottom. This task has a sparse reward: when the ball is in the cup, and otherwise. We discretize the two-dimensional continuous action space to a grid, i.e., actions in total.






Algorithms. We compare the following algorithms: our proposed implementation of Politex using experience replay, Politex using weight averaging, the original Politex algorithm, the variant of Politex that averages randomly selected Q-functions, the mirror descent policy optimization (MDPO) algorithm (Tomar et al., 2020), the constrained policy optimization (CPO) (Achiam et al., 2017), and the V-MPO algorithm (Song et al., 2019b).
For all the algorithms, we extract Fourier basis features (Konidaris et al., 2011) from the raw observations for both environments: for Cart-pole, we use bases and for Ball-in-cup we use bases. For variants of Politex we construct the state-action features by block one-hot encoding, i.e., we partition the vector into blocks, and set the -th block to be the Fourier features of and other blocks to be zero. We approximate -functions using neural networks with one hidden layer and ReLU activation: the width of the hidden layer is for Cart-pole and for Ball-in-cup. For MDPO, CPO and V-MPO, the algorithms use a policy network whose input and output are the state feature and action probability, respectively. We use one hidden layer networks with width for Cart-pole and for Ball-in-cup. These three algorithms also need a value network, which takes the state feature as input and outputs its value for the current policy. For both environments, we use one hidden layer network with width .
For Cart-pole, we choose phase length and for Ball-in-cup, we choose . Since the environments are episodic, we have multiple episodes in each phase. For Cart-pole, since the training usually does not need a large number of phases, we do not set limit on the size of the replay buffer, whereas for Ball-in-cup, we set the limit as . For our algorithm and the variants of Politex, we choose parameter in and report the result of each algorithm using the value of that produces the best average reward. For MDPO, CPO and V-MPO, we note that according to Eq. (4.1), the KL regularization coefficient between policies is the reciprocal of the parameter in our algorithm, we also run these baseline algorithms in the same range of and report the best result. In addition, in CPO, we set the limit in KL divergence between adjacent policies , which, in our experiments, leads to good performance. We treat the length of Monte Carlo estimate as a hyper parameter, so we choose it in and report the best performance.
Results. We run each algorithm in each environment times and report the average results as well as the standard deviation of the runs (shaded area). We report the average reward in each phase before training on the data collected during the phase. The results are presented in Figure 1. Every run is conducted on a single P100 GPU.
From Fig. 1(a), we can see that the iteration complexity (average reward vs number of phases) and final performance of the experience replay implementation using all the data from past phases (blue curve) are similar to many state-of-the-art algorithms, such as Politex and V-MPO. Meanwhile, according to Fig. 1(b), the experience replay implementation achieves strong performance in training speed, i.e., best performance at minute in Cart-pole and second best performance at minute in Ball-in-cup. We note here that the training time includes both the time for data collection, i.e., interacting with the environment simulator and the training of a new policy. The observation that the experience replay based implementation is faster than the original Politex and the implementation that uses Q-functions can be explained by the fact that the replay-based algorithm only uses a single Q-function and thus is faster at data collection, as discussed in Section 7.2. In addition, that the replay-based algorithm runs faster than MDPO, CPO, and V-MPO can be explained by its simplicity: in variants of Politex, we only need to approximate the Q-function, whereas in MDPO, CPO, and V-MPO, we need to train both the policy and value networks.
The Politex weight averaging scheme achieves the best performance on Ball-in-cup in both iteration complexity and training speed, but does not converge to a solution that is sufficiently close to the optimum on Cart-pole. Notice that for weight averaging, we used the initialization technique mentioned in Section 7.1. Considering the consistent strong performance of experience replay in both environments, we still recommend applying experience replay when using non-linear function approximation.
In Fig. 1(c), we can see that when we only sample a small subset ( or ) of the data to store in the replay buffer, the coreset technique described in Section 7.2 achieves better performance due to the variance reduction effect. This improvement is quite significant in the Ball-in-cup environment, which has sparse rewards. Notice that sampling coreset only adds negligible computational overhead during training, and thus we recommend the coreset technique when there exists a strict memory constraint.
9 Discussion
The main contributions of our work are an improved analysis and practical implementation of Politex. On the theoretical side, we show that Politex obtains an high-probability regret bound in uniformly mixing average-reward MDPs with linear function approximation, which is the first such bound for a computationally efficient algorithm. The main limitation of these result is that, similarly to previous works, they hold under somewhat strong assumptions which circumvent the need for exploration. An interesting future work direction would be to relax these assumptions and incorporate explicit exploration instead.
On the practical side, we propose an efficient implementation with neural networks that relies on experience replay, a standard tool in modern deep RL. Our work shows that experience replay and KL regularization can both be viewed as approximately implementing mirror descent policy updates within a policy iteration scheme. This provides an online-learning justification for using replay buffers in policy iteration, which is different than the standard explanation for their success. Our work also suggests a new objective for storing and prioritizing replay samples, with the goal of approximating the average of value functions well. This goal has some similarities with continual learning, where experience replay has also lead to empirical successes. One interesting direction for future work would be to explore other continual learning techniques in the approximate implementation of mirror descent policy updates.
Acknowledgements
We would like to thank Mehrdad Farajtabar, Dilan Görür, Nir Levine, and Ang Li for helpful discussions.
References
- Abbasi-Yadkori et al. (2019) Y. Abbasi-Yadkori, P. Bartlett, K. Bhatia, N. Lazic, C. Szepesvari, and G. Weisz. Politex: Regret bounds for policy iteration using expert prediction. In Proceedings of the 36th International Conference on Machine Learning, volume 97 of Proceedings of Machine Learning Research, pages 3692–3702. PMLR, 09–15 Jun 2019.
- Abdolmaleki et al. (2018) A. Abdolmaleki, J. T. Springenberg, Y. Tassa, R. Munos, N. Heess, and M. Riedmiller. Maximum a posteriori policy optimisation. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=S1ANxQW0b.
- Achiam et al. (2017) J. Achiam, D. Held, A. Tamar, and P. Abbeel. Constrained policy optimization. In International Conference on Machine Learning, pages 22–31, 2017.
- Bachem et al. (2017) O. Bachem, M. Lucic, and A. Krause. Practical coreset constructions for machine learning. arXiv preprint arXiv:1703.06476, 2017.
- Bartlett (2009) P. L. Bartlett. Regal: A regularization based algorithm for reinforcement learning in weakly communicating mdps. In In Proceedings of the 25th Annual Conference on Uncertainty in Artificial Intelligence, 2009.
- Barto et al. (1983) A. G. Barto, R. S. Sutton, and C. W. Anderson. Neuronlike adaptive elements that can solve difficult learning control problems. IEEE Transactions on Systems, Man, and Cybernetics, (5):834–846, 1983.
- Borsos et al. (2020) Z. Borsos, M. Mutnỳ, and A. Krause. Coresets via bilevel optimization for continual learning and streaming. arXiv preprint arXiv:2006.03875, 2020.
- Cao (1999) X.-R. Cao. Single sample path-based optimization of Markov chains. Journal of optimization theory and applications, 100(3):527–548, 1999.
- Cesa-Bianchi and Lugosi (2006) N. Cesa-Bianchi and G. Lugosi. Prediction, learning, and games. Cambridge university press, 2006.
- Chaudhry et al. (2018) A. Chaudhry, M. Ranzato, M. Rohrbach, and M. Elhoseiny. Efficient lifelong learning with a-gem. arXiv preprint arXiv:1812.00420, 2018.
- Cho and Meyer (2001) G. E. Cho and C. D. Meyer. Comparison of perturbation bounds for the stationary distribution of a Markov chain. Linear Algebra and its Applications, 335(1-3):137–150, 2001.
- Degrave et al. (2019) J. Degrave, A. Abdolmaleki, J. T. Springenberg, N. Heess, and M. Riedmiller. Quinoa: a Q-function you infer normalized over actions. arXiv preprint arXiv:1911.01831, 2019.
- Even-Dar et al. (2009) E. Even-Dar, S. M. Kakade, and Y. Mansour. Online Markov decision processes. Mathematics of Operations Research, 34(3):726–736, 2009.
- Farajtabar et al. (2020) M. Farajtabar, N. Azizan, A. Mott, and A. Li. Orthogonal gradient descent for continual learning. In International Conference on Artificial Intelligence and Statistics, pages 3762–3773, 2020.
- Fruit et al. (2018) R. Fruit, M. Pirotta, A. Lazaric, and R. Ortner. Efficient bias-span-constrained exploration-exploitation in reinforcement learning. In International Conference on Machine Learning, 2018.
- Hao et al. (2020) B. Hao, N. Lazic, Y. Abbasi-Yadkori, P. Joulani, and C. Szepesvari. Provably efficient adaptive approximate policy iteration. arXiv preprint arXiv:2002.03069, 2020.
- Heess et al. (2015) N. Heess, G. Wayne, D. Silver, T. Lillicrap, T. Erez, and Y. Tassa. Learning continuous control policies by stochastic value gradients. Advances in Neural Information Processing Systems, 28:2944–2952, 2015.
- Jaksch et al. (2010) T. Jaksch, R. Ortner, and P. Auer. Near-optimal regret bounds for reinforcement learning. Journal of Machine Learning Research, 11(Apr):1563–1600, 2010.
- Jian et al. (2019) Q. Jian, R. Fruit, M. Pirotta, and A. Lazaric. Exploration bonus for regret minimization in discrete and continuous average reward mdps. In Advances in Neural Information Processing Systems, pages 4891–4900, 2019.
- Jin et al. (2019) C. Jin, P. Netrapalli, R. Ge, S. M. Kakade, and M. I. Jordan. A short note on concentration inequalities for random vectors with subgaussian norm. arXiv preprint arXiv:1902.03736, 2019.
- Kakade and Langford (2002) S. Kakade and J. Langford. Approximately optimal approximate reinforcement learning. In ICML, volume 2, pages 267–274, 2002.
- Kirkpatrick et al. (2017) J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A. A. Rusu, K. Milan, J. Quan, T. Ramalho, A. Grabska-Barwinska, et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences, 114(13):3521–3526, 2017.
- Konidaris et al. (2011) G. Konidaris, S. Osentoski, and P. Thomas. Value function approximation in reinforcement learning using the fourier basis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 25, 2011.
- Lillicrap et al. (2016) T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra. Continuous control with deep reinforcement learning. In ICLR (Poster), 2016.
- Lin (1992) L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine learning, 8(3-4):293–321, 1992.
- Lopez-Paz and Ranzato (2017) D. Lopez-Paz and M. Ranzato. Gradient episodic memory for continual learning. In Advances in neural information processing systems, pages 6467–6476, 2017.
- Mnih et al. (2013) V. Mnih, K. Kavukcuoglu, D. Silver, A. Graves, I. Antonoglou, D. Wierstra, and M. Riedmiller. Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602, 2013.
- Mnih et al. (2015) V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. Human-level control through deep reinforcement learning. nature, 518(7540):529–533, 2015.
- Neu et al. (2014) G. Neu, A. György, C. Szepesvári, and A. Antos. Online Markov decision processes under bandit feedback. IEEE Transactions on Automatic Control, 59(3):676–691, December 2014.
- Ouyang et al. (2017) Y. Ouyang, M. Gagrani, A. Nayyar, and R. Jain. Learning unknown Markov decision processes: A Thompson sampling approach. In Advances in Neural Information Processing Systems, pages 1333–1342, 2017.
- Rusu et al. (2016) A. A. Rusu, N. C. Rabinowitz, G. Desjardins, H. Soyer, J. Kirkpatrick, K. Kavukcuoglu, R. Pascanu, and R. Hadsell. Progressive neural networks. arXiv preprint arXiv:1606.04671, 2016.
- Schaul et al. (2016) T. Schaul, J. Quan, I. Antonoglou, and D. Silver. Prioritized experience replay. In International Conference on Learrning Representations (ICLR), 2016.
- Schulman et al. (2015) J. Schulman, S. Levine, P. Abbeel, M. Jordan, and P. Moritz. Trust region policy optimization. In International conference on machine learning, pages 1889–1897, 2015.
- Schulman et al. (2017) J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Seneta (1979) E. Seneta. Coefficients of ergodicity: structure and applications. Advances in applied probability, pages 576–590, 1979.
- Seneta (1988) E. Seneta. Perturbation of the stationary distribution measured by ergodicity coefficients. Advances in Applied Probability, 20(1):228–230, 1988.
- Song et al. (2019a) H. F. Song, A. Abdolmaleki, J. T. Springenberg, A. Clark, H. Soyer, J. W. Rae, S. Noury, A. Ahuja, S. Liu, D. Tirumala, N. Heess, D. Belov, M. A. Riedmiller, and M. M. Botvinick. V-MPO: on-policy maximum a posteriori policy optimization for discrete and continuous control. CoRR, abs/1909.12238, 2019a. URL http://arxiv.org/abs/1909.12238.
- Song et al. (2019b) H. F. Song, A. Abdolmaleki, J. T. Springenberg, A. Clark, H. Soyer, J. W. Rae, S. Noury, A. Ahuja, S. Liu, D. Tirumala, et al. V-MPO: on-policy maximum a posteriori policy optimization for discrete and continuous control. arXiv preprint arXiv:1909.12238, 2019b.
- Talebi and Maillard (2018) M. S. Talebi and O.-A. Maillard. Variance-aware regret bounds for undiscounted reinforcement learning in mdps. In Algorithmic Learning Theory, 2018.
- Tassa et al. (2018) Y. Tassa, Y. Doron, A. Muldal, T. Erez, Y. Li, D. d. L. Casas, D. Budden, A. Abdolmaleki, J. Merel, A. Lefrancq, et al. DeepMind control suite. arXiv preprint arXiv:1801.00690, 2018.
- Tomar et al. (2020) M. Tomar, L. Shani, Y. Efroni, and M. Ghavamzadeh. Mirror descent policy optimization. arXiv preprint arXiv:2005.09814, 2020.
- Tropp (2012) J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
- Vieillard et al. (2020a) N. Vieillard, T. Kozuno, B. Scherrer, O. Pietquin, R. Munos, and M. Geist. Leverage the average: an analysis of regularization in rl. arXiv preprint arXiv:2003.14089, 2020a.
- Vieillard et al. (2020b) N. Vieillard, B. Scherrer, O. Pietquin, and M. Geist. Momentum in reinforcement learning. In International Conference on Artificial Intelligence and Statistics, pages 2529–2538, 2020b.
- Wei et al. (2020a) C.-Y. Wei, M. Jafarnia-Jahromi, H. Luo, and R. Jain. Learning infinite-horizon average-reward mdps with linear function approximation. arXiv preprint arXiv:2007.11849, 2020a.
- Wei et al. (2020b) C.-Y. Wei, M. J. Jahromi, H. Luo, H. Sharma, and R. Jain. Model-free reinforcement learning in infinite-horizon average-reward Markov decision processes. In International Conference on Machine Learning, pages 10170–10180. PMLR, 2020b.
- Yin et al. (2020) D. Yin, M. Farajtabar, and A. Li. SOLA: Continual learning with second-order loss approximation. arXiv preprint arXiv:2006.10974, 2020.
- Yu (1994) B. Yu. Rates of convergence for empirical processes of stationary mixing sequences. The Annals of Probability, 22(1):94–116, 1994.
- Zenke et al. (2017) F. Zenke, B. Poole, and S. Ganguli. Continual learning through synaptic intelligence. Proceedings of machine learning research, 70:3987, 2017.
Appendix
Appendix A Preliminaries
We state some useful definitions and lemmas in this section.
Lemma A.1.
Let and be a pair of distribution vectors. Let be the transition matrix of an ergodic Markov chain with a stationary distribution , and ergodicity coefficient (defined in Assumption 2.1) upper-bounded by . Then
Proof.
Let be the normalized left eigenvectors of corresponding to ordered eigenvalues . Then , , and for all , we have that (since the chain is ergodic) and . Write in terms of the eigenvector basis as:
Since and , it is easy to see that . Thus we have
where the inequality follows from the definition of the ergodicity coefficient and the fact that for all . Since
the inequality also holds for powers of . ∎
Lemma A.2 (Doob martingale).
Let Assumption 2.1 hold, and let be the state-action sequence obtained when following policies for steps each from an initial distribution . For , let be a binary indicator vector with a non-zero element at the linear index of the state-action pair . Define for ,
Then, is a vector-valued martingale: for , and holds for .
The constructed martingale is known as the Doob martingale underlying the sum .
Proof.
That is a martingale follows from the definition. We now bound its difference sequence. Let be the state-action transition matrix at time , and let , and define . Then, for , and by the Markov property, for any ,
For any ,
(A.1) |
Let be a filtration and define . We will make use of the following concentration results for the sum of random matrices and vectors.
Theorem A.3 (Matrix Azuma, Tropp (2012) Thm 7.1).
Consider a finite -adapted sequence of Hermitian matrices of dimension , and a fixed sequence of Hermitian matrices that satisfy and almost surely. Let . Then with probability at least , .
A version of Theorem A.3 for non-Hermitian matrices of dimension can be obtained by applying the theorem to a Hermitian dilation of , , which satisfies and . In this case, we have that .
Lemma A.4 (Hoeffding-type inequality for norm-subGaussian random vectors, Jin et al. (2019)).
Consider random vectors and corresponding filtrations , such that is zero-mean norm-subGaussian with . That is:
If the condition is satisfied for fixed , there exists a constant such that for any , with probability at least ,
Appendix B Bounding the Difference Between Empirical and Average Rewards
In this section, we bound the second term in Equation 5.1, corresponding to the difference between empirical and average rewards.
Lemma B.1.
Let Assumption 2.1 hold, and assume that and that for all . Then, by choosing , we have with probability at least ,
Proof.
Let denote the vector of rewards, and recall that . Let be the indicator vector for the state-action pair at time , as in Lemma A.2, and let . We have the following:
We slightly abuse the notation above by letting denote the state-action distribution at time , and the stationary distribution of policy . Let be the Doob martingale in Lemma A.2. Then and , and the first term can be expressed as
By Lemma A.2, . Hence by Azuma’s inequality, with probability at least ,
(B.1) |
For the second term we have
For , the first term is upper-bounded by .
Using results on perturbations of Markov chains (Seneta, 1988; Cho and Meyer, 2001), we have that
Note that the policies are generated by running mirror descent on reward functions . A well-known property of mirror descent updates with entropy regularization (or equivalently, the exponentially-weighted-average algorithm) is that the difference between consecutive policies is bounded as
See e.g. Neu et al. (2014) Section V.A for a proof, which involves applying Pinsker’s inequality and Hoeffding’s lemma (Cesa-Bianchi and Lugosi (2006) Section A.2 and Lemma A.6). Since we assume that , we can obtain
By choosing , we can bound the second term as
(B.2) |
Appendix C Proof of Lemma 6.3
Proof.
Recall that we split each phase into blocks of size and let and denote the starting indices of odd and even blocks, respectively. We let denote the empirical -step returns from the state action pair in phase :
We start by bounding the error in . Let be a binary indicator vector for a state-action pair . Let be the state-action transition kernel for policy , and let be the corresponding stationary state-action distribution. We can write the action-value function at as
Let be a version of truncated to steps. Under uniform mixing, the difference to the true is bounded as
(C.1) |
Let denote the truncation bias at time , and let denote the reward noise. We will write
Note that and let
We estimate the value function of each policy using data from phase as
Our estimate of can thus be written as follows:
We proceed to upper-bound the norm of the RHS.
Set . Let be an upper-bound on the norm of the true value-function weights for . In Appendix C.3, we show that with probability at least , for , . Thus with probability at least , the last error term is upper-bounded as
(C.2) |
Similarly, for
(C.3) |
the norm of the truncation bias term is upper-bounded as
(C.4) |
To bound the error terms corresponding to reward noise and average-error noise , we rely on the independent blocks techniques of Yu (1994). We show in Sections C.1 and C.2 that with probability , for constants and , each of these terms can be bounded as:
Thus, putting terms together, we have for an absolute constant , with probability at least ,
Note that this result holds for every and thus also holds for . ∎
C.1 Bounding
Let denote the total variation norm.
Definition C.1 (-mixing).
Let be a stochastic process. Denote by the collection , where we allow . Let denote the sigma-algebra generated by (). The -mixing coefficient of , , is defined by
is said to be -mixing if as . In particular, we say that a -mixing process mixes at an exponential rate with parameters if holds for all .
Let be the indicator vector for the state-action pair as in Lemma A.2. Note that the distribution of given can be written as . Let be the state-action transition matrix at time , let , and define . Then we have that and , where is the initial state distribution. Thus, under the uniform mixing Assumption 2.1, the -mixing coefficient is bounded as:
We bound the noise terms using the independent blocks technique of Yu (1994). Recall that we partition each phase into blocks of size . Thus, after phases we have a total of blocks. Let denote the joint distribution of state-action pairs in odd blocks. Let denote the set of indices in the block, and let denote the corresponding states and actions. We factorize the joint distribution according to blocks:
Let be the marginal distribution over the variables in block , and let be the product of marginals of odd blocks.
Corollary 2.7 of Yu (1994) implies that for any Borel-measurable set ,
(C.5) |
where is the -mixing coefficient of the process. The result follows since the size of the “gap” between successive blocks is ; see Appendix E for more details.
Recall that our estimates are based only on data in odd blocks in each phase. Let denote the expectation w.r.t. the product-of-marginals distribution . Then because for and under , is zero-mean given and is independent of other feature vectors outside of the block. Furthermore, by Hoeffding’s inequality . Since and for large enough , we have that
Since are norm-subGaussian vectors, using Lemma A.4, there exists a constant such that for any
C.2 Bounding
Recall that the average-reward estimates are computed using time indices corresponding to the starts of even blocks, . Thus this error term is only a function of the indices corresponding to block starts. Now let denote the distribution over state-action pairs for indices corresponding to block starts, i.e. . We again factorize the distribution over blocks as . Let be a product-of-marginals distribution defined as follows. For odd , let be the marginal of over . For even in phase , let correspond to the stationary distribution of the corresponding policy . Using arguments similar to independent blocks, we show in Appendix E that
Let denote expectation w.r.t. the product-of-marginals distribution . Then , since under , is the sum of rewards for state-action pairs distributed according to , and these state-action pairs are independent of other data. Using a similar argument as in the previous section, for , there exists a constant such that with probability at least ,
C.3 Bounding
In this subsection, we show that with probability at least , for , .
Let be a matrix of all features. Let , and let , where is a state-action indicator as in Lemma A.2. Let . We can write as
Thus
For , the above norm is upper-bounded by .
C.4 Bounding
For any matrix ,
(C.7) |
where denotes the sum of absolute entries of . Using the same notation for as in Lemma A.2,
Under the fast-mixing assumption 2.1, the second term is bounded by .
For the first term, we can define a martingale similar to the Doob martingale in Lemma A.2, but defined only on the indices . Note that . Thus we can use matrix-Azuma to bound the difference sequence. Given that
combining the two terms, we have that with probability at least ,
Appendix D Bounding
We write the value function error as follows:
Note that for any set of scalars with , the term has the same upper bound as . The reason is as follows. One part of the error includes bias terms (C.2) and (C.4), whose upper bounds are only smaller when reweighted by scalars in . Thus we can simply upper-bound the bias by setting all to 1. Another part of the error, analyzed in Appendices C.1 and C.2 involves sums of norm-subGaussian vectors. In this case, applying the weights only results in these vectors potentially having smaller norm bounds. We keep the same bounds for simplicity, again corresponding to all equal to 1. Thus, reusing the results of the previous section, we have
Appendix E Independent Blocks
Blocks. Recall that we partition each phase into blocks of size . Thus, after phases we have a total of blocks. Let denote the joint distribution of state-action pairs in odd blocks. Let denote the set of indices in the block, and let denote the corresponding states and actions. We factorize the joint distribution according to blocks:
Let be the marginal distribution over the variables in block , and let be the product of marginals. Then the difference between the distributions and can be written as
Under -mixing, since the gap between the blocks is of size , we have that
Thus the difference between the joint distribution and the product of marginals is bounded as
Block starts. Now let denote the distribution over state-action pairs for indices corresponding to block starts, i.e. . We again factorize the distribution over blocks:
Define a product-of-marginals distribution over the block-start variables as follows. For odd , let be the marginal of over . For even in phase , let correspond to the stationary distribution of the policy . Using the same notation as in Appendix A, let be the indicator vector for and let be the product of state-action transition matrices at times . For odd blocks , we have
Slightly abusing notation, let be the state-action transition matrix under policy . For even blocks in phase , since they always follow an odd block in the same phase,
Thus, using a similar distribution decomposition as before, we have that .