Non-asymptotic Convergence of Adam-type Reinforcement Learning Algorithms under Markovian Sampling
Abstract
Despite the wide applications of Adam in reinforcement learning (RL), the theoretical convergence of Adam-type RL algorithms has not been established. This paper provides the first such convergence analysis for two fundamental RL algorithms of policy gradient (PG) and temporal difference (TD) learning that incorporate AMSGrad updates (a standard alternative of Adam in theoretical analysis), referred to as PG-AMSGrad and TD-AMSGrad, respectively. Moreover, our analysis focuses on Markovian sampling for both algorithms. We show that under general nonlinear function approximation, PG-AMSGrad with a constant stepsize converges to a neighborhood of a stationary point at the rate of (where denotes the number of iterations), and with a diminishing stepsize converges exactly to a stationary point at the rate of . Furthermore, under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at the rate of , and with a diminishing stepsize converges exactly to the global optimum at the rate of . Our study develops new techniques for analyzing the Adam-type RL algorithms under Markovian sampling.
1 Introduction
Reinforcement learning (RL) aims to study how an agent learns a policy through interacting with its environment to maximize the accumulative reward for a task. RL has so far accomplished tremendous success in various applications such as playing video games Mnih et al., (2013), bipedal walking Castillo et al., (2019) and online advertising Pednault et al., (2002), to name a few. In general, there are two widely used classes of RL algorithms: policy-based methods and value function based methods.
For the first class, policy gradient (PG) Sutton et al., (2000) is a basic algorithm which has motivated many advanced policy-based algorithms including actor-critic Konda and Tsitsiklis, (2000), DPG Silver et al., (2014), TRPO Schulman et al., (2015), PPO Schulman et al., (2017), etc. The idea of the vanilla PG Sutton et al., (2000) is to parameterize the policy and optimize a target accumulated reward function by (stochastic) gradient descent. The asymptotic convergence and finite-time (i.e., non-asymptotic) analysis have been characterized for PG in various scenarios, which will be further discussed in Section 1.2.
For the second class of value function based algorithms, temporal difference (TD) learning Sutton, (1988) is a fundamental algorithm which has motivated more advanced algorithms such as Q-learning Watkins and Dayan, (1992), SARSA Rummery and Niranjan, (1994), etc. The vanilla TD Sutton, (1988) typically parameterizes the value function of an unknown policy and iteratively finds the true value function or its estimator by following the (projected) Bellman operation, which can also be interpreted to be analogous to a stochastic gradient descent (SGD) update. The asymptotic convergence and finite-time analysis have been established for TD in various scenarios, which will be discussed in Section 1.2.
Despite extensive exploration, all the existing theoretical studies of PG and TD have focused on algorithms that adopt the SGD-type updates without adaption on the stepsize. In practice, however, the adaptive momentum estimation (Adam) method Kingma and Ba, (2015) has been more commonly used in RL Bello et al., (2017); Stooke and Abbeel, (2018). It is so far unclear whether RL algorithms that incorporate the Adam-type updates have provable converge guarantee. The goal of this paper is to provide the first non-asymptotic analysis to characterize the convergence rate of the Adam-type PG and TD algorithms. In addition, we develop new technical tools to analyze the Adam-type algorithms under Markovian sampling. Such analysis is not available in the existing studies of the Adam-type algorithms in optimization which usually assume independent and identically distributed (i.i.d.) sampling.
1.1 Our Contribution
The main contribution of the paper lies in establishing the first non-asymptotic analysis for the Adam-type PG and TD algorithms under the Markovian sampling model.
From the technical standpoint, although the existing analysis of the Adam-type algorithms in conventional optimization may provide useful tools, analysis of such algorithms in RL problems has new challenges. (1) Samples in TD learning are drawn from an unknown transition kernel in a non-i.i.d. manner, and hence a bias of the gradient estimation arises. Such a bias is also dependent on the AMSGrad updates, which further complicates the analysis. We develop techniques to bound the bias by the stepsize even under adaptive momentum updates. (2) The sampling process in PG (with a nonlinear approximation architecture) is not only non-i.i.d., but also follows time-varying parametric policies. Thus, it is even more challenging to bound such a bias with the AMSGrad updates. In response to this, we develop new techniques to characterize the bias error which can also be controlled by the stepsize.
As for the results, we provide the first convergence guarantee for RL algorithms (including PG and TD) incorporating the update rule of AMSGrad (referred to as PG-AMSGrad and TD-AMSGrad, respectively), and our techniques also lead to an improved result for the vanilla PG. (1) First, we show that under general nonlinear function approximation, PG-AMSGrad with a constant stepsize111The “stepsize” here refers to the basic stepsize in the AMSGrad update (4). The overall learning rate of the algorithm is determined by the basic stepsize , hyperparameters and , and the first and second moments of gradients as given in (1)-(4), and is hence adaptive during the AMSGrad iteration. converges to a neighborhood of a stationary point at a rate of (where denotes the number of iterations), and with a diminishing stepsize converges exactly to a stationary point at a rate of . (2) Furthermore, under linear function approximation, TD-AMSGrad with a constant stepsize converges to a neighborhood of the global optimum at a rate of , and with a diminishing stepsize converges exactly to the global optimum at a rate of . (3) In addition, by adapting our technical tools to analyze the vanilla PG with the SGD update under Markovian sampling, we obtain an orderwisely better computational complexity than the existing works.
1.2 Related Work
Due to the rapidly growing theoretical studies on RL, we only review the most relevant studies below.
Convergence analysis of PG: Asymptotic convergence of PG based on stochastic approximation (SA) was established in Williams, (1992); Baxter and Bartlett, (2001); Sutton et al., (2000); Kakade, (2002); Pirotta et al., (2015); Tadić et al., (2017). In specific RL problems such as LQR, PG has been proved to converge to the optimal policy Fazel et al., (2018); Malik et al., (2019); Tu and Recht, (2019). Under specific policy function approximation classes, Bhandari and Russo, (2019); Agarwal et al., (2019) also showed that PG can find the optimal policy. Under the general nonconvex approximation, Shen et al., (2019); Papini et al., (2018, 2019); Xu et al., 2019a ; Xu et al., (2020) characterized the convergence rate for PG and variance reduced PG to a stationary point in the finite-horizon scenario, and Zhang et al., (2019); Karimi et al., (2019) provided the convergence rate for PG in the infinite-horizon scenario. Wang et al., (2020) studied PG with neural network function approximation in an overparameterized regime. Convergence analysis has been also established for the variants of PG, such as actor-critic algorithms (Bhatnagar et al.,, 2008, 2009; Bhatnagar,, 2010; Luo et al.,, 2019; Yang et al.,, 2019; Fu et al.,, 2020), TRPO (Shani et al.,, 2020), TRPO/PPO (Liu et al.,, 2019), etc. This paper studies the infinite-horizon scenario, but focuses on the Adam-type PG, which has not been studied in the literature. Our analysis is also applicable to other variants of PG.
Convergence analysis of TD: Originally proposed in Sutton, (1988), TD learning with function approximation aroused great interest in analyzing its convergence. While a general TD may not converge as pointed out in Baird, (1995); Györfi and Walk, (1996), the authors in Tsitsiklis and Van Roy, (1997) provided conditions to ensure asymptotic convergence of TD with linear function approximation under i.i.d. sampling. Other results on asymptotic convergence using the tools from linear SA were provided in Kushner and Yin, (2003); Benveniste et al., (2012). Non-asymptotic convergence was established for TD under i.i.d. sampling in, e.g., Dalal et al., (2018); Bhandari et al., (2018); Lakshminarayanan and Szepesvari, (2018), and under Markovian sampling in, e.g., Bhandari et al., (2018); Srikant and Ying, (2019); Hu and Syed, (2019). The convergence rate of TD with nonlinear function approximation has recently been studied in Cai et al., (2019) for overparameterized neural networks using i.i.d. samples. In contrast to the aforementioned work on TD with the SGD updates, this paper studies Adam-type TD under Markovian sampling.
Adaptive reinforcement learning algorithms: Adaptivity has been applied to RL algorithms to improve the performance. (Shani et al.,, 2020) used an adaptive proximity term to study the convergence of TRPO. An adaptive batch size was adopted to improve the policy performance (Papini et al.,, 2017) and reduce the variance (Ji et al.,, 2019) of PG. The abovementioned papers did not study how adaptive learning rates can affect the performance of PG or TD. More recently, a concurrent work (Sun et al.,, 2020) which is posted a few days after our paper provided an analysis of TD(0) and TD() when incorporating adaptive gradient descent (AdaGrad) updates. However, this paper focuses on a more popular class of adaptive algorithms: Adam-type methods, and provide the first convergence guarantee when they are applied to PG and TD.
Convergence analysis of Adam-type algorithms in conventional optimization: Adam was proposed in Kingma and Ba, (2015) for speeding up the training of deep neural networks, but the vanilla Adam was shown not to converge in Reddi et al., (2018). Instead AMSGrad was proposed as a slightly modified version to justify the theoretic performance of Adam and its regret bounds were characterized in Reddi et al., (2018); Tran and Phong, (2019) for online convex optimization. Recently, Adam/AMSGrad was proved to converge to a stationary point for certain general nonconvex optimization problems Zou et al., 2019a ; Zhou et al., (2018); Chen et al., 2019a . Our study provides the first convergence guarantee for the Adam-type algorithms in the RL settings, where the Markovian sampling poses the key difference and challenge in our analysis from those in conventional optimization.
2 Preliminary
In this section, we provide the necessary background for the problems that we study in this paper.
2.1 Markov Decision Process
We consider the standard RL settings, where an agent interacts with a (possibly stochastic) environment (e.g. process or system dynamics). This interaction is usually modeled as a discrete-time discounted Markov Decision Processes (MDPs), described by a tuple , where is the state space, is the action space, is the probability kernel for the state transitions, e.g., denotes the probability distribution of the next state given the current state and action . In addition, is the reward function mapping station-action pairs to a bounded subset of , is the discount factor, and denotes the initial state distribution. The agent’s decision is captured by the policy which characterizes the density function over the action space at the state . We denote as the stationary distribution of the transition kernel for a given . In addition, we define the -discounted stationary visitation distribution of the policy as . Further, we denote as the (discounted) state-action visitation distribution.
In this paper, we assume that the state space is countably infinite and the action space is finite with possibly large cardinality . Hence, we estimate the policy and the value function corresponding to a certain unknown policy by parameterized function classes as we introduce in Section Sections 3.1 and 4.1, respectively.
2.2 Update Rule of AMSGrad
Although Adam Kingma and Ba, (2015) has gained great success in practice since being proposed, it is shown not to converge even in the simple convex setting Reddi et al., (2018). Instead, a slightly modified version called AMSGrad Reddi et al., (2018) is widely used to understand the success of adaptive momentum optimization algorithms. Given a gradient at time , the generic form of AMSGrad is given by
(1) | ||||
(2) | ||||
(3) | ||||
(4) |
where is the stepsize, and are two hyper-parameters. In addition, , given in (1) and (2) are viewed as the estimation of the first moment and second moment, respectively, which play important roles in adapting the learning rate as in (4). Compared to Adam, the main difference of AMSGrad lies in (3), which guarantees the sequence to be non-decreasing whereas Adam does not require this. Such difference is considered to be a central reason causing the non-convergent behavior of Adam Reddi et al., (2018); Chen et al., 2019a .
2.3 Notations
We use to denote the -norm of a vector , and use to denote the infinity norm. When are both vectors, are all calculated in the element-wise manner, which are used in the update of AMSGrad. We denote , and as the integer such that .
3 Convergence of PG-AMSGrad under Markovian Sampling
In this section, we study the convergence of an Adam-type policy gradient algorithm (PG-AMSGrad) with nonlinear function approximation and under non-i.i.d. sampling.
3.1 Policy Gradient and PG-AMSGrad
Consider a reinforcement learning problem which aims to find a policy that maximizes the expected accumulative reward. We assume that the policy is parameterized by and form a policy class , which in general is a nonlinear function class. The policy gradient method is usually used to solve the following infinite-horizon optimization problem:
(5) |
The gradient of with respect to is captured by the policy gradient theorem for infinite-horizon MDP with the discounted reward Sutton et al., (2000), and is given by
(6) |
where the expectation is taken over the discounted state-action visitation distribution , and denotes the Q-function for an initial state-action pair defined as
In addition, we refer to as the score function corresponding to the policy .
Since the transition probability is unknown, the policy gradient in (6) needs to be estimated via sampling. The Q-function and the score function are typically estimated by independent samples. First, at each time , we draw a sample trajectory to provide an estimated Q-function based on the algorithm EstQ Zhang et al., (2019) (see Algorithm 3 in Appendix A for details). Such an estimator has been shown to be unbiased Zhang et al., (2019). That is, if we use to denote the randomness including the samples and horizon in EstQ, then we have
(7) |
Second, we take samples to estimate the score function by following the policy and the transition function proposed in Konda, (2002) which is given by
where denotes the initial distribution and is the transition probability from the original MDP. It has been shown in Konda, (2002) that such a transition probability guarantees the MDP to converge to the state-action visitation distribution.
Then, the gradient estimator to approximate at time is given by
(8) |
We then apply such a gradient estimator to update the policy parameter by the AMSGrad update given in (1)-(4), and obtain PG-AMSGrad as in Algorithm 1.
We note that the gradient estimator obtained in (8) is biased, because the score function is estimated by a sequence of Markovian samples. We will show that such a biased gradient estimator is in fact computationally more efficient than the unbiased estimator used in the existing literature Zhang et al., (2019) (see Section 3.4). Our main technical novelty here lies in developing techniques to analyze the biased estimator under the AMSGrad update for PG.
3.2 Technical Assumptions
In the following, we specify some technical assumptions in our convergence analysis.
We consider a general class of parameterized policy functions that satisfy the following assumption.
Assumption 1.
Assume that the parameterized policy is differentiable with respect to , and the score function corresponding to exists. In addition, we assume both the policy function and the score function are Lipschitz continuous with the parameters and , respectively, i.e., for all ,
Further, we also assume that the score function is uniformly bounded by for any , i.e.,
This assumption is standard in the existing literature that studies PG with nonconvex function approximation Zhang et al., (2019); Xu et al., 2019a ; Papini et al., (2018).
In Algorithm 1, we sample a data trajectory using the transition kernel and the policy . Such a sequence of samples are non-i.i.d. and follow a Markovian distribution. We assume that the MDP and the policies we consider satisfy the following standard mixing property.
Assumption 2.
For any , there exist constant and such that
where denotes the total-variation norm (or the total-variation distance between two probability measures and ).
This assumption holds for irreducible and aperiodic Markov chains Mitrophanov, (2005), and is widely adopted in the theoretical analysis of RL algorithms under Markovian sampling settings Bhandari et al., (2018); Chen et al., 2019b ; Zou et al., 2019b ; Karimi et al., (2019).
3.3 Convergence of PG-AMSGrad
In this section, we provide the convergence analysis of PG-AMSGrad as given in Algorithm 1. We first consider the case with a constant stepsize, and then provide the result with a diminishing stepsize.
Although AMSGrad has been studied in conventional optimization, our analysis of PG-AMSGrad mainly deals with the following new challenges arising in RL. First, samples here are generated via an MDP and distributed in a non-i.i.d. fashion. Thus the gradient estimator is biased and we need to control the bias with a certain upper bound scaled by the stepsize. Second, the sampling distribution also changes over time, which causes additional complication. Thus, our technical development mainly handles the above two challenges under the adaptive momentum update rule of AMSGrad. We next provide the convergence results that we obtain and relegate all the proofs to the appendices.
We first provide the Lipschitz properties for the true policy gradient and its estimator, which are useful for establishing the convergence. Recall that in Algorithm 1, the gradient estimator at time is obtained by using the Q-function estimator generated by the EstQ algorithm (see Appendix A). Note that is an unbiased estimator of for all Zhang et al., (2019), and the samples for estimation are independent of those for other steps in PG-AMSGrad except the initial sample. Taking expectation over the randomness in EstQ at time (denoted as ), we obtain an estimator defined as
(9) |
We next obtain the Lipschitz properties of and in the following lemma.
Lemma 1.
(Lipschitz property of policy gradient) Under Assumptions 1 and 2, the policy gradient defined in (6) is Lipschitz continuous with the parameter , i.e., ,
(10) |
where the constant coefficient . Further, the policy gradient estimator defined in (9) is also Lipschitz continuous with the parameter , i.e., ,
(11) |
where .
The following theorem characterizes the convergence of PG-AMSGrad with a constant stepsize. Recall that the stepsize refers to the parameter in the AMSGrad update (4), not the overall learning rate of the algorithm.
Theorem 1.
Theorem 1 indicates that under the constant stepsize, PG-AMSGrad converges to a neighborhood of a stationary point at a rate of . The size of the neighborhood can be controlled by the stepsize . One can observe that controls a tradeoff between the convergence rate and the convergence accuracy. Decreasing improves the convergence accuracy, but slows down the convergence, since the coefficient contains in the denominator. To balance such a tradeoff, we set the stepsize . In this case, the mixing time becomes and thus PG-AMSGrad converges to a stationary point with a rate of .
In the following, we adopt a diminishing stepsize to eliminate the convergence error and obtain the exact convergence.
Theorem 2.
Under the diminishing stepsize, PG-AMSGrad can converge exactly to a stationary point. Since , the convergence rate is given by .
Theorems 1 and 2 indicate that under both a constant stepsize and diminishing stepsize, PG-AMSGrad finds a stationary point efficiently with guaranteed convergence. However, there is a tradeoff between the convergence rate and the convergence accuracy. With a constant stepsize, PG-AMSGrad can converge faster but only to a neighborhood of a stationary point whose size is controlled by the stepsize, whereas a diminishing stepsize yields a better convergence accuracy, but attains a lower convergence rate.
3.4 Implication on Vanilla PG under Markovian Data
Although our focus in this paper is on the Adam-type PG, slight modification of our analysis yields the convergence rate of the vanilla PG under infinite horizon Markovian sampling. In the following, we first present such results and then compare them with two recent studies Zhang et al., (2019); Karimi et al., (2019) on the same model to illustrate the novelty of our analysis and benefit of our sampling strategy.
We consider the vanilla PG algorithm that uses the same gradient estimator and sampling strategy as those of Algorithm 1, but adopts the SGD update (i.e., ) rather than the AMSGrad update. We call such an algorithm as PG-SGD. The following proposition characterizes the convergence rate for PG-SGD.
Proposition 1.
We next compare Proposition 1 with two recent studies on the same non-i.i.d. model. First, Karimi et al., (2019) studied infinite-horizon PG with a biased gradient estimator. In their analysis, they did not bound the gradient bias using the stepsize, and hence their convergence has a non-zero error even with a diminishing stepsize. In contrast, we obtain a fine-grained bound on the bias and characterize its dependence on the stepsize. We show that PG exactly converges to a stationary point under the diminishing stepsize.
Another closely related study on the infinite-horizon PG under non-i.i.d. sampling was by Zhang et al., (2019), but their algorithm adopts an unbiased gradient estimator at the cost of using more samples. As a comparison, Proposition 1 indicates that PG-SGD with a biased gradient estimator attains a better convergence rate and accuracy. More specifically, under the constant stepsize, (Zhang et al.,, 2019, Corollary 4.4) showed that their PG algorithm converges with an error bound of , whereas PG-SGD with a biased gradient estimator achieves a much smaller error bound by taking . Similarly, under the diminishing stepsize, (Zhang et al.,, 2019, Theorem 4.3) showed that their PG algorithm converges at a rate of , whereas our PG-SGD converges at a rate of , which is much faster since is usually close to 1, and is considered to be less influential in practice.
4 Convergence of TD-AMSGrad under Markovian Sampling
In this section, we adopt AMSGrad to TD learning and analyze its convergence under Markovian sampling. The proof techniques of bounding the bias and the nature of the convergence are very different from those of PG-AMSGrad.
4.1 TD Learning and TD-AMSGrad
Policy evaluation is a fundamental task in RL, and often plays a critical role in other algorithms such as PG that we study in Section 3. The goal of policy evaluation is to obtain an accurate estimation of the accumulated reward function known as the value function for a given policy defined as
The Bellman operator corresponding to a value function is defined by ,
(12) |
The key to approximate the value function is the observation that it satisfies the projected bellman equation when projected onto some convex space. Under the function approximation, the value function is parameterized by and denoted by . As many recent finite-time analysis of TD Bhandari et al., (2018); Xu et al., 2019b ; Srikant and Ying, (2019), we consider the linear approximation class of the value function defined as
(13) |
where , and is a vector function with the dimension , and the elements of represent the nonlinear kernel (feature) functions. Then the vanilla TD algorithm follows a stochastic iterative method given by
(14) |
where is the stepsize, and is defined as
(15) |
Here, serves as a stochastic pseudo-gradient, and is an estimator of the full pseudo-gradient given by
(16) |
where the expectation is taken over the stationary distribution of the states. We note that is not a gradient of a loss function, but plays a similar role as the gradient in the gradient descent algorithm.
Then TD-AMSGrad is obtained by replacing the update (14) by the AMSGrad update given in (1)-(4) as in Algorithm 2.
As seen in Algorithm 2, the state-action pairs are sampled as a trajectory under the transition probability with unknown policy . Therefore, the samples along the trajectory are dependent, and hence we need to analyze the convergence of TD-AMSGrad under Markovian sampling.
4.2 Technical Assumptions
In this section, we introduce some standard technical assumptions for our analysis.
We first give the following standard assumption on the kernel function Tsitsiklis and Van Roy, (1997); Bhandari et al., (2018); Xu et al., 2019b ; Chen et al., 2019b .
Assumption 3.
For any state , the kernel function is uniformly bounded, i.e., . In addition, we define a feature matrix as
and assume that the columns of are linearly independent.
The boundedness assumption is mild since we can always normalize the kernel functions.
We next define the following norm of the value function as:
where is a diagonal matrix whose elements are determined by the stationary distribution , and is the steady-state feature covariance matrix given by .
The next assumption of the bounded domain is standard in theoretical analysis of the Adam-type algorithms Reddi et al., (2018); Tran and Phong, (2019).
Assumption 4.
The domain of approximation parameters is a ball originating at with a bounded diameter containing . That is, there exists , such that , and , for any .
The boundedness parameter can be chosen as discussed in Bhandari et al., (2018).
4.3 Convergence of TD-AMSGrad
In the following, we provide the convergence results for TD-AMSGrad with linear function approximation under Markovian sampling.
First consider the full pseudo-gradient in (16). We define as the fixed point of , i.e., . Then is the unique fixed point under Assumption 3 following from the contraction property of the projected Bellman operator Tsitsiklis and Van Roy, (1997). The following lemma is useful for our convergence analysis.
Lemma 2.
Lemma 2 indicates that the update of TD learning exhibits a property similar to strongly convex optimization.
Since our analysis is under Markovian sampling, non-zero bias arises in our convergence analysis when estimating the gradient, i.e., . The following lemma provides an upper bound on such a bias error for TD-AMSGrad.
Lemma 3.
The following theorem provides the convergence of TD-AMSGrad under a constant stepsize coupled with diminishing hyper-parameters in the AMSGrad update.
Theorem 3.
In Theorem 3, are constants and time-independent. Therefore, under the choice of the stepsize and hyper-parameters in the theorem, Algorithm 2 converges to a neighborhood of the global optimum at a rate of . The size of the neigborhood is controlled by the stepsize . Similarly to Theorem 1, we can balance the tradeoff between the convergence rate and the convergence accuracy, i.e., the size of neighborhood, by setting the stepsize , which yields a convergence to the global optimal solution at the rate of .
Next, we provide the convergence result with a diminishing stepsize in the following theorem.
Theorem 4.
Comparing with Theorem 3 and observing , we conclude that TD-ASMGrad with the diminishing stepsize converges exactly to the global optimum at the rate of , rather than to a neighborhood.
5 Conclusion
This paper provides the first convergence analysis of the Adam-type RL algorithms under Markovian sampling. For PG-AMSGrad with nonlinear function approximation for the policy, we show that the algorithm converges to a stationary point with a diminishing stepsize. With a constant size, we show that PG-AMSGrad converges only to a neighborhood of a stationary point but at a faster rate. Furthermore, we also provide the finite-time convergence results for TD-AMSGrad to the global optimum under proper choices of the stepsize.
Several future directions along this topic are interesting. For example, the optimality of the convergence result of PG-AMSGrad is of importance to study. More general value function approximation, and convergence results for constant hyper-parameters in TD-AMSGrad are also of interest.
Acknowledgements
The work was supported in part by the U.S. National Science Foundation under Grants CCF-1761506, ECCS-1818904, CCF-1909291 and CCF-1900145.
References
- Agarwal et al., (2019) Agarwal, A., Kakade, S. M., Lee, J. D., and Mahajan, G. (2019). Optimality and approximation with policy gradient methods in markov decision processes. arXiv preprint arXiv:1908.00261.
- Baird, (1995) Baird, L. (1995). Residual algorithms: Reinforcement learning with function approximation. In Machine Learning Proceedings 1995, pages 30–37. Elsevier.
- Baxter and Bartlett, (2001) Baxter, J. and Bartlett, P. L. (2001). Infinite-horizon policy-gradient estimation. Journal of Artificial Intelligence Research, 15:319–350.
- Bello et al., (2017) Bello, I., Zoph, B., Vasudevan, V., and Le, Q. V. (2017). Neural optimizer search with reinforcement learning. In International Conference on Machine Learning (ICML), pages 459–468.
- Benveniste et al., (2012) Benveniste, A., Métivier, M., and Priouret, P. (2012). Adaptive algorithms and stochastic approximations, volume 22. Springer Science & Business Media.
- Bhandari and Russo, (2019) Bhandari, J. and Russo, D. (2019). Global optimality guarantees for policy gradient methods. arXiv preprint arXiv:1906.01786.
- Bhandari et al., (2018) Bhandari, J., Russo, D., and Singal, R. (2018). A finite time analysis of temporal difference learning with linear function approximation. In Conference on Learning Theory (COLT).
- Bhatnagar, (2010) Bhatnagar, S. (2010). An actor-critic algorithm with function approximation for discounted cost constrained markov decision processes. Systems & Control Letters, 59(12):760–766.
- Bhatnagar et al., (2008) Bhatnagar, S., Ghavamzadeh, M., Lee, M., and Sutton, R. S. (2008). Incremental natural actor-critic algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 105–112.
- Bhatnagar et al., (2009) Bhatnagar, S., Sutton, R. S., Ghavamzadeh, M., and Lee, M. (2009). Natural actor-critic algorithms. Automatica, 45(11):2471–2482.
- Cai et al., (2019) Cai, Q., Yang, Z., Lee, J. D., and Wang, Z. (2019). Neural temporal-difference learning converges to global optima. In Advances in Neural Information Processing Systems (NeurIPS), pages 11312–11322.
- Castillo et al., (2019) Castillo, G. A., Weng, B., Hereid, A., Wang, Z., and Zhang, W. (2019). Reinforcement learning meets hybrid zero dynamics: A case study for rabbit. In 2019 International Conference on Robotics and Automation (ICRA), pages 284–290.
- (13) Chen, X., Liu, S., Sun, R., and Hong, M. (2019a). On the convergence of a class of Adam-type algorithms for non-convex optimization. In International Conference on Learning Representations (ICLR).
- (14) Chen, Z., Zhang, S., Doan, T. T., Maguluri, S. T., and Clarke, J.-P. (2019b). Finite-time analysis of Q-learning with linear function approximation. arXiv preprint arXiv:1905.11425.
- Dalal et al., (2018) Dalal, G., Szörényi, B., Thoppe, G., and Mannor, S. (2018). Finite sample analyses for td (0) with function approximation. In Thirty-Second AAAI Conference on Artificial Intelligence.
- Fazel et al., (2018) Fazel, M., Ge, R., Kakade, S., and Mesbahi, M. (2018). Global convergence of policy gradient methods for the linear quadratic regulator. In International Conference on Machine Learning (ICML), pages 1467–1476.
- Fu et al., (2020) Fu, Z., Yang, Z., Chen, Y., and Wang, Z. (2020). Actor-critic provably finds nash equilibria of linear-quadratic mean-field games. In International Conference on Learning Representations (ICLR).
- Györfi and Walk, (1996) Györfi, L. and Walk, H. (1996). On the averaged stochastic approximation for linear regression. SIAM Journal on Control and Optimization, 34(1):31–61.
- Hu and Syed, (2019) Hu, B. and Syed, U. (2019). Characterizing the exact behaviors of temporal difference learning algorithms using Markov jump linear system theory. In Advances in Neural Information Processing Systems (NeurIPS), pages 8477–8488.
- Ji et al., (2019) Ji, K., Wang, Z., Zhou, Y., and Liang, Y. (2019). Faster stochastic algorithms via history-gradient aided batch size adaptation. arXiv preprint arXiv:1910.09670.
- Kakade, (2002) Kakade, S. M. (2002). A natural policy gradient. In Advances in Neural Information Processing Systems (NeurIPS), pages 1531–1538.
- Karimi et al., (2019) Karimi, B., Miasojedow, B., Moulines, E., and Wai, H.-T. (2019). Non-asymptotic analysis of biased stochastic approximation scheme. In Conference on Learning Theory (COLT).
- Kingma and Ba, (2015) Kingma, D. P. and Ba, J. (2015). Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
- Konda, (2002) Konda, V. (2002). Actor-critic algorithms. PhD thesis, Massachusetts Institute of Technology.
- Konda and Tsitsiklis, (2000) Konda, V. R. and Tsitsiklis, J. N. (2000). Actor-critic algorithms. In Advances in Neural Information Processing Systems (NeurIPS), pages 1008–1014.
- Kushner and Yin, (2003) Kushner, H. and Yin, G. G. (2003). Stochastic approximation and recursive algorithms and applications, volume 35. Springer Science & Business Media.
- Lakshminarayanan and Szepesvari, (2018) Lakshminarayanan, C. and Szepesvari, C. (2018). Linear stochastic approximation: How far does constant step-size and iterate averaging go? In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 1347–1355.
- Liu et al., (2019) Liu, B., Cai, Q., Yang, Z., and Wang, Z. (2019). Neural trust region/proximal policy optimization attains globally optimal policy. In Advances in Neural Information Processing Systems (NeurIPS), pages 10564–10575.
- Luo et al., (2019) Luo, Y., Yang, Z., Wang, Z., and Kolar, M. (2019). Natural actor-critic converges globally for hierarchical linear quadratic regulator. arXiv preprint arXiv:1912.06875.
- Malik et al., (2019) Malik, D., Pananjady, A., Bhatia, K., Khamaru, K., Bartlett, P., and Wainwright, M. (2019). Derivative-free methods for policy optimization: Guarantees for linear quadratic systems. In International Conference on Artificial Intelligence and Statistics (AISTATS), pages 2916–2925.
- Mitrophanov, (2005) Mitrophanov, A. Y. (2005). Sensitivity and convergence of uniformly ergodic Markov chains. Journal of Applied Probability, 42(4):1003–1014.
- Mnih et al., (2013) Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., and Riedmiller, M. (2013). Playing atari with deep reinforcement learning. arXiv preprint arXiv:1312.5602.
- Papini et al., (2018) Papini, M., Binaghi, D., Canonaco, G., Pirotta, M., and Restelli, M. (2018). Stochastic variance-reduced policy gradient. In International Conference on Machine Learning (ICML), pages 4026–4035.
- Papini et al., (2017) Papini, M., Pirotta, M., and Restelli, M. (2017). Adaptive batch size for safe policy gradients. In Advances in Neural Information Processing Systems (NeurIPS), pages 3591–3600.
- Papini et al., (2019) Papini, M., Pirotta, M., and Restelli, M. (2019). Smoothing policies and safe policy gradients. arXiv preprint arXiv:1905.03231.
- Pednault et al., (2002) Pednault, E., Abe, N., and Zadrozny, B. (2002). Sequential cost-sensitive decision making with reinforcement learning. In Proceedings of the eighth ACM SIGKDD International conference on Knowledge Discovery and Data Mining, pages 259–268.
- Pirotta et al., (2015) Pirotta, M., Restelli, M., and Bascetta, L. (2015). Policy gradient in lipschitz Markov decision processes. Machine Learning, 100(2-3):255–283.
- Reddi et al., (2018) Reddi, S. J., Kale, S., and Kumar, S. (2018). On the convergence of Adam and beyond. In International Conference on Learning Representations (ICLR).
- Rummery and Niranjan, (1994) Rummery, G. A. and Niranjan, M. (1994). On-line Q-learning using connectionist systems, volume 37.
- Schulman et al., (2015) Schulman, J., Levine, S., Abbeel, P., Jordan, M., and Moritz, P. (2015). Trust region policy optimization. In International Conference on Machine Learning (ICML), pages 1889–1897.
- Schulman et al., (2017) Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347.
- Shani et al., (2020) Shani, L., Efroni, Y., and Mannor, S. (2020). Adaptive trust region policy optimization: Global convergence and faster rates for regularized mdps. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Shen et al., (2019) Shen, Z., Ribeiro, A., Hassani, H., Qian, H., and Mi, C. (2019). Hessian aided policy gradient. In International Conference on Machine Learning (ICML), pages 5729–5738.
- Silver et al., (2014) Silver, D., Lever, G., Heess, N., Degris, T., Wierstra, D., and Riedmiller, M. (2014). Deterministic policy gradient algorithms. In International Conference on Machine Learning (ICML), pages 387–395.
- Srikant and Ying, (2019) Srikant, R. and Ying, L. (2019). Finite-time error bounds for linear stochastic approximation andtd learning. In Conference on Learning Theory (COLT), pages 2803–2830.
- Stooke and Abbeel, (2018) Stooke, A. and Abbeel, P. (2018). Accelerated methods for deep reinforcement learning. arXiv preprint arXiv:1803.02811.
- Sun et al., (2020) Sun, T., Shen, H., Chen, T., and Li, D. (2020). daptive temporal difference learning with linear function approximation. arXiv preprint arXiv:2002.08537.
- Sutton, (1988) Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1):9–44.
- Sutton et al., (2000) Sutton, R. S., McAllester, D. A., Singh, S. P., and Mansour, Y. (2000). Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems (NeurIPS), pages 1057–1063.
- Tadić et al., (2017) Tadić, V. B., Doucet, A., et al. (2017). Asymptotic bias of stochastic gradient search. The Annals of Applied Probability, 27(6):3255–3304.
- Tran and Phong, (2019) Tran, P. T. and Phong, L. T. (2019). On the convergence proof of AMSGrad and a new version. IEEE Access, 7:61706–61716.
- Tsitsiklis and Van Roy, (1997) Tsitsiklis, J. N. and Van Roy, B. (1997). An analysis of temporal-diffference learning with function approximation. IEEE Transactions on Automatic Control, 42(5):674 – 690.
- Tu and Recht, (2019) Tu, S. and Recht, B. (2019). The gap between model-based and model-free methods on the linear quadratic regulator: An asymptotic viewpoint. In Conference on Learning Theory (COLT), pages 3036–3083.
- Wang et al., (2020) Wang, L., Cai, Q., Yang, Z., and Wang, Z. (2020). Neural policy gradient methods: Global optimality and rates of convergence. In International Conference on Learning Representations (ICLR).
- Watkins and Dayan, (1992) Watkins, C. J. and Dayan, P. (1992). Q-learning. Machine Learning, 8(3-4):279–292.
- Williams, (1992) Williams, R. J. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3-4):229–256.
- (57) Xu, P., Gao, F., and Gu, Q. (2019a). An improved convergence analysis of stochastic variance-reduced policy gradient. In International Conference on Uncertainty in Artificial Intelligence (UAI).
- Xu et al., (2020) Xu, P., Gao, F., and Gu, Q. (2020). Sample efficient policy gradient methods with recursive variance reduction. In International Conference on Learning Representations (ICLR).
- (59) Xu, T., Zou, S., and Liang, Y. (2019b). Two time-scale off-policy td learning: Non-asymptotic analysis over markovian samples. In Advances in Neural Information Processing Systems (NeurIPS), pages 10633–10643.
- Yang et al., (2019) Yang, Z., Chen, Y., Hong, M., and Wang, Z. (2019). Provably global convergence of actor-critic: A case for linear quadratic regulator with ergodic cost. In Advances in Neural Information Processing Systems (NeurIPS), pages 8351–8363.
- Zhang et al., (2019) Zhang, K., Koppel, A., Zhu, H., and Başar, T. (2019). Global convergence of policy gradient methods to (almost) locally optimal policies. arXiv preprint arXiv:1906.08383.
- Zhou et al., (2018) Zhou, D., Tang, Y., Yang, Z., Cao, Y., and Gu, Q. (2018). On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671.
- (63) Zou, F., Shen, L., Jie, Z., Zhang, W., and Liu, W. (2019a). A sufficient condition for convergences of adam and rmsprop. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11127–11135.
- (64) Zou, S., Xu, T., and Liang, Y. (2019b). Finite-sample analysis for sarsa with linear function approximation. In Advances in Neural Information Processing Systems (NeurIPS), pages 8665–8675.
Supplementary Materials
Appendix A EstQ algorithm
We introduce the unbiased Q-function estimating algorithm EstQ below.
Appendix B Proof of Lemma 1
For notational simplicity, we denote and in the sequel. We first introduce two technical lemmas which are useful for the proof of Lemma 1.
Lemma 4.
The gradient estimator is uniformly bounded, that is,
(17) |
Similarly, we also have
Proof.
The proof can be proceeded by the definition of .
where (i) follows from Assumption 1 and from the fact that the reward function is uniformly bounded. The proof of the second claim is similar to the first one. ∎
Lemma 5.
(Zou et al., 2019b, , Lemma 3) For any , under Assumption 2 we have
Proof of Lemma 1: To derive the Lipschitz continuity of , we first use the policy gradient theorem Sutton et al., (2000) which gives
where is denoted as the discounted visitation distribution. We further proceed the proof of Lipschitz continuity of as
(18) |
where (i) follows from Assumption 1 and boundedness of , and (ii) follows since . In the following, we bound the last two terms in the right hand side. First, we show that is also Lipschitz continuous with respect to . To see that, we have
where is the distribution of the state-action trajectory corresponding to the transition probability and the policy . Although is different from the state-action visitation distribution , both of them share the same property. That is, for a different , is determined by the same transition kernel and only differs in the policy. Therefore, Lemma 5 applies to two distributions and , which yields
(19) |
Appendix C Proof of Theorem 1
In this appendix we will first introduce some useful lemmas in Appendix C.1, and then provide technical proofs in Appendix C.2.
C.1 Supporting Lemmas
We first provide two lemmas to deal with the bias of the gradient estimation.
Lemma 6.
Given and a positive diagonal matrix , then
(21) |
Proof.
Since is a diagonal matrix with , we have
where (i) follows because . ∎
Lemma 7.
Proof.
For notational brevity we denote . Observe that
where (i) follows from the boundedness of . In the following, we bound . To this end, we need to build a new Markov chain in which the policy is fixed after time . To be specific, in Algorithm 1, we sample actions via a changing policy , which yields a Markov chain:
(23) |
Now, from time , we repeatedly use the policy , which yields an auxiliary Markov chain:
(24) |
where we denote correspondingly. Clearly we have for ,
(25) |
Since we use the same policy from time , the Markov chain given in (24) in this time slot is uniformly ergodic, and thus satisfies Assumption 2. Therefore, we can bound as
(26) |
Observe that
(27) |
It remains to deal with . We proceed as follows
(28) |
where (i) follows from Assumption 1 and (ii) follows from Lemma 9. It remains to bound the second term of the right hand side. To this end, we first write as
(29) |
Similarly, for we have
(30) |
Then we obtain the following bound
Then we have a dynamical form as
(31) |
Applying (31) recursively yields
(32) |
∎
The next a few lemmas are closely related to the update rule of AMSGrad.
Lemma 8.
Lemma 9.
Let for be sequences generated by Algorithm 1 and dentoe . Then
(34) |
Proof.
Lemma 10.
Proof.
We refer the reader to the proof of Lemma A.2 in Zhou et al., (2018) for more details by reducing their proof to the case where . ∎
Since the update rule of AMSGrad is complicated, it is often useful to introduce an auxiliary sequences . If we define , then for , let
(38) |
The following lemma captures the property of and its connection with .
C.2 Proof of Theorem 1
In the remaining, we can complete this proof by taking the following three steps.
Step1: Establishing convergent sequence
First, we observe that is Lipschitz continuous according to Lemma 1. Then we have
(43) |
The term is the key to deal with under non-i.i.d. sampling, where a non-zero bias arises in the gradient estimation. We bound this term as
where (i) follows from Lemma 6 and (ii) follows because is positive diagonal matrix and each entry is bounded as in Lemma 8.
Next we bound the term as
where (i) follows from Cauchy-Schwarz inequality, (ii) follows from Lemma 11, (iii) follows because and (iv) follows by the update rule of Algorithm 1.
Substituting the upper bounds of the terms and in (C.2) yields
Next we rearrange the above inequality and take expectation over all the randomness to obtain
(44) |
where in the last equation we denote for brevity, and this notation is used in the sequel as well. We emphasize that the filtration contains all the samples up to time except the samples for estimating . Thus we have where the expectation is taken over the randomness in EstQ algorithm.
Step2: Bounding bias term
In the following, we bound the bias term . Observe that
It turns out that the terms are easier to bound and the term is the key to bound the bias. For the clarity of presentation, we first bound the terms .
To bound the term , we have
where (i) follows from Cauchy-Schwarz inequality, (ii) follows because , (iii) follows from Lemma 1 and Lemma 8, (iv) follows by the triangle inequality, and (v) follows due to Lemma 9.
Similarly, we can bound the term as follows:
Next, we bound the term and obtain
Last, it remains to bound the term . Observe that . We first deal with as
where (i) follows from Lemma 6. The key to bound the bias is to bound the term , which has been done in Lemma 7.
To conclude, the bias term ca be bounded for a fixed as:
(45) |
Step3: Establishing convergence to stationary point
For the case with a constant stepsize , we choose . To take the summation over the time steps, notice that the bound in (C.2) holds only when , and hence we separate the summation of the bias term into two parts as follows:
(46) |
Then applying (C.2) to (C.2) yields
(47) |
where (i) follows from Lemma 10, (ii) follows from the definition of and (iii) follows since
Appendix D Proof of Theorem 2
The proof of Theorem 2 starts from steps similar to those of Theorem 1. The difference starts from (C.2). Here we consider is not constant. Then we divide both sides of (C.2) by and obtain
where
We choose . Again we choose if and if . If , then choice of yields
where the last inequality follows because
(48) |
By taking the summation over time steps, we obtain
(49) |
where (i) follows since is decreasing and from the definition of , and (ii) follows from Lemma 10.
For clarity we bound separately. The key observation is that is uniformly bounded as
(50) |
Thus
(51) |
Appendix E Proof of Proposition 1
In the following, we show how to adapt the proof techniques in analyzing PG-AMSGrad to PG-SGD. We first reduce Lemma 7 to the vanilla PG case.
Proof.
since the major part of the proof is similar to that of Lemma 7, we only emphasize the different steps.
For notational brevity, we denote . Then we still have
(53) |
Then we use the steps similar to those in building an auxiliary Markov chain (24) which is generated by the policy from time . Similarly to (C.1), we have
(54) |
where (i) follows from (26), (C.1) and (C.1). Observe that in PG-SGD, for any we have,
Thus, we complete the proof by substituting the above observation to (54) and then (53).
∎
Proof of Proposition 1: Following from the Lipschitz condition of in Lemma 1, we obtain
Then we rearrange the above inequality and take expectation over all the randomness to have
(55) |
where the filtration contains all the samples up to time except the samples for estimating in the EstQ algorithm. Thus we have where the expectation is taken over the randomness in EstQ algorithm. Observe that if , we have
Next we bound the terms and .
Observe that . We deal with and have
Next we bound and obtain
Similarly, we can bound term as
Last, we bound term and obtain
Thus for any , we have
(56) |
Proof of convergence under constant stepsize:
We first consider a constant stepsize for . Choose . We rearrange (55) and take the summation over the time steps to obtain
where (i) follows because , and (ii) follows from (56). Then we divide both sides of the above inequality by , and have
Proof of convergence under diminishing stepsize:
Appendix F Proof of Lemma 3
We bound the expectation of bias via constructing a new Markov chain and applying useful techniques from information theory. Before deriving the bound, we first introduce some technical lemmas.
Proof.
Lemma 14.
Lemma 15.
(Bhandari et al.,, 2018, Lemma 11) Let . Fix . Then is uniformly bounded by
and it is Lipschitz continuous as given by
Lemma 16.
Remark 1.
The notation indicates that the random variable and are independent conditioned on , which is a standard notation in information theory.
Proof of Lemma 3: Now we bound the bias of the gradient estimation. We first develop the connection between and using Lemma 15. For notational brevity, we define a random tuple and clearly . We denote
where as defined in (16).
Next, we apply Lemma 9 to obtain
Thus we can relate and by using the Lipschitz property in Lemma 15 as follows.
(59) |
Next, we seek to bound using Lemma 16. Observe that given any deterministic , we have
Since is non-random, we have . Now we are ready to bound with Lemma 16 via constructing a random process satisfying (58). To do so, consider random variables and drawn independently from the marginal distribution of and , thus . Further, we can obtain since and are independent. Combining Lemma 15 and Lemma 16 yields
(60) |
Appendix G Proof of Theorem 3
Differently from the regret bound for AMSGrad in conventional optimization Reddi et al., (2018), our analysis here focuses on the convergence rate. In fact, a slight modification of our proof also provides the convergence rate of AMSGrad for conventional strongly convex optimization, which can be of independent interest. Moreover, we provide results under the constant stepsize and under Marovian sampling, neither of which has been studied in Reddi et al., (2018).
To proceed the proof, we first observe that
Clearly due to Assumption 4. We start from the update of when .
where (i) follows from Cauchy-Schwarz inequality, and (ii) holds because .
Next, we take the expectation over all samples used up to time step on both sides, which still preserves the inequality. Since we consider the Markovian sampling case where the estimation of gradient is biased. That is,
(61) |
where and this notation will be used in the remaining for simplicity.
Thus we have
where (i) follows from Lemma 2 and because , (ii) follows because and , and (iii) follows from by Lemma 14 and Assumption 4. By rearranging the terms in the above inequality and taking the summation over time steps, we have
where (i) follows from . By further implementing the first term in the right-hand side of the last inequality, we can then bound the sum as
Next, we further bound the above term as
(62) |
where (i) follows from Assumption 4 and because .
Since we consider the case with a constant stepsize , we continue to bound (G) as follows
(63) |
where (i) follows because
Then we are ready to obtain the upper bound by applying Lemma 3. Choosing if and if , we obtain
Finally, applying Jensen’s inequality yields
(64) |
where
Appendix H Proof of Theorem 4
We first provide the following useful lemma.
Lemma 17.
Let and for . Then
(65) |
Proof.
The proof follows from taking the standard sum of geometric sequences.
(66) |
∎
Proof of Theorem 4: The proof starts with steps similar to those of Theorem 3. The difference begins from (G), where we now consider a diminishing stepsize . We then have
Then we are ready to obtain the upper bound by applying Lemma 3. Choosing if and if , we obtain
Finally, applying Jensen’s inequality completes our proof as
where