Accelerating Value Iteration with Anchoring
Abstract
Value Iteration (VI) is foundational to the theory and practice of modern reinforcement learning, and it is known to converge at a -rate, where is the discount factor. Surprisingly, however, the optimal rate in terms of Bellman error for the VI setup was not known, and finding a general acceleration mechanism has been an open problem. In this paper, we present the first accelerated VI for both the Bellman consistency and optimality operators. Our method, called Anc-VI, is based on an anchoring mechanism (distinct from Nesterov’s acceleration), and it reduces the Bellman error faster than standard VI. In particular, Anc-VI exhibits a -rate for or even , while standard VI has rate for , where is the iteration count. We also provide a complexity lower bound matching the upper bound up to a constant factor of , thereby establishing optimality of the accelerated rate of Anc-VI. Finally, we show that the anchoring mechanism provides the same benefit in the approximate VI and Gauss–Seidel VI setups as well.
1 Introduction
Value Iteration (VI) is foundational to the theory and practice of modern dynamic programming (DP) and reinforcement learning (RL). It is well known that when a discount factor is used, (exact) VI is a contractive iteration in the -norm and therefore converges. The progress of VI is measured by the Bellman error in practice (as the distance to the fixed point is not computable), and much prior work has been dedicated to analyzing the rates of convergence of VI and its variants.
Surprisingly, however, the optimal rate in terms of Bellman error for the VI setup was not known, and finding a general acceleration mechanism has been an open problem. The classical -rate of VI is inadequate as many practical setups use or for the discount factor. (Not to mention that VI may not converge when .) Moreover, most prior works on accelerating VI focused on the Bellman consistency operator (policy evaluation) as its linearity allows eigenvalue analyses, but the Bellman optimality operator (control) is the more relevant object in modern RL.
Contribution.
In this paper, we present the first accelerated VI for both the Bellman consistency and optimality operators. Our method, called Anc-VI, is based on an “anchoring” mechanism (distinct from Nesterov’s acceleration), and it reduces the Bellman error faster than standard VI. In particular, Anc-VI exhibits a -rate for or even , while standard VI has rate for , where is the iteration count. We also provide a complexity lower bound matching the upper bound up to a constant factor of , thereby establishing optimality of the accelerated rate of Anc-VI. Finally, we show that the anchoring mechanism provides the same benefit in the approximate VI and Gauss–Seidel VI setups as well.
1.1 Notations and preliminaries
We quickly review basic definitions and concepts of Markov decision processes (MDP) and reinforcement learning (RL). For further details, refer to standard references such as [69, 84, 81].
Markov Decision Process.
Let be the space of probability distributions over . Write to denote the MDP with state space , action space , transition probability , reward , and discount factor . Denote for a policy, and for - and -value functions, where denotes the expected value over all trajectories induced by and . We say and are optimal - and - value functions if and . We say and are optimal policies if and . (If argmax is not unique, break ties arbitrarily.)
Value Iteration.
Let denote the space of bounded measurable real-valued functions over . With the given MDP , for and , define the Bellman consistency operators as
for all , and the Bellman optimality operators as
for all . For notational conciseness, we write and , where is the reward induced by policy and and defined as
are the transition probabilities induced by policy . We define VI for Bellman consistency and optimality operators as
where are initial points. VI for control, after executing iterations, returns the near-optimal policy as a greedy policy satisfying
For , both Bellman consistency and optimality operators are contractions, and, by Banach’s fixed-point theorem [5], the VIs converge to the unique fixed points , , and with -rate. For notational unity, we use the symbol when both and can be used. Since , VI exhibits the rate on the Bellman error:
() |
where is Bellman consistency or optimality operator, is a starting point, and is fixed point of . We say or if or for all and , respectively.
Fixed-point iterations.
Given an operator , we say is fixed point if . Since Banach [5], the standard fixed-point iteration
has been commonly used to find fixed points. Note that VI for policy evaluation and control are fixed-point iterations with Bellman consistency and optimality operators. In this work, we also consider the Halpern iteration
where is an initial point and .
1.2 Prior works
Value Iteration.
Value iteration (VI) was first introduced in the DP literature [8] for finding optimal value function, and its variant approximate VI [11, 30, 56, 32, 19, 90, 81] considers approximate evaluations of the Bellman optimality operator. In RL, VI and approximate VI have served as the basis of RL algorithms such as fitted value iteration [29, 57, 52, 87, 50, 36] and temporal difference learning [80, 89, 41, 94, 54]. There is a line of research that emulates VI by learning a model of the MDP dynamics [85, 83, 62] and applying a modified Bellman operator [7, 33]. Asynchronous VI, another variation of VI updating the coordinate of value function in asynchronous manner, has also been studied in both RL and DP literature [11, 9, 88, 100].
Fixed-point iterations.
The Banach fixed-point theorem [5] establishes the convergence of the standard fixed-point iteration with a contractive operator. The Halpern iteration [39] converges for nonexpansive operators on Hilbert spaces [96] and uniformly smooth Banach spaces [70, 97]. (To clarify, the -norm in is not uniformly smooth.)
The fixed-point residual is a commonly used error measure for fixed-point problems. In general normed spaces, the Halpern iteration was shown to exhibit -rate for (nonlinear) nonexpansive operators [48] and -rate for linear nonexpansive operators [17] on the fixed-point residual. In Hilbert spaces, [72] first established a -rate for the Halpern iteration and the constant was later improved by [49, 43]. For contractive operators, [65] proved exact optimality of Halpern iteration through an exact matching complexity lower bound.
Acceleration.
Since Nesterov’s seminal work [61], there has been a large body of research on acceleration in convex minimization. Gradient descent [15] can be accelerated to efficiently reduce function value and squared gradient magnitude for smooth convex minimization problems [61, 44, 45, 46, 102, 21, 60] and smooth strongly convex minimization problems [59, 91, 64, 86, 73]. Motivated by Nesterov acceleration, inertial fixed-point iterations [51, 22, 75, 70, 42] have also been suggested to accelerate fixed-point iterations. Anderson acceleration [2], another acceleration scheme for fixed-point iterations, has recently been studied with interest [6, 74, 93, 101].
In DP and RL, prioritized sweeping [55] is a well-known method that changes the order of updates to accelerate convergence, and several variants [68, 53, 95, 3, 18] have been proposed. Speedy Q-learning [4] modifies the update rule of Q-learning and uses aggressive learning rates for acceleration. Recently, there has been a line of research that applies acceleration techniques of other areas to VI: [34, 79, 28, 67, 27, 76] uses Anderson acceleration of fixed-point iterations, [92, 37, 38, 12, 1] uses Nesterov acceleration of convex optimization, and [31] uses ideas inspired by PID controllers in control theory. Among those works, [37, 38, 1] applied Nesterov acceleration to obtain theoretically accelerated convergence rates, but those analyses require certain reversibility conditions or restrictions on eigenvalues of the transition probability induced by the policy.
The anchor acceleration, a new acceleration mechanism distinct from Nesterov’s, lately gained attention in convex optimization and fixed-point theory. The anchoring mechanism, which retracts iterates towards the initial point, has been used to accelerate algorithms for minimax optimization and fixed-point problems [71, 47, 98, 65, 43, 20, 99, 78], and we focus on it in this paper.
Complexity lower bound.
With the information-based complexity analysis [58], complexity lower bound on first-order methods for convex minimization problem has been thoroughly studied [59, 23, 25, 13, 14, 24]. If a complexity lower bound matches an algorithm’s convergence rate, it establishes optimality of the algorithm [58, 44, 73, 86, 26, 65]. In fixed-point problems, [16] established lower bound on distance to solution for Halpern iteration with a nonexpansive operator in -uniformly smooth Banach spaces. In [17], a lower bound on the fixed-point residual for the general Mann iteration with a nonexpansive linear operator, which includes standard fixed-point iteration and Halpern iterations, in the -space was provided. In Hilbert spaces, [65] showed exact complexity lower bound on fixed-point residual for deterministic fixed-point iterations with -contractive and nonexpansive operators. Finally, [37] provided lower bound on distance to optimal value function for fixed-point iterations satisfying span condition with Bellman consistency and optimality operators and we discussed this lower bound in section 4.
2 Anchored Value Iteration
Let be a -contractive (in the -norm) Bellman consistency or optimality operator. The Anchored Value Iteration (Anc-VI) is
(Anc-VI) |
for where and is an initial point. In this section, we present accelerated convergence rates of Anc-VI for both Bellman consistency and optimality operators for both - and -value iterations. For the control setup, where the Bellman optimality operator is used, Anc-VI returns the near-optimal policy as a greedy policy satisfying after executing iterations.
Notably, Anc-VI obtains the next iterate as a convex combination between the output of and the starting point . We call the term the anchor term since, loosely speaking, it serves to pull the iterates toward the starting point . The strength of the anchor mechanism diminishes as the iteration progresses since is a decreasing sequence.
The anchor mechanism was introduced [39, 72, 49, 65, 17, 48] for general nonexpansive operators and -nonexpansive and contractive operators. The optimal method for -nonexpansive and contractive operators in [65] shares the same coefficients with Anc-VI, and convergence results for general nonexapnsive operators in [17, 48] are applicable to Anc-VI for nonexpansive Bellman optimality and consistency operators. While our anchor mechanism does bear a formal resemblance to those of prior works, our convergence rates and point convergence are neither a direct application nor a direct adaptation of the prior convergence analyses. The prior analyses for -nonexpansive and contractive operators do not apply to Bellman operators, and prior analyses for general nonexpansive operators have slower rates and do not provide point convergence while our Theorem 3 does. Our analyses specifically utilize the structure of Bellman operators to obtain the faster rates and point convergence.
The accelerated rate of Anc-VI for the Bellman optimality operator is more technically challenging and is, in our view, the stronger contribution. However, we start by presenting the result for the Bellman consistency operator because it is commonly studied in the prior RL theory literature on accelerating value iteration [37, 38, 1, 31] and because the analysis in the Bellman consistency setup will serve as a good conceptual stepping stone towards the analysis in the Bellman optimality setup.
2.1 Accelerated rate for Bellman consistency operator
First, for general state-action spaces, we present the accelerated convergence rate of Anc-VI for the Bellman consistency operator.
Theorem 1.
If , both rates of Theorem 1 are strictly faster than the standard rate () of VI, since
The second rate of Theorem 1, which has the additional requirement, is faster than the standard rate () of VI for all . Interestingly, in the regime, Anc-VI achieves -rate while VI has a -rate. We briefly note that the condition and have been used in analyses of variants of VI [69, Theorem 6.3.11], [77, p.3].
In the following, we briefly outline the proof of Theorem 1 while deferring the full description to Appendix B. In the outline, we highlight a particular step, labeled , that crucially relies on the linearity of the Bellman consistency operator. In the analysis for the Bellman optimality operator of Theorem 2, resolving the step despite the nonlinearity is the key technical challenge.
Proof outline of Theorem 1.
Recall that we can write Bellman consistency operator as and . Since is a linear operator111Arguably, is affine, not linear, but we follow the convention of [69] say is linear., we get
where the first equality follows from the definition of Anc-VI and the property of fixed point, while the last equality follows from induction. Taking the -norm of both sides, we conclude
∎
2.2 Accelerated rate for Bellman optimality operator
We now present the accelerated convergence rate of Anc-VI for the Bellman optimality operator.
Our analysis uses what we call the Bellman anti-optimality operator, defined as
for all and . (The sup is replaced with a inf.) When , the Bellman anti-optimality operator is -contractive and has a unique fixed point by the exact same arguments that establish -contractiveness of the standard Bellman optimality operator.
Theorem 2.
Anc-VI with the Bellman optimality operator exhibits the same accelerated convergence rate as Anc-VI with the Bellman consistency operator. As in Theorem 1, the rate of Theorem 2 also becomes when , while VI has a -rate.
Proof outline of Theorem 2.
The key technical challenge of the proof comes from the fact that the Bellman optimality operator is non-linear. Similar to the Bellman consistency operator case, we have
where is the greedy policy satisfying , we define , and last inequality follows by induction and monotonicity of Bellman optimality operator. The key step uses greedy policies , which are well defined when the action space is finite. When the action space is infinite, greedy policies may not exist, so we use the Hahn–Banach extension theorem to overcome this technicality. The full argument is provided in Appendix B.
To lower bound , we use a similar line of reasoning with the Bellman anti-optimality operator. Combining the upper and lower bounds of , we conclude the accelerated rate of Theorem 2. ∎
For , the rates of Theorems 1 and 2 can be translated to a bound on the distance to solution:
for . This rate is worse than the rate of (classical) VI by a constant factor. Therefore, Anc-VI is better than VI in terms of the Bellman error, but it is not better than VI in terms of distance to solution.
3 Convergence when
Undiscounted MDPs are not commonly studied in the DP and RL theory literature due to the following difficulties: Bellman consistency and optimality operators may not have fixed points, VI is a nonexpansive (not contractive) fixed-point iteration and may not convergence to a fixed point even if one exist, and the interpretation of a fixed point as the (optimal) value function becomes unclear when the fixed point is not unique. However, many modern deep RL setups actually do not use discounting,222 As a specific example, the classical policy gradient theorem [82] calls for the use of , but many modern deep policy gradient methods use in the first instance of (so ) while using in [63]. and this empirical practice makes the theoretical analysis with relevant.
In this section, we show that Anc-VI converges to fixed points of the Bellman consistency and optimality operators of undiscounted MDPs. While a full treatment of undiscounted MDPs is beyond the scope of this paper, we show that fixed points, if one exists, can be found, and we therefore argue that the inability to find fixed points should not be considered an obstacle in studying the setup.
We first state our convergence result for finite state-action spaces.
Theorem 3.
Let . Let be the nonexpansive Bellman consistency or optimality operator for or . Assume a fixed point exists (not necessarily unique). If, , then Anc-VI exhibits the rate
for any fixed point satisfying . Furthermore, for some fixed point .
If rewards are nonnegative, then the condition is satisfied with . So, under this mild condition, Anc-VI with converges with -rate on the Bellman error. To clarify, the convergence has no rate, i.e., , while . In contrast, standard VI does not guarantee convergence in this setup.
We also point out that the convergence of Bellman error does not immediately imply point convergence, i.e., does not immediately imply , when . Rather, we show (i) is a bounded sequence, (ii) any convergent subsequence converges to a fixed point , and (iii) is elementwise monotonically nondecreasing and therefore has a single limit.
Next, we state our convergence result for general state-action spaces.
Theorem 4.
Let . Let the state and action spaces be general (possibly infinite) sets. Let be the nonexpansive Bellman consistency or optimality operator for or , and assume is well defined.333Well-definedness of requires a -algebra on state and action spaces, expectation with respect to transition probability and policy to be well defined, boundedness and measurability of the output of Bellman operators, etc. Assume a fixed point exists (not necessarily unique). If , then Anc-VI exhibits the rate
for any fixed point satisfying . Furthermore, pointwise monotonically for some fixed point .
The convergence pointwise in infinite state-action spaces is, in our view, a non-trivial contribution. When the state-action space is finite, pointwise convergence directly implies convergence in , and in this sense, Theorem 4 is generalization of Theorem 3. However, when the state-action space is infinite, pointwise convergence does not necessarily imply uniform convergence, i.e., pointwise does not necessarily imply in .
4 Complexity lower bound
We now present a complexity lower bound establishing optimality of Anc-VI.
Theorem 5.
Let , , , and . Then there exists an MDP with and (which implies the Bellman consistency and optimality operator for V and Q all coincide as ) such that has a fixed point satisfying and
for any iterates satisfying
Proof outline of Theorem 5.
Without loss of generality, assume and . Consider the MDP such that
Then, , and . Under the span condition, we can show that . Then, we get
and this implies
Taking the absolute value on both sides,
Therefore, we conclude
∎
Note that the case is included in Theorem 5. When , the lower bound of Theorem 5 exactly matches the upper bound of Theorem 3.
Since
the lower bound establishes optimality of the second rates Theorems 1 and 2 up to a constant of factor . Theorem 5 improves upon the prior state-of-the-art complexity lower bound established in the proof of [37, Theorem 3] by a factor . (In [37, Theorem 3], a lower bound on the distance to optimal value function is provided. Their result has an implicit dependence on the initial distance to optimal value function , so we make the dependence explicit, and we translate their result to a lower bound on the Bellman error. Once this is done, the difference between our lower bound of Theorem 5 and of [37, Theorem 3] is a factor of . The worst-case MDP of [37, Theorem 3] and our worst-case MDP primarily differ in the rewards, while the states and the transition probabilities are almost the same.)
The so-called “span condition” of Theorem 5 is arguably very natural and is satisfied by standard VI and Anc-VI. The span condition is commonly used in the construction of complexity lower bounds on first-order optimization methods [59, 23, 25, 13, 14, 65] and has been used in the prior state-of-the-art lower bound for standard VI [37, Theorem 3]. However, designing an algorithm that breaks the lower bound of Theorem 5 by violating the span condition remains a possibility. In optimization theory, there is precedence of lower bounds being broken by violating seemingly natural and minute conditions [40, 35, 98].
5 Approximate Anchored Value Iteration
In this section, we show that the anchoring mechanism is robust against evaluation errors of the Bellman operator, just as much as the standard approximate VI.
Let and let be the Bellman optimality operator. The Approximate Anchored Value Iteration (Apx-Anc-VI) is
(Apx-Anc-VI) |
for where , is an initial point, and the is the error sequence modeling approximate evaluations of .
Of course, the classical Approximate Value Iteration (Apx-VI) is
(Apx-VI) |
for where is an initial point.
Fact 1 (Classical result, [11, p.333]).
Let be the discount factor. Let be the Bellman optimality for or . Let be the fixed point of . Then Apx-VI exhibits the rate
Theorem 6.
Let be the discount factor. Let and respectively be the Bellman optimality and anti-optimality operators for or . Let and respectively be the fixed points of and . Then Apx-Anc-VI exhibits the rate
If, furthermore, , then (Apx-Anc-VI) exhibits the rate
for .
The dependence on of Apx-Anc-VI is no worse than that of Apx-VI. In this sense, Apx-Anc-VI is robust against evaluation errors of the Bellman operator, just as much as the standard Apx-VI. Finally, we note that a similar analysis can be done for Apx-Anc-VI with the Bellman consistency operator.
6 Gauss–Seidel Anchored Value Iteration
In this section, we show that the anchoring mechanism can be combined with Gauss–Seidel-type updates in finite state-action spaces. Let and let be the Bellman optimality operator. Define as
where is defined as
for .
Fact 2.
[Classical result, [69, Theorem 6.3.4]] is a -contractive operator and has the same fixed point as .
The Gauss–Seidel Anchored Value Iteration (GS-Anc-VI) is
(GS-Anc-VI) |
for where and is an initial point.
Theorem 7.
Let the state and action spaces be finite sets. Let be the discount factor. Let and respectively be the Bellman optimality and anti-optimality operators for or . Let and respectively be the fixed points of and . Then GS-Anc-VI exhibits the rate
for . If, furthermore, or , then GS-Anc-VI exhibits the rate
for .
7 Conclusion
We show that the classical value iteration (VI) is, in fact, suboptimal and that the anchoring mechanism accelerates VI to be optimal in the sense that the accelerated rate matches a complexity lower bound up to a constant factor of . We also show that the accelerated iteration provably converges to a fixed point even when , if a fixed point exists. Being able to provide a substantive improvement upon the classical VI is, in our view, a surprising contribution.
One direction of future work is to study the empirical effectiveness of Anc-VI. Another direction is to analyze Anc-VI in a model-free setting and, more broadly, to investigate the effectiveness of the anchor mechanism in more practical RL methods.
Our results lead us to believe that many of the classical foundations of dynamic programming and reinforcement learning may be improved with a careful examination based on an optimization complexity theory perspective. The theory of optimal optimization algorithms has recently enjoyed significant developments [44, 43, 45, 98, 66], the anchoring mechanism being one such example [49, 65], and the classical DP and RL theory may benefit from a similar line of investigation on iteration complexity.
Acknowledgments and Disclosure of Funding
This work was supported by the the Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [NO.2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)] and the Samsung Science and Technology Foundation (Project Number SSTF-BA2101-02). We thank Jisun Park for providing valuable feedback.
References
- [1] M. Akian, S. Gaubert, Z. Qu, and O. Saadi. Multiply accelerated value iteration for non-symmetric affine fixed point problems and application to markov decision processes. SIAM Journal on Matrix Analysis and Applications, 43(1):199–232, 2022.
- [2] D. G. Anderson. Iterative procedures for nonlinear integral equations. Journal of the Association for Computing Machinery, 12(4):547–560, 1965.
- [3] D. Andre, N. Friedman, and R. Parr. Generalized prioritized sweeping. Neural Information Processing Systems, 1997.
- [4] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. Neural Information Processing Systems, 2011.
- [5] S. Banach. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamenta Mathematicae, 3(1):133–181, 1922.
- [6] M. Barré, A. Taylor, and A. d’Aspremont. Convergence of a constrained vector extrapolation scheme. SIAM Journal on Mathematics of Data Science, 4(3):979–1002, 2022.
- [7] M. G. Bellemare, G. Ostrovski, A. Guez, P. Thomas, and R. Munos. Increasing the action gap: New operators for reinforcement learning. Association for the Advancement of Artificial Intelligence, 2016.
- [8] R. Bellman. A Markovian decision process. Journal of Mathematics and Mechanics, 6(5):679–684, 1957.
- [9] D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, 2015.
- [10] D. P. Bertsekas. Dynamic Programming and Optimal Control, volume II. 4th edition, 2012.
- [11] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1995.
- [12] W. Bowen, X. Huaqing, Z. Lin, L. Yingbin, and Z. Wei. Finite-time theory for momentum Q-learning. Conference on Uncertainty in Artificial Intelligence, 2021.
- [13] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points I. Mathematical Programming, 184(1–2):71–120, 2020.
- [14] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points II: first-order methods. Mathematical Programming, 185(1–2):315–355, 2021.
- [15] A.-L. Cauchy. Méthode générale pour la résolution des systemes d’équations simultanées. Comptes rendus de l’Académie des Sciences, 25:536–538, 1847.
- [16] V. Colao and G. Marino. On the rate of convergence of Halpern iterations. Journal of Nonlinear and Convex Analysis, 22(12):2639–2646, 2021.
- [17] J. P. Contreras and R. Cominetti. Optimal error bounds for non-expansive fixed-point iterations in normed spaces. Mathematical Programming, 199(1–2):343–374, 2022.
- [18] P. Dai, D. S. Weld, J. Goldsmith, et al. Topological value iteration algorithms. Journal of Artificial Intelligence Research, 42:181–209, 2011.
- [19] D. P. De Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization theory and Applications, 105:589–608, 2000.
- [20] J. Diakonikolas. Halpern iteration for near-optimal and parameter-free monotone inclusion and strong solutions to variational inequalities. Conference on Learning Theory, 2020.
- [21] J. Diakonikolas and P. Wang. Potential function-based framework for minimizing gradients in convex and min-max optimization. SIAM Journal on Optimization, 32(3):1668–1697, 2022.
- [22] Q. Dong, H. Yuan, Y. Cho, and T. M. Rassias. Modified inertial Mann algorithm and inertial CQ-algorithm for nonexpansive mappings. Optimization Letters, 12(1):87–102, 2018.
- [23] Y. Drori. The exact information-based complexity of smooth convex minimization. Journal of Complexity, 39:1–16, 2017.
- [24] Y. Drori and O. Shamir. The complexity of finding stationary points with stochastic gradient descent. International Conference on Machine Learning, 2020.
- [25] Y. Drori and A. Taylor. On the oracle complexity of smooth strongly convex minimization. Journal of Complexity, 68, 2022.
- [26] Y. Drori and M. Teboulle. An optimal variant of Kelley’s cutting-plane method. Mathematical Programming, 160(1–2):321–351, 2016.
- [27] M. Ermis, M. Park, and I. Yang. On Anderson acceleration for partially observable Markov decision processes. IEEE Conference on Decision and Control, 2021.
- [28] M. Ermis and I. Yang. A3DQN: Adaptive Anderson acceleration for deep Q-networks. IEEE Symposium Series on Computational Intelligence, 2020.
- [29] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.
- [30] D. Ernst, M. Glavic, P. Geurts, and L. Wehenkel. Approximate value iteration in the reinforcement learning context. Application to electrical power system control. International Journal of Emerging Electric Power Systems, 3(1), 2005.
- [31] A.-m. Farahmand and M. Ghavamzadeh. PID accelerated value iteration algorithm. International Conference on Machine Learning, 2021.
- [32] A.-m. Farahmand, C. Szepesvári, and R. Munos. Error propagation for approximate policy and value iteration. Neural Information Processing Systems, 2010.
- [33] M. Fellows, K. Hartikainen, and S. Whiteson. Bayesian Bellman operators. Neural Information Processing Systems, 2021.
- [34] M. Geist and B. Scherrer. Anderson acceleration for reinforcement learning. European Workshop on Reinforcement Learning, 2018.
- [35] N. Golowich, S. Pattathil, C. Daskalakis, and A. Ozdaglar. Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. Conference on Learning Theory, 2020.
- [36] G. J. Gordon. Stable function approximation in dynamic programming. International Conference on Machine Learning, 1995.
- [37] V. Goyal and J. Grand-Clément. A first-order approach to accelerated value iteration. Operations Research, 71(2):517–535, 2022.
- [38] J. Grand-Clément. From convex optimization to MDPs: A review of first-order, second-order and quasi-newton methods for MDPs. arXiv:2104.10677, 2021.
- [39] B. Halpern. Fixed points of nonexpanding maps. Bulletin of the American Mathematical Society, 73(6):957–961, 1967.
- [40] R. Hannah, Y. Liu, D. O’Connor, and W. Yin. Breaking the span assumption yields fast finite-sum minimization. Neural Information Processing Systems, 2018.
- [41] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. Association for the Advancement of Artificial Intelligence, 2018.
- [42] F. Iutzeler and J. M. Hendrickx. A generic online acceleration scheme for optimization algorithms via relaxation and inertia. Optimization Methods and Software, 34(2):383–405, 2019.
- [43] D. Kim. Accelerated proximal point method for maximally monotone operators. Mathematical Programming, 190(1–2):57–87, 2021.
- [44] D. Kim and J. A. Fessler. Optimized first-order methods for smooth convex minimization. Mathematical Programming, 159(1–2):81–107, 2016.
- [45] D. Kim and J. A. Fessler. Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. Journal of Optimization Theory and Applications, 188(1):192–219, 2021.
- [46] J. Lee, C. Park, and E. K. Ryu. A geometric structure of acceleration and its role in making gradients small fast. Neural Information Processing Systems, 2021.
- [47] S. Lee and D. Kim. Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems. Neural Information Processing Systems, 2021.
- [48] L. Leustean. Rates of asymptotic regularity for Halpern iterations of nonexpansive mappings. Journal of Universal Computer Science, 13(11):1680–1691, 2007.
- [49] F. Lieder. On the convergence rate of the Halpern-iteration. Optimization Letters, 15(2):405–418, 2021.
- [50] M. Lutter, S. Mannor, J. Peters, D. Fox, and A. Garg. Value iteration in continuous actions, states and time. International Conference on Machine Learning, 2021.
- [51] P.-E. Maingé. Convergence theorems for inertial KM-type algorithms. Journal of Computational and Applied Mathematics, 219(1):223–236, 2008.
- [52] A. massoud Farahmand, M. Ghavamzadeh, C. Szepesvári, and S. Mannor. Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. American Control Conference, 2009.
- [53] H. B. McMahan and G. J. Gordon. Fast exact planning in Markov decision processes. International Conference on Automated Planning and Scheduling, 2005.
- [54] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
- [55] A. W. Moore and C. G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13:103–130, 1993.
- [56] R. Munos. Error bounds for approximate value iteration. Association for the Advancement of Artificial Intelligence, 2005.
- [57] R. Munos and C. Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(27):815–857, 2008.
- [58] A. S. Nemirovski. Information-based complexity of linear operator equations. Journal of Complexity, 8(2):153–175, 1992.
- [59] Y. Nesterov. Lectures on Convex Optimization. Springer, 2nd edition, 2018.
- [60] Y. Nesterov, A. Gasnikov, S. Guminov, and P. Dvurechensky. Primal–dual accelerated gradient methods with small-dimensional relaxation oracle. Optimization Methods and Software, 36(4):773–810, 2021.
- [61] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate . Doklady Akademii Nauk SSSR, 269(3):543–547, 1983.
- [62] S. Niu, S. Chen, H. Guo, C. Targonski, M. Smith, and J. Kovačević. Generalized value iteration networks: Life beyond lattices. Association for the Advancement of Artificial Intelligence, 2018.
- [63] C. Nota and P. Thomas. Is the policy gradient a gradient? International Conference on Autonomous Agents and Multiagent Systems, 2020.
- [64] C. Park, J. Park, and E. K. Ryu. Factor- acceleration of accelerated gradient methods. Applied Mathematics & Optimization, 2023.
- [65] J. Park and E. K. Ryu. Exact optimal accelerated complexity for fixed-point iterations. International Conference on Machine Learning, 2022.
- [66] J. Park and E. K. Ryu. Accelerated infeasibility detection of constrained optimization and fixed-point iterations. International Conference on Machine Learning, 2023.
- [67] M. Park, J. Shin, and I. Yang. Anderson acceleration for partially observable Markov decision processes: A maximum entropy approach. arXiv:2211.14998, 2022.
- [68] J. Peng and R. J. Williams. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1(4):437–454, 1993.
- [69] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, 1994.
- [70] S. Reich. Strong convergence theorems for resolvents of accretive operators in Banach spaces. Journal of Mathematical Analysis and Applications, 75(1):287–292, 1980.
- [71] E. K. Ryu, K. Yuan, and W. Yin. Ode analysis of stochastic gradient methods with optimism and anchoring for minimax problems. arXiv:1905.10899, 2019.
- [72] S. Sabach and S. Shtern. A first order method for solving convex bilevel optimization problems. SIAM Journal on Optimization, 27(2):640–660, 2017.
- [73] A. Salim, L. Condat, D. Kovalev, and P. Richtárik. An optimal algorithm for strongly convex minimization under affine constraints. International Conference on Artificial Intelligence and Statistics, 2022.
- [74] D. Scieur, A. d’Aspremont, and F. Bach. Regularized nonlinear acceleration. Mathematical Programming, 179(1–2):47–83, 2020.
- [75] Y. Shehu. Convergence rate analysis of inertial Krasnoselskii–Mann type iteration with applications. Numerical Functional Analysis and Optimization, 39(10):1077–1091, 2018.
- [76] W. Shi, S. Song, H. Wu, Y.-C. Hsu, C. Wu, and G. Huang. Regularized Anderson acceleration for off-policy deep reinforcement learning. Neural Information Processing Systems, 2019.
- [77] O. Shlakhter, C.-G. Lee, D. Khmelev, and N. Jaber. Acceleration operators in the value iteration algorithms for Markov decision processes. Operations Research, 58(1):193–202, 2010.
- [78] J. J. Suh, J. Park, and E. K. Ryu. Continuous-time analysis of anchor acceleration. Neural Information Processing Systems, 2023.
- [79] K. Sun, Y. Wang, Y. Liu, B. Pan, S. Jui, B. Jiang, L. Kong, et al. Damped Anderson mixing for deep reinforcement learning: Acceleration, convergence, and stabilization. Neural Information Processing Systems, 2021.
- [80] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
- [81] R. S. Sutton and A. G. Barto. Reinforcement Learning: An introduction. MIT press, 2nd edition, 2018.
- [82] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems, 1999.
- [83] Q. Sykora, M. Ren, and R. Urtasun. Multi-agent routing value iteration network. International Conference on Machine Learning, 2020.
- [84] C. Szepesvári. Algorithms for Reinforcement Learning. Springer, 1st edition, 2010.
- [85] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel. Value iteration networks. Neural Information Processing Systems, 2016.
- [86] A. Taylor and Y. Drori. An optimal gradient method for smooth strongly convex minimization. Mathematical Programming, 199(1-2):557–594, 2023.
- [87] S. Tosatto, M. Pirotta, C. d’Eramo, and M. Restelli. Boosted fitted Q-iteration. International Conference on Machine Learning, 2017.
- [88] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994.
- [89] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. Association for the Advancement of Artificial Intelligence, 2016.
- [90] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234–244, 2006.
- [91] B. Van Scoy, R. A. Freeman, and K. M. Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1):49–54, 2018.
- [92] N. Vieillard, B. Scherrer, O. Pietquin, and M. Geist. Momentum in reinforcement learning. International Conference on Artificial Intelligence and Statistics, 2020.
- [93] H. F. Walker and P. Ni. Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis, 49(4):1715–1735, 2011.
- [94] C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.
- [95] D. Wingate, K. D. Seppi, and S. Mahadevan. Prioritization methods for accelerating MDP solvers. Journal of Machine Learning Research, 6(25):851–881, 2005.
- [96] R. Wittmann. Approximation of fixed points of nonexpansive mappings. Archiv der Mathematik, 58(5):486–491, 1992.
- [97] H.-K. Xu. Iterative algorithms for nonlinear operators. Journal of the London Mathematical Society, 66(1):240–256, 2002.
- [98] T. Yoon and E. K. Ryu. Accelerated algorithms for smooth convex-concave minimax problems with rate on squared gradient norm. International Conference on Machine Learning, 2021.
- [99] T. Yoon and E. K. Ryu. Accelerated minimax algorithms flock together. arXiv:2205.11093, 2022.
- [100] Y. Zeng, F. Feng, and W. Yin. AsyncQVI: Asynchronous-parallel Q-value iteration for discounted Markov decision processes with near-optimal sample complexity. International Conference on Artificial Intelligence and Statistics, 2020.
- [101] J. Zhang, B. O’Donoghue, and S. Boyd. Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM Journal on Optimization, 30(4):3170–3197, 2020.
- [102] K. Zhou, L. Tian, A. M.-C. So, and J. Cheng. Practical schemes for finding near-stationary points of convex finite-sums. International Conference on Artificial Intelligence and Statistics, 2022.
Appendix A Preliminaries
For notational unity, we use the symbol when both and can be used.
Lemma 1.
[10, Lemma 1.1.1] Let . If , then .
Lemma 2.
Let . For any policy , is a nonexpansive linear operator such that if , .
Proof.
If for all and , . Then by Lemma 1 and -contraction of , we have the desired result. ∎
Lemma 3.
Let . Let and respectively be the Bellman optimality and anti-optimality operators. Let and respectively be the fixed points of and . Then .
Proof.
By definition, . Thus, . ∎
Appendix B Omitted proofs in Section 2
First, we prove the following lemma by induction.
Lemma 4.
Proof.
If , we have
If , since is a linear operator,
∎
Now, we prove the first rate of Theorem 1.
Proof of first rate in Theorem 1.
For the second rate of Theorem 1, we introduce following lemma.
Lemma 5.
Let . Let be Bellman consistency or optimality operator. For the iterates of Anc-VI, if , then for . Also, if , then for .
Proof.
First, let . If , by assumption. Since , by monotonicity of Bellman consistency and optimality operators.
By induction,
and since ,
Also, implies by monotonicity of Bellman consistency and optimality operators, and implies that for all .
Now, suppose . If , by assumption. Since , by monotonicity of Bellman consistency and optimality operators.
By induction,
and since ,
Also, implies by monotonicity of Bellman consistency and optimality operators, and implies that for all . ∎
Now, we prove following key lemmas.
Lemma 6.
Lemma 7.
Now, we prove the second rates of Theorem 1.
Proof of second rates in Theorem 1.
Let . By Lemma 5, if , then . Hence,
by Lemma 6. Taking -norm both sides, we have
Otherwise, if , by Lemma 5. Since
by Lemma 7, taking -norm both sides, we obtain same rate as before.
Lastly, Taylor series expansion for both rates at is
∎
For the analyses of Anc-VI for Bellman optimality operator, we first prove following two lemmas.
Lemma 8.
Let . If , assume a fixed point exists. Then, if and , there exist nonexpansive linear operator such that
Lemma 9.
Let . If and , then there exist nonexpansive linear operator such that
Proof of Lemma 8.
First, let , and .
If action space is finite,
where is the greedy policy satisfying , first inequality is from and , and second inequality comes from Lemma 1. Thus, we can conclude .
Otherwise, if action space is infinite, define for and previously given . Let be linear space spanned by with -norm. Then, is linear functional on and since . Due to Hahn–Banach extension Theorem, there exist linear functional with and . Furthermore, we can define such that for all . Then, since for , we have . Therefore, is nonexpansive linear operator in -norm. Then,
for all . Therefore, we have
Similarly, let , and .
If action space is finite,
where is the greedy policy satisfying , first inequality is from and , and second inequality comes from Lemma 1. Then, we can conclude .
Otherwise, if action space is infinite, define for and previously given . Let be linear space spanned by with -norm. Then, is linear functional on and . Due to Hahn–Banach extension Theorem, there exist linear functional with and . Furthermore, we can define such that for all and . Therefore, is nonexpansive linear operator in -norm. Then,
for all . Therefore, we have
∎
Proof of Lemma 9.
Note that is Bellman anti-optimality operators for or , and is the fixed point of . First, let , and . Then,
Then, if action space is finite,
where is the policy satisfying and second inequality comes from Lemma 1. Thus, we can conclude .
Otherwise, if action space is infinite, define for and previously given . Let be linear space spanned by with -norm. Then, is linear functional on and since . Due to Hahn–Banach extension Theorem, there exist linear functional with and . Furthermore, we can define such that for all . Then since for . . Thus, is nonexpansive linear operator in -norm. Then, we have
for all . Therefore, we have
Similarly, let , and . Then,
Hence, if action space is finite,
where is the policy satisfying and second inequality comes from Lemma 1. Then, we can conclude .
Otherwise, if action space is infinite, define for and previously given . Let be linear space spanned by with -norm. Then, is linear functional on with . Due to Hahn–Banach extension Theorem, there exist linear functional with and . Furthermore, we can define such that for all and . Thus is nonexpansive linear operator in -norm. Then, we have
for all . Therefore, we have
∎
Now, we present our key lemmas for the first rate of Theorem 2.
Lemma 10.
Let . If , assume a fixed point exists. For the iterates of Anc-VI, there exist nonexpansive linear operators such that
where and .
Lemma 11.
We prove previous lemmas by induction.
Proof of Lemma 10.
By induction,
and let be the entire right hand side of inequality. Then, we have
where inequality comes from first inequality in Lemma 8 with , and previously defined . ∎
Proof of Lemma 11.
Note that is Bellman anti-optimality operators for or , and is the fixed point of . If ,
where inequality comes from second inequality in Lemma 9 with .
By induction,
and let be the entire right hand side of inequality. Then, we have
where inequality comes from second inequality in Lemma 9 with , and previously defined . ∎
Now, we prove the first rate of Theorem 2.
Proof of first rate in Theorem 2.
Since implies for , if we take right side first inequality of Lemma 10, we have
where the first inequality comes from triangular inequality, second inequality is from nonexpansiveness of , and last equality comes from calculations.
If we take right side of second inequality of Lemma 10, similarly, we have
where the first inequality comes from triangular inequality, second inequality is from from nonexpansiveness of , and last equality comes from calculations. Therefore, we conclude
∎
Next, for the second rate in Theorem 2, we prove following lemmas by induction.
Lemma 12.
Let . If , assume a fixed point exists. For the iterates of Anc-VI, if , there exist nonexpansive linear operators such that
where and .
Lemma 13.
Let . For the iterates of Anc-VI, if , there exist nonexpansive linear operators such that
where and .
Proof of Lemma 12.
If ,
where the second inequality is from the condition.
By induction,
and let be the entire right hand side of inequality. Then, we have
where inequality comes from first inequality in Lemma 8 with , and previously defined . ∎
Proof of Lemma 13.
By induction,
and let be the entire right hand side of inequality. Then, we have
where inequality comes from second inequality in Lemma 9 with , and previously defined . ∎
Now, we prove the second rates of Theorem 2.
Appendix C Omitted proofs in Section 3
First, we present the following lemma.
Lemma 14.
Let . Assume a fixed point exists. For the iterates of Anc-VI, .
Proof.
If , it is obvious. By induction,
where the second inequality comes form nonexpansiveness of . ∎
Now, we present the proof of Theorem 3.
Proof of Theorem 3.
First, if , with same argument in proof of Lemma 5, we can show that for
Since fixed point exists by assumption, Lemma 4 and 10 hold. Note that implies and if we take -norm both sides for those inequalities in lemmas, by simple calculation, we have
for any fixed point (since , we can get upper bound of from Lemma 10).
Suppose that there exist such that converges to some . Then, since is continuous. This implies that is a fixed point. By Lemma 14 and previous argument, is increasing and bounded sequence in . Thus, has single limit point, some fixed point . Furthermore, the fact that implies that Lemma 6 and 12 hold. Therefore, we have
∎
Next, we prove the Theorem 4.
Proof of Theorem 4.
By same argument in the proof of Theorem 3, if , we can show that for , and
for any fixed point . Since is increasing and bounded by Lemma 14 and previous argument, converges pointwise to some in general action-state space. We now show that also converges pointwise to . First, let be Bellman consistency operator and . By monontone convergence theorem,
for any fixed . With same argument, case also holds. If is Bellman optimality operator, we use following lemma.
Lemma 15.
Let for . If for all , and converge pointwise to , then .
Proof.
implies that . If , there exist which satisfying , and by definition of , there exist such that for any . ∎
If and , by previous lemma and monotone convergence theorem, we have
for any fixed . With similar argument, case also holds.
Appendix D Omitted proofs in Section 4
We present the proof of Theorem 5.
Proof of Theorem 5.
First, we prove the case for . Consider the MDP such that
Then, , and . Under the span condition, we can show that for by following lemma.
Lemma 16.
Let be defined as before. Then, under span condition, for , and for and .
Proof.
Case is obvious. By induction, for . Then for . This implies that for . Hence . Again, by induction, for , . Then for , and this implies that for , . Therefore, for . ∎
Then, we get
and this implies
Taking the absolute value on both sides,
Therefore, we conclude
Now, we show that for any initial point , there exists an MDP which exhibits same lower bound with the case . Denote by MDP() and the worst-case MDP and Bellman consistency or opitmality operator constructed for . Define an MDP() for as
Then, Bellman consistency or optimality operator satisfies
Let be fixed point of . Then, if , is fixed point of . Furthermore, if satisfies span condition
is a sequence satisfying
which is the same span condition in Theorem 5 with respect to . This is because
for . Thus, is a sequence starting from and satisfy the span condition for . This implies that
Hence, MDP() is indeed our desired worst-case instance. Lastly, the fact that implies .
∎
Appendix E Omitted proofs in Section 5
First, we prove following key lemma.
Lemma 17.
Let . For the iterates of Anc-VI, there exist nonexpansive linear operators and such that
for , where , , and .
Proof of Lemma 17.
First, we prove the first inequality in Lemma 17 by induction.
If ,
where inequality comes from Lemma 8 with , and let be the entire right hand side of inequality. Then, we have
where inequality comes from Lemma 8 with , and previously defined .
By induction,
and let be the entire right hand side of inequality. Then, we have
where inequality comes from Lemma 8 with , and previously defined .
Now, we prove second inequality in Lemma 17 by induction.
If ,
where inequality comes from Lemma 9 with , and let be the entire right hand side of inequality. Then, we have
where inequality comes from Lemma 9 with , and previously defined .
By induction,
and let be the entire right hand side of inequality. Then, we have
where inequality comes from Lemma 9 with , and previously defined ∎
Now, we prove the first rate in Theorem 6.
Proof of first rate in Theorem 6.
Now, for the second rate in Theorem 6, we present following key lemma.
Lemma 18.
Let . For the iterates of Anc-VI, if , there exist nonexpansive linear operators and such that
for , where , , and .
Proof of Lemma 18.
First, we prove first inequality in Lemma 18 by induction. If ,
where the second inequality is from the , and first inequality comes from Lemma 8 with , and let be the entire right hand side of inequality. Then, we have
where inequality comes from Lemma 8 with , and previously defined .
By induction,
where the second inequality is from and let be the entire right hand side of inequality. Then, we have
where the first inequality comes from Lemma 8 with , and previously defined .
For the second inequality in Lemma 18, if ,
where the second inequality is from , and let be the entire right hand side of inequality. Then, we have
where inequality comes from Lemma 9 with , and previously defined .
By induction,
and let be the entire right hand side of inequality. Then, we have
where inequality comes from Lemma 9 with , and previously defined . ∎
Now, we prove the second rate in Theorem 6.
Appendix F Omitted proofs in Section 6
For the analyses, we first define as
where is defined as
for , where is Bellman anti-optimality operator.
Fact 3.
[Classical result, [10, Proposition 1.3.2]] is a -contractive operator and has the same fixed point as .
Now, we introduce the following lemmas.
Lemma 19.
Let . If , then there exist -contractive nonnegative matrix such that
Lemma 20.
Let . If , then there exist -contractive nonnegative matrix such that
Proof of Lemma 19.
First let . For , we have
where is the greedy policy satisfying and first inequality is from and . Then, define matrix as
for . Note that is nonnegative matrix since is nonnegative matrix. Then, we have
By induction, there exist a sequence of matrices satisfying
since for all . Denote as . Then, is -contractive nonnegative matrix since
for , where first equality is from definition of for , inequality comes from definition of for , and last equality is induced by definition of . Therefore, this implies that .
If , with similar argument of case , let be the greedy policy, define matrix as
and denote as . Then, is -contractive nonnegative matrix satisfying
∎
Proof of Lemma 20.
First let . For , we have
Let and define matrix as
for . Note that is nonnegative matrix since is nonnegative matrix. Then, we have
By induction, there exist a sequence of matrices satisfying
and denote as . With same argument in proof of Lemma 19, is -contractive nonnegative matrix.
If , with similar argument, let and define matrix as
Denote as . Then, with same argument in proof of Lemma 19, is -contractive nonnegative matrix satisfying
∎
Next, we prove following key lemma.
Lemma 21.
Let . For the iterates of (GS-Anc-VI), there exist -contractive nonnegative matrices and such that
where and .
Proof of Lemma 21.
First, we prove first inequality in Lemma 21 by induction.
By induction,
where the first inequality comes from Lemma 19 with , and second inequality comes from nonnegativeness of .
First, we prove second inequality in Lemma 21 by induction.
Now, we prove the first rate in Theorem 7.
Proof of first rate in Theorem 7.
Since implies for , if we take right side of first inequality in Lemma 21, we have
where the first inequality comes from triangular inequality, second inequality is from -contraction of , and last equality comes from calculations. If we take right side of second inequality in Lemma 21, we have
where the first inequality comes from triangular inequality, second inequality is from from -contraction of , and last equality comes from calculations. Therefore, we conclude
∎
For the second rates of Theorem 7, we introduce following lemma.
Lemma 22.
Let . For the iterates of (GS-Anc-VI), if , then for . Also, if , then for .
Proof.
Now, we prove the second rates in Theorem 7.
Proof of second rates in Theorem 7.
Appendix G Broader Impacts
Our work focuses on the theoretical aspects of reinforcement learning. There are no negative social impacts that we anticipate from our theoretical results.
Appendix H Limitations
Our analysis concerns value iteration. While value iteration is of theoretical interest, the analysis of value iteration is not sufficient to understand modern deep reinforcement learning practices.