This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Accelerating Value Iteration with Anchoring

Jongmin Lee1     Ernest K. Ryu1,2
1Department of Mathematical Science, Seoul National University
2Interdisciplinary Program in Artificial Intelligence, Seoul National University
Abstract

Value Iteration (VI) is foundational to the theory and practice of modern reinforcement learning, and it is known to converge at a 𝒪(γk)\mathcal{O}(\gamma^{k})-rate, where γ\gamma is the discount factor. Surprisingly, however, the optimal rate in terms of Bellman error for the VI setup was not known, and finding a general acceleration mechanism has been an open problem. In this paper, we present the first accelerated VI for both the Bellman consistency and optimality operators. Our method, called Anc-VI, is based on an anchoring mechanism (distinct from Nesterov’s acceleration), and it reduces the Bellman error faster than standard VI. In particular, Anc-VI exhibits a 𝒪(1/k)\mathcal{O}(1/k)-rate for γ1\gamma\approx 1 or even γ=1\gamma=1, while standard VI has rate 𝒪(1)\mathcal{O}(1) for γ11/k\gamma\geq 1-1/k, where kk is the iteration count. We also provide a complexity lower bound matching the upper bound up to a constant factor of 44, thereby establishing optimality of the accelerated rate of Anc-VI. Finally, we show that the anchoring mechanism provides the same benefit in the approximate VI and Gauss–Seidel VI setups as well.

1 Introduction

Value Iteration (VI) is foundational to the theory and practice of modern dynamic programming (DP) and reinforcement learning (RL). It is well known that when a discount factor γ<1\gamma<1 is used, (exact) VI is a contractive iteration in the \left\|{\cdot}\right\|_{\infty}-norm and therefore converges. The progress of VI is measured by the Bellman error in practice (as the distance to the fixed point is not computable), and much prior work has been dedicated to analyzing the rates of convergence of VI and its variants.

Surprisingly, however, the optimal rate in terms of Bellman error for the VI setup was not known, and finding a general acceleration mechanism has been an open problem. The classical 𝒪(γk)\mathcal{O}(\gamma^{k})-rate of VI is inadequate as many practical setups use γ1\gamma\approx 1 or γ=1\gamma=1 for the discount factor. (Not to mention that VI may not converge when γ=1\gamma=1.) Moreover, most prior works on accelerating VI focused on the Bellman consistency operator (policy evaluation) as its linearity allows eigenvalue analyses, but the Bellman optimality operator (control) is the more relevant object in modern RL.

Contribution.

In this paper, we present the first accelerated VI for both the Bellman consistency and optimality operators. Our method, called Anc-VI, is based on an “anchoring” mechanism (distinct from Nesterov’s acceleration), and it reduces the Bellman error faster than standard VI. In particular, Anc-VI exhibits a 𝒪(1/k)\mathcal{O}(1/k)-rate for γ1\gamma\approx 1 or even γ=1\gamma=1, while standard VI has rate 𝒪(1)\mathcal{O}(1) for γ11/k\gamma\geq 1-1/k, where kk is the iteration count. We also provide a complexity lower bound matching the upper bound up to a constant factor of 44, thereby establishing optimality of the accelerated rate of Anc-VI. Finally, we show that the anchoring mechanism provides the same benefit in the approximate VI and Gauss–Seidel VI setups as well.

1.1 Notations and preliminaries

We quickly review basic definitions and concepts of Markov decision processes (MDP) and reinforcement learning (RL). For further details, refer to standard references such as [69, 84, 81].

Markov Decision Process.

Let (𝒳)\mathcal{M}(\mathcal{X}) be the space of probability distributions over 𝒳\mathcal{X}. Write (𝒮,𝒜,P,r,γ)(\mathcal{S},\mathcal{A},P,r,\gamma) to denote the MDP with state space 𝒮\mathcal{S}, action space 𝒜\mathcal{A}, transition probability P:𝒮×𝒜(𝒮)P\colon\mathcal{S}\times\mathcal{A}\rightarrow\mathcal{M}(\mathcal{S}), reward r:𝒮×𝒜r\colon\mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}, and discount factor γ(0,1]\gamma\in(0,1]. Denote π:𝒮(𝒜)\pi\colon\mathcal{S}\rightarrow\mathcal{M}(\mathcal{A}) for a policy, Vπ(s)=𝔼π[t=0γtr(st,at)|s0=s]V^{\pi}(s)=\mathbb{E}_{\pi}[\sum^{\infty}_{t=0}\gamma^{t}r(s_{t},a_{t})\,|\,s_{0}=s] and Qπ(s,a)=𝔼π[t=0γtr(st,at)|s0=s,a0=a]Q^{\pi}(s,a)=\mathbb{E}_{\pi}[\sum^{\infty}_{t=0}\gamma^{t}r(s_{t},a_{t})\,|\,s_{0}=s,a_{0}=a] for VV- and QQ-value functions, where 𝔼π\mathbb{E}_{\pi} denotes the expected value over all trajectories (s0,a0,s1,a1,)(s_{0},a_{0},s_{1},a_{1},\dots) induced by PP and π\pi. We say VV^{\star} and QQ^{\star} are optimal VV- and QQ- value functions if V=supπVπV^{\star}=\sup_{\pi}V^{\pi}and Q=supπQπQ^{\star}=\sup_{\pi}Q^{\pi}. We say πV\pi_{V}^{\star} and πQ\pi_{Q}^{\star} are optimal policies if πV=argmaxπVπ\pi_{V}^{\star}=\operatorname*{argmax}_{\pi}{V^{\pi}} and πQ=argmaxπQπ\pi_{Q}^{\star}=\operatorname*{argmax}_{\pi}{Q^{\pi}}. (If argmax is not unique, break ties arbitrarily.)

Value Iteration.

Let (𝒳)\mathcal{F}(\mathcal{X}) denote the space of bounded measurable real-valued functions over 𝒳\mathcal{X}. With the given MDP (𝒮,𝒜,P,r,γ)(\mathcal{S},\mathcal{A},P,r,\gamma), for V(𝒮)V\in\mathcal{F}(\mathcal{S}) and Q(𝒮×𝒜)Q\in\mathcal{F}(\mathcal{S}\times\mathcal{A}), define the Bellman consistency operators TπT^{\pi} as

TπV(s)\displaystyle T^{\pi}V(s) =𝔼aπ(|s),sP(|s,a)[r(s,a)+γV(s)],\displaystyle=\mathbb{E}_{a\sim\pi(\cdot\,|\,s),s^{\prime}\sim P(\cdot\,|\,s,a)}\left[r(s,a)+\gamma V(s^{\prime})\right],
TπQ(s,a)\displaystyle T^{\pi}Q(s,a) =r(s,a)+γ𝔼sP(|s,a),aπ(|s)[Q(s,a)]\displaystyle=r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a),a^{\prime}\sim\pi(\cdot\,|\,s^{\prime})}\left[Q(s^{\prime},a^{\prime})\right]

for all s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}, and the Bellman optimality operators TT^{\star} as

TV(s)\displaystyle T^{\star}V(s) =supa𝒜{r(s,a)+γ𝔼sP(|s,a)[V(s)]},\displaystyle=\sup_{a\in\mathcal{A}}\left\{r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[V(s^{\prime})\right]\right\},
TQ(s,a)\displaystyle T^{\star}Q(s,a) =r(s,a)+γ𝔼sP(|s,a)[supa𝒜Q(s,a)]\displaystyle=r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\sup_{a^{\prime}\in\mathcal{A}}Q(s^{\prime},a^{\prime})\right]

for all s𝒮,a𝒜s\in\mathcal{S},a\in\mathcal{A}. For notational conciseness, we write TπV=rπ+γ𝒫πVT^{\pi}V=r^{\pi}+\gamma\mathcal{P}^{\pi}V and TπQ=r+γ𝒫πQT^{\pi}Q=r+\gamma\mathcal{P}^{\pi}Q, where rπ(s)=𝔼aπ(|s)[r(s,a)]r^{\pi}(s)=\mathbb{E}_{a\sim\pi(\cdot\,|\,s)}\left[r(s,a)\right] is the reward induced by policy π\pi and 𝒫π(s)\mathcal{P}^{\pi}(s) and 𝒫π(s,a)\mathcal{P}^{\pi}(s,a) defined as

𝒫π(ss)\displaystyle\mathcal{P}^{\pi}(s\rightarrow s^{\prime}) =Prob(ss|aπ(|s),sP(|s,a))\displaystyle=\mathrm{Prob}(s\rightarrow s^{\prime}\,|\,a\sim\pi(\cdot\,|\,s),s^{\prime}\sim P(\cdot\,|\,s,a))
𝒫π((s,a)(s,a))\displaystyle\mathcal{P}^{\pi}((s,a)\rightarrow(s^{\prime},a^{\prime})) =Prob((s,a)(s,a)|sP(|s,a),aπ(|s)),\displaystyle=\mathrm{Prob}((s,a)\rightarrow(s^{\prime},a^{\prime})\,|\,s^{\prime}\sim P(\cdot\,|\,s,a),a^{\prime}\sim\pi(\cdot\,|\,s^{\prime})),

are the transition probabilities induced by policy π\pi. We define VI for Bellman consistency and optimality operators as

Vk+1=TπVk,Qk+1=TπQk,Vk+1=TVk,Qk+1=TQk for k=0,1,,V^{k+1}=T^{\pi}V^{k},\quad Q^{k+1}=T^{\pi}Q^{k},\quad V^{k+1}=T^{\star}V^{k},\quad Q^{k+1}=T^{\star}Q^{k}\qquad\text{ for }k=0,1,\dots,

where V0,Q0V^{0},Q^{0} are initial points. VI for control, after executing KK iterations, returns the near-optimal policy πK\pi_{K} as a greedy policy satisfying

TπKVK=TVK,TπKQK=TQK.T^{\pi_{K}}V^{K}=T^{\star}V^{K},\quad T^{\pi_{K}}Q^{K}=T^{\star}Q^{K}.

For γ<1\gamma<1, both Bellman consistency and optimality operators are contractions, and, by Banach’s fixed-point theorem [5], the VIs converge to the unique fixed points VπV^{\pi}, Qπ,VQ^{\pi},V^{\star}, and QQ^{\star} with 𝒪(γk)\mathcal{O}(\gamma^{k})-rate. For notational unity, we use the symbol UU when both VV and QQ can be used. Since TUkUkTUkU+UkU(1+γ)UkU\left\|{TU^{k}-U^{k}}\right\|_{\infty}\leq\left\|{TU^{k}-U^{\star}}\right\|_{\infty}+\left\|{U^{k}-U^{\star}}\right\|_{\infty}\leq(1+\gamma)\left\|{U^{k}-U^{\star}}\right\|_{\infty}, VI exhibits the rate on the Bellman error:

TUkUk\displaystyle\left\|{TU^{k}-U^{k}}\right\|_{\infty} (1+γ)γkU0U for k=0,1,,\displaystyle\leq(1+\gamma)\gamma^{k}\left\|{U^{0}-U^{\star}}\right\|_{\infty}\qquad\text{ for }k=0,1,\dots, (11)

where TT is Bellman consistency or optimality operator, U0U^{0} is a starting point, and UU^{\star} is fixed point of TT. We say VVV\leq V^{\prime} or QQQ\leq Q^{\prime} if V(s)V(s)V(s)\leq V^{\prime}(s) or Q(s,a)Q(s,a)Q(s,a)\leq Q^{\prime}(s,a) for all s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A}, respectively.

Fixed-point iterations.

Given an operator TT, we say xx^{\star} is fixed point if Tx=xTx^{\star}=x^{\star}. Since Banach [5], the standard fixed-point iteration

xk+1=Txk for k=0,1,x^{k+1}=Tx^{k}\qquad\text{ for }k=0,1,\dots

has been commonly used to find fixed points. Note that VI for policy evaluation and control are fixed-point iterations with Bellman consistency and optimality operators. In this work, we also consider the Halpern iteration

xk+1=βk+1x0+(1βk+1)Txk for k=0,1,,x^{k+1}=\beta_{k+1}x^{0}+(1-\beta_{k+1})Tx^{k}\qquad\text{ for }k=0,1,\dots,

where x0x^{0} is an initial point and {βk}k𝐍(0,1)\{\beta_{k}\}_{k\in\mathbf{N}}\in(0,1).

1.2 Prior works

Value Iteration.

Value iteration (VI) was first introduced in the DP literature [8] for finding optimal value function, and its variant approximate VI [11, 30, 56, 32, 19, 90, 81] considers approximate evaluations of the Bellman optimality operator. In RL, VI and approximate VI have served as the basis of RL algorithms such as fitted value iteration [29, 57, 52, 87, 50, 36] and temporal difference learning [80, 89, 41, 94, 54]. There is a line of research that emulates VI by learning a model of the MDP dynamics [85, 83, 62] and applying a modified Bellman operator [7, 33]. Asynchronous VI, another variation of VI updating the coordinate of value function in asynchronous manner, has also been studied in both RL and DP literature [11, 9, 88, 100].

Fixed-point iterations.

The Banach fixed-point theorem [5] establishes the convergence of the standard fixed-point iteration with a contractive operator. The Halpern iteration [39] converges for nonexpansive operators on Hilbert spaces [96] and uniformly smooth Banach spaces [70, 97]. (To clarify, the \|\cdot\|_{\infty}-norm in n\mathbb{R}^{n} is not uniformly smooth.)

The fixed-point residual Txkxk\left\|{Tx_{k}-x_{k}}\right\| is a commonly used error measure for fixed-point problems. In general normed spaces, the Halpern iteration was shown to exhibit 𝒪(1/log(k))\mathcal{O}(1/\log(k))-rate for (nonlinear) nonexpansive operators [48] and 𝒪(1/k)\mathcal{O}(1/k)-rate for linear nonexpansive operators [17] on the fixed-point residual. In Hilbert spaces, [72] first established a 𝒪(1/k)\mathcal{O}(1/k)-rate for the Halpern iteration and the constant was later improved by [49, 43]. For contractive operators, [65] proved exact optimality of Halpern iteration through an exact matching complexity lower bound.

Acceleration.

Since Nesterov’s seminal work [61], there has been a large body of research on acceleration in convex minimization. Gradient descent [15] can be accelerated to efficiently reduce function value and squared gradient magnitude for smooth convex minimization problems [61, 44, 45, 46, 102, 21, 60] and smooth strongly convex minimization problems [59, 91, 64, 86, 73]. Motivated by Nesterov acceleration, inertial fixed-point iterations [51, 22, 75, 70, 42] have also been suggested to accelerate fixed-point iterations. Anderson acceleration [2], another acceleration scheme for fixed-point iterations, has recently been studied with interest [6, 74, 93, 101].

In DP and RL, prioritized sweeping [55] is a well-known method that changes the order of updates to accelerate convergence, and several variants [68, 53, 95, 3, 18] have been proposed. Speedy Q-learning [4] modifies the update rule of Q-learning and uses aggressive learning rates for acceleration. Recently, there has been a line of research that applies acceleration techniques of other areas to VI: [34, 79, 28, 67, 27, 76] uses Anderson acceleration of fixed-point iterations, [92, 37, 38, 12, 1] uses Nesterov acceleration of convex optimization, and [31] uses ideas inspired by PID controllers in control theory. Among those works, [37, 38, 1] applied Nesterov acceleration to obtain theoretically accelerated convergence rates, but those analyses require certain reversibility conditions or restrictions on eigenvalues of the transition probability induced by the policy.

The anchor acceleration, a new acceleration mechanism distinct from Nesterov’s, lately gained attention in convex optimization and fixed-point theory. The anchoring mechanism, which retracts iterates towards the initial point, has been used to accelerate algorithms for minimax optimization and fixed-point problems [71, 47, 98, 65, 43, 20, 99, 78], and we focus on it in this paper.

Complexity lower bound.

With the information-based complexity analysis [58], complexity lower bound on first-order methods for convex minimization problem has been thoroughly studied [59, 23, 25, 13, 14, 24]. If a complexity lower bound matches an algorithm’s convergence rate, it establishes optimality of the algorithm [58, 44, 73, 86, 26, 65]. In fixed-point problems, [16] established Ω(1/k12/q)\Omega(1/k^{1-\sqrt{2/q}}) lower bound on distance to solution for Halpern iteration with a nonexpansive operator in qq-uniformly smooth Banach spaces. In [17], a Ω(1/k)\Omega(1/k) lower bound on the fixed-point residual for the general Mann iteration with a nonexpansive linear operator, which includes standard fixed-point iteration and Halpern iterations, in the \ell^{\infty}-space was provided. In Hilbert spaces, [65] showed exact complexity lower bound on fixed-point residual for deterministic fixed-point iterations with γ\gamma-contractive and nonexpansive operators. Finally, [37] provided lower bound on distance to optimal value function for fixed-point iterations satisfying span condition with Bellman consistency and optimality operators and we discussed this lower bound in section 4.

2 Anchored Value Iteration

Let TT be a γ\gamma-contractive (in the \|\cdot\|_{\infty}-norm) Bellman consistency or optimality operator. The Anchored Value Iteration (Anc-VI) is

Uk=βkU0+(1βk)TUk1\displaystyle U^{k}=\beta_{k}U^{0}+(1-\beta_{k})TU^{k-1}\qquad (Anc-VI)

for k=1,2,,k=1,2,\dots, where βk=1/(i=0kγ2i)\beta_{k}=1/(\sum_{i=0}^{k}\gamma^{-2i}) and U0U^{0} is an initial point. In this section, we present accelerated convergence rates of Anc-VI for both Bellman consistency and optimality operators for both VV- and QQ-value iterations. For the control setup, where the Bellman optimality operator is used, Anc-VI returns the near-optimal policy πK\pi_{K} as a greedy policy satisfying TπKUK=TUKT^{\pi_{K}}U^{K}=T^{\star}U^{K} after executing KK iterations.

Notably, Anc-VI obtains the next iterate as a convex combination between the output of TT and the starting point U0U^{0}. We call the βkU0\beta_{k}U_{0} term the anchor term since, loosely speaking, it serves to pull the iterates toward the starting point U0U_{0}. The strength of the anchor mechanism diminishes as the iteration progresses since βk\beta_{k} is a decreasing sequence.

The anchor mechanism was introduced [39, 72, 49, 65, 17, 48] for general nonexpansive operators and 2\|\cdot\|_{2}-nonexpansive and contractive operators. The optimal method for 2\|\cdot\|_{2}-nonexpansive and contractive operators in [65] shares the same coefficients with Anc-VI, and convergence results for general nonexapnsive operators in [17, 48] are applicable to Anc-VI for nonexpansive Bellman optimality and consistency operators. While our anchor mechanism does bear a formal resemblance to those of prior works, our convergence rates and point convergence are neither a direct application nor a direct adaptation of the prior convergence analyses. The prior analyses for 2\|\cdot\|_{2}-nonexpansive and contractive operators do not apply to Bellman operators, and prior analyses for general nonexpansive operators have slower rates and do not provide point convergence while our Theorem 3 does. Our analyses specifically utilize the structure of Bellman operators to obtain the faster rates and point convergence.

The accelerated rate of Anc-VI for the Bellman optimality operator is more technically challenging and is, in our view, the stronger contribution. However, we start by presenting the result for the Bellman consistency operator because it is commonly studied in the prior RL theory literature on accelerating value iteration [37, 38, 1, 31] and because the analysis in the Bellman consistency setup will serve as a good conceptual stepping stone towards the analysis in the Bellman optimality setup.

2.1 Accelerated rate for Bellman consistency operator

First, for general state-action spaces, we present the accelerated convergence rate of Anc-VI for the Bellman consistency operator.

Theorem 1.

Let 0<γ<10<\gamma<1 be the discount factor and π\pi be a policy. Let TπT^{\pi} be the Bellman consistency operator for VV or QQ. Then, Anc-VI exhibits the rate

TπUkUk\displaystyle\left\|{T^{\pi}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+2γγk+1)(γk+1)1γk+1U0Uπ\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\pi}}\right\|_{\infty}
=(2k+1+k1k+1ϵ+O(ϵ2))U0Uπ for k=0,1,,\displaystyle=\left({\frac{2}{k+1}+\frac{k-1}{k+1}\epsilon+O(\epsilon^{2})}\right)\left\|{U^{0}-U^{\pi}}\right\|_{\infty}\qquad\text{ for }k=0,1,\dots,

where ϵ=1γ\epsilon=1-\gamma and the big-𝒪\mathcal{O} notation considers the limit ϵ0\epsilon\rightarrow 0. If, furthermore, U0TπU0U^{0}\leq T^{\pi}U^{0} or U0TπU0U^{0}\geq T^{\pi}U^{0}, then Anc-VI exhibits the rate

TπUkUk\displaystyle\left\|{T^{\pi}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+γγk+1)(γk+1)1γk+1U0Uπ\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\pi}}\right\|_{\infty}
=(1k+1+kk+1ϵ+O(ϵ2))U0Uπ for k=0,1,.\displaystyle=\left({\frac{1}{k+1}+\frac{k}{k+1}\epsilon+O(\epsilon^{2})}\right)\left\|{U^{0}-U^{\pi}}\right\|_{\infty}\qquad\text{ for }k=0,1,\dots.

If γ12\gamma\geq\frac{1}{2}, both rates of Theorem 1 are strictly faster than the standard rate (11) of VI, since

(γ1γ)(1+2γγk+1)(γk+1)1γk+1=γk(1γ2)(1+2γγk+1)(1γ2k+2)<γk(1+γ).\displaystyle\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}=\gamma^{k}\frac{\left({1-\gamma^{2}}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({1-\gamma^{2k+2}}\right)}<\gamma^{k}(1+\gamma).

The second rate of Theorem 1, which has the additional requirement, is faster than the standard rate (11) of VI for all 0<γ<10<\gamma<1. Interestingly, in the γ1\gamma\approx 1 regime, Anc-VI achieves 𝒪(1/k)\mathcal{O}(1/k)-rate while VI has a 𝒪(1)\mathcal{O}(1)-rate. We briefly note that the condition U0TU0U^{0}\leq TU^{0} and U0TU0U^{0}\geq TU^{0} have been used in analyses of variants of VI [69, Theorem 6.3.11], [77, p.3].

In the following, we briefly outline the proof of Theorem 1 while deferring the full description to Appendix B. In the outline, we highlight a particular step, labeled \blacktriangle, that crucially relies on the linearity of the Bellman consistency operator. In the analysis for the Bellman optimality operator of Theorem 2, resolving the \blacktriangle step despite the nonlinearity is the key technical challenge.

Proof outline of Theorem 1.

Recall that we can write Bellman consistency operator as TπV=rπ+γ𝒫πVT^{\pi}V=r^{\pi}+\gamma\mathcal{P}^{\pi}V and TπQ=r+γ𝒫πQT^{\pi}Q=r+\gamma\mathcal{P}^{\pi}Q. Since TπT^{\pi} is a linear operator111Arguably, TπT^{\pi} is affine, not linear, but we follow the convention of [69] say TπT^{\pi} is linear., we get

TπUkUk\displaystyle T^{\pi}U^{k}-U^{k} =TπUk(1βk)TπUk1βkTπUπβk(U0Uπ)\displaystyle=T^{\pi}U^{k}-(1-\beta_{k})T^{\pi}U^{k-1}-\beta_{k}T^{\pi}U^{\pi}-\beta_{k}(U^{0}-U^{\pi})
=γ𝒫π(Uk(1βk)Uk1βkUπ)βk(U0Uπ)\displaystyle\stackrel{{\scriptstyle\blacktriangle}}{{=}}\gamma\mathcal{P}^{\pi}(U^{k}-(1-\beta_{k})U^{k-1}-\beta_{k}U^{\pi})-\beta_{k}(U^{0}-U^{\pi})
=γ𝒫π(βk(U0Uπ)+(1βk)(TπUk1Uk1))βk(U0Uπ)\displaystyle=\gamma\mathcal{P}^{\pi}(\beta_{k}(U^{0}-U^{\pi})+(1-\beta_{k})(T^{\pi}U^{k-1}-U^{k-1}))-\beta_{k}(U^{0}-U^{\pi})
=i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(γ𝒫π)ki+1(U0Uπ)]\displaystyle=\sum_{i=1}^{k}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k-i+1}(U^{0}-U^{\pi})\right]
βk(U0Uπ)+(Πj=1k(1βj))(γ𝒫π)k+1(U0Uπ),\displaystyle\quad-\beta_{k}(U^{0}-U^{\pi})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k+1}(U^{0}-U^{\pi}),

where the first equality follows from the definition of Anc-VI and the property of fixed point, while the last equality follows from induction. Taking the \left\|{\cdot}\right\|_{\infty}-norm of both sides, we conclude

TπUkUk(γ1γ)(1+2γγk+1)(γk+1)1γk+1U0Uπ.\displaystyle\left\|{T^{\pi}U^{k}-U^{k}}\right\|_{\infty}\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\pi}}\right\|_{\infty}.

2.2 Accelerated rate for Bellman optimality operator

We now present the accelerated convergence rate of Anc-VI for the Bellman optimality operator.

Our analysis uses what we call the Bellman anti-optimality operator, defined as

T^V(s)\displaystyle\hat{T}^{\star}V(s) =infa𝒜{r(s,a)+γ𝔼sP(|s,a)[V(s)]}\displaystyle=\inf_{a\in\mathcal{A}}\left\{r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[V(s^{\prime})\right]\right\}
T^Q(s,a)\displaystyle\hat{T}^{\star}Q(s,a) =r(s,a)+γ𝔼sP(|s,a)[infa𝒜Q(s,a)],\displaystyle=r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\inf_{a^{\prime}\in\mathcal{A}}Q(s^{\prime},a^{\prime})\right],

for all s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A}. (The sup is replaced with a inf.) When 0<γ<10<\gamma<1, the Bellman anti-optimality operator is γ\gamma-contractive and has a unique fixed point U^\hat{U}^{\star} by the exact same arguments that establish γ\gamma-contractiveness of the standard Bellman optimality operator.

Theorem 2.

Let 0<γ<10<\gamma<1 be the discount factor. Let TT^{\star} and T^\hat{T}^{\star} respectively be the Bellman optimality and anti-optimality operators for VV or QQ. Let UU^{\star} and U^\hat{U}^{\star} respectively be the fixed points of TT^{\star} and T^\hat{T}^{\star}. Then, Anc-VI exhibits the rate

TUkUk\displaystyle\left\|{T^{\star}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+2γγk+1)(γk+1)1γk+1max{U0U,U0U^}\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\max{\left\{\left\|{U^{0}-U^{\star}}\right\|_{\infty},\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}\right\}}

for k=0,1,k=0,1,\dots. If, furthermore, U0TU0U^{0}\leq T^{\star}U^{0} or U0TU0U^{0}\geq T^{\star}U^{0}, then Anc-VI exhibits the rate

TUkUk\displaystyle\left\|{T^{\star}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+γγk+1)(γk+1)1γk+1U0UifU0TU0\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\star}}\right\|_{\infty}\quad\,\,\text{if}\,\,U^{0}\leq T^{\star}U^{0}
TUkUk\displaystyle\left\|{T^{\star}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+γγk+1)(γk+1)1γk+1U0U^ifU0TU0\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}\quad\,\,\text{if}\,\,U^{0}\geq T^{\star}U^{0}

for k=0,1,k=0,1,\dots.

Anc-VI with the Bellman optimality operator exhibits the same accelerated convergence rate as Anc-VI with the Bellman consistency operator. As in Theorem 1, the rate of Theorem 2 also becomes 𝒪(1/k)\mathcal{O}(1/k) when γ1\gamma\approx 1, while VI has a 𝒪(1)\mathcal{O}(1)-rate.

Proof outline of Theorem 2.

The key technical challenge of the proof comes from the fact that the Bellman optimality operator is non-linear. Similar to the Bellman consistency operator case, we have

TUkUk\displaystyle T^{\star}U^{k}-U^{k} =TUk(1βk)TUk1βkTUβk(U0U)\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}T^{\star}U^{\star}-\beta_{k}(U^{0}-U^{\star})
γ𝒫πk(Uk(1βk)Uk1βkU)βk(U0U)\displaystyle\stackrel{{\scriptstyle\blacktriangle}}{{\leq}}\gamma\mathcal{P}^{\pi_{k}}\left({U^{k}-(1-\beta_{k})U^{k-1}-\beta_{k}U^{\star}}\right)-\beta_{k}(U^{0}-U^{\star})
=γ𝒫πk(βk(U0U)+(1βk)(TUk1Uk1))βk(U0U)\displaystyle=\gamma\mathcal{P}^{\pi_{k}}(\beta_{k}\left({U^{0}-U^{\star}}\right)+(1-\beta_{k})(T^{\star}U^{k-1}-U^{k-1}))-\beta_{k}(U^{0}-U^{\star})
i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫πl)(U0U)]\displaystyle\leq\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\mathcal{P}^{\pi_{l}}}\right)(U^{0}-U^{\star})\right]
βk(U0U)+(Πj=1k(1βj))(Πl=k0γ𝒫πl)(U0U),\displaystyle\quad-\beta_{k}(U^{0}-U^{\star})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k}\gamma\mathcal{P}^{\pi_{l}}}\right)(U^{0}-U^{\star}),

where πk\pi_{k} is the greedy policy satisfying TπkUk=TUkT^{\pi_{k}}U^{k}=T^{\star}U^{k}, we define Πl=kiγ𝒫πl=γ𝒫πkγ𝒫πk1γ𝒫πi\Pi^{i}_{l=k}\gamma\mathcal{P}^{\pi_{l}}=\gamma\mathcal{P}^{\pi_{k}}\gamma\mathcal{P}^{\pi_{k-1}}\cdots\gamma\mathcal{P}^{\pi_{i}}, and last inequality follows by induction and monotonicity of Bellman optimality operator. The key step \blacktriangle uses greedy policies {πl}l=0,1,,k\{\pi_{l}\}_{l=0,1,\dots,k}, which are well defined when the action space is finite. When the action space is infinite, greedy policies may not exist, so we use the Hahn–Banach extension theorem to overcome this technicality. The full argument is provided in Appendix B.

To lower bound TUkUkT^{\star}U^{k}-U^{k}, we use a similar line of reasoning with the Bellman anti-optimality operator. Combining the upper and lower bounds of TUkUkT^{\star}U^{k}-U^{k}, we conclude the accelerated rate of Theorem 2. ∎

For γ<1\gamma<1, the rates of Theorems 1 and 2 can be translated to a bound on the distance to solution:

UkUγk(1+γ)(1+2γγk+1)(1γ2k+2)U0U\displaystyle\left\|{U^{k}-U^{\star}}\right\|_{\infty}\leq\gamma^{k}\frac{\left({1+\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({1-\gamma^{2k+2}}\right)}\left\|{U^{0}-U^{\star}}\right\|_{\infty}

for k=1,2,k=1,2,\dots. This O(γk)O(\gamma^{k}) rate is worse than the rate of (classical) VI by a constant factor. Therefore, Anc-VI is better than VI in terms of the Bellman error, but it is not better than VI in terms of distance to solution.

3 Convergence when γ=1\gamma=1

Undiscounted MDPs are not commonly studied in the DP and RL theory literature due to the following difficulties: Bellman consistency and optimality operators may not have fixed points, VI is a nonexpansive (not contractive) fixed-point iteration and may not convergence to a fixed point even if one exist, and the interpretation of a fixed point as the (optimal) value function becomes unclear when the fixed point is not unique. However, many modern deep RL setups actually do not use discounting,222 As a specific example, the classical policy gradient theorem [82] calls for the use of J(θ)=𝔼[t=0γtθlogπθ(at|st)Qγϕ(st,at)]\nabla J(\theta)=\mathbb{E}\left[\sum^{\infty}_{t=0}\gamma^{t}\nabla_{\theta}\log\pi_{\theta}(a_{t}\,|\,s_{t})Q^{\phi}_{\gamma}(s_{t},a_{t})\right], but many modern deep policy gradient methods use γ=1\gamma=1 in the first instance of γ\gamma (so γt=1\gamma^{t}=1) while using γ<1\gamma<1 in Qγϕ(st,at)Q^{\phi}_{\gamma}(s_{t},a_{t}) [63]. and this empirical practice makes the theoretical analysis with γ=1\gamma=1 relevant.

In this section, we show that Anc-VI converges to fixed points of the Bellman consistency and optimality operators of undiscounted MDPs. While a full treatment of undiscounted MDPs is beyond the scope of this paper, we show that fixed points, if one exists, can be found, and we therefore argue that the inability to find fixed points should not be considered an obstacle in studying the γ=1\gamma=1 setup.

We first state our convergence result for finite state-action spaces.

Theorem 3.

Let γ=1\gamma=1. Let T:nnT\colon\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} be the nonexpansive Bellman consistency or optimality operator for VV or QQ. Assume a fixed point exists (not necessarily unique). If, U0TU0U^{0}\leq TU^{0}, then Anc-VI exhibits the rate

TUkUk1k+1U0U for k=0,1,.\displaystyle\left\|{TU^{k}-U^{k}}\right\|_{\infty}\leq\frac{1}{k+1}\left\|{U^{0}-U^{\star}}\right\|_{\infty}\qquad\text{ for }k=0,1,\dots.

for any fixed point UU^{\star} satisfying U0UU^{0}\leq U^{\star}. Furthermore, UkUU^{k}\rightarrow U^{\infty} for some fixed point UU^{\infty}.

If rewards are nonnegative, then the condition U0TU0U^{0}\leq TU^{0} is satisfied with U0=0U^{0}=0. So, under this mild condition, Anc-VI with γ=1\gamma=1 converges with 𝒪(1/k)\mathcal{O}(1/k)-rate on the Bellman error. To clarify, the convergence UkUU^{k}\rightarrow U^{\infty} has no rate, i.e., UkU=o(1)\|U^{k}-U^{\infty}\|_{\infty}=o(1), while TUkUk=𝒪(1/k)\left\|{TU^{k}-U^{k}}\right\|_{\infty}=\mathcal{O}(1/k). In contrast, standard VI does not guarantee convergence in this setup.

We also point out that the convergence of Bellman error does not immediately imply point convergence, i.e., TUkUk0TU^{k}-U^{k}\rightarrow 0 does not immediately imply UkUU^{k}\rightarrow U^{\star}, when γ=1\gamma=1. Rather, we show (i) UkU^{k} is a bounded sequence, (ii) any convergent subsequence UkjU^{k_{j}} converges to a fixed point UU^{\infty}, and (iii) UkU^{k} is elementwise monotonically nondecreasing and therefore has a single limit.

Next, we state our convergence result for general state-action spaces.

Theorem 4.

Let γ=1\gamma=1. Let the state and action spaces be general (possibly infinite) sets. Let TT be the nonexpansive Bellman consistency or optimality operator for VV or QQ, and assume TT is well defined.333Well-definedness of TT requires a σ\sigma-algebra on state and action spaces, expectation with respect to transition probability and policy to be well defined, boundedness and measurability of the output of Bellman operators, etc. Assume a fixed point exists (not necessarily unique). If U0TU0U^{0}\leq TU^{0}, then Anc-VI exhibits the rate

TUkUk1k+1U0U for k=0,1,\displaystyle\left\|{TU^{k}-U^{k}}\right\|_{\infty}\leq\frac{1}{k+1}\left\|{U^{0}-U^{\star}}\right\|_{\infty}\qquad\text{ for }k=0,1,\dots

for any fixed point UU^{\star} satisfying U0UU^{0}\leq U^{\star}. Furthermore, UkUU^{k}\rightarrow U^{\infty} pointwise monotonically for some fixed point UU^{\infty}.

The convergence UkUU^{k}\rightarrow U^{\infty} pointwise in infinite state-action spaces is, in our view, a non-trivial contribution. When the state-action space is finite, pointwise convergence directly implies convergence in \|\cdot\|_{\infty}, and in this sense, Theorem 4 is generalization of Theorem 3. However, when the state-action space is infinite, pointwise convergence does not necessarily imply uniform convergence, i.e., UkUU^{k}\rightarrow U^{\infty} pointwise does not necessarily imply UkUU^{k}\rightarrow U^{\infty} in \|\cdot\|_{\infty}.

4 Complexity lower bound

We now present a complexity lower bound establishing optimality of Anc-VI.

Theorem 5.

Let k0k\geq 0, nk+2n\geq k+2, 0<γ10<\gamma\leq 1, and U0nU^{0}\in\mathbb{R}^{n}. Then there exists an MDP with |𝒮|=n|\mathcal{S}|=n and |𝒜|=1|\mathcal{A}|=1 (which implies the Bellman consistency and optimality operator for V and Q all coincide as T:nnT\colon\mathbb{R}^{n}\rightarrow\mathbb{R}^{n}) such that TT has a fixed point UU^{\star} satisfying U0UU^{0}\leq U^{\star} and

TUkUkγki=0kγiU0U\left\|{TU^{k}-U^{k}}\right\|_{\infty}\geq\frac{\gamma^{k}}{\sum_{i=0}^{k}\gamma^{i}}\left\|{U^{0}-U^{\star}}\right\|_{\infty}

for any iterates {Ui}i=0k\{U^{i}\}^{k}_{i=0} satisfying

UiU0+span{TU0U0,TU1U1,,TUi1Ui1} for i=1,,k.U^{i}\in U^{0}+\mathrm{span}\{TU^{0}-U^{0},TU^{1}-U^{1},\dots,TU^{i-1}-U^{i-1}\}\qquad\text{ for }i=1,\dots,k.
Proof outline of Theorem 5.

Without loss of generality, assume n=k+2n=k+2 and U0=0U^{0}=0. Consider the MDP (𝒮,𝒜,P,r,γ)(\mathcal{S},\mathcal{A},P,r,\gamma) such that

𝒮={s1,,sk+2},𝒜={a1},P(si|sj,a1)=𝟙{i=j=1,j=i+1},r(si,a1)=𝟙{i=2}.\displaystyle\mathcal{S}=\{s_{1},\dots,s_{k+2}\},\quad\mathcal{A}=\{a_{1}\},\quad P(s_{i}\,|\,s_{j},a_{1})=\mathds{1}_{\{i=j=1,\,j=i+1\}},\quad r(s_{i},a_{1})=\mathds{1}_{\{i=2\}}.

Then, T=γ𝒫πU+[0,1,0,,0],U=[0,1,γ,,γk]T=\gamma\mathcal{P}^{\pi}U+[0,1,0,\dots,0]^{\intercal},U^{\star}=[0,1,\gamma,\dots,\gamma^{k}]^{\intercal}, and U0U=1\left\|{U^{0}-U^{\star}}\right\|_{\infty}=1. Under the span condition, we can show that (Uk)1=(Uk)k+2=0\left({U^{k}}\right)_{1}=\left({U^{k}}\right)_{k+2}=0. Then, we get

TUkUk=(0,1(Uk)2,γ(Uk)2(Uk)3,,γ(Uk)k(Uk)k+1,γ(Uk)k+1)\displaystyle TU^{k}-U^{k}=\left(0,1-\left({U^{k}}\right)_{2},\gamma\left({U^{k}}\right)_{2}-\left({U^{k}}\right)_{3},\dots,\gamma\left({U^{k}}\right)_{k}-\left({U^{k}}\right)_{k+1},\gamma\left({U^{k}}\right)_{k+1}\right)

and this implies

(TUkUk)1+(TUkUk)2+γ1(TUkUk)3++γk(TUkUk)k+2=1.\left({TU^{k}-U^{k}}\right)_{1}+\left({TU^{k}-U^{k}}\right)_{2}+\gamma^{-1}\left({TU^{k}-U^{k}}\right)_{3}+\cdots+\gamma^{-k}\left({TU^{k}-U^{k}}\right)_{k+2}=1.

Taking the absolute value on both sides,

(1++γk)max1ik+2{|TUkUk|i}1.\displaystyle(1+\cdots+\gamma^{-k})\max_{1\leq i\leq k+2}{\{|TU^{k}-U^{k}|_{i}\}}\geq 1.

Therefore, we conclude

TUkUkγki=0kγiU0U.\displaystyle\|TU^{k}-U^{k}\|_{\infty}\geq\frac{\gamma^{k}}{\sum_{i=0}^{k}\gamma^{i}}\left\|{U^{0}-U^{\star}}\right\|_{\infty}.

Note that the case γ=1\gamma=1 is included in Theorem 5. When γ=1\gamma=1, the lower bound of Theorem 5 exactly matches the upper bound of Theorem 3.

Since

γki=0kγi(γ1γ)(1+γγk+1)(γk+1)1γk+14γki=0kγi for all 0<γ<1,\frac{\gamma^{k}}{\sum_{i=0}^{k}\gamma^{i}}\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\leq\frac{4\gamma^{k}}{\sum_{i=0}^{k}\gamma^{i}}\qquad\text{ for all }0<\gamma<1,

the lower bound establishes optimality of the second rates Theorems 1 and 2 up to a constant of factor 44. Theorem 5 improves upon the prior state-of-the-art complexity lower bound established in the proof of [37, Theorem 3] by a factor 1γk+11-\gamma^{k+1}. (In [37, Theorem 3], a lower bound on the distance to optimal value function is provided. Their result has an implicit dependence on the initial distance to optimal value function U0U\|U^{0}-U^{\star}\|_{\infty}, so we make the dependence explicit, and we translate their result to a lower bound on the Bellman error. Once this is done, the difference between our lower bound of Theorem 5 and of [37, Theorem 3] is a factor of 1γk+11-\gamma^{k+1}. The worst-case MDP of [37, Theorem 3] and our worst-case MDP primarily differ in the rewards, while the states and the transition probabilities are almost the same.)

The so-called “span condition” of Theorem 5 is arguably very natural and is satisfied by standard VI and Anc-VI. The span condition is commonly used in the construction of complexity lower bounds on first-order optimization methods [59, 23, 25, 13, 14, 65] and has been used in the prior state-of-the-art lower bound for standard VI [37, Theorem 3]. However, designing an algorithm that breaks the lower bound of Theorem 5 by violating the span condition remains a possibility. In optimization theory, there is precedence of lower bounds being broken by violating seemingly natural and minute conditions [40, 35, 98].

5 Approximate Anchored Value Iteration

In this section, we show that the anchoring mechanism is robust against evaluation errors of the Bellman operator, just as much as the standard approximate VI.

Let 0<γ<10<\gamma<1 and let TT^{\star} be the Bellman optimality operator. The Approximate Anchored Value Iteration (Apx-Anc-VI) is

Uϵk=TUk1+ϵk1Uk=βkU0+(1βk)Uϵk\displaystyle\begin{aligned} U_{\epsilon}^{k}&=T^{\star}U^{k-1}+\epsilon^{k-1}\\ U^{k}&=\beta_{k}U^{0}+(1-\beta_{k})U_{\epsilon}^{k}\end{aligned} (Apx-Anc-VI)

for k=1,2,,k=1,2,\dots, where βk=1/(i=0kγ2i)\beta_{k}=1/(\sum_{i=0}^{k}\gamma^{-2i}), U0U^{0} is an initial point, and the {ϵk}k=0\{\epsilon^{k}\}_{k=0}^{\infty} is the error sequence modeling approximate evaluations of TT^{\star}.

Of course, the classical Approximate Value Iteration (Apx-VI) is

Uk\displaystyle U^{k} =TUk1+ϵk1\displaystyle=T^{\star}U^{k-1}+\epsilon^{k-1} (Apx-VI)

for k=1,2,,k=1,2,\dots, where U0U^{0} is an initial point.

Fact 1 (Classical result, [11, p.333]).

Let 0<γ<10<\gamma<1 be the discount factor. Let TT^{\star} be the Bellman optimality for VV or QQ. Let UU^{\star} be the fixed point of TT^{\star}. Then Apx-VI exhibits the rate

TUkUk\displaystyle\left\|{T^{\star}U^{k}-U^{k}}\right\|_{\infty} (1+γ)γkU0U+(1+γ)1γk1γmax0ik1ϵi for k=1,2,.\displaystyle\leq(1+\gamma)\gamma^{k}\left\|{U^{0}-U^{\star}}\right\|_{\infty}+\left({1+\gamma}\right)\frac{1-\gamma^{k}}{1-\gamma}\max_{0\leq i\leq k-1}\left\|{\epsilon^{i}}\right\|_{\infty}\,\,\,\text{ for }k=1,2,\dots.
Theorem 6.

Let 0<γ<10<\gamma<1 be the discount factor. Let TT^{\star} and T^\hat{T}^{\star} respectively be the Bellman optimality and anti-optimality operators for VV or QQ. Let UU^{\star} and U^\hat{U}^{\star} respectively be the fixed points of TT^{\star} and T^\hat{T}^{\star}. Then Apx-Anc-VI exhibits the rate

TUkUk\displaystyle\left\|{T^{\star}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+2γγk+1)(γk+1)1γk+1max{U0U,U0U^}\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\max{\left\{\left\|{U^{0}-U^{\star}}\right\|_{\infty},\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}\right\}}
+1+γ1+γk+11γk1γmax0ik1ϵi for k=1,2,.\displaystyle\quad+\frac{1+\gamma}{1+\gamma^{k+1}}\frac{1-\gamma^{k}}{1-\gamma}\max_{0\leq i\leq k-1}\left\|{\epsilon^{i}}\right\|_{\infty}\qquad\text{ for }k=1,2,\dots.

If, furthermore, U0TU0U^{0}\geq T^{\star}U^{0}, then (Apx-Anc-VI) exhibits the rate

TUkUk\displaystyle\left\|{T^{\star}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+γγk+1)(γk+1)1γk+1U0U^+1+γ1+γk+11γk1γmax0ik1ϵi\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}+\frac{1+\gamma}{1+\gamma^{k+1}}\frac{1-\gamma^{k}}{1-\gamma}\max_{0\leq i\leq k-1}\left\|{\epsilon^{i}}\right\|_{\infty}

for k=1,2,k=1,2,\dots.

The dependence on maxϵi\max\left\|{\epsilon_{i}}\right\|_{\infty} of Apx-Anc-VI is no worse than that of Apx-VI. In this sense, Apx-Anc-VI is robust against evaluation errors of the Bellman operator, just as much as the standard Apx-VI. Finally, we note that a similar analysis can be done for Apx-Anc-VI with the Bellman consistency operator.

6 Gauss–Seidel Anchored Value Iteration

In this section, we show that the anchoring mechanism can be combined with Gauss–Seidel-type updates in finite state-action spaces. Let 0<γ<10<\gamma<1 and let T:nnT^{\star}\colon\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} be the Bellman optimality operator. Define TGS:nnT_{GS}^{\star}\colon\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} as

TGS=TnT2T1,T_{GS}^{\star}=T^{\star}_{n}\cdots T^{\star}_{2}T^{\star}_{1},

where Tj:nnT_{j}^{\star}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} is defined as

Tj(U)=(U1,,Uj1,(T(U))j,Uj+1,,Un)\displaystyle T^{\star}_{j}(U)=(U_{1},\dots,U_{j-1},\left({T^{\star}(U)}\right)_{j},U_{j+1},\dots,U_{n})

for j=1,,nj=1,\dots,n.

Fact 2.

[Classical result, [69, Theorem 6.3.4]] TGST_{GS}^{\star} is a γ\gamma-contractive operator and has the same fixed point as TT^{\star}.

The Gauss–Seidel Anchored Value Iteration (GS-Anc-VI) is

Uk\displaystyle U^{k} =βkU0+(1βk)TGSUk1\displaystyle=\beta_{k}U^{0}+(1-\beta_{k})T_{GS}^{\star}U^{k-1} (GS-Anc-VI)

for k=1,2,,k=1,2,\dots, where βk=1/(i=0kγ2i)\beta_{k}=1/(\sum_{i=0}^{k}\gamma^{-2i}) and U0U^{0} is an initial point.

Theorem 7.

Let the state and action spaces be finite sets. Let 0<γ<10<\gamma<1 be the discount factor. Let TT^{\star} and T^\hat{T}^{\star} respectively be the Bellman optimality and anti-optimality operators for VV or QQ. Let UU^{\star} and U^\hat{U}^{\star} respectively be the fixed points of TT^{\star} and T^\hat{T}^{\star}. Then GS-Anc-VI exhibits the rate

TGSUkUk\displaystyle\left\|{T_{GS}^{\star}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+2γγk+1)(γk+1)1γk+1max{U0U,U0U^}\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\max{\left\{\left\|{U^{0}-U^{\star}}\right\|_{\infty},\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}\right\}}

for k=0,1,k=0,1,\dots. If, furthermore, U0TGSU0U^{0}\leq T_{GS}^{\star}U^{0} or U0TGSU0U^{0}\geq T_{GS}^{\star}U^{0}, then GS-Anc-VI exhibits the rate

TGSUkUk\displaystyle\left\|{T_{GS}^{\star}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+γγk+1)(γk+1)1γk+1U0UifU0TGSU0\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\star}}\right\|_{\infty}\quad\,\,\text{if}\,\,U^{0}\leq T_{GS}^{\star}U^{0}
TGSUkUk\displaystyle\left\|{T_{GS}^{\star}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+γγk+1)(γk+1)1γk+1U0U^ifU0TGSU0\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}\quad\,\,\text{if}\,\,U^{0}\geq T_{GS}^{\star}U^{0}

for k=0,1,k=0,1,\dots.

We point out that GS-Anc-VI cannot be directly extended to infinite action spaces since Hahn–Banach extension theorem is not applicable in the Gauss–Seidel setup. Furthermore, we note that a similar analysis can be carried out for GS-Anc-VI with the Bellman consistency operator.

7 Conclusion

We show that the classical value iteration (VI) is, in fact, suboptimal and that the anchoring mechanism accelerates VI to be optimal in the sense that the accelerated rate matches a complexity lower bound up to a constant factor of 44. We also show that the accelerated iteration provably converges to a fixed point even when γ=1\gamma=1, if a fixed point exists. Being able to provide a substantive improvement upon the classical VI is, in our view, a surprising contribution.

One direction of future work is to study the empirical effectiveness of Anc-VI. Another direction is to analyze Anc-VI in a model-free setting and, more broadly, to investigate the effectiveness of the anchor mechanism in more practical RL methods.

Our results lead us to believe that many of the classical foundations of dynamic programming and reinforcement learning may be improved with a careful examination based on an optimization complexity theory perspective. The theory of optimal optimization algorithms has recently enjoyed significant developments [44, 43, 45, 98, 66], the anchoring mechanism being one such example [49, 65], and the classical DP and RL theory may benefit from a similar line of investigation on iteration complexity.

Acknowledgments and Disclosure of Funding

This work was supported by the the Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) [NO.2021-0-01343, Artificial Intelligence Graduate School Program (Seoul National University)] and the Samsung Science and Technology Foundation (Project Number SSTF-BA2101-02). We thank Jisun Park for providing valuable feedback.

References

  • [1] M. Akian, S. Gaubert, Z. Qu, and O. Saadi. Multiply accelerated value iteration for non-symmetric affine fixed point problems and application to markov decision processes. SIAM Journal on Matrix Analysis and Applications, 43(1):199–232, 2022.
  • [2] D. G. Anderson. Iterative procedures for nonlinear integral equations. Journal of the Association for Computing Machinery, 12(4):547–560, 1965.
  • [3] D. Andre, N. Friedman, and R. Parr. Generalized prioritized sweeping. Neural Information Processing Systems, 1997.
  • [4] M. G. Azar, R. Munos, M. Ghavamzadeh, and H. Kappen. Speedy Q-learning. Neural Information Processing Systems, 2011.
  • [5] S. Banach. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamenta Mathematicae, 3(1):133–181, 1922.
  • [6] M. Barré, A. Taylor, and A. d’Aspremont. Convergence of a constrained vector extrapolation scheme. SIAM Journal on Mathematics of Data Science, 4(3):979–1002, 2022.
  • [7] M. G. Bellemare, G. Ostrovski, A. Guez, P. Thomas, and R. Munos. Increasing the action gap: New operators for reinforcement learning. Association for the Advancement of Artificial Intelligence, 2016.
  • [8] R. Bellman. A Markovian decision process. Journal of Mathematics and Mechanics, 6(5):679–684, 1957.
  • [9] D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical Methods. Athena Scientific, 2015.
  • [10] D. P. Bertsekas. Dynamic Programming and Optimal Control, volume II. 4th edition, 2012.
  • [11] D. P. Bertsekas and J. N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientific, 1995.
  • [12] W. Bowen, X. Huaqing, Z. Lin, L. Yingbin, and Z. Wei. Finite-time theory for momentum Q-learning. Conference on Uncertainty in Artificial Intelligence, 2021.
  • [13] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points I. Mathematical Programming, 184(1–2):71–120, 2020.
  • [14] Y. Carmon, J. C. Duchi, O. Hinder, and A. Sidford. Lower bounds for finding stationary points II: first-order methods. Mathematical Programming, 185(1–2):315–355, 2021.
  • [15] A.-L. Cauchy. Méthode générale pour la résolution des systemes d’équations simultanées. Comptes rendus de l’Académie des Sciences, 25:536–538, 1847.
  • [16] V. Colao and G. Marino. On the rate of convergence of Halpern iterations. Journal of Nonlinear and Convex Analysis, 22(12):2639–2646, 2021.
  • [17] J. P. Contreras and R. Cominetti. Optimal error bounds for non-expansive fixed-point iterations in normed spaces. Mathematical Programming, 199(1–2):343–374, 2022.
  • [18] P. Dai, D. S. Weld, J. Goldsmith, et al. Topological value iteration algorithms. Journal of Artificial Intelligence Research, 42:181–209, 2011.
  • [19] D. P. De Farias and B. Van Roy. On the existence of fixed points for approximate value iteration and temporal-difference learning. Journal of Optimization theory and Applications, 105:589–608, 2000.
  • [20] J. Diakonikolas. Halpern iteration for near-optimal and parameter-free monotone inclusion and strong solutions to variational inequalities. Conference on Learning Theory, 2020.
  • [21] J. Diakonikolas and P. Wang. Potential function-based framework for minimizing gradients in convex and min-max optimization. SIAM Journal on Optimization, 32(3):1668–1697, 2022.
  • [22] Q. Dong, H. Yuan, Y. Cho, and T. M. Rassias. Modified inertial Mann algorithm and inertial CQ-algorithm for nonexpansive mappings. Optimization Letters, 12(1):87–102, 2018.
  • [23] Y. Drori. The exact information-based complexity of smooth convex minimization. Journal of Complexity, 39:1–16, 2017.
  • [24] Y. Drori and O. Shamir. The complexity of finding stationary points with stochastic gradient descent. International Conference on Machine Learning, 2020.
  • [25] Y. Drori and A. Taylor. On the oracle complexity of smooth strongly convex minimization. Journal of Complexity, 68, 2022.
  • [26] Y. Drori and M. Teboulle. An optimal variant of Kelley’s cutting-plane method. Mathematical Programming, 160(1–2):321–351, 2016.
  • [27] M. Ermis, M. Park, and I. Yang. On Anderson acceleration for partially observable Markov decision processes. IEEE Conference on Decision and Control, 2021.
  • [28] M. Ermis and I. Yang. A3DQN: Adaptive Anderson acceleration for deep Q-networks. IEEE Symposium Series on Computational Intelligence, 2020.
  • [29] D. Ernst, P. Geurts, and L. Wehenkel. Tree-based batch mode reinforcement learning. Journal of Machine Learning Research, 6:503–556, 2005.
  • [30] D. Ernst, M. Glavic, P. Geurts, and L. Wehenkel. Approximate value iteration in the reinforcement learning context. Application to electrical power system control. International Journal of Emerging Electric Power Systems, 3(1), 2005.
  • [31] A.-m. Farahmand and M. Ghavamzadeh. PID accelerated value iteration algorithm. International Conference on Machine Learning, 2021.
  • [32] A.-m. Farahmand, C. Szepesvári, and R. Munos. Error propagation for approximate policy and value iteration. Neural Information Processing Systems, 2010.
  • [33] M. Fellows, K. Hartikainen, and S. Whiteson. Bayesian Bellman operators. Neural Information Processing Systems, 2021.
  • [34] M. Geist and B. Scherrer. Anderson acceleration for reinforcement learning. European Workshop on Reinforcement Learning, 2018.
  • [35] N. Golowich, S. Pattathil, C. Daskalakis, and A. Ozdaglar. Last iterate is slower than averaged iterate in smooth convex-concave saddle point problems. Conference on Learning Theory, 2020.
  • [36] G. J. Gordon. Stable function approximation in dynamic programming. International Conference on Machine Learning, 1995.
  • [37] V. Goyal and J. Grand-Clément. A first-order approach to accelerated value iteration. Operations Research, 71(2):517–535, 2022.
  • [38] J. Grand-Clément. From convex optimization to MDPs: A review of first-order, second-order and quasi-newton methods for MDPs. arXiv:2104.10677, 2021.
  • [39] B. Halpern. Fixed points of nonexpanding maps. Bulletin of the American Mathematical Society, 73(6):957–961, 1967.
  • [40] R. Hannah, Y. Liu, D. O’Connor, and W. Yin. Breaking the span assumption yields fast finite-sum minimization. Neural Information Processing Systems, 2018.
  • [41] M. Hessel, J. Modayil, H. Van Hasselt, T. Schaul, G. Ostrovski, W. Dabney, D. Horgan, B. Piot, M. Azar, and D. Silver. Rainbow: Combining improvements in deep reinforcement learning. Association for the Advancement of Artificial Intelligence, 2018.
  • [42] F. Iutzeler and J. M. Hendrickx. A generic online acceleration scheme for optimization algorithms via relaxation and inertia. Optimization Methods and Software, 34(2):383–405, 2019.
  • [43] D. Kim. Accelerated proximal point method for maximally monotone operators. Mathematical Programming, 190(1–2):57–87, 2021.
  • [44] D. Kim and J. A. Fessler. Optimized first-order methods for smooth convex minimization. Mathematical Programming, 159(1–2):81–107, 2016.
  • [45] D. Kim and J. A. Fessler. Optimizing the efficiency of first-order methods for decreasing the gradient of smooth convex functions. Journal of Optimization Theory and Applications, 188(1):192–219, 2021.
  • [46] J. Lee, C. Park, and E. K. Ryu. A geometric structure of acceleration and its role in making gradients small fast. Neural Information Processing Systems, 2021.
  • [47] S. Lee and D. Kim. Fast extra gradient methods for smooth structured nonconvex-nonconcave minimax problems. Neural Information Processing Systems, 2021.
  • [48] L. Leustean. Rates of asymptotic regularity for Halpern iterations of nonexpansive mappings. Journal of Universal Computer Science, 13(11):1680–1691, 2007.
  • [49] F. Lieder. On the convergence rate of the Halpern-iteration. Optimization Letters, 15(2):405–418, 2021.
  • [50] M. Lutter, S. Mannor, J. Peters, D. Fox, and A. Garg. Value iteration in continuous actions, states and time. International Conference on Machine Learning, 2021.
  • [51] P.-E. Maingé. Convergence theorems for inertial KM-type algorithms. Journal of Computational and Applied Mathematics, 219(1):223–236, 2008.
  • [52] A. massoud Farahmand, M. Ghavamzadeh, C. Szepesvári, and S. Mannor. Regularized fitted Q-iteration for planning in continuous-space Markovian decision problems. American Control Conference, 2009.
  • [53] H. B. McMahan and G. J. Gordon. Fast exact planning in Markov decision processes. International Conference on Automated Planning and Scheduling, 2005.
  • [54] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, S. Petersen, C. Beattie, A. Sadik, I. Antonoglou, H. King, D. Kumaran, D. Wierstra, S. Legg, and D. Hassabis. Human-level control through deep reinforcement learning. Nature, 518(7540):529–533, 2015.
  • [55] A. W. Moore and C. G. Atkeson. Prioritized sweeping: Reinforcement learning with less data and less time. Machine Learning, 13:103–130, 1993.
  • [56] R. Munos. Error bounds for approximate value iteration. Association for the Advancement of Artificial Intelligence, 2005.
  • [57] R. Munos and C. Szepesvári. Finite-time bounds for fitted value iteration. Journal of Machine Learning Research, 9(27):815–857, 2008.
  • [58] A. S. Nemirovski. Information-based complexity of linear operator equations. Journal of Complexity, 8(2):153–175, 1992.
  • [59] Y. Nesterov. Lectures on Convex Optimization. Springer, 2nd edition, 2018.
  • [60] Y. Nesterov, A. Gasnikov, S. Guminov, and P. Dvurechensky. Primal–dual accelerated gradient methods with small-dimensional relaxation oracle. Optimization Methods and Software, 36(4):773–810, 2021.
  • [61] Y. E. Nesterov. A method for solving the convex programming problem with convergence rate 𝒪(1/k2)\mathcal{O}(1/k^{2}). Doklady Akademii Nauk SSSR, 269(3):543–547, 1983.
  • [62] S. Niu, S. Chen, H. Guo, C. Targonski, M. Smith, and J. Kovačević. Generalized value iteration networks: Life beyond lattices. Association for the Advancement of Artificial Intelligence, 2018.
  • [63] C. Nota and P. Thomas. Is the policy gradient a gradient? International Conference on Autonomous Agents and Multiagent Systems, 2020.
  • [64] C. Park, J. Park, and E. K. Ryu. Factor-2\sqrt{2} acceleration of accelerated gradient methods. Applied Mathematics & Optimization, 2023.
  • [65] J. Park and E. K. Ryu. Exact optimal accelerated complexity for fixed-point iterations. International Conference on Machine Learning, 2022.
  • [66] J. Park and E. K. Ryu. Accelerated infeasibility detection of constrained optimization and fixed-point iterations. International Conference on Machine Learning, 2023.
  • [67] M. Park, J. Shin, and I. Yang. Anderson acceleration for partially observable Markov decision processes: A maximum entropy approach. arXiv:2211.14998, 2022.
  • [68] J. Peng and R. J. Williams. Efficient learning and planning within the Dyna framework. Adaptive Behavior, 1(4):437–454, 1993.
  • [69] M. L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley and Sons, 1994.
  • [70] S. Reich. Strong convergence theorems for resolvents of accretive operators in Banach spaces. Journal of Mathematical Analysis and Applications, 75(1):287–292, 1980.
  • [71] E. K. Ryu, K. Yuan, and W. Yin. Ode analysis of stochastic gradient methods with optimism and anchoring for minimax problems. arXiv:1905.10899, 2019.
  • [72] S. Sabach and S. Shtern. A first order method for solving convex bilevel optimization problems. SIAM Journal on Optimization, 27(2):640–660, 2017.
  • [73] A. Salim, L. Condat, D. Kovalev, and P. Richtárik. An optimal algorithm for strongly convex minimization under affine constraints. International Conference on Artificial Intelligence and Statistics, 2022.
  • [74] D. Scieur, A. d’Aspremont, and F. Bach. Regularized nonlinear acceleration. Mathematical Programming, 179(1–2):47–83, 2020.
  • [75] Y. Shehu. Convergence rate analysis of inertial Krasnoselskii–Mann type iteration with applications. Numerical Functional Analysis and Optimization, 39(10):1077–1091, 2018.
  • [76] W. Shi, S. Song, H. Wu, Y.-C. Hsu, C. Wu, and G. Huang. Regularized Anderson acceleration for off-policy deep reinforcement learning. Neural Information Processing Systems, 2019.
  • [77] O. Shlakhter, C.-G. Lee, D. Khmelev, and N. Jaber. Acceleration operators in the value iteration algorithms for Markov decision processes. Operations Research, 58(1):193–202, 2010.
  • [78] J. J. Suh, J. Park, and E. K. Ryu. Continuous-time analysis of anchor acceleration. Neural Information Processing Systems, 2023.
  • [79] K. Sun, Y. Wang, Y. Liu, B. Pan, S. Jui, B. Jiang, L. Kong, et al. Damped Anderson mixing for deep reinforcement learning: Acceleration, convergence, and stabilization. Neural Information Processing Systems, 2021.
  • [80] R. S. Sutton. Learning to predict by the methods of temporal differences. Machine Learning, 3:9–44, 1988.
  • [81] R. S. Sutton and A. G. Barto. Reinforcement Learning: An introduction. MIT press, 2nd edition, 2018.
  • [82] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour. Policy gradient methods for reinforcement learning with function approximation. Neural Information Processing Systems, 1999.
  • [83] Q. Sykora, M. Ren, and R. Urtasun. Multi-agent routing value iteration network. International Conference on Machine Learning, 2020.
  • [84] C. Szepesvári. Algorithms for Reinforcement Learning. Springer, 1st edition, 2010.
  • [85] A. Tamar, Y. Wu, G. Thomas, S. Levine, and P. Abbeel. Value iteration networks. Neural Information Processing Systems, 2016.
  • [86] A. Taylor and Y. Drori. An optimal gradient method for smooth strongly convex minimization. Mathematical Programming, 199(1-2):557–594, 2023.
  • [87] S. Tosatto, M. Pirotta, C. d’Eramo, and M. Restelli. Boosted fitted Q-iteration. International Conference on Machine Learning, 2017.
  • [88] J. N. Tsitsiklis. Asynchronous stochastic approximation and Q-learning. Machine Learning, 16:185–202, 1994.
  • [89] H. Van Hasselt, A. Guez, and D. Silver. Deep reinforcement learning with double Q-learning. Association for the Advancement of Artificial Intelligence, 2016.
  • [90] B. Van Roy. Performance loss bounds for approximate value iteration with state aggregation. Mathematics of Operations Research, 31(2):234–244, 2006.
  • [91] B. Van Scoy, R. A. Freeman, and K. M. Lynch. The fastest known globally convergent first-order method for minimizing strongly convex functions. IEEE Control Systems Letters, 2(1):49–54, 2018.
  • [92] N. Vieillard, B. Scherrer, O. Pietquin, and M. Geist. Momentum in reinforcement learning. International Conference on Artificial Intelligence and Statistics, 2020.
  • [93] H. F. Walker and P. Ni. Anderson acceleration for fixed-point iterations. SIAM Journal on Numerical Analysis, 49(4):1715–1735, 2011.
  • [94] C. J. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992.
  • [95] D. Wingate, K. D. Seppi, and S. Mahadevan. Prioritization methods for accelerating MDP solvers. Journal of Machine Learning Research, 6(25):851–881, 2005.
  • [96] R. Wittmann. Approximation of fixed points of nonexpansive mappings. Archiv der Mathematik, 58(5):486–491, 1992.
  • [97] H.-K. Xu. Iterative algorithms for nonlinear operators. Journal of the London Mathematical Society, 66(1):240–256, 2002.
  • [98] T. Yoon and E. K. Ryu. Accelerated algorithms for smooth convex-concave minimax problems with 𝒪(1/k2)\mathcal{O}(1/k^{2}) rate on squared gradient norm. International Conference on Machine Learning, 2021.
  • [99] T. Yoon and E. K. Ryu. Accelerated minimax algorithms flock together. arXiv:2205.11093, 2022.
  • [100] Y. Zeng, F. Feng, and W. Yin. AsyncQVI: Asynchronous-parallel Q-value iteration for discounted Markov decision processes with near-optimal sample complexity. International Conference on Artificial Intelligence and Statistics, 2020.
  • [101] J. Zhang, B. O’Donoghue, and S. Boyd. Globally convergent type-I Anderson acceleration for nonsmooth fixed-point iterations. SIAM Journal on Optimization, 30(4):3170–3197, 2020.
  • [102] K. Zhou, L. Tian, A. M.-C. So, and J. Cheng. Practical schemes for finding near-stationary points of convex finite-sums. International Conference on Artificial Intelligence and Statistics, 2022.

Appendix A Preliminaries

For notational unity, we use the symbol UU when both VV and QQ can be used.

Lemma 1.

[10, Lemma 1.1.1] Let 0<γ10<\gamma\leq 1. If UU~U\leq\tilde{U}, then TπUTπU~,TUTU~T^{\pi}U\leq T^{\pi}\tilde{U},T^{\star}U\leq T^{\star}\tilde{U}.

Lemma 2.

Let 0<γ10<\gamma\leq 1. For any policy π\pi, 𝒫π\mathcal{P}^{\pi} is a nonexpansive linear operator such that if UU~U\leq\tilde{U}, 𝒫πU𝒫πU~\mathcal{P}^{\pi}U\leq\mathcal{P}^{\pi}\tilde{U}.

Proof.

If r(s,a)=0r(s,a)=0 for all s𝒮s\in\mathcal{S} and a𝒜a\in\mathcal{A}, Tπ=γ𝒫πT^{\pi}=\gamma\mathcal{P}^{\pi}. Then by Lemma 1 and γ\gamma-contraction of TπT^{\pi}, we have the desired result. ∎

Lemma 3.

Let 0<γ<10<\gamma<1. Let TT^{\star} and T^\hat{T}^{\star} respectively be the Bellman optimality and anti-optimality operators. Let UU^{\star} and U^\hat{U}^{\star} respectively be the fixed points of TT^{\star} and T^\hat{T}^{\star}. Then U^U\hat{U}^{\star}\leq U^{\star}.

Proof.

By definition, U^=T^U^TU^\hat{U}^{\star}=\hat{T}^{\star}\hat{U}^{\star}\leq T^{\star}\hat{U}^{\star}. Thus, U^limm(T)mU^=U\hat{U}^{\star}\leq\lim_{m\rightarrow\infty}\left({T^{\star}}\right)^{m}\hat{U}^{\star}=U^{\star}. ∎

Appendix B Omitted proofs in Section 2

First, we prove the following lemma by induction.

Lemma 4.

Let 0<γ10<\gamma\leq 1, and if γ=1\gamma=1, assume a fixed point UπU^{\pi} exists. For the iterates {Uk}k=0,1,\{U^{k}\}_{k=0,1,\dots} of Anc-VI,

TπUkUk\displaystyle T^{\pi}U^{k}-U^{k} =i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(γ𝒫π)ki+1(U0Uπ)]\displaystyle=\sum_{i=1}^{k}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k-i+1}(U^{0}-U^{\pi})\right]
βk(U0Uπ)+(Πj=1k(1βj))(γ𝒫π)k+1(U0Uπ)\displaystyle\quad-\beta_{k}(U^{0}-U^{\pi})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k+1}(U^{0}-U^{\pi})

where (Πj=k+1k(1βj))=1\left({\Pi^{k}_{j=k+1}(1-\beta_{j})}\right)=1 and β0=1\beta_{0}=1.

Proof.

If k=0k=0, we have

TπU0U0\displaystyle T^{\pi}U^{0}-U^{0} =TπU0Uπ(U0Uπ)\displaystyle=T^{\pi}U^{0}-U^{\pi}-(U^{0}-U^{\pi})
=TπU0TπUπ(U0Uπ)\displaystyle=T^{\pi}U^{0}-T^{\pi}U^{\pi}-(U^{0}-U^{\pi})
=γ𝒫π(U0Uπ)(U0Uπ)\displaystyle=\gamma\mathcal{P}^{\pi}\left({U^{0}-U^{\pi}}\right)-(U^{0}-U^{\pi})

If k=mk=m, since TπT^{\pi} is a linear operator,

TπUmUm\displaystyle T^{\pi}U^{m}-U^{m} =TπUm(1βm)TπUm1βmU0\displaystyle=T^{\pi}U^{m}-(1-\beta_{m})T^{\pi}U^{m-1}-\beta_{m}U^{0}
=TπUm(1βm)TπUm1βmUπβm(U0Uπ)\displaystyle=T^{\pi}U^{m}-(1-\beta_{m})T^{\pi}U^{m-1}-\beta_{m}U^{\pi}-\beta_{m}(U^{0}-U^{\pi})
=TπUm(1βm)TπUm1βmTπUπβm(U0Uπ)\displaystyle=T^{\pi}U^{m}-(1-\beta_{m})T^{\pi}U^{m-1}-\beta_{m}T^{\pi}U^{\pi}-\beta_{m}(U^{0}-U^{\pi})
=γ𝒫π(Um(1βm)Um1βmUπ)βm(U0Uπ)\displaystyle=\gamma\mathcal{P}^{\pi}(U^{m}-(1-\beta_{m})U^{m-1}-\beta_{m}U^{\pi})-\beta_{m}(U^{0}-U^{\pi})
=γ𝒫π(βm(U0Uπ)+(1βm)(TπUm1Um1))βm(U0Uπ)\displaystyle=\gamma\mathcal{P}^{\pi}(\beta_{m}(U^{0}-U^{\pi})+(1-\beta_{m})(T^{\pi}U^{m-1}-U^{m-1}))-\beta_{m}(U^{0}-U^{\pi})
=(1βm)γ𝒫πi=1m1[(βiβi1(1βi))(Πj=i+1m1(1βj))(γ𝒫π)m1i+1(U0Uπ)]\displaystyle=(1-\beta_{m})\gamma\mathcal{P}^{\pi}\sum_{i=1}^{m-1}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{m-1}_{j=i+1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{m-1-i+1}(U^{0}-U^{\pi})\right]
(1βm)γ𝒫πβm1(U0Uπ)+(1βm)γ𝒫π(Πj=1m1(1βj))(γ𝒫π)m(U0Uπ)\displaystyle\quad-(1-\beta_{m})\gamma\mathcal{P}^{\pi}\beta_{m-1}(U^{0}-U^{\pi})+(1-\beta_{m})\gamma\mathcal{P}^{\pi}\left({\Pi^{m-1}_{j=1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{m}(U^{0}-U^{\pi})
+βmγ𝒫π(U0Uπ)βm(U0Uπ)\displaystyle\quad+\beta_{m}\gamma\mathcal{P}^{\pi}(U^{0}-U^{\pi})-\beta_{m}(U^{0}-U^{\pi})
=i=1m1[(βiβi1(1βi))(Πj=i+1m(1βj))(γ𝒫π)mi+1(U0Uπ)]\displaystyle=\sum_{i=1}^{m-1}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{m}_{j=i+1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{m-i+1}(U^{0}-U^{\pi})\right]
βm1(1βm)γ𝒫π(U0Uπ)+βmγ𝒫π(U0Uπ)\displaystyle\quad-\beta_{m-1}(1-\beta_{m})\gamma\mathcal{P}^{\pi}(U^{0}-U^{\pi})+\beta_{m}\gamma\mathcal{P}^{\pi}(U^{0}-U^{\pi})
βm(U0Uπ)+(Πj=1m(1βj))(γ𝒫π)m+1(U0Uπ)\displaystyle\quad-\beta_{m}(U^{0}-U^{\pi})+\left({\Pi^{m}_{j=1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{m+1}(U^{0}-U^{\pi})
=i=1m[(βiβi1(1βi))(Πj=i+1m(1βj))(γ𝒫π)mi+1(U0Uπ)]\displaystyle=\sum_{i=1}^{m}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{m}_{j=i+1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{m-i+1}(U^{0}-U^{\pi})\right]
βm(U0Uπ)+(Πj=1m(1βj))(γ𝒫π)m+1(U0Uπ)\displaystyle\quad-\beta_{m}(U^{0}-U^{\pi})+\left({\Pi^{m}_{j=1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{m+1}(U^{0}-U^{\pi})

Now, we prove the first rate of Theorem 1.

Proof of first rate in Theorem 1.

Taking \left\|{\cdot}\right\|_{\infty}-norm both sides of equality in Lemma 4, we get

TπUkUk\displaystyle\left\|{T^{\pi}U^{k}-U^{k}}\right\|_{\infty} i=1k|βiβi1(1βi)|(Πj=i+1k(1βj))(γ𝒫π)ki+1(U0Uπ)\displaystyle\leq\sum_{i=1}^{k}\left|\beta_{i}-\beta_{i-1}(1-\beta_{i})\right|\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left\|{\left({\gamma\mathcal{P}^{\pi}}\right)^{k-i+1}(U^{0}-U^{\pi})}\right\|_{\infty}
+βkU0Uπ+(Πi=1k(1βi))(γ𝒫π)k+1(U0Uπ)\displaystyle\quad+\beta_{k}\left\|{U^{0}-U^{\pi}}\right\|_{\infty}+\left({\Pi^{k}_{i=1}(1-\beta_{i})}\right)\left\|{\left({\gamma\mathcal{P}^{\pi}}\right)^{k+1}(U^{0}-U^{\pi})}\right\|_{\infty}
(i=1kγki+1|βiβi1(1βi)|(Πj=i+1k(1βj))+βk+γk+1Πj=1k(1βj))\displaystyle\leq\bigg{(}\sum_{i=1}^{k}\gamma^{k-i+1}\left|\beta_{i}-\beta_{i-1}(1-\beta_{i})\right|\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)+\beta_{k}+\gamma^{k+1}\Pi^{k}_{j=1}(1-\beta_{j})\bigg{)}
U0U\displaystyle\quad\left\|{U^{0}-U^{\star}}\right\|_{\infty}
=(i=1kγk+i1(1γ2)21γ(2k+2)+γ2k1γ21γ2k+2+γk+11γ21γ2k+2)U0Uπ\displaystyle=\left({\sum_{i=1}^{k}\gamma^{k+i-1}\frac{\left({1-\gamma^{2}}\right)^{2}}{1-\gamma^{(2k+2)}}+\gamma^{2k}\frac{1-\gamma^{2}}{1-\gamma^{2k+2}}+\gamma^{k+1}\frac{1-\gamma^{2}}{1-\gamma^{2k+2}}}\right)\left\|{U^{0}-U^{\pi}}\right\|_{\infty}
=(γ1γ)(1+2γγk+1)(γk+1)1γk+1U0Uπ,\displaystyle=\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\pi}}\right\|_{\infty},

where the first inequality comes from triangular inequality, second inequality is from Lemma 2, and equality come from calculations. ∎

For the second rate of Theorem 1, we introduce following lemma.

Lemma 5.

Let 0<γ<10<\gamma<1. Let TT be Bellman consistency or optimality operator. For the iterates {Uk}k=0,1,\{U^{k}\}_{k=0,1,\dots} of Anc-VI, if U0TU0U^{0}\leq TU^{0}, then Uk1UkTUk1TUkUU_{k-1}\leq U_{k}\leq TU_{k-1}\leq TU_{k}\leq U^{\star} for 1k1\leq k. Also, if U0TU0U^{0}\geq TU^{0}, then Uk1UkTUk1TUkUU_{k-1}\geq U_{k}\geq TU_{k-1}\geq TU_{k}\geq U^{\star} for 1k1\leq k.

Proof.

First, let U0TU0U^{0}\leq TU^{0}. If k=1k=1, U0β1U0+(1β1)TU0=U1TU0U^{0}\leq\beta_{1}U^{0}+(1-\beta_{1})TU^{0}=U_{1}\ \leq TU^{0} by assumption. Since U0U1U^{0}\leq U^{1}, TU0TU1TU^{0}\leq TU^{1} by monotonicity of Bellman consistency and optimality operators.

By induction,

Uk=βkU0+(1βk)TUk1TUk1,U^{k}=\beta_{k}U^{0}+(1-\beta_{k})TU^{k-1}\leq TU^{k-1},

and since βkβk1\beta_{k}\leq\beta_{k-1},

βkU0+(1βk)TUk1\displaystyle\beta_{k}U^{0}+(1-\beta_{k})TU^{k-1} βk1U0+(1βk1)TUk1\displaystyle\geq\beta_{k-1}U^{0}+(1-\beta_{k-1})TU^{k-1}
βk1U0+(1βk1)TUk2\displaystyle\geq\beta_{k-1}U^{0}+(1-\beta_{k-1})TU^{k-2}
=Uk1.\displaystyle=U^{k-1}.

Also, Uk1UkU^{k-1}\leq U^{k} implies TUk1TUkTU^{k-1}\leq TU^{k} by monotonicity of Bellman consistency and optimality operators, and UkTUkU^{k}\leq TU^{k} implies that Uklimm(T)mUk=UU^{k}\leq\lim_{m\rightarrow\infty}\left({T}\right)^{m}U^{k}=U^{\star} for all k=0,1,k=0,1,\dots.

Now, suppose U0TU0U^{0}\geq TU^{0}. If k=1k=1, U0β1U0+(1β1)TU0=U1TU0U^{0}\geq\beta_{1}U^{0}+(1-\beta_{1})TU^{0}=U_{1}\ \geq TU^{0} by assumption. Since U0U1U^{0}\geq U^{1}, TU0TU1TU^{0}\geq TU^{1} by monotonicity of Bellman consistency and optimality operators.

By induction,

Uk=βkU0+(1βk)TUk1TUk1,U^{k}=\beta_{k}U^{0}+(1-\beta_{k})TU^{k-1}\geq TU^{k-1},

and since βkβk1\beta_{k}\leq\beta_{k-1},

βkU0+(1βk)TUk1\displaystyle\beta_{k}U^{0}+(1-\beta_{k})TU^{k-1} βk1U0+(1βk1)TUk1\displaystyle\leq\beta_{k-1}U^{0}+(1-\beta_{k-1})TU^{k-1}
βk1U0+(1βk1)TUk2\displaystyle\leq\beta_{k-1}U^{0}+(1-\beta_{k-1})TU^{k-2}
=Uk1.\displaystyle=U^{k-1}.

Also, Uk1UkU^{k-1}\geq U^{k} implies TUk1TUkTU^{k-1}\geq TU^{k} by monotonicity of Bellman consistency and optimality operators, and UkTUkU_{k}\geq TU_{k} implies that Uklimm(T)mUk=UU^{k}\geq\lim_{m\rightarrow\infty}\left({T}\right)^{m}U^{k}=U^{\star} for all k=0,1,k=0,1,\dots. ∎

Now, we prove following key lemmas.

Lemma 6.

Let 0<γ10<\gamma\leq 1, and assume a fixed point UπU^{\pi} exists if γ=1\gamma=1. For the iterates {Uk}k=0,1,\{U^{k}\}_{k=0,1,\dots} of Anc-VI, if U0UπU^{0}\leq U^{\pi},

TπUkUk\displaystyle T^{\pi}U^{k}-U^{k} i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(γ𝒫π)ki+1(U0Uπ)]\displaystyle\leq\sum_{i=1}^{k}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k-i+1}(U^{0}-U^{\pi})\right]
βk(U0Uπ),\displaystyle\quad-\beta_{k}(U^{0}-U^{\pi}),

where (Πj=k+1k(1βj))=1\left({\Pi^{k}_{j=k+1}(1-\beta_{j})}\right)=1 and β0=1\beta_{0}=1.

Lemma 7.

Let 0<γ<10<\gamma<1. For the iterates {Uk}k=0,1,\{U^{k}\}_{k=0,1,\dots} of Anc-VI, if U0TπU0U^{0}\geq T^{\pi}U^{0},

TπUkUk\displaystyle T^{\pi}U^{k}-U^{k} i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(γ𝒫π)ki+1(U0Uπ)]\displaystyle\geq\sum_{i=1}^{k}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k-i+1}(U^{0}-U^{\pi})\right]
βk(U0Uπ),\displaystyle\quad-\beta_{k}(U^{0}-U^{\pi}),

where (Πj=k+1k(1βj))=1\left({\Pi^{k}_{j=k+1}(1-\beta_{j})}\right)=1 and β0=1\beta_{0}=1.

Proof of Lemma 6.

If U0Uπ,U^{0}\leq U^{\pi}, we get

TπUkUk\displaystyle T^{\pi}U^{k}-U^{k} =i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(γ𝒫π)ki+1(U0Uπ)]\displaystyle=\sum_{i=1}^{k}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k-i+1}(U^{0}-U^{\pi})\right]
βk(U0Uπ)+(Πj=1k(1βj))(γ𝒫π)k+1(U0Uπ)\displaystyle\quad-\beta_{k}(U^{0}-U^{\pi})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k+1}(U^{0}-U^{\pi})
i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(γ𝒫π)ki+1(U0Uπ)]βk(U0Uπ),\displaystyle\leq\sum_{i=1}^{k}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k-i+1}(U^{0}-U^{\pi})\right]-\beta_{k}(U^{0}-U^{\pi}),

by Lemma 4 and the fact that (Πj=1k(1βj))(γ𝒫π)k+1(U0Uπ)0\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k+1}(U^{0}-U^{\pi})\leq 0. ∎

Proof of Lemma 7.

If U0TU0U^{0}\geq TU^{0}, U0Uπ0U^{0}-U^{\pi}\geq 0 by Lemma 5. Hence, by Lemma 4, we have

TπUkUki=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(γ𝒫π)ki+1(U0Uπ)]βk(U0Uπ),\displaystyle T^{\pi}U^{k}-U^{k}\geq\sum_{i=1}^{k}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k-i+1}(U^{0}-U^{\pi})\right]-\beta_{k}(U^{0}-U^{\pi}),

since 0(Πj=1k(1βj))(γ𝒫π)k+1(U0Uπ)0\leq\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k+1}(U^{0}-U^{\pi}). ∎

Now, we prove the second rates of Theorem 1.

Proof of second rates in Theorem 1.

Let 0<γ<10<\gamma<1. By Lemma 5, if U0TπU0U^{0}\leq T^{\pi}U^{0}, then U0UπU^{0}\leq U^{\pi}. Hence,

0\displaystyle 0 TπUkUk\displaystyle\leq T^{\pi}U^{k}-U^{k}
i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(γ𝒫π)ki+1(U0Uπ)]βk(U0Uπ),\displaystyle\leq\sum_{i=1}^{k}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k-i+1}(U^{0}-U^{\pi})\right]-\beta_{k}(U^{0}-U^{\pi}),

by Lemma 6. Taking \left\|{\cdot}\right\|_{\infty}-norm both sides, we have

TπUkUk(γ1γ)(1+γγk+1)(γk+1)1γk+1U0Uπ.\left\|{T^{\pi}U^{k}-U^{k}}\right\|_{\infty}\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\pi}}\right\|_{\infty}.

Otherwise, if U0TU0U^{0}\geq TU^{0}, UkTUkU^{k}\geq TU^{k} by Lemma 5. Since

0\displaystyle 0 TπUkUk\displaystyle\geq T^{\pi}U^{k}-U^{k}
i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(γ𝒫π)ki+1(U0Uπ)]βk(U0Uπ),\displaystyle\geq\sum_{i=1}^{k}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\gamma\mathcal{P}^{\pi}}\right)^{k-i+1}(U^{0}-U^{\pi})\right]-\beta_{k}(U^{0}-U^{\pi}),

by Lemma 7, taking \left\|{\cdot}\right\|_{\infty}-norm both sides, we obtain same rate as before.

Lastly, Taylor series expansion for both rates at γ=1\gamma=1 is

(γ1γ)(1+2γγk+1)(γk+1)1γk+1\displaystyle\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}} =2k+1k1k+1(γ1)+O((γ1)2),\displaystyle=\frac{2}{k+1}-\frac{k-1}{k+1}(\gamma-1)+O((\gamma-1)^{2}),
(γ1γ)(1+γγk+1)(γk+1)1γk+1\displaystyle\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}} =1k+1kk+1(γ1)+O((γ1)2).\displaystyle=\frac{1}{k+1}-\frac{k}{k+1}(\gamma-1)+O((\gamma-1)^{2}).

For the analyses of Anc-VI for Bellman optimality operator, we first prove following two lemmas.

Lemma 8.

Let 0<γ10<\gamma\leq 1. If γ=1\gamma=1, assume a fixed point UU^{\star} exists. Then, if 0α10\leq\alpha\leq 1 and U(1α)U~αUU¯U-(1-\alpha)\tilde{U}-\alpha U^{\star}\leq\bar{U}, there exist nonexpansive linear operator 𝒫H\mathcal{P}_{H} such that

TU(1α)TU~αTU\displaystyle T^{\star}U-(1-\alpha)T^{\star}\tilde{U}-\alpha T^{\star}U^{\star} γ𝒫HU¯.\displaystyle\leq\gamma\mathcal{P}_{H}\bar{U}.
Lemma 9.

Let 0<γ<10<\gamma<1. If 0α10\leq\alpha\leq 1 and U¯U(1α)U~αU^\bar{U}\leq U-(1-\alpha)\tilde{U}-\alpha\hat{U}^{\star}, then there exist nonexpansive linear operator 𝒫^H\hat{\mathcal{P}}_{H} such that

γ𝒫^H(U¯)\displaystyle\gamma\hat{\mathcal{P}}_{H}(\bar{U}) TUαTU~(1α)T^U^.\displaystyle\leq T^{\star}U-\alpha T^{\star}\tilde{U}-(1-\alpha)\hat{T}^{\star}\hat{U}^{\star}.
Proof of Lemma 8.

First, let U=V,U~=V~,U=V,U¯=V¯U=V,\tilde{U}=\tilde{V},U^{\star}=V^{\star},\bar{U}=\bar{V}, and V(1α)V~αVV¯V-(1-\alpha)\tilde{V}-\alpha V^{\star}\leq\bar{V}.

If action space is finite,

TV(1α)TV~αTV\displaystyle T^{\star}V-(1-\alpha)T^{\star}\tilde{V}-\alpha T^{\star}V^{\star} TπV(1α)TπV~αTπV\displaystyle\leq T^{\pi}V-(1-\alpha)T^{\pi}\tilde{V}-\alpha T^{\pi}V^{\star}
=γ𝒫π(V(1α)V~αV)\displaystyle=\gamma\mathcal{P}^{\pi}\left({V-(1-\alpha)\tilde{V}-\alpha V^{\star}}\right)
γ𝒫πV¯\displaystyle\leq\gamma\mathcal{P}^{\pi}\bar{V}

where π\pi is the greedy policy satisfying TπV=TVT^{\pi}V=T^{\star}V, first inequality is from TπV~TV~T^{\pi}\tilde{V}\leq T^{\star}\tilde{V} and TπVTVT^{\pi}V^{\star}\leq T^{\star}V^{\star}, and second inequality comes from Lemma 1. Thus, we can conclude 𝒫H=𝒫π\mathcal{P}_{H}=\mathcal{P}^{\pi}.

Otherwise, if action space is infinite, define 𝒫(cV¯)=csups𝒮V¯(s)\mathcal{P}(c\bar{V})=c\sup_{s\in\mathcal{S}}\bar{V}(s) for cc\in\mathbb{R} and previously given V¯\bar{V}. Let MM be linear space spanned by V¯\bar{V} with \left\|{\cdot}\right\|_{\infty}-norm. Then, 𝒫\mathcal{P} is linear functional on MM and 𝒫op1\|\mathcal{P}\|_{\text{op}}\leq 1 since |csups𝒮V¯(s)|cV¯1\frac{|c\sup_{s\in\mathcal{S}}\bar{V}(s)|}{\left\|{c\bar{V}}\right\|_{\infty}}\leq 1. Due to Hahn–Banach extension Theorem, there exist linear functional 𝒫h:(𝒮)\mathcal{P}_{h}\colon\mathcal{F}(\mathcal{S})\rightarrow\mathbb{R} with 𝒫h(V¯)=sups𝒮V¯(s)\mathcal{P}_{h}(\bar{V})=\sup_{s\in\mathcal{S}}\bar{V}(s) and 𝒫hop1\|\mathcal{P}_{h}\|_{\text{op}}\leq 1. Furthermore, we can define 𝒫H:(𝒮)(𝒮)\mathcal{P}_{H}\colon\mathcal{F}(\mathcal{S})\rightarrow\mathcal{F}(\mathcal{S}) such that 𝒫HV(s)=𝒫h(V)\mathcal{P}_{H}V(s)=\mathcal{P}_{h}(V) for all s𝒮s\in\mathcal{S}. Then, since 𝒫H(V)=|𝒫h(V)|𝒫hop1\|\mathcal{P}_{H}(V)\|_{\infty}=|\mathcal{P}_{h}(V)|\leq\|\mathcal{P}_{h}\|_{\text{op}}\leq 1 for V1\left\|{V}\right\|_{\infty}\leq 1, we have 𝒫H1\|\mathcal{P}_{H}\|_{\infty}\leq 1. Therefore, 𝒫H\mathcal{P}_{H} is nonexpansive linear operator in \left\|{\cdot}\right\|_{\infty}-norm. Then,

TV(s)(1α)TV~(s)αTV(s)\displaystyle T^{\star}V(s)-(1-\alpha)T^{\star}\tilde{V}(s)-\alpha T^{\star}V^{\star}(s)
=supa𝒜{r(s,a)+γ𝔼sP(|s,a)[V(s)]}supa𝒜{(1α)r(s,a)+(1α)γ𝔼sP(|s,a)[V~(s)]}\displaystyle=\sup_{a\in\mathcal{A}}\bigg{\{}r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[V(s^{\prime})\right]\bigg{\}}-\sup_{a\in\mathcal{A}}\bigg{\{}(1-\alpha)r(s,a)+(1-\alpha)\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\tilde{V}(s^{\prime})\right]\bigg{\}}
supa𝒜{αr(s,a)+αγ𝔼sP(|s,a)[V(s)]}\displaystyle\quad-\sup_{a\in\mathcal{A}}\bigg{\{}\alpha r(s,a)+\alpha\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[V^{\star}(s^{\prime})\right]\bigg{\}}
supa𝒜{r(s,a)+γ𝔼sP(|s,a)[V(s)](1α)r(s,a)(1α)γ𝔼sP(|s,a)[V~(s)]}\displaystyle\leq\sup_{a\in\mathcal{A}}\bigg{\{}r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[V(s^{\prime})\right]-(1-\alpha)r(s,a)-(1-\alpha)\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\tilde{V}(s^{\prime})\right]\bigg{\}}
supa𝒜{αr(s,a)+αγ𝔼sP(|s,a)[V(s)]}\displaystyle\quad-\sup_{a\in\mathcal{A}}\bigg{\{}\alpha r(s,a)+\alpha\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[V^{\star}(s^{\prime})\right]\bigg{\}}
γsupa𝒜{𝔼sP(|s,a)[V(s)(1α)V~(s)αV(s)]}\displaystyle\leq\gamma\sup_{a\in\mathcal{A}}\bigg{\{}\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[V(s^{\prime})-(1-\alpha)\tilde{V}(s^{\prime})-\alpha V^{\star}(s^{\prime})\right]\bigg{\}}
γsups𝒮{V(s)(1α)V~(s)αV(s)}\displaystyle\leq\gamma\sup_{s^{\prime}\in\mathcal{S}}\{V(s^{\prime})-(1-\alpha)\tilde{V}(s^{\prime})-\alpha V^{\star}(s^{\prime})\}
γsups𝒮V¯(s).\displaystyle\leq\gamma\sup_{s^{\prime}\in\mathcal{S}}\bar{V}(s^{\prime}).

for all s𝒮s\in\mathcal{S}. Therefore, we have

TV(1α)TV~αTVγ𝒫H(V¯).\displaystyle T^{\star}V-(1-\alpha)T^{\star}\tilde{V}-\alpha T^{\star}V^{\star}\leq\gamma\mathcal{P}_{H}(\bar{V}).

Similarly, let U=Q,U~=Q~,U=Q,U¯=Q¯U=Q,\tilde{U}=\tilde{Q},U^{\star}=Q^{\star},\bar{U}=\bar{Q}, and Q(1α)Q~αQQ¯Q-(1-\alpha)\tilde{Q}-\alpha Q^{\star}\leq\bar{Q}.

If action space is finite,

TQ(1α)TQ~αTQ\displaystyle T^{\star}Q-(1-\alpha)T^{\star}\tilde{Q}-\alpha T^{\star}Q^{\star} γ𝒫π(Q(1α)Q~αQ)\displaystyle\leq\gamma\mathcal{P}^{\pi}\left({Q-(1-\alpha)\tilde{Q}-\alpha Q^{\star}}\right)
γ𝒫πQ¯\displaystyle\leq\gamma\mathcal{P}^{\pi}\bar{Q}

where π\pi is the greedy policy satisfying TπQ=TQT^{\pi}Q=T^{\star}Q, first inequality is from TπQ~TQ~T^{\pi}\tilde{Q}\leq T^{\star}\tilde{Q} and TπQTQT^{\pi}Q^{\star}\leq T^{\star}Q^{\star}, and second inequality comes from Lemma 1. Then, we can conclude 𝒫H=𝒫π\mathcal{P}_{H}=\mathcal{P}^{\pi}.

Otherwise, if action space is infinite, define 𝒫(cQ¯)=csup(s,a)𝒮×𝒜Q¯(s,a)\mathcal{P}(c\bar{Q})=c\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}\bar{Q}(s^{\prime},a^{\prime}) for cc\in\mathbb{R} and previously given Q¯\bar{Q}. Let MM be linear space spanned by Q¯\bar{Q} with \left\|{\cdot}\right\|_{\infty}-norm. Then, 𝒫\mathcal{P} is linear functional on MM and 𝒫op1\|\mathcal{P}\|_{\text{op}}\leq 1. Due to Hahn–Banach extension Theorem, there exist linear functional 𝒫h:(𝒮×𝒜)\mathcal{P}_{h}\colon\mathcal{F}(\mathcal{S}\times\mathcal{A})\rightarrow\mathbb{R} with 𝒫h(Q¯)=sup(s,a)𝒮×𝒜Q¯(s,a)\mathcal{P}_{h}(\bar{Q})=\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}\bar{Q}(s^{\prime},a^{\prime}) and 𝒫hop1\|\mathcal{P}_{h}\|_{\text{op}}\leq 1. Furthermore, we can define 𝒫H:(𝒮×𝒜)(𝒮×𝒜)\mathcal{P}_{H}\colon\mathcal{F}(\mathcal{S}\times\mathcal{A})\rightarrow\mathcal{F}(\mathcal{S}\times\mathcal{A}) such that 𝒫HQ(s,a)=𝒫h(Q)\mathcal{P}_{H}Q(s,a)=\mathcal{P}_{h}(Q) for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} and PH1\|P_{H}\|_{\infty}\leq 1. Therefore, 𝒫H\mathcal{P}_{H} is nonexpansive linear operator in \left\|{\cdot}\right\|_{\infty}-norm. Then,

TQ(s,a)(1α)TQ~(s,a)αTQ(s,a)\displaystyle T^{\star}Q(s,a)-(1-\alpha)T^{\star}\tilde{Q}(s,a)-\alpha T^{\star}Q^{\star}(s,a)
=r(s,a)+γ𝔼sP(|s,a)[supa𝒜Q(s,a)](1α)r(s,a)(1α)γ𝔼sP(|s,a)[supa𝒜Q~(s,a)]\displaystyle=r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\sup_{a^{\prime}\in\mathcal{A}}Q(s^{\prime},a^{\prime})\right]-(1-\alpha)r(s,a)-(1-\alpha)\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\sup_{a^{\prime}\in\mathcal{A}}\tilde{Q}(s^{\prime},a^{\prime})\right]
αr(s,a)αγ𝔼sP(|s,a)[supa𝒜Q(s,a)]\displaystyle\quad-\alpha r(s,a)-\alpha\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\sup_{a^{\prime}\in\mathcal{A}}Q^{\star}(s^{\prime},a^{\prime})\right]
γ𝔼sP(|s,a)[supa𝒜{Q(s,a)(1α)Q~(s,a)}]γ𝔼sP(|s,a)[supa𝒜αQ(s,a)]\displaystyle\leq\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\sup_{a^{\prime}\in\mathcal{A}}\left\{Q(s^{\prime},a^{\prime})-(1-\alpha)\tilde{Q}(s^{\prime},a^{\prime})\right\}\right]-\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\sup_{a^{\prime}\in\mathcal{A}}\alpha Q(s^{\prime},a^{\prime})\right]
γ𝔼sP(|s,a)[supa𝒜{Q(s,a)(1α)Q~(s,a)αQ(s,a)}]\displaystyle\leq\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\sup_{a^{\prime}\in\mathcal{A}}\left\{Q(s^{\prime},a^{\prime})-(1-\alpha)\tilde{Q}(s^{\prime},a^{\prime})-\alpha Q^{\star}(s^{\prime},a^{\prime})\right\}\right]
γsup(s,a)𝒮×𝒜{Q(s,a)(1α)Q~(s,a)αQ(s,a)},\displaystyle\leq\gamma\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}\left\{Q(s^{\prime},a^{\prime})-(1-\alpha)\tilde{Q}(s^{\prime},a^{\prime})-\alpha Q^{\star}(s^{\prime},a^{\prime})\right\},
γsup(s,a)𝒮×𝒜Q¯(s,a)\displaystyle\leq\gamma\sup_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}\bar{Q}(s^{\prime},a^{\prime})

for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. Therefore, we have

TQ(1α)TQ~αTQγ𝒫H(Q¯).\displaystyle T^{\star}Q-(1-\alpha)T^{\star}\tilde{Q}-\alpha T^{\star}Q^{\star}\leq\gamma\mathcal{P}_{H}(\bar{Q}).

Proof of Lemma 9.

Note that T^\hat{T}^{\star} is Bellman anti-optimality operators for VV or QQ, and U^\hat{U}^{\star} is the fixed point of T^\hat{T}^{\star}. First, let U=V,U~=V~,U^=V^,U¯=V¯U=V,\tilde{U}=\tilde{V},\hat{U}^{\star}=\hat{V}^{\star},\bar{U}=\bar{V}, and V¯V(1α)V~αV^\bar{V}\leq V-(1-\alpha)\tilde{V}-\alpha\hat{V}^{\star}. Then,

TV(s)(1α)TV~(s)αT^V^(s)\displaystyle T^{\star}V(s)-(1-\alpha)T^{\star}\tilde{V}(s)-\alpha\hat{T}^{\star}\hat{V}^{\star}(s)
=supa𝒜{r(s,a)+γ𝔼sP(|s,a)[V(s)]}supa𝒜{(1α)r(s,a)+(1α)γ𝔼sP(|s,a)[V~(s)]}\displaystyle=\sup_{a\in\mathcal{A}}\bigg{\{}r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[V(s^{\prime})\right]\bigg{\}}-\sup_{a\in\mathcal{A}}\bigg{\{}(1-\alpha)r(s,a)+(1-\alpha)\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\tilde{V}(s^{\prime})\right]\bigg{\}}
infa𝒜{αr(s,a)+αγ𝔼sP(|s,a)[V^(s)]}\displaystyle\quad-\inf_{a\in\mathcal{A}}\bigg{\{}\alpha r(s,a)+\alpha\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\hat{V}^{\star}(s^{\prime})\right]\bigg{\}}
infa𝒜{r(s,a)+γ𝔼sP(|s,a)[V(s)](1α)r(s,a)(1α)γ𝔼sP(|s,a)[V~(s)]}\displaystyle\geq\inf_{a\in\mathcal{A}}\bigg{\{}r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[V(s^{\prime})\right]-(1-\alpha)r(s,a)-(1-\alpha)\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\tilde{V}(s^{\prime})\right]\bigg{\}}
infa𝒜{αr(s,a)+αγ𝔼sP(|s,a)[V^(s)]}\displaystyle\quad-\inf_{a\in\mathcal{A}}\bigg{\{}\alpha r(s,a)+\alpha\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\hat{V}^{\star}(s^{\prime})\right]\bigg{\}}
γinfa𝒜{𝔼sP(|s,a)[V(s)(1α)V~(s)αV^(s)]}.\displaystyle\geq\gamma\inf_{a\in\mathcal{A}}\bigg{\{}\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[V(s^{\prime})-(1-\alpha)\tilde{V}(s^{\prime})-\alpha\hat{V}^{\star}(s^{\prime})\right]\bigg{\}}.

Then, if action space is finite,

TV(1α)TV~αTV\displaystyle T^{\star}V-(1-\alpha)T^{\star}\tilde{V}-\alpha T^{\star}V^{\star} γ𝒫π^(V(1α)V~αV^)\displaystyle\geq\gamma\mathcal{P}^{\hat{\pi}}\left({V-(1-\alpha)\tilde{V}-\alpha\hat{V}^{\star}}\right)
γ𝒫π^V¯\displaystyle\geq\gamma\mathcal{P}^{\hat{\pi}}\bar{V}

where π^\hat{\pi} is the policy satisfying π^(|s)=argmina𝒜𝔼sP(|s,a)[V(s)(1α)V~(s)αV^(s)]\hat{\pi}(\cdot\,|\,s)=\operatorname*{argmin}_{a\in\mathcal{A}}\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[V(s^{\prime})-(1-\alpha)\tilde{V}(s^{\prime})-\alpha\hat{V}^{\star}(s^{\prime})\right] and second inequality comes from Lemma 1. Thus, we can conclude 𝒫H=𝒫π\mathcal{P}_{H}=\mathcal{P}^{\pi}.

Otherwise, if action space is infinite, define 𝒫^(cV¯)=cinfs𝒮V¯(s)\hat{\mathcal{P}}(c\bar{V})=c\inf_{s\in\mathcal{S}}\bar{V}(s) for cc\in\mathbb{R} and previously given V¯\bar{V}. Let MM be linear space spanned by V¯\bar{V} with \left\|{\cdot}\right\|_{\infty}-norm. Then, 𝒫^\hat{\mathcal{P}} is linear functional on MM and 𝒫^op1\|\hat{\mathcal{P}}\|_{\text{op}}\leq 1 since |cinfs𝒮V¯(s)|cV¯1\frac{|c\inf_{s\in\mathcal{S}}\bar{V}(s)|}{\left\|{c\bar{V}}\right\|_{\infty}}\leq 1. Due to Hahn–Banach extension Theorem, there exist linear functional 𝒫^h:(𝒮)\hat{\mathcal{P}}_{h}\colon\mathcal{F}(\mathcal{S})\rightarrow\mathbb{R} with 𝒫^h(V¯)=infs𝒮V¯(s)\hat{\mathcal{P}}_{h}(\bar{V})=\inf_{s\in\mathcal{S}}\bar{V}(s) and 𝒫^hop1\|\hat{\mathcal{P}}_{h}\|_{\text{op}}\leq 1. Furthermore, we can define 𝒫^H:(𝒮)(𝒮)\hat{\mathcal{P}}_{H}\colon\mathcal{F}(\mathcal{S})\rightarrow\mathcal{F}(\mathcal{S}) such that 𝒫^HV(s)=𝒫^h(V)\hat{\mathcal{P}}_{H}V(s)=\hat{\mathcal{P}}_{h}(V) for all s𝒮s\in\mathcal{S}. Then 𝒫^H1\|\hat{\mathcal{P}}_{H}\|_{\infty}\leq 1 since 𝒫^H(V)=|𝒫^h(V)|𝒫^hop1\|\hat{\mathcal{P}}_{H}(V)\|_{\infty}=|\hat{\mathcal{P}}_{h}(V)|\leq\|\hat{\mathcal{P}}_{h}\|_{\text{op}}\leq 1 for V1\left\|{V}\right\|_{\infty}\leq 1. . Thus, 𝒫^H\hat{\mathcal{P}}_{H} is nonexpansive linear operator in \left\|{\cdot}\right\|_{\infty}-norm. Then, we have

TV(s)(1α)TV~(s)αT^V^(s)\displaystyle T^{\star}V(s)-(1-\alpha)T^{\star}\tilde{V}(s)-\alpha\hat{T}^{\star}\hat{V}^{\star}(s) γinfa𝒜{𝔼sP(|s,a)[V(s)(1α)V~(s)αV^(s)]}\displaystyle\geq\gamma\inf_{a\in\mathcal{A}}\bigg{\{}\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[V(s^{\prime})-(1-\alpha)\tilde{V}(s^{\prime})-\alpha\hat{V}^{\star}(s^{\prime})\right]\bigg{\}}
γinfs𝒮{V(s)(1α)V~(s)αV^(s)}\displaystyle\geq\gamma\inf_{s^{\prime}\in\mathcal{S}}\{V(s^{\prime})-(1-\alpha)\tilde{V}(s^{\prime})-\alpha\hat{V}^{\star}(s^{\prime})\}
γinfs𝒮{V¯(s)}\displaystyle\geq\gamma\inf_{s^{\prime}\in\mathcal{S}}\{\bar{V}(s^{\prime})\}

for all s𝒮s\in\mathcal{S}. Therefore, we have

γ𝒫^H(V¯)TV(s)(1α)TV~(s)αT^V^(s).\displaystyle\gamma\hat{\mathcal{P}}_{H}(\bar{V})\leq T^{\star}V(s)-(1-\alpha)T^{\star}\tilde{V}(s)-\alpha\hat{T}^{\star}\hat{V}^{\star}(s).

Similarly, let U=Q,U~=Q~,U^=Q^,U¯=Q¯U=Q,\tilde{U}=\tilde{Q},\hat{U}^{\star}=\hat{Q}^{\star},\bar{U}=\bar{Q}, and Q¯Q(1α)Q~αQ^\bar{Q}\leq Q-(1-\alpha)\tilde{Q}-\alpha\hat{Q}^{\star}. Then,

TQ(s,a)αTQ~(s,a)(1α)T^Q^(s,a)\displaystyle T^{\star}Q(s,a)-\alpha T^{\star}\tilde{Q}(s,a)-(1-\alpha)\hat{T}^{\star}\hat{Q}^{\star}(s,a)
=r(s,a)+γ𝔼sP(|s,a)[supa𝒜Q(s,a)](1α)r(s,a)(1α)γ𝔼sP(|s,a)[supa𝒜Q~(s,a)]\displaystyle=r(s,a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\sup_{a^{\prime}\in\mathcal{A}}Q(s^{\prime},a^{\prime})\right]-(1-\alpha)r(s,a)-(1-\alpha)\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\sup_{a^{\prime}\in\mathcal{A}}\tilde{Q}(s^{\prime},a^{\prime})\right]
αr(s,a)αγ𝔼sP(|s,a)[infa𝒜Q^(s,a)]\displaystyle\quad-\alpha r(s,a)-\alpha\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\inf_{a^{\prime}\in\mathcal{A}}\hat{Q}^{\star}(s^{\prime},a^{\prime})\right]
γ𝔼sP(|s,a)[infa𝒜{Q(s,a)(1α)Q~(s,a)}]γ𝔼sP(|s,a)[infa𝒜αQ^(s,a)]\displaystyle\geq\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\inf_{a^{\prime}\in\mathcal{A}}\left\{Q(s^{\prime},a^{\prime})-(1-\alpha)\tilde{Q}(s^{\prime},a^{\prime})\right\}\right]-\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\inf_{a^{\prime}\in\mathcal{A}}\alpha\hat{Q}(s^{\prime},a^{\prime})\right]
γ𝔼sP(|s,a)[infa𝒜{Q(s,a)(1α)Q~(s,a)αQ^(s,a)}].\displaystyle\geq\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\inf_{a^{\prime}\in\mathcal{A}}\left\{Q(s^{\prime},a^{\prime})-(1-\alpha)\tilde{Q}(s^{\prime},a^{\prime})-\alpha\hat{Q}^{\star}(s^{\prime},a^{\prime})\right\}\right].

Hence, if action space is finite,

TQ(1α)TQ~αTQ\displaystyle T^{\star}Q-(1-\alpha)T^{\star}\tilde{Q}-\alpha T^{\star}Q^{\star} γ𝒫π^(Q(1α)Q~αQ),\displaystyle\geq\gamma\mathcal{P}^{\hat{\pi}}\left({Q-(1-\alpha)\tilde{Q}-\alpha Q^{\star}}\right),
γ𝒫π^Q¯,\displaystyle\geq\gamma\mathcal{P}^{\hat{\pi}}\bar{Q},

where π^\hat{\pi} is the policy satisfying π^(|s)=argmina𝒜𝔼sP(|s,a)[Q(s)(1α)Q~(s)αQ(s)]\hat{\pi}(\cdot\,|\,s)=\operatorname*{argmin}_{a\in\mathcal{A}}\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[Q(s^{\prime})-(1-\alpha)\tilde{Q}(s^{\prime})-\alpha Q^{\star}(s^{\prime})\right] and second inequality comes from Lemma 1. Then, we can conclude 𝒫H=𝒫π^\mathcal{P}_{H}=\mathcal{P}^{\hat{\pi}}.

Otherwise, if action space is infinite, define 𝒫^(cQ¯)=cinf(s,a)𝒮×𝒜Q¯(s,a)\hat{\mathcal{P}}(c\bar{Q})=c\inf_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}\bar{Q}(s^{\prime},a^{\prime}) for cnc\in\mathbb{R}^{n} and previously given Q¯\bar{Q}. Let MM be linear space spanned by Q¯\bar{Q} with \left\|{\cdot}\right\|_{\infty}-norm. Then, 𝒫\mathcal{P} is linear functional on MM with 𝒫^op1\|\hat{\mathcal{P}}\|_{\text{op}}\leq 1. Due to Hahn–Banach extension Theorem, there exist linear functional 𝒫^h:(𝒮×𝒜)\hat{\mathcal{P}}_{h}\colon\mathcal{F}(\mathcal{S}\times\mathcal{A})\rightarrow\mathbb{R} with 𝒫^h(Q¯)=inf(s,a)𝒮×𝒜Q¯(s,a)\hat{\mathcal{P}}_{h}(\bar{Q})=\inf_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}\bar{Q}(s^{\prime},a^{\prime}) and 𝒫^hop1\|\hat{\mathcal{P}}_{h}\|_{\text{op}}\leq 1. Furthermore, we can define 𝒫^H:(𝒮×𝒜)(𝒮×𝒜)\hat{\mathcal{P}}_{H}\colon\mathcal{F}(\mathcal{S}\times\mathcal{A})\rightarrow\mathcal{F}(\mathcal{S}\times\mathcal{A}) such that 𝒫HQ(s,a)=𝒫^h(Q)\mathcal{P}_{H}Q(s,a)=\hat{\mathcal{P}}_{h}(Q) for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A} and P^H1\|\hat{P}_{H}\|_{\infty}\leq 1. Thus 𝒫^H\hat{\mathcal{P}}_{H} is nonexpansive linear operator in \left\|{\cdot}\right\|_{\infty}-norm. Then, we have

TQ(s,a)αTQ~(s,a)(1α)T^Q^(s,a)\displaystyle T^{\star}Q(s,a)-\alpha T^{\star}\tilde{Q}(s,a)-(1-\alpha)\hat{T}^{\star}\hat{Q}^{\star}(s,a)
γ𝔼sP(|s,a)[infa𝒜{Q(s,a)(1α)Q~(s,a)αQ^(s,a)}]\displaystyle\geq\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[\inf_{a^{\prime}\in\mathcal{A}}\left\{Q(s^{\prime},a^{\prime})-(1-\alpha)\tilde{Q}(s^{\prime},a^{\prime})-\alpha\hat{Q}^{\star}(s^{\prime},a^{\prime})\right\}\right]
γinf(s,a)𝒮×𝒜{Q(s,a)(1α)Q~(s,a)αQ^(s,a)}\displaystyle\geq\gamma\inf_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}\left\{Q(s^{\prime},a^{\prime})-(1-\alpha)\tilde{Q}(s^{\prime},a^{\prime})-\alpha\hat{Q}^{\star}(s^{\prime},a^{\prime})\right\}
γinf(s,a)𝒮×𝒜Q¯(s,a),\displaystyle\geq\gamma\inf_{(s^{\prime},a^{\prime})\in\mathcal{S}\times\mathcal{A}}\bar{Q}(s^{\prime},a^{\prime}),

for all (s,a)𝒮×𝒜(s,a)\in\mathcal{S}\times\mathcal{A}. Therefore, we have

γ𝒫^H(Q¯)TQ(1α)TQ~αT^Q^.\displaystyle\gamma\hat{\mathcal{P}}_{H}(\bar{Q})\leq T^{\star}Q-(1-\alpha)T^{\star}\tilde{Q}-\alpha\hat{T}^{\star}\hat{Q}^{\star}.

Now, we present our key lemmas for the first rate of Theorem 2.

Lemma 10.

Let 0<γ10<\gamma\leq 1. If γ=1\gamma=1, assume a fixed point UU^{\star} exists. For the iterates {Uk}k=0,1,\{U^{k}\}_{k=0,1,\dots} of Anc-VI, there exist nonexpansive linear operators {𝒫l}l=0,1,,k\{\mathcal{P}^{l}\}_{l=0,1,\dots,k} such that

TUkUk\displaystyle T^{\star}U^{k}-U^{k} i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫l)(U0U)]\displaystyle\leq\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})\right]
βk(U0U)+(Πj=1k(1βj))(Πl=k0γ𝒫l)(U0U)\displaystyle\quad-\beta_{k}(U^{0}-U^{\star})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})

where Πj=k+1k(1βj)=1\Pi^{k}_{j=k+1}(1-\beta_{j})=1 and β0=1\beta_{0}=1.

Lemma 11.

Let 0<γ<10<\gamma<1. For the iterates {Uk}k=0,1,\{U^{k}\}_{k=0,1,\dots} of Anc-VI, there exist nonexpansive linear operators {𝒫^l}l=0,1,,k\{\hat{\mathcal{P}}^{l}\}_{l=0,1,\dots,k} such that

TUkUk\displaystyle T^{\star}U^{k}-U^{k} i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫^l)(U0U^)]\displaystyle\geq\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
βk(U0U^)+(Πj=1k(1βj))(Πl=k0γ𝒫^l)(U0U^),\displaystyle\quad-\beta_{k}(U^{0}-\hat{U}^{\star})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star}),

where Πj=k+1k(1βj)=1\Pi^{k}_{j=k+1}(1-\beta_{j})=1 and β0=1\beta_{0}=1.

We prove previous lemmas by induction.

Proof of Lemma 10.

If k=0k=0,

TU0U0\displaystyle T^{\star}U^{0}-U^{0} =TU0U(U0U)\displaystyle=T^{\star}U^{0}-U^{\star}-(U^{0}-U^{\star})
=TU0TU(U0U)\displaystyle=T^{\star}U^{0}-T^{\star}U^{\star}-(U^{0}-U^{\star})
γ𝒫0(U0U)(U0U).\displaystyle\leq\gamma\mathcal{P}^{0}(U^{0}-U^{\star})-(U^{0}-U^{\star}).

where inequality comes from first inequality in Lemma 8 with α=1,U=U0,U¯=U0U\alpha=1,U=U^{0},\bar{U}=U^{0}-U^{\star}.

By induction,

Uk(1βk)Uk1βkU\displaystyle U^{k}-(1-\beta_{k})U^{k-1}-\beta_{k}U^{\star}
=βk(U0U)+(1βk)(TUk1Uk1)\displaystyle=\beta_{k}\left({U^{0}-U^{\star}}\right)+(1-\beta_{k})(T^{\star}U^{k-1}-U^{k-1})
(1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫l)(U0U)]\displaystyle\leq(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})\right]
(1βk)βk1(U0U)+(1βk)(Πj=1k1(1βj))(Πl=k10γ𝒫l)(U0U)\displaystyle\quad-(1-\beta_{k})\beta_{k-1}(U^{0}-U^{\star})+(1-\beta_{k})\left({\Pi^{k-1}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k-1}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})
+βk(U0U),\displaystyle\quad+\beta_{k}(U^{0}-U^{\star}),

and let U¯\bar{U} be the entire right hand side of inequality. Then, we have

TUkUk\displaystyle T^{\star}U^{k}-U^{k}
=TUk(1βk)TUk1βkU0\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}U^{0}
=TUk(1βk)TUk1βkUβk(U0U)\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}U^{\star}-\beta_{k}(U^{0}-U^{\star})
=TUk(1βk)TUk1βkTUβk(U0U)\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}T^{\star}U^{\star}-\beta_{k}(U^{0}-U^{\star})
γ𝒫k((1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫l)(U0U)]\displaystyle\leq\gamma\mathcal{P}^{k}\bigg{(}(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})\right]
(1βk)βk1(U0U)+(1βk)(Πj=1k1(1βj))(Πl=k10γ𝒫l)(U0U)\displaystyle\quad-(1-\beta_{k})\beta_{k-1}(U^{0}-U^{\star})+(1-\beta_{k})\left({\Pi^{k-1}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k-1}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})
+βk(U0U))βk(U0U)\displaystyle\quad+\beta_{k}(U^{0}-U^{\star})\bigg{)}-\beta_{k}(U^{0}-U^{\star})
=i=1k1[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫l)(U0U)]\displaystyle=\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})\right]
βk1(1βk)γ𝒫k(U0U)+βkγ𝒫k(U0U)\displaystyle\quad-\beta_{k-1}(1-\beta_{k})\gamma\mathcal{P}^{k}(U^{0}-U^{\star})+\beta_{k}\gamma\mathcal{P}^{k}\left({U^{0}-U^{\star}}\right)
βk(U0U)+(Πj=1k(1βj))(Πl=k0γ𝒫l)(U0U)\displaystyle\quad-\beta_{k}(U^{0}-U^{\star})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})
=i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫l)(U0U)]\displaystyle=\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})\right]
βk(U0U)+(Πj=1k(1βj))(Πl=k0γ𝒫l)(U0U).\displaystyle\quad-\beta_{k}(U^{0}-U^{\star})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star}).

where inequality comes from first inequality in Lemma 8 with α=βk,U=Uk,U~=Uk1\alpha=\beta_{k},U=U^{k},\tilde{U}=U^{k-1}, and previously defined U¯\bar{U}. ∎

Proof of Lemma 11.

Note that T^\hat{T}^{\star} is Bellman anti-optimality operators for VV or QQ, and U^\hat{U}^{\star} is the fixed point of T^\hat{T}^{\star}. If k=0k=0,

TU0U0\displaystyle T^{\star}U^{0}-U^{0} =TU0U^(U0U^)\displaystyle=T^{\star}U^{0}-\hat{U}^{\star}-(U^{0}-\hat{U}^{\star})
=TU0T^U^(U0U^)\displaystyle=T^{\star}U^{0}-\hat{T}^{\star}\hat{U}^{\star}-(U^{0}-\hat{U}^{\star})
γ𝒫^0(U0U^)(U0U^).\displaystyle\geq\gamma\hat{\mathcal{P}}^{0}(U^{0}-\hat{U}^{\star})-(U^{0}-\hat{U}^{\star}).

where inequality comes from second inequality in Lemma 9 with α=1,U=U0,U¯=U0U^\alpha=1,U=U^{0},\bar{U}=U^{0}-\hat{U}^{\star}.

By induction,

Uk(1βk)Uk1βkU^\displaystyle U^{k}-(1-\beta_{k})U^{k-1}-\beta_{k}\hat{U}^{\star}
=βk(U0U^)+(1βk)(TUk1Uk1)\displaystyle=\beta_{k}(U^{0}-\hat{U}^{\star})+(1-\beta_{k})(T^{\star}U^{k-1}-U^{k-1})
(1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫^l)(U0U^)]\displaystyle\geq(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
(1βk)βk1(U0U^)+(1βk)(Πj=1k1(1βj))(Πl=k10γ𝒫^l)(U0U^)\displaystyle\quad-(1-\beta_{k})\beta_{k-1}(U^{0}-\hat{U}^{\star})+(1-\beta_{k})\left({\Pi^{k-1}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})
+βk(U0U^),\displaystyle\quad+\beta_{k}(U^{0}-\hat{U}^{\star}),

and let U¯\bar{U} be the entire right hand side of inequality. Then, we have

TUkUk\displaystyle T^{\star}U^{k}-U^{k}
=TUk(1βk)TUk1βkU0\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}U^{0}
=TUk(1βk)TUk1βkU^βk(U0U^)\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}\hat{U}^{\star}-\beta_{k}(U^{0}-\hat{U}^{\star})
=TUk(1βk)TUk1βkT^U^βk(U0U^)\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}\hat{T}^{\star}\hat{U}^{\star}-\beta_{k}(U^{0}-\hat{U}^{\star})
γ𝒫^k((1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫^l)(U0U^)]\displaystyle\geq\gamma\hat{\mathcal{P}}^{k}\bigg{(}(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
(1βk)βk1(U0U^)+(1βk)(Πj=1k1(1βj))(Πl=k10γ𝒫^l)(U0U^)\displaystyle\quad-(1-\beta_{k})\beta_{k-1}(U^{0}-\hat{U}^{\star})+(1-\beta_{k})\left({\Pi^{k-1}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})
+βk(U0U^))βk(U0U^)\displaystyle\quad+\beta_{k}(U^{0}-\hat{U}^{\star})\bigg{)}-\beta_{k}(U^{0}-\hat{U}^{\star})
=i=1k1[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫^l)(U0U^)]\displaystyle=\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
βk1(1βk)γ𝒫^k(U0U^)+βkγ𝒫^k(U0U^)\displaystyle\quad-\beta_{k-1}(1-\beta_{k})\gamma\hat{\mathcal{P}}^{k}(U^{0}-\hat{U}^{\star})+\beta_{k}\gamma\hat{\mathcal{P}}^{k}\left({U^{0}-\hat{U}^{\star}}\right)
βk(U0U^)+(Πj=1k(1βj))(Πl=k0γ𝒫^l)(U0U^)\displaystyle\quad-\beta_{k}(U^{0}-\hat{U}^{\star})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})
=i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫^l)(U0U^)]\displaystyle=\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
βk(U0U^)+(Πj=1k(1βj))(Πl=k0γ𝒫^l)(U0U^).\displaystyle\quad-\beta_{k}(U^{0}-\hat{U}^{\star})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star}).

where inequality comes from second inequality in Lemma 9 with α=βk,U=Uk,U~=Uk1\alpha=\beta_{k},U=U^{k},\tilde{U}=U^{k-1}, and previously defined U¯\bar{U}. ∎

Now, we prove the first rate of Theorem 2.

Proof of first rate in Theorem 2.

Since B1AB2B_{1}\leq A\leq B_{2} implies Asup{B1,B2}\left\|{A}\right\|_{\infty}\leq\sup\{\left\|{B_{1}}\right\|_{\infty},\left\|{B_{2}}\right\|_{\infty}\} for A,B(𝒳)A,B\in\mathcal{F}(\mathcal{X}), if we take \left\|{\cdot}\right\|_{\infty} right side first inequality of Lemma 10, we have

i=1k|βiβi1(1βi)|(Πj=i+1k(1βj))(Πl=kiγ𝒫l)(U0U)\displaystyle\sum_{i=1}^{k}\left|{\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right|\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left\|{\left({\Pi^{i}_{l=k}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})}\right\|_{\infty}
+βkU0Uπ+(Πj=1k(1βj))(Πl=k0γ𝒫l)(U0U)\displaystyle\quad+\beta_{k}\left\|{U^{0}-U^{\pi}}\right\|_{\infty}+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left\|{\left({\Pi^{0}_{l=k}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})}\right\|_{\infty}
(i=1kγki+1|βiβi1(1βi)|(Πj=i+1k(1βj))+βk+γk+1Πj=1k(1βj))\displaystyle\leq\left({\sum_{i=1}^{k}\gamma^{k-i+1}\left|{\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right|\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)+\beta_{k}+\gamma^{k+1}\Pi^{k}_{j=1}(1-\beta_{j})}\right)
U0U\displaystyle\quad\left\|{U^{0}-U^{\star}}\right\|_{\infty}
=(γ1γ)(1+2γγk+1)(γk+1)1γk+1U0U,\displaystyle=\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\star}}\right\|_{\infty},

where the first inequality comes from triangular inequality, second inequality is from nonexpansiveness of 𝒫l\mathcal{P}^{l}, and last equality comes from calculations.

If we take \left\|{\cdot}\right\|_{\infty} right side of second inequality of Lemma 10, similarly, we have

i=1k|βiβi1(1βi)|(Πj=i+1k(1βj))(Πl=kiγ𝒫^l)(U0U^)\displaystyle\sum_{i=1}^{k}\left|{\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right|\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left\|{\left({\Pi^{i}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})}\right\|_{\infty}
+βkU0Uπ+(Πj=1k(1βj))(Πl=k0γ𝒫^l)(U0U^)\displaystyle\quad+\beta_{k}\left\|{U^{0}-U^{\pi}}\right\|_{\infty}+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left\|{\left({\Pi^{0}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})}\right\|_{\infty}
(i=1kγki+1|βiβi1(1βi)|(Πj=i+1k(1βj))+βk+γk+1Πj=1k(1βj))\displaystyle\leq\left({\sum_{i=1}^{k}\gamma^{k-i+1}\left|{\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right|\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)+\beta_{k}+\gamma^{k+1}\Pi^{k}_{j=1}(1-\beta_{j})}\right)
=(γ1γ)(1+2γγk+1)(γk+1)1γk+1U0U^,\displaystyle=\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty},

where the first inequality comes from triangular inequality, second inequality is from from nonexpansiveness of 𝒫^l\hat{\mathcal{P}}^{l}, and last equality comes from calculations. Therefore, we conclude

TUkUk\displaystyle\left\|{T^{\star}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+2γγk+1)(γk+1)1γk+1max{U0U,U0U^}.\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\max{\left\{\left\|{U^{0}-U^{\star}}\right\|_{\infty},\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}\right\}}.

Next, for the second rate in Theorem 2, we prove following lemmas by induction.

Lemma 12.

Let 0<γ10<\gamma\leq 1. If γ=1\gamma=1, assume a fixed point UU^{\star} exists. For the iterates {Uk}k=0,1,\{U^{k}\}_{k=0,1,\dots} of Anc-VI, if TU0UT^{\star}U^{0}\leq U^{\star}, there exist nonexpansive linear operators {𝒫l}l=0,1,,k\{\mathcal{P}^{l}\}_{l=0,1,\dots,k} such that

TUkUk\displaystyle T^{\star}U^{k}-U^{k} i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫l)(U0U)]βk(U0U)\displaystyle\leq\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})\right]-\beta_{k}(U^{0}-U^{\star})

where Πj=k+1k(1βj)=1\Pi^{k}_{j=k+1}(1-\beta_{j})=1 and β0=1\beta_{0}=1.

Lemma 13.

Let 0<γ<10<\gamma<1. For the iterates {Uk}k=0,1,\{U^{k}\}_{k=0,1,\dots} of Anc-VI, if U0TU0U^{0}\geq T^{\star}U^{0}, there exist nonexpansive linear operators {𝒫^l}l=0,1,,k\{\hat{\mathcal{P}}^{l}\}_{l=0,1,\dots,k} such that

TUkUk\displaystyle T^{\star}U^{k}-U^{k} i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫^l)(U0U^)]βk(U0U^),\displaystyle\geq\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]-\beta_{k}(U^{0}-\hat{U}^{\star}),

where Πj=k+1k(1βj)=1\Pi^{k}_{j=k+1}(1-\beta_{j})=1 and β0=1\beta_{0}=1.

Proof of Lemma 12.

If k=0k=0,

TU0U0\displaystyle T^{\star}U^{0}-U^{0} =TU0U(U0U)\displaystyle=T^{\star}U^{0}-U^{\star}-(U^{0}-U^{\star})
(U0U)\displaystyle\leq-(U^{0}-U^{\star})

where the second inequality is from the condition.

By induction,

Uk(1βk)Uk1βkU\displaystyle U^{k}-(1-\beta_{k})U^{k-1}-\beta_{k}U^{\star}
(1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫l)(U0U)]\displaystyle\leq(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})\right]
(1βk)βk1(U0U)+βk(U0U),\displaystyle\quad-(1-\beta_{k})\beta_{k-1}(U^{0}-U^{\star})+\beta_{k}(U^{0}-U^{\star}),

and let U¯\bar{U} be the entire right hand side of inequality. Then, we have

TUkUk\displaystyle T^{\star}U^{k}-U^{k}
=TUk(1βk)TUk1βkTUβk(U0U)\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}T^{\star}U^{\star}-\beta_{k}(U^{0}-U^{\star})
γ𝒫k((1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫l)(U0U)]\displaystyle\leq\gamma\mathcal{P}^{k}\bigg{(}(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})\right]
(1βk)βk1(U0U)+βk(U0U))βk(U0U)\displaystyle\quad-(1-\beta_{k})\beta_{k-1}(U^{0}-U^{\star})+\beta_{k}(U^{0}-U^{\star})\bigg{)}-\beta_{k}(U^{0}-U^{\star})
=i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫l)(U0U)]βk(U0U),\displaystyle=\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})\right]-\beta_{k}(U^{0}-U^{\star}),

where inequality comes from first inequality in Lemma 8 with α=βk,U=Uk,U~=Uk1\alpha=\beta_{k},U=U^{k},\tilde{U}=U^{k-1}, and previously defined U¯\bar{U}. ∎

Proof of Lemma 13.

If k=0k=0,

TU0U0\displaystyle T^{\star}U^{0}-U^{0} =TU0U^(U0U^)\displaystyle=T^{\star}U^{0}-\hat{U}^{\star}-(U^{0}-\hat{U}^{\star})
(U0U^).\displaystyle\geq-(U^{0}-\hat{U}^{\star}).

where the second inequality is from the fact that U0TU0U^{0}\geq T^{\star}U^{0} implies TU0UT^{\star}U^{0}\geq U^{\star} by Lemma 5 and UU^U^{\star}\geq\hat{U}^{\star} by Lemma 3.

By induction,

Uk(1βk)Uk1βkU^\displaystyle U^{k}-(1-\beta_{k})U^{k-1}-\beta_{k}\hat{U}^{\star}
(1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫^l)(U0U^)]\displaystyle\geq(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
(1βk)βk1(U0U^)+βk(U0U^),\displaystyle\quad-(1-\beta_{k})\beta_{k-1}(U^{0}-\hat{U}^{\star})+\beta_{k}(U^{0}-\hat{U}^{\star}),

and let U¯\bar{U} be the entire right hand side of inequality. Then, we have

TUkUk\displaystyle T^{\star}U^{k}-U^{k}
=TUk(1βk)TUk1βkT^U^βk(U0U^)\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}\hat{T}^{\star}\hat{U}^{\star}-\beta_{k}(U^{0}-\hat{U}^{\star})
γ𝒫^k((1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫^l)(U0U^)]\displaystyle\geq\gamma\hat{\mathcal{P}}^{k}\bigg{(}(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
(1βk)βk1(U0U^)+βk(U0U^))βk(U0U^)\displaystyle\quad-(1-\beta_{k})\beta_{k-1}(U^{0}-\hat{U}^{\star})+\beta_{k}(U^{0}-\hat{U}^{\star})\bigg{)}-\beta_{k}(U^{0}-\hat{U}^{\star})
=i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫^l)(U0U^)]βk(U0U^),\displaystyle=\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]-\beta_{k}(U^{0}-\hat{U}^{\star}),

where inequality comes from second inequality in Lemma 9 with α=βk,U=Uk,U~=Uk1\alpha=\beta_{k},U=U^{k},\tilde{U}=U^{k-1}, and previously defined U¯\bar{U}. ∎

Now, we prove the second rates of Theorem 2.

Proof of second rates in Theorem 2.

Let 0<γ<10<\gamma<1. Then, if U0TU0U^{0}\leq T^{\star}U^{0}, then TU0UT^{\star}U^{0}\leq U^{\star} and UkTUkU^{k}\leq T^{\star}U^{k} by Lemma 5. Hence, taking \left\|{\cdot}\right\|_{\infty}-norm both sides of first inequality in Lemma 12, we have

TUkUk(γ1γ)(1+γγk+1)(γk+1)1γk+1U0U.\left\|{T^{\star}U^{k}-U^{k}}\right\|_{\infty}\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\star}}\right\|_{\infty}.

Otherwise, if U0TU0U^{0}\geq TU^{0}, UkTUkU^{k}\geq TU^{k} by Lemma 5. taking \left\|{\cdot}\right\|_{\infty}-norm both sides of second inequality in Lemma 13, we have

TUkUk(γ1γ)(1+γγk+1)(γk+1)1γk+1U0U^.\left\|{T^{\star}U^{k}-U^{k}}\right\|_{\infty}\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}.

Appendix C Omitted proofs in Section 3

First, we present the following lemma.

Lemma 14.

Let γ=1\gamma=1. Assume a fixed point UU^{\star} exists. For the iterates {Uk}k=0,1,\{U^{k}\}_{k=0,1,\dots} of Anc-VI, UkUU0U\left\|{U^{k}-U^{\star}}\right\|_{\infty}\leq\left\|{U^{0}-U^{\star}}\right\|_{\infty}.

Proof.

If k=0k=0, it is obvious. By induction,

UkU\displaystyle\left\|{U^{k}-U^{\star}}\right\|_{\infty} =βkU0+(1βk)TUk1U\displaystyle=\left\|{\beta_{k}U^{0}+(1-\beta_{k})TU^{k-1}-U^{\star}}\right\|_{\infty}
=(1βk)(TUk1U)+βk(U0U)\displaystyle=\left\|{(1-\beta_{k})(TU^{k-1}-U^{\star})+\beta_{k}\left({U^{0}-U^{\star}}\right)}\right\|_{\infty}
(1βk)TUk1U+βkU0U\displaystyle\leq(1-\beta_{k})\left\|{TU^{k-1}-U^{\star}}\right\|_{\infty}+\beta_{k}\left\|{U^{0}-U^{\star}}\right\|_{\infty}
(1βk)Uk1U+βkU0U\displaystyle\leq(1-\beta_{k})\left\|{U^{k-1}-U^{\star}}\right\|_{\infty}+\beta_{k}\left\|{U^{0}-U^{\star}}\right\|_{\infty}
=U0U\displaystyle=\left\|{U^{0}-U^{\star}}\right\|_{\infty}

where the second inequality comes form nonexpansiveness of TT. ∎

Now, we present the proof of Theorem 3.

Proof of Theorem 3.

First, if U0TU0U^{0}\leq TU^{0}, with same argument in proof of Lemma 5, we can show that Uk1UkTUk1TUkU^{k-1}\leq U^{k}\leq TU^{k-1}\leq TU^{k} for k=1,2,.k=1,2,\dots.

Since fixed point UU^{\star} exists by assumption, Lemma 4 and 10 hold. Note that γ=1\gamma=1 implies βk=1k+1\beta_{k}=\frac{1}{k+1} and if we take \left\|{\cdot}\right\|_{\infty}-norm both sides for those inequalities in lemmas, by simple calculation, we have

TUkUk2k+1U0U\displaystyle\left\|{TU^{k}-U^{k}}\right\|_{\infty}\leq\frac{2}{k+1}\left\|{U^{0}-U^{\star}}\right\|_{\infty}

for any fixed point UU^{\star} (since 0TUkUk0\leq T^{\star}U^{k}-U^{k}, we can get upper bound of TUkUk\left\|{T^{\star}U^{k}-U^{k}}\right\|_{\infty} from Lemma 10).

Suppose that there exist {kj}j=0,1,\{k_{j}\}_{j=0,1,\dots} such that UkjU^{k_{j}} converges to some U~\tilde{U}^{\star}. Then, limj(TI)Ukj=(TI)U~=0\lim_{j\rightarrow\infty}(T-I)U^{k_{j}}=(T-I)\tilde{U}^{\star}=0 since TIT-I is continuous. This implies that U~\tilde{U}^{\star} is a fixed point. By Lemma 14 and previous argument, UkU^{k} is increasing and bounded sequence in n\mathbb{R}^{n}. Thus, UkU^{k} has single limit point, some fixed point U~\tilde{U}^{\star}. Furthermore, the fact that U0TU0U~U^{0}\leq TU^{0}\leq\tilde{U}^{\star} implies that Lemma 6 and 12 hold. Therefore, we have

TUkUk1k+1U0U~.\displaystyle\left\|{TU^{k}-U^{k}}\right\|_{\infty}\leq\frac{1}{k+1}\left\|{U^{0}-\tilde{U}^{\star}}\right\|_{\infty}.

Next, we prove the Theorem 4.

Proof of Theorem 4.

By same argument in the proof of Theorem 3, if U0TU0U^{0}\leq TU^{0}, we can show that Uk1UkTUk1TUkU^{k-1}\leq U^{k}\leq TU^{k-1}\leq TU^{k} for k=1,2,.k=1,2,\dots., and

TUkUk2k+1U0U\displaystyle\left\|{TU^{k}-U^{k}}\right\|_{\infty}\leq\frac{2}{k+1}\left\|{U^{0}-U^{\star}}\right\|_{\infty}

for any fixed point UU^{\star}. Since UkU^{k} is increasing and bounded by Lemma 14 and previous argument, UkU^{k} converges pointwise to some U~\tilde{U}^{\star} in general action-state space. We now show that TUkTU^{k} also converges pointwise to TU~T\tilde{U}^{\star}. First, let TT be Bellman consistency operator and U=V,U~=V~πU=V,\tilde{U}^{\star}=\tilde{V}^{\pi}. By monontone convergence theorem,

limkTπVk(s)\displaystyle\lim_{k\rightarrow\infty}T^{\pi}V^{k}(s) =limk𝔼aπ(|s)[𝔼sP(|s,a)[r(s,a)+γVk(s)]]\displaystyle=\lim_{k\rightarrow\infty}\mathbb{E}_{a\sim\pi(\cdot\,|\,s)}\left[\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[r(s,a)+\gamma V^{k}(s^{\prime})\right]\right]
=𝔼aπ(|s)[limk𝔼sP(|s,a)[r(s,a)+γVk(s)]]\displaystyle=\mathbb{E}_{a\sim\pi(\cdot\,|\,s)}\left[\lim_{k\rightarrow\infty}\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[r(s,a)+\gamma V^{k}(s^{\prime})\right]\right]
=𝔼aπ(|s)[𝔼sP(|s,a)[r(s,a)+γlimkVk(s)]]\displaystyle=\mathbb{E}_{a\sim\pi(\cdot\,|\,s)}\left[\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[r(s,a)+\gamma\lim_{k\rightarrow\infty}V^{k}(s^{\prime})\right]\right]
=TπV~π(s)\displaystyle=T^{\pi}\tilde{V}^{\pi}(s)

for any fixed s𝒮s\in\mathcal{S}. With same argument, case U=QU=Q also holds. If TT is Bellman optimality operator, we use following lemma.

Lemma 15.

Let W,Wk(𝒳)W,W^{k}\in\mathcal{F}(\mathcal{X}) for k=0,1,k=0,1,\dots. If Wk(x)Wk+1(x)W^{k}(x)\leq W^{k+1}(x) for all x𝒳x\in\mathcal{X}, and {Wk}k=0,1,,\{W^{k}\}_{k=0,1,\dots,} converge pointwise to WW, then limk{supxWk(x)}=supxW(x)\lim_{k\rightarrow\infty}\{\sup_{x}W^{k}(x)\}=\sup_{x}W(x).

Proof.

Wk(x)W(x)W^{k}(x)\leq W(x) implies that supxWk(x)supxW(x)\sup_{x}W^{k}(x)\leq\sup_{x}W(x). If supxW(x)=a\sup_{x}W(x)=a, there exist xx which satisfying aW(x)<ϵ2a-W(x)<\frac{\epsilon}{2}, and by definition of WW, there exist WkW^{k} such that aWk(x)<ϵa-W^{k}(x)<\epsilon for any ϵ>0\epsilon>0. ∎

If U=VU=V and U~=V~\tilde{U}^{\star}=\tilde{V}^{\star}, by previous lemma and monotone convergence theorem, we have

limkTVk(s)\displaystyle\lim_{k\rightarrow\infty}T^{\star}V^{k}(s) =limksupa{𝔼sP(|s,a)[r(s,a)+γVk(s)]}\displaystyle=\lim_{k\rightarrow\infty}\sup_{a}\left\{\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[r(s,a)+\gamma V^{k}(s^{\prime})\right]\right\}
=supa{limk𝔼sP(|s,a)[r(s,a)+γVk(s)]}\displaystyle=\sup_{a}\left\{\lim_{k\rightarrow\infty}\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[r(s,a)+\gamma V^{k}(s^{\prime})\right]\right\}
=supa{𝔼sP(|s,a)[r(s,a)+γlimkVk(s)]}\displaystyle=\sup_{a}\left\{\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[r(s,a)+\gamma\lim_{k\rightarrow\infty}V^{k}(s^{\prime})\right]\right\}
=TU~(s)\displaystyle=T^{\star}\tilde{U}^{\star}(s)

for any fixed s𝒮s\in\mathcal{S}. With similar argument, case U=QU=Q also holds.

Since TUkTU~TU^{k}\rightarrow T\tilde{U}^{\star} and UkU~U^{k}\rightarrow\tilde{U}^{\star} pointwisely, TUkUkTU^{k}-U^{k} converges pointwise to TU~U~=0T\tilde{U}^{\star}-\tilde{U}^{\star}=0. Thus, U~\tilde{U}^{\star} is indeed fixed point of TT. Furthermore, the fact that U0TU0U~U^{0}\leq TU^{0}\leq\tilde{U}^{\star} implies that Lemma 6 and 12 hold. Therefore, we have

TUkUk1k+1U0U~.\displaystyle\left\|{TU^{k}-U^{k}}\right\|_{\infty}\leq\frac{1}{k+1}\left\|{U^{0}-\tilde{U}^{\star}}\right\|_{\infty}.

Appendix D Omitted proofs in Section 4

We present the proof of Theorem 5.

Proof of Theorem 5.

First, we prove the case U0=0U^{0}=0 for nk+2n\geq k+2. Consider the MDP (𝒮,𝒜,P,r,γ)(\mathcal{S},\mathcal{A},P,r,\gamma) such that

𝒮={s1,,sn},𝒜={a1},P(si|sj,a1)=𝟙{i=j=1,j=i+1},r(si,a1)=𝟙{i=2}.\displaystyle\mathcal{S}=\{s_{1},\dots,s_{n}\},\quad\mathcal{A}=\{a_{1}\},\quad P(s_{i}\,|\,s_{j},a_{1})=\mathds{1}_{\{i=j=1,\,j=i+1\}},\quad r(s_{i},a_{1})=\mathds{1}_{\{i=2\}}.

Then, T=γ𝒫πU+[0,1,0,,0],U=[0,1,γ,,γn2]T=\gamma\mathcal{P}^{\pi}U+[0,1,0,\dots,0]^{\intercal},U^{\star}=[0,1,\gamma,\dots,\gamma^{n-2}]^{\intercal}, and U0U=1\left\|{U^{0}-U^{\star}}\right\|_{\infty}=1. Under the span condition, we can show that (Uk)1=(Uk)l=0\left({U^{k}}\right)_{1}=\left({U^{k}}\right)_{l}=0 for k+2lnk+2\leq l\leq n by following lemma.

Lemma 16.

Let T:nnT\colon\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} be defined as before. Then, under span condition, (Ui)1=0\left({U^{i}}\right)_{1}=0 for 0ik0\leq i\leq k, and (Ui)j=0\left({U^{i}}\right)_{j}=0 for 0ik0\leq i\leq k and i+2jni+2\leq j\leq n.

Proof.

Case k=0k=0 is obvious. By induction, (Ul)1=0\left({U^{l}}\right)_{1}=0 for 0li10\leq l\leq i-1. Then (TUl)1=0\left({TU^{l}}\right)_{1}=0 for 0li10\leq l\leq i-1. This implies that (TUlUl)1=0\left({TU^{l}-U^{l}}\right)_{1}=0 for 0li10\leq l\leq i-1. Hence (Ui)1=(U0)1=0\left({U^{i}}\right)_{1}=\left({U^{0}}\right)_{1}=0. Again, by induction, (Ul)j=0\left({U^{l}}\right)_{j}=0 for 0li10\leq l\leq i-1, l+2jnl+2\leq j\leq n. Then (TUl)j=0\left({TU^{l}}\right)_{j}=0 for 0li10\leq l\leq i-1, l+3jnl+3\leq j\leq n and this implies that (TUlUl)j=0\left({TU^{l}-U^{l}}\right)_{j}=0 for 0li10\leq l\leq i-1, l+3jnl+3\leq j\leq n. Therefore, (Ui)j=0\left({U^{i}}\right)_{j}=0 for i+2jni+2\leq j\leq n. ∎

Then, we get

TUkUk=(0,1(Uk)2,γ(Uk)2(Uk)3,,γ(Uk)k(Uk)k+1,γ(Uk)k+1,0,,0nk2),\displaystyle TU^{k}-U^{k}=\Big{(}0,1-\left({U^{k}}\right)_{2},\gamma\left({U^{k}}\right)_{2}-\left({U^{k}}\right)_{3},\dots,\gamma\left({U^{k}}\right)_{k}-\left({U^{k}}\right)_{k+1},\gamma\left({U^{k}}\right)_{k+1},\underbrace{0,\dots,0}_{n-k-2}\Big{)},

and this implies

(TUkUk)2+γ1(TUkUk)3++γk(TUkUk)k+2=1.\left({TU^{k}-U^{k}}\right)_{2}+\gamma^{-1}\left({TU^{k}-U^{k}}\right)_{3}+\cdots+\gamma^{-k}\left({TU^{k}-U^{k}}\right)_{k+2}=1.

Taking the absolute value on both sides,

(1++γk)max1in{|TUkUk|i}1.\displaystyle(1+\cdots+\gamma^{-k})\max_{1\leq i\leq n}{\{|TU^{k}-U^{k}|_{i}\}}\geq 1.

Therefore, we conclude

TUkUkγki=0kγiU0U.\displaystyle\|TU^{k}-U^{k}\|_{\infty}\geq\frac{\gamma^{k}}{\sum_{i=0}^{k}\gamma^{i}}\left\|{U^{0}-U^{\star}}\right\|_{\infty}.

Now, we show that for any initial point U0nU^{0}\in\mathbb{R}^{n}, there exists an MDP which exhibits same lower bound with the case U0=0U^{0}=0. Denote by MDP(0) and T0T_{0} the worst-case MDP and Bellman consistency or opitmality operator constructed for U0=0U^{0}=0. Define an MDP(U0U^{0}) (𝒮,𝒜,P,r,γ)(\mathcal{S},\mathcal{A},P,r,\gamma) for U00U^{0}\neq 0 as

𝒮={s1,,sn},𝒜={a1},P(si|sj,a1)=𝟙{i=j=1,j=i+1},r(si,a1)=(U0𝒫πU0)i+𝟙{i=2}.\displaystyle\mathcal{S}=\{s_{1},\dots,s_{n}\},\,\,\mathcal{A}=\{a_{1}\},\,\,P(s_{i}\,|\,s_{j},a_{1})=\mathds{1}_{\{i=j=1,\,j=i+1\}},\,\,r(s_{i},a_{1})=\left({U^{0}-\mathcal{P}^{\pi}U^{0}}\right)_{i}+\mathds{1}_{\{i=2\}}.

Then, Bellman consistency or optimality operator TT satisfies

TU=T0(UU0)+U0.TU=T_{0}(U-U^{0})+U^{0}.

Let U~\tilde{U}^{\star} be fixed point of T0T_{0}. Then, if U=U~+U0U^{\star}=\tilde{U}^{\star}+U^{0}, UU^{\star} is fixed point of TT. Furthermore, if {Ui}i=0k\{U^{i}\}^{k}_{i=0} satisfies span condition

UiU0+span{TU0U0,TU1U1,,TUi1Ui1},i=1,,k,U^{i}\in U^{0}+span\{TU^{0}-U^{0},TU^{1}-U^{1},\dots,TU^{i-1}-U^{i-1}\},\qquad i=1,\dots,k,

Ui~=UiU0\tilde{U^{i}}=U^{i}-U^{0} is a sequence satisfying

U~iU~0=0+span{T0U~0U~0,T0U~1U~1,,T0U~i1U~i1},i=1,,k,\tilde{U}^{i}\in\underbrace{\,\tilde{U}^{0}}_{=0}+span\{T_{0}\tilde{U}^{0}-\tilde{U}^{0},T_{0}\tilde{U}^{1}-\tilde{U}^{1},\dots,T_{0}\tilde{U}^{i-1}-\tilde{U}^{i-1}\},\qquad i=1,\dots,k,

which is the same span condition in Theorem 5 with respect to T0T_{0}. This is because

TUiUi=T0(UiU0)(UiU0)=TU~iU~iTU^{i}-U^{i}=T_{0}(U^{i}-U^{0})-(U^{i}-U^{0})=T\tilde{U}^{i}-\tilde{U}^{i}

for i=0,,ki=0,\dots,k. Thus, {U~i}i=0k\{\tilde{U}^{i}\}^{k}_{i=0} is a sequence starting from 0 and satisfy the span condition for T0T_{0}. This implies that

TUkUk\displaystyle\left\|{TU^{k}-U^{k}}\right\|_{\infty} =TU~kU~k\displaystyle=\left\|{T\tilde{U}^{k}-\tilde{U}^{k}}\right\|_{\infty}
γki=0kγiU~0U~\displaystyle\geq\frac{\gamma^{k}}{\sum^{k}_{i=0}\gamma^{i}}\left\|{\tilde{U}^{0}-\tilde{U}^{\star}}\right\|_{\infty}
=γki=0kγiU0U.\displaystyle=\frac{\gamma^{k}}{\sum^{k}_{i=0}\gamma^{i}}\left\|{U^{0}-U^{\star}}\right\|_{\infty}.

Hence, MDP(U0U^{0}) is indeed our desired worst-case instance. Lastly, the fact that U0U=U~0U~=(0,1,γ,,γn2)U^{0}-U^{\star}=\tilde{U}^{0}-\tilde{U}^{\star}=-(0,1,\gamma,\dots,\gamma^{n-2}) implies U0UU^{0}\leq U^{\star}.

Appendix E Omitted proofs in Section 5

First, we prove following key lemma.

Lemma 17.

Let 0<γ<10<\gamma<1. For the iterates {Uk}k=0,1,\{U^{k}\}_{k=0,1,\dots} of Anc-VI, there exist nonexpansive linear operators {𝒫l}l=0,1,,k\{\mathcal{P}^{l}\}_{l=0,1,\dots,k} and {𝒫^l}l=0,1,,k\{\hat{\mathcal{P}}^{l}\}_{l=0,1,\dots,k} such that

TUkUk\displaystyle T^{\star}U^{k}-U^{k} i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫l)(U0U)]βk(U0U)\displaystyle\leq\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})\right]-\beta_{k}(U^{0}-U^{\star})
+Πj=1k(1βj)Πl=k0γ𝒫l(U0U)i=1kΠj=ik(1βj)Πl=ki+1γ𝒫l(Iγ𝒫i)ϵi1,\displaystyle\quad+\Pi^{k}_{j=1}(1-\beta_{j})\Pi^{0}_{l=k}\gamma\mathcal{P}^{l}(U^{0}-U^{\star})-\sum_{i=1}^{k}\Pi^{k}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k}\gamma\mathcal{P}^{l}\left({I-\gamma\mathcal{P}^{{i}}}\right)\epsilon^{i-1},
TUkUk\displaystyle T^{\star}U^{k}-U^{k} i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫^l)(U0U^)]βk(U0U^)\displaystyle\geq\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]-\beta_{k}(U^{0}-\hat{U}^{\star})
+Πj=1k(1βj)Πl=k0γ𝒫^l(U0U^)i=1kΠj=ik(1βj)Πl=ki+1γ𝒫^l(Iγ𝒫^i)ϵi1,\displaystyle\quad+\Pi^{k}_{j=1}(1-\beta_{j})\Pi^{0}_{l=k}\gamma\hat{\mathcal{P}}^{l}(U^{0}-\hat{U}^{\star})-\sum_{i=1}^{k}\Pi^{k}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k}\gamma\hat{\mathcal{P}}^{l}\left({I-\gamma\hat{\mathcal{P}}^{{i}}}\right)\epsilon^{i-1},

for 1k1\leq k, where Πj=k+1k(1βj)=1\Pi^{k}_{j=k+1}(1-\beta_{j})=1, Πl=kk+1γ𝒫l=Πl=kk+1γ𝒫^l=I\Pi^{k+1}_{l=k}\gamma\mathcal{P}^{l}=\Pi^{k+1}_{l=k}\gamma\hat{\mathcal{P}}^{l}=I, and β0=1\beta_{0}=1.

Proof of Lemma 17.

First, we prove the first inequality in Lemma 17 by induction.

If k=1k=1,

U1(1β1)U0β1U\displaystyle U^{1}-(1-\beta_{1})U^{0}-\beta_{1}U^{\star} =(1β1)ϵ0+β1(U0U)+(1β1)(TU0U0)\displaystyle=(1-\beta_{1})\epsilon^{0}+\beta_{1}(U^{0}-U^{\star})+(1-\beta_{1})(T^{\star}U^{0}-U^{0})
(1β1)ϵ0+(1β1)γ𝒫0(U0U)+(2β11)(U0U),\displaystyle\leq(1-\beta_{1})\epsilon^{0}+(1-\beta_{1})\gamma\mathcal{P}^{0}(U^{0}-U^{\star})+(2\beta_{1}-1)(U^{0}-U^{\star}),

where inequality comes from Lemma 8 with α=1,U=U0,U¯=U0U\alpha=1,U=U^{0},\bar{U}=U^{0}-U^{\star}, and let U¯\bar{U} be the entire right hand side of inequality. Then, we have

TU1U1\displaystyle T^{\star}U^{1}-U^{1} =TU1(1β1)TU0β1Uβ1(U0U)(1β1)ϵ0\displaystyle=T^{\star}U^{1}-(1-\beta_{1})T^{\star}U^{0}-\beta_{1}U^{\star}-\beta_{1}(U^{0}-U^{\star})-(1-\beta_{1})\epsilon^{0}
γ𝒫1((1β1)ϵ0+(1β1)γ𝒫0(U0U)+(2β11)(U0U))β1(U0U)\displaystyle\leq\gamma\mathcal{P}^{1}((1-\beta_{1})\epsilon^{0}+(1-\beta_{1})\gamma\mathcal{P}^{0}(U^{0}-U^{\star})+(2\beta_{1}-1)(U^{0}-U^{\star}))-\beta_{1}(U^{0}-U^{\star})
(1β1)ϵ0\displaystyle\quad-(1-\beta_{1})\epsilon^{0}
=(1β1)γ𝒫1γ𝒫0(U0U)+γ𝒫1(2β11)(U0U)β1(U0U)\displaystyle=(1-\beta_{1})\gamma\mathcal{P}^{1}\gamma\mathcal{P}^{0}(U^{0}-U^{\star})+\gamma\mathcal{P}^{1}(2\beta_{1}-1)(U^{0}-U^{\star})-\beta_{1}(U^{0}-U^{\star})
(Iγ𝒫1)(1β1)ϵ0.\displaystyle\quad-(I-\gamma\mathcal{P}^{1})(1-\beta_{1})\epsilon^{0}.

where inequality comes from Lemma 8 with α=β1,U=U1,U~=U0\alpha=\beta_{1},U=U^{1},\tilde{U}=U^{0}, and previously defined U¯\bar{U}.

By induction,

Uk(1βk)Uk1βkU\displaystyle U^{k}-(1-\beta_{k})U^{k-1}-\beta_{k}U^{\star}
=βk(U0U)+(1βk)(TUk1Uk1)+(1βk)ϵk1\displaystyle=\beta_{k}\left({U^{0}-U^{\star}}\right)+(1-\beta_{k})(T^{\star}U^{k-1}-U^{k-1})+(1-\beta_{k})\epsilon^{k-1}
(1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫l)(U0U)]\displaystyle\leq(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})\right]
(1βk)βk1(U0U)+(1βk)(Πj=1k1(1βj))(Πl=k10γ𝒫l)(U0U)\displaystyle\quad-(1-\beta_{k})\beta_{k-1}(U^{0}-U^{\star})+(1-\beta_{k})\left({\Pi^{k-1}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k-1}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})
+βk(U0U)(1βk)i=1k1Πj=ik1(1βj)Πl=k1i+1γ𝒫l(Iγ𝒫i)ϵi1+(1βk)ϵk1,\displaystyle\quad+\beta_{k}(U^{0}-U^{\star})-(1-\beta_{k})\sum_{i=1}^{k-1}\Pi^{k-1}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k-1}\gamma\mathcal{P}^{l}\left({I-\gamma\mathcal{P}^{{i}}}\right)\epsilon^{i-1}+(1-\beta_{k})\epsilon^{k-1},

and let U¯\bar{U} be the entire right hand side of inequality. Then, we have

TUkUk\displaystyle T^{\star}U^{k}-U^{k}
=TUk(1βk)TUk1βkU0(1βk)ϵk1\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}U^{0}-(1-\beta_{k})\epsilon^{k-1}
=TUk(1βk)TUk1βkTUβk(U0U)(1βk)ϵk1\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}T^{\star}U^{\star}-\beta_{k}(U^{0}-U^{\star})-(1-\beta_{k})\epsilon^{k-1}
γ𝒫k((1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫l)(U0U)]\displaystyle\leq\gamma\mathcal{P}^{k}\bigg{(}(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})\right]
(1βk)βk1(U0U)+(1βk)(Πj=1k1(1βj))(Πl=k10γ𝒫l)(U0U)\displaystyle\quad-(1-\beta_{k})\beta_{k-1}(U^{0}-U^{\star})+(1-\beta_{k})\left({\Pi^{k-1}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k-1}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})
+βk(U0U)(1βk)i=1k1Πj=ik1(1βj)Πl=k1i+1γ𝒫l(Iγ𝒫i)ϵi1+(1βk)ϵk1)\displaystyle\quad+\beta_{k}(U^{0}-U^{\star})-(1-\beta_{k})\sum_{i=1}^{k-1}\Pi^{k-1}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k-1}\gamma\mathcal{P}^{l}\left({I-\gamma\mathcal{P}^{{i}}}\right)\epsilon^{i-1}+(1-\beta_{k})\epsilon^{k-1}\bigg{)}
βk(U0U)(1βk)ϵk1\displaystyle\quad-\beta_{k}(U^{0}-U^{\star})-(1-\beta_{k})\epsilon^{k-1}
=i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫l)(U0U)]βk(U0U)\displaystyle=\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})\right]-\beta_{k}(U^{0}-U^{\star})
+Πj=1k(1βj)Πl=k0γ𝒫l(U0U)i=1kΠj=ik(1βj)Πl=ki+1γ𝒫l(Iγ𝒫i)ϵi1,\displaystyle\quad+\Pi^{k}_{j=1}(1-\beta_{j})\Pi^{0}_{l=k}\gamma\mathcal{P}^{l}(U^{0}-U^{\star})-\sum_{i=1}^{k}\Pi^{k}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k}\gamma\mathcal{P}^{l}\left({I-\gamma\mathcal{P}^{{i}}}\right)\epsilon^{i-1},

where inequality comes from Lemma 8 with α=βk,U=Uk,U~=Uk1\alpha=\beta_{k},U=U^{k},\tilde{U}=U^{k-1}, and previously defined U¯\bar{U}.

Now, we prove second inequality in Lemma 17 by induction.

If k=1k=1,

U1(1β1)U0β1U^\displaystyle U^{1}-(1-\beta_{1})U^{0}-\beta_{1}\hat{U}^{\star} =(1β1)ϵ0+β1(U0U^)+(1β1)(TU0U0)\displaystyle=(1-\beta_{1})\epsilon^{0}+\beta_{1}(U^{0}-\hat{U}^{\star})+(1-\beta_{1})(T^{\star}U^{0}-U^{0})
(1β1)ϵ0+(1β1)γ𝒫^0(U0U^)+(2β11)(U0U^),\displaystyle\geq(1-\beta_{1})\epsilon^{0}+(1-\beta_{1})\gamma\hat{\mathcal{P}}^{0}(U^{0}-\hat{U}^{\star})+(2\beta_{1}-1)(U^{0}-\hat{U}^{\star}),

where inequality comes from Lemma 9 with α=1,U=U0,U¯=U0U^\alpha=1,U=U^{0},\bar{U}=U^{0}-\hat{U}^{\star}, and let U¯\bar{U} be the entire right hand side of inequality. Then, we have

TU1U1\displaystyle T^{\star}U^{1}-U^{1} =TU1(1β1)TU0β1U^β1(U0U^)(1β1)ϵ0\displaystyle=T^{\star}U^{1}-(1-\beta_{1})T^{\star}U^{0}-\beta_{1}\hat{U}^{\star}-\beta_{1}(U^{0}-\hat{U}^{\star})-(1-\beta_{1})\epsilon^{0}
γ𝒫^1((1β1)ϵ0+(1β1)γ𝒫^0(U0U^)+(2β11)(U0U^))β1(U0U^)\displaystyle\geq\gamma\hat{\mathcal{P}}^{1}((1-\beta_{1})\epsilon^{0}+(1-\beta_{1})\gamma\hat{\mathcal{P}}^{0}(U^{0}-\hat{U}^{\star})+(2\beta_{1}-1)(U^{0}-\hat{U}^{\star}))-\beta_{1}(U^{0}-\hat{U}^{\star})
(1β1)ϵ0\displaystyle\quad-(1-\beta_{1})\epsilon^{0}
=(1β1)γ𝒫^1γ𝒫^0(U0U^)+γ𝒫^1(2β11)(U0U^)β1(U0U^)\displaystyle=(1-\beta_{1})\gamma\hat{\mathcal{P}}^{1}\gamma\hat{\mathcal{P}}^{0}(U^{0}-\hat{U}^{\star})+\gamma\hat{\mathcal{P}}^{1}(2\beta_{1}-1)(U^{0}-\hat{U}^{\star})-\beta_{1}(U^{0}-\hat{U}^{\star})
(Iγ𝒫^1)(1β1)ϵ0.\displaystyle\quad-(I-\gamma\hat{\mathcal{P}}^{1})(1-\beta_{1})\epsilon^{0}.

where inequality comes from Lemma 9 with α=β1,U=U1,U~=U0\alpha=\beta_{1},U=U^{1},\tilde{U}=U^{0}, and previously defined U¯\bar{U}.

By induction,

Uk(1βk)Uk1βkU^\displaystyle U^{k}-(1-\beta_{k})U^{k-1}-\beta_{k}\hat{U}^{\star}
=βk(U0U^)+(1βk)(TUk1Uk1)+(1βk)ϵk1\displaystyle=\beta_{k}\left({U^{0}-\hat{U}^{\star}}\right)+(1-\beta_{k})(T^{\star}U^{k-1}-U^{k-1})+(1-\beta_{k})\epsilon^{k-1}
(1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫^l)(U0U^)]\displaystyle\geq(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
(1βk)βk1(U0U^)+(1βk)(Πj=1k1(1βj))(Πl=k10γ𝒫^l)(U0U^)\displaystyle\quad-(1-\beta_{k})\beta_{k-1}(U^{0}-\hat{U}^{\star})+(1-\beta_{k})\left({\Pi^{k-1}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})
+βk(U0U^)(1βk)i=1k1Πj=ik1(1βj)Πl=k1i+1γ𝒫^l(Iγ𝒫^i)ϵi1+(1βk)ϵk1,\displaystyle\quad+\beta_{k}(U^{0}-\hat{U}^{\star})-(1-\beta_{k})\sum_{i=1}^{k-1}\Pi^{k-1}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}\left({I-\gamma\hat{\mathcal{P}}^{{i}}}\right)\epsilon^{i-1}+(1-\beta_{k})\epsilon^{k-1},

and let U¯\bar{U} be the entire right hand side of inequality. Then, we have

TUkUk\displaystyle T^{\star}U^{k}-U^{k}
=TUk(1βk)TUk1βkU0(1βk)ϵk1\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}U^{0}-(1-\beta_{k})\epsilon^{k-1}
=TUk(1βk)TUk1βkTU^βk(U0U^)(1βk)ϵk1\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}T^{\star}\hat{U}^{\star}-\beta_{k}(U^{0}-\hat{U}^{\star})-(1-\beta_{k})\epsilon^{k-1}
γ𝒫^k((1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫^l)(U0U^)]\displaystyle\geq\gamma\hat{\mathcal{P}}^{k}\bigg{(}(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
(1βk)βk1(U0U^)+(1βk)(Πj=1k1(1βj))(Πl=k10γ𝒫^l)(U0U^)\displaystyle\quad-(1-\beta_{k})\beta_{k-1}(U^{0}-\hat{U}^{\star})+(1-\beta_{k})\left({\Pi^{k-1}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})
+βk(U0U^)(1βk)i=1k1Πj=ik1(1βj)Πl=k1i+1γ𝒫^l(Iγ𝒫^i)ϵi1+(1βk)ϵk1)\displaystyle\quad+\beta_{k}(U^{0}-\hat{U}^{\star})-(1-\beta_{k})\sum_{i=1}^{k-1}\Pi^{k-1}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}\left({I-\gamma\hat{\mathcal{P}}^{{i}}}\right)\epsilon^{i-1}+(1-\beta_{k})\epsilon^{k-1}\bigg{)}
βk(U0U^)(1βk)ϵk1\displaystyle\quad-\beta_{k}(U^{0}-\hat{U}^{\star})-(1-\beta_{k})\epsilon^{k-1}
=i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫^l)(U0U^)]βk(U0U^)\displaystyle=\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]-\beta_{k}(U^{0}-\hat{U}^{\star})
+Πj=1k(1βj)Πl=k0γ𝒫^l(U0U^)i=1kΠj=ik(1βj)Πl=ki+1γ𝒫^l(Iγ𝒫^i)ϵi1,\displaystyle\quad+\Pi^{k}_{j=1}(1-\beta_{j})\Pi^{0}_{l=k}\gamma\hat{\mathcal{P}}^{l}(U^{0}-\hat{U}^{\star})-\sum_{i=1}^{k}\Pi^{k}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k}\gamma\hat{\mathcal{P}}^{l}\left({I-\gamma\hat{\mathcal{P}}^{{i}}}\right)\epsilon^{i-1},

where inequality comes from Lemma 9 with α=βk,U=Uk,U~=Uk1\alpha=\beta_{k},U=U^{k},\tilde{U}=U^{k-1}, and previously defined U¯\bar{U}

Now, we prove the first rate in Theorem 6.

Proof of first rate in Theorem 6.

Since B1AB2B_{1}\leq A\leq B_{2} implies Asup{B1,B2}\left\|{A}\right\|_{\infty}\leq\sup\{\left\|{B_{1}}\right\|_{\infty},\left\|{B_{2}}\right\|_{\infty}\} for A,B(𝒳)A,B\in\mathcal{F}(\mathcal{X}), if we take \left\|{\cdot}\right\|_{\infty} right side of first inequality in Lemma 17, we have

(γ1γ)(1+2γγk+1)(γk+1)1γk+1U0U+(1+γ)i=1k(Πj=ik(1βj))γkiϵi1\displaystyle\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\star}}\right\|_{\infty}+(1+\gamma)\sum_{i=1}^{k}\left({\Pi^{k}_{j=i}(1-\beta_{j})}\right)\gamma^{k-i}\left\|{\epsilon^{i-1}}\right\|_{\infty}
(γ1γ)(1+2γγk+1)(γk+1)1γk+1U0U+1+γ1+γk+11γk1γmax0ik1ϵi.\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\star}}\right\|_{\infty}+\frac{1+\gamma}{1+\gamma^{k+1}}\frac{1-\gamma^{k}}{1-\gamma}\max_{0\leq i\leq k-1}\left\|{\epsilon^{i}}\right\|_{\infty}.

If we apply second inequality of Lemma 17 and take \left\|{\cdot}\right\|_{\infty}-norm right side, we have

(γ1γ)(1+2γγk+1)(γk+1)1γk+1U0U^+1+γ1+γk+11γk1γmax0ik1ϵi.\displaystyle\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}+\frac{1+\gamma}{1+\gamma^{k+1}}\frac{1-\gamma^{k}}{1-\gamma}\max_{0\leq i\leq k-1}\left\|{\epsilon^{i}}\right\|_{\infty}.

Therefore, we get

TUkUk\displaystyle\left\|{T^{\star}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+2γγk+1)(γk+1)1γk+1max{U0U,U0U^}\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\max{\left\{\left\|{U^{0}-U^{\star}}\right\|_{\infty},\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}\right\}}
+1+γ1+γk+11γk1γmax0ik1ϵi.\displaystyle\quad+\frac{1+\gamma}{1+\gamma^{k+1}}\frac{1-\gamma^{k}}{1-\gamma}\max_{0\leq i\leq k-1}\left\|{\epsilon^{i}}\right\|_{\infty}.

Now, for the second rate in Theorem 6, we present following key lemma.

Lemma 18.

Let 0<γ<10<\gamma<1. For the iterates {Uk}k=0,1,\{U^{k}\}_{k=0,1,\dots} of Anc-VI, if U0TU0U^{0}\geq T^{\star}U^{0}, there exist nonexpansive linear operators {𝒫l}l=0,1,,k\{\mathcal{P}^{l}\}_{l=0,1,\dots,k} and {𝒫^l}l=0,1,,k\{\hat{\mathcal{P}}^{l}\}_{l=0,1,\dots,k} such that

TUkUk\displaystyle T^{\star}U^{k}-U^{k} Πj=1k(1βj)Πl=k0γ𝒫l(U0U)i=1kΠj=ik(1βj)Πl=ki+1γ𝒫l(Iγ𝒫i)ϵi1\displaystyle\leq\Pi^{k}_{j=1}(1-\beta_{j})\Pi^{0}_{l=k}\gamma\mathcal{P}^{l}(U^{0}-U^{\star})-\sum_{i=1}^{k}\Pi^{k}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k}\gamma\mathcal{P}^{l}\left({I-\gamma\mathcal{P}^{{i}}}\right)\epsilon^{i-1}
βk(U0U),\displaystyle\quad-\beta^{k}(U^{0}-U^{\star}),
TUkUk\displaystyle T^{\star}U^{k}-U^{k} i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫^l)(U0U^)]βk(U0U^)\displaystyle\geq\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]-\beta_{k}(U^{0}-\hat{U}^{\star})
i=1kΠj=ik(1βj)Πl=ki+1γ𝒫^l(Iγ𝒫^i)ϵi1,\displaystyle\quad-\sum_{i=1}^{k}\Pi^{k}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k}\gamma\hat{\mathcal{P}}^{l}\left({I-\gamma\hat{\mathcal{P}}^{{i}}}\right)\epsilon^{i-1},

for 1k1\leq k, where Πj=k+1k(1βj)=1\Pi^{k}_{j=k+1}(1-\beta_{j})=1, Πl=kk+1γ𝒫l=Πl=kk+1γ𝒫^l=I\Pi^{k+1}_{l=k}\gamma\mathcal{P}^{l}=\Pi^{k+1}_{l=k}\gamma\hat{\mathcal{P}}^{l}=I, and β0=1\beta_{0}=1.

Proof of Lemma 18.

If U0TU0U^{0}\geq T^{\star}U^{0}, U0limm(T)mU0=UU^{0}\geq\lim_{m\rightarrow\infty}(T^{\star})^{m}U^{0}=U^{\star} by Lemma 1. By Lemma 3, this also implies U0U^U^{0}\geq\hat{U}^{\star}.

First, we prove first inequality in Lemma 18 by induction. If k=1k=1,

U1(1β1)U0β1U\displaystyle U^{1}-(1-\beta_{1})U^{0}-\beta_{1}U^{\star} =(1β1)ϵ0+β1(U0U)+(1β1)(TU0U0)\displaystyle=(1-\beta_{1})\epsilon^{0}+\beta_{1}(U^{0}-U^{\star})+(1-\beta_{1})(T^{\star}U^{0}-U^{0})
(1β1)ϵ0+(1β1)γ𝒫0(U0U)+(2β11)(U0U)\displaystyle\leq(1-\beta_{1})\epsilon^{0}+(1-\beta_{1})\gamma\mathcal{P}^{0}(U^{0}-U^{\star})+(2\beta_{1}-1)(U^{0}-U^{\star})
(1β1)ϵ0+(1β1)γ𝒫0(U0U),\displaystyle\leq(1-\beta_{1})\epsilon^{0}+(1-\beta_{1})\gamma\mathcal{P}^{0}(U^{0}-U^{\star}),

where the second inequality is from the (2β11)(U0U)0(2\beta_{1}-1)(U^{0}-U^{\star})\leq 0, and first inequality comes from Lemma 8 with α=1,U=U0,U¯=U0U\alpha=1,U=U^{0},\bar{U}=U^{0}-U^{\star}, and let U¯\bar{U} be the entire right hand side of inequality. Then, we have

TU1U1\displaystyle T^{\star}U^{1}-U^{1} =TU1(1β1)TU0β1Uβ1(U0U)(1β1)ϵ0\displaystyle=T^{\star}U^{1}-(1-\beta_{1})T^{\star}U^{0}-\beta_{1}U^{\star}-\beta_{1}(U^{0}-U^{\star})-(1-\beta_{1})\epsilon^{0}
γ𝒫1((1β1)ϵ0+(1β1)γ𝒫0(U0U))β1(U0U)(1β1)ϵ0\displaystyle\leq\gamma\mathcal{P}^{1}((1-\beta_{1})\epsilon^{0}+(1-\beta_{1})\gamma\mathcal{P}^{0}(U^{0}-U^{\star}))-\beta_{1}(U^{0}-U^{\star})-(1-\beta_{1})\epsilon^{0}
=(1β1)γ𝒫1γ𝒫0(U0U)β1(U0U)(Iγ𝒫1)(1β1)ϵ0.\displaystyle=(1-\beta_{1})\gamma\mathcal{P}^{1}\gamma\mathcal{P}^{0}(U^{0}-U^{\star})-\beta_{1}(U^{0}-U^{\star})-(I-\gamma\mathcal{P}^{1})(1-\beta_{1})\epsilon^{0}.

where inequality comes from Lemma 8 with α=β1,U=U1,U~=U0\alpha=\beta_{1},U=U^{1},\tilde{U}=U^{0}, and previously defined U¯\bar{U}.

By induction,

Uk(1βk)Uk1βkU\displaystyle U^{k}-(1-\beta_{k})U^{k-1}-\beta_{k}U^{\star}
=βk(U0U)+(1βk)(TUk1Uk1)+(1βk)ϵk1\displaystyle=\beta_{k}\left({U^{0}-U^{\star}}\right)+(1-\beta_{k})(T^{\star}U^{k-1}-U^{k-1})+(1-\beta_{k})\epsilon^{k-1}
βk(U0U)(1βk)βk1(U0U)+(1βk)(Πj=1k1(1βj))(Πl=k10γ𝒫l)(U0U)\displaystyle\leq\beta_{k}(U^{0}-U^{\star})-(1-\beta_{k})\beta_{k-1}(U^{0}-U^{\star})+(1-\beta_{k})\left({\Pi^{k-1}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k-1}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})
+(1βk)ϵk1(1βk)i=1k1Πj=ik1(1βj)Πl=k1i+1γ𝒫l(Iγ𝒫i)ϵi1\displaystyle\quad+(1-\beta_{k})\epsilon^{k-1}-(1-\beta_{k})\sum_{i=1}^{k-1}\Pi^{k-1}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k-1}\gamma\mathcal{P}^{l}\left({I-\gamma\mathcal{P}^{{i}}}\right)\epsilon^{i-1}
(1βk)(Πj=1k1(1βj))(Πl=k10γ𝒫l)(U0U)+(1βk)ϵk1\displaystyle\leq(1-\beta_{k})\left({\Pi^{k-1}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k-1}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})+(1-\beta_{k})\epsilon^{k-1}
(1βk)i=1k1Πj=ik1(1βj)Πl=k1i+1γ𝒫l(Iγ𝒫i)ϵi1,\displaystyle-(1-\beta_{k})\sum_{i=1}^{k-1}\Pi^{k-1}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k-1}\gamma\mathcal{P}^{l}\left({I-\gamma\mathcal{P}^{{i}}}\right)\epsilon^{i-1},

where the second inequality is from βk(1βk)βk10\beta_{k}-(1-\beta_{k})\beta_{k-1}\leq 0 and let U¯\bar{U} be the entire right hand side of inequality. Then, we have

TUkUk\displaystyle T^{\star}U^{k}-U^{k}
=TUk(1βk)TUk1βkTUβk(U0U)(1βk)ϵk1\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}T^{\star}U^{\star}-\beta_{k}(U^{0}-U^{\star})-(1-\beta_{k})\epsilon^{k-1}
γ𝒫k((1βk)(Πj=1k1(1βj))(Πl=k10γ𝒫l)(U0U)+(1βk)ϵk1\displaystyle\leq\gamma\mathcal{P}^{k}\bigg{(}(1-\beta_{k})\left({\Pi^{k-1}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k-1}\gamma\mathcal{P}^{l}}\right)(U^{0}-U^{\star})+(1-\beta_{k})\epsilon^{k-1}
(1βk)i=1k1Πj=ik1(1βj)Πl=k1i+1γ𝒫l(Iγ𝒫i)ϵi1)βk(U0U)(1βk)ϵk1\displaystyle-(1-\beta_{k})\sum_{i=1}^{k-1}\Pi^{k-1}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k-1}\gamma\mathcal{P}^{l}\left({I-\gamma\mathcal{P}^{{i}}}\right)\epsilon^{i-1}\bigg{)}-\beta_{k}(U^{0}-U^{\star})-(1-\beta_{k})\epsilon^{k-1}
=Πj=1k(1βj)Πl=k0γ𝒫l(U0U)i=1kΠj=ik(1βj)Πl=ki+1γ𝒫l(Iγ𝒫i)ϵi1\displaystyle\quad=\Pi^{k}_{j=1}(1-\beta_{j})\Pi^{0}_{l=k}\gamma\mathcal{P}^{l}(U^{0}-U^{\star})-\sum_{i=1}^{k}\Pi^{k}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k}\gamma\mathcal{P}^{l}\left({I-\gamma\mathcal{P}^{{i}}}\right)\epsilon^{i-1}
βk(U0U),\displaystyle\quad-\beta^{k}(U^{0}-U^{\star}),

where the first inequality comes from Lemma 8 with α=βk,U=Uk,U~=Uk1\alpha=\beta_{k},U=U^{k},\tilde{U}=U^{k-1}, and previously defined U¯\bar{U}.

For the second inequality in Lemma 18, if k=1k=1,

U1(1β1)U0β1U^\displaystyle U^{1}-(1-\beta_{1})U^{0}-\beta_{1}\hat{U}^{\star} =(1β1)ϵ0+β1(U0U^)+(1β1)(TU0U0)\displaystyle=(1-\beta_{1})\epsilon^{0}+\beta_{1}(U^{0}-\hat{U}^{\star})+(1-\beta_{1})(T^{\star}U^{0}-U^{0})
=(1β1)ϵ0+β1(U0U^)+(1β1)(TU0U^(U0U^))\displaystyle=(1-\beta_{1})\epsilon^{0}+\beta_{1}(U^{0}-\hat{U}^{\star})+(1-\beta_{1})(T^{\star}U^{0}-\hat{U}^{\star}-(U^{0}-\hat{U}^{\star}))
(1β1)ϵ0+β1(U0U^)(1β1)(U0U^)\displaystyle\geq(1-\beta_{1})\epsilon^{0}+\beta_{1}(U^{0}-\hat{U}^{\star})-(1-\beta_{1})(U^{0}-\hat{U}^{\star})

where the second inequality is from U0TU0U^U^{0}\geq T^{\star}U^{0}\geq\hat{U}^{\star}, and let U¯\bar{U} be the entire right hand side of inequality. Then, we have

TU1U1\displaystyle T^{\star}U^{1}-U^{1} =TU1(1β1)TU0β1Uβ1(U0U^)(1β1)ϵ0\displaystyle=T^{\star}U^{1}-(1-\beta_{1})T^{\star}U^{0}-\beta_{1}U^{\star}-\beta_{1}(U^{0}-\hat{U}^{\star})-(1-\beta_{1})\epsilon^{0}
γ𝒫1((1β1)ϵ0+β1(U0U^)(1β1)(U0U^))β1(U0U^)(1β1)ϵ0\displaystyle\geq\gamma\mathcal{P}^{1}((1-\beta_{1})\epsilon^{0}+\beta_{1}(U^{0}-\hat{U}^{\star})-(1-\beta_{1})(U^{0}-\hat{U}^{\star}))-\beta_{1}(U^{0}-\hat{U}^{\star})-(1-\beta_{1})\epsilon^{0}
=(2β11)γ𝒫1(U0U^)β1(U0U^)(Iγ𝒫1)(1β1)ϵ0.\displaystyle=(2\beta_{1}-1)\gamma\mathcal{P}^{1}(U^{0}-\hat{U}^{\star})-\beta_{1}(U^{0}-\hat{U}^{\star})-(I-\gamma\mathcal{P}^{1})(1-\beta_{1})\epsilon^{0}.

where inequality comes from Lemma 9 with α=β1,U=U1,U~=U0\alpha=\beta_{1},U=U^{1},\tilde{U}=U^{0}, and previously defined U¯\bar{U}.

By induction,

Uk(1βk)Uk1βkU^\displaystyle U^{k}-(1-\beta_{k})U^{k-1}-\beta_{k}\hat{U}^{\star}
(1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫^l)(U0U^)]\displaystyle\geq(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
+(βk(1βk)βk1)(U0U^)+(1βk)ϵk1\displaystyle\quad+(\beta_{k}-(1-\beta_{k})\beta_{k-1})(U^{0}-\hat{U}^{\star})+(1-\beta_{k})\epsilon^{k-1}
(1βk)i=1k1Πj=ik1(1βj)Πl=k1i+1γ𝒫^l(Iγ𝒫^i)ϵi1,\displaystyle\quad-(1-\beta_{k})\sum_{i=1}^{k-1}\Pi^{k-1}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}\left({I-\gamma\hat{\mathcal{P}}^{{i}}}\right)\epsilon^{i-1},

and let U¯\bar{U} be the entire right hand side of inequality. Then, we have

TUkUk\displaystyle T^{\star}U^{k}-U^{k}
=TUk(1βk)TUk1βkT^U^βk(U0U^)(1βk)ϵk1\displaystyle=T^{\star}U^{k}-(1-\beta_{k})T^{\star}U^{k-1}-\beta_{k}\hat{T}^{\star}\hat{U}^{\star}-\beta_{k}(U^{0}-\hat{U}^{\star})-(1-\beta_{k})\epsilon^{k-1}
γ𝒫^k((1βk)i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1iγ𝒫^l)(U0U^)]\displaystyle\geq\gamma\hat{\mathcal{P}}^{k}\bigg{(}(1-\beta_{k})\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
+(βk(1βk)βk1)(U0U^)(1βk)i=1k1Πj=ik1(1βj)Πl=k1i+1γ𝒫^l(Iγ𝒫^i)ϵi1\displaystyle\quad+(\beta_{k}-(1-\beta_{k})\beta_{k-1})(U^{0}-\hat{U}^{\star})-(1-\beta_{k})\sum_{i=1}^{k-1}\Pi^{k-1}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k-1}\gamma\hat{\mathcal{P}}^{l}\left({I-\gamma\hat{\mathcal{P}}^{{i}}}\right)\epsilon^{i-1}
+(1βk)ϵk1)(1βk)ϵk1βk(U0U^)\displaystyle\quad+(1-\beta_{k})\epsilon^{k-1}\bigg{)}-(1-\beta_{k})\epsilon^{k-1}-\beta_{k}(U^{0}-\hat{U}^{\star})
=i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=kiγ𝒫^l)(U0U^)]βk(U0U^)\displaystyle=\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\gamma\hat{\mathcal{P}}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]-\beta_{k}(U^{0}-\hat{U}^{\star})
i=1kΠj=ik(1βj)Πl=ki+1γ𝒫^l(Iγ𝒫^i)ϵi1,\displaystyle\quad-\sum_{i=1}^{k}\Pi^{k}_{j=i}(1-\beta_{j})\Pi^{i+1}_{l=k}\gamma\hat{\mathcal{P}}^{l}\left({I-\gamma\hat{\mathcal{P}}^{{i}}}\right)\epsilon^{i-1},

where inequality comes from Lemma 9 with α=βk,U=Uk,U~=Uk1\alpha=\beta_{k},U=U^{k},\tilde{U}=U^{k-1}, and previously defined U¯\bar{U}. ∎

Now, we prove the second rate in Theorem 6.

Proof of second rate in Theorem 6.

If we take \left\|{\cdot}\right\|_{\infty} right side of first inequality in Lemma 18, we have

(γ1γ)γ(γk+1)1γk+1U0U+1+γ1+γk+11γk1γmax0ik1ϵi.\displaystyle\frac{\left({\gamma^{-1}-\gamma}\right)\gamma}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\star}}\right\|_{\infty}+\frac{1+\gamma}{1+\gamma^{k+1}}\frac{1-\gamma^{k}}{1-\gamma}\max_{0\leq i\leq k-1}\left\|{\epsilon^{i}}\right\|_{\infty}.

If we apply second inequality of Lemma 18 and take \left\|{\cdot}\right\|_{\infty}-norm right side, we have

(γ1γ)(1+γγk+1)(γk+1)1γk+1U0U^+1+γ1+γk+11γk1γmax0ik1ϵi.\displaystyle\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}+\frac{1+\gamma}{1+\gamma^{k+1}}\frac{1-\gamma^{k}}{1-\gamma}\max_{0\leq i\leq k-1}\left\|{\epsilon^{i}}\right\|_{\infty}.

Therefore, we get

TUkUk\displaystyle\left\|{T^{\star}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+γγk+1)(γk+1)1γk+1U0U^+1+γ1+γk+11γk1γmax0ik1ϵi,\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}+\frac{1+\gamma}{1+\gamma^{k+1}}\frac{1-\gamma^{k}}{1-\gamma}\max_{0\leq i\leq k-1}\left\|{\epsilon^{i}}\right\|_{\infty},

since U^UU0\hat{U}^{\star}\leq U^{\star}\leq U^{0} implies that

(γ1γ)γ(γk+1)1γk+1U0U(γ1γ)(1+γγk+1)(γk+1)1γk+1U0U^.\frac{\left({\gamma^{-1}-\gamma}\right)\gamma}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\star}}\right\|_{\infty}\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}.

Appendix F Omitted proofs in Section 6

For the analyses, we first define T^GS:nn\hat{T}_{GS}^{\star}\colon\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} as

T^GS=T^nT^2T^1,\hat{T}_{GS}^{\star}=\hat{T}^{\star}_{n}\cdots\hat{T}^{\star}_{2}\hat{T}^{\star}_{1},

where T^j:nn\hat{T}_{j}^{\star}:\mathbb{R}^{n}\rightarrow\mathbb{R}^{n} is defined as

T^j(U)=(U1,,Uj1,(T^(U))j,Uj+1,,Un)\displaystyle\hat{T}^{\star}_{j}(U)=(U_{1},\dots,U_{j-1},\left({\hat{T}^{\star}(U)}\right)_{j},U_{j+1},\dots,U_{n})

for j=1,,nj=1,\dots,n, where T^\hat{T}^{\star} is Bellman anti-optimality operator.

Fact 3.

[Classical result, [10, Proposition 1.3.2]] T^GS\hat{T}_{GS}^{\star} is a γ\gamma-contractive operator and has the same fixed point as T^\hat{T}^{\star}.

Now, we introduce the following lemmas.

Lemma 19.

Let 0<γ<10<\gamma<1. If 0α10\leq\alpha\leq 1, then there exist γ\gamma-contractive nonnegative matrix 𝒫GS\mathcal{P}_{GS} such that

TGSU(1α)TGSU~αTGSU\displaystyle T_{GS}^{\star}U-(1-\alpha)T_{GS}^{\star}\tilde{U}-\alpha T_{GS}^{\star}U^{\star} 𝒫GS(U(1α)U~αU).\displaystyle\leq\mathcal{P}_{GS}(U-(1-\alpha)\tilde{U}-\alpha U^{\star}).
Lemma 20.

Let 0<γ<10<\gamma<1. If 0α10\leq\alpha\leq 1, then there exist γ\gamma-contractive nonnegative matrix 𝒫^GS\hat{\mathcal{P}}_{GS} such that

𝒫^GS(U(1α)U~αU^)TGSU(1α)TGSU~αT^GSU^.\displaystyle\hat{\mathcal{P}}_{GS}(U-(1-\alpha)\tilde{U}-\alpha\hat{U}^{\star})\leq T_{GS}^{\star}U-(1-\alpha)T_{GS}^{\star}\tilde{U}-\alpha\hat{T}_{GS}^{\star}\hat{U}^{\star}.
Proof of Lemma 19.

First let U=V,U~=V~,U=VU=V,\tilde{U}=\tilde{V},U^{\star}=V^{\star}. For 1in1\leq i\leq n, we have

TiV(si)(1α)TiV~(si)αTiV(si)\displaystyle T_{i}^{\star}V(s_{i})-(1-\alpha)T_{i}^{\star}\tilde{V}(s_{i})-\alpha T_{i}^{\star}V^{\star}(s_{i}) TiπiV(si)(1α)TiπiV~(si)αTiπiV(si)\displaystyle\leq T_{i}^{\pi_{i}}V(s_{i})-(1-\alpha)T_{i}^{\pi_{i}}\tilde{V}(s_{i})-\alpha T_{i}^{\pi_{i}}V^{\star}(s_{i})
=γ𝒫πi(V(1α)V~αV)(si),\displaystyle=\gamma\mathcal{P}^{\pi_{i}}\left({V-(1-\alpha)\tilde{V}-\alpha V^{\star}}\right)(s_{i}),

where πi\pi_{i} is the greedy policy satisfying TπiV=TVT^{\pi_{i}}V=T^{\star}V and first inequality is from TπiV~TV~T^{\pi_{i}}\tilde{V}\leq T^{\star}\tilde{V} and TπiVTVT^{\pi_{i}}V^{\star}\leq T^{\star}V^{\star}. Then, define matrix 𝒫i\mathcal{P}_{i} as

𝒫i(V)=(V1,,Vi1,(γ𝒫πi(V))i,Vi+1,,Vn)\displaystyle\mathcal{P}_{i}(V)=(V_{1},\dots,V_{i-1},\left({\gamma\mathcal{P}^{\pi_{i}}(V)}\right)_{i},V_{i+1},\dots,V_{n})

for i=1,,ni=1,\dots,n. Note that 𝒫i\mathcal{P}_{i} is nonnegative matrix since 𝒫πi\mathcal{P}^{\pi_{i}} is nonnegative matrix. Then, we have

TiV(1α)TiV~αTiV𝒫i(V(1α)V~αV).T_{i}^{\star}V-(1-\alpha)T_{i}^{\star}\tilde{V}-\alpha T_{i}^{\star}V^{\star}\leq\mathcal{P}_{i}(V-(1-\alpha)\tilde{V}-\alpha V^{\star}).

By induction, there exist a sequence of matrices {𝒫i}i=1,,n\{\mathcal{P}_{i}\}_{i=1,\dots,n} satisfying

TGSV(1α)TGSV~αTGSV𝒫n𝒫1(V(1α)V~αV)T_{GS}^{\star}V-(1-\alpha)T_{GS}^{\star}\tilde{V}-\alpha T_{GS}^{\star}V^{\star}\leq\mathcal{P}_{n}\cdots\mathcal{P}_{1}(V-(1-\alpha)\tilde{V}-\alpha V^{\star})

since TiV=VT_{i}^{\star}V^{\star}=V^{\star} for all ii. Denote PGSP_{GS} as 𝒫n𝒫1\mathcal{P}_{n}\cdots\mathcal{P}_{1}. Then, PGSP_{GS} is γ\gamma-contractive nonnegative matrix since

j=1n(PGS)ij=j=1n(𝒫i𝒫1)ijj=1n(𝒫i)ij=γ\sum^{n}_{j=1}\left({P_{GS}}\right)_{ij}=\sum^{n}_{j=1}\left({\mathcal{P}_{i}\cdots\mathcal{P}_{1}}\right)_{ij}\leq\sum^{n}_{j=1}\left({\mathcal{P}_{i}}\right)_{ij}=\gamma

for 1in1\leq i\leq n, where first equality is from definition of 𝒫l\mathcal{P}_{l} for i+1lni+1\leq l\leq n, inequality comes from definition of 𝒫l\mathcal{P}_{l} for 1li11\leq l\leq i-1, and last equality is induced by definition of 𝒫i\mathcal{P}_{i}. Therefore, this implies that PGSγ\left\|{P_{GS}}\right\|_{\infty}\leq\gamma.

If U=QU=Q, with similar argument of case U=VU=V, let πi\pi_{i} be the greedy policy, define matrix 𝒫i\mathcal{P}_{i} as

𝒫i(Q)=(Q1,,Qi1,(γ𝒫πi(Q))i,Qi+1,,Qn),\displaystyle\mathcal{P}_{i}(Q)=(Q_{1},\dots,Q_{i-1},\left({\gamma\mathcal{P}^{\pi_{i}}(Q)}\right)_{i},Q_{i+1},\dots,Q_{n}),

and denote PGSP_{GS} as 𝒫n𝒫1\mathcal{P}_{n}\cdots\mathcal{P}_{1}. Then, PGSP_{GS} is γ\gamma-contractive nonnegative matrix satisfying

TGSQ(1α)TGSQ~αTGSQ𝒫GS(Q(1α)Q~αQ).T_{GS}^{\star}Q-(1-\alpha)T_{GS}^{\star}\tilde{Q}-\alpha T_{GS}^{\star}Q^{\star}\leq\mathcal{P}_{GS}(Q-(1-\alpha)\tilde{Q}-\alpha Q^{\star}).

Proof of Lemma 20.

First let U=V,U~=V~,U^=V^U=V,\tilde{U}=\tilde{V},\hat{U}^{\star}=\hat{V}^{\star}. For 1in1\leq i\leq n, we have

TiV(si)(1α)TiV~(si)αT^iV^(si)\displaystyle T^{\star}_{i}V(s_{i})-(1-\alpha)T_{i}^{\star}\tilde{V}(s_{i})-\alpha\hat{T}_{i}^{\star}\hat{V}^{\star}(s_{i})
=supa𝒜{r(si,a)+γ𝔼sP(|si,a)[V(s)]}supa𝒜{(1α)r(si,a)+(1α)γ𝔼sP(|si,a)[V~(s)]}\displaystyle=\sup_{a\in\mathcal{A}}\bigg{\{}r(s_{i},a)+\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s_{i},a)}\left[V(s^{\prime})\right]\bigg{\}}-\sup_{a\in\mathcal{A}}\bigg{\{}(1-\alpha)r(s_{i},a)+(1-\alpha)\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s_{i},a)}\left[\tilde{V}(s^{\prime})\right]\bigg{\}}
infa𝒜{αr(si,a)+αγ𝔼sP(|si,a)[V^(s)]}\displaystyle\quad-\inf_{a\in\mathcal{A}}\bigg{\{}\alpha r(s_{i},a)+\alpha\gamma\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s_{i},a)}\left[\hat{V}^{\star}(s^{\prime})\right]\bigg{\}}
γinfa𝒜{𝔼sP(|si,a)[V(s)(1α)V~(s)αV^(s)]}.\displaystyle\geq\gamma\inf_{a\in\mathcal{A}}\bigg{\{}\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s_{i},a)}\left[V(s^{\prime})-(1-\alpha)\tilde{V}(s^{\prime})-\alpha\hat{V}^{\star}(s^{\prime})\right]\bigg{\}}.

Let π^i(|s)=argmina𝒜𝔼sP(|s,a)[V(s)(1α)V~(s)αV^(s)]\hat{\pi}_{i}(\cdot\,|\,s)=\operatorname*{argmin}_{a\in\mathcal{A}}\mathbb{E}_{s^{\prime}\sim P(\cdot\,|\,s,a)}\left[V(s^{\prime})-(1-\alpha)\tilde{V}(s^{\prime})-\alpha\hat{V}^{\star}(s^{\prime})\right] and define matrix 𝒫^i\hat{\mathcal{P}}_{i} as

𝒫^i(V)=(V1,,Vi1,(γ𝒫π^i(V))i,Vi+1,,Vn)\displaystyle\hat{\mathcal{P}}_{i}(V)=(V_{1},\dots,V_{i-1},\left({\gamma\mathcal{P}^{\hat{\pi}_{i}}(V)}\right)_{i},V_{i+1},\dots,V_{n})

for i=1,,ni=1,\dots,n. Note that 𝒫^i\hat{\mathcal{P}}_{i} is nonnegative matrix since 𝒫π^i\mathcal{P}^{\hat{\pi}_{i}} is nonnegative matrix. Then, we have

𝒫^i(V(1α)V~αV^)TiV(1α)TiV~αTiV^.\hat{\mathcal{P}}_{i}(V-(1-\alpha)\tilde{V}-\alpha\hat{V}^{\star})\leq T_{i}^{\star}V-(1-\alpha)T_{i}^{\star}\tilde{V}-\alpha T_{i}^{\star}\hat{V}^{\star}.

By induction, there exist a sequence of matrices {𝒫^i}i=1,,n\{\hat{\mathcal{P}}_{i}\}_{i=1,\dots,n} satisfying

𝒫^n𝒫^1(V(1α)V~αV^)TGSV(1α)TGSV~αT^GSV^,\hat{\mathcal{P}}_{n}\cdots\hat{\mathcal{P}}_{1}(V-(1-\alpha)\tilde{V}-\alpha\hat{V}^{\star})\leq T_{GS}^{\star}V-(1-\alpha)T_{GS}^{\star}\tilde{V}-\alpha\hat{T}_{GS}^{\star}\hat{V}^{\star},

and denote P^GS\hat{P}_{GS} as 𝒫^n𝒫^1\hat{\mathcal{P}}_{n}\cdots\hat{\mathcal{P}}_{1}. With same argument in proof of Lemma 19, P^GS\hat{P}_{GS} is γ\gamma-contractive nonnegative matrix.

If U=QU=Q, with similar argument, let π^i(|s)=argmina𝒜{Q(s,a)(1α)Q~(s,a)αQ^(s,a)}\hat{\pi}_{i}(\cdot\,|\,s)=\operatorname*{argmin}_{a\in\mathcal{A}}\{Q(s,a)-(1-\alpha)\tilde{Q}(s,a)-\alpha\hat{Q}^{\star}(s,a)\} and define matrix 𝒫^i\hat{\mathcal{P}}_{i} as

𝒫i(Q)=(U1,,Qi1,(γ𝒫π^i(Q))i,Qi+1,,Qn).\displaystyle\mathcal{P}_{i}(Q)=(U_{1},\dots,Q_{i-1},\left({\gamma\mathcal{P}^{\hat{\pi}_{i}}(Q)}\right)_{i},Q_{i+1},\dots,Q_{n}).

Denote P^GS\hat{P}_{GS} as 𝒫^n𝒫^1\hat{\mathcal{P}}_{n}\cdots\hat{\mathcal{P}}_{1}. Then, with same argument in proof of Lemma 19, P^GS\hat{P}_{GS} is γ\gamma-contractive nonnegative matrix satisfying

𝒫^GS(Q(1α)Q~αQ^)TGSQ(1α)TGSQ~αT^GSQ^.\hat{\mathcal{P}}_{GS}(Q-(1-\alpha)\tilde{Q}-\alpha\hat{Q}^{\star})\leq T_{GS}^{\star}Q-(1-\alpha)T_{GS}^{\star}\tilde{Q}-\alpha\hat{T}_{GS}^{\star}\hat{Q}^{\star}.

Next, we prove following key lemma.

Lemma 21.

Let 0<γ<10<\gamma<1. For the iterates {Uk}k=0,1,\{U^{k}\}_{k=0,1,\dots} of (GS-Anc-VI), there exist γ\gamma-contractive nonnegative matrices {𝒫GSl}l=0,1,,k\{\mathcal{P}_{GS}^{l}\}_{l=0,1,\dots,k} and {𝒫^GSl}l=0,1,,k\{\hat{\mathcal{P}}_{GS}^{l}\}_{l=0,1,\dots,k} such that

TGSUkUk\displaystyle T_{GS}^{\star}U^{k}-U^{k} i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=ki𝒫GSl)(U0U)]\displaystyle\leq\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\mathcal{P}_{GS}^{l}}\right)(U^{0}-U^{\star})\right]
βk(U0U)+(Πj=1k(1βj))(Πl=k0𝒫GSl)(U0U),\displaystyle\quad-\beta_{k}(U^{0}-U^{\star})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k}\mathcal{P}_{GS}^{l}}\right)(U^{0}-U^{\star}),
TGSUkUk\displaystyle T_{GS}^{\star}U^{k}-U^{k} i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=ki𝒫^GSl)(U0U^)]\displaystyle\geq\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\hat{\mathcal{P}}_{GS}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
βk(U0U^)+(Πj=1k(1βj))(Πl=k0𝒫^GSl)(U0U^),\displaystyle\quad-\beta_{k}(U^{0}-\hat{U}^{\star})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k}\hat{\mathcal{P}}_{GS}^{l}}\right)(U^{0}-\hat{U}^{\star}),

where Πj=k+1k(1βj)=1\Pi^{k}_{j=k+1}(1-\beta_{j})=1 and β0=1\beta_{0}=1.

Proof of Lemma 21.

First, we prove first inequality in Lemma 21 by induction.

If k=0k=0,

TGSU0U0\displaystyle T_{GS}^{\star}U^{0}-U^{0} =TGSU0U(U0U)\displaystyle=T_{GS}^{\star}U^{0}-U^{\star}-(U^{0}-U^{\star})
=TGSU0TGSU(U0U)\displaystyle=T_{GS}^{\star}U^{0}-T_{GS}^{\star}U^{\star}-(U^{0}-U^{\star})
𝒫GS0(U0U)(U0U).\displaystyle\leq\mathcal{P}_{GS}^{0}(U^{0}-U^{\star})-(U^{0}-U^{\star}).

where inequality comes from Lemma 19 with α=1,U=U0\alpha=1,U=U^{0}.

By induction,

TGSUkUk\displaystyle T_{GS}^{\star}U^{k}-U^{k}
=TGSUk(1βk)TGSUk1βkTGSUβk(U0U)\displaystyle=T_{GS}^{\star}U^{k}-(1-\beta_{k})T_{GS}^{\star}U^{k-1}-\beta_{k}T_{GS}^{\star}U^{\star}-\beta_{k}(U^{0}-U^{\star})
𝒫GSk(Uk(1βk)Uk1βkU)βk(U0U)\displaystyle\leq\mathcal{P}_{GS}^{k}(U^{k}-(1-\beta_{k})U^{k-1}-\beta_{k}U^{\star})-\beta_{k}(U^{0}-U^{\star})
=𝒫GSk(βk(U0U)+(1βk)(TGSUk1Uk1))βk(U0U)\displaystyle=\mathcal{P}_{GS}^{k}(\beta_{k}(U^{0}-U^{\star})+(1-\beta_{k})(T_{GS}^{\star}U^{k-1}-U^{k-1}))-\beta_{k}(U^{0}-U^{\star})
(1βk)𝒫GSk(i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1i𝒫GSl)(U0U)]\displaystyle\leq(1-\beta_{k})\mathcal{P}_{GS}^{k}\bigg{(}\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\mathcal{P}_{GS}^{l}}\right)(U^{0}-U^{\star})\right]
βk1(U0U)+(Πj=1k1(1βj))(Πl=k10𝒫GSl)(U0U))\displaystyle\quad-\beta_{k-1}(U^{0}-U^{\star})+\left({\Pi^{k-1}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k-1}\mathcal{P}_{GS}^{l}}\right)(U^{0}-U^{\star})\bigg{)}
+βk𝒫GSk(U0U)βk(U0U)\displaystyle\quad+\beta_{k}\mathcal{P}_{GS}^{k}(U^{0}-U^{\star})-\beta_{k}(U^{0}-U^{\star})
=i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=ki𝒫GSl)(U0U)]\displaystyle=\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\mathcal{P}_{GS}^{l}}\right)(U^{0}-U^{\star})\right]
βk(U0U)+(Πj=1k(1βj))(Πl=k0𝒫GSl)(U0U)\displaystyle\quad-\beta_{k}(U^{0}-U^{\star})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k}\mathcal{P}_{GS}^{l}}\right)(U^{0}-U^{\star})

where the first inequality comes from Lemma 19 with α=βk,U=Uk,U~=Uk1\alpha=\beta_{k},U=U^{k},\tilde{U}=U^{k-1}, and second inequality comes from nonnegativeness of 𝒫GSk\mathcal{P}_{GS}^{k}.

First, we prove second inequality in Lemma 21 by induction.

If k=0k=0,

TGSU0U0\displaystyle T_{GS}^{\star}U^{0}-U^{0} =TGSU0U^(U0U^)\displaystyle=T_{GS}^{\star}U^{0}-\hat{U}^{\star}-(U^{0}-\hat{U}^{\star})
=TGSU0T^GSU^(U0U^)\displaystyle=T_{GS}^{\star}U^{0}-\hat{T}_{GS}^{\star}\hat{U}^{\star}-(U^{0}-\hat{U}^{\star})
𝒫^GS0(U0U^)(U0U^),\displaystyle\geq\hat{\mathcal{P}}_{GS}^{0}(U^{0}-\hat{U}^{\star})-(U^{0}-\hat{U}^{\star}),

where inequality comes from Lemma 20 with α=1,U=U0\alpha=1,U=U^{0}.

By induction,

TGSUkUk\displaystyle T_{GS}^{\star}U^{k}-U^{k}
=TGSUk(1βk)TGSUk1βkT^GSU^βk(U0U^)\displaystyle=T_{GS}^{\star}U^{k}-(1-\beta_{k})T_{GS}^{\star}U^{k-1}-\beta_{k}\hat{T}_{GS}^{\star}\hat{U}^{\star}-\beta_{k}(U^{0}-\hat{U}^{\star})
𝒫^GSk(Uk(1βk)Uk1βkU^)βk(U0U^)\displaystyle\geq\hat{\mathcal{P}}_{GS}^{k}(U^{k}-(1-\beta_{k})U^{k-1}-\beta_{k}\hat{U}^{\star})-\beta_{k}(U^{0}-\hat{U}^{\star})
=𝒫^GSk(βk(U0U^)+(1βk)(TGSUk1Uk1))βk(U0U^)\displaystyle=\hat{\mathcal{P}}_{GS}^{k}(\beta_{k}(U^{0}-\hat{U}^{\star})+(1-\beta_{k})(T_{GS}^{\star}U^{k-1}-U^{k-1}))-\beta_{k}(U^{0}-\hat{U}^{\star})
(1βk)𝒫^GSk(i=1k1[(βiβi1(1βi))(Πj=i+1k1(1βj))(Πl=k1i𝒫^GSl)(U0U^)]\displaystyle\geq(1-\beta_{k})\hat{\mathcal{P}}_{GS}^{k}\bigg{(}\sum_{i=1}^{k-1}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k-1}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k-1}\hat{\mathcal{P}}_{GS}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
βk1(U0U^)+(Πj=1k1(1βj))(Πl=k10𝒫^GSl)(U0U^))\displaystyle\quad-\beta_{k-1}(U^{0}-\hat{U}^{\star})+\left({\Pi^{k-1}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k-1}\hat{\mathcal{P}}_{GS}^{l}}\right)(U^{0}-\hat{U}^{\star})\bigg{)}
+βk𝒫^GSk(U0U^)βk(U0U^)\displaystyle\quad+\beta_{k}\hat{\mathcal{P}}_{GS}^{k}(U^{0}-\hat{U}^{\star})-\beta_{k}(U^{0}-\hat{U}^{\star})
=i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=ki𝒫^GSl)(U0U^)]\displaystyle=\sum_{i=1}^{k}\left[(\beta_{i}-\beta_{i-1}(1-\beta_{i}))\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\hat{\mathcal{P}}_{GS}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]
βk(U0U^)+(Πj=1k(1βj))(Πl=k0𝒫^GSl)(U0U^)\displaystyle\quad-\beta_{k}(U^{0}-\hat{U}^{\star})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k}\hat{\mathcal{P}}_{GS}^{l}}\right)(U^{0}-\hat{U}^{\star})

where the first inequality comes from Lemma 20 with α=βk,U=Uk,U~=Uk1\alpha=\beta_{k},U=U^{k},\tilde{U}=U^{k-1}, and nonnegativeness of 𝒫^GSk\hat{\mathcal{P}}_{GS}^{k}. ∎

Now, we prove the first rate in Theorem 7.

Proof of first rate in Theorem 7.

Since B1AB2B_{1}\leq A\leq B_{2} implies Asup{B1,B2}\left\|{A}\right\|_{\infty}\leq\sup\{\left\|{B_{1}}\right\|_{\infty},\left\|{B_{2}}\right\|_{\infty}\} for A,B(𝒳)A,B\in\mathcal{F}(\mathcal{X}), if we take \left\|{\cdot}\right\|_{\infty} right side of first inequality in Lemma 21, we have

i=1k|βiβi1(1βi)|(Πj=i+1k(1βj))(Πl=ki𝒫GSl)(U0U)\displaystyle\sum_{i=1}^{k}\left|{\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right|\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left\|{\left({\Pi^{i}_{l=k}\mathcal{P}_{GS}^{l}}\right)(U^{0}-U^{\star})}\right\|_{\infty}
+βkU0Uπ+(Πj=1k(1βj))(Πl=k0𝒫GSl)(U0U)\displaystyle\quad+\beta_{k}\left\|{U^{0}-U^{\pi}}\right\|_{\infty}+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left\|{\left({\Pi^{0}_{l=k}\mathcal{P}_{GS}^{l}}\right)(U^{0}-U^{\star})}\right\|_{\infty}
(i=1kγki+1|βiβi1(1βi)|(Πj=i+1k(1βj))+βk+γk+1Πj=1k(1βj))\displaystyle\leq\left({\sum_{i=1}^{k}\gamma^{k-i+1}\left|{\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right|\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)+\beta_{k}+\gamma^{k+1}\Pi^{k}_{j=1}(1-\beta_{j})}\right)
U0U\displaystyle\quad\left\|{U^{0}-U^{\star}}\right\|_{\infty}
=(γ1γ)(1+2γγk+1)(γk+1)1γk+1U0U,\displaystyle=\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\star}}\right\|_{\infty},

where the first inequality comes from triangular inequality, second inequality is from γ\gamma-contraction of 𝒫GSl\mathcal{P}_{GS}^{l}, and last equality comes from calculations. If we take \left\|{\cdot}\right\|_{\infty} right side of second inequality in Lemma 21, we have

i=1k|βiβi1(1βi)|(Πj=i+1k(1βj))(Πl=ki𝒫^GSl)(U0U^)\displaystyle\sum_{i=1}^{k}\left|{\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right|\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left\|{\left({\Pi^{i}_{l=k}\hat{\mathcal{P}}_{GS}^{l}}\right)(U^{0}-\hat{U}^{\star})}\right\|_{\infty}
+βkU0Uπ+(Πj=1k(1βj))(Πl=k0𝒫^GSl)(U0U^)\displaystyle\quad+\beta_{k}\left\|{U^{0}-U^{\pi}}\right\|_{\infty}+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left\|{\left({\Pi^{0}_{l=k}\hat{\mathcal{P}}_{GS}^{l}}\right)(U^{0}-\hat{U}^{\star})}\right\|_{\infty}
(i=1kγki+1|βiβi1(1βi)|(Πj=i+1k(1βj))+βk+γk+1Πj=1k(1βj))\displaystyle\leq\left({\sum_{i=1}^{k}\gamma^{k-i+1}\left|{\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right|\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)+\beta_{k}+\gamma^{k+1}\Pi^{k}_{j=1}(1-\beta_{j})}\right)
=(γ1γ)(1+2γγk+1)(γk+1)1γk+1U0U^,\displaystyle=\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty},

where the first inequality comes from triangular inequality, second inequality is from from γ\gamma-contraction of 𝒫^GSl\hat{\mathcal{P}}_{GS}^{l}, and last equality comes from calculations. Therefore, we conclude

TGSUkUk\displaystyle\left\|{T_{GS}^{\star}U^{k}-U^{k}}\right\|_{\infty} (γ1γ)(1+2γγk+1)(γk+1)1γk+1max{U0U,U0U^}.\displaystyle\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+2\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\max{\left\{\left\|{U^{0}-U^{\star}}\right\|_{\infty},\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}\right\}}.

For the second rates of Theorem 7, we introduce following lemma.

Lemma 22.

Let 0<γ<10<\gamma<1. For the iterates {Uk}k=0,1,\{U^{k}\}_{k=0,1,\dots} of (GS-Anc-VI), if U0TGSU0U^{0}\leq T_{GS}^{\star}U^{0}, then Uk1UkTGSUk1TGSUkUU^{k-1}\leq U^{k}\leq T^{\star}_{GS}U^{k-1}\leq T^{\star}_{GS}U^{k}\leq U^{\star} for 1k1\leq k. Also, if U0TGSU0U^{0}\geq T^{\star}_{GS}U^{0}, then Uk1UkTGSUk1TGSUkUU^{k-1}\geq U^{k}\geq T^{\star}_{GS}U^{k-1}\geq T^{\star}_{GS}U^{k}\geq U^{\star} for 1k1\leq k.

Proof.

By Fact 3, limmTGSU=U\lim_{m\rightarrow\infty}T_{GS}^{\star}U=U^{\star}. By definition, if UU~U\leq\tilde{U}, TiUTiU~T^{\star}_{i}U\leq T_{i}^{\star}\tilde{U} for any 1in1\leq i\leq n and this implies that if UU~U\leq\tilde{U}, then TGSUTGSU~T_{GS}^{\star}U\leq T_{GS}^{\star}\tilde{U}. Hence, with same argument in proof of Lemma 5, we can obtain desired results. ∎

Now, we prove the second rates in Theorem 7.

Proof of second rates in Theorem 7.

If U0TGSU0U^{0}\leq T_{GS}^{\star}U^{0}, then U0U0U^{0}-U^{\star}\leq 0 and UkTGSUkU^{k}\leq T_{GS}^{\star}U^{k} by Lemma 22. Hence, by Lemma 21, we get

0\displaystyle 0 TGSUkUk\displaystyle\leq T_{GS}^{\star}U^{k}-U^{k}
=i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=ki𝒫GSl)(U0U)]\displaystyle=\sum_{i=1}^{k}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\mathcal{P}_{GS}^{l}}\right)(U^{0}-U^{\star})\right]
βk(U0Uπ)+(Πj=1k(1βj))(Πl=k0𝒫GSl)(U0U)\displaystyle\quad-\beta_{k}(U^{0}-U^{\pi})+\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k}\mathcal{P}_{GS}^{l}}\right)(U^{0}-U^{\star})
i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=ki𝒫GSl)(U0U)]βk(U0U),\displaystyle\leq\sum_{i=1}^{k}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\mathcal{P}_{GS}^{l}}\right)(U^{0}-U^{\star})\right]-\beta_{k}(U^{0}-U^{\star}),

where the second inequality follows from (Πj=1k(1βj))(Πl=ki𝒫GSl)(U0U)0\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\mathcal{P}_{GS}^{l}}\right)(U^{0}-U^{\star})\leq 0. Taking \left\|{\cdot}\right\|_{\infty}-norm both sides, we have

TGSUkUk(γ1γ)(1+γγk+1)(γk+1)1γk+1U0U.\left\|{T_{GS}^{\star}U^{k}-U^{k}}\right\|_{\infty}\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-U^{\star}}\right\|_{\infty}.

Otherwise, if U0TGSU0U^{0}\geq T_{GS}^{\star}U^{0}, UkTGSUkU^{k}\geq T_{GS}^{\star}U^{k} and U0UU^U^{0}\geq U^{\star}\geq\hat{U}^{\star} by Lemma 22 and 3. Thus, by Lemma 21, we get

0\displaystyle 0 TGSUkUk\displaystyle\geq T_{GS}^{\star}U^{k}-U^{k}
i=1k[(βiβi1(1βi))(Πj=i+1k(1βj))(Πl=ki𝒫^GSl)(U0U^)]βk(U0U^),\displaystyle\geq\sum_{i=1}^{k}\left[\left({\beta_{i}-\beta_{i-1}(1-\beta_{i})}\right)\left({\Pi^{k}_{j=i+1}(1-\beta_{j})}\right)\left({\Pi^{i}_{l=k}\hat{\mathcal{P}}_{GS}^{l}}\right)(U^{0}-\hat{U}^{\star})\right]-\beta_{k}(U^{0}-\hat{U}^{\star}),

where the second inequality follows from 0(Πj=1k(1βj))(Πl=k0𝒫^GSl)(U0U^)0\leq\left({\Pi^{k}_{j=1}(1-\beta_{j})}\right)\left({\Pi^{0}_{l=k}\hat{\mathcal{P}}_{GS}^{l}}\right)(U^{0}-\hat{U}^{\star}). Taking \left\|{\cdot}\right\|_{\infty}-norm both sides, we have

TGSUkUk(γ1γ)(1+γγk+1)(γk+1)1γk+1U0U^.\left\|{T_{GS}^{\star}U^{k}-U^{k}}\right\|_{\infty}\leq\frac{\left({\gamma^{-1}-\gamma}\right)\left({1+\gamma-\gamma^{k+1}}\right)}{\left({\gamma^{k+1}}\right)^{-1}-\gamma^{k+1}}\left\|{U^{0}-\hat{U}^{\star}}\right\|_{\infty}.

Appendix G Broader Impacts

Our work focuses on the theoretical aspects of reinforcement learning. There are no negative social impacts that we anticipate from our theoretical results.

Appendix H Limitations

Our analysis concerns value iteration. While value iteration is of theoretical interest, the analysis of value iteration is not sufficient to understand modern deep reinforcement learning practices.