This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Strong Polynomiality of the Value Iteration Algorithm for Computing Nearly Optimal Policies for Discounted Dynamic Programming

Eugene A. Feinberg111Corresponding author
Department of Mathematics and Statistics, Stony Brook University
Stony Brook, NY 11794-3600, USA
[email protected]

Gaojin He
Department of Mathematics, University of California, San Diego
La Jolla, CA 92093-0112, USA
[email protected]
Abstract

This note provides upper bounds on the number of operations required to compute by value iterations a nearly optimal policy for an infinite-horizon discounted Markov decision process with a finite number of states and actions. For a given discount factor, magnitude of the reward function, and desired closeness to optimality, these upper bounds are strongly polynomial in the number of state-action pairs, and one of the provided upper bounds has the property that it is a non-decreasing function of the value of the discount factor.

Keywords— Markov Decision Process, Discounting, Algorithm, Complexity, Optimal Policy.

1 Introduction

Value and policy iteration algorithms are the major tools for solving infinite-horizon discounted Markov decision processes (MDPs). Policy iteration algorithms also can be viewed as implementations of specific versions of the simplex method applied to linear programming problems corresponding to discounted MDPs [6, 17]. Ye [17] proved that for a given discount factor the policy iteration algorithm is strongly polynomial as a function of the total number of state-action pairs. Kitahara and Mizuno[7] extended Ye’s [17] results by providing sufficient conditions for strong polynomiality of a simplex method for linear programming, and Scherrer [13] improved Ye’s [17] bound for MDPs. For deterministic MDPs Post and Ye [10] proved that for policy iterations there is a polynomial upper bound on the number of operations, which does not depend on the value of the discount factor. Feinberg and Huang [4] showed that value iterations are not strongly polynomial for discounted MDPs. Earlier Tseng [16] proved weak polynomiality of value iterations.

In this note we show that the value iteration algorithm computes ϵ\epsilon-optimal policies in strongly polynomial time. This is an important observation because value iterations are broadly used in applications, including reinforcement learning [1, 2, 15], for computing nearly optimal policies.

Let us consider an MDP with a finite state space 𝕏={1,2,,m}\mathbb{X}=\{1,2,\ldots,m\} and with a finite nonempty sets of actions A(x)A(x) available at each state x𝕏.x\in\mathbb{X}. Each action set A(x)A(x) consists of kxk_{x} actions. Thus, the total number of actions is k=x=1mkx,k=\sum_{x=1}^{m}k_{x}, and this number can be interpreted as the total number of state-action pairs. Let α[0,1)\alpha\in[0,1) be the discount factor. According to [13], the policy iteration algorithm finds an optimal policy within NIPI(α)N^{PI}_{I}(\alpha) iterations, and each iterations requires at most NOPIN^{PI}_{O} operations with

NIPI(α):=O(k1αlog11α),NOPI:=O(m3+mk).N^{PI}_{I}(\alpha):=O\left(\frac{k}{1-\alpha}\log\frac{1}{1-\alpha}\right),\qquad\qquad N^{PI}_{O}:=O\left(m^{3}+mk\right). (1.1)

This paper shows that for each ϵ>0\epsilon>0, the value iteration algorithm finds an ϵ\epsilon-optimal policy within NIVI(ϵ)N^{VI}_{I}(\epsilon) iterations, and each iterations requires at most NOVIN^{VI}_{O} operations, where

NIVI(ϵ)(α):=max{log(1α)ϵ[R+(1+α)V]logα,1},NOVI:=O(mk),N^{VI(\epsilon)}_{I}(\alpha):=\max\left\{\left\lceil\frac{\log\frac{\left(1-\alpha\right)\epsilon}{\left[R+(1+\alpha)V\right]}}{\log\alpha}\right\rceil,1\right\},\qquad\qquad N^{VI}_{O}:=O\left(mk\right), (1.2)

where RR and VV are the constants defined by a one-step cost function rr and a function of terminal rewards v0v^{0}. In addition, NIVI(ϵ)(α)N^{VI(\epsilon)}_{I}(\alpha) is non-decreasing in α[0,1)\alpha\in[0,1).

To define the values of RR and V,V, let us denote by sp(u)sp(u) the span seminorm of um,u\in\mathbb{R}^{m}, where m\mathbb{R}^{m} is an mm-dimensional Euclidean space,

sp(u):=maxx𝕏u(x)minx𝕏u(x).sp(u):=\max_{x\in\mathbb{X}}u(x)-\min_{x\in\mathbb{X}}u(x).

The properties of this seminorm can be found in [11, pp. 196].

Let r(x,a)r(x,a) be the reward collected if the system is at the state x𝕏x\in\mathbb{X} and the action aA(x)a\in A(x) is chosen. Let v0(x)v^{0}(x) be the reward collected at the final state x𝕏.x\in\mathbb{X}. Then

R:=sp(v0).R:=sp(v^{0}). (1.3)

We denote by

v1(x):=maxaA(x)r(x,a),x𝕏,v^{1}(x):=\max_{a\in A(x)}r(x,a),\qquad x\in\mathbb{X}, (1.4)

the maximal one-step reward that can be collected at the state x.x. Then

V:=sp(v1)=maxx𝕏maxaA(x)r(x,a)minx𝕏maxaA(x)r(x,a).V:=sp(v^{1})=\max_{x\in\mathbb{X}}\max_{a\in A(x)}r(x,a)-\min_{x\in\mathbb{X}}\max_{a\in A(x)}r(x,a). (1.5)

Observe that

Vsp(r):=maxx𝕏maxaA(x)r(x,a)minx𝕏minaA(x)r(x,a).V\leq sp(r):=\max_{x\in\mathbb{X}}\max_{a\in A(x)}r(x,a)-\min_{x\in\mathbb{X}}\min_{a\in A(x)}r(x,a).

Of course, the total number of operations to find an ϵ\epsilon-optimal policy is bounded above by NIVI(ϵ)(α)NOVIN^{VI(\epsilon)}_{I}(\alpha)\cdot N^{VI}_{O} for the value iteration algorithm. The total number of operations to find an optimal policy is bounded above by NIPI(α)NOPIN^{PI}_{I}(\alpha)\cdot N^{PI}_{O} for the policy iteration algorithm. Each iteration in the policy iteration algorithm requires solving a system of mm linear equations. This can be done by Gaussian elimination within O(m3)O(m^{3}) operations. This is the reason the formula for NOPIN^{PI}_{O} in (1.1) depends on the term m3,m^{3}, which can be reduced to mωm^{\omega} with ω<3\omega<3 by using contemporary methods. For example, ω=2.807\omega=2.807 for Strassen’s algorithm[14]. According to [8], the best currently available ω=2.37286,\omega=2.37286, but this method of solving linear equations is impractical due to the large value of the constant in O.O. In addition to (1.2), more accurate upper bounds on a number of iterations are presented in (3.5), (3.8), and (3.9), but they require additional definitions and calculations.

As well-known and clear from (1.1) and (1.2), the number of operations at each step is larger for the policy iteration algorithm than for the value iteration algorithm. If the number of states mm is large, then the difference (NOPINOVI)(N^{PI}_{O}-N^{VI}_{O}) can be significant. In order to accelerate policy iterations, the method of modified policy iterations was introduced in [12]. This method uses value iterations to solve linear equations. As shown in Feinberg et al. [5], modified policy iterations and their versions are not strongly polynomial algorithms for finding optimal policies.

2 Definitions

Let \mathbb{N} and \mathbb{R} be the sets of natural numbers and real numbers respectively. For a finite set EE, let |E|\left|E\right| denote the number of elements in the set EE. We consider an MDP with a finite state space 𝕏={1,2,,m},\mathbb{X}=\{1,2,\ldots,m\}, where mm\in\mathbb{N} is the number of states, and nonempty finite action sets A(x)A(x) available at states x𝕏.x\in\mathbb{X}. Let 𝔸:=x𝕏A(x)\mathbb{A}:=\bigcup_{x\in\mathbb{X}}A(x) be the action set. We recall that k=x𝕏|A(x)|k=\sum_{x\in\mathbb{X}}\left|A(x)\right| is the total number of actions at all states or, in slightly different terms, the number of all state-action pairs. For each x𝕏,x\in\mathbb{X}, if an action aA(x)a\in A(x) is selected at the state x𝕏,x\in\mathbb{X}, then a one-step reward r(x,a)r(x,a) is collected and the process moves to the next state y𝕏y\in\mathbb{X} with the probability p(y|x,a),p(y|x,a), where r(x,a)r(x,a) is a real number and y𝕏p(y|x,a)=1.\sum_{y\in\mathbb{X}}p(y|x,a)=1. The process continues over a finite or infinite planning horizon. For a finite-horizon problem, the terminal real-valued reward v0(x)v^{0}(x) is collected at the final state x𝕏.x\in\mathbb{X}.

A deterministic policy is a mapping ϕ:𝕏𝔸\phi:\mathbb{X}\mapsto\mathbb{A} such that ϕ(x)A(x)\phi(x)\in A(x) for each x𝕏,x\in\mathbb{X}, and, if the process is at a state x𝕏,x\in\mathbb{X}, then the action ϕ(x)\phi(x) is selected. A general policy can be randomized and history-dependent; see e.g., Puterman [11, p. 154] for definitions of various classes of policies. We denote by Π\Pi the set of all policies.

Let α[0,1)\alpha\in[0,1) be the discount factor. For a policy πΠ\pi\in\Pi and for an initial state x𝕏,x\in\mathbb{X}, the expected total discounted reward for an nn-horizon problem is

vn,απ(x):=𝔼xπ[t=0n1αkr(xt,at)+αnv0(xn)],n,v_{n,\alpha}^{\pi}(x):=\mathbb{E}_{x}^{\pi}\left[\sum_{t=0}^{n-1}\alpha^{k}r(x_{t},a_{t})+\alpha^{n}v^{0}(x_{n})\right],\qquad n\in\mathbb{N},

and for the infinite-horizon problem it is

vαπ(x):=𝔼xπt=0αkr(xt,at),v_{\alpha}^{\pi}(x):=\mathbb{E}_{x}^{\pi}\sum_{t=0}^{\infty}\alpha^{k}r(x_{t},a_{t}),

where 𝔼xπ\mathbb{E}_{x}^{\pi} is the expectation defined by the initial state xx and the policy π,\pi, and where x0a0xtatx_{0}a_{0}\cdots x_{t}a_{t}\ldots is a trajectory of the process consisting of states xt𝕏x_{t}\in\mathbb{X} and actions atA(xt)a_{t}\in A(x_{t}) at epochs t=0,1,.t=0,1,\ldots\ . The value functions are defined for initial states x𝕏x\in\mathbb{X} as

vn,α(x):=supπΠvn,απ(x),n,v_{n,\alpha}(x):=\sup_{\pi\in\Pi}v_{n,\alpha}^{\pi}(x),\qquad n\in\mathbb{N},

for nn-horizon problems, and

vα(x):=supϕ𝔽vαϕ(x)v_{\alpha}(x):=\sup_{\phi\in\mathbb{F}}v_{\alpha}^{\phi}(x)

for infinite-horizon problems. Note that v0,απ=v0,α=v0v_{0,\alpha}^{\pi}=v_{0,\alpha}=v^{0} for all α[0,1)\alpha\in[0,1) and πΠ\pi\in\Pi.

A policy π\pi is called optimal (nn-horizon optimal for n=1,2,n=1,2,\ldots) if vαπ(x)=supπΠvαπ(x)v_{\alpha}^{\pi}(x)=\sup_{\pi\in\Pi}v_{\alpha}^{\pi}(x) (vn,απ(x)=supπΠvn,απ(x)v_{n,\alpha}^{\pi}(x)=\sup_{\pi\in\Pi}v_{n,\alpha}^{\pi}(x)) for all x𝕏x\in\mathbb{X}. It is well-known that, for discounted MDPs with finite action sets, finite horizon problems have optimal policies and infinite horizon problems have deterministic optimal policies; see [11, p. 154].

A policy π\pi is called ϵ\epsilon-optimal for ϵ0\epsilon\geq 0 if vαπ(x)vα(x)ϵv_{\alpha}^{\pi}(x)\geq v_{\alpha}(x)-\epsilon for all x𝕏.x\in\mathbb{X}. A 0-optimal policy is optimal. The objective of this paper is to estimate the complexity of the value iteration algorithm for finding a deterministic ϵ\epsilon-optimal policy for ϵ>0\epsilon>0.

3 Main Results

For a real-valued function v:𝕏,v:\mathbb{X}\to\mathbb{R}, let us define

Tαav(x):=r(x,a)+αy𝕏p(y|x,a)v(y),x𝕏,aA(x).T_{\alpha}^{a}v(x):=r(x,a)+\alpha\sum_{y\in\mathbb{X}}p(y|x,a)v(y),\qquad x\in\mathbb{X},\ a\in A(x). (3.1)

We shall use the notation Tαϕv(x):=Tαϕ(x)v(x)T_{\alpha}^{\phi}v(x):=T_{\alpha}^{\phi(x)}v(x) for a deterministic policy ϕ.\phi. For v:𝕏v:\mathbb{X}\to\mathbb{R} we also define the optimality operator Tα,T_{\alpha},

Tαv(x):=maxaA(x)Tαav(x),x𝕏.T_{\alpha}v(x):=\max_{a\in A(x)}T_{\alpha}^{a}v(x),\qquad x\in\mathbb{X}. (3.2)

Every real-valued function vv on 𝕏\mathbb{X} can be identified with a vector v=(v(1),,v(m)).v=(v(1),\ldots,v(m)). Therefore, all real-valued functions on 𝕏\mathbb{X} form the mm-dimensional Euclidean space m.\mathbb{R}^{m}.

For each nn\in\mathbb{N} and all v0mv^{0}\in\mathbb{R}^{m}, the expected total discounted rewards vn,αϕv_{n,\alpha}^{\phi} and vαϕv_{\alpha}^{\phi} satisfy the equations

vn,αϕ=Tαϕvn1,αϕ,vαϕ=Tαϕvαϕ,v^{\phi}_{n,\alpha}=T^{\phi}_{\alpha}v^{\phi}_{n-1,\alpha},\quad v^{\phi}_{\alpha}=T^{\phi}_{\alpha}v^{\phi}_{\alpha},

the value functions vn,α,v_{n,\alpha}, vαv_{\alpha} satisfy the optimality equations

vn,α=Tαvn1,α,vα=Tαvn1,α,v_{n,\alpha}=T_{\alpha}v_{n-1,\alpha},\quad v_{\alpha}=T_{\alpha}v_{n-1,\alpha},

and value iterations converge to the infinite-horizon expected total rewards and optimal values

vαϕ=\displaystyle v_{\alpha}^{\phi}= limn(Tαϕ)nv0=Tαϕvαϕ,\displaystyle\lim_{n\to\infty}{\left(T_{\alpha}^{\phi}\right)}^{n}v^{0}=T_{\alpha}^{\phi}v_{\alpha}^{\phi}, (3.3)
vα=limn(Tα)n\displaystyle v_{\alpha}=\lim_{n\to\infty}{\left(T_{\alpha}\right)}^{n} v0=limnvn,α=Tαvα,\displaystyle v^{0}=\lim_{n\to\infty}v_{n,\alpha}=T_{\alpha}v_{\alpha},

and a deterministic policy ϕ\phi is optimal if and only if Tαϕvα=Tαvα;T^{\phi}_{\alpha}v_{\alpha}=T_{\alpha}v_{\alpha}; see, e.g., [3], [11, pp. 146-151]. Therefore, if we consider the nonempty sets

Aα(x)={aA(x):vα(x)=Tαavα(x)},A_{\alpha}(x)=\{a\in A(x):v_{\alpha}(x)=T^{a}_{\alpha}v_{\alpha}(x)\},

then a deterministic policy ϕ\phi is optimal if and only if ϕ(x)Aα(x)\phi(x)\in A_{\alpha}(x) for all x𝕏.x\in\mathbb{X}.

For a given ϵ>0,\epsilon>0, the following value iteration algorithm computes a deterministic ϵ\epsilon-optimal policy. It uses a stopping rule based on the value of the span sp(vn,αvn1,α)sp(v_{n,\alpha}-v_{n-1,\alpha}). As mentioned in [11, p. 205], this algorithm generates the same number of iterations as the relative value iteration algorithm originally introduced to accelerate value iterations.

Algorithm 1.

(Computing a deterministic ϵ\epsilon-optimal policy by value iterations).

For given MDP, discount factor α(0,1),\alpha\in(0,1), and constant ϵ>0\epsilon>0:
1. select a vector u:=v0mu:=v^{0}\in\mathbb{R}^{m} and a constant Δ>1ααϵ\Delta>\frac{1-\alpha}{\alpha}\epsilon (e.g., choose Δ:=ϵ/α\Delta:=\epsilon/\alpha);
2. while Δ>1ααϵ\Delta>\frac{1-\alpha}{\alpha}\epsilon compute v=Tαu,v=T_{\alpha}u, Δ:=sp(uv)\Delta:=sp(u-v) and set u:=u,u^{*}:=u, u:=vu:=v endwhile;
3. choose a deterministic policy ϕ\phi such that v=Tαϕu;v=T_{\alpha}^{\phi}u^{*}; this policy is ϵ\epsilon-optimal.

If α=0,\alpha=0, then a deterministic policy ϕ\phi is optimal if and only if r(x,ϕ(x))=maxaA(x){r(x,a)}.r(x,\phi(x))=\max_{a\in A(x)}\{r(x,a)\}. As well-known, Algorithm 1 converges within a finite number of iterations (e.g., this follows from (3.3)) and returns an ϵ\epsilon-optimal policy ϕ\phi (e.g., this follows from [11, Proposition 6.6.5]).

Following Puterman [11, Theorem 6.6.6], let us define

γ:=maxx,y𝕏aA(x),bA(y)[1z𝕏min{p(z|x,a),p(z|y,b)}].\gamma:=\max_{\begin{subarray}{c}x,y\in\mathbb{X}\\ a\in A(x),b\in A(y)\end{subarray}}\left[1-\sum_{z\in\mathbb{X}}\min\left\{p(z|x,a),p(z|y,b)\right\}\right]. (3.4)

We notice that 0γ10\leq\gamma\leq 1. We assume γ>0\gamma>0 since otherwise p(z|x,a)=p(z|y,b)p(z|x,a)=p(z|y,b) for any x,y,z𝕏x,y,z\in\mathbb{X} and aA(x),bA(y).a\in A(x),b\in A(y). If γ=0,\gamma=0, then we have again that a deterministic policy ϕ\phi is optimal if and only if r(x,ϕ(x))=maxaA(x){r(x,a)}.r(x,\phi(x))=\max_{a\in A(x)}\{r(x,a)\}. In this case Algorithm 1 stops at most after the second iteration and returns a deterministic optimal policy. If an MDP has deterministic probabilities and there are two or more deterministic policies with nonidentical stochastic matrices, then γ=1\gamma=1.

Computing the value of γ\gamma requires computing the sum in (3.4) for all couples {(x,a),(y,b))}\{(x,a),(y,b))\} of state-action pairs such that (x,a)(y,b).(x,a)\neq(y,b). The identical pairs (x,a)=(y,b)(x,a)=(y,b) can be excluded since for such pairs the value of the function to be maximized in (3.4) is 0, which is the smallest possible value. The total number of such couples is k(k1)/2=O(k2)k(k-1)/2=O(k^{2}). The number of arithmetic operations in (3.4), which are additions, is mm for each couple. Therefore, the straightforward computation of γ\gamma requires O(mk2)O(mk^{2}) operations, which can be significantly larger than the complexity to compute an deterministic ϵ\epsilon-optimal policy, which is the product of NIVI(ϵ)(α)N^{VI(\epsilon)}_{I}(\alpha) and NOVIN^{VI}_{O} defined in (1.2). Puterman [11, Eq. (6.6.16)] also provides an upper bound γ[γ,1],\gamma^{\prime}\in[\gamma,1], where γ:=1z𝕏minx𝕏,aA(x)p(z|x,a),\gamma^{\prime}:=1-\sum_{z\in\mathbb{X}}\min_{x\in\mathbb{X},a\in A(x)}p(z|x,a), whose computation requires O(mk)O(mk) operations.

Theorem 1.

For α(0,1)\alpha\in(0,1) and ϵ>0\epsilon>0, the number of iterations for Algorithm 1 to find a deterministic ϵ\epsilon-optimal policy is bounded above by

n(α):=max{log(1α)ϵγsp(v1,αv0)log(αγ),1}n^{*}(\alpha):=\max\left\{\left\lceil\frac{\log\frac{(1-\alpha)\epsilon\gamma}{sp(v_{1,\alpha}-v^{0})}}{\log(\alpha\gamma)}\right\rceil,1\right\} (3.5)

if sp(v1,αv0)0sp(v_{1,\alpha}-v^{0})\neq 0 and γ>0.\gamma>0. If sp(v1,αv0)=0,sp(v_{1,\alpha}-v^{0})=0, then n(α)=1n^{*}(\alpha)=1 and Algorithm 1 returns a deterministic optimal policy. If γ=0,\gamma=0, then n(α){1,2}n^{*}(\alpha)\in\{1,2\} and Algorithm 1 returns a deterministic optimal policy. In addition, each iteration uses at most O(mk)O(mk) operations.

Proof.

As follows from (3.1) and (3.2), each iteration uses at most O(mk)O(mk) arithmetic operations. Let nn be the actual number of iterations. According to its steps 2 snd 3, the algorithm returns the deterministic policy ϕ,\phi, for which vn,α=Tαvn1,α=Tαϕvn1,αv_{n,\alpha}=T_{\alpha}v_{n-1,\alpha}=T_{\alpha}^{\phi}v_{n-1,\alpha} and

sp(Tαvn1,αvn1,α)1ααϵ.sp(T_{\alpha}v_{n-1,\alpha}-v_{n-1,\alpha})\leq\frac{1-\alpha}{\alpha}\epsilon.

In view of [11, Proposition 6.6.5, p. 201], this ϕ\phi is ϵ\epsilon-optimal. By [11, Corollary 6.6.8, p. 204],

sp(vn,αvn1,α)(αγ)n1sp(v1,αv0).sp(v_{n,\alpha}-v_{n-1,\alpha})\leq\left(\alpha\gamma\right)^{n-1}sp(v_{1,\alpha}-v^{0}). (3.6)

Therefore, the minimal number nn\in\mathbb{N} satisfying

(αγ)n1sp(v1,αv0)1ααϵ\left(\alpha\gamma\right)^{n-1}sp(v_{1,\alpha}-v^{0})\leq\frac{1-\alpha}{\alpha}\epsilon (3.7)

leads to (3.5). The case sp(v1,αv0)=0sp(v_{1,\alpha}-v^{0})=0 is obvious, and the case γ=0\gamma=0 is discussed above. ∎

Example 1.

This example shows that the bound in (3.5) can be exact, and it may not be monotone in the discount factor. Let the state space be 𝕏={1,2,3}\mathbb{X}=\left\{1,2,3\right\} and the action space be 𝔸={b,c}\mathbb{A}=\left\{b,c\right\}. Let A(1)=𝔸A(1)=\mathbb{A}, A(2)=A(3)={b}A(2)=A(3)=\left\{b\right\} be the sets of actions available at states 1,1, 2,2, and 33 respectively. The transition probabilities are given by p(3|1,b)=p(2|1,c)=p(2|2,b)=p(3|3,b)=1p(3|1,b)=p(2|1,c)=p(2|2,b)=p(3|3,b)=1. The one-step rewards are r(1,b)=r(1,c)=0,r(2,b)=1r(1,b)=r(1,c)=0,r(2,b)=1, and r(3,b)=1;r(3,b)=-1; see Figure 1.

Refer to caption

Figure 1: MDP diagram for Example 1.

We set v0(1)=1,v0(2)=2,v0(1)=2v^{0}(1)=1,v^{0}(2)=2,v^{0}(1)=-2. As discussed above, γ=1\gamma=1 for this MDP with deterministic transitions. Straightforward calculations imply that

vn,α=(αn+k=1nαk,αn+k=0nαk,αnk=0nαk),\displaystyle v_{n,\alpha}=\left(\alpha^{n}+\sum_{k=1}^{n}\alpha^{k},\ \alpha^{n}+\sum_{k=0}^{n}\alpha^{k},\ -\alpha^{n}-\sum_{k=0}^{n}\alpha^{k}\right),
vn,αvn1,α=(2anan1, 2anan1,2an+an1),\displaystyle v_{n,\alpha}-v_{n-1,\alpha}=\left(2a^{n}-a^{n-1},\ 2a^{n}-a^{n-1},\ -2a^{n}+a^{n-1}\right),
sp(vn,αvn1,α)=2αn1|2α1|=αn1sp(v1,αv0),\displaystyle sp(v_{n,\alpha}-v_{n-1,\alpha})=2\alpha^{n-1}\left|2\alpha-1\right|=\alpha^{n-1}sp(v_{1,\alpha}-v^{0}),

where the ii-th coordinates of the vectors correspond to the states i=1,2,3i=1,2,3. The last displayed equality implies that inequality (3.6) holds in the form of an equality for this example. Therefore, the bound in (3.5) is also the actual number of iterations executed by Algorithm 1 for this MDP, which is

n(α)=max{log(1α)ϵ2|2α1|logα,1}.n^{*}(\alpha)=\max\left\{\left\lceil\frac{\log\frac{(1-\alpha)\epsilon}{2\left|2\alpha-1\right|}}{\log\alpha}\right\rceil,1\right\}.

for α0.5\alpha\neq 0.5 and ϵ>0\epsilon>0. If α=0.5,\alpha=0.5, then Algorithm 1 stops after the first iteration. Let ϵ=0.02.\epsilon=0.02. Then n(0.24)=3,n^{*}(0.24)=3, n(0.47)=4,n^{*}(0.47)=4, and n(0.48)=3n^{*}(0.48)=3, which shows that n(α)n^{*}(\alpha) may not be monotone in α\alpha. \square

Let us consider the vector v1mv^{1}\in\mathbb{R}^{m} defined in (1.4). Note that v1=v1,αv^{1}=v_{1,\alpha} if v00v^{0}\equiv 0. In the next corollary, we assume that sp(v1)+sp(v0)>0sp(v^{1})+sp(v^{0})>0 and γ>0\gamma>0 since otherwise Algorithm 1 stops after the first iteration and returns a deterministic optimal policy.

Theorem 2.

Let α(0,1)\alpha\in(0,1). For fixed ϵ>0,γ(0,1]\epsilon>0,\gamma\in(0,1], and v0,v1mv^{0},v^{1}\in\mathbb{R}^{m} such that sp(v1)+sp(v0)>0sp(v^{1})+sp(v^{0})>0, Algorithm 1 finds a deterministic ϵ\epsilon-optimal policy after the finite number iteration bounded above by

F(α):=max{log(1α)ϵγsp(v1)+(1+α)sp(v0)log(αγ),1}.F(\alpha):=\max\left\{\left\lceil\frac{\log\frac{(1-\alpha)\epsilon\gamma}{sp(v^{1})+(1+\alpha)sp(v^{0})}}{\log(\alpha\gamma)}\right\rceil,1\right\}. (3.8)

Furthermore, the function F(α)F(\alpha) defined in (3.8) for α[0,1)\alpha\in[0,1) has the following properties for an arbitrary fixed parameter γ(0,1]\gamma\in(0,1]:

(a)limα0F(α)=1,limα1F(α)=+;\displaystyle\emph{(a)}\ \lim_{\alpha\downarrow 0}F(\alpha)=1,\ \lim_{\alpha\uparrow 1}F(\alpha)=+\infty;
(b)F(α)F(\alpha) is non-decreasing in α\alpha.
Proof.

By (3.1) and (3.2),

v1,α(x)\displaystyle v_{1,\alpha}(x) =maxaA(x){r(x,a)+αy𝕏p(y|x,a)v0(y)}\displaystyle=\max_{a\in A(x)}\left\{r(x,a)+\alpha\sum_{y\in\mathbb{X}}p(y|x,a)v^{0}(y)\right\}
maxaA(x){r(x,a)+αy𝕏p(y|x,a)maxz𝕏v0(z)}\displaystyle\leq\max_{a\in A(x)}\left\{r(x,a)+\alpha\sum_{y\in\mathbb{X}}p(y|x,a)\max_{z\in\mathbb{X}}v^{0}(z)\right\}
=maxaA(x)r(x,a)+αmaxz𝕏v0(z).\displaystyle=\max_{a\in A(x)}r(x,a)+\alpha\max_{z\in\mathbb{X}}v^{0}(z).

Similarly, v1,α(x)maxaA(x)r(x,a)+αminz𝕏v0(z).v_{1,\alpha}(x)\geq\max_{a\in A(x)}r(x,a)+\alpha\min_{z\in\mathbb{X}}v^{0}(z). Therefore,

sp(v1,α)\displaystyle sp(v_{1,\alpha}) =maxx𝕏v1,α(x)minx𝕏v1,α(x)\displaystyle=\max_{x\in\mathbb{X}}v_{1,\alpha}(x)-\min_{x\in\mathbb{X}}v_{1,\alpha}(x)
maxx𝕏maxaA(x)r(x,a)+αmaxz𝕏v0(z)minx𝕏maxaA(x)r(x,a)αminz𝕏v0(z)\displaystyle\leq\max_{x\in\mathbb{X}}\max_{a\in A(x)}r(x,a)+\alpha\max_{z\in\mathbb{X}}v^{0}(z)-\min_{x\in\mathbb{X}}\max_{a\in A(x)}r(x,a)-\alpha\min_{z\in\mathbb{X}}v^{0}(z)
=sp(v1)+αsp(v0).\displaystyle=sp(v^{1})+\alpha sp(v^{0}).

By the properties of seminorm provided in [11, p. 196],

sp(v1,αv0)sp(v1,α)+sp(v0)sp(v1)+(1+α)sp(v0),sp(v_{1,\alpha}-v^{0})\leq sp(v_{1,\alpha})+sp(v^{0})\leq sp(v^{1})+(1+\alpha)sp(v^{0}),

which leads to F(α)n(α),F(\alpha)\geq n^{*}(\alpha), where n(α)n^{*}(\alpha) is defined in (3.5).
The formulae in (a) follow directly from (3.8). To prove (b), we recall that R=sp(v1)R=sp(v^{1}) and V=sp(v0);V=sp(v^{0}); see (1.5) and (1.3). By the assumption in the theorem, R+(1+α)V>0.R+(1+\alpha)V>0. The function (1α)ϵγR+(1+α)V\frac{(1-\alpha)\epsilon\gamma}{R+(1+\alpha)V} is positive and decreasing in α(0,1),\alpha\in(0,1), and the function αγ\alpha\gamma is increasing in α(0,1).\alpha\in(0,1). In addition, αγ(0,1).\alpha\gamma\in(0,1). Therefore, the function

f(α):=log(1α)ϵγR+(1+α)Vlog(αγ)f(\alpha):=\frac{\log\frac{(1-\alpha)\epsilon\gamma}{R+(1+\alpha)V}}{\log(\alpha\gamma)}

is increasing when (1α)ϵγR+(1+α)V1\frac{(1-\alpha)\epsilon\gamma}{R+(1+\alpha)V}\leq 1, which is αϵγRVϵγ+V\alpha\geq\frac{\epsilon\gamma-R-V}{\epsilon\gamma+V}. This implies that F(α)F(\alpha) is increasing on the interval (b,1),(b,1), where b:=max{ϵγRVϵγ+V,0}b:=\max\{\frac{\epsilon\gamma-R-V}{\epsilon\gamma+V},0\}. Thus, if ϵγRVϵγ+V0,\frac{\epsilon\gamma-R-V}{\epsilon\gamma+V}\leq 0, then the theorem is proved. Now let ϵγRVϵγ+V>0.\frac{\epsilon\gamma-R-V}{\epsilon\gamma+V}>0. We choose an arbitrary α(0,ϵγRVϵγ+V].\alpha\in(0,\frac{\epsilon\gamma-R-V}{\epsilon\gamma+V}]. Then (1α)ϵγR+(1+α)V1,\frac{(1-\alpha)\epsilon\gamma}{R+(1+\alpha)V}\geq 1, which implies f(α)0.f(\alpha)\leq 0. In view of (3.8), F(α)=1,F(\alpha)=1, and the function F(α)F(\alpha) is non-decreasing on (0,1).(0,1).

As explained in the paragraph preceding Theorem 1, it may be time-consuming to find the actual value of γ\gamma for an MDP. The following corollaries provide additional bounds.

Corollary 1.

Let α(0,1)\alpha\in(0,1). For a fixed ϵ>0,\epsilon>0, if sp(v1)+sp(v0)>0sp(v^{1})+sp(v^{0})>0, then

n(α)max{log(1α)ϵsp(v1,αv0)logα,1},n^{*}(\alpha)\leq\max\left\{\left\lceil\frac{\log\frac{(1-\alpha)\epsilon}{sp({v_{1,\alpha}-v^{0}})}}{\log\alpha}\right\rceil,1\right\}, (3.9)
F(α)NIVI(ϵ)(α):=max{log(1α)ϵsp(v1)+(1+α)sp(v0)logα,1}.F(\alpha)\leq N^{VI(\epsilon)}_{I}(\alpha):=\max\left\{\left\lceil\frac{\log\frac{(1-\alpha)\epsilon}{sp(v^{1})+(1+\alpha)sp(v^{0})}}{\log\alpha}\right\rceil,1\right\}. (3.10)
Proof.

We recall that n(α)n^{*}(\alpha) is defined as the smallest nn\in\mathbb{N} satisfying (3.7). The right-hand side of (3.9) the smallest nn\in\mathbb{N} satisfying (αγ)n1sp(v1,αv0)1ααϵ.\left(\alpha\gamma\right)^{n-1}sp(v_{1,\alpha}-v^{0})\leq\frac{1-\alpha}{\alpha}\epsilon. Since α,γ(0,1),\alpha,\gamma\in(0,1), inequality (3.9) holds. Inequality (3.10) holds because of the similar reasons, where F(α)F(\alpha) and NIVI(ϵ)(α)N^{VI(\epsilon)}_{I}(\alpha) are the smallest nn\in\mathbb{N} satisfying (αγ)n1[sp(v1)+sp(v0)]1ααϵ\left(\alpha\gamma\right)^{n-1}[sp(v^{1})+sp(v^{0})]\leq\frac{1-\alpha}{\alpha}\epsilon and αn1[sp(v1)+sp(v0)]1ααϵ\alpha^{n-1}[sp(v^{1})+sp(v^{0})]\leq\frac{1-\alpha}{\alpha}\epsilon respectively. ∎

We notice that, if the function FF from (3.8) is minimized in v0v^{0}, than the smallest value is attained when sp(v0)=0,sp(v^{0})=0, which is v0=const.v^{0}=const. The following corollary provides upper bounds for v0=constv^{0}=const including v00.v^{0}\equiv 0.

Corollary 2.

Let α(0,1),\alpha\in(0,1), and let v0=const.v^{0}=const. If sp(v1)>0,sp(v^{1})>0, then Algorithm 1 finds a deterministic ϵ\epsilon-optimal policy after a finite number of iterations bounded above by

F(α):=max{log(1α)ϵγsp(v1)log(αγ),1}max{log(1α)ϵsp(v1)logα,1}.F^{*}(\alpha):=\max\left\{\left\lceil\frac{\log\frac{(1-\alpha)\epsilon\gamma}{sp(v^{1})}}{\log(\alpha\gamma)}\right\rceil,1\right\}\leq\max\left\{\left\lceil\frac{\log\frac{(1-\alpha)\epsilon}{sp(v^{1})}}{\log\alpha}\right\rceil,1\right\}.
Proof.

This corollary follows from (3.10). ∎

Example 2.

This example illustrates the monotonicity of the polynomial upper bound F(α)F(\alpha) for computing ϵ\epsilon-optimal policies and non-monotonicity of the number of calculations to find an optimal policy by value iterations. The following MDP is taken from [4]. Let the state space be 𝕏={1,2,3}\mathbb{X}=\left\{1,2,3\right\} and the action space be 𝔸={b,c}\mathbb{A}=\left\{b,c\right\}. Let A(1)=𝔸A(1)=\mathbb{A}, A(2)=A(3)={b}A(2)=A(3)=\left\{b\right\} be the sets of actions available at states 1,2,31,2,3 respectively; see Figure 2. The transition probabilities are given by p(2|1,c)=p(3|1,b)=p(2|2,b)=p(3|3,b)=1p(2|1,c)=p(3|1,b)=p(2|2,b)=p(3|3,b)=1. The one-step rewards are r(1,b)=r(2,b)=0,r(3,b)=1r(1,b)=r(2,b)=0,r(3,b)=1, and r(1,c)=1exp(M)r(1,c)=1-\exp{(-M)} where M>0M>0. We set v0(x)=0v^{0}(x)=0 for x𝕏x\in\mathbb{X}.

Refer to caption

Figure 2: MDP diagram for Example 2.

As shown in [4], for α=0.5\alpha=0.5 the number of value iterations required to find an optimal policy increases to infinity as MM increases to infinity. This shows that the value iteration algorithm for computing the optimal policy is not strongly polynomial. However, sp(v1)=1sp(v^{1})=1 does not change with the increasing MM in this example. As follows from Corollary 2, for fixed ϵ>0\epsilon>0 and α(0,α]\alpha\in(0,\alpha^{*}] with α(0,1),\alpha^{*}\in(0,1), the number of required iterations NIVI(ϵ)N^{VI(\epsilon)}_{I} for Algorithm 1 is uniformly bounded no matter how large MM is; see footnote 3\square

Refer to caption


Figure 3: Exact numbers of iterations for finding optimal policies and NIVI(105)N^{VI(10^{-5})}_{I} in Example 2.333Numbers of iterations are integers. In particular, in this example the graphs display step functions. This graph is generated by the Matlab, and it is automatically smoothened to clarify comparisons.

Lewis and Paul [9] provide examples of MDPs for which the numbers of iterations required for computing an optimal policy by value iterations can tend to infinity even for discount factor bounded away from 11. Here we provide a significantly simpler example.

Example 3.

This example shows that the required number of value iterations to compute an optimal policy may be unbounded as a function of the discount factor α\alpha when α(0,α]\alpha\in(0,\alpha^{*}] for some α(0,1).\alpha^{*}\in(0,1). Consider an MDP with the state space 𝕏={1,2,3},\mathbb{X}=\left\{1,2,3\right\}, with the action space 𝔸={b,c},\mathbb{A}=\left\{b,c\right\}, and with the sets of actions A(1)=𝔸A(1)=\mathbb{A} and A(2)=A(3)={b}A(2)=A(3)=\left\{b\right\} available at states 1,1, 2,2, and 33 respectively; see Figure 4. The transition probabilities are given by p(2|1,c)=p(3|1,b)=p(2|2,b)=p(3|3,b)=1p(2|1,c)=p(3|1,b)=p(2|2,b)=p(3|3,b)=1. The one-step rewards are r(1,c)=r(2,b)=1,r(1,b)=2r(1,c)=r(2,b)=1,r(1,b)=2, and r(3,b)=0r(3,b)=0.

We set v0(x)=0v^{0}(x)=0 for x𝕏x\in\mathbb{X}.

Refer to caption

Figure 4: MDP diagram for Example 3.

There are only two deterministic policies denoted by ϕ\phi and ψ,\psi, which differ only at state 11 with ϕ(1)=c\phi(1)=c and ψ(1)=b\psi(1)=b. Observe that vαπ(x)=vα(x)v_{\alpha}^{\pi}(x)=v_{\alpha}(x) for x=2,3x=2,3 and all πΠ.\pi\in\Pi. In addition,

vαϕ(1)=k=0αk=11α,vαψ(1)=2.v_{\alpha}^{\phi}(1)=\sum_{k=0}^{\infty}\alpha^{k}=\frac{1}{1-\alpha},\qquad v_{\alpha}^{\psi}(1)=2.

Therefore, ψ\psi is optimal for α[0,0.5]\alpha\in[0,0.5] and ϕ\phi is optimal for α[0.5,1).\alpha\in[0.5,1). Let us consider the number Δ\Delta computed at the nn-th value iteration at step 2 of Algorithm 1, n.n\in\mathbb{N}. For this MDP, Δ=vu\Delta=v-u with v=vn,α=Tαuv=v_{n,\alpha}=T_{\alpha}u and u=vn1,α.u=v_{n-1,\alpha}. Observe that vn,α(3)=0v_{n,\alpha}(3)=0 and vn,α(2)=i=0n1αiv_{n,\alpha}(2)=\sum_{i=0}^{n-1}\alpha^{i} for all n.n\in\mathbb{N}. If n=1,2,n=1,2, then vn,α(1)=2v_{n,\alpha}(1)=2. For n3n\geq 3

vn,α(1)=max{1+αi=0n2αi,2+0}={2,ifα(0,0.5+δn];i=0n1αi,ifα[0.5+δn,1),v_{n,\alpha}(1)=\max\left\{1+\alpha\sum_{i=0}^{n-2}\alpha^{i},2+0\right\}=\begin{cases}2,\ &{\rm if\ }\alpha\in(0,0.5+\delta_{n}];\\ \sum_{i=0}^{n-1}\alpha^{i},\ &{\rm if\ }\alpha\in[0.5+\delta_{n},1),\end{cases}

where 0<δn<0.50<\delta_{n}<0.5 and i=0n1(0.5+δn)i=2\sum_{i=0}^{n-1}{(0.5+\delta_{n})}^{i}=2. Clearly δn\delta_{n} is strictly decreasing as nn increases and limnδn=0\lim_{n\to\infty}\delta_{n}=0. This implies that for α(0,0.5]\alpha\in(0,0.5] the value iterations identify the optimal policy ψ\psi at the first iteration, and for every n3n\geq 3 the value iterations identify the optimal policy ϕ\phi at the nn-th iteration for α(0.5+δn+1,0.5+δn)\alpha\in(0.5+\delta_{n+1},0.5+\delta_{n}). \square

Example 2 and Example 3 represent the main difficulties of running value iterations for computing optimal policies. Nevertheless, the results of this paper show that these difficulties can be easily overcome by using value iterations for computing ϵ\epsilon-optimal policies.

4 Acknowledgement

This research was partially supported by the National Science Foundation grant CMMI-1636193.

References

  • [1] D. P. Bertsekas, Reinforcement Learning and Optimal Control, Athena Scientific, Belmont, MA, 2019.
  • [2] D. P. Bertsekas, J. N. Tsitsiklis, Neuro-Dynamic Programming, Athena Scientific, Belmont, MA, 1996.
  • [3] E. A. Feinberg, Total reward criteria, in: E. A. Feinberg, A. Schwartz (Eds.), Handbook of Markov Decision Processes, Kluwer, Boston, MA, 2002, pp. 173-207.
  • [4] E. A. Feinberg and J. Huang, The value iteration algorithm is not strongly polynomial for discounted dynamic programming, Oper. Res. Lett. 42 (2014) 130-131.
  • [5] E. A. Feinberg, J. Huang, and B. Scherrer, Modified policy iteration algorithms are not strongly polynomial for discounted dynamic programming, Oper. Res. Lett. 42 (2014) 429-431
  • [6] L.C.M. Kallenberg, Finite state and action MDPs, in: E. A. Feinberg, A. Schwartz (Eds.), Handbook of Markov Decision Processes, Kluwer, Boston, MA, 2002, pp. 21-87.
  • [7] T. Kitahara and S. Mizuno, A bound for the number of different basic solutions generated by the simplex method, Math. Program. 137 (2013) 579-586.
  • [8] J. F. Le Gall, Powers of tensors and fast matrix multiplication, Proc. of 39-th Int. Symp. Symb. Alge. Comput. (2014) 296-303.
  • [9] M. E. Lewis and A. Paul, Uniform turnpike theorems for finite Markov decision processes, Math. Oper. Res. 44 (2019) 1145-1160.
  • [10] I. Post and Y. Ye, The simplex method is strongly polynomial for deterministic Markov decision processes, Math. Oper. Res. 40 (2015) 859-868.
  • [11] M. L. Puterman, Markov Decision Processes: Discrete Stochastic Dynamic Programming, John Wiley & Sons, Inc., New York, 1994.
  • [12] M. L. Puterman and M. C. Shin, Modified policy iteration algorithms for discounted Markov decision problems, Manag. Sci. 24 (1978) 1127-1137.
  • [13] B. Scherrer, Improved and generalized upper bounds on the complexity of policy iteration, Math. Oper. Res. 41 (2016) 758-774.
  • [14] V. Strassen, Gaussian elimination is not optimal, Numer. Math. 13 (1969) 354-356.
  • [15] R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction, second ed., MIT Press, Cambridge, 2018.
  • [16] P. Tseng, Solving h-horizon, stationary Markov decision problems in time proportional to log(h), Oper. Res. Lett. 9 (1990) 287-297.
  • [17] Y. Ye, The simplex and policy-iteration methods are strongly polynomial for the Markov decision problem with a fixed discount rate, Math. Oper. Res. 36 (2011) 593-603.