This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

On Supervised On-line Rolling-Horizon Control for Infinite-Horizon Discounted Markov Decision Processes

Hyeong Soo Chang H.S. Chang is with the Department of Computer Science and Engineering at Sogang University, Seoul 121-742, Korea. (e-mail:[email protected]).
Abstract

This note re-visits the rolling-horizon control approach to the problem of a Markov decision process (MDP) with infinite-horizon discounted expected reward criterion. Distinguished from the classical value-iteration approach, we develop an asynchronous on-line algorithm based on policy iteration integrated with a multi-policy improvement method of policy switching. A sequence of monotonically improving solutions to the forecast-horizon sub-MDP is generated by updating the current solution only at the currently visited state, building in effect a rolling-horizon control policy for the MDP over infinite horizon. Feedbacks from “supervisors,” if available, can be also incorporated while updating. We focus on the convergence issue with a relation to the transition structure of the MDP. Either a global convergence to an optimal forecast-horizon policy or a local convergence to a “locally-optimal” fixed-policy in a finite time is achieved by the algorithm depending on the structure.

Index Terms:
rolling horizon control, policy iteration, policy switching, Markov decision process

I Introduction

Consider the rolling horizon control (see, e.g., [5]) with a fixed finite forecast-horizon HH to the problem of a Markov decision process (MDP) MM_{\infty} with infinite-horizon discounted expected reward criterion. At discrete time k1k\geq 1, the system is at a state xkx_{k} in a finite state set XX. If the controller of the system takes an action aa in A(xk)A(x_{k}) at kk, then it obtains a reward of R(xk,a)R(x_{k},a) from a reward function R:X×AR:X\times A\rightarrow\Re, where A(x)A(x) denotes an admissible-action set of xx in XX. The system then makes a random transition to a next state xk+1x_{k+1} by the probability of Pxkxk+1aP_{x_{k}x_{k+1}}^{a} specified in an |X|×|X||X|\times|X|-matrix Pa=[Pxya],x,yXP^{a}=[P_{xy}^{a}],x,y\in X.

Let B(X)B(X) be the set of all possible functions from XX to \Re. The zero function in B(X)B(X) is referred to as 0X0_{X} where 0X(x)=00_{X}(x)=0 for all xXx\in X. Let also Π(X)\Pi(X) be the set of all possible mappings from XX to AA where for any πΠ(X)\pi\in\Pi(X), π(x)\pi(x) is constrained to be in A(x)A(x) for all xXx\in X. Let an hh-length policy of the controller be an element in Π(X)h\Pi(X)^{h}, hh-ary Cartesian product, h1h\geq 1. That is, πΠ(X)h\pi\in\Pi(X)^{h} is an ordered tuple (π1,,πh)(\pi_{1},...,\pi_{h}) where the mmth entry of π\pi is equal to πmΠ(X),m1,\pi_{m}\in\Pi(X),m\geq 1, and when πm\pi_{m} is applied at xx in XX, then the controller is supposed to look ahead of (or forecasts) the remaining horizon hmh-m for control. An infinite-length policy is an element in the infinite Cartesian product of Π(X)\Pi(X), denoted by Π(X)\Pi(X)^{\infty}, and referred to as just a policy. We say that a policy ϕΠ(X)\phi\in\Pi(X)^{\infty} is stationary if ϕm=π\phi_{m}=\pi for all m1m\geq 1 for some πΠ(X)\pi\in\Pi(X). Given πΠ(X)\pi\in\Pi(X), [π][\pi] denotes a stationary policy in Π(X)\Pi(X)^{\infty} constructed from π\pi such that [π]m=π[\pi]_{m}=\pi for all m1m\geq 1.

Define the hh-horizon value function VhπV^{\pi}_{h} of π\pi in Π(X)H,Hh\Pi(X)^{H},H\geq h such that

Vhπ(x)=E[l=1hγl1R(Xl,πl(Xl))|X1=x],xX,h[1,),V^{\pi}_{h}(x)=E\left[\sum_{l=1}^{h}\gamma^{l-1}R(X_{l},\pi_{l}(X_{l}))\biggl{|}X_{1}=x\right],x\in X,h\in[1,\infty),

where XlX_{l} is a random variable that denotes a state at the level (of forecast) ll by following the llth-entry mapping πl\pi_{l} of π\pi and a fixed discounting factor γ\gamma is in (0,1)(0,1).

In the sequel, any operator is applied componentwise for the elements in B(X)B(X) and in Π(X)\Pi(X), respectively. Given πΠ(X)h\pi\in\Pi(X)^{h} and xXx\in X, π(x)\pi(x) is set to be (π1(x),,πh(x))(\pi_{1}(x),...,\pi_{h}(x)) meaning the “xx-coordinate” of π\pi here. As is well known then, there exists an optimal hh-length policy π(h)\pi^{*}(h) such that for any xXx\in X, Vhπ(h)(x)=maxπΠ(X)hVhπ(x)=Vh(x)V^{\pi^{*}(h)}_{h}(x)=\max_{\pi\in\Pi(X)^{h}}V^{\pi}_{h}(x)=V^{*}_{h}(x), where VhV^{*}_{h} is referred to as the optimal hh-horizon value function. In particular, {Vh,h1}\{V^{*}_{h},h\geq 1\} satisfies the optimality principle:

Vh(x)=maxaA(x)(R(x,a)+γyXPxyaVh1(y)),xX,V^{*}_{h}(x)=\max_{a\in A(x)}\biggl{(}R(x,a)+\gamma\sum_{y\in X}P_{xy}^{a}V^{*}_{h-1}(y)\biggr{)},x\in X,

for any fixed V0V^{*}_{0} in B(X)B(X). Furthermore, VhV^{*}_{h} is equal to the function in B(X)B(X) obtained after applying the value iteration (VI) operator T:B(X)B(X)T:B(X)\rightarrow B(X) iteratively hh times with the initial function of V0V^{*}_{0}: T(T(T(T(V0))))=Th(V0))=VhT(...T(T(T(V^{*}_{0}))))=T^{h}(V^{*}_{0}))=V^{*}_{h}, where

T(u)(x)=maxaA(x)(R(x,a)+γyXPxyau(y)),xX,uB(X).T(u)(x)=\max_{a\in A(x)}(R(x,a)+\gamma\sum_{y\in X}P_{xy}^{a}u(y)),x\in X,u\in B(X).

This optimal substructure leads to a dynamic programming algorithm, backward induction, which computes {Vh}\{V^{*}_{h}\} in off-line and returns an optimal HH-horizon policy π(H)\pi^{*}(H) that achieves the optimal value at any xXx\in X for the HH-horizon sub-MDP MHM_{H} of MM_{\infty} by

π(H)Hh+1(x)arg maxaA(x)(R(x,a)+γyXPxyaVh1(y)),xX,h{1,,H}.\pi^{*}(H)_{H-h+1}(x)\in\mathop{\mbox{\rm arg\,max}}_{a\in A(x)}\left(R(x,a)+\gamma\sum_{y\in X}P_{xy}^{a}V^{*}_{h-1}(y)\right),x\in X,h\in\{1,...,H\}.

Once obtained, the rolling HH-horizon controller employs the first entry π(H)1\pi^{*}(H)_{1} of π(H)\pi^{*}(H) or a stationary policy [π(H)1][\pi^{*}(H)_{1}] over the system time: At each k1k\geq 1, π(H)1(xk)\pi^{*}(H)_{1}(x_{k}) is taken at xkx_{k}.

Given πΠ(X)\pi\in\Pi(X)^{\infty}, VπV^{\pi}_{\infty} denotes the value function of π\pi over infinite horizon and it is obtained by letting hh approach infinity in E[l=1hγl1R(Xl,πl(Xl))|X1=x]E[\sum_{l=1}^{h}\gamma^{l-1}R(X_{l},\pi_{l}(X_{l}))|X_{1}=x]. The optimal value function VV^{*}_{\infty} of MM_{\infty} is then defined such that V(x)=supπΠ(X)Vπ(x),xXV^{*}_{\infty}(x)=\sup_{\pi\in\Pi(X)^{\infty}}V^{\pi}_{\infty}(x),x\in X. A performance result of [π(H)1][\pi^{*}(H)_{1}] applied to MM_{\infty} (see, e.g., [5]) is that the infinity-norm of the difference between the value function of [π(H)1][\pi^{*}(H)_{1}] and VV^{*}_{\infty} is upper bounded by (an error of) O(γHVV0)O(\gamma^{H}||V^{*}_{\infty}-V^{*}_{0}||_{\infty}). The term VV0||V^{*}_{\infty}-V^{*}_{0}||_{\infty} can be loosely upper bounded by C/(1γ)C/(1-\gamma) with some constant CC. Then due to the dependence on (1γ)1(1-\gamma)^{-1}, the performance worsens around γ\gamma closer to one. How to set V0V^{*}_{0} is a critical issue in the rolling horizon control even if the error vanishes to zero exponentially fast in HH with the rate of γ\gamma.

This note develops an algorithm for on-line rolling HH-horizon control. The sub-MDP MHM_{H} is not solved in advance. Rather with an arbitrarily selected π1(H)Π(X)H\pi_{1}(H)\in\Pi(X)^{H} for MHM_{H}, the algorithm generates a monotonically improving sequence of {πk(H)}\{\pi_{k}(H)\} over time k1k\geq 1. To the algorithm, only πk1(H)\pi_{k-1}(H) is available at k>1k>1 and it updates πk1(H)\pi_{k-1}(H) only at xkx_{k} to obtain πk(H)\pi_{k}(H). Either we have that πk=πk1\pi_{k}=\pi_{k-1} or πk(x)=πk1(x)\pi_{k}(x)=\pi_{k-1}(x) for all xX{xk}x\in X\setminus\{x_{k}\} but πk(xk)πk1(xk)\pi_{k}(x_{k})\neq\pi_{k-1}(x_{k}). The algorithm has a design flexibility in the aspect that a feedback of an action to be used at xkx_{k} by some “supervisor,” can be incorporated while generating πk(H)\pi_{k}(H). By setting ϕk=πk(H)1\phi_{k}=\pi_{k}(H)_{1} at each k1k\geq 1, a policy ϕΠ(X)\phi\in\Pi(X)^{\infty} is in effect built sequentially for the controller. Once ϕk\phi_{k} is available to the controller, it takes ϕk(xk)\phi_{k}(x_{k}) to the system and the underlying system of MM_{\infty} moves to a next random state xk+1x_{k+1} by the probability of Pxkxk+1ϕk(xk)P^{\phi_{k}(x_{k})}_{x_{k}x_{k+1}}.

The behavior of such a control policy is discussed by focusing on the convergence issue with a relation to the transition structure of MM_{\infty}. We are concerned with a question about the existence of a finite time K<K<\infty such that ϕk=π(H)1\phi_{k}=\pi^{*}(H)_{1} for all k>Kk>K for the infinite sequence {ϕk}\{\phi_{k}\}.

II Off-line Synchronous Policy Iteration with Policy Switching

To start with, we present an algorithm of off-line synchronous policy iteration (PI) combined with a multi-policy improvement method of policy switching for solving MHM_{H}. In what follows, we assume that V0π=0XV^{\pi}_{0}=0_{X} for any πΠ(X)H\pi\in\Pi(X)^{H} for simplicity. (The topic about how to set V0πV^{\pi}_{0} is beyond the scope of this note.)

Theorem II.1 (Theorem 2 [3])

Given a nonempty ΔΠ(X)H\Delta\subseteq\Pi(X)^{H}, construct policy switching with Δ\Delta in Π(X)H\Pi(X)^{H} as πps(Δ)\pi_{\mathrm{ps}}(\Delta) such that for each possible pair of xXx\in X and h{1,,H}h\in\{1,...,H\},

πps(Δ)h(x)=ϕh(x) where ϕarg maxϕΔVHh+1ϕ(x).\pi_{\mathrm{ps}}(\Delta)_{h}(x)=\phi^{*}_{h}(x)\mbox{ where }\phi^{*}\in\mathop{\mbox{\rm arg\,max}}_{\phi\in\Delta}V^{\phi}_{H-h+1}(x).

Then VHπps(Δ)VHϕV^{\pi_{\mathrm{ps}}(\Delta)}_{H}\geq V^{\phi}_{H} for all ϕΔ\phi\in\Delta.

Given π~\tilde{\pi} and π\pi in Π(X)H\Pi(X)^{H}, we say that π~\tilde{\pi} strictly improves π\pi (over the horizon HH) if there exists some sXs\in X such that VHπ~(s)>VHπ(s)V^{\tilde{\pi}}_{H}(s)>V^{\pi}_{H}(s) and VHπ~VHπV^{\tilde{\pi}}_{H}\geq V^{\pi}_{H} in which case we write as π~>Hπ\tilde{\pi}>_{H}\pi. Define switchable action set Shπ(x)S^{\pi}_{h}(x) of πΠ(X)H\pi\in\Pi(X)^{H} at xXx\in X for h{1,,H}h\in\{1,...,H\} as

Shπ(x)={aA(x)|R(x,a)+γyXPxyaVh1π(y)>Vhπ(x)}S^{\pi}_{h}(x)=\biggl{\{}a\in A(x)\biggl{|}R(x,a)+\gamma\sum_{y\in X}P^{a}_{xy}V^{\pi}_{h-1}(y)>V^{\pi}_{h}(x)\biggr{\}}

and also improvable-state set of π\pi for hh as

Ihπ={(h,x)|Shπ(x),xX}.I^{\pi}_{h}=\{(h,x)|S^{\pi}_{h}(x)\neq\emptyset,x\in X\}.

Set Iπ,H=h=1HIhπI^{\pi,H}=\bigcup_{h=1}^{H}I^{\pi}_{h}. Observe that if Iπ,H=I^{\pi,H}=\emptyset for πΠ(X)H\pi\in\Pi(X)^{H}, then π\pi is an optimal HH-length policy for MHM_{H}.

The following theorem provides a result for MHM_{H} in analogy with the key step of the single-policy improvement (see, e.g., [7]) in PI for MM_{\infty}. Because Banach’s fixed-point theorem is difficult to be invoked in the finite-horizon case unlike the standard proof for the infinite-horizon case, we provide a proof for the completeness.

Theorem II.2

Given πΠ(X)H\pi\in\Pi(X)^{H} with Iπ,HI^{\pi,H}\neq\emptyset, construct π~Π(X)H\tilde{\pi}\in\Pi(X)^{H} with any II satisfying IIπ,H\emptyset\subsetneq I\subseteq I^{\pi,H} such that π~Hh+1(x)Shπ(x)\tilde{\pi}_{H-h+1}(x)\in S^{\pi}_{h}(x) for all (h,x)I(h,x)\in I and π~Hh+1(x)=πHh+1(x)\tilde{\pi}_{H-h+1}(x)=\pi_{H-h+1}(x) for all (h,x)({1,,H}×X)I(h,x)\in(\{1,...,H\}\times X)\setminus I. Then π~>Hπ\tilde{\pi}>_{H}\pi.

Proof:

The base case holds because V0π~=V0πV^{\tilde{\pi}}_{0}=V^{\pi}_{0}. For the induction step, assume that Vh1π~Vh1πV^{\tilde{\pi}}_{h-1}\geq V^{\pi}_{h-1}. For all xx such that π~Hh+1(x)=πHh+1(x)\tilde{\pi}_{H-h+1}(x)=\pi_{H-h+1}(x),

Vhπ~(x)=R(x,π~Hh+1(x))+γyXPxyπ~Hh+1(x)Vh1π~(y)\displaystyle V^{\tilde{\pi}}_{h}(x)=R(x,\tilde{\pi}_{H-h+1}(x))+\gamma\sum_{y\in X}P^{\tilde{\pi}_{H-h+1}(x)}_{xy}V^{\tilde{\pi}}_{h-1}(y)
R(x,πHh+1(x))+γyXPxyπHh+1(x)Vh1π(y)=Vhπ(x).\displaystyle\geq R(x,\pi_{H-h+1}(x))+\gamma\sum_{y\in X}P^{\pi_{H-h+1}(x)}_{xy}V^{\pi}_{h-1}(y)=V^{\pi}_{h}(x).

On the other hand, for any xXx\in X such that π~Hh+1(x)Shπ(x)\tilde{\pi}_{H-h+1}(x)\in S^{\pi}_{h}(x),

Vhπ~(x)=R(x,π~Hh+1(x))+γyXPxyπ~Hh+1(x)Vh1π~(y)\displaystyle V^{\tilde{\pi}}_{h}(x)=R(x,\tilde{\pi}_{H-h+1}(x))+\gamma\sum_{y\in X}P^{\tilde{\pi}_{H-h+1}(x)}_{xy}V^{\tilde{\pi}}_{h-1}(y)
R(x,π~Hh+1(x))+γyXPxyπ~Hh+1(x)Vh1π(y) by induction hypothesis Vh1π~Vh1π\displaystyle\geq R(x,\tilde{\pi}_{H-h+1}(x))+\gamma\sum_{y\in X}P^{\tilde{\pi}_{H-h+1}(x)}_{xy}V^{\pi}_{h-1}(y)\mbox{ by induction hypothesis }V^{\tilde{\pi}}_{h-1}\geq V^{\pi}_{h-1}
>R(x,πHh+1(x))+γyXPxyπHh+1(x)Vh1π(y)=Vhπ(x) because π~Hh+1(x)Shπ(x).\displaystyle>R(x,\pi_{H-h+1}(x))+\gamma\sum_{y\in X}P^{\pi_{H-h+1}(x)}_{xy}V^{\pi}_{h-1}(y)=V^{\pi}_{h}(x)\mbox{ because }\tilde{\pi}_{H-h+1}(x)\in S^{\pi}_{h}(x).

Putting the two cases together makes Vhπ~VhπV^{\tilde{\pi}}_{h}\geq V^{\pi}_{h}. In particular, we see that there must exist some yXy\in X such that π~Hh+1(y)Shπ(y)\tilde{\pi}_{H-h+1}(y)\in S^{\pi}_{h}(y) because of our choice of II, having Vhπ~(y)>Vhπ(y)V^{\tilde{\pi}}_{h}(y)>V^{\pi}_{h}(y). By an induction argument on the level from hh then, it follows that VHπ~(y)>VHπ(y)V^{\tilde{\pi}}_{H}(y)>V^{\pi}_{H}(y). ∎

The previous theorem states that if a policy was generated from a given π\pi by switching the action prescribed by π\pi at each improvable state with its corresponding level in a particularly chosen nonempty subset of the improvable-state set of π\pi, then the policy constructed strictly improves π\pi over the relevant finite horizon. However, in general, even if π>Hϕ\pi>_{H}\phi is known, for π\pi^{\prime} and ϕ\phi^{\prime} obtained by the method of Theorem II.2, respectively, π>Hϕ\pi^{\prime}>_{H}\phi^{\prime} does not hold necessarily. (Note that this is also true for the infinite-horizon case.) It can be merely said that π\pi^{\prime} improves π\pi and ϕ\phi^{\prime} improves ϕ\phi, respectively. Motivated by this, we consider the following. For a given π\pi in Π(X)H\Pi(X)^{H}, let βπ,H\beta^{\pi,H} be the set of all strictly better policies than π\pi obtainable from Iπ,HI^{\pi,H}: If Iπ,H=I^{\pi,H}=\emptyset, βπ,H=\beta^{\pi,H}=\emptyset. Otherwise,

βπ,H={π~Π(X)H|I2Iπ,H{},(h,x)I π~Hh+1(x)Shπ(x)\displaystyle\beta^{\pi,H}=\bigl{\{}\tilde{\pi}\in\Pi(X)^{H}|I\in 2^{I^{\pi,H}}\setminus\{\emptyset\},\forall(h,x)\in I\mbox{ }\tilde{\pi}_{H-h+1}(x)\in S^{\pi}_{h}(x)
 and (h,x)({1,,H}×X)I π~Hh+1(x)=πHh+1(x)}.\displaystyle\hskip 85.35826pt\mbox{ and }\forall(h,x)\in(\{1,...,H\}\times X)\setminus I\mbox{ }\tilde{\pi}_{H-h+1}(x)=\pi_{H-h+1}(x)\bigr{\}}.

Once obtained, policy switching with respect to βπ,H\beta^{\pi,H} is applied to find a single policy no worse than all policies in the set.

We are ready to derive an off-line synchronous algorithm, “policy-iteration with policy switching,” (PIPS) that generates a sequence of HH-length policies for solving MHM_{H}: Set arbitrarily π1Π(X)H\pi_{1}\in\Pi(X)^{H}. Loop with n{1,2,,}n\in\{1,2,...,\} until Iπn,H=I^{\pi_{n},H}=\emptyset where πn+1=πps(βπn,H)\pi_{n+1}=\pi_{\mathrm{ps}}(\beta^{\pi_{n},H}).

The convergence to an optimal HH-length policy for MHM_{H} is trivially guaranteed because πn+1>Hπn,n1\pi_{n+1}>_{H}\pi_{n},n\geq 1, and both XX and AA are finite. Note that βπn,H\beta^{\pi_{n},H} in πps(βπn,H)\pi_{\mathrm{ps}}(\beta^{\pi_{n},H}) can be substituted with any ΔnΠ(X)H\Delta_{n}\subseteq\Pi(X)^{H} as long as for Δnβπn,H\Delta_{n}\cap\beta^{\pi_{n},H}\neq\emptyset. The generality of {Δn}\{\Delta_{n}\} then provides a broad design-flexibility of PIPS.

The idea behind policy switching used in PIPS with βπn,H\beta^{\pi_{n},H} can be attributed to approximating the steepest ascent direction while applying the steepest ascent algorithm. At the current location πn\pi_{n}, we find ascent “directions” relative to πn\pi_{n} over the local neighborhood of βπn,H\beta^{\pi_{n},H}. A steepest ascent direction, πps(βπn,H)\pi_{\mathrm{ps}}(\beta^{\pi_{n},H}), is then obtained by “combining” all of the possible ascent directions. In particular, the greedy ascent direction ϕ\phi that satisfies that T(VHhπn)(x)=R(x,ϕh(x))+γyXPxyϕh(x)VHhπn(y)T(V^{\pi_{n}}_{H-h})(x)=R(x,\phi_{h}(x))+\gamma\sum_{y\in X}P_{xy}^{\phi_{h}(x)}V^{\pi_{n}}_{H-h}(y) for all xXx\in X and h{1,,H}h\in\{1,...,H\} is always included while combining.

III Off-line Asynchronous PIPS

An asynchronous version can be inferred from the synchronous PIPS by the following improvement result when a single HH-length policy in Π(X)H\Pi(X)^{H} is updated only at a single state:

Corollary III.1

Given xXx\in X and πΠ(X)H\pi\in\Pi(X)^{H}, let Ixπ,H={(h,x)Iπ,H|h{1,,H}}I^{\pi,H}_{x}=\{(h,x)\in I^{\pi,H}|h\in\{1,...,H\}\}. Suppose that Ixπ,HI^{\pi,H}_{x}\neq\emptyset. Then for any ϕβxπ,H\phi\in\beta^{\pi,H}_{x}, ϕ>Hπ\phi>_{H}\pi where

βxπ,H={π~Π(X)H|I2Ixπ,H{},h{1,,H} such that (h,x)I π~Hh+1(x)\displaystyle\beta^{\pi,H}_{x}=\bigl{\{}\tilde{\pi}\in\Pi(X)^{H}|I\in 2^{I^{\pi,H}_{x}}\setminus\{\emptyset\},\forall h\in\{1,...,H\}\mbox{ such that }(h,x)\in I\mbox{ }\tilde{\pi}_{H-h+1}(x)
Shπ(x) and (h,x)({1,,H}×X)I π~Hh+1(x)=πHh+1(x)}.\displaystyle\in S^{\pi}_{h}(x)\mbox{ and }\forall(h,x^{\prime})\in(\{1,...,H\}\times X)\setminus I\mbox{ }\tilde{\pi}_{H-h+1}(x^{\prime})=\pi_{H-h+1}(x^{\prime})\bigr{\}}.
Proof:

Because Ixπ,HIπ,H\emptyset\neq I^{\pi,H}_{x}\subseteq I^{\pi,H}, βxπ,Hβπ,H\emptyset\neq\beta^{\pi,H}_{x}\subseteq\beta^{\pi,H}. ∎

The set βxπ,H\beta^{\pi,H}_{x} “projected to the xx-coordinate direction” from βπ,H\beta^{\pi,H} contains all strictly better policies than π\pi that can be obtained by switching the action(s) prescribed by π\pi at the single state xx. This result leads to an off-line convergent asynchronous PIPS for MHM_{H}: Select π1Π(X)H\pi_{1}\in\Pi(X)^{H} arbitrarily. Loop with n{1,2,,}n\in\{1,2,...,\}: If Iπn,HI^{\pi_{n},H}\neq\emptyset, select xnIπn,Hx_{n}\in I^{\pi_{n},H} and construct πn+1\pi_{n+1} such that πn+1(xn)=πps(βxnπn,H)(xn)\pi_{n+1}(x_{n})=\pi_{\mathrm{ps}}(\beta^{\pi_{n},H}_{x_{n}})(x_{n}) and πn+1(x)=πn(x)\pi_{n+1}(x)=\pi_{n}(x) for all xx in X{xn}X\setminus\{x_{n}\}. If Iπn,H=I^{\pi_{n},H}=\emptyset, stop. Because xnx_{n} is always selected to be an improvable-state in Iπn,HI^{\pi_{n},H}\neq\emptyset, πn+1>Hπn\pi_{n+1}>_{H}\pi_{n} for all n1n\geq 1. Therefore, {πn}\{\pi_{n}\} converges to an optimal HH-length policy for MHM_{H}.

Suppose that the state given at the current step of the previous algorithm is not guaranteed to be in the improvable-state set of the current policy. Such scenario is possible with the following modified version: Select π1Π(X)H\pi_{1}\in\Pi(X)^{H} arbitrarily. Loop with n{1,2,,}n\in\{1,2,...,\}: If Iπn,H=I^{\pi_{n},H}=\emptyset, stop. Select xnXx_{n}\in X. If xnIπn,Hx_{n}\in I^{\pi_{n},H}, then construct πn+1\pi_{n+1} such that πn+1(xn)=πps(βxnπn,H)(xn)\pi_{n+1}(x_{n})=\pi_{\mathrm{ps}}(\beta^{\pi_{n},H}_{x_{n}})(x_{n}) and πn+1(x)=πn(x)\pi_{n+1}(x)=\pi_{n}(x) for all xx in X{xn}X\setminus\{x_{n}\}. If xnIπn,Hx_{n}\notin I^{\pi_{n},H}, πn+1=πn\pi_{n+1}=\pi_{n}. Unlike the previous version, this algorithm’s convergence depends on the sequence {xn}\{x_{n}\} selected. Even if πn+1>Hπn\pi_{n+1}>_{H}\pi_{n} when πn+1πn\pi_{n+1}\neq\pi_{n}, the stopping condition that checks for the optimality can never be satisfied. In other words, an infinite loop is possible. The immediate problem is then how to choose an update-state sequence to achieve a global convergence. The reason for bring this issue up with the modified algorithm is that the situation is closely related with the on-line algorithm to be discussed in the next section. Dealing with this issue here would help understanding the convergence behaviour of the on-line algorithm. We discuss some pedagogical example of choosing an update-state sequence of the modified off-line algorithm below.

One way of enforcing a global convergence is to “embed” backward induction into the update-state sequence. For example, we generate a sequence of {xn}={x0,,xn1,,xn2,,xnh,.,xnH,}\{x_{n}\}=\{x_{0},...,x_{n_{1}},...,x_{n_{2}},...,x_{n_{h}},....,x_{n_{H}},...\} whose subsequence {xnh,h=1,,H}\{x_{n_{h}},h=1,...,H\} produces {πnh}\{\pi^{n_{h}}\} that solves MhM_{h}. We need to follow the optimality principle such that Mh1M_{h-1} is solved before MhM_{h}, and so forth, until MHM_{H} is finally solved. Therefore, the entries of π(H)\pi^{*}(H) are searched from π(H)H\pi^{*}(H)_{H} to π(H)1\pi^{*}(H)_{1} over {xn}\{x_{n}\} such that πn1=(π(H)H,π2n1,,πH1n1,πHn1)\pi^{n_{1}}=(\pi^{*}(H)_{H},\pi^{n_{1}}_{2},...,\pi^{n_{1}}_{H-1},\pi^{n_{1}}_{H}) where V1=V1πn1V^{*}_{1}=V^{\pi^{n_{1}}}_{1}, and then πn2=(π(H)H1,π(H)H,π3n2,,πHn2)\pi^{n_{2}}=(\pi^{*}(H)_{H-1},\pi^{*}(H)_{H},\pi^{n_{2}}_{3},...,\pi^{n_{2}}_{H}) where V2=V2πn2V^{*}_{2}=V^{\pi^{n_{2}}}_{2}, πnh=(π(H)Hh+1,,π(H)H,,πHnh)\pi^{n_{h}}=(\pi^{*}(H)_{H-h+1},...,\pi^{*}(H)_{H},...,\pi^{n_{h}}_{H}) where Vh=VhπnhV^{*}_{h}=V^{\pi^{n_{h}}}_{h},…, and then finally, πnH=(π(H)1,,π(H)H1,π(H)H)\pi^{n_{H}}=(\pi^{*}(H)_{1},...,\pi^{*}(H)_{H-1},\pi^{*}(H)_{H}) with VH=VHπnHV^{*}_{H}=V^{\pi^{n_{H}}}_{H}.

Once Mh1M_{h-1} has been solved, an optimal hh-length policy πnh\pi^{n_{h}} for MhM_{h} can be found exhaustively. The corresponding update-state subsequence from xnh1+1x_{n_{h-1}+1} to xnhx_{n_{h}} can be any permutation of the states in XX. Visiting each xx in XX at least once for updating causes an optimal hh-length policy for MhM_{h} to be found because if not empty, βxπm,H\beta^{\pi^{m},H}_{x}, where m{nh1+1,,nh}m\in\{n_{h-1}+1,...,n_{h}\}, includes an HH-length policy whose first entry mapping maps xx to an action in arg maxaA(x)(R(x,a)+γyXPxyaVh1(y))\mathop{\mbox{\rm arg\,max}}_{a\in A(x)}(R(x,a)+\gamma\sum_{y\in X}P_{xy}^{a}V^{*}_{h-1}(y)). Even though visiting each state at least once makes the approach enumerative, our point is showing that there exists an update-state sequence that makes a global convergence possible.

IV On-line Asynchronous PIPS

We are now at the position of the main subject of this note about solving MM_{\infty} within an on-line rolling HH-horizon control by solving MHM_{H} not in advance but over time. We assume that a sequence of {Δk}\{\Delta_{k}\} is available where ΔkΠ(X)H,k1\Delta_{k}\subseteq\Pi(X)^{H},k\geq 1, stands for a set of supervisors at kk. Any feedback of an action to take at a state can be represented by an element in Π(X)H\Pi(X)^{H}.

The controller applies a policy ϕΠ(X)\phi\in\Pi(X)^{\infty} to the system of MM_{\infty} built from the sequence of the HH-length policies generated by the on-line asynchronous PIPS algorithm: PIPS first selects π1(H)Π(X)H\pi_{1}(H)\in\Pi(X)^{H} arbitrarily and sets ϕ1=π1(H)1\phi_{1}=\pi_{1}(H)_{1}. At k>1k>1, if βxkπk1(H),H=\beta^{\pi_{k-1}(H),H}_{x_{k}}=\emptyset, πk(H)=πk1(H)\pi_{k}(H)=\pi_{k-1}(H). Otherwise, PIPS updates πk1(H)\pi_{k-1}(H) only at xkx_{k} such that πk(H)(xk)=πps(βxkπk1(H),HΔk)(xk)\pi_{k}(H)(x_{k})=\pi_{\mathrm{ps}}(\beta^{\pi_{k-1}(H),H}_{x_{k}}\cup\Delta_{k})(x_{k}) and πk(H)(x)=πk1(H)(x)\pi_{k}(H)(x)=\pi_{k-1}(H)(x) for all xx in X{xk}X\setminus\{x_{k}\}. After finishing update, PIPS sets πk(H)1\pi_{k}(H)_{1} to the kkth entry ϕk\phi_{k} of ϕ\phi. Once ϕk\phi_{k} is available to the controller, ϕk(xk)\phi_{k}(x_{k}) is taken to the system and the system makes a random transition to xk+1x_{k+1}.

Before we present a general convergence result, an intuitive consequence about a global convergence from a sufficient condition related with the transition structure of MM_{\infty} is given below as a theorem. Note that MM_{\infty} is communicating, if every Markov chain induced by fixing each stationary policy [π],πΠ(X),[\pi],\pi\in\Pi(X), in MM_{\infty} is communicating [6].

Theorem IV.1

Suppose that MM_{\infty} is communicating. Then for {πk(H),k1}\{\pi_{k}(H),k\geq 1\} generated by on-line asynchronous PIPS, there exists some K<K<\infty such that πk(H)=π(H)\pi_{k}(H)=\pi^{*}(H) for all k>Kk>K where VHπ(H)=VHV^{\pi^{*}(H)}_{H}=V^{*}_{H} in MHM_{H} for any π1(H)Π(X)H\pi_{1}(H)\in\Pi(X)^{H} and {Δk}\{\Delta_{k}\} where ΔkΠ(X)H,k1\Delta_{k}\subseteq\Pi(X)^{H},k\geq 1.

Proof:

Because both XX and AA are finite, B(X)B(X) and Π(X)H\Pi(X)^{H} are finite. For any given {Δk}\{\Delta_{k}\} and any π1(H)\pi_{1}(H), the monotonicity of {Vπk(H)}\{V^{\pi_{k}(H)}\} of {πk(H)}\{\pi_{k}(H)\} holds because {πk(H)}\{\pi_{k}(H)\} satisfies that πk(H)>Hπk1(H)\pi_{k}(H)>_{H}\pi_{k-1}(H) if βxkπk1(H),H\beta^{\pi_{k-1}(H),H}_{x_{k}}\neq\emptyset and Vπk(H)Vπk1(H)V^{\pi_{k}(H)}\geq V^{\pi_{k-1}(H)} otherwise. The assumption that MM_{\infty} is communicating ensures that every state xx in XX is visited infinitely often within {xk}\{x_{k}\}. It follows that at some sufficiently large finite time KK, βπK(H),H=\beta^{\pi_{K}(H),H}=\emptyset implying IπK(H),H=I^{\pi_{K}(H),H}=\emptyset. Therefore, πk(H)=π(H)\pi_{k}(H)=\pi^{*}(H) for all k>Kk>K where VHπ(H)=VHV^{\pi^{*}(H)}_{H}=V^{*}_{H} in MHM_{H}. ∎

The policy ϕ\phi of the controller becomes stable in the sense that the sequence {ϕk}\{\phi_{k}\} converges to π(H)1\pi^{*}(H)_{1}. The question about checking whether MM_{\infty} is communicating without enumerating all stationary policies in Π(X)\Pi(X)^{\infty} is possible in a polynomial time-complexity was raised in [6]. Unfortunately, this problem is in general NP-complete [8]. A simple and obvious sufficient condition for such connectivity is that Pxya>0P^{a}_{xy}>0 for all x,yx,y in XX and aa in A(x)A(x). The key in the convergence here is that which state in XX is visited “sufficiently often” by following {ϕk}\{\phi_{k}\} to ensure that an optimal action at the visited state is eventually found. The following result stated in Theorem IV.2 reflects this rationale. Given a stationary policy ϕΠ(X)\phi\in\Pi(X)^{\infty}, the connectivity relation χϕ\chi^{\phi} is defined on XX from the Markov chain MϕM_{\infty}^{\phi} induced by fixing ϕ\phi in MM_{\infty}: If xx and yy in XX communicate each other in MϕM_{\infty}^{\phi}, (x,y)(x,y) is an element of χϕ\chi^{\phi}. Given xXx\in X, the equivalence class of xx with respect to χϕ\chi^{\phi} is denoted by [x]χϕ[x]_{\chi^{\phi}}. Note that for any xyx\neq y, either [x]χϕ=[y]χϕ[x]_{\chi^{\phi}}=[y]_{\chi^{\phi}} or [x]χϕ[y]χϕ=[x]_{\chi^{\phi}}\cap[y]_{\chi^{\phi}}=\emptyset. The collection of [x]χϕ,xX,[x]_{\chi^{\phi}},x\in X, partitions XX.

Theorem IV.2

For any π1(H)Π(X)H\pi_{1}(H)\in\Pi(X)^{H} and any {Δk}\{\Delta_{k}\} where ΔkΠ(X)H,k1\Delta_{k}\subseteq\Pi(X)^{H},k\geq 1, {πk(H)}\{\pi_{k}(H)\} generated by on-line asynchronous PIPS converges to some λ(H)\lambda(H) in Π(X)H\Pi(X)^{H} such that for some K<K<\infty, πk(H)=λ(H)\pi_{k}(H)=\lambda(H) for all k>Kk>K. Furthermore, λ(H)\lambda(H) satisfies that VHλ(H)VHπV^{\lambda(H)}_{H}\geq V^{\pi}_{H} for all πx[x]χ[λ(H)1]βx[λ(H)1],H\pi\in\bigcup_{x\in[x^{*}]_{\chi^{[\lambda(H)_{1}]}}}\beta^{[\lambda(H)_{1}],H}_{x}, where xx^{*} is any visited state at k>Kk>K.

Proof:

By the same reasoning in the proof of Theorem IV.1, {πk(H)}\{\pi_{k}(H)\} converges to an element λ(H)\lambda(H) in Π(X)H\Pi(X)^{H} in a finite time KK. Because every state xx in [x]χ[λ(H)1][x^{*}]_{\chi^{[\lambda(H)_{1}]}} is visited infinitely often within {xk}\{x_{k}\} for k>Kk>K, Ixλ(H),H=I^{\lambda(H),H}_{x}=\emptyset for such xx. No more improvement is possible at all states in [x]χ[λ(H)1][x^{*}]_{\chi^{[\lambda(H)_{1}]}}. Otherwise, it contradicts the convergence to λ(H)\lambda(H). ∎

The theorem states that λ(H)\lambda(H) is “locally optimal” over [x]χ[λ(H)1][x^{*}]_{\chi^{[\lambda(H)_{1}]}} in the sense that no more improvement is possible at all states in [x]χ[λ(H)1][x^{*}]_{\chi^{[\lambda(H)_{1}]}}.

We remark that the above local convergence result is different from that of Proposition 2.2 by Bertsekas [2]. In our case, a subset of XX in which every state is visited infinitely often is not assumed to be given in advance. The sequence of policies generated by on-line PIPS will eventually converge to a policy and Theorem IV.2 characterizes its local optimality with respect to the communicating classes, in which every state is visited infinitely often, induced by the policy. Note further that Bertsekas’ result is within the context of rolling an infinite-horizon control.

Unfortunately, we cannot provide a useful performance result about the performance of λ(H)\lambda(H) in Theorem IV.2 here, e.g, an upper bound on V[λ(H)1]V||V^{[\lambda(H)_{1}]}_{\infty}-V^{*}_{\infty}||_{\infty} because it is difficult to bound VHλ(H)VH||V^{\lambda(H)}_{H}-V^{*}_{H}||_{\infty}. The degree of approximation by λ(H)\lambda(H) for π(H)\pi^{*}(H) will determine the performance of the rolling horizon control policy by the on-line asynchronous PIPS algorithm.

V Concluding Remarks

The off-line and on-line PIPS algorithms can play the role of frameworks for solving MDPs with supervisors. While the disposition of the algorithms (and their developments) was done mainly in the perspective of theoretical soundness and results, both can be easily implemented by simulation if the MDP model is generative. In particular, for the on-line case, each πβxkπk,HΔk\pi\in\beta^{\pi_{k},H}_{x_{k}}\cup\Delta_{k} is simply followed (rolled out) over a relevant forecast-horizon (see, e.g., [1] for related approaches and references) in order to generate its sample-paths. If βxkπk,HΔk\beta^{\pi_{k},H}_{x_{k}}\cup\Delta_{k} is large, some policies from βxkπk,H\beta^{\pi_{k},H}_{x_{k}} and Δk\Delta_{k} can be “randomly” sampled, for example, without losing the improvement of πk\pi_{k}. A study on the actual implementation is important and is left as a future study.

On-line PIPS is also in the category of “learning” control. Essentially, V0V^{*}_{0} can be thought as an initial knowledge of control to the system, e.g., as in the value function of a self-learned Go-playing policy of AlphaZero from a neural-network based learning-system. The monotonically improving value-function sequence generated by PIPS can be interpreted as learning or obtaining a better knowledge about controlling the system,

There exist an algorithm, “adaptive multi-stage sampling,” (AMS) [4] for MHM_{H} whose random estimate converges to VH(x)V^{*}_{H}(x) for a given xx as the number of samplings approach infinity in the expected sense. AMS employed within the rolling-horizon control is closer to its original spirit, compared with on-line PIPS, because the value of VH(xk)V^{*}_{H}(x_{k}) is approximated at each visited xkx_{k} like solving MHM_{H} in advance. In contrast with the PI-based approach here, the idea of AMS is to replace the inside of the maximization over the admissible action set in the TT-operator of VI such that the maximum selection is done with a different utility for each action or the “necessity” measure of sampling, which is estimated over a set of currently sampled next states with a support from a certain upper-confidence-bound that controls a degree of search by the action. Because the replacement is applied at each level while emulating the process of backward induction, AMS requires a recursive process in a depth-first-search manner that effectively builds a sampled tree whose size is exponential in HH. It is therefore not surprising that similar to the complexity of backward induction, AMS also has the (sample) complexity that exponentially depends on HH. On the other hand, in general estimating the value of a policy, not the optimal value, is a much easier task by simulation. Generating random sample-paths of a policy has a polynomial dependence on HH. More importantly, it seems difficult to discuss the convergence behaviour of the rolling horizon AMS-control because some characterization of a stochastic policy, due to random estimates of VH(xk)V^{*}_{H}(x_{k}) at each kk, needs to be made with a finite sampling-complexity. Arguably, this would be true for any algorithm that produces a random estimate of the optimal value, e.g., Monte-Carlo Tree Search (MCTS) used in AlphaGo or AlphaZero [1]. However, it is worthwhile to note that these algorithms’ output can act as a feedback from a supervisor in the framework of on-line PIPS.

It can be checked that another multi-policy improvement method of parallel rollout [3] does not work for preserving the monotonicity property with asynchronous update when the set of more than one policies is applied to the method for the improvement. Even with synchronous update, the parallel-rollout approach requires estimating a double expectation for each action, one for the next-state distribution and another one in the value function (to be evaluated at the next state). In contrast, in policy switching a single expectation for each policy needs to be estimated leading to a lower simulation-complexity.

References

  • [1] Bertsekas, D. P. (2022). Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control. Athena Scientific.
  • [2] Bertsekas, D. P. (2021). On-Line policy iteration for infinite horizon dynamic programming. arXiv:2106.00746.
  • [3] Chang, H. S., Givan, R. L., & Chong, E. K. P. (2004). Parallel rollout for online solution of partially observable Markov decision processes. Discrete Event Dynamic Systems, 14, 309–341.
  • [4] Chang, H. S., Fu, M., Hu, J., & Marcus S. I. (2005). An adaptive sampling algorithm for solving Markov decision processes. Operations Research, 53, 126–139.
  • [5] Hernandez-Lerma, O. (1989). Adaptive Markov Control Processes. Springer-Verlag.
  • [6] Kallenberg, L.C.M. (2002). Classification Problems in MDPs. Markov Processes and Controlled Markov Chains, Hou Z., Filar J.A., Chen A. (eds), Springer, Boston, MA.
  • [7] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York.
  • [8] Tsitsiklis, J. N. (2007). NP-Hardness of checking the unichain condition in average cost MDPs. Operations Research Letters, 35, 319–323.