On Supervised On-line Rolling-Horizon Control for Infinite-Horizon Discounted Markov Decision Processes

Hyeong Soo Chang H.S. Chang is with the Department of Computer Science and Engineering at Sogang University, Seoul 121-742, Korea. (e-mail:[email protected]).

Abstract

This note re-visits the rolling-horizon control approach to the problem of a Markov decision process (MDP) with infinite-horizon discounted expected reward criterion. Distinguished from the classical value-iteration approach, we develop an asynchronous on-line algorithm based on policy iteration integrated with a multi-policy improvement method of policy switching. A sequence of monotonically improving solutions to the forecast-horizon sub-MDP is generated by updating the current solution only at the currently visited state, building in effect a rolling-horizon control policy for the MDP over infinite horizon. Feedbacks from “supervisors,” if available, can be also incorporated while updating. We focus on the convergence issue with a relation to the transition structure of the MDP. Either a global convergence to an optimal forecast-horizon policy or a local convergence to a “locally-optimal” fixed-policy in a finite time is achieved by the algorithm depending on the structure.

Index Terms:

rolling horizon control, policy iteration, policy switching, Markov decision process

I Introduction

Consider the rolling horizon control (see, e.g., [5]) with a fixed finite forecast-horizon $H$ to the problem of a Markov decision process (MDP) $M_{\infty}$ with infinite-horizon discounted expected reward criterion. At discrete time $k\geq 1$ , the system is at a state $x_{k}$ in a finite state set $X$ . If the controller of the system takes an action $a$ in $A(x_{k})$ at $k$ , then it obtains a reward of $R(x_{k},a)$ from a reward function $R:X\times A\rightarrow\Re$ , where $A(x)$ denotes an admissible-action set of $x$ in $X$ . The system then makes a random transition to a next state $x_{k+1}$ by the probability of $P_{x_{k}x_{k+1}}^{a}$ specified in an $|X|\times|X|$ -matrix $P^{a}=[P_{xy}^{a}],x,y\in X$ .

Let $B(X)$ be the set of all possible functions from $X$ to $\Re$ . The zero function in $B(X)$ is referred to as $0_{X}$ where $0_{X}(x)=0$ for all $x\in X$ . Let also $\Pi(X)$ be the set of all possible mappings from $X$ to $A$ where for any $\pi\in\Pi(X)$ , $\pi(x)$ is constrained to be in $A(x)$ for all $x\in X$ . Let an $h$ -length policy of the controller be an element in $\Pi(X)^{h}$ , $h$ -ary Cartesian product, $h\geq 1$ . That is, $\pi\in\Pi(X)^{h}$ is an ordered tuple $(\pi_{1},...,\pi_{h})$ where the $m$ th entry of $\pi$ is equal to $\pi_{m}\in\Pi(X),m\geq 1,$ and when $\pi_{m}$ is applied at $x$ in $X$ , then the controller is supposed to look ahead of (or forecasts) the remaining horizon $h-m$ for control. An infinite-length policy is an element in the infinite Cartesian product of $\Pi(X)$ , denoted by $\Pi(X)^{\infty}$ , and referred to as just a policy. We say that a policy $\phi\in\Pi(X)^{\infty}$ is stationary if $\phi_{m}=\pi$ for all $m\geq 1$ for some $\pi\in\Pi(X)$ . Given $\pi\in\Pi(X)$ , $[\pi]$ denotes a stationary policy in $\Pi(X)^{\infty}$ constructed from $\pi$ such that $[\pi]_{m}=\pi$ for all $m\geq 1$ .

Define the $h$ -horizon value function $V^{\pi}_{h}$ of $\pi$ in $\Pi(X)^{H},H\geq h$ such that

V^{\pi}_{h}(x)=E\left[\sum_{l=1}^{h}\gamma^{l-1}R(X_{l},\pi_{l}(X_{l}))\biggl{|}X_{1}=x\right],x\in X,h\in[1,\infty),

where $X_{l}$ is a random variable that denotes a state at the level (of forecast) $l$ by following the $l$ th-entry mapping $\pi_{l}$ of $\pi$ and a fixed discounting factor $\gamma$ is in $(0,1)$ .

In the sequel, any operator is applied componentwise for the elements in $B(X)$ and in $\Pi(X)$ , respectively. Given $\pi\in\Pi(X)^{h}$ and $x\in X$ , $\pi(x)$ is set to be $(\pi_{1}(x),...,\pi_{h}(x))$ meaning the “ $x$ -coordinate” of $\pi$ here. As is well known then, there exists an optimal $h$ -length policy $\pi^{*}(h)$ such that for any $x\in X$ , $V^{\pi^{*}(h)}_{h}(x)=\max_{\pi\in\Pi(X)^{h}}V^{\pi}_{h}(x)=V^{*}_{h}(x)$ , where $V^{*}_{h}$ is referred to as the optimal $h$ -horizon value function. In particular, $\{V^{*}_{h},h\geq 1\}$ satisfies the optimality principle:

V^{*}_{h}(x)=\max_{a\in A(x)}\biggl{(}R(x,a)+\gamma\sum_{y\in X}P_{xy}^{a}V^{*}_{h-1}(y)\biggr{)},x\in X,

for any fixed $V^{*}_{0}$ in $B(X)$ . Furthermore, $V^{*}_{h}$ is equal to the function in $B(X)$ obtained after applying the value iteration (VI) operator $T:B(X)\rightarrow B(X)$ iteratively $h$ times with the initial function of $V^{*}_{0}$ : $T(...T(T(T(V^{*}_{0}))))=T^{h}(V^{*}_{0}))=V^{*}_{h}$ , where

T(u)(x)=\max_{a\in A(x)}(R(x,a)+\gamma\sum_{y\in X}P_{xy}^{a}u(y)),x\in X,u\in B(X).

This optimal substructure leads to a dynamic programming algorithm, backward induction, which computes $\{V^{*}_{h}\}$ in off-line and returns an optimal $H$ -horizon policy $\pi^{*}(H)$ that achieves the optimal value at any $x\in X$ for the $H$ -horizon sub-MDP $M_{H}$ of $M_{\infty}$ by

\pi^{*}(H)_{H-h+1}(x)\in\mathop{\mbox{\rm arg\,max}}_{a\in A(x)}\left(R(x,a)+\gamma\sum_{y\in X}P_{xy}^{a}V^{*}_{h-1}(y)\right),x\in X,h\in\{1,...,H\}.

Once obtained, the rolling $H$ -horizon controller employs the first entry $\pi^{*}(H)_{1}$ of $\pi^{*}(H)$ or a stationary policy $[\pi^{*}(H)_{1}]$ over the system time: At each $k\geq 1$ , $\pi^{*}(H)_{1}(x_{k})$ is taken at $x_{k}$ .

Given $\pi\in\Pi(X)^{\infty}$ , $V^{\pi}_{\infty}$ denotes the value function of $\pi$ over infinite horizon and it is obtained by letting $h$ approach infinity in $E[\sum_{l=1}^{h}\gamma^{l-1}R(X_{l},\pi_{l}(X_{l}))|X_{1}=x]$ . The optimal value function $V^{*}_{\infty}$ of $M_{\infty}$ is then defined such that $V^{*}_{\infty}(x)=\sup_{\pi\in\Pi(X)^{\infty}}V^{\pi}_{\infty}(x),x\in X$ . A performance result of $[\pi^{*}(H)_{1}]$ applied to $M_{\infty}$ (see, e.g., [5]) is that the infinity-norm of the difference between the value function of $[\pi^{*}(H)_{1}]$ and $V^{*}_{\infty}$ is upper bounded by (an error of) $O(\gamma^{H}||V^{*}_{\infty}-V^{*}_{0}||_{\infty})$ . The term $||V^{*}_{\infty}-V^{*}_{0}||_{\infty}$ can be loosely upper bounded by $C/(1-\gamma)$ with some constant $C$ . Then due to the dependence on $(1-\gamma)^{-1}$ , the performance worsens around $\gamma$ closer to one. How to set $V^{*}_{0}$ is a critical issue in the rolling horizon control even if the error vanishes to zero exponentially fast in $H$ with the rate of $\gamma$ .

This note develops an algorithm for on-line rolling $H$ -horizon control. The sub-MDP $M_{H}$ is not solved in advance. Rather with an arbitrarily selected $\pi_{1}(H)\in\Pi(X)^{H}$ for $M_{H}$ , the algorithm generates a monotonically improving sequence of $\{\pi_{k}(H)\}$ over time $k\geq 1$ . To the algorithm, only $\pi_{k-1}(H)$ is available at $k>1$ and it updates $\pi_{k-1}(H)$ only at $x_{k}$ to obtain $\pi_{k}(H)$ . Either we have that $\pi_{k}=\pi_{k-1}$ or $\pi_{k}(x)=\pi_{k-1}(x)$ for all $x\in X\setminus\{x_{k}\}$ but $\pi_{k}(x_{k})\neq\pi_{k-1}(x_{k})$ . The algorithm has a design flexibility in the aspect that a feedback of an action to be used at $x_{k}$ by some “supervisor,” can be incorporated while generating $\pi_{k}(H)$ . By setting $\phi_{k}=\pi_{k}(H)_{1}$ at each $k\geq 1$ , a policy $\phi\in\Pi(X)^{\infty}$ is in effect built sequentially for the controller. Once $\phi_{k}$ is available to the controller, it takes $\phi_{k}(x_{k})$ to the system and the underlying system of $M_{\infty}$ moves to a next random state $x_{k+1}$ by the probability of $P^{\phi_{k}(x_{k})}_{x_{k}x_{k+1}}$ .

The behavior of such a control policy is discussed by focusing on the convergence issue with a relation to the transition structure of $M_{\infty}$ . We are concerned with a question about the existence of a finite time $K<\infty$ such that $\phi_{k}=\pi^{*}(H)_{1}$ for all $k>K$ for the infinite sequence $\{\phi_{k}\}$ .

II Off-line Synchronous Policy Iteration with Policy Switching

To start with, we present an algorithm of off-line synchronous policy iteration (PI) combined with a multi-policy improvement method of policy switching for solving $M_{H}$ . In what follows, we assume that $V^{\pi}_{0}=0_{X}$ for any $\pi\in\Pi(X)^{H}$ for simplicity. (The topic about how to set $V^{\pi}_{0}$ is beyond the scope of this note.)

Theorem II.1 (Theorem 2 [3])

Given a nonempty $\Delta\subseteq\Pi(X)^{H}$ , construct policy switching with $\Delta$ in $\Pi(X)^{H}$ as $\pi_{\mathrm{ps}}(\Delta)$ such that for each possible pair of $x\in X$ and $h\in\{1,...,H\}$ ,

\pi_{\mathrm{ps}}(\Delta)_{h}(x)=\phi^{*}_{h}(x)\mbox{ where }\phi^{*}\in\mathop{\mbox{\rm arg\,max}}_{\phi\in\Delta}V^{\phi}_{H-h+1}(x).

Then $V^{\pi_{\mathrm{ps}}(\Delta)}_{H}\geq V^{\phi}_{H}$ for all $\phi\in\Delta$ .

Given $\tilde{\pi}$ and $\pi$ in $\Pi(X)^{H}$ , we say that $\tilde{\pi}$ strictly improves $\pi$ (over the horizon $H$ ) if there exists some $s\in X$ such that $V^{\tilde{\pi}}_{H}(s)>V^{\pi}_{H}(s)$ and $V^{\tilde{\pi}}_{H}\geq V^{\pi}_{H}$ in which case we write as $\tilde{\pi}>_{H}\pi$ . Define switchable action set $S^{\pi}_{h}(x)$ of $\pi\in\Pi(X)^{H}$ at $x\in X$ for $h\in\{1,...,H\}$ as

S^{\pi}_{h}(x)=\biggl{\{}a\in A(x)\biggl{|}R(x,a)+\gamma\sum_{y\in X}P^{a}_{xy}V^{\pi}_{h-1}(y)>V^{\pi}_{h}(x)\biggr{\}}

and also improvable-state set of $\pi$ for $h$ as

I^{\pi}_{h}=\{(h,x)|S^{\pi}_{h}(x)\neq\emptyset,x\in X\}.

Set $I^{\pi,H}=\bigcup_{h=1}^{H}I^{\pi}_{h}$ . Observe that if $I^{\pi,H}=\emptyset$ for $\pi\in\Pi(X)^{H}$ , then $\pi$ is an optimal $H$ -length policy for $M_{H}$ .

The following theorem provides a result for $M_{H}$ in analogy with the key step of the single-policy improvement (see, e.g., [7]) in PI for $M_{\infty}$ . Because Banach’s fixed-point theorem is difficult to be invoked in the finite-horizon case unlike the standard proof for the infinite-horizon case, we provide a proof for the completeness.

Theorem II.2

Given $\pi\in\Pi(X)^{H}$ with $I^{\pi,H}\neq\emptyset$ , construct $\tilde{\pi}\in\Pi(X)^{H}$ with any $I$ satisfying $\emptyset\subsetneq I\subseteq I^{\pi,H}$ such that $\tilde{\pi}_{H-h+1}(x)\in S^{\pi}_{h}(x)$ for all $(h,x)\in I$ and $\tilde{\pi}_{H-h+1}(x)=\pi_{H-h+1}(x)$ for all $(h,x)\in(\{1,...,H\}\times X)\setminus I$ . Then $\tilde{\pi}>_{H}\pi$ .

Proof:

The base case holds because $V^{\tilde{\pi}}_{0}=V^{\pi}_{0}$ . For the induction step, assume that $V^{\tilde{\pi}}_{h-1}\geq V^{\pi}_{h-1}$ . For all $x$ such that $\tilde{\pi}_{H-h+1}(x)=\pi_{H-h+1}(x)$ ,

	$\displaystyle V^{\tilde{\pi}}_{h}(x)=R(x,\tilde{\pi}_{H-h+1}(x))+\gamma\sum_{y\in X}P^{\tilde{\pi}_{H-h+1}(x)}_{xy}V^{\tilde{\pi}}_{h-1}(y)$
			$\displaystyle\geq R(x,\pi_{H-h+1}(x))+\gamma\sum_{y\in X}P^{\pi_{H-h+1}(x)}_{xy}V^{\pi}_{h-1}(y)=V^{\pi}_{h}(x).$

On the other hand, for any $x\in X$ such that $\tilde{\pi}_{H-h+1}(x)\in S^{\pi}_{h}(x)$ ,

	$\displaystyle V^{\tilde{\pi}}_{h}(x)=R(x,\tilde{\pi}_{H-h+1}(x))+\gamma\sum_{y\in X}P^{\tilde{\pi}_{H-h+1}(x)}_{xy}V^{\tilde{\pi}}_{h-1}(y)$
			$\displaystyle\geq R(x,\tilde{\pi}_{H-h+1}(x))+\gamma\sum_{y\in X}P^{\tilde{\pi}_{H-h+1}(x)}_{xy}V^{\pi}_{h-1}(y)\mbox{ by induction hypothesis }V^{\tilde{\pi}}_{h-1}\geq V^{\pi}_{h-1}$
			$\displaystyle>R(x,\pi_{H-h+1}(x))+\gamma\sum_{y\in X}P^{\pi_{H-h+1}(x)}_{xy}V^{\pi}_{h-1}(y)=V^{\pi}_{h}(x)\mbox{ because }\tilde{\pi}_{H-h+1}(x)\in S^{\pi}_{h}(x).$

Putting the two cases together makes $V^{\tilde{\pi}}_{h}\geq V^{\pi}_{h}$ . In particular, we see that there must exist some $y\in X$ such that $\tilde{\pi}_{H-h+1}(y)\in S^{\pi}_{h}(y)$ because of our choice of $I$ , having $V^{\tilde{\pi}}_{h}(y)>V^{\pi}_{h}(y)$ . By an induction argument on the level from $h$ then, it follows that $V^{\tilde{\pi}}_{H}(y)>V^{\pi}_{H}(y)$ . ∎

The previous theorem states that if a policy was generated from a given $\pi$ by switching the action prescribed by $\pi$ at each improvable state with its corresponding level in a particularly chosen nonempty subset of the improvable-state set of $\pi$ , then the policy constructed strictly improves $\pi$ over the relevant finite horizon. However, in general, even if $\pi>_{H}\phi$ is known, for $\pi^{\prime}$ and $\phi^{\prime}$ obtained by the method of Theorem II.2, respectively, $\pi^{\prime}>_{H}\phi^{\prime}$ does not hold necessarily. (Note that this is also true for the infinite-horizon case.) It can be merely said that $\pi^{\prime}$ improves $\pi$ and $\phi^{\prime}$ improves $\phi$ , respectively. Motivated by this, we consider the following. For a given $\pi$ in $\Pi(X)^{H}$ , let $\beta^{\pi,H}$ be the set of all strictly better policies than $\pi$ obtainable from $I^{\pi,H}$ : If $I^{\pi,H}=\emptyset$ , $\beta^{\pi,H}=\emptyset$ . Otherwise,

	$\displaystyle\beta^{\pi,H}=\bigl{\{}\tilde{\pi}\in\Pi(X)^{H}\|I\in 2^{I^{\pi,H}}\setminus\{\emptyset\},\forall(h,x)\in I\mbox{ }\tilde{\pi}_{H-h+1}(x)\in S^{\pi}_{h}(x)$
			$\displaystyle\hskip 85.35826pt\mbox{ and }\forall(h,x)\in(\{1,...,H\}\times X)\setminus I\mbox{ }\tilde{\pi}_{H-h+1}(x)=\pi_{H-h+1}(x)\bigr{\}}.$

Once obtained, policy switching with respect to $\beta^{\pi,H}$ is applied to find a single policy no worse than all policies in the set.

We are ready to derive an off-line synchronous algorithm, “policy-iteration with policy switching,” (PIPS) that generates a sequence of $H$ -length policies for solving $M_{H}$ : Set arbitrarily $\pi_{1}\in\Pi(X)^{H}$ . Loop with $n\in\{1,2,...,\}$ until $I^{\pi_{n},H}=\emptyset$ where $\pi_{n+1}=\pi_{\mathrm{ps}}(\beta^{\pi_{n},H})$ .

The convergence to an optimal $H$ -length policy for $M_{H}$ is trivially guaranteed because $\pi_{n+1}>_{H}\pi_{n},n\geq 1$ , and both $X$ and $A$ are finite. Note that $\beta^{\pi_{n},H}$ in $\pi_{\mathrm{ps}}(\beta^{\pi_{n},H})$ can be substituted with any $\Delta_{n}\subseteq\Pi(X)^{H}$ as long as for $\Delta_{n}\cap\beta^{\pi_{n},H}\neq\emptyset$ . The generality of $\{\Delta_{n}\}$ then provides a broad design-flexibility of PIPS.

The idea behind policy switching used in PIPS with $\beta^{\pi_{n},H}$ can be attributed to approximating the steepest ascent direction while applying the steepest ascent algorithm. At the current location $\pi_{n}$ , we find ascent “directions” relative to $\pi_{n}$ over the local neighborhood of $\beta^{\pi_{n},H}$ . A steepest ascent direction, $\pi_{\mathrm{ps}}(\beta^{\pi_{n},H})$ , is then obtained by “combining” all of the possible ascent directions. In particular, the greedy ascent direction $\phi$ that satisfies that $T(V^{\pi_{n}}_{H-h})(x)=R(x,\phi_{h}(x))+\gamma\sum_{y\in X}P_{xy}^{\phi_{h}(x)}V^{\pi_{n}}_{H-h}(y)$ for all $x\in X$ and $h\in\{1,...,H\}$ is always included while combining.

III Off-line Asynchronous PIPS

An asynchronous version can be inferred from the synchronous PIPS by the following improvement result when a single $H$ -length policy in $\Pi(X)^{H}$ is updated only at a single state:

Corollary III.1

Given $x\in X$ and $\pi\in\Pi(X)^{H}$ , let $I^{\pi,H}_{x}=\{(h,x)\in I^{\pi,H}|h\in\{1,...,H\}\}$ . Suppose that $I^{\pi,H}_{x}\neq\emptyset$ . Then for any $\phi\in\beta^{\pi,H}_{x}$ , $\phi>_{H}\pi$ where

	$\displaystyle\beta^{\pi,H}_{x}=\bigl{\{}\tilde{\pi}\in\Pi(X)^{H}\|I\in 2^{I^{\pi,H}_{x}}\setminus\{\emptyset\},\forall h\in\{1,...,H\}\mbox{ such that }(h,x)\in I\mbox{ }\tilde{\pi}_{H-h+1}(x)$
			$\displaystyle\in S^{\pi}_{h}(x)\mbox{ and }\forall(h,x^{\prime})\in(\{1,...,H\}\times X)\setminus I\mbox{ }\tilde{\pi}_{H-h+1}(x^{\prime})=\pi_{H-h+1}(x^{\prime})\bigr{\}}.$

Proof:

Because $\emptyset\neq I^{\pi,H}_{x}\subseteq I^{\pi,H}$ , $\emptyset\neq\beta^{\pi,H}_{x}\subseteq\beta^{\pi,H}$ . ∎

The set $\beta^{\pi,H}_{x}$ “projected to the $x$ -coordinate direction” from $\beta^{\pi,H}$ contains all strictly better policies than $\pi$ that can be obtained by switching the action(s) prescribed by $\pi$ at the single state $x$ . This result leads to an off-line convergent asynchronous PIPS for $M_{H}$ : Select $\pi_{1}\in\Pi(X)^{H}$ arbitrarily. Loop with $n\in\{1,2,...,\}$ : If $I^{\pi_{n},H}\neq\emptyset$ , select $x_{n}\in I^{\pi_{n},H}$ and construct $\pi_{n+1}$ such that $\pi_{n+1}(x_{n})=\pi_{\mathrm{ps}}(\beta^{\pi_{n},H}_{x_{n}})(x_{n})$ and $\pi_{n+1}(x)=\pi_{n}(x)$ for all $x$ in $X\setminus\{x_{n}\}$ . If $I^{\pi_{n},H}=\emptyset$ , stop. Because $x_{n}$ is always selected to be an improvable-state in $I^{\pi_{n},H}\neq\emptyset$ , $\pi_{n+1}>_{H}\pi_{n}$ for all $n\geq 1$ . Therefore, $\{\pi_{n}\}$ converges to an optimal $H$ -length policy for $M_{H}$ .

Suppose that the state given at the current step of the previous algorithm is not guaranteed to be in the improvable-state set of the current policy. Such scenario is possible with the following modified version: Select $\pi_{1}\in\Pi(X)^{H}$ arbitrarily. Loop with $n\in\{1,2,...,\}$ : If $I^{\pi_{n},H}=\emptyset$ , stop. Select $x_{n}\in X$ . If $x_{n}\in I^{\pi_{n},H}$ , then construct $\pi_{n+1}$ such that $\pi_{n+1}(x_{n})=\pi_{\mathrm{ps}}(\beta^{\pi_{n},H}_{x_{n}})(x_{n})$ and $\pi_{n+1}(x)=\pi_{n}(x)$ for all $x$ in $X\setminus\{x_{n}\}$ . If $x_{n}\notin I^{\pi_{n},H}$ , $\pi_{n+1}=\pi_{n}$ . Unlike the previous version, this algorithm’s convergence depends on the sequence $\{x_{n}\}$ selected. Even if $\pi_{n+1}>_{H}\pi_{n}$ when $\pi_{n+1}\neq\pi_{n}$ , the stopping condition that checks for the optimality can never be satisfied. In other words, an infinite loop is possible. The immediate problem is then how to choose an update-state sequence to achieve a global convergence. The reason for bring this issue up with the modified algorithm is that the situation is closely related with the on-line algorithm to be discussed in the next section. Dealing with this issue here would help understanding the convergence behaviour of the on-line algorithm. We discuss some pedagogical example of choosing an update-state sequence of the modified off-line algorithm below.

One way of enforcing a global convergence is to “embed” backward induction into the update-state sequence. For example, we generate a sequence of $\{x_{n}\}=\{x_{0},...,x_{n_{1}},...,x_{n_{2}},...,x_{n_{h}},....,x_{n_{H}},...\}$ whose subsequence $\{x_{n_{h}},h=1,...,H\}$ produces $\{\pi^{n_{h}}\}$ that solves $M_{h}$ . We need to follow the optimality principle such that $M_{h-1}$ is solved before $M_{h}$ , and so forth, until $M_{H}$ is finally solved. Therefore, the entries of $\pi^{*}(H)$ are searched from $\pi^{*}(H)_{H}$ to $\pi^{*}(H)_{1}$ over $\{x_{n}\}$ such that $\pi^{n_{1}}=(\pi^{*}(H)_{H},\pi^{n_{1}}_{2},...,\pi^{n_{1}}_{H-1},\pi^{n_{1}}_{H})$ where $V^{*}_{1}=V^{\pi^{n_{1}}}_{1}$ , and then $\pi^{n_{2}}=(\pi^{*}(H)_{H-1},\pi^{*}(H)_{H},\pi^{n_{2}}_{3},...,\pi^{n_{2}}_{H})$ where $V^{*}_{2}=V^{\pi^{n_{2}}}_{2}$ , $\pi^{n_{h}}=(\pi^{*}(H)_{H-h+1},...,\pi^{*}(H)_{H},...,\pi^{n_{h}}_{H})$ where $V^{*}_{h}=V^{\pi^{n_{h}}}_{h}$ ,…, and then finally, $\pi^{n_{H}}=(\pi^{*}(H)_{1},...,\pi^{*}(H)_{H-1},\pi^{*}(H)_{H})$ with $V^{*}_{H}=V^{\pi^{n_{H}}}_{H}$ .

Once $M_{h-1}$ has been solved, an optimal $h$ -length policy $\pi^{n_{h}}$ for $M_{h}$ can be found exhaustively. The corresponding update-state subsequence from $x_{n_{h-1}+1}$ to $x_{n_{h}}$ can be any permutation of the states in $X$ . Visiting each $x$ in $X$ at least once for updating causes an optimal $h$ -length policy for $M_{h}$ to be found because if not empty, $\beta^{\pi^{m},H}_{x}$ , where $m\in\{n_{h-1}+1,...,n_{h}\}$ , includes an $H$ -length policy whose first entry mapping maps $x$ to an action in $\mathop{\mbox{\rm arg\,max}}_{a\in A(x)}(R(x,a)+\gamma\sum_{y\in X}P_{xy}^{a}V^{*}_{h-1}(y))$ . Even though visiting each state at least once makes the approach enumerative, our point is showing that there exists an update-state sequence that makes a global convergence possible.

IV On-line Asynchronous PIPS

We are now at the position of the main subject of this note about solving $M_{\infty}$ within an on-line rolling $H$ -horizon control by solving $M_{H}$ not in advance but over time. We assume that a sequence of $\{\Delta_{k}\}$ is available where $\Delta_{k}\subseteq\Pi(X)^{H},k\geq 1$ , stands for a set of supervisors at $k$ . Any feedback of an action to take at a state can be represented by an element in $\Pi(X)^{H}$ .

The controller applies a policy $\phi\in\Pi(X)^{\infty}$ to the system of $M_{\infty}$ built from the sequence of the $H$ -length policies generated by the on-line asynchronous PIPS algorithm: PIPS first selects $\pi_{1}(H)\in\Pi(X)^{H}$ arbitrarily and sets $\phi_{1}=\pi_{1}(H)_{1}$ . At $k>1$ , if $\beta^{\pi_{k-1}(H),H}_{x_{k}}=\emptyset$ , $\pi_{k}(H)=\pi_{k-1}(H)$ . Otherwise, PIPS updates $\pi_{k-1}(H)$ only at $x_{k}$ such that $\pi_{k}(H)(x_{k})=\pi_{\mathrm{ps}}(\beta^{\pi_{k-1}(H),H}_{x_{k}}\cup\Delta_{k})(x_{k})$ and $\pi_{k}(H)(x)=\pi_{k-1}(H)(x)$ for all $x$ in $X\setminus\{x_{k}\}$ . After finishing update, PIPS sets $\pi_{k}(H)_{1}$ to the $k$ th entry $\phi_{k}$ of $\phi$ . Once $\phi_{k}$ is available to the controller, $\phi_{k}(x_{k})$ is taken to the system and the system makes a random transition to $x_{k+1}$ .

Before we present a general convergence result, an intuitive consequence about a global convergence from a sufficient condition related with the transition structure of $M_{\infty}$ is given below as a theorem. Note that $M_{\infty}$ is communicating, if every Markov chain induced by fixing each stationary policy $[\pi],\pi\in\Pi(X),$ in $M_{\infty}$ is communicating [6].

Theorem IV.1

Suppose that $M_{\infty}$ is communicating. Then for $\{\pi_{k}(H),k\geq 1\}$ generated by on-line asynchronous PIPS, there exists some $K<\infty$ such that $\pi_{k}(H)=\pi^{*}(H)$ for all $k>K$ where $V^{\pi^{*}(H)}_{H}=V^{*}_{H}$ in $M_{H}$ for any $\pi_{1}(H)\in\Pi(X)^{H}$ and $\{\Delta_{k}\}$ where $\Delta_{k}\subseteq\Pi(X)^{H},k\geq 1$ .

Proof:

Because both $X$ and $A$ are finite, $B(X)$ and $\Pi(X)^{H}$ are finite. For any given $\{\Delta_{k}\}$ and any $\pi_{1}(H)$ , the monotonicity of $\{V^{\pi_{k}(H)}\}$ of $\{\pi_{k}(H)\}$ holds because $\{\pi_{k}(H)\}$ satisfies that $\pi_{k}(H)>_{H}\pi_{k-1}(H)$ if $\beta^{\pi_{k-1}(H),H}_{x_{k}}\neq\emptyset$ and $V^{\pi_{k}(H)}\geq V^{\pi_{k-1}(H)}$ otherwise. The assumption that $M_{\infty}$ is communicating ensures that every state $x$ in $X$ is visited infinitely often within $\{x_{k}\}$ . It follows that at some sufficiently large finite time $K$ , $\beta^{\pi_{K}(H),H}=\emptyset$ implying $I^{\pi_{K}(H),H}=\emptyset$ . Therefore, $\pi_{k}(H)=\pi^{*}(H)$ for all $k>K$ where $V^{\pi^{*}(H)}_{H}=V^{*}_{H}$ in $M_{H}$ . ∎

The policy $\phi$ of the controller becomes stable in the sense that the sequence $\{\phi_{k}\}$ converges to $\pi^{*}(H)_{1}$ . The question about checking whether $M_{\infty}$ is communicating without enumerating all stationary policies in $\Pi(X)^{\infty}$ is possible in a polynomial time-complexity was raised in [6]. Unfortunately, this problem is in general NP-complete [8]. A simple and obvious sufficient condition for such connectivity is that $P^{a}_{xy}>0$ for all $x,y$ in $X$ and $a$ in $A(x)$ . The key in the convergence here is that which state in $X$ is visited “sufficiently often” by following $\{\phi_{k}\}$ to ensure that an optimal action at the visited state is eventually found. The following result stated in Theorem IV.2 reflects this rationale. Given a stationary policy $\phi\in\Pi(X)^{\infty}$ , the connectivity relation $\chi^{\phi}$ is defined on $X$ from the Markov chain $M_{\infty}^{\phi}$ induced by fixing $\phi$ in $M_{\infty}$ : If $x$ and $y$ in $X$ communicate each other in $M_{\infty}^{\phi}$ , $(x,y)$ is an element of $\chi^{\phi}$ . Given $x\in X$ , the equivalence class of $x$ with respect to $\chi^{\phi}$ is denoted by $[x]_{\chi^{\phi}}$ . Note that for any $x\neq y$ , either $[x]_{\chi^{\phi}}=[y]_{\chi^{\phi}}$ or $[x]_{\chi^{\phi}}\cap[y]_{\chi^{\phi}}=\emptyset$ . The collection of $[x]_{\chi^{\phi}},x\in X,$ partitions $X$ .

Theorem IV.2

For any $\pi_{1}(H)\in\Pi(X)^{H}$ and any $\{\Delta_{k}\}$ where $\Delta_{k}\subseteq\Pi(X)^{H},k\geq 1$ , $\{\pi_{k}(H)\}$ generated by on-line asynchronous PIPS converges to some $\lambda(H)$ in $\Pi(X)^{H}$ such that for some $K<\infty$ , $\pi_{k}(H)=\lambda(H)$ for all $k>K$ . Furthermore, $\lambda(H)$ satisfies that $V^{\lambda(H)}_{H}\geq V^{\pi}_{H}$ for all $\pi\in\bigcup_{x\in[x^{*}]_{\chi^{[\lambda(H)_{1}]}}}\beta^{[\lambda(H)_{1}],H}_{x}$ , where $x^{*}$ is any visited state at $k>K$ .

Proof:

By the same reasoning in the proof of Theorem IV.1, $\{\pi_{k}(H)\}$ converges to an element $\lambda(H)$ in $\Pi(X)^{H}$ in a finite time $K$ . Because every state $x$ in $[x^{*}]_{\chi^{[\lambda(H)_{1}]}}$ is visited infinitely often within $\{x_{k}\}$ for $k>K$ , $I^{\lambda(H),H}_{x}=\emptyset$ for such $x$ . No more improvement is possible at all states in $[x^{*}]_{\chi^{[\lambda(H)_{1}]}}$ . Otherwise, it contradicts the convergence to $\lambda(H)$ . ∎

The theorem states that $\lambda(H)$ is “locally optimal” over $[x^{*}]_{\chi^{[\lambda(H)_{1}]}}$ in the sense that no more improvement is possible at all states in $[x^{*}]_{\chi^{[\lambda(H)_{1}]}}$ .

We remark that the above local convergence result is different from that of Proposition 2.2 by Bertsekas [2]. In our case, a subset of $X$ in which every state is visited infinitely often is not assumed to be given in advance. The sequence of policies generated by on-line PIPS will eventually converge to a policy and Theorem IV.2 characterizes its local optimality with respect to the communicating classes, in which every state is visited infinitely often, induced by the policy. Note further that Bertsekas’ result is within the context of rolling an infinite-horizon control.

Unfortunately, we cannot provide a useful performance result about the performance of $\lambda(H)$ in Theorem IV.2 here, e.g, an upper bound on $||V^{[\lambda(H)_{1}]}_{\infty}-V^{*}_{\infty}||_{\infty}$ because it is difficult to bound $||V^{\lambda(H)}_{H}-V^{*}_{H}||_{\infty}$ . The degree of approximation by $\lambda(H)$ for $\pi^{*}(H)$ will determine the performance of the rolling horizon control policy by the on-line asynchronous PIPS algorithm.

V Concluding Remarks

The off-line and on-line PIPS algorithms can play the role of frameworks for solving MDPs with supervisors. While the disposition of the algorithms (and their developments) was done mainly in the perspective of theoretical soundness and results, both can be easily implemented by simulation if the MDP model is generative. In particular, for the on-line case, each $\pi\in\beta^{\pi_{k},H}_{x_{k}}\cup\Delta_{k}$ is simply followed (rolled out) over a relevant forecast-horizon (see, e.g., [1] for related approaches and references) in order to generate its sample-paths. If $\beta^{\pi_{k},H}_{x_{k}}\cup\Delta_{k}$ is large, some policies from $\beta^{\pi_{k},H}_{x_{k}}$ and $\Delta_{k}$ can be “randomly” sampled, for example, without losing the improvement of $\pi_{k}$ . A study on the actual implementation is important and is left as a future study.

On-line PIPS is also in the category of “learning” control. Essentially, $V^{*}_{0}$ can be thought as an initial knowledge of control to the system, e.g., as in the value function of a self-learned Go-playing policy of AlphaZero from a neural-network based learning-system. The monotonically improving value-function sequence generated by PIPS can be interpreted as learning or obtaining a better knowledge about controlling the system,

There exist an algorithm, “adaptive multi-stage sampling,” (AMS) [4] for $M_{H}$ whose random estimate converges to $V^{*}_{H}(x)$ for a given $x$ as the number of samplings approach infinity in the expected sense. AMS employed within the rolling-horizon control is closer to its original spirit, compared with on-line PIPS, because the value of $V^{*}_{H}(x_{k})$ is approximated at each visited $x_{k}$ like solving $M_{H}$ in advance. In contrast with the PI-based approach here, the idea of AMS is to replace the inside of the maximization over the admissible action set in the $T$ -operator of VI such that the maximum selection is done with a different utility for each action or the “necessity” measure of sampling, which is estimated over a set of currently sampled next states with a support from a certain upper-confidence-bound that controls a degree of search by the action. Because the replacement is applied at each level while emulating the process of backward induction, AMS requires a recursive process in a depth-first-search manner that effectively builds a sampled tree whose size is exponential in $H$ . It is therefore not surprising that similar to the complexity of backward induction, AMS also has the (sample) complexity that exponentially depends on $H$ . On the other hand, in general estimating the value of a policy, not the optimal value, is a much easier task by simulation. Generating random sample-paths of a policy has a polynomial dependence on $H$ . More importantly, it seems difficult to discuss the convergence behaviour of the rolling horizon AMS-control because some characterization of a stochastic policy, due to random estimates of $V^{*}_{H}(x_{k})$ at each $k$ , needs to be made with a finite sampling-complexity. Arguably, this would be true for any algorithm that produces a random estimate of the optimal value, e.g., Monte-Carlo Tree Search (MCTS) used in AlphaGo or AlphaZero [1]. However, it is worthwhile to note that these algorithms’ output can act as a feedback from a supervisor in the framework of on-line PIPS.

It can be checked that another multi-policy improvement method of parallel rollout [3] does not work for preserving the monotonicity property with asynchronous update when the set of more than one policies is applied to the method for the improvement. Even with synchronous update, the parallel-rollout approach requires estimating a double expectation for each action, one for the next-state distribution and another one in the value function (to be evaluated at the next state). In contrast, in policy switching a single expectation for each policy needs to be estimated leading to a lower simulation-complexity.

References

[1] Bertsekas, D. P. (2022). Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control. Athena Scientific.
[2] Bertsekas, D. P. (2021). On-Line policy iteration for infinite horizon dynamic programming. arXiv:2106.00746.
[3] Chang, H. S., Givan, R. L., & Chong, E. K. P. (2004). Parallel rollout for online solution of partially observable Markov decision processes. Discrete Event Dynamic Systems, 14, 309–341.
[4] Chang, H. S., Fu, M., Hu, J., & Marcus S. I. (2005). An adaptive sampling algorithm for solving Markov decision processes. Operations Research, 53, 126–139.
[5] Hernandez-Lerma, O. (1989). Adaptive Markov Control Processes. Springer-Verlag.
[6] Kallenberg, L.C.M. (2002). Classification Problems in MDPs. Markov Processes and Controlled Markov Chains, Hou Z., Filar J.A., Chen A. (eds), Springer, Boston, MA.
[7] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York.
[8] Tsitsiklis, J. N. (2007). NP-Hardness of checking the unichain condition in average cost MDPs. Operations Research Letters, 35, 319–323.