On Supervised On-line Rolling-Horizon Control for Infinite-Horizon Discounted Markov Decision Processes
Abstract
This note re-visits the rolling-horizon control approach to the problem of a Markov decision process (MDP) with infinite-horizon discounted expected reward criterion. Distinguished from the classical value-iteration approach, we develop an asynchronous on-line algorithm based on policy iteration integrated with a multi-policy improvement method of policy switching. A sequence of monotonically improving solutions to the forecast-horizon sub-MDP is generated by updating the current solution only at the currently visited state, building in effect a rolling-horizon control policy for the MDP over infinite horizon. Feedbacks from “supervisors,” if available, can be also incorporated while updating. We focus on the convergence issue with a relation to the transition structure of the MDP. Either a global convergence to an optimal forecast-horizon policy or a local convergence to a “locally-optimal” fixed-policy in a finite time is achieved by the algorithm depending on the structure.
Index Terms:
rolling horizon control, policy iteration, policy switching, Markov decision processI Introduction
Consider the rolling horizon control (see, e.g., [5]) with a fixed finite forecast-horizon to the problem of a Markov decision process (MDP) with infinite-horizon discounted expected reward criterion. At discrete time , the system is at a state in a finite state set . If the controller of the system takes an action in at , then it obtains a reward of from a reward function , where denotes an admissible-action set of in . The system then makes a random transition to a next state by the probability of specified in an -matrix .
Let be the set of all possible functions from to . The zero function in is referred to as where for all . Let also be the set of all possible mappings from to where for any , is constrained to be in for all . Let an -length policy of the controller be an element in , -ary Cartesian product, . That is, is an ordered tuple where the th entry of is equal to and when is applied at in , then the controller is supposed to look ahead of (or forecasts) the remaining horizon for control. An infinite-length policy is an element in the infinite Cartesian product of , denoted by , and referred to as just a policy. We say that a policy is stationary if for all for some . Given , denotes a stationary policy in constructed from such that for all .
Define the -horizon value function of in such that
where is a random variable that denotes a state at the level (of forecast) by following the th-entry mapping of and a fixed discounting factor is in .
In the sequel, any operator is applied componentwise for the elements in and in , respectively. Given and , is set to be meaning the “-coordinate” of here. As is well known then, there exists an optimal -length policy such that for any , , where is referred to as the optimal -horizon value function. In particular, satisfies the optimality principle:
for any fixed in . Furthermore, is equal to the function in obtained after applying the value iteration (VI) operator iteratively times with the initial function of : , where
This optimal substructure leads to a dynamic programming algorithm, backward induction, which computes in off-line and returns an optimal -horizon policy that achieves the optimal value at any for the -horizon sub-MDP of by
Once obtained, the rolling -horizon controller employs the first entry of or a stationary policy over the system time: At each , is taken at .
Given , denotes the value function of over infinite horizon and it is obtained by letting approach infinity in . The optimal value function of is then defined such that . A performance result of applied to (see, e.g., [5]) is that the infinity-norm of the difference between the value function of and is upper bounded by (an error of) . The term can be loosely upper bounded by with some constant . Then due to the dependence on , the performance worsens around closer to one. How to set is a critical issue in the rolling horizon control even if the error vanishes to zero exponentially fast in with the rate of .
This note develops an algorithm for on-line rolling -horizon control. The sub-MDP is not solved in advance. Rather with an arbitrarily selected for , the algorithm generates a monotonically improving sequence of over time . To the algorithm, only is available at and it updates only at to obtain . Either we have that or for all but . The algorithm has a design flexibility in the aspect that a feedback of an action to be used at by some “supervisor,” can be incorporated while generating . By setting at each , a policy is in effect built sequentially for the controller. Once is available to the controller, it takes to the system and the underlying system of moves to a next random state by the probability of .
The behavior of such a control policy is discussed by focusing on the convergence issue with a relation to the transition structure of . We are concerned with a question about the existence of a finite time such that for all for the infinite sequence .
II Off-line Synchronous Policy Iteration with Policy Switching
To start with, we present an algorithm of off-line synchronous policy iteration (PI) combined with a multi-policy improvement method of policy switching for solving . In what follows, we assume that for any for simplicity. (The topic about how to set is beyond the scope of this note.)
Theorem II.1 (Theorem 2 [3])
Given a nonempty , construct policy switching with in as such that for each possible pair of and ,
Then for all .
Given and in , we say that strictly improves (over the horizon ) if there exists some such that and in which case we write as . Define switchable action set of at for as
and also improvable-state set of for as
Set . Observe that if for , then is an optimal -length policy for .
The following theorem provides a result for in analogy with the key step of the single-policy improvement (see, e.g., [7]) in PI for . Because Banach’s fixed-point theorem is difficult to be invoked in the finite-horizon case unlike the standard proof for the infinite-horizon case, we provide a proof for the completeness.
Theorem II.2
Given with , construct with any satisfying such that for all and for all . Then .
Proof:
The base case holds because . For the induction step, assume that . For all such that ,
On the other hand, for any such that ,
Putting the two cases together makes . In particular, we see that there must exist some such that because of our choice of , having . By an induction argument on the level from then, it follows that . ∎
The previous theorem states that if a policy was generated from a given by switching the action prescribed by at each improvable state with its corresponding level in a particularly chosen nonempty subset of the improvable-state set of , then the policy constructed strictly improves over the relevant finite horizon. However, in general, even if is known, for and obtained by the method of Theorem II.2, respectively, does not hold necessarily. (Note that this is also true for the infinite-horizon case.) It can be merely said that improves and improves , respectively. Motivated by this, we consider the following. For a given in , let be the set of all strictly better policies than obtainable from : If , . Otherwise,
Once obtained, policy switching with respect to is applied to find a single policy no worse than all policies in the set.
We are ready to derive an off-line synchronous algorithm, “policy-iteration with policy switching,” (PIPS) that generates a sequence of -length policies for solving : Set arbitrarily . Loop with until where .
The convergence to an optimal -length policy for is trivially guaranteed because , and both and are finite. Note that in can be substituted with any as long as for . The generality of then provides a broad design-flexibility of PIPS.
The idea behind policy switching used in PIPS with can be attributed to approximating the steepest ascent direction while applying the steepest ascent algorithm. At the current location , we find ascent “directions” relative to over the local neighborhood of . A steepest ascent direction, , is then obtained by “combining” all of the possible ascent directions. In particular, the greedy ascent direction that satisfies that for all and is always included while combining.
III Off-line Asynchronous PIPS
An asynchronous version can be inferred from the synchronous PIPS by the following improvement result when a single -length policy in is updated only at a single state:
Corollary III.1
Given and , let . Suppose that . Then for any , where
Proof:
Because , . ∎
The set “projected to the -coordinate direction” from contains all strictly better policies than that can be obtained by switching the action(s) prescribed by at the single state . This result leads to an off-line convergent asynchronous PIPS for : Select arbitrarily. Loop with : If , select and construct such that and for all in . If , stop. Because is always selected to be an improvable-state in , for all . Therefore, converges to an optimal -length policy for .
Suppose that the state given at the current step of the previous algorithm is not guaranteed to be in the improvable-state set of the current policy. Such scenario is possible with the following modified version: Select arbitrarily. Loop with : If , stop. Select . If , then construct such that and for all in . If , . Unlike the previous version, this algorithm’s convergence depends on the sequence selected. Even if when , the stopping condition that checks for the optimality can never be satisfied. In other words, an infinite loop is possible. The immediate problem is then how to choose an update-state sequence to achieve a global convergence. The reason for bring this issue up with the modified algorithm is that the situation is closely related with the on-line algorithm to be discussed in the next section. Dealing with this issue here would help understanding the convergence behaviour of the on-line algorithm. We discuss some pedagogical example of choosing an update-state sequence of the modified off-line algorithm below.
One way of enforcing a global convergence is to “embed” backward induction into the update-state sequence. For example, we generate a sequence of whose subsequence produces that solves . We need to follow the optimality principle such that is solved before , and so forth, until is finally solved. Therefore, the entries of are searched from to over such that where , and then where , where ,…, and then finally, with .
Once has been solved, an optimal -length policy for can be found exhaustively. The corresponding update-state subsequence from to can be any permutation of the states in . Visiting each in at least once for updating causes an optimal -length policy for to be found because if not empty, , where , includes an -length policy whose first entry mapping maps to an action in . Even though visiting each state at least once makes the approach enumerative, our point is showing that there exists an update-state sequence that makes a global convergence possible.
IV On-line Asynchronous PIPS
We are now at the position of the main subject of this note about solving within an on-line rolling -horizon control by solving not in advance but over time. We assume that a sequence of is available where , stands for a set of supervisors at . Any feedback of an action to take at a state can be represented by an element in .
The controller applies a policy to the system of built from the sequence of the -length policies generated by the on-line asynchronous PIPS algorithm: PIPS first selects arbitrarily and sets . At , if , . Otherwise, PIPS updates only at such that and for all in . After finishing update, PIPS sets to the th entry of . Once is available to the controller, is taken to the system and the system makes a random transition to .
Before we present a general convergence result, an intuitive consequence about a global convergence from a sufficient condition related with the transition structure of is given below as a theorem. Note that is communicating, if every Markov chain induced by fixing each stationary policy in is communicating [6].
Theorem IV.1
Suppose that is communicating. Then for generated by on-line asynchronous PIPS, there exists some such that for all where in for any and where .
Proof:
Because both and are finite, and are finite. For any given and any , the monotonicity of of holds because satisfies that if and otherwise. The assumption that is communicating ensures that every state in is visited infinitely often within . It follows that at some sufficiently large finite time , implying . Therefore, for all where in . ∎
The policy of the controller becomes stable in the sense that the sequence converges to . The question about checking whether is communicating without enumerating all stationary policies in is possible in a polynomial time-complexity was raised in [6]. Unfortunately, this problem is in general NP-complete [8]. A simple and obvious sufficient condition for such connectivity is that for all in and in . The key in the convergence here is that which state in is visited “sufficiently often” by following to ensure that an optimal action at the visited state is eventually found. The following result stated in Theorem IV.2 reflects this rationale. Given a stationary policy , the connectivity relation is defined on from the Markov chain induced by fixing in : If and in communicate each other in , is an element of . Given , the equivalence class of with respect to is denoted by . Note that for any , either or . The collection of partitions .
Theorem IV.2
For any and any where , generated by on-line asynchronous PIPS converges to some in such that for some , for all . Furthermore, satisfies that for all , where is any visited state at .
Proof:
By the same reasoning in the proof of Theorem IV.1, converges to an element in in a finite time . Because every state in is visited infinitely often within for , for such . No more improvement is possible at all states in . Otherwise, it contradicts the convergence to . ∎
The theorem states that is “locally optimal” over in the sense that no more improvement is possible at all states in .
We remark that the above local convergence result is different from that of Proposition 2.2 by Bertsekas [2]. In our case, a subset of in which every state is visited infinitely often is not assumed to be given in advance. The sequence of policies generated by on-line PIPS will eventually converge to a policy and Theorem IV.2 characterizes its local optimality with respect to the communicating classes, in which every state is visited infinitely often, induced by the policy. Note further that Bertsekas’ result is within the context of rolling an infinite-horizon control.
Unfortunately, we cannot provide a useful performance result about the performance of in Theorem IV.2 here, e.g, an upper bound on because it is difficult to bound . The degree of approximation by for will determine the performance of the rolling horizon control policy by the on-line asynchronous PIPS algorithm.
V Concluding Remarks
The off-line and on-line PIPS algorithms can play the role of frameworks for solving MDPs with supervisors. While the disposition of the algorithms (and their developments) was done mainly in the perspective of theoretical soundness and results, both can be easily implemented by simulation if the MDP model is generative. In particular, for the on-line case, each is simply followed (rolled out) over a relevant forecast-horizon (see, e.g., [1] for related approaches and references) in order to generate its sample-paths. If is large, some policies from and can be “randomly” sampled, for example, without losing the improvement of . A study on the actual implementation is important and is left as a future study.
On-line PIPS is also in the category of “learning” control. Essentially, can be thought as an initial knowledge of control to the system, e.g., as in the value function of a self-learned Go-playing policy of AlphaZero from a neural-network based learning-system. The monotonically improving value-function sequence generated by PIPS can be interpreted as learning or obtaining a better knowledge about controlling the system,
There exist an algorithm, “adaptive multi-stage sampling,” (AMS) [4] for whose random estimate converges to for a given as the number of samplings approach infinity in the expected sense. AMS employed within the rolling-horizon control is closer to its original spirit, compared with on-line PIPS, because the value of is approximated at each visited like solving in advance. In contrast with the PI-based approach here, the idea of AMS is to replace the inside of the maximization over the admissible action set in the -operator of VI such that the maximum selection is done with a different utility for each action or the “necessity” measure of sampling, which is estimated over a set of currently sampled next states with a support from a certain upper-confidence-bound that controls a degree of search by the action. Because the replacement is applied at each level while emulating the process of backward induction, AMS requires a recursive process in a depth-first-search manner that effectively builds a sampled tree whose size is exponential in . It is therefore not surprising that similar to the complexity of backward induction, AMS also has the (sample) complexity that exponentially depends on . On the other hand, in general estimating the value of a policy, not the optimal value, is a much easier task by simulation. Generating random sample-paths of a policy has a polynomial dependence on . More importantly, it seems difficult to discuss the convergence behaviour of the rolling horizon AMS-control because some characterization of a stochastic policy, due to random estimates of at each , needs to be made with a finite sampling-complexity. Arguably, this would be true for any algorithm that produces a random estimate of the optimal value, e.g., Monte-Carlo Tree Search (MCTS) used in AlphaGo or AlphaZero [1]. However, it is worthwhile to note that these algorithms’ output can act as a feedback from a supervisor in the framework of on-line PIPS.
It can be checked that another multi-policy improvement method of parallel rollout [3] does not work for preserving the monotonicity property with asynchronous update when the set of more than one policies is applied to the method for the improvement. Even with synchronous update, the parallel-rollout approach requires estimating a double expectation for each action, one for the next-state distribution and another one in the value function (to be evaluated at the next state). In contrast, in policy switching a single expectation for each policy needs to be estimated leading to a lower simulation-complexity.
References
- [1] Bertsekas, D. P. (2022). Lessons from AlphaZero for Optimal, Model Predictive, and Adaptive Control. Athena Scientific.
- [2] Bertsekas, D. P. (2021). On-Line policy iteration for infinite horizon dynamic programming. arXiv:2106.00746.
- [3] Chang, H. S., Givan, R. L., & Chong, E. K. P. (2004). Parallel rollout for online solution of partially observable Markov decision processes. Discrete Event Dynamic Systems, 14, 309–341.
- [4] Chang, H. S., Fu, M., Hu, J., & Marcus S. I. (2005). An adaptive sampling algorithm for solving Markov decision processes. Operations Research, 53, 126–139.
- [5] Hernandez-Lerma, O. (1989). Adaptive Markov Control Processes. Springer-Verlag.
- [6] Kallenberg, L.C.M. (2002). Classification Problems in MDPs. Markov Processes and Controlled Markov Chains, Hou Z., Filar J.A., Chen A. (eds), Springer, Boston, MA.
- [7] Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley, New York.
- [8] Tsitsiklis, J. N. (2007). NP-Hardness of checking the unichain condition in average cost MDPs. Operations Research Letters, 35, 319–323.