Department of Computer Science, University of Oxford, UK School of Informatics, University of Edinburgh, UK Université de Paris, CNRS, IRIF, F-75013 Paris, France Department of Computer Science, University of Liverpool, UK \CopyrightStefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, Patrick Totzke \ccsdesc[200]Theory of computation Random walks and Markov chains \ccsdesc[100]Mathematics of computing Probability and statistics \relatedversionThis is the full version of a CONCUR 2021 paper [13]. \supplement\funding
Acknowledgements.
\hideLIPIcs\EventEditorsSerge Haddad and Daniele Varacca \EventNoEds2 \EventLongTitle32nd International Conference on Concurrency Theory (CONCUR 2021) \EventShortTitleCONCUR 2021 \EventAcronymCONCUR \EventYear2021 \EventDateAugust 23–27, 2021 \EventLocationVirtual Conference \EventLogo \SeriesVolume203 \ArticleNo1Transience in Countable MDPs
Abstract
The objective is not to visit any state infinitely often. While this is not possible in any finite Markov Decision Process (MDP), it can be satisfied in countably infinite ones, e.g., if the transition graph is acyclic.
We prove the following fundamental properties of in countably infinite MDPs.
-
1.
There exist uniformly -optimal MD strategies (memoryless deterministic) for , even in infinitely branching MDPs.
-
2.
Optimal strategies for need not exist, even if the MDP is finitely branching. However, if an optimal strategy exists then there is also an optimal MD strategy.
-
3.
If an MDP is universally transient (i.e., almost surely transient under all strategies) then many other objectives have a lower strategy complexity than in general MDPs. E.g., -optimal strategies for Safety and co-Büchi and optimal strategies for (where they exist) can be chosen MD, even if the MDP is infinitely branching.
keywords:
Markov decision processes, Parity, Transience1 Introduction
Those who cannot remember the past are condemned to repeat it.
George Santayana (1905) [22]
The famous aphorism above has often been cited (with small variations), e.g., by Winston Churchill in a 1948 speech to the House of Commons, and carved into several monuments all over the world [22].
We prove that the aphorism is false. In fact, even those who cannot remember anything at all are not condemned to repeat the past. With the right strategy they can avoid repeating the past equally well as everyone else. More formally, playing for does not require any memory. We show that there always exist -optimal memoryless deterministic strategies for , and if optimal strategies exist then there also exist optimal memoryless deterministic strategies.111Our result applies to MDPs (also called games against nature). It is an open question whether it generalizes to countable stochastic 2-player games. (However, it is easy to see that the adversary needs infinite memory in general, even if the player is passive [14, 16].)
Background. We study Markov decision processes (MDPs), a standard model for dynamic systems that exhibit both stochastic and controlled behavior [21]. MDPs play a prominent role in many domains, e.g., artificial intelligence and machine learning [26, 24], control theory [5, 1], operations research and finance [25, 12, 6, 23], and formal verification [2, 25, 11, 8, 3, 7].
An MDP is a directed graph where states are either random or controlled. Its observed behavior is described by runs, which are infinite paths that are, in part, determined by the choices of a controller. If the current state is random then the next state is chosen according to a fixed probability distribution. Otherwise, if the current state is controlled, the controller can choose a distribution over all possible successor states. By fixing a strategy for the controller (and initial state), one obtains a probability space of runs of the MDP. The goal of the controller is to optimize the expected value of some objective function on the runs.
The strategy complexity of a given objective characterizes the type of strategy necessary to achieve an optimal (resp. -optimal) value for the objective. General strategies can take the whole history of the run into account (history-dependent; (H)), while others use only bounded information about it (finite memory; (F)) or base decisions only on the current state (memoryless; (M)). Moreover, the strategy type depends on whether the controller can randomize (R) or is limited to deterministic choices (D). The simplest type, MD, refers to memoryless deterministic strategies.
Acyclicity and Transience. An MDP is called acyclic iff its transition graph is acyclic. While finite MDPs cannot be acyclic (unless they have deadlocks), countable MDPs can. In acyclic countable MDPs, the strategy complexity of Büchi/Parity objectives is lower than in the general case: -optimal strategies for Büchi/Parity objectives require only one bit of memory in acyclic MDPs, while they require infinite memory (an unbounded step-counter, plus one bit) in general countable MDPs [14, 15].
The concept of transience can be seen as a generalization of acyclicity. In a Markov chain, a state is called transient iff the probability of returning from to is (otherwise the state is called recurrent). This means that a transient state is almost surely visited only finitely often. The concept of transient/recurrent is naturally lifted from Markov chains to MDPs, where they depend on the chosen strategy.
We define the objective as the set of runs that do not visit any state infinitely often. We call an MDP universally transient iff it almost-surely satisfies under every strategy. Thus every acyclic MDP is universally transient, but not vice-versa; cf. Figure 1. In particular, universal transience does not just depend on the structure of the transition graph, but also on the transition probabilities. Universally transient MDPs have interesting properties. Many objectives (e.g., Safety, Büchi, co-Büchi) have a lower strategy complexity than in general MDPs; see below.
We also study the strategy complexity of the objective itself, and how it interacts with other objectives, e.g., how to attain a Büchi objective in a transient way.
Our contributions.
-
1.
We show that there exist uniformly -optimal MD strategies (memoryless deterministic) for , even in infinitely branching MDPs. This is unusual, since (apart from reachability objectives) most other objectives require infinite memory if the MDP is infinitely branching, e.g., all objectives generalizing Safety [17].
Our result is shown in several steps. First we show that there exist -optimal deterministic 1-bit strategies for . Then we show how to dispense with the 1-bit memory and obtain -optimal MD strategies for . Finally, we make these MD strategies uniform, i.e., independent of the start state.
-
2.
We show that optimal strategies for need not exist, even if the MDP is finitely branching. If they do exist then there are also MD optimal strategies. More generally, there exists a single MD strategy that is optimal from every state that allows optimal strategies for .
-
3.
If an MDP is universally transient (i.e., almost surely transient under all strategies) then many other objectives have a lower strategy complexity than in general MDPs, e.g., -optimal strategies for Safety and co-Büchi and optimal strategies for (where they exist) can be chosen MD, even if the MDP is infinitely branching.
For our proofs we develop some technical results that are of independent interest. We generalize Ornstein’s plastering construction [20] from reachability to tail objectives and thus obtain a general tool to infer uniformly -optimal MD strategies from non-uniform ones (cf. Theorem 4.7). Secondly, in Section 6 we develop the notion of the conditioned MDP (cf. [17]). For tail objectives, this allows to obtain uniformly -optimal MD strategies wrt. multiplicative errors from those with merely additive errors.
2 Preliminaries
A probability distribution over a countable set is a function with . We write for the set of all probability distributions over .
Markov Decision Processes. We define Markov decision processes (MDPs for short) over countably infinite state spaces as tuples where is the countable set of states partitioned into a set of controlled states and a set of random states. The transition relation is , and is a probability function. We write if , and refer to as a successor of . We assume that every state has at least one successor. The probability function assigns to each random state a probability distribution over its set of successors. A sink is a subset closed under the relation.
An MDP is acyclic if the underlying graph is acyclic. It is finitely branching if every state has finitely many successors and infinitely branching otherwise. An MDP without controlled states () is a Markov chain.
Strategies and Probability Measures. A run is an infinite sequence of states such that for all ; a partial run is a finite prefix of a run. We write and say that (partial) run visits if for some . It starts in if .
A strategy is a function that assigns to partial runs a distribution over the successors of . We write for the set of all strategies in . A strategy and an initial state induce a standard probability measure on sets of infinite runs. We write for the probability of a measurable set of runs starting from . It is defined for the cylinders as , where is the map that extends by for all . By Carathéodory’s theorem [4], the measure for cylinders extends uniquely to a probability measure on all measurable subsets of . We will write for the expectation w.r.t. .
Strategy Classes. Strategies are in general randomized (R) in the sense that they take values in . A strategy is deterministic (D) if is a Dirac distribution for all partial runs .
We formalize the amount of memory needed to implement strategies in Appendix A. The two classes of memoryless and 1-bit strategies are central to this paper. A strategy is memoryless (M) if bases its decision only on the last state of the run: for all . We may view M-strategies as functions . A 1-bit strategy may base its decision also on a memory mode . Formally, a 1-bit strategy is given as a tuple where is the initial memory mode and is an update function such that
-
•
for all controlled states , the distribution is over .
-
•
for all random states , we have that .
Note that this definition allows for updating the memory mode upon visiting random states. We write for the strategy obtained from by setting the initial memory mode to .
MD strategies are both memoryless and deterministic; and deterministic 1-bit strategies are both deterministic and 1-bit.
Objectives. The objective of the controller is determined by a predicate on infinite runs. We assume familiarity with the syntax and semantics of the temporal logic LTL [9]. Formulas are interpreted on the underlying structure of the MDP . We use to denote the set of runs starting from that satisfy the LTL formula , which is a measurable set [27]. We also write for . Where it does not cause confusion we will identify and and just write instead of .
Given a set of states, the reachability objective is the set of runs that visit at least once. The safety objective is the set of runs that never visit .
Let be a finite set of colors. A color function assigns to each state its color . The parity objective, written as , is the set of infinite runs such that the largest color that occurs infinitely often along the run is even. To define this formally, let . For , , and , let be the set of states in with color . Then
We write for the parity objectives with the set of colors . The classical Büchi and co-Büchi objectives correspond to and , respectively.
An objective is called a tail objective (in ) iff for every run with some finite prefix we have . For every coloring , is tail. Reachability objectives are not always tail but in MDPs where the target set is a sink is tail.
Optimal and -optimal Strategies. Given an objective , the value of state in an MDP , denoted by , is the supremum probability of achieving . Formally, we have where is the set of all strategies. For and state , we say that a strategy is -optimal from iff . A -optimal strategy is called optimal. An optimal strategy is almost-surely winning iff .
Considering an MD strategy as a function and , is uniformly -optimal (resp. uniformly optimal) if it is -optimal (resp. optimal) from every .
Throughout the paper, we may drop the subscripts and superscripts from notations, if it is understood from the context. The missing proofs can be found in the appendix.
3 Transience and Universally Transient MDPs
In this section we define the transience property for MDPs, a natural generalization of the well-understood concept of transient Markov chains. We enumerate crucial characteristics of this objective and define the notion of universally transient MDPs.
Fix a countable MDP . Define the transience objective, denoted by , to be the set of runs that do not visit any state of infinitely often, i.e.,
The objective is tail, as it is closed under removing finite prefixes of runs. Also note that cannot be encoded in a parity objective.
We call universally transient iff for all states , for all strategies , the property holds almost-surely from , i.e.,
The MDP in Figure 1 models the classical Gambler’s Ruin Problem with restart; see [10, Chapter 14]. It is well-known that if the controller starts with wealth and if , the probability of ruin (visiting the state ) is . Consequently, the probability of re-visiting infinitely often is , implying that . In contrast, for the case with , for all states , the probability of re-visiting is strictly below . Hence, the property holds almost-surely. This example indicates that the transience property depends on the probability values of the transitions and not just on the underlying transition graph, and thus may require arithmetic reasoning. In particular, the MDP in Figure 1 is universally transient iff .
In general, optimal strategies for need not exist:
Lemma 3.1.
There exists a finitely branching countable MDP with initial state such that
-
•
for all controlled states ,
-
•
there does not exist any optimal strategy such that .
Proof 3.2.
Consider a countable MDP with set of states; see Figure 2. For all the state is the unique successor of so that form an acyclic ladder; the value of is for all . The state is sink, and its value is . The states are all random, and and . Observe that the value of is for the .
The states are controlled whereas the states are random. By interleaving of these states, we construct a “recurrent ladder” of decisions: and for all , state has two successors and . In random states , as in Gambler’s Ruin with a fair coin, the successors are or , each with equal probability. In each state , the controller decides to either stay on the ladder by going to or leaves the ladder to . As in Figure 1, if the controller stays on the ladder forever, the probability of is .
Starting in , for all , strategy that stays on the ladder until visiting (which happens eventually almost surely) and then leaves the ladder to achieves with probability . Hence, .
Recall that transience cannot be achieved with a positive probability by staying on the acyclic ladder forever. But any strategy that leaves the ladder with a positive probability comes with a positive probability of falling into , thus is not optimal either. Thus there is no optimal strategy for .
Reduction to Finitely Branching MDPs. In our main results, we will prove that for the property there always exist -optimal MD strategies in finitely branching countable MDPs; and if an optimal strategy exists, there will exist an optimal MD strategy. We generalize these results to infinitely branching countable MDPs by the following reduction:
Lemma 3.5.
Given an infinitely branching countable MDP with an initial state , there exists a finitely branching countable with a set of states such that and
-
1.
each strategy in is mapped to a unique strategy in where
-
2.
and conversely, every MD strategy in is mapped to an MD strategy in where
Proof 3.6 (Proof sketch).
See Appendix B for the complete construction. In order to construct from , for each controlled state in that has infinitely many successors , a “recurrent ladder” is introduced; see Figure 3. Since the probability of is for all those runs that eventually stay forever on a recurrent ladder, the controller should exit such ladders to play optimally for . Infinitely branching random states can be dealt with in an easier way.
Properties of Universally Transient MDPs.
Notice that acyclicity implies universal transience, but not vice-versa.
Lemma 3.7.
For every countable MDP , the following conditions are equivalent.
-
1.
is universally transient, i.e., .
-
2.
For every initial state and state , the objective of re-visiting infinitely often has value zero, i.e., .
-
3.
For every state the value of the objective to re-visit is strictly below , i.e.,
. -
4.
For every state there exists a finite bound such that for every state and strategy from the expected number of visits to is .
-
5.
For all states , under every strategy from the expected number of visits to is finite.
Proof 3.8.
Towards , consider an arbitrary strategy from the initial state and some state . By (1) we have and thus which implies (2).
Towards , consider an arbitrary strategy from the initial state . By (2) we have and thus .
We now show the implications .
Towards , implies and thus . Let . We define the strategy to play like between the -th and th visit to . Since , we have . Therefore , which implies , where .
Towards , regardless of and the chosen strategy, the expected number of visits to is upper-bounded by .
The implication holds trivially.
Towards , by there exist states and a strategy such that . Thus the expected number of visits to is infinite, which implies .
We remark that if an MDP is not universally transient (unlike in Lemma 3.7(5)), for a strategy , the expected number of visits to some state can be infinite, even if attains almost surely.
Consider the MDP with controlled states , initial state and transitions and for every . We define a strategy that, while in state , proceeds in rounds . In the -th round it tosses a fair coin. If Heads then it goes to . If Tails then it loops around exactly times and then goes to round . In every round the probability of going to is and therefore the probability of staying in forever is . Thus . However, the expected number of visits to is .
4 MD Strategies for Transience
We show that there exist uniformly -optimal MD strategies for and that optimal strategies, where they exist, can also be chosen MD.
First we show that there exist -optimal deterministic 1-bit strategies for (in Corollary 4.3) and then we show how to dispense with the 1-bit memory (in Lemma 4.5).
It was shown in [14] that there exist -optimal deterministic 1-bit strategies for Büchi objectives in acyclic countable MDPs (though not in general MDPs). These 1-bit strategies will be similar to the 1-bit strategies for that we aim for in (not necessarily acyclic) countable MDPs. In Lemma 4.1 below we first strengthen the result from [14] and construct -optimal deterministic 1-bit strategies for objectives . From this we obtain deterministic 1-bit strategies for (Corollary 4.3).
Lemma 4.1.
Let be a countable MDP, a finite set of initial states, a set of states and . Then there exists a deterministic 1-bit strategy for that is -optimal from every .
Proof 4.2 (Proof sketch).
The full proof can be found in Appendix C. It follows the proof of [14, Theorem 5], which considers conditions for acyclic (and hence universally transient) MDPs. The only part of that proof that requires modification is [14, Lemma 10], which is replaced here by Lemma C.2 to deal with general MDPs.
In short, from every there exists an -optimal strategy for . We observe the behavior of the finitely many for on an infinite, increasing sequence of finite subsets of . Based on Lemma C.2, we can define a second stronger objective and show . We then construct a deterministic 1-bit strategy that is optimal for from all and thus -optimal for . Since can be chosen arbitrarily small, the result follows.
Unlike for the objective alone (see below), the 1-bit memory is strictly necessary for the objective in Lemma 4.1. The 1-bit lower bound for objectives in [14] holds even for acyclic MDPs where is trivially true.
Corollary 4.3.
Let be a countable MDP, a finite set of initial states, a set of states and .
-
1.
If then there exists a deterministic 1-bit strategy for that is -optimal from every .
-
2.
If is universally transient then there exists a deterministic 1-bit strategy for that is -optimal from every .
-
3.
There exists a deterministic 1-bit strategy for that is -optimal from every .
Proof 4.4.
Towards (1), since , strategies that are -optimal for are also -optimal for . Thus the result follows from Lemma 4.1.
Item (2) follows directly from (1), since the precondition always holds in universally transient MDPs.
Towards (3), let . Then we have and we obtain from Lemma 4.1 that there exists a deterministic 1-bit strategy for that is -optimal from every .
Note that every acyclic MDP is universally transient and thus Corollary 4.3(2) implies the upper bound on the strategy complexity of from [14] (but not vice-versa).
In the next step we show how to dispense with the 1-bit memory and obtain non-uniform -optimal MD strategies for .
Lemma 4.5.
Let be a countable MDP with initial state , and . There exists an MD strategy that is -optimal for from , i.e., .
Proof 4.6.
By Lemma 3.5 it suffices to prove the property for finitely branching MDPs. Thus without restriction in the rest of the proof we assume that is finitely branching.
Let . We instantiate Corollary 4.3(3) with and obtain that there exists an -optimal deterministic 1-bit strategy for from .
We now construct a slightly modified MDP as follows. Let be the subset of states where attains zero for in both memory modes, i.e., . Let . We obtain from by making all states in losing sinks (for ), by deleting all outgoing edges and adding a self-loop instead. It follows that
(1) |
(2) |
In the following we show that it is possible to play in such a way that, for every , the expected number of visits to is finite. We obtain the deterministic 1-bit strategy in by modifying as follows. In every state and memory mode where attains for and attains the strategy sets the memory bit to . (Note that only states can be affected by this change.) It follows that
(3) |
Moreover, from all states in in the strategy attains a strictly positive probability of in both memory modes, i.e., for all we have
Let be the probability, when playing from state , of reaching again in the same memory mode . For every we have , since .
Let be the expected number of visits to state when playing from in , and the expected number of visits to in memory mode . For all we have that
(4) |
where the first equality holds by linearity of expectations. Thus the expected number of visits to is finite.
Now we upper-bound the probability of visiting . We have by (3), (1) and the -optimality of . Since states in are losing sinks in , it follows that
(5) |
We now augment the MDP by assigning costs to transitions as follows. Let be an enumeration of the state space, i.e., a bijection. Let be the subset of states in that are visited with non-zero probability when playing from . Each transition is assigned a cost:
-
•
If then by def. of . We assign cost .
-
•
If and we assign cost for .
-
•
If and we assign cost . This is well defined, since .
-
•
and we assign cost .
Note that all transitions leading to states in are assigned a non-zero cost, since is finite by (4).
When playing from in , the expected total cost is upper-bounded by
The first part is by (5) and the second part is , since by (4). Therefore the expected total cost is , i.e., witnesses that it is possible to attain a finite expected cost that is upper-bounded by .
Now we define our MD strategy . Let be an optimal MD strategy on (from ) that minimizes the expected cost. It exists, as a finite expected cost is attainable and is finitely branching; see [21, Theorem 7.3.6].
We now show that attains with high probability in (and in ). Since is cost-optimal, its attained cost from is upper-bounded by that of , i.e., . Since the cost of entering is , we have and thus
(6) |
For every state , all transitions into have the same fixed non-zero cost. Thus every run that visits some state infinitely often has infinite cost. Since the expected cost of playing from is , such runs must be a null-set, i.e.,
(7) |
Thus
by (2) | |||
by (7) | |||
by (6) | |||
def. of | |||
def. of |
Now we lift the result of Lemma 4.5 from non-uniform to uniform strategies (and to optimal strategies) and obtain the following theorem. The proof is a generalization of a “plastering” construction by Ornstein [20] (see also [16]) from reachability to tail objectives, which works by fixing MD strategies on ever expanding subsets of the state space.
Theorem 4.7.
Let be a countable MDP, and let be an objective that is tail in . Suppose for every there exist -optimal MD strategies for . Then:
-
1.
There exist uniform -optimal MD strategies for .
-
2.
There exists a single MD strategy that is optimal from every state that has an optimal strategy.
Theorem 4.8.
In every countable MDP there exist uniform -optimal MD strategies for . Moreover, there exists a single MD strategy that is optimal for from every state that has an optimal strategy.
Proof 4.9.
Immediate from Lemmas 4.5 and 4.7, since is a tail objective.
5 Strategy Complexity in Universally Transient MDPs
The strategy complexity of parity objectives in general MDPs is known [15]. Here we show that some parity objectives have a lower strategy complexity in universally transient MDPs. It is known [14] that there are acyclic (and hence universally transient) MDPs where -optimal strategies for (and optimal strategies for , resp.) require bit.
We show that, for all simpler parity objectives in the Mostowski hierarchy [19], universally transient MDPs admit uniformly (-)optimal MD strategies (unlike general MDPs [15]). These results (Theorems 5.3 and 5.5) ultimately rely on the existence of uniformly -optimal strategies for safety objectives. While such strategies always exist for finitely branching MDPs – simply pick a value-maximal successor – this is not the case for infinitely branching MDPs [17]. However, we show that universal transience implies the existence of uniformly -optimal strategies for safety objectives even for infinitely branching MDPs.
Theorem 5.1.
For every universally transient countable MDP, safety objective and there exists a uniformly -optimal MD strategy.
Proof 5.2.
Let be a universally transient MDP and . Assume w.l.o.g. that the target of the objective is a (losing) sink and let be an enumeration of the state space .
By Lemma 3.7(3), for every state we have and thus . This means that, independent of the chosen strategy, upper-bounds the chance to return to , and bounds the expected number of visits to .
Suppose that is an MD strategy which, at any state , picks a successor with
This is possible even if is infinitely branching, by the definition of value and the fact that . We show that holds for every initial state , which implies the claim of the theorem.
Towards this, we define a function that labels each transition in the MDP with a real-valued cost: For every controlled transition let . Random transitions have cost zero. We will argue that when playing from any start state , its attainment w.r.t. the objective equals the value of minus the expected total cost, and that this cost is bounded by .
For any let us write for the random variable denoting the state just after step , and for the cost of step in a random run. We observe that under the expected total cost is bounded in the limit, i.e.,
(8) |
We moreover note that for every ,
(9) |
Full proofs of the above two equations can be found in LABEL:app-parity. Together they imply
(10) |
Finally, to show the claim let be the random variable that indicates that the -th state is not in the target set . Note that because target states have value . We have:
semantics of | ||||
continuity of measures | ||||
is a sink | ||||
definition of | ||||
as | ||||
Equation 10. |
We can now combine Theorem 5.1 with the results from [15] to show the existence of MD strategies assuming universal transience.
Theorem 5.3.
For universally transient MDPs optimal strategies for , where they exist, can be chosen uniformly MD.
Formally, let be a universally transient MDP with states , , and . There exists an MD strategy that is optimal for all states that have an optimal strategy: .
Proof 5.4.
Let be the conditioned version of w.r.t. (see [15, Def. 19] for a precise definition). By Lemma 6.8, is still a universally transient MDP and therefore by Theorem 5.1, there exist uniformly -optimal MD strategies for every safety objective and every . The claim now follows from [15, Theorem 22].
Theorem 5.5.
For every universally transient countable MDP , co-Büchi objective and there exists a uniformly -optimal MD strategy.
Formally, let be a universally transient countable MDP with states , be a coloring, and .
There exists an MD strategy s.t. for every state , .
Proof 5.6.
This directly follows from Theorem 5.1 and [15, Theorem 25].
6 The Conditioned MDP
Given an MDP and an objective that is tail in , a construction of a conditioned MDP was provided in [17, Lemma 6] that, very loosely speaking, “scales up” the probability of so that any strategy is optimal in if it is almost surely winning in . For certain tail objectives, this construction was used in [17] to reduce the sufficiency of MD strategies for optimal strategies to the sufficiency of MD strategies for almost surely winning strategies, which is a special case that may be easier to handle.
However, the construction was restricted to states that have an optimal strategy. In fact, states in that do not have an optimal strategy do not appear in . In the following, we lift this restriction by constructing a more general version of the conditioned MDP, called . The MDP will contain all states from that have a positive value w.r.t. in . Moreover, all these states will have value in . It will then follow from Lemma 6.2(3) below that an -optimal strategy in is -optimal in . This allows us to reduce the sufficiency of MD strategies for -optimal strategies to the sufficiency of MD strategies for -optimal strategies for states with value . In fact, it also follows that if an MD strategy is uniform -optimal in , it is multiplicatively uniform -optimal in , i.e., holds for all states .
Definition 6.1.
For an MDP and an objective that is tail in , define the conditioned version of w.r.t. to be the MDP with
for a fresh state .
The conditioned MDP is well-defined. Indeed, as is tail in , for any we have , and so if then .
Lemma 6.2.
Let be an MDP, and let be an objective that is tail in . Let be the conditioned version of w.r.t. . Let . Let , and note that can be transformed to a strategy in in a natural way. Then:
-
1.
For all and all partial runs in with :
where for a partial run in refers to its natural contraction to a partial run in ; i.e., is obtained from by deleting all states of the form .
-
2.
For all measurable we have
where is obtained from by deleting, in all runs, all states of the form .
-
3.
We have . In particular, , and, for any , strategy is -optimal in if and only if it is -optimal in .
Lemma 6.2.3 provides a way of proving the existence of MD strategies that attain, for each state , a fixed fraction (arbitrarily close to ) of the value of :
Theorem 6.3.
Let be an MDP, and let be an objective that is tail in . Let be the conditioned version of w.r.t. . Let . Any MD strategy that is uniformly -optimal in (i.e., holds for all ) is multiplicatively -optimal in (i.e., holds for all ).
Proof 6.4.
Immediate from Lemma 6.2.3.
As an application of Theorem 6.3, we can strengthen the first statement of Theorem 4.8 towards multiplicatively (see Theorem 6.3) uniform -optimal MD strategies for .
Corollary 6.5.
In every countable MDP there exist multiplicatively uniform -optimal MD strategies for .
Proof 6.6.
Let be a countable MDP, and its conditioned version w.r.t. . Let . By Theorem 4.8, there is a uniform -optimal MD strategy for in . By Theorem 6.3, strategy is multiplicatively uniform -optimal in .
The following lemma, stating that universal transience is closed under “conditioning”, is needed for the proof of Lemma 6.8 below.
Lemma 6.7.
Let be an MDP, and let be an objective that is tail in . Let be the conditioned version of w.r.t. , where is replaced by an infinite chain . If is universally transient, then so is .
In [17, Lemma 6] a variant, say , of the conditioned MDP from Definition 6.1 was proposed. This variant differs from in that has only those states from that have an optimal strategy, i.e., a strategy with . Further, for any transition in where is a controlled state, we have , i.e., does not have value-decreasing transitions emanating from controlled states. The following lemma was used in the proof of Theorem 5.3:
Lemma 6.8.
Let be an MDP, and let be an objective that is tail in . Let be the conditioned version w.r.t. in the sense of [17, Lemma 6]. If is universally transient, then so is .
7 Conclusion
The objective admits -optimal (resp. optimal) MD strategies even in infinitely branching MDPs. This is unusual, since -optimal strategies for most other objectives require infinite memory if the MDP is infinitely branching (in particular all objectives generalizing Safety [17]).
encodes a notion of continuous progress, which can be used as a tool to reason about the strategy complexity of other objectives in countable MDPs. E.g., our result on is used in [18] as a building block to show upper bounds on the strategy complexity of certain threshold objectives w.r.t. mean payoff, total payoff and point payoff.
References
- [1] Pieter Abbeel and Andrew Y. Ng. Learning first-order Markov models for control. In Advances in Neural Information Processing Systems 17. MIT Press, 2004. URL: http://papers.nips.cc/paper/2569-learning-first-order-markov-models-for-control.
- [2] Galit Ashkenazi-Golan, János Flesch, Arkadi Predtetchinski, and Eilon Solan. Reachability and safety objectives in Markov decision processes on long but finite horizons. Journal of Optimization Theory and Applications, 2020.
- [3] Christel Baier and Joost-Pieter Katoen. Principles of Model Checking. MIT Press, 2008.
- [4] Patrick Billingsley. Probability and Measure. Wiley, 1995. Third Edition.
- [5] Vincent D. Blondel and John N. Tsitsiklis. A survey of computational complexity results in systems and control. Automatica, 2000.
- [6] Nicole Bäuerle and Ulrich Rieder. Markov Decision Processes with Applications to Finance. Springer-Verlag Berlin Heidelberg, 2011.
- [7] K. Chatterjee and T. Henzinger. A survey of stochastic -regular games. Journal of Computer and System Sciences, 2012.
- [8] Edmund M. Clarke, Thomas A. Henzinger, Helmut Veith, and Roderick Bloem, editors. Handbook of Model Checking. Springer, 2018. doi:10.1007/978-3-319-10575-8.
- [9] E.M. Clarke, O. Grumberg, and D. Peled. Model Checking. MIT Press, Dec. 1999.
- [10] William Feller. An Introduction to Probability Theory and Its Applications. Wiley & Sons, second edition, 1966.
- [11] János Flesch, Arkadi Predtetchinski, and William Sudderth. Simplifying optimal strategies in limsup and liminf stochastic games. Discrete Applied Mathematics, 2018.
- [12] T.P. Hill and V.C. Pestien. The existence of good Markov strategies for decision processes with general payoffs. Stoch. Processes and Appl., 1987.
- [13] S. Kiefer, R. Mayr, M. Shirmohammadi, and P. Totzke. Transience in countable MDPs. In International Conference on Concurrency Theory, LIPIcs, 2021. Full version at https://arxiv.org/abs/2012.13739.
- [14] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, and Patrick Totzke. Büchi objectives in countable MDPs. In International Colloquium on Automata, Languages and Programming, LIPIcs, 2019. Full version at https://arxiv.org/abs/1904.11573. doi:10.4230/LIPIcs.ICALP.2019.119.
- [15] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, and Patrick Totzke. Strategy Complexity of Parity Objectives in Countable MDPs. In International Conference on Concurrency Theory, 2020. doi:10.4230/LIPIcs.CONCUR.2020.7.
- [16] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, Patrick Totzke, and Dominik Wojtczak. How to play in infinite MDPs (invited talk). In International Colloquium on Automata, Languages and Programming, 2020. doi:10.4230/LIPIcs.ICALP.2020.3.
- [17] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, and Dominik Wojtczak. Parity Objectives in Countable MDPs. In Annual IEEE Symposium on Logic in Computer Science, 2017. doi:10.1109/LICS.2017.8005100.
- [18] Richard Mayr and Eric Munday. Strategy Complexity of Mean Payoff, Total Payoff and Point Payoff Objectives in Countable MDPs. In International Conference on Concurrency Theory, LIPIcs, 2021. The full version is available on arXiv.
- [19] A. Mostowski. Regular expressions for infinite trees and a standard form of automata. In Computation Theory, LNCS, 1984.
- [20] Donald Ornstein. On the existence of stationary optimal strategies. Proceedings of the American Mathematical Society, 1969. doi:10.2307/2035700.
- [21] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1st edition, 1994.
- [22] George Santayana. Reason in common sense. In Volume 1 of The Life of Reason. 1905. URL: https://en.wikipedia.org/wiki/George_Santayana.
- [23] Manfred Schäl. Markov decision processes in finance and dynamic options. In Handbook of Markov Decision Processes. Springer, 2002.
- [24] Olivier Sigaud and Olivier Buffet. Markov Decision Processes in Artificial Intelligence. John Wiley & Sons, 2013.
- [25] William D. Sudderth. Optimal Markov strategies. Decisions in Economics and Finance, 2020.
- [26] R.S. Sutton and A.G Barto. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning. MIT Press, 2018.
- [27] Moshe Y. Vardi. Automatic verification of probabilistic concurrent finite-state programs. In Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 1985. doi:10.1109/SFCS.1985.12.
Appendix A Strategy Classes
We formalize the amount of memory needed to implement strategies. Let be a countable set of memory modes. An update function is a function that meets the following two conditions, for all modes :
-
•
for all controlled states , the distribution is over .
-
•
for all random states , we have that .
An update function together with an initial memory induce a strategy as follows. Consider the Markov chain with states set , transition relation and probability function . Any partial run in gives rise to a set of partial runs in this Markov chain. Each induces a probability distribution , the probability is the probability of being in state conditioned on having taken some partial run from . We define such that for all and .
We say that a strategy can be implemented with memory (and initial memory ) if there exists an update function such that . In this case we may also write to explicitly specify the initial memory mode . Based on this, we can define several classes of strategies:
A strategy is memoryless (M) (also called positional) if it can be implemented with a memory of size . We may view M-strategies as functions . A strategy is finite memory (F) if there exists a finite memory implementing . More specifically, a strategy is -bit if it can be implemented with a memory of size . Such a strategy is then determined by a function . Deterministic 1-bit strategies are are both deterministic and 1-bit.
Appendix B Missing Proofs from Section 3
In this section, we prove Lemma 3.5 from the main body.
See 3.5
Proof B.1.
Given an infinitely branching MDP with set of states and an initial state , we construct a finitely branching with set of states such that . The reduction uses the concept of “recurrent ladders”; see Figure 2.
The reduction is as follows.
-
•
For all controlled state in with infinite branching for all , we introduce a recurrent ladder in , consisting the controlled states and random states . The set of transitions includes and , and for all two transitions , and . Moreover, and . Here, all states of the recurrent ladder are fresh states.
-
•
For all random states in with infinite branching for all , we use a gadget for all , with fresh random states and suitably adjusted probabilities to ensure that the gadget is left at state with exact probability , i.e., .
See Figure 3 for a partial illustration.
Given denote by the last state of .
For the first item, let be a general strategy in . We define in with the use of memory and an update function ; see Appendix A. The definition of is as follows. For all and ,
-
•
for all and ,
-
•
for all and with ,
-
•
for all with ,
-
•
for all and with ,
-
•
and otherwise.
The strategy consists of the above update function and initial memory where is the empty run. Intuitively speaking, in every step considers the memory and the current state to simulate what would have played in . The memory is such that invariantly demonstrates the history of run projected into the state space of (omitting the introduced states due to the reduction). The second component in the memory is if the current state is in , and otherwise it is a natural number . Such a natural number indicates that the controller is currently on a recurrent ladder and must leaves the ladder at the -th controlled state on the ladder. Subsequently, starts with memory and ,
-
•
when is a random state in , only append to to keep track of the history;
-
•
when is a finitely branching state in , plays as and append to ;
-
•
when is an infinitely branching state in with successors , for every , the strategy chooses the first state of the recurrent ladder for while flipping the memory from to with probability . This requires the ladder to be traversed to state and left from there to , the -th successor of in . Furthermore, append to ;
-
•
when is and memory is with , if then continues to stay on the recurrent ladder by picking ;
-
•
when is and memory is with , leaves the ladder from to which is the -th successor of state in . In addition, the memory is flipped back to .
By the construction of and , it follows that in faithfully simulates in and thus .
For the second item, let be an MD strategy in where is the set of controlled states in . We define an MD strategy in as follows. For all controlled states ,
Note that the above strategy is well-defined, as in every recurrent ladder in , either there exists some such that exits the ladder at its -th controller state, or choose to stay on the ladder forever. In the latter case, by a Gambler’s Ruin argument, the probability of for those runs staying on the ladder forever is . By the construction of , faithfully simulates unless when stays on a ladder forever and the prospect of becomes . In those cases, continues playing what would have played if it exited the -ladder at .
It follows that .
Appendix C 1-Bit Strategy for
See 4.1
Proof C.1.
We prove the claim for finitely branching first and transfer the result to general MDPs at the end.
Let be a finitely branching countable MDP, a finite set of initial states and a set of goal states and the objective.
For every and every there exists an -optimal strategy such that
(11) |
However, the strategies might differ from each other and might use randomization and a large (or even infinite) amount of memory. We will construct a single deterministic strategy that uses only 1 bit of memory such that . This proves the claim as can be chosen arbitrarily small.
In order to construct , we first observe the behavior of the finitely many for on an infinite, increasing sequence of finite subsets of . Based on this, we define a second stronger objective with
(12) |
and show that all attain at least w.r.t. , i.e.,
(13) |
We construct as a deterministic 1-bit optimal strategy w.r.t. from all and obtain
by Equation 12 | ||||
by optimality of for | ||||
Behavior of , objective and properties Equation 12 and Equation 13. We start with some notation. Let be the set of states that can be reached from some state in the set within at most steps. Since is finitely branching, is finite if is finite. Let and denote the property of visiting the set (at least once) within at most (resp. at least) steps. Moreover, let .
Lemma C.2.
Assume the setup of Lemma 4.1, and a strategy from each . Let be a finite set of states and .
-
1.
There is such that
-
2.
There is such that .
Proof C.3.
It suffices to show the properties for a single since one can take the maximal over the finitely many .
We observe that , where the last equivalence is due to the finiteness of .
Towards 1, we have and therefore that . It follows from the continuity of measures that .
Towards 2, we have . By continuity of measures we obtain .
In the following, let us write to denote the complement of a set of runs.