Department of Computer Science, University of Oxford, UK School of Informatics, University of Edinburgh, UK Université de Paris, CNRS, IRIF, F-75013 Paris, France Department of Computer Science, University of Liverpool, UK \CopyrightStefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, Patrick Totzke \ccsdesc[200]Theory of computation Random walks and Markov chains \ccsdesc[100]Mathematics of computing Probability and statistics \relatedversionThis is the full version of a CONCUR 2021 paper [13]. \supplement\funding

Acknowledgements.

\hideLIPIcs\EventEditorsSerge Haddad and Daniele Varacca \EventNoEds2 \EventLongTitle32nd International Conference on Concurrency Theory (CONCUR 2021) \EventShortTitleCONCUR 2021 \EventAcronymCONCUR \EventYear2021 \EventDateAugust 23–27, 2021 \EventLocationVirtual Conference \EventLogo \SeriesVolume203 \ArticleNo1

Transience in Countable MDPs

Stefan Kiefer Richard Mayr Mahsa Shirmohammadi Patrick Totzke

Abstract

The $\mathtt{Transience}$ objective is not to visit any state infinitely often. While this is not possible in any finite Markov Decision Process (MDP), it can be satisfied in countably infinite ones, e.g., if the transition graph is acyclic.

We prove the following fundamental properties of $\mathtt{Transience}$ in countably infinite MDPs.

1.

There exist uniformly $\varepsilon$ -optimal MD strategies (memoryless deterministic) for $\mathtt{Transience}$ , even in infinitely branching MDPs.
2.

Optimal strategies for $\mathtt{Transience}$ need not exist, even if the MDP is finitely branching. However, if an optimal strategy exists then there is also an optimal MD strategy.
3.

If an MDP is universally transient (i.e., almost surely transient under all strategies) then many other objectives have a lower strategy complexity than in general MDPs. E.g., $\varepsilon$ -optimal strategies for Safety and co-Büchi and optimal strategies for ${\{0,1,2\}}\text{-}\mathtt{Parity}$ (where they exist) can be chosen MD, even if the MDP is infinitely branching.

keywords:

Markov decision processes, Parity, Transience

1 Introduction

Those who cannot remember the past are condemned to repeat it.

George Santayana (1905) [22]

The famous aphorism above has often been cited (with small variations), e.g., by Winston Churchill in a 1948 speech to the House of Commons, and carved into several monuments all over the world [22].

We prove that the aphorism is false. In fact, even those who cannot remember anything at all are not condemned to repeat the past. With the right strategy they can avoid repeating the past equally well as everyone else. More formally, playing for $\mathtt{Transience}$ does not require any memory. We show that there always exist $\varepsilon$ -optimal memoryless deterministic strategies for $\mathtt{Transience}$ , and if optimal strategies exist then there also exist optimal memoryless deterministic strategies.¹¹1Our result applies to MDPs (also called games against nature). It is an open question whether it generalizes to countable stochastic 2-player games. (However, it is easy to see that the adversary needs infinite memory in general, even if the player is passive [14, 16].)

Background. We study Markov decision processes (MDPs), a standard model for dynamic systems that exhibit both stochastic and controlled behavior [21]. MDPs play a prominent role in many domains, e.g., artificial intelligence and machine learning [26, 24], control theory [5, 1], operations research and finance [25, 12, 6, 23], and formal verification [2, 25, 11, 8, 3, 7].

An MDP is a directed graph where states are either random or controlled. Its observed behavior is described by runs, which are infinite paths that are, in part, determined by the choices of a controller. If the current state is random then the next state is chosen according to a fixed probability distribution. Otherwise, if the current state is controlled, the controller can choose a distribution over all possible successor states. By fixing a strategy for the controller (and initial state), one obtains a probability space of runs of the MDP. The goal of the controller is to optimize the expected value of some objective function on the runs.

The strategy complexity of a given objective characterizes the type of strategy necessary to achieve an optimal (resp. $\varepsilon$ -optimal) value for the objective. General strategies can take the whole history of the run into account (history-dependent; (H)), while others use only bounded information about it (finite memory; (F)) or base decisions only on the current state (memoryless; (M)). Moreover, the strategy type depends on whether the controller can randomize (R) or is limited to deterministic choices (D). The simplest type, MD, refers to memoryless deterministic strategies.

Acyclicity and Transience. An MDP is called acyclic iff its transition graph is acyclic. While finite MDPs cannot be acyclic (unless they have deadlocks), countable MDPs can. In acyclic countable MDPs, the strategy complexity of Büchi/Parity objectives is lower than in the general case: $\varepsilon$ -optimal strategies for Büchi/Parity objectives require only one bit of memory in acyclic MDPs, while they require infinite memory (an unbounded step-counter, plus one bit) in general countable MDPs [14, 15].

The concept of transience can be seen as a generalization of acyclicity. In a Markov chain, a state $s$ is called transient iff the probability of returning from $s$ to $s$ is $<1$ (otherwise the state is called recurrent). This means that a transient state is almost surely visited only finitely often. The concept of transient/recurrent is naturally lifted from Markov chains to MDPs, where they depend on the chosen strategy.

We define the $\mathtt{Transience}$ objective as the set of runs that do not visit any state infinitely often. We call an MDP universally transient iff it almost-surely satisfies $\mathtt{Transience}$ under every strategy. Thus every acyclic MDP is universally transient, but not vice-versa; cf. Figure 1. In particular, universal transience does not just depend on the structure of the transition graph, but also on the transition probabilities. Universally transient MDPs have interesting properties. Many objectives (e.g., Safety, Büchi, co-Büchi) have a lower strategy complexity than in general MDPs; see below.

We also study the strategy complexity of the $\mathtt{Transience}$ objective itself, and how it interacts with other objectives, e.g., how to attain a Büchi objective in a transient way.

Our contributions.

1.

We show that there exist uniformly $\varepsilon$ -optimal MD strategies (memoryless deterministic) for $\mathtt{Transience}$ , even in infinitely branching MDPs. This is unusual, since (apart from reachability objectives) most other objectives require infinite memory if the MDP is infinitely branching, e.g., all objectives generalizing Safety [17].

Our result is shown in several steps. First we show that there exist $\varepsilon$ -optimal deterministic 1-bit strategies for $\mathtt{Transience}$ . Then we show how to dispense with the 1-bit memory and obtain $\varepsilon$ -optimal MD strategies for $\mathtt{Transience}$ . Finally, we make these MD strategies uniform, i.e., independent of the start state.
2.

We show that optimal strategies for $\mathtt{Transience}$ need not exist, even if the MDP is finitely branching. If they do exist then there are also MD optimal strategies. More generally, there exists a single MD strategy that is optimal from every state that allows optimal strategies for $\mathtt{Transience}$ .
3.

If an MDP is universally transient (i.e., almost surely transient under all strategies) then many other objectives have a lower strategy complexity than in general MDPs, e.g., $\varepsilon$ -optimal strategies for Safety and co-Büchi and optimal strategies for ${\{0,1,2\}}\text{-}\mathtt{Parity}$ (where they exist) can be chosen MD, even if the MDP is infinitely branching.

For our proofs we develop some technical results that are of independent interest. We generalize Ornstein’s plastering construction [20] from reachability to tail objectives and thus obtain a general tool to infer uniformly $\varepsilon$ -optimal MD strategies from non-uniform ones (cf. Theorem 4.7). Secondly, in Section 6 we develop the notion of the conditioned MDP (cf. [17]). For tail objectives, this allows to obtain uniformly $\varepsilon$ -optimal MD strategies wrt. multiplicative errors from those with merely additive errors.

2 Preliminaries

A probability distribution over a countable set $S$ is a function $f:S\to[0,1]$ with $\sum_{s\in S}f(s)=1$ . We write $\mathcal{D}(S)$ for the set of all probability distributions over $S$ .

Markov Decision Processes. We define Markov decision processes (MDPs for short) over countably infinite state spaces as tuples ${\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P)$ where $S$ is the countable set of states partitioned into a set $S_{\Box}$ of controlled states and a set $S_{\ocircle}$ of random states. The transition relation is ${\longrightarrow}\subseteq S\times S$ , and $P:S_{\ocircle}\to\mathcal{D}(S)$ is a probability function. We write $s{\longrightarrow}{}s^{\prime}$ if $(s,s^{\prime})\in{\longrightarrow}$ , and refer to $s^{\prime}$ as a successor of $s$ . We assume that every state has at least one successor. The probability function $P$ assigns to each random state $s\in S_{\ocircle}$ a probability distribution $P(s)$ over its set of successors. A sink is a subset $T\subseteq S$ closed under the ${\longrightarrow}$ relation.

An MDP is acyclic if the underlying graph $(S,{\longrightarrow})$ is acyclic. It is finitely branching if every state has finitely many successors and infinitely branching otherwise. An MDP without controlled states ( $S_{\Box}=\emptyset$ ) is a Markov chain.

Strategies and Probability Measures. A run $\rho$ is an infinite sequence $s_{0}s_{1}\cdots$ of states such that $s_{i}{\longrightarrow}{}s_{i+1}$ for all $i\in\mathbb{N}$ ; a partial run is a finite prefix of a run. We write $\rho(i)=s_{i}$ and say that (partial) run $s_{0}s_{1}\cdots$ visits $s$ if $s=s_{i}$ for some $i$ . It starts in $s$ if $s=s_{0}$ .

A strategy is a function $\sigma:S^{*}S_{\Box}\to\mathcal{D}(S)$ that assigns to partial runs $\rho s\in S^{*}S_{\Box}$ a distribution over the successors of $s$ . We write $\Sigma_{{{\mathcal{M}}}}$ for the set of all strategies in ${\mathcal{M}}$ . A strategy $\sigma$ and an initial state $s_{0}\in S$ induce a standard probability measure on sets of infinite runs. We write ${\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({{\mathfrak{R}}})$ for the probability of a measurable set ${\mathfrak{R}}\subseteq s_{0}S^{\omega}$ of runs starting from $s_{0}$ . It is defined for the cylinders $s_{0}s_{1}\ldots s_{n}S^{\omega}\in S^{\omega}$ as ${\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(s_{0}s_{1}\ldots s_{n}S^{\omega})\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\prod_{i=0}^{n-1}\bar{\sigma}(s_{0}s_{1}\ldots s_{i})(s_{i+1})$ , where $\bar{\sigma}$ is the map that extends $\sigma$ by $\bar{\sigma}(ws)=P(s)$ for all $ws\in S^{*}S_{\ocircle}$ . By Carathéodory’s theorem [4], the measure for cylinders extends uniquely to a probability measure ${\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}$ on all measurable subsets of $s_{0}S^{\omega}$ . We will write ${\mathcal{E}}_{{\mathcal{M}},s_{0},\sigma}$ for the expectation w.r.t. ${\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}$ .

Strategy Classes. Strategies $\sigma:S^{*}S_{\Box}\to\mathcal{D}(S)$ are in general randomized (R) in the sense that they take values in $\mathcal{D}(S)$ . A strategy $\sigma$ is deterministic (D) if $\sigma(\rho)$ is a Dirac distribution for all partial runs $\rho\in S^{*}S_{\Box}$ .

We formalize the amount of memory needed to implement strategies in Appendix A. The two classes of memoryless and 1-bit strategies are central to this paper. A strategy $\sigma$ is memoryless (M) if $\sigma$ bases its decision only on the last state of the run: $\sigma(\rho s)=\sigma(\rho^{\prime}s)$ for all $\rho,\rho^{\prime}\in S^{*}$ . We may view M-strategies as functions $\sigma:S_{\Box}\to\mathcal{D}(S)$ . A 1-bit strategy $\sigma$ may base its decision also on a memory mode ${\sf m}\in\{0,1\}$ . Formally, a 1-bit strategy $\sigma$ is given as a tuple $(u,{\sf m}_{0})$ where ${\sf m}_{0}\in\{0,1\}$ is the initial memory mode and $u:\{0,1\}\times S\to\mathcal{D}(\{0,1\}\times S)$ is an update function such that

•

for all controlled states $s\in S_{\Box}$ , the distribution $u(({\sf m},s))$ is over $\{0,1\}\times\{s^{\prime}\mid s{\longrightarrow}{}s^{\prime}\}$ .
•

for all random states $s\in S_{\ocircle}$ , we have that $\sum_{{\sf m}^{\prime}\in\{0,1\}}u(({\sf m},s))({\sf m}^{\prime},s^{\prime})=P(s)(s^{\prime})$ .

Note that this definition allows for updating the memory mode upon visiting random states. We write $\sigma[{\sf m}_{0}]$ for the strategy obtained from $\sigma$ by setting the initial memory mode to ${\sf m}_{0}$ .

MD strategies are both memoryless and deterministic; and deterministic 1-bit strategies are both deterministic and 1-bit.

Objectives. The objective of the controller is determined by a predicate on infinite runs. We assume familiarity with the syntax and semantics of the temporal logic LTL [9]. Formulas are interpreted on the underlying structure $(S,{\longrightarrow})$ of the MDP ${\mathcal{M}}$ . We use $\llbracket{\varphi}\rrbracket^{{\mathcal{M}},s}\subseteq sS^{\omega}$ to denote the set of runs starting from $s$ that satisfy the LTL formula ${\varphi}$ , which is a measurable set [27]. We also write $\llbracket{\varphi}\rrbracket^{{\mathcal{M}}}$ for $\bigcup_{s\in S}\llbracket{\varphi}\rrbracket^{{\mathcal{M}},s}$ . Where it does not cause confusion we will identify $\varphi$ and $\llbracket{\varphi}\rrbracket$ and just write ${\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi})$ instead of ${\mathcal{P}}_{{\mathcal{M}},s,\sigma}(\llbracket{\varphi}\rrbracket^{{\mathcal{M}},s})$ .

Given a set $T\subseteq S$ of states, the reachability objective $\mathtt{Reach}(T)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}{\sf F}T$ is the set of runs that visit $T$ at least once. The safety objective $\mathtt{Safety}(T)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}{\sf G}\neg T$ is the set of runs that never visit $T$ .

Let ${\mathcal{C}}\subseteq{\rm Nature}$ be a finite set of colors. A color function ${\mathit{C}ol}:S\to{\mathcal{C}}$ assigns to each state $s$ its color ${\mathit{C}ol}({s})$ . The parity objective, written as $\mathtt{Parity}({\mathit{C}ol})$ , is the set of infinite runs such that the largest color that occurs infinitely often along the run is even. To define this formally, let ${\mathit{e}ven}({\mathcal{C}})=\{i\in{\mathcal{C}}\mid i\equiv 0\mod{2}\}$ . For $\mathord{\rhd}\in\{\mathord{<},\mathord{\leq},\mathord{=},\mathord{\geq},\mathord{>}\}$ , $n\in{\rm Nature}$ , and $Q\subseteq S$ , let $[Q]^{{\mathit{C}ol}\rhd n}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\{{s\in Q}|\;{{\mathit{C}ol}({s})\rhd n}\}$ be the set of states in $Q$ with color $\rhd n$ . Then

\mathtt{Parity}({\mathit{C}ol})\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\bigvee_{i\in{\mathit{e}ven}({\mathcal{C}})}\left({\sf G}{\sf F}[S]^{{\mathit{C}ol}=i}\wedge{\sf F}{\sf G}[S]^{{\mathit{C}ol}\leq i}\right).

We write ${{\mathcal{C}}}\text{-}\mathtt{Parity}$ for the parity objectives with the set of colors ${\mathcal{C}}\subseteq{\rm Nature}$ . The classical Büchi and co-Büchi objectives correspond to ${\{1,2\}}\text{-}\mathtt{Parity}$ and ${\{0,1\}}\text{-}\mathtt{Parity}$ , respectively.

An objective ${\varphi}$ is called a tail objective (in ${\mathcal{M}}$ ) iff for every run $\rho^{\prime}\rho$ with some finite prefix $\rho^{\prime}$ we have $\rho^{\prime}\rho\in{\varphi}{}\Leftrightarrow\rho\in{\varphi}{}$ . For every coloring ${\mathit{C}ol}$ , $\mathtt{Parity}({\mathit{C}ol})$ is tail. Reachability objectives are not always tail but in MDPs where the target set $T$ is a sink $\mathtt{Reach}(T)$ is tail.

Optimal and $\varepsilon$ -optimal Strategies. Given an objective ${\varphi}$ , the value of state $s$ in an MDP ${\mathcal{M}}$ , denoted by ${\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}$ , is the supremum probability of achieving ${\varphi}$ . Formally, we have ${\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\sup_{\sigma\in\Sigma}{\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi})$ where $\Sigma$ is the set of all strategies. For $\varepsilon\geq 0$ and state $s\in S$ , we say that a strategy is $\varepsilon$ -optimal from $s$ iff ${\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi})\geq{\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-\varepsilon$ . A $0$ -optimal strategy is called optimal. An optimal strategy is almost-surely winning iff ${\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}=1$ .

Considering an MD strategy as a function $\sigma:S_{\Box}\to S$ and $\varepsilon\geq 0$ , $\sigma$ is uniformly $\varepsilon$ -optimal (resp. uniformly optimal) if it is $\varepsilon$ -optimal (resp. optimal) from every $s\in S$ .

Throughout the paper, we may drop the subscripts and superscripts from notations, if it is understood from the context. The missing proofs can be found in the appendix.

3 Transience and Universally Transient MDPs

In this section we define the transience property for MDPs, a natural generalization of the well-understood concept of transient Markov chains. We enumerate crucial characteristics of this objective and define the notion of universally transient MDPs.

Fix a countable MDP ${\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P)$ . Define the transience objective, denoted by $\mathtt{Transience}$ , to be the set of runs that do not visit any state of ${\mathcal{M}}$ infinitely often, i.e.,

\mathtt{Transience}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\bigwedge_{s\in S}{\sf F}{\sf G}\;\neg s.

The $\mathtt{Transience}$ objective is tail, as it is closed under removing finite prefixes of runs. Also note that $\mathtt{Transience}$ cannot be encoded in a parity objective.

We call ${\mathcal{M}}$ universally transient iff for all states $s_{0}$ , for all strategies $\sigma$ , the $\mathtt{Transience}$ property holds almost-surely from $s_{0}$ , i.e.,

\forall s_{0}\in S\leavevmode\nobreak\ \leavevmode\nobreak\ \forall\sigma\in\Sigma\leavevmode\nobreak\ \leavevmode\nobreak\ {\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})=1.

Figure 1: Gambler’s Ruin with restart: The state

w_{i}

illustrates that the controller’s wealth is

i

, and the coin tosses are in the controller’s favor with probability

p

. For all

i

{\mathcal{P}}_{w_{i}}(\mathtt{Transience})=0

p\leq\frac{1}{2}

; and

{\mathcal{P}}_{w_{i}}(\mathtt{Transience})=1

otherwise.

The MDP in Figure 1 models the classical Gambler’s Ruin Problem with restart; see [10, Chapter 14]. It is well-known that if the controller starts with wealth $i$ and if $p\leq\frac{1}{2}$ , the probability of ruin (visiting the state $w_{0}$ ) is ${\mathcal{P}}_{w_{i}}({\sf F}\,w_{0})=1$ . Consequently, the probability of re-visiting $w_{0}$ infinitely often is $1$ , implying that ${\mathcal{P}}_{w_{i}}(\mathtt{Transience})=0$ . In contrast, for the case with $p>\frac{1}{2}$ , for all states $w_{i}$ , the probability of re-visiting $w_{i}$ is strictly below $1$ . Hence, the $\mathtt{Transience}$ property holds almost-surely. This example indicates that the transience property depends on the probability values of the transitions and not just on the underlying transition graph, and thus may require arithmetic reasoning. In particular, the MDP in Figure 1 is universally transient iff $p>\frac{1}{2}$ .

In general, optimal strategies for $\mathtt{Transience}$ need not exist:

Lemma 3.1.

There exists a finitely branching countable MDP with initial state $s_{0}$ such that

•

${\mathtt{val}_{\mathtt{Transience}}(s)}=1$ for all controlled states $s$ ,
•

there does not exist any optimal strategy $\sigma$ such that ${\mathcal{P}}_{s_{0},\sigma}(\mathtt{Transience})=1$ .

Proof 3.2.

Consider a countable MDP ${\mathcal{M}}$ with set $S=\{\ell_{i},\ell^{\prime}_{i},r_{i},x_{i}\mid i\geq 1\}\cup\{\ell_{0},\bot\}$ of states; see Figure 2. For all $i\geq 1$ the state $x_{i+1}$ is the unique successor of $x_{i}$ so that $(x_{i})_{i\geq 1}$ form an acyclic ladder; the value of $\mathtt{Transience}$ is $1$ for all $x_{i}$ . The state $\bot$ is sink, and its value is $0$ . The states $(r_{i})_{i\geq 1}$ are all random, and $r_{i}$ and $r_{i}$ . Observe that the value of $\mathtt{Transience}$ is $1-2^{-i}$ for the $r_{i}$ .

The states $(\ell_{i})_{i\in\mathbb{N}}$ are controlled whereas the states $(\ell^{\prime}_{i})_{i\geq 1}$ are random. By interleaving of these states, we construct a “recurrent ladder” of decisions: $\ell_{0}\to\ell_{1}$ and for all $i\geq 1$ , state $\ell_{i}$ has two successors $\ell^{\prime}_{i}$ and $r_{i}$ . In random states $\ell^{\prime}_{i}$ , as in Gambler’s Ruin with a fair coin, the successors are $\ell_{i-1}$ or $\ell_{i+1}$ , each with equal probability. In each state $(\ell_{i})_{i\geq 1}$ , the controller decides to either stay on the ladder by going to $\ell^{\prime}_{i}$ or leaves the ladder to $r_{i}$ . As in Figure 1, if the controller stays on the ladder forever, the probability of $\mathtt{Transience}$ is $0$ .

Figure 2: A partial illustration of the MDP in Lemma 3.1, in which there is no optimal strategy for

\mathtt{Transience}

, starting from states

\ell_{i}

. For readability, we have three copies of the state

\bot

. We call the ladder consisting of the interleaved controlled states

\ell_{i}

and random states

\ell^{\prime}_{i}

a “recurrent ladder”: if the controller stays on this ladder forever, it faithfully simulates a Gambler’s Ruin with a fair coin, and the probability of

\mathtt{Transience}

will be

0

Starting in $\ell_{0}$ , for all $i>0$ , strategy $\sigma_{i}$ that stays on the ladder until visiting $\ell_{i}$ (which happens eventually almost surely) and then leaves the ladder to $r_{i}$ achieves $\mathtt{Transience}$ with probability $1-2^{i}$ . Hence, ${\mathtt{val}_{\mathtt{Transience}}(\ell_{0})}=1$ .

Recall that transience cannot be achieved with a positive probability by staying on the acyclic ladder forever. But any strategy that leaves the ladder with a positive probability comes with a positive probability of falling into $\bot$ , thus is not optimal either. Thus there is no optimal strategy for $\mathtt{Transience}$ .

Reduction to Finitely Branching MDPs. In our main results, we will prove that for the $\mathtt{Transience}$ property there always exist $\varepsilon$ -optimal MD strategies in finitely branching countable MDPs; and if an optimal strategy exists, there will exist an optimal MD strategy. We generalize these results to infinitely branching countable MDPs by the following reduction:

Lemma 3.5.

Given an infinitely branching countable MDP ${\mathcal{M}}$ with an initial state $s_{0}$ , there exists a finitely branching countable ${\mathcal{M}}^{\prime}$ with a set $S^{\prime}$ of states such that $s_{0}\in S^{\prime}$ and

each strategy $\alpha_{1}$ in ${\mathcal{M}}$ is mapped to a unique strategy $\beta_{1}$ in ${\mathcal{M}}^{\prime}$ where

{\mathcal{P}}_{s_{0},\alpha_{1}}(\mathtt{Transience})={\mathcal{P}}_{s_{0},\beta_{1}}(\mathtt{Transience}),

and conversely, every MD strategy $\beta_{2}$ in ${\mathcal{M}}^{\prime}$ is mapped to an MD strategy $\alpha_{2}$ in ${\mathcal{M}}$ where

{\mathcal{P}}_{s_{0},\alpha_{2}}(\mathtt{Transience})\geq{\mathcal{P}}_{s_{0},\beta_{2}}(\mathtt{Transience}).

Proof 3.6 (Proof sketch).

See Appendix B for the complete construction. In order to construct ${\mathcal{M}}^{\prime}$ from ${\mathcal{M}}$ , for each controlled state $s\in S$ in ${\mathcal{M}}$ that has infinitely many successors $(s_{i})_{i\geq 1}$ , a “recurrent ladder” is introduced; see Figure 3. Since the probability of $\mathtt{Transience}$ is $0$ for all those runs that eventually stay forever on a recurrent ladder, the controller should exit such ladders to play optimally for $\mathtt{Transience}$ . Infinitely branching random states can be dealt with in an easier way.

Figure 3: A partial illustration of the reduction in Lemma 3.5.

Properties of Universally Transient MDPs.

Notice that acyclicity implies universal transience, but not vice-versa.

Lemma 3.7.

For every countable MDP ${\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P)$ , the following conditions are equivalent.

1.

${\mathcal{M}}$ is universally transient, i.e., $\forall s_{0},\forall\sigma.\,{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})=1$ .
2.

For every initial state $s_{0}$ and state $s$ , the objective of re-visiting $s$ infinitely often has value zero, i.e., $\forall s_{0},s\,\sup_{\sigma}{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({\sf G}{\sf F}(s))=0$ .
3.

For every state $s$ the value of the objective to re-visit $s$ is strictly below $1$ , i.e.,
${\it Re}(s)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\sup_{\sigma}{\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\sf X}{\sf F}(s))<1$ .
4.

For every state $s$ there exists a finite bound $B(s)$ such that for every state $s_{0}$ and strategy $\sigma$ from $s_{0}$ the expected number of visits to $s$ is $\leq B(s)$ .
5.

For all states $s_{0},s$ , under every strategy $\sigma$ from $s_{0}$ the expected number of visits to $s$ is finite.

Proof 3.8.

Towards $(1)\Rightarrow(2)$ , consider an arbitrary strategy $\sigma$ from the initial state $s_{0}$ and some state $s$ . By (1) we have $\forall\sigma.{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})=1$ and thus $0={\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\neg\mathtt{Transience})={\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\bigcup_{s^{\prime}\in S}{\sf G}{\sf F}(s^{\prime}))\geq{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({\sf G}{\sf F}(s))$ which implies (2).

Towards $(2)\Rightarrow(1)$ , consider an arbitrary strategy $\sigma$ from the initial state $s_{0}$ . By (2) we have $0=\sum_{s\in S}{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({\sf G}{\sf F}(s))\geq{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\bigcup_{s\in S}{\sf G}{\sf F}(s))={\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\neg\mathtt{Transience})$ and thus ${\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})=1$ .

We now show the implications $(2)\Rightarrow(3)\Rightarrow(4)\Rightarrow(5)\Rightarrow(2)$ .

Towards $\neg(3)\Rightarrow\neg(2)$ , $\neg(3)$ implies $\exists s.{\it Re}(s)=1$ and thus $\forall\varepsilon>0.\exists\sigma_{\varepsilon}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{\varepsilon}}({\sf X}{\sf F}(s))\geq 1-\varepsilon$ . Let $\varepsilon_{i}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}2^{-(i+1)}$ . We define the strategy $\sigma$ to play like $\sigma_{\varepsilon_{i}}$ between the $i$ -th and $(i+1)$ th visit to $s$ . Since $\sum_{i=1}^{\infty}\varepsilon_{i}<\infty$ , we have $\prod_{i=1}^{\infty}(1-\varepsilon_{i})>0$ . Therefore ${\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\sf G}{\sf F}(s))\geq\prod_{i=1}^{\infty}(1-\varepsilon_{i})>0$ , which implies $\neg(2)$ , where $s_{0}=s$ .

Towards $(3)\Rightarrow(4)$ , regardless of $s_{0}$ and the chosen strategy, the expected number of visits to $s$ is upper-bounded by $B(s)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\sum_{n=0}^{\infty}(n+1)\cdot({\it Re}(s))^{n}<\infty$ .

The implication $(4)\Rightarrow(5)$ holds trivially.

Towards $\neg(2)\Rightarrow\neg(5)$ , by $\neg(2)$ there exist states $s_{0},s$ and a strategy $\sigma$ such that ${\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({\sf G}{\sf F}(s))>0$ . Thus the expected number of visits to $s$ is infinite, which implies $\neg(5)$ .

We remark that if an MDP is not universally transient (unlike in Lemma 3.7(5)), for a strategy $\sigma$ , the expected number of visits to some state can be infinite, even if $\sigma$ attains $\mathtt{Transience}$ almost surely.

Consider the MDP ${\mathcal{M}}$ with controlled states $\{s_{0},s_{1},\dots\}$ , initial state $s_{0}$ and transitions $s_{0}\to s_{0}$ and $s_{k}\to s_{k+1}$ for every $k\geq 0$ . We define a strategy $\sigma$ that, while in state $s_{0}$ , proceeds in rounds $i=1,2,\dots$ . In the $i$ -th round it tosses a fair coin. If Heads then it goes to $s_{1}$ . If Tails then it loops around $s_{0}$ exactly $2^{i}$ times and then goes to round $i+1$ . In every round the probability of going to $s_{1}$ is $1/2$ and therefore the probability of staying in $s_{0}$ forever is $(1/2)^{\infty}=0$ . Thus ${\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})=1$ . However, the expected number of visits to $s_{0}$ is $\geq\sum_{i=1}^{\infty}\left(\frac{1}{2}\right)^{i}\cdot 2^{i}=\infty$ .

4 MD Strategies for Transience

We show that there exist uniformly $\varepsilon$ -optimal MD strategies for $\mathtt{Transience}$ and that optimal strategies, where they exist, can also be chosen MD.

First we show that there exist $\varepsilon$ -optimal deterministic 1-bit strategies for $\mathtt{Transience}$ (in Corollary 4.3) and then we show how to dispense with the 1-bit memory (in Lemma 4.5).

It was shown in [14] that there exist $\varepsilon$ -optimal deterministic 1-bit strategies for Büchi objectives in acyclic countable MDPs (though not in general MDPs). These 1-bit strategies will be similar to the 1-bit strategies for $\mathtt{Transience}$ that we aim for in (not necessarily acyclic) countable MDPs. In Lemma 4.1 below we first strengthen the result from [14] and construct $\varepsilon$ -optimal deterministic 1-bit strategies for objectives $\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}$ . From this we obtain deterministic 1-bit strategies for $\mathtt{Transience}$ (Corollary 4.3).

Lemma 4.1.

Let ${\mathcal{M}}$ be a countable MDP, $I$ a finite set of initial states, $F$ a set of states and $\varepsilon>0$ . Then there exists a deterministic 1-bit strategy for $\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}$ that is $\varepsilon$ -optimal from every $s\in I$ .

Proof 4.2 (Proof sketch).

The full proof can be found in Appendix C. It follows the proof of [14, Theorem 5], which considers $\mbox{{B\"{u}chi}}(F)$ conditions for acyclic (and hence universally transient) MDPs. The only part of that proof that requires modification is [14, Lemma 10], which is replaced here by Lemma C.2 to deal with general MDPs.

In short, from every $s\in I$ there exists an $\varepsilon$ -optimal strategy $\sigma_{s}$ for ${\varphi}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}$ . We observe the behavior of the finitely many $\sigma_{s}$ for $s\in I$ on an infinite, increasing sequence of finite subsets of $S$ . Based on Lemma C.2, we can define a second stronger objective ${\varphi}^{\prime}\subseteq{\varphi}$ and show $\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}^{\prime})\geq{\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-2\varepsilon$ . We then construct a deterministic 1-bit strategy $\sigma^{\prime}$ that is optimal for ${\varphi}^{\prime}$ from all $s\in I$ and thus $2\varepsilon$ -optimal for ${\varphi}$ . Since $\varepsilon$ can be chosen arbitrarily small, the result follows.

Unlike for the $\mathtt{Transience}$ objective alone (see below), the 1-bit memory is strictly necessary for the $\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}$ objective in Lemma 4.1. The 1-bit lower bound for $\mbox{{B\"{u}chi}}(F)$ objectives in [14] holds even for acyclic MDPs where $\mathtt{Transience}$ is trivially true.

Corollary 4.3.

Let ${\mathcal{M}}$ be a countable MDP, $I$ a finite set of initial states, $F$ a set of states and $\varepsilon>0$ .

1.

If $\forall s\in I\ {\mathtt{val}_{{\mathcal{M}},\mbox{{B\"{u}chi}}(F)}(s)}={\mathtt{val}_{{\mathcal{M}},\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}}(s)}$ then there exists a deterministic 1-bit strategy for $\mbox{{B\"{u}chi}}(F)$ that is $\varepsilon$ -optimal from every $s\in I$ .
2.

If ${\mathcal{M}}$ is universally transient then there exists a deterministic 1-bit strategy for $\mbox{{B\"{u}chi}}(F)$ that is $\varepsilon$ -optimal from every $s\in I$ .
3.

There exists a deterministic 1-bit strategy for $\mathtt{Transience}$ that is $\varepsilon$ -optimal from every $s\in I$ .

Proof 4.4.

Towards (1), since $\forall s\in I\ {\mathtt{val}_{{\mathcal{M}},\mbox{{B\"{u}chi}}(F)}(s)}={\mathtt{val}_{{\mathcal{M}},\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}}(s)}$ , strategies that are $\varepsilon$ -optimal for $\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}$ are also $\varepsilon$ -optimal for $\mbox{{B\"{u}chi}}(F)$ . Thus the result follows from Lemma 4.1.

Item (2) follows directly from (1), since the precondition always holds in universally transient MDPs.

Towards (3), let $F\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}S$ . Then we have $\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}=\mathtt{Transience}$ and we obtain from Lemma 4.1 that there exists a deterministic 1-bit strategy for $\mathtt{Transience}$ that is $\varepsilon$ -optimal from every $s\in I$ .

Note that every acyclic MDP is universally transient and thus Corollary 4.3(2) implies the upper bound on the strategy complexity of $\mbox{{B\"{u}chi}}(F)$ from [14] (but not vice-versa).

In the next step we show how to dispense with the 1-bit memory and obtain non-uniform $\varepsilon$ -optimal MD strategies for $\mathtt{Transience}$ .

Lemma 4.5.

Let ${\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P)$ be a countable MDP with initial state $s_{0}$ , and $\varepsilon>0$ . There exists an MD strategy $\sigma$ that is $\varepsilon$ -optimal for $\mathtt{Transience}$ from $s_{0}$ , i.e., ${\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})\geq{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}-\varepsilon$ .

Proof 4.6.

By Lemma 3.5 it suffices to prove the property for finitely branching MDPs. Thus without restriction in the rest of the proof we assume that ${\mathcal{M}}$ is finitely branching.

Let $\varepsilon^{\prime}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\varepsilon/2$ . We instantiate Corollary 4.3(3) with $I\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\{s_{0}\}$ and obtain that there exists an $\varepsilon^{\prime}$ -optimal deterministic 1-bit strategy $\hat{\sigma}$ for $\mathtt{Transience}$ from $s_{0}$ .

We now construct a slightly modified MDP ${\mathcal{M}}^{\prime}$ as follows. Let $S_{\it bad}\subseteq S$ be the subset of states where $\hat{\sigma}$ attains zero for $\mathtt{Transience}$ in both memory modes, i.e., $S_{\it bad}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\{s\in S\mid{\mathcal{P}}_{{\mathcal{M}},s,\sigma[0]}(\mathtt{Transience})={\mathcal{P}}_{{\mathcal{M}},s,\sigma[1]}(\mathtt{Transience})=0\}$ . Let $S_{\it good}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}S\setminus S_{\it bad}$ . We obtain ${\mathcal{M}}^{\prime}$ from ${\mathcal{M}}$ by making all states in $S_{\it bad}$ losing sinks (for $\mathtt{Transience}$ ), by deleting all outgoing edges and adding a self-loop instead. It follows that

{\mathcal{P}}_{{\mathcal{M}},s_{0},\hat{\sigma}}(\mathtt{Transience})={\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\hat{\sigma}}(\mathtt{Transience})

(1)

\forall\sigma.\ {\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})\geq{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma}(\mathtt{Transience})

(2)

In the following we show that it is possible to play in such a way that, for every $s\in S_{\it good}$ , the expected number of visits to $s$ is finite. We obtain the deterministic 1-bit strategy $\sigma^{\prime}$ in ${\mathcal{M}}^{\prime}$ by modifying $\hat{\sigma}$ as follows. In every state $s$ and memory mode $x\in\{0,1\}$ where $\hat{\sigma}[x]$ attains $0$ for $\mathtt{Transience}$ and $\hat{\sigma}[1-x]$ attains $>0$ the strategy $\sigma^{\prime}$ sets the memory bit to $1-x$ . (Note that only states $s\in S_{\it good}$ can be affected by this change.) It follows that

\forall s\in S.\ {\mathcal{P}}_{{\mathcal{M}}^{\prime},s,\sigma^{\prime}}(\mathtt{Transience})\geq{\mathcal{P}}_{{\mathcal{M}}^{\prime},s,\hat{\sigma}}(\mathtt{Transience})

(3)

Moreover, from all states in $S_{\it good}$ in ${\mathcal{M}}^{\prime}$ the strategy $\sigma^{\prime}$ attains a strictly positive probability of $\mathtt{Transience}$ in both memory modes, i.e., for all $s\in S_{\it good}$ we have

t(s,\sigma^{\prime})\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\min_{x\in\{0,1\}}{\mathcal{P}}_{{\mathcal{M}}^{\prime},s,\sigma^{\prime}[i]}(\mathtt{Transience})>0.

Let $r(s,\sigma^{\prime},x)$ be the probability, when playing $\sigma^{\prime}[x]$ from state $s$ , of reaching $s$ again in the same memory mode $x$ . For every $s\in S_{\it good}$ we have $r(s,\sigma^{\prime},x)<1$ , since $t(s,\sigma^{\prime})>0$ .

Let $R(s)$ be the expected number of visits to state $s$ when playing $\sigma^{\prime}$ from $s_{0}$ in ${\mathcal{M}}^{\prime}$ , and $R_{x}(s)$ the expected number of visits to $s$ in memory mode $x\in\{0,1\}$ . For all $s\in S_{\it good}$ we have that

R(s)=R_{0}(s)+R_{1}(s)\leq\sum_{n=1}^{\infty}n\cdot r(s,\sigma^{\prime},0)^{n-1}+\sum_{n=1}^{\infty}n\cdot r(s,\sigma^{\prime},1)^{n-1}<\infty

(4)

where the first equality holds by linearity of expectations. Thus the expected number of visits to $s$ is finite.

Now we upper-bound the probability of visiting $S_{\it bad}$ . We have ${\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma^{\prime}}(\mathtt{Transience})\geq{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\hat{\sigma}}(\mathtt{Transience})={\mathcal{P}}_{{\mathcal{M}},s_{0},\hat{\sigma}}(\mathtt{Transience})\geq{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}-\varepsilon^{\prime}$ by (3), (1) and the $\varepsilon^{\prime}$ -optimality of $\hat{\sigma}$ . Since states in $S_{\it bad}$ are losing sinks in ${\mathcal{M}}^{\prime}$ , it follows that

{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma^{\prime}}({\sf F}S_{\it bad})\leq 1-{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma^{\prime}}(\mathtt{Transience})\leq 1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime}

(5)

We now augment the MDP ${\mathcal{M}}^{\prime}$ by assigning costs to transitions as follows. Let $i:S\to\mathbb{N}$ be an enumeration of the state space, i.e., a bijection. Let $S_{\it good}^{\prime}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\{s\in S_{\it good}\mid R(s)>0\}$ be the subset of states in $S_{\it good}$ that are visited with non-zero probability when playing $\sigma^{\prime}$ from $s_{0}$ . Each transition $s^{\prime}\to s$ is assigned a cost:

•

If $s^{\prime}\in S_{\it bad}$ then $s\in S_{\it bad}$ by def. of ${\mathcal{M}}^{\prime}$ . We assign cost $0$ .
•

If $s^{\prime}\in S_{\it good}$ and $s\in S_{\it bad}$ we assign cost $K/(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime})$ for $K\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}(1+\varepsilon^{\prime})/\varepsilon^{\prime}$ .
•

If $s^{\prime}\in S_{\it good}$ and $s\in S_{\it good}^{\prime}$ we assign cost $2^{-i(s)}/R(s)$ . This is well defined, since $R(s)>0$ .
•

$s^{\prime}\in S_{\it good}$ and $s\in S_{\it good}\setminus S_{\it good}^{\prime}$ we assign cost $1$ .

Note that all transitions leading to states in $S_{\it good}$ are assigned a non-zero cost, since $R(s)$ is finite by (4).

When playing $\sigma^{\prime}$ from $s_{0}$ in ${\mathcal{M}}^{\prime}$ , the expected total cost is upper-bounded by

{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma^{\prime}}({\sf F}S_{\it bad})\cdot K/(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime})+\sum_{s\in S_{\it good}^{\prime}}R(s)\cdot 2^{-i(s)}/R(s)

The first part is $\leq K$ by (5) and the second part is $\leq 1$ , since $R(s)<\infty$ by (4). Therefore the expected total cost is $\leq K+1$ , i.e., $\sigma^{\prime}$ witnesses that it is possible to attain a finite expected cost that is upper-bounded by $K+1$ .

Now we define our MD strategy $\sigma$ . Let $\sigma$ be an optimal MD strategy on ${\mathcal{M}}^{\prime}$ (from $s_{0}$ ) that minimizes the expected cost. It exists, as a finite expected cost is attainable and ${\mathcal{M}}^{\prime}$ is finitely branching; see [21, Theorem 7.3.6].

We now show that $\sigma$ attains $\mathtt{Transience}$ with high probability in ${\mathcal{M}}^{\prime}$ (and in ${\mathcal{M}}$ ). Since $\sigma$ is cost-optimal, its attained cost from $s_{0}$ is upper-bounded by that of $\sigma^{\prime}$ , i.e., $\leq K+1$ . Since the cost of entering $S_{\it bad}$ is $K/(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime})$ , we have ${\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma}({\sf F}S_{\it bad})\cdot K/(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime})\leq K+1$ and thus

{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma}({\sf F}S_{\it bad})\leq\frac{K+1}{K}(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime})

(6)

For every state $s\in S_{\it good}$ , all transitions into $s$ have the same fixed non-zero cost. Thus every run that visits some state $s\in S_{\it good}$ infinitely often has infinite cost. Since the expected cost of playing $\sigma$ from $s_{0}$ is $\leq K+1$ , such runs must be a null-set, i.e.,

{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma}(\neg\mathtt{Transience}\,\wedge\,{\sf G}S_{\it good})=0

(7)

Thus

	$\displaystyle{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})$
	$\displaystyle\geq{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma}(\mathtt{Transience})$	by (2)
	$\displaystyle=1-{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma}({\sf F}S_{\it bad})$	by (7)
	$\displaystyle\geq 1-\frac{K+1}{K}(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime})$	by (6)
	$\displaystyle={\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}-\varepsilon^{\prime}-(1/K)(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime})$
	$\displaystyle\geq{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}-\varepsilon^{\prime}-(1/K)(1+\varepsilon^{\prime})$
	$\displaystyle={\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}-2\varepsilon^{\prime}$	def. of $K$
	$\displaystyle={\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}-\varepsilon$	def. of $\varepsilon^{\prime}$

Now we lift the result of Lemma 4.5 from non-uniform to uniform strategies (and to optimal strategies) and obtain the following theorem. The proof is a generalization of a “plastering” construction by Ornstein [20] (see also [16]) from reachability to tail objectives, which works by fixing MD strategies on ever expanding subsets of the state space.

Theorem 4.7.

Let ${\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P)$ be a countable MDP, and let ${\varphi}$ be an objective that is tail in ${\mathcal{M}}$ . Suppose for every $s\in S$ there exist $\varepsilon$ -optimal MD strategies for ${\varphi}$ . Then:

1.

There exist uniform $\varepsilon$ -optimal MD strategies for ${\varphi}$ .
2.

There exists a single MD strategy that is optimal from every state that has an optimal strategy.

Theorem 4.8.

In every countable MDP there exist uniform $\varepsilon$ -optimal MD strategies for $\mathtt{Transience}$ . Moreover, there exists a single MD strategy that is optimal for $\mathtt{Transience}$ from every state that has an optimal strategy.

Proof 4.9.

Immediate from Lemmas 4.5 and 4.7, since $\mathtt{Transience}$ is a tail objective.

5 Strategy Complexity in Universally Transient MDPs

The strategy complexity of parity objectives in general MDPs is known [15]. Here we show that some parity objectives have a lower strategy complexity in universally transient MDPs. It is known [14] that there are acyclic (and hence universally transient) MDPs where $\varepsilon$ -optimal strategies for ${\{1,2\}}\text{-}\mathtt{Parity}$ (and optimal strategies for ${\{1,2,3\}}\text{-}\mathtt{Parity}$ , resp.) require $1$ bit.

We show that, for all simpler parity objectives in the Mostowski hierarchy [19], universally transient MDPs admit uniformly ( $\varepsilon$ -)optimal MD strategies (unlike general MDPs [15]). These results (Theorems 5.3 and 5.5) ultimately rely on the existence of uniformly $\varepsilon$ -optimal strategies for safety objectives. While such strategies always exist for finitely branching MDPs – simply pick a value-maximal successor – this is not the case for infinitely branching MDPs [17]. However, we show that universal transience implies the existence of uniformly $\varepsilon$ -optimal strategies for safety objectives even for infinitely branching MDPs.

Theorem 5.1.

For every universally transient countable MDP, safety objective and $\varepsilon>0$ there exists a uniformly $\epsilon$ -optimal MD strategy.

Proof 5.2.

Let ${\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P)$ be a universally transient MDP and $\varepsilon>0$ . Assume w.l.o.g. that the target $T\subseteq S$ of the objective ${\varphi}=\mathtt{Safety}(T)$ is a (losing) sink and let $\iota:S\to\mathbb{N}$ be an enumeration of the state space $S$ .

By Lemma 3.7(3), for every state $s$ we have ${\it Re}(s)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\sup_{\sigma}{\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\sf X}{\sf F}(s))<1$ and thus ${\it R}(s)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\sum_{i=0}^{\infty}{\it Re}(s)^{i}<\infty$ . This means that, independent of the chosen strategy, ${\it Re}(s)$ upper-bounds the chance to return to $s$ , and ${\it R}(s)$ bounds the expected number of visits to $s$ .

Suppose that $\sigma$ is an MD strategy which, at any state $s\in S_{\Box}$ , picks a successor $s^{\prime}$ with

{\mathtt{val}(s^{\prime})}\quad\geq\quad{\mathtt{val}(s)}-\frac{\varepsilon}{2^{\iota(s)+1}\cdot{\it R}(s)}.

This is possible even if ${\mathcal{M}}$ is infinitely branching, by the definition of value and the fact that ${\it R}(s)<\infty$ . We show that ${\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Safety}(T))\geq{\mathtt{val}(s_{0})}-\varepsilon$ holds for every initial state $s_{0}$ , which implies the claim of the theorem.

Towards this, we define a function ${\mathtt{cost}}$ that labels each transition in the MDP with a real-valued cost: For every controlled transition $s{\longrightarrow}s^{\prime}$ let ${\mathtt{cost}}((s,s^{\prime}))\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}{\mathtt{val}(s)}-{\mathtt{val}(s^{\prime})}\geq 0$ . Random transitions have cost zero. We will argue that when playing $\sigma$ from any start state $s_{0}$ , its attainment w.r.t. the objective $\mathtt{Safety}(T)$ equals the value of $s_{0}$ minus the expected total cost, and that this cost is bounded by $\varepsilon$ .

For any $i\in\mathbb{N}$ let us write $s_{i}$ for the random variable denoting the state just after step $i$ , and ${\mathtt{Cost}}(i)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}{\mathtt{cost}}(s_{i},s_{i+1})$ for the cost of step $i$ in a random run. We observe that under $\sigma$ the expected total cost is bounded in the limit, i.e.,

\lim_{n\to\infty}{\mathcal{E}}\left(\sum_{i=0}^{n-1}{\mathtt{Cost}}(i)\right)\leq\varepsilon.

(8)

We moreover note that for every $n$ ,

{\mathcal{E}}({\mathtt{val}(s_{n})})={\mathcal{E}}({\mathtt{val}(s_{0})})-{\mathcal{E}}\left(\sum_{i=0}^{n-1}{\mathtt{Cost}}(i)\right).

(9)

Full proofs of the above two equations can be found in LABEL:app-parity. Together they imply

\liminf_{n\to\infty}{\mathcal{E}}({\mathtt{val}(s_{n})})={\mathtt{val}(s_{0})}-\lim_{n\to\infty}{\mathcal{E}}\left(\sum_{i=0}^{n-1}{\mathtt{cost}}(i)\right)\geq{\mathtt{val}(s_{0})}-\varepsilon.

(10)

Finally, to show the claim let $[s_{n}\notin T]:S^{\omega}\to\{0,1\}$ be the random variable that indicates that the $n$ -th state is not in the target set $T$ . Note that $[s_{n}\notin T]\geq{\mathtt{val}(s_{n})}$ because target states have value $0$ . We have:

$\displaystyle{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Safety}(T))\quad$	$\displaystyle=\quad{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}\left(\bigwedge_{i=0}^{\infty}{{\sf X}^{i}\neg T}\right)$	semantics of $\mathtt{Safety}(T)={\sf G}\neg T$
	$\displaystyle=\quad\smashoperator[]{\lim_{n\to\infty}^{}}{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}\left(\bigwedge_{i=0}^{n}{\sf X}^{i}\neg T\right)$	continuity of measures
	$\displaystyle=\quad\smashoperator[]{\lim_{n\to\infty}^{}}{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({\sf X}^{n}\neg T)$	$T$ is a sink
	$\displaystyle=\quad\smashoperator[]{\lim_{n\to\infty}^{}}{\mathcal{E}}([s_{n}\notin T])$	definition of $[s_{n}\notin T]$
	$\displaystyle\geq\quad\liminf_{n\to\infty}{\mathcal{E}}({\mathtt{val}(s_{n})})$	as $[s_{n}\notin T]\geq{\mathtt{val}(s_{n})}$
	$\displaystyle\geq\quad{\mathtt{val}(s_{0})}-\varepsilon$	Equation 10.

We can now combine Theorem 5.1 with the results from [15] to show the existence of MD strategies assuming universal transience.

Theorem 5.3.

For universally transient MDPs optimal strategies for ${\{0,1,2\}}\text{-}\mathtt{Parity}$ , where they exist, can be chosen uniformly MD.

Formally, let ${\mathcal{M}}$ be a universally transient MDP with states $S$ , ${\mathit{C}ol}:S\to\{0,1,2\}$ , and ${\varphi}=\mathtt{Parity}({\mathit{C}ol})$ . There exists an MD strategy $\sigma^{\prime}$ that is optimal for all states $s$ that have an optimal strategy: $\big{(}\exists\sigma\in\Sigma.\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi})={\mathtt{val}_{{\mathcal{M}}}(s)}\big{)}\implies{\mathcal{P}}_{{\mathcal{M}},s,\sigma^{\prime}}({\varphi})={\mathtt{val}_{{\mathcal{M}}}(s)}$ .

Proof 5.4.

Let ${\mathcal{M}}_{+}$ be the conditioned version of ${\mathcal{M}}$ w.r.t. ${\varphi}$ (see [15, Def. 19] for a precise definition). By Lemma 6.8, ${\mathcal{M}}_{+}$ is still a universally transient MDP and therefore by Theorem 5.1, there exist uniformly $\varepsilon$ -optimal MD strategies for every safety objective and every $\varepsilon>0$ . The claim now follows from [15, Theorem 22].

Theorem 5.5.

For every universally transient countable MDP ${\mathcal{M}}$ , co-Büchi objective and $\varepsilon>0$ there exists a uniformly $\varepsilon$ -optimal MD strategy.

Formally, let ${\mathcal{M}}$ be a universally transient countable MDP with states $S$ , ${\mathit{C}ol}:S\to\{0,1\}$ be a coloring, ${\varphi}=\mathtt{Parity}({\mathit{C}ol})$ and $\varepsilon>0$ .

There exists an MD strategy $\sigma^{\prime}$ s.t. for every state $s$ , ${\mathcal{P}}_{{\mathcal{M}},s,\sigma^{\prime}}({\varphi})\geq{\mathtt{val}_{{\mathcal{M}}}(s)}-\varepsilon$ .

Proof 5.6.

This directly follows from Theorem 5.1 and [15, Theorem 25].

6 The Conditioned MDP

Given an MDP ${\mathcal{M}}$ and an objective ${\varphi}$ that is tail in ${\mathcal{M}}$ , a construction of a conditioned MDP ${\mathcal{M}}_{+}$ was provided in [17, Lemma 6] that, very loosely speaking, “scales up” the probability of ${\varphi}$ so that any strategy $\sigma$ is optimal in ${\mathcal{M}}$ if it is almost surely winning in ${\mathcal{M}}_{+}$ . For certain tail objectives, this construction was used in [17] to reduce the sufficiency of MD strategies for optimal strategies to the sufficiency of MD strategies for almost surely winning strategies, which is a special case that may be easier to handle.

However, the construction was restricted to states that have an optimal strategy. In fact, states in ${\mathcal{M}}$ that do not have an optimal strategy do not appear in ${\mathcal{M}}_{+}$ . In the following, we lift this restriction by constructing a more general version of the conditioned MDP, called ${\mathcal{M}}_{*}$ . The MDP ${\mathcal{M}}_{*}$ will contain all states from ${\mathcal{M}}$ that have a positive value w.r.t. ${\varphi}$ in ${\mathcal{M}}$ . Moreover, all these states will have value $1$ in ${\mathcal{M}}_{*}$ . It will then follow from Lemma 6.2(3) below that an $\varepsilon$ -optimal strategy in ${\mathcal{M}}_{*}$ is $\varepsilon{\mathtt{val}_{{\mathcal{M}}}(s_{0})}$ -optimal in ${\mathcal{M}}$ . This allows us to reduce the sufficiency of MD strategies for $\varepsilon$ -optimal strategies to the sufficiency of MD strategies for $\varepsilon$ -optimal strategies for states with value $1$ . In fact, it also follows that if an MD strategy $\sigma$ is uniform $\varepsilon$ -optimal in ${\mathcal{M}}_{*}$ , it is multiplicatively uniform $\varepsilon$ -optimal in ${\mathcal{M}}$ , i.e., ${\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi})\geq(1-\varepsilon)\cdot{\mathtt{val}_{{\mathcal{M}}}(s)}$ holds for all states $s$ .

Definition 6.1.

For an MDP ${\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P)$ and an objective ${\varphi}$ that is tail in ${\mathcal{M}}$ , define the conditioned version of ${\mathcal{M}}$ w.r.t. ${\varphi}$ to be the MDP ${\mathcal{M}}_{*}=(S_{*},S_{*\Box},S_{*\ocircle},{\longrightarrow}_{*},P_{*})$ with

	$\displaystyle S_{*\Box}\ =\$	$\displaystyle\{s\in S_{\Box}\mid{\mathtt{val}_{{\mathcal{M}}}(s)}>0\}$
	$\displaystyle S_{*\ocircle}\ =\$	$\displaystyle\{s\in S_{\ocircle}\mid{\mathtt{val}_{{\mathcal{M}}}(s)}>0\}\cup\{s_{\bot}\}\cup\{(s,t)\in\mathord{{\longrightarrow}}\mid s\in S_{\Box},\ {\mathtt{val}_{{\mathcal{M}}}(s)}>0\}$
	$\displaystyle{\longrightarrow}_{*}\ =\$	$\displaystyle\{(s,(s,t))\in(S_{\Box}\times\mathord{{\longrightarrow}})\mid{\mathtt{val}_{{\mathcal{M}}}(s)}>0,\ s{\longrightarrow}t\}\cup\mbox{}$
		$\displaystyle\{(s,t)\in S_{\ocircle}\times S\mid{\mathtt{val}_{{\mathcal{M}}}(s)}>0,\ {\mathtt{val}_{{\mathcal{M}}}(t)}>0\}\cup\mbox{}$
		$\displaystyle\{((s,t),t)\in(\mathord{{\longrightarrow}}\times S)\mid{\mathtt{val}_{{\mathcal{M}}}(s)}>0,\ {\mathtt{val}_{{\mathcal{M}}}(t)}>0\}\cup\mbox{}$
		$\displaystyle\{((s,t),s_{\bot})\in(\mathord{{\longrightarrow}}\times\{s_{\bot}\})\mid{\mathtt{val}_{{\mathcal{M}}}(s)}>{\mathtt{val}_{{\mathcal{M}}}(t)}\}\cup\mbox{}$
		$\displaystyle\{(s_{\bot},s_{\bot})\}$
	$\displaystyle P_{*}(s,t)\ =\$	$\displaystyle P(s,t)\cdot\frac{{\mathtt{val}_{{\mathcal{M}}}(t)}}{{\mathtt{val}_{{\mathcal{M}}}(s)}}\hskip 56.9055ptP_{*}((s,t),t)\ =\ \frac{{\mathtt{val}_{{\mathcal{M}}}(t)}}{{\mathtt{val}_{{\mathcal{M}}}(s)}}$
	$\displaystyle P_{*}((s,t),s_{\bot})\ =\$	$\displaystyle 1-\frac{{\mathtt{val}_{{\mathcal{M}}}(t)}}{{\mathtt{val}_{{\mathcal{M}}}(s)}}\hskip 76.82243ptP_{*}(s_{\bot},s_{\bot})\ =\ 1$

for a fresh state $s_{\bot}$ .

The conditioned MDP is well-defined. Indeed, as ${\varphi}$ is tail in ${\mathcal{M}}$ , for any $s\in S_{\ocircle}$ we have ${\mathtt{val}_{{\mathcal{M}}}(s)}=\sum_{s{\longrightarrow}t}P(s,t){\mathtt{val}_{{\mathcal{M}}}(t)}$ , and so if ${\mathtt{val}_{{\mathcal{M}}}(s)}>0$ then $\sum_{s{\longrightarrow}t}P_{*}(s,t)=1$ .

Lemma 6.2.

For all $n\geq 0$ and all partial runs $s_{0}s_{1}\cdots s_{n}\in s_{0}S_{*}^{*}$ in ${\mathcal{M}}_{*}$ with $s_{n}\in S$ :

{\mathtt{val}_{{\mathcal{M}}}(s_{0})}\cdot{\mathcal{P}}_{{\mathcal{M}}_{*},s_{0},\sigma}(s_{0}s_{1}\cdots s_{n}S_{*}^{\omega})\ =\ {\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\overline{s_{0}s_{1}\cdots s_{n}}S^{\omega})\cdot{\mathtt{val}_{{\mathcal{M}}}(s_{n})}\,,

where $\overline{w}$ for a partial run $w$ in ${\mathcal{M}}_{*}$ refers to its natural contraction to a partial run in ${\mathcal{M}}$ ; i.e., $\overline{w}$ is obtained from $w$ by deleting all states of the form $(s,t)$ .

For all measurable ${\mathfrak{R}}\subseteq s_{0}(S_{*}\setminus\{s_{\bot}\})^{\omega}$ we have

{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\overline{{\mathfrak{R}}})\ \geq\ {\mathtt{val}_{{\mathcal{M}}}(s_{0})}\cdot{\mathcal{P}}_{{\mathcal{M}}_{*},s_{0},\sigma}({\mathfrak{R}})\ \geq\ {\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\overline{{\mathfrak{R}}}\cap\llbracket{\varphi}\rrbracket^{s_{0}})\,,

where $\overline{{\mathfrak{R}}}$ is obtained from ${\mathfrak{R}}$ by deleting, in all runs, all states of the form $(s,t)$ .

3.

We have ${\mathtt{val}_{{\mathcal{M}}}(s_{0})}\cdot{\mathcal{P}}_{{\mathcal{M}}_{*},s_{0},\sigma}({\varphi})={\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({\varphi})$ . In particular, ${\mathtt{val}_{{\mathcal{M}}_{*}}(s_{0})}=1$ , and, for any $\varepsilon\geq 0$ , strategy $\sigma$ is $\varepsilon$ -optimal in ${\mathcal{M}}_{*}$ if and only if it is $\varepsilon{\mathtt{val}_{{\mathcal{M}}}(s_{0})}$ -optimal in ${\mathcal{M}}$ .

Lemma 6.2.3 provides a way of proving the existence of MD strategies that attain, for each state $s$ , a fixed fraction (arbitrarily close to $1$ ) of the value of $s$ :

Theorem 6.3.

Let ${\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P)$ be an MDP, and let ${\varphi}$ be an objective that is tail in ${\mathcal{M}}$ . Let ${\mathcal{M}}_{*}=(S_{*},S_{*\Box},S_{*\ocircle},{\longrightarrow}_{*},P_{*})$ be the conditioned version of ${\mathcal{M}}$ w.r.t. ${\varphi}$ . Let $\varepsilon\geq 0$ . Any MD strategy $\sigma$ that is uniformly $\varepsilon$ -optimal in ${\mathcal{M}}_{*}$ (i.e., ${\mathcal{P}}_{{\mathcal{M}}_{*},s,\sigma}({\varphi})\geq{\mathtt{val}_{{\mathcal{M}}_{*}}(s)}-\varepsilon$ holds for all $s\in S_{*}$ ) is multiplicatively $\varepsilon$ -optimal in ${\mathcal{M}}$ (i.e., ${\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi})\geq(1-\varepsilon){\mathtt{val}_{{\mathcal{M}}}(s)}$ holds for all $s\in S$ ).

Proof 6.4.

Immediate from Lemma 6.2.3.

As an application of Theorem 6.3, we can strengthen the first statement of Theorem 4.8 towards multiplicatively (see Theorem 6.3) uniform $\varepsilon$ -optimal MD strategies for $\mathtt{Transience}$ .

Corollary 6.5.

In every countable MDP there exist multiplicatively uniform $\varepsilon$ -optimal MD strategies for $\mathtt{Transience}$ .

Proof 6.6.

Let ${\mathcal{M}}$ be a countable MDP, and ${\mathcal{M}}_{*}$ its conditioned version w.r.t. $\mathtt{Transience}$ . Let $\varepsilon>0$ . By Theorem 4.8, there is a uniform $\varepsilon$ -optimal MD strategy $\sigma$ for $\mathtt{Transience}$ in ${\mathcal{M}}_{*}$ . By Theorem 6.3, strategy $\sigma$ is multiplicatively uniform $\varepsilon$ -optimal in ${\mathcal{M}}$ .

The following lemma, stating that universal transience is closed under “conditioning”, is needed for the proof of Lemma 6.8 below.

Lemma 6.7.

In [17, Lemma 6] a variant, say ${\mathcal{M}}_{+}$ , of the conditioned MDP ${\mathcal{M}}_{*}$ from Definition 6.1 was proposed. This variant ${\mathcal{M}}_{+}$ differs from ${\mathcal{M}}_{*}$ in that ${\mathcal{M}}_{+}$ has only those states $s$ from ${\mathcal{M}}$ that have an optimal strategy, i.e., a strategy $\sigma$ with ${\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi})={\mathtt{val}_{{\mathcal{M}}}(s)}$ . Further, for any transition $s{\longrightarrow}t$ in ${\mathcal{M}}_{+}$ where $s$ is a controlled state, we have ${\mathtt{val}_{{\mathcal{M}}}(s)}={\mathtt{val}_{{\mathcal{M}}}(t)}$ , i.e., ${\mathcal{M}}_{+}$ does not have value-decreasing transitions emanating from controlled states. The following lemma was used in the proof of Theorem 5.3:

Lemma 6.8.

Let ${\mathcal{M}}$ be an MDP, and let ${\varphi}$ be an objective that is tail in ${\mathcal{M}}$ . Let ${\mathcal{M}}_{+}$ be the conditioned version w.r.t. ${\varphi}$ in the sense of [17, Lemma 6]. If ${\mathcal{M}}$ is universally transient, then so is ${\mathcal{M}}_{+}$ .

7 Conclusion

The $\mathtt{Transience}$ objective admits $\varepsilon$ -optimal (resp. optimal) MD strategies even in infinitely branching MDPs. This is unusual, since $\varepsilon$ -optimal strategies for most other objectives require infinite memory if the MDP is infinitely branching (in particular all objectives generalizing Safety [17]).

$\mathtt{Transience}$ encodes a notion of continuous progress, which can be used as a tool to reason about the strategy complexity of other objectives in countable MDPs. E.g., our result on $\mathtt{Transience}$ is used in [18] as a building block to show upper bounds on the strategy complexity of certain threshold objectives w.r.t. mean payoff, total payoff and point payoff.

References

[1] Pieter Abbeel and Andrew Y. Ng. Learning first-order Markov models for control. In Advances in Neural Information Processing Systems 17. MIT Press, 2004. URL: http://papers.nips.cc/paper/2569-learning-first-order-markov-models-for-control.
[2] Galit Ashkenazi-Golan, János Flesch, Arkadi Predtetchinski, and Eilon Solan. Reachability and safety objectives in Markov decision processes on long but finite horizons. Journal of Optimization Theory and Applications, 2020.
[3] Christel Baier and Joost-Pieter Katoen. Principles of Model Checking. MIT Press, 2008.
[4] Patrick Billingsley. Probability and Measure. Wiley, 1995. Third Edition.
[5] Vincent D. Blondel and John N. Tsitsiklis. A survey of computational complexity results in systems and control. Automatica, 2000.
[6] Nicole Bäuerle and Ulrich Rieder. Markov Decision Processes with Applications to Finance. Springer-Verlag Berlin Heidelberg, 2011.
[7] K. Chatterjee and T. Henzinger. A survey of stochastic $\omega$ -regular games. Journal of Computer and System Sciences, 2012.
[8] Edmund M. Clarke, Thomas A. Henzinger, Helmut Veith, and Roderick Bloem, editors. Handbook of Model Checking. Springer, 2018. doi:10.1007/978-3-319-10575-8.
[9] E.M. Clarke, O. Grumberg, and D. Peled. Model Checking. MIT Press, Dec. 1999.
[10] William Feller. An Introduction to Probability Theory and Its Applications. Wiley & Sons, second edition, 1966.
[11] János Flesch, Arkadi Predtetchinski, and William Sudderth. Simplifying optimal strategies in limsup and liminf stochastic games. Discrete Applied Mathematics, 2018.
[12] T.P. Hill and V.C. Pestien. The existence of good Markov strategies for decision processes with general payoffs. Stoch. Processes and Appl., 1987.
[13] S. Kiefer, R. Mayr, M. Shirmohammadi, and P. Totzke. Transience in countable MDPs. In International Conference on Concurrency Theory, LIPIcs, 2021. Full version at https://arxiv.org/abs/2012.13739.
[14] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, and Patrick Totzke. Büchi objectives in countable MDPs. In International Colloquium on Automata, Languages and Programming, LIPIcs, 2019. Full version at https://arxiv.org/abs/1904.11573. doi:10.4230/LIPIcs.ICALP.2019.119.
[15] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, and Patrick Totzke. Strategy Complexity of Parity Objectives in Countable MDPs. In International Conference on Concurrency Theory, 2020. doi:10.4230/LIPIcs.CONCUR.2020.7.
[16] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, Patrick Totzke, and Dominik Wojtczak. How to play in infinite MDPs (invited talk). In International Colloquium on Automata, Languages and Programming, 2020. doi:10.4230/LIPIcs.ICALP.2020.3.
[17] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, and Dominik Wojtczak. Parity Objectives in Countable MDPs. In Annual IEEE Symposium on Logic in Computer Science, 2017. doi:10.1109/LICS.2017.8005100.
[18] Richard Mayr and Eric Munday. Strategy Complexity of Mean Payoff, Total Payoff and Point Payoff Objectives in Countable MDPs. In International Conference on Concurrency Theory, LIPIcs, 2021. The full version is available on arXiv.
[19] A. Mostowski. Regular expressions for infinite trees and a standard form of automata. In Computation Theory, LNCS, 1984.
[20] Donald Ornstein. On the existence of stationary optimal strategies. Proceedings of the American Mathematical Society, 1969. doi:10.2307/2035700.
[21] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1st edition, 1994.
[22] George Santayana. Reason in common sense. In Volume 1 of The Life of Reason. 1905. URL: https://en.wikipedia.org/wiki/George_Santayana.
[23] Manfred Schäl. Markov decision processes in finance and dynamic options. In Handbook of Markov Decision Processes. Springer, 2002.
[24] Olivier Sigaud and Olivier Buffet. Markov Decision Processes in Artificial Intelligence. John Wiley & Sons, 2013.
[25] William D. Sudderth. Optimal Markov strategies. Decisions in Economics and Finance, 2020.
[26] R.S. Sutton and A.G Barto. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning. MIT Press, 2018.
[27] Moshe Y. Vardi. Automatic verification of probabilistic concurrent finite-state programs. In Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 1985. doi:10.1109/SFCS.1985.12.

Appendix A Strategy Classes

We formalize the amount of memory needed to implement strategies. Let ${\sf M}$ be a countable set of memory modes. An update function is a function $u:{\sf M}\times S\to\mathcal{D}({\sf M}\times S)$ that meets the following two conditions, for all modes ${\sf m}\in{\sf M}$ :

•

for all controlled states $s\in S_{\Box}$ , the distribution $u(({\sf m},s))$ is over ${\sf M}\times\{s^{\prime}\mid s{\longrightarrow}{}s^{\prime}\}$ .
•

for all random states $s\in S_{\ocircle}$ , we have that $\sum_{{\sf m}^{\prime}\in{\sf M}}u(({\sf m},s))({\sf m}^{\prime},s^{\prime})=P(s)(s^{\prime})$ .

An update function $u$ together with an initial memory ${\sf m}_{0}$ induce a strategy $u[{\sf m}_{0}]:S^{*}S_{\Box}\to\mathcal{D}(S)$ as follows. Consider the Markov chain with states set ${\sf M}\times S$ , transition relation $({\sf M}\times S)^{2}$ and probability function $u$ . Any partial run $\rho=s_{0}\cdots s_{i}$ in ${\mathcal{M}}$ gives rise to a set $H(\rho)=\{({\sf m}_{0},s_{0})\cdots({\sf m}_{i},s_{i})\mid{\sf m}_{0},\ldots,{\sf m}_{i}\in{\sf M}\}$ of partial runs in this Markov chain. Each $\rho s\in s_{0}S^{*}S_{\Box}$ induces a probability distribution $\mu_{\rho s}\in\mathcal{D}({\sf M})$ , the probability $\mu_{\rho s}({\sf m})$ is the probability of being in state $({\sf m},s)$ conditioned on having taken some partial run from $H(\rho s)$ . We define $u[{\sf m}_{0}]$ such that $u[{\sf m}_{0}](\rho s)(s^{\prime})\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\sum_{{\sf m},{\sf m}^{\prime}\in{\sf M}}\mu_{\rho s}({\sf m})u(({\sf m},s))({\sf m}^{\prime},s^{\prime})$ for all $\rho s\in S^{*}S_{\Box}$ and $s^{\prime}\in S$ .

We say that a strategy $\sigma$ can be implemented with memory ${\sf M}$ (and initial memory ${\sf m}_{0}$ ) if there exists an update function $u$ such that $\sigma=u[{\sf m}_{0}]$ . In this case we may also write $\sigma[{\sf m}_{0}]$ to explicitly specify the initial memory mode ${\sf m}_{0}$ . Based on this, we can define several classes of strategies:

A strategy $\sigma$ is memoryless (M) (also called positional) if it can be implemented with a memory of size $1$ . We may view M-strategies as functions $\sigma:S_{\Box}\to\mathcal{D}(S)$ . A strategy $\sigma$ is finite memory (F) if there exists a finite memory ${\sf M}$ implementing $\sigma$ . More specifically, a strategy is $1$ -bit if it can be implemented with a memory of size $2$ . Such a strategy is then determined by a function $u:\{0,1\}\times S\to\mathcal{D}(\{0,1\}\times S)$ . Deterministic 1-bit strategies are are both deterministic and 1-bit.

Appendix B Missing Proofs from Section 3

In this section, we prove Lemma 3.5 from the main body.

See 3.5

Proof B.1.

Given an infinitely branching MDP ${\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P)$ with set $S$ of states and an initial state $s_{0}\in S$ , we construct a finitely branching ${\mathcal{M}}^{\prime}$ with set $S^{\prime}$ of states such that $s_{0}\in S^{\prime}$ . The reduction uses the concept of “recurrent ladders”; see Figure 2.

The reduction is as follows.

•

For all controlled state $s$ in ${\mathcal{M}}$ with infinite branching $s\to s_{i}$ for all $i\geq 1$ , we introduce a recurrent ladder in ${\mathcal{M}}^{\prime}$ , consisting the controlled states $(\ell_{s,i})_{i\in\mathbb{N}}$ and random states $(\ell^{\prime}_{s,i})_{i\geq 1}$ . The set of transitions includes $s\to\ell_{s,0}$ and $\ell_{s,0}\to\ell_{s,1}$ , and for all $i\geq 1$ two transitions $\ell_{s,i}\to\ell^{\prime}_{s,i}$ , and $\ell_{s,i}\to s_{i}$ . Moreover, $\ell^{\prime}_{s,i}$ and $\ell^{\prime}_{s,i}$ . Here, all states of the recurrent ladder are fresh states.
•

For all random states $s$ in ${\mathcal{M}}$ with infinite branching $s$ for all $i\geq 1$ , we use a gadget $s$ for all $i\geq 1$ , with fresh random states $z_{i}$ and suitably adjusted probabilities $p_{i}^{\prime}$ to ensure that the gadget is left at state $s_{i}$ with exact probability $p_{i}$ , i.e., $p_{i}^{\prime}=p_{i}/(\prod_{j=1}^{i-1}(1-p_{j}^{\prime}))$ .

See Figure 3 for a partial illustration.

Given $\rho=q_{0}q_{1}\cdots q_{n}\in S^{+}$ denote by $\mathrm{last}(\rho)=q_{n}$ the last state of $\rho$ .

For the first item, let $\alpha_{1}$ be a general strategy $\alpha_{1}:S^{*}S_{\Box}\to\mathcal{D}(S)$ in ${\mathcal{M}}$ . We define $\beta_{1}$ in ${\mathcal{M}}^{\prime}$ with the use of memory ${\sf M}=S^{*}\times\{\bot\}\cup\{i\in\mathbb{N}\mid i\leq 1\})$ and an update function $u$ ; see Appendix A. The definition of $u:{\sf M}\times S^{\prime}\to\mathcal{D}({\sf M}\times S^{\prime})$ is as follows. For all $q,q^{\prime}\in S^{\prime}$ and $\rho\in S^{*}$ ,

•

for all ${\sf m}=(\rho,\bot)$ and ${\sf m}^{\prime}=(\rho q,\bot)$ ,

u({\sf m},q)({\sf m}^{\prime},q^{\prime})=\begin{cases}P(q)(q^{\prime})&\parbox{227.62204pt}{if $q\in S_{\ocircle}$;}\\ \\ \alpha_{1}(\rho q)(q^{\prime})&\parbox{227.62204pt}{if $q\in S_{\Box}$ is finitely branching in ${\mathcal{M}}$;}\end{cases}

•

for all ${\sf m}=(\rho,\bot)$ and ${\sf m}^{\prime}=(\rho q,j)$ with $j\geq 1$ ,

u({\sf m},q)({\sf m}^{\prime},q^{\prime})=\begin{cases}\alpha_{1}(\rho q)(q_{j})&\parbox{227.62204pt}{if $q\in S_{\Box}$ is infinitely branching in ${\mathcal{M}}$ with $q\to q_{i}$ for all $i\geq 1$, and $q^{\prime}=\ell_{q,0}$;}\end{cases}

•

for all ${\sf m},{\sf m}^{\prime}=(\rho,j)$ with $j$ ,

u({\sf m},q)({\sf m},q^{\prime})=\begin{cases}1&\parbox{227.62204pt}{if $s=\mathrm{last}(\rho)$ was infinitely branching in\leavevmode\nobreak\ ${\mathcal{M}}$, and if $q=\ell_{s,i}$, $q^{\prime}=\ell^{\prime}_{s,i}$ and $i<j$;}\end{cases}

•

for all ${\sf m}=(\rho,j)$ and ${\sf m}^{\prime}=(\rho,\bot)$ with $j\geq 1$ ,

u({\sf m},q)({\sf m},q^{\prime})=\begin{cases}1&\parbox{227.62204pt}{if $s=\mathrm{last}(\rho)$ was infinitely branching in\leavevmode\nobreak\ ${\mathcal{M}}$ with $s\to s_{i}$ for all $i\geq 1$, and if $q=\ell_{s,j}$ and $q^{\prime}=s_{j}$;}\end{cases}

•

and $u({\sf m},q)({\sf m}^{\prime},q^{\prime})=0$ otherwise.

The strategy $\beta_{1}$ consists of the above update function $u$ and initial memory ${\sf m}_{0}=(\epsilon,\bot)$ where $\epsilon$ is the empty run. Intuitively speaking, in every step $\beta_{1}$ considers the memory $(\rho,x)$ and the current state $q$ to simulate what $\alpha_{1}$ would have played in ${\mathcal{M}}$ . The memory $(\rho,x)$ is such that $\rho$ invariantly demonstrates the history of run projected into the state space $S$ of ${\mathcal{M}}$ (omitting the introduced states due to the reduction). The second component $x$ in the memory is $\bot$ if the current state is in $S$ , and otherwise it is a natural number $j\geq 1$ . Such a natural number $j$ indicates that the controller is currently on a recurrent ladder and must leaves the ladder at the $j$ -th controlled state on the ladder. Subsequently, $\beta_{1}$ starts with memory $(\epsilon,\bot)$ and $q=s_{0}$ ,

•

when $q$ is a random state in ${\mathcal{M}}$ , $\beta_{1}$ only append $q$ to $\rho$ to keep track of the history;
•

when $q$ is a finitely branching state in ${\mathcal{M}}$ , $\beta_{1}$ plays as $\alpha_{1}(\rho q)$ and append $q$ to $\rho$ ;
•

when $q$ is an infinitely branching state in ${\mathcal{M}}$ with successors $(q_{j})_{j\geq 1}$ , for every $j\geq 1$ , the strategy $\beta_{1}$ chooses the first state $\ell_{q,0}$ of the recurrent ladder for $q$ while flipping the memory from $(\rho,\bot)$ to $(\rho,j)$ with probability $\sigma(\rho q)(q_{j})$ . This requires the ladder to be traversed to state $\ell_{q,j}$ and left from there to $q_{j}$ , the $j$ -th successor of $q$ in ${\mathcal{M}}$ . Furthermore, $\beta_{1}$ append $q$ to $\rho$ ;
•

when $q$ is $\ell_{s,i}$ and memory is $(\rho,j)$ with $\mathrm{last}(\rho)=s$ , if $i\leq j$ then $\beta_{1}$ continues to stay on the recurrent ladder by picking $\ell^{\prime}_{s,i}$ ;
•

when $q$ is $\ell_{s,j}$ and memory is $(\rho,j)$ with $\mathrm{last}(\rho)=s$ , $\beta_{1}$ leaves the ladder from $\ell_{s,j}$ to $q_{j}$ which is the $j$ -th successor of state $s$ in ${\mathcal{M}};$ . In addition, the memory $(\rho,j)$ is flipped back to $(\rho,\bot)$ .

By the construction of ${\mathcal{M}}^{\prime}$ and $\beta_{1}$ , it follows that $\beta_{1}$ in ${\mathcal{M}}^{\prime}$ faithfully simulates $\alpha_{1}$ in ${\mathcal{M}}$ and thus ${\mathcal{P}}_{{\mathcal{M}},s_{0},\alpha_{1}}(\mathtt{Transience})={\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\beta_{1}}(\mathtt{Transience})$ .

For the second item, let $\beta_{2}:S_{\Box}^{\prime}\to S_{\Box}^{\prime}$ be an MD strategy in ${\mathcal{M}}^{\prime}$ where $S_{\Box}^{\prime}\subseteq S$ is the set of controlled states in ${\mathcal{M}}^{\prime}$ . We define an MD strategy $\alpha_{2}:S\to S$ in ${\mathcal{M}}$ as follows. For all controlled states $s\in S^{\prime}$ ,

\alpha_{2}(s)=\begin{cases}\beta_{2}(q)&\parbox{298.75394pt}{if $s\in S_{\Box}$ is finitely branching in\leavevmode\nobreak\ ${\mathcal{M}}$;}\\ \\ s_{j}&\parbox{298.75394pt}{if $s\in S_{\Box}$ is infinitely branching in ${\mathcal{M}}$ with the successors\leavevmode\nobreak\ $(s_{i})_{i\geq 1}$, and if there exists $j\in\mathbb{N}$ such that $\beta_{2}(s)=\ell_{s,0}$, $\beta_{2}(\ell_{s,0})=\ell_{s,1}$ and $\beta_{2}(\ell_{s,i})=\ell^{\prime}_{s,i}$ for all $0<i<j$, and $\beta_{2}(\ell_{s,j})=s_{j}$;}\\ \\ s_{1}&\parbox{298.75394pt}{if $s\in S_{\Box}$ is infinitely branching in ${\mathcal{M}}$ with the successors\leavevmode\nobreak\ $(s_{i})_{i\geq 1}$, and if $\beta_{2}(s)=\ell_{s,0}$, $\beta_{2}(\ell_{s,0})=\ell_{s,1}$ and $\beta_{2}(\ell_{s,i})=\ell^{\prime}_{s,i}$ for all $i>0$.}\end{cases}

Note that the above strategy is well-defined, as in every recurrent ladder in ${\mathcal{M}}^{\prime}$ , either there exists some $j$ such that $\beta_{2}$ exits the ladder at its $j$ -th controller state, or $\beta_{2}$ choose to stay on the ladder forever. In the latter case, by a Gambler’s Ruin argument, the probability of $\mathtt{Transience}$ for those runs staying on the ladder forever is $0$ . By the construction of ${\mathcal{M}}^{\prime}$ , $\alpha_{2}$ faithfully simulates $\beta_{2}$ unless when $\beta_{2}$ stays on a ladder forever and the prospect of $\mathtt{Transience}$ becomes $0$ . In those cases, $\alpha_{2}$ continues playing what $\beta_{2}$ would have played if it exited the $s$ -ladder at $\ell_{s,1}$ .

It follows that ${\mathcal{P}}_{{\mathcal{M}},s_{0},\alpha_{2}}(\mathtt{Transience})\geq{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\beta_{2}}(\mathtt{Transience})$ .

Appendix C 1-Bit Strategy for $\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}$

See 4.1

Proof C.1.

We prove the claim for finitely branching ${\mathcal{M}}$ first and transfer the result to general MDPs at the end.

Let ${\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P)$ be a finitely branching countable MDP, $I\subseteq S$ a finite set of initial states and $F\subseteq S$ a set of goal states and ${\varphi}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}$ the objective.

For every $\varepsilon>0$ and every $s\in I$ there exists an $\varepsilon$ -optimal strategy $\sigma_{s}$ such that

{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi})\geq{\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-\varepsilon.

(11)

However, the strategies $\sigma_{s}$ might differ from each other and might use randomization and a large (or even infinite) amount of memory. We will construct a single deterministic strategy $\sigma^{\prime}$ that uses only 1 bit of memory such that $\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma^{\prime}}({\varphi})\geq{\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-2\varepsilon$ . This proves the claim as $\varepsilon$ can be chosen arbitrarily small.

In order to construct $\sigma^{\prime}$ , we first observe the behavior of the finitely many $\sigma_{s}$ for $s\in I$ on an infinite, increasing sequence of finite subsets of $S$ . Based on this, we define a second stronger objective ${\varphi}^{\prime}$ with

{\varphi}^{\prime}\subseteq{\varphi},

(12)

and show that all $\sigma_{s}$ attain at least ${\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-2\varepsilon$ w.r.t. ${\varphi}^{\prime}$ , i.e.,

\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}^{\prime})\geq{\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-2\varepsilon.

(13)

We construct $\sigma^{\prime}$ as a deterministic 1-bit optimal strategy w.r.t. ${\varphi}^{\prime}$ from all $s\in I$ and obtain

$\displaystyle{\mathcal{P}}_{{\mathcal{M}},s,\sigma^{\prime}}({\varphi})\$	$\displaystyle\geq\ {\mathcal{P}}_{{\mathcal{M}},s,\sigma^{\prime}}({\varphi}^{\prime})$	by Equation 12
	$\displaystyle\geq\ {\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}^{\prime})$	by optimality of $\sigma^{\prime}$ for ${\varphi}^{\prime}$
	$\displaystyle\geq\ {\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-2\varepsilon$	$\displaystyle\text{by \lx@cref{creftype~refnum}{eq:observe-orig}}.$

Behavior of $\sigma$ , objective ${\varphi}^{\prime}$ and properties Equation 12 and Equation 13. We start with some notation. Let ${\sf bubble}_{k}(X)$ be the set of states that can be reached from some state in the set $X$ within at most $k$ steps. Since ${\mathcal{M}}$ is finitely branching, ${\sf bubble}_{k}(X)$ is finite if $X$ is finite. Let ${\sf F}^{{\leq k}}(X)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\{\rho\in S^{\omega}\mid\exists t\leq k.\,\rho(t)\in X\}$ and ${\sf F}^{{\geq k}}(X)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\{\rho\in S^{\omega}\mid\exists t\geq k.\,\rho(t)\in X\}$ denote the property of visiting the set $X$ (at least once) within at most (resp. at least) $k$ steps. Moreover, let $\varepsilon_{i}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\varepsilon\,\cdot\,2^{-(i+1)}$ .

Figure 4: To show the bubble construction. The green region in

K_{1}

F_{1}

, and for all

i\geq 2

, the green region in

K_{i}\setminus L_{i-1}

F_{i}

Lemma C.2.

Assume the setup of Lemma 4.1, ${\varphi}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}$ and a strategy $\sigma_{s}$ from each $s\in I$ . Let $X\subseteq S$ be a finite set of states and $\varepsilon^{\prime}>0$ .

1.

There is $k\in\mathbb{N}$ such that $\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}\cap\lnot({\sf F}^{{\leq k}}(F\setminus X)))\ \leq\ \varepsilon^{\prime}.$
2.

There is $l\in\mathbb{N}$ such that $\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}\cap{\sf F}^{{\geq l}}(X))\ \leq\ \varepsilon^{\prime}$ .

Proof C.3.

It suffices to show the properties for a single $s,\sigma_{s}$ since one can take the maximal $k,l$ over the finitely many $s\in I$ .

We observe that ${\varphi}\subseteq\mathtt{Transience}=\bigcap_{s\in S}{\sf F}{\sf G}\neg(s)\subseteq\bigcap_{s\in X}{\sf F}{\sf G}\neg(s)={\sf F}{\sf G}\neg(X)$ , where the last equivalence is due to the finiteness of $X$ .

Towards 1, we have ${\varphi}={\sf G}{\sf F}F\cap\mathtt{Transience}\subseteq{\sf G}{\sf F}F\cap{\sf F}{\sf G}\neg(X)\subseteq{\sf G}{\sf F}(F\setminus X)\leavevmode\nobreak\ \subseteq\leavevmode\nobreak\ {\sf F}(F\setminus X)=\bigcup_{k\in\mathbb{N}}{\sf F}^{{\leq k}}(F\setminus X)$ and therefore that ${\varphi}\cap\bigcap_{k\in\mathbb{N}}\lnot({\sf F}^{{\leq k}}(F\setminus X))=\emptyset$ . It follows from the continuity of measures that $\lim_{k\to\infty}{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}\cap\lnot({\sf F}^{{\leq k}}(F\setminus X)))=0$ .

Towards 2, we have ${\varphi}\cap\cap_{l}{\sf F}^{{\geq l}}(X)\subseteq{\sf F}{\sf G}\neg(X)\cap\cap_{l}{\sf F}^{{\geq l}}(X)=\emptyset$ . By continuity of measures we obtain $\lim_{l\rightarrow\infty}{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}\cap{\sf F}^{{\geq l}}(X))=0$ .

In the following, let us write $\overline{X}$ to denote the complement of a set $X\subseteq S^{\omega}$ of runs.

By Lemma C.2(1) there is a $k_{1}$ such that for $K_{1}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}{\sf bubble}_{k_{1}}(I)$ and $F_{1}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}F\cap K_{1}$ we have $\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}\cap\overline{K_{1}^{*}F_{1}S^{\omega}})\leq\varepsilon_{1}$ . We define the pattern

R_{1}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}(K_{1}\setminus F_{1})^{*}F_{1}

and obtain $\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}\cap\overline{R_{1}S^{\omega}})\leq\varepsilon_{1}$ . By Lemma C.2(2) there is an $l_{1}>k_{1}$ such that $\forall_{s \in I} 𝒫_{ℳ, s, σ_{s}} (𝖥^{\geq l_{1}} (K_{1}$