This paper was converted on www.awesomepapers.org from LaTeX by an anonymous user.
Want to know more? Visit the Converter page.

Department of Computer Science, University of Oxford, UK School of Informatics, University of Edinburgh, UK Université de Paris, CNRS, IRIF, F-75013 Paris, France Department of Computer Science, University of Liverpool, UK \CopyrightStefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, Patrick Totzke \ccsdesc[200]Theory of computation Random walks and Markov chains \ccsdesc[100]Mathematics of computing Probability and statistics \relatedversionThis is the full version of a CONCUR 2021 paper [13]. \supplement\funding

Acknowledgements.
\hideLIPIcs\EventEditorsSerge Haddad and Daniele Varacca \EventNoEds2 \EventLongTitle32nd International Conference on Concurrency Theory (CONCUR 2021) \EventShortTitleCONCUR 2021 \EventAcronymCONCUR \EventYear2021 \EventDateAugust 23–27, 2021 \EventLocationVirtual Conference \EventLogo \SeriesVolume203 \ArticleNo1

Transience in Countable MDPs

Stefan Kiefer    Richard Mayr    Mahsa Shirmohammadi    Patrick Totzke
Abstract

The 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} objective is not to visit any state infinitely often. While this is not possible in any finite Markov Decision Process (MDP), it can be satisfied in countably infinite ones, e.g., if the transition graph is acyclic.

We prove the following fundamental properties of 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} in countably infinite MDPs.

  1. 1.

    There exist uniformly ε\varepsilon-optimal MD strategies (memoryless deterministic) for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}, even in infinitely branching MDPs.

  2. 2.

    Optimal strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} need not exist, even if the MDP is finitely branching. However, if an optimal strategy exists then there is also an optimal MD strategy.

  3. 3.

    If an MDP is universally transient (i.e., almost surely transient under all strategies) then many other objectives have a lower strategy complexity than in general MDPs. E.g., ε\varepsilon-optimal strategies for Safety and co-Büchi and optimal strategies for {0,1,2}-𝙿𝚊𝚛𝚒𝚝𝚢{\{0,1,2\}}\text{-}\mathtt{Parity} (where they exist) can be chosen MD, even if the MDP is infinitely branching.

keywords:
Markov decision processes, Parity, Transience

1 Introduction

Those who cannot remember the past are condemned to repeat it.

George Santayana (1905) [22]

The famous aphorism above has often been cited (with small variations), e.g., by Winston Churchill in a 1948 speech to the House of Commons, and carved into several monuments all over the world [22].

We prove that the aphorism is false. In fact, even those who cannot remember anything at all are not condemned to repeat the past. With the right strategy they can avoid repeating the past equally well as everyone else. More formally, playing for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} does not require any memory. We show that there always exist ε\varepsilon-optimal memoryless deterministic strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}, and if optimal strategies exist then there also exist optimal memoryless deterministic strategies.111Our result applies to MDPs (also called games against nature). It is an open question whether it generalizes to countable stochastic 2-player games. (However, it is easy to see that the adversary needs infinite memory in general, even if the player is passive [14, 16].)

Background. We study Markov decision processes (MDPs), a standard model for dynamic systems that exhibit both stochastic and controlled behavior [21]. MDPs play a prominent role in many domains, e.g., artificial intelligence and machine learning [26, 24], control theory [5, 1], operations research and finance [25, 12, 6, 23], and formal verification [2, 25, 11, 8, 3, 7].

An MDP is a directed graph where states are either random or controlled. Its observed behavior is described by runs, which are infinite paths that are, in part, determined by the choices of a controller. If the current state is random then the next state is chosen according to a fixed probability distribution. Otherwise, if the current state is controlled, the controller can choose a distribution over all possible successor states. By fixing a strategy for the controller (and initial state), one obtains a probability space of runs of the MDP. The goal of the controller is to optimize the expected value of some objective function on the runs.

The strategy complexity of a given objective characterizes the type of strategy necessary to achieve an optimal (resp. ε\varepsilon-optimal) value for the objective. General strategies can take the whole history of the run into account (history-dependent; (H)), while others use only bounded information about it (finite memory; (F)) or base decisions only on the current state (memoryless; (M)). Moreover, the strategy type depends on whether the controller can randomize (R) or is limited to deterministic choices (D). The simplest type, MD, refers to memoryless deterministic strategies.

Acyclicity and Transience. An MDP is called acyclic iff its transition graph is acyclic. While finite MDPs cannot be acyclic (unless they have deadlocks), countable MDPs can. In acyclic countable MDPs, the strategy complexity of Büchi/Parity objectives is lower than in the general case: ε\varepsilon-optimal strategies for Büchi/Parity objectives require only one bit of memory in acyclic MDPs, while they require infinite memory (an unbounded step-counter, plus one bit) in general countable MDPs [14, 15].

The concept of transience can be seen as a generalization of acyclicity. In a Markov chain, a state ss is called transient iff the probability of returning from ss to ss is <1<1 (otherwise the state is called recurrent). This means that a transient state is almost surely visited only finitely often. The concept of transient/recurrent is naturally lifted from Markov chains to MDPs, where they depend on the chosen strategy.

We define the 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} objective as the set of runs that do not visit any state infinitely often. We call an MDP universally transient iff it almost-surely satisfies 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} under every strategy. Thus every acyclic MDP is universally transient, but not vice-versa; cf. Figure 1. In particular, universal transience does not just depend on the structure of the transition graph, but also on the transition probabilities. Universally transient MDPs have interesting properties. Many objectives (e.g., Safety, Büchi, co-Büchi) have a lower strategy complexity than in general MDPs; see below.

We also study the strategy complexity of the 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} objective itself, and how it interacts with other objectives, e.g., how to attain a Büchi objective in a transient way.

Our contributions.

  1. 1.

    We show that there exist uniformly ε\varepsilon-optimal MD strategies (memoryless deterministic) for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}, even in infinitely branching MDPs. This is unusual, since (apart from reachability objectives) most other objectives require infinite memory if the MDP is infinitely branching, e.g., all objectives generalizing Safety [17].

    Our result is shown in several steps. First we show that there exist ε\varepsilon-optimal deterministic 1-bit strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}. Then we show how to dispense with the 1-bit memory and obtain ε\varepsilon-optimal MD strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}. Finally, we make these MD strategies uniform, i.e., independent of the start state.

  2. 2.

    We show that optimal strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} need not exist, even if the MDP is finitely branching. If they do exist then there are also MD optimal strategies. More generally, there exists a single MD strategy that is optimal from every state that allows optimal strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}.

  3. 3.

    If an MDP is universally transient (i.e., almost surely transient under all strategies) then many other objectives have a lower strategy complexity than in general MDPs, e.g., ε\varepsilon-optimal strategies for Safety and co-Büchi and optimal strategies for {0,1,2}-𝙿𝚊𝚛𝚒𝚝𝚢{\{0,1,2\}}\text{-}\mathtt{Parity} (where they exist) can be chosen MD, even if the MDP is infinitely branching.

For our proofs we develop some technical results that are of independent interest. We generalize Ornstein’s plastering construction [20] from reachability to tail objectives and thus obtain a general tool to infer uniformly ε\varepsilon-optimal MD strategies from non-uniform ones (cf. Theorem 4.7). Secondly, in Section 6 we develop the notion of the conditioned MDP (cf. [17]). For tail objectives, this allows to obtain uniformly ε\varepsilon-optimal MD strategies wrt. multiplicative errors from those with merely additive errors.

2 Preliminaries

A probability distribution over a countable set SS is a function f:S[0,1]f:S\to[0,1] with sSf(s)=1\sum_{s\in S}f(s)=1. We write 𝒟(S)\mathcal{D}(S) for the set of all probability distributions over SS.

Markov Decision Processes. We define Markov decision processes (MDPs for short) over countably infinite state spaces as tuples =(S,S,S,,P){\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P) where SS is the countable set of states partitioned into a set SS_{\Box} of controlled states and a set SS_{\ocircle} of random states. The transition relation is S×S{\longrightarrow}\subseteq S\times S, and P:S𝒟(S)P:S_{\ocircle}\to\mathcal{D}(S) is a probability function. We write sss{\longrightarrow}{}s^{\prime} if (s,s)(s,s^{\prime})\in{\longrightarrow}, and refer to ss^{\prime} as a successor of ss. We assume that every state has at least one successor. The probability function PP assigns to each random state sSs\in S_{\ocircle} a probability distribution P(s)P(s) over its set of successors. A sink is a subset TST\subseteq S closed under the {\longrightarrow} relation.

An MDP is acyclic if the underlying graph (S,)(S,{\longrightarrow}) is acyclic. It is finitely branching if every state has finitely many successors and infinitely branching otherwise. An MDP without controlled states (S=S_{\Box}=\emptyset) is a Markov chain.

Strategies and Probability Measures. A run ρ\rho is an infinite sequence s0s1s_{0}s_{1}\cdots of states such that sisi+1s_{i}{\longrightarrow}{}s_{i+1} for all ii\in\mathbb{N}; a partial run is a finite prefix of a run. We write ρ(i)=si\rho(i)=s_{i} and say that (partial) run s0s1s_{0}s_{1}\cdots visits ss if s=sis=s_{i} for some ii. It starts in ss if s=s0s=s_{0}.

A strategy is a function σ:SS𝒟(S)\sigma:S^{*}S_{\Box}\to\mathcal{D}(S) that assigns to partial runs ρsSS\rho s\in S^{*}S_{\Box} a distribution over the successors of ss. We write Σ\Sigma_{{{\mathcal{M}}}} for the set of all strategies in {\mathcal{M}}. A strategy σ\sigma and an initial state s0Ss_{0}\in S induce a standard probability measure on sets of infinite runs. We write 𝒫,s0,σ(){\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({{\mathfrak{R}}}) for the probability of a measurable set s0Sω{\mathfrak{R}}\subseteq s_{0}S^{\omega} of runs starting from s0s_{0}. It is defined for the cylinders s0s1snSωSωs_{0}s_{1}\ldots s_{n}S^{\omega}\in S^{\omega} as 𝒫,s0,σ(s0s1snSω)=defi=0n1σ¯(s0s1si)(si+1){\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(s_{0}s_{1}\ldots s_{n}S^{\omega})\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\prod_{i=0}^{n-1}\bar{\sigma}(s_{0}s_{1}\ldots s_{i})(s_{i+1}), where σ¯\bar{\sigma} is the map that extends σ\sigma by σ¯(ws)=P(s)\bar{\sigma}(ws)=P(s) for all wsSSws\in S^{*}S_{\ocircle}. By Carathéodory’s theorem [4], the measure for cylinders extends uniquely to a probability measure 𝒫,s0,σ{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma} on all measurable subsets of s0Sωs_{0}S^{\omega}. We will write ,s0,σ{\mathcal{E}}_{{\mathcal{M}},s_{0},\sigma} for the expectation w.r.t. 𝒫,s0,σ{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}.

Strategy Classes. Strategies σ:SS𝒟(S)\sigma:S^{*}S_{\Box}\to\mathcal{D}(S) are in general randomized (R) in the sense that they take values in 𝒟(S)\mathcal{D}(S). A strategy σ\sigma is deterministic (D) if σ(ρ)\sigma(\rho) is a Dirac distribution for all partial runs ρSS\rho\in S^{*}S_{\Box}.

We formalize the amount of memory needed to implement strategies in Appendix A. The two classes of memoryless and 1-bit strategies are central to this paper. A strategy σ\sigma is memoryless (M) if σ\sigma bases its decision only on the last state of the run: σ(ρs)=σ(ρs)\sigma(\rho s)=\sigma(\rho^{\prime}s) for all ρ,ρS\rho,\rho^{\prime}\in S^{*}. We may view M-strategies as functions σ:S𝒟(S)\sigma:S_{\Box}\to\mathcal{D}(S). A 1-bit strategy σ\sigma may base its decision also on a memory mode 𝗆{0,1}{\sf m}\in\{0,1\}. Formally, a 1-bit strategy σ\sigma is given as a tuple (u,𝗆0)(u,{\sf m}_{0}) where 𝗆0{0,1}{\sf m}_{0}\in\{0,1\} is the initial memory mode and u:{0,1}×S𝒟({0,1}×S)u:\{0,1\}\times S\to\mathcal{D}(\{0,1\}\times S) is an update function such that

  • for all controlled states sSs\in S_{\Box}, the distribution u((𝗆,s))u(({\sf m},s)) is over {0,1}×{sss}\{0,1\}\times\{s^{\prime}\mid s{\longrightarrow}{}s^{\prime}\}.

  • for all random states sSs\in S_{\ocircle}, we have that 𝗆{0,1}u((𝗆,s))(𝗆,s)=P(s)(s)\sum_{{\sf m}^{\prime}\in\{0,1\}}u(({\sf m},s))({\sf m}^{\prime},s^{\prime})=P(s)(s^{\prime}).

Note that this definition allows for updating the memory mode upon visiting random states. We write σ[𝗆0]\sigma[{\sf m}_{0}] for the strategy obtained from σ\sigma by setting the initial memory mode to 𝗆0{\sf m}_{0}.

MD strategies are both memoryless and deterministic; and deterministic 1-bit strategies are both deterministic and 1-bit.

Objectives. The objective of the controller is determined by a predicate on infinite runs. We assume familiarity with the syntax and semantics of the temporal logic LTL [9]. Formulas are interpreted on the underlying structure (S,)(S,{\longrightarrow}) of the MDP {\mathcal{M}}. We use φ,ssSω\llbracket{\varphi}\rrbracket^{{\mathcal{M}},s}\subseteq sS^{\omega} to denote the set of runs starting from ss that satisfy the LTL formula φ{\varphi}, which is a measurable set [27]. We also write φ\llbracket{\varphi}\rrbracket^{{\mathcal{M}}} for sSφ,s\bigcup_{s\in S}\llbracket{\varphi}\rrbracket^{{\mathcal{M}},s}. Where it does not cause confusion we will identify φ\varphi and φ\llbracket{\varphi}\rrbracket and just write 𝒫,s,σ(φ){\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi}) instead of 𝒫,s,σ(φ,s){\mathcal{P}}_{{\mathcal{M}},s,\sigma}(\llbracket{\varphi}\rrbracket^{{\mathcal{M}},s}).

Given a set TST\subseteq S of states, the reachability objective 𝚁𝚎𝚊𝚌𝚑(T)=def𝖥T\mathtt{Reach}(T)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}{\sf F}T is the set of runs that visit TT at least once. The safety objective 𝚂𝚊𝚏𝚎𝚝𝚢(T)=def𝖦¬T\mathtt{Safety}(T)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}{\sf G}\neg T is the set of runs that never visit TT.

Let 𝒞Nature{\mathcal{C}}\subseteq{\rm Nature} be a finite set of colors. A color function Col:S𝒞{\mathit{C}ol}:S\to{\mathcal{C}} assigns to each state ss its color Col(s){\mathit{C}ol}({s}). The parity objective, written as 𝙿𝚊𝚛𝚒𝚝𝚢(Col)\mathtt{Parity}({\mathit{C}ol}), is the set of infinite runs such that the largest color that occurs infinitely often along the run is even. To define this formally, let even(𝒞)={i𝒞i0mod2}{\mathit{e}ven}({\mathcal{C}})=\{i\in{\mathcal{C}}\mid i\equiv 0\mod{2}\}. For {<,,=,,>}\mathord{\rhd}\in\{\mathord{<},\mathord{\leq},\mathord{=},\mathord{\geq},\mathord{>}\}, nNaturen\in{\rm Nature}, and QSQ\subseteq S, let [Q]Coln=def{sQ|Col(s)n}[Q]^{{\mathit{C}ol}\rhd n}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\{{s\in Q}|\;{{\mathit{C}ol}({s})\rhd n}\} be the set of states in QQ with color n\rhd n. Then

𝙿𝚊𝚛𝚒𝚝𝚢(Col)=defieven(𝒞)(𝖦𝖥[S]Col=i𝖥𝖦[S]Coli).\mathtt{Parity}({\mathit{C}ol})\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\bigvee_{i\in{\mathit{e}ven}({\mathcal{C}})}\left({\sf G}{\sf F}[S]^{{\mathit{C}ol}=i}\wedge{\sf F}{\sf G}[S]^{{\mathit{C}ol}\leq i}\right).

We write 𝒞-𝙿𝚊𝚛𝚒𝚝𝚢{{\mathcal{C}}}\text{-}\mathtt{Parity} for the parity objectives with the set of colors 𝒞Nature{\mathcal{C}}\subseteq{\rm Nature}. The classical Büchi and co-Büchi objectives correspond to {1,2}-𝙿𝚊𝚛𝚒𝚝𝚢{\{1,2\}}\text{-}\mathtt{Parity} and {0,1}-𝙿𝚊𝚛𝚒𝚝𝚢{\{0,1\}}\text{-}\mathtt{Parity}, respectively.

An objective φ{\varphi} is called a tail objective (in {\mathcal{M}}) iff for every run ρρ\rho^{\prime}\rho with some finite prefix ρ\rho^{\prime} we have ρρφρφ\rho^{\prime}\rho\in{\varphi}{}\Leftrightarrow\rho\in{\varphi}{}. For every coloring Col{\mathit{C}ol}, 𝙿𝚊𝚛𝚒𝚝𝚢(Col)\mathtt{Parity}({\mathit{C}ol}) is tail. Reachability objectives are not always tail but in MDPs where the target set TT is a sink 𝚁𝚎𝚊𝚌𝚑(T)\mathtt{Reach}(T) is tail.

Optimal and ε\varepsilon-optimal Strategies. Given an objective φ{\varphi}, the value of state ss in an MDP {\mathcal{M}}, denoted by 𝚟𝚊𝚕,φ(s){\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}, is the supremum probability of achieving φ{\varphi}. Formally, we have 𝚟𝚊𝚕,φ(s)=defsupσΣ𝒫,s,σ(φ){\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\sup_{\sigma\in\Sigma}{\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi}) where Σ\Sigma is the set of all strategies. For ε0\varepsilon\geq 0 and state sSs\in S, we say that a strategy is ε\varepsilon-optimal from ss iff 𝒫,s,σ(φ)𝚟𝚊𝚕,φ(s)ε{\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi})\geq{\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-\varepsilon. A 0-optimal strategy is called optimal. An optimal strategy is almost-surely winning iff 𝚟𝚊𝚕,φ(s)=1{\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}=1.

Considering an MD strategy as a function σ:SS\sigma:S_{\Box}\to S and ε0\varepsilon\geq 0, σ\sigma is uniformly ε\varepsilon-optimal (resp. uniformly optimal) if it is ε\varepsilon-optimal (resp. optimal) from every sSs\in S.

Throughout the paper, we may drop the subscripts and superscripts from notations, if it is understood from the context. The missing proofs can be found in the appendix.

3 Transience and Universally Transient MDPs

In this section we define the transience property for MDPs, a natural generalization of the well-understood concept of transient Markov chains. We enumerate crucial characteristics of this objective and define the notion of universally transient MDPs.

Fix a countable MDP =(S,S,S,,P){\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P). Define the transience objective, denoted by 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}, to be the set of runs that do not visit any state of {\mathcal{M}} infinitely often, i.e.,

𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎=defsS𝖥𝖦¬s.\mathtt{Transience}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\bigwedge_{s\in S}{\sf F}{\sf G}\;\neg s.

The 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} objective is tail, as it is closed under removing finite prefixes of runs. Also note that 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} cannot be encoded in a parity objective.

We call {\mathcal{M}} universally transient iff for all states s0s_{0}, for all strategies σ\sigma, the 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} property holds almost-surely from s0s_{0}, i.e.,

s0SσΣ𝒫,s0,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=1.\forall s_{0}\in S\leavevmode\nobreak\ \leavevmode\nobreak\ \forall\sigma\in\Sigma\leavevmode\nobreak\ \leavevmode\nobreak\ {\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})=1.
w0w_{0}w1w_{1}w2w_{2}w3w_{3}w4w_{4}\cdots11pppppppp1p1-p1p1-p1p1-p1p1-p1p1-p
Figure 1: Gambler’s Ruin with restart: The state wiw_{i} illustrates that the controller’s wealth is ii, and the coin tosses are in the controller’s favor with probability pp. For all ii, 𝒫wi(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=0{\mathcal{P}}_{w_{i}}(\mathtt{Transience})=0 if p12p\leq\frac{1}{2}; and 𝒫wi(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=1{\mathcal{P}}_{w_{i}}(\mathtt{Transience})=1 otherwise.

The MDP in Figure 1 models the classical Gambler’s Ruin Problem with restart; see [10, Chapter 14]. It is well-known that if the controller starts with wealth ii and if p12p\leq\frac{1}{2}, the probability of ruin (visiting the state w0w_{0}) is 𝒫wi(𝖥w0)=1{\mathcal{P}}_{w_{i}}({\sf F}\,w_{0})=1. Consequently, the probability of re-visiting w0w_{0} infinitely often is 11, implying that 𝒫wi(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=0{\mathcal{P}}_{w_{i}}(\mathtt{Transience})=0. In contrast, for the case with p>12p>\frac{1}{2}, for all states wiw_{i}, the probability of re-visiting wiw_{i} is strictly below 11. Hence, the 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} property holds almost-surely. This example indicates that the transience property depends on the probability values of the transitions and not just on the underlying transition graph, and thus may require arithmetic reasoning. In particular, the MDP in Figure 1 is universally transient iff p>12p>\frac{1}{2}.

In general, optimal strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} need not exist:

Lemma 3.1.

There exists a finitely branching countable MDP with initial state s0s_{0} such that

  • 𝚟𝚊𝚕𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s)=1{\mathtt{val}_{\mathtt{Transience}}(s)}=1 for all controlled states ss,

  • there does not exist any optimal strategy σ\sigma such that 𝒫s0,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=1{\mathcal{P}}_{s_{0},\sigma}(\mathtt{Transience})=1.

Proof 3.2.

Consider a countable MDP {\mathcal{M}} with set S={i,i,ri,xii1}{0,}S=\{\ell_{i},\ell^{\prime}_{i},r_{i},x_{i}\mid i\geq 1\}\cup\{\ell_{0},\bot\} of states; see Figure 2. For all i1i\geq 1 the state xi+1x_{i+1} is the unique successor of xix_{i} so that (xi)i1(x_{i})_{i\geq 1} form an acyclic ladder; the value of 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} is 11 for all xix_{i}. The state \bot is sink, and its value is 0. The states (ri)i1(r_{i})_{i\geq 1} are all random, and rir_{i} and rir_{i}. Observe that the value of 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} is 12i1-2^{-i} for the rir_{i}.

The states (i)i(\ell_{i})_{i\in\mathbb{N}} are controlled whereas the states (i)i1(\ell^{\prime}_{i})_{i\geq 1} are random. By interleaving of these states, we construct a “recurrent ladder” of decisions: 01\ell_{0}\to\ell_{1} and for all i1i\geq 1, state i\ell_{i} has two successors i\ell^{\prime}_{i} and rir_{i}. In random states i\ell^{\prime}_{i}, as in Gambler’s Ruin with a fair coin, the successors are i1\ell_{i-1} or i+1\ell_{i+1}, each with equal probability. In each state (i)i1(\ell_{i})_{i\geq 1}, the controller decides to either stay on the ladder by going to i\ell^{\prime}_{i} or leaves the ladder to rir_{i}. As in Figure 1, if the controller stays on the ladder forever, the probability of 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} is 0.

0\leavevmode\nobreak\ \ell_{0}\leavevmode\nobreak\ 1\leavevmode\nobreak\ \ell_{1}\leavevmode\nobreak\ 1\ell^{\prime}_{1}\leavevmode\nobreak\ \leavevmode\nobreak\ \cdots\leavevmode\nobreak\ \leavevmode\nobreak\ i1\leavevmode\nobreak\ \ell_{i-1}\leavevmode\nobreak\ i1\ell^{\prime}_{i-1}i\leavevmode\nobreak\ \ell_{i}\leavevmode\nobreak\ i\leavevmode\nobreak\ \ell^{\prime}_{i}\leavevmode\nobreak\ \leavevmode\nobreak\ \cdotsr1\leavevmode\nobreak\ r_{1}\leavevmode\nobreak\ \bot\leavevmode\nobreak\ \leavevmode\nobreak\ \cdots\leavevmode\nobreak\ \leavevmode\nobreak\ ri1r_{i-1}\botri\leavevmode\nobreak\ r_{i}\leavevmode\nobreak\ \bot\leavevmode\nobreak\ \cdotsx1\leavevmode\nobreak\ x_{1}\leavevmode\nobreak\ \leavevmode\nobreak\ \leavevmode\nobreak\ \cdots\leavevmode\nobreak\ \leavevmode\nobreak\ xi1\leavevmode\nobreak\ x_{i-1}\leavevmode\nobreak\ xi\leavevmode\nobreak\ x_{i}\leavevmode\nobreak\ \leavevmode\nobreak\ \cdots12\frac{1}{2}12\frac{1}{2}12\frac{1}{2}12\frac{1}{2}12\frac{1}{2}12\frac{1}{2}12\frac{1}{2}1121-\frac{1}{2}12\frac{1}{2}112i11-\frac{1}{2^{i-1}}12i1\frac{1}{2^{i-1}}112i1-\frac{1}{2^{i}}12i\frac{1}{2^{i}}
Figure 2: A partial illustration of the MDP in Lemma 3.1, in which there is no optimal strategy for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}, starting from states i\ell_{i}. For readability, we have three copies of the state \bot. We call the ladder consisting of the interleaved controlled states i\ell_{i} and random states i\ell^{\prime}_{i} a “recurrent ladder”: if the controller stays on this ladder forever, it faithfully simulates a Gambler’s Ruin with a fair coin, and the probability of 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} will be 0.

Starting in 0\ell_{0}, for all i>0i>0, strategy σi\sigma_{i} that stays on the ladder until visiting i\ell_{i} (which happens eventually almost surely) and then leaves the ladder to rir_{i} achieves 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} with probability 12i1-2^{i}. Hence, 𝚟𝚊𝚕𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(0)=1{\mathtt{val}_{\mathtt{Transience}}(\ell_{0})}=1.

Recall that transience cannot be achieved with a positive probability by staying on the acyclic ladder forever. But any strategy that leaves the ladder with a positive probability comes with a positive probability of falling into \bot, thus is not optimal either. Thus there is no optimal strategy for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}.

Reduction to Finitely Branching MDPs. In our main results, we will prove that for the 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} property there always exist ε\varepsilon-optimal MD strategies in finitely branching countable MDPs; and if an optimal strategy exists, there will exist an optimal MD strategy. We generalize these results to infinitely branching countable MDPs by the following reduction:

Lemma 3.5.

Given an infinitely branching countable MDP {\mathcal{M}} with an initial state s0s_{0}, there exists a finitely branching countable {\mathcal{M}}^{\prime} with a set SS^{\prime} of states such that s0Ss_{0}\in S^{\prime} and

  1. 1.

    each strategy α1\alpha_{1} in {\mathcal{M}} is mapped to a unique strategy β1\beta_{1} in {\mathcal{M}}^{\prime} where

    𝒫s0,α1(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=𝒫s0,β1(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎),{\mathcal{P}}_{s_{0},\alpha_{1}}(\mathtt{Transience})={\mathcal{P}}_{s_{0},\beta_{1}}(\mathtt{Transience}),
  2. 2.

    and conversely, every MD strategy β2\beta_{2} in {\mathcal{M}}^{\prime} is mapped to an MD strategy α2\alpha_{2} in {\mathcal{M}} where

    𝒫s0,α2(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)𝒫s0,β2(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎).{\mathcal{P}}_{s_{0},\alpha_{2}}(\mathtt{Transience})\geq{\mathcal{P}}_{s_{0},\beta_{2}}(\mathtt{Transience}).
Proof 3.6 (Proof sketch).

See Appendix B for the complete construction. In order to construct {\mathcal{M}}^{\prime} from {\mathcal{M}}, for each controlled state sSs\in S in {\mathcal{M}} that has infinitely many successors (si)i1(s_{i})_{i\geq 1}, a “recurrent ladder” is introduced; see Figure 3. Since the probability of 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} is 0 for all those runs that eventually stay forever on a recurrent ladder, the controller should exit such ladders to play optimally for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}. Infinitely branching random states can be dealt with in an easier way.

s\leavevmode\nobreak\ s\leavevmode\nobreak\ s1\leavevmode\nobreak\ s_{1}\leavevmode\nobreak\ \leavevmode\nobreak\ \cdots\leavevmode\nobreak\ si1s_{i-1}si\leavevmode\nobreak\ s_{i}\leavevmode\nobreak\ \leavevmode\nobreak\ \cdotsMDP {\mathcal{M}}\Downarrow reductionMDP {\mathcal{M}}^{\prime}s\leavevmode\nobreak\ s\leavevmode\nobreak\ 0\leavevmode\nobreak\ \ell_{0}\leavevmode\nobreak\ 1\leavevmode\nobreak\ \ell_{1}\leavevmode\nobreak\ 1\ell^{\prime}_{1}\leavevmode\nobreak\ \leavevmode\nobreak\ \cdots\leavevmode\nobreak\ \leavevmode\nobreak\ i1\leavevmode\nobreak\ \ell_{i-1}\leavevmode\nobreak\ i1\ell^{\prime}_{i-1}i\leavevmode\nobreak\ \ell_{i}\leavevmode\nobreak\ i\leavevmode\nobreak\ \ell^{\prime}_{i}\leavevmode\nobreak\ \leavevmode\nobreak\ \cdotss1\leavevmode\nobreak\ s_{1}\leavevmode\nobreak\ \leavevmode\nobreak\ \cdots\leavevmode\nobreak\ si1s_{i-1}si\leavevmode\nobreak\ s_{i}\leavevmode\nobreak\ \leavevmode\nobreak\ \cdots12\frac{1}{2}12\frac{1}{2}12\frac{1}{2}12\frac{1}{2}12\frac{1}{2}12\frac{1}{2}12\frac{1}{2}
Figure 3: A partial illustration of the reduction in Lemma 3.5.

Properties of Universally Transient MDPs.

Notice that acyclicity implies universal transience, but not vice-versa.

Lemma 3.7.

For every countable MDP =(S,S,S,,P){\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P), the following conditions are equivalent.

  1. 1.

    {\mathcal{M}} is universally transient, i.e., s0,σ.𝒫,s0,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=1\forall s_{0},\forall\sigma.\,{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})=1.

  2. 2.

    For every initial state s0s_{0} and state ss, the objective of re-visiting ss infinitely often has value zero, i.e., s0,ssupσ𝒫,s0,σ(𝖦𝖥(s))=0\forall s_{0},s\,\sup_{\sigma}{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({\sf G}{\sf F}(s))=0.

  3. 3.

    For every state ss the value of the objective to re-visit ss is strictly below 11, i.e.,
    𝑅𝑒(s)=defsupσ𝒫,s,σ(𝖷𝖥(s))<1{\it Re}(s)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\sup_{\sigma}{\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\sf X}{\sf F}(s))<1.

  4. 4.

    For every state ss there exists a finite bound B(s)B(s) such that for every state s0s_{0} and strategy σ\sigma from s0s_{0} the expected number of visits to ss is B(s)\leq B(s).

  5. 5.

    For all states s0,ss_{0},s, under every strategy σ\sigma from s0s_{0} the expected number of visits to ss is finite.

Proof 3.8.

Towards (1)(2)(1)\Rightarrow(2), consider an arbitrary strategy σ\sigma from the initial state s0s_{0} and some state ss. By (1) we have σ.𝒫,s0,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=1\forall\sigma.{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})=1 and thus 0=𝒫,s0,σ(¬𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=𝒫,s0,σ(sS𝖦𝖥(s))𝒫,s0,σ(𝖦𝖥(s))0={\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\neg\mathtt{Transience})={\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\bigcup_{s^{\prime}\in S}{\sf G}{\sf F}(s^{\prime}))\geq{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({\sf G}{\sf F}(s)) which implies (2).

Towards (2)(1)(2)\Rightarrow(1), consider an arbitrary strategy σ\sigma from the initial state s0s_{0}. By (2) we have 0=sS𝒫,s0,σ(𝖦𝖥(s))𝒫,s0,σ(sS𝖦𝖥(s))=𝒫,s0,σ(¬𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)0=\sum_{s\in S}{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({\sf G}{\sf F}(s))\geq{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\bigcup_{s\in S}{\sf G}{\sf F}(s))={\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\neg\mathtt{Transience}) and thus 𝒫,s0,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=1{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})=1.

We now show the implications (2)(3)(4)(5)(2)(2)\Rightarrow(3)\Rightarrow(4)\Rightarrow(5)\Rightarrow(2).

Towards ¬(3)¬(2)\neg(3)\Rightarrow\neg(2), ¬(3)\neg(3) implies s.𝑅𝑒(s)=1\exists s.{\it Re}(s)=1 and thus ε>0.σε𝒫,s,σε(𝖷𝖥(s))1ε\forall\varepsilon>0.\exists\sigma_{\varepsilon}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{\varepsilon}}({\sf X}{\sf F}(s))\geq 1-\varepsilon. Let εi=def2(i+1)\varepsilon_{i}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}2^{-(i+1)}. We define the strategy σ\sigma to play like σεi\sigma_{\varepsilon_{i}} between the ii-th and (i+1)(i+1)th visit to ss. Since i=1εi<\sum_{i=1}^{\infty}\varepsilon_{i}<\infty, we have i=1(1εi)>0\prod_{i=1}^{\infty}(1-\varepsilon_{i})>0. Therefore 𝒫,s,σ(𝖦𝖥(s))i=1(1εi)>0{\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\sf G}{\sf F}(s))\geq\prod_{i=1}^{\infty}(1-\varepsilon_{i})>0, which implies ¬(2)\neg(2), where s0=ss_{0}=s.

Towards (3)(4)(3)\Rightarrow(4), regardless of s0s_{0} and the chosen strategy, the expected number of visits to ss is upper-bounded by B(s)=defn=0(n+1)(𝑅𝑒(s))n<B(s)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\sum_{n=0}^{\infty}(n+1)\cdot({\it Re}(s))^{n}<\infty.

The implication (4)(5)(4)\Rightarrow(5) holds trivially.

Towards ¬(2)¬(5)\neg(2)\Rightarrow\neg(5), by ¬(2)\neg(2) there exist states s0,ss_{0},s and a strategy σ\sigma such that 𝒫,s0,σ(𝖦𝖥(s))>0{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({\sf G}{\sf F}(s))>0. Thus the expected number of visits to ss is infinite, which implies ¬(5)\neg(5).

We remark that if an MDP is not universally transient (unlike in Lemma 3.7(5)), for a strategy σ\sigma, the expected number of visits to some state can be infinite, even if σ\sigma attains 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} almost surely.

Consider the MDP {\mathcal{M}} with controlled states {s0,s1,}\{s_{0},s_{1},\dots\}, initial state s0s_{0} and transitions s0s0s_{0}\to s_{0} and sksk+1s_{k}\to s_{k+1} for every k0k\geq 0. We define a strategy σ\sigma that, while in state s0s_{0}, proceeds in rounds i=1,2,i=1,2,\dots. In the ii-th round it tosses a fair coin. If Heads then it goes to s1s_{1}. If Tails then it loops around s0s_{0} exactly 2i2^{i} times and then goes to round i+1i+1. In every round the probability of going to s1s_{1} is 1/21/2 and therefore the probability of staying in s0s_{0} forever is (1/2)=0(1/2)^{\infty}=0. Thus 𝒫,s0,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=1{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})=1. However, the expected number of visits to s0s_{0} is i=1(12)i2i=\geq\sum_{i=1}^{\infty}\left(\frac{1}{2}\right)^{i}\cdot 2^{i}=\infty.

4 MD Strategies for Transience

We show that there exist uniformly ε\varepsilon-optimal MD strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} and that optimal strategies, where they exist, can also be chosen MD.

First we show that there exist ε\varepsilon-optimal deterministic 1-bit strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} (in Corollary 4.3) and then we show how to dispense with the 1-bit memory (in Lemma 4.5).

It was shown in [14] that there exist ε\varepsilon-optimal deterministic 1-bit strategies for Büchi objectives in acyclic countable MDPs (though not in general MDPs). These 1-bit strategies will be similar to the 1-bit strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} that we aim for in (not necessarily acyclic) countable MDPs. In Lemma 4.1 below we first strengthen the result from [14] and construct ε\varepsilon-optimal deterministic 1-bit strategies for objectives Büchi(F)𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}. From this we obtain deterministic 1-bit strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} (Corollary 4.3).

Lemma 4.1.

Let {\mathcal{M}} be a countable MDP, II a finite set of initial states, FF a set of states and ε>0\varepsilon>0. Then there exists a deterministic 1-bit strategy for Büchi(F)𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience} that is ε\varepsilon-optimal from every sIs\in I.

Proof 4.2 (Proof sketch).

The full proof can be found in Appendix C. It follows the proof of [14, Theorem 5], which considers Büchi(F)\mbox{{B\"{u}chi}}(F) conditions for acyclic (and hence universally transient) MDPs. The only part of that proof that requires modification is [14, Lemma 10], which is replaced here by Lemma C.2 to deal with general MDPs.

In short, from every sIs\in I there exists an ε\varepsilon-optimal strategy σs\sigma_{s} for φ=defBüchi(F)𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎{\varphi}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}. We observe the behavior of the finitely many σs\sigma_{s} for sIs\in I on an infinite, increasing sequence of finite subsets of SS. Based on Lemma C.2, we can define a second stronger objective φφ{\varphi}^{\prime}\subseteq{\varphi} and show sI𝒫,s,σs(φ)𝚟𝚊𝚕,φ(s)2ε\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}^{\prime})\geq{\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-2\varepsilon. We then construct a deterministic 1-bit strategy σ\sigma^{\prime} that is optimal for φ{\varphi}^{\prime} from all sIs\in I and thus 2ε2\varepsilon-optimal for φ{\varphi}. Since ε\varepsilon can be chosen arbitrarily small, the result follows.

Unlike for the 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} objective alone (see below), the 1-bit memory is strictly necessary for the Büchi(F)𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience} objective in Lemma 4.1. The 1-bit lower bound for Büchi(F)\mbox{{B\"{u}chi}}(F) objectives in [14] holds even for acyclic MDPs where 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} is trivially true.

Corollary 4.3.

Let {\mathcal{M}} be a countable MDP, II a finite set of initial states, FF a set of states and ε>0\varepsilon>0.

  1. 1.

    If sI𝚟𝚊𝚕,Büchi(F)(s)=𝚟𝚊𝚕,Büchi(F)𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s)\forall s\in I\ {\mathtt{val}_{{\mathcal{M}},\mbox{{B\"{u}chi}}(F)}(s)}={\mathtt{val}_{{\mathcal{M}},\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}}(s)} then there exists a deterministic 1-bit strategy for Büchi(F)\mbox{{B\"{u}chi}}(F) that is ε\varepsilon-optimal from every sIs\in I.

  2. 2.

    If {\mathcal{M}} is universally transient then there exists a deterministic 1-bit strategy for Büchi(F)\mbox{{B\"{u}chi}}(F) that is ε\varepsilon-optimal from every sIs\in I.

  3. 3.

    There exists a deterministic 1-bit strategy for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} that is ε\varepsilon-optimal from every sIs\in I.

Proof 4.4.

Towards (1), since sI𝚟𝚊𝚕,Büchi(F)(s)=𝚟𝚊𝚕,Büchi(F)𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s)\forall s\in I\ {\mathtt{val}_{{\mathcal{M}},\mbox{{B\"{u}chi}}(F)}(s)}={\mathtt{val}_{{\mathcal{M}},\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}}(s)}, strategies that are ε\varepsilon-optimal for Büchi(F)𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience} are also ε\varepsilon-optimal for Büchi(F)\mbox{{B\"{u}chi}}(F). Thus the result follows from Lemma 4.1.

Item (2) follows directly from (1), since the precondition always holds in universally transient MDPs.

Towards (3), let F=defSF\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}S. Then we have Büchi(F)𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎=𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}=\mathtt{Transience} and we obtain from Lemma 4.1 that there exists a deterministic 1-bit strategy for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} that is ε\varepsilon-optimal from every sIs\in I.

Note that every acyclic MDP is universally transient and thus Corollary 4.3(2) implies the upper bound on the strategy complexity of Büchi(F)\mbox{{B\"{u}chi}}(F) from [14] (but not vice-versa).

In the next step we show how to dispense with the 1-bit memory and obtain non-uniform ε\varepsilon-optimal MD strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}.

Lemma 4.5.

Let =(S,S,S,,P){\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P) be a countable MDP with initial state s0s_{0}, and ε>0\varepsilon>0. There exists an MD strategy σ\sigma that is ε\varepsilon-optimal for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} from s0s_{0}, i.e., 𝒫,s0,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)ε{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})\geq{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}-\varepsilon.

Proof 4.6.

By Lemma 3.5 it suffices to prove the property for finitely branching MDPs. Thus without restriction in the rest of the proof we assume that {\mathcal{M}} is finitely branching.

Let ε=defε/2\varepsilon^{\prime}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\varepsilon/2. We instantiate Corollary 4.3(3) with I=def{s0}I\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\{s_{0}\} and obtain that there exists an ε\varepsilon^{\prime}-optimal deterministic 1-bit strategy σ^\hat{\sigma} for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} from s0s_{0}.

We now construct a slightly modified MDP {\mathcal{M}}^{\prime} as follows. Let S𝑏𝑎𝑑SS_{\it bad}\subseteq S be the subset of states where σ^\hat{\sigma} attains zero for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} in both memory modes, i.e., S𝑏𝑎𝑑=def{sS𝒫,s,σ[0](𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=𝒫,s,σ[1](𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=0}S_{\it bad}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\{s\in S\mid{\mathcal{P}}_{{\mathcal{M}},s,\sigma[0]}(\mathtt{Transience})={\mathcal{P}}_{{\mathcal{M}},s,\sigma[1]}(\mathtt{Transience})=0\}. Let S𝑔𝑜𝑜𝑑=defSS𝑏𝑎𝑑S_{\it good}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}S\setminus S_{\it bad}. We obtain {\mathcal{M}}^{\prime} from {\mathcal{M}} by making all states in S𝑏𝑎𝑑S_{\it bad} losing sinks (for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}), by deleting all outgoing edges and adding a self-loop instead. It follows that

𝒫,s0,σ^(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=𝒫,s0,σ^(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎){\mathcal{P}}_{{\mathcal{M}},s_{0},\hat{\sigma}}(\mathtt{Transience})={\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\hat{\sigma}}(\mathtt{Transience}) (1)
σ.𝒫,s0,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)𝒫,s0,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)\forall\sigma.\ {\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})\geq{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma}(\mathtt{Transience}) (2)

In the following we show that it is possible to play in such a way that, for every sS𝑔𝑜𝑜𝑑s\in S_{\it good}, the expected number of visits to ss is finite. We obtain the deterministic 1-bit strategy σ\sigma^{\prime} in {\mathcal{M}}^{\prime} by modifying σ^\hat{\sigma} as follows. In every state ss and memory mode x{0,1}x\in\{0,1\} where σ^[x]\hat{\sigma}[x] attains 0 for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} and σ^[1x]\hat{\sigma}[1-x] attains >0>0 the strategy σ\sigma^{\prime} sets the memory bit to 1x1-x. (Note that only states sS𝑔𝑜𝑜𝑑s\in S_{\it good} can be affected by this change.) It follows that

sS.𝒫,s,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)𝒫,s,σ^(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)\forall s\in S.\ {\mathcal{P}}_{{\mathcal{M}}^{\prime},s,\sigma^{\prime}}(\mathtt{Transience})\geq{\mathcal{P}}_{{\mathcal{M}}^{\prime},s,\hat{\sigma}}(\mathtt{Transience}) (3)

Moreover, from all states in S𝑔𝑜𝑜𝑑S_{\it good} in {\mathcal{M}}^{\prime} the strategy σ\sigma^{\prime} attains a strictly positive probability of 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} in both memory modes, i.e., for all sS𝑔𝑜𝑜𝑑s\in S_{\it good} we have

t(s,σ)=defminx{0,1}𝒫,s,σ[i](𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)>0.t(s,\sigma^{\prime})\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\min_{x\in\{0,1\}}{\mathcal{P}}_{{\mathcal{M}}^{\prime},s,\sigma^{\prime}[i]}(\mathtt{Transience})>0.

Let r(s,σ,x)r(s,\sigma^{\prime},x) be the probability, when playing σ[x]\sigma^{\prime}[x] from state ss, of reaching ss again in the same memory mode xx. For every sS𝑔𝑜𝑜𝑑s\in S_{\it good} we have r(s,σ,x)<1r(s,\sigma^{\prime},x)<1, since t(s,σ)>0t(s,\sigma^{\prime})>0.

Let R(s)R(s) be the expected number of visits to state ss when playing σ\sigma^{\prime} from s0s_{0} in {\mathcal{M}}^{\prime}, and Rx(s)R_{x}(s) the expected number of visits to ss in memory mode x{0,1}x\in\{0,1\}. For all sS𝑔𝑜𝑜𝑑s\in S_{\it good} we have that

R(s)=R0(s)+R1(s)n=1nr(s,σ,0)n1+n=1nr(s,σ,1)n1<R(s)=R_{0}(s)+R_{1}(s)\leq\sum_{n=1}^{\infty}n\cdot r(s,\sigma^{\prime},0)^{n-1}+\sum_{n=1}^{\infty}n\cdot r(s,\sigma^{\prime},1)^{n-1}<\infty (4)

where the first equality holds by linearity of expectations. Thus the expected number of visits to ss is finite.

Now we upper-bound the probability of visiting S𝑏𝑎𝑑S_{\it bad}. We have 𝒫,s0,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)𝒫,s0,σ^(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=𝒫,s0,σ^(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)ε{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma^{\prime}}(\mathtt{Transience})\geq{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\hat{\sigma}}(\mathtt{Transience})={\mathcal{P}}_{{\mathcal{M}},s_{0},\hat{\sigma}}(\mathtt{Transience})\geq{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}-\varepsilon^{\prime} by (3), (1) and the ε\varepsilon^{\prime}-optimality of σ^\hat{\sigma}. Since states in S𝑏𝑎𝑑S_{\it bad} are losing sinks in {\mathcal{M}}^{\prime}, it follows that

𝒫,s0,σ(𝖥S𝑏𝑎𝑑)1𝒫,s0,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)1𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)+ε{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma^{\prime}}({\sf F}S_{\it bad})\leq 1-{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma^{\prime}}(\mathtt{Transience})\leq 1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime} (5)

We now augment the MDP {\mathcal{M}}^{\prime} by assigning costs to transitions as follows. Let i:Si:S\to\mathbb{N} be an enumeration of the state space, i.e., a bijection. Let S𝑔𝑜𝑜𝑑=def{sS𝑔𝑜𝑜𝑑R(s)>0}S_{\it good}^{\prime}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\{s\in S_{\it good}\mid R(s)>0\} be the subset of states in S𝑔𝑜𝑜𝑑S_{\it good} that are visited with non-zero probability when playing σ\sigma^{\prime} from s0s_{0}. Each transition sss^{\prime}\to s is assigned a cost:

  • If sS𝑏𝑎𝑑s^{\prime}\in S_{\it bad} then sS𝑏𝑎𝑑s\in S_{\it bad} by def. of {\mathcal{M}}^{\prime}. We assign cost 0.

  • If sS𝑔𝑜𝑜𝑑s^{\prime}\in S_{\it good} and sS𝑏𝑎𝑑s\in S_{\it bad} we assign cost K/(1𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)+ε)K/(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime}) for K=def(1+ε)/εK\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}(1+\varepsilon^{\prime})/\varepsilon^{\prime}.

  • If sS𝑔𝑜𝑜𝑑s^{\prime}\in S_{\it good} and sS𝑔𝑜𝑜𝑑s\in S_{\it good}^{\prime} we assign cost 2i(s)/R(s)2^{-i(s)}/R(s). This is well defined, since R(s)>0R(s)>0.

  • sS𝑔𝑜𝑜𝑑s^{\prime}\in S_{\it good} and sS𝑔𝑜𝑜𝑑S𝑔𝑜𝑜𝑑s\in S_{\it good}\setminus S_{\it good}^{\prime} we assign cost 11.

Note that all transitions leading to states in S𝑔𝑜𝑜𝑑S_{\it good} are assigned a non-zero cost, since R(s)R(s) is finite by (4).

When playing σ\sigma^{\prime} from s0s_{0} in {\mathcal{M}}^{\prime}, the expected total cost is upper-bounded by

𝒫,s0,σ(𝖥S𝑏𝑎𝑑)K/(1𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)+ε)+sS𝑔𝑜𝑜𝑑R(s)2i(s)/R(s){\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma^{\prime}}({\sf F}S_{\it bad})\cdot K/(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime})+\sum_{s\in S_{\it good}^{\prime}}R(s)\cdot 2^{-i(s)}/R(s)

The first part is K\leq K by (5) and the second part is 1\leq 1, since R(s)<R(s)<\infty by (4). Therefore the expected total cost is K+1\leq K+1, i.e., σ\sigma^{\prime} witnesses that it is possible to attain a finite expected cost that is upper-bounded by K+1K+1.

Now we define our MD strategy σ\sigma. Let σ\sigma be an optimal MD strategy on {\mathcal{M}}^{\prime} (from s0s_{0}) that minimizes the expected cost. It exists, as a finite expected cost is attainable and {\mathcal{M}}^{\prime} is finitely branching; see [21, Theorem 7.3.6].

We now show that σ\sigma attains 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} with high probability in {\mathcal{M}}^{\prime} (and in {\mathcal{M}}). Since σ\sigma is cost-optimal, its attained cost from s0s_{0} is upper-bounded by that of σ\sigma^{\prime}, i.e., K+1\leq K+1. Since the cost of entering S𝑏𝑎𝑑S_{\it bad} is K/(1𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)+ε)K/(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime}), we have 𝒫,s0,σ(𝖥S𝑏𝑎𝑑)K/(1𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)+ε)K+1{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma}({\sf F}S_{\it bad})\cdot K/(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime})\leq K+1 and thus

𝒫,s0,σ(𝖥S𝑏𝑎𝑑)K+1K(1𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)+ε){\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma}({\sf F}S_{\it bad})\leq\frac{K+1}{K}(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime}) (6)

For every state sS𝑔𝑜𝑜𝑑s\in S_{\it good}, all transitions into ss have the same fixed non-zero cost. Thus every run that visits some state sS𝑔𝑜𝑜𝑑s\in S_{\it good} infinitely often has infinite cost. Since the expected cost of playing σ\sigma from s0s_{0} is K+1\leq K+1, such runs must be a null-set, i.e.,

𝒫,s0,σ(¬𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎𝖦S𝑔𝑜𝑜𝑑)=0{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma}(\neg\mathtt{Transience}\,\wedge\,{\sf G}S_{\it good})=0 (7)

Thus

𝒫,s0,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)\displaystyle{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Transience})
𝒫,s0,σ(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)\displaystyle\geq{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma}(\mathtt{Transience}) by (2)
=1𝒫,s0,σ(𝖥S𝑏𝑎𝑑)\displaystyle=1-{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\sigma}({\sf F}S_{\it bad}) by (7)
1K+1K(1𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)+ε)\displaystyle\geq 1-\frac{K+1}{K}(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime}) by (6)
=𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)ε(1/K)(1𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)+ε)\displaystyle={\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}-\varepsilon^{\prime}-(1/K)(1-{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}+\varepsilon^{\prime})
𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)ε(1/K)(1+ε)\displaystyle\geq{\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}-\varepsilon^{\prime}-(1/K)(1+\varepsilon^{\prime})
=𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)2ε\displaystyle={\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}-2\varepsilon^{\prime} def. of KK
=𝚟𝚊𝚕,𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎(s0)ε\displaystyle={\mathtt{val}_{{\mathcal{M}},\mathtt{Transience}}(s_{0})}-\varepsilon def. of ε\varepsilon^{\prime}

Now we lift the result of Lemma 4.5 from non-uniform to uniform strategies (and to optimal strategies) and obtain the following theorem. The proof is a generalization of a “plastering” construction by Ornstein [20] (see also [16]) from reachability to tail objectives, which works by fixing MD strategies on ever expanding subsets of the state space.

Theorem 4.7.

Let =(S,S,S,,P){\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P) be a countable MDP, and let φ{\varphi} be an objective that is tail in {\mathcal{M}}. Suppose for every sSs\in S there exist ε\varepsilon-optimal MD strategies for φ{\varphi}. Then:

  1. 1.

    There exist uniform ε\varepsilon-optimal MD strategies for φ{\varphi}.

  2. 2.

    There exists a single MD strategy that is optimal from every state that has an optimal strategy.

Theorem 4.8.

In every countable MDP there exist uniform ε\varepsilon-optimal MD strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}. Moreover, there exists a single MD strategy that is optimal for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} from every state that has an optimal strategy.

Proof 4.9.

Immediate from Lemmas 4.5 and 4.7, since 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} is a tail objective.

5 Strategy Complexity in Universally Transient MDPs

The strategy complexity of parity objectives in general MDPs is known [15]. Here we show that some parity objectives have a lower strategy complexity in universally transient MDPs. It is known [14] that there are acyclic (and hence universally transient) MDPs where ε\varepsilon-optimal strategies for {1,2}-𝙿𝚊𝚛𝚒𝚝𝚢{\{1,2\}}\text{-}\mathtt{Parity} (and optimal strategies for {1,2,3}-𝙿𝚊𝚛𝚒𝚝𝚢{\{1,2,3\}}\text{-}\mathtt{Parity}, resp.) require 11 bit.

We show that, for all simpler parity objectives in the Mostowski hierarchy [19], universally transient MDPs admit uniformly (ε\varepsilon-)optimal MD strategies (unlike general MDPs [15]). These results (Theorems 5.3 and 5.5) ultimately rely on the existence of uniformly ε\varepsilon-optimal strategies for safety objectives. While such strategies always exist for finitely branching MDPs – simply pick a value-maximal successor – this is not the case for infinitely branching MDPs [17]. However, we show that universal transience implies the existence of uniformly ε\varepsilon-optimal strategies for safety objectives even for infinitely branching MDPs.

Theorem 5.1.

For every universally transient countable MDP, safety objective and ε>0\varepsilon>0 there exists a uniformly ϵ\epsilon-optimal MD strategy.

Proof 5.2.

Let =(S,S,S,,P){\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P) be a universally transient MDP and ε>0\varepsilon>0. Assume w.l.o.g. that the target TST\subseteq S of the objective φ=𝚂𝚊𝚏𝚎𝚝𝚢(T){\varphi}=\mathtt{Safety}(T) is a (losing) sink and let ι:S\iota:S\to\mathbb{N} be an enumeration of the state space SS.

By Lemma 3.7(3), for every state ss we have 𝑅𝑒(s)=defsupσ𝒫,s,σ(𝖷𝖥(s))<1{\it Re}(s)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\sup_{\sigma}{\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\sf X}{\sf F}(s))<1 and thus R(s)=defi=0𝑅𝑒(s)i<{\it R}(s)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\sum_{i=0}^{\infty}{\it Re}(s)^{i}<\infty. This means that, independent of the chosen strategy, 𝑅𝑒(s){\it Re}(s) upper-bounds the chance to return to ss, and R(s){\it R}(s) bounds the expected number of visits to ss.

Suppose that σ\sigma is an MD strategy which, at any state sSs\in S_{\Box}, picks a successor ss^{\prime} with

𝚟𝚊𝚕(s)𝚟𝚊𝚕(s)ε2ι(s)+1R(s).{\mathtt{val}(s^{\prime})}\quad\geq\quad{\mathtt{val}(s)}-\frac{\varepsilon}{2^{\iota(s)+1}\cdot{\it R}(s)}.

This is possible even if {\mathcal{M}} is infinitely branching, by the definition of value and the fact that R(s)<{\it R}(s)<\infty. We show that 𝒫,s0,σ(𝚂𝚊𝚏𝚎𝚝𝚢(T))𝚟𝚊𝚕(s0)ε{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Safety}(T))\geq{\mathtt{val}(s_{0})}-\varepsilon holds for every initial state s0s_{0}, which implies the claim of the theorem.

Towards this, we define a function 𝚌𝚘𝚜𝚝{\mathtt{cost}} that labels each transition in the MDP with a real-valued cost: For every controlled transition sss{\longrightarrow}s^{\prime} let 𝚌𝚘𝚜𝚝((s,s))=def𝚟𝚊𝚕(s)𝚟𝚊𝚕(s)0{\mathtt{cost}}((s,s^{\prime}))\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}{\mathtt{val}(s)}-{\mathtt{val}(s^{\prime})}\geq 0. Random transitions have cost zero. We will argue that when playing σ\sigma from any start state s0s_{0}, its attainment w.r.t. the objective 𝚂𝚊𝚏𝚎𝚝𝚢(T)\mathtt{Safety}(T) equals the value of s0s_{0} minus the expected total cost, and that this cost is bounded by ε\varepsilon.

For any ii\in\mathbb{N} let us write sis_{i} for the random variable denoting the state just after step ii, and 𝙲𝚘𝚜𝚝(i)=def𝚌𝚘𝚜𝚝(si,si+1){\mathtt{Cost}}(i)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}{\mathtt{cost}}(s_{i},s_{i+1}) for the cost of step ii in a random run. We observe that under σ\sigma the expected total cost is bounded in the limit, i.e.,

limn(i=0n1𝙲𝚘𝚜𝚝(i))ε.\lim_{n\to\infty}{\mathcal{E}}\left(\sum_{i=0}^{n-1}{\mathtt{Cost}}(i)\right)\leq\varepsilon. (8)

We moreover note that for every nn,

(𝚟𝚊𝚕(sn))=(𝚟𝚊𝚕(s0))(i=0n1𝙲𝚘𝚜𝚝(i)).{\mathcal{E}}({\mathtt{val}(s_{n})})={\mathcal{E}}({\mathtt{val}(s_{0})})-{\mathcal{E}}\left(\sum_{i=0}^{n-1}{\mathtt{Cost}}(i)\right). (9)

Full proofs of the above two equations can be found in LABEL:app-parity. Together they imply

lim infn(𝚟𝚊𝚕(sn))=𝚟𝚊𝚕(s0)limn(i=0n1𝚌𝚘𝚜𝚝(i))𝚟𝚊𝚕(s0)ε.\liminf_{n\to\infty}{\mathcal{E}}({\mathtt{val}(s_{n})})={\mathtt{val}(s_{0})}-\lim_{n\to\infty}{\mathcal{E}}\left(\sum_{i=0}^{n-1}{\mathtt{cost}}(i)\right)\geq{\mathtt{val}(s_{0})}-\varepsilon. (10)

Finally, to show the claim let [snT]:Sω{0,1}[s_{n}\notin T]:S^{\omega}\to\{0,1\} be the random variable that indicates that the nn-th state is not in the target set TT. Note that [snT]𝚟𝚊𝚕(sn)[s_{n}\notin T]\geq{\mathtt{val}(s_{n})} because target states have value 0. We have:

𝒫,s0,σ(𝚂𝚊𝚏𝚎𝚝𝚢(T))\displaystyle{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\mathtt{Safety}(T))\quad =𝒫,s0,σ(i=0𝖷i¬T)\displaystyle=\quad{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}\left(\bigwedge_{i=0}^{\infty}{{\sf X}^{i}\neg T}\right) semantics of 𝚂𝚊𝚏𝚎𝚝𝚢(T)=𝖦¬T\mathtt{Safety}(T)={\sf G}\neg T
=limn𝒫,s0,σ(i=0n𝖷i¬T)\displaystyle=\quad\smashoperator[]{\lim_{n\to\infty}^{}}{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}\left(\bigwedge_{i=0}^{n}{\sf X}^{i}\neg T\right) continuity of measures
=limn𝒫,s0,σ(𝖷n¬T)\displaystyle=\quad\smashoperator[]{\lim_{n\to\infty}^{}}{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({\sf X}^{n}\neg T) TT is a sink
=limn([snT])\displaystyle=\quad\smashoperator[]{\lim_{n\to\infty}^{}}{\mathcal{E}}([s_{n}\notin T]) definition of [snT][s_{n}\notin T]
lim infn(𝚟𝚊𝚕(sn))\displaystyle\geq\quad\liminf_{n\to\infty}{\mathcal{E}}({\mathtt{val}(s_{n})}) as [snT]𝚟𝚊𝚕(sn)[s_{n}\notin T]\geq{\mathtt{val}(s_{n})}
𝚟𝚊𝚕(s0)ε\displaystyle\geq\quad{\mathtt{val}(s_{0})}-\varepsilon Equation 10.

We can now combine Theorem 5.1 with the results from [15] to show the existence of MD strategies assuming universal transience.

Theorem 5.3.

For universally transient MDPs optimal strategies for {0,1,2}-𝙿𝚊𝚛𝚒𝚝𝚢{\{0,1,2\}}\text{-}\mathtt{Parity}, where they exist, can be chosen uniformly MD.

Formally, let {\mathcal{M}} be a universally transient MDP with states SS, Col:S{0,1,2}{\mathit{C}ol}:S\to\{0,1,2\}, and φ=𝙿𝚊𝚛𝚒𝚝𝚢(Col){\varphi}=\mathtt{Parity}({\mathit{C}ol}). There exists an MD strategy σ\sigma^{\prime} that is optimal for all states ss that have an optimal strategy: (σΣ.𝒫,s,σ(φ)=𝚟𝚊𝚕(s))𝒫,s,σ(φ)=𝚟𝚊𝚕(s)\big{(}\exists\sigma\in\Sigma.\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi})={\mathtt{val}_{{\mathcal{M}}}(s)}\big{)}\implies{\mathcal{P}}_{{\mathcal{M}},s,\sigma^{\prime}}({\varphi})={\mathtt{val}_{{\mathcal{M}}}(s)}.

Proof 5.4.

Let +{\mathcal{M}}_{+} be the conditioned version of {\mathcal{M}} w.r.t. φ{\varphi} (see [15, Def. 19] for a precise definition). By Lemma 6.8, +{\mathcal{M}}_{+} is still a universally transient MDP and therefore by Theorem 5.1, there exist uniformly ε\varepsilon-optimal MD strategies for every safety objective and every ε>0\varepsilon>0. The claim now follows from [15, Theorem 22].

Theorem 5.5.

For every universally transient countable MDP {\mathcal{M}}, co-Büchi objective and ε>0\varepsilon>0 there exists a uniformly ε\varepsilon-optimal MD strategy.

Formally, let {\mathcal{M}} be a universally transient countable MDP with states SS, Col:S{0,1}{\mathit{C}ol}:S\to\{0,1\} be a coloring, φ=𝙿𝚊𝚛𝚒𝚝𝚢(Col){\varphi}=\mathtt{Parity}({\mathit{C}ol}) and ε>0\varepsilon>0.

There exists an MD strategy σ\sigma^{\prime} s.t. for every state ss, 𝒫,s,σ(φ)𝚟𝚊𝚕(s)ε{\mathcal{P}}_{{\mathcal{M}},s,\sigma^{\prime}}({\varphi})\geq{\mathtt{val}_{{\mathcal{M}}}(s)}-\varepsilon.

Proof 5.6.

This directly follows from Theorem 5.1 and [15, Theorem 25].

6 The Conditioned MDP

Given an MDP {\mathcal{M}} and an objective φ{\varphi} that is tail in {\mathcal{M}}, a construction of a conditioned MDP +{\mathcal{M}}_{+} was provided in [17, Lemma 6] that, very loosely speaking, “scales up” the probability of φ{\varphi} so that any strategy σ\sigma is optimal in {\mathcal{M}} if it is almost surely winning in +{\mathcal{M}}_{+}. For certain tail objectives, this construction was used in [17] to reduce the sufficiency of MD strategies for optimal strategies to the sufficiency of MD strategies for almost surely winning strategies, which is a special case that may be easier to handle.

However, the construction was restricted to states that have an optimal strategy. In fact, states in {\mathcal{M}} that do not have an optimal strategy do not appear in +{\mathcal{M}}_{+}. In the following, we lift this restriction by constructing a more general version of the conditioned MDP, called {\mathcal{M}}_{*}. The MDP {\mathcal{M}}_{*} will contain all states from {\mathcal{M}} that have a positive value w.r.t. φ{\varphi} in {\mathcal{M}}. Moreover, all these states will have value 11 in {\mathcal{M}}_{*}. It will then follow from Lemma 6.2(3) below that an ε\varepsilon-optimal strategy in {\mathcal{M}}_{*} is ε𝚟𝚊𝚕(s0)\varepsilon{\mathtt{val}_{{\mathcal{M}}}(s_{0})}-optimal in {\mathcal{M}}. This allows us to reduce the sufficiency of MD strategies for ε\varepsilon-optimal strategies to the sufficiency of MD strategies for ε\varepsilon-optimal strategies for states with value 11. In fact, it also follows that if an MD strategy σ\sigma is uniform ε\varepsilon-optimal in {\mathcal{M}}_{*}, it is multiplicatively uniform ε\varepsilon-optimal in {\mathcal{M}}, i.e., 𝒫,s,σ(φ)(1ε)𝚟𝚊𝚕(s){\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi})\geq(1-\varepsilon)\cdot{\mathtt{val}_{{\mathcal{M}}}(s)} holds for all states ss.

Definition 6.1.

For an MDP =(S,S,S,,P){\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P) and an objective φ{\varphi} that is tail in {\mathcal{M}}, define the conditioned version of {\mathcal{M}} w.r.t. φ{\varphi} to be the MDP =(S,S,S,,P){\mathcal{M}}_{*}=(S_{*},S_{*\Box},S_{*\ocircle},{\longrightarrow}_{*},P_{*}) with

S=\displaystyle S_{*\Box}\ =\ {sS𝚟𝚊𝚕(s)>0}\displaystyle\{s\in S_{\Box}\mid{\mathtt{val}_{{\mathcal{M}}}(s)}>0\}
S=\displaystyle S_{*\ocircle}\ =\ {sS𝚟𝚊𝚕(s)>0}{s}{(s,t)sS,𝚟𝚊𝚕(s)>0}\displaystyle\{s\in S_{\ocircle}\mid{\mathtt{val}_{{\mathcal{M}}}(s)}>0\}\cup\{s_{\bot}\}\cup\{(s,t)\in\mathord{{\longrightarrow}}\mid s\in S_{\Box},\ {\mathtt{val}_{{\mathcal{M}}}(s)}>0\}
=\displaystyle{\longrightarrow}_{*}\ =\ {(s,(s,t))(S×)𝚟𝚊𝚕(s)>0,st}\displaystyle\{(s,(s,t))\in(S_{\Box}\times\mathord{{\longrightarrow}})\mid{\mathtt{val}_{{\mathcal{M}}}(s)}>0,\ s{\longrightarrow}t\}\cup\mbox{}
{(s,t)S×S𝚟𝚊𝚕(s)>0,𝚟𝚊𝚕(t)>0}\displaystyle\{(s,t)\in S_{\ocircle}\times S\mid{\mathtt{val}_{{\mathcal{M}}}(s)}>0,\ {\mathtt{val}_{{\mathcal{M}}}(t)}>0\}\cup\mbox{}
{((s,t),t)(×S)𝚟𝚊𝚕(s)>0,𝚟𝚊𝚕(t)>0}\displaystyle\{((s,t),t)\in(\mathord{{\longrightarrow}}\times S)\mid{\mathtt{val}_{{\mathcal{M}}}(s)}>0,\ {\mathtt{val}_{{\mathcal{M}}}(t)}>0\}\cup\mbox{}
{((s,t),s)(×{s})𝚟𝚊𝚕(s)>𝚟𝚊𝚕(t)}\displaystyle\{((s,t),s_{\bot})\in(\mathord{{\longrightarrow}}\times\{s_{\bot}\})\mid{\mathtt{val}_{{\mathcal{M}}}(s)}>{\mathtt{val}_{{\mathcal{M}}}(t)}\}\cup\mbox{}
{(s,s)}\displaystyle\{(s_{\bot},s_{\bot})\}
P(s,t)=\displaystyle P_{*}(s,t)\ =\ P(s,t)𝚟𝚊𝚕(t)𝚟𝚊𝚕(s)P((s,t),t)=𝚟𝚊𝚕(t)𝚟𝚊𝚕(s)\displaystyle P(s,t)\cdot\frac{{\mathtt{val}_{{\mathcal{M}}}(t)}}{{\mathtt{val}_{{\mathcal{M}}}(s)}}\hskip 56.9055ptP_{*}((s,t),t)\ =\ \frac{{\mathtt{val}_{{\mathcal{M}}}(t)}}{{\mathtt{val}_{{\mathcal{M}}}(s)}}
P((s,t),s)=\displaystyle P_{*}((s,t),s_{\bot})\ =\ 1𝚟𝚊𝚕(t)𝚟𝚊𝚕(s)P(s,s)= 1\displaystyle 1-\frac{{\mathtt{val}_{{\mathcal{M}}}(t)}}{{\mathtt{val}_{{\mathcal{M}}}(s)}}\hskip 76.82243ptP_{*}(s_{\bot},s_{\bot})\ =\ 1

for a fresh state ss_{\bot}.

The conditioned MDP is well-defined. Indeed, as φ{\varphi} is tail in {\mathcal{M}}, for any sSs\in S_{\ocircle} we have 𝚟𝚊𝚕(s)=stP(s,t)𝚟𝚊𝚕(t){\mathtt{val}_{{\mathcal{M}}}(s)}=\sum_{s{\longrightarrow}t}P(s,t){\mathtt{val}_{{\mathcal{M}}}(t)}, and so if 𝚟𝚊𝚕(s)>0{\mathtt{val}_{{\mathcal{M}}}(s)}>0 then stP(s,t)=1\sum_{s{\longrightarrow}t}P_{*}(s,t)=1.

Lemma 6.2.

Let =(S,S,S,,P){\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P) be an MDP, and let φ{\varphi} be an objective that is tail in {\mathcal{M}}. Let =(S,S,S,,P){\mathcal{M}}_{*}=(S_{*},S_{*\Box},S_{*\ocircle},{\longrightarrow}_{*},P_{*}) be the conditioned version of {\mathcal{M}} w.r.t. φ{\varphi}. Let s0SSs_{0}\in S_{*}\cap S. Let σΣ\sigma\in\Sigma_{{\mathcal{M}}_{*}}, and note that σ\sigma can be transformed to a strategy in {\mathcal{M}} in a natural way. Then:

  1. 1.

    For all n0n\geq 0 and all partial runs s0s1sns0Ss_{0}s_{1}\cdots s_{n}\in s_{0}S_{*}^{*} in {\mathcal{M}}_{*} with snSs_{n}\in S:

    𝚟𝚊𝚕(s0)𝒫,s0,σ(s0s1snSω)=𝒫,s0,σ(s0s1sn¯Sω)𝚟𝚊𝚕(sn),{\mathtt{val}_{{\mathcal{M}}}(s_{0})}\cdot{\mathcal{P}}_{{\mathcal{M}}_{*},s_{0},\sigma}(s_{0}s_{1}\cdots s_{n}S_{*}^{\omega})\ =\ {\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\overline{s_{0}s_{1}\cdots s_{n}}S^{\omega})\cdot{\mathtt{val}_{{\mathcal{M}}}(s_{n})}\,,

    where w¯\overline{w} for a partial run ww in {\mathcal{M}}_{*} refers to its natural contraction to a partial run in {\mathcal{M}}; i.e., w¯\overline{w} is obtained from ww by deleting all states of the form (s,t)(s,t).

  2. 2.

    For all measurable s0(S{s})ω{\mathfrak{R}}\subseteq s_{0}(S_{*}\setminus\{s_{\bot}\})^{\omega} we have

    𝒫,s0,σ(¯)𝚟𝚊𝚕(s0)𝒫,s0,σ()𝒫,s0,σ(¯φs0),{\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\overline{{\mathfrak{R}}})\ \geq\ {\mathtt{val}_{{\mathcal{M}}}(s_{0})}\cdot{\mathcal{P}}_{{\mathcal{M}}_{*},s_{0},\sigma}({\mathfrak{R}})\ \geq\ {\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}(\overline{{\mathfrak{R}}}\cap\llbracket{\varphi}\rrbracket^{s_{0}})\,,

    where ¯\overline{{\mathfrak{R}}} is obtained from {\mathfrak{R}} by deleting, in all runs, all states of the form (s,t)(s,t).

  3. 3.

    We have 𝚟𝚊𝚕(s0)𝒫,s0,σ(φ)=𝒫,s0,σ(φ){\mathtt{val}_{{\mathcal{M}}}(s_{0})}\cdot{\mathcal{P}}_{{\mathcal{M}}_{*},s_{0},\sigma}({\varphi})={\mathcal{P}}_{{\mathcal{M}},s_{0},\sigma}({\varphi}). In particular, 𝚟𝚊𝚕(s0)=1{\mathtt{val}_{{\mathcal{M}}_{*}}(s_{0})}=1, and, for any ε0\varepsilon\geq 0, strategy σ\sigma is ε\varepsilon-optimal in {\mathcal{M}}_{*} if and only if it is ε𝚟𝚊𝚕(s0)\varepsilon{\mathtt{val}_{{\mathcal{M}}}(s_{0})}-optimal in {\mathcal{M}}.

Lemma 6.2.3 provides a way of proving the existence of MD strategies that attain, for each state ss, a fixed fraction (arbitrarily close to 11) of the value of ss:

Theorem 6.3.

Let =(S,S,S,,P){\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P) be an MDP, and let φ{\varphi} be an objective that is tail in {\mathcal{M}}. Let =(S,S,S,,P){\mathcal{M}}_{*}=(S_{*},S_{*\Box},S_{*\ocircle},{\longrightarrow}_{*},P_{*}) be the conditioned version of {\mathcal{M}} w.r.t. φ{\varphi}. Let ε0\varepsilon\geq 0. Any MD strategy σ\sigma that is uniformly ε\varepsilon-optimal in {\mathcal{M}}_{*} (i.e., 𝒫,s,σ(φ)𝚟𝚊𝚕(s)ε{\mathcal{P}}_{{\mathcal{M}}_{*},s,\sigma}({\varphi})\geq{\mathtt{val}_{{\mathcal{M}}_{*}}(s)}-\varepsilon holds for all sSs\in S_{*}) is multiplicatively ε\varepsilon-optimal in {\mathcal{M}} (i.e., 𝒫,s,σ(φ)(1ε)𝚟𝚊𝚕(s){\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi})\geq(1-\varepsilon){\mathtt{val}_{{\mathcal{M}}}(s)} holds for all sSs\in S).

Proof 6.4.

Immediate from Lemma 6.2.3.

As an application of Theorem 6.3, we can strengthen the first statement of Theorem 4.8 towards multiplicatively (see Theorem 6.3) uniform ε\varepsilon-optimal MD strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}.

Corollary 6.5.

In every countable MDP there exist multiplicatively uniform ε\varepsilon-optimal MD strategies for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}.

Proof 6.6.

Let {\mathcal{M}} be a countable MDP, and {\mathcal{M}}_{*} its conditioned version w.r.t. 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience}. Let ε>0\varepsilon>0. By Theorem 4.8, there is a uniform ε\varepsilon-optimal MD strategy σ\sigma for 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} in {\mathcal{M}}_{*}. By Theorem 6.3, strategy σ\sigma is multiplicatively uniform ε\varepsilon-optimal in {\mathcal{M}}.

The following lemma, stating that universal transience is closed under “conditioning”, is needed for the proof of Lemma 6.8 below.

Lemma 6.7.

Let =(S,S,S,,P){\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P) be an MDP, and let φ{\varphi} be an objective that is tail in {\mathcal{M}}. Let =(S,S,S,,P){\mathcal{M}}_{*}=(S_{*},S_{*\Box},S_{*\ocircle},{\longrightarrow}_{*},P_{*}) be the conditioned version of {\mathcal{M}} w.r.t. φ{\varphi}, where ss_{\bot} is replaced by an infinite chain s1s2s_{\bot}^{1}{\longrightarrow}s_{\bot}^{2}{\longrightarrow}\cdots. If {\mathcal{M}} is universally transient, then so is {\mathcal{M}}_{*}.

In [17, Lemma 6] a variant, say +{\mathcal{M}}_{+}, of the conditioned MDP {\mathcal{M}}_{*} from Definition 6.1 was proposed. This variant +{\mathcal{M}}_{+} differs from {\mathcal{M}}_{*} in that +{\mathcal{M}}_{+} has only those states ss from {\mathcal{M}} that have an optimal strategy, i.e., a strategy σ\sigma with 𝒫,s,σ(φ)=𝚟𝚊𝚕(s){\mathcal{P}}_{{\mathcal{M}},s,\sigma}({\varphi})={\mathtt{val}_{{\mathcal{M}}}(s)}. Further, for any transition sts{\longrightarrow}t in +{\mathcal{M}}_{+} where ss is a controlled state, we have 𝚟𝚊𝚕(s)=𝚟𝚊𝚕(t){\mathtt{val}_{{\mathcal{M}}}(s)}={\mathtt{val}_{{\mathcal{M}}}(t)}, i.e., +{\mathcal{M}}_{+} does not have value-decreasing transitions emanating from controlled states. The following lemma was used in the proof of Theorem 5.3:

Lemma 6.8.

Let {\mathcal{M}} be an MDP, and let φ{\varphi} be an objective that is tail in {\mathcal{M}}. Let +{\mathcal{M}}_{+} be the conditioned version w.r.t. φ{\varphi} in the sense of [17, Lemma 6]. If {\mathcal{M}} is universally transient, then so is +{\mathcal{M}}_{+}.

7 Conclusion

The 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} objective admits ε\varepsilon-optimal (resp. optimal) MD strategies even in infinitely branching MDPs. This is unusual, since ε\varepsilon-optimal strategies for most other objectives require infinite memory if the MDP is infinitely branching (in particular all objectives generalizing Safety [17]).

𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} encodes a notion of continuous progress, which can be used as a tool to reason about the strategy complexity of other objectives in countable MDPs. E.g., our result on 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} is used in [18] as a building block to show upper bounds on the strategy complexity of certain threshold objectives w.r.t. mean payoff, total payoff and point payoff.

References

  • [1] Pieter Abbeel and Andrew Y. Ng. Learning first-order Markov models for control. In Advances in Neural Information Processing Systems 17. MIT Press, 2004. URL: http://papers.nips.cc/paper/2569-learning-first-order-markov-models-for-control.
  • [2] Galit Ashkenazi-Golan, János Flesch, Arkadi Predtetchinski, and Eilon Solan. Reachability and safety objectives in Markov decision processes on long but finite horizons. Journal of Optimization Theory and Applications, 2020.
  • [3] Christel Baier and Joost-Pieter Katoen. Principles of Model Checking. MIT Press, 2008.
  • [4] Patrick Billingsley. Probability and Measure. Wiley, 1995. Third Edition.
  • [5] Vincent D. Blondel and John N. Tsitsiklis. A survey of computational complexity results in systems and control. Automatica, 2000.
  • [6] Nicole Bäuerle and Ulrich Rieder. Markov Decision Processes with Applications to Finance. Springer-Verlag Berlin Heidelberg, 2011.
  • [7] K. Chatterjee and T. Henzinger. A survey of stochastic ω\omega-regular games. Journal of Computer and System Sciences, 2012.
  • [8] Edmund M. Clarke, Thomas A. Henzinger, Helmut Veith, and Roderick Bloem, editors. Handbook of Model Checking. Springer, 2018. doi:10.1007/978-3-319-10575-8.
  • [9] E.M. Clarke, O. Grumberg, and D. Peled. Model Checking. MIT Press, Dec. 1999.
  • [10] William Feller. An Introduction to Probability Theory and Its Applications. Wiley & Sons, second edition, 1966.
  • [11] János Flesch, Arkadi Predtetchinski, and William Sudderth. Simplifying optimal strategies in limsup and liminf stochastic games. Discrete Applied Mathematics, 2018.
  • [12] T.P. Hill and V.C. Pestien. The existence of good Markov strategies for decision processes with general payoffs. Stoch. Processes and Appl., 1987.
  • [13] S. Kiefer, R. Mayr, M. Shirmohammadi, and P. Totzke. Transience in countable MDPs. In International Conference on Concurrency Theory, LIPIcs, 2021. Full version at https://arxiv.org/abs/2012.13739.
  • [14] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, and Patrick Totzke. Büchi objectives in countable MDPs. In International Colloquium on Automata, Languages and Programming, LIPIcs, 2019. Full version at https://arxiv.org/abs/1904.11573. doi:10.4230/LIPIcs.ICALP.2019.119.
  • [15] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, and Patrick Totzke. Strategy Complexity of Parity Objectives in Countable MDPs. In International Conference on Concurrency Theory, 2020. doi:10.4230/LIPIcs.CONCUR.2020.7.
  • [16] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, Patrick Totzke, and Dominik Wojtczak. How to play in infinite MDPs (invited talk). In International Colloquium on Automata, Languages and Programming, 2020. doi:10.4230/LIPIcs.ICALP.2020.3.
  • [17] Stefan Kiefer, Richard Mayr, Mahsa Shirmohammadi, and Dominik Wojtczak. Parity Objectives in Countable MDPs. In Annual IEEE Symposium on Logic in Computer Science, 2017. doi:10.1109/LICS.2017.8005100.
  • [18] Richard Mayr and Eric Munday. Strategy Complexity of Mean Payoff, Total Payoff and Point Payoff Objectives in Countable MDPs. In International Conference on Concurrency Theory, LIPIcs, 2021. The full version is available on arXiv.
  • [19] A. Mostowski. Regular expressions for infinite trees and a standard form of automata. In Computation Theory, LNCS, 1984.
  • [20] Donald Ornstein. On the existence of stationary optimal strategies. Proceedings of the American Mathematical Society, 1969. doi:10.2307/2035700.
  • [21] Martin L. Puterman. Markov Decision Processes: Discrete Stochastic Dynamic Programming. John Wiley & Sons, Inc., 1st edition, 1994.
  • [22] George Santayana. Reason in common sense. In Volume 1 of The Life of Reason. 1905. URL: https://en.wikipedia.org/wiki/George_Santayana.
  • [23] Manfred Schäl. Markov decision processes in finance and dynamic options. In Handbook of Markov Decision Processes. Springer, 2002.
  • [24] Olivier Sigaud and Olivier Buffet. Markov Decision Processes in Artificial Intelligence. John Wiley & Sons, 2013.
  • [25] William D. Sudderth. Optimal Markov strategies. Decisions in Economics and Finance, 2020.
  • [26] R.S. Sutton and A.G Barto. Reinforcement Learning: An Introduction. Adaptive Computation and Machine Learning. MIT Press, 2018.
  • [27] Moshe Y. Vardi. Automatic verification of probabilistic concurrent finite-state programs. In Annual Symposium on Foundations of Computer Science. IEEE Computer Society, 1985. doi:10.1109/SFCS.1985.12.

Appendix A Strategy Classes

We formalize the amount of memory needed to implement strategies. Let 𝖬{\sf M} be a countable set of memory modes. An update function is a function u:𝖬×S𝒟(𝖬×S)u:{\sf M}\times S\to\mathcal{D}({\sf M}\times S) that meets the following two conditions, for all modes 𝗆𝖬{\sf m}\in{\sf M}:

  • for all controlled states sSs\in S_{\Box}, the distribution u((𝗆,s))u(({\sf m},s)) is over 𝖬×{sss}{\sf M}\times\{s^{\prime}\mid s{\longrightarrow}{}s^{\prime}\}.

  • for all random states sSs\in S_{\ocircle}, we have that 𝗆𝖬u((𝗆,s))(𝗆,s)=P(s)(s)\sum_{{\sf m}^{\prime}\in{\sf M}}u(({\sf m},s))({\sf m}^{\prime},s^{\prime})=P(s)(s^{\prime}).

An update function uu together with an initial memory 𝗆0{\sf m}_{0} induce a strategy u[𝗆0]:SS𝒟(S)u[{\sf m}_{0}]:S^{*}S_{\Box}\to\mathcal{D}(S) as follows. Consider the Markov chain with states set 𝖬×S{\sf M}\times S, transition relation (𝖬×S)2({\sf M}\times S)^{2} and probability function uu. Any partial run ρ=s0si\rho=s_{0}\cdots s_{i} in {\mathcal{M}} gives rise to a set H(ρ)={(𝗆0,s0)(𝗆i,si)𝗆0,,𝗆i𝖬}H(\rho)=\{({\sf m}_{0},s_{0})\cdots({\sf m}_{i},s_{i})\mid{\sf m}_{0},\ldots,{\sf m}_{i}\in{\sf M}\} of partial runs in this Markov chain. Each ρss0SS\rho s\in s_{0}S^{*}S_{\Box} induces a probability distribution μρs𝒟(𝖬)\mu_{\rho s}\in\mathcal{D}({\sf M}), the probability μρs(𝗆)\mu_{\rho s}({\sf m}) is the probability of being in state (𝗆,s)({\sf m},s) conditioned on having taken some partial run from H(ρs)H(\rho s). We define u[𝗆0]u[{\sf m}_{0}] such that u[𝗆0](ρs)(s)=def𝗆,𝗆𝖬μρs(𝗆)u((𝗆,s))(𝗆,s)u[{\sf m}_{0}](\rho s)(s^{\prime})\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\sum_{{\sf m},{\sf m}^{\prime}\in{\sf M}}\mu_{\rho s}({\sf m})u(({\sf m},s))({\sf m}^{\prime},s^{\prime}) for all ρsSS\rho s\in S^{*}S_{\Box} and sSs^{\prime}\in S.

We say that a strategy σ\sigma can be implemented with memory 𝖬{\sf M} (and initial memory 𝗆0{\sf m}_{0}) if there exists an update function uu such that σ=u[𝗆0]\sigma=u[{\sf m}_{0}]. In this case we may also write σ[𝗆0]\sigma[{\sf m}_{0}] to explicitly specify the initial memory mode 𝗆0{\sf m}_{0}. Based on this, we can define several classes of strategies:

A strategy σ\sigma is memoryless (M) (also called positional) if it can be implemented with a memory of size 11. We may view M-strategies as functions σ:S𝒟(S)\sigma:S_{\Box}\to\mathcal{D}(S). A strategy σ\sigma is finite memory (F) if there exists a finite memory 𝖬{\sf M} implementing σ\sigma. More specifically, a strategy is 11-bit if it can be implemented with a memory of size 22. Such a strategy is then determined by a function u:{0,1}×S𝒟({0,1}×S)u:\{0,1\}\times S\to\mathcal{D}(\{0,1\}\times S). Deterministic 1-bit strategies are are both deterministic and 1-bit.

Appendix B Missing Proofs from Section 3

In this section, we prove Lemma 3.5 from the main body.

See 3.5

Proof B.1.

Given an infinitely branching MDP =(S,S,S,,P){\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P) with set SS of states and an initial state s0Ss_{0}\in S, we construct a finitely branching {\mathcal{M}}^{\prime} with set SS^{\prime} of states such that s0Ss_{0}\in S^{\prime}. The reduction uses the concept of “recurrent ladders”; see Figure 2.

The reduction is as follows.

  • For all controlled state ss in {\mathcal{M}} with infinite branching ssis\to s_{i} for all i1i\geq 1, we introduce a recurrent ladder in {\mathcal{M}}^{\prime}, consisting the controlled states (s,i)i(\ell_{s,i})_{i\in\mathbb{N}} and random states (s,i)i1(\ell^{\prime}_{s,i})_{i\geq 1}. The set of transitions includes ss,0s\to\ell_{s,0} and s,0s,1\ell_{s,0}\to\ell_{s,1}, and for all i1i\geq 1 two transitions s,is,i\ell_{s,i}\to\ell^{\prime}_{s,i}, and s,isi\ell_{s,i}\to s_{i}. Moreover, s,i\ell^{\prime}_{s,i} and s,i\ell^{\prime}_{s,i}. Here, all states of the recurrent ladder are fresh states.

  • For all random states ss in {\mathcal{M}} with infinite branching ss for all i1i\geq 1, we use a gadget ss for all i1i\geq 1, with fresh random states ziz_{i} and suitably adjusted probabilities pip_{i}^{\prime} to ensure that the gadget is left at state sis_{i} with exact probability pip_{i}, i.e., pi=pi/(j=1i1(1pj))p_{i}^{\prime}=p_{i}/(\prod_{j=1}^{i-1}(1-p_{j}^{\prime})).

See Figure 3 for a partial illustration.

Given ρ=q0q1qnS+\rho=q_{0}q_{1}\cdots q_{n}\in S^{+} denote by last(ρ)=qn\mathrm{last}(\rho)=q_{n} the last state of ρ\rho.


For the first item, let α1\alpha_{1} be a general strategy α1:SS𝒟(S)\alpha_{1}:S^{*}S_{\Box}\to\mathcal{D}(S) in {\mathcal{M}}. We define β1\beta_{1} in {\mathcal{M}}^{\prime} with the use of memory 𝖬=S×{}{ii1}){\sf M}=S^{*}\times\{\bot\}\cup\{i\in\mathbb{N}\mid i\leq 1\}) and an update function uu; see Appendix A. The definition of u:𝖬×S𝒟(𝖬×S)u:{\sf M}\times S^{\prime}\to\mathcal{D}({\sf M}\times S^{\prime}) is as follows. For all q,qSq,q^{\prime}\in S^{\prime} and ρS\rho\in S^{*},

  • for all 𝗆=(ρ,){\sf m}=(\rho,\bot) and 𝗆=(ρq,){\sf m}^{\prime}=(\rho q,\bot),

    u(𝗆,q)(𝗆,q)={P(q)(q)if qS;α1(ρq)(q)if qS is finitely branching in ;u({\sf m},q)({\sf m}^{\prime},q^{\prime})=\begin{cases}P(q)(q^{\prime})&\parbox{227.62204pt}{if $q\in S_{\ocircle}$;}\\ \\ \alpha_{1}(\rho q)(q^{\prime})&\parbox{227.62204pt}{if $q\in S_{\Box}$ is finitely branching in ${\mathcal{M}}$;}\end{cases}
  • for all 𝗆=(ρ,){\sf m}=(\rho,\bot) and 𝗆=(ρq,j){\sf m}^{\prime}=(\rho q,j) with j1j\geq 1,

    u(𝗆,q)(𝗆,q)={α1(ρq)(qj)if qS is infinitely branching in  with qqi for all i1, and q=q,0;u({\sf m},q)({\sf m}^{\prime},q^{\prime})=\begin{cases}\alpha_{1}(\rho q)(q_{j})&\parbox{227.62204pt}{if $q\in S_{\Box}$ is infinitely branching in ${\mathcal{M}}$ with $q\to q_{i}$ for all $i\geq 1$, and $q^{\prime}=\ell_{q,0}$;}\end{cases}
  • for all 𝗆,𝗆=(ρ,j){\sf m},{\sf m}^{\prime}=(\rho,j) with jj,

    u(𝗆,q)(𝗆,q)={1if s=last(ρ) was infinitely branching in , and if q=s,iq=s,i and i<j;u({\sf m},q)({\sf m},q^{\prime})=\begin{cases}1&\parbox{227.62204pt}{if $s=\mathrm{last}(\rho)$ was infinitely branching in\leavevmode\nobreak\ ${\mathcal{M}}$, and if $q=\ell_{s,i}$, $q^{\prime}=\ell^{\prime}_{s,i}$ and $i<j$;}\end{cases}
  • for all 𝗆=(ρ,j){\sf m}=(\rho,j) and 𝗆=(ρ,){\sf m}^{\prime}=(\rho,\bot) with j1j\geq 1,

    u(𝗆,q)(𝗆,q)={1if s=last(ρ) was infinitely branching in  with ssi for all i1, and if q=s,j and q=sj;u({\sf m},q)({\sf m},q^{\prime})=\begin{cases}1&\parbox{227.62204pt}{if $s=\mathrm{last}(\rho)$ was infinitely branching in\leavevmode\nobreak\ ${\mathcal{M}}$ with $s\to s_{i}$ for all $i\geq 1$, and if $q=\ell_{s,j}$ and $q^{\prime}=s_{j}$;}\end{cases}
  • and u(𝗆,q)(𝗆,q)=0u({\sf m},q)({\sf m}^{\prime},q^{\prime})=0 otherwise.

The strategy β1\beta_{1} consists of the above update function uu and initial memory 𝗆0=(ϵ,){\sf m}_{0}=(\epsilon,\bot) where ϵ\epsilon is the empty run. Intuitively speaking, in every step β1\beta_{1} considers the memory (ρ,x)(\rho,x) and the current state qq to simulate what α1\alpha_{1} would have played in {\mathcal{M}}. The memory (ρ,x)(\rho,x) is such that ρ\rho invariantly demonstrates the history of run projected into the state space SS of {\mathcal{M}} (omitting the introduced states due to the reduction). The second component xx in the memory is \bot if the current state is in SS, and otherwise it is a natural number j1j\geq 1. Such a natural number jj indicates that the controller is currently on a recurrent ladder and must leaves the ladder at the jj-th controlled state on the ladder. Subsequently, β1\beta_{1} starts with memory (ϵ,)(\epsilon,\bot) and q=s0q=s_{0},

  • when qq is a random state in {\mathcal{M}}, β1\beta_{1} only append qq to ρ\rho to keep track of the history;

  • when qq is a finitely branching state in {\mathcal{M}}, β1\beta_{1} plays as α1(ρq)\alpha_{1}(\rho q) and append qq to ρ\rho;

  • when qq is an infinitely branching state in {\mathcal{M}} with successors (qj)j1(q_{j})_{j\geq 1}, for every j1j\geq 1, the strategy β1\beta_{1} chooses the first state q,0\ell_{q,0} of the recurrent ladder for qq while flipping the memory from (ρ,)(\rho,\bot) to (ρ,j)(\rho,j) with probability σ(ρq)(qj)\sigma(\rho q)(q_{j}). This requires the ladder to be traversed to state q,j\ell_{q,j} and left from there to qjq_{j}, the jj-th successor of qq in {\mathcal{M}}. Furthermore, β1\beta_{1} append qq to ρ\rho;

  • when qq is s,i\ell_{s,i} and memory is (ρ,j)(\rho,j) with last(ρ)=s\mathrm{last}(\rho)=s, if iji\leq j then β1\beta_{1} continues to stay on the recurrent ladder by picking s,i\ell^{\prime}_{s,i};

  • when qq is s,j\ell_{s,j} and memory is (ρ,j)(\rho,j) with last(ρ)=s\mathrm{last}(\rho)=s, β1\beta_{1} leaves the ladder from s,j\ell_{s,j} to qjq_{j} which is the jj-th successor of state ss in ;{\mathcal{M}};. In addition, the memory (ρ,j)(\rho,j) is flipped back to (ρ,)(\rho,\bot).

By the construction of {\mathcal{M}}^{\prime} and β1\beta_{1}, it follows that β1\beta_{1} in {\mathcal{M}}^{\prime} faithfully simulates α1\alpha_{1} in {\mathcal{M}} and thus 𝒫,s0,α1(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)=𝒫,s0,β1(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎){\mathcal{P}}_{{\mathcal{M}},s_{0},\alpha_{1}}(\mathtt{Transience})={\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\beta_{1}}(\mathtt{Transience}).


For the second item, let β2:SS\beta_{2}:S_{\Box}^{\prime}\to S_{\Box}^{\prime} be an MD strategy in {\mathcal{M}}^{\prime} where SSS_{\Box}^{\prime}\subseteq S is the set of controlled states in {\mathcal{M}}^{\prime}. We define an MD strategy α2:SS\alpha_{2}:S\to S in {\mathcal{M}} as follows. For all controlled states sSs\in S^{\prime},

α2(s)={β2(q)if sS is finitely branching in ;sjif sS is infinitely branching in  with the successors (si)i1, and if there exists j such that β2(s)=s,0β2(s,0)=s,1 and β2(s,i)=s,i for all 0<i<j, and β2(s,j)=sj;s1if sS is infinitely branching in  with the successors (si)i1, and if β2(s)=s,0β2(s,0)=s,1 and β2(s,i)=s,i for all i>0.\alpha_{2}(s)=\begin{cases}\beta_{2}(q)&\parbox{298.75394pt}{if $s\in S_{\Box}$ is finitely branching in\leavevmode\nobreak\ ${\mathcal{M}}$;}\\ \\ s_{j}&\parbox{298.75394pt}{if $s\in S_{\Box}$ is infinitely branching in ${\mathcal{M}}$ with the successors\leavevmode\nobreak\ $(s_{i})_{i\geq 1}$, and if there exists $j\in\mathbb{N}$ such that $\beta_{2}(s)=\ell_{s,0}$, $\beta_{2}(\ell_{s,0})=\ell_{s,1}$ and $\beta_{2}(\ell_{s,i})=\ell^{\prime}_{s,i}$ for all $0<i<j$, and $\beta_{2}(\ell_{s,j})=s_{j}$;}\\ \\ s_{1}&\parbox{298.75394pt}{if $s\in S_{\Box}$ is infinitely branching in ${\mathcal{M}}$ with the successors\leavevmode\nobreak\ $(s_{i})_{i\geq 1}$, and if $\beta_{2}(s)=\ell_{s,0}$, $\beta_{2}(\ell_{s,0})=\ell_{s,1}$ and $\beta_{2}(\ell_{s,i})=\ell^{\prime}_{s,i}$ for all $i>0$.}\end{cases}

Note that the above strategy is well-defined, as in every recurrent ladder in {\mathcal{M}}^{\prime}, either there exists some jj such that β2\beta_{2} exits the ladder at its jj-th controller state, or β2\beta_{2} choose to stay on the ladder forever. In the latter case, by a Gambler’s Ruin argument, the probability of 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} for those runs staying on the ladder forever is 0. By the construction of {\mathcal{M}}^{\prime}, α2\alpha_{2} faithfully simulates β2\beta_{2} unless when β2\beta_{2} stays on a ladder forever and the prospect of 𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mathtt{Transience} becomes 0. In those cases, α2\alpha_{2} continues playing what β2\beta_{2} would have played if it exited the ss-ladder at s,1\ell_{s,1}.

It follows that 𝒫,s0,α2(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎)𝒫,s0,β2(𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎){\mathcal{P}}_{{\mathcal{M}},s_{0},\alpha_{2}}(\mathtt{Transience})\geq{\mathcal{P}}_{{\mathcal{M}}^{\prime},s_{0},\beta_{2}}(\mathtt{Transience}).

Appendix C 1-Bit Strategy for Büchi(F)𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience}

See 4.1

Proof C.1.

We prove the claim for finitely branching {\mathcal{M}} first and transfer the result to general MDPs at the end.

Let =(S,S,S,,P){\mathcal{M}}=(S,S_{\Box},S_{\ocircle},{\longrightarrow},P) be a finitely branching countable MDP, ISI\subseteq S a finite set of initial states and FSF\subseteq S a set of goal states and φ=defBüchi(F)𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎{\varphi}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience} the objective.

For every ε>0\varepsilon>0 and every sIs\in I there exists an ε\varepsilon-optimal strategy σs\sigma_{s} such that

𝒫,s,σs(φ)𝚟𝚊𝚕,φ(s)ε.{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi})\geq{\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-\varepsilon. (11)

However, the strategies σs\sigma_{s} might differ from each other and might use randomization and a large (or even infinite) amount of memory. We will construct a single deterministic strategy σ\sigma^{\prime} that uses only 1 bit of memory such that sI𝒫,s,σ(φ)𝚟𝚊𝚕,φ(s)2ε\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma^{\prime}}({\varphi})\geq{\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-2\varepsilon. This proves the claim as ε\varepsilon can be chosen arbitrarily small.

In order to construct σ\sigma^{\prime}, we first observe the behavior of the finitely many σs\sigma_{s} for sIs\in I on an infinite, increasing sequence of finite subsets of SS. Based on this, we define a second stronger objective φ{\varphi}^{\prime} with

φφ,{\varphi}^{\prime}\subseteq{\varphi}, (12)

and show that all σs\sigma_{s} attain at least 𝚟𝚊𝚕,φ(s)2ε{\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-2\varepsilon w.r.t. φ{\varphi}^{\prime}, i.e.,

sI𝒫,s,σs(φ)𝚟𝚊𝚕,φ(s)2ε.\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}^{\prime})\geq{\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-2\varepsilon. (13)

We construct σ\sigma^{\prime} as a deterministic 1-bit optimal strategy w.r.t. φ{\varphi}^{\prime} from all sIs\in I and obtain

𝒫,s,σ(φ)\displaystyle{\mathcal{P}}_{{\mathcal{M}},s,\sigma^{\prime}}({\varphi})\ 𝒫,s,σ(φ)\displaystyle\geq\ {\mathcal{P}}_{{\mathcal{M}},s,\sigma^{\prime}}({\varphi}^{\prime}) by Equation 12
𝒫,s,σs(φ)\displaystyle\geq\ {\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}^{\prime}) by optimality of σ\sigma^{\prime} for φ{\varphi}^{\prime}
𝚟𝚊𝚕,φ(s)2ε\displaystyle\geq\ {\mathtt{val}_{{\mathcal{M}},{\varphi}}(s)}-2\varepsilon by Equation 13.\displaystyle\text{by \lx@cref{creftype~refnum}{eq:observe-orig}}.

Behavior of σ\sigma, objective φ{\varphi}^{\prime} and properties Equation 12 and Equation 13. We start with some notation. Let 𝖻𝗎𝖻𝖻𝗅𝖾k(X){\sf bubble}_{k}(X) be the set of states that can be reached from some state in the set XX within at most kk steps. Since {\mathcal{M}} is finitely branching, 𝖻𝗎𝖻𝖻𝗅𝖾k(X){\sf bubble}_{k}(X) is finite if XX is finite. Let 𝖥k(X)=def{ρSωtk.ρ(t)X}{\sf F}^{{\leq k}}(X)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\{\rho\in S^{\omega}\mid\exists t\leq k.\,\rho(t)\in X\} and 𝖥k(X)=def{ρSωtk.ρ(t)X}{\sf F}^{{\geq k}}(X)\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\{\rho\in S^{\omega}\mid\exists t\geq k.\,\rho(t)\in X\} denote the property of visiting the set XX (at least once) within at most (resp. at least) kk steps. Moreover, let εi=defε 2(i+1)\varepsilon_{i}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\varepsilon\,\cdot\,2^{-(i+1)}.

II\cdots\cdotsK1K_{1}L1L_{1}K2K_{2}L2L_{2}K3K_{3}Li1L_{i-1}KiK_{i}LiL_{i}
Figure 4: To show the bubble construction. The green region in K1K_{1} is F1F_{1}, and for all i2i\geq 2, the green region in KiLi1K_{i}\setminus L_{i-1} is FiF_{i}.
Lemma C.2.

Assume the setup of Lemma 4.1, φ=defBüchi(F)𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎{\varphi}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}\mbox{{B\"{u}chi}}(F)\cap\mathtt{Transience} and a strategy σs\sigma_{s} from each sIs\in I. Let XSX\subseteq S be a finite set of states and ε>0\varepsilon^{\prime}>0.

  1. 1.

    There is kk\in\mathbb{N} such that sI𝒫,s,σs(φ¬(𝖥k(FX)))ε.\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}\cap\lnot({\sf F}^{{\leq k}}(F\setminus X)))\ \leq\ \varepsilon^{\prime}.

  2. 2.

    There is ll\in\mathbb{N} such that sI𝒫,s,σs(φ𝖥l(X))ε\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}\cap{\sf F}^{{\geq l}}(X))\ \leq\ \varepsilon^{\prime}.

Proof C.3.

It suffices to show the properties for a single s,σss,\sigma_{s} since one can take the maximal k,lk,l over the finitely many sIs\in I.

We observe that φ𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎=sS𝖥𝖦¬(s)sX𝖥𝖦¬(s)=𝖥𝖦¬(X){\varphi}\subseteq\mathtt{Transience}=\bigcap_{s\in S}{\sf F}{\sf G}\neg(s)\subseteq\bigcap_{s\in X}{\sf F}{\sf G}\neg(s)={\sf F}{\sf G}\neg(X), where the last equivalence is due to the finiteness of XX.

Towards 1, we have φ=𝖦𝖥F𝚃𝚛𝚊𝚗𝚜𝚒𝚎𝚗𝚌𝚎𝖦𝖥F𝖥𝖦¬(X)𝖦𝖥(FX)𝖥(FX)=k𝖥k(FX){\varphi}={\sf G}{\sf F}F\cap\mathtt{Transience}\subseteq{\sf G}{\sf F}F\cap{\sf F}{\sf G}\neg(X)\subseteq{\sf G}{\sf F}(F\setminus X)\leavevmode\nobreak\ \subseteq\leavevmode\nobreak\ {\sf F}(F\setminus X)=\bigcup_{k\in\mathbb{N}}{\sf F}^{{\leq k}}(F\setminus X) and therefore that φk¬(𝖥k(FX))={\varphi}\cap\bigcap_{k\in\mathbb{N}}\lnot({\sf F}^{{\leq k}}(F\setminus X))=\emptyset. It follows from the continuity of measures that limk𝒫,s,σs(φ¬(𝖥k(FX)))=0\lim_{k\to\infty}{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}\cap\lnot({\sf F}^{{\leq k}}(F\setminus X)))=0.

Towards 2, we have φl𝖥l(X)𝖥𝖦¬(X)l𝖥l(X)={\varphi}\cap\cap_{l}{\sf F}^{{\geq l}}(X)\subseteq{\sf F}{\sf G}\neg(X)\cap\cap_{l}{\sf F}^{{\geq l}}(X)=\emptyset. By continuity of measures we obtain liml𝒫,s,σs(φ𝖥l(X))=0\lim_{l\rightarrow\infty}{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}\cap{\sf F}^{{\geq l}}(X))=0.

In the following, let us write X¯\overline{X} to denote the complement of a set XSωX\subseteq S^{\omega} of runs.

By Lemma C.2(1) there is a k1k_{1} such that for K1=def𝖻𝗎𝖻𝖻𝗅𝖾k1(I)K_{1}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}{\sf bubble}_{k_{1}}(I) and F1=defFK1F_{1}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}F\cap K_{1} we have sI𝒫,s,σs(φK1F1Sω¯)ε1\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}\cap\overline{K_{1}^{*}F_{1}S^{\omega}})\leq\varepsilon_{1}. We define the pattern

R1=def(K1F1)F1R_{1}\stackrel{{\scriptstyle\text{{\tiny{def}}}}}{{=}}(K_{1}\setminus F_{1})^{*}F_{1}

and obtain sI𝒫,s,σs(φR1Sω¯)ε1\forall_{s\in I}\,{\mathcal{P}}_{{\mathcal{M}},s,\sigma_{s}}({\varphi}\cap\overline{R_{1}S^{\omega}})\leq\varepsilon_{1}. By Lemma C.2(2) there is an l1>k1l_{1}>k_{1} such that sI𝒫,s,σs(𝖥l1(K1